Better Models: Worse Tools

TL;DR

Recent observations indicate that the newest Anthropic language models are generating more malformed tool calls, especially with Pi’s edit tool, compared to older models. This suggests a decline in tool call accuracy in state-of-the-art models, raising questions about training and robustness.

Recent testing shows that the latest Anthropic models, including Opus 4.8 and Sonnet 5, are generating more malformed tool calls when invoking Pi’s edit tool, compared to older models. This trend raises concerns about the robustness of these models in practical tool use, which is critical for AI deployment in automation and coding tasks.

Multiple users and researchers have observed that newer Anthropic language models are producing tool call payloads with invented, nonsensical fields, especially in multi-turn or context-rich interactions. These malformed calls often contain extraneous keys like type, id, requireUnique, and others, which are rejected by Pi’s validation schema. The issue is more prominent in Opus 4.8 and Sonnet 5, but not present in older models, indicating a possible deterioration linked to training updates or model architecture changes.

Researchers note that the errors tend to occur in complex, context-dependent prompts, such as multi-file edits, and are less frequent in simple, single-turn prompts. The problem seems to stem from the models’ tendency to produce extra, invalid fields during structured tool calls, despite correctly generating the core payloads.

At a glance

reportWhen: ongoing; observations made in July 2026

The developmentResearchers have identified that newer Anthropic models, such as Opus 4.8 and Sonnet 5, increasingly produce invalid tool call payloads, worsening the reliability of AI tool integration.

Implications for AI Reliability in Tool Integration

This development matters because it highlights a potential decline in the robustness of state-of-the-art language models when performing structured tool calls, which are essential for automating coding, data manipulation, and other technical tasks. If models produce invalid or malformed tool invocation payloads, it can lead to increased failure rates, reduced trust in AI systems, and the need for additional validation layers, complicating deployment in critical applications.

AI Programming Made Practical: A Step-by-Step Guide to Building AI-Powered Applications, Writing Better Code Faster, and Using Modern AI Tools with Confidence

As an affiliate, we earn on qualifying purchases.

Evolution of Tool Call Handling in Anthropic Models

Earlier versions of Anthropic’s models were trained primarily on plain text and limited tool interaction schemas, which resulted in relatively stable tool call outputs. However, recent models, especially those integrated with Claude Code and advanced harnesses, have been trained on more complex tool invocation schemas, including nested and serialized JSON formats. This shift aims to improve flexibility but appears to have introduced new failure modes, notably the production of nonsensical or extra fields in tool calls.

The issue was first noticed in July 2026 by researchers testing multi-turn interactions, where the models’ tool calls increasingly included extraneous keys that violate the expected schema, leading to rejection and retries. Notably, this problem does not appear in older models, suggesting a change in training data or model architecture may be responsible.

“The newer models are producing nested edit calls with a zoo of invented keys, which are then rejected by Pi’s validation schema. It’s a regression in tool call reliability.”
— Researcher observing the issue

Structured Output Prompting: Step-by-Step JSON & Schema Enforcement for Reliable LLM Automation (with Checklists, Troubleshooting, and Real Examples)

As an affiliate, we earn on qualifying purchases.

Extent and Causes of the Tool Call Degradation

It remains unclear whether this issue is widespread across all recent models, specific to certain configurations, or a transient artifact of recent training updates. The precise mechanisms causing models to generate extraneous fields are still under investigation, with some hypotheses pointing to changes in training data, schema enforcement, or decoding strategies.

10 AI Tools Every Software Developer Must Know: Automate Coding, Debugging & Optimization (AI Toolkit for Students: Smarter Learning with AI)

As an affiliate, we earn on qualifying purchases.

Monitoring and Mitigating Future Tool Call Failures

Researchers and developers are expected to conduct further tests across different model versions and training regimes to determine the scope of this issue. Efforts may include refining training data, applying stricter decoding constraints, or developing post-processing validation to improve tool call accuracy. Model developers are also likely to issue updates or patches to address these failures as they become better understood.

Data-Oriented Programming: Reduce software complexity

As an affiliate, we earn on qualifying purchases.

Key Questions

Why are newer models producing worse tool calls?

The exact cause is still under investigation, but it may relate to recent training practices that incorporate more complex schemas, leading models to generate extraneous or malformed fields in structured outputs.

Does this affect all AI models?

No, the issue appears specific to recent versions of Anthropic’s models like Opus 4.8 and Sonnet 5, and is not observed in older models.

How does this impact AI deployment?

Malformed tool calls can cause increased errors and retries, reducing reliability and trustworthiness in automated tasks such as coding, editing, or data management.

Will this problem be fixed?

Developers are expected to investigate and address this issue through training adjustments, schema constraints, or validation improvements in upcoming model updates.

Source: Hacker News

Better Models: Worse Tools

Up next

GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

Author

Deep Intellica Team

Share article

Implications for AI Reliability in Tool Integration

AI Programming Made Practical: A Step-by-Step Guide to Building AI-Powered Applications, Writing Better Code Faster, and Using Modern AI Tools with Confidence

Evolution of Tool Call Handling in Anthropic Models

Structured Output Prompting: Step-by-Step JSON & Schema Enforcement for Reliable LLM Automation (with Checklists, Troubleshooting, and Real Examples)

Extent and Causes of the Tool Call Degradation

10 AI Tools Every Software Developer Must Know: Automate Coding, Debugging & Optimization (AI Toolkit for Students: Smarter Learning with AI)

Monitoring and Mitigating Future Tool Call Failures

Data-Oriented Programming: Reduce software complexity

Key Questions