TL;DR
Recent observations indicate that the newest Anthropic language models are generating more malformed tool calls, especially with Pi’s edit tool, compared to older models. This suggests a decline in tool call accuracy in state-of-the-art models, raising questions about training and robustness.
Recent testing shows that the latest Anthropic models, including Opus 4.8 and Sonnet 5, are generating more malformed tool calls when invoking Pi’s edit tool, compared to older models. This trend raises concerns about the robustness of these models in practical tool use, which is critical for AI deployment in automation and coding tasks.
Multiple users and researchers have observed that newer Anthropic language models are producing tool call payloads with invented, nonsensical fields, especially in multi-turn or context-rich interactions. These malformed calls often contain extraneous keys like type, id, requireUnique, and others, which are rejected by Pi’s validation schema. The issue is more prominent in Opus 4.8 and Sonnet 5, but not present in older models, indicating a possible deterioration linked to training updates or model architecture changes.
Researchers note that the errors tend to occur in complex, context-dependent prompts, such as multi-file edits, and are less frequent in simple, single-turn prompts. The problem seems to stem from the models’ tendency to produce extra, invalid fields during structured tool calls, despite correctly generating the core payloads.
Implications for AI Reliability in Tool Integration
This development matters because it highlights a potential decline in the robustness of state-of-the-art language models when performing structured tool calls, which are essential for automating coding, data manipulation, and other technical tasks. If models produce invalid or malformed tool invocation payloads, it can lead to increased failure rates, reduced trust in AI systems, and the need for additional validation layers, complicating deployment in critical applications.

AI Programming Made Practical: A Step-by-Step Guide to Building AI-Powered Applications, Writing Better Code Faster, and Using Modern AI Tools with Confidence
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Evolution of Tool Call Handling in Anthropic Models
Earlier versions of Anthropic’s models were trained primarily on plain text and limited tool interaction schemas, which resulted in relatively stable tool call outputs. However, recent models, especially those integrated with Claude Code and advanced harnesses, have been trained on more complex tool invocation schemas, including nested and serialized JSON formats. This shift aims to improve flexibility but appears to have introduced new failure modes, notably the production of nonsensical or extra fields in tool calls.
The issue was first noticed in July 2026 by researchers testing multi-turn interactions, where the models’ tool calls increasingly included extraneous keys that violate the expected schema, leading to rejection and retries. Notably, this problem does not appear in older models, suggesting a change in training data or model architecture may be responsible.
“The newer models are producing nested edit calls with a zoo of invented keys, which are then rejected by Pi’s validation schema. It’s a regression in tool call reliability.”
— Researcher observing the issue

Structured Output Prompting: Step-by-Step JSON & Schema Enforcement for Reliable LLM Automation (with Checklists, Troubleshooting, and Real Examples)
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Extent and Causes of the Tool Call Degradation
It remains unclear whether this issue is widespread across all recent models, specific to certain configurations, or a transient artifact of recent training updates. The precise mechanisms causing models to generate extraneous fields are still under investigation, with some hypotheses pointing to changes in training data, schema enforcement, or decoding strategies.

10 AI Tools Every Software Developer Must Know: Automate Coding, Debugging & Optimization (AI Toolkit for Students: Smarter Learning with AI)
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Monitoring and Mitigating Future Tool Call Failures
Researchers and developers are expected to conduct further tests across different model versions and training regimes to determine the scope of this issue. Efforts may include refining training data, applying stricter decoding constraints, or developing post-processing validation to improve tool call accuracy. Model developers are also likely to issue updates or patches to address these failures as they become better understood.

Data-Oriented Programming: Reduce software complexity
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
Why are newer models producing worse tool calls?
The exact cause is still under investigation, but it may relate to recent training practices that incorporate more complex schemas, leading models to generate extraneous or malformed fields in structured outputs.
Does this affect all AI models?
No, the issue appears specific to recent versions of Anthropic’s models like Opus 4.8 and Sonnet 5, and is not observed in older models.
How does this impact AI deployment?
Malformed tool calls can cause increased errors and retries, reducing reliability and trustworthiness in automated tasks such as coding, editing, or data management.
Will this problem be fixed?
Developers are expected to investigate and address this issue through training adjustments, schema constraints, or validation improvements in upcoming model updates.
Source: Hacker News