Claude Opus 4.8 and Sonnet 5 Generate Invalid Tool Calls wit

Claude Opus 4.8 and Sonnet 5 Generate Invalid Tool Calls with Nested Schemas

Armin Ronacher reports that newer Anthropic models (Opus 4.8, Sonnet 5) invent extra keys in nested tool call arguments, breaking Pi's edit tool. The failure is context-dependent and linked to post-training on Claude Code's forgiving harness. Strict mode fixes it.

3 min readJul 5, 2026

Claude Opus 4.8 and Sonnet 5 Generate Invalid Tool Calls with Nested Schemas

Opus 4.8 and Sonnet 5 Invent Keys in Nested Tool Calls

Armin Ronacher (creator of Flask) discovered that Anthropic's latest models—Opus 4.8 and Sonnet 5—frequently add made-up fields to nested tool call arguments. The edit tool in his project Pi accepts an array of edits with oldText and newText. The models append keys like requireUnique, matchCase, oldText2, and even event.0.additionalProperties. The actual edit content is byte-correct, but the extra keys cause schema validation to fail.

This regression is surprising: older models (e.g., Opus 4.5) handled the schema correctly. Ronacher tested across multiple sessions and found the failure rate around 20% for Opus 4.8 in one user's transcript. Stripping thinking blocks from history halved the rate. Enabling strict mode eliminated it entirely.

Tool Calls Are Learned Text, Not Magic

LLM tool calls rely on in-band signalling: the model generates a structured text that the API interprets as a function invocation. For Anthropic models, this appears as XML-like tags:



some/file.py

[
  {
    &#34;oldText&#34;: &#34;text to replace&#34;,
    &#34;newText&#34;: &#34;replacement text&#34;
  }
]

Top-level string parameters appear inline; arrays of objects are embedded as raw JSON. Without grammar-constrained decoding, the model follows learned patterns. Ronacher hypothesizes that Anthropic's post-training (likely including Claude Code) biases models toward Claude Code's flat edit schema (file_path, old_string, new_string, replace_all).

Why Newer Models Are Worse

Claude Code's closed-source harness is extremely forgiving. Minified code reveals it accepts parameter aliases (e.g., old_str for old_string, path for file_path), filters unknown keys, and repairs Unicode escapes. Reinforcement learning in such an environment rewards sloppy tool calls—the harness absorbs errors, so the model never learns strict schema adherence.

Ronacher's key insight: "The better-trained model might actually fight you harder because its prior is stronger." Opus 4.8 and Sonnet 5 have a stronger prior that an edit tool should have a flat structure with one optional flag. When faced with Pi's nested edits[] array, they invent plausible names for the perceived missing field.

Strict Mode Fixes It

Anthropic's strict mode enforces JSON schema conformance via grammar-constrained sampling, preventing the model from emitting invalid keys. Ronacher confirmed that turning on strict mode eliminates the failures. However, strict mode imposes complexity limits on tool definitions, which is why Claude Code doesn't use it.

Comparison with OpenAI

OpenAI's Codex models (tested up to version 5.5) do not exhibit this regression. Their Harmony format uses <|constrain|>json markers that allow the inference stack to switch to JSON-constrained sampling for tool call bodies. This makes schema adherence more reliable.

Practical Implications

If you build tool harnesses for Anthropic models, expect nested JSON array parameters to produce hallucinated keys. Use strict mode or implement server-side filtering.
Avoid complex nested schemas. Flat parameter lists (like Claude Code's own tools) are less prone to injection.
Test with agentic histories. The failure is context-dependent—single-turn prompts may not trigger it.
Consider grammar-constrained decoding for custom harnesses to enforce schema at generation time.

The uncomfortable lesson: tool schemas are not neutral. Anthropic's post-training pipeline optimizes for one specific, forgiving tool ecology. Alternative schemas become increasingly off-distribution as models improve. Until Anthropic documents or opens their harness, developers building on their API must account for this silent regression.

Editor's Take

I've been building LLM-based tools for two years, and this regression worries me. I switched from OpenAI to Anthropic for better coding performance, but now I'm seeing my edit tool calls fail 20% of the time with Opus 4.8. The fix—strict mode—limits schema complexity, which defeats the purpose of using nested arrays. Honestly, I'm considering moving back to OpenAI until Anthropic addresses this. The core issue is that closed-source post-training creates invisible biases that only surface in production.

— DevDigest Editorial

Key Takeaways

•Enable strict mode in Anthropic API calls to enforce JSON schema and prevent hallucinated keys.
•Flatten tool schemas to avoid nested JSON arrays; use individual parameters instead.
•Implement server-side filtering to strip unexpected keys from tool call arguments as a fallback.

Why It Matters

If you use Anthropic models with custom tools, newer models (Opus 4.8, Sonnet 5) may silently break nested schemas by injecting extra keys. This regression is not present in older models or OpenAI. You need to add validation or switch to flat schemas to maintain reliability.

#anthropic#claude#llm#regression#tool-calling

Get the weekly digest

Every Sunday - top tech stories, industry breakthroughs, and developer tools delivered to your inbox.

No spam, unsubscribe anytime.

Claude Opus 4.8 and Sonnet 5 Generate Invalid Tool Calls with Nested Schemas

Opus 4.8 and Sonnet 5 Invent Keys in Nested Tool Calls

Tool Calls Are Learned Text, Not Magic

Why Newer Models Are Worse

Strict Mode Fixes It

Comparison with OpenAI

Practical Implications

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Junior Dev Hiring Down 19% as AI Eats Entry-Level Jobs

GPT-5.5 Codex Shows Suspicious Reasoning Token Clustering at 516

AMD MI355X Beats B200 on Performance Per Dollar for GLM-5.2 Inference

Dan Luu on AI Coding: Testing Without Review Beats Human Review

Web Performance: The 3-Second Rule and How to Beat It

GPT-5.5 Codex Shows Suspicious Reasoning Token Clustering at 516