Single-Judge Evals Are Misleading
A recent benchmark of six AI agents across eleven skills, graded by three different LLM judges (Sonnet, GPT-5.5, Opus-4-7), shows that scores and rankings shift dramatically depending on the judge. One model, gpt-5.3, swung 47 points on a single skill between the most and least generous judge. The lesson: if you trust eval numbers from a single LLM judge, you're partly measuring judge preference, not model capability.
The Setup
The benchmark tested six models (opus-4-7, composer, gpt-5.5, gpt-5.4, gpt-5.3, gpt-5-codex) on eleven agent skills, with five scenarios per skill. The only variable was the grading model. Each judge scored every run independently using the same rubric. The Tessl UI normally averages results, but the raw data reveals significant variance.
Judge Strictness Varies
Averaged across all models and skills, Sonnet graded most generously (76.1 baseline, 90.3 with-skill), GPT-5.5 graded strictest (70.7 baseline, 83.4 with-skill), and Opus-4-7 fell in between (72.6 baseline, 88.3 with-skill). The gap between Sonnet and GPT-5.5 averages 6.9 points. If your pipeline uses Sonnet as default judge, expect scores 5–7 points higher than a stricter grader would return.
Rankings Shift Under Different Judges
While opus-4-7 held first place under all three judges, every other model moved position. For example, gpt-5.3 ranked third under Sonnet (91.9) but fell to fifth under GPT-5.5 (75.7) and Opus (84.0). Conversely, gpt-5.5 ranked fifth under Sonnet (87.4) but climbed to second under GPT-5.5 (88.4) and Opus (92.3). The swing column—the gap between highest and lowest score for a model—ranged from 2.5 points for composer to 16.2 points for gpt-5.3. This 16.2-point swing is larger than the gap between first and last place in the averaged rankings.
Self-Judge Bias Is Measurable
The data shows that LLM judges favor their own model family. Opus gave itself a 96.5 score, while Sonnet gave it 94.5 and GPT-5.5 gave it 89.2—a 4.6-point boost over the average of the other two judges. The gpt-5.5 case did not show the same pattern: GPT-5.5 actually scored its own model lower (88.4) than the other judges' average (89.9). Self-favor exists but is not symmetric. If you use Claude models to grade Claude outputs, expect a systematic upward bias of 4–5 points.
What the gpt-5.3 Drop Tells Us
gpt-5.3 scored 91.9 under Sonnet, 75.7 under GPT-5.5, and 84.0 under Opus. Two judges independently gave substantially lower scores, indicating the Sonnet-only number was inflated. Per-skill data suggests gpt-5.3 produces outputs that are directionally correct but not precisely correct. Sonnet gives partial credit; GPT-5.5 does not. If you care about exact specification compliance, GPT-5.5's score is more informative. For general capability, the average of three judges is probably right.
Lift Scores Also Vary
Lift—the gap between baseline and with-skill performance—also varies by judge. For instance, gpt-5.3's lift ranged from 16.1 (Sonnet) to 22.9 (Opus). The rubric was identical; the disagreement is about whether gpt-5.3's output counted as genuine compliance or a close approximation. A single judge cannot tell you which reading is correct.
Practical Fixes
- Run multiple judges and average. Three independent judges smooth out individual preferences and produce more stable numbers.
- Design rubrics with binary criteria wherever possible. "Is the file deleted?" yields consistent scores across judges; "How well did the agent explain the migration?" does not.
- Favor the same judge as the model you'll use in development, if you know it.
Code Example: Running a Multi-Judge Eval with Tessl
To run a benchmark with a specific agent and scorer, use the Tessl CLI:
$ curl -fsSL https://get.tessl.io | sh
$ tessl scenario generate --count=5
$ tessl scenario download --last
$ tessl eval run --agent=claude:claude-opus-4-6 --scorer-agent codex:gpt-5.5
Repeat the last command with different --scorer-agent values (e.g., claude:claude-sonnet-4-6, claude:claude-opus-4-7) and average the results.
Conclusion
Single-judge evals are unreliable. The models that performed consistently across all judges—gpt-5.4 and composer—share one characteristic: their outputs are correct rather than approximately correct. A strict grader and a generous one disagree less when the answer is unambiguous. If you're publishing or trusting eval numbers from a single LLM judge, you're benchmarking judge preference as much as model capability. Run multiple judges, average the scores, and design for binary criteria.



