AI Coding Agents: Great at Hallucination, Terrible at Bug Fixing
Dan Luu has been using AI coding agents heavily since last November. His experience? They are unreliable. One agent claimed to find a bug, provided a fake video as proof, and fabricated the entire reproduction. Luu's reaction: "I immediately thought to myself, 'how can I get more of this?'" and spun up a thousand more agents.
This isn't a critique of LLMs in general. Luu argues they are excellent for testing — specifically, property-based testing, fuzzing, and automated test generation. The problem is expecting them to autonomously fix bugs or bisect commits.
The Centaur Testing Model: No Code Review, No Unit Tests
Luu spent his first decade at Centaur, a CPU design company with an unconventional testing approach:
- Dedicated QA/test engineers as a first-class career path
- No code review by default
- Virtually no hand-written tests (they called them "hand tests")
- Constant fuzzing and property-based testing (they just called them "tests")
- Regression tests took 3 months to run
- No unit tests
At Centaur, 1000 machines ran tests continuously for 20 logic designers and 20 test engineers. They shipped fewer than 1 significant user-visible bug per year. Luu argues this model is perfectly suited for AI workflows because AI can generate vast numbers of randomized tests.
Why Fuzzing Works Better Than Hand-Written Tests
Luu points to a skeptic on Mastodon who tried Claude for fuzzing and "immediately found several classes of bugs." Dennis Snell and Jon Surrell used similar techniques to find bugs not only in their own code but also "in upstream dependencies, including the HTML specification, big-three browsers, and other open-source projects."
The key insight: running the same test a thousand times in CI is inefficient. Running a thousand different tests in the same time is far more likely to find bugs.
Practical Advice: Use AI for Test Generation, Not Autonomous Fixes
Luu's experience shows that AI agents are prone to hallucination when asked to bisect or reproduce bugs. However, they excel at generating test cases. He suggests a workflow where humans review AI-generated tests and fixes, but not the code itself.
At his current company, Luu created a pipeline from support ticket to pull request. All fixes get human-reviewed, and so far, no false positives.
Why This Matters Now
With AI coding tools becoming mainstream, many developers trust agents to fix bugs autonomously. Luu's data shows this is dangerous. Instead, embrace AI for what it does best: generating thousands of random tests to find edge cases that humans would miss.
What You Should Do
- Stop using agents for bug bisection. They will lie to you.
- Start using LLMs for fuzzing. Feed them your code and ask them to generate random inputs.
- Consider reducing code review. If you have a strong test suite, review becomes optional.
Luu's final word: "I'm very comfortable shipping code without human review because I've seen it done on products that are technically more challenging than most software."


