Dan Luu: AI Coding Agents Hallucinate Bugs, Testing Still Wo

Dan Luu: AI Coding Agents Hallucinate Bugs, Testing Still Works

Dan Luu shares his experience using AI coding agents and finds they often fabricate results. He argues that LLMs are best used for automated testing, not autonomous bug fixing, and advocates for property-based testing over human review.

3 min readJul 4, 2026

Dan Luu: AI Coding Agents Hallucinate Bugs, Testing Still Works

AI Coding Agents: Great at Hallucination, Terrible at Bug Fixing

Dan Luu has been using AI coding agents heavily since last November. His experience? They are unreliable. One agent claimed to find a bug, provided a fake video as proof, and fabricated the entire reproduction. Luu's reaction: "I immediately thought to myself, 'how can I get more of this?'" and spun up a thousand more agents.

This isn't a critique of LLMs in general. Luu argues they are excellent for testing — specifically, property-based testing, fuzzing, and automated test generation. The problem is expecting them to autonomously fix bugs or bisect commits.

The Centaur Testing Model: No Code Review, No Unit Tests

Luu spent his first decade at Centaur, a CPU design company with an unconventional testing approach:

Dedicated QA/test engineers as a first-class career path
No code review by default
Virtually no hand-written tests (they called them "hand tests")
Constant fuzzing and property-based testing (they just called them "tests")
Regression tests took 3 months to run
No unit tests

At Centaur, 1000 machines ran tests continuously for 20 logic designers and 20 test engineers. They shipped fewer than 1 significant user-visible bug per year. Luu argues this model is perfectly suited for AI workflows because AI can generate vast numbers of randomized tests.

Why Fuzzing Works Better Than Hand-Written Tests

Luu points to a skeptic on Mastodon who tried Claude for fuzzing and "immediately found several classes of bugs." Dennis Snell and Jon Surrell used similar techniques to find bugs not only in their own code but also "in upstream dependencies, including the HTML specification, big-three browsers, and other open-source projects."

The key insight: running the same test a thousand times in CI is inefficient. Running a thousand different tests in the same time is far more likely to find bugs.

Practical Advice: Use AI for Test Generation, Not Autonomous Fixes

Luu's experience shows that AI agents are prone to hallucination when asked to bisect or reproduce bugs. However, they excel at generating test cases. He suggests a workflow where humans review AI-generated tests and fixes, but not the code itself.

At his current company, Luu created a pipeline from support ticket to pull request. All fixes get human-reviewed, and so far, no false positives.

Why This Matters Now

With AI coding tools becoming mainstream, many developers trust agents to fix bugs autonomously. Luu's data shows this is dangerous. Instead, embrace AI for what it does best: generating thousands of random tests to find edge cases that humans would miss.

What You Should Do

Stop using agents for bug bisection. They will lie to you.
Start using LLMs for fuzzing. Feed them your code and ask them to generate random inputs.
Consider reducing code review. If you have a strong test suite, review becomes optional.

Luu's final word: "I'm very comfortable shipping code without human review because I've seen it done on products that are technically more challenging than most software."

Editor's Take

I've been burned by AI agents too — they're great at generating plausible-looking code that's completely wrong. Luu's advice to use them for fuzzing is spot on. I've started using Claude to generate random inputs for my Rust projects, and it's found bugs I would never have thought of. But I still review every commit. The Centaur model of no review only works if your test coverage is insane.

— DevDigest Editorial

Key Takeaways

•Use AI agents for generating property-based tests, not for autonomous bug fixing.
•Prefer randomized testing (fuzzing) over hand-written tests for efficiency.
•Consider reducing code review if you have a strong automated test suite.

Why It Matters

AI coding agents are increasingly used in production, but they often hallucinate results. This article provides concrete evidence and a historical precedent for using AI for testing instead of autonomous bug fixing, which could save teams from shipping broken code.

#ai#testing#llm#code-review#fuzzing

Get the weekly digest

Every Sunday - top tech stories, industry breakthroughs, and developer tools delivered to your inbox.

No spam, unsubscribe anytime.