Frontier LLMs broke open CTF: GPT-5.5 one-shots insane heap pwn

Open online CTF competitions are no longer about human skill. The author, a former top-10 CTFTime competitor with TheHackersCrew, argues that frontier LLMs have automated the majority of challenges. The scoreboard now measures orchestration ability and token budget, not security expertise.

The breaking point: Claude Opus 4.5 and GPT-5.5

When Claude Opus 4.5 dropped, the tone shifted. Almost every medium difficulty challenge, and some hard challenges, became agent-solvable. Claude Code packaged everything into a CLI, making it trivial to build an orchestrator that spins up a Claude instance for every challenge via the CTFd API. Teams that refused to use AI were playing a slower version of the competition.

GPT-5.5 and GPT-5.5 Pro sealed the deal. By benchmark metrics, GPT-5.5 is close to Claude Mythos' capability, and Pro likely surpasses it. These models can one-shot Insane difficulty active leakless heap pwn challenges on HackTheBox. If you orchestrate Pro against Insane challenges in a 48-hour CTF, there is a good chance you get the flag before the event ends.

The scoreboard is broken

The CTFTime leaderboard no longer reflects human skill. The 2026 scoreboard is unrecognisable compared to every year before it. TheHackersCrew and many other large teams either do not play, play with far fewer people, or struggle to cut into the top 10. Unregulated cheating is through the roof. Some of the best CTFs, like Plaid CTF, are not running anymore.

Organizers can't fight back

CTF organizers have tried techniques to break or deter LLM solutions, but they are temporary friction at best. Claude Code does not meaningfully care about old refusal-string tricks anymore. Frontier models are getting better at noticing prompt injections. Web search capabilities weaken challenges based on technologies released after the training cutoff. Rules that ask people not to use LLMs are ignored and almost impossible to enforce in open online events.

If organizers make normal challenges, agents solve too much. If they make challenges deliberately hostile to agents, the challenges often become guessy, overengineered, or unpleasant for humans too. That is not a real fix.

The beginner's ladder is gone

CTFs were not just a set of puzzles. They were a ladder. Beginners could see themselves improve, solve more challenges, place higher, join better teams, and become more competitive over time. That feedback loop is breaking. If the visible scoreboard is dominated by teams using AI, a beginner is pushed toward using AI before they have built the instincts the AI is replacing. That prevents active learning, and active struggle is the bit that actually teaches you.

Beginners are better off using picoGym, HackTheBox, and other lab environments where the point is actually learning instead of pretending the public scoreboard still reflects human growth.

The "AI is a chess engine" analogy fails

Chess engines are not allowed during competitive play. They are used for analysis, training, commentary, and practice. They enrich the game around the competition without replacing the person competing. Imagine giving every competitive chess player the best chess engine and letting them use it freely during matches. Would that be considered fair? Would it be fun to watch? The same questions apply to CTFs.

What now?

The community around CTFing has been an amazing place to learn, grow, and connect. That's something we shouldn't lose. As a community, we should strive to stay together and build new avenues to stay passionate and keep learning. Security-adjacent social events like SecTalks, student conferences, and local meetups are great ways to stay connected. Learning platforms and the communities they provide through Discord are also a valuable resource.