The Nuke Heard Round the Bench

An AI agent, playing Civilization VI as Portugal, built two nuclear devices and leveled Toulouse. It was trying to stop France's culture victory. It failed—France won by diplomacy two turns later. This is not a bug report. It's the headline result from CivBench, a new benchmark designed to test whether AI can sustain strategic reasoning over hundreds of decisions.

What Is CivBench?

CivBench is the creation of Luke Wilko, an AI researcher at the Tony Blair Institute. He hacked Civilization VI's debug port into an MCP server with 76 tools, letting frontier models play the game through text-based function calls. The agent sees no map, no animations—just raw data: get_game_overview returns four lines of text like:

Turn 150/330 | Poland (Jadwiga) | Score: 179 | Prince | Quick speed
Gold: 628 (+20/turn) | Science: 26.6 | Culture: 16.2
Cities: 3 | Population: 21 | Units: 4

To see nearby threats, it must call get_units. If it doesn't ask, the threat doesn't exist in its world.

The Sensorium Effect

Wilko calls this the "sensorium effect": when an agent perceives everything through separate tool calls, it goes blind to anything it doesn't think to ask about. In one early game, the agent played Byzantium—a civilization built around religion—and never founded one. Russia converted the entire map while the agent had no religion-monitoring tools. A human would have seen missionary icons for 100 turns.

Even when tools exist, the agent ignores them. Playing India, the agent had religion-monitoring tools and standing instructions to respond to conversion. It set them aside and kept pushing science. France won a religious victory after 76 turns of conversion.

The Knowing–Doing Gap

The agent knows every strategy guide. Ask it how to play Macedon, and it'll recite: build Encampments early, train units through the Basilikoi Paides, convert conquest into science. In its Macedon game, it wrote a detailed plan across four eras. It never built the Encampment. Not once in 110 turns. It defaulted to a generic science sprint, the same strategy for every civilization. Its diary repeatedly noted: "I need to build military infrastructure." Each time, no action.

This mirrors the BALROG findings across game environments: a persistent gap between models' ability to articulate optimal strategies and their ability to execute them under pressure.

The Nuke in Detail

Playing Portugal, the agent finally found a non-science strategy: trade routes → gold → envoys → city-state alliances → diplomatic favor → World Congress votes. It built Commercial Hubs in every city, peaked at 400+ gold per turn, and reached 18 of 20 diplomatic victory points by the endgame.

But France was running two clocks: a culture victory (26 foreign tourists away) and a diplomatic victory (2 votes away). The agent locked onto the culture threat. Its diary: "This is the PRIMARY THREAT." Peaceful counters failed—Rock Bands couldn't be activated through the debug protocol, melee combat dealt zero damage, and a space project was blocked by a production bug.

So the agent executed a 50-turn plan: set Nuclear Fission as research, named Toulouse in its diary, started the Manhattan Project, brokered a joint war with Korea. When conventional warfare failed, it used its Lua execution tool to probe the engine's code until it found how nuclear launch commands worked. At turn 305, the first nuke hit Toulouse. At turn 311, a second.

The culture clock stopped. France won by diplomacy at turn 318.

The agent's post-game note: France "reached 20 first through… WC votes that we couldn't monitor, victory progress tool broken." It nuked the threat it could see and lost to the threat it couldn't.

What This Means for AI Safety

Wilko's point: this isn't a Civ bug. Any AI operating through tool calls in a complex environment suffers the same effect. It misses what it doesn't think to ask about and ignores what it does see if it doesn't fit the current plan.

CivBench is open-source. The MCP server, agent logs, and replay files are available on GitHub. Wilko's previous benchmark, GovBench, tested AI on 3,497 multiple-choice questions about UK legislation. GPT-5 scored 99.26%. But as Wilko writes: "I'd measured recall and called it reasoning."

CivBench measures something harder: whether an AI can hold a goal across hundreds of decisions, notice when the world has changed, and adapt. The answer so far: not reliably.

Try It Yourself

Clone the repo, point it at your Civ VI install, and watch an agent flail. The 76-tool MCP server works with any model supporting function calling. Wilko's logs show Claude Code as both developer and playtester. The sensorium effect and knowing-doing gap are reproducible. Run your own games. See if your model does better.

If it does, publish the logs. If it doesn't, you'll understand why Wilko says: "We are much better at measuring the first thing than the second."