Anthropic's Circuit Tracing Reveals LLM Inner Workings

LLMs Are Not Black Boxes: Anthropic's Circuit Tracing Breakthrough

For years, the narrative held that large language models (LLMs) are inscrutable black boxes. Anthropic's 2025 paper, "On the Biology of a Large Language Model," shatters that notion. Using a technique called circuit tracing, researchers have reverse-engineered the internal computations of LLMs, revealing human-interpretable features and multi-step reasoning chains. This is mechanistic interpretability's biggest win yet.

The Superposition Problem

A single neuron in an LLM participates in many unrelated concepts. Conversely, any given concept is smeared across many neurons. This "superposition" makes reading meaning from individual activations impossible. You cannot just look at one unit and know what the model is thinking.

Circuit Tracing: How It Works

Anthropic's solution trains a separate "replacement" model to sparsely recreate the outputs of the base model's MLP layers. This decomposes the base model's activations into sparse features. Remarkably, these features correspond to high-level concepts humans can identify — like "Texas" or "the Olympics."

Once you have these interpretable features, you group them into causally-linked clusters by tracing how they interact during a forward pass. The result is a wiring diagram of the computation.

Multi-Step Reasoning Observed

When you ask the model "What is the capital of the state containing Dallas?" you can observe:

The Dallas feature activates.
This causes the Texas feature to light up.
Then Austin activates.

This is genuine multi-step reasoning via intermediary concepts. The model even "thinks ahead" to future rhyme candidates when planning a poem. It performs a kind of pseudo-symbolic inference — what philosophers call "higher reasoning."

Not Just for LLMs

This phenomenon isn't unique to language models. DeepMind (2022) showed that AlphaZero, a MCTS-based system, converged on human chess concepts like "in check" and "pinning a piece" entirely on its own, with no human chess knowledge supplied.

Better Understanding → Better Algorithms

Breaking down a model's implicit reasoning can guide algorithm design. For example, Claude 3.5 Haiku learned an algorithm for small-integer addition that doesn't map to human mental math. It splits the problem into parallel pathways — computing a rough magnitude alongside the precise ones-digit — and recombines them, leaning on memorized "lookup table" features. The natural next step is to identify such suboptimal algorithms and steer the model toward better ones.

The Model Has a Subconscious

The model itself lacks metacognitive insight into its own reasoning. Ask it to explain how it added two numbers, and it narrates a tidy, human-style procedure — not the algorithm it actually ran. This subconscious layer is precisely what circuit tracing exposes.

Why This Matters for Developers

Mechanistic interpretability offers concrete tools for identifying model misbehavior, steering outputs, and designing better learning algorithms. Unlike the vague promise of "explainable AI," this is a rigorous, empirical approach. You can now trace a model's reasoning from input to output, feature by feature.

Practical Implications

Debugging: Spot dangerous intent or hallucination pathways.
Steering: Adjust internal features to bias model behavior.
Algorithm Design: Learn from the model's discovered algorithms to improve training.

What You Should Do Now

Read Anthropic's original paper. Experiment with sparse autoencoders on small models. The tools for circuit tracing are still emerging, but the foundation is solid. The black box is open.

LLMs Are Not Black Boxes: Anthropic's Circuit Tracing Breakthrough

The Superposition Problem

Circuit Tracing: How It Works

Once you have these interpretable features, you group them into causally-linked clusters by tracing how they interact during a forward pass. The result is a wiring diagram of the computation.

Multi-Step Reasoning Observed

When you ask the model "What is the capital of the state containing Dallas?" you can observe:

The Dallas feature activates.
This causes the Texas feature to light up.
Then Austin activates.

Not Just for LLMs

Better Understanding → Better Algorithms

The Model Has a Subconscious

Why This Matters for Developers

Practical Implications

Debugging: Spot dangerous intent or hallucination pathways.
Steering: Adjust internal features to bias model behavior.
Algorithm Design: Learn from the model's discovered algorithms to improve training.

What You Should Do Now

Read Anthropic's original paper. Experiment with sparse autoencoders on small models. The tools for circuit tracing are still emerging, but the foundation is solid. The black box is open.

Anthropic's Circuit Tracing Reveals LLM Inner Workings

LLMs Are Not Black Boxes: Anthropic's Circuit Tracing Breakthrough

The Superposition Problem

Circuit Tracing: How It Works

Multi-Step Reasoning Observed

Not Just for LLMs

Better Understanding → Better Algorithms

The Model Has a Subconscious

Why This Matters for Developers

Practical Implications

What You Should Do Now

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Porting nanochat to TPU v6e: PyTorch Wins and JAX Breaks

Anthropic's Circuit Tracing Reveals LLM Inner Workings

LLMs Are Not Black Boxes: Anthropic's Circuit Tracing Breakthrough

The Superposition Problem

Circuit Tracing: How It Works

Multi-Step Reasoning Observed

Not Just for LLMs

Better Understanding → Better Algorithms

The Model Has a Subconscious

Why This Matters for Developers

Practical Implications

What You Should Do Now

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Porting nanochat to TPU v6e: PyTorch Wins and JAX Breaks

Vāgdhenu: Sanskrit TTS That Handles Meter and Retroflex Aspirates

Isomorphic Labs IsoDDE Doubles AlphaFold 3 Accuracy on Novel Systems

Claude Fable 5 vs GPT-5.6 Sol: $100 AI Music Video Showdown

Porting nanochat to TPU v6e: PyTorch Wins and JAX Breaks

NpgsqlRest: Delete Your Backend, Keep PostgreSQL