LLMs Are Not Black Boxes: Anthropic's Circuit Tracing Breakthrough
For years, the narrative held that large language models (LLMs) are inscrutable black boxes. Anthropic's 2025 paper, "On the Biology of a Large Language Model," shatters that notion. Using a technique called circuit tracing, researchers have reverse-engineered the internal computations of LLMs, revealing human-interpretable features and multi-step reasoning chains. This is mechanistic interpretability's biggest win yet.
The Superposition Problem
A single neuron in an LLM participates in many unrelated concepts. Conversely, any given concept is smeared across many neurons. This "superposition" makes reading meaning from individual activations impossible. You cannot just look at one unit and know what the model is thinking.
Circuit Tracing: How It Works
Anthropic's solution trains a separate "replacement" model to sparsely recreate the outputs of the base model's MLP layers. This decomposes the base model's activations into sparse features. Remarkably, these features correspond to high-level concepts humans can identify — like "Texas" or "the Olympics."
Once you have these interpretable features, you group them into causally-linked clusters by tracing how they interact during a forward pass. The result is a wiring diagram of the computation.
Multi-Step Reasoning Observed
When you ask the model "What is the capital of the state containing Dallas?" you can observe:
- The Dallas feature activates.
- This causes the Texas feature to light up.
- Then Austin activates.
This is genuine multi-step reasoning via intermediary concepts. The model even "thinks ahead" to future rhyme candidates when planning a poem. It performs a kind of pseudo-symbolic inference — what philosophers call "higher reasoning."
Not Just for LLMs
This phenomenon isn't unique to language models. DeepMind (2022) showed that AlphaZero, a MCTS-based system, converged on human chess concepts like "in check" and "pinning a piece" entirely on its own, with no human chess knowledge supplied.
Better Understanding → Better Algorithms
Breaking down a model's implicit reasoning can guide algorithm design. For example, Claude 3.5 Haiku learned an algorithm for small-integer addition that doesn't map to human mental math. It splits the problem into parallel pathways — computing a rough magnitude alongside the precise ones-digit — and recombines them, leaning on memorized "lookup table" features. The natural next step is to identify such suboptimal algorithms and steer the model toward better ones.
The Model Has a Subconscious
The model itself lacks metacognitive insight into its own reasoning. Ask it to explain how it added two numbers, and it narrates a tidy, human-style procedure — not the algorithm it actually ran. This subconscious layer is precisely what circuit tracing exposes.
Why This Matters for Developers
Mechanistic interpretability offers concrete tools for identifying model misbehavior, steering outputs, and designing better learning algorithms. Unlike the vague promise of "explainable AI," this is a rigorous, empirical approach. You can now trace a model's reasoning from input to output, feature by feature.
Practical Implications
- Debugging: Spot dangerous intent or hallucination pathways.
- Steering: Adjust internal features to bias model behavior.
- Algorithm Design: Learn from the model's discovered algorithms to improve training.
What You Should Do Now
Read Anthropic's original paper. Experiment with sparse autoencoders on small models. The tools for circuit tracing are still emerging, but the foundation is solid. The black box is open.




