The $200 Crash That Changed Everything

A support automation agent got stuck in a retry loop. One run didn't burn $200—but the projection did. Same bad loop, same document search, same model calls, left inside an overnight batch. The answer looked polished enough to pass a sleepy review. The trace behind it was not polished at all.

That's when the author stopped treating the agent like a chat feature and started treating it like a system that needs a black box.

Not a dashboard. Not a full observability stack. Not another hosted service. Just one local file that can answer:

The Shape Of The Problem

A normal Python script usually fails in one place. An agent fails across a chain:

User Request -> Model Decision -> Tool Call -> Tool Result -> Next Turn -> Final Answer

If you only log the final answer, you have a diary entry. If you record the chain, you have evidence.

The simplest useful format is JSONL—one event per line:

{"type":"tool_start","tool":"search_docs","input":{"query":"rate limits"}}
{"type":"tool_end","tool":"search_docs","duration_ms":83.4,"ok":true}
{"type":"turn_end","turn":2,"total_cost_usd":0.0041}

JSONL appends cleanly, survives crashes better than one large JSON document, and can be searched with normal tools.

The 71-Line Recorder

Here's the core AgentBlackBox class. It does four things:

from __future__ import annotations
import json
import re
import time
import traceback
from contextlib import contextmanager
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Iterator
from uuid import uuid4

SECRET_KEYS = re.compile(
    r"(api[_-]?key|token|password|secret|authorization|cookie)",
    re.IGNORECASE,
)

@dataclass
class Event:
    run_id: str
    event_id: str
    type: str
    timestamp: float
    data: dict[str, Any] = field(default_factory=dict)

def sanitize(value: Any) -> Any:
    if isinstance(value, dict):
        cleaned = {}
        for key, item in value.items():
            if SECRET_KEYS.search(str(key)):
                cleaned[key] = "[redacted]"
            else:
                cleaned[key] = sanitize(item)
        return cleaned
    if isinstance(value, list):
        return [sanitize(item) for item in value]
    return value

class AgentBlackBox:
    def __init__(self, path: str | Path, run_id: str | None = None) -> None:
        self.path = Path(path)
        self.run_id = run_id or uuid4().hex
        self.path.parent.mkdir(parents=True, exist_ok=True)

    def record(self, event_type: str, **data: Any) -> None:
        event = Event(
            run_id=self.run_id,
            event_id=uuid4().hex,
            type=event_type,
            timestamp=time.time(),
            data=sanitize(data),
        )
        with self.path.open("a", encoding="utf-8") as file:
            file.write(json.dumps(asdict(event), default=str) + "\n")

    @contextmanager
    def tool(self, name: str, **tool_input: Any) -> Iterator[None]:
        started = time.perf_counter()
        self.record("tool_start", tool=name, input=tool_input)
        try:
            yield
        except Exception as exc:
            self.record(
                "tool_error",
                tool=name,
                error_type=type(exc).__name__,
                error=str(exc),
                traceback=traceback.format_exc(limit=6),
                duration_ms=round((time.perf_counter() - started) * 1000, 2),
            )
            raise
        else:
            self.record(
                "tool_end",
                tool=name,
                ok=True,
                duration_ms=round((time.perf_counter() - started) * 1000, 2),
            )

The sanitize() function is not perfect—it's a seatbelt, not a vault. Still, it prevents the most embarrassing version of this pattern: building a helpful debug trace that quietly stores API keys.

Add A Cheap Run Guard

Most runaway agent stories start with a loop that looked harmless. The black box should also record when it refused to continue:

class RunStopped(RuntimeError):
    pass

def stop_if_needed(
    box: AgentBlackBox,
    *,
    turn: int,
    max_turns: int,
    spent_usd: float,
    max_usd: float,
) -> None:
    box.record(
        "guard_check",
        turn=turn,
        max_turns=max_turns,
        spent_usd=round(spent_usd, 6),
        max_usd=round(max_usd, 6),
    )
    if turn > max_turns:
        box.record("guard_stop", reason="max_turns", turn=turn)
        raise RunStopped(f"Stopped at turn {turn}. Max turns is {max_turns}.")
    if spent_usd > max_usd:
        box.record("guard_stop", reason="budget", spent_usd=spent_usd)
        raise RunStopped(f"Stopped at ${spent_usd:.4f}. Budget is ${max_usd:.4f}.")

This is not exact billing—use your provider response for real token counts when you have them. The goal is a local tripwire that leaves a clear reason when the run stops.

A Tiny Agent Loop

Here's a fake loop that uses the black box:

def estimate_cost(input_tokens: int, output_tokens: int) -> float:
    return input_tokens * 0.0000005 + output_tokens * 0.0000015

def run_agent(question: str) -> str:
    box = AgentBlackBox("traces/run.jsonl")
    messages = [{"role": "user", "content": question}]
    spent_usd = 0.0
    max_turns = 3
    max_usd = 0.01
    box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)
    for turn in range(1, max_turns + 1):
        stop_if_needed(box, turn=turn, max_turns=max_turns, spent_usd=spent_usd, max_usd=max_usd)
        box.record("turn_start", turn=turn, message_count=len(messages))
        # Pretend the model picked this tool input.
        query = question if turn == 1 else "python jsonl duckdb traces"
        with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
            docs = search_docs(query=query, api_key="sk-not-a-real-key")
        messages.append({"role": "tool", "content": "\n".join(docs)})
        turn_cost = estimate_cost(
            input_tokens=sum(len(m["content"].split()) for m in messages),
            output_tokens=120,
        )
        spent_usd += turn_cost
        box.record("turn_end", turn=turn, message_count=len(messages), turn_cost_usd=round(turn_cost, 6), total_cost_usd=round(spent_usd, 6))
    answer = "Record every tool call as JSONL, then query failures after the run."
    box.record("run_end", answer=answer, total_cost_usd=round(spent_usd, 6))
    return answer

Run it with a normal question, then with a bad one (e.g., "timeout during document search"). The second run fails with a trail.

Query The Crash With DuckDB

Install DuckDB: pip install duckdb. Then query the trace:

import duckdb

def query_trace(path: str = "traces/run.jsonl") -> None:
    con = duckdb.connect()
    con.sql(f"""
        create or replace view events as
        select *
        from read_json_auto('{path}');
    """)
    print("Event counts")
    con.sql("""
        select type, count(*) as events
        from events
        group by type
        order by events desc;
    """).show()
    print("Tool errors")
    con.sql("""
        select
            data.tool as tool,
            data.error_type as error_type,
            data.error as error,
            data.duration_ms as duration_ms
        from events
        where type = 'tool_error';
    """).show()
    print("Slow tools")
    con.sql("""
        select
            data.tool as tool,
            data.duration_ms as duration_ms
        from events
        where type = 'tool_end'
        order by data.duration_ms desc
        limit 5;
    """).show()

Running query_trace() gives output like:

+-------------+--------+
| type        | events |
+-------------+--------+
| guard_check |      4 |
| turn_start  |      3 |
| tool_start  |      3 |
| tool_end    |      2 |
| tool_error  |      1 |
| guard_stop  |      1 |
+-------------+--------+

And the crash row is now a query result:

+-------------+--------------+---------------------------+-------------+
| tool        | error_type   | error                     | duration_ms |
+-------------+--------------+---------------------------+-------------+
| search_docs | TimeoutError | Document search timed out |      147.82 |
+-------------+--------------+---------------------------+-------------+

You can answer questions that normal print logs make annoying: Which tools failed most often? Which tool was slowest? Which turn crossed the budget? Did the same input fail repeatedly? Did the guard stop the run, or did the tool crash first?

What To Record In A Real Project

For a real project, add: model, provider, prompt_hash, tool_schema_version, input_tokens, output_tokens, finish_reason, retry_count, user_id_hash, environment. Don't record raw access tokens, private documents, full customer prompts, full tool responses with sensitive data, cookies, or request headers.

The boring security rule: Record enough to debug behavior. Do not record enough to harm someone.

The Pattern In One Sentence

Every agent run should produce a local, append-only event stream that is safe to keep, easy to query, and useful after the process crashes.