The $200 Crash That Changed Everything
A support automation agent got stuck in a retry loop. One run didn't burn $200—but the projection did. Same bad loop, same document search, same model calls, left inside an overnight batch. The answer looked polished enough to pass a sleepy review. The trace behind it was not polished at all.
That's when the author stopped treating the agent like a chat feature and started treating it like a system that needs a black box.
Not a dashboard. Not a full observability stack. Not another hosted service. Just one local file that can answer:
- What did the agent try?
- Which tool did it call?
- What input did the tool receive?
- Did the tool fail?
- How long did it take?
- Did the run cross a cost or turn limit?
- Can I query the run after everything is over?
The Shape Of The Problem
A normal Python script usually fails in one place. An agent fails across a chain:
User Request -> Model Decision -> Tool Call -> Tool Result -> Next Turn -> Final Answer
If you only log the final answer, you have a diary entry. If you record the chain, you have evidence.
The simplest useful format is JSONL—one event per line:
{"type":"tool_start","tool":"search_docs","input":{"query":"rate limits"}}
{"type":"tool_end","tool":"search_docs","duration_ms":83.4,"ok":true}
{"type":"turn_end","turn":2,"total_cost_usd":0.0041}
JSONL appends cleanly, survives crashes better than one large JSON document, and can be searched with normal tools.
The 71-Line Recorder
Here's the core AgentBlackBox class. It does four things:
- gives every run a unique id
- writes append-only JSONL events
- measures tool duration
- sanitizes obvious secrets before writing anything to disk
from __future__ import annotations
import json
import re
import time
import traceback
from contextlib import contextmanager
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Iterator
from uuid import uuid4
SECRET_KEYS = re.compile(
r"(api[_-]?key|token|password|secret|authorization|cookie)",
re.IGNORECASE,
)
@dataclass
class Event:
run_id: str
event_id: str
type: str
timestamp: float
data: dict[str, Any] = field(default_factory=dict)
def sanitize(value: Any) -> Any:
if isinstance(value, dict):
cleaned = {}
for key, item in value.items():
if SECRET_KEYS.search(str(key)):
cleaned[key] = "[redacted]"
else:
cleaned[key] = sanitize(item)
return cleaned
if isinstance(value, list):
return [sanitize(item) for item in value]
return value
class AgentBlackBox:
def __init__(self, path: str | Path, run_id: str | None = None) -> None:
self.path = Path(path)
self.run_id = run_id or uuid4().hex
self.path.parent.mkdir(parents=True, exist_ok=True)
def record(self, event_type: str, **data: Any) -> None:
event = Event(
run_id=self.run_id,
event_id=uuid4().hex,
type=event_type,
timestamp=time.time(),
data=sanitize(data),
)
with self.path.open("a", encoding="utf-8") as file:
file.write(json.dumps(asdict(event), default=str) + "\n")
@contextmanager
def tool(self, name: str, **tool_input: Any) -> Iterator[None]:
started = time.perf_counter()
self.record("tool_start", tool=name, input=tool_input)
try:
yield
except Exception as exc:
self.record(
"tool_error",
tool=name,
error_type=type(exc).__name__,
error=str(exc),
traceback=traceback.format_exc(limit=6),
duration_ms=round((time.perf_counter() - started) * 1000, 2),
)
raise
else:
self.record(
"tool_end",
tool=name,
ok=True,
duration_ms=round((time.perf_counter() - started) * 1000, 2),
)
The sanitize() function is not perfect—it's a seatbelt, not a vault. Still, it prevents the most embarrassing version of this pattern: building a helpful debug trace that quietly stores API keys.
Add A Cheap Run Guard
Most runaway agent stories start with a loop that looked harmless. The black box should also record when it refused to continue:
class RunStopped(RuntimeError):
pass
def stop_if_needed(
box: AgentBlackBox,
*,
turn: int,
max_turns: int,
spent_usd: float,
max_usd: float,
) -> None:
box.record(
"guard_check",
turn=turn,
max_turns=max_turns,
spent_usd=round(spent_usd, 6),
max_usd=round(max_usd, 6),
)
if turn > max_turns:
box.record("guard_stop", reason="max_turns", turn=turn)
raise RunStopped(f"Stopped at turn {turn}. Max turns is {max_turns}.")
if spent_usd > max_usd:
box.record("guard_stop", reason="budget", spent_usd=spent_usd)
raise RunStopped(f"Stopped at ${spent_usd:.4f}. Budget is ${max_usd:.4f}.")
This is not exact billing—use your provider response for real token counts when you have them. The goal is a local tripwire that leaves a clear reason when the run stops.
A Tiny Agent Loop
Here's a fake loop that uses the black box:
def estimate_cost(input_tokens: int, output_tokens: int) -> float:
return input_tokens * 0.0000005 + output_tokens * 0.0000015
def run_agent(question: str) -> str:
box = AgentBlackBox("traces/run.jsonl")
messages = [{"role": "user", "content": question}]
spent_usd = 0.0
max_turns = 3
max_usd = 0.01
box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)
for turn in range(1, max_turns + 1):
stop_if_needed(box, turn=turn, max_turns=max_turns, spent_usd=spent_usd, max_usd=max_usd)
box.record("turn_start", turn=turn, message_count=len(messages))
# Pretend the model picked this tool input.
query = question if turn == 1 else "python jsonl duckdb traces"
with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
docs = search_docs(query=query, api_key="sk-not-a-real-key")
messages.append({"role": "tool", "content": "\n".join(docs)})
turn_cost = estimate_cost(
input_tokens=sum(len(m["content"].split()) for m in messages),
output_tokens=120,
)
spent_usd += turn_cost
box.record("turn_end", turn=turn, message_count=len(messages), turn_cost_usd=round(turn_cost, 6), total_cost_usd=round(spent_usd, 6))
answer = "Record every tool call as JSONL, then query failures after the run."
box.record("run_end", answer=answer, total_cost_usd=round(spent_usd, 6))
return answer
Run it with a normal question, then with a bad one (e.g., "timeout during document search"). The second run fails with a trail.
Query The Crash With DuckDB
Install DuckDB: pip install duckdb. Then query the trace:
import duckdb
def query_trace(path: str = "traces/run.jsonl") -> None:
con = duckdb.connect()
con.sql(f"""
create or replace view events as
select *
from read_json_auto('{path}');
""")
print("Event counts")
con.sql("""
select type, count(*) as events
from events
group by type
order by events desc;
""").show()
print("Tool errors")
con.sql("""
select
data.tool as tool,
data.error_type as error_type,
data.error as error,
data.duration_ms as duration_ms
from events
where type = 'tool_error';
""").show()
print("Slow tools")
con.sql("""
select
data.tool as tool,
data.duration_ms as duration_ms
from events
where type = 'tool_end'
order by data.duration_ms desc
limit 5;
""").show()
Running query_trace() gives output like:
+-------------+--------+
| type | events |
+-------------+--------+
| guard_check | 4 |
| turn_start | 3 |
| tool_start | 3 |
| tool_end | 2 |
| tool_error | 1 |
| guard_stop | 1 |
+-------------+--------+
And the crash row is now a query result:
+-------------+--------------+---------------------------+-------------+
| tool | error_type | error | duration_ms |
+-------------+--------------+---------------------------+-------------+
| search_docs | TimeoutError | Document search timed out | 147.82 |
+-------------+--------------+---------------------------+-------------+
You can answer questions that normal print logs make annoying: Which tools failed most often? Which tool was slowest? Which turn crossed the budget? Did the same input fail repeatedly? Did the guard stop the run, or did the tool crash first?
What To Record In A Real Project
For a real project, add: model, provider, prompt_hash, tool_schema_version, input_tokens, output_tokens, finish_reason, retry_count, user_id_hash, environment. Don't record raw access tokens, private documents, full customer prompts, full tool responses with sensitive data, cookies, or request headers.
The boring security rule: Record enough to debug behavior. Do not record enough to harm someone.
The Pattern In One Sentence
Every agent run should produce a local, append-only event stream that is safe to keep, easy to query, and useful after the process crashes.



