airCloset CTO Ryan Tsuji detailed how their in-house AI platform, cortex, automates incident response and recurrence prevention. The system merges fix PRs automatically and simultaneously adds new lint rules, CI guards, type constraints, or guideline entries to reject the same anti-pattern in the future. In the past 30 days, cortex merged 115 self-healing PRs—54 from deploy failures caught before shipping to production, and 61 from production-runtime alerts that were absorbed before user impact. Only low single-digit incidents per month require human intervention.
The Three Layers: Observation, Repair, Strengthening
Self-healing requires three layers built on top of the Product Graph (cpg) from Part 2 and a robust observability stack:
- Observation: Real-time detection via OTel SDK, Loki logs, Mimir metrics, Tempo traces, Faro frontend monitoring, and Grafana alerts. Log levels are defined by business impact:
warnfor self-recoverable issues,errorfor data recovery needed,fatalfor feature-wide failure. - Repair: On alert, the AI investigates via Loki logs and cpg, opens a fix PR, runs auto-review, auto-merges, and auto-redeploys. The flow uses
git worktree add -b hotfix/auto-alert-{service}-{ts} origin/mainandclaude -pfor root cause analysis. - Strengthening: Every fix PR must add a new Guide—a
@cortex/eslint-plugin-graphrule (26 rules exist), ascripts/check-*.tsCI guard (13 guards), or a recurrence-prevention.md entry. This ensures the same anti-pattern is auto-rejected from then on.
Observability Prerequisites
Without proper observability, self-healing can't detect anomalies. cortex uses a strict log level definition based on business impact, not exception class. For example, a "record not found" situation is error if the record must exist, warn if it's a user search with zero hits. Alert rules are managed declaratively in Pulumi, grouped by service into categories like BOT, Pipeline, Transformer, Generator, Gemini, CI, Deploy, and Service Catch-All. Adding a new service auto-spawns dashboards and alerts with one line of infra code.
What Self-Healing Can't Catch
Self-healing only reacts to what the observation layer can detect: logic-level errors, exceptions, deploy failures, and threshold-based metric anomalies. It cannot catch UI logic errors with no error logs, silent data corruption, or perceived UX degradation unless latency thresholds trip. These remain blind spots that require continued investment in observability.
Recurrence Prevention: The [Recurrence] Loop
The system forces every fix to add a new Guide. For example, a repeated error pattern in the gcs-transformer service (25 of 61 production alerts) gets converted into a lint rule or type constraint. The auto-review pipeline has a [Recurrence] lens that checks if the fix adds a Guide; if not, the PR is rejected. This compounds quality gates over time, gradually eliminating entire classes of incidents.
Real-World Numbers and Caveats
Tsuji notes that the 115 figure is slightly inflated because a new no-silent-catch lint rule exposed previously hidden production errors. Sweeping existing silent catches in batches caused a spike in alerts—"monitoring caught up to reality." Once the recurrence loop converts these into lint, the number should converge. He emphasizes that doing 115 manual cycles of alert → context switch → fix → review → deploy would bankrupt engineering bandwidth. The system absorbs them unnoticed.
The Bottom Line
Self-healing without recurrence prevention is just patching symptoms. cortex's approach—fix the incident and close the recurrence class simultaneously—turns incident response into a compounding quality investment. The prerequisite is a unified knowledge graph (cpg) and production-grade observability; without those, AI-driven auto-repair multiplies risk instead of reducing it.


