airCloset's AI Merged 115 Self-Healing PRs in 30 Days

airCloset's internal AI platform, cortex, automatically investigates production alerts, fixes code, and adds guardrails to prevent recurrence. In the past 30 days, it merged 115 self-healing PRs without human involvement, with only low single-digit incidents requiring human escalation.

3 min readJun 2, 2026

airCloset's AI Merged 115 Self-Healing PRs in 30 Days

airCloset CTO Ryan Tsuji detailed how their in-house AI platform, cortex, automates incident response and recurrence prevention. The system merges fix PRs automatically and simultaneously adds new lint rules, CI guards, type constraints, or guideline entries to reject the same anti-pattern in the future. In the past 30 days, cortex merged 115 self-healing PRs—54 from deploy failures caught before shipping to production, and 61 from production-runtime alerts that were absorbed before user impact. Only low single-digit incidents per month require human intervention.

The Three Layers: Observation, Repair, Strengthening

Self-healing requires three layers built on top of the Product Graph (cpg) from Part 2 and a robust observability stack:

Observation: Real-time detection via OTel SDK, Loki logs, Mimir metrics, Tempo traces, Faro frontend monitoring, and Grafana alerts. Log levels are defined by business impact: warn for self-recoverable issues, error for data recovery needed, fatal for feature-wide failure.
Repair: On alert, the AI investigates via Loki logs and cpg, opens a fix PR, runs auto-review, auto-merges, and auto-redeploys. The flow uses git worktree add -b hotfix/auto-alert-{service}-{ts} origin/main and claude -p for root cause analysis.
Strengthening: Every fix PR must add a new Guide—a @cortex/eslint-plugin-graph rule (26 rules exist), a scripts/check-*.ts CI guard (13 guards), or a recurrence-prevention.md entry. This ensures the same anti-pattern is auto-rejected from then on.

Observability Prerequisites

Without proper observability, self-healing can't detect anomalies. cortex uses a strict log level definition based on business impact, not exception class. For example, a "record not found" situation is error if the record must exist, warn if it's a user search with zero hits. Alert rules are managed declaratively in Pulumi, grouped by service into categories like BOT, Pipeline, Transformer, Generator, Gemini, CI, Deploy, and Service Catch-All. Adding a new service auto-spawns dashboards and alerts with one line of infra code.

What Self-Healing Can't Catch

Self-healing only reacts to what the observation layer can detect: logic-level errors, exceptions, deploy failures, and threshold-based metric anomalies. It cannot catch UI logic errors with no error logs, silent data corruption, or perceived UX degradation unless latency thresholds trip. These remain blind spots that require continued investment in observability.

Recurrence Prevention: The [Recurrence] Loop

The system forces every fix to add a new Guide. For example, a repeated error pattern in the gcs-transformer service (25 of 61 production alerts) gets converted into a lint rule or type constraint. The auto-review pipeline has a [Recurrence] lens that checks if the fix adds a Guide; if not, the PR is rejected. This compounds quality gates over time, gradually eliminating entire classes of incidents.

Real-World Numbers and Caveats

Tsuji notes that the 115 figure is slightly inflated because a new no-silent-catch lint rule exposed previously hidden production errors. Sweeping existing silent catches in batches caused a spike in alerts—"monitoring caught up to reality." Once the recurrence loop converts these into lint, the number should converge. He emphasizes that doing 115 manual cycles of alert → context switch → fix → review → deploy would bankrupt engineering bandwidth. The system absorbs them unnoticed.

The Bottom Line

Self-healing without recurrence prevention is just patching symptoms. cortex's approach—fix the incident and close the recurrence class simultaneously—turns incident response into a compounding quality investment. The prerequisite is a unified knowledge graph (cpg) and production-grade observability; without those, AI-driven auto-repair multiplies risk instead of reducing it.

Editor's Take

I've been skeptical of AI writing production code, but the recurrence prevention layer changes the equation. Instead of just patching symptoms, cortex turns every fix into a permanent quality gate. I'd want to see how well the lint rules generalize across a codebase, but the 115 PRs merged without human involvement is a concrete metric that's hard to ignore. I think this is the first real glimpse of a sustainable AI-assisted incident response workflow.

— DevDigest Editorial

Key Takeaways

•Implement a unified knowledge graph (like cpg) to give AI context across code, docs, DB, and infra—without it, auto-repair is dangerous.
•Define log levels by business impact, not exception class, to reduce alert fatigue and ensure critical issues are caught.
•Force every automated fix to add a new lint rule, CI guard, or type constraint to prevent recurrence and compound quality over time.

Why It Matters

This demonstrates a practical implementation of AI-driven incident response that not only fixes issues automatically but prevents them from recurring. For teams drowning in on-call alerts, this pattern could drastically reduce burnout and improve code quality over time.

#ai#automation#devops#incident response#cortex

Get the weekly digest

Every Sunday - top tech stories, industry breakthroughs, and developer tools delivered to your inbox.

No spam, unsubscribe anytime.

airCloset's AI Merged 115 Self-Healing PRs in 30 Days

The Three Layers: Observation, Repair, Strengthening

Observability Prerequisites

What Self-Healing Can't Catch

Recurrence Prevention: The [Recurrence] Loop

Real-World Numbers and Caveats

The Bottom Line

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Node's spawnSync ENOENT Error: CWD Missing, Not Git

Claude Code Left 10 Orphaned Busy-Loops Saturating My CPU for 2 Days

NpgsqlRest: Delete Your Backend, Keep PostgreSQL

OpenStrike Ships: A Counter-Strike FPS Running at 60 FPS on a 2004 PSP

Claude Code Left 10 Orphaned Busy-Loops Saturating My CPU for 2 Days

Waldi: A Multi-Tenant Blogging Platform in Go with Guaranteed Readers