769 PRs in 30 Days, Median Merge Time 31 Minutes, Human Review ~0%

airCloset's internal AI platform, codenamed cortex, has been running an automated PR review pipeline that merges 769 PRs per month with near-zero human involvement. The median time to merge is 31 minutes, with 1 in 5 merged within 10 minutes and half within 30 minutes. The AI reviewer covers 100% of PRs, averaging 10.8 review-fix loop iterations per PR (max 56).

The Bottleneck Problem

As AI writing speed increases, human review becomes the bottleneck. Anthropic's internal blog on Claude Code confirms this pattern: senior engineers shifted from writing code to reviewing AI output. cortex hit the same wall. When Claude Code ran at full throttle, writing speed jumped an order of magnitude, but human review time only grew linearly. If the reviewer took a day off, the entire org stalled.

cortex's solution: move the reviewer role to AI as well. Humans tune the prompts and guidelines—operating on the policy layer, not the execution layer.

Three Conditions for AI Review to Work

  1. Sufficient context. A generic AI reviewer sees only the PR diff. cortex feeds the Product Graph (cpg) from Part 2—a knowledge graph fusing code, docs, DB schemas, and infra—into the AI reviewer. It catches missed upstream/downstream fixes, doc updates, and tests that should have been updated but weren't.

  2. Non-improvisational reviews. Review guidelines are passed as a mandatory citation source. cortex open-sourced a snapshot at air-closet/cortex-review-guidelines (JP/EN). The live guidelines evolve daily.

  3. False positives don't block merges. A severity hierarchy (Critical/Major/Minor/Nit) with strict no-downgrade rules prevents blanket blocks.

Pipeline Architecture

The implementation is a script running on each developer's machine. GitHub webhooks hit an in-house Event Relay server, persist to Firestore, and each machine subscribes as an SSE client. On reconnect, Last-Event-ID replays missed events—zero event loss. Reviewer-mode machines stay always-on; author mode runs in the background on the PR author's machine.

# Example: starting reviewer mode
cortex-review --mode reviewer --pr 1234

The pipeline evolved through three iterations:

  • GitHub webhook → smee.io → each machine (connection drops)
  • GitHub webhook → Cloudflare Tunnel → each machine (missed deliveries)
  • GitHub webhook → in-house Event Relay with Firestore → SSE (zero loss)

When the reviewer machine receives an event, it spawns claude -p and walks 9 dimensions sequentially: Graph, Architecture, Security, Test, Doc, Impact, Observability, AI-Antipattern, Recurrence. A single session shares context across dimensions, avoiding the token bloat and cross-reference issues of parallel sub-agents. At the end, the AI emits a verdict marker and posts APPROVE or REQUEST_CHANGES via gh pr review.

9 Review Dimensions with Tagged Output

TagDimensionPrimary Target
[Graph]Product Graph integrity@graph-* JSDoc, node dependencies, doc consistency
[Doc]Doc consistencyDoc updates following code changes
[Impact]Impact analysisMissed upstream/downstream fixes
[Security]SecurityAuth, input validation, secrets
[Architecture]Composable Architectureapp/package boundaries, dependency direction
[Test]Test qualityCoverage, matchers, naming
[Observability]ObservabilityStructured logging, no-truncate rules
[AI-Antipattern]AI-generated code trapsHallucinated APIs, fallback overuse, dead code
[Recurrence]Recurrence preventionBug-fix triage (lint / horizontal rollout / new guideline)

Severity rules:

  • Critical: Security, data corruption, prod-risk → REQUEST_CHANGES
  • Major: Spec violation, architecture violation, missing tests → REQUEST_CHANGES
  • Minor: Naming, maintainability → REQUEST_CHANGES (must be resolved)
  • Nit: Style preference → APPROVE (comment only)

The no-downgrade rule states: "Following existing patterns" is not a valid reason to downgrade; "Will be addressed in a separate PR" is not valid; "Leave a TODO/FIXME" is not a valid deferral path.

Operational Details

  • Draft PRs are skipped; review starts when flipped to Ready for Review.
  • Specific PRs can be manually targeted via CLI after CI failure.
  • Auto-merge is PR author's call (default on; can be disabled for prod changes).
  • A 500-lines-per-file lint keeps files small enough for a single AI session.
  • CLAUDE.md is swapped to a review-specific version at startup, removing development-time noise.

Why Sequential Single-Session Review?

Initially, cortex tried parallel sub-agents for the 9 dimensions. Three problems emerged:

  • cpg/guidelines/PR diff injected 9 times (token cost ballooned)
  • Cross-dimension findings couldn't reference each other
  • Aggregating 9 outputs into a single verdict required extra machinery

A single sequential session fixes all three: one cpg/guideline load, earlier findings stay in context, and one verdict marker at the end is the entire aggregation step.