cortex Auto-Review: AI Reviews 769 PRs/Month With Near-Zero

769 PRs in 30 Days, Median Merge Time 31 Minutes, Human Review ~0%

airCloset's internal AI platform, codenamed cortex, has been running an automated PR review pipeline that merges 769 PRs per month with near-zero human involvement. The median time to merge is 31 minutes, with 1 in 5 merged within 10 minutes and half within 30 minutes. The AI reviewer covers 100% of PRs, averaging 10.8 review-fix loop iterations per PR (max 56).

The Bottleneck Problem

As AI writing speed increases, human review becomes the bottleneck. Anthropic's internal blog on Claude Code confirms this pattern: senior engineers shifted from writing code to reviewing AI output. cortex hit the same wall. When Claude Code ran at full throttle, writing speed jumped an order of magnitude, but human review time only grew linearly. If the reviewer took a day off, the entire org stalled.

cortex's solution: move the reviewer role to AI as well. Humans tune the prompts and guidelines—operating on the policy layer, not the execution layer.

Three Conditions for AI Review to Work

Sufficient context. A generic AI reviewer sees only the PR diff. cortex feeds the Product Graph (cpg) from Part 2—a knowledge graph fusing code, docs, DB schemas, and infra—into the AI reviewer. It catches missed upstream/downstream fixes, doc updates, and tests that should have been updated but weren't.
Non-improvisational reviews. Review guidelines are passed as a mandatory citation source. cortex open-sourced a snapshot at air-closet/cortex-review-guidelines (JP/EN). The live guidelines evolve daily.
False positives don't block merges. A severity hierarchy (Critical/Major/Minor/Nit) with strict no-downgrade rules prevents blanket blocks.

Pipeline Architecture

The implementation is a script running on each developer's machine. GitHub webhooks hit an in-house Event Relay server, persist to Firestore, and each machine subscribes as an SSE client. On reconnect, Last-Event-ID replays missed events—zero event loss. Reviewer-mode machines stay always-on; author mode runs in the background on the PR author's machine.

# Example: starting reviewer mode
cortex-review --mode reviewer --pr 1234

The pipeline evolved through three iterations:

GitHub webhook → smee.io → each machine (connection drops)
GitHub webhook → Cloudflare Tunnel → each machine (missed deliveries)
GitHub webhook → in-house Event Relay with Firestore → SSE (zero loss)

When the reviewer machine receives an event, it spawns claude -p and walks 9 dimensions sequentially: Graph, Architecture, Security, Test, Doc, Impact, Observability, AI-Antipattern, Recurrence. A single session shares context across dimensions, avoiding the token bloat and cross-reference issues of parallel sub-agents. At the end, the AI emits a verdict marker and posts APPROVE or REQUEST_CHANGES via gh pr review.

9 Review Dimensions with Tagged Output

Tag	Dimension	Primary Target
[Graph]	Product Graph integrity	@graph-* JSDoc, node dependencies, doc consistency
[Doc]	Doc consistency	Doc updates following code changes
[Impact]	Impact analysis	Missed upstream/downstream fixes
[Security]	Security	Auth, input validation, secrets
[Architecture]	Composable Architecture	app/package boundaries, dependency direction
[Test]	Test quality	Coverage, matchers, naming
[Observability]	Observability	Structured logging, no-truncate rules
[AI-Antipattern]	AI-generated code traps	Hallucinated APIs, fallback overuse, dead code
[Recurrence]	Recurrence prevention	Bug-fix triage (lint / horizontal rollout / new guideline)

Severity rules:

Critical: Security, data corruption, prod-risk → REQUEST_CHANGES
Major: Spec violation, architecture violation, missing tests → REQUEST_CHANGES
Minor: Naming, maintainability → REQUEST_CHANGES (must be resolved)
Nit: Style preference → APPROVE (comment only)

The no-downgrade rule states: "Following existing patterns" is not a valid reason to downgrade; "Will be addressed in a separate PR" is not valid; "Leave a TODO/FIXME" is not a valid deferral path.

Operational Details

Draft PRs are skipped; review starts when flipped to Ready for Review.
Specific PRs can be manually targeted via CLI after CI failure.
Auto-merge is PR author's call (default on; can be disabled for prod changes).
A 500-lines-per-file lint keeps files small enough for a single AI session.
CLAUDE.md is swapped to a review-specific version at startup, removing development-time noise.

Why Sequential Single-Session Review?

Initially, cortex tried parallel sub-agents for the 9 dimensions. Three problems emerged:

cpg/guidelines/PR diff injected 9 times (token cost ballooned)
Cross-dimension findings couldn't reference each other
Aggregating 9 outputs into a single verdict required extra machinery

A single sequential session fixes all three: one cpg/guideline load, earlier findings stay in context, and one verdict marker at the end is the entire aggregation step.

cortex Auto-Review: AI Reviews 769 PRs/Month With Near-Zero Human Input

769 PRs in 30 Days, Median Merge Time 31 Minutes, Human Review ~0%

The Bottleneck Problem

Three Conditions for AI Review to Work

Pipeline Architecture

9 Review Dimensions with Tagged Output

Operational Details

Why Sequential Single-Session Review?

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

EF Core 11 Prunes Redundant Joins in Split Queries

Biff.graph: Query Your Clojure Codebase as a Unified Graph

Simple Fluids Can Fracture: New Physics for Engineers

FableCut: Project File as Interface for AI-Driven Video Editing

UPI Payment Architecture: Inside the 2,272 Crore Transaction Pipeline

Igropyr: Erlang-Style Actor Web Server in Pure Chez Scheme