Cloudflare's AI Code Review: 7 Agents, One Coordinator, No Noise

Cloudflare's engineering team got tired of waiting hours for code reviews. Their solution: a CI-native orchestration system that launches up to seven specialized AI agents per merge request, managed by a coordinator agent that deduplicates findings and posts a single structured review comment. The system has processed tens of thousands of internal MRs, approving clean code and blocking merges on genuine bugs and security vulnerabilities.

Instead of building a monolithic agent, they built an orchestrator around OpenCode, an open-source coding agent. Cloudflare engineers have contributed over 45 pull requests upstream to OpenCode.

Architecture: Plugin-Based Isolation

The system uses a composable plugin architecture with three lifecycle phases: bootstrap, configure, and postConfigure. Each plugin implements a ReviewPlugin interface. Bootstrap hooks run concurrently and are non-fatal (e.g., a template fetch failure doesn't stop the review). Configure hooks run sequentially and are fatal (if GitLab can't connect, the job stops).

Plugins interact through a ConfigureContext API—they can register agents, add AI providers, set environment variables, inject prompt sections, and alter agent permissions. No plugin has direct access to the final configuration object. The core assembler merges everything into an opencode.json file.

Here's the plugin roster for a typical internal review:

PluginResponsibility
@opencode-reviewer/gitlabGitLab VCS provider, MR data, MCP comment server
@opencode-reviewer/cloudflareAI Gateway configuration, model tiers, failback chains
@opencode-reviewer/codexInternal compliance checking against engineering RFCs
@opencode-reviewer/braintrustDistributed tracing and observability
@opencode-reviewer/agents-mdVerifies the repo's AGENTS.md is up to date
@opencode-reviewer/reviewer-configRemote per-reviewer model overrides from a Cloudflare Worker
@opencode-reviewer/telemetryFire-and-forget review tracking

All VCS-specific coupling is isolated in a single ci-config.ts file.

Why OpenCode?

OpenCode is structured as a server first, with a text-based UI and desktop app as clients. This allowed Cloudflare to create sessions programmatically, send prompts via an SDK, and collect results from multiple concurrent sessions without hacking around a CLI interface.

The orchestration works in two layers:

  1. Coordinator Process: Spawns OpenCode as a child process using Bun.spawn. The coordinator prompt is passed via stdin (not command-line arguments) to avoid the Linux kernel's ARG_MAX limit—Cloudflare hit E2BIG errors on large MRs before switching. The process runs with --format json, outputting JSONL events on stdout.
const proc = Bun.spawn(
  ["bun", opencodeScript, "--print-logs", "--log-level", logLevel,
   "--format", "json", "--agent", "review_coordinator", "run"],
  {
    stdin: Buffer.from(prompt),
    env: {
      ...sanitizeEnvForChildProcess(process.env),
      OPENCODE_CONFIG: process.env.OPENCODE_CONFIG_PATH ?? "",
      BUN_JSC_gcMaxHeapSize: "2684354560", // 2.5 GB heap cap
    },
    stdout: "pipe",
    stderr: "pipe",
  },
);
  1. Review Plugin: Inside OpenCode, a runtime plugin provides the spawn_reviewers tool. When the coordinator LLM decides to review code, it calls this tool, launching sub-reviewer sessions via OpenCode's SDK client:
const createResult = await this.client.session.create({
  body: { parentID: input.parentSessionID },
  query: { directory: dir },
});
// Send the prompt asynchronously (non-blocking)
this.client.session.promptAsync({
  path: { id: task.sessionID },
  body: {
    parts: [{ type: "text", text: promptText }],
    agent: input.agent,
    model: { providerID, modelID },
  },
});

Each sub-reviewer runs in its own OpenCode session with its own agent prompt, free to read source files, run grep, or search the codebase. They return findings as structured XML when finished.

JSONL for Streaming

Cloudflare uses JSONL (JSON Lines) for structured logging. Each line is a valid, self-contained JSON object. Unlike a standard JSON array, you don't need to parse the whole document to read the first entry. This avoids buffering massive payloads into memory and handles early exits gracefully. In practice, the output looks like:

Stripped:   authorization, cf-access-token, host
Added:      cf-aig-authorization: Bearer 
cf-aig-metadata: {"userId": ""}

The streaming pipeline buffers output and flushes every 100 lines or 50ms to avoid slow appendFileSync death. It watches for specific triggers: step_finish events to track token usage and costs, error events for retry logic, and reason: "length" in step_finish to detect max_tokens truncation and automatically retry.

A heartbeat log prints "Model is thinking... (Ns since last output)" every 30 seconds to prevent users from canceling jobs that appear hung.

Specialized Agents Over One Big Prompt

Instead of one model with a massive generic prompt, each agent has a tightly scoped prompt with explicit instructions on what to flag and—more importantly—what to ignore. The security reviewer's prompt includes:

## What to Flag
- Injection vulnerabilities (SQL, XSS, command, path traversal)
- Authentication/authorisation bypasses in changed code
- Hardcoded secrets, credentials, or API keys
- Insecure cryptographic usage
- Missing input validation on untrusted data at trust boundaries

## What NOT to Flag
- Theoretical risks that require unlikely preconditions
- Defense-in-depth suggestions when primary defenses are adequate
- Issues in unchanged code that this MR doesn't affect
- "Consider using library X" style suggestions

Telling the LLM what not to do is where the actual prompt engineering value resides. Without these boundaries, you get a firehose of speculative warnings that developers learn to ignore.

Every reviewer produces findings in structured XML with severity: critical (will cause an outage or is exploitable), warning (measurable regression or concrete risk), or suggestion (an improvement worth considering).

Key Takeaway

Cloudflare's approach proves that specialized, orchestrated agents outperform monolithic prompts for code review. The plugin architecture makes the system adaptable to any VCS and AI provider. If you're building similar tooling, start with a server-first agent like OpenCode, use JSONL for streaming, and invest heavily in negative prompt engineering.