Cloudflare's AI Code Review: 7 Agents, One Coordinator, No Noise
Cloudflare's engineering team got tired of waiting hours for code reviews. Their solution: a CI-native orchestration system that launches up to seven specialized AI agents per merge request, managed by a coordinator agent that deduplicates findings and posts a single structured review comment. The system has processed tens of thousands of internal MRs, approving clean code and blocking merges on genuine bugs and security vulnerabilities.
Instead of building a monolithic agent, they built an orchestrator around OpenCode, an open-source coding agent. Cloudflare engineers have contributed over 45 pull requests upstream to OpenCode.
Architecture: Plugin-Based Isolation
The system uses a composable plugin architecture with three lifecycle phases: bootstrap, configure, and postConfigure. Each plugin implements a ReviewPlugin interface. Bootstrap hooks run concurrently and are non-fatal (e.g., a template fetch failure doesn't stop the review). Configure hooks run sequentially and are fatal (if GitLab can't connect, the job stops).
Plugins interact through a ConfigureContext API—they can register agents, add AI providers, set environment variables, inject prompt sections, and alter agent permissions. No plugin has direct access to the final configuration object. The core assembler merges everything into an opencode.json file.
Here's the plugin roster for a typical internal review:
| Plugin | Responsibility |
|---|---|
@opencode-reviewer/gitlab | GitLab VCS provider, MR data, MCP comment server |
@opencode-reviewer/cloudflare | AI Gateway configuration, model tiers, failback chains |
@opencode-reviewer/codex | Internal compliance checking against engineering RFCs |
@opencode-reviewer/braintrust | Distributed tracing and observability |
@opencode-reviewer/agents-md | Verifies the repo's AGENTS.md is up to date |
@opencode-reviewer/reviewer-config | Remote per-reviewer model overrides from a Cloudflare Worker |
@opencode-reviewer/telemetry | Fire-and-forget review tracking |
All VCS-specific coupling is isolated in a single ci-config.ts file.
Why OpenCode?
OpenCode is structured as a server first, with a text-based UI and desktop app as clients. This allowed Cloudflare to create sessions programmatically, send prompts via an SDK, and collect results from multiple concurrent sessions without hacking around a CLI interface.
The orchestration works in two layers:
- Coordinator Process: Spawns OpenCode as a child process using
Bun.spawn. The coordinator prompt is passed via stdin (not command-line arguments) to avoid the Linux kernel'sARG_MAXlimit—Cloudflare hitE2BIGerrors on large MRs before switching. The process runs with--format json, outputting JSONL events on stdout.
const proc = Bun.spawn(
["bun", opencodeScript, "--print-logs", "--log-level", logLevel,
"--format", "json", "--agent", "review_coordinator", "run"],
{
stdin: Buffer.from(prompt),
env: {
...sanitizeEnvForChildProcess(process.env),
OPENCODE_CONFIG: process.env.OPENCODE_CONFIG_PATH ?? "",
BUN_JSC_gcMaxHeapSize: "2684354560", // 2.5 GB heap cap
},
stdout: "pipe",
stderr: "pipe",
},
);
- Review Plugin: Inside OpenCode, a runtime plugin provides the
spawn_reviewerstool. When the coordinator LLM decides to review code, it calls this tool, launching sub-reviewer sessions via OpenCode's SDK client:
const createResult = await this.client.session.create({
body: { parentID: input.parentSessionID },
query: { directory: dir },
});
// Send the prompt asynchronously (non-blocking)
this.client.session.promptAsync({
path: { id: task.sessionID },
body: {
parts: [{ type: "text", text: promptText }],
agent: input.agent,
model: { providerID, modelID },
},
});
Each sub-reviewer runs in its own OpenCode session with its own agent prompt, free to read source files, run grep, or search the codebase. They return findings as structured XML when finished.
JSONL for Streaming
Cloudflare uses JSONL (JSON Lines) for structured logging. Each line is a valid, self-contained JSON object. Unlike a standard JSON array, you don't need to parse the whole document to read the first entry. This avoids buffering massive payloads into memory and handles early exits gracefully. In practice, the output looks like:
Stripped: authorization, cf-access-token, host
Added: cf-aig-authorization: Bearer
cf-aig-metadata: {"userId": ""}
The streaming pipeline buffers output and flushes every 100 lines or 50ms to avoid slow appendFileSync death. It watches for specific triggers: step_finish events to track token usage and costs, error events for retry logic, and reason: "length" in step_finish to detect max_tokens truncation and automatically retry.
A heartbeat log prints "Model is thinking... (Ns since last output)" every 30 seconds to prevent users from canceling jobs that appear hung.
Specialized Agents Over One Big Prompt
Instead of one model with a massive generic prompt, each agent has a tightly scoped prompt with explicit instructions on what to flag and—more importantly—what to ignore. The security reviewer's prompt includes:
## What to Flag
- Injection vulnerabilities (SQL, XSS, command, path traversal)
- Authentication/authorisation bypasses in changed code
- Hardcoded secrets, credentials, or API keys
- Insecure cryptographic usage
- Missing input validation on untrusted data at trust boundaries
## What NOT to Flag
- Theoretical risks that require unlikely preconditions
- Defense-in-depth suggestions when primary defenses are adequate
- Issues in unchanged code that this MR doesn't affect
- "Consider using library X" style suggestions
Telling the LLM what not to do is where the actual prompt engineering value resides. Without these boundaries, you get a firehose of speculative warnings that developers learn to ignore.
Every reviewer produces findings in structured XML with severity: critical (will cause an outage or is exploitable), warning (measurable regression or concrete risk), or suggestion (an improvement worth considering).
Key Takeaway
Cloudflare's approach proves that specialized, orchestrated agents outperform monolithic prompts for code review. The plugin architecture makes the system adaptable to any VCS and AI provider. If you're building similar tooling, start with a server-first agent like OpenCode, use JSONL for streaming, and invest heavily in negative prompt engineering.






