GPT-5.5 Codex Responses Cluster at Exactly 516 Reasoning Tokens

A GitHub issue (openai/codex#30364) reports that GPT-5.5 responses in Codex disproportionately land at exactly 516 reasoning tokens, with secondary spikes at 1034 and 1552. The analysis covers 390,195 response-level token records from 865 sessions between February 1 and June 27, 2026.

The Numbers Don't Lie

GPT-5.5 accounts for only 19.3% of all responses but 82.0% of exact-516 events. Its exact-516 / >=516 ratio is 44.0%, compared to 1.3% for all other models combined—a 33.6x difference. The clustering is model-specific:

ModelResponse recordsExact 516 / >=516
gpt-5.575,40144.0%
gpt-5.425,21419.8%
gpt-5.2247,5750.34%
gpt-5.3-codex13,3330.0%
gpt-5.3-codex-spark26,1790.0%

The anomaly worsened over time. The exact-516 / >=516 ratio jumped from 0.11% in February to 53.30% in May, then dropped slightly to 35.84% in June.

Declining Reasoning Intensity

As the clustering increased, overall reasoning-token intensity decreased. Mean reasoning tokens fell from 268.1 in February to 106.9 in May, while P90 dropped from 772 to 344. June saw a partial recovery to 168.5 mean and 515 P90, but still below February levels.

Month | Mean reasoning tokens | P90 reasoning tokens Feb 2026 | 268.1 | 772 Mar 2026 | 256.8 | 723 Apr 2026 | 228.7 | 669 May 2026 | 106.9 | 344 Jun 2026 | 168.5 | 515

Why This Looks Suspicious

The fixed values—516, 1034, and 1552—are suspiciously round. They suggest threshold boundaries, not natural variation. The issue author notes this is not proof of hidden chain-of-thought truncation, but the pattern is consistent with a "thresholded reasoning-budget behavior."

Related issue #29353 reported a task-level reproduction where GPT-5.5 runs ending at exactly 516 reasoning tokens returned the wrong answer. This new analysis adds aggregate evidence across a larger time window.

Potential Causes

The issue asks the Codex team to investigate whether GPT-5.5 has a reasoning-budget, routing, truncation, fallback, or scheduler behavior causing termination around these fixed token counts. Possible explanations include:

  • Budget cap: A hard limit on reasoning tokens per response.
  • Degraded tier: A fallback path that truncates reasoning.
  • Scheduler issue: A load-balancing mechanism that caps reasoning under high demand.

What Developers Should Do

If you're using Codex with GPT-5.5 for complex tasks, watch for responses ending exactly at 516 reasoning tokens. You can check this in Codex telemetry or token_count metadata. If you see this pattern, consider:

  1. Verify with other models: Run the same task on GPT-5.2 or GPT-5.4 and compare reasoning token distributions.
  2. Monitor quality: Separate exact-516 responses from longer-reasoning ones and evaluate correctness.
  3. Report anomalies: File issues with exact token counts and session IDs.

To check your own data, you can query token_count events by model and compute the ratio of exact-516 to >=516:

-- Pseudocode for token_count analysis
SELECT model,
       COUNT(*) AS total_responses,
       SUM(CASE WHEN reasoning_output_tokens = 516 THEN 1 ELSE 0 END) AS exact_516,
       SUM(CASE WHEN reasoning_output_tokens >= 516 THEN 1 ELSE 0 END) AS at_least_516,
       ROUND(1.0 * SUM(CASE WHEN reasoning_output_tokens = 516 THEN 1 ELSE 0 END) /
             NULLIF(SUM(CASE WHEN reasoning_output_tokens >= 516 THEN 1 ELSE 0 END), 0), 4) AS ratio
FROM token_count_events
WHERE timestamp >= '2026-02-01' AND timestamp < '2026-07-01'
GROUP BY model;

The Bigger Picture

This isn't just a Codex bug—it raises questions about how reasoning budgets are managed across model versions. If GPT-5.5 is silently capping reasoning, developers building on it may see inconsistent quality. The issue is a call for transparency from OpenAI on model behavior.

Next Steps

  • File a report: If you see similar patterns, add to the GitHub issue.
  • Pin models: For critical tasks, consider pinning to GPT-5.2 or GPT-5.4 until resolved.
  • Monitor telemetry: Track reasoning token distributions in your own usage.

The Codex team has not yet responded. Watch the issue for updates.