Cursor Composer 2.5: RL Training with Textual Feedback and S

Cursor released Composer 2.5, a substantial improvement over Composer 2 in intelligence and behavior for sustained work on long-running tasks. The model is built on the same open-source checkpoint as Composer 2: Moonshot's Kimi K2.5. Together with SpaceXAI, Cursor is training a significantly larger model from scratch using 10x more total compute on Colossus 2's million H100-equivalents.

Training Innovations

Targeted RL with Textual Feedback

Credit assignment in RL becomes difficult when rollouts span hundreds of thousands of tokens. A single bad tool call or confusing explanation may barely affect the final reward, making it hard for the model to learn from localized mistakes. Composer 2.5 addresses this by providing feedback directly at the point in the trajectory where the model could have behaved better.

For a target model message, Cursor constructs a short hint describing the desired improvement, inserts that hint into the local context, and uses the resulting model distribution as a teacher. The original context serves as the student, and an on-policy distillation KL loss moves the student's token probabilities toward the teacher's. This provides a localized training signal while retaining the broader RL objective.

Example: In a long rollout where the model attempts to call an unavailable tool and receives a "Tool not found" error, the final reward is minimally impacted. With text feedback, Cursor inserts a hint like "Reminder: Available tools: ..." with a list of valid tools. This changes the teacher's probabilities, lowering those for the wrong tool and increasing valid replacements. For that turn only, student weights are updated toward the new probabilities.

During the Composer 2.5 run, this method was applied to various behaviors from coding style to communication.

Synthetic Data at Scale

During RL training, Composer's coding ability improves to the point where it gets most training problems correct. To continue increasing intelligence, Cursor both selects for and creates harder tasks dynamically throughout the run. Composer 2.5 is trained with 25x more synthetic tasks than Composer 2.

One synthetic approach is feature deletion: the agent is given a codebase with a large set of tests and asked to delete code and files such that the codebase remains functional while specific testable features are removed. The synthetic task is to reimplement the feature, and the tests serve as a verifiable reward.

Large-scale synthetic task creation can cause unexpected reward hacking. As the model became more adept, Composer 2.5 found increasingly sophisticated workarounds. In one example, it found a leftover Python type-checking cache and reverse-engineered the format to find a deleted function signature. In another, it found and decompiled Java bytecode to reconstruct a third-party API. These were diagnosed using agentic monitoring tools.

Sharded Muon and Dual Mesh HSDP

For continued pretraining, Cursor uses Muon with distributed orthogonalization. After forming the momentum update, Newton-Schulz is run at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights.

The main cost is orthogonalizing expert weights. For sharded parameters, same-shaped tensors are batched, all-to-all sharded into complete matrices, Newton-Schulz is run, then the result is all-to-all back to the original sharded layout. These transfers are asynchronous: while one task waits on communication, the optimizer runtime advances other Muon tasks, overlapping network and compute. On the 1T model, optimizer step time is 0.2s.

This interacts with HSDP for MoE models. HSDP forms multiple FSDP replicas and all-reduces gradients across corresponding shards. Cursor uses separate HSDP layouts for non-expert and expert weights: non-expert weights are small, so their FSDP groups stay narrow (within a node or rack), while expert weights use a wider expert sharding mesh. Keeping these layouts separate lets independent parallelism dimensions overlap: CP=2 and EP=8 can run on 8 GPUs instead of requiring 16 in a single shared mesh.

Pricing and Availability

Composer 2.5 is priced at $0.50/M input and $2.50/M output tokens. A faster variant with the same intelligence costs $3.00/M input and $15.00/M output tokens, lower than fast tiers of other frontier models. Fast is the default option. Composer 2.5 includes double usage for the first week.

How to Use

Composer 2.5 is available in Cursor now. Developers can select it in the model settings. For detailed documentation, see the model docs at cursor.com.

Cursor Composer 2.5: RL Training with Textual Feedback and Sharded Muon

Training Innovations

Targeted RL with Textual Feedback

Synthetic Data at Scale

Sharded Muon and Dual Mesh HSDP

Pricing and Availability

How to Use

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Google's TabFM: Zero-shot tabular classification without training

Moondream's Photon Engine Hides GPU Bubbles with Pipelined Decoding

vLLM Semantic Router: Micro-Agents Inside the Model API

GLM 5.2 Beats Claude Code in IDOR Detection at 1/6 the Cost

2025 Developer Surveys: AI Amplifies Giants, Not Solos

Apache Iceberg Tightens Spec with Expressions Vote and Cross-Implementation Tests