Magenta RealTime 2: Live AI Music Model Runs on MacBook at 4

Magenta RealTime 2 Ships Open-Weight Live Music Model with 40ms Frame Latency

Google's Magenta team has released Magenta RealTime 2 (MRT2), an open-weights model and inference engine for real-time AI music generation. The 2.4B-parameter model runs on Apple Silicon MacBooks, achieving a 40ms frame size and ~200ms control latency — a 15x improvement over the first version.

Unlike offline generative music models that process a prompt into a static track, MRT2 is a live, interactive instrument. It accepts MIDI, text, and audio inputs, and generates audio continuously with low latency. The model is released under an open license, along with a C++ inference engine, a Python library, and example applications.

Architecture: Frame-Level Autoregression with Sliding Window Attention

MRT2 is a codec language model using the SpectroStream codec to compress 48kHz stereo audio into tokens at 3 kbps (25 Hz frame rate, 12 residual vector quantization tokens per frame, vocabulary size 1024). The key architectural change from the original Magenta RealTime is moving from chunk-level to frame-level autoregression.

Original MRT processed 2-second chunks (400 tokens) at a time, creating a minimum 2-second control delay. MRT2 processes individual frames (12 tokens, 40ms) using a decoder-only Transformer with causal sliding window attention. This reduces the sequential bottleneck: conditioning (MIDI, text, audio) is injected as frame-aligned conditioning at every step, allowing the model to react within a single frame.

To handle long sequences with bounded memory, the sliding window attention evicts old key-value cache entries beyond a fixed window size. The team added learnable attention sink embeddings to prevent quality degradation when initial tokens are evicted, and dropped positional embeddings (NoPE) to improve length generalization — they found RoPE hurt performance beyond training length.

Inference Engine: C++ with MLX on Apple Silicon

The inference engine is written in C++ and uses Apple's MLX framework to run on Apple Silicon GPUs. The model is implemented in Python using the SequenceLayers library, then compiled into an .mlxfn file (bundling weights and computational graph). The C++ engine loads this file and executes it via the MLX runtime, handling audio buffering, resampling, and MIDI input.

Real-time performance (generating audio faster than playback) requires specific hardware:

Base model (2.4B): MacBook M3 Pro or higher, or M2 Max or higher
Small model (230M): Any Apple Silicon Mac, including MacBook Air

Both models can run offline (non-real-time) on any Apple Silicon Mac.

Example Applications and Integrations

The release includes a suite of example applications: standalone apps, DAW plugins, and extensions. These demonstrate sound cloning, style blending, and live accompaniment. The Python library (pip install magenta-rt) provides inference via JAX/MLX using SequenceLayers.

How to Get Started

Download the apps from the Magenta website (requires Apple Silicon Mac).
Install the Python library: pip install magenta-rt
Use the C++ inference engine for DAW integration or custom instrument development.

The team plans to add finetuning support and more performance tools, and will be at the Music Technology Hackathon in Boston showcasing MRT2.

Technical Details: Latency Breakdown

Control latency is ~200ms, composed of:

Frame processing: 40ms (one frame)
Depth decode: time to decode 12 RVQ tokens per frame
Codec decode: time to convert tokens to audio waveform

The exact breakdown depends on hardware and model size, but the team reports a ~15x improvement over MRT's 3s latency.

Citation

If you use MRT2 in your work, cite:

@article{mrt2,
  title  = {Magenta RealTime 2: Open &amp; Local Live Music Models},
  author = {Magenta Team},
  year   = {2026},
  note   = {https://magenta.withgoogle.com/magenta-realtime-2}
}

Magenta RealTime 2: Live AI Music Model Runs on MacBook at 40ms Latency

Magenta RealTime 2 Ships Open-Weight Live Music Model with 40ms Frame Latency

Architecture: Frame-Level Autoregression with Sliding Window Attention

Inference Engine: C++ with MLX on Apple Silicon

Example Applications and Integrations

How to Get Started

Technical Details: Latency Breakdown

Citation

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

77 AI APIs Monitored for 6 Weeks: Which Go Down Most?

Xiaomi-Robotics-1: Scaling Robot Policies with 100K Hours of Data

Fix RAG Retrieval Failures: Chunking & Metadata Filtering

1 Million P-Bit Probabilistic Computer Built on 18 FPGAs

Xiaomi-Robotics-1: Scaling Robot Policies with 100K Hours of Data

Node's spawnSync ENOENT Error: CWD Missing, Not Git