Machine learning community has been stuck on the autoregressive bottleneck for years, but a new paper shows that it's possible to use diffusion models to work on discrete at scale. The researchers trained two coding focused models named Mercury Coder Mini and Small that completely shatter the current speed and quality tradeoff.
Independent evaluations had the Mini model hitting an absurd throughput of 1109 tokens per second on H100 GPUs while the Small model reaches 737 tokens per second. They are essentially outperforming existing speed optimized frontier models by up to ten times in throughput without sacrificing coding capabilities. On practical benchmarks and human evaluations like Copilot Arena the Mini tied for second place in quality against huge models like GPT-4o while maintaining an average latency of just 25 ms. Their model matched the performance of established speed optimized models like Claude 3.5 Haiku and Gemini 2.0 Flash Lite across multiple programming languages while decoding exponentially faster.
The advantage of diffusion relative to classical autoregressive models stems from its ability to perform parallel generation which greatly improves speed. Standard language models are chained to a sequential decoding process where they must generate an answer exactly one token at a time. Mercury abandons this sequential bottleneck entirely by training a Transformer model to predict multiple tokens in parallel. The model starts with a sequence of pure random noise and applies a denoising process that iteratively refines all tokens simultaneously in a coarse to fine manner until the final text emerges. Because the generation happens in parallel rather than sequentially the algorithm achieves a significantly higher arithmetic intensity that fully saturates modern GPU architectures. The team paired this parallel decoding capability with a custom inference engine featuring dynamic batching and specialized kernels to squeeze out maximum hardware utilization.