Inception Labs' Mercury 2 Scores 90 on AIME 2026, Outpaces Google's DiffusionGemma

Inception Labs introduced Mercury 2 on Thursday, positioning it as the world's fastest reasoning language model at roughly 1,000 tokens per second. The model scored 90 on the AIME 2026 benchmark, outperforming Google's recently released DiffusionGemma, which achieved 69.1% on the same test while hitting similar generation speeds. Both models employ diffusion-based parallel generation rather than sequential token processing, reflecting an industry-wide architectural shift toward faster inference methods.

Mercury 2 Outperforms DiffusionGemma on Mathematics Benchmark

Mercury 2 generates about 1,000 tokens per second—the chunks of text an AI model reads and writes—against roughly 89 tokens per second for Anthropic's Claude Haiku 4.5 Reasoning and 71 for OpenAI's GPT-5 Mini, according to Inception Labs' announcement. On AIME 2026, built from real American Invitational Mathematics Examination problems and scored as the percentage solved correctly, Mercury 2 hit 90%. Google tested DiffusionGemma on the same set, where it scored 69.1%, while standard, non-diffusion Gemma 4 scored 88.3% on the same test.

On GPQA, a PhD-level science benchmark scored the same way, the two models nearly tie: Mercury 2 at 77% against DiffusionGemma's 73.2%. Google's developer guide recommends standard Gemma 4 for applications that demand maximum quality, conceding DiffusionGemma trails it across the board. DiffusionGemma is free and open-weight on Hugging Face. Mercury 2 is a paid, closed-weight API model.

Diffusion Models Replace Sequential Token Generation

Both models drop the typewriter approach to writing. A standard chatbot writes one word, checks what it just wrote, then writes the next, looping until the answer is finished. Diffusion models instead fill a block of text with random placeholder tokens and erase the noise across a handful of parallel passes—the same trick that turns static into a photo in image generators like Stable Diffusion—until the whole block locks into a finished response at once.

Augment Code Reports 82% Latency Reduction in Production

Augment Code, an AI coding-agent company, swapped Mercury 2 in for Anthropic's Claude Opus 4.7 on its context-compaction subagent and saw an 82% drop in latency and a 90% cut in cost, while reporting the same output quality, according to a joint case study.

Inception Labs Secures $50 Million Funding Round

Inception Labs raised $50 million in funding with backing from Nvidia's venture arm and individual investors Andrew Ng and Andrej Karpathy. The startup was built on research from its founder Stefano Ermon, a Stanford professor who co-authored some of the score-based diffusion techniques that power today's image generators.

Parallel Generation Enables Multi-Agent System Architecture

Complex AI systems are orchestras of specialized helpers: one for deep reasoning, several for quick summarization, routing, tool lookup, output checking. Sequential models make those utility calls expensive and slow. Parallel diffusion models make them cheap and fast enough to use liberally. Mercury 2 is API/cloud for now, and the full ecosystem—local runtimes, agent frameworks—is still catching up.

Speed-Sensitive Workflows Benefit from Diffusion Approach

Use cases include real-time programming where the model keeps up with edits, multi-agent coding or support systems where lots of fast sub-calls happen, voice interfaces that don't feel laggy, and any latency-sensitive autocomplete or next-action prediction. At scale, the cost and energy savings from higher throughput on standard hardware add up fast, according to Inception Labs.

FAQ

What did Inception Labs announce on Thursday? Inception Labs introduced Mercury 2 on Thursday, calling it the world's fastest reasoning language model. It generates about 1,000 tokens per second and scored 90 on the AIME 2026 benchmark.

How does Mercury 2 compare to Google's DiffusionGemma on benchmarks? Mercury 2 scored 90 on AIME 2026, while Google's DiffusionGemma scored 69.1% on the same test. On GPQA, a PhD-level science benchmark, Mercury 2 achieved 77% against DiffusionGemma's 73.2%.

What cost and latency improvements did Augment Code report? Augment Code swapped Mercury 2 in for Anthropic's Claude Opus 4.7 on its context-compaction subagent and saw an 82% drop in latency and a 90% cut in cost, while reporting the same output quality, according to a joint case study.

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.
Comment
0/400
No comments