
Google DeepMind officially released and open-sourced DiffusionGemma on June 10, as a new member of the open-source Gemma 4 family. DiffusionGemma uses a diffusion-based text generation architecture, combined with a Mixture of Experts (MoE) design. In all published public benchmark tests, DiffusionGemma’s scores are lower than those of the standard Gemma 4.
Official speed test data and hardware specifications
According to the confirmed numbers published by Google:
Speed test (Google official, not third-party verification)
Nvidia RTX 5090 (consumer-grade): about 700 tokens/second
Nvidia H100 (data-center-grade): surpassing 1,000 tokens/second
Self-evaluation multiplier: about 4x that of same-size autoregressive Gemma models
Architecture and parameters
Total parameters: 26 billion (26B)
Active parameters during inference: 3.8 billion (3.8B)
VRAM requirement: can run on high-end GPUs with 18GB VRAM (especially quantized versions)
Maximum parallel processing: up to 256 tokens processed simultaneously at once
License: Apache 2.0
Generation mechanism: key differences between diffusion and autoregression
Standard autoregressive models generate one token at a time in sequence. Each token depends on the computation results of the previous one, and the bottleneck lies in memory bandwidth—each time a token is output, the model weights must be read from memory.
DiffusionGemma follows a different process: first, it places placeholder tokens across the entire output region, then performs multiple rounds of denoising. In each round, all positions’ tokens are updated simultaneously to mutually correct each other, until the entire content converges to the final output. This compute-intensive parallel calculation shifts the bottleneck from memory bandwidth to GPU compute, making fuller use of modern GPUs’ parallel capabilities.
In its official documents, Google gives examples that DiffusionGemma has structural advantages on non-linear logic tasks such as solving Sudoku, because such tasks’ correct answers often involve complex inter-position dependencies, which naturally constrain the linear generation approach of autoregressive methods.
Benchmark results: all published test scores are lower than Gemma 4
In the release materials, Google confirmed that in all published public benchmark tests, DiffusionGemma’s scores are lower than those of the standard Gemma 4. This means that a 4x speed increase comes with a systematic decline in generation quality. A BlockTempo article points out that this trade-off has very different implications across application scenarios: for latency-sensitive cases or scenarios requiring large-batch outputs, the speed advantage is practical; for tasks with higher quality requirements, standard Gemma 4 is still more reliable.
Google’s official list of DiffusionGemma’s applicable scenarios includes: in-line editing, molecular sequence generation, mathematical drawing, and non-linear tasks involving complex logic dependency relationships.
FAQ
What is the essential difference in generation mechanism between DiffusionGemma and standard autoregressive language models?
Standard autoregressive models generate linearly one token at a time, with each token depending on the result of the previous one. DiffusionGemma first fills the entire output region with placeholder tokens, performs multiple rounds of denoising, and in each round updates all positions simultaneously. Finally, it outputs the finalized entire sequence in one go, making its generation logic closer to how Stable Diffusion generates images.
What hardware can DiffusionGemma run on locally?
According to Google’s official description, DiffusionGemma can run on high-end GPUs with 18GB VRAM, especially quantized versions. Google’s official testing shows that a consumer-grade Nvidia RTX 5090 can reach about 700 tokens per second, but the figures above are Google’s own self-evaluation, not independent third-party verification.
Have DiffusionGemma’s speed figures passed third-party verification?
Not yet. BlockTempo explicitly stated that all speed test numbers come from Google’s official testing, not independent third-party verification. In real-world conditions with different scenarios and different generation lengths, the actual multipliers may differ from the official numbers.