According to Beating monitoring, Google released an open-source text generation model called DiffusionGemma, which uses a diffusion-based mechanism to generate text in parallel blocks rather than token-by-token sequentially. The 26B-parameter model activates only 3.8B parameters per forward pass under a mixture-of-experts architecture, achieving a 4x speed improvement in local GPU inference.
On a single NVIDIA H100 GPU, DiffusionGemma reaches over 1000 tokens per second, while the consumer-grade RTX 5090 exceeds 700 tokens per second. After 4-bit floating-point quantization, the model requires under 18GB of VRAM. DiffusionGemma weights are now open-sourced on Hugging Face and supported by MLX, vLLM, Unsloth, and NVIDIA NeMo.