Google’s DiffusionGemma: Faster Text AI with Parallel Diffusion

Google has just open-sourced an experimental model that could change how fast AI generates text locally. DiffusionGemma, a 26-billion-parameter Mixture of Experts model under the permissive Apache 2.0 license, replaces the usual step-by-step token generation with a parallel diffusion process. The result? On high-end GPUs, it can deliver up to four times the throughput of standard autoregressive models—ideal for real-time editing or rapid prototyping.

A New Approach to Text Generation

Most language models today build text one token at a time, each depending on the previous one. DiffusionGemma flips the script. It starts with a “canvas” of random placeholder tokens and denoises them in batches, locking in high-confidence parts while refining the rest. This bidirectional diffusion allows the model to self-correct mid-process, something autoregressive systems can’t do once a token is committed. On a single NVIDIA H100, the model reaches over 1,000 tokens per second; even on a GeForce RTX 5090, it sustains more than 700 tokens per second.

Built for Speed, Not Perfection

DiffusionGemma runs on the Gemma 4 backbone but activates only about 3.8 billion parameters during inference, keeping its VRAM footprint under 18 GB. It handles interleaved text, image, and video inputs with a 256K-token context window across more than 140 languages. Google is transparent about the trade-off: this model prioritizes speed and layout flexibility over raw output quality. For production work requiring the highest fidelity, the company still points developers to the standard Gemma 4 autoregressive models.

What This Means for Local AI Workflows

Source: MarkTechPost. AI-assisted editorial synthesis — TechnoExpress.

Google’s DiffusionGemma: Faster Text AI with Parallel Diffusion

A New Approach to Text Generation

Built for Speed, Not Perfection

What This Means for Local AI Workflows

Essential tech, every morning