● Launch · June 11, 2026 · 6 min

DiffusionGemma: Google releases an AI model 4x faster — and it runs on your own computer

by MassAI · published June 11, 2026

On June 10, Google DeepMind released DiffusionGemma — an experimental open model that no longer writes text word by word, but "develops" it in whole blocks, in parallel. The result: generation up to 4x faster, on an ordinary consumer GPU, under a permissive Apache 2.0 license. How it works, what trade-offs it makes, and what it means for companies that want AI without sending their data to the cloud.

What DiffusionGemma is

DiffusionGemma is built on the Gemma 4 architecture — Google's family of open models — but replaces the classic generation method with one new to language models: text diffusion. Technically, it is a 26-billion-parameter Mixture-of-Experts model with only 3.8 billion parameters active per response — which is why it can run on accessible hardware. Quantized, it fits in 18 GB of VRAM, within the limits of a top consumer graphics card (RTX 4090/5090). The weights are public on Hugging Face under an Apache 2.0 license — free to use, including commercially.

How "text diffusion" works

Classic models — GPT, Claude, standard Gemma — write one word, then the next, then the next; every step waits for the previous one. DiffusionGemma instead starts from a "canvas" of 256 placeholder tokens and refines it iteratively, in parallel, until the whole block of text becomes coherent — then moves on to the next block. It is the same principle image generators use, applied to text.

Two practical effects follow. First: because every position "sees" the whole block (bidirectional attention), the model can fix earlier mistakes as it goes — a native form of real-time self-correction. Second: blocks are computed in parallel on the GPU, so speed is no longer limited by memory bandwidth but by compute — exactly the resource modern cards have in abundance.

How fast it is, in numbers

Up to 4x faster than classic generation, per Google's measurements: over 700 tokens per second on a consumer RTX 5090 and over 1,000 tokens per second on a server H100 GPU. For perspective, at those speeds a multi-page report generates in 2–3 seconds.

A 262,144-token context window — the equivalent of several hundred pages of documents processed in a single session — and support for more than 140 languages.

The trade-off: speed versus quality

Google is transparent here: raw output quality is below standard Gemma 4. DiffusionGemma is meant for tasks where speed and latency matter more than maximum nuance — interactive assistants, structured text, high-volume processing. For maximum production quality, Google still recommends classic Gemma 4.

What happens after fine-tuning on well-defined tasks is the interesting part: on a Sudoku benchmark, the tuned model solved 80% of puzzles — versus nearly 0% for the base model — and in 12 steps instead of 48. The business signal: on well-scoped problems, specialization turns raw speed into precision.

Where to get it

The model weights are on Hugging Face under Apache 2.0. Day-zero support in the standard tooling: vLLM, Hugging Face Transformers, SGLang and MLX (for Mac). For fine-tuning: Unsloth, NVIDIA NeMo and Hackable Diffusion (JAX). In the cloud, it is available through Google Cloud Model Garden and NVIDIA NIM.

Why it matters for business

Data stays in-house. A capable model running on your own hardware means documents, contracts and customer data are processed without ever leaving the company network — a direct argument for GDPR and confidentiality.

Cost per use: zero. After the hardware investment, there is no per-token cost. For high volumes of repetitive processing — classification, data extraction, summaries — the economics change fundamentally.

The strategic signal. Google keeps pushing the second big direction of the moment: capable AI on local hardware, not just in the cloud. For most companies, the cloud remains the practical path — that is where the top models live. But the hybrid architecture — large cloud models for the heavy work, fast local models for sensitive data and high volumes — gets more realistic with every release like this. Worth watching.

Sources: ↗ Google Developers Blog · ↗ NVIDIA Blog

Want to see what automation is possible for you?
See what MassAI agents can do →