
Google DeepMind Releases DiffusionGemma, Running Up to 4x Faster on NVIDIA RTX GPUs
Key Takeaways
- DiffusionGemma generates entire text blocks in parallel.
- 4x faster text generation.
- Built on Gemma 4 architecture.
A new local text model
Google DeepMind released DiffusionGemma, a Gemma 4 open model family member announced on June 10, 2026, designed to generate text in parallel rather than token-by-token.
“Another day, another AI model from Google”
The model is built on Google’s Gemma 4 architecture and optimized to run on NVIDIA’s RTX GPUs, RTX PRO platform, and DGX Spark systems, with NVIDIA’s hardware used to deliver up to 4x faster text generation compared to traditional large language models (LLMs).

DiffusionGemma uses a parallel processing approach that denoises up to 256 tokens per step, and it is described as suited for latency-sensitive applications such as chatbots, agentic workflows, and on-device AI assistants.
In performance claims, it delivers up to 1,000 tokens per second on a single NVIDIA H100 Tensor Core GPU and up to 150 tokens per second on DGX Spark systems.
Ars Technica adds that DiffusionGemma can produce an entire block of text in parallel, and describes its output process as producing a “denoised” text canvas at the end of the process.
How it generates text
Ars Technica says DiffusionGemma takes a field of placeholder tokens running over the canvas multiple times to generate likely tokens and using those to improve estimation of others.
The model is described as a Mixture of Experts (MoE) with a total of 26 billion parameters, while only 3.8 billion are activated during inference.

Ars Technica reports that in testing with an RTX 5090, DiffusionGemma produces around 700 tokens per second, and that with a single Nvidia H100 AI accelerator it can produce 1,000+ tokens per second.
The Ars Technica account also frames the approach as shifting the bottleneck from memory bandwidth to compute, generating up to 256 tokens in parallel.
In a separate evaluation claim, Crypto Briefing says DiffusionGemma has achieved sampling speeds of approximately 1,479 tokens per second and links the speed to diffusion-style iterative refinement rather than committing to each token permanently.
Hardware fit and next steps
The model’s open-weight design is released under an Apache 2.0 license, and it is presented as supporting local deployment without requiring cloud-based resources or per-token costs.
“DiffusionGemma offers 4x faster output with simultaneous text generation The open model ditches the word-by-word approach, generating entire text blocks in parallel while correcting itself in real time”
blockchain.news says developers can test DiffusionGemma locally using Hugging Face Transformers, with support for NVIDIA’s RTX and DGX platforms available out of the box, and it also points to NVIDIA’s free API testing at build.nvidia.com.
For fine-tuning and tooling, the same source lists NVIDIA NeMo and Unsloth, along with preconfigured DGX Spark playbooks.
Ars Technica adds that Google says the parallel approach is faster and more efficient when running on local hardware like an Nvidia DGX or a “humble gaming GPU,” and it connects the method to non-linear tasks such as in-line editing, molecular sequencing, and mathematical graphing.
Crypto Briefing frames the practical implication as businesses benefiting from rapid text generation where latency matters, while also emphasizing that the open model and NVIDIA optimization remove barriers to trying it on widely available NVIDIA hardware.
More on Technology and Science
Scientists Discover 5.3-Million-Year-Old Whale Necropolis in Diamantina Zone of Southeastern Indian Ocean
10 sources compared

India Halts Starlink Approvals After SpaceX Deployment in Iran Sparks Regulator Concerns
14 sources compared

Logitech Launches Mobi Fold Travel Mouse Priced at $79.99 With 22% Less Muscle Strain
16 sources compared

Türkiye Unveils COP31 Global Electrification Target: 35% by 2035
16 sources compared