Image: Crypto Briefing

Google DeepMind Releases DiffusionGemma, Running Up to 4x Faster on NVIDIA RTX GPUs

10 June, 2026.Technology and Science.3 sources

The story in 15 seconds

DiffusionGemma generates entire text blocks in parallel.
4x faster text generation.
Built on Gemma 4 architecture.

The divide · 1 of 3

Speed benchmarks vary substantially across outlets.

blockchain.news

“deliver up to 4x faster text generation compared to traditional large language models (LLMs).”

Read at source →

Ars Technica

“With a single Nvidia H100 AI accelerator, DiffusionGemma can produce 1,000+ tokens per second.”

Read at source →

Different benchmark setups can obscure real-world performance expectations.

Who skipped what

How each outlet frames it

Every outlet we compared, the headline it ran, and a link to the original article.

Source Diversity

3 sources

Western Mainstream

Other

Western Alternative

Western Mainstream

Ars Technica

Google DeepMind releases DiffusionGemma, a model that runs local AI 4x faster

10 June, 2026

Read the original →

Other

blockchain.news

NVIDIA Powers Google DeepMind's DiffusionGemma for High-Speed AI

10 June, 2026

Read the original →

Western Alternative

Crypto Briefing

DiffusionGemma offers 4x faster output with simultaneous text generation

10 June, 2026

Read the original →

Full story

A new local text model

Google DeepMind released DiffusionGemma, a Gemma 4 open model family member announced on June 10, 2026, designed to generate text in parallel rather than token-by-token.

“Another day, another AI model from Google”

Ars Technica

The model is built on Google’s Gemma 4 architecture and optimized to run on NVIDIA’s RTX GPUs, RTX PRO platform, and DGX Spark systems, with NVIDIA’s hardware used to deliver up to 4x faster text generation compared to traditional large language models (LLMs).

Ars Technica

DiffusionGemma uses a parallel processing approach that denoises up to 256 tokens per step, and it is described as suited for latency-sensitive applications such as chatbots, agentic workflows, and on-device AI assistants.

In performance claims, it delivers up to 1,000 tokens per second on a single NVIDIA H100 Tensor Core GPU and up to 150 tokens per second on DGX Spark systems.

Ars Technica adds that DiffusionGemma can produce an entire block of text in parallel, and describes its output process as producing a “denoised” text canvas at the end of the process.

How it generates text

Ars Technica says DiffusionGemma takes a field of placeholder tokens running over the canvas multiple times to generate likely tokens and using those to improve estimation of others.

The model is described as a Mixture of Experts (MoE) with a total of 26 billion parameters, while only 3.8 billion are activated during inference.

blockchain.news

Ars Technica reports that in testing with an RTX 5090, DiffusionGemma produces around 700 tokens per second, and that with a single Nvidia H100 AI accelerator it can produce 1,000+ tokens per second.

The Ars Technica account also frames the approach as shifting the bottleneck from memory bandwidth to compute, generating up to 256 tokens in parallel.

In a separate evaluation claim, Crypto Briefing says DiffusionGemma has achieved sampling speeds of approximately 1,479 tokens per second and links the speed to diffusion-style iterative refinement rather than committing to each token permanently.

Hardware fit and next steps

The model’s open-weight design is released under an Apache 2.0 license, and it is presented as supporting local deployment without requiring cloud-based resources or per-token costs.

“DiffusionGemma offers 4x faster output with simultaneous text generation The open model ditches the word-by-word approach, generating entire text blocks in parallel while correcting itself in real time”

Crypto Briefing

blockchain.news says developers can test DiffusionGemma locally using Hugging Face Transformers, with support for NVIDIA’s RTX and DGX platforms available out of the box, and it also points to NVIDIA’s free API testing at build.nvidia.com.

For fine-tuning and tooling, the same source lists NVIDIA NeMo and Unsloth, along with preconfigured DGX Spark playbooks.

Ars Technica adds that Google says the parallel approach is faster and more efficient when running on local hardware like an Nvidia DGX or a “humble gaming GPU,” and it connects the method to non-linear tasks such as in-line editing, molecular sequencing, and mathematical graphing.

Crypto Briefing frames the practical implication as businesses benefiting from rapid text generation where latency matters, while also emphasizing that the open model and NVIDIA optimization remove barriers to trying it on widely available NVIDIA hardware.

The deep audit

How victims, perpetrators and terms are handled across outlets.

More on Technology and Science

Pedro Sanchez Warns Spain and France Face Crucial Days as Wildfires Threaten Bordeaux and Madrid

19 sources compared

Nvidia Invests $5 Billion in Ilya Sutskever’s Safe Superintelligence and Vera Rubin Platform Access

17 sources compared

Hurricane Genevieve Becomes Category 5 in East Pacific as Fausto Moves Toward Hawaii

13 sources compared

Amazon Seeks FCC Approval to Launch 5,105 Satellites for Direct-to-Device Network in 2028

24 sources compared