Artificial intelligenceJuly 1, 2026· via MarkTechPost

NVIDIA’s TwoTower diffusion model speeds up text generation without sacrificing quality

NVIDIA’s TwoTower diffusion model speeds up text generation without sacrificing quality

NVIDIA has just released Nemotron-Labs-TwoTower, a diffusion language model that promises faster text generation without sacrificing quality. Built on a frozen Nemotron-3-Nano-30B-A3B autoregressive backbone, this open-weight model introduces a dual-tower architecture that decouples context processing from iterative refinement.

A dual-tower architecture for parallel decoding

Traditional autoregressive models generate text one token at a time, creating a throughput bottleneck. Diffusion language models aim to solve this by generating and refining tokens in parallel, but most approaches rely on a single network to handle both tasks. TwoTower changes that by splitting the work into two specialized towers: a frozen context tower and a trained denoiser tower. The context tower maintains the autoregressive model’s capabilities, producing key-value caches and final states for the prompt and committed tokens. Meanwhile, the denoiser tower refines noisy token blocks using bidirectional in-block attention, guided by the context tower’s representations through layer-aligned cross-attention. This design keeps most of the original model’s quality—retaining 98.7% on aggregate benchmarks—while delivering a 2.42× speedup in wall-clock generation.

Training on a fraction of the backbone’s data

The denoiser tower was trained on roughly 2.1 trillion tokens, a small fraction of the 25 trillion tokens used to pretrain the backbone. Despite this limited fine-tuning, TwoTower achieves competitive results across standard benchmarks such as MMLU, MMLU-Pro, and ARC-Challenge. The model supports multiple decoding modes, including diffusion, mock-AR, and standard AR decoding, offering flexibility for different use cases.

Practical gains for high-throughput applications

Evaluations run on two H100 GPUs in BF16 precision show TwoTower’s efficiency at its default operating point (confidence unmasking threshold γ=0.8, block size S=16). The approach is especially promising for applications that demand high throughput without a significant drop in output quality. By decoupling context processing from iterative refinement, NVIDIA’s new model could help developers balance performance and accuracy in large-scale text generation tasks.


Source: MarkTechPost. AI-assisted editorial synthesis — TechnoExpress.

Read the original source on MarkTechPost →

← Back to home