NVIDIA’s TwoTower diffusion model speeds up text generation without sacrificing quality

NVIDIA has just released Nemotron-Labs-TwoTower, a diffusion language model that promises faster text generation without sacrificing quality. Built on a frozen Nemotron-3-Nano-30B-A3B autoregressive backbone, this open-weight model introduces a dual-tower architecture that decouples context processing from iterative refinement.

A dual-tower architecture for parallel decoding

Traditional autoregressive models generate text one token at a time, creating a throughput bottleneck. Diffusion language models aim to solve this by generating and refining tokens in parallel, but most approaches rely on a single network to handle both tasks. TwoTower changes that by splitting the work into two specialized towers: a frozen context tower and a trained denoiser tower. The context tower maintains the autoregressive model’s capabilities, producing key-value caches and final states for the prompt and committed tokens. Meanwhile, the denoiser tower refines noisy token blocks using bidirectional in-block attention, guided by the context tower’s representations through layer-aligned cross-attention. This design keeps most of the original model’s quality—retaining 98.7% on aggregate benchmarks—while delivering a 2.42× speedup in wall-clock generation.

Training on a fraction of the backbone’s data

The denoiser tower was trained on roughly 2.1 trillion tokens, a small fraction of the 25 trillion tokens used to pretrain the backbone. Despite this limited fine-tuning, TwoTower achieves competitive results across standard benchmarks such as MMLU, MMLU-Pro, and ARC-Challenge. The model supports multiple decoding modes, including diffusion, mock-AR, and standard AR decoding, offering flexibility for different use cases.

Practical gains for high-throughput applications

Evaluations run on two H100 GPUs in BF16 precision show TwoTower’s efficiency at its default operating point (confidence unmasking threshold γ=0.8, block size S=16). The approach is especially promising for applications that demand high throughput without a significant drop in output quality. By decoupling context processing from iterative refinement, NVIDIA’s new model could help developers balance performance and accuracy in large-scale text generation tasks.

Source: MarkTechPost. AI-assisted editorial synthesis — TechnoExpress.

NVIDIA’s TwoTower diffusion model speeds up text generation without sacrificing quality

A dual-tower architecture for parallel decoding

Training on a fraction of the backbone’s data

Practical gains for high-throughput applications

Essential tech, every morning