Diffusion ASR breaks new ground with multilingual speech-to-text

A YC-backed startup has quietly open-sourced what may become a blueprint for the next wave of speech recognition. Interfaze just released diffusion-gemma-asr-small, the first multilingual audio diffusion ASR model—capable of transcribing six languages from a single 42-million-parameter adapter bolted onto a frozen 26-billion-parameter backbone.
Why diffusion changes the game
Most speech-to-text systems rely on autoregressive decoders, generating tokens one by one. Diffusion models flip the script: they refine an entire transcript in parallel, treating text generation as a denoising problem. The new model uses DiffusionGemma’s parallel denoising decoder, which replaces the usual step-by-step approach with uniform random-token diffusion. Instead of masking or predicting sequentially, it fills a fixed-length canvas with random vocabulary tokens and steadily anneals the noise into coherent text.
A lean adapter on a massive backbone
Training only 42 million parameters on top of a frozen 26-billion-parameter DiffusionGemma mixture-of-experts model keeps computational costs manageable while preserving the backbone’s broad knowledge. The adapter ships under Apache 2.0, while the backbone and whisper-small encoder remain in their respective repositories under their original licenses. The team reports competitive word error rates on LibriSpeech—6.6% versus Whisfusion’s 8.3%—though it still trails autoregressive Whisper overall.
From raw audio to denoised text
The pipeline avoids feeding raw waveforms directly into the LLM. Instead, a frozen whisper-small encoder converts 30 seconds of speech into 1,500 acoustic frames. A small trainable projector compresses these into 188 “audio tokens,” which scatter into reserved slots in DiffusionGemma’s prompt. LoRA adapters let the backbone attend to the new modality, and the decoder denoises a 192-token transcript canvas bidirectionally in roughly 16 steps. The result is a compact, modular architecture that separates feature extraction, projection, and decoding into distinct, trainable stages.
Source: MarkTechPost. AI-assisted editorial synthesis — TechnoExpress.

