Flash-KMeans slashes GPU k-means latency with smarter data flow
Modern AI pipelines now call k-means inside training loops and inference steps, so every millisecond counts. Researchers from UC Berkeley and UT Austin just open-sourced Flash-KMeans, an IO-aware implementation that leaves the math of Lloyd’s algorithm untouched but restructures how data moves through an NVIDIA GPU. The result: up to 17.9× end-to-end speedup versus the best baseline on an H200, 33× against NVIDIA cuML, and over 200× versus FAISS.
Smarter kernels, not shortcuts
Unlike methods that skip work by pruning or sampling, Flash-KMeans keeps every centroid and every point in the computation. It simply fuses the distance loop with an online argmin so the full N×K distance matrix never leaves high-bandwidth memory. Borrowing ideas from FlashAttention, the FlashAssign kernel streams tiles of points and centroids from HBM into on-chip SRAM, then reduces the dominant IO complexity from O(NK) to O(Nd + Kd). The research team reports kernel-level gains up to 21.2×, cutting a 122.5 ms assignment pass down to 5.8 ms in one test.
Contention-free updates
The centroid update phase in standard code relies on scatter-style atomic adds that serialize when many threads target the same “hot” cluster. Flash-KMeans replaces this with Sort-Inverse Update. It argsorts the 1D assignment vector so identical cluster IDs form contiguous segments, then each thread block reduces a segment in SRAM before issuing a single atomic add. The approach drops the number of atomic operations and lifts effective bandwidth from roughly 50 GB/s to much higher levels, yielding up to 6.3× kernel speedups.
Installable via pip under Apache 2.0, Flash-KMeans targets batched k-means workloads where latency per call matters more than peak FLOPs.
Source: MarkTechPost. AI-assisted editorial synthesis — TechnoExpress.

