MoonMath AI’s HIP Kernel Boosts AMD MI300X Attention Speed
MoonMath AI has released an open-source bf16 forward attention kernel for AMD’s MI300X GPU, written in HIP rather than hand-optimized assembly. Benchmarks show it outperforms AMD’s AITER v3 across all tested configurations, with performance gains of up to 1.26× and an average speedup of 1.18×, 1.15×, and 1.08× depending on the rounding mode. The kernel, available under the MIT license, targets AMD’s gfx942 ISA and is designed exclusively for the MI300X.
A Practical Approach to GPU Optimization
Unlike traditional assembly-based kernels, MoonMath’s implementation leverages compiler-friendly HIP code while still achieving low-level control over execution. The core technique involves wrapping a single matrix-fused multiply-add (MFMA) instruction in a device function, using extended assembly constraints to specify operands without manual register management. By tying input and output registers directly, the compiler avoids unnecessary data movement, keeping the kernel clean while preserving performance.
Memory and Execution Strategy
Performance gains come from strategic memory placement: key matrices are loaded into local data share (LDS), value vectors are cached in L1, and query vectors and accumulators reside in registers. The kernel processes eight waves per compute unit block, split into two synchronized groups that alternate between matrix operations and softmax calculations. Two barrier synchronizations per iteration help maintain efficiency, ensuring the matrix core remains active throughout execution.
Real-World Impact and Limitations
A production deployment in SGLang demonstrated a 1.23× speedup for the Wan2.1 video diffusion model with no quality regression. However, the kernel has constraints: it supports bf16 inputs in BSHD or BHSD layouts, a fixed head dimension of 128, and excludes features like causal masking, grouped query attention, or variable-length batching. Outputs remain numerically consistent with AITER v3, matching rounding rules and handling edge cases like NaN and Inf deterministically.
Source: MarkTechPost. AI-assisted editorial synthesis — TechnoExpress.

