Prime RL 0.6.0 pushes trillion-parameter MoE models to new scale

Prime Intellect has released prime-rl 0.6.0, an open framework that trains trillion-parameter Mixture-of-Experts models on agentic reinforcement-learning workloads. The update shows step times staying under five minutes while processing 256 rollouts, using only 28 H200 nodes and sequence lengths up to 131k tokens.

Breaking the trillion-parameter barrier for agentic RL

The new version extends prime-rl’s asynchronous RL pipeline to MoE scales that were previously impractical. By disaggregating trainer and inference, the framework avoids GPU idle time during long-tail rollouts, synchronizing only at policy-update points. A single run of zai-org/GLM-5.1 starts with one command on a Slurm cluster, demonstrating that large-scale agentic training can begin in minutes rather than days.

Optimized inference: speed meets stability

Inference throughput becomes the bottleneck in long-horizon RL, so prime-rl introduces several targeted optimizations. FP8 inference with custom kernels lowers prefill and decode latency without sacrificing stability. Wide Expert Parallelism spreads experts across at least 32 GPUs while maintaining large data-parallel ranks, enabling efficient expert serving. Prefill/Decode disaggregation keeps decode workers responsive even when tool outputs inflate prefill tokens, and tiered KV-cache offloading pools RAM and disk across nodes to handle high concurrency. A fork of vLLM-router routes requests by KV-cache reuse, queue depth, and live load, ensuring balanced worker utilization.

Source: MarkTechPost. AI-assisted editorial synthesis — TechnoExpress.

Prime RL 0.6.0 pushes trillion-parameter MoE models to new scale

Breaking the trillion-parameter barrier for agentic RL

Optimized inference: speed meets stability

Essential tech, every morning