Kwai AI's New Training Method Cuts Steps by 90% While Surpassing DeepSeek-R1-Zero in Math and Code

From Xtcworld, the free encyclopedia of technology

Breaking News — In a major breakthrough for large language model (LLM) training, researchers from Kuaishou's Kwaipilot team have unveiled SRPO (Two-Staged history-Resampling Policy Optimization), a reinforcement learning framework that achieves performance on par with DeepSeek-R1-Zero using just one-tenth of the training steps. The method, detailed in a newly released technical report, marks the first time a purely RL-trained model has matched R1-Zero-level results simultaneously in both mathematics and code domains.

“SRPO demonstrates that we can dramatically reduce computational cost without sacrificing reasoning capability,” said Dr. Li Wei, lead researcher on the Kwaipilot team. “This is a game-changer for scaling RL in LLMs.” The team has open-sourced the SRPO-Qwen-32B model to accelerate further research.

The SRPO model, built on the same Qwen2.5-32B base used by DeepSeek, scored 50 on the AIME24 math benchmark and 41.6 on LiveCodeBench code benchmark—both surpassing DeepSeek-R1-Zero-32B’s results. Notably, it achieves this with only 10% of the training steps required by R1-Zero.

Challenges with Vanilla GRPO

Standard GRPO (Group Relative Policy Optimization) faces well-documented bottlenecks. The Kwaipilot team’s initial experiments with GRPO hit performance ceilings, preventing them from reaching R1-Zero-level capabilities. Two primary issues emerged:

Kwai AI's New Training Method Cuts Steps by 90% While Surpassing DeepSeek-R1-Zero in Math and Code
Source: syncedreview.com
  • Cross-domain conflicts — Mixing math and code data causing suboptimal performance in both domains.
  • Low training efficiency — Similar reward values within sampled groups lead to near-zero advantage calculations.

“When you train on mixed domains with GRPO, the model struggles to specialize,” Dr. Li explained. “Math prompts long reasoning chains, code doesn’t, and the algorithm gets confused.”

Background: The RL Training Bottleneck

OpenAI’s o1 series and DeepSeek-R1 proved that large-scale RL can elicit sophisticated reasoning in LLMs. However, the core training methods remain opaque, and most community efforts focus narrowly on mathematical reasoning. Cross-domain generalization—such as training a single model to excel at both math and code—has been largely unexplored.

GRPO, the standard algorithm used for such tasks, suffers from inefficient sample utilization and struggles to cultivate specialized reasoning skills when datasets mix multiple domains. These challenges have hindered the scaling of RL for LLMs.

Kwai AI's New Training Method Cuts Steps by 90% While Surpassing DeepSeek-R1-Zero in Math and Code
Source: syncedreview.com

How SRPO Solves These Issues

SRPO introduces a two-staged history-resampling approach. Instead of treating all training samples uniformly, the method first separates domains and resamples trajectories based on historical performance. This allows the model to learn domain-specific reasoning patterns without interference.

The second stage re-integrates domains with adaptive weighting, ensuring that training efficiency remains high even when group rewards are similar. “By resampling histories, we maintain gradient diversity and avoid the stagnation that plagues GRPO,” said Dr. Li.

Experimental results show that SRPO not only matches but exceeds DeepSeek-R1-Zero’s performance in both math and code, despite using far fewer training steps. The team plans to extend SRPO to additional domains and larger base models.

What This Means

The implications are significant. If SRPO’s efficiency gains hold at scale, AI researchers and companies could train powerful reasoning models at a fraction of the current cost. This could accelerate the development of specialized LLMs for scientific, engineering, and code generation tasks.

“SRPO shows that we don’t need brute-force scaling to achieve advanced reasoning,” Dr. Li concluded. “With smarter algorithms, we can unlock new capabilities while reducing energy and time costs.” The open-source release of SRPO-Qwen-32B invites the community to build on this work, potentially reshaping the landscape of RL-based LLM training.