The field of large language models (LLMs) is witnessing significant advancements in reasoning and mathematics capabilities. Recent developments have focused on improving the accuracy and efficiency of LLMs through innovative training methods, such as combining supervised fine-tuning (SFT) and reinforcement learning (RL). These approaches have led to state-of-the-art performance on challenging benchmarks, including mathematical Olympiad competitions. Notably, researchers have discovered that prolonged SFT phases can significantly enhance model accuracy, while RL can optimize solution length and improve token efficiency. Moreover, adaptive guidance and difficulty-aware reinforcement learning frameworks have been proposed to stabilize training and improve reasoning performance. These advancements have far-reaching implications for developing powerful and robust reasoning models. Some notable papers include:
- A Practical Two-Stage Recipe for Mathematical LLMs, which introduces a systematic methodology for combining SFT and RL to maximize accuracy and efficiency.
- KAT-V1, which presents an open-source 40B large language model that addresses the overthinking problem in reasoning-intensive tasks through automatic thinking training paradigms.
- GHPO, which proposes a novel difficulty-aware reinforcement learning framework that adaptively balances direct imitation learning and exploration-based reinforcement learning.