The field of Large Language Models (LLMs) is moving towards addressing the limitations of current reinforcement learning methods, such as capability boundary collapse and inefficient training processes. Researchers are exploring novel approaches that combine internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. Additionally, there is a growing interest in simulating and optimizing LLM inference systems, including the development of high-fidelity simulators and heterogeneity-aware training simulators. Decoupling inference and training phases, as well as improving the efficiency of RL fine-tuning, are also being investigated. Notable papers include: RL-PLUS, which proposes a hybrid-policy optimization approach to address capability boundary collapse. Frontier, a high-fidelity simulator designed for the next generation of LLM inference systems. Echo, an RL system that decouples inference and training phases across heterogeneous swarms. Shuffle-R1, a framework that improves RL fine-tuning efficiency through dynamic trajectory sampling and batch composition.