The field of large language model reasoning is moving towards more effective and efficient methods for aligning models with human-annotated demonstrations and improving their reasoning capabilities. Recent developments have focused on addressing the limitations of supervised fine-tuning and reinforcement learning with verifiable rewards, such as overfitting, poor out-of-domain generalization, and capability regression. Innovative approaches include the use of on-policy techniques, self-rewarding mechanisms, and progressive reward structures to enhance generalization performance and data efficiency. Additionally, there is a growing interest in developing more robust and reliable reasoning models that can mitigate the risks of flawed-positive rollouts and reward hacking. Notable papers in this area include Self-Rewarding PPO, which combines the strengths of supervised fine-tuning and proximal policy optimization to achieve more effective alignment from demonstration data. Another noteworthy paper is FAPO, which proposes a parameter-free reward penalty for flawed-positive rollouts, enabling the policy to leverage them as useful shortcuts in the warm-up stage while gradually shifting optimization toward reliable reasoning in the later refinement stage.