Large Language Model Optimization Advances

The field of large language models is moving towards more efficient and stable optimization methods. Recent research has focused on improving the trade-off between optimization granularity and training stability, with several novel frameworks and techniques being proposed. These advancements have led to state-of-the-art performance on various benchmarks and have the potential to improve the overall quality of large language models. Notable papers in this area include ESPO, which introduces a new framework for policy optimization that reconciles fine-grained control with training stability, and DVPO, which combines conditional risk theory with distributional value modeling to balance robustness and generalization. Additionally, papers such as Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective and Auxiliary-Hyperparameter-Free Sampling: Entropy Equilibrium for Text Generation have made significant contributions to the field by introducing new perspectives and approaches to optimization and sampling.

Sources

ESPO: Entropy Importance Sampling Policy Optimization

What Is Preference Optimization Doing, How and Why?

Auxiliary-Hyperparameter-Free Sampling: Entropy Equilibrium for Text Generation

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

Enhancing Instruction-Following Capabilities in Seq2Seq Models: DoLA Adaptations for T5

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

Built with on top of