The field of large language models is moving towards more sophisticated reinforcement learning techniques, with a focus on multi-objective optimization. Researchers are exploring new methods to mitigate reward hacking, improve alignment with human preferences, and enhance the overall performance of large language models. Notable papers in this area include:
- OrthAlign, which introduces a novel approach to resolving gradient-level conflicts in multi-objective preference alignment using orthogonal subspace decomposition.
- MO-GRPO, which proposes a simple normalization method to reweight reward functions and ensure even contribution to the loss function.
- Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards, which presents a unified framework for aligning large language models across various domains and objectives.