The field of large language models (LLMs) is rapidly evolving, with a growing focus on ensuring these models align with human values and intentions. Recent research has explored various techniques for achieving this alignment, including preference-based methods, supervised fine-tuning, and direct preference optimization. These approaches aim to balance the trade-offs between core alignment objectives, such as instruction-following and nuanced human intent. Noteworthy papers in this area include: PITA, which introduces a novel framework for integrating preference feedback directly into LLM token generation, eliminating the need for a pre-trained reward model. MaPPO, which proposes a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective, enhancing alignment and mitigating oversimplified binary classification of responses.