Advances in Reinforcement Learning

The field of reinforcement learning is moving towards more efficient and stable methods for policy optimization and exploration. Recent developments have focused on improving the accuracy of return estimates, mitigating estimation bias, and developing more robust algorithms for multi-objective decision-making. Notable advancements include the use of behavior policies to collect off-policy data, the integration of flow-based generative models into actor-critic structures, and the development of trajectory entropy-constrained reinforcement learning frameworks. These innovations have shown promising results in improving sample efficiency, performance, and stability in various environments. Noteworthy papers include: Behaviour Policy Optimization, which extends two policy-gradient methods with provably lower variance return estimates. Mind Your Entropy, which proposes a trajectory entropy-constrained reinforcement learning framework to address challenges in maximum entropy frameworks. One-Step Generative Policies with Q-Learning, which introduces a one-step generative policy for offline reinforcement learning that maps noise directly to actions via a residual reformulation of MeanFlow. Stabilizing Policy Gradient Methods via Reward Profiling, which proposes a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm.

Sources

Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

Convergence of Flow-Policy Gradient Learning for Linear Quadratic Regulator Problems

Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL

Clustering-Based Weight Orthogonalization for Stabilizing Deep Reinforcement Learning

One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow

An Online Multiobjective Policy Gradient for Long-run Average-reward Markov Decision Process

STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization

Mitigating Estimation Bias with Representation Learning in TD Error-Driven Regularization

Limitations of Scalarisation in MORL: A Comparative Study in Discrete Environments

Stabilizing Policy Gradient Methods via Reward Profiling

Built with on top of