Advances in Offline-to-Online Reinforcement Learning

The field of reinforcement learning is moving towards more practical and efficient methods for real-world deployment. Recent developments have focused on offline-to-online reinforcement learning, which leverages offline datasets for pretraining and online interactions for fine-tuning. This approach has shown promise in addressing the challenges of traditional reinforcement learning, such as the need for extensive exploration and the risk of unsafe behavior. Notably, researchers have proposed various methods to improve the stability and plasticity of offline-to-online reinforcement learning, including the use of density-ratio weighted behavioral cloning and policy gradient guidance. These advances have the potential to enable more reliable and adaptive robotic navigation systems, as well as more efficient and effective reinforcement learning algorithms.

Noteworthy papers include: Adaptive Policy Backbone via Shared Network, which proposes a meta-transfer RL method that enables parameter-efficient fine-tuning while preserving prior knowledge during adaptation. In-Context Compositional Q-Learning for Offline Reinforcement Learning, which formulates Q-learning as a contextual inference problem and achieves bounded Q-function approximation error. Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption, which proposes a novel method termed RPEX to alleviate heavy-tailedness induced by data corruption and achieves state-of-the-art O2O performance.

Sources

Adaptive Policy Backbone via Shared Network

Rethinking Reward Miscalibration of GRPO in Agentic RL

In-Context Compositional Q-Learning for Offline Reinforcement Learning

Humanline: Online Alignment as Perceptual Loss

Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption

Learning Distinguishable Representations in Deep Q-Networks for Linear Transfer

Accelerating Transformers in Online RL

Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning

Integrating Offline Pre-Training with Online Fine-Tuning: A Reinforcement Learning Approach for Robot Social Navigation

The Three Regimes of Offline-to-Online Reinforcement Learning

Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

Policy Gradient Guidance Enables Test Time Control

Built with on top of