Advances in Offline Reinforcement Learning

The field of offline reinforcement learning is moving towards addressing security risks and distributional shifts in pre-collected data. Researchers are exploring innovative methods to quantify and mitigate these issues, including sequence-level data-policy coverage and implicit constraint-aware off-policy correction. These approaches aim to improve the robustness and reliability of offline reinforcement learning algorithms. Noteworthy papers in this area include:

  • A study on Collapsing Sequence-Level Data-Policy Coverage via Poisoning Attack, which introduces a poisoning attack to reduce coverage and exacerbate distributional shifts.
  • A paper on Implicit Constraint-Aware Off-Policy Correction, which embeds structural priors directly inside every Bellman update to enforce prescribed structure exactly.
  • A General Framework for Off-Policy Learning with Partially-Observed Reward, which proposes a new method called Hybrid Policy Optimization for Partially-Observed Reward (HyPeR) to effectively use secondary rewards in addition to partially-observed target rewards.
  • CAWR: Corruption-Averse Advantage-Weighted Regression for Robust Policy Optimization, which incorporates robust loss functions and advantage-based prioritized experience replay to filter out poor explorations.

Sources

Collapsing Sequence-Level Data-Policy Coverage via Poisoning Attack in Offline Reinforcement Learning

Implicit Constraint-Aware Off-Policy Correction for Offline Reinforcement Learning

A General Framework for Off-Policy Learning with Partially-Observed Reward

CAWR: Corruption-Averse Advantage-Weighted Regression for Robust Policy Optimization

Built with on top of