The field of offline reinforcement learning is moving towards addressing security risks and distributional shifts in pre-collected data. Researchers are exploring innovative methods to quantify and mitigate these issues, including sequence-level data-policy coverage and implicit constraint-aware off-policy correction. These approaches aim to improve the robustness and reliability of offline reinforcement learning algorithms. Noteworthy papers in this area include:
- A study on Collapsing Sequence-Level Data-Policy Coverage via Poisoning Attack, which introduces a poisoning attack to reduce coverage and exacerbate distributional shifts.
- A paper on Implicit Constraint-Aware Off-Policy Correction, which embeds structural priors directly inside every Bellman update to enforce prescribed structure exactly.
- A General Framework for Off-Policy Learning with Partially-Observed Reward, which proposes a new method called Hybrid Policy Optimization for Partially-Observed Reward (HyPeR) to effectively use secondary rewards in addition to partially-observed target rewards.
- CAWR: Corruption-Averse Advantage-Weighted Regression for Robust Policy Optimization, which incorporates robust loss functions and advantage-based prioritized experience replay to filter out poor explorations.