Relaxed Stationary Distribution Correction Estimation for Improved Offline Policy Optimization

Woosung Kim, Donghyeon Ki, Byung Jun Lee

Research output: Contribution to journalConference articlepeer-review


One of the major challenges of offline reinforcement learning (RL) is dealing with distribution shifts that stem from the mismatch between the trained policy and the data collection policy. Stationary distribution correction estimation algorithms (DICE) have addressed this issue by regularizing the policy optimization with f-divergence between the state-action visitation distributions of the data collection policy and the optimized policy. While such regularization naturally integrates to derive an objective to get optimal state-action visitation, such an implicit policy optimization framework has shown limited performance in practice. We observe that the reduced performance is attributed to the biased estimate and the properties of conjugate functions of f-divergence regularization. In this paper, we improve the regularized implicit policy optimization framework by relieving the bias and reshaping the conjugate function by relaxing the constraints. We show that the relaxation adjusts the degree of involvement of the suboptimal samples in optimization, and we derive a new offline RL algorithm that benefits from the relaxed framework, improving from a previous implicit policy optimization algorithm by a large margin.

Original languageEnglish
Pages (from-to)13185-13192
Number of pages8
JournalProceedings of the AAAI Conference on Artificial Intelligence
Issue number12
Publication statusPublished - 2024 Mar 25
Event38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, Canada
Duration: 2024 Feb 202024 Feb 27

Bibliographical note

Publisher Copyright:
Copyright © 2024, Association for the Advancement of Artificial Intelligence ( All rights reserved.

ASJC Scopus subject areas

  • Artificial Intelligence


Dive into the research topics of 'Relaxed Stationary Distribution Correction Estimation for Improved Offline Policy Optimization'. Together they form a unique fingerprint.

Cite this