OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation

  • Jongmin Lee*
  • , Wonseok Jeon*
  • , Byung Jun Lee
  • , Joelle Pineau
  • , Kee Eung Kim
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.

Original languageEnglish
Title of host publicationProceedings of the 38th International Conference on Machine Learning, ICML 2021
PublisherML Research Press
Pages6120-6130
Number of pages11
ISBN (Electronic)9781713845065
Publication statusPublished - 2021
Externally publishedYes
Event38th International Conference on Machine Learning, ICML 2021 - Virtual, Online
Duration: 2021 Jul 182021 Jul 24

Publication series

NameProceedings of Machine Learning Research
Volume139
ISSN (Electronic)2640-3498

Conference

Conference38th International Conference on Machine Learning, ICML 2021
CityVirtual, Online
Period21/7/1821/7/24

Bibliographical note

Publisher Copyright:
Copyright © 2021 by the author(s)

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation'. Together they form a unique fingerprint.

Cite this