Batch Reinforcement Learning with Hyperparameter Gradients

  • Byung Jun Lee*
  • , Jongmin Lee
  • , Peter Vrancx
  • , Dongho Kim
  • , Kee Eung Kim
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

1 Citation (Scopus)

Abstract

We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the estimation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.

Original languageEnglish
JournalProceedings of Machine Learning Research
Volume119
Publication statusPublished - 2020
Externally publishedYes
Event37th International Conference on Machine Learning, ICML 2020 - Virtual, Online
Duration: 2020 Jul 132020 Jul 18

Bibliographical note

Publisher Copyright:
© 2020 by the author(s).

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Batch Reinforcement Learning with Hyperparameter Gradients'. Together they form a unique fingerprint.

Cite this