Batch reinforcement learning with hyperparameter gradients

Byung Jun Lee, Jongmin Lee, Peter Vrancx, Dongho Kim, Kee Eung Kim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the es_timation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning ap_proach, batch optimization of policy and hyper_parameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.

Original languageEnglish
Title of host publication37th International Conference on Machine Learning, ICML 2020
EditorsHal Daume, Aarti Singh
PublisherInternational Machine Learning Society (IMLS)
Pages5681-5691
Number of pages11
ISBN (Electronic)9781713821120
Publication statusPublished - 2020
Externally publishedYes
Event37th International Conference on Machine Learning, ICML 2020 - Virtual, Online
Duration: 2020 Jul 132020 Jul 18

Publication series

Name37th International Conference on Machine Learning, ICML 2020
VolumePartF168147-8

Conference

Conference37th International Conference on Machine Learning, ICML 2020
CityVirtual, Online
Period20/7/1320/7/18

Bibliographical note

Funding Information:
This work was supported by the National Research Foundation (NRF) of Korea (NRF-2019R1A2C1087634 and NRF-2019M3F2A1072238), the Ministry of Science and Information communication Technology (MSIT) of Korea (IITP No. 2020-0-00940, IITP 2019-0-00075 and IITP No. 2017-0-01779 XAI), and POSCO.

Publisher Copyright:
© International Conference on Machine Learning, ICML 2020. All rights reserved.

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Human-Computer Interaction
  • Software

Fingerprint

Dive into the research topics of 'Batch reinforcement learning with hyperparameter gradients'. Together they form a unique fingerprint.

Cite this