Abstract
We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the es_timation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning ap_proach, batch optimization of policy and hyper_parameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.
Original language | English |
---|---|
Title of host publication | 37th International Conference on Machine Learning, ICML 2020 |
Editors | Hal Daume, Aarti Singh |
Publisher | International Machine Learning Society (IMLS) |
Pages | 5681-5691 |
Number of pages | 11 |
ISBN (Electronic) | 9781713821120 |
Publication status | Published - 2020 |
Externally published | Yes |
Event | 37th International Conference on Machine Learning, ICML 2020 - Virtual, Online Duration: 2020 Jul 13 → 2020 Jul 18 |
Publication series
Name | 37th International Conference on Machine Learning, ICML 2020 |
---|---|
Volume | PartF168147-8 |
Conference
Conference | 37th International Conference on Machine Learning, ICML 2020 |
---|---|
City | Virtual, Online |
Period | 20/7/13 → 20/7/18 |
Bibliographical note
Funding Information:This work was supported by the National Research Foundation (NRF) of Korea (NRF-2019R1A2C1087634 and NRF-2019M3F2A1072238), the Ministry of Science and Information communication Technology (MSIT) of Korea (IITP No. 2020-0-00940, IITP 2019-0-00075 and IITP No. 2017-0-01779 XAI), and POSCO.
Publisher Copyright:
© International Conference on Machine Learning, ICML 2020. All rights reserved.
ASJC Scopus subject areas
- Computational Theory and Mathematics
- Human-Computer Interaction
- Software