Hostname: page-component-cd9895bd7-gvvz8 Total loading time: 0 Render date: 2024-12-25T20:32:14.311Z Has data issue: false hasContentIssue false

Two-level Q-learning: learning from conflict demonstrations

Published online by Cambridge University Press:  12 November 2019

Mao Li*
Affiliation:
Computer Science Department, University of York, York, United Kingdom e-mail: [email protected]
Yi Wei
Affiliation:
Computer Science Department, Shan dong University, Jinan, China e-mail: [email protected]
Daniel Kudenko
Affiliation:
Computer Science Department, University of York, York, United Kingdom e-mail: [email protected] JetBrains Research, St Petersburg, Russia e-mail: [email protected]

Abstract

One way to address this low sample efficiency of reinforcement learning (RL) is to employ human expert demonstrations to speed up the RL process (RL from demonstration or RLfD). The research so far has focused on demonstrations from a single expert. However, little attention has been given to the case where demonstrations are collected from multiple experts, whose expertise may vary on different aspects of the task. In such scenarios, it is likely that the demonstrations will contain conflicting advice in many parts of the state space. We propose a two-level Q-learning algorithm, in which the RL agent not only learns the policy of deciding on the optimal action but also learns to select the most trustworthy expert according to the current state. Thus, our approach removes the traditional assumption that demonstrations come from one single source and are mostly conflict-free. We evaluate our technique on three different domains and the results show that the state-of-the-art RLfD baseline fails to converge or performs similarly to conventional Q-learning. In contrast, the performance level of our novel algorithm increases with more experts being involved in the learning process and the proposed approach has the capability to handle demonstration conflicts well.

Type
Adaptive and Learning Agents
Copyright
© Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. & Zaremba, W. 2016. Openai gym. arXiv preprint arXiv:1606.01540.Google Scholar
Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Taylor, M. E. & Nowé, A. 2015. Reinforcement learning from demonstration through shaping. In Proceedings of the 24th International Conference on Artificial Intelligence, 33523358. AAAI Press.Google Scholar
Devlin, S. & Kudenko, D. 2012. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, 433440. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar
Dietterich, T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research(JAIR) 13, 227303.CrossRefGoogle Scholar
Fernández, F. & Veloso, M. 2006. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, 720727. ACM.CrossRefGoogle Scholar
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J. 2018. Deep Q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence 2018 Apr 29.Google Scholar
Kaelbling, L. P. 1993. Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, 951, 167173.Google Scholar
Kingma, D. P. & Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S. 2015. Human-level control through deep reinforcement learning. Nature 518(7540), 529533.CrossRefGoogle ScholarPubMed
Ng, A. Y., Harada, D. & Russell, S. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, 99, 278287.Google Scholar
Schaal, S. 1997. Learning from demonstration. In Proceedings of the 1997 Conference on Neural Information Processing Systems (NIPS97). Denver, CO, pp. 10401046.Google Scholar
Sutton, R. S. & Barto, A. G. 1998. Reinforcement Learning: An Introduction, 1. MIT press.Google Scholar
Taylor, M. E., Suay, H. B. & Chernova, S. 2011. Integrating reinforcement learning with human demonstrations of varying ability. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, 617624. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar
Wang, Z. & Taylor, M. E. 2017. Improving reinforcement learning with confidence-based demonstrations. In Proceedings of the 26th International Conference on Artificial Intelligence (IJCAI).CrossRefGoogle Scholar
Watkins, C. J. & Dayan, P. 1992. Q-learning. Machine Learning 8(3–4), 279292.CrossRefGoogle Scholar
Wiewiora, E., Cottrell, G. W. & Elkan, C. 2003. Principled methods for advising reinforcement learning agents. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), 792799.Google Scholar