Two-level Q-learning: learning from conflict demonstrations

Mao Li; Yi Wei; Daniel Kudenko

doi:10.1017/S0269888919000092

Two-level Q-learning: learning from conflict demonstrations

Part of: Adaptive Learning Agents 2018

Published online by Cambridge University Press: 12 November 2019

Mao Li

Yi Wei and

Daniel Kudenko

Show author details

Mao Li*: Affiliation:
Computer Science Department, University of York, York, United Kingdom e-mail: [email protected]
Yi Wei: Affiliation:
Computer Science Department, Shan dong University, Jinan, China e-mail: [email protected]
Daniel Kudenko: Affiliation:
Computer Science Department, University of York, York, United Kingdom e-mail: [email protected] JetBrains Research, St Petersburg, Russia e-mail: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

One way to address this low sample efficiency of reinforcement learning (RL) is to employ human expert demonstrations to speed up the RL process (RL from demonstration or RLfD). The research so far has focused on demonstrations from a single expert. However, little attention has been given to the case where demonstrations are collected from multiple experts, whose expertise may vary on different aspects of the task. In such scenarios, it is likely that the demonstrations will contain conflicting advice in many parts of the state space. We propose a two-level Q-learning algorithm, in which the RL agent not only learns the policy of deciding on the optimal action but also learns to select the most trustworthy expert according to the current state. Thus, our approach removes the traditional assumption that demonstrations come from one single source and are mostly conflict-free. We evaluate our technique on three different domains and the results show that the state-of-the-art RLfD baseline fails to converge or performs similarly to conventional Q-learning. In contrast, the performance level of our novel algorithm increases with more experts being involved in the learning process and the proposed approach has the capability to handle demonstration conflicts well.

Type: Adaptive and Learning Agents
Information: The Knowledge Engineering Review , Volume 34 , 2019 , e14

DOI: https://doi.org/10.1017/S0269888919000092 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. & Zaremba, W. 2016. Openai gym. arXiv preprint arXiv:1606.01540.Google Scholar

Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Taylor, M. E. & Nowé, A. 2015. Reinforcement learning from demonstration through shaping. In Proceedings of the 24th International Conference on Artificial Intelligence, 3352–3358. AAAI Press.Google Scholar

Devlin, S. & Kudenko, D. 2012. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, 433–440. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar

Dietterich, T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research(JAIR) 13, 227–303.CrossRef Google Scholar

Fernández, F. & Veloso, M. 2006. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, 720–727. ACM.CrossRef Google Scholar

Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J. 2018. Deep Q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence 2018 Apr 29.Google Scholar

Kaelbling, L. P. 1993. Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, 951, 167–173.Google Scholar

Kingma, D. P. & Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S. 2015. Human-level control through deep reinforcement learning. Nature 518(7540), 529–533.CrossRef Google Scholar PubMed

Ng, A. Y., Harada, D. & Russell, S. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, 99, 278–287.Google Scholar

Schaal, S. 1997. Learning from demonstration. In Proceedings of the 1997 Conference on Neural Information Processing Systems (NIPS97). Denver, CO, pp. 1040–1046.Google Scholar

Sutton, R. S. & Barto, A. G. 1998. Reinforcement Learning: An Introduction, 1. MIT press.Google Scholar

Taylor, M. E., Suay, H. B. & Chernova, S. 2011. Integrating reinforcement learning with human demonstrations of varying ability. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, 617–624. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar

Wang, Z. & Taylor, M. E. 2017. Improving reinforcement learning with confidence-based demonstrations. In Proceedings of the 26th International Conference on Artificial Intelligence (IJCAI).CrossRef Google Scholar

Watkins, C. J. & Dayan, P. 1992. Q-learning. Machine Learning 8(3–4), 279–292.CrossRef Google Scholar

Wiewiora, E., Cottrell, G. W. & Elkan, C. 2003. Principled methods for advising reinforcement learning agents. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), 792–799.Google Scholar

Article contents

Two-level Q-learning: learning from conflict demonstrations

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests