Hostname: page-component-cd9895bd7-lnqnp Total loading time: 0 Render date: 2025-01-01T02:43:50.498Z Has data issue: false hasContentIssue false

Automatic landmark discovery for learning agents under partial observability

Published online by Cambridge University Press:  02 August 2019

Alper Demіr
Affiliation:
Department of Computer Engineering, Middle East Technical University, 06800 Ankara, Turkey e-mail: [email protected]
Erkіn Çіlden
Affiliation:
RF and Simulation Systems Directorate, STM Defense Technologies Engineering and Trade Inc., 06530 Ankara, Turkey e-mail: [email protected]
Faruk Polat
Affiliation:
Department of Computer Engineering, Middle East Technical University, 06800 Ankara, Turkey e-mail: [email protected]

Abstract

In the reinforcement learning context, a landmark is a compact information which uniquely couples a state, for problems with hidden states. Landmarks are shown to support finding good memoryless policies for Partially Observable Markov Decision Processes (POMDP) which contain at least one landmark. SarsaLandmark, as an adaptation of Sarsa(λ), is known to promise a better learning performance with the assumption that all landmarks of the problem are known in advance.

In this paper, we propose a framework built upon SarsaLandmark, which is able to automatically identify landmarks within the problem during learning without sacrificing quality, and requiring no prior information about the problem structure. For this purpose, the framework fuses SarsaLandmark with a well-known multiple-instance learning algorithm, namely Diverse Density (DD). By further experimentation, we also provide a deeper insight into our concept filtering heuristic to accelerate DD, abbreviated as DDCF (Diverse Density with Concept Filtering), which proves itself to be suitable for POMDPs with landmarks. DDCF outperforms its antecedent in terms of computation speed and solution quality without loss of generality.

The methods are empirically shown to be effective via extensive experimentation on a number of known and newly introduced problems with hidden state, and the results are discussed.

Type
Research Article
Copyright
© Cambridge University Press, 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Chrisman, L. 1992. Reinforcement learning with perceptual aliasing: the perceptual distinctions approach. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI92, 183188. AAAI Press. https://www.aaai.org/Papers/AAAI/1992/AAAI92-029.pdf.Google Scholar
Daniel, C., van Hoof, H., Peters, J. & Neumann, G. 2016. Probabilistic inference for determining options in reinforcement learning. In Machine Learning 104. 2-3, 337357. doi: 10.1007/s10994-016-5580-x.CrossRefGoogle Scholar
Demir, A., Çilden, E. & Polat, F. 2017. A concept filtering approach for diverse density to discover subgoals in reinforcement learning. In: Proceedings of the 29th IEEE International Conference on Tools with Artificial Intelligence. ICTAI17, 15, Short Paper. doi: 10.1109/ICTAI.2017.00012.Google Scholar
Dietterich, T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13, 227303. doi: 10.1613/jair.639.CrossRefGoogle Scholar
Digney, B. L. 1998. Learning hierarchical control structures for multiple tasks and changing environments. In Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior: From Animals to Animats 5. SAB98, 321330. MIT Press, ISBN: 0-262-66144-6.Google Scholar
Dung, L. T., Komeda, T., & Takagi, M. 2007. Reinforcement learning in non-Markovian environments using automatic discovery of subgoals. In SICE, 2007 Annual Conference, 26012605. doi: 10.1109/SICE.2007.4421430.CrossRefGoogle Scholar
Elkawkagy, M., Bercher, P., Schattenberg, B., & Biundo, S. 2012. Improving hierarchical planning performance by the use of landmarks. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 17631769. https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/5070.Google Scholar
Frommberger, L. 2008. Representing and selecting landmarks in autonomous learning of robot navigation. In ICIRA 2008. LNAI 5314, 488497. Springer-Verlag, Berlin, Heidelberg. doi: 10.1007/978-3-540-88513-9_53.Google Scholar
Goel, S. & Huber, M. 2003. Subgoal discovery for hierarchical reinforcement learning using learned policies. In Proceedings of the 16th International FLAIRS Conference, FLAIRS03, 346350. AAAI Press. ISBN 1-57735-177-0.Google Scholar
Hengst, B. 2012. Hierarchical approaches. In: Reinforcement Learning: State-of-the-Art, Adaptation, Learning, and Optimization 12, 293323. Springer, Berlin, Heidelberg. doi: 10.1007/978-3-642-27645-3_9.CrossRefGoogle Scholar
Hoffmann, J., Porteous, J. & Sebastia, L. 2004. Ordered landmarks in planning. Journal of Artificial Intelligence Research 22, 215278. doi: 10.1613/jair.1492.CrossRefGoogle Scholar
Howard, A. & Kitchen, L. 1999. Navigation using natural landmarks. Robotics and Autonomous Systems 26(2–3), 99115. doi: 10.1016/S0921-8890(98)00063-3.CrossRefGoogle Scholar
Hwang, W., Kim, T., Ramanathan, M. & Zhang, A. 2008. Bridging centrality: graph mining from element level to group level. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 336344. ACM. doi: 10.1145/1401890.1401934.CrossRefGoogle Scholar
James, M. R. & Singh, S. P. 2009. SarsaLandmark: an algorithm for learning in POMDPs with landmarks. In 8th International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS09, 585591. http://www.ifaamas.org/Proceedings/aamas09/pdf/01_Full%20Papers/09_50_FP_0850.pdf.Google Scholar
Jiang, B. & Claramunt, C. 2004 Topological analysis of urban street networks. Environment and Planning B: Planning and Design 31(1), 151162. doi: 10.1068/b306.CrossRefGoogle Scholar
Jonsson, A. & Barto, A. 2006. Causal graph based decomposition of factored MDPs. Journal of Machine Learning Research 7, 22592301. http://dl.acm.org/citation.cfm?id=1248547.1248628.Google Scholar
Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101(1–2), 99134. doi: 10.1016/S0004-3702(98)00023-X.CrossRefGoogle Scholar
Karpas, E., Wang, D., Williams, B. C. & Haslum, P. 2015. Temporal landmarks: what must happen, and when. In: Proceedings of the Twenty-Fifth International Conference on Automated Planning and Scheduling, ICAPS15, 138146. https://www.aaai.org/ocs/index.php/ICAPS/ICAPS15/paper/view/10605.Google Scholar
Koenig, S. & Simmons, R. G. 1998. Xavier: a robot navigation architecture based on partially observable Markov decision process models. In Artificial Intelligence and Mobile Robots. MIT Press, 91122. http://idm-lab.org/bib/abstracts/papers/book98.pdf.Google Scholar
Lazanas, A. & Latombe, J.-C. 1995. Motion planning with uncertainty: a landmark approach. Artificial Intelligence 76(1–2), 287317. doi: 10.1016/0004-3702(94)00079-G.CrossRefGoogle Scholar
Loch, J. & Singh, S. P. 1998. Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML98, 323331. https://dl.acm.org/citation.cfm?id=657452.Google Scholar
Mannor, S., Menache, I., Hoze, A. & Klein, U. 2004. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML04, 7178. ACM. doi: 10.1145/1015330.1015355.CrossRefGoogle Scholar
Maron, O. & Lozano-Pérez, T. 1998. A framework for multiple-instance learning. In Proceedings of the 1997 conference on Advances in Neural Information Processing Systems 10, NIPS97, 570576. MIT Press. http://papers.nips.cc/paper/1346-a-framework-for-multiple-instance-learning.pdf.Google Scholar
McGovern, A. & Barto, A. G. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML’01, 361368. Morgan Kaufmann Publishers Inc., https://scholarworks.umass.edu/cs_faculty_pubs/8/.Google Scholar
Menache, I., Mannor, S. & Shimkin, N. 2002. Q-cut—dynamic discovery of sub-goals in reinforcement learning. In 13th European Conference on Machine Learning Proceedings, Machine Learning: ECML ’02, 295306. Springer-Verlag. doi: 10.1007/3-540-36755-1_25.Google Scholar
Mugan, J. & Kuipers, B. 2009. Autonomously learning an action hierarchy using a learned qualitative state representation. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, IJCAI ’09, 11751180. Morgan Kaufmann Publishers Inc. https://www.aaai.org/ocs/index.php/IJCAI/IJCAI-09/paper/viewPaper/617.Google Scholar
Pickett, M. & Barto, A. G. 2002. PolicyBlocks: an algorithm for creating useful macro-actions in reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML ’02, 506513. Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=645531.655988.Google Scholar
Simsek, O. 2008. Behavioral Building Blocks for Autonomous Agents: Description, Identification, and Learning. PhD thesis, University of Massachusetts Amherst.Google Scholar
Simsek, O., Wolfe, A. P. & Barto, A. G. 2005. Identifying useful subgoals in reinforcement learning by local graph partitioning. In Proceedings of the 22nd international conference on Machine Learning, ICML ’05, 816823. ACM. doi: 10.1145/1102351.1102454.CrossRefGoogle Scholar
Stolle, M. & Precup, D. 2002. Learning options in reinforcement learning. In Proceedings of the 5th International Symposium on Abstraction, Reformulation, and Approximation, Koenig, S. & Holte, R. C. (eds), LNCS 2371, 212223. Springer, Berlin, Heidelberg. doi: 10.1007/3-540-45622-8_16.CrossRefGoogle Scholar
Sutton, R. S. & Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press. ISBN 978-0-262-19398-6.Google Scholar
Sutton, R. S., Precup, D. & Singh, S. 1999. Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1–2), 181211. doi: 10.1016/S0004-3702(99)00052-1.CrossRefGoogle Scholar
Uther, W. & Veloso, M. 2003. TTree: tree-based state generalization with temporally abstract actions. In AAMAS 2002, Lecture Notes in Computer Science, 2636, 260290. Springer, Berlin, Heidelberg. doi: 10.1007/3-540-44826-8_16.Google Scholar
Välimäki, T. & Ritala, R. 2016. Optimizing gaze direction in a visual navigation task. In IEEE International Conference on Robotics and Automation, ICRA ’16, 14271432. IEEE. doi: 10.1109/ICRA.2016.7487276.CrossRefGoogle Scholar
Watts, D. J. & Strogatz, S. H. 1998. Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440442. doi: 10.1038/30918.CrossRefGoogle ScholarPubMed
Whitehead, S. D. & Ballard, D. H. 1991. Learning to perceive and act by trial and error. In Machine Learning 7(1), 4583. doi: 10.1023/A:1022619109594.CrossRefGoogle Scholar
Wikipedia 2018. Landmark. https://en.wikipedia.org/wiki/Landmark (visited on 22 January 2018).Google Scholar
Xiao, D., Li, Y. & Shi, C. 2014. Autonomic discovery of subgoals in hierarchical reinforcement learning. The Journal of China Universities of Posts and Telecommunications 21(5), 94104. doi: 10.1016/S1005-8885(14)60337-X.CrossRefGoogle Scholar
Yang, B. & Liu, J. 2008. Discovering global network communities based on local centralities. ACM Transactions on the Web 2(1), 132. doi: 10.1145/1326561.1326570.CrossRefGoogle Scholar
Yoshikawa, T. & Kurihara, M. 2006. An acquiring method of macro-actions in reinforcement learning. In IEEE International Conference on Systems, Man, and Cybernetics, SMC ’06 6, 48134817. doi: 10.1109/ICSMC.2006.385067.CrossRefGoogle Scholar