Robot imitation from multimodal observation with unsupervised cross-modal representation

Xuanhui Xu; Mingyu You; Hongjun Zhou; Bin He

doi:10.1017/S0263574724000626

Robot imitation from multimodal observation with unsupervised cross-modal representation

Published online by Cambridge University Press: 08 November 2024

Xuanhui Xu

Mingyu You ,

Hongjun Zhou and

Bin He

Show author details

Xuanhui Xu: Affiliation:
College of Electronic and Information Engineering, Tongji University, ShangHai, China
Mingyu You*: Affiliation:
College of Electronic and Information Engineering, Tongji University, ShangHai, China National Key Laboratory of Autonomous Intelligent Unmanned Systems, Frontiers Science Center for Intelligent Autonomous Systems, Ministry of Education, Tongji University, ShangHai, China
Hongjun Zhou: Affiliation:
College of Electronic and Information Engineering, Tongji University, ShangHai, China
Bin He: Affiliation:
College of Electronic and Information Engineering, Tongji University, ShangHai, China National Key Laboratory of Autonomous Intelligent Unmanned Systems, Frontiers Science Center for Intelligent Autonomous Systems, Ministry of Education, Tongji University, ShangHai, China
*: Corresponding author: Mingyu You; Email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Imitation from Observation (IfO) prompts the robot to imitate tasks from unlabeled videos via reinforcement learning (RL). The performance of the IfO algorithm depends on its ability to extract task-relevant representations since images are informative. Existing IfO algorithms extract image representations by using a simple encoding network or pre-trained network. Due to the lack of action labels, it is challenging to design a supervised task-relevant proxy task to train the simple encoding network. Representations extracted by a pre-trained network such as Resnet are often task-irrelevant. In this article, we propose a new approach for robot IfO via multimodal observations. Different modalities describe the same information from different sides, which can be used to design an unsupervised proxy task. Our approach contains two modules: the unsupervised cross-modal representation (UCMR) module and a self-behavioral cloning (self-BC)-based RL module. The UCMR module learns to extract task-relevant representations via a multimodal unsupervised proxy task. The Self-BC for further offline policy optimization collects successful experiences during the RL training. We evaluate our approach on the real robot pouring water task, quantitative pouring task, and pouring sand task. The robot achieves state-of-the-art performance.

Keywords

Imitation from observation representation learning multimodal

Type: Research Article
Information: Robotica , Volume 42 , Issue 10 , October 2024 , pp. 3247 - 3262

DOI: https://doi.org/10.1017/S0263574724000626 [Opens in a new window]
Copyright: © The Author(s), 2024. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Chen, Y., Zeng, C., Wang, Z., Lu, P. and Yang, C., “Zero-shot sim-to-real transfer of reinforcement learning framework for robotics manipulation with demonstration and force feedback,” Robotica 41(3), 1015–1024 (2023).CrossRef Google Scholar

Pan, Y., Cheng, C.-A., Saigol, K., Lee, K., Yan, X., Theodorou, E. A. and Boots, B., “Imitation learning for agile autonomous driving,” Int. J. Robot. Res. 39(2-3), 286–302 (2019).CrossRef Google Scholar

Hermann, L., Argus, M., Eitel, A., Amiranashvili, A., Burgard, W. and Brox, T.. “Adaptive Curriculum Generation from Demonstrations for Sim-to-Real Visuomotor Control,” IEEE International Conference on Robotics and Automation (ICRA), Paris, France (2020) pp. 6498–6505.Google Scholar

Torabi, F., Warnell, G. and Stone, P., “Generative adversarial imitation from observation,” arXiv preprint arXiv: 1807.06158 (2018).Google Scholar

Karnan, H., Warnell, G., Xiao, X. S. and Stone, P., “VOILA: Visual-Observation-Only Imitation Learning for Autonomous Navigation,” IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, USA (2022) pp. 2497–2503.Google Scholar

Shah, R. and Kumar, V., “RRL: Resnet as Representation for Reinforcement Learning,” 2021 In International Conference on Machine Learning (ICML), (2021) pp. 9465–9476.Google Scholar

Cole, E., Yang, X., Wilber, K., Aodha, O. M. and Belongie, S., “When Does Contrastive Visual Representation Learning Work?,” IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA (2022) pp.14755–14764.Google Scholar

Saito, N., Ogata, T., Funabashi, S., Mori, H. and Sugano, S., “How to select and use tools? : Active perception of target objects using multimodal deep learning,” IEEE Robot. Autom. Lett. 6(2), 2517–2524 (2021).CrossRef Google Scholar

Zhang, D., Ju, R. and Cao, Z., “Reinforcement learning-based motion control for snake robots in complex environments,” Robotica 42(4), 947–961 (2024).CrossRef Google Scholar

Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S. and Levine, S., “Time-Contrastive Networks: Self-Supervised Learning from Video,” 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia (2018) pp.1134–1141.Google Scholar

Torabi, F., Warnell, G. and Stone, P., “Behavioral Cloning from Observation,” 27th International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden (2018), pp. 4950–4957.Google Scholar

Cobbe, K., Klimov, O., Hesse, C., Kim, T. and Schulman, J., “Quantifying Generalization in Reinforcement Learning,” International Conference on Machine Learning (ICML), Long Beach, CA, USA (1289) pp. 1282–1289.Google Scholar

Liu, Y., Gupta, A., Abbeel, P. and Levine, S., “Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation,” 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, (1125) pp. 1118–1125.Google Scholar

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y., “Generative adversarial networks,” Commun. Acm. 63(11), 139–144 (2020).CrossRef Google Scholar

Torabi, E., Geiger, S., Warnell, G. and Stone, P., “Sample-efficient adversarial imitation learning from observation,” J. Mach. Learn. Res. 25(31), 1–32 (2024).Google Scholar

Lee, M., Tan, M., Zhu, Y. and Bohg, J., “Detect, Reject, Correct: Crossmodal Compensation of Corrupted Sensors,” IEEE International Conference on Robotics and Automation (ICRA), Xian, China (2021) pp. 909–916.Google Scholar

Tremblay, J. F., Manderson, T., Noca, A., Dudek, G. and Meger, D., “Multimodal dynamics modeling for off-road autonomous vehicles,” 2021 IEEE International Conference on Robotics and Automation (ICRA), Xian, China (1802) pp. 1796–1802.Google Scholar

Marwan, Q. M., Chua, S. C. and Kwek, L. C., “Comprehensive review on reaching and grasping of objects in robotics,” Robotica 39(10), 1849–1882 (2021).CrossRef Google Scholar

Gangapurwala, S., Geisert, M., Orsolino, R., Fallon, M. and Havoutis, I., “Rloc: Terrain-aware legged locomotion using reinforcement learning and optimal control,” IEEE Trans. Robot. 38(5), 2908–2927 (2022).CrossRef Google Scholar

Brunke, L., Greeff, M., Hall, A. W., Yuan, Z. C., Zhou, S. Q., Panerati, J. and Schoellig, A. P., “Safe learning in robotics: From learning-based control to safe reinforcement learning,” Annu. Rev. Contr. Robot. Auton. Sys. 5(1), 411–444 (2022).CrossRef Google Scholar

Saha, P., Liu, Y., Gick, B. and Fels, S., “Ultra2Speech – A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images,” In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, Proceedings, Part III 23, Springer International Publishing (2020) pp. 473–482.Google Scholar

Chen, X. and He, K.. “Exploring Simple Siamese Representation Learning,” IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021) 15750–15758.CrossRef Google Scholar

Yi, D., Ahn, J. and Ji, S., “An effective optimization method for machine learning based on ADAM,” Appl. Sci. 10(3), 1073 (2020).CrossRef Google Scholar

Gu, Y., Cheng, Y. H., Chen, C. L. P. and Wang, X. S., “Proximal policy optimization with policy feedback,” IEEE Trans. Sys. Man. Cyber Syst. 52(7), 4600–4610 (2021).CrossRef Google Scholar

Schulman, J., Levine, S., Moritz, P., Jordan, M. and Abbeel, P., “Trust Region Policy Optimization,” International Conference on Machine Learning (ICML), Lille, France (1897) pp. 1889–1897.Google Scholar

Li, T., Wu, Y., Cui, X., Dong, H., Fang, F. and Russell, S., “Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient,” Proc. Sym. Edu. Adva. Artifi. Intel. (AAAI) 33(01), 4213–4220 (2019).Google Scholar

Yuan, R., Zhang, F., Wang, Y., Fu, Y. and Wang, S., “A Q-learning approach based on human reasoning for navigation in a dynamic environment,” Robotica 37(3), 445–468 (2019).CrossRef Google Scholar

Ruder, S., “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv: 1609.04747, (2016).Google Scholar

Yu, B. and Tao, D., “Heatmap regression via randomized rounding,” IEEE Trans. Pattern Anal. 44(11), 8276–8289 (2021).CrossRef Google Scholar

van der Maaten, L. and Hinton, G., “Visualizing data using t-SNE,” J. Mach. Learn. Res. 9(11), 2579–2605 (2008).Google Scholar

Wang, C., Duan, H. and Li, L., “Design, simulation, control of a hybrid pouring robot: Enhancing automation level in the foundry industry,” Robotica 42(4), 1018–1038 (2024).Google Scholar