first commit
This commit is contained in:
		
						commit
						7cec2d9ba0
					
				
							
								
								
									
										
											BIN
										
									
								
								202221090225_武博文_中期答辩.pptx
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								202221090225_武博文_中期答辩.pptx
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
							
								
								
									
										
											BIN
										
									
								
								202221090225_武博文_开题.pptx
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								202221090225_武博文_开题.pptx
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
							
								
								
									
										231
									
								
								202221090225_武博文_开题报告表.docx
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										231
									
								
								202221090225_武博文_开题报告表.docx
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,231 @@ | ||||
|                             电 子 科 技 大 学 | ||||
|                学术学位研究生学位论文开题报告表 | ||||
| 	攻读学位级别: □博士        硕士 | ||||
| 	学科专业:   软件工程                 | ||||
| 	学        院:   信息与软件工程学院       | ||||
| 	学        号:   202221090225             | ||||
| 	姓        名:   武博文                   | ||||
| 	论文题目:   室外动态场景下基于实例    | ||||
| 	                 分割的视觉SLAM研究       | ||||
| 	指导教师:   王春雨                    | ||||
| 	填表日期:   2023  年  12  月  15  日 | ||||
|                         电子科技大学研究生院 | ||||
|               | ||||
|  学位论文研究内容 | ||||
|                                  课题类型 | ||||
| □基础研究    □应用基础研究     应用研究 | ||||
|                                  课题来源 | ||||
| □纵向        □横向             自拟 | ||||
|                                       学 | ||||
|                                       位 | ||||
|                                       论 | ||||
|                                       文 | ||||
|                                       研 | ||||
|                                       究 | ||||
|                                       内 | ||||
|                                       容 | ||||
| 学位论文的研究目标、研究内容及拟解决的关键性问题(可续页) | ||||
|  研究目标 | ||||
| 目前机器人SLAM算法主要分为激光SLAM和视觉SLAM,区别在于传感器分别是激光雷达和相机。随着移动机器人的普及以及应用场景的增多,激光SLAM由于激光雷达的高价格,难以应用在小电器以及低成本机器人上,而视觉SLAM凭借相机价格较低,体积较少,能够采集多维度信息等优势,逐渐成为目前SLAM算法中研究的主流方向。 | ||||
| 视觉同步定位与建图(Visual Simultaneous Localization And Mapping,V-SLAM)在机器人视觉感知领域中占有重要地位。最先进的V-SLAM算法提供了高精度定位和场景重建的能力[[1][]]。然而,它们大多忽略了动态对象所产生的不良影响。在这些研究中,环境被认为是完全静止的,这种强假设使得系统在复杂的动态环境中会产生严重的误差导致位姿估计误差较大,甚至导致定位失败。因此研究动态场景下的相机运动和物体运动是十分有必要的。 | ||||
| 拟在存在动态物体的室外场景下,使用相机作为传感器,研究如何区分真正移动的动态物体和潜在运动但是静止的物体,更好地利用静态特征点,提高相机运动估计的准确性和SLAM系统的鲁棒性。 | ||||
|      研究内容 | ||||
|   动态场景作为V-SLAM走向实际应用的一大阻碍,具有较大的难度和挑战性。也是许多学者研究的内容。本文拟研究在室外动态场景下如何识别动态物体,设计动态物体识别算法,将动态物体对相机位姿估计的影响降低,获得较为精准的相机位姿。在获得较为精准的相机位姿后,跟踪动态物体,建立动态物体跟踪集合,对新出现的物体和消失的物体记录。最后,将观测量,如相机位姿和物体位姿等传入后端,建立全局优化,根据优化后的地图点建立地图。 | ||||
|   针对如何识别室外动态物体的问题,研究深度学习和几何约束相结合的动态点判定方法,设计识别运动物体的算法,去除语义信息未包括的动点,正确恢复相机位姿。 | ||||
|   针对运动物体跟踪,研究在语义信息中的不同物体的跟踪方法,设计区分不同物体以及其运动,恢复运动物体的位姿。 | ||||
|   针对后端优化,研究应用动态物体信息的优化方法,同时优化相机位姿和物体位姿,得到更精确的相机位姿。 | ||||
|      拟解决的关键性问题 | ||||
|  动态物体判别问题 | ||||
|   动态物体判别是整个动态SLAM问题要解决的一个关键环节,其最终解决的效果好坏直接影响到相机位姿的估计和后端的建图效果。该问题解决的是在相机不同的两帧之间,物体的空间位置是否发生了移动的问题。只使用实例分割获得的语义信息只能判定已知语义的物体是动态物体,但是不能确定在当前图像物体是否真的发生了移动。同时在处理未知运动物体方面,语义信息会失效,需要结合几何信息设计一种算法来判定物体是否真正运动,以将动态物体的特征点与静态背景特征点做区分。 | ||||
|  动态物体跟踪问题 | ||||
|   对动态点的处理常常是在判定为动态点后,直接将其从特征点中去除,不再考虑这些特征点的意义。但是这些特征点也是地图中的点,对于动态物体存在跟踪的价值,因此研究动态物体所产生的特征点的存储和利用是关键点。在动态场景下,动态特征点非常可能不是来源一个物体,即在一个图像中可能存在多个动态物体,因此需要研究不同物体在不同帧间的关联关系,建立唯一的匹配,实现动态物体的分别跟踪。 | ||||
|  同步跟踪和优化问题 | ||||
|   在求解相机位姿后,跟踪动态物体的运动,获得运动物体的位姿,物体运动信息是预测得来的信息,可以经过局部优化或全局优化获得更精准的信息。但一般的优化只进行线性优化或者只对相机位姿优化,忽略了动态物体点的有效信息。因此拟设计一种优化的过程,确定优化变量,实现更准确的位姿估计,生成更准确的地图点,解决动态物体有效信息不完全利用的问题。 | ||||
| 
 | ||||
| 
 | ||||
|  学位论文研究依据 | ||||
| 学位论文的选题依据和研究意义,国内外研究现状和发展态势,主要参考文献,以及已有的工作积累和研究成果。(应有2000字) | ||||
|  选题依据和研究意义 | ||||
|   同步定位与地图构建(SLAM)是搭载激光雷达、IMU(Inertial Measurement Unit)、相机等传感器的移动载体在未知环境下同步进行定位与地图构建的过程[[][2][]]。SLAM一般可分为激光SLAM和视觉SLAM。激光SLAM利用激光雷达、编码器和惯性测量单元(IMU)等多种传感器相结合,已在理论和应用方面相对成熟。然而,激光雷达具有较高的价格使其难以普及到个人小型设备,并且雷达信息获取量有限。视觉SLAM利用视觉传感器,如单目、双目和RGB-D(带有深度信息的彩色图像)相机等,来构建环境地图。相机能够获取丰富的图像信息,并且视觉传感器具有低廉的价格,简单的结构和小巧便携的特点,因此成为近年来研究者们关注的热点,也成为SLAM技术中的主要研究方向。视觉SLAM能够广泛应用于无人驾驶,自主机器人,导盲避障等领域,对视觉SLAM的研究具有现实意义。 | ||||
|   经过近二十年的发展,视觉同时定位与建图(Visual Simultaneous Localization And Mapping,V-SLAM)框架已趋于成熟。现阶段,V-SLAM系统大多数建立在非动态环境的假设上,即假设移动载体在跟踪过程中不存在动态物体。然而,这种假设是一种强假设,在现实场景中很难成立。在室内场景中,常出现移动的人和桌椅等等;在室外场景中,常常出现移动的车和动物等等,这些动态物体的出现对V-SLAM系统的影响巨大,尤其是对V-SLAM中的前端模块的影响。SLAM前端求解存在两种方案,直接法和特征点法。直接法基于光度不变假设来描述像素随时间在图像之间的运动方式,每个像素在两帧之间的运动是一致的,通过此估计相机的运动。然而由于相机获得的图像受到光线,噪声等影响,光度不变假设往往不成立,如果再出现动态物体,直接使用此方法更会影响相机的运动估计。特征点法是一种间接的方法,它首先提取图像的特征点,然后通过两帧间特征点的匹配和位置变化求解相机运动。特征点的选择与使用大幅提高了V-SLAM系统定位的准确性,例如著名开源视觉SLAM框架ORB-SLAM2[[3]]、ORB-SLAM3[[4]]、VINS-Mono[[5]]都采用了特征点法。但是,一旦出现动态物体,这些特征点中会包含动态物体上的点,动态物体的移动造成了特征点移动的不一致性,从而对相机运动的估计造成了巨大影响。这种影响会导致后端失效,定位精度大幅减弱,不能忽视。随着视觉SLAM技术的发展,如何解决动态影响受到广泛关注,具有重要的研究价值。 | ||||
|  国内外研究现状和发展态势 | ||||
|   2.1视觉SLAM研究现状 | ||||
|   视觉SLAM问题最早可追溯到滤波技术的提出,Smith等人提出了采用状态估计理论的方法处理机器人在定位和建图等方面的问题[[][6][]]。随后出现各种基于滤波算法的SLAM系统,例如粒子滤波[[][7][]]和卡尔曼滤波[[][8][]]。2007年视觉SLAM取得重大突破,A. J. Davison等人提出第一个基于单目相机的视觉SLAM系统MonoSLAM[[][9][]]。该系统基于扩展卡尔曼滤波算法(Extended Kalman Filter, UKF),是首个达到实时效果的单目视觉SLAM系统,在此之前其他的算法都是对预先拍好的视频进行处理,无法做到同步。同年,Klein 等人提出了PTAM( Parallel Tracking And Mapping) [[][10][]],创新地以并行的方式进行跟踪和建图线程,这种并行的方式也是当下SLAM框架的主流。PTAM应用了关键帧和非线性化优化理论而非当时多数的滤波方案,为后续基于非线性化优化的视觉SLAM开辟了道路。 | ||||
|   2014年慕尼黑工业大学计算机视觉组Jakob Engel等人[[][11][]]提出LSD-SLAM,该方案是一种基于直接法的单目视觉SLAM算法,不需要计算特征点,通过最小化光度误差进行图像像素信息的匹配,实现了效果不错的建图。该方案的出现证明了基于直接法的视觉SLAM系统的有效性,为后续的研究奠定了基础。同年SVO被Forster等人提出[[][12][]]。这是一种基于稀疏直接法的视觉SLAM方案,结合了特征点和直接法,使用了特征点,但是不计算特征点的描述子,特征点的匹配使用其周围像素利用直接法匹配。 | ||||
|   2015年Mur-Artal等人参考PTAM关键帧和并行线程的方案,提出了ORB-SLAM框架[[][13][]]。该框架是一种完全基于特征点法的单目视觉SLAM系统,包括了跟踪,建图和回环检测三个并行线程。最为经典的是该系统采用的ORB特征点,能实现提取速度和效果的平衡。但是其系统只适用于单目相机,精度低且应用场景受限。随着相机的进步,2017年Mur-Artal 等人对ORB-SLAM进行了改进,扩展了对双目和RGB-D相机的支持,提出ORB-SLAM2[[][3][]]。相比于原版,该系统支持三种相机,同时新增重定位,全局优化和地图复用等功能,更具鲁棒性。 | ||||
|   2017年,香港科技大学Qin Tong等人[[][1][4]]提出VINS Mono系统,该系统在单目相机中融合IMU传感器,在视觉信息短暂失效时可利用IMU估计位姿,视觉信息在优化时可以修正IMU数据的漂移,两者的结合表现出了优良的性能。2019年提出改进版系统VINS-Fusion[[][1][5][]],新增对双目相机和GPS传感器的支持,融合后的系统效果更优。 | ||||
|   2020年Carlos Campos等[[][4][]]提出了ORB-SLAM3,该系统在ORB-SLAM2的基础上,加入了对视觉惯性传感器融合的支持,并在社区开源。系统对算法的多个环节进行改进优化,例如加入了多地图系统和新的重定位模块,能够适应更多的场景,同时精度相比上一版增加2-3倍。在2021年底,系统更新了V1.0版本,继承了ORB-SLAM2的优良性能,成为现阶段最有代表性的视觉SLAM系统之一。 | ||||
|   2.2动态SLAM研究现状 | ||||
|   针对动态物体的影响,已经有许多研究人员开展了相关工作,尝试解决动态场景下的视觉SLAM问题。解决这一问题的主要挑战就是如何高效地检测到动态物体和其特征点,并将动态特征点剔除以恢复相机运动。 | ||||
|   最早的解决思路是根据几何约束来筛除动态物体的特征点,如WANG 等[[][1][6][]]首次使用 K-Means 将由 RGB-D相机计算的3D点聚类,并使用连续图像之间的极线约束计算区域中内点关键点数量的变化,内点数量较少的区域被认定是动态的。Fang[[][1][7][]]使用光流法检测图像之间的动态物体所在位置,对其特征点进行滤除。该方法利用光流提高检测的精度,有效地降低了帧之间极线约束的误差。尽管基于几何约束的方法可以在一定程度消除动态特征点的影响,但随着深度学习的发展,图像中语义信息逐渐被重视和利用起来。 | ||||
|   现阶段有许多优秀的深度学习网络,如YOLO[[1][8][]],SegNet[[1][9][]],Mask R-CNN[[][20][]]等等。这些神经网络有着强大的特征提取能力和语义信息提取能力,可以帮助SLAM系统更轻松地辨别出动态物体的存在,从而消除其影响。Fangwei Zhong等人提出的Detect-SLAM[[2][1][]],利用目标检测网络获取环境中的动态的人和车等,为了实时性,只在关键帧中进行目标检测,最后去除所有检测到的动态点来恢复相机位姿。LIU和MIURA[[][2][2][]]提出了RDS-SLAM。基于ORB-SLAM3[4]的RDS-SLAM框架使用模型的分割结果初始化移动对象的移动概率,将概率传播到随后的帧,以此来区分动静点。这种只基于深度学习的方法仅能提供图像中的语义信息,但无法判断图像中的物体是否真的在运动,比如静止的人或者路边停靠的汽车。若根据语义信息将其标记为动态物体后直接去除其特征点,这种方法会导致系统丢失有用的特征点,对相机的运动估计有所影响。因此仅利用深度学习不能很好解决动态物体对SLAM系统的影响。 | ||||
|   许多研究开始探索语义信息和几何信息的结合。例如清华大学Chao Yun等提出的DS-SLAM[[][2][3][]],该系统首先利用SegNet网络进行语义分割,再利用极线约束过滤移动的物体,达到了不错的效果。Berta Bescos等人首次利用Mask R-CNN网络进行实例分割,提出了DynaSLAM[[][2][4][]]。该系统结合基于多视几何深度的动态物体分割和区域生长算法,大幅降低了位姿估计的误差。 | ||||
|   利用深度学习得来的语义信息和几何信息结合来解决SLAM中的动态场景问题渐渐成了一种主流,但是上述大多系统只是为了恢复相机的位姿而剔除动态物体的特征点,而没有估计动态物体的位姿。同时估计相机运动和跟踪动态物体运动,将动态物体的点加入优化步骤正在发展为一种趋势。Javier Civera等人提出的DOT SLAM(Dynamic Object Tracking for Visual SLAM)[[][2][5][]]主要工作在前端,结合实例分割为对态对象生成掩码,通过最小化光度重投影误差跟踪物体。AirDOS被卡内基梅隆大学Yuheng Qiu等人提出[[][2][6][]],将刚性和运动约束引入模型铰接对象,通过联合优化相机位姿、物体运动和物体三维结构,来纠正相机位姿估计。VDO SLAM[[][2][7][]]利用Mask R-CNN掩码和光流区分动静点,将动态环境下的SLAM表示为整体的图优化,同时估计相机位姿和物体位姿。 | ||||
|   总体来说,目前动态场景下的视觉SLAM问题的解决需要借助几何信息和深度学习的语义信息,语义信息提供更准确的物体,几何信息提供物体真实的运动状态,两者结合来估计相机运动和跟踪物体。 | ||||
|      主要参考文献 | ||||
|  J. Engel, V. Koltun, and D. Cremers, "Direct sparse odometry," IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611 - 625, Mar. 2016. | ||||
|  孔德磊, 方正. 基于事件的视觉传感器及其应用综述[J]. 信息与控制, 2021, 50(1): 1-19. KONG D L, FANG Z. A review of event-based vision sensors and their applications[J]. Information and Control, 2021, 50(1): 1-19. | ||||
|  Mur-Artal R , JD Tardós. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras[J]. IEEE Transactions on Robotics, 2017. | ||||
|  Campos C, Elvira R, Rodriguez J, et al. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual - Inertial, and Multimap SLAM[J]. IEEE Transactions on Robotics: A publication of the IEEE Robotics and Automation Society, 2021, 37(6): 1874-1890. | ||||
|  Tong, Qin, Peiliang, et al. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2018. | ||||
|  Smith R, Self M, Cheeseman P. Estimating Uncertain Spatial Relationships in Robotics [J]. Machine Intelligence & Pattern Recognition, 1988, 5(5):435-461. | ||||
|  Grisetti G, Stachniss C, Burgard W. Improved Techniques for Grid Mapping With Rao-Blackwellized Particle Filters [J]. IEEE Transactions on Robotics, 2007, 23(1):34-46. | ||||
|  Kalman R E. A New Approach To Linear Filtering and Prediction Problems[J]. Journal of Basic Engineering,  1960, 82D:35-45.DOI:10.1115/1.3662552. | ||||
|  Davison A J, Reid I D, Molton N D, et al. MonoSLAM: Real-Time Single Camera SLAM[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6):1052-1067. | ||||
|  Klein G, Murray D. Parallel Tracking and Mapping for Small AR Workspaces[C]. IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE, 2007:1-10. | ||||
|  ENGEL J, SCHOPS T, CREMERS D, LSD-SLAM: Large-scale direct monocular SLAM[C]. European Conference on Computer Vision(ECCV), 2014:834 - 849. | ||||
|  FORSTER C, PIZZOLI M, SCARAMUZZA D. SVO: Fast semi-direct monocular visual odometry[C]. Hong Kong, China: IEEE International Conference on Robotics and Automation (ICRA), 2014: 15-22. | ||||
|  MURARTAL R, MONTIEL J M, TARDOS J D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System[J]. IEEE Transactions on Robotics, 2015, 31(5):1147-1163. | ||||
|  TONG Q, PEILIANG L, SHAOJIE S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2017,99:1-17. | ||||
|  QIN T, PAN J, CAO S, et al. A general optimization-based framework for local odometry estimation with multiple sensors[J]. ArXiv, 2019:1901.03638. | ||||
|  WANG R, WAN W, WANG Y, et al. A new RGB-D SLAM method with moving object detection for dynamic indoor scenes[J]. Remote Sensing, 2019, 11(10): 1143. | ||||
|  Fang Y, Dai B. An improved moving target detecting and tracking based on Optical Flow technique and Kalman filter[J]. IEEE, 2009.DOI:10.1109/ICCSE.2009.5228464. | ||||
|  Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection[C]//Computer Vision & Pattern Recognition. IEEE, 2016.DOI:10.1109/CVPR.2016.91. | ||||
|  Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495. | ||||
|  Gkioxari G, He K, Piotr Dollár, et al. Mask R-CNN [J]. IEEE transactions on pattern analysis and machine intelligence, 2020, 42(2): 386-397. | ||||
|  Zhong F, Wang S, Zhang Z, et al. Detect-SLAM: Making Object Detection and SLAM Mutually Beneficial[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018.DOI:10.1109/WACV.2018.00115. | ||||
|  LIU Y, MIURA J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods[J]. IEEE Access, 2021, 9: 23772-23785. | ||||
|  C. Yu, et al. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments[A]. //2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 1168-1174. | ||||
|  B. Bescos, J. M. Fácil, J. Civera and J. Neira, DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes[J]. IEEE Robotics and Automation Letters, 2018, 3(4)4076-4083. | ||||
|  Ballester I, Fontan A, Civera J, et al. DOT: Dynamic Object Tracking for Visual SLAM[J].  2020.DOI:10.48550/arXiv.2010.00052. | ||||
|  Qiu Y, Wang C, Wang W, et al. AirDOS: Dynamic SLAM benefits from articulated objects[C]//2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022: 8047-8053. | ||||
|  Zhang J, Henein M, Mahony R, et al. VDO-SLAM: A Visual Dynamic Object-aware SLAM System[J].  2020.DOI:10.48550/arXiv.2005.11052. | ||||
|  高翔, 张涛等. 视觉SLAM十四讲[M]. 第二版. 北京:电子工业出版社, 2019. | ||||
|  已有的工作积累和研究成果 | ||||
|  工作积累 | ||||
| 研究生期间学习主要以视觉十四讲[[][28][]]为主,阅读了大量SLAM相关文献,在虚拟机环境下测试过ORB-SLAM2,VDO-SLAM等多种框架,在公开数据集KITTI序列中的性能。掌握框架的主要函数,可以通过编程实现环境的搭建和算法的编写测试。 | ||||
|  研究成果 | ||||
| 暂无 | ||||
|  学位论文研究计划及预期目标 | ||||
| 1.拟采取的主要理论、研究方法、技术路线和实施方案(可续页) | ||||
| 1.1 主要理论和研究方法 | ||||
| 一个典型的视觉SLAM系统一般可以分为五个子模块,包括传感器,前端,后端优化,回环检测和建图。如图3-1所示。 | ||||
|                                         | ||||
|                              图3-1 SLAM模块图 | ||||
| 对于视觉SLAM而言,传感器为相机,前端又称为视觉里程计,主要根据相机信息估计相邻两个时刻内的运动(即位姿变化)。后端优化位姿,回环检测是检测相机是否经过相同的场景,与建图有着密切的联系。本文的主要工作集中在前端和后端。在光照变化不明显,没有动态物体的场景下,SLAM基本模块已经很完善。要解决动态场景下的问题,需要在此模块的基础上,结合深度学习模型来实现语义级别的SLAM。 | ||||
| 在后端优化方面,基于因子图优化。因子图是应用贝叶斯定律的估计模型,贝叶斯模型是给定Z,求解X的概率,表示为P(X|Z),P(X|Z)正比于给定求解Z的概率,如公式(1)所示。 | ||||
| PXZ=PZXPXPZ=k*PZXPX#1 | ||||
| 贝叶斯定律左侧称为后验概率,右侧的P(Z|X)称为似然,P(X)称为先验。直接求后验分布是困难的,但是求一个状态最优估计,使得在该状态下,后验概率最大化是可行的,如公式(2)所示。因此求解最大后验概率,等价于求解最大化似然和先验的乘积。 | ||||
| X*=argmax PXZ=argmax PZXPX#2 | ||||
| 求解最大似然估计时,考虑观测数据的条件概率满足高斯分布,可以使用最小化负对数来求高斯分布的最大似然,,这样就可以得到一个最小二乘问题,如公式(3)所示,它的解等价于状态的最大似然估计。其中公式(3)中的f(x)为噪声符合高斯分布的X的误差项。 | ||||
| X*=argmax PXZ=argmaxlogPXZ=argminfx22 #3 | ||||
| 在SLAM问题中,每一个观测变量在贝叶斯网络中都是相互独立的,因此所有条件概率是乘积的形式,且可分解,对应于因子图中的每一项。因子图包含节点和边,节点为状态变量节点,表示待估计的变量,如位姿,3D点等。边为误差项,即因子,表示变量之间的误差项。因子图还会包含一个先验因子,用来固定系统的解,以满足可解。因子图的求解就是使得所有因子的乘积最大化的状态量,该步骤可转化为最小二乘问题,最终解得的系统状态是在概率上最可能的系统状态。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | ||||
| 在研究时,从主要理论出发,阅读大量室外动态场景下的视觉SLAM文献,对文献总结和理解,学习方法的异同,优化自己的算法。从实践出发,多写代码尝试不同的算法,测试算法性能,通过实验得到良好的解决方案。 | ||||
|   1.2 技术路线和实施方案 | ||||
| 本文预计的技术路线和实施方案如图3-2所示: | ||||
|                       图3-2 技术路线和实施方案 | ||||
| 在室外动态场景下基于实例分割的SLAM算法首先需要解决深度学习模型的数据预处理,然后应用得到的语义信息和几何约束设计算法来实现动静点判定。根据静点估计相机的运动,根据动点估计运动物体的运动,不同的运动物体分别跟踪。最终研究相机位姿,运物体位姿和地图点的全局优化,实现建图。 | ||||
| 本文预计的详细技术路线和实施方案如下: | ||||
|  基于实例分割和聚类的动态物体判别方法 | ||||
| 在室外动态场景下,提出一种基于实例分割和超像素聚类的动态物体识别算法。通过实例分割得到物体掩码,将掩码内的点作为动点候选点,通过特征提取的点与动点候选点做差,得到静点候选点。静点候选点通过聚类后重投影到前一帧,计算点误差,提出一种基于误差比的动点判断方法,解决语义未知的动态物体判定问题。对于语义已知的掩码物体,同样使用该方法判定是否真的在运动。研究思路如图3-3所示。 | ||||
|          图3-3 基于实例分割和聚类的动态物体判别方法 | ||||
|  依赖掩码内动点集合的动态物体跟踪方法 | ||||
| 研究具有掩码的动态物体的运动,提出一种在室外场景下,全局的动态物体跟踪方法。首先通过掩码,稠密地提取像素点,每隔2个点取一个点,以保证物体跟踪时特征点的数量。再通过运动判定和语义标签得到真的在运动的物体,设计一个存储集合来管理这些物体像素点,同时利用提取的像素点估计不同物体的位姿,物体位姿的求解建立在刚体假设之上。研究思路如图3-4所示。 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
|                         图3-4 动态物体跟踪方法 | ||||
|  因子图优化方法 | ||||
| 研究基于因子图的相机位姿和物体位姿优化,该方法将动态SLAM问题作为一个图形优化的问题,为了构建全局一致的地图。因子图的变量节点作为观测得来的值,是要估计求解的变量,点之间的变量作为状态变量,是因子节点,作为约束。拟设计的因子图如图3-5所示。 | ||||
|                                图3-5 因子图 | ||||
| 2.研究计划可行性,研究条件落实情况,可能存在的问题及解决办法(可续页) | ||||
| 2.1 可行性分析 | ||||
|   得益于视觉SLAM的逐渐发展,动态物体问题已经有了不少解决思路,尤其是前端部分的研究更多,每年都有一定的论文产出,可作为参考。其次,随着深度学习的模型逐渐完善,实例分割技术和光流检测等技术也能有比较好的效果,对动态SLAM问题的解决有所助益。因此,在理论上和实践上,本论文的研究方向均具有可行性。 | ||||
| 2.2 研究条件 | ||||
|   (1) 教研室的科研氛围,指导老师和教研室老师们的意见,师兄们的帮助。教研室已经发了不少相关论文和专利; | ||||
|   (2) 教研室完备的硬件环境,服务器,移动小车和各种摄像头等硬件设施; | ||||
|   (3) 研究内容相关的论文和书籍,有足够的理论基础支撑研究; | ||||
| 2.3 可能存在的问题及解决办法 | ||||
|   (1) 全局优化的结果不如原始数据 | ||||
|   在将预测值进行全局优化时,不确定预测值的误差大小,会导致一些误差较大的预测值加入全局优化,使得优化后的效果不如原始数据。针对这样的问题,首先考虑优化对象的选择,增加或删除优化值,为了更准确的效果。其次考虑在加入优化前对预测值做处理,比如绝对阈值处理或相对阈值。 | ||||
|   (2) 实施方案未能达到较好的效果 | ||||
|   若出现这样的问题,则需要和导师师兄交流,讨论原因做好记录,找到问题所在,并根据实际情况调整技术路线,设计新的方案来达到效果。 | ||||
|    | ||||
| 
 | ||||
| 3.研究计划及预期成果 | ||||
|                                       研 | ||||
|                                       究 | ||||
|                                       计 | ||||
|                                       划 | ||||
|                                  起止年月 | ||||
|                                  完成内容 | ||||
|                                         | ||||
|                                 2023.12-2024.02 | ||||
| 研究动态物体判别方法 | ||||
|                                         | ||||
|                                 2024.02-2024.04 | ||||
| 研究动态物体跟踪方法 | ||||
|                                         | ||||
|                                 2024.04-2024.06 | ||||
| 研究包含动态物体的局部优化和全局优化 | ||||
|                                         | ||||
|                                 2024.06-2024.08 | ||||
| 验证地图精度指标,改进算法 | ||||
|                                         | ||||
|                                 2024.08-2024.11 | ||||
| 测试数据集,做实验 | ||||
|                                         | ||||
|                                 2024.11-2025.03 | ||||
| 撰写硕士学位论文 | ||||
|                                       预 | ||||
|                                       期 | ||||
|                                       创 | ||||
|                                       新 | ||||
|                                       点 | ||||
|                                       及 | ||||
|                                       成 | ||||
|                                       果 | ||||
|                                       形 | ||||
|                                       式 | ||||
|  预期创新点 | ||||
|  设计基于实例分割和聚类的动态物体判别方法 | ||||
|  提出基于掩码的动态物体同步跟踪方法 | ||||
|  设计因子图,实现更优的全局优化 | ||||
|    | ||||
|  成果形式 | ||||
|  学术论文 | ||||
|   发表一篇学术论文 | ||||
|  专利 | ||||
|   申请发明专利1-2项 | ||||
|  论文 | ||||
|   撰写硕士学位论文1篇 | ||||
| 
 | ||||
|  开题报告审查意见 | ||||
| 1.导师对学位论文选题和论文计划可行性意见,是否同意开题: | ||||
| 
 | ||||
| 导师(组)签字:                                                          年     月     日 | ||||
| 2.开题报告考评组意见 | ||||
|                                  开题日期 | ||||
|                                         | ||||
|                                  开题地点 | ||||
| 
 | ||||
|                                  考评专家 | ||||
| 
 | ||||
|                                  考评成绩 | ||||
| 合格    票       基本合格    票       不合格    票 | ||||
|                                   结    论 | ||||
| □通过            □原则通过           □不通过  | ||||
| 通过:表决票均为合格 | ||||
| 原则通过:表决票中有1票为基本合格或不合格,其余为合格和基本合格 | ||||
| 不通过:表决票中有2票及以上为不合格 | ||||
| 考评组对学位论文的选题、研究计划及方案实施的可行性的意见和建议: | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 考评组签名:                    | ||||
|                                                             年     月     日 | ||||
| 3.学院意见: | ||||
| 
 | ||||
| 负责人签名:                             年     月     日 | ||||
|                                         | ||||
							
								
								
									
										65
									
								
								202221090225_武博文_文献综述.docx
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										65
									
								
								202221090225_武博文_文献综述.docx
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,65 @@ | ||||
|      电子科技大学学术学位硕士研究生学位论文文献综述 | ||||
| 姓名:武博文 | ||||
|  学号:202221090225 | ||||
|  学科:软件工程 | ||||
| 综述题目:室外动态场景下基于实例分割的视觉SLAM研究   | ||||
|                                                                                 | ||||
| 导师意见: | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 导师签字: | ||||
| 日期: | ||||
| 
 | ||||
|   选题依据和研究意义 | ||||
| 同步定位与地图构建(SLAM)是搭载激光雷达、IMU(Inertial Measurement Unit)、相机等传感器的移动载体在未知环境下同步进行定位与地图构建的过程[[][1][]]。SLAM一般可分为激光SLAM和视觉SLAM。激光SLAM利用激光雷达、编码器和惯性测量单元(IMU)等多种传感器相结合,已在理论和应用方面相对成熟。然而,激光雷达具有较高的价格使其难以普及到个人小型设备,并且雷达信息获取量有限。视觉SLAM利用视觉传感器,如单目、双目和RGB-D(带有深度信息的彩色图像)相机等,来构建环境地图。相机能够获取丰富的图像信息,并且视觉传感器具有低廉的价格,简单的结构和小巧便携的特点,因此成为近年来研究者们关注的热点,也成为SLAM技术中的主要研究方向。视觉SLAM能够广泛应用于无人驾驶,自主机器人,导盲避障等领域,对视觉SLAM的研究具有现实意义。 | ||||
| 经过近二十年的发展,视觉同时定位与建图(Visual Simultaneous Localization And Mapping,V-SLAM)框架已趋于成熟,在机器人视觉感知领域中占有重要地位,最先进的V-SLAM算法提供了高精度定位和场景重建的能力[[][2][]]。现阶段,V-SLAM系统大多数建立在非动态环境的假设上,即假设移动载体在跟踪过程中不存在动态物体。然而,这种假设是一种强假设,在现实场景中很难成立。在室内场景中,常出现移动的人和桌椅等等;在室外场景中,常常出现移动的车和动物等等,这些动态物体的出现对V-SLAM系统的影响巨大,尤其是对V-SLAM中的前端模块的影响。SLAM前端求解存在两种方案,直接法和特征点法。直接法基于光度不变假设来描述像素随时间在图像之间的运动方式,每个像素在两帧之间的运动是一致的,通过此估计相机的运动。然而由于相机获得的图像受到光线,噪声等影响,光度不变假设往往不成立,如果再出现动态物体,直接使用此方法更会影响相机的运动估计。特征点法是一种间接的方法,它首先提取图像的特征点,然后通过两帧间特征点的匹配和位置变化求解相机运动。特征点的选择与使用大幅提高了V-SLAM系统定位的准确性,例如著名开源视觉SLAM框架ORB-SLAM2[[3]]、ORB-SLAM3[[4]]、VINS-Mono[[5]]都采用了特征点法。但是,一旦出现动态物体,这些特征点中会包含动态物体上的点,动态物体的移动造成了特征点移动的不一致性,从而对相机运动的估计造成了巨大影响。这种影响会导致后端失效,定位精度大幅减弱,不能忽视。随着视觉SLAM技术的发展,如何解决动态影响受到广泛关注,具有重要的研究价值。 | ||||
|  国内外研究现状和发展态势 | ||||
| 2.1 视觉SLAM研究现状 | ||||
| 视觉SLAM问题最早可追溯到滤波技术的提出,Smith等人提出了采用状态估计理论的方法处理机器人在定位和建图等方面的问题[[6]]。随后出现各种基于滤波算法的SLAM系统,例如粒子滤波[[7]]和卡尔曼滤波[[8]]。2007年视觉SLAM取得重大突破,A. J. Davison等人提出第一个基于单目相机的视觉SLAM系统MonoSLAM[[][9]]。该系统基于扩展卡尔曼滤波算法(Extended Kalman Filter, UKF),是首个达到实时效果的单目视觉SLAM系统,在此之前其他的算法都是对预先拍好的视频进行处理,无法做到同步。MonoSLAM的发布标志着视觉SLAM的研究从理论层面转到了实际应用,具有里程碑式意义。同年,Klein 等人提出了PTAM( Parallel Tracking And Mapping) [[10]],创新地以并行的方式进行跟踪和建图线程,解决了MonoSLAM计算复杂度高的问题,这种并行的方式也是当下SLAM框架的主流。PTAM应用了关键帧和非线性化优化理论而非当时多数的滤波方案,为后续基于非线性化优化的视觉SLAM开辟了道路。 | ||||
| 2014年慕尼黑工业大学计算机视觉组Jakob Engel等人[[11]]提出LSD-SLAM,该方案是一种基于直接法的单目视觉SLAM算法,不需要计算特征点,通过最小化光度误差进行图像像素信息的匹配,实现了效果不错的建图,可以生成半稠密的深度图。该方案的出现证明了基于直接法的视觉SLAM系统的有效性,为后续的研究奠定了基础。但该方案仍旧存在尺度不确定性问题,以及在相机快速移动时容易丢失目标的问题等等。同年SVO(semi-direct monocular visual odometry)被Forster等人提出[[12]]。这是一种基于稀疏直接法的视觉SLAM方案,结合了特征点和直接法,使用了特征点,但是不计算特征点的描述子,特征点的匹配使用特征点周围像素利用直接法匹配。SVO有着较快的计算速度,但是缺少了后端的功能,对相机的运动估计有较为明显的累计误差,应用场景受限。 | ||||
| 2015年Mur-Artal等人参考PTAM关键帧和并行线程的方案,提出了ORB-SLAM框架[[13]]。该框架是一种完全基于特征点法的单目视觉SLAM系统,包括了跟踪,建图和回环检测三个并行线程。跟踪线程负责提取ORB[[][14][]](oriented FAST and rotated BRIEF)特征点,这该系统最为经典的一部分,采用的ORB特征点具有良好的尺度不变性和旋转不变性,能实现提取速度和效果的平衡。跟踪线程还完成估计位姿的工作,并且适时选出新的关键帧来实现建图。建图线程接收跟踪线程选出的关键帧,删除冗余的关键帧和地图点,再进行全局优化。回环线程接收建图线程筛选后的关键帧,与其他关键图进行回环检测,然后更新相机位姿和地图。ORB-SLAM因为回环检测线程的加入,有限消除了累计误差的影响,提高了定位和建图的准确性。但是其系统只适用于单目相机,精度低且应用场景受限。随着相机的进步,2017年Mur-Artal 等人对ORB-SLAM进行了改进,扩展了对双目和RGB-D相机的支持,提出ORB-SLAM2[[3]]。相比于原版,该系统支持三种相机,同时新增重定位,全局优化和地图复用等功能,更具鲁棒性。 | ||||
| 2017年,香港科技大学Qin Tong等人[[1][5][]]提出VINS Mono系统,该系统在单目相机中融合IMU传感器,在视觉信息短暂失效时可利用IMU估计位姿,视觉信息在优化时可以修正IMU数据的漂移,两者的结合表现出了优良的性能。2019年提出改进版系统VINS-Fusion[[1][6][]],新增对双目相机和GPS传感器的支持,融合后的系统效果更优。 | ||||
| 2020年Carlos Campos等提出了ORB-SLAM3[[4]],该系统在ORB-SLAM2的基础上,加入了对视觉惯性传感器融合的支持,并在社区开源。系统对算法的多个环节进行改进优化,例如加入了多地图系统和新的重定位模块,能够适应更多的场景,同时精度相比上一版增加2-3倍。在2021年底,系统更新了V1.0版本,继承了ORB-SLAM2的优良性能,成为现阶段最有代表性的视觉SLAM系统之一。 | ||||
| 2.2 动态SLAM研究现状 | ||||
| 针对动态物体的影响,已经有许多研究人员开展了相关工作,尝试解决动态场景下的视觉SLAM问题。解决这一问题的主要挑战就是如何高效地检测到动态物体和其特征点,并将动态特征点剔除以恢复相机运动。 | ||||
| 最早的解决思路是根据几何约束来筛除动态物体的特征点,如WANG等[[1][7][]]首次使用K-Means将由RGB-D相机计算的3D点聚类,并使用连续图像之间的极线约束计算区域中内点关键点数量的变化,内点数量较少的区域被认定是动态的。利用极线约束是一种判断动态物体特征点的常见方法,但是如果相邻帧间存在高速移动物体或者运动物体沿着极线方向移动,这种方法效果会大大减弱。为了更好地利用几何信息,研究人员提出借助光流信息来提高动态物体的检测。Fang[[1][8][]]使用光流法检测图像之间的动态物体所在位置,对其特征点进行滤除。该方法利用光流提高检测的精度,有效地降低了帧之间极线约束的误差。尽管基于几何约束的方法可以在一定程度消除动态特征点的影响,但随着深度学习的发展,图像中语义信息逐渐被重视和利用起来。 | ||||
| 现阶段有许多优秀的深度学习网络,如YOLO[[1][9][]],SegNet[[][20][]],Mask R-CNN[[2][1][]]等等。这些神经网络有着强大的特征提取能力和语义信息提取能力,可以帮助SLAM系统更轻松地辨别出动态物体的存在,提供语义先验信息,从而消除其影响。Fangwei Zhong等人提出的Detect-SLAM[[2][2][]],利用目标检测网络获取环境中的动态的人和车等,为了实时性,只在关键帧中进行目标检测,最后去除所有检测到的动态点来恢复相机位姿。LIU和MIURA[[2][3][]]提出了 RDS-SLAM。基于ORB-SLAM3[[4]]的RDS-SLAM框架使用模型的分割结果初始化移动对象的移动概率,将概率传播到随后的帧,以此来区分动静点。这种只基于深度学习的方法仅能提供图像中的语义信息,但无法判断图像中的物体是否真的在运动,比如静止的人或者路边停靠的汽车。若根据语义信息将其标记为动态物体后直接去除其特征点,这种方法会导致系统丢失有用的特征点,对相机的运动估计有所影响。因此仅利用深度学习不能很好解决动态物体对SLAM系统的影响。 | ||||
| 许多研究开始探索语义信息和几何信息的结合。清华大学Chao Yun等提出的DS-SLAM[[2][4][]],该系统首先利用SegNet网络进行语义分割,再利用极线约束过滤移动的物体,达到了不错的效果。Berta Bescos等人首次利用Mask R-CNN网络进行实例分割,提出了DynaSLAM[[2][5][]]。该系统结合基于多视几何深度的动态物体分割和区域生长算法,大幅降低了位姿估计的误差。Runz等人提出了MaskFusion,一种考虑物体的语义和动态RGD-D SLAM系统[[][26][]]。这个系统基于MASK-RCNN语义分割和几何分割,将语义分割和SALM线程放在两个线程以保证整个SLAM系统的实时性。但是该系统物体边界分割常包含背景,仍有改善空间。等人提出RS-SLAM,一种使用RGB-D相机解决动态环境不良影响的SLAM[[][27][]]。该系统采用语义分割识别动态对象,通过动态对象和可移动对象的几何关系来判断可移动对象是否移动。动态内容随后被剔除,跟踪模块对剔除过的静态背景图像帧进行ORB特征提取并估计相机位姿。 | ||||
| 利用深度学习得来的语义信息和几何信息结合来解决SLAM中的动态场景问题渐渐成了一种主流,但是上述大多系统只是为了恢复相机的位姿而剔除动态物体的特征点,而没有估计动态物体的位姿。同时估计相机运动和跟踪动态物体运动,将动态物体的点加入优化步骤正在发展为一种趋势。Henein等人提出一种新的基于特征的,无模型的动态SLAM算法Dynamic SLAM(Dynamic SLAM: The Need For Speed)[[][28][]]。该方法利用语义分割场景中的刚体物体的运动,并提取运动物体的速度,有效性在各种虚拟和真实数据集上得到了验证。Javier Civera等人提出的DOT SLAM(Dynamic Object Tracking for Visual SLAM)[[2][9][]]主要工作在前端,结合实例分割为对态对象生成掩码,通过最小化光度重投影误差跟踪物体。AirDOS被卡内基梅隆大学Yuheng Qiu等人提出[[][30][]],将刚性和运动约束引入模型铰接对象,通过联合优化相机位姿、物体运动和物体三维结构,来纠正相机位姿估计。VDO SLAM[[][31][]]利用Mask R-CNN掩码和光流区分动静点,将动态环境下的SLAM表示为整体的图优化,同时估计相机位姿和物体位姿。 | ||||
| 总体来说,目前动态场景下的视觉SLAM问题的解决需要借助几何信息和深度学习的语义信息,语义信息提供更准确的物体,几何信息提供物体真实的运动状态,两者结合来估计相机运动和跟踪物体。 | ||||
| 
 | ||||
|                                  参考文献 | ||||
|  孔德磊, 方正. 基于事件的视觉传感器及其应用综述[J]. 信息与控制, 2021, 50(1): 1-19. KONG D L, FANG Z. A review of event-based vision sensors and their applications[J]. Information and Control, 2021, 50(1): 1-19. | ||||
|  J. Engel, V. Koltun, and D. Cremers, "Direct sparse odometry," IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611 - 625, Mar. 2016. | ||||
|  Mur-Artal R , JD Tardós. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras[J]. IEEE Transactions on Robotics, 2017. | ||||
|  Campos C, Elvira R, Rodriguez J, et al. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual - Inertial, and Multimap SLAM[J]. IEEE Transactions on Robotics: A publication of the IEEE Robotics and Automation Society, 2021, 37(6): 1874-1890. | ||||
|  Tong, Qin, Peiliang, et al. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2018. | ||||
|  Smith R, Self M, Cheeseman P. Estimating Uncertain Spatial Relationships in Robotics [J]. Machine Intelligence & Pattern Recognition, 1988, 5(5):435-461. | ||||
|  Grisetti G, Stachniss C, Burgard W. Improved Techniques for Grid Mapping With Rao-Blackwellized Particle Filters [J]. IEEE Transactions on Robotics, 2007, 23(1):34-46. | ||||
|  Kalman R E. A New Approach To Linear Filtering and Prediction Problems[J]. Journal of Basic Engineering,  1960, 82D:35-45.DOI:10.1115/1.3662552. | ||||
|  Davison A J, Reid I D, Molton N D, et al. MonoSLAM: Real-Time Single Camera SLAM[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6):1052-1067. | ||||
|  Klein G, Murray D. Parallel Tracking and Mapping for Small AR Workspaces[C]. IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE, 2007:1-10. | ||||
|  ENGEL J, SCHOPS T, CREMERS D, LSD-SLAM: Large-scale direct monocular SLAM[C]. European Conference on Computer Vision(ECCV), 2014:834 - 849. | ||||
|  FORSTER C, PIZZOLI M, SCARAMUZZA D. SVO: Fast semi-direct monocular visual odometry[C]. Hong Kong, China: IEEE International Conference on Robotics and Automation (ICRA), 2014: 15-22. | ||||
|  MURARTAL R, MONTIEL J M, TARDOS J D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System[J]. IEEE Transactions on Robotics, 2015, 31(5):1147-1163. | ||||
|  Rublee E,Rabaud V,Konolige K,et al.ORB:An efficient alternative to SIFT or SURF[C].2011 International conference on computer vision. IEEE, 2011:2564-2571. | ||||
|  TONG Q, PEILIANG L, SHAOJIE S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2017,99:1-17. | ||||
|  QIN T, PAN J, CAO S, et al. A general optimization-based framework for local odometry estimation with multiple sensors[J]. ArXiv, 2019:1901.03638. | ||||
|  WANG R, WAN W, WANG Y, et al. A new RGB-D SLAM method with moving object detection for dynamic indoor scenes[J]. Remote Sensing, 2019, 11(10): 1143. | ||||
|  Fang Y, Dai B. An improved moving target detecting and tracking based on Optical Flow technique and Kalman filter[J]. IEEE, 2009.DOI:10.1109/ICCSE.2009.5228464. | ||||
|  Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection[C]//Computer Vision & Pattern Recognition. IEEE, 2016.DOI:10.1109/CVPR.2016.91. | ||||
|  Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495. | ||||
|  Gkioxari G, He K, Piotr Dollár, et al. Mask R-CNN [J]. IEEE transactions on pattern analysis and machine intelligence, 2020, 42(2): 386-397. | ||||
|  Zhong F, Wang S, Zhang Z, et al. Detect-SLAM: Making Object Detection and SLAM Mutually Beneficial[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018.DOI:10.1109/WACV.2018.00115. | ||||
|  LIU Y, MIURA J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods[J]. IEEE Access, 2021, 9: 23772-23785. | ||||
|  C. Yu, et al. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments[A]. //2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 1168-1174. | ||||
|  B. Bescos, J. M. Fácil, J. Civera and J. Neira, DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes[J]. IEEE Robotics and Automation Letters, 2018, 3(4)4076-4083. | ||||
|  Runz M, Buffier M, Agapito L. MaskFusion: real-time recognition, tracking and reconstruction of multiple moving objects[J]. 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2018, pp. 10-20. | ||||
|  T. Ran, L. Yuan, J. Zhang, D. Tang and L. He. RS-SLAM: A Robust Semantic SLAM in Dynamic Environments Based on RGB-D Sensor[J]. IEEE Sensors Journal, 2021, vol. 21, no. 18, pp. 20657-20664. | ||||
|  M. Henein, J. Zhang, R. Mahony and V. Ila. Dynamic SLAM: The Need For Speed[C]. 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020: 2123-2129. | ||||
|  Ballester I, Fontan A, Civera J, et al. DOT: Dynamic Object Tracking for Visual SLAM[J].  2020.DOI:10.48550/arXiv.2010.00052. | ||||
|  Qiu Y, Wang C, Wang W, et al. AirDOS: Dynamic SLAM benefits from articulated objects[C]//2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022: 8047-8053. | ||||
|  Zhang J, Henein M, Mahony R, et al. VDO-SLAM: A Visual Dynamic Object-aware SLAM System[J]. 2020.DOI:10.48550/arXiv.2005.11052. | ||||
							
								
								
									
										3
									
								
								docker/wbw-slam/Dockerfile
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										3
									
								
								docker/wbw-slam/Dockerfile
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,3 @@ | ||||
| FROM nvidia/cuda:11.1.1-devel-ubuntu18.04 | ||||
| 
 | ||||
| WORKDIR /root | ||||
							
								
								
									
										10
									
								
								docker/wbw-slam/run.txt
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										10
									
								
								docker/wbw-slam/run.txt
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,10 @@ | ||||
| 
 | ||||
| docker run --name wbw-slam --gpus=all -it -v /home/wbw/data:/data -e DISPLAY -e WAYLAND_DISPLAY -e XDG_RUNTIME_DIR -e PULSE_SERVER -p 8080:5901 -p 8081:20 wbw-slam /bin/bash  | ||||
| 
 | ||||
| 
 | ||||
| docker run --name wbw-docker --gpus=all -it -v /home/wbw/data:/data -e DISPLAY -e WAYLAND_DISPLAY -e XDG_RUNTIME_DIR -e PULSE_SERVER -p 8083:5901 -p 8084:20 wbw-docker /bin/bash  | ||||
| // 启动docker | ||||
| docker start wbw-slam | ||||
| docker exec -it wbw-slam bash | ||||
| 
 | ||||
| 
 | ||||
							
								
								
									
										
											BIN
										
									
								
								docker/wbw_docker_export.tar
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								docker/wbw_docker_export.tar
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
							
								
								
									
										1101
									
								
								动态slam/06_tar.txt
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										1101
									
								
								动态slam/06_tar.txt
									
									
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load Diff
											
										
									
								
							
							
								
								
									
										
											BIN
										
									
								
								动态slam/2020年-2022年开源动态SLAM.zip
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								动态slam/2020年-2022年开源动态SLAM.zip
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
							
								
								
									
										38
									
								
								动态slam/2020年-2022年开源动态SLAM/2020-2022年开源动态SLAM.docx
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										38
									
								
								动态slam/2020年-2022年开源动态SLAM/2020-2022年开源动态SLAM.docx
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,38 @@ | ||||
|                      2020-2023年开源的动态SLAM论文 | ||||
| 一、2020年 | ||||
| 1.Zhang J, Henein M, Mahony R, et al. VDO-SLAM: a visual dynamic object-aware SLAM system[J]. arXiv preprint arXiv:2005.11052, 2020. | ||||
| https://github.com/halajun/vdo_slam | ||||
| 2.Bescos B, Cadena C, Neira J. Empty cities: A dynamic-object-invariant space for visual SLAM[J]. IEEE Transactions on Robotics, 2020, 37(2): 433-451. | ||||
| https://github.com/bertabescos/EmptyCities_SLAM | ||||
| 3.Vincent J, Labbé M, Lauzon J S, et al. Dynamic object tracking and masking for visual SLAM[C]//2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020: 4974-4979. | ||||
| https://github.com/introlab/dotmask | ||||
| 
 | ||||
| 二、2021年 | ||||
| 1.Liu Y, Miura J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods[J]. Ieee Access, 2021, 9: 23772-23785. | ||||
|  https://github.com/yubaoliu/RDS-SLAM/ | ||||
|  2.Bao R, Komatsu R, Miyagusuku R, et al. Stereo camera visual SLAM with hierarchical masking and motion-state classification at outdoor construction sites containing large dynamic objects[J]. Advanced Robotics, 2021, 35(3-4): 228-241. | ||||
|  https://github.com/RunqiuBao/kenki-positioning-vSLAM | ||||
|  3.Wimbauer F, Yang N, Von Stumberg L, et al. MonoRec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 6112-6122. | ||||
| https://github.com/Brummi/MonoRec | ||||
|  4.Wang W, Hu Y, Scherer S. Tartanvo: A generalizable learning-based vo[C]//Conference on Robot Learning. PMLR, 2021: 1761-1772. | ||||
| https://github.com/castacks/tartanvo | ||||
| 5.Zhan H, Weerasekera C S, Bian J W, et al. DF-VO: What should be learnt for visual odometry?[J]. arXiv preprint arXiv:2103.00933, 2021. | ||||
| https://github.com/Huangying-Zhan/DF-VO | ||||
| 
 | ||||
| 三、2022年 | ||||
| 1.Liu J, Li X, Liu Y, et al. RGB-D inertial odometry for a resource-restricted robot in dynamic environments[J]. IEEE Robotics and Automation Letters, 2022, 7(4): 9573-9580. | ||||
| https://github.com/HITSZ-NRSL/Dynamic-VINS | ||||
| 2.Song S, Lim H, Lee A J, et al. Dynavins: A visual-inertial slam for dynamic environments[J]. IEEE Robotics and Automation Letters, 2022, 7(4): 11523-11530. | ||||
| https://github.com/url-kaist/dynavins | ||||
| 3.Wang H, Ko J Y, Xie L. Multi-modal Semantic SLAM for Complex Dynamic Environments[J]. arXiv e-prints, 2022: arXiv: 2205.04300. | ||||
|  https://github.com/wh200720041/MMS_SLAM | ||||
| 4.Qiu Y, Wang C, Wang W, et al. AirDOS: Dynamic SLAM benefits from articulated objects[C]//2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022: 8047-8053. | ||||
| https://github.com/haleqiu/AirDOS | ||||
| 5.Cheng S, Sun C, Zhang S, et al. SG-SLAM: a real-time RGB-D visual SLAM toward dynamic scenes with semantic and geometric information[J]. IEEE Transactions on Instrumentation and Measurement, 2022, 72: 1-12. | ||||
| https://github.com/silencht/SG-SLAM | ||||
| 6.Esparza D, Flores G. The STDyn-SLAM: a stereo vision and semantic segmentation approach for VSLAM in dynamic outdoor environments[J]. IEEE Access, 2022, 10: 18201-18209. | ||||
| https://github.com/DanielaEsparza/STDyn-SLAM | ||||
| 7.Shen S, Cai Y, Wang W, et al. DytanVO: Joint refinement of visual odometry and motion segmentation in dynamic environments[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 4048-4055. | ||||
| https://github.com/castacks/DytanVO | ||||
| 
 | ||||
| 
 | ||||
| @ -0,0 +1,381 @@ | ||||
|                                         Dynamic Object Tracking and Masking for Visual SLAM | ||||
| 
 | ||||
|                                                 Jonathan Vincent, Mathieu Labbe´, Jean-Samuel Lauzon, Franc¸ois Grondin, | ||||
|                                                                    Pier-Marc Comtois-Rivet, Franc¸ois Michaud | ||||
| 
 | ||||
| arXiv:2008.00072v1 [cs.CV] 31 Jul 2020     Abstract— In dynamic environments, performance of visual               the proposed method. Our research hypothesis is that a | ||||
|                                         SLAM techniques can be impaired by visual features taken                  deep learning algorithm can be used to semantically segment | ||||
|                                         from moving objects. One solution is to identify those objects            object instances in images using a priori semantic knowledge | ||||
|                                         so that their visual features can be removed for localization and         of dynamic objects, enabling the identification, tracking and | ||||
|                                         mapping. This paper presents a simple and fast pipeline that              removal of dynamic objects from the scenes using extended | ||||
|                                         uses deep neural networks, extended Kalman filters and visual              Kalman filters to improve both localization and mapping in | ||||
|                                         SLAM to improve both localization and mapping in dynamic                  vSLAM. By doing so, the approach, referred to as Dynamic | ||||
|                                         environments (around 14 fps on a GTX 1080). Results on the                Object Tracking and Masking for vSLAM (DOTMask)1 | ||||
|                                         dynamic sequences from the TUM dataset using RTAB-Map                     aims at providing six benefits: 1) increased visual odometry | ||||
|                                         as visual SLAM suggest that the approach achieves similar                 performance; 2) increased quality of loop closure detection; | ||||
|                                         localization performance compared to other state-of-the-art               3) produce 3D maps free of dynamic objects; 4) tracking of | ||||
|                                         methods, while also providing the position of the tracked                 dynamic objects; 5) modular and fast pipeline. | ||||
|                                         dynamic objects, a 3D map free of those dynamic objects, better | ||||
|                                         loop closure detection with the whole pipeline able to run on a              The paper is organized as follows. Section II presents re- | ||||
|                                         robot moving at moderate speed.                                           lated work of approaches taking into consideration dynamic | ||||
|                                                                                                                   objects during localization and during mapping. Section III | ||||
|                                                                I. INTRODUCTION                                    describes our approach applied as a pre-processing module | ||||
|                                                                                                                   to RTAB-Map [5], a vSLAM approach. Section IV presents | ||||
|                                            To perform tasks effectively and safely, autonomous mo-                the experimental setup, and Section V provides comparative | ||||
|                                         bile robots need accurate and reliable localization from their            results on dynamic sequences taken from the TUM dataset. | ||||
|                                         representation of the environment. Compared to LIDARs | ||||
|                                         (Light Detection And Ranging sensors) and GPS (Global                                             II. RELATED WORK | ||||
|                                         Positioning System), using visual images for Simultaneous | ||||
|                                         Localization and Mapping (SLAM) adds significant infor-                       Some approaches take into consideration dynamic objects | ||||
|                                         mation about the environment [1], such as color, textures,                during localization. For instance, BaMVO [6] uses a RGB- | ||||
|                                         surface composition that can be used for semantic interpre-               D camera to estimate ego-motion. It uses a background | ||||
|                                         tation of the environment. Standard visual SLAM (vSLAM)                   model estimator combined with an energy-based dense visual | ||||
|                                         techniques perform well in static environments by being                   odometry technique to estimate the motion of the camera. Li | ||||
|                                         able to extract stable visual features from images. However,              et al. [7] developed a static point weighting method which | ||||
|                                         in environments with dynamic objects (e.g., people, cars,                 calculates a weight for each edge point in a keyframe. This | ||||
|                                         animals), performance decreases significantly because visual               weight indicates the likelihood of that specific edge point | ||||
|                                         features may come from those objects, making localization                 being part of the static environment. Weights are determined | ||||
|                                         less reliable [1]. Deep learning architectures have recently              by the movement of a depth edge point between two frames | ||||
|                                         demonstrated interesting capabilities to achieve semantic seg-            and are added to an Intensity Assisted Iterative Closest Point | ||||
|                                         mentation from images, outperforming traditional techniques               (IA-ICP) method used to perform the registration task in | ||||
|                                         in tasks such as image classification [2]. For instance, Segnet            SLAM. Sun et al. [8] present a motion removal approach to | ||||
|                                         [3] is commonly used for semantic segmentation [4]. It uses               increase the localization reliability in dynamic environments. | ||||
|                                         an encoder and a decoder to achieve pixel wise semantic                   It consists of three steps: 1) detecting moving objects’ motion | ||||
|                                         segmentation of a scene.                                                  based on ego-motion compensated using image differencing; | ||||
|                                                                                                                   2) using a particle filter for tracking; and 3) applying a | ||||
|                                            This paper introduces a simple and fast pipeline that                  Maximum-A-Posterior (MAP) estimator on depth images | ||||
|                                         uses neural networks, extended Kalman filters and vSLAM                    to determine the foreground. This approach is used as the | ||||
|                                         algorithm to deal with dynamic objects. Experiments con-                  frontend of Dense Visual Odometry (DVO) SLAM [9]. Sun | ||||
|                                         ducted on the TUM dataset demonstrate the robustness of                   et al. [10] uses a similar foreground technique but instead | ||||
|                                                                                                                   of using a MAP they use a foreground model which is | ||||
|                                            This work was supported by the Institut du ve´hicule innovant (IVI),   updated on-line. All of these approaches demonstrate good | ||||
|                                         Mitacs, InnovE´ E´ and NSERC. J. Vincent, M. Labbe´, J.-S. Lauzon,        localization results using the Technical University of Munich | ||||
|                                         F. Grondin and F. Michaud are with the Interdisciplinary Institute for    (TUM) dataset [11], however, mapping is yet to be addressed. | ||||
|                                         Technological Innovation (3IT), Dept. Elec. Eng. and Comp. Eng., | ||||
|                                         Universite´ de Sherbrooke, 3000 boul. de l’Universite´, Que´bec (Canada)     1https://github.com/introlab/dotmask | ||||
|                                         J1K 0A5. P.-M. Comtois-Rivet is with the Institut du Ve´hicule Innovant | ||||
|                                         (IVI), 25, boul. Maisonneuve, Saint-Je´roˆme, Que´bec (Canada), J5L 0A1. | ||||
|                                         {Jonathan.Vincent2, Mathieu.m.Labbe, Jean-Samuel.Lauzon, | ||||
|                                         Francois.Grondin2, Francois.Michaud}@USherbrooke.ca, | ||||
|                                         Pmcrivet@ivisolutions.ca | ||||
| Depth Image RGB Image  Instance segmentation            Dynamic   is then applied to the original depth image, resulting in a | ||||
|                              DOS                         Object   masked depth image (MDI). The DOS is also sent to the | ||||
|                                                          Classes  Tracking module. After computing a 3D centroid for each | ||||
|                                                                   masked object, the Tracking module predict the position and | ||||
|                                                         MDI       velocity of the objects. This information is then used by the | ||||
|                                                                   Moving Object Classification module (MOC) to classify the | ||||
| MO-MDI                 Tracking/MOC           Camera              object as idle or not based on its class, its estimated velocity | ||||
|                                               World               and its shape deformation. Moving objects are removed | ||||
|                                                Pose               from the original depth image, resulting in the Moving | ||||
|                                                                   Object Masked Depth Image (MO-MDI). The original RGB | ||||
|                        vSLAM                                      image, the MDI and the MO-MDI are used by the vSLAM | ||||
|                                                                   algorithm. It uses the depth images as a mask for feature | ||||
|                                               Odometry            extraction thus ignoring features from the masked regions. | ||||
|                                                                   The MO-MDI is used by the visual odometry algorithm of | ||||
|                                                Map                the vSLAM approach while the MDI is used by both its | ||||
|                                                                   mapping and loop closure algorithms, resulting in a map free | ||||
|                 Fig. 1: Architecture of DOTMask                   of dynamic objects while still being able to use the features | ||||
|                                                                   of the idle objects for visual odometry. The updated camera | ||||
|    SLAM++ [12] and Semantic Fusion [13] focus on                  pose is then used in the Tracking module to estimate the | ||||
| the mapping aspect of SLAM in dynamic environments.               position and velocity of the dynamic objects resulting in a | ||||
| SLAM++ [12] is an object-oriented SLAM which achieves             closed loop. | ||||
| efficient semantic scene description using 3D object recog- | ||||
| nition. SLAM++ defines objects using areas of interest             A. Instance Segmentation | ||||
| to subsequently locate and map them. However, it needs | ||||
| predefined 3D object models to work. Semantic Fusion                  Deep learning algorithms such as Mask R-CNN recently | ||||
| [13] creates a semantic segmented 3D map in real time             proved to be useful to accomplish instance semantic seg- | ||||
| using RGB-CNN [14], a convolutional deep learning neural          mentation [4]. A recent and interesting architecture for | ||||
| network, and a dense SLAM algorithm. However, SLAM++              fast instance segmentation is the YOLACT [18] and its | ||||
| and Semantic Fusion do not address SLAM localization              update YOLACT++ [19]. This network aims at providing | ||||
| accuracy in dynamic environments, neither do they remove          similar results as the Mask-RCNN or the Fully Convolutional | ||||
| dynamic objects in the 3D map.                                    Instance-aware Semantic Segmentation (FCIS) [20] but at a | ||||
|                                                                   much lower computational cost. YOLACT and YOLACT++ | ||||
|    Other approaches use deep learning algorithm to provide        can achieve real-time instance segmentation. Development in | ||||
| improved localisation and mapping. Fusion++ [15] and MID-         neural networks has been incredibly fast in the past few years | ||||
| Fusion [16] uses object-level octree-based volumetric repre-      and probably will be in the years to come. DOTMask was | ||||
| sentation to estimate both the camera pose and the object         designed the be modular and can easily change the neural | ||||
| positions. They use deep learning techniques to segment ob-       network used in the pipeline. In its current state, DOTMask | ||||
| ject instances. DynaSLAM [17] proposes to combine multi-          works with Mask-RCNN, YOLACT and YOLACT++. The | ||||
| view geometry models and deep-learning-based algorithms           YOLACT is much faster than the two others and the loss | ||||
| to detect dynamic objects and to remove them from the im-         in precision doesn’t impact our results. This is why this | ||||
| ages prior to a vSLAM algorithm. They also uses inpainting        architecture is used in our tests. The instance segmentation | ||||
| to recreate the image without object occlusion. DynaSLAM          module takes the input RGB image and outputs the bounding | ||||
| achieves impressive results on the TUM dataset. However,          box, class and binary mask for each instance. | ||||
| these approaches are not optimized for real-time operation. | ||||
|                                                                   B. Tracking Using EKF | ||||
|   III. DYNAMIC OBJECT TRACKING AND MASKING FOR | ||||
|                                VSLAM                                 Using the DOS from the Instance Segmentation module | ||||
|                                                                   and odometry from vSLAM, the Tracking module predicts | ||||
|    The objective of our work is to provide a fast and complete    the pose and velocity of the objects in the world frame. This | ||||
| solution for visual SLAM in dynamic environments. Figure          is useful when the camera is moving at speed similar to the | ||||
| 1 illustrates the DOTMask pipeline. As a general overview         objects to track (e.g., moving cars on the highway, robot | ||||
| of the approach, a set of objects of interest (OOI) are defined    following a pedestrian) or when idle objects have a high | ||||
| using a priori knowledge and understanding of dynamic             amount of features (e.g., person wearing a plaid shirt). | ||||
| objects classes that can be found in the environment. Instance | ||||
| segmentation is done using a neural network trained to               First, the Tracking module receives the DOS and the | ||||
| identify the object classes from an RGB image. For each           original depth image as a set, defined as Dk = {d1k, ..., dkI }, | ||||
| dynamic object instance, its bounding box, class type and         where dki = Tk, Bki , ζik is the object instance detected | ||||
| binary mask are grouped for convenience and referred as the       by the Instance Segmentation module, with i ∈ I, I = | ||||
| dynamic object state (DOS). The binary mask of the DOS            {1, ..., L}, L being the total number of object detection | ||||
|                                                                   in the frame at time k. T ∈ Rm×n is the depth image , | ||||
| B ∈ Zm 2 ×n is the binary mask and ζ ∈ J is the class ID,           2) Update: In EKF, the Update step starts by evaluating | ||||
| with J = {1, ..., W }, and W is the number of total trained      the innovation y˜k defined as (4): | ||||
| classes in the Instance Segmentation module. | ||||
|                                                                  y˜k = zk − hˆk(xˆk|k−1)         (4) | ||||
|    The DOS and the original depth image are used by EKF | ||||
| to estimate the dynamic objects positions and velocities.        where zk ∈ R3 is a 3D observation of a masked object in | ||||
| EKF provides steady tracking of each object instance corre- | ||||
| sponding to the object type detected by the neural network.      reference to the camera for each object instance, with z = | ||||
| An EKF is instantiated for each new object, and a priori         [zx zy zz]T , zx = (µx −Cx)zz/fx and zy = (µy −Cy)zz/fy, | ||||
| knowledge from the set of dynamic object classes defines          where Cx and Cy are the principal center point coordinate | ||||
| some of the filter’s parameters. This instantiation is made       and fx and fy are the focal lengths expressed in pixels. zz | ||||
| using the following parameters: the class of the object, its | ||||
| binary mask and its 3D centroid position. The 3D centroid        is approximated using the average depth from the masked | ||||
| is defined as the center of the corresponding bounding box. | ||||
| If the tracked object is observed in the DOS, its position is    region on the depth image. The expressions µx and µy stand | ||||
| updated accordingly, otherwise its predicted position using | ||||
| EKF is used. If no observations of the object are made for       for the center of the bounding box. | ||||
| e number of frames, the object is considered removed from | ||||
| the scene and therefore the filter is discarded. The Tracking        To simplify the following equations, (s, c) represent re- | ||||
| module outputs the estimated velocity of the objects to the | ||||
| MOC module. The MOC module will classify the objects             spectively the sine and cosine operations of the the Euler | ||||
| as idle or not based on the object class, the filter velocity     angles φ, θ, ψ (roll, pitch, yaw). h(xk) ∈ R4 is the | ||||
| estimation and the object deformation.                           observation function which maps the true state space xk to | ||||
|                                                                  the observed state space zk. hˆ(xk) is the three first terms of | ||||
|    To explain further how the Tracking module works, the         h(xk). However, in our case, the transform between those | ||||
| following subsections presents in more details the Prediction | ||||
| and Update steps of EKF used by DOTMask.                         spaces is not linear, justifying the use of EKF. The non-linear | ||||
|                                                                  rotation matrix used to transform the estimate state xˆk in the | ||||
|    1) Prediction: Let us define the hidden state x ∈ R6×1 as      observed state zk follows the (x, y, z) Tait-Bryan convention | ||||
| the 3D position and velocity of an object referenced in the      and is given by h(xˆk) = [hφ hθ hψ 1], where: | ||||
| global map in Cartesian coordinates. The a priori estimate | ||||
| of the state at time k ∈ N is predicted based on the previous    hφ = (cφcθ)xˆx + (cφsθsψ − cψsφ)xˆy + (sφsψ + cφcψsθ)xˆz + cx | ||||
| state at time k − 1 as in (1):                                   hθ = (cθsφ)xˆx + (cφcψ + sφsθsψ)xˆy + (cψsφsθ − cφsψ)xˆz + cy | ||||
| 
 | ||||
|                                                                               hψ = −(sθ)xˆx + (cθsψ)xˆy + (cθcψ)xˆz + cz | ||||
|                                                                                                                                                 (5) | ||||
| 
 | ||||
|                                                                  and cx, cy and cz are the coordinate of the camera referenced | ||||
| 
 | ||||
|                                                                  to the world, which is derived using vSLAM odometry. | ||||
|                                                                     The innovation covariance Sk ∈ R3×3 is defined as | ||||
| 
 | ||||
|                                                                  follows, where the expression Hk ∈ R3×6 stands for the | ||||
|                                                                  Jacobian of h(xˆk): | ||||
| 
 | ||||
| xˆk|k−1 = Fxˆk−1|k−1 with F =    I3 ∆tI3     (1)                 Sk = HkPk|k−1 Hk T + Rk         (6) | ||||
|                                  03 I3 | ||||
| 
 | ||||
| where F ∈ R6×6 is the state transition matrix, ∆t ∈ R+ is        where Rk ∈ R3×3 is the covariance of the observation noise, | ||||
| the time between each prediction, 03 is a 3 × 3 zero matrix      its diagonal terms stand for the imprecision of the RGB- | ||||
| and I3 is a 3 × 3 identity matrix. Note that the value of ∆t     D camera. The near optimal Kalman gain Kk ∈ R3×3 is | ||||
| is redefined before each processing cycle.                        defined as follows: | ||||
| 
 | ||||
|    The a priori estimate of the state covariance (Pk|k−1 ∈       Kk = Pk|k−1 Hk T (Sk)−1         (7) | ||||
| R6×6) at time k is predicted based on the previous state at | ||||
| time k − 1 as given by (2):                                         Finally, the updated state estimate xˆk|k and the covariance | ||||
|                                                                  estimate are given respectively by (8) and (9). | ||||
| Pk|k−1 = FPk−1|k−1FT + Q                     (2) | ||||
| 
 | ||||
| where Q ∈ R6×6 is the process noise covariance matrix            xˆk|k = xˆk|k−1 + Kky˜k         (8) | ||||
| defined using the random acceleration model (3): | ||||
|                                                                  Pk|k = (I6 − KkHk)Pk|k−1        (9) | ||||
| 
 | ||||
| Q = ΓΣΓT  with Γ = [  ∆t2  I3×3  ∆t2I3×3 ]T  (3)                 C. Moving Object Classification | ||||
|                         2 | ||||
|                                                                     The MOC module classify dynamic objects as either | ||||
| where Γ ∈ R6×3 is the mapping between the random                 moving or idle. It takes as inputs the dynamic objects class, | ||||
| acceleration vector a ∈ R3 and the state x, and Σ ∈ R3×3         velocity and mask. The object velocity comes from the | ||||
| is the covariance matrix of a. The acceleration components       tracking module estimation. The object class and mask are | ||||
| ax, ay and az are assumed to be uncorrelated.                    directly obtained from the DOS. The object class defines | ||||
|                                                                  if the object is rigid or not. The deformation of non-rigid | ||||
|    The dynamic of every detected objects may vary greatly        object is computed using the intersection over union (IoU) | ||||
| depending on its class. For instance, a car does not have the    of the masks of the object at time k and k − 1. The IoU | ||||
| same dynamic as a mug. To better track different types of        algorithm takes two arbitrary convex shape Mk−1, Mk and | ||||
| objects, a covariance matrix is defined for each class to better  is defined as IoU = |Mk ∩ Mk−1|/|Mk ∪ Mk−1|, where | ||||
| represent their respective process noise. | ||||
| TABLE I: Experimental Parameters | ||||
| 
 | ||||
|                  Description                 Value                 (a) Original RGB Image | ||||
|   Frame to terminate object tracking | ||||
|                                                10 | ||||
|             Score threshold (s)               0.1 | ||||
| Maximum number of observations (m)             5 | ||||
|                                           0.01 m/sec | ||||
|      Velocity threshold for a person      0.1 m/sec | ||||
| Velocity threshold for the other objects  0.62 m/s2 | ||||
|                                            1.0 m/s2 | ||||
|    Random acceleration for a person | ||||
| Random acceleration for other objects | ||||
| 
 | ||||
| | . . . | is the cardinality of the set. A dynamic object is            (b) RGB and depth image superposed without DOTMask | ||||
| classified as moving if its velocity is higher than a predefined | ||||
| threshold or if it is an non-rigid object with an IoU above               (c) RGB and depth image superposed with DOTMask | ||||
| another predefined threshold. The original depth image is | ||||
| then updated resulting in the MO-MDI. The MO-MDI is                Fig. 2: RTAB-Map features (colored dots) not appearing on | ||||
| sent to the vSLAM odometry to update the camera pose.              moving objects with DOTMask | ||||
| 
 | ||||
|                 IV. EXPERIMENTAL SETUP                             dataset, along with their superimposed RGB and depth | ||||
|                                                                    images with features used by RTAB-Map (Fig. 2b) and with | ||||
|    To test our DOTMask approach, we chose to use the TUM           DOTMask (Fig. 2c). Using the depth image as a mask to | ||||
| dataset because it presents challenging indoor dynamic RGB-        filter outlying features, dynamic objects (i.e., humans and | ||||
| D sequences with ground truth to evaluate visual odometry          chairs in this case) are filtered out because the MDI includes | ||||
| techniques. Also, TUM is commonly used to compare with             the semantic mask. The MO-MDI is used by RTAB-Map | ||||
| other state-of-the-art techniques. We used sequences in low        to compute visual odometry, keeping only the features from | ||||
| dynamic and highly dynamic environments.                           static objects as seen in Fig. 2c (left vs right) with the colored | ||||
|                                                                    dots representing visual features used for visual odometry. In | ||||
|    For our experimental setup, ROS is used as a middleware         the left image of Fig. 2c, the man on the left is classified | ||||
| to make the interconnections between the input images,             by the Tracking module as moving, while the man on the | ||||
| segmentation network, EKF and RTAB-Map. The deep learn-            right is classified as being idle, resulting in keeping his | ||||
| ing library PyTorch is used for the instance segmentation          visual features. In the rigth image of Fig. 2c, the man on the | ||||
| algorithm. The ResNet-50-FPN backbone is used for the              right is also classified as moving because he is standing up, | ||||
| YOLACT architecture because this configuration achieves             masking his visual features. Figure 3 illustrates the influence | ||||
| the best results at a higher framerate [18]. Our Instance          of MDI, which contains the depth mask of all the dynamic | ||||
| segmentation module is based on the implementation of              objects, either idle or not, to generate a map free of dynamic | ||||
| YOLACT by dbolya2 and its pre-trained weights. The net-            objects. This has two benefits: it creates a more visually | ||||
| work is trained on all 91 classes of the COCO dataset.             accurate 3D rendered map, and it improves loop closure | ||||
| The COCO dataset is often used to compare state-of-the-art         detection. The differences in the 3D generated maps between | ||||
| instance segmentation approaches, which is why we chose to         RTAB-Map without and with DOTMask are very apparent: | ||||
| use it in our trials. In our tests, person, chair, cup and bottle  there are less artifacts of dynamic objects and less drifting. | ||||
| are the the OOI used because of their presence in the TUM          The fr3/walking static sequence shows improved quality in | ||||
| dataset and in our in-house tests.The RTAB-Map library [5]         the map, while the fr3/walking rpy sequence presents some | ||||
| is also used, which includes various state-of-the-art visual       undesirable artifacts. These artifacts are caused either by the | ||||
| odometry algorithms, a loop closure detection approach and         mask failing to identify dynamic objects that are tilted or | ||||
| a 3D map render.                                                   upside down or by the time delay between the RGB image | ||||
|                                                                    and its corresponding depth image. The fr3/sitting static | ||||
|    Table I presents the parameters used for DOTMask in our | ||||
| trials, based on empirical observations in the evaluated TUM | ||||
| sequences and our understanding of the nature of the objects. | ||||
| A probability threshold p and a maximum instance number | ||||
| m are used to reduce the number of object instances to feed | ||||
| into the pipeline. Only detections with a score above p are | ||||
| used and at maximum, m objects detections are processed. | ||||
| This provides faster and more robust tracking. | ||||
| 
 | ||||
|                             V. RESULTS | ||||
| 
 | ||||
|    Trials were conducted in comparison with approaches | ||||
| by Kim and Kim [6], Sun et al. [8], Bescos et al. [17] | ||||
| and RTAB-Map, the latter being also used with DOTMask. | ||||
| Figure 2a shows two original RGB frames in the TUM | ||||
| 
 | ||||
|    2https://github.com/dbolya/yolact | ||||
| TABLE II: Absolute Transitional Error (ATE) RMSE in cm                                          TABLE IV: Timing Analysis | ||||
| 
 | ||||
| TUM Seqs         BaMVO                                                              Aproach     Img. Res.  Avg. Time     CPU        GPU | ||||
|                             Sun et al. | ||||
|                                       DynaSLAM                                      BaMVO.      320×240     42.6 ms   i7 3.3GHz       - | ||||
|                                                   RTAB-Map                          Sun et al.  640×480     500 ms         i5         - | ||||
|                                                              DOTMask                DynaSLAM    640×480     500 ms         -          - | ||||
|                                                                          Impr. (%)  DOTMask     640×480      70 ms               GTX1080 | ||||
|                                                                                     DOTMask     640×480     125 ms    i5-8600K   GTX1050 | ||||
| fr3/sit static   2.48  -  -        1.70 0.60   64.71                                                                  i7-8750H | ||||
| fr3/sit xyz                        1.60 1.80   -12.50 | ||||
| fr3/wlk static   4.82 3.17 1.5     10.7 0.80   92.52 | ||||
| fr3/wlk xyz                        24.50 2.10  91.42 | ||||
| fr3/wlk rpy      13.39 0.60 2.61   22.80 5.30  76.75 | ||||
| fr3/wlk halfsph                    14.50 4.00  72.41 | ||||
|                  23.26 9.32 1.50                                                    a mobile robot operating at a moderate speed. The fastest | ||||
|                                                                                     method is BaMVO with only 42 ms cycle time. | ||||
|                  35.84 13.33 3.50 | ||||
|                                                                                        Figure 4 shows the tracked dynamic objects in the ROS | ||||
|                  17.38 12.52 2.50                                                   visualizer RViz. DOTMask generates ROS transforms to | ||||
|                                                                                     track the position of the objects. Those transforms could | ||||
|           TABLE III: Loop Closure Analysis                                          easily be used in other ROS applications. Figure 5 shows the | ||||
|                                                                                     difference between RTAB-Map and DOTMask in a real scene | ||||
| TUM Seqs               RTAB-Map           DOTMask                                   where a robot moves at a similar speed as dynamic objects | ||||
|                  Nb Terr Rerr       Nb Terr Rerr                                    (chairs and humans). The pink and blue lines represent the | ||||
| fr3/sit static   loop (cm) (deg)   loop (cm) (deg)                                  odometry of RTAB-Map without and with DOTMask. These | ||||
| fr3/sit xyz                                                                         results suggest qualitatively that DOTMask improves the | ||||
| fr3/wlk static    33 1.80 0.26     1246 0.60 0.21                                   odometry and the 3D map. | ||||
| fr3/wlk xyz      288 2.10 0.42     1486 2.50 0.45 | ||||
| fr3/wlk halfs.   105 9.00 0.18     1260 7.00 0.15                                                          VI. CONCLUSION | ||||
| fr3/wlk rpy       55 6.5 0.99      1516 2.9 0.45 | ||||
|                  121 5.90 0.84     964 4.90 0.79                                       This paper presents DOTMask, a fast and modular pipeline | ||||
|                   94 6.7 1.06      965 6.00 1.04                                    that uses a deep learning algorithm to semantically segment | ||||
|                                                                                     images, enabling the tracking and masking of dynamic | ||||
| shows the result when masking idle object, resulting in                             objects in scenes to improve both localization and mapping in | ||||
| completely removing the dynamic objects from the scene.                             vSLAM. Our approach aims at providing a simple and com- | ||||
|                                                                                     plete pipeline to allow mobile robots to operate in dynamic | ||||
|    Table II characterizes the overall SLAM quality in terms                         environments. Results on the TUM dataset suggest that using | ||||
| of absolute trajectory error (ATE). In almost all cases,                            DOTMask with RTAB-Map provides similar performance | ||||
| DOTMask improves the ATE compared to RTAB-Map alone                                 compared to other state-of-the-art localization approaches | ||||
| (as seen in the last column of the table). Table II characterizes                   while providing an improved 3D map, dynamic objects | ||||
| the overall SLAM quality in terms of absolute trajectory                            tracking and higher loop closure detection. While DOTMask | ||||
| error (ATE). While DynaSLAM is better in almost every                               does not outperform DynaSLAM on the TUM dataset or | ||||
| sequences, DOTMask is not far off with closer values com-                           outrun BaMVO, it reveals to be a good compromise for | ||||
| pared to the other techniques.                                                      robotic applications. Because DOTMask pipeline is highly | ||||
|                                                                                     modular, it can also evolve with future improvements of | ||||
|    Table III presents the number of loop closure detections,                        deep learning architectures and new sets of dynamic object | ||||
| the mean translation error (Terr) and the mean rotational                           classes. In future work, we want to use the tracked dynamic | ||||
| error (Rerr) on each sequences both with and without                                objects to create a global 3D map with object permanence, | ||||
| DOTMask. In all sequences, DOTMask helps RTAB-Map                                   and explore more complex neural networks3 to add body | ||||
| to make more loop closures while also lowering both mean                            keypoint tracking, which could significantly improve human | ||||
| errors. Since loop closure features are computed from the                           feature extraction. We would also like to explore techniques | ||||
| depth image (MDI), using DOTMask forces RTAB-Map to                                 to detect outlier segmentations from the neural network to | ||||
| use only features from static object hence providing better                         improve robustness. | ||||
| loop closures. | ||||
|                                                                                                                 REFERENCES | ||||
|    On the fr3/sitting xyz sequence, RTAB-Map alone pro- | ||||
| vides better performance in both ATE and loop closure                                [1] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendo´n-Mancha, | ||||
| detection. In this entire sequence, the dynamic objects do                                “Visual simultaneous localization and mapping: A survey,” Artificial | ||||
| not move. While the MO-MDI enables features from idle                                     Intelligence Review, vol. 43, no. 1, pp. 55–81, 2015. | ||||
| dynamic objects to be used by the odometry algorithm, the | ||||
| MDI does not enables those same features for the loop                                [2] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural | ||||
| closure algorithm. Since nothing is moving in this particular                             networks for image classification,” in Proc. IEEE Conf. Computer | ||||
| sequence, all features will help to provide a better locali-                              Vision and Pattern Recognition, 2012, pp. 3642–3649. | ||||
| sation. However, this case is not representative of dynamic | ||||
| environments.                                                                        [3] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep | ||||
|                                                                                           convolutional encoder-decoder architecture for image segmentation,” | ||||
|    Table IV presents the average computation time to process                              IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 39, | ||||
| a frame for each approach without vSLAM and odometry                                      no. 12, pp. 2481–2495, 2017. | ||||
| algorithms. Results are processed on a computer equipped | ||||
| with a GTX 1080 GPU and a I5-8600k CPU. DOTMask was                                    3https://github.com/daijucug/Mask-RCNN-TF detection-human segment- | ||||
| also tested on a laptop with a GTX 1050 where it achieved                           body keypoint-regression | ||||
| an average of 8 frames per second. At 70 ms, it can run on | ||||
| (a) fr3/sitting static  (b) fr3/walking static                              (c) fr3/walking rpy | ||||
| 
 | ||||
| Fig. 3: RTAB-Map 3D rendered map from the TUM sequences, without (top) and with (bottom) DOTMask | ||||
| 
 | ||||
| Fig. 4: Position of tracked dynamic objects shown in RVIZ                         static point weighting,” IEEE Robotics and Automation Letters, vol. 2, | ||||
|                                                                                   no. 4, pp. 2263–2270, 2017. | ||||
| (a) RTAB-Map alone      (b) RTAB-Map with DOTMask                            [8] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving RGB-D SLAM in | ||||
|                                                                                   dynamic environments: A motion removal approach,” Robotics and | ||||
| Fig. 5: 3D map and odometry improved with DOTMask                                 Autonomous Systems, vol. 89, pp. 110–122, 2017. | ||||
|                                                                              [9] C. Kerl, J. Sturm, and D. Cremers, “Dense visual SLAM for RGB-D | ||||
| [4] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and        cameras,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, | ||||
|      J. Garcia-Rodriguez, “A review on deep learning techniques applied           2013, pp. 2100–2106. | ||||
|      to semantic segmentation,” arXiv preprint arXiv:1704.06857, 2017.      [10] Y. Sun, M. Liu, and M. Q.-H. Meng, “Motion removal for reliable rgb- | ||||
|                                                                                   d slam in dynamic environments,” Robotics and Autonomous Systems, | ||||
| [5] M. Labbe´ and F. Michaud, “Online global loop closure detection for           vol. 108, pp. 115–128, 2018. | ||||
|      large-scale multi-session graph-based SLAM,” in Proc. IEEE/RSJ Int.    [11] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A | ||||
|      Conf. on Intelligent Robots and Systems, 2014, pp. 2661–2666.                benchmark for the evaluation of RGB-D SLAM systems,” in Proc. | ||||
|                                                                                   IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Oct. 2012. | ||||
| [6] D.-H. Kim and J.-H. Kim, “Effective background model-based RGB-         [12] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and | ||||
|      D dense visual odometry in a dynamic environment,” IEEE Trans.               A. J. Davison, “SLAM++: Simultaneous localisation and mapping at | ||||
|      Robotics, vol. 32, no. 6, pp. 1565–1573, 2016.                               the level of objects,” in Proc. IEEE Conf. Computer Vision and Pattern | ||||
|                                                                                   Recognition, 2013, pp. 1352–1359. | ||||
| [7] S. Li and D. Lee, “RGB-D SLAM in dynamic environments using             [13] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Seman- | ||||
|                                                                                   ticFusion: Dense 3D semantic mapping with convolutional neural | ||||
|                                                                                   networks,” in Proc. IEEE Int. Conf. Robotics and Automation, 2017, | ||||
|                                                                                   pp. 4628–4635. | ||||
|                                                                             [14] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for | ||||
|                                                                                   semantic segmentation,” in Proc. IEEE Int. Conf. Computer Vision, | ||||
|                                                                                   2015, pp. 1520–1528. | ||||
|                                                                             [15] J. McCormac, R. Clark, M. Bloesch, A. Davison, and S. Leutenegger, | ||||
|                                                                                   “Fusion++: Volumetric object-level SLAM,” in 2018 international | ||||
|                                                                                   conference on 3D vision (3DV). IEEE, 2018, pp. 32–41. | ||||
|                                                                             [16] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and | ||||
|                                                                                   S. Leutenegger, “Mid-fusion: Octree-based object-level multi-instance | ||||
|                                                                                   dynamic slam,” in 2019 International Conference on Robotics and | ||||
|                                                                                   Automation (ICRA). IEEE, 2019, pp. 5231–5237. | ||||
|                                                                             [17] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “Dynaslam: Tracking, | ||||
|                                                                                   mapping, and inpainting in dynamic scenes,” IEEE Robotics and | ||||
|                                                                                   Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018. | ||||
|                                                                             [18] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance | ||||
|                                                                                   segmentation,” in ICCV, 2019. | ||||
|                                                                             [19] ——, “Yolact++: Better real-time instance segmentation,” 2019. | ||||
|                                                                             [20] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance- | ||||
|                                                                                   aware semantic segmentation,” in Proc. IEEE Conf. Computer Vision | ||||
|                                                                                   and Pattern Recognition, 2017, pp. 2359–2367. | ||||
|  | ||||
										
											
												File diff suppressed because it is too large
												Load Diff
											
										
									
								
							
										
											
												File diff suppressed because it is too large
												Load Diff
											
										
									
								
							
										
											
												File diff suppressed because it is too large
												Load Diff
											
										
									
								
							| @ -0,0 +1,679 @@ | ||||
|                                                                                                                                                          2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | ||||
| 
 | ||||
| 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 978-1-6654-4509-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/CVPR46437.2021.00605  MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments | ||||
|                                                                                                                                                                                        from a Single Moving Camera | ||||
| 
 | ||||
|                                                                                                                                                          Felix Wimbauer1, Nan Yang1,2, Lukas von Stumberg1 Niclas Zeller1,2 Daniel Cremers1,2 | ||||
|                                                                                                                                                                                          1 Technical University of Munich, 2 Artisense | ||||
| 
 | ||||
|                                                                                                                                                                                        {wimbauer, yangn, stumberg, zellern, cremers}@in.tum.de | ||||
| 
 | ||||
|                                                                                                                                                          Abstract | ||||
| 
 | ||||
|                                                                                                                                                             In this paper, we propose MonoRec, a semi-supervised       Figure 1: MonoRec can deliver high-quality dense recon- | ||||
|                                                                                                                                                          monocular dense reconstruction architecture that predicts     struction from a single moving camera. The figure shows | ||||
|                                                                                                                                                          depth maps from a single moving camera in dynamic en-         an example of a large-scale outdoor point cloud reconstruc- | ||||
|                                                                                                                                                          vironments. MonoRec is based on a multi-view stereo set-      tion (KITTI Odometry sequence 07) by simply accumulat- | ||||
|                                                                                                                                                          ting which encodes the information of multiple consecutive    ing predicted depth maps. Please refer to our project page | ||||
|                                                                                                                                                          images in a cost volume. To deal with dynamic objects in      for the video of the entire reconstruction of the sequence. | ||||
|                                                                                                                                                          the scene, we introduce a MaskModule that predicts mov- | ||||
|                                                                                                                                                          ing object masks by leveraging the photometric inconsisten-   creasing demand of reducing the total number of sensors. | ||||
|                                                                                                                                                          cies encoded in the cost volumes. Unlike other multi-view     Over the past years, researchers have therefore put a lot of | ||||
|                                                                                                                                                          stereo methods, MonoRec is able to reconstruct both static    effort into solving the problem of perception with only a sin- | ||||
|                                                                                                                                                          and moving objects by leveraging the predicted masks. Fur-    gle monocular camera. Considering recent achievements in | ||||
|                                                                                                                                                          thermore, we present a novel multi-stage training scheme      monocular visual odometry (VO) [8, 58, 51], with respect to | ||||
|                                                                                                                                                          with a semi-supervised loss formulation that does not re-     ego-motion estimation, this was certainly successful. Nev- | ||||
|                                                                                                                                                          quire LiDAR depth values. We carefully evaluate MonoRec       ertheless, reliable dense 3D mapping of the static environ- | ||||
|                                                                                                                                                          on the KITTI dataset and show that it achieves state-of-the-  ment and moving objects is still an open research topic. | ||||
|                                                                                                                                                          art performance compared to both multi-view and single- | ||||
|                                                                                                                                                          view methods. With the model trained on KITTI, we further-       To tackle the problem of dense 3D reconstruction based | ||||
|                                                                                                                                                          more demonstrate that MonoRec is able to generalize well      on a single moving camera, there are basically two paral- | ||||
|                                                                                                                                                          to both the Oxford RobotCar dataset and the more chal- | ||||
|                                                                                                                                                          lenging TUM-Mono dataset recorded by a handheld cam- | ||||
|                                                                                                                                                          era. Code and related materials are available at https: | ||||
|                                                                                                                                                          //vision.in.tum.de/research/monorec. | ||||
| 
 | ||||
|                                                                                                                                                          1. Introduction | ||||
| 
 | ||||
|                                                                                                                                                          1.1. Real-world Scene Capture from Video | ||||
| 
 | ||||
|                                                                                                                                                             Obtaining a 3D understanding of the entire static and dy- | ||||
|                                                                                                                                                          namic environment can be seen as one of the key-challenges | ||||
|                                                                                                                                                          in robotics, AR/VR, and autonomous driving. State of to- | ||||
|                                                                                                                                                          day, this is achieved based on the fusion of multiple sen- | ||||
|                                                                                                                                                          sor sources (incl. cameras, LiDARs, RADARs and IMUs). | ||||
|                                                                                                                                                          This guarantees dense coverage of the vehicle’s surround- | ||||
|                                                                                                                                                          ings and accurate ego-motion estimation. However, driven | ||||
|                                                                                                                                                          by the high cost as well as the challenge to maintain cross- | ||||
|                                                                                                                                                          calibration of such a complex sensor suite, there is an in- | ||||
| 
 | ||||
|                                                                                                                                                                 Indicates equal contribution. | ||||
| 
 | ||||
|                                                                                                                                                          978-1-6654-4509-2/21/$31.00 ©2021 IEEE                        6108 | ||||
| 
 | ||||
|                                                                                                                                                          DOI 10.1109/CVPR46437.2021.00605 | ||||
| 
 | ||||
|                                                                                                                                                          Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. | ||||
| lel lines of research. On one side, there are dense multi-    [26, 27, 36] or 3D point cloud based [3, 12]. Most popular | ||||
| view stereo (MVS) methods, which evolved over the last        are still depth map representations predicted from a 3D cost | ||||
| decade [39, 45, 2] and saw a great improvement through the    volume [23, 53, 61, 66, 22, 56, 41, 24, 33, 62, 19, 64, 57]. | ||||
| use of convolutional neural networks (CNNs) [23, 61, 57].     Huang et al. [23] proposed one of the first cost-volume | ||||
| On the other side, there are monocular depth prediction       based approaches. They compute a set of image-pair-wise | ||||
| methods which purely rely on deep learning [7, 16, 58].       plane-sweep volumes with respect to a reference image and | ||||
| Though all these methods show impressive performance,         use a CNN to predict one single depth map based on this | ||||
| both types have also their respective shortcomings. For       set. Zhou et al. [66] also use the photometric cost volumes | ||||
| MVS the overall assumption is a stationary environment to     as the inputs of the deep neural networks and employ a two | ||||
| be reconstructed, so the presence of dynamic objects deteri-  stage approach for dense depth prediction. Yao et al. [61] | ||||
| orate their performance. Monocular depth prediction meth-     instead calculate a single cost volume using deep features | ||||
| ods, in contrast, perform very well in reconstructing mov-    of all input images. | ||||
| ing objects, as predictions are made only based on individ- | ||||
| ual images. At the same time, due to their use of a single    2.2. Dense Depth Estimation in Dynamic Scenes | ||||
| image only, they strongly rely on the perspective appear- | ||||
| ance of objects as observed with specific camera intrinsics       Reconstructing dynamic scenes is challenging since the | ||||
| and extrinsics and therefore do not generalize well to other  moving objects violate the static-world assumption for clas- | ||||
| datasets.                                                     sical multi-view stereo methods. Russell et al. [43] and | ||||
|                                                               Ranftl et al. [40] base on motion segmentation and perform | ||||
| 1.2. Contribution                                             classical optimization. Li et al. [32] proposed to estimate | ||||
|                                                               dense depth maps from the scenes with moving people. All | ||||
|    To combine the advantage of both deep MVS and              these methods need additional inputs, e.g., optical flow, ob- | ||||
| monocular depth prediction, we propose MonoRec, a novel       ject masks, etc., for the inference, while MonoRec requires | ||||
| monocular dense reconstruction architecture that consists of  only the posed images as the inputs. Another line of re- | ||||
| a MaskModule and a DepthModule. We encode the infor-          search is monocular depth estimation [7, 6, 29, 31, 11, 59, | ||||
| mation from multiple consecutive images using cost vol-       16, 48, 67, 63, 65, 52, 18, 17, 58]. These methods are not | ||||
| umes which are constructed based on structural similarity     affected by moving objects, but the depth estimation is not | ||||
| index measure (SSIM) [54] instead of sum of absolute dif-     necessarily accurate, especially in unseen scenarios. Luo | ||||
| ferences (SAD) like prior works. The MaskModule is able       et al. [34] proposed a test-time optimization method which | ||||
| to identify moving pixels and downweights the correspond-     is not real-time capable. In a concurrent work, Watson et | ||||
| ing voxels in the cost volume. Thereby, in contrast to other  al. [55] address moving objects with the consistency be- | ||||
| MVS methods, MonoRec does not suffer from artifacts on        tween monocular depth estimation and multi-view stereo, | ||||
| moving objects and therefore delivers depth estimations on    while MonoRec predicts the dynamic masks explicitly by | ||||
| both static and dynamic objects.                              the proposed MaskModule. | ||||
| 
 | ||||
|    With the proposed multi-stage training scheme,             2.3. Dense SLAM | ||||
| MonoRec achieves state-of-the-art performance compared | ||||
| to other MVS and monocular depth prediction methods              Several of the methods cited above solve both the prob- | ||||
| on the KITTI dataset [14]. Furthermore, we validate the       lem of dense 3D reconstruction and camera pose estima- | ||||
| generalization capabilities of our network on the Oxford      tion [48, 67, 63, 65, 66, 59, 58]. Nevertheless, these meth- | ||||
| RobotCar dataset [35] and the TUM-Mono dataset [9].           ods either solve both problems independently or only in- | ||||
| Figure 1 shows a dense point cloud reconstructed by our       tegrate one into the other (e.g. [66, 58]). Newcombe et | ||||
| method on one of our test sequences of KITTI.                 al. [37] instead jointly optimize the 6DoF camera pose and | ||||
|                                                               the dense 3D scene structure. However, due to its volu- | ||||
| 2. Related Work                                               metric map representation it is only applicable to small- | ||||
|                                                               scale scenes. Recently, Bloesch et al. [1] proposed a | ||||
| 2.1. Multi-view Stereo                                        learned code representation which can be optimized jointly | ||||
|                                                               with the 6DoF camera poses. This idea is pursued by | ||||
|    Multi-view stereo (MVS) methods estimate a dense rep-      Czarnowski et al. [5] and integrated into a full SLAM sys- | ||||
| resentation of the 3D environment based on a set of im-       tem. All the above-mentioned methods, however, do not | ||||
| ages with known poses. Over the past years, several           address the issue of moving objects. Instead, the proposed | ||||
| methods have been developed to solve the MVS problem          MonoRec network explicitly deals with moving objects and | ||||
| [46, 28, 30, 2, 47, 49, 39, 13, 45, 60] based on classical    achieves superior accuracy both on moving and on static | ||||
| optimization. Recently, due to the advance of deep neu-       structures. Furthermore, prior works show that the accuracy | ||||
| ral networks (DNNs), different learning based approaches      of camera tracking does not necessarily improve with more | ||||
| were proposed. This representation can be volumetric | ||||
| 
 | ||||
|                                                                                                      6109 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. | ||||
| Cost Volume                  weighted | ||||
| Construction | ||||
|                                                                  MaskModule | ||||
|                                                                                                   DepthModule | ||||
| 
 | ||||
|                                                                                       masked | ||||
| 
 | ||||
|               Reprojections            Max Pool | ||||
| 
 | ||||
|                                                                                                                ResNet-18 Image features for | ||||
| 
 | ||||
| Figure 2: MonoRec Architecture: It first constructs a photometric cost volume from multiple input frames. Unlike prior | ||||
| works, we use the SSIM [54] metric instead of SAD to measure the photometric consistency. The MaskModule aims to detect | ||||
| inconsistencies between the different input frames to determine moving objects. The multi-frame cost volume C is multiplied | ||||
| with the predicted mask and then passed to the DepthModule which predicts a dense inverse depth map. In both the decoders | ||||
| of MaskModule and DepthModule, the cost volume features are concatenated with pre-trained ResNet-18 features. | ||||
| 
 | ||||
| points [8, 10]. MonoRec therefore focuses solely on deliv-       3.2. Cost Volume | ||||
| ering dense reconstruction using poses from a sparse VO | ||||
| system and shows state-of-the-art results on public bench-          A cost volume encodes geometric information from the | ||||
| marks. Note that, this way, MonoRec can be easily com-           different frames in a tensor that is suited as input for neural | ||||
| bined with any VO systems with arbitrary sensor setups.          networks. For a number of discrete depth steps, the tem- | ||||
|                                                                  poral stereo frames are reprojected to the keyframe and a | ||||
| 3. The MonoRec Network                                           pixel-wise photometric error is computed. Ideally, the lower | ||||
|                                                                  the photometric error, the better the depth step approximates | ||||
|    MonoRec uses a set of consecutive frames and the cor-         the real depth at a given pixel. Our cost volume follows the | ||||
| responding camera poses to predict a dense depth map for         general formulation of the prior works [37, 66]. Neverthe- | ||||
| the given keyframe. The MonoRec architecture combines            less, unlike the previous works that define the photometric | ||||
| a MaskModule and a DepthModule. MaskModule predicts              error pe() as a patch-wise SAD, we propose to use the SSIM | ||||
| moving object masks that improve depth accuracy and al-          as follows: | ||||
| lows us to eliminate noise in 3D reconstructions. Depth- | ||||
| Module predicts a depth map from the masked cost volume.                     pe(x, d) = 1 − SSIM(Itt (x, d), It(x))                          (2) | ||||
| In this section, we first describe the different modules of our                                             2 | ||||
| architecture, and then discuss the specialized multi-stage | ||||
| semi-supervised training scheme.                                 with 3 × 3 patch size. Here Itt (x, d) defines the intensity | ||||
|                                                                  at pixel x of the image It warped with constant depth d. | ||||
| 3.1. Preliminaries                                               In practice, we clamp the error to [0, 1]. The cost volume | ||||
| 
 | ||||
|    Our method aims to predict a dense inverse depth map          C stores at C(x, d) the aggregated photometric consistency | ||||
| 
 | ||||
| Dt of the selected keyframe from a set of consecutive            for pixel x and depth d | ||||
| frames {I1, · · · , IN }. We denote the selected keyframe as | ||||
| It and others as It (t ∈ {1, · · · , N } \ t). Given the camera  C(x, d) = 1 − 2 ·            1            pett (x, d) · ωt (x)              (3) | ||||
| intrinsics, the inverse depth map Dt, and the relative cam-                                          · | ||||
| era pose Ttt ∈ SE(3) between It and It, we can perform | ||||
| the reprojection from It to It as                                                             t ωt t | ||||
| 
 | ||||
|                                                                  where  d    ∈  {di|dmin  +    i  · (dmin  − dmax)}.  The  weighting | ||||
|                                                                                               M | ||||
| 
 | ||||
|                                                                  term wt (x) weights the optimal depth step height based on | ||||
| 
 | ||||
|                                                                  the photometric error while others are weighted lower: | ||||
| 
 | ||||
|               Itt = It proj Dt, Ttt ,  (1)                                            1 | ||||
|                                                                  wt (x) =1 − M − 1 | ||||
|                                                                                       exp −α pett (x, d) − pett (x, d∗) 2 (4) | ||||
| where proj () is the projection function and is the dif-                     · | ||||
| ferentiable sampler [25]. This reprojection formulation is | ||||
| important for both the cost volume formation (Sec. 3.2) and                     d=d∗ | ||||
| the self-supervised loss term (Sec. 3.4). | ||||
|                                                                  with d∗t = arg mind pett (x, d). Note that C(x, d) has the | ||||
|    In the following, we refer to the consecutive frames as       range [−1, 1] where −1/1 indicates the lowest/highest pho- | ||||
| temporal stereo (T) frames. During training, we use an ad- | ||||
| ditional static stereo (S) frame ItS for each sample, which      tometric consistency. | ||||
| was captured by a synchronized stereo camera at the same | ||||
| time as the respective keyframe.                                    In the following section, we denote cost volumes calcu- | ||||
|                                                                  lated based on the keyframe It and only one non-keyframe | ||||
|                                                                  It by Ct (x, d) where applicable. | ||||
| 
 | ||||
|                                                                                                      6110 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. | ||||
| 3.3. Network Architecture                                      Figure 3: Auxiliary Training Masks: Examples of aux- | ||||
|                                                                iliary training masks from the training set that are used as | ||||
|    As shown in Figure 2, the proposed network architec-        reference. | ||||
| ture contains two sub-modules, namely, MaskModule and | ||||
| DepthModule.                                                   MaskModule refinement stage and the DepthModule refine- | ||||
|                                                                ment stage are executed successively. | ||||
| MaskModule MaskModule aims to predict a mask Mt | ||||
| where Mt(x) ∈ [0, 1] indicates the probability of a pixel      Bootstrapping In the bootstrapping stage, MaskModule | ||||
| x in It belonging to a moving object. Determining mov-         and DepthModule are trained separately. DepthModule | ||||
| ing objects from It alone is an ambiguous task and hard to     takes the non-masked C as the input and predicts Dt. The | ||||
| be generalizable. Therefore, we propose to use the set of      training objective of DepthModule is defined as a multi- | ||||
| cost volumes {Ct |t ∈ {1, · · · , N } \ t} which encode the    scale (s ∈ [0, 3]) semi-supervised loss. It combines a self- | ||||
| geometric priors between It and {It |t ∈ {1, · · · , N } \ t}  supervised photometric loss and an edge-aware smoothness | ||||
| respectively. We use Ct instead of C since the inconsis-       term, as proposed in [17], with a supervised sparse depth | ||||
| tent geometric information from different Ct is a strong       loss. | ||||
| prior for moving object prediction – dynamic pixels yield | ||||
| inconsistent optimal depth steps in different Ct . However,                       3 | ||||
| geometric priors alone are not enough to predict moving | ||||
| objects, since poorly-textured or non-Lambertian surfaces      Ldepth = Lself,s + αLsparse,s + βLsmooth,s. (5) | ||||
| can lead to inconsistencies as well. Furthermore, the cost | ||||
| volumes tend to reach a consensus on wrong depths that                          s=0 | ||||
| semantically don’t fit into the context of the scene for ob- | ||||
| jects that move at constant speed . Therefore, we further      The self-supervised loss is computed from the photometric | ||||
| leverage pre-trained ResNet-18 [21] features of It to en-      errors between the keyframe and the reprojected temporal | ||||
| code semantic priors in addition to the geometric ones. The    stereo and static stereo frames: | ||||
| network adapts a U-Net architecture design [42] with skip | ||||
| connections. All cost volumes are passed through the en-       Lself,s = min                   1  −  SSIM(Itt  , It) | ||||
| coders with shared weights. The features from different cost                                 λ | ||||
| volumes are aggregated using max-pooling and then passed                        t ∈t ∪{tS }          2 | ||||
| through the decoder. In this way, MaskModule can be ap-                                                                     (6) | ||||
| plied to different numbers of frames without retraining. | ||||
|                                                                                                   + (1 − λ)||Itt − It||1 , | ||||
| DepthModule DepthModule predicts a dense pixel-wise | ||||
| inverse depth map Dt of It. To this end, the module re-        where λ = 0.85. Note that Lself,s takes the per-pixel min- | ||||
| ceives the complete cost volume C concatenated with the | ||||
| keyframe It. Unlike MaskModule, here we use C instead          imum which has be shown to be superior compared to the | ||||
| of Ct since multi-frame cost volumes in general lead to | ||||
| higher depth accuracy and robustness against photometric       per-pixel average [17]. The sparse supervised depth loss is | ||||
| noise [37]. To eliminate wrong depth predictions for mov- | ||||
| ing objects, we perform pixel-wise multiplication between      defined as | ||||
| Mt and the cost volume C for every depth step d. This way, | ||||
| there won’t be any maxima (i.e. strong priors) in regions                 Lsparse,s = ||Dt − DV O||1,                       (7) | ||||
| of moving objects left, such that DepthModule has to rely | ||||
| on information from the image features and the surround-       where the ground-truth sparse depth maps (DV O) are ob- | ||||
| ings to infer the depth of moving objects. We employ a         tained by a visual odometry system [59]. Note that all the | ||||
| U-Net architecture with multi-scale depth outputs from the     supervision signals of DepthModule are generated from ei- | ||||
| decoder [17]. Finally, DepthModule outputs an interpola-       ther images themselves or the visual odometry system with- | ||||
| tion factor between dmin and dmax. In practice, we use         out any manual labeling or LiDAR depth. | ||||
| s = 4 scales of depth prediction. | ||||
|                                                                   MaskModule is trained with the mask loss Lmask which | ||||
| 3.4. Multi-stage Training                                      is the weighted binary cross entropy between the predicted | ||||
|                                                                mask Mt and the auxiliary ground-truth moving object | ||||
|    In this section, we propose a multi-stage training scheme   mask Maux. We generate Maux by leveraging a pre-trained | ||||
| for the networks. Specifically, the bootstrapping stage, the    Mask-RCNN and the trained DepthModule as explained | ||||
|                                                                above. We firstly define the movable object classes, e.g., | ||||
|                                                                cars, cyclists, etc, and then obtain the instance segmenta- | ||||
|                                                                tions of these object classes for the training images. A | ||||
|                                                                movable instance is classified as a moving instance if it | ||||
| 
 | ||||
|                                                                                                      6111 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. | ||||
| has a high ratio of photometrically inconsistent pixels be-       a) | ||||
|                                                                                      Depth | ||||
| tween temporal stereo and static stereo. Specifically, for                           Module | ||||
| each image, we predict its depth maps Dt and DtS using | ||||
| the cost volumes formed by temporal stereo images C and                                                  Mask | ||||
| static stereo images CS, respectively. Then a pixel x is re-                                            Module | ||||
| 
 | ||||
| garded as a moving pixel if two of the following three met-                          Depth | ||||
|                                                                                     Module | ||||
| rics are above predefined thresholds: (1) The static stereo        b) | ||||
| photometric error using Dt, i.e., pettS (x, Dt(x)). (2) The | ||||
| average temporal stereo photometric error using DtS, i.e.,                                 Depth | ||||
| pett (x, DtS(x)). (3) The difference between Dt(x) and                                    Module | ||||
| DtS(x). Please refer to our supplementary materials for | ||||
| more details. Figure 3 shows some examples of the gen-                               Mask | ||||
|                                                                                     Module | ||||
| erated auxiliary ground-truth moving object masks. | ||||
|                                                                                            Depth | ||||
| MaskModule Refinement The bootstrapping stage for                                          Module | ||||
| 
 | ||||
| MaskModule is limited in two ways: (1) Heavy augmen-             Figure 4: Refinement Losses: a) MaskModule refinement | ||||
|                                                                  and b) DepthModule refinement loss functions. Dashed out- | ||||
| tation is needed since mostly only a very small percent-         lines denote that no gradient is being computed for the re- | ||||
|                                                                  spective forward pass in the module. | ||||
| age of pixels on the image belongs to moving objects. (2) | ||||
|                                                                  stereo and sparse depth losses are backpropagated. Because | ||||
| The auxiliary masks are not necessarily related to the ge-       moving objects make up only a small percentage of all pix- | ||||
|                                                                  els in a keyframe, the gradients from the photometric error | ||||
| ometric prior in the cost volume, which slows down the           are rather weak. To solve this, we perform a further static | ||||
|                                                                  stereo forward pass and use the resulting depth map DtS | ||||
| convergence. Therefore, to improve the mask prediction,          as prior for moving objects. Therefore, as shown in Fig- | ||||
|                                                                  ure 4(b), the loss for refining DepthModule is defined as | ||||
| we utilize the trained DepthModule from the bootstrapping | ||||
|                                                                  Ld ref,s =(1 − Mt) (Lself,s + αLsparse,s) | ||||
| stage. We leverage the fact that the depth prediction for | ||||
|                                                                  + Mt LsSelf,s + γ Dt − DtS 1               (9) | ||||
| moving objects, and consequently the photometric consis- | ||||
|                                                                  + βLsmooth,s. | ||||
| tency, should be better with a static stereo prediction than | ||||
|                                                                  3.4.1 Implementation Details | ||||
| with a temporal stereo one. Therefore, similar to the classi- | ||||
|                                                                  The networks are implemented in PyTorch [38] with image | ||||
| fication of moving pixels as explained in the previous sec-       size 512×256. For the bootstrapping stage, we train Depth- | ||||
| tion, we obtain DtS and Dt from two forward passes using         Module for 70 epochs with learning rate lr = 1e−4 for | ||||
| CS and C as inputs, respectively. Then we compute the            the first 65 epochs and lr = 1e−5 for the remaining ones. | ||||
| static stereo photometric error LsSelf,s using DtS as depth      MaskModule is trained for 60 epochs with lr = 1e−4. Dur- | ||||
| and the temporal stereo photometric error LsTelf,s using Dt      ing MaskModule refinement, we train for 32 epochs with | ||||
| as depth. To train Mt, we interpret it as pixel-wise inter-      lr = 1e−4, and during DepthModule refinement we train | ||||
| polation factors between LsSelf,s and LsTelf,s, and minimize     for 15 epochs with lr = 1e−4 and another 4 epochs at | ||||
| the summation:                                                   lr = 1e−5. The hyperparameters α, β and γ are set to | ||||
|                                                                  4, 10−3 × 2−s and 4, respectively. For inference, MonoRec | ||||
|           3                                                      can achieve 10 fps with batch size 1 using 2GB memory. | ||||
| 
 | ||||
| Lm ref =       MtLdSepth,s + (1 − Mt)LdTepth,s  (8)              4. Experiments | ||||
| 
 | ||||
|           s=0                                                       To evaluate the proposed method, we first compare | ||||
|                                                                  against state-of-the-art monocular depth prediction and | ||||
|           + Lmask.                                               MVS methods with our train/test split of the KITTI | ||||
|                                                                  dataset [15]. Then, we perform extensive ablation studies | ||||
| Figure 4(a) shows the diagram illustrating different loss        to show the efficacy of our design choices. In the end, | ||||
| terms. Note that we still add the supervised mask loss           we demonstrate the generalization capabilities of different | ||||
| Lmask as a regularizer to stabilize the training. This way,      methods on Oxford RobotCar [35] and TUM-Mono [9] us- | ||||
| the new gradients are directly related to the geometric struc-   ing the model trained on KITTI. | ||||
| ture in the cost volume and help to improve the mask pre- | ||||
| diction accuracy and alleviate the danger of overfitting. | ||||
| 
 | ||||
| DepthModule Refinement The bootstrapping stage does | ||||
| not distinguish between the moving pixels and static pixels | ||||
| when training DepthModule. Therefore, we aim to refine | ||||
| DepthModule such that it is able to predict proper depths | ||||
| also for moving objects. The key idea is that, by utilizing | ||||
| Mt, only the static stereo loss is backpropagated for mov- | ||||
| ing pixels, while for static pixels the temporal stereo, static | ||||
| 
 | ||||
|                                                                                                      6112 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. | ||||
|                                              Seq=07, KF=395            Depth Map  View 1  View 2 | ||||
| 
 | ||||
| GT View 2 GT View 1 Keyframe                                  MonoRec | ||||
| 
 | ||||
|                                                               PackNet | ||||
| 
 | ||||
|                                                               DORN | ||||
| 
 | ||||
|                                              Seq=00, KF=1482           Depth Map  View 1  View 2 | ||||
| 
 | ||||
| GT View 2 GT View 1 Keyframe                                  MonoRec | ||||
| 
 | ||||
|                                                               PackNet | ||||
| 
 | ||||
|                                                               DORN | ||||
| 
 | ||||
|                               Mask Filtered | ||||
| 
 | ||||
|                               Original | ||||
| 
 | ||||
|          Figure 5: Qualitative Results on KITTI: The upper part of the figure shows the results for a selected number of frames | ||||
|          from the KITTI test set. The compared PackNet model was trained in a semi-supervised fashion using LiDAR as the ground | ||||
|          truth. Besides the depth maps, we also show the 3D point clouds by reprojecting the depth and viewing from two different | ||||
|          perspectives. For comparison we show the LiDAR ground truth from the corresponding perspectives. Our method clearly | ||||
|          shows the best prediction quality. The lower part of the figure shows large scale reconstructions as point clouds accumulated | ||||
|          from multiple frames. The red insets depict the reconstructed artifacts from moving objects. With the proposed MaskModule, | ||||
|          we can effectively filter out the moving objects to avoid those artifacts in the final reconstruction. | ||||
| 
 | ||||
|                                                                                                      6113 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. | ||||
| Method                     Training      Dataset      Input  Abs Rel  Sq Rel  RMSE   RMSElog  δ < 1.25  δ < 1.252  δ < 1.253 | ||||
| 
 | ||||
| Colmap [44] (geometric)        -             -       KF + 2   0.099   3.451   5.632    0.184    0.952      0.979      0.986 | ||||
| Colmap [44] (photometric)      -             -       KF + 2   0.190   6.826   7.781    0.531    0.893      0.932      0.947 | ||||
| 
 | ||||
| Monodepth2 [17]              MS        Eigen Split     KF     0.082   0.405   3.129    0.127    0.931      0.985      0.996 | ||||
| PackNet [20]                 MS      CS+Eigen Split    KF     0.080   0.331   2.914    0.124    0.929      0.987       0.997 | ||||
| PackNet [20]                MS, D    CS+Eigen Split    KF     0.077   0.290   2.688    0.118    0.935      0.988       0.997 | ||||
| DORN [11]                      D                       KF     0.077   0.290   2.723    0.113    0.949      0.988      0.996 | ||||
|                                        Eigen Split | ||||
| DeepMVS [23]                   D                     KF+2     0.103   1.160   3.968    0.166    0.896      0.947      0.978 | ||||
| DeepMVS [23] (pretr.)          D      Odom. Split    KF+2     0.088   0.644   3.191    0.146    0.914      0.955       0.982 | ||||
| DeepTAM [66] (only FB)     MS, D*     Odom. Split    KF+2     0.059   0.474   2.769    0.096    0.964      0.987       0.994 | ||||
| DeepTAM [66] (1x Ref.)     MS, D*     Odom. Split    KF+2     0.053   0.351   2.480    0.089    0.971      0.990      0.995 | ||||
|                                       Odom. Split | ||||
| MonoRec                    MS, D*                    KF+2     0.050   0.295   2.266    0.082    0.973      0.991      0.996 | ||||
|                                       Odom. Split | ||||
| 
 | ||||
| Table 1: Quantitative Results on KITTI: Comparison between MonoRec and other methods on our KITTI test set. The | ||||
| Dataset column shows the training dataset used by the corresponding method and please note that Eigen split is a superset | ||||
| of our odometry split. Best / Second best results are marked bold / underlined. The evaluation result shows that our method | ||||
| achieves overall the best performance. Legend: M: Monocular images, S: Stereo images, D: GT depth, D*: Depths from | ||||
| DVSO, KF: Keyframe, KF + 2: Keyframe + 2 mono frames, CS: Cityscapes [4], pretr.: Pretrained network, FB: Fixed band | ||||
| module of DeepTAM, Ref.: Narrow band refinement module of DeepTAM | ||||
| 
 | ||||
| (a) Keyframe                         (b) W/o MaskModule               TAM), shown in Table 1. Note that the training code of | ||||
|                                                                       DeepTAM was not published, we therefore implemented it | ||||
| (c) MaskModule                       (d) MaskModule+D.Ref.            ourselves for training and testing using our split to deliver | ||||
|                                                                       a fair comparison. Our method outperforms all the other | ||||
| Figure 6: Qualitative Improvement: Effects of cost vol-               methods with a notable margin despite relying on images | ||||
| ume masking and depth refinement.                                      only without using LiDAR ground truth for training. | ||||
| 
 | ||||
| 4.1. The KITTI Dataset                                                   This is also clearly reflected in the qualitative results | ||||
|                                                                       shown in Figure 5. Compared with monocular depth esti- | ||||
|    The Eigen split [6] is the most popular training/test split        mation methods, our method delivers very sharp edges in | ||||
| for evaluating depth estimation on KITTI. We cannot make              the depth maps and can recover finer details. In comparison | ||||
| use of it directly since MonoRec requires temporally con-             to the other MVS methods, it can better deal with moving | ||||
| tinuous images with estimated poses. Hence, we select our             objects, which is further illustrated in Figure 7. | ||||
| training/testing splits as the intersection between the KITTI | ||||
| Odometry benchmark and the Eigen split, which results in                 A single depth map usually cannot really reflect the qual- | ||||
| 13714/8634 samples for training/testing. We obtain the rel-           ity for large scale reconstruction. We therefore also visual- | ||||
| ative poses between the images from the monocular VO sys-             ize the accumulated points using the depth maps from mul- | ||||
| tem DVSO [59]. During training, we also leverage the point            tiple frames in lower part of Figure 5. We can see that our | ||||
| clouds generated by DVSO as the sparse depth supervision              method can deliver very high quality reconstruction and, | ||||
| signals. For training MaskModule we only use images that              due to our MaskModule, is able to remove artifacts caused | ||||
| contain moving objects in the generated auxiliary masks,              by moving objects. We urge readers to watch the supple- | ||||
| 2412 in total. For all the following evaluation results we            mentary video for more convincing comparisons. | ||||
| use the improved ground truth [50] and cap depths at 80 m.            Ablation Studies. We also investigated the contribution | ||||
|                                                                       of the different components towards the method’s perfor- | ||||
|    We first compare our method against the recent state of             mance. Table 2 shows quantitative results of our ablation | ||||
| the art including an optimization based method (Colmap),              studies, which confirm that all our proposed contributions | ||||
| self-supervised monocular methods (MonoDepth2 and                     improve the depth prediction over the baseline method. Fur- | ||||
| PackNet), a semi-supervised monocular method using                    thermore, Figure 6 demonstrates the qualitative improve- | ||||
| sparse LiDAR data (PackNet), a supervised monocular                   ment achieved by MaskModule and refinement training. | ||||
| method (DORN) and MVS methods (DeepMVS and Deep- | ||||
|                                                                       4.2. Oxford RobotCar and TUM-Mono | ||||
| 
 | ||||
|                                                                          To demonstrate the generalization capabilities of | ||||
|                                                                       MonoRec, we test our KITTI model on the Oxford Robot- | ||||
|                                                                       Car dataset and the TUM-Mono dataset. Oxford RobotCar | ||||
|                                                                       is a street view dataset and shows a similar motion pattern | ||||
|                                                                       and view perspective to KITTI. TUM-Mono, however, is | ||||
|                                                                       recorded by a handheld monochrome camera, so it demon- | ||||
| 
 | ||||
|                                                                                                      6114 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. | ||||
| Model     SSIM  MaskModule  D. Ref.  M. Ref.  Abs Rel         Sq Rel                               RMSE   RMSElog  δ < 1.25  δ < 1.252  δ < 1.253 | ||||
| 
 | ||||
| Baseline                                       0.056          0.342                                2.624    0.092    0.965      0.990      0.994 | ||||
| Baseline                                       0.054          0.346                                2.444    0.088    0.970      0.989      0.995 | ||||
| 
 | ||||
| MonoRec                                        0.054          0.306                                2.372    0.087    0.970      0.990      0.995 | ||||
| MonoRec                                        0.051          0.346                                2.361    0.085    0.972      0.990       0.995 | ||||
| MonoRec                                        0.052          0.302                                2.303    0.087    0.969      0.990       0.995 | ||||
| MonoRec                                        0.050          0.295                                2.266    0.082    0.973      0.991       0.996 | ||||
| 
 | ||||
| Table 2: Ablation Study: Baseline consists of only DepthModule using the unmasked cost volume (CV). Baseline without | ||||
| SSIM uses a 5x5 patch that has same receptive field as SSIM. Using SSIM to form CV gives a significant improvement. For | ||||
| MonoRec, only the addition of MaskModule without refinement does not yield significant improvements. The DepthModule | ||||
| refinement gives a major improvement. The best performance is achieved by combining all the proposed components. | ||||
| 
 | ||||
| Keyframe                    MonoRec           Mask Prediction                                             DeepTAM                       DeepMVS | ||||
| 
 | ||||
| Figure 7: Comparison on Moving Objects Depth Estimation: In comparison to other MVS methods, MonoRec is able to | ||||
| predict plausible depths. Furthermore, the depth prediction has less noise and artifacts in static regions of the scene. | ||||
| 
 | ||||
| strates very different motion and image quality compared              PackNet Monodepth2 Keyframe | ||||
| to KITTI. The results are shown in Figure 8. The monoc- | ||||
| ular methods struggle to generalize to a new context. The             DORN | ||||
| compared MVS methods show more artifacts and cannot | ||||
| predict plausible depths for the moving objects. In contrast          DeepTAM DeepMVS | ||||
| our method is able to generalize well to the new scenes for | ||||
| both depth and moving object predictions. Since Oxford                MonoRec | ||||
| RobotCar also provides LiDAR depth data, we further show | ||||
| a quantitative evaluation in the supplementary material.      Figure 8: Oxford RobotCar and TUM-Mono: All results | ||||
|                                                               are obtained by the respective best-performing variant in | ||||
| 5. Conclusion                                                 Table 1. MonoRec shows stronger generalization capabil- | ||||
|                                                               ity than the monocular methods. Compared to DeepMVS | ||||
|    We have presented MonoRec, a deep architecture that        and DeepTAM, MonoRec delivers depth maps with less ar- | ||||
| estimates accurate dense 3D reconstructions from only a       tifacts and predicts the moving object masks in addition. | ||||
| single moving camera. We first propose to use SSIM as | ||||
| the photometric measurement to construct the cost vol-        Acknowledgement This work was supported by the Munich Center | ||||
| umes. To deal with dynamic objects, we propose a novel | ||||
| MaskModule which predicts moving object masks from the        for Machine Learning and by the ERC Advanced Grant SIMULACRON. | ||||
| input cost volumes. With the predicted masks, the pro- | ||||
| posed DepthModule is able to estimate accurate depths for | ||||
| both static and dynamic objects. Additionally, we propose | ||||
| a novel multi-stage training scheme together with a semi- | ||||
| supervised loss formulation for training the depth predic- | ||||
| tion. All combined, MonoRec is able to outperform the | ||||
| state-of-the-art MVS and monocular depth prediction meth- | ||||
| ods both qualitatively and quantitatively on KITTI and also | ||||
| shows strong generalization capability on Oxford RobotCar | ||||
| and TUM-Mono. We believe that this capacity to recover | ||||
| accurate dense 3D reconstructions from a single moving | ||||
| camera will help to establish the camera as the lead sensor | ||||
| for autonomous systems. | ||||
| 
 | ||||
|                                                                                                      6115 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. | ||||
| References                                                                national Journal of Robotics Research (IJRR), pages 1229– | ||||
|                                                                           1235, 2013. | ||||
|  [1] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and | ||||
|       A. J. Davison. CodeSLAM - learning a compact, optimisable     [15] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we | ||||
|       representation for dense visual SLAM. In IEEE Conference            ready for autonomous driving? the KITTI vision benchmark | ||||
|       on Computer Vision and Pattern Recognition (CVPR), pages            suite. In IEEE Conference on Computer Vision and Pattern | ||||
|       2560–2568, 2018.                                                    Recognition (CVPR), pages 3354–3361. IEEE, 2012. | ||||
| 
 | ||||
|  [2] Neill D. F. Campbell, George Vogiatzis, Carlos Herna´ndez,     [16] Clement Godard, Oisin Mac Aodha, and Gabriel J. Bros- | ||||
|       and Roberto Cipolla. Using multiple hypotheses to improve           tow. Unsupervised monocular depth estimation with left- | ||||
|       depth-maps for multi-view stereo. In European Conference            right consistency. In IEEE Conference on Computer Vision | ||||
|       on Computer Vision (ECCV), pages 766–779, 2008.                     and Pattern Recognition (CVPR), 2017. | ||||
| 
 | ||||
|  [3] Rui Chen, Songfang Han, Jing Xu, and Hao Su. Point-based       [17] Cle´ment Godard, Oisin Mac Aodha, Michael Firman, and | ||||
|       multi-view stereo network. In International Conference on           Gabriel J. Brostow. Digging into self-supervised monocular | ||||
|       Computer Vision (ICCV), 2019.                                       depth estimation. In International Conference on Computer | ||||
|                                                                           Vision (ICCV), pages 3828–3838, 2019. | ||||
|  [4] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo | ||||
|       Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe              [18] Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia | ||||
|       Franke, Stefan Roth, and Bernt Schiele. The cityscapes              Angelova. Depth from videos in the wild: Unsupervised | ||||
|       dataset for semantic urban scene understanding. In IEEE             monocular depth learning from unknown cameras. In In- | ||||
|       Conference on Computer Vision and Pattern Recognition               ternational Conference on Computer Vision (ICCV), 2019. | ||||
|       (CVPR), pages 3213–3223, 2016. | ||||
|                                                                     [19] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong | ||||
|  [5] Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An-               Tan, and Ping Tan. Cascade cost volume for high-resolution | ||||
|       drew J. Davison. DeepFactors: Real-time probabilistic dense         multi-view stereo and stereo matching. In IEEE Interna- | ||||
|       monocular SLAM. IEEE Robotics and Automation Letters                tional Conference on Computer Vision and Pattern Recog- | ||||
|       (RA-L), 5(2):721–728, 2020.                                         nition (CVPR), 2020. | ||||
| 
 | ||||
|  [6] David Eigen and Rob Fergus. Predicting depth, surface nor-     [20] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- | ||||
|       mals and semantic labels with a common multi-scale convo-           tos, and Adrien Gaidon. 3D packing for self-supervised | ||||
|       lutional architecture. In International Conference on Com-          monocular depth estimation. In IEEE Conference on Com- | ||||
|       puter Vision (ICCV), pages 2650–2658, 2015.                         puter Vision and Pattern Recognition (CVPR), pages 2485– | ||||
|                                                                           2494, 2020. | ||||
|  [7] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map | ||||
|       prediction from a single image using a multi-scale deep net-  [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. | ||||
|       work. In Neural Information Processing Systems (NIPS),              Deep residual learning for image recognition. In IEEE | ||||
|       2014.                                                               Conference on Computer Vision and Pattern Recognition | ||||
|                                                                           (CVPR), pages 770–778, 2016. | ||||
|  [8] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct | ||||
|       sparse odometry. IEEE Transactions on Pattern Analysis and    [22] Yuxin Hou, Juho Kannala, and Arno Solin. Multi-view | ||||
|       Machine Intelligence (PAMI), 40(3):611–625, 2018.                   stereo by temporal nonparametric fusion. In International | ||||
|                                                                           Conference on Computer Vision (ICCV), 2019. | ||||
|  [9] Jakob Engel, Vladyslav Usenko, and Daniel Cremers. A | ||||
|       photometrically calibrated benchmark for monocular visual     [23] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra | ||||
|       odometry. In arXiv, July 2016.                                      Ahuja, and Jia-Bin Huang. DeepMVS: Learning multi-view | ||||
|                                                                           stereopsis. In IEEE Conference on Computer Vision and Pat- | ||||
| [10] Alejandro Fontan, Javier Civera, and Rudolph Triebel.                tern Recognition (CVPR), pages 2821–2830, 2018. | ||||
|       Information-driven direct rgb-d odometry. In IEEE Confer- | ||||
|       ence on Computer Vision and Pattern Recognition (CVPR),       [24] Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So | ||||
|       pages 4929–4937, 2020.                                              Kweon. DPSNet: End-to-end deep plane sweep stereo. In In- | ||||
|                                                                           ternational Conference on Learning Representations (ICLR), | ||||
| [11] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat-                    2019. | ||||
|       manghelich, and Dacheng Tao. Deep ordinal regression net- | ||||
|       work for monocular depth estimation. In IEEE Conference       [25] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and | ||||
|       on Computer Vision and Pattern Recognition (CVPR), pages            Koray Kavukcuoglu. Spatial transformer networks. In Neu- | ||||
|       2002–2011, 2018.                                                    ral Information Processing Systems (NIPS), pages 2017– | ||||
|                                                                           2025, 2015. | ||||
| [12] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and | ||||
|       robust multiview stereopsis. IEEE Transactions on Pat-        [26] Mengqi Ji, Ju¨rgen Gall, Haitian Zheng, Yebin Liu, and Lu | ||||
|       tern Analysis and Machine Intelligence (PAMI), pages 1362–          Fang. SurfaceNet: An end-to-end 3D neural network for | ||||
|       1376, 2010.                                                         multiview stereopsis. In International Conference on Com- | ||||
|                                                                           puter Vision (ICCV), pages 2326–2334, 2017. | ||||
| [13] Silvano Galliani, Katrin Lasinger, and Konrad Schindler. | ||||
|       Massively parallel multiview stereopsis by surface normal     [27] Abhishek Kar, Christian Ha¨ne, and Jitendra Malik. Learning | ||||
|       diffusion. In International Conference on Computer Vision           a multi-view stereo machine. In Neural Information Process- | ||||
|       (ICCV), 2015.                                                       ing Systems (NIPS), page 364–375, 2017. | ||||
| 
 | ||||
| [14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel     [28] Kiriakos N. Kutulakos and Steven M. Seitz. A theory of | ||||
|       Urtasun. Vision meets robotics: The KITTI dataset. Inter-           shape by space carving. In International Conference on | ||||
|                                                                           Computer Vision (ICCV), 1999. | ||||
| 
 | ||||
|                                                                                                      6116 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. | ||||
| [29] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed-          puting and Computer Assisted Intervention (MICCAI), pages | ||||
|       erico Tombari, and Nassir Navab. Deeper depth prediction            234–241. Springer, 2015. | ||||
|       with fully convolutional residual networks. In International | ||||
|       Conference on 3D Vision (3DV), 2016.                          [43] Chris Russell, Rui Yu, and Lourdes Agapito. Video pop-up: | ||||
|                                                                           Monocular 3d reconstruction of dynamic scenes. In Euro- | ||||
| [30] Maxime Lhuillier and Long Quan. A quasi-dense approach               pean Conference on Computer Vision (ECCV), pages 583– | ||||
|       to surface reconstruction from uncalibrated images. IEEE            598. Springer, 2014. | ||||
|       Transactions on Pattern Analysis and Machine Intelligence | ||||
|       (PAMI), pages 418–433, 2005.                                  [44] Johannes L Schonberger and Jan-Michael Frahm. Structure- | ||||
|                                                                           from-motion revisited. In IEEE Conference on Computer | ||||
| [31] Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel,               Vision and Pattern Recognition (CVPR), pages 4104–4113, | ||||
|       and Mingyi He. Depth and surface normal estimation from             2016. | ||||
|       monocular images using regression on deep features and hi- | ||||
|       erarchical CRFs. In IEEE Conference on Computer Vision        [45] Johannes L. Scho¨nberger, Enliang Zheng, Jan-Michael | ||||
|       and Pattern Recognition (CVPR), 2015.                               Frahm, and Marc Pollefeys. Pixelwise view selection for | ||||
|                                                                           unstructured multi-view stereo. In European Conference on | ||||
| [32] Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker,              Computer Vision (ECCV), pages 501–518, 2016. | ||||
|       Noah Snavely, Ce Liu, and William T Freeman. Learning | ||||
|       the depths of moving people by watching frozen people. In     [46] Steven M. Seitz and Charles R. Dyer. Photorealistic scene | ||||
|       Proceedings of the IEEE/CVF Conference on Computer Vi-              reconstruction by voxel coloring. In IEEE Conference on | ||||
|       sion and Pattern Recognition, pages 4521–4530, 2019.                Computer Vision and Pattern Recognition (CVPR), 1997. | ||||
| 
 | ||||
| [33] Keyang Luo, Tao Guan, Lili Ju, Haipeng Huang, and Yawei        [47] Jan Stu¨hmer, Stefan Gumhold, and Daniel Cremers. Real- | ||||
|       Luo. P-MVSNet: Learning patch-wise matching confidence               time dense geometry from a handheld camera. In DAGM | ||||
|       aggregation for multi-view stereo. In International Confer-         Conference on Pattern Recognition, pages 11–20, 2010. | ||||
|       ence on Computer Vision (ICCV), 2019. | ||||
|                                                                     [48] Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir | ||||
| [34] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen,             Navab. CNN-SLAM: Real-time dense monocular SLAM | ||||
|       and Johannes Kopf. Consistent video depth estimation.               with learned depth prediction. In IEEE Conference on Com- | ||||
|       39(4), 2020.                                                        puter Vision and Pattern Recognition (CVPR), 2017. | ||||
| 
 | ||||
| [35] Will Maddern, Geoff Pascoe, Chris Linegar, and Paul New-       [49] Engin Tola, Christoph Strecha, and Pascal Fua. Efficient | ||||
|       man. 1 Year, 1000km: The Oxford RobotCar Dataset. Inter-            large-scale multi-view stereo for ultra high-resolution image | ||||
|       national Journal of Robotics Research (IJRR), 36(1):3–15,           sets. Machine Vision and Applications (MVA), pages 903– | ||||
|       2017.                                                               920, 2011. | ||||
| 
 | ||||
| [36] Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha,      [50] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, | ||||
|       Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End-            Thomas Brox, and Andreas Geiger. Sparsity invariant CNNs. | ||||
|       to-end 3D scene reconstruction from posed images. In Euro-          In International Conference on 3D Vision (3DV), pages 11– | ||||
|       pean Conference on Computer Vision (ECCV), 2020.                    20. IEEE, 2017. | ||||
| 
 | ||||
| [37] Richard A. Newcombe, Steven J. Lovegrove, and Andrew J.        [51] Vladyslav Usenko, Nikolaus Demmel, David Schubert, Jo¨rg | ||||
|       Davison. DTAM: Dense tracking and mapping in real-time.             Stu¨ckler, and Daniel Cremers. Visual-inertial mapping with | ||||
|       In International Conference on Computer Vision (ICCV),              non-linear factor recovery. IEEE Robotics and Automation | ||||
|       2011.                                                               Letters (RA-L), 5(2):422–429, 2020. | ||||
| 
 | ||||
| [38] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,           [52] Chaoyang Wang, Jose Miguel Buenaposada, Rui Zhu, and | ||||
|       James Bradbury, Gregory Chanan, Trevor Killeen, Zeming              Simon Lucey. Learning depth from monocular videos using | ||||
|       Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An            direct methods. In IEEE Conference on Computer Vision and | ||||
|       imperative style, high-performance deep learning library. In        Pattern Recognition (CVPR), 2018. | ||||
|       Advances in neural information processing systems, pages | ||||
|       8026–8037, 2019.                                              [53] Kaixuan Wang and Shaojie Shen. MVDepthNet: Real-time | ||||
|                                                                           multiview depth estimation neural network. In International | ||||
| [39] Matia Pizzoli, Christian Forster, and Davide Scaramuzza.             Conference on 3D Vision (3DV), 2018. | ||||
|       REMODE: Probabilistic, monocular dense reconstruction in | ||||
|       real time. In IEEE International Conference on Robotics and   [54] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- | ||||
|       Automation (ICRA), 2014.                                            moncelli. Image quality assessment: from error visibility to | ||||
|                                                                           structural similarity. IEEE transactions on image processing, | ||||
| [40] Rene Ranftl, Vibhav Vineet, Qifeng Chen, and Vladlen                 13(4):600–612, 2004. | ||||
|       Koltun. Dense monocular depth estimation in complex dy- | ||||
|       namic scenes. In IEEE Conference on Computer Vision and       [55] Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel | ||||
|       Pattern Recognition (CVPR), pages 4058–4066, 2016.                  Brostow, and Michael Firman. The temporal opportunist: | ||||
|                                                                           Self-supervised multi-frame monocular depth. In IEEE | ||||
| [41] Andrea Romanoni and Matteo Matteucci. TAPA-MVS:                      Conference on Computer Vision and Pattern Recognition | ||||
|       Textureless-aware PAtchMatch multi-view stereo. In Inter-           (CVPR), 2021. | ||||
|       national Conference on Computer Vision (ICCV), 2019. | ||||
|                                                                     [56] Youze Xue, Jiansheng Chen, Weitao Wan, Yiqing Huang, | ||||
| [42] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-               Cheng Yu, Tianpeng Li, and Jiayu Bao. MVSCRF: Learning | ||||
|       Net: Convolutional networks for biomedical image segmen-            multi-view stereo with conditional random fields. In Inter- | ||||
|       tation. In International Conference on Medical Image Com-           national Conference on Computer Vision (ICCV), 2019. | ||||
| 
 | ||||
|                                                                                                      6117 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. | ||||
|           [57] Jiayu Yang, Wei Mao, Jose M. Alvarez, and Miaomiao Liu. | ||||
|                 Cost volume pyramid based depth inference for multi-view | ||||
|                 stereo. In IEEE Conference on Computer Vision and Pattern | ||||
|                 Recognition (CVPR), 2020. | ||||
| 
 | ||||
|           [58] Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cre- | ||||
|                 mers. D3VO: Deep depth, deep pose and deep uncertainty | ||||
|                 for monocular visual odometry. In IEEE Conference on | ||||
|                 Computer Vision and Pattern Recognition (CVPR), 2020. | ||||
| 
 | ||||
|           [59] Nan Yang, Rui Wang, Jo¨rg Stu¨ckler, and Daniel Cremers. | ||||
|                 Deep virtual stereo odometry: Leveraging deep depth predic- | ||||
|                 tion for monocular direct sparse odometry. In European Con- | ||||
|                 ference on Computer Vision (ECCV), pages 817–833, 2018. | ||||
| 
 | ||||
|           [60] Yao Yao, Shiwei Li, Siyu Zhu, Hanyu Deng, Tian Fang, and | ||||
|                 Long Quan. Relative camera refinement for accurate dense | ||||
|                 reconstruction. In International Conference on 3D Vision | ||||
|                 (3DV), 2017. | ||||
| 
 | ||||
|           [61] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long | ||||
|                 Quan. MVSNet: Depth inference for unstructured multi- | ||||
|                 view stereo. In European Conference on Computer Vision | ||||
|                 (ECCV), pages 785–801, 2018. | ||||
| 
 | ||||
|           [62] Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, | ||||
|                 and Long Quan. Recurrent MVSNet for high-resolution | ||||
|                 multi-view stereo depth inference. In IEEE Conference on | ||||
|                 Computer Vision and Pattern Recognition (CVPR), 2019. | ||||
| 
 | ||||
|           [63] Zhichao Yin and Jianping Shi. GeoNet: Unsupervised learn- | ||||
|                 ing of dense depth, optical flow and camera pose. In IEEE | ||||
|                 Conference on Computer Vision and Pattern Recognition | ||||
|                 (CVPR), 2018. | ||||
| 
 | ||||
|           [64] Zehao Yu and Shenghua Gao. Fast-MVSNet: Sparse-to- | ||||
|                 dense multi-view stereo with learned propagation and gauss- | ||||
|                 newton refinement. In IEEE Conference on Computer Vision | ||||
|                 and Pattern Recognition (CVPR), 2020. | ||||
| 
 | ||||
|           [65] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, | ||||
|                 Kejie Li, Harsh Agarwal, and Ian M. Reid. Unsupervised | ||||
|                 learning of monocular depth estimation and visual odometry | ||||
|                 with deep feature reconstruction. In IEEE Conference on | ||||
|                 Computer Vision and Pattern Recognition (CVPR), 2018. | ||||
| 
 | ||||
|           [66] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox. | ||||
|                 DeepTAM: Deep tracking and mapping. In European Con- | ||||
|                 ference on Computer Vision (ECCV), pages 822–838, 2018. | ||||
| 
 | ||||
|           [67] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. | ||||
|                 Lowe. Unsupervised learning of depth and ego-motion from | ||||
|                 video. In IEEE Conference on Computer Vision and Pattern | ||||
|                 Recognition (CVPR), 2017. | ||||
| 
 | ||||
|                                                                                                      6118 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. | ||||
|  | ||||
| @ -0,0 +1,738 @@ | ||||
| Received December 21, 2020, accepted January 6, 2021, date of publication January 11, 2021, date of current version February 10, 2021. | ||||
| Digital Object Identifier 10.1109/ACCESS.2021.3050617 | ||||
| 
 | ||||
| RDS-SLAM: Real-Time Dynamic SLAM Using | ||||
| Semantic Segmentation Methods | ||||
| 
 | ||||
| YUBAO LIU AND JUN MIURA , (Member, IEEE) | ||||
| 
 | ||||
| Department of Computer Science and Engineering, Toyohashi University of Technology, Toyohashi 441-8580, Japan | ||||
| 
 | ||||
| Corresponding author: Yubao Liu (yubao.liu.ra@tut.jp) | ||||
| 
 | ||||
| This work was supported in part by the Japan Society for the Promotion of Science (JSPS) KAKENHI under Grant 17H01799. | ||||
| 
 | ||||
|   ABSTRACT The scene rigidity is a strong assumption in typical visual Simultaneous Localization and | ||||
|   Mapping (vSLAM) algorithms. Such strong assumption limits the usage of most vSLAM in dynamic | ||||
|   real-world environments, which are the target of several relevant applications such as augmented reality, | ||||
|   semantic mapping, unmanned autonomous vehicles, and service robotics. Many solutions are proposed | ||||
|   that use different kinds of semantic segmentation methods (e.g., Mask R-CNN, SegNet) to detect dynamic | ||||
|   objects and remove outliers. However, as far as we know, such kind of methods wait for the semantic | ||||
|   results in the tracking thread in their architecture, and the processing time depends on the segmentation | ||||
|   methods used. In this paper, we present RDS-SLAM, a real-time visual dynamic SLAM algorithm that | ||||
|   is built on ORB-SLAM3 and adds a semantic thread and a semantic-based optimization thread for robust | ||||
|   tracking and mapping in dynamic environments in real-time. These novel threads run in parallel with the | ||||
|   others, and therefore the tracking thread does not need to wait for the semantic information anymore. | ||||
|   Besides, we propose an algorithm to obtain as the latest semantic information as possible, thereby making | ||||
|   it possible to use segmentation methods with different speeds in a uniform way. We update and propagate | ||||
|   semantic information using the moving probability, which is saved in the map and used to remove outliers | ||||
|   from tracking using a data association algorithm. Finally, we evaluate the tracking accuracy and real-time | ||||
|   performance using the public TUM RGB-D datasets and Kinect camera in dynamic indoor scenarios. | ||||
|   Source code and demo: https://github.com/yubaoliu/RDS-SLAM.git | ||||
| 
 | ||||
| INDEX TERMS Dynamic SLAM, ORB SLAM, Mask R-CNN, SegNet, real-time. | ||||
| 
 | ||||
| I. INTRODUCTION                                                         algorithm from using data associations related to such | ||||
| Simultaneous localization and mapping (SLAM) [1] is a fun-              dynamic objects in real-time is the challenge to allow vSLAM | ||||
| damental technique for many applications such as augmented              to be deployed in the real world. | ||||
| reality (AR), robotics, and unmanned autonomous vehicles | ||||
| (UAV). Visual SLAM (vSLAM) [2] uses the camera as the                      We classify the solutions into two classes: pure | ||||
| input and is useful in scene understanding and decision mak-            geometric-based [3]–[7] and semantic-based [8]–[13] meth- | ||||
| ing. However, the strong assumption of scene rigidity limits            ods. These geometric-based approaches cannot remove all | ||||
| the use of most vSLAM in real-world environments. Dynamic               potential dynamic objects, e.g., people who are sitting. Fea- | ||||
| objects will cause many bad or unstable data associations that          tures on such objects are unreliable and also need to be | ||||
| accumulate drifts during the SLAM process. In Fig. 1, for               removed from tracking and mapping. These semantic-based | ||||
| example, assume m1 is on a person and its position changes              methods use semantic segmentation or object detection | ||||
| in the scene. The bad or unstable data associations (the red            approaches to obtain pixel-wise masks or bounding box of | ||||
| lines in Fig. 1) will lead to incorrect camera ego-motion               potential dynamic objects. Sitting people can be detected | ||||
| estimation in dynamic environments. Usually, there are two              and removed from tracking and mapping using the semantic | ||||
| basic requirements for vSLAM: robustness in tracking and                information and a map of static objects can be built. Usu- | ||||
| real-time performance. Therefore, how to detect dynamic                 ally, in semantic-based methods, geometric check, such as | ||||
| objects in the populated scene and prevent the tracking                 Random Sample Consensus (RANSAC) [14] and multi-view | ||||
|                                                                         geometry, are also used to remove outliers. | ||||
|    The associate editor coordinating the review of this manuscript and | ||||
|                                                                            These semantic-based methods first detect or segment | ||||
| approving it for publication was Heng Wang .                            objects and then remove outliers from tracking. The tracking | ||||
|                                                                         thread has to wait for semantic information before tracking | ||||
| 
 | ||||
| 23772  This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 9, 2021 | ||||
| Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| 
 | ||||
| FIGURE 1. Example of data association in vSLAM under dynamic scene.               These classified map points will be used to select as stable | ||||
| Ft , (t ≥ 0) is the frame and KFt is the selected keyframe. mi , i ∈ {0, 1, ...}  data associations as possible in tracking. | ||||
| is the map point. Assume m1 moved to new position m1 because it | ||||
| belongs to a moving object. The red line indicates the unstable or bad               The main contributions of this paper are: | ||||
| data association.                                                                    (1) we propose a novel semantic-based real-time dynamic | ||||
|                                                                                   vSLAM algorithm, RDS-SLAM, which enables the tracking | ||||
|                                                                                   thread does not need to wait for the semantic results any- | ||||
|                                                                                   more. This method efficiently and effectively uses seman- | ||||
|                                                                                   tic segmentation results for dynamic object detection and | ||||
|                                                                                   outliers removing while keeping the algorithm’s real-time | ||||
|                                                                                   nature. | ||||
|                                                                                      (2) we propose a keyframe selection strategy that uses as | ||||
|                                                                                   the latest new semantic information as possible for outliers | ||||
|                                                                                   removal with any semantic segmentation methods with dif- | ||||
|                                                                                   ferent speeds in a uniform way. | ||||
|                                                                                      (3) We show the real-time performance of the proposed | ||||
|                                                                                   method is better than the existing similar methods using the | ||||
|                                                                                   TUM dataset. | ||||
|                                                                                      The rest of the paper is structured as follows. Section II dis- | ||||
|                                                                                   cusses related work. Section III describes a system overview. | ||||
|                                                                                   Sections IV, V, and VI detail the implementation of the pro- | ||||
|                                                                                   posed methods. Section VII shows experimental results, and | ||||
|                                                                                   section VIII presents the conclusions and discusses future | ||||
|                                                                                   work. | ||||
| 
 | ||||
| FIGURE 2. Blocked model. Semantic model can use different kinds of                II. RELATED WORK | ||||
| segmentation methods, e.g., Mask R-CNN and SegNet. Note that this is              A. VISUAL SLAM | ||||
| not exactly the same as the semantic-based methods mentioned [8]–[13].            vSLAM [2] can be classified into feature-based methods | ||||
| The tracking process is blocked to wait for the results of semantic model.        and direct methods. Mur-Artal et al. presented ORB-SLAM2 | ||||
|                                                                                   [16], a complete SLAM system for monocular, stereo, and | ||||
| (camera ego-motion estimation), which is called the blocked                       RGB-D cameras, which works in real-time on standard CPUs | ||||
| model in this paper (as shown in Fig. 2). Their processing                        in a wide variety of environments. This system estimates the | ||||
| speed is limited by the time-consuming of semantic segmen-                        ego-motion of the camera by matching the corresponding | ||||
| tation methods used. For example, Mask R-CNN requires                             ORB [17] features between the current frame and previous | ||||
| about 200ms [15] for segmenting one image and this will limit                     frames and has three parallel threads: tracking, local map- | ||||
| the real-time performance of the entire system.                                   ping, and loop closing. Carlos et al. proposed the latest | ||||
|                                                                                   version ORB-SLAM3 [18], mainly adding two novelties: | ||||
|    Our main challenge is how to execute vSLAM in real-time                        1) a feature-based tightly-integrated visual-inertial SLAM | ||||
| under dynamic scenes with various pixel-wise semantic seg-                        that fully relies on maximum-a-posteriori (MAP) estimation; | ||||
| mentation methods that ran at a different speed, such as                          2) a multiple map system (ATLAS [19]) that relies on a new | ||||
| SegNet and Mask R-CNN. We propose a semantic thread                               place recognition method with improved recall. In contrast to | ||||
| to wait for the semantic information. It runs in parallel with                    features-based methods,. For example, Kerl et al. proposed a | ||||
| the tracking thread and the tracking thread does not need to                      dense visual SLAM method, DVO [20], for RGB-D cameras | ||||
| wait for the segment result. Therefore, the tracking thread                       that minimizes both the photometric and the depth error over | ||||
| can execute in real-time. We call it a non-blocked model in                       all pixels. However, none of the above methods can address | ||||
| this paper. Faster segmentation methods (e.g., SegNet) can                        the common problem of dynamic objects. Detecting and deal- | ||||
| update semantic information more frequently than slower                           ing with dynamic objects in a dynamic scene in real-time is a | ||||
| methods (e.g., Mask R-CNN). Although we cannot control                            challenging task in vSLAM. | ||||
| the segmentation speed, we can use a strategy to obtain as | ||||
| the latest semantic information as possible to remove outliers                       Our work follows the implementation of ORB-SLAM3 | ||||
| from the current frame.                                                           [18]. The concepts in ORB-SLAM3: keyframe, covisibility | ||||
|                                                                                   graph, ATLAS and Bundle adjustment (BA), are also used in | ||||
|    Because the semantic thread runs in parallel with the track-                   our implementation. | ||||
| ing thread, we use the map points to save and share the | ||||
| semantic information. As shown in Fig. 1, we update and                           1) KEYFRAME | ||||
| propagate semantic information using the moving probability                       Keyframes [18] is a subset of selected frames to avoid | ||||
| and classify map points into three categories, static, dynamic,                   unnecessary redundancy in tracking and optimization. Each | ||||
| and unknown, according to the moving probability threshold. | ||||
|                                                                                                                                                                                  23773 | ||||
| VOLUME 9, 2021 | ||||
| Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| 
 | ||||
| keyframe stores 1) a rigid body transformation of the camera     C. SEMANTIC-BASED SOLUTIONS | ||||
| pose that transforms points from the world to the camera         DS-SLAM [10], implemented on ORB-SLAM2 [16], com- | ||||
| coordinate system; 2) ORB features, associated or not to         bines a semantic segmentation network (SegNet [22]) with a | ||||
| a map point. In this paper, keyframes are selected by the        moving consistency check to reduce the impact of dynamic | ||||
| same policy as ORB-SLAM3; a keyframe is selected if all          objects and produce a dense semantic octree map [23]. | ||||
| the following conditions are met: 1) 20 frames have passed       DS-SLAM assumes that the feature points on the people are | ||||
| from the last global relocalization or the last keyframe inser-  most likely to be outliers. If a person is determined to be static, | ||||
| tion; 2) local mapping thread is idle; 3) current frame tracks   then matching points on the person can also be used to predict | ||||
| at least 50 points or less than 90% points than reference        the pose of the camera. | ||||
| keyframe. | ||||
|                                                                     DynaSLAM [9], also built on ORB-SLAM2, is robust | ||||
| 2) COVISIBILITY GRAPH                                            in dynamic scenarios for monocular, stereo, and RGB-D | ||||
| Covisibility graph [16] is represented as an undirected          datasets, by adding the capabilities of dynamic object detec- | ||||
| weighted graph, in which each node is a keyframe and the         tion and background inpainting. It can detect the moving | ||||
| edge holds the number of commonly observed map points.           objects either by multi-view geometry, deep learning, or both | ||||
|                                                                  and inpaint the frame background that has been occluded by | ||||
| 3) ATLAS                                                         dynamic objects using a static map of the scene. It uses Mask | ||||
| The Atlas [19] is a multi-map representation that handles an     R-CNN to segment out all the priori dynamic objects, such as | ||||
| unlimited number of sub-maps. Two kinds of maps, active          people or vehicles. DynaSLAM II [24] tightly integrates the | ||||
| map and non-active map, are managed in the atlas. When           multi-object tracking capability. But this method only works | ||||
| the camera tracking is considered lost and relocalization was    for rigid objects. However, in the dynamic scene of TUM [25] | ||||
| failed for a few frames, the active map becomes a non-active     dataset, people change their shape by sometimes standing and | ||||
| map, and a new map will be initialized. In the atlas, keyframes  sometimes sitting. | ||||
| and map points are managed using the covisibility graph and | ||||
| the spanning tree.                                                  Detect-SLAM [12], also built on ORB-SLAM2, integrates | ||||
|                                                                  visual SLAM with single-shot multi-box detector (SSD) [26] | ||||
| 4) BUNDLE ADJUSTMENT (BA)                                        to make the two functions mutually beneficial. They call | ||||
| BA [21] is the problem of refining a visual reconstruction to     the probability of a feature point belonging to a moving | ||||
| produce jointly optimal 3D structure and viewing parameter       object the moving probability. They distinguish keypoints | ||||
| estimates. Local BA is used in the local mapping thread          into four states, high-confidence static, low-confidence static, | ||||
| to optimize only the camera pose. Loop closing launches a        low-confidence dynamic, and high-confidence dynamic. | ||||
| thread to perform full BA after the pose-graph optimization      Considering the delay of detection and the spatio-temporal | ||||
| to jointly optimize the camera pose and the corresponding        consistency of successive frames, they only use the color | ||||
| landmarks.                                                       images of keyframes to detect using SSD, meanwhile prop- | ||||
|                                                                  agating probability frame-by-frame in the tracking thread. | ||||
| B. GEOMETRIC-BASED SOLUTIONS                                     Once the detection result is obtained, they insert the keyframe | ||||
| Li et al. [5] proposed a real-time depth edge-based RGB-D        into the local map and update the moving probability on | ||||
| SLAM system for dynamic environments based on the frame-         the local map. Then they update the moving probabil- | ||||
| to-keyframe registration. They only use depth edge points        ity of 3D points in the local map that matched with the | ||||
| which have an associated weight indicating its probability       keyframe. | ||||
| of belonging to a dynamic object. Sun et al. [6] classify | ||||
| pixels using the segmentation of the quantized depth image          DM-SLAM [11] combines Mask R-CNN, optical flow, and | ||||
| and calculate the difference in intensity between consec-        epipolar constraint to judge outliers. The Ego-motion Estima- | ||||
| utive RGB images. Tan et al. [3] propose a novel online          tion module estimates the initial pose of the camera, similar | ||||
| keyframe representation and updating method to adaptively        to the Low-cost tracking module in DynaSLAM. DM-SLAM | ||||
| model the dynamic environments. The camera pose can reli-        also uses features in priori dynamic objects, if they are not | ||||
| ably be estimated even in challenging situations using a         moving heavily, to reduce the feature-less case caused by | ||||
| novel prior-based adaptive RANSAC algorithm to efficiently        removing all priori dynamic objects. | ||||
| remove outliers. | ||||
|                                                                     Fan et al. [8] proposed a novel semantic SLAM system | ||||
|    Although the geometric-based vSLAM solution in                with a more accurate point cloud map in dynamic environ- | ||||
| dynamic environments can restrict the effect of the dynamic      ments and they use BlizNet [27] to obtain the masks and | ||||
| objects to some extent, there are some limitations: 1) they      bounding boxes of the dynamic objects in the image. | ||||
| cannot detect the potential dynamic objects that temporarily | ||||
| keep static; 2) lack of semantic information. We cannot judge       All these methods use the blocked model. They wait for the | ||||
| dynamic objects using priori knowledge of the scene.             semantic results of every frame or keyframe before estimating | ||||
|                                                                  the camera pose. As a result, their processing speed are | ||||
| 23774                                                            limited by the specific CNN models they used. In this paper, | ||||
|                                                                  we propose RDS-SLAM that uses the non-blocked model and | ||||
|                                                                  shows its real-time performance by comparing it with those | ||||
|                                                                  methods. | ||||
| 
 | ||||
|                                                                                                                                                                     VOLUME 9, 2021 | ||||
| Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| 
 | ||||
| FIGURE 3. System architecture. Models with orange color are modified blocks based on ORB-SLAM3. Models with magenta color are newly added | ||||
| features. Blocks in blue are important data structures. | ||||
| 
 | ||||
| III. SYSTEM OVERVIEW                                             thread, we use a simple example to explain the general | ||||
| Each frame will first pass through the tracking thread. The ini-  flow, as shown in Fig. 4. Assume the keyframes are selected | ||||
| tial camera pose is estimated for the current frame after being  every two frames. The keyframes are selected by the ORB- | ||||
| tracked with the last frame and further optimized by being       SLAM3 and we inserted them into a keyframe list KF sequen- | ||||
| tracked with the local map. Then, keyframes are selected         tially. Assume, at time t = 12, KF2-KF6 are inside KF. | ||||
| and they are useful in semantic tracking, semantic-based         The next step is to select keyframes from KF to request | ||||
| optimization, and local mapping thread. We modify several        semantic labels from the semantic server. We call this pro- | ||||
| models in the tracking and the local mapping threads to          cess as semantic keyframe selection process in this paper. | ||||
| remove outliers from camera ego-motion estimation using          We take one keyframe from the head of KF (KF2) and one | ||||
| the semantic information. In the tracking thread, we propose     from the back of KF (KF6) to request the semantic labels. | ||||
| a data association algorithm to use as the features on static    Then, we calculate the mask of the priori dynamic objects | ||||
| objects as possible.                                             using semantic labels S2 and S6. Next, we update the moving | ||||
|                                                                  probability of map points stored in the atlas. The moving | ||||
|    The semantic thread runs in parallel with the others, so as   probability will be used later to remove outliers from the | ||||
| not to block the tracking thread and saves the semantic infor-   tracking thread. | ||||
| mation into the atlas. Semantic labels are used to generate | ||||
| the mask image of the priori dynamic objects. The moving            Alg. 1 shows the detailed implementation of the semantic | ||||
| probability of the map points matched with features in the       thread. The first step is to select semantic keyframes from | ||||
| keyframes is updated using the semantic information. Finally,    keyframe list KF (Line 2). Next, we request semantic labels | ||||
| the camera pose is optimized using the semantic information      from the semantic model and return the semantic labels SLs | ||||
| in the atlas.                                                    (Line 3). Lines 4-8 are to save and process the semantic | ||||
|                                                                  results for each item returned. Line 6 is to generate the mask | ||||
|    We will introduce the new features and modified models in      image of dynamic objects and Line 7 updates the moving | ||||
| the following sections. We skip the detailed explanations of     probability stored in the atlas. We will introduce each sub- | ||||
| the modules that are the same as those of ORB-SLAM3.             module of the semantic thread sequentially (see Fig. 3). | ||||
| 
 | ||||
| IV. SEMANTIC THREAD                                              A. SEMANTIC KEYFRAME SELECTION ALGORITHM | ||||
| The semantic thread is responsible for generating seman-         The semantic keyframe selection algorithm is to select | ||||
| tic information and updating it into the atlas map. Before       keyframes for requesting the semantic labels later. We need | ||||
| we introduce the detailed implementation of the semantic | ||||
| 
 | ||||
| VOLUME 9, 2021                                                                                                                             23775 | ||||
|                                           Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| 
 | ||||
|        FIGURE 4. Semantic tracking example. Assume keyframes KFn is selected every two frames Fn and inserted into keyframe list KF . | ||||
|        We choose keyframes from KF to request semantic labels Sn. Then we update the moving probability into the atlas using the mask image | ||||
|        of dynamic objects that reproduced from the semantic label. Blue circles stand for the static map points and red circles for the dynamic | ||||
| 
 | ||||
|        map points. Others marked in green are unknown. | ||||
| 
 | ||||
| Algorithm 1 Semantic Tracking Thread                               However, this will monotonically increase the time delay | ||||
|                                                                    when using time-consuming segmentation methods as shown | ||||
| Require: KeyFrame list: KF                                         as the blue line in Fig. 6. For instance, at time t = 10 (F10), | ||||
|  1: while not_request_finish() do                                   the semantic model completed the segmentation of KF0 (F0) | ||||
|  2: SK = semantic_keyframe_selection(KF)                           and the semantic delay is d = 10. Similarly, at time 40 (F40), | ||||
|  3: SLs = request_segmentation(SK)                                 the semantic delay becomes 34. That is, the last frame that | ||||
|  4: for i = 0; i < SLs.size(); i + + do                            has semantic information is 34 frames behind the current | ||||
|  5: KeyFrame kf = SR[i]                                            frame. The current frame cannot obtain the latest semantic | ||||
|  6: kf->mask = GenerateMaskImage(SLs[i])                           information. | ||||
|  7: kf->UpdatePrioriMovingProbability() | ||||
|  8: end for                                                           To shorten the distance, supposed that we segment two | ||||
|  9: end while                                                      frames sequentially at the same time (Fig. 5 (b)). Then, | ||||
|                                                                    the delay becomes 12−2 = 10 if KF0 and KF1 are segmented | ||||
| to keep the real-time performance while using different kinds      at the same time. The delay still grows linearly as shown as | ||||
| of semantic segmentation methods. However, some of them,           the red line in Fig. 6. | ||||
| such as Mask R-CNN, are time-consuming and the current | ||||
| frame in tracking may not obtain the new semantic informa-            To further shorten the semantic delay, we use a | ||||
| tion if we segment every keyframe sequentially.                    bi-directional model. We do not segment keyframes sequen- | ||||
|                                                                    tially. Instead, we do semantic segmentation using keyframes | ||||
|    To evaluate the distance quantitatively, we define the           both from the front and back of the list to use as the latest | ||||
| semantic delay that is the distance between the latest frame id    semantic information as possible, as shown in Fig. 5 (c) and as | ||||
| which has the semantic label (St ) that holds the latest semantic  the yellow line in Fig. 6. The semantic delay becomes a con- | ||||
| information and the current frame (Ft ) id, as follows:            stant value. In practice, the delay in the bidirectional model is | ||||
|                                                                    not always 10. The distance is influenced by the segmentation | ||||
|        d = FrameID(Ft ) − FrameID(St ).   (1)                      method used, the frequency of keyframe selection, and the | ||||
|                                                                    processing speed of the related threads. | ||||
|    Fig. 5 shows the semantic delay for several cases. The | ||||
| general idea is to segment each frame or keyframe sequen-             The left side of Fig. 7 indicates a semantic keyframe selec- | ||||
| tially, according to the time sequence as shown in Fig. 5 (a).     tion example and the right side of Fig. 7 shows the time- | ||||
| We call this kind of model the sequential segmentation model.      line of requesting semantic information from the semantic | ||||
|                                                                    model/server. We take both keyframes from the head and | ||||
| 23776 | ||||
|                                                                                                                                                                       VOLUME 9, 2021 | ||||
| Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| 
 | ||||
|                                                                         FIGURE 7. Semantic time line. The left side is the contents inside the | ||||
|                                                                         keyframe list KF and right side is the time line of requesting semantic | ||||
|                                                                         label. Keyframe in green color means this item has already obtained the | ||||
|                                                                         semantic information in the previous round. | ||||
| 
 | ||||
| FIGURE 5. Bi-direction model vs sequential model. Assume we use Mask       We can obtain relatively new information if we segment the | ||||
| R-CNN (200ms) and ORB-SLAM3 (20ms), and the keyframe is selected        keyframe at the tail of the KF list. Then why do we also need | ||||
| every two frames. About 200/20 = 10 frames delay while waiting for the  to segment the keyframe that in front of the list? Different | ||||
| semantic result.                                                        from the blocked model, there is no semantic information for | ||||
|                                                                         the first few frames (about 10 frames if use Mask R-CNN) | ||||
|                                                                         in our method. Since the processing speed of the tracking | ||||
|                                                                         thread is usually faster than the semantic thread, vSLAM | ||||
|                                                                         may have already accumulated large errors because of the | ||||
|                                                                         dynamic objects. Therefore, we need to correct these drift | ||||
|                                                                         errors using the semantic information by popping out and | ||||
|                                                                         feeding the keyframes in the front of the KF list sequentially | ||||
|                                                                         to the semantic-based optimization thread to correct/optimize | ||||
|                                                                         the camera poses. | ||||
| 
 | ||||
|                                                                         B. SEMANTIC SEGMENTATION | ||||
| 
 | ||||
|                                                                         In our experiment, we use two models with different | ||||
|                                                                         speeds, Mask R-CNN (slower) and SegNet (faster), as shown | ||||
|                                                                         in Fig. 8. Mask R-CNN [15] is trained with the MS | ||||
|                                                                         COCO [28], which has both pixel-wise semantic segmenta- | ||||
|                                                                         tion results and instance labels. We implemented it based on | ||||
|                                                                         the TensorFlow version of Matterport.1 SegNet [22] imple- | ||||
|                                                                         mented using Caffe,2 is trained with the PASCAL VOC [29] | ||||
|                                                                         2012 dataset, where 20 classes are offered. We did not refine | ||||
|                                                                         the network using the TUM dataset because SLAM usually | ||||
|                                                                         runs in an unknown environment. | ||||
| 
 | ||||
| FIGURE 6. Semantic delay of sequential model vs bi-direction model.     C. SEMANTIC MASK GENERATION | ||||
|                                                                         We merge all the binary mask images of instance segmenta- | ||||
| back of KF to request the semantic label. (Round 1) At time             tion results into one mask image that is used to generate the | ||||
| t = 2, two keyframes KF0 and KF1 are selected. Segmen-                  mask image (Fig. 8) of people. Then we calculate the priori | ||||
| tation finished at t = 12. By this time, new keyframes are               moving probability of map points using the mask. In practice, | ||||
| selected and then inserted into KF (see Round 2). Then we               since the segmentation on object boundaries are sometimes | ||||
| take two elements KF2 from the front and KF6 from this back             unreliable, the features on the boundaries cannot be detected | ||||
| to request the semantic label. At the time t = 22, we received          if directly apply the mask image, as shown in Fig. 9 (a). | ||||
| the semantic result and continue the next round (Round 3). | ||||
|                                                                         1https://github.com/matterport/Mask_RCNN | ||||
| VOLUME 9, 2021                                                          2https://github.com/alexgkendall/SegNet-Tutorial | ||||
| 
 | ||||
|                                                                                                                           23777 | ||||
| Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| 
 | ||||
|                                                                            FIGURE 10. Segmentation failure case. Some features on the body on the | ||||
|                                                                            person (a) cannot be identified as outliers using unsound mask | ||||
|                                                                            (c) generated by semantic result (b). Therefore, those features are | ||||
|                                                                            wrongly labeled as static in this frame. | ||||
| 
 | ||||
|                                                                            FIGURE 11. Moving probability. θs is the static threshold and θd is the | ||||
|                                                                            dynamic threshold value. | ||||
| 
 | ||||
| FIGURE 8. Semantic information. ‘‘M’’ stands for Mask R-CNN and ‘‘S’’ for  moving probability is used to detect and remove outliers from | ||||
| ‘‘SegNet’’. (e) shows the outliers that marked in red color, which are     tracking. | ||||
| detected using the mask image. | ||||
|                                                                            1) DEFINITION OF MOVING PROBABILITY | ||||
| FIGURE 9. Mask dilation. Remove outliers on the edge of dynamic            As we know, vSLAM is usually running in an unknown | ||||
| objects.                                                                   environment, the semantic result is not always robust if the | ||||
|                                                                            CNN network is not well trained or refined according to | ||||
| Therefore, we dilate the mask using a morphological filter to               the current environment (Fig. 10). To detect outliers, it is | ||||
| include the edge of dynamic objects, as shown in Fig. 9 (b).               more reasonable to consider the spatio-temporal consistency | ||||
| D. MOVING PROBABILITY UPDATE                                               of frames, rather than just use the semantic result of one | ||||
| In order not to wait for the semantic information in the                   frame. Therefore, we use the moving probability to leverage | ||||
| tracking thread, we isolate the semantic segmentation from                 the semantic information of successive keyframes. | ||||
| tracking. We use the moving probability to convey semantic | ||||
| information from semantic thread to tracking thread. The                      We define the moving probability (p(mti ), mti ∈ M ) of each | ||||
|                                                                            map point i at the current time as shown in Fig. 11. The | ||||
| 23778                                                                      status of the map point is more likely dynamic if its moving | ||||
|                                                                            probability is closer to one. The more static the map point | ||||
|                                                                            is if it is more closer to zero. To simplify, we abbreviate the | ||||
|                                                                            moving probability of map point i at time t (p(mti )) to p(mt ). | ||||
|                                                                            Each map point has two status (M ), dynamic and static, and | ||||
|                                                                            the initial probability (initial belief) is set to 0.5 (bel(m0)). | ||||
| 
 | ||||
|                                                                                                     M = {static(s), dynamic(d) } | ||||
|                                                                                        bel(m0 = d) = bel(m0 = s) = 0.5 | ||||
| 
 | ||||
|                                                                            2) DEFINITION OF OBSERVED MOVING PROBABILITY | ||||
|                                                                            Considering the fact that the semantic segmentation is not | ||||
|                                                                            100% accurate, we define the observe moving probability as: | ||||
| 
 | ||||
|                                                                                               p(zt = d|mt = d) = α, | ||||
|                                                                                               p(zt = s|mt = d) = 1 − α, | ||||
|                                                                                                p(zt = s|mt = s) = β, and | ||||
|                                                                                               p(zt = d|mt = s) = 1 − β. | ||||
| 
 | ||||
|                                                                            The values α and β are manually given and it is related to the | ||||
|                                                                            accuracy of semantic segmentation. In the experiment, we set | ||||
|                                                                            α and β to 0.9 by supping the semantic segmentation is fairly | ||||
|                                                                            reliable. | ||||
| 
 | ||||
|                                                                                                                                                                               VOLUME 9, 2021 | ||||
| Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| 
 | ||||
| 3) MOVING PROBABILITY UPDATE                                       Algorithm 2 Robust Data Association Algorithm | ||||
| 
 | ||||
| The moving probability of the current time bel(mt ) is pre-        Require: Current Frame: Ft | ||||
| dicted based on the observation z1:t (semantic segmentation)            Last Frame: Ft−1 | ||||
| and initial status m0. We formulate the moving probability              Unknown subset: Unknown<FeatureId, MapPoint*> | ||||
| updating problem as a Bayesian filter [30] problem: | ||||
|                                                                         Static subset: Static<FeatureId, MapPoint*> | ||||
| 
 | ||||
| bel(mt ) = p(mt |z1:t , m0)                                             Threshold: θd , θs, τ = 20 | ||||
| 
 | ||||
|                                                                    1: for i = 0; i < Ft−1.Features.size(); i + + do | ||||
| 
 | ||||
|                 = ηp(zt |mt , z1:t−1, m0)p(mt |z1:t−1, m0)         2: MapPoint* m = Ft−1.MapPoints[i] | ||||
| 
 | ||||
|                 = ηp(zt |mt )p(mt |z1:t−1, m0)                     3: f = FindMatchedFeatures(Ft , m) | ||||
| 
 | ||||
|                 = ηp(zt |mt )bel(mt )                       (2)    4: if p(m) > θd then | ||||
| 
 | ||||
|                                                                    5:                   continue | ||||
| 
 | ||||
| In Eq. 2 exploits Bayes rule and the conditional independence      6: end if | ||||
|                                                                    7: if p(m) < θs then | ||||
| that the current observation zt only relies on the current status  8: Static.insert(f , m) | ||||
| mt . η is a constant. The prediction bel(mt ) is calculated by: | ||||
| 
 | ||||
| bel(mt ) = p(mt |mt−1, z1:t−1)p(mt−1|z1:t−1)dmt−1                   9: end if | ||||
|                                                                    10: if θd ≤ p(m) ≤ θs then | ||||
|                                                                    11: Unknown.insert(f , m) | ||||
| 
 | ||||
|                 = p(mt |mt−1)bel(mt−1)dmt−1                 (3)    12: end if | ||||
| 
 | ||||
|                                                                    13: end for | ||||
| 
 | ||||
| In Eq. (3), we exploit the assumption that our state is com-       14: for it = Static.begin(); it! = Static.end();it ++ do | ||||
| 
 | ||||
| plete. This implies if we know the previous state mt−1, past       15: Ft .MapPoints[it->first] = it->second; | ||||
| measurements convey no information regarding the state mt .        16: end for | ||||
| We assume the state transition probability p(mt = d|mt−1 =         17: if Static.size()<τ then | ||||
| s) = 0 and p(mt = d|mt−1 = d) = 1 because we cannot | ||||
| detect the suddenly change of objects. η is calculated by          18: for it = Unknown.begin(); it!=Unknown.end();it ++ | ||||
| (bel(mt = d)+bel(mt = s))/2. The probability of map points | ||||
| belonging to dynamic is calculated by:                                  do | ||||
| 
 | ||||
|                                                                    19:                  Ft .MapPoints[it->first] = it->second; | ||||
| 
 | ||||
|                                                                    20: end for | ||||
| 
 | ||||
|                 bel(mt = d)                                        21: end if | ||||
| 
 | ||||
|                 = p(mt = d|mt−1 = d)bel(mt−1 = d)           (4) | ||||
| 
 | ||||
| 4) JUDGEMENT OF STATIC AND DYNAMIC POINTS                          in order to remove the bad influence from dynamic map | ||||
|                                                                    points, we skip those map points that have higher moving | ||||
| Whether a point is dynamic or static is judged using prede-        probability (Lines 4-6). Then, there are two kinds of map | ||||
| fined probability thresholds, θd and θs (see Fig. 11). They are     points left, static and unknown map points. We want to | ||||
| set to 0.6 and 0.4 respectively in the experiment.                 use only the static map points as far as we can. Therefore, | ||||
|                                                                    we classify the remaining map points into two subsets: static | ||||
|                                                                   subset and unknown subset, according to their moving proba- | ||||
|                  dynamic p(mt ) > θd                              bility (Lines 7-12). Finally, we use the selected relative good | ||||
|                                                                   matches. We first use all the good data stored in static subset | ||||
| Status(mti ) = static                                              (Lines 14-16). If the size of these data is not enough (less | ||||
|                                        p(mt ) < θs          (5)    than the threshold τ = 20, the value used in ORB-SLAM3), | ||||
|                                                                    we also use the data in unknown subset (Lines 17-21). | ||||
|                             unknown   others | ||||
|                                                                      We try to exclude outliers from tracking using the moving | ||||
|                                                                    probability stored in the atlas. How well the outliers are | ||||
| V. TRACKING THREAD                                                 removed will have a great influence on the tracking accuracy. | ||||
| The tracking thread runs in real-time and tends to accumulate      We show the results of a few frames in Fig. 12. All the features | ||||
| the drift error due to the incorrect or unstable data associa-     in the first few frames are in green color because no semantic | ||||
| tions of 3D map points and 2D features in each frame caused        information can be used and the moving probability of all | ||||
| by dynamic objects. We modify the Track Last Frame model           map points is 0.5, the initial value. The features in red belong | ||||
| and Track Local Map model of ORB-SLAM3 tracking thread             to dynamic objects and they are hard to match with the last | ||||
| to remove outliers (see Fig. 3). We propose a data association     frame than static features (blue features). The green features | ||||
| algorithm that uses as good data associations as possible          are almost disappeared because the map points obtained the | ||||
| using the moving probability stored in the atlas.                  semantic information over time. We only use features in the | ||||
|                                                                    static subset if its size number is enough to estimate camera | ||||
| A. TRACK LAST FRAME                                                ego-motion. | ||||
| Alg. 2 shows the data association algorithm in tracking last | ||||
| frame model. For each feature i in the last frame, we first                                                                                                        23779 | ||||
| get their matched map point m (Line 2). Next, we find | ||||
| the matched feature in the current frame by comparing the | ||||
| descriptor distance of ORB features (Line 3). After that, | ||||
| 
 | ||||
| VOLUME 9, 2021 | ||||
| Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| 
 | ||||
| FIGURE 12. Results after tracking last frame. ‘‘M’’ stands for Mask R-CNN an ‘‘S’’ for SegNet. The features in red color are not used in tracking. Blue | ||||
| features belong to the static subset and green features belong to the unknown subset. | ||||
| 
 | ||||
| FIGURE 13. Results after tracking local map. ‘‘M’’ stands for Mask R-CNN and ‘‘S’’ for SegNet. | ||||
| 
 | ||||
| B. TRACK LOCAL MAP                                              drifts have already accumulated to some extent with the | ||||
| The basic idea of the data association algorithm in the Track- | ||||
| ing Local Map model is similar with Alg. 2. The difference      influence of dynamic objects. Therefore, we try to correct | ||||
| is that here we use all the map points in the local map to | ||||
| find good data association. The data association result after    the camera pose using semantic information. We modify | ||||
| tracking local map is shown in Fig. 13. More map points are | ||||
| used to match in this model than the tracking last frame. The   the error term used in ORB-SLAM3 by using the moving | ||||
| features on the people are almost successfully detected or not | ||||
| matched/used.                                                   probability of map points for weighting, as shown below. | ||||
| 
 | ||||
| VI. OPTIMIZATION                                                In the experience, we only use the matched static map points | ||||
| A. SEMANTIC-BASED OPTIMIZATION | ||||
| We optimize the camera pose using the keyframes given by        for optimization. | ||||
| the semantic keyframe selection algorithm. Considering that        Assume Xjw ∈ R3 is the 3D pose of a map point j in the | ||||
| the tracking thread runs very fast than the semantic thread, | ||||
|                                                                 world coordinate system. The i-th keyframe pose in the world | ||||
| 23780                                                           coordinate is Tiw ∈ SE(3). The camera pose Tiw is optimized | ||||
|                                                                 by minimizing the reprojection error concerning the matched | ||||
|                                                                 keypoint xij ∈ R2 of the map point. The error term for the | ||||
|                                                                 observation of a map point j in a keyframe i is: | ||||
| 
 | ||||
|                                                                 e(i, j) = (xij − πi(Tiw, Xjw))(1 − p(mj)),                                               (6) | ||||
| 
 | ||||
|                                                                                                             VOLUME 9, 2021 | ||||
| Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| 
 | ||||
| where πi is the projection function that projects a 3D map          the obtained trajectories using their source codes and | ||||
| point into a 2D pixel point in the keyframe i. The larger the       therefore the trajectories are not exactly the same as the | ||||
| moving probability is, the smaller contribution to the error.       ones in their original paper. We evaluated our system | ||||
| The cost function to be optimized is:                               using both Mask R-CNN (M) and SegNet (S). The tra- | ||||
|                                                                     jectory of DynaSLAM that use Mask R-CNN is very | ||||
|                 C=       ρ(eiT,j  −1   ei,j)            (7)         similar with our Mask R-CNN version as shown in | ||||
|                                   i,j                               Fig. 14 (m-p) and Fig. 14 (q-t). The performance of our | ||||
|                                                                     SegNet version (Fig. 14 (i and j)) is similar to the DS-SLAM | ||||
|                     i,j                                             (Fig. 14 (e and f)). | ||||
| 
 | ||||
| where ρ is the Huber robust cost function and  −1   is  the            The error in the estimated trajectory was calculated by | ||||
|                                                i,j                  comparing it with the ground truth, using two promi- | ||||
|                                                                     nent measurements: absolute trajectory error (ATE) and | ||||
| covariance matrix.                                                  relative pose error (RPE) [25], which are well-suited for | ||||
|                                                                     measuring the performance of the vSLAM. The root mean | ||||
| B. BUNDLE ADJUSTMENT IN LOCAL MAPPING THREAD                        squared error (RMSE), and the standard deviation (S.D.) | ||||
| We modify the local BA model to reduce the influence of              of ATE and RPE are compared. Each sequence was run at | ||||
| dynamic map points using semantic information. What we              least five times as dynamic objects are prone to increase | ||||
| modified are: 1) the error term, in which the moving probabil-       the non-deterministic effect. We compared our method | ||||
| ity is used, as shown in Eq. 6; 2) only keyframes that already      with ORB-SLAM3 [18], DS-SLAM [10], DynaSLAM [9], | ||||
| obtained semantic information are used for BA.                      SLAM-PCD [8], DM-SLAM [11], and Detect-SLAM [12]. | ||||
|                                                                     The comparison results are summarized in Tables 1, 2, and 3. | ||||
| VII. EXPERIMENTAL RESULTS                                           DynaSLAM reported they obtained the best performance | ||||
| We evaluate the tracking accuracy using TUM [25] indoor             using the combination of Mask R-CNN and geometric model. | ||||
| dataset and demonstrate the real-time performance by                In this paper, we mainly focus on the time cost problem | ||||
| comparing with state-of-the-art vSLAMs methods using,               caused by semantic segmentation. Contrary to the very heavy | ||||
| when possible, the results in the original papers.                  geometric model that DynaSLAM used, we only use the very | ||||
|                                                                     light geometric check, such as RANSAC, photometric error | ||||
| A. SYSTEM SETUP                                                     to deal with the outliers that not rely on the priori dynamic | ||||
|                                                                     objects. | ||||
| Our system is evaluated using GeForce RTX 2080Ti GPU, | ||||
| Cuda 11.1, and docker. 3 Docker is used to deploy differ-              Our proposal outperforms the original ORB-SLAM3 | ||||
|                                                                     (RGB-D mode only without IMU) and obtains similar per- | ||||
| ent kinds of semantic segmentation methods on the same              formance with DynaSLAM, SLAM-PCD, and DM-SLAM, | ||||
| machine. We also use Kinect v2 4 camera to evaluate in real         in which the tracking error is already very small. Different | ||||
|                                                                     from them, we use the non-blocked model. The first few | ||||
| environment.                                                        frames do not have any semantic information. The number | ||||
|                                                                     of keyframes that have a semantic label is smaller than suing | ||||
| B. TRACKING ACCURACY EVALUATION                                     the blocked model because the processing speed of the track- | ||||
|                                                                     ing thread is much faster than the semantic segmentation | ||||
| The proposed method was compared against the ORB-                   (especially for the heavy model, Mask R-CNN). However, | ||||
| SLAM3 and similar semantic-based algorithms to quantify             we achieved a similar tracking performance using less seman- | ||||
| the tracking performance of our proposal in dynamic scenar-         tic information. | ||||
| ios. | ||||
|                                                                     C. REAL ENVIRONMENT EVALUATION | ||||
|    The TUM RGB-D dataset contains color and depth images            We test our system using Kinect2 RGB-D camera, as shown | ||||
| along the ground-truth trajectory of the sensor. In the             in Fig. 15. All the features are in initial status when in the first | ||||
| sequence named ‘‘fr3/walking_*’’ (labeled as f3/w/*), two           few frames because they have not yet obtained any semantic | ||||
| people walk through an office. This is intended to evaluate          information. The static features will be increasingly detected | ||||
| the robustness of vSLAM in the case of quickly moving               over time and used to estimate camera pose. The features | ||||
| dynamic objects in large parts of a visible scene. Four types       on the person is detected and excluded from tracking. The | ||||
| of camera motion are included in walking data sequences             algorithm runs in around 30HZ, as shown in Table 4. | ||||
| 1) ‘‘xyz’’, the Asus Xtion camera is manually moved along | ||||
| three directions (xyz); 2) ‘‘static’’, where the camera is kept in  D. EXECUTION TIME | ||||
| place manually; 3) ‘‘halfsphere’’, where the camera is moved        Tab. 4 compares the execution time of vSLAM algorithms. | ||||
| on a small half-sphere of approximately one-meter diameter;         In the blocked model, the tracking thread needs to wait for | ||||
| 4) ‘‘rpy’’, where the camera is rotated along the principal axes    the semantic label. The speed of the other methods is related | ||||
| (roll-pitch-yaw). In the experiment, the person is dealt with as    to the semantic segmentation methods used. The heavy the | ||||
| the only priori dynamic object in the TUM dataset. | ||||
|                                                                                                                                                                    23781 | ||||
|    We compared the trajectory of camera with ORB- | ||||
| SLAM3,5 DS-SLAM,6 and DynaSLAM. 7 Fig. 14 compares | ||||
| 
 | ||||
|    3https://docs.docker.com/ | ||||
|    4https://github.com/code-iai/iai_kinect2 | ||||
|    5https://github.com/UZ-SLAMLab/ORB_SLAM3 | ||||
|    6https://github.com/ivipsourcecode/DS-SLAM | ||||
|    7https://github.com/BertaBescos/DynaSLAM | ||||
| 
 | ||||
| VOLUME 9, 2021 | ||||
|        Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| 
 | ||||
| FIGURE 14. Trajectory comparing frame by brame. ‘‘M’’ stands for ‘‘Mask R-CNN’’ and ‘‘S’’ for ‘‘SegNet’’. | ||||
| TABLE 1. Results of absolute trajectory error of TUM (m). Ours (1) and (3) are evaluated results only using keyframes. | ||||
| 
 | ||||
| semantic model used, the higher the total time consuming is.  known, DynaSLAM is not a real-time algorithm. DS-SLAM | ||||
| Although DynaSLAM achieved good tracking performance,         is the second fastest algorithm because it uses a lightweight | ||||
| the processing time is long due to Mask R-CNN. As we          semantic segmentation method, SegNet. However, the | ||||
| 
 | ||||
| 23782                                                                                                                   VOLUME 9, 2021 | ||||
| Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| TABLE 2. Results of translational relative pose error (RPE) (m). Ours (1) and (3) are evaluated results only using keyframes. | ||||
| 
 | ||||
| TABLE 3. Results of rotational relative pose error (RPE) (m). Ours (1) and (3) are evaluated results only using keyframes. | ||||
| 
 | ||||
| TABLE 4. The execution time comparison of TUM Dataset. We use the data in their original paper as possible. If not provide in their papers, | ||||
| we approximate the processing time. | ||||
| 
 | ||||
| architecture used is also a blocked model. The execution time   TABLE 5. Semantic keyframe number comparison (Mask R-CNN). | ||||
| will increase if a more time-consuming method is used. Our | ||||
| method uses the non-blocked model and runs almost at a          keyframes are segmented, the better tracking accuracy can be | ||||
| constant speed regardless of the segmentation methods.          achieved. This depends on the specific application and the | ||||
|                                                                 segmentation methods used. | ||||
|    We evaluate the error metric of TUM dataset using 15HZ | ||||
| by manually adding some time delay in the tracking thread          In the bi-direction model, we selected two keyframes at | ||||
| because TUM dataset is very short. Very small semantic          the same time. We offered two strategies to segment them: | ||||
| information can be obtained in this short time. We compare      1) infer images at the same time as a batch on the same GPU, | ||||
| the time and the number of keyframes that obtained semantic     2) infer images on the same GPU sequentially (one by one). | ||||
| label (Semantic keyframe Number) in Tab. 5. We only com- | ||||
| pared the Mask R-CNN version because SegNet is faster and                                                                                                      23783 | ||||
| it can segment almost all the keyframes in each dataset. We | ||||
| assume the time cost of Mask R-CNN is 0.2s for segmenting | ||||
| each frame. The total time of running the fr3/w/xyz dataset is | ||||
| about 57.3s for 15HZ, however, only 28.3s for 30HZ. In this | ||||
| short time, the number of semantic keyframes in 30HZ (143) | ||||
| is two times smaller than 15HZ (286). Usually, the more | ||||
| 
 | ||||
| VOLUME 9, 2021 | ||||
| Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| 
 | ||||
|                                                                               FIGURE 16. Semantic Delay of TUM w/xyz Dataset. The average value of | ||||
|                                                                               Mask R-CNN case is 10 and SegNet is 5. | ||||
| 
 | ||||
| FIGURE 15. Result of real environment. The green features are in initial      factors: 1) the segmentation speed, 2) the keyframe selection | ||||
| status and their moving probability is 0.5. The blue features are static      policy, 3) the undetermined influence caused by the different | ||||
| features and the red are outliers. (a) is the original detected ORB           running speed of multiple threads (e.g., Loop Closing thread), | ||||
| features. (b) is the output after tracking last frame process and (c) is the  3) the hardware configures. In the fr3/w/xyz dataset, the cam- | ||||
| result after tracking local map process.                                      era sometimes moves very slow and sometimes moves for- | ||||
|                                                                               ward or backward. As a result, this will change the keyframe | ||||
| We suggest using (1) if the GPU can infer a batch of images at                selection frequency and cause the variance of semantic delay. | ||||
| the same time. Our Mask R-CNN version uses (1) because we | ||||
| found we need 0.3s-0.4s in case (1) and 0.2s in case (2). Our                 VIII. CONCLUSION | ||||
| SegNet version is evaluated using the strategy (2) because                    A novel vSLAM system, semantic-based real-time visual | ||||
| SegNet is very fast and can be segmented sequentially.                        SLAM (RDS-SLAM) for dynamic environment using an | ||||
|                                                                               RGB-D camera is presented. We modify ORB-SLAM3 and | ||||
| E. SEMANTIC DELAY EVALUATION                                                  add a semantic tracking thread and a semantic-based opti- | ||||
| We have analyzed the semantic delay by assuming the                           mization thread to remove the influence of dynamic objects | ||||
| keyframe is selected every two frames (see Fig. 6). In exper-                 using semantic information. These new threads run in parallel | ||||
| iment, we follow the keyframe selection policy used in                        with the tracking thread and therefore, the tracking thread is | ||||
| ORB SLAM3 and we compared the semantic delay of                               not blocked to wait for semantic information. We proposed | ||||
| Mask R-CNN case and SegNet case using the TUM dataset,                        a keyframe selection strategy for semantic segmentation to | ||||
| as shown in Fig. 16. The semantic delay is influenced by these                 obtain as the latest semantic information as possible that can | ||||
|                                                                               deal with segmentation methods with different speeds. We | ||||
| 23784                                                                         update and propagate semantic information using the moving | ||||
|                                                                               probability which is used to detect and remove outliers from | ||||
|                                                                               tracking using a data association algorithm. We evaluated the | ||||
|                                                                               tracking performance and the processing time using the TUM | ||||
|                                                                               dataset. The comparison against state-of-the-art vSLAMs | ||||
|                                                                               shows that our method achieved good tracking performance | ||||
|                                                                               and can track each frame in real-time. The fastest speed of the | ||||
|                                                                               system is about 30HZ, which is similar to the tracking speed | ||||
|                                                                               of ORB-SLAM3. In future work, we will try to 1) deploy our | ||||
|                                                                               system on a real robot, 2) extend our system to the stereo | ||||
|                                                                               camera and mono camera systems, and 3) build a semantic | ||||
|                                                                               map. | ||||
| 
 | ||||
|                                                                               REFERENCES | ||||
| 
 | ||||
|                                                                                [1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, | ||||
|                                                                                      I. Reid, and J. J. Leonard, ‘‘Past, present, and future of simultaneous | ||||
|                                                                                      localization and mapping: Toward the robust-perception age,’’ IEEE Trans. | ||||
|                                                                                      Robot., vol. 32, no. 6, pp. 1309–1332, Dec. 2016. [Online]. Available: | ||||
|                                                                                      http://ieeexplore.ieee.org/document/7747236/ | ||||
| 
 | ||||
|                                                                                [2] T. Taketomi, H. Uchiyama, and S. Ikeda, ‘‘Visual SLAM algorithms: | ||||
|                                                                                      A survey from 2010 to 2016,’’ IPSJ Trans. Comput. Vis. Appl., vol. 9, no. 1, | ||||
|                                                                                      pp. 1–11, Dec. 2017, doi: 10.1186/s41074-017-0027-2. | ||||
| 
 | ||||
|                                                                                                                                                                                  VOLUME 9, 2021 | ||||
| Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods | ||||
| 
 | ||||
|  [3] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, ‘‘Robust monocular              [25] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, ‘‘A bench- | ||||
|        SLAM in dynamic environments,’’ in Proc. IEEE Int. Symp. Mixed Aug-                 mark for the evaluation of RGB-D SLAM systems,’’ in Proc. IEEE/RSJ | ||||
|        mented Reality (ISMAR), Oct. 2013, pp. 209–218. [Online]. Available:                Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573–580. [Online]. Available: | ||||
|        http://ieeexplore.ieee.org/                                                         http://vision.in.tum.de/data/datasets/ | ||||
| 
 | ||||
|  [4] W. Dai, Y. Zhang, P. Li, and Z. Fang, ‘‘RGB-D SLAM in dynamic envi-            [26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, | ||||
|        ronments using points correlations,’’ IEEE Robot. Autom. Lett., vol. 2,             and A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Computer | ||||
|        no. 4, pp. 2263–2270, Nov. 2018. [Online]. Available: https://arxiv.org/            Vision—ECCV 2016 (Lecture Notes in Computer Science: Lecture Notes | ||||
|        pdf/1811.03217v1.pdf                                                                in Artificial Intelligence: Lecture Notes in Bioinformatics), vol. 9905. | ||||
|                                                                                            Cham, Switzerland: Springer, 2016, pp. 21–37. [Online]. Available: | ||||
|  [5] S. Li and D. Lee, ‘‘RGB-D SLAM in dynamic environments using static                   https://github.com/weiliu89/caffe/tree/ssd | ||||
|        point weighting,’’ IEEE Robot. Autom. Lett., vol. 2, no. 4, pp. 2263–2270, | ||||
|        Oct. 2017.                                                                   [27] N. Dvornik, K. Shmelkov, J. Mairal, and C. Schmid, ‘‘BlitzNet: A real-time | ||||
|                                                                                            deep network for scene understanding,’’ in Proc. IEEE Int. Conf. Comput. | ||||
|  [6] Y. Sun, M. Liu, and M. Q.-H. Meng, ‘‘Improving RGB-D SLAM in                          Vis. (ICCV), Oct. 2017, pp. 4174–4182. | ||||
|        dynamic environments: A motion removal approach,’’ Robot. Auto. Syst., | ||||
|        vol. 89, pp. 110–122, Mar. 2017.                                             [28] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, | ||||
|                                                                                            and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’ in | ||||
|  [7] D.-H. Kim, S.-B. Han, and J.-H. Kim, ‘‘Visual odometry algorithm                      Computer Vision—ECCV 2014 (Lecture Notes in Computer Science: Lec- | ||||
|        using an RGB-D sensor and IMU in a highly dynamic environ-                          ture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), | ||||
|        ment,’’ in Robot Intelligence Technology and Applications 3, vol. 345.              vol. 8693. Cham, Switzerland: Springer, 2014, pp. 740–755. | ||||
|        New York, NY, USA: Springer-Verlag, 2015, pp. 11–26. [Online]. Avail- | ||||
|        able: https://link.springer.com/chapter/10.1007/978-3-319-16841-8_2          [29] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and | ||||
|                                                                                            A. Zisserman, ‘‘The Pascal visual object classes (VOC) challenge,’’ | ||||
|  [8] Y. Fan, Q. Zhang, S. Liu, Y. Tang, X. Jing, J. Yao, and H. Han, ‘‘Semantic            Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010. [Online]. | ||||
|        SLAM with more accurate point cloud map in dynamic environments,’’                  Available: http://link.springer.com/10.1007/s11263-009-0275-4 | ||||
|        IEEE Access, vol. 8, pp. 112237–112252, 2020. | ||||
|                                                                                     [30] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge, MA, | ||||
|  [9] B. Bescos, J. M. Facil, J. Civera, and J. Neira, ‘‘DynaSLAM: Tracking,                USA: MIT Press, 2012, p. 2012. [Online]. Available: https://mitpress- | ||||
|        mapping, and inpainting in dynamic scenes,’’ IEEE Robot. Autom. Lett.,              mit-edu.proxy.library.uu.nl/books/probabilistic-robotics%0Ahttp:// | ||||
|        vol. 3, no. 4, pp. 4076–4083, Oct. 2018.                                            mitpress.mit.edu/books/probabilistic-robotics | ||||
| 
 | ||||
| [10] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, ‘‘DS-                                          YUBAO LIU received the bachelor’s degree in | ||||
|        SLAM: A semantic visual SLAM towards dynamic environments,’’                                                computer science from Qufu Normal University, | ||||
|        in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018,                                        Qufu, China, in 2012, and the master’s degree in | ||||
|        pp. 1168–1174.                                                                                              computer science from Capital Normal University, | ||||
|                                                                                                                    Beijing, China, in 2015. He is currently pursuing | ||||
| [11] J. Cheng, Z. Wang, H. Zhou, L. Li, and J. Yao, ‘‘DM-SLAM: A feature-                                          the Ph.D. degree with the Toyohashi University of | ||||
|        based SLAM system for rigid dynamic scenes,’’ ISPRS Int. J. Geo-Inf.,                                       Technology, Toyohashi, Japan. In 2015, he joined | ||||
|        vol. 9, no. 4, pp. 1–18, 2020.                                                                              Intel Research Center, Beijing, and he transferred | ||||
|                                                                                                                    to Isoftstone, Beijing, in 2016, as a Senior Soft- | ||||
| [12] F. Zhong, S. Wang, Z. Zhang, C. Chen, and Y. Wang, ‘‘Detect-SLAM:                                             ware Engineer, working on computer vision and | ||||
|        Making object detection and SLAM mutually beneficial,’’ in Proc. IEEE         AR. His research interests include pattern recognition and SLAM for AR | ||||
|        Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2018, pp. 1001–1010.            and smart robotics. | ||||
| 
 | ||||
| [13] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, ‘‘Dynamic-                                                     JUN MIURA (Member, IEEE) received the | ||||
|        SLAM: Semantic monocular visual localization and mapping based on                                           B.Eng. degree in mechanical engineering and | ||||
|        deep learning in dynamic environment,’’ Robot. Auto. Syst., vol. 117,                                       the M.Eng. and Dr.Eng. degrees in informa- | ||||
|        pp. 1–16, Jul. 2019. [Online]. Available: https://linkinghub.elsevier.com/                                  tion engineering from The University of Tokyo, | ||||
|        retrieve/pii/S0921889018308029                                                                              Tokyo, Japan, in 1984, 1986, and 1989, respec- | ||||
|                                                                                                                    tively. In 1989, he joined the Department of | ||||
| [14] M. A. Fischler and R. Bolles, ‘‘Random sample consensus: A paradigm for                                       Computer-Controlled Mechanical Systems, Osaka | ||||
|        model fitting with applications to image analysis and automated cartogra-                                    University, Suita, Japan. Since April 2007, he has | ||||
|        phy,’’ Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981.                                                      been a Professor with the Department of Computer | ||||
|                                                                                                                    Science and Engineering, Toyohashi University of | ||||
| [15] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, ‘‘Mask R-CNN,’’ in Proc.    Technology, Toyohashi, Japan. From March 1994 to February 1995, he was a | ||||
|        IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.               Visiting Scientist with the Computer Science Department, Carnegie Mellon | ||||
|                                                                                     University, Pittsburgh, PA, USA. He has published over 220 articles in | ||||
| [16] R. Mur-Artal and J. D. Tardos, ‘‘ORB-SLAM2: An open-source SLAM                international journal and conferences in the areas of intelligent robotics, | ||||
|        system for monocular, stereo, and RGB-D cameras,’’ IEEE Trans. Robot.,       mobile service robots, robot vision, and artificial intelligence. He received | ||||
|        vol. 33, no. 5, pp. 1255–1262, Oct. 2017.                                    several awards, including the Best Paper Award from the Robotics Society | ||||
|                                                                                     of Japan, in 1997, the Best Paper Award Finalist at ICRA-1995, and the Best | ||||
| [17] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ‘‘ORB: An efficient          Service Robotics Paper Award Finalist at ICRA-2013. | ||||
|        alternative to SIFT or SURF,’’ in Proc. Int. Conf. Comput. Vis., Nov. 2011, | ||||
|        pp. 2564–2571.                                                                                                                                                              23785 | ||||
| 
 | ||||
| [18] C. Campos, R. Elvira, J. J. Gómez Rodríguez, J. M. M. Montiel, and | ||||
|        J. D. Tardós, ‘‘ORB-SLAM3: An accurate open-source library for visual, | ||||
|        visual-inertial and multi-map SLAM,’’ 2020, arXiv:2007.11898. [Online]. | ||||
|        Available: http://arxiv.org/abs/2007.11898 | ||||
| 
 | ||||
| [19] R. Elvira, J. D. Tardos, and J. M. M. Montiel, ‘‘ORBSLAM-Atlas: | ||||
|        A robust and accurate multi-map system,’’ in Proc. IEEE/RSJ Int. Conf. | ||||
|        Intell. Robots Syst. (IROS), Nov. 2019, pp. 6253–6259. | ||||
| 
 | ||||
| [20] C. Kerl, J. Sturm, and D. Cremers, ‘‘Dense visual SLAM for RGB-D | ||||
|        cameras,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Nov. 2013, | ||||
|        pp. 2100–2106. | ||||
| 
 | ||||
| [21] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, | ||||
|        ‘‘Bundle adjustment—A modern synthesis,’’ in Proc. Int. Workshop Vis. | ||||
|        Algorithms, 2000, pp. 298–372. [Online]. Available: http://link.springer. | ||||
|        com/10.1007/3-540-44480-7_21 | ||||
| 
 | ||||
| [22] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep convolu- | ||||
|        tional encoder-decoder architecture for image segmentation,’’ IEEE Trans. | ||||
|        Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, Dec. 2017. | ||||
| 
 | ||||
| [23] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, | ||||
|        ‘‘OctoMap: An efficient probabilistic 3D mapping framework based on | ||||
|        octrees,’’ Auto. Robots, vol. 34, no. 3, pp. 189–206, Apr. 2013. | ||||
| 
 | ||||
| [24] B. Bescos, C. Campos, J. D. Tardós, and J. Neira, ‘‘DynaSLAM II: Tightly- | ||||
|        coupled multi-object tracking and SLAM,’’ 2020, arXiv:2010.07820. | ||||
|        [Online]. Available: http://arxiv.org/abs/2010.07820 | ||||
| 
 | ||||
| VOLUME 9, 2021 | ||||
|  | ||||
| @ -0,0 +1,760 @@ | ||||
|                                         12 | ||||
| 
 | ||||
|                                              STEREO CAMERA VISUAL SLAM WITH HIERARCHICAL | ||||
|                                         MASKING AND MOTION-STATE CLASSIFICATION AT OUTDOOR | ||||
| 
 | ||||
|                                               CONSTRUCTION SITES CONTAINING LARGE DYNAMIC | ||||
|                                                                            OBJECTS | ||||
| 
 | ||||
| arXiv:2101.06563v1 [cs.RO] 17 Jan 2021                  Runqiu Bao                               Ren Komatsu | ||||
|                                             Dept. of Precision Engineering             Dept. of Precision Engineering | ||||
|                                         The University of Tokyo, Tokyo, Japan     The University of Tokyo, Tokyo, Japan | ||||
|                                         bao@robot.t.u-tokyo.ac.jp               komatsu@robot.t.u-tokyo.ac.jp | ||||
| 
 | ||||
|                                                           Renato Miyagusuku | ||||
|                                           Dept. of Mechanical and Intelligent Engineering | ||||
|                                         Utsunomiya University, Utsunomiya, Tochigi, Japan | ||||
|                                           miyagusuku@cc.utsunomiya-u.ac.jp | ||||
| 
 | ||||
|                                                               Masaki Chino                      Atsushi Yamashita | ||||
|                                                          Construction Division           Dept. of Precision Engineering | ||||
|                                         HAZAMA ANDO CORPORATION, Tokyo, Japan       The University of Tokyo, Tokyo, Japan | ||||
|                                                 chino.masaki@ad-hzm.co.jp       yamashita@robot.t.u-tokyo.ac.jp | ||||
| 
 | ||||
|                                                       Hajime Asama | ||||
|                                              Dept. of Precision Engineering | ||||
|                                         The University of Tokyo, Tokyo, Japan | ||||
|                                         asama@robot.t.u-tokyo.ac.jp | ||||
| 
 | ||||
|                                                                                                  January 19, 2021 | ||||
| 
 | ||||
|                                                                                          ABSTRACT | ||||
| 
 | ||||
|                                                At modern construction sites, utilizing GNSS (Global Navigation Satellite System) to measure the | ||||
|                                                real-time location and orientation (i.e. pose) of construction machines and navigate them is very | ||||
|                                                common. However, GNSS is not always available. Replacing GNSS with on-board cameras and | ||||
|                                                visual simultaneous localization and mapping (visual SLAM) to navigate the machines is a cost- | ||||
|                                                effective solution. Nevertheless, at construction sites, multiple construction machines will usually | ||||
|                                                work together and side-by-side, causing large dynamic occlusions in the cameras’ view. Standard | ||||
|                                                visual SLAM cannot handle large dynamic occlusions well. In this work, we propose a motion | ||||
|                                                segmentation method to efficiently extract static parts from crowded dynamic scenes to enable robust | ||||
|                                                tracking of camera ego-motion. Our method utilizes semantic information combined with object-level | ||||
|                                                geometric constraints to quickly detect the static parts of the scene. Then, we perform a two-step | ||||
|                                                coarse-to-fine ego-motion tracking with reference to the static parts. This leads to a novel dynamic | ||||
|                                                visual SLAM formation. We test our proposals through a real implementation based on ORB-SLAM2, | ||||
|                                                and datasets we collected from real construction sites. The results show that when standard visual | ||||
| 
 | ||||
|                                         ∗Code available at: https://github.com/RunqiuBao/kenki-positioning-vSLAM | ||||
|                                         †∗Corresponding author Email: bao@robot.t.u-tokyo.ac.jp | ||||
|                                                                                                    A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
|            SLAM fails, our method can still retain accurate camera ego-motion tracking in real-time. Comparing | ||||
|            to state-of-the-art dynamic visual SLAM methods, ours shows outstanding efficiency and competitive | ||||
|            result trajectory accuracy. | ||||
| 
 | ||||
|            Keywords | ||||
|            dynamic visual SLAM, motion segmentation, hierarchical masking, object motion-state classification, | ||||
|            ego-motion tracking | ||||
| 
 | ||||
| 1 Introduction | ||||
| 
 | ||||
| Knowledge of real-time location and orientation (i.e. pose) of construction machines, such as bulldozers, excavators, | ||||
| and vibration rollers, is essential for the automation of construction sites. Currently, RTK-GNSS (Real-Time Kinematic | ||||
| - Global Navigation Satellite System) is widely used because of its centimeter-level location accuracy. However, | ||||
| in addition to the high price, the location output of RTK-GNSS could be unstable due to loss of satellite signals | ||||
| underground, near mountains and trees, and between tall buildings. Therefore, replacing RTK-GNSS with on-board | ||||
| cameras and visual SLAM (vSLAM) has been proposed [1]. Assuming machine’s starting pose is known in a global | ||||
| coordinate system, relative pose outputs from vSLAM can be used to navigate the machine. | ||||
| However at construction sites, several machines usually work together and side-by-side (Figure 1), which results in | ||||
| large dynamic occlusions in the view of the cameras. Such dynamic occlusions can occupy more than 50% of the image. | ||||
| It leads to a dramatic drop in tracking accuracy or even tracking failure when using standard vSLAM. We introduce | ||||
| this problem distinctly in the context of dynamic vSLAM and propose an original stereo camera dynamic vSLAM | ||||
| formation. | ||||
| To deal with dynamic occlusions, our idea is to firstly detect static objects and backgrounds, and then track ego-motion | ||||
| with reference to them. To achieve this, we need to estimate the real motion-states of objects. We use learning-based | ||||
| object detection and instance segmentation combined with object-wise geometric measurement in stereo frames to label | ||||
| the motion-states of object instances and generate occlusion masks for dynamic objects. Additionally, two types of | ||||
| occlusion masks are applied to balance accuracy and computation cost, bounding box mask for small occlusions and | ||||
| pixel-wise for large occlusions. Pixel-wise masks describe boundaries of objects more accurately. While bounding | ||||
| boxes are faster to predict, it is not so accurate. | ||||
| In a nutshell, our contributions in this work include: (1) a semantic-geometric approach to detect static objects and | ||||
| static backgrounds for stereo vSLAM, (2) a masking technique for dynamic objects called hierarchical masking, (3) a | ||||
| novel stereo camera dynamic visual SLAM system for construction sites. | ||||
| The remainder of this work is organized as follows: In Section 2, we summarize the existing research on dynamic visual | ||||
| SLAM and motion segmentation methods, and describe the features of this work. In Section 3, the system structure | ||||
| and our original proposals (two algorithms) are introduced. In Section 4, to test the performance of our proposals, | ||||
| we conducted experiments at real construction sites and built datasets for algorithm evaluation. We used Absolute | ||||
| Trajectroy RMSE [2] to evaluate accuracy of the location outputs of the vSLAM system. Finally, Section 5 contains the | ||||
| conclusions and future work plan. | ||||
| 
 | ||||
| Figure 1: Simultaneous working of construction machines causing large-area moving occlusions in on-board cameras’ | ||||
| view. | ||||
| 
 | ||||
|                                                                       2 | ||||
|                                                                                                    A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| Figure 2: Cameras are mounted on top of our construction machine facing to the sides, and RTK-GNSS is used to | ||||
| collect ground truth positions. | ||||
| 
 | ||||
| 2 Related Work | ||||
| 
 | ||||
| 2.1 Dynamic Visual SLAM | ||||
| 
 | ||||
| Standard visual SLAM (vSLAM) assumes that the environment is static. Correspondingly, vSLAM for dynamic | ||||
| environments (Dynamic vSLAM or Robust vSLAM) distinguishes static and dynamic features and computes pose | ||||
| estimation based solely on static features. | ||||
| Depending on the application, dynamic vSLAM can be categorized into two classes. One solely builds a static | ||||
| background model, ignoring moving objects [3, 4, 2]. The other aims at not only creating a static background map, but | ||||
| simultaneously maintaining sub-maps of moving objects [5, 6, 7]. Our task, i.e. positioning of construction machines, | ||||
| requires fast and accurate camera ego-motion tracking and thus belongs to the first class. | ||||
| Real-time positioning task at construction sites brought new problem to vSLAM. Specifically, we found that at a busy | ||||
| construction site, there are often many machines, trucks and persons moving around which become large dynamic | ||||
| occlusions (occlusion rate >50% from time to time) in the camera view. Besides, such occlusions usually contain | ||||
| more salient feature points than earthen ground and cause chaos in feature-based camera ego-motion tracking. Even | ||||
| existing dynamic vSLAM solutions may suffer from different issues and are thus not the optimal solution to this task. | ||||
| For example, [8, 9, 10, 11] proposed very fast methods for dealing with dynamic objects. Yet, they did not explicitly | ||||
| consider overly-large dynamic occlusions and thus might suffer from accuracy drop. [2] and [6] proposed very robust | ||||
| methods for masking dynamic occlusions. But both of them require heavy computation and are not suitable for real-time | ||||
| positioning task. Therefore, we proposed our own dynamic vSLAM solution for real-time positioning at dynamic | ||||
| construction sites. | ||||
| In a dynamic vSLAM system, there are mainly two major modules: (1) motion segmentation and (2) localization and | ||||
| mapping [12]. Motion segmentation is the key part that distinguishes an outstanding dynamic vSLAM system from the | ||||
| rests. | ||||
| 
 | ||||
| 2.2 Motion Segmentation | ||||
| 
 | ||||
| Motion segmentation is aimed at detecting moving parts in the image and classifying the features into two groups, static | ||||
| and dynamic features. | ||||
| Standard visual SLAM achieves this by applying robust statistical approaches to the estimation of geometric models, | ||||
| such as Random Sample Consensus (RANSAC) [13]. However, such approach may fail when large dynamic occlusions | ||||
| exist, and static features are not in the majority. Other approaches leverage external sensors such as inertial measurement | ||||
| units (IMU) to fix camera ego-motion. In the following, we focus on visual-only approaches to distinguish static and | ||||
| dynamic features. Muhamad et al. [12] summarizes this research area well, for more details please refer to the study. | ||||
| The most intuitive approach for motion segmentation is using semantic information to separate object instances that | ||||
| may move in the scene. To obtain semantic information, Bârsan et al. [6] used learning-based instance segmentation | ||||
| to generate pixel-wise masks for object instances. Cui et al. [14] proposed only using bounding boxes obtained from | ||||
| YOLO v3 [15] to filter dynamic objects, which can reduce computation cost. However, these works simply assume that | ||||
| movable objects are dynamic. End-to-end learning-based methods for motion segmentation (without prior information | ||||
| about the environment) are still scarce [12]. | ||||
| Another common strategy for motion segmentation is utilizing geometric constraints. It leverages the fact that dynamic | ||||
| features will violate constraints defined in multi-view geometry for static scenes. Kundu et al. [16] detected dynamic | ||||
| features by checking if the points lie on the epipolar line in the subsequent view and used Flow Vector Bound (FVB) to | ||||
| distinguish motion-states of 3D points moving along the epipolar line. Migliore et al. [17] kept checking the intersection | ||||
| 
 | ||||
|                                                                       3 | ||||
|                                                                                                    A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| between three projected viewing rays in three different views to confirm static points. Tan et al. [18] projected existing | ||||
| map points into the current frame to check if the feature is dynamic. It is difficult for us to evaluate these methods. | ||||
| However, one obvious drawback is that they require complicated modifications to the bottom components of standard | ||||
| visual SLAM algorithm without the static environment assumption. We argue that such modifications are not good for | ||||
| the modularity of a vSLAM system. | ||||
| 
 | ||||
| As a novel hybrid approach, Berta et al. [2], in their work named Dynaslam, proposed to combine learning-based | ||||
| instance segmentation with multi-view geometry to refine masks for objects that are not a priori dynamic, but movable. | ||||
| Our system follows the hybrid fashion of Dynaslam, but we treat motion segmentation as an object-level classification | ||||
| problem. Our idea is, by triangulating and measuring positions of points inside the bounding boxes and comparing them | ||||
| between frames, we can estimate object-level motion-states for every bounding box (assuming objects are all rigid). If | ||||
| we know the motion-state of every bounding box, the surroundings can be easily divided into static and dynamic parts. | ||||
| 
 | ||||
| Besides, bounding boxes of large dynamic occlusions reduce available static features. We will show that it is essential to | ||||
| keep the overall masked area under a certain threshold if possible. Hence, we designed an algorithm named hierarchical | ||||
| masking to refine a pixel-wise mask inside the bounding box when the overall masked area extends past a threshold to | ||||
| save scarce static features. This hierarchical masking algorithm is also an original proposal from us. | ||||
| 
 | ||||
| 3 Stereo Camera Dynamic Visual SLAM robust against Large Dynamic Occlusions | ||||
| 
 | ||||
| The core problem in this research is to achieve fast and accurate camera ego-motion tracking when there are large | ||||
| occlusions in the camera’s view. Subsection 3.1 is a general introduction of the system pipeline. In Subsection 3.2, the | ||||
| principle of feature-based camera ego-motion tracking with occlusion masks for dynamic occlusions is introduced. In | ||||
| order to balance computation speed and accuracy in occlusion mask generation, a hierarchical masking approach is | ||||
| proposed in Subsection 3.3. Last, through stereo triangulation and comparison, object instances in the current frame | ||||
| will be assigned a predicted motion-state label, static or dynamic, which leads to further mask refining and a second | ||||
| around of tracking. | ||||
| 
 | ||||
| 3.1 System Overview | ||||
| 
 | ||||
| The system installation is illustrated in Figure 2 and the system pipeline is shown in Figure 3. Inputs are stereo frames | ||||
| (left image and right image) captured by a stereo camera. Then semantic information, including object labels and | ||||
| bounding boxes, are extracted using learning-based object detection. In addition, a hierarchical mask generation | ||||
| approach is proposed to balance mask accuracy and generation speed. Object masks exclude suspicious dynamic objects | ||||
| from the static background. The features in the static background are then used in the initial tracking of camera pose. | ||||
| 
 | ||||
| After initial tracking, a rough pose of the new frame is known, with which we distinguish static objects from other | ||||
| objects. This is done by triangulating object-level 3D key points in the reference and current frame and comparing the | ||||
| 3D position errors to distinguish whether the object is moving or not. Large static objects can provide more salient static | ||||
| features for improving tracking accuracy. Dynamic objects will be kept masked in the second ego-motion tracking. | ||||
| This two-round coarse-to-fine tracking scheme helps detect static objects and improve pose estimation accuracy. | ||||
| 
 | ||||
| After the second round of tracking, there will be mapping and pose graph optimization steps as most of state-of-the-art | ||||
| vSLAM algorithms do. | ||||
| 
 | ||||
| 3.2 Feature-based Camera Ego-motion Tracking by Masking Dynamic Occlusions | ||||
| 
 | ||||
| Camera ego-motion tracking framework used here is based on ORB-SLAM2 stereo [19]. When a new frame comes in, | ||||
| first, a constant velocity motion model is used to predict the new camera pose with which we can search for map points | ||||
| and 2D feature points matches. After enough matches are found, a new pose can be re-estimated by Perspective-n-point | ||||
| (PnP) algorithm [20]. Motion-only bundle adjustment (BA) is then used for further pose optimization. Motion-only | ||||
| BA estimates the camera pose of the new stereo frame, including orientation R ∈ SO (3) and position t ∈ R3, by | ||||
| minimizing the reprojection error between matched 3D points xi ∈ R3 in the SLAM coordinates and feature points | ||||
| pi(.) in the new frame, where i = 1, 2, ..., N . pi(.) include monocular feature points pm i ∈ R2 and stereo feature points | ||||
| pis ∈ R3. | ||||
| Now supposing M out of N 3D points are on a rigid body dynamic object that had a pose change R , t in the physical | ||||
| world, and their 3D coordinates change from xi to xi, for i = 1, 2, ..., M . The rigid body transformation can be | ||||
| 
 | ||||
|                                                                       4 | ||||
|                                                                           A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| Figure 3: An overview of the proposed system. Inputs are stereo frames (all the processes are on the left image. Right | ||||
| image is only for triangulating 3D points). After semantic information extraction, occlusion masks of objects are | ||||
| generated and used in filtering potential dynamic features. The initial ego-motion tracking is based purely on the static | ||||
| background. Then more static objects are found and used as references in the second round of tracking to get more | ||||
| accuracy results. The final output is the camera pose R and t of the current frame in the SLAM coordinates. | ||||
| 
 | ||||
| expressed as xi = R xi + t . Pose change estimation can be expressed as: | ||||
| 
 | ||||
|          M                                                                2 | ||||
| 
 | ||||
| {R, t, R , t } = arg min ρ                 pi(.) − π(.) (R (R xi + t ) + t) Σ | ||||
| 
 | ||||
|                              R,t,R ,t i=1                                      (1) | ||||
| 
 | ||||
|    N                                          2 | ||||
| 
 | ||||
| +        ρ pi(.) − π(.) (Rxi + t) Σ , | ||||
| 
 | ||||
|    M +1 | ||||
| 
 | ||||
| where ρ is the robust Huber cost function that controls the error growth of the quadratic function, and Σ is the covariance | ||||
| matrix associated to the scale of the feature point. The project functions π(.) include monocular πm and rectified stereo | ||||
| 
 | ||||
|                                            5 | ||||
|                                                                            A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| πs, as defined in [19]: | ||||
| 
 | ||||
|                         πm  X                =  fxX/Z + cx   =  ul  ,            (2) | ||||
|                             Y                   fyY /Z + cy     vl | ||||
|                             Z | ||||
| 
 | ||||
|                         X                       fxX/Z + cx          ul | ||||
| 
 | ||||
|                         πs Y =                  fyY /Z + cy     = vl ,           (3) | ||||
| 
 | ||||
|                         Z                 fx (X − b) /Z + cx        ur | ||||
| 
 | ||||
| where (fx, fy) is the focal length, (cx, cy) is the principal point and b the baseline. (ul, vl) represents the monocular | ||||
| feature points and (ul, vl, ur) the stereo feature points. | ||||
| 
 | ||||
| However, solving this equation (1) is not easy, not to mention that there could be more than one dynamic objects in real | ||||
| world. If we only want to estimate R, t, equation (1) can be simplified to: | ||||
| 
 | ||||
|                                           N                                2 | ||||
| 
 | ||||
|                         {R, t} = arg min        ρ pi(.) − π(.) (Rxi + t)      ,  (4) | ||||
| 
 | ||||
|                             R,t i=M +1                                     Σ | ||||
| 
 | ||||
| which means only using static points in the scene to estimate the camera pose. If dynamic feature points as moving | ||||
| outliers are not excluded, the estimation result will be wrong. | ||||
| 
 | ||||
| To separate static and dynamic feature points, our approach is to use a binary image as mask (for the left image of the | ||||
| input stereo frame). The mask has the same size as the input image, and pixels with value 0 indicate static area, while | ||||
| pixels with value 1 indicate dynamic area. Suppose that Imask (u, v) refers to a pixel in the mask image Imask. Sp is a | ||||
| set of static pixels and Dp is a set of dynamic pixels, | ||||
| 
 | ||||
|                                             0,  Imask (u, v) ∈ Sp       .        (5) | ||||
|                         Imask (u, v) = 1,       Imask (u, v) ∈ Dp | ||||
| 
 | ||||
| Figure 4 shows examples of mask (with alpha blending). To generate a mask, we first get bounding boxes or pixel-wise | ||||
| segmentation results from learning-based object detection and instance segmentation (Subsection 3.3). Then, for those | ||||
| objects with a priori dynamic semantic label such as "car", "person", "truck", etc., we change the pixels’ value to 1 in | ||||
| the mask image, while keeping the others as 0. We also apply geometrical measurement and calculate a motion-state | ||||
| label for every object bounding box. Inside a static bounding box, we change the pixels’ value to 0 whatever it was | ||||
| (Subsection 3.4). Later during ego-motion tracking period, only the areas where the mask value equals 0 will be used to | ||||
| estimate camera pose as described by Equation (4). | ||||
| 
 | ||||
| 3.3 Hierarchical Object Masking | ||||
| 
 | ||||
| The switching between two types of masks forms a hierarchical masking strategy that balances computation speed and | ||||
| mask accuracy. | ||||
| 
 | ||||
| To reduce computation cost, we first used object detectors, e.p. EfficientDet [21], to predict object instances and | ||||
| recognize their bounding boxes. Such learning-based object detector is a deep neural network, which can predict all the | ||||
| bounding boxes, class labels, and class probabilities directly from an image in one evaluation. A bounding box only | ||||
| represents a rough boundary of the object, so when using it as an object mask, background feature points inside the | ||||
| rectangle are also classified as "object". It is, therefore, only a rough boundary description. | ||||
| 
 | ||||
| There were cases when bounding boxes occupied most of the area in the image, which led to a shortage of available static | ||||
| features, and thus the accuracy of the ego-motion tracking declined. In such cases, we perform pixel-wise segmentation | ||||
| on the image to save more static features. For pixel-wise segmentation, we also use deep learning approaches, such as | ||||
| Mask R-CNN [22]. Pixel-wise segmentation takes more time and slows down the system output rate. Thus, only in | ||||
| extreme cases when the frame is so crowded with object bounding boxes, should pixel-wise segmentation be performed. | ||||
| 
 | ||||
|                                                                       6 | ||||
|               A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
|                                         Figure 4: Two kinds of masks and masked features. | ||||
| 
 | ||||
| Algorithm 1: Hierarchical Masking | ||||
| 
 | ||||
| Input: stereo images in current frame, Icl, Icr; Mased Area Ratio threshold, τmar. | ||||
| Output: image mask for the left image in current frame, Imask. | ||||
| 
 | ||||
|      Initialisation: a blank image mask, Imask; initial Masked Area Ratio as 0, mar = 0; | ||||
|  1: Imask=objectDetectionAndMasking(Icl); | ||||
|  2: mar=calMaskedAreaRatio(Imask); | ||||
|  3: if (mar ≥ τmar) then | ||||
|  4: Imask=pixelwiseSegmentationAndMasking(Icl); | ||||
|  5: end if | ||||
|  6: return Imask | ||||
| 
 | ||||
| The switching to pixel-wise segmentation is controlled by an index named Masked Area Ratio (mar). If Am is the total | ||||
| area of bounding boxes in pixels and Af is the total area of the image in pixels, then we have, | ||||
| 
 | ||||
| mar = Am .                                                                                 (6) | ||||
|           Af | ||||
| 
 | ||||
| If mar is larger than the threshold τmar, it means the current frame is quite crowded and pixel-wise segmentation is | ||||
| necessary. | ||||
| 
 | ||||
| Hierarchical object masking is concluded as following: when we get one frame input, we first use object detector | ||||
| performing object detection and obtain bounding boxes. Then mar is calculated. If mar is higher than a pre-set | ||||
| threshold τmar, then we perform pixel-wise segmentation and output the pixel-wise object mask. If mar is smaller than | ||||
| the threshold, then the bounding box mask are directly forwarded as object mask. This algorithm is summarized in | ||||
| Algorithm 1. | ||||
| 
 | ||||
| 3.4 Objects’ Motion-state Classification for Further Mask Refinement | ||||
| 
 | ||||
| After the first ego-motion tracking, with reference to the background, we roughly know the pose of the current frame. | ||||
| Based on the current pose, we triangulate object-level 3D points on all the detected object instances in the current frame | ||||
| and a selected reference frame and distinguish whether they have moved. Feature points inside static bounding boxes | ||||
| are then unmasked and used as valid static references in the second round of tracking. This algorithm (Algorithm 2) | ||||
| named motion-state classification is detailed in the following. | ||||
| 
 | ||||
| To classify objects’ motion-state, first, a reference frame needs to be selected from previous frames. In this work, we | ||||
| used the N -th frame before the current frame as reference frame. N is determined based on the machines’ velocity. | ||||
| For example, for vibration rollers moving at 4 km/h mostly, FPS/3 to FPS/2 can be selected as N (FPS stands for | ||||
| the frame rate of camera recording, namely Frame Per Second). For domestic automobiles running at higher speed, | ||||
| N should be selected smaller so that there is appropriate visual change between current and reference frame. This | ||||
| strategy is simple but effective, given the simple moving pattern of construction machines. There are more sophisticated | ||||
| methods for selecting the best reference frame as stated in [2] and [18]. | ||||
| 
 | ||||
| Then, suppose there are objects {obj1, obj2, ..., objm} in the reference frame (RF) and objects {obj1, obj2, ..., objn} | ||||
| in the current frame (CF). We associate the m objects in RF with the n objects in CF by feature matching. If the | ||||
| 
 | ||||
| 7 | ||||
|                                                                                                    A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| Figure 5: Associate bounding boxes between the Reference Frame (RF) and Current Frame (CF) using feature matching. | ||||
| Triangulate object-level 3D points in RF, then triangulate corresponding 3D points in CF and compare their positions in | ||||
| the two measurements. If most of point-wise position errors of an object (bounding box) are smaller than three times | ||||
| the standard variation of static background points, the object is labeled as ‘static’ during camera pose change from RF | ||||
| to CF. | ||||
| 
 | ||||
|                                   Figure 6: Algorithm 2: Objects’ Motion-state Classification. | ||||
| object instances are associated successfully between two frames, which means the object is co-visible in both frames, | ||||
| we triangulate 3D points within the bounding boxes in both frames in SLAM coordinates and calculate point-wise | ||||
| position errors. 3D points’ position errors of static background are assumed to obey zero-mean Gaussian distribution. | ||||
| The standard deviation, σbkg, is determined beforehand and used as the threshold for classification. For static objects, | ||||
| principally all 3D points’ position error should be less than three times of σbkg. But considering the inaccuracy of a | ||||
| bounding box, we loosened the condition to 70%, i.e. objects are classified as "static" when more than 70% of its 3D | ||||
| points have a position error smaller than (3 × σbkg). However, outliers of feature matching usually result in very large | ||||
| position errors. We only keep points with position error smaller than the median to exclude outliers. Figure 5 shows the | ||||
| 
 | ||||
|                                                                       8 | ||||
|                                  A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| (a) Construction site bird view  (b) Vibration roller | ||||
| 
 | ||||
| Figure 7: Experiment setting. | ||||
| 
 | ||||
| principle of the geometric constraint, the left one is a dynamic object and the right one is a static object. Figure 6 shows | ||||
| input and output as well as main ideas of Algorithm 2. Details about how to implement this algorithm can be found in | ||||
| our code repository. | ||||
| 
 | ||||
| 4 Experimental Evaluations | ||||
| 
 | ||||
| 4.1 Testing Environments and Datasets | ||||
| 
 | ||||
| To evaluate our proposed approaches, we conducted experiments at two construction sites in Japan with a machine | ||||
| called vibration roller as shown in Figure 7(b). Vibration roller is used to flatten the earthen basement of structures and | ||||
| facilities. For efficiency of work, there are usually multiple rollers running simultaneously and side by side, thus large | ||||
| moving occlusions become a serious problem for visual SLAM. | ||||
| 
 | ||||
| In all experiments, a stereo camera was mounted on the cabin top of a roller facing the side. The baseline of the stereo | ||||
| camera was about 1 m. The roller moved along a typical trajectory (Figure 7(a)) with maximum speed of 11 km/h. The | ||||
| ground truth trajectories were recorded using RTK-GNSS. We synchronized ground truth and estimated camera poses | ||||
| by minimizing Absolute Trajectory RMSE ([2, 19, 23]) and choosing appropriate time offsets between GNSS’s and | ||||
| the camera’s timer. Then the estimated camera trajectories will be aligned with ground truth trajectories by Umeyama | ||||
| algorithm [24]. We evaluate the accuracy of camera pose outputs of the vSLAM system with reference to the associated | ||||
| ground truth by Absolute Trajectory RMSE (AT-RMSE). | ||||
| 
 | ||||
| Video data were collected at the site and evaluated in the lab. Image resolution was 3840 × 2160, and frame rate was 60 | ||||
| fps. For efficient evaluation, we downsampled the image sequences to 960 × 540 and 6 fps. We eventually collected | ||||
| five image sequences, three with dynamic machines inside, the 4th one containing only two static machines, and the | ||||
| 5th one was without any occlusions. The specifications of the computer being used were Intel Core i7-9700K CPU, | ||||
| and NVIDIA GeForce RTX 2080 Ti GPU. We used a tool provided by [25] for trajectory accuracy evaluation and | ||||
| visualization. | ||||
| 
 | ||||
| When evaluating our vSLAM system implemetation, all the masks including bounding box and pixel-wise masks | ||||
| are generated beforehand using EfficientDet [21] and Detectron2 [26] version of Mask R-CNN [22]. EfficientDet is | ||||
| reported to be able to prioritize detection speed or detection accuracy through configuration. In our implementation, we | ||||
| used EfficientDet-D0 and the weights were trained on MS COCO dataset [27]. The weights for Mask R-CNN are also | ||||
| trained on MS COCO datasets [27]. Without fine-tuning, they are already good enough for this study. Besides, when | ||||
| calculating overall computation time per frame, we record time consumption for vSLAM tracking part as well as mask | ||||
| generation part respectively, and then add them together. Note that in hierarchical masking, the additional time caused | ||||
| by pixel-wise segmentation will be averaged into all the frames. | ||||
| 
 | ||||
|                                                                       9 | ||||
|                                                                               A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| (a) Absolute position error of every camera pose         (b) Camera trajectory with colormapped position error | ||||
| 
 | ||||
| Figure 8: Quantitative evaluation for estimated trajectory of image sequence 1 "kumamoto1". | ||||
| 
 | ||||
|                       Table 1: Details about the five image sequences. | ||||
| 
 | ||||
|    Dataset details     kumamoto1       kumamoto2               chiba1              chiba2            chiba3 | ||||
| Max. occlusion ratio       0.493           0.445                0.521               0.633               0.0 | ||||
|                            0/1263          0/1186              12/647              69/668 | ||||
|  MAR>0.5 frames                                              0 to 4 km/h        0 to 4 km/h           0/708 | ||||
|   Machines’ speed       0 to 4 km/h     0 to 4 km/h                                                0 to 4 km/h | ||||
|                           1 roller        1 roller       1 roller (dynamic)   2 rollers (static) | ||||
|     Occlusions &                                           1 roller (static)                      no occlusions | ||||
|  their motion-states     (dynamic)       (dynamic) | ||||
|                        7 color cones   7 color cones | ||||
| 
 | ||||
|                            (static)        (static) | ||||
|                       1 checkerboard  1 checkerboard | ||||
| 
 | ||||
|                            (static)        (static) | ||||
| 
 | ||||
| 4.2 Performance Evaluation with Our Construction Site Datasets | ||||
| 
 | ||||
| Figure 8(a) shows the absolute position error of every camera pose between the estimated trajectory using the proposed | ||||
| system and ground truth of a sequence (kumamoto2). Figure 8(b) is a bird’s eye view of the camera trajectory with | ||||
| colormapped absolute position error. There are totally 5 sequences prepared, we repeat such evaluation 10 times for | ||||
| each sequence. The details about the five sequences are described in Table 1. Figure 9(a) shows the distribution of | ||||
| 
 | ||||
| (a) Estimated trajectory accuracy (lower is better)      (b) Averaged computation speed (lower is better) | ||||
| 
 | ||||
|                       Figure 9: Performance comparison on our construction site datasets. | ||||
| 
 | ||||
|                                                      10 | ||||
|                                                       A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| (a) Three machines working parallelly to each other.  (b) From view point of the on-board camera | ||||
| 
 | ||||
| Figure 10: Dynamic scene and hierarchical masking example. | ||||
| 
 | ||||
| Absolute Trajectory RMSE of all five sequences. We compare our proposed system with a simple baseline system, with | ||||
| DynaSLAM [2] and with the original ORB-SLAM2 stereo. The baseline system is also based on ORB-SLAM2 but | ||||
| is able to detect and remove moving objects. Its “moving object removal” method is derived from Detect-SLAM [9], | ||||
| which performs bounding box detection and masks all movable bounding boxes detected. In the results, our proposed | ||||
| system shows better trajectory accuracy in 3 sequences out of five comparing to the baseline, including kumamoto1, | ||||
| chiba1 and chiba3. If the baseline represents fast and efficient handling of dynamic objects, DynaSLAM is much | ||||
| heavier computationally. But the motion segmentation method in DynaSLAM is pixel-level precise and indeed the | ||||
| current state-of-the-art. The experiment results shows that, DynaSLAM does show slight superiority of trajectory | ||||
| accuracy in sequences including kumamoto1, chiba1. The original ORB-SLAM2 stereo can only survive chiba2 and | ||||
| chiba3, which are completely static. In addition, trajectory accuracy of chiba2 and chiba3 are generally better than | ||||
| those of dynamic sequences, no matter which method. Dynamic occlusions do cause irreversible influence on camera | ||||
| ego-motion tracking. | ||||
| 
 | ||||
| Averaged computation speed comparisons are shown in Figure 9(b). Our proposed system is relatively slow than the | ||||
| baseline and orb-slam2 stereo at the beginning. However, our method is able to be significantly accelerated by utilizing | ||||
| parallel computing such as GPU acceleration. In our implementation named "ours_gpu" in Figure 9, we enabled GPU | ||||
| acceleration for all the ORB feature extractions, and the speed can be improved notably. However, the trajectory | ||||
| accuracy became different from "ours" to a certain extent, although theoretically they should be the same. We are still | ||||
| looking for the root cause. Finally, time cost of DynaSLAM (only tracking, without background inpainting) is 2 to 3 | ||||
| times of ours_gpu. Large computation latency is not preferable, since our targeted task is real-time positioning and | ||||
| navigation of a construction machine. | ||||
| 
 | ||||
| 4.3 Ablation Study | ||||
| 
 | ||||
| 4.3.1 Hierarchical Object Masking | ||||
| 
 | ||||
| Hierarchical masking aims to efficiently propose an appropriate initial mask in case there are overly-large dynamic | ||||
| occlusions in the image. Figure 10(a) shows a scene when the machine was working along with two other machines and | ||||
| thus had two large occlusions in the camera view. Figure 10(b) shows a sample image recorded by the on-board camera. | ||||
| Notice that the two rectangles labeled as truck are bounding boxes detected by object detection algorithm, and the | ||||
| color masks inside the bounding boxes are by pixel-wise segmentation. Besides, ORB feature points are extracted | ||||
| and marked on this image. Green points are static features on the static background, blue points are those included | ||||
| by bounding boxes but not included by pixel-wise masks, and red points are features masked by pixel-wise masks. | ||||
| It is obvious that bounding box mask causes many innocent static features being treated as dynamic. Through a toy | ||||
| experiment, we can see how it will cause shortage of available feature points and lead to worse pose tracking accuracy. | ||||
| Then by a real example in our datasets, we explain the effectiveness of hierarchical masking. | ||||
| 
 | ||||
| 11 | ||||
|                                                  A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| Figure 11: A toy experiment: estimated trajectory accuracy when putting different sizes of occlusions on the 4th image | ||||
| sequence "chiba2". | ||||
| 
 | ||||
| Table 2: Tracking accuracy of "chiba2" with three different mask types. | ||||
| 
 | ||||
| Mask type          AT-RMSE, m              Max. occlusion ratio | ||||
|                    (average of 10 trials) | ||||
| 
 | ||||
| B-box mask         0.0437                  0.63 | ||||
| 
 | ||||
| Hierarchical mask  0.0404                  0.50 | ||||
| 
 | ||||
| Pixel-wise mask    0.0397                  0.32 | ||||
| 
 | ||||
| (1) A toy experiment | ||||
| We put a fake constant dynamic occlusion at the center of the mask images of the 4th image sequence chiba2 (static | ||||
| scene). And we adjusted the size of this area to simulate different occlusion ratio and see how the result trajectory | ||||
| accuracy changes. The result is plotted in Figure 11. Before occlusion ratio reaches 0.6, trajectory error only varies | ||||
| over a small range; when occlusion ratio exceeds 0.7, the RMSE increases exponentially due to shortage of available | ||||
| features. Therefore, when occlusion ratio of the image approaches the critical point of 0.6, we define it as a large | ||||
| occlusion condition, requiring the refinement of the bounding box mask to a pixel-wise mask to suppress the growing | ||||
| error. Besides, when occlusion ratio is larger than 0.6, tracking lost will frequently happen which is not preferred when | ||||
| navigating a construction machine. To avoid tracking lost and relocalization, we set the threshold (τmar in section 3.3) | ||||
| to 0.5 as a safty limit. | ||||
| 
 | ||||
| However, when occlusion ratio is far smaller than 0.6, bounding box mask is enough and also faster to get. With our | ||||
| computer, generating bounding box masks for one image frame takes 0.0207 seconds in average while a pixel-wise | ||||
| mask takes 0.12 seconds. | ||||
| 
 | ||||
| (2) An overly large occlusion case | ||||
| 
 | ||||
| In order to demonstrate the effectiveness of hierarchial masking when facing overly large occlusions, we show an | ||||
| example in sequence "chiba2". From the 3500th frame to 4500th frame (1000 frames in the original 60 fps sequence) in | ||||
| "chiba2" sequence, we encountered an overly large occlusion. As Table 2 shows, when changing from bounding box | ||||
| mask to pixel-wise mask, the maximum masked area ratio reduced from 0.63 to 0.32 and, correspondingly, trajectory | ||||
| error decreased. Hierarchical masking benefits trajectory accuracy, and it will cost much less time than only using | ||||
| pixel-wise mask. In this example, only 2/3 of the frames during this period need to calculate pixel-wise mask. And the | ||||
| maximum masked area ratio is constrained within 0.5. Note that although the Absolute Trajectory RMSE difference | ||||
| between 0.0404 and 0.0437 seems trivial here in Table 2. It is partially because of the trajectory alignment algorithm | ||||
| [24] we used for evaluation, the actual accuracy difference can be larger. | ||||
| 
 | ||||
| 4.3.2 Objects’ Motion-state Classification | ||||
| 
 | ||||
| Not all a priori dynamic objects are moving. Ignoring static objects leads to loss of information, especially when | ||||
| they are salient and occupy a large area in the image. Therefore, we designed the objects’ motion-state classification | ||||
| algorithm to detect static objects and unmask them for ego-motion tracking. Figure 12 shows dynamic and static objects | ||||
| detected in the image sequences and scores relating to the possibility of them being dynamic. We also show an example | ||||
| 
 | ||||
|                                                                      12 | ||||
|                                       A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| Figure 12: Illustration of the classification result. In the left column, the third row shows that there is one machine | ||||
| classified as "moving" and another classified as "static" in this frame. The second row shows the position errors of | ||||
| 3D points on these two machines between this frame and the reference frame. Points on the "moving" machine have | ||||
| higher position errors. Similarly, in the right column, there are also "moving" machines (two parts of one machine) and | ||||
| a "static" color cone detected. | ||||
| 
 | ||||
| Table 3: Tracking accuracy with motion-state classification. | ||||
| 
 | ||||
| Mask type           AT-RMSE, m Max. occlusion ratio | ||||
| 
 | ||||
| All objects masked  0.04973      0.63 | ||||
| 
 | ||||
| Static objects unmasked 0.04198  0.0 | ||||
| 
 | ||||
| of using the proposed algorithm in visual SLAM. Again, we use the 3500th frame to 4500th frame (1000 frames) in | ||||
| "chiba2" sequence, and since the machines are totally static during this period, they are detected as static and unmasked. | ||||
| Table 3 shows how it can influence the tracking accuracy. | ||||
| 
 | ||||
| However, there is still one bottleneck in this algorithm. σbkg is an essential parameter for the performance of motion- | ||||
| state classification. For all the evaluations above with the four image sequences, we set σbkg to 0.12 which was | ||||
| empirically determined. To explore the influence of this parameter on system performance, we variate σbkg between 0 | ||||
| and 0.6 to evaluate the classifier in terms of ROC (Receiver Operating Characteristics). Since the final target is to find | ||||
| static objects, "static" is regarded as positive and "dynamic" as negative, ignoring objects that cannot be classified. The | ||||
| ROC curve is shown in Figure 13, true positive rate (TPR, sensitivity) on the y axis is the ratio of true positive number | ||||
| over the sum of true positives and false negatives. False positive rate on the x axis is the ratio of false positives over the | ||||
| sum of false positives and true negatives. According to this curve, the Area Under the Curve (AUC) reached 0.737, | ||||
| which proved it to be a valid classifier. The red dot in the plot is the position where σbkg = 0.12. | ||||
| 
 | ||||
| 4.4 Evaluation with KITTI Dataset | ||||
| 
 | ||||
| The KITTI Dataset [28] provides stereo camera sequences in outdoor urban and highway environments. It has been a | ||||
| wide-spread benchmark for evaluating vSLAM system performance, especially trajectory accuracy. Works such as | ||||
| [2, 19] all provide evaluation results with KITTI. There are some sequences in KITTI containing normal-size dynamic | ||||
| occlusions, thus KITTI is also appropriate for evaluation of our method. Table 4 shows the evaluation results. The | ||||
| comparison includes four systems, our proposed system, the baseline, DynaSLAM and ORB-SLAM2 stereo, same as in | ||||
| section 4.2. For the baseline, DynaSLAM and ORB-SLAM2 stereo, all the settings remain the same as before. For our | ||||
| system, τmar (Section 3.3) remains to be 0.5 and σbkg (Section 3.4) remains to be 0.12. However, N (Section 3.4) is | ||||
| changed to be 2, since frame rate of KITTI is 10 fps and the cars are much faster than our construction machines. We | ||||
| ran each sequence 10 times with each system and recorded the averaged Absolute Trajectory RMSE (AT-RMSE, m) as | ||||
| well as the averaged computation time per frame (s). For our system, we recorded both results with GPU acceleration (w | ||||
| A) and without GPU acceleration (w/o A). Between the four comparisons, best AT-RMSE for each sequence is marked | ||||
| with bold font and best computation time marked with bold, italic font. Note that the AT-RMSE results of DynaSLAM | ||||
| 
 | ||||
|                                                                      13 | ||||
|                                                                    A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| Figure 13: ROC curve for the motion-state classification when σbkg was between 0 and 0.6, estimated with the 3rd | ||||
| image sequence "chiba1". The Area Under Curve (AUC) reached 0.737. Red dot is the position where σbkg = 0.12. | ||||
| 
 | ||||
|           Table 4: Trajectory accuracy and time consumption evaluation on KITTI Dataset. | ||||
| 
 | ||||
|                        ours                 baseline      dynaslam (tracking)             orb-slam2 | ||||
| 
 | ||||
| Sequence  AT-RMSE (m)  time per        AT-RMSE time per   AT-RMSE time per     AT-RMSE time per | ||||
|                        frame (s) | ||||
| KITTI 00                               (m)     frame (s)  (m)      frame (s)   (m)        frame (s) | ||||
| KITTI 01  w/o A w A w/o A w A | ||||
| KITTI 02 | ||||
| KITTI 03  2.1290 1.7304 0.2018 0.1565  2.0173  0.0912     3.9691   0.3354      1.7304     0.0703 | ||||
| KITTI 04                               9.1271  0.0917     21.8982  0.3273      8.7620     0.0734 | ||||
| KITTI 05  8.4940 8.7620 0.1860 0.1305  4.9280  0.0935     5.9401   0.3243      4.9994     0.0771 | ||||
| KITTI 06                               3.1174  0.0898     4.7770   0.3459      3.0735     0.0723 | ||||
| KITTI 07  5.1759 4.7338 0.1764 0.1194  0.9970  0.0864     1.3371   0.3420      1.0079     0.0672 | ||||
| KITTI 08                               2.0528  0.0923     1.7644   0.3482      1.9751     0.0717 | ||||
| KITTI 09  3.2169 3.4246 0.1462 0.0983  1.9338  0.0943     2.0627   0.3434      1.8793     0.0752 | ||||
| KITTI 10                               1.1799  0.0843     1.1285   0.3493      0.9733     0.0632 | ||||
|           1.0835 1.2937 0.1811 0.1297  4.7857  0.0882     3.7062   0.3488      4.6483     0.0675 | ||||
|                                        7.1441  0.0865     4.2753   0.3463      5.9788     0.0657 | ||||
|           2.1243 2.2529 0.1915 0.1382  2.6986  0.0912     2.2028   0.3466      2.6699     0.0631 | ||||
| 
 | ||||
|           2.1718 2.2651 0.2076 0.1546 | ||||
| 
 | ||||
|           1.2323 1.3159 0.1791 0.1337 | ||||
| 
 | ||||
|           4.5641 5.2294 0.1945 0.1445 | ||||
| 
 | ||||
|           4.9692 5.8698 0.1760 0.1231 | ||||
| 
 | ||||
|           2.5849 2.6375 0.1522 0.1022 | ||||
| 
 | ||||
| and ORB-SLAM2 stereo are different from the original paper. It is because we only align the trajectory with ground | ||||
| truth without adjusting scale before calculating trajectory error, since our target is online positioning with vSLAM. | ||||
| 
 | ||||
| From Table 4, we see that in terms of computation speed, ORB-SLAM2 stereo is always the best. Because it has | ||||
| adapted the static environment assumption. DynaSLAM is the slowest. Ours is slightly worse than the baseline and | ||||
| ORB-SLAM2 stereo, however, we do see that GPU acceleration helps improving speed to a tolerable level. In terms of | ||||
| AT-RMSE, the results are various, but DynaSLAM and ORB-SLAM2 stereo did have the most bold fonts numbers. | ||||
| In KITTI dataset, there are moving automobiles, bicycles and persons in some frames, but they are not overly-large. | ||||
| Actually there are only 6 frames in "07" in which occlusion ratio became larger than 0.5. Besides, automobiles on | ||||
| the street do not contain so many salient feature points as construction machines, most of them have texture-less and | ||||
| smooth surface. Therefore, our proposed system is not advantageous in KITTI. | ||||
| 
 | ||||
| 5 Conclusions & Future Work | ||||
| 
 | ||||
| We presented a stereo vSLAM system for dynamic outdoor construction sites. The key contributions are, first, a | ||||
| hierarchical masking strategy that can timely refine overly-large occlusion mask in an efficient way. Second, a semantic- | ||||
| geometric approach for objects’ motion-state classification and a two-step coarse-to-fine ego-motion tracking scheme. | ||||
| Our system accurately retrieved the motion trajectories of a stereo camera at construction sites, and most of the | ||||
| surrounding objects’ motion-states in the scene were correctly predicted. Hierarchical object masking has also been | ||||
| 
 | ||||
|                                                                      14 | ||||
|                                                                                                    A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| proved to be a simple but useful strategy. Our proposed system can deal with dynamic and crowded environments that | ||||
| standard vSLAM systems may fail to keep tracking. | ||||
| 
 | ||||
| In future work, the method to select reference frames can be optimized to enable more robust object motion-state | ||||
| classification. Moreover, we plan to combine vSLAM with an inertial measuring unit (IMU) sensor for higher-accuracy | ||||
| positioning. However, the fierce and high-frequency vibration of the vibration roller may cause severe noises in the | ||||
| IMU measurements, which results in worse pose accuracy. Therefore, we will look into this problem and meanwhile | ||||
| also explore other topics about visual SLAM related research at construction sites. | ||||
| 
 | ||||
| References | ||||
| 
 | ||||
|  [1] Runqiu Bao, Ren Komatsu, Renato Miyagusuku, Masaki Chino, Atsushi Yamashita, and Hajime Asama. Cost- | ||||
|       effective and robust visual based localization with consumer-level cameras at construction sites. In Proceedings of | ||||
|       the 2019 IEEE Global Conference on Consumer Electronics (GCCE 2019), pages 983–985, 2019. | ||||
| 
 | ||||
|  [2] Berta Bescos, José M Fácil, Javier Civera, and José Neira. Dynaslam: Tracking, mapping, and inpainting in | ||||
|       dynamic scenes. IEEE Robotics and Automation Letters, 3(4):4076–4083, 2018. | ||||
| 
 | ||||
|  [3] Mariano Jaimez, Christian Kerl, Javier Gonzalez-Jimenez, and Daniel Cremers. Fast odometry and scene flow | ||||
|       from rgb-d cameras based on geometric clustering. In Proceedings of the 2017 IEEE International Conference on | ||||
|       Robotics and Automation (ICRA 2017), pages 3992–3999, 2017. | ||||
| 
 | ||||
|  [4] Dan Barnes, Will Maddern, Geoffrey Pascoe, and Ingmar Posner. Driven to distraction: Self-supervised distractor | ||||
|       learning for robust monocular visual odometry in urban environments. In Proceedings of the 2018 IEEE | ||||
|       International Conference on Robotics and Automation (ICRA 2018), pages 1894–1900, 2018. | ||||
| 
 | ||||
|  [5] Binbin Xu, Wenbin Li, Dimos Tzoumanikas, Michael Bloesch, Andrew Davison, and Stefan Leutenegger. Mid- | ||||
|       fusion: Octree-based object-level multi-instance dynamic slam. In Proceedings of the 2019 IEEE International | ||||
|       Conference on Robotics and Automation (ICRA 2019), pages 5231–5237, 2019. | ||||
| 
 | ||||
|  [6] Ioan Andrei Bârsan, Peidong Liu, Marc Pollefeys, and Andreas Geiger. Robust dense mapping for large-scale | ||||
|       dynamic environments. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation | ||||
|       (ICRA 2018), pages 7510–7517, 2018. | ||||
| 
 | ||||
|  [7] Martin Runz, Maud Buffier, and Lourdes Agapito. Maskfusion: Real-time recognition, tracking and reconstruction | ||||
|       of multiple moving objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented | ||||
|       Reality (ISMAR 2018), pages 10–20, 2018. | ||||
| 
 | ||||
|  [8] Chao Yu, Zuxin Liu, Xin-Jun Liu, Fugui Xie, Yi Yang, Qi Wei, and Qiao Fei. Ds-slam: A semantic visual slam | ||||
|       towards dynamic environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent | ||||
|       Robots and Systems (IROS 2018), pages 1168–1174. IEEE, 2018. | ||||
| 
 | ||||
|  [9] Fangwei Zhong, Sheng Wang, Ziqi Zhang, and Yizhou Wang. Detect-slam: Making object detection and slam | ||||
|       mutually beneficial. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision | ||||
|       (WACV 2018), pages 1001–1010. IEEE, 2018. | ||||
| 
 | ||||
| [10] Linhui Xiao, Jinge Wang, Xiaosong Qiu, Zheng Rong, and Xudong Zou. Dynamic-slam: Semantic monocular | ||||
|       visual localization and mapping based on deep learning in dynamic environment. Robotics and Autonomous | ||||
|       Systems, 117:1–16, 2019. | ||||
| 
 | ||||
| [11] João Carlos Virgolino Soares, Marcelo Gattass, and Marco Antonio Meggiolaro. Visual slam in human populated | ||||
|       environments: Exploring the trade-off between accuracy and speed of yolo and mask r-cnn. In Proceedings of the | ||||
|       2019 International Conference on Advanced Robotics (ICAR 2019), pages 135–140. IEEE, 2019. | ||||
| 
 | ||||
| [12] Muhamad Risqi U Saputra, Andrew Markham, and Niki Trigoni. Visual slam and structure from motion in | ||||
|       dynamic environments: A survey. ACM Computing Surveys (CSUR), 51(2):1–36, 2018. | ||||
| 
 | ||||
| [13] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications | ||||
|       to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. | ||||
| 
 | ||||
| [14] Zhaopeng Cui, Lionel Heng, Ye Chuan Yeo, Andreas Geiger, Marc Pollefeys, and Torsten Sattler. Real-time dense | ||||
|       mapping for self-driving vehicles using fisheye cameras. In Proceedings of the 2019 International Conference on | ||||
|       Robotics and Automation (ICRA 2019), pages 6087–6093, 2019. | ||||
| 
 | ||||
| [15] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. | ||||
| 
 | ||||
| [16] Abhijit Kundu, K Madhava Krishna, and Jayanthi Sivaswamy. Moving object detection by multi-view geometric | ||||
|       techniques from a single camera mounted robot. In Proceedings of the 2009 IEEE/RSJ International Conference | ||||
|       on Intelligent Robots and Systems (IROS 2009), pages 4306–4312, 2009. | ||||
| 
 | ||||
|                                                                      15 | ||||
|                                                                                                    A PREPRINT - JANUARY 19, 2021 | ||||
| 
 | ||||
| [17] Davide Migliore, Roberto Rigamonti, Daniele Marzorati, Matteo Matteucci, and Domenico G Sorrenti. Use a | ||||
|       single camera for simultaneous localization and mapping with mobile object tracking in dynamic environments. | ||||
|       In Proceedings of the 2009 ICRA Workshop on Safe navigation in open and dynamic environments: Application to | ||||
|       autonomous vehicles, pages 12–17, 2009. | ||||
| 
 | ||||
| [18] Wei Tan, Haomin Liu, Zilong Dong, Guofeng Zhang, and Hujun Bao. Robust monocular slam in dynamic | ||||
|       environments. In Proceedings of the 2013 IEEE International Symposium on Mixed and Augmented Reality | ||||
|       (ISMAR 2013), pages 209–218, 2013. | ||||
| 
 | ||||
| [19] Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d | ||||
|       cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017. | ||||
| 
 | ||||
| [20] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular | ||||
|       slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015. | ||||
| 
 | ||||
| [21] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. arXiv | ||||
|       preprint arXiv:1911.09070, 2019. | ||||
| 
 | ||||
| [22] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the 2017 IEEE | ||||
|       International Conference on Computer Vision (ICCV 2017), pages 2961–2969, 2017. | ||||
| 
 | ||||
| [23] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the | ||||
|       evaluation of rgb-d slam systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent | ||||
|       Robots and Systems (IROS 2012), pages 573–580. IEEE, 2012. | ||||
| 
 | ||||
| [24] Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE | ||||
|       Transactions on Pattern Analysis & Machine Intelligence, (4):376–380, 1991. | ||||
| 
 | ||||
| [25] Michael Grupp. evo: Python package for the evaluation of odometry and slam. https://github.com/ | ||||
|       MichaelGrupp/evo, 2017. | ||||
| 
 | ||||
| [26] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https: | ||||
|       //github.com/facebookresearch/detectron2, 2019. | ||||
| 
 | ||||
| [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and | ||||
|       C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the 2014 European | ||||
|       Conference on Computer Vision (ECCV 2014), pages 740–755, 2014. | ||||
| 
 | ||||
| [28] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The | ||||
|       International Journal of Robotics Research, 32(11):1231–1237, 2013. | ||||
| 
 | ||||
|                                                                      16 | ||||
|  | ||||
| @ -0,0 +1,724 @@ | ||||
|                                         TartanVO: A Generalizable Learning-based VO | ||||
| 
 | ||||
|                                         Wenshan Wang∗  Yaoyu Hu  Sebastian Scherer | ||||
| 
 | ||||
|                                         Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University | ||||
| 
 | ||||
| arXiv:2011.00359v1 [cs.CV] 31 Oct 2020            Abstract: We present the first learning-based visual odometry (VO) model, | ||||
|                                                   which generalizes to multiple datasets and real-world scenarios, and outperforms | ||||
|                                                   geometry-based methods in challenging scenes. We achieve this by leveraging | ||||
|                                                   the SLAM dataset TartanAir, which provides a large amount of diverse synthetic | ||||
|                                                   data in challenging environments. Furthermore, to make our VO model generalize | ||||
|                                                   across datasets, we propose an up-to-scale loss function and incorporate the cam- | ||||
|                                                   era intrinsic parameters into the model. Experiments show that a single model, | ||||
|                                                   TartanVO, trained only on synthetic data, without any finetuning, can be general- | ||||
|                                                   ized to real-world datasets such as KITTI and EuRoC, demonstrating significant | ||||
|                                                   advantages over the geometry-based methods on challenging trajectories. Our | ||||
|                                                   code is available at https://github.com/castacks/tartanvo. | ||||
| 
 | ||||
|                                                   Keywords: Visual Odometry, Generalization, Deep Learning, Optical Flow | ||||
| 
 | ||||
|                                         1 Introduction | ||||
| 
 | ||||
|                                         Visual SLAM (Simultaneous Localization and Mapping) becomes more and more important for | ||||
|                                         autonomous robotic systems due to its ubiquitous availability and the information richness of im- | ||||
|                                         ages [1]. Visual odometry (VO) is one of the fundamental components in a visual SLAM system. | ||||
|                                         Impressive progress has been made in both geometric-based methods [2, 3, 4, 5] and learning-based | ||||
|                                         methods [6, 7, 8, 9]. However, it remains a challenging problem to develop a robust and reliable VO | ||||
|                                         method for real-world applications. | ||||
| 
 | ||||
|                                         On one hand, geometric-based methods are not robust enough in many real-life situations [10, 11]. | ||||
|                                         On the other hand, although learning-based methods demonstrate robust performance on many vi- | ||||
|                                         sual tasks, including object recognition, semantic segmentation, depth reconstruction, and optical | ||||
|                                         flow, we have not yet seen the same story happening to VO. | ||||
| 
 | ||||
|                                         It is widely accepted that by leveraging a large amount of data, deep-neural-network-based methods | ||||
|                                         can learn a better feature extractor than engineered ones, resulting in a more capable and robust | ||||
|                                         model. But why haven’t we seen the deep learning models outperform geometry-based methods yet? | ||||
|                                         We argue that there are two main reasons. First, the existing VO models are trained with insufficient | ||||
|                                         diversity, which is critical for learning-based methods to be able to generalize. By diversity, we | ||||
|                                         mean diversity both in the scenes and motion patterns. For example, a VO model trained only on | ||||
|                                         outdoor scenes is unlikely to be able to generalize to an indoor environment. Similarly, a model | ||||
|                                         trained with data collected by a camera fixed on a ground robot, with limited pitch and roll motion, | ||||
|                                         will unlikely be applicable to drones. Second, most of the current learning-based VO models neglect | ||||
|                                         some fundamental nature of the problem which is well formulated in geometry-based VO theories. | ||||
|                                         From the theory of multi-view geometry, we know that recovering the camera pose from a sequence | ||||
|                                         of monocular images has scale ambiguity. Besides, recovering the pose needs to take account of the | ||||
|                                         camera intrinsic parameters (referred to as the intrinsics ambiguity later). Without explicitly dealing | ||||
|                                         with the scale problem and the camera intrinsics, a model learned from one dataset would likely fail | ||||
|                                         in another dataset, no matter how good the feature extractor is. | ||||
| 
 | ||||
|                                         To this end, we propose a learning-based method that can solve the above two problems and can | ||||
|                                         generalize across datasets. Our contributions come in three folds. First, we demonstrate the crucial | ||||
|                                         effects of data diversity on the generalization ability of a VO model by comparing performance on | ||||
|                                         different quantities of training data. Second, we design an up-to-scale loss function to deal with the | ||||
| 
 | ||||
|                                            ∗Corresponding author: wenshanw@andrew.cmu.edu | ||||
| 
 | ||||
|                                         4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA. | ||||
| scale ambiguity of monocular VO. Third, we create an intrinsics layer (IL) in our VO model enabling | ||||
| generalization across different cameras. To our knowledge, our model is the first learning-based VO | ||||
| that has competitive performance in various real-world datasets without finetuning. Furthermore, | ||||
| compared to geometry-based methods, our model is significantly more robust in challenging scenes. | ||||
| A demo video can be found: https://www.youtube.com/watch?v=NQ1UEh3thbU | ||||
| 
 | ||||
| 2 Related Work | ||||
| 
 | ||||
| Besides early studies of learning-based VO models [12, 13, 14, 15], more and more end-to-end | ||||
| learning-based VO models have been studied with improved accuracy and robustness. The majority | ||||
| of the recent end-to-end models adopt the unsupervised-learning design [6, 16, 17, 18], due to the | ||||
| complexity and the high-cost associated with collecting ground-truth data. However, supervised | ||||
| models trained on labeled odometry data still have a better performance [19, 20]. | ||||
| 
 | ||||
| To improve the performance, end-to-end VO models tend to have auxiliary outputs related to camera | ||||
| motions, such as depth and optical flow. With depth prediction, models obtain supervision signals | ||||
| by imposing depth consistency between temporally consecutive images [17, 21]. This procedure can | ||||
| be interpreted as matching the temporal observations in the 3D space. A similar effect of temporal | ||||
| matching can be achieved by producing the optical flow, e.g., [16, 22, 18] jointly predict depth, | ||||
| optical flow, and camera motion. | ||||
| 
 | ||||
| Optical flow can also be treated as an intermediate representation that explicitly expresses the 2D | ||||
| matching. Then, camera motion estimators can process the optical flow data rather than directly | ||||
| working on raw images[20, 23]. If designed this way, components for estimating the camera motion | ||||
| can even be trained separately on available optical flow data [19]. We follow these designs and use | ||||
| the optical flow as an intermediate representation. | ||||
| 
 | ||||
| It is well known that monocular VO systems have scale ambiguity. Nevertheless, most of the super- | ||||
| vised learning models did not handle this issue and directly use the difference between the model | ||||
| prediction and the true camera motion as the supervision [20, 24, 25]. In [19], the scale is handled | ||||
| by dividing the optical flow into sub-regions and imposing a consistency of the motion predictions | ||||
| among these regions. In non-learning methods, scale ambiguity can be solved if a 3D map is avail- | ||||
| able [26]. Ummenhofer et al. [20] introduce the depth prediction to correcting the scale-drift. Tateno | ||||
| et al. [27] and Sheng et al. [28] ameliorate the scale problem by leveraging the key-frame selection | ||||
| technique from SLAM systems. Recently, Zhan et al. [29] use PnP techniques to explicitly solve | ||||
| for the scale factor. The above methods introduce extra complexity to the VO system, however, the | ||||
| scale ambiguity is not totally suppressed for monocular setups especially in the evaluation stage. | ||||
| Instead, some models choose to only produce up-to-scale predictions. Wang et al. [30] reduce the | ||||
| scale ambiguity in the monocular depth estimation task by normalizing the depth prediction before | ||||
| computing the loss function. Similarly, we will focus on predicting the translation direction rather | ||||
| than recovering the full scale from monocular images, by defining a new up-to-scale loss function. | ||||
| 
 | ||||
| Learning-based models suffer from generalization issues when tested on images from a new en- | ||||
| vironment or a new camera. Most of the VO models are trained and tested on the same dataset | ||||
| [16, 17, 31, 18]. Some multi-task models [6, 20, 32, 22] only test their generalization ability on the | ||||
| depth prediction, not on the camera pose estimation. Recent efforts, such as [33], use model adap- | ||||
| tation to deal with new environments, however, additional training is needed on a per-environment | ||||
| or per-camera basis. In this work, we propose a novel approach to achieve cross-camera/dataset | ||||
| generalization, by incorporating the camera intrinsics directly into the model. | ||||
| 
 | ||||
| Figure 1: The two-stage network architecture. The model consists of a matching network, which | ||||
| estimates optical flow from two consecutive RGB images, followed by a pose network predicting | ||||
| camera motion from the optical flow. | ||||
| 
 | ||||
|                                                            2 | ||||
| 3 Approach | ||||
| 
 | ||||
| 3.1 Background | ||||
| 
 | ||||
| We focus on the monocular VO problem, which takes two consecutive undistorted images {It, It+1}, | ||||
| and estimates the relative camera motion δtt+1 = (T, R), where T ∈ R3 is the 3D translation and | ||||
| R ∈ so(3) denotes the 3D rotation. According to the epipolar geometry theory [34], the geometry- | ||||
| based VO comes in two folds. Firstly, visual features are extracted and matched from It and It+1. | ||||
| Then using the matching results, it computes the essential matrix leading to the recovery of the | ||||
| up-to-scale camera motion δtt+1. | ||||
| 
 | ||||
| Following the same idea, our model consists of two sub-modules. One is the matching module | ||||
| Mθ(It, It+1), estimating the dense matching result Ftt+1 from two consecutive RGB images (i.e. | ||||
| optical flow). The other is a pose module Pφ(Ftt+1) that recovers the camera motion δtt+1 from the | ||||
| matching result (Fig. 1). This modular design is also widely used in other learning-based methods, | ||||
| especially in unsupervised VO [13, 19, 16, 22, 18]. | ||||
| 
 | ||||
| 3.2 Training on large scale diverse data | ||||
| 
 | ||||
| The generalization capability has always been one of the most critical issues for learning-based | ||||
| methods. Most of the previous supervised models are trained on the KITTI dataset, which contains | ||||
| 11 labeled sequences and 23,201 image frames in the driving scenario [35]. Wang et al. [8] presented | ||||
| the training and testing results on the EuRoC dataset [36], collected by a micro aerial vehicle (MAV). | ||||
| They reported that the performance is limited by the lack of training data and the more complex | ||||
| dynamics of a flying robot. Surprisingly, most unsupervised methods also only train their models in | ||||
| very uniform scenes (e.g., KITTI and Cityscape [37]). To our knowledge, no learning-based model | ||||
| has yet shown the capability of running on multiple types of scenes (car/MAV, indoor/outdoor). To | ||||
| achieve this, we argue that the training data has to cover diverse scenes and motion patterns. | ||||
| 
 | ||||
| TartanAir [11] is a large scale dataset with highly diverse scenes and motion patterns, containing | ||||
| more than 400,000 data frames. It provides multi-modal ground truth labels including depth, seg- | ||||
| mentation, optical flow, and camera pose. The scenes include indoor, outdoor, urban, nature, and | ||||
| sci-fi environments. The data is collected with a simulated pinhole camera, which moves with ran- | ||||
| dom and rich 6DoF motion patterns in the 3D space. | ||||
| 
 | ||||
| We take advantage of the monocular image sequences {It}, the optical flow labels {Ftt+1}, and the | ||||
| ground truth camera motions {δtt+1} in our task. Our objective is to jointly minimize the optical | ||||
| flow loss Lf and the camera motion loss Lp. The end-to-end loss is defined as: | ||||
| 
 | ||||
| L = λLf + Lp = λ Mθ(It, It+1) − Ftt+1 + Pφ(Fˆtt+1) − δtt+1  (1) | ||||
| 
 | ||||
| where λ is a hyper-parameter balancing the two losses. We use ˆ· to denote the estimated variable | ||||
| from our model. | ||||
| 
 | ||||
| Since TartanAir is purely synthetic, the biggest question is can a model learned from simulation | ||||
| data be generalized to real-world scenes? As discussed by Wang et al. [11], a large number of | ||||
| studies show that training purely in simulation but with broad diversity, the model learned can be | ||||
| easily transferred to the real world. This is also known as domain randomization [38, 39]. In our | ||||
| experiments, we show that the diverse simulated data indeed enable the VO model to generalize to | ||||
| real-world data. | ||||
| 
 | ||||
| 3.3 Up-to-scale loss function | ||||
| 
 | ||||
| The motion scale is unobservable from a monocular image sequence. In geometry-based methods, | ||||
| the scale is usually recovered from other sources of information ranging from known object size or | ||||
| camera height to extra sensors such as IMU. However, in most existing learning-based VO studies, | ||||
| the models generally neglect the scale problem and try to recover the motion with scale. This is | ||||
| feasible if the model is trained and tested with the same camera and in the same type of scenario. | ||||
| For example, in the KITTI dataset, the camera is mounted at a fixed height above the ground and a | ||||
| fixed orientation. A model can learn to remember the scale in this particular setup. Obviously, the | ||||
| model will have huge problems when tested with a different camera configuration. Imagine if the | ||||
| 
 | ||||
|                                           3 | ||||
| Figure 2: a) Illustration of the FoV and image resolution in TartanAir, EuRoC, and KITTI datasets. | ||||
| b) Calculation of the intrinsics layer. | ||||
| 
 | ||||
| camera in KITTI moves a little upwards and becomes higher from the ground, the same amount of | ||||
| camera motion would cause a smaller optical flow value on the ground, which is inconsistent with | ||||
| the training data. Although the model could potentially learn to pick up other clues such as object | ||||
| size, it is still not fully reliable across different scenes or environments. | ||||
| 
 | ||||
| Following the geometry-based methods, we only recover an up-to-scale camera motion from the | ||||
| 
 | ||||
| monocular sequences. Knowing that the scale ambiguity only affects the translation T , we design | ||||
| 
 | ||||
| a new loss function for T and keep the loss for rotation R unchanged. We propose two up-to-scale | ||||
| loss functions for LP : the cosine similarity loss Lcpos and the normalized distance loss Lnporm. Lpcos | ||||
| is defined by the cosine angle between the estimated Tˆ and the label T : | ||||
| 
 | ||||
| Lpcos     =  max(  Tˆ · T  T      +  Rˆ − R        (2) | ||||
|                    Tˆ ·       ,) | ||||
| 
 | ||||
| Similarly, for Lnporm, we normalize the translation vector before calculating the distance between | ||||
| the estimation and the label: | ||||
| 
 | ||||
| Lpnorm =     Tˆ               T          + Rˆ − R | ||||
|           max( Tˆ , ) − max( T                     (3) | ||||
|                                      ,) | ||||
| 
 | ||||
| where = 1e-6 is used to avoid divided by zero error. From our preliminary empirical comparison, | ||||
| the above two formulations have similar performance. In the following sections, we will use Eq 3 | ||||
| to replace Lp in Eq 1. Later, we show by experiments that the proposed up-to-scale loss function is | ||||
| crucial for the model’s generalization ability. | ||||
| 
 | ||||
| 3.4 Cross-camera generalization by encoding camera intrinsics | ||||
| 
 | ||||
| In epipolar geometry theory, the camera intrinsics is required when recovering the camera pose | ||||
| from the essential matrix (assuming the images are undistorted). In fact, learning-based methods | ||||
| are unlikely to generalize to data with different camera intrinsics. Imagine a simple case that the | ||||
| camera changes a lens with a larger focal length. Assume the resolution of the image remains the | ||||
| same, the same amount of camera motion will introduce bigger optical flow values, which we call | ||||
| the intrinsics ambiguity. | ||||
| 
 | ||||
| A tempting solution for intrinsics ambiguity is warping the input images to match the camera in- | ||||
| trinsics of the training data. However, this is not quite practical especially when the cameras differ | ||||
| too much. As shown in Fig. 2-a, if a model is trained on TartanAir, the warped KITTI image only | ||||
| covers a small part of the TartanAir’s field of view (FoV). After training, a model learns to exploit | ||||
| cues from all possible positions in the FoV and the interrelationship among those cues. Some cues | ||||
| no longer exist in the warped KITTI images leading to drastic performance drops. | ||||
| 
 | ||||
| 3.4.1 Intrinsics layer | ||||
| 
 | ||||
| We propose to train a model that takes both RGB images and camera intrinsics as input, thus the | ||||
| model can directly handle images coming from various camera settings. Specifically, instead of re- | ||||
| covering the camera motion Ttt+1 only from the feature matching Ftt+1, we design a new pose net- | ||||
| work Pφ(Ftt+1, K), which depends also on the camera intrinsic parameters K = {fx, fy, ox, oy}, | ||||
| where fx and fy are the focal lengths, and ox and oy denote the position of the principle point. | ||||
| 
 | ||||
|                    4 | ||||
| Figure 3: The data augmentation procedure of random cropping and resizing. In this way we gener- | ||||
| ate a wide range of camera intrinsics (FoV 40◦ to 90◦). | ||||
| 
 | ||||
| As for the implementation, we concatenate an IL (intrinsics layer) Kc ∈ R2×H×W (H and W | ||||
| are image height and width respectively) to Ftt+1 before going into Pφ. To compose Kc, we first | ||||
| generate two index matrices Xind and Yind for the x and y axis in the 2D image frame (Fig. 2-b). | ||||
| Then the two channels of Kc are calculated from the following formula: | ||||
| 
 | ||||
| Kxc = (Xind − ox)/fx                                 (4) | ||||
| Kyc = (Yind − oy)/fy | ||||
| 
 | ||||
| The concatenation of Ftt+1 and Kc augments the optical flow estimation with 2D position informa- | ||||
| tion. Similar to the situation where geometry-based methods have to know the 2D coordinates of the | ||||
| matched features, Kc provides the necessary position information. In this way, intrinsics ambiguity | ||||
| 
 | ||||
| is explicitly handled by coupling 2D positions and matching estimations (Ftt+1). | ||||
| 
 | ||||
| 3.4.2 Data generation for various camera intrinsics | ||||
| 
 | ||||
| To make a model generalizable across different cameras, we need training data with various camera | ||||
| intrinsics. TartanAir only has one set of camera intrinsics, where fx = fy = 320, ox = 320, | ||||
| and oy = 240. We simulate various intrinsics by randomly cropping and resizing (RCR) the input | ||||
| images. As shown in Fig. 3, we first crop the image at a random location with a random size. Next, | ||||
| we resize the cropped image to the original size. One advantage of the IL is that during RCR, we can | ||||
| crop and resize the IL with the image, without recomputing the IL. To cover typical cameras with | ||||
| FoV between 40◦ to 90◦, we find that using random resizing factors up to 2.5 is sufficient during | ||||
| RCR. Note the ground truth optical flow should also be scaled with respect to the resizing factor. We | ||||
| use very aggressive cropping and shifting in our training, which means the optical center could be | ||||
| way off the image center. Although the resulting intrinsic parameters will be uncommon in modern | ||||
| cameras, we find the generalization is improved. | ||||
| 
 | ||||
| 4 Experimental Results | ||||
| 
 | ||||
| 4.1 Network structure and training detail | ||||
| 
 | ||||
| Network We utilize the pre-trained PWC-Net [40] as the matching network Mθ, and a modified | ||||
| ResNet50 [41] as the pose network Pφ. We remove the batch normalization layers from the ResNet, | ||||
| and add two output heads for the translation and rotation, respectively. The PWC-Net outputs optical | ||||
| flow in size of H/4 × W/4, so Pφ is trained on 1/4 size, consuming very little GPU memory. The | ||||
| overall inference time (including both Mθ and Pφ) is 40ms on an NVIDIA GTX 1080 GPU. | ||||
| 
 | ||||
| Training Our model is implemented by PyTorch [42] and trained on 4 NVIDIA GTX 1080 GPUs. | ||||
| There are two training stages. First, Pφ is trained separately using ground truth optical flow and | ||||
| camera motions for 100,000 iterations with a batch size of 100. In the second stage, Pφ and Mθ are | ||||
| connected and jointly optimized for 50,000 iterations with a batch size of 64. During both training | ||||
| stages, the learning rate is set to 1e-4 with a decay rate of 0.2 at 1/2 and 7/8 of the total training | ||||
| steps. The RCR is applied on the optical flow, RGB images, and the IL (Sec 3.4.2). | ||||
| 
 | ||||
| 4.2 How the training data quantity affects the generalization ability | ||||
| 
 | ||||
| To show the effects of data diversity, we compare the generalization ability of the model trained | ||||
| with different amounts of data. We use 20 environments from the TartanAir dataset, and set aside | ||||
| 3 environments (Seaside-town, Soul-city, and Hongkong-alley) only for testing, which results in | ||||
| 
 | ||||
| 5 | ||||
| Figure 4: Generalization ability with respect to different quantities of training data. Model Pφ is | ||||
| trained on true optical flow. Blue: training loss, orange: testing loss on three unseen environments. | ||||
| Testing loss drops constantly with increasing quantity of training data. | ||||
| 
 | ||||
| Figure 5: Comparison of the loss curve w/ and w/o up-to-scale loss function. a) The training and | ||||
| testing loss w/o the up-to-scale loss. b) The translation and rotation loss of a). Big gap exists between | ||||
| the training and testing translation losses (orange arrow in b)). c) The training and testing losses w/ | ||||
| up-to-scale loss. d) The translation and rotation losses of c). The translation loss gap decreases. | ||||
| 
 | ||||
| more than 400,000 training frames and about 40,000 testing frames. As a comparison, KITTI and | ||||
| EuRoC datasets provide 23,201 and 26,604 pose labeled frames, respectively. Besides, data in | ||||
| KITTI and EuRoC are much more uniform in the sense of scene type and motion pattern. As shown | ||||
| in Fig. 4, we set up three experiments, where we use 20,000 (comparable to KITTI and EuRoC), | ||||
| 100,000, and 400,000 frames of data for training the pose network Pφ. The experiments show that | ||||
| the generalization ability, measured by the gap between training loss and testing loss on unseen | ||||
| environments, improves constantly with increasing training data. | ||||
| 
 | ||||
| 4.3 Up-to-scale loss function | ||||
| 
 | ||||
| Without the up-to-scale loss, we observe that there is a gap between the training and testing loss even | ||||
| trained with a large amount of data (Fig. 5-a). As we plotting the translation loss and rotation loss | ||||
| separately (Fig. 5-b), it shows that the translation error is the main contributor to the gap. After we | ||||
| apply the up-to-scale loss function described in Sec 3.3, the translation loss gap decreases (Fig. 5- | ||||
| c,d). During testing, we align the translation with the ground truth to recover the scale using the | ||||
| same way as described in [16, 6]. | ||||
| 
 | ||||
| 4.4 Camera intrinsics layer | ||||
| 
 | ||||
| The IL is critical to the generalization ability across datasets. Before we move to other datasets, | ||||
| we first design an experiment to investigate the properties of the IL using the pose network Pφ. As | ||||
| shown in Table 1, in the first two columns, where the data has no RCR augmentation, the training | ||||
| and testing loss are low. But these two models would output nonsense values on data with RCR | ||||
| augmentation. One interesting finding is that adding IL doesn’t help in the case of only one type | ||||
| of intrinsics. This indicates that the network has learned a very different algorithm compared with | ||||
| the geometry-based methods, where the intrinsics is necessary to recover the motion. The last two | ||||
| columns show that the IL is critical when the input data is augmented by RCR (i.e. various intrin- | ||||
| sics). Another interesting thing is that training a model with RCR and IL leads to a lower testing | ||||
| loss (last column) than only training on one type of intrinsics (first two columns). This indicates that | ||||
| by generating data with various intrinsics, we learned a more robust model for the VO task. | ||||
| 
 | ||||
|                                                            6 | ||||
| Table 1: Training and testing losses with four combinations of RCR and IL settings. The IL is | ||||
| critical with the presence of RCR. The model trained with RCR reaches lower testing loss than | ||||
| those without RCR. | ||||
| 
 | ||||
| Training configuration      w/o RCR, w/o IL     w/o RCR, w/ IL        w/ RCR, w/o IL          w/ RCR, w/ IL | ||||
| Training loss                    0.0325             0.0311                0.1534                 0.0499 | ||||
| Test-loss on data w/ RCR             -                  -                 0.1999                 0.0723 | ||||
| Test-loss on data w/o RCR        0.0744             0.0714                0.1630                 0.0549 | ||||
| 
 | ||||
| Table 2: Comparison of translation and rotation on the KITTI dataset. DeepVO [43] is a super- | ||||
| vised method trained on Seq 00, 02, 08, 09. It contains an RNN module, which accumulates | ||||
| information from multiple frames. Wang et al. [9] is a supervised method trained on Seq 00-08 | ||||
| and uses the semantic information of multiple frames to optimize the trajectory. UnDeepVO [44] | ||||
| and GeoNet [16] are trained on Seq 00-08 in an unsupervised manner. VISO2-M [45] and ORB- | ||||
| SLAM [3] are geometry-based monocular VO. ORB-SLAM uses the bundle adjustment on multiple | ||||
| frames to optimize the trajectory. Our method works in a pure VO manner (only takes two frames). | ||||
| It has never seen any KITTI data before the testing, and yet achieves competitive results. | ||||
| 
 | ||||
|         Seq                06              07              09              10                      Ave | ||||
| 
 | ||||
| DeepVO [43]*†      trel        rrel  trel      rrel  trel      rrel  trel              rrel  trel       rrel | ||||
| Wang et al. [9]*† | ||||
| UnDeepVO [44]*     5.42 5.82         3.91 4.6        -         -     8.11 8.83               5.81 6.41 | ||||
| GeoNet [16]* | ||||
| VISO2-M [45]       -           -     -         -     8.04 1.51       6.23 0.97               7.14 1.24 | ||||
| ORB-SLAM [3]† | ||||
| TartanVO (ours)    6.20 1.98         3.15 2.48       -         -     10.63 4.65              6.66 3.04 | ||||
| 
 | ||||
|                    9.28 4.34         8.27 5.93       26.93 9.54      20.73 9.04              16.3 7.21 | ||||
| 
 | ||||
|                    7.3 6.14          23.61 19.11     4.04 1.43       25.2 3.8                15.04 7.62 | ||||
| 
 | ||||
|                    18.68 0.26        10.96 0.37      15.3 0.26       3.71 0.3                12.16 0.3 | ||||
| 
 | ||||
|                    4.72 2.95         4.32 3.41       6.0 3.11        6.89 2.73               5.48 3.05 | ||||
| 
 | ||||
| trel: average translational RMSE drift (%) on a length of 100–800 m. | ||||
| rrel: average rotational RMSE drift (◦/100 m) on a length of 100–800 m. | ||||
| *: the starred methods are trained or finetuned on the KITTI dataset. | ||||
| †: these methods use multiple frames to optimize the trajectory after the VO process. | ||||
| 
 | ||||
| 4.5 Generalize to real-world data without finetuning | ||||
| 
 | ||||
| KITTI dataset The KITTI dataset is one of the most influential datasets for VO/SLAM tasks. We | ||||
| compare our model, TartanVO, with two supervised learning models (DeepVO [43], Wang et al. | ||||
| [9]), two unsupervised models (UnDeepVO [44], GeoNet [16]), and two geometry-based methods | ||||
| (VISO2-M [45], ORB-SLAM [3]). All the learning-based methods except ours are trained on the | ||||
| KITTI dataset. Note that our model has not been finetuned on KITTI and is trained purely on a | ||||
| synthetic dataset. Besides, many algorithms use multiple frames to further optimize the trajectory. | ||||
| In contrast, our model only takes two consecutive images. As listed in Table 2, TartanVO achieves | ||||
| comparable performance, despite no finetuning nor backend optimization are performed. | ||||
| 
 | ||||
| EuRoC dataset The EuRoC dataset contains 11 sequences collected by a MAV in an indoor en- | ||||
| vironment. There are 3 levels of difficulties with respect to the motion pattern and the light con- | ||||
| dition. Few learning-based methods have ever been tested on EuRoC due to the lack of training | ||||
| data. The changing light condition and aggressive rotation also pose real challenges to geometry- | ||||
| based methods as well. In Table 3, we compare with geometry-based methods including SVO [46], | ||||
| ORB-SLAM [3], DSO [5] and LSD-SLAM [2]. Note that all these geometry-based methods per- | ||||
| form some types of backend optimization on selected keyframes along the trajectory. In contrast, our | ||||
| model only estimates the frame-by-frame camera motion, which could be considered as the frontend | ||||
| module in these geometry-based methods. In Table 3, we show the absolute trajectory error (ATE) | ||||
| of 6 medium and difficult trajectories. Our method shows the best performance on the two most dif- | ||||
| ficult trajectories VR1-03 and VR2-03, where the MAV has very aggressive motion. A visualization | ||||
| of the trajectories is shown in Fig. 6. | ||||
| 
 | ||||
| Challenging TartanAir data TartanAir provides 16 very challenging testing trajectories2 that | ||||
| cover many extremely difficult cases, including changing illumination, dynamic objects, fog and | ||||
| rain effects, lack of features, and large motion. As listed in Table 4, we compare our model with the | ||||
| ORB-SLAM using ATE. Our model shows a more robust performance in these challenging cases. | ||||
| 
 | ||||
|     2https://github.com/castacks/tartanair tools#download-the-testing-data-for-the-cvpr-visual-slam-challenge | ||||
| 
 | ||||
|                                                   7 | ||||
| Table 3: Comparison of ATE on EuRoC dataset. We are among very few learning-based methods, | ||||
| which can be tested on this dataset. Same as the geometry-based methods, our model has never seen | ||||
| the EuRoC data before testing. We show the best performance on two difficult sequences VR1-03 | ||||
| and VR2-03. Note our method doesn’t contain any backend optimization module. | ||||
| 
 | ||||
|                   Seq.            MH-04 MH-05 VR1-02 VR1-03 VR2-02 VR2-03 | ||||
| 
 | ||||
|                   SVO [46]        1.36 0.51   0.47  x                         0.47                    x | ||||
| 
 | ||||
| Geometry-based *  ORB-SLAM [3]    0.20  0.19   x     x                        0.07   x | ||||
|                   DSO [5]         0.25  0.11  0.11  0.93                      0.13  1.16 | ||||
| 
 | ||||
|                   LSD-SLAM [2] 2.13 0.85      1.11  x                         x                       x | ||||
| 
 | ||||
| Learning-based † TartanVO (ours) 0.74 0.68    0.45  0.64                      0.67  1.04 | ||||
| 
 | ||||
| * These results are from [46]. † Other learning-based methods [36] did not report numerical results. | ||||
| 
 | ||||
| Figure 6: The visualization of 6 EuRoC trajectories in Table 3. Black: ground truth trajectory, | ||||
| orange: estimated trajectory. | ||||
| 
 | ||||
| Table 4: Comparison of ATE on TartanAir dataset. These trajectories are not contained in the | ||||
| 
 | ||||
| training set. We repeatedly run ORB-SLAM 5 times and report the best result. | ||||
| 
 | ||||
| Seq               MH000 MH001 MH002 MH003 MH004 MH005 MH006 MH007 | ||||
| 
 | ||||
| ORB-SLAM [3] 1.3            0.04  2.37  2.45  x     x                         21.47 2.73 | ||||
| 
 | ||||
| TartanVO (ours) 4.88        0.26  2     0.94  1.07  3.19                      1     2.04 | ||||
| 
 | ||||
| Figure 7: TartanVO outputs competitive results on D345i IR data compared to T265 (equipped with | ||||
| fish-eye stereo camera and an IMU). a) The hardware setup. b) Trail 1: smooth and slow motion. c) | ||||
| Trail 2: smooth and medium speed. d) Trail 3: aggressive and fast motion. See videos for details. | ||||
| 
 | ||||
| RealSense Data Comparison We test TartanVO using data collected by a customized sensor | ||||
| setup. As shown in Fig. 7 a), a RealSense D345i is fixed on top of a RealSense T265 tracking | ||||
| camera. We use the left near-infrared (IR) image on D345i in our model and compare it with the | ||||
| trajectories provided by the T265 tracking camera. We present 3 loopy trajectories following similar | ||||
| paths with increasing motion difficulties. From Fig. 7 b) to d), we observe that although TartanVO | ||||
| has never seen real-world images or IR data during training, it still generalizes well and predicts | ||||
| odometry closely matching the output of T265, which is a dedicated device estimating the camera | ||||
| motion with a pair of fish-eye stereo camera and an IMU. | ||||
| 
 | ||||
| 5 Conclusions | ||||
| 
 | ||||
| We presented TartanVO, a generalizable learning-based visual odometry. By training our model | ||||
| with a large amount of data, we show the effectiveness of diverse data on the ability of model gener- | ||||
| alization. A smaller gap between training and testing losses can be expected with the newly defined | ||||
| up-to-scale loss, further increasing the generalization capability. We show by extensive experiments | ||||
| that, equipped with the intrinsics layer designed explicitly for handling different cameras, TartanVO | ||||
| can generalize to unseen datasets and achieve performance even better than dedicated learning mod- | ||||
| els trained directly on those datasets. Our work introduces many exciting future research directions | ||||
| such as generalizable learning-based VIO, Stereo-VO, multi-frame VO. | ||||
| 
 | ||||
|                                                            8 | ||||
| Acknowledgments | ||||
| 
 | ||||
| This work was supported by ARL award #W911NF1820218. Special thanks to Yuheng Qiu and Huai | ||||
| Yu from Carnegie Mellon University for preparing simulation results and experimental setups. | ||||
| 
 | ||||
| References | ||||
| 
 | ||||
|  [1] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendo´n-Mancha. Visual simultaneous localization and | ||||
|       mapping: a survey. Artificial Intelligence Review, 43(1):55–81, 2015. | ||||
| 
 | ||||
|  [2] J. Engel, T. Schops, and D. Cremers. LSD-SLAM: Large-scale direct monocular slam. In ECCV, 2014. | ||||
| 
 | ||||
|  [3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: a versatile and accurate monocular slam | ||||
|       system. IEEE transactions on robotics, 31(5):1147–1163, 2015. | ||||
| 
 | ||||
|  [4] C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In ICRA, | ||||
|       pages 15–22. IEEE, 2014. | ||||
| 
 | ||||
|  [5] J. Engel, V. Koltun, and D. Cremers. Direct sparse odometry. IEEE transactions on pattern analysis and | ||||
|       machine intelligence, 40(3):611–625, 2017. | ||||
| 
 | ||||
|  [6] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from | ||||
|       video. In CVPR, 2017. | ||||
| 
 | ||||
|  [7] S. Vijayanarasimhan, S. Ricco, C. Schmidy, R. Sukthankar, and K. Fragkiadaki. Sfm-net: Learning of | ||||
|       structure and motion from video. In arXiv:1704.07804, 2017. | ||||
| 
 | ||||
|  [8] S. Wang, R. Clark, H. Wen, and N. Trigoni. End-to-end, sequence-to-sequence probabilistic visual odom- | ||||
|       etry through deep neural networks. The International Journal of Robotics Research, 37(4-5):513–542, | ||||
|       2018. | ||||
| 
 | ||||
|  [9] X. Wang, D. Maturana, S. Yang, W. Wang, Q. Chen, and S. Scherer. Improving learning-based ego- | ||||
|       motion estimation with homomorphism-based losses and drift correction. In 2019 IEEE/RSJ International | ||||
|       Conference on Intelligent Robots and Systems (IROS), pages 970–976. IEEE, 2019. | ||||
| 
 | ||||
| [10] G. Younes, D. Asmar, E. Shammas, and J. Zelek. Keyframe-based monocular slam: design, survey, and | ||||
|       future directions. Robotics and Autonomous Systems, 98:67–88, 2017. | ||||
| 
 | ||||
| [11] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer. Tartanair: A | ||||
|       dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots | ||||
|       and Systems (IROS), 2020. | ||||
| 
 | ||||
| [12] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch. Memory-based learning for visual odometry. | ||||
|       In Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, pages 47–52. IEEE, | ||||
|       2008. | ||||
| 
 | ||||
| [13] V. Guizilini and F. Ramos. Semi-parametric models for visual odometry. In Robotics and Automation | ||||
|       (ICRA), 2012 IEEE International Conference on, pages 3482–3489. IEEE, 2012. | ||||
| 
 | ||||
| [14] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci. Evaluation of non-geometric methods for visual | ||||
|       odometry. Robotics and Autonomous Systems, 62(12):1717–1730, 2014. | ||||
| 
 | ||||
| [15] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned | ||||
|       depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, | ||||
|       pages 6243–6252, 2017. | ||||
| 
 | ||||
| [16] Z. Yin and J. Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In | ||||
|       Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, | ||||
|       2018. | ||||
| 
 | ||||
| [17] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid. Unsupervised learning of monocular | ||||
|       depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE | ||||
|       Conference on Computer Vision and Pattern Recognition, pages 340–349, 2018. | ||||
| 
 | ||||
| [18] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black. Competitive collabora- | ||||
|       tion: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In | ||||
|       Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June | ||||
|       2019. | ||||
| 
 | ||||
| [19] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia. Exploring representation learning with cnns for | ||||
|       frame-to-frame ego-motion estimation. RAL, 1(1):18–25, 2016. | ||||
| 
 | ||||
| [20] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and | ||||
|       motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer | ||||
|       Vision and Pattern Recognition (CVPR), July 2017. | ||||
| 
 | ||||
| [21] N. Yang, L. v. Stumberg, R. Wang, and D. Cremers. D3vo: Deep depth, deep pose and deep uncertainty | ||||
|       for monocular visual odometry. In IEEE/CVF Conference on Computer Vision and Pattern Recognition | ||||
|       (CVPR), June 2020. | ||||
| 
 | ||||
| [22] Y. Zou, Z. Luo, and J.-B. Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task | ||||
|       consistency. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. | ||||
| 
 | ||||
|                                                            9 | ||||
| [23] H. Zhou, B. Ummenhofer, and T. Brox. Deeptam: Deep tracking and mapping. In Proceedings of the | ||||
|       European Conference on Computer Vision (ECCV), September 2018. | ||||
| 
 | ||||
| [24] C. Tang and P. Tan. Ba-net: Dense bundle adjustment network. arXiv preprint arXiv:1806.04807, 2018. | ||||
| 
 | ||||
| [25] R. Clark, M. Bloesch, J. Czarnowski, S. Leutenegger, and A. J. Davison. Ls-net: Learning to solve | ||||
|       nonlinear least squares for monocular stereo. arXiv preprint arXiv:1809.02966, 2018. | ||||
| 
 | ||||
| [26] H. Li, W. Chen, j. Zhao, J.-C. Bazin, L. Luo, Z. Liu, and Y.-H. Liu. Robust and efficient estimation of ab- | ||||
|       solute camera pose for monocular visual odometry. In Proceedings of the IEEE International Conference | ||||
|       on Robotics and Automation (ICRA), May 2020. | ||||
| 
 | ||||
| [27] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned | ||||
|       depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition | ||||
|       (CVPR), July 2017. | ||||
| 
 | ||||
| [28] L. Sheng, D. Xu, W. Ouyang, and X. Wang. Unsupervised collaborative learning of keyframe detec- | ||||
|       tion and visual odometry towards monocular deep slam. In Proceedings of the IEEE/CVF International | ||||
|       Conference on Computer Vision (ICCV), October 2019. | ||||
| 
 | ||||
| [29] H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid. Visual odometry revisited: What should be learnt? | ||||
|       In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), May 2020. | ||||
| 
 | ||||
| [30] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct | ||||
|       methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), | ||||
|       June 2018. | ||||
| 
 | ||||
| [31] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu. Unos: Unified unsupervised optical-flow and | ||||
|       stereo-depth estimation by watching videos. In Proceedings of the IEEE/CVF Conference on Computer | ||||
|       Vision and Pattern Recognition (CVPR), June 2019. | ||||
| 
 | ||||
| [32] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and ego-motion from monoc- | ||||
|       ular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision | ||||
|       and Pattern Recognition (CVPR), June 2018. | ||||
| 
 | ||||
| [33] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha. Self-supervised deep visual odometry with online | ||||
|       adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition | ||||
|       (CVPR), June 2020. | ||||
| 
 | ||||
| [34] D. Niste´r. An efficient solution to the five-point relative pose problem. IEEE transactions on pattern | ||||
|       analysis and machine intelligence, 26(6):756–770, 2004. | ||||
| 
 | ||||
| [35] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International | ||||
|       Journal of Robotics Research, 32(11):1231–1237, 2013. | ||||
| 
 | ||||
| [36] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart. The | ||||
|       euroc micro aerial vehicle datasets. The International Journal of Robotics Research, 35(10):1157–1163, | ||||
|       2016. | ||||
| 
 | ||||
| [37] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and | ||||
|       B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE | ||||
|       conference on computer vision and pattern recognition, pages 3213–3223, 2016. | ||||
| 
 | ||||
| [38] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transfer- | ||||
|       ring deep neural networks from simulation to the real world. In IROS, pages 23–30. IEEE, 2017. | ||||
| 
 | ||||
| [39] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, | ||||
|       and S. Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain ran- | ||||
|       domization. In CVPR Workshops, pages 969–977, 2018. | ||||
| 
 | ||||
| [40] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and | ||||
|       cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages | ||||
|       8934–8943, 2018. | ||||
| 
 | ||||
| [41] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the | ||||
|       IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. | ||||
| 
 | ||||
| [42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and | ||||
|       A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. | ||||
| 
 | ||||
| [43] S. Wang, R. Clark, H. Wen, and N. Trigoni. Deepvo: Towards end-to-end visual odometry with deep | ||||
|       recurrent convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International | ||||
|       Conference on, pages 2043–2050. IEEE, 2017. | ||||
| 
 | ||||
| [44] R. Li, S. Wang, Z. Long, and D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep | ||||
|       learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291. | ||||
|       IEEE, 2018. | ||||
| 
 | ||||
| [45] S. Song, M. Chandraker, and C. Guest. High accuracy monocular SFM and scale correction for au- | ||||
|       tonomous driving. IEEE Transactions on Pattern Analysis & Machine Intelligence, pages 1–1, 2015. | ||||
| 
 | ||||
| [46] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza. Svo: Semidirect visual odometry | ||||
|       for monocular and multicamera systems. IEEE Transactions on Robotics, 33(2):249–265, 2016. | ||||
| 
 | ||||
|                                                           10 | ||||
| A Additional experimental details | ||||
| 
 | ||||
| In this section, we provide additional details for the experiments, including the network structure, | ||||
| training parameters, qualitative results, and quantitative results. | ||||
| 
 | ||||
| A.1 Network Structure | ||||
| 
 | ||||
| Our network consists of two sub-modules, namely, the matching network Mθ and the pose network | ||||
| Pφ. As mentioned in the paper, we employ PWC-Net as the matching network, which takes in two | ||||
| consecutive images of size 640 x 448 (PWC-Net only accepts image size that is multiple of 64). The | ||||
| output optical flow, which is 160 x 112 in size, is fed into the pose network. The structure of the | ||||
| pose network is detailed in Table 5. The overall inference time (including both Mθ and Pφ) is 40ms | ||||
| on an NVIDIA GTX 1080 GPU. | ||||
| 
 | ||||
| Table 5: Parameters of the proposed pose network. Constructions of residual blocks are designated | ||||
| 
 | ||||
| in brackets multiplied by the number of stacked blocks. Downsampling is performed by Conv1, and | ||||
| 
 | ||||
| at the beginning of each residual block. After the residual blocks, we reshape the feature map into a | ||||
| 
 | ||||
| one-dimensional vector, which goes through three fully connected layers in the translation head and | ||||
| 
 | ||||
| rotation head, respectively. | ||||
| 
 | ||||
|      Name                    Layer setting                       Output dimension | ||||
| 
 | ||||
|      Input                                         1  H  ×  1   W   ×  2          114 × 160 | ||||
|      Conv1                                         4        4                      56 × 80 | ||||
|      Conv2                                                                         56 × 80 | ||||
|      Conv3                    3 × 3, 32        1    H    ×  1  W   ×  32           56 × 80 | ||||
|                               3 × 3, 32        8            8 | ||||
|                               3 × 3, 32 | ||||
|                                                1    H    ×  1  W   ×  32 | ||||
|                                                8            8 | ||||
| 
 | ||||
|                                                1    H    ×  1  W   ×  32 | ||||
|                                                8            8 | ||||
| 
 | ||||
|                                             ResBlock | ||||
| 
 | ||||
|      Block1        3 × 3, 64          ×3       1    H    ×  1   W   ×  64         28 × 40 | ||||
|                    3 × 3, 64                   16           16 | ||||
| 
 | ||||
|      Block2        3 × 3, 128            ×4    1   H     ×  1   W  ×   128        14 × 20 | ||||
|                    3 × 3, 128                  32           32 | ||||
| 
 | ||||
|      Block3        3 × 3, 128            ×6    1   H     ×  1   W  ×   128        7 × 10 | ||||
|                    3 × 3, 128                  64           64 | ||||
| 
 | ||||
|      Block4        3 × 3, 256            ×7     1   H    ×   1   W    × 256        4×5 | ||||
|                    3 × 3, 256                  128          128 | ||||
| 
 | ||||
|      Block5        3 × 3, 256            ×3     1   H    ×   1   W    × 256        2×3 | ||||
|                    3 × 3, 256                  256          256 | ||||
| 
 | ||||
|              FC trans                                                     FC rot | ||||
| 
 | ||||
| Trans head fc1 256 × 6 × 128                        Rot head fc1                  256 × 6 × 128 | ||||
| 
 | ||||
| Trans head fc2                128 × 32              Rot head fc2                  128 × 32 | ||||
| 
 | ||||
| Trans head fc3                32 × 3                Rot head fc3                  32 × 3 | ||||
| 
 | ||||
|      Output                   3                          Output                    3 | ||||
| 
 | ||||
| Table 6: Comparison of ORB-SLAM and TartanVO on the TartanAir dataset using the ATE metric. | ||||
| 
 | ||||
| These trajectories are not contained in the training set. We repeatedly run ORB-SLAM for 5 times | ||||
| 
 | ||||
| and report the best result. | ||||
| 
 | ||||
| Seq             SH000 SH001 SH002 SH003 SH004 SH005 SH006 SH007 | ||||
| 
 | ||||
| ORB-SLAM        x             3.5           x         x            x         x     x         x | ||||
| 
 | ||||
| TartanVO (ours) 2.52 1.61 3.65 0.29 3.36 4.74 3.72 3.06 | ||||
| 
 | ||||
| A.2 Testing Results on TartanAir | ||||
| 
 | ||||
| TartanAir provides 16 challenging testing trajectories. We reported 8 trajectories in the experiment | ||||
| section. The rest 8 trajectories are shown in Table 6. We compare TartanVO against the ORB-SLAM | ||||
| monocular algorithm. Due to the randomness in ORB-SLAM, we repeatedly run ORB-SLAM for 5 | ||||
| trials and report the best result. We consider a trial is a failure if ORB-SLAM tracks less than 80% | ||||
| 
 | ||||
|                                                           11 | ||||
| of the trajectory. A visualization of all the 16 trajectories (including the 8 trajectories shown in the | ||||
| experiment section) is shown in Figure 8. | ||||
| 
 | ||||
| Figure 8: Visualization of the 16 testing trajectories in the TartanAir dataset. The black dashed line | ||||
| represents the ground truth. The estimated trajectories by TartanVO and the ORB-SLAM monocular | ||||
| algorithm are shown in orange and blue lines, respectively. The ORB-SLAM algorithm frequently | ||||
| loses tracking in these challenging cases. It fails in 9/16 testing trajectories. Note that we run | ||||
| full-fledge ORB-SLAM with local bundle adjustment, global bundle adjustment, and loop closure | ||||
| components. In contrast, although TartanVO only takes in two images, it is much more robust than | ||||
| ORB-SLAM. | ||||
| 
 | ||||
|                                                           12 | ||||
|  | ||||
| @ -0,0 +1,518 @@ | ||||
|                                                                                                                                                      2022 IEEE International Conference on Robotics and Automation (ICRA) | ||||
|                                                                                                                                                      May 23-27, 2022. Philadelphia, PA, USA | ||||
| 
 | ||||
|                                                                                                                                                             AirDOS: Dynamic SLAM benefits from Articulated Objects | ||||
| 
 | ||||
|                                                                                                                                                                      Yuheng Qiu1, Chen Wang1, Wenshan Wang1, Mina Henein2, and Sebastian Scherer1 | ||||
| 
 | ||||
| 2022 IEEE International Conference on Robotics and Automation (ICRA) | 978-1-7281-9681-7/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICRA46639.2022.9811667     Abstract— Dynamic Object-aware SLAM (DOS) exploits                       (a) Challenge of Shibuya Tokyo (b) TartanAir Shibuya Dataset | ||||
|                                                                                                                                                      object-level information to enable robust motion estimation in | ||||
|                                                                                                                                                      dynamic environments. Existing methods mainly focus on iden-                            (c) Example of KITTI tracking dataset training 19 | ||||
|                                                                                                                                                      tifying and excluding dynamic objects from the optimization. | ||||
|                                                                                                                                                      In this paper, we show that feature-based visual SLAM systems            Fig. 1. (a) Example of a highly dynamic environment cluttered with | ||||
|                                                                                                                                                      can also benefit from the presence of dynamic articulated                 humans which represents a challenge for Visual SLAM. Existing dynamic | ||||
|                                                                                                                                                      objects by taking advantage of two observations: (1) The 3D              SLAM algorithms often fail in this challenging scenario (b) Example of | ||||
|                                                                                                                                                      structure of each rigid part of articulated object remains               the TartanAir Shibuya Dataset. (c) Example of the estimated full map with | ||||
|                                                                                                                                                      consistent over time; (2) The points on the same rigid part              dynamic objects and static background. | ||||
|                                                                                                                                                      follow the same motion. In particular, we present AirDOS, | ||||
|                                                                                                                                                      a dynamic object-aware system that introduces rigidity and                  Can we make use of moving objects in SLAM to improve | ||||
|                                                                                                                                                      motion constraints to model articulated objects. By jointly              camera pose estimation rather than filtering them out? | ||||
|                                                                                                                                                      optimizing the camera pose, object motion, and the object 3D | ||||
|                                                                                                                                                      structure, we can rectify the camera pose estimation, preventing            In this paper, we extend the simple rigid objects to general | ||||
|                                                                                                                                                      tracking loss, and generate 4D spatio-temporal maps for both             articulated objects, defined as objects composed of one | ||||
|                                                                                                                                                      dynamic objects and static scenes. Experiments show that our             or more rigid parts (links) connected by joints allowing | ||||
|                                                                                                                                                      algorithm improves the robustness of visual SLAM algorithms              rotational motion [10], e.g., vehicles and humans in Fig. 2, | ||||
|                                                                                                                                                      in challenging crowded urban environments. To the best of our            and utilize the properties of articulated objects to improve | ||||
|                                                                                                                                                      knowledge, AirDOS is the first dynamic object-aware SLAM                  the camera pose estimation. Namely, we jointly optimize | ||||
|                                                                                                                                                      system demonstrating that camera pose estimation can be                  (1) the 3D structural information and (2) the motion of | ||||
|                                                                                                                                                      improved by incorporating dynamic articulated objects.                   articulated objects. To this end, we introduce (1) a rigidity | ||||
|                                                                                                                                                                                                                               constraint, which assumes that the distance between any two | ||||
|                                                                                                                                                                               I. INTRODUCTION                                 points located on the same rigid part remains constant over | ||||
|                                                                                                                                                                                                                               time, and (2) a motion constraint, which assumes that feature | ||||
|                                                                                                                                                         Simultaneous localization and mapping (SLAM) is a fun-                points on the same rigid parts follow the same 3D motion. | ||||
|                                                                                                                                                      damental research problem in many robotic applications.                  This allows us to build a 4D spatio-temporal map including | ||||
|                                                                                                                                                      Despite its success in static environments, the performance              both dynamic and static structures. | ||||
|                                                                                                                                                      degradation and lack of robustness in the dynamic world has | ||||
|                                                                                                                                                      become a major hurdle for its practical applications [1], [2].              In summary, the main contributions of this paper are: | ||||
|                                                                                                                                                      To address the challenges of dynamic environments, most                     • A new pipeline, named AirDOS, is introduced for stereo | ||||
|                                                                                                                                                      SLAM algorithms adopt an elimination strategy that treats | ||||
|                                                                                                                                                      moving objects as outliers and estimates the camera pose                      SLAM to jointly optimize the camera poses, trajectories | ||||
|                                                                                                                                                      only based on the measurements of static landmarks [3], [4].                  of dynamic objects, and the map of the environment. | ||||
|                                                                                                                                                      This strategy can handle environments with a small number                   • We introduce simple yet efficient rigidity and motion | ||||
|                                                                                                                                                      of dynamics, but cannot address challenging cases, where                      constraints for general dynamic articulated objects. | ||||
|                                                                                                                                                      dynamic objects cover a large field of view as in Fig. 1(a).                 • We introduce a new benchmark TartanAir Shibuya, on | ||||
|                                                                                                                                                                                                                                    which we demonstrates that, for the first time, dynamic | ||||
|                                                                                                                                                         Some efforts have been made to include dynamic objects                     articulated objects can benefit the camera pose estima- | ||||
|                                                                                                                                                      in the SLAM process. Very few methods try to estimate the                     tion in visual SLAM. | ||||
|                                                                                                                                                      pose of simple rigid objects [5], [6] or estimate their motion | ||||
|                                                                                                                                                      model [7], [8]. For example, CubeSLAM [6] introduces a | ||||
|                                                                                                                                                      simple 3D cuboid to model rigid objects. Dynamic SLAM | ||||
|                                                                                                                                                      [9] estimates 3D motions of dynamic objects. However, these | ||||
|                                                                                                                                                      methods can only cover special rigid objects, e.g., cubes [6] | ||||
|                                                                                                                                                      and quadrics [5] and do not show that camera pose estimation | ||||
|                                                                                                                                                      can be improved by the introduction of dynamic objects [7]– | ||||
|                                                                                                                                                      [9]. This introduces our main question: | ||||
| 
 | ||||
|                                                                                                                                                         *This work was supported by the Sony award #A023367. | ||||
|                                                                                                                                                         Source Code: https://github.com/haleqiu/AirDOS. | ||||
|                                                                                                                                                         1Yuheng Qiu, Chen Wang, Wenshan Wang, and Sebastiian | ||||
|                                                                                                                                                      Scherer are with the Robotics Institute, Carnegie Mellon University, | ||||
|                                                                                                                                                      Pittsburgh, PA 15213, USA {yuhengq, wenshanw, basti} | ||||
|                                                                                                                                                      @andrew.cmu.edu; chenwang@dr.com | ||||
|                                                                                                                                                         2Mina Henein is with the System, Theory and Robotics Lab, Australian | ||||
|                                                                                                                                                      National University. mina.henein@anu.edu.au | ||||
| 
 | ||||
|                                                                                                                                                      978-1-7281-9680-0/22/$31.00 ©2022 IEEE  8047 | ||||
| 
 | ||||
|                                                                                                                                                      Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. | ||||
|                                                                                     Wang et al. [18] introduce a simultaneous localization, map- | ||||
|                                                                                     ping, and moving object tracking (SLAMMOT) algorithm, | ||||
|                                                                                     which tracks moving objects with a learned motion model | ||||
|                                                                                     based on a dynamic Bayesian network. Reddy, et al. [19] use | ||||
|                                                                                     optical flow to segment moving objects, and apply a smooth | ||||
|                                                                                     trajectory constraint to enforce the smoothness of objects’ | ||||
|                                                                                     motion. Judd et al. [8] propose multi-motion visual odometry | ||||
|                                                                                     (MVO), which simultaneously estimates the camera pose | ||||
|                                                                                     and the object motion. The work by Henein, et al. [7], | ||||
|                                                                                     [20], [21], of which the most recent is VDO-SLAM [20], | ||||
|                                                                                     generate a map of dynamic and static structure and estimate | ||||
|                                                                                     velocities of rigid moving objects using motion constraints. | ||||
|                                                                                     Rosinol, et al. [22] propose 3D dynamic scene graphs to | ||||
|                                                                                     detect and track dense human mesh in dynamic scenes. This | ||||
|                                                                                     method constraints the humans maximum walking speed for | ||||
|                                                                                     a consistency check. | ||||
| 
 | ||||
| Fig. 2. This is an example of the articulated dynamic objects’ point-segment        C. Rigidity Constraint | ||||
| mode. In urban environment, we can model rigid objects like vehicle and | ||||
| semi-rigid objects like pedestrian as articulated object. pki and pkj are the i-th     Rigidity constraint assumes that pair-wise distances of | ||||
| and j-th dynamic features on the moving objects at time k. pik+1 and pkj+1          points on the same rigid body remain the same over time. | ||||
| is the dynamic features after the motion l Tk at time k + 1. In this model,         It was applied to segment moving objects in dynamic en- | ||||
| the segment si j is invariant over time and motion.                                 vironments dating back to the 1980s. Zhang et al. [23] | ||||
|                                                                                     propose to use rigidity constraint to match moving rigid | ||||
|                         II. RELATED WORK                                            bodies. Thompson et al. [24] use a similar idea of rigidity | ||||
|                                                                                     constraint and propose a rigidity geometry testing for moving | ||||
|    Recent works on dynamic SLAM roughly fall into three                             rigid object matching. Previous research utilized rigidity | ||||
| categories: elimination strategy, motion constraint, and rigid-                     assumption to segment moving rigid objects, while in this | ||||
| ity constraint, which will be reviewed, respectively.                               paper, we use rigidity constraint to recover objects’ structure. | ||||
| 
 | ||||
| A. Elimination Strategy                                                                To model rigid object, SLAM++ [25] introduced pre- | ||||
|                                                                                     defined CAD models into the object matching and pose | ||||
|    Algorithms in this category filter out the dynamic objects                        optimization. QuadricSLAM [5] utilize dual-quadrics as 3D | ||||
| and only utilize the static structures of the environment                           object representation, to represent the orientation and scale of | ||||
| for pose estimation. Therefore, most of the algorithms in                           object landmarks. Yang and Scherer [6] propose a monocular | ||||
| this category apply elimination strategies like RANSAC                              object SLAM system named CubeSLAM for 3D cuboid | ||||
| [11] and robust loss functions [12] to eliminate the effects                        object detection and multi-view object SLAM. As mentioned | ||||
| of dynamic objects. For example, ORB-SLAM [3] applies                               earlier, the above methods can only model simple rigid | ||||
| RANSAC to select and remove points that cannot converge                             objects, e.g., cubes, while we target more general objects, | ||||
| to a stable pose estimation. DynaSLAM [13] detects the                              i.e., articulated objects, which can cover common dynamic | ||||
| moving objects by multi-view geometry and deep learning                             objects such as vehicles and humans. | ||||
| modules. This allows inpainting the frame background that | ||||
| has been occluded by dynamic objects. Bârsan et al. [14] use                        III. METHODOLOGY | ||||
| both instance-aware semantic segmentation and sparse scene | ||||
| flow to classify objects as either background, moving, or                            A. Background and Notation | ||||
| potentially moving objects. Dai et al. [15] utilize the distance | ||||
| correlation of map points to segment dynamic objects from                           Visual SLAM in static environments is often formulated as | ||||
| static background. To reduce the computational cost, Ji et al. | ||||
| [16] combine semantic segmentation and geometry modules,                            a factor graph optimization [26]. The objective (1) is to find | ||||
| which clusters the depth image into a few regions and | ||||
| identify dynamic regions via reprojection errors.                                   the robot state xk ∈ X, k ∈ [0, nx] and the static landmarks pi ∈ | ||||
| 
 | ||||
| B. Motion Constraint                                                                Ps, i ∈ [0, nps ] that best fit the observation of the landmarks | ||||
|                                                                                     zki ∈ Z, where nx denotes the total number of the robots’ | ||||
|    Most algorithms in this category estimate the motion of                          state and nps denotes the number of the static landmarks. | ||||
| dynamic objects but do not show that the motion constraint                          This is often based on a reprojection error minimization | ||||
| can contribute to the camera pose estimation, and would thus                        ei,k = h(xk, pi) − zik with: | ||||
| suffer in highly dynamic environments. For example, Hahnel | ||||
| et al. [17] track the dynamic objects in the SLAM system.                           ∑ X ∗, P∗ = argmin eiT,kΩ−i,k1ei,k  (1) | ||||
| 
 | ||||
|                                                                                     {X,Ps} i,k | ||||
| 
 | ||||
|                                                                                     where h(xk, pi) denotes the 3D points observation function | ||||
|                                                                                     and Ωi,k denotes the observation covariance matrix. | ||||
| 
 | ||||
|                                                                                                  8048 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. | ||||
|                                                   Nose                         C. Motion Constraint | ||||
|                                                   Neck | ||||
|                                           Right            Left                   We adopt the motion constraint from [7] which does not | ||||
|                                         Shoulder        Shoulder               need a prior geometric model. For every feature point on the | ||||
|                                                                                same rigid part of an articulated object l, we have | ||||
|                                       Right                    Left | ||||
|                                       Elbow                  Elbow | ||||
| 
 | ||||
|                                       Right                   Left                                   l p¯ki +1 = l T l p¯ki ,  (4) | ||||
|                                       Hand                   Hand | ||||
|                                                                                where lT ∈ SE(3) is a motion transform associated with the | ||||
|                                            Right         Left                  object l and ¯· indicates homogeneous coordinates. Therefore, | ||||
|                                            Knee         Knee                   we can define the loss function for motion constraint as: | ||||
| 
 | ||||
|                                       Right             Left | ||||
|                                       Feet              Feet | ||||
| 
 | ||||
| (a) Rigidity Constraint Factor Graph  (b) Human Rigidity                       em = || l p¯ki +1 − lT l p¯ik||. | ||||
| 
 | ||||
|                                                                                                                                (5) | ||||
| 
 | ||||
|                         (c) Motion Constraint Factor Graph                        The motion constraint simultaneously estimates the ob- | ||||
|                                                                                jects’ motion lT and enforces each point l pik to follow the | ||||
| Fig. 3. (a) Factor graph of the rigidity constraint. Black nodes represent     same motion pattern [7]. This motion model lT assumes that | ||||
| the camera pose, blue nodes the dynamic points, and red nodes indicate the     the object is rigid, thus, for articulated objects, we apply the | ||||
| rigid segment length. Cyan and red rectangles represent the measurements       motion constraint on each rigid part of articulated object. In | ||||
| of points and rigidity consequently. (c) Human can be modeled with point       Fig. 3(c) we show the factor graph of the motion constraint. | ||||
| and segment based on the body parts’ rigidity. (b) Factor graph of the motion | ||||
| constraint. The orange node is the estimated motion and the green rectangles      In highly dynamic environments, even if we filter out | ||||
| denote the motion constraints                                                  the moving objects, the tracking of static features is easily | ||||
|                                                                                interrupted by the moving objects. By enforcing the motion | ||||
|                                                                                constraints, dynamic objects will be able to contribute to the | ||||
|                                                                                motion estimation of the camera pose. Therefore, when the | ||||
|                                                                                static features are not reliable enough, moving objects can | ||||
|                                                                                correct the camera pose estimation, preventing tracking loss. | ||||
| 
 | ||||
|    In dynamic SLAM, the reprojection error ep of dynamic                       D. Bundle Adjustment | ||||
| feature points is also considered: | ||||
|                                                                                   The bundle adjustment (BA) jointly optimizes the static | ||||
| ep = h(xk, l pki ) − lzik ,                                   (2)              points pi, dynamic points l pki , segments si j, camera poses xk | ||||
|                                                                                and dynamic object motions lT . This can be formulated as | ||||
|                                                                                the factor graph optimization: | ||||
| 
 | ||||
| where l pik ∈ Pd are the dynamic points and lzik are the                          X ∗, P∗, S∗, T ∗ = argmin erT Ω−i, j1er+ | ||||
| corresponding observation of dynamic points. | ||||
|                                                                                                               {X,P,S,T } | ||||
| B. Rigidity Constraint | ||||
|                                                                                                                         emT Ω−i,l1em + eTp Ωi−,k1ep, (6) | ||||
|    Let si j be the segment length between two feature points | ||||
| l pik and l pkj, the rigidity constraint is that si j is invariant over        where P is the union set of Ps and Pd. This problem can be | ||||
| time, i.e, ski j = ski j+1, if l pki and l pkj are on the same rigid           solved using the Levenberg-Marquardt algorithms. | ||||
| part of an articulated object, as shown in Fig. 2. Inspired by | ||||
|                                                                                                     IV. SYSTEM OVERVIEW | ||||
| this, we model the dynamic articulated object using a rigidity | ||||
|                                                                                   We propose the framework AirDOS in Fig. III-B for dy- | ||||
| constraint, and thus we can define the rigidity error er as                     namic stereo visual SLAM, which consists of three modules, | ||||
|                                                                                pre-processing, tracking, and back-end bundle adjustment. | ||||
| er = l pki − l pkj − si j .                                   (3) | ||||
|                                                                                   In pre-processing and tracking modules, we first extract | ||||
|    Fig. 3(a) shows the factor graph of the rigidity constraint,                ORB features [28] and perform an instance-level segmen- | ||||
| where the length of segment si j is invariant after the motion.                tation [29] to identify potential moving objects. We then | ||||
| The benefits to involving the rigidity error (3) are two-fold.                  estimate the initial ego-motion by tracking the static features. | ||||
| First, it offers a temporal geometric constraint for dynamic                   For articulated objects like humans, we perform Alpha-Pose | ||||
| points, which is able to correct the scale and 3D structure                    [27] to extract the human key points and calculate their | ||||
| of dynamic objects. Second, it provides a geometric check,                     3D positions by triangulating the corresponding key points | ||||
| which eliminates the incorrectly matched points.                               from stereo images. We then track the moving humans using | ||||
|                                                                                the optical flow generated by PWC-net [30]. The tracking | ||||
|    We model humans as a special articulated object shown                       module provides a reliable initialization for the camera pose | ||||
| in Fig. 3(b), where each human can be described by 14                          and also the object poses of dynamic objects. | ||||
| key points, including nose, shoulders, elbows, hands, waists, | ||||
| knee, feet, etc. In the experiments, we detect the human key                      In the back-end optimization, we construct a global map | ||||
| points using the off-the-shelf algorithm Alpha-Pose [27].                      consisting of camera poses, static points, dynamic points, and | ||||
|                                                                                the motion of objects. We perform local bundle adjustment | ||||
|                                                                                with dynamic objects in the co-visibility graph [31] built | ||||
| 
 | ||||
|                                                                                                  8049 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. | ||||
|          Preprocessing                            Tracking                                              Back-end | ||||
|         Stereo Image                  Tracking Camera Pose                                            Optimization | ||||
| 
 | ||||
|  Instant-level                                                                                                           Map | ||||
| Segmentation | ||||
| Human Pose                     Static Feature    Ego-motion Track Local                    Local Map      Camera Pose | ||||
|                                   Extractor                                                               Static Points | ||||
|   Detection                                      Estimation                           Map | ||||
|  Optical Flow | ||||
|   Estimation                   3D Human Pose                      Motion                    Local Bundle  Dynamic Points | ||||
|                                  Triangulation                  Estimation                   Adjustment   Object Rigidity | ||||
| 
 | ||||
|                                Dynamic Object                                              Global Bundle       Motion | ||||
|                                     Tracking                                                Adjustment | ||||
| 
 | ||||
|                                       Tracking Dynamic Objects | ||||
| 
 | ||||
| Fig. 4. The framework of AirDOS, which is composed of three modules, i.e., pre-processing, tracking, and back-end optimization. | ||||
| 
 | ||||
|                                TABLE I                                  Camera Pose | ||||
| PERFORMANCE ON KITTI DATASETS BASED ON ATE (m).                             Object 1 | ||||
|                                                                             Object 2 | ||||
| Sequence          W/ Mask             W/O Mask                                                            AirDOS | ||||
|           ORB-SLAM AirDOS      ORB-SLAM AirDOS                  ORB-SLAM | ||||
|  Test 18 | ||||
|  Test 28  0.933         0.934  0.937  0.948 | ||||
| Train 13  2.033         2.027  2.031  2.021 | ||||
| Train 14  1.547         1.618  1.551  1.636 | ||||
| Train 15  0.176         0.172  0.174  0.169 | ||||
| Train 19  0.240         0.234  0.240  0.234 | ||||
|           2.633         2.760  2.642  2.760 | ||||
| 
 | ||||
| from the co-visible landmarks for the sake of efficiency.        Fig. 5. Qualitative analysis of the KITTI Tracking datasets in training 19. | ||||
| Similar to the strategy of RANSAC, we eliminate the factors     Applying rigidity constraint and motion constraint improve the estimation | ||||
| and edges which contribute a large error based on the rigidity  of the objects’ structure. | ||||
| constraint (3) and motion constraint (5). This process helps | ||||
| to identify the mismatched or falsely estimated human poses. | ||||
| Visual SLAM algorithms usually only perform bundle adjust- | ||||
| ment on selected key-frames due to the repeated static feature | ||||
| observations. However, in highly dynamic environments, like | ||||
| the ones presented in this paper, this might easily result in | ||||
| loss of dynamic object tracking, therefore we perform bundle | ||||
| adjustment on every frame to capture the full trajectory. | ||||
| 
 | ||||
|                          V. EXPERIMENTS                         B. Performance on KITTI Tracking Dataset | ||||
| 
 | ||||
| A. Metric, Baseline, and Implementation                            The KITTI Tracking dataset [32] contains 50 sequences | ||||
|                                                                 (29 for testing, 21 for training) with multiple moving objects. | ||||
|    We use the Absolute Translation Error (ATE) to evaluate      We select 6 sequences that contain moving pedestrians. For | ||||
| our algorithm. Our method is compared against the state-of-     evaluation, we generate the ground truth using IMU and GPS. | ||||
| the-art methods, ORB-SLAM [3] (1) with and (2) without the      As shown in Table I, the ATEs of both our method and ORB- | ||||
| masking of potential dynamic objects, and RGB-D dynamic         SLAM are small in all sequences, which means that both | ||||
| SLAM algorithm [20]. Similar to the setup described in          methods perform well in these sequences. The main reason is | ||||
| Section IV, we modified the ORB-SLAM to perform BA               that the moving objects are relatively far and small, and there | ||||
| on every frame with the observation from dynamic features,      are plentiful static features in these sequences. Moreover, | ||||
| so as to capture the full trajectory of the moving objects. In  most sequences have a simple translational movement, which | ||||
| the experiment, we applied the same parameters to AirDOS        makes these cases very simple. | ||||
| and ORB-SLAM, i.e., the number of feature points extracted | ||||
| per frame, the threshold for RANSAC, and the covariance            Although the camera trajectory is similar, our algorithm | ||||
| of the reprojection error.                                      recovers a better human model as shown in Fig. 5. The ORB- | ||||
|                                                                 SLAM generates noisy human poses when the human is far | ||||
| 
 | ||||
|                                                                                                  8050 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. | ||||
|                                                           TABLE II | ||||
|                     EXPERIMENTS ON TARTAN-AIR DATASET WITH AND WITHOUT MASK | ||||
| 
 | ||||
| Datasets Sequence              W/ Mask                                                                                           W/O Mask | ||||
|                             ORB-SLAM | ||||
|                     AirDOS                                                 VDO-SLAM [20]                                 AirDOS  ORB-SLAM | ||||
| 
 | ||||
| Standing       I    0.0606  0.0788                                         0.0994                                        0.0469            0.1186 | ||||
|                     0.0193  0.0060                                         0.6129                                           -                 - | ||||
| Human          II | ||||
|                     0.0951  0.0657                                         0.3813                                        0.0278            0.0782 | ||||
| Road Crossing  III  0.0331  0.0196                                         0.3879                                        0.1106            0.0927 | ||||
|      (Easy)    IV   0.0206  0.0148                                         0.2175                                        0.0149            0.0162 | ||||
|                V | ||||
| 
 | ||||
| Road Crossing VI    0.2230  1.0984                                         0.2400                                        3.6700            4.3907 | ||||
|                     0.5625  0.8476                                         0.6628                                        1.1572            1.4632 | ||||
| (Hard)         VII | ||||
| 
 | ||||
| Overall             0.1449  0.3044                                         0.3717                                        0.8379            1.0226 | ||||
| 
 | ||||
| Results show Absolute Trajectory Error (ATE) in meter (m). ‘-’ means that SLAM failed in this sequence. | ||||
| 
 | ||||
|                                                                                                             Camera Pose | ||||
|                                                                                                                Object 1 | ||||
|                                                                                                                Object 2 | ||||
|                                                                                                                Object 3 | ||||
|                                                                                                                Object 3 | ||||
| 
 | ||||
|                                                                            ORBSLAM | ||||
| 
 | ||||
| (a) Standing Human  (b) Road Crossing | ||||
| 
 | ||||
| Fig. 6. (a) Example of the Tartan-Air datasets, where almost every one is | ||||
| standing. (b) Example of moving humans in road crossing. | ||||
| 
 | ||||
| away from the camera. That’s because the rigidity constraint                                AirDOS | ||||
| helps to recover the structure of the moving articulated | ||||
| objects. Also, the motion constraint can improve the accuracy              Fig. 7. Qualitative analysis of the TartanAir sequence IV. The moving | ||||
| of the dynamic objects’ trajectory. Given the observation                  objects tracked by the ORB-SLAM is noisy, while our proposed method | ||||
| from the entire trajectory, our algorithm recovers the human               generate a smooth trajectory. We present that dynamic objects and the | ||||
| pose and eliminates the mismatched dynamic feature points.                 camera pose can benefits each other in visual SLAM. | ||||
| 
 | ||||
| C. Performance on TartanAir Shibuya Dataset                                   1) Evaluation: To test the robustness of our system when | ||||
|                                                                            the visual odometry is interrupted by dynamic objects or | ||||
|    We notice that the moving objects in KITTI dataset only                 in cases where the segmentation might fail due to indirect | ||||
| cover a small field of view. To address the challenges of                   occlusions such as illumination changes, we evaluate the | ||||
| the highly dynamic environment, we build the TartanAir                     performance in two settings: with and without masking the | ||||
| Shibuya dataset as shown in Fig. 6, and demonstrate that                   dynamic features during ego-motion estimation. | ||||
| our method outperforms the existing dynamic SLAM al- | ||||
| gorithms in this benchmark. Our previous work TartanAir                       As shown in the Table II, with human masks, our algo- | ||||
| [33] is a very challenging visual SLAM dataset consisting                  rithm obtains a 39.5% and 15.2% improvements compared | ||||
| of binocular RGB-D sequences together with additional per-                 to ORB-SLAM [3] and VDO-SLAM [20] in the overall | ||||
| frame information such as camera poses, optical flow, and                   performance. In Sequence II, IV and V, both ORB-SLAM | ||||
| semantic annotations. In this paper, we use the same pipeline              and our algorithm show a good performance, where all | ||||
| to generate TartanAir Shibuya, which is to simulate the                    ATEs are lower than 0.04. We notice that the performance | ||||
| world’s most busy road intersection at Shibuya Tokyo shown                 of VDO-SLAM is not as good as ORB-SLAM. This may | ||||
| in Fig. 1. It covers much more challenging viewpoints and                  be because that VDO-SLAM relies heavily on the optical | ||||
| diverse motion patterns for articulated objects than TartanAir.            flow for feature matching, it is likely to confuse background | ||||
|                                                                            features with dynamic features. | ||||
|    We separate the TartanAir Shibuya dataset into two | ||||
| groups: Standing Humans in Fig. 6(a) and Road Crossing in                     Our algorithm also outperforms ORB-SLAM without | ||||
| Fig. 6(b) with easy and difficult categories. Each sequence                 masking the potential moving objects. As shown in the | ||||
| contains 100 frames and more than 30 tracked moving                        sequence I, III, V, and VI of Table II, our method obtains a | ||||
| humans. In the sequences of Standing Human, most of                        higher accuracy than ORB-SLAM by 0.0717, 0.050, 0.721 | ||||
| the humans standstill, while few of them move around the                   and 0.306. Overall, we achieve an improvement of 18.1%. | ||||
| space. In Road Crossing, there are multiple moving humans                  That’s because moving objects can easily lead the traditional | ||||
| coming from different directions. For the difficult sequences,              visual odometry to fail, but we take the observations from | ||||
| dynamic objects often enter the scene abruptly, in which the               moving articulated objects to rectify the camera poses, and | ||||
| visual odometry of traditional methods will fail easily.                   filter out the mismatched dynamic features. | ||||
| 
 | ||||
|                                                                                                  8051 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. | ||||
|                                                                          TABLE III | ||||
|                                                     ABLATION STUDY ON SIMULATED DATASET. | ||||
| 
 | ||||
|        Groups       RPE-R      I     ATE    RPE-R      II     ATE    RPE-R      III    ATE    RPE-R       IV     ATE    RPE-R   Overall | ||||
|                             RPE-T                   RPE-T                    RPE-T                     RPE-T                    RPE-T ATE | ||||
|      Before BA      0.4898          83.441  0.6343          109.968  1.1003          138.373  0.7925           168.312  0.7908 | ||||
| BA w/ static point  0.0989  15.991  15.002  0.1348  17.728  25.796   0.2028  21.070  17.085   0.1389   19.242  35.521   0.1537  18.8328  125.024 | ||||
|                     0.0988  3.3184  15.019  0.1349  3.7146  25.708   0.2035  4.2522  16.985   0.1388   3.5074  35.269   0.1538  3.7540   23.351 | ||||
|   BA w/o motion     0.0962  3.3176  14.881  0.1282  3.7176  25.704   0.1871  4.2631  16.921   0.1226   3.5069  35.426   0.1410  3.7565   23.245 | ||||
|   BA w/o rigidity   0.0958  3.2245  14.879  0.1276  3.4984  25.703   0.1870  4.0387  16.914   0.1215   3.2397  35.412   0.1407  3.5148   23.233 | ||||
| BA in Equation (6)          3.2177                  3.4824                   4.0372                    3.2227                   3.5085   23.227 | ||||
| 
 | ||||
| Results show RPE-T and ATE in centimeter (cm) and RPE-R in degree (°). | ||||
| 
 | ||||
|    It can be seen that in Fig. 7, ORB-SLAM was interrupted           (d) is 5.71 ± 0.31 lower than (b). This is because the motion | ||||
| by the moving humans and failed when making a large rota-            constraint assumes that every dynamic feature on the same | ||||
| tion. By tracking moving humans, our method outperforms              object follows the same motion pattern, which requires the | ||||
| ORB-SLAM when making a turn. Also, a better camera                   object to be rigid. From another point of view, the rigidity | ||||
| pose estimation can in turn benefit the moving objects’               constraint provides a good initialization to the objects’ 3D | ||||
| trajectory. As can be seen, the objects’ trajectories generated      structure, and so indirectly improves the estimation of the | ||||
| by ORB-SLAM are noisy and inconsistent, while ours are               objects’ trajectory. In general, the ablation study proves | ||||
| smoother. In general, the proposed motion constraint and             that applying motion and rigidity constraints to dynamic | ||||
| rigidity constraint have a significant impact on the difficult         articulated objects can benefit the camera pose estimation. | ||||
| sequences, where ORB-SLAM outputs inaccurate trajectories | ||||
| due to dynamic objects.                                              C. Computational Analysis | ||||
| 
 | ||||
|                       VI. ABLATION STUDY                                Finally, we evaluate the running time of the rigidity con- | ||||
|                                                                      straint and motion constraint in the optimization. The back- | ||||
|    We perform an ablation study to show the effects of the           end optimization is implemented in C++ with a modified | ||||
| introduced rigidity and motion constraints. Specifically, we          g2o [34] solver. With the same setup as section VI-A, we | ||||
| demonstrate that the motion constraint and rigidity constraint       randomly initialized 10 different sequences with 18 frames. | ||||
| inprove the camera pose estimation via bundle adjustment.            In each frame, we can observe 8 static landmarks, and 12 | ||||
|                                                                      dynamic landmarks from one moving object. In Table IV, | ||||
| A. Implementation                                                    We show the (i) convergence time (ii) runtime per iteration | ||||
|                                                                      of group I in the ablation study. Our method takes 53.54 | ||||
|    We simulate dynamic articulated objects that follow a             (mSec) to converge, which is comparable to 39.22 (mSec) | ||||
| simple constant motion pattern, and initialize the robot’s state     from the optimization with re-projection error only. | ||||
| with Gaussian noise of σ = 0.05m on translation σ = 2.9° | ||||
| on rotation. We also generate static features around the path           In this paper, semantic mask [29] and human poses [27] | ||||
| of the robot, and simulate a sensor with a finite field of view.       are pre-processed as an input to the system. The experiment | ||||
| The measurement of point also has a noise of σ = 0.05m.              are carried out on an Intel Core i7 with 16GB RAM. | ||||
| We generate 4 groups of sequences with different lengths | ||||
| and each group consists of 10 sequences that are initialized                                                TABLE IV | ||||
| with the same number of static and dynamic features. We                                TIME ANALYSIS OF BUNDLE ADJUSTMENT | ||||
| set the ratio of static to dynamic landmarks as 1:1.8. | ||||
|                                                                              BA w/ reprojection error  Convergence Time (mSec)  Runtime/iter (mSec) | ||||
| B. Results                                                                        BA w/o Rigidity | ||||
|                                                                                   BA w/o Motion                     39.22                4.024 | ||||
|    We evaluate the performance of (a) bundle adjustment with                                                        45.47                4.078 | ||||
| static features only, (b) bundle adjustment without motion                      BA in Equation (6)                  45.37                4.637 | ||||
| constraint, (c) bundle adjustment without rigidity constraint,                                                      53.54                4.792 | ||||
| and (d) bundle adjustment with both the motion constraint | ||||
| and rigidity constraint. We use the Absolute Translation Error                                   CONCLUSION | ||||
| (ATE) and Relative Pose Error of Rotation (RPE-R) and | ||||
| Translation (RPE-T) as our evaluation metrics.                          In this paper, we introduce the rigidity constraint and | ||||
|                                                                      motion constraint to model dynamic articulated objects. We | ||||
|    As shown in Table III, both motion and rigidity constraints       propose a new pipeline, AirDOS for stereo SLAM which | ||||
| are able to improve the camera pose estimation, while the            jointly optimizes the trajectory of dynamic objects, map of | ||||
| best performance is obtained when the two constraints are            the environment, and camera poses, improving the robustness | ||||
| applied together. An interesting phenomenon is that rigidity         and accuracy in dynamic environments. We evaluate our | ||||
| constraint can also benefit the objects’ trajectory estimation.       algorithm in KITTI tracking and TartanAir Shibuya dataset, | ||||
| In Groups I, we evaluate the estimation of dynamic points            and demonstrate that camera pose estimation and dynamic | ||||
| with setting (b), (c), and (d), with 100 repeated experiments.       objects can benefit each other, especially when there is | ||||
| We find that the ATE of dynamic object feature points in              an aggressive rotation or static features are not enough to | ||||
| setting (c) is 5.68 ± 0.30 lower than setting (b), while setting     support the visual odometry. | ||||
| 
 | ||||
|                                                                                                  8052 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. | ||||
|                             REFERENCES                                        [24] W. B. Thompson, P. Lechleider, and E. R. Stuck, “Detecting moving | ||||
|                                                                                     objects using the rigidity constraint,” IEEE Transactions on Pattern | ||||
|  [1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,         Analysis and Machine Intelligence, vol. 15, no. 2, pp. 162–166, 1993. | ||||
|       I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous | ||||
|       localization and mapping: Toward the robust-perception age,” IEEE       [25] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and | ||||
|       Transactions on robotics, vol. 32, no. 6, pp. 1309–1332, 2016.                A. J. Davison, “Slam++: Simultaneous localisation and mapping at the | ||||
|                                                                                     level of objects,” in Proceedings of the IEEE conference on computer | ||||
|  [2] C. Wang, J. Yuan, and L. Xie, “Non-iterative SLAM,” in International           vision and pattern recognition, 2013, pp. 1352–1359. | ||||
|       Conference on Advanced Robotics (ICAR). IEEE, 2017, pp. 83–90. | ||||
|                                                                               [26] M. Kaess, A. Ranganathan, and F. Dellaert, “isam: Incremental | ||||
|  [3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a                 smoothing and mapping,” IEEE Transactions on Robotics, vol. 24, | ||||
|       versatile and accurate monocular slam system,” IEEE transactions on           no. 6, pp. 1365–1378, 2008. | ||||
|       robotics, vol. 31, no. 5, pp. 1147–1163, 2015. | ||||
|                                                                               [27] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional multi- | ||||
|  [4] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE            person pose estimation,” in ICCV, 2017. | ||||
|       transactions on pattern analysis and machine intelligence, vol. 40, | ||||
|       no. 3, pp. 611–625, 2017.                                               [28] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An | ||||
|                                                                                     efficient alternative to sift or surf,” in 2011 International conference | ||||
|  [5] L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: Dual                on computer vision. Ieee, 2011, pp. 2564–2571. | ||||
|       quadrics from object detections as landmarks in object-oriented slam,” | ||||
|       IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2018.     [29] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in | ||||
|                                                                                     Proceedings of the IEEE international conference on computer vision, | ||||
|  [6] S. Yang and S. Scherer, “Cubeslam: Monocular 3-d object slam,” IEEE            2017, pp. 2961–2969. | ||||
|       Transactions on Robotics, vol. 35, no. 4, pp. 925–938, 2019. | ||||
|                                                                               [30] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical | ||||
|  [7] M. Henein, G. Kennedy, R. Mahony, and V. Ila, “Exploiting rigid body           flow using pyramid, warping, and cost volume,” in Proceedings of the | ||||
|       motion for slam in dynamic environments,” environments, vol. 18,              IEEE conference on computer vision and pattern recognition, 2018, | ||||
|       p. 19, 2018.                                                                  pp. 8934–8943. | ||||
| 
 | ||||
|  [8] K. M. Judd, J. D. Gammell, and P. Newman, “Multimotion visual            [31] C. Mei, G. Sibley, and P. Newman, “Closing loops without places,” | ||||
|       odometry (mvo): Simultaneous estimation of camera and third-party             in 2010 IEEE/RSJ International Conference on Intelligent Robots and | ||||
|       motions,” in 2018 IEEE/RSJ International Conference on Intelligent            Systems. IEEE, 2010, pp. 3738–3744. | ||||
|       Robots and Systems (IROS). IEEE, 2018, pp. 3949–3956. | ||||
|                                                                               [32] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: | ||||
|  [9] M. Henein, J. Zhang, R. Mahony, and V. Ila, “Dynamic slam: The                 The kitti dataset,” The International Journal of Robotics Research, | ||||
|       need for speed,” in 2020 IEEE International Conference on Robotics            vol. 32, no. 11, pp. 1231–1237, 2013. | ||||
|       and Automation (ICRA). IEEE, 2020, pp. 2123–2129. | ||||
|                                                                               [33] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, | ||||
| [10] G. Stamou, M. Krinidis, E. Loutas, N. Nikolaidis, and I. Pitas, “4.11-         and S. Scherer, “Tartanair: A dataset to push the limits of visual | ||||
|       2d and 3d motion tracking in digital video,” Handbook of Image and            slam,” in IEEE/RSJ International Conference on Intelligent Robots | ||||
|       Video Processing, 2005.                                                       and Systems (IROS), 2020. | ||||
| 
 | ||||
| [11] M. A. Fischler and R. C. Bolles, “Random sample consensus: a             [34] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, | ||||
|       paradigm for model fitting with applications to image analysis and             “g 2 o: A general framework for graph optimization,” in 2011 IEEE | ||||
|       automated cartography,” Communications of the ACM, vol. 24, no. 6,            International Conference on Robotics and Automation. IEEE, 2011, | ||||
|       pp. 381–395, 1981.                                                            pp. 3607–3613. | ||||
| 
 | ||||
| [12] C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for | ||||
|       rgb-d cameras,” in 2013 IEEE International Conference on Robotics | ||||
|       and Automation. IEEE, 2013, pp. 3748–3754. | ||||
| 
 | ||||
| [13] B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “Dynaslam: Tracking, | ||||
|       mapping, and inpainting in dynamic scenes,” IEEE Robotics and | ||||
|       Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018. | ||||
| 
 | ||||
| [14] I. A. Bârsan, P. Liu, M. Pollefeys, and A. Geiger, “Robust dense | ||||
|       mapping for large-scale dynamic environments,” in 2018 IEEE In- | ||||
|       ternational Conference on Robotics and Automation (ICRA). IEEE, | ||||
|       2018, pp. 7510–7517. | ||||
| 
 | ||||
| [15] W. Dai, Y. Zhang, P. Li, Z. Fang, and S. Scherer, “Rgb-d slam in | ||||
|       dynamic environments using point correlations,” IEEE Transactions | ||||
|       on Pattern Analysis and Machine Intelligence, 2020. | ||||
| 
 | ||||
| [16] T. Ji, C. Wang, and L. Xie, “Towards real-time semantic rgb-d slam in | ||||
|       dynamic environments,” in 2021 International Conference on Robotics | ||||
|       and Automation (ICRA), 2021. | ||||
| 
 | ||||
| [17] D. Hahnel, R. Triebel, W. Burgard, and S. Thrun, “Map building with | ||||
|       mobile robots in dynamic environments,” in 2003 IEEE International | ||||
|       Conference on Robotics and Automation (Cat. No. 03CH37422), | ||||
|       vol. 2. IEEE, 2003, pp. 1557–1563. | ||||
| 
 | ||||
| [18] C.-C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte, | ||||
|       “Simultaneous localization, mapping and moving object tracking,” The | ||||
|       International Journal of Robotics Research, vol. 26, no. 9, pp. 889– | ||||
|       916, 2007. | ||||
| 
 | ||||
| [19] N. D. Reddy, P. Singhal, V. Chari, and K. M. Krishna, “Dynamic body | ||||
|       vslam with semantic constraints,” in 2015 IEEE/RSJ International | ||||
|       Conference on Intelligent Robots and Systems (IROS). IEEE, 2015, | ||||
|       pp. 1897–1904. | ||||
| 
 | ||||
| [20] J. Zhang, M. Henein, R. Mahony, and V. Ila, “Vdo-slam: a visual | ||||
|       dynamic object-aware slam system,” arXiv preprint arXiv:2005.11052, | ||||
|       2020. | ||||
| 
 | ||||
| [21] M. Henein, J. Zhang, R. Mahony, and V. Ila, “Dynamic slam: The | ||||
|       need for speed,” in 2020 IEEE International Conference on Robotics | ||||
|       and Automation (ICRA), 2020, pp. 2123–2129. | ||||
| 
 | ||||
| [22] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic | ||||
|       scene graphs: Actionable spatial perception with places, objects, and | ||||
|       humans,” arXiv preprint arXiv:2002.06289, 2020. | ||||
| 
 | ||||
| [23] Z. Zhang, O. D. Faugeras, and N. Ayache, “Analysis of a sequence | ||||
|       of stereo scenes containing multiple moving objects using rigidity | ||||
|       constraints,” in ICCV, 1988. | ||||
| 
 | ||||
|                                                                                                  8053 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. | ||||
|  | ||||
| @ -0,0 +1,663 @@ | ||||
| IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022                                             11523 | ||||
| 
 | ||||
| DynaVINS: A Visual-Inertial SLAM for | ||||
|           Dynamic Environments | ||||
| 
 | ||||
| Seungwon Song , Hyungtae Lim , Graduate Student Member, IEEE, Alex Junho Lee , | ||||
|                              and Hyun Myung , Senior Member, IEEE | ||||
| 
 | ||||
|    Abstract—Visual inertial odometry and SLAM algorithms are                    Fig. 1. Our algorithm, DynaVINS, in various dynamic environments. (a)–(b) | ||||
| widely used in various fields, such as service robots, drones, and               Feature rejection results in city_day sequence of VIODE dataset [13]. Even if | ||||
| autonomous vehicles. Most of the SLAM algorithms are based on                   the most features are dynamic, DynaVINS can discard the effect of the dynamic | ||||
| assumption that landmarks are static. However, in the real-world,               features. (c) Separation of feature matching results into multiple hypotheses in E | ||||
| various dynamic objects exist, and they degrade the pose estimation             shape sequence of our dataset. Even if a temporarily static object exists, only a | ||||
| accuracy. In addition, temporarily static objects, which are static             hypothesis from static objects is determined as true positive. Features with high | ||||
| during observation but move when they are out of sight, trigger false           and low weights are denoted as green circles and red crosses, respectively, in | ||||
| positive loop closings. To overcome these problems, we propose a                both two cases. | ||||
| novel visual-inertial SLAM framework, called DynaVINS, which is | ||||
| robust against both dynamic objects and temporarily static objects.             cameras [4], [5], [6] are widely used because of their relatively | ||||
| In our framework, we first present a robust bundle adjustment                    low cost and weight with rich information. | ||||
| that could reject the features from dynamic objects by leveraging | ||||
| pose priors estimated by the IMU preintegration. Then, a keyframe                  Various visual SLAM methods have been studied for more | ||||
| grouping and a multi-hypothesis-based constraints grouping meth-                than a decade. However, most researchers have assumed that | ||||
| ods are proposed to reduce the effect of temporarily static objects in          landmarks are implicitly static; thus, many visual SLAM meth- | ||||
| the loop closing. Subsequently, we evaluated our method in a public             ods still have potential risks when interacting with real-world | ||||
| dataset that contains numerous dynamic objects. Finally, the exper-             environments that contain various dynamic objects. Only re- | ||||
| imental results corroborate that our DynaVINS has promising per-                cently, several studies focused on dealing with dynamic objects | ||||
| formance compared with other state-of-the-art methods by success-               solely using visual sensors. | ||||
| fully rejecting the effect of dynamic and temporarily static objects. | ||||
|                                                                                    Most of the studies [7], [8], [9] address the problems by de- | ||||
|    Index Terms—Visual-inertial SLAM, SLAM, visual tracking.                     tecting the regions of dynamic objects via depth clustering, fea- | ||||
|                                                                                 ture reprojection, or deep learning. Moreover, some researchers | ||||
|                            I. INTRODUCTION                                      incorporate the dynamics of the objects into the optimization | ||||
|                                                                                 framework [10], [11], [12]. However, geometry-based methods | ||||
| S IMULTANEOUS localization and mapping (SLAM) al-                               require accurate camera poses; hence they can only deal with | ||||
|      gorithms have been widely exploited in various robotic                     limited fractions of dynamic objects. In addition, deep-learning- | ||||
| applications that require precise positioning or navigation in                  aided methods have the limitation of solely working for prede- | ||||
| environments where GPS signals are blocked. Various types                       fined objects. | ||||
| of sensors have been used in SLAM algorithms. In particular, | ||||
| visual sensors such as monocular cameras [1], [2], [3] and stereo                  In the meanwhile, visual-inertial SLAM (VI-SLAM) frame- | ||||
|                                                                                 works [2], [3], [4], [5], [6] have been proposed by integrating an | ||||
|    Manuscript received 27 April 2022; accepted 22 August 2022. Date of          inertial measurement unit (IMU) into the visual SLAM. Unlike | ||||
| publication 31 August 2022; date of current version 6 September 2022. This      the visual SLAMs, a motion prior from the IMU helps the | ||||
| letter was recommended for publication by Associate Editor M. Magnusson and     VI-SLAM algorithms to tolerate scenes with dynamic objects to | ||||
| Editor S. Behnke upon evaluation of the reviewers’ comments. This work was      some degree. However, if the dominant dynamic objects occlude | ||||
| supported in part by the Indoor Robot Spatial AI Technology Development” | ||||
| project funded by KT, KT award under Grant B210000715 and in part by the | ||||
| Institute of Information & Communications Technology Planning & Evaluation | ||||
| (IITP) grant funded by Korea government (MSIT) under Grant 2020-0-00440, | ||||
| Development of Artificial Intelligence Technology that Continuously Improves | ||||
| Itself as the Situation Changes in the Real World. The students are supported | ||||
| by the BK21 FOUR from the Ministry of Education (Republic of Korea). | ||||
| (Corresponding author: Hyun Myung.) | ||||
| 
 | ||||
|    Seungwon Song, Hyungtae Lim, and Hyun Myung are with the School of | ||||
| Electrical Engineering, KAIST, Daejeon 34141, Republic of Korea (e-mail: | ||||
| sswan55@kaist.ac.kr; shapelim@kaist.ac.kr; hmyung@kaist.ac.kr). | ||||
| 
 | ||||
|    Alex Junho Lee is with the Department of Civil and Environmen- | ||||
| tal Engineering, KAIST, Daejeon 34141, Republic of Korea (e-mail: | ||||
| alex_jhlee@kaist.ac.kr). | ||||
| 
 | ||||
|    Our code is available: https://github.com/url-kaist/dynaVINShttps://github. | ||||
| com/url-kaist/dynaVINS | ||||
| 
 | ||||
|    This letter has supplementary downloadable material available at | ||||
| https://doi.org/10.1109/LRA.2022.3203231, provided by the authors. | ||||
| 
 | ||||
|    Digital Object Identifier 10.1109/LRA.2022.3203231 | ||||
| 
 | ||||
| 2377-3766 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. | ||||
|                    See https://www.ieee.org/publications/rights/index.html for more information. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. | ||||
| 11524                                                                     IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 | ||||
| 
 | ||||
| most of the view as shown in Fig. 1(b), the problem cannot be             camera movement and the feature. Canovas et al. [9] proposed | ||||
|                                                                           a similar method, but adopted a surfel, similar to a polygon, | ||||
| solved solely using the motion prior.                                     to enable a real-time performance by reducing the number of | ||||
|                                                                           items to be computed. However, multi-view geometry-based | ||||
| In addition, in real-world applications, temporarily static ob-           algorithms assumed that the camera pose estimation is accurate | ||||
|                                                                           enough, leading to the failure when the camera pose estimation | ||||
| jects are static while being observed but in motion when they are         is inaccurate owing to the dominant dynamic objects. | ||||
| 
 | ||||
| not under observation. These objects may lead to a critical failure          One of the solutions to this problem is to employ a wheel | ||||
|                                                                           encoder. G2P-SLAM [18] rejected loop closure matching results | ||||
| on the loop closure process due to false positives as shown in            with a high Mahalanobis distance from the estimated pose by | ||||
|                                                                           the wheel odometry, which is invariant to the effect of dynamic | ||||
| Fig. 1(c). To deal with temporarily static objects, robust back-end       and temporarily static objects. Despite the advantages of wheel | ||||
|                                                                           encoder, these methods are highly dependent on the wheel | ||||
| methods [14], [15], [16], [17] are proposed to reduce the effect of       encoder, limiting their own applicability. | ||||
| 
 | ||||
| the false positive loop closures in optimization. However, since             Another feasible approach is to adopt deep learning networks | ||||
|                                                                           to identify predefined dynamic objects. In the DynaSLAM [7], | ||||
| they focused on the instantaneous false positive loop closures,           masked areas of the predefined dynamic objects using a deep | ||||
|                                                                           learning network were eliminated and the remainder was deter- | ||||
| they cannot deal with the persistent false positive loop closures         mined via multi-view geometry. In the Dynamic SLAM [19], | ||||
|                                                                           a compensation method was adopted to make up for missed | ||||
| caused by the temporarily static objects.                                 detections in a few keyframes using sequential data. Although | ||||
|                                                                           the deep learning methods can successfully discard the dynamic | ||||
| In this study, to address the aforementioned problems, we                 objects even if they are temporarily static, these methods are | ||||
|                                                                           somewhat problematic for the following two reasons: a) the types | ||||
| propose a robust VI-SLAM framework, called DynaVINS, which                of dynamic objects have to be predefined, and b) sometimes, only | ||||
|                                                                           a part of the dynamic object is visible as shown in Fig. 1(b). For | ||||
| is robust against dynamic and temporarily static objects. Our             these reasons, the objects may not be detected occasionally. | ||||
| 
 | ||||
| conrtriTbhuetiornosbuarset  summarized as follows:  proposed  to  handle     On the other hand, methods for tracking a dynamic object’s | ||||
|                             VI-SLAM approach is                           motion have been proposed. RigidFusion [10] assumed that only | ||||
|                                                                           a single dynamic object is in the environment and estimated | ||||
|        dominant, undefined dynamic objects that cannot be solved           the motion of the dynamic object. Qiu et al. [12] combined a | ||||
|                                                                           deep learning method and VINS-Mono [2] to track poses of the | ||||
| r      solely by learning-based or vision-only methods.                   camera and object simultaneously. DynaSLAM II [11] identified | ||||
|        A novel bundle adjustment (BA) pipeline is proposed for            dynamic objects, similar to DynaSLAM [7], then, within the BA | ||||
|                                                                           factor graph, the poses of static features and the camera were | ||||
|        simultaneously estimating camera poses and discarding the          estimated while estimating the motion of the dynamic objects | ||||
|                                                                           simultaneously. | ||||
|        features from the dynamic objects that deviate significantly | ||||
|                                                                           C. Robust Back-End | ||||
| r      from the motion prior. | ||||
|        A robust global optimization with constraints grouped into            In the graph SLAM field, several researchers have attempted | ||||
|                                                                           to discard incorrectly created constraints. For instance, max- | ||||
|        multiple hypotheses is proposed to reject persistent loop          mixture [14] employed a single integrated Bayesian framework | ||||
|                                                                           to eliminate the incorrect loop closures, while switchable con- | ||||
|        closures from the temporarily static objects.                      straint [15] is proposed to adjust the weight of each constraint to | ||||
|                                                                           eliminate false positive loop closures in the optimization. How- | ||||
| In the remainder of this letter, we introduce the robust BA               ever, false-positive loop closures can be expected to be consistent | ||||
|                                                                           and occur persistently by the temporarily static objects. These | ||||
| method for optimizing moving windows in Section III, methods              robust kernels are not appropriate to handling such persistent | ||||
|                                                                           loop closures. | ||||
| for the robust global optimization in Section IV, and compare our | ||||
|                                                                              On the other hand, the Black-Rangarajan (B-R) duality [20] is | ||||
| proposed method with other state-of-the-art (SOTA) methods                proposed to unify robust estimation and outlier rejection process. | ||||
|                                                                           Some methods [16], [17] utilize B-R duality in point cloud | ||||
| in various environments in Section V.                                     registration and pose graph optimization (PGO) to reduce the | ||||
|                                                                           effect of false-positive matches even if they are dominant. These | ||||
|                          II. RELATED WORKS                                methods are useful for rejecting outliers in a PGO. However, | ||||
|                                                                           repeatedly detected false-positive loop closures from similar | ||||
| A. Visual-Inertial SLAM                                                   objects are not considered. Moreover, B-R duality is not yet | ||||
|                                                                           utilized in the BA of the VI-SLAM. | ||||
|    As mentioned earlier, to address the limitations of the visual | ||||
| SLAM framework, VI-SLAM algorithms have been recently                        To address the aforementioned limitations, we improve the | ||||
| proposed to correct the scale and camera poses by adopting                VI-SLAM to minimize the effect of the dynamic and temporarily | ||||
| the IMU. MSCKF [3] was proposed as an extended Kalman                     static objects by adopting the B-R duality not only in the graph | ||||
| filter(EKF)-based VI-SLAM algorithm. ROVIO [6] also used | ||||
| an EKF, but proposed a fully robocentric and direct VI-SLAM | ||||
| framework running in real time. | ||||
| 
 | ||||
|    There are other approaches using optimization. OKVIS [5] | ||||
| proposed a keyframe-based framework and fuses the IMU | ||||
| preintegration residual and the reprojection residual in an op- | ||||
| timization. ORB-SLAM3 [4] used an ORB descriptor for the | ||||
| feature matching, and poses and feature positions are corrected | ||||
| through an optimization. VINS-Fusion [2], an extended version | ||||
| of VINS-Mono, supports a stereo camera and adopts a feature | ||||
| tracking, rather than a descriptor matching, which makes the | ||||
| algorithm faster and more robust. | ||||
| 
 | ||||
|    However, these VI-SLAM methods described above still have | ||||
| potential limitations in handling the dominant dynamic objects | ||||
| and the temporarily static objects. | ||||
| 
 | ||||
| B. Dynamic Objects Rejection in Visual and VI SLAM | ||||
| 
 | ||||
|    Numerous researchers have proposed various methods to | ||||
| handle dynamic objects in visual and VI SLAM algorithms. Fan | ||||
| et al. [8] proposed a multi-view geometry-based method using an | ||||
| RGB-D camera. After obtaining camera poses by minimizing the | ||||
| reprojection error, the type of each feature point is determined | ||||
| as dynamic or static by the geometric relationship between the | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. | ||||
| SONG et al.: DYNAVINS: A VISUAL-INERTIAL SLAM FOR DYNAMIC ENVIRONMENTS                                                    11525 | ||||
| 
 | ||||
| Fig. 2. The pipeline of our robust visual inertial SLAM. Features are tracked in mono or stereo images and IMU data are preintegrated in the sensor preprocessing | ||||
| step. Then, the robust BA is applied to discard tracked features from dynamic objects and only the features from static objects will be remain. Keyframes are | ||||
| grouped using the number of tracked features, and loop closures detected in current keyframe groups are clustered into hypotheses. Each hypothesis with the weight | ||||
| is used or rejected in the selective optimization. Using the proposed framework, a trajectory robust against dynamic and temporarily static objects can be obtained. | ||||
| 
 | ||||
| structure but also in the BA framework by reflecting the IMU              considered as outliers would never become inliers even though | ||||
| prior and the feature tracking information.                              the features are originated from static objects. | ||||
| 
 | ||||
|                 III. ROBUST BUNDLE ADJUSTMENT                               To address these problems, our BA method consists of two | ||||
|                                                                          parts: a) a regularization factor that leverages the IMU preinte- | ||||
| A. Notation                                                              gration and b) a momentum factor for considering the previous | ||||
|                                                                          state of each weight to cover the case where the preintegration | ||||
|    In this letter, the following notations are defined. The i-th          becomes temporarily inaccurate. | ||||
| camera frame and the j-th tracked feature are denoted as Ci | ||||
| and fj, respectively. For two frames CA and CB, TBA ∈ SE(3)              C. Regularization Factor | ||||
| denotes the pose of CA relative to CB. And the pose of CA in | ||||
| the world frame W can be denoted as TWA .                                   First, to reject the outlier features while robustly estimate | ||||
|                                                                          the poses, we propose a novel loss term inspired by the B-R | ||||
|    B is a set of indices of the IMU preintegrations, and P is a set      duality [20] as follows: | ||||
| of visual pairs (i, j) where i corresponds to the frame Ci and j to | ||||
| the feature fj. Because the feature fj is tracked across multiple                     ρ wj , rPj = wj2rPj + λwΦ2(wj ),              (2) | ||||
| camera frames, different camera frames can contain the same | ||||
| feature fj. Thus, a set of indices of all tracked features in the        where rjP denotes i∈P(fj) rjP,i 2 for simplicity, wj ∈ [0, 1] | ||||
| current moving window is denoted as FP , and a set of indices            denotes the weight corresponding to each feature fj, and fj | ||||
| of the camera frames that contain the feature fj is denoted as           with wj close to 1 is determined as a static feature; λw ∈ R+ | ||||
| P (fj ).                                                                 is a constant parameter; Φ(wj) denotes the regularization factor | ||||
| 
 | ||||
|    In the visual-inertial optimization framework of the current          of the weight wj and is defined as follows: | ||||
| sliding window, X represents the full state vector that contains | ||||
| sets of poses and velocities of the keyframes, biases of the IMU,                               Φ(wj) = 1 − wj.                     (3) | ||||
| i.e., acceleration and gyroscope biases, and estimated depth of | ||||
| the features as in [2]. | ||||
| 
 | ||||
| B. Conventional Bundle Adjustment                                           Then, ρ(wj, rPj ) in (2) is adopted instead of the Huber norm | ||||
|                                                                          in the visual reprojection term in (1). Hence, the BA formulation | ||||
|    In the conventional visual-inertial state estimator [2], the | ||||
| visual-inertial BA formulation is defined as follows:                     can be expressed as: | ||||
| 
 | ||||
|                                                                                ⎧                                          ⎫ | ||||
|                                                                                ⎨                                          ⎬ | ||||
|      rp − HpX 2 +                rI zˆbbkk+1 , X  2                                             2+       rIk 2 +          ρ wj , rjP ⎭ , | ||||
| min                                               Pbbkk+1                min   ⎩  rp − HpX | ||||
|                                           ⎫ | ||||
|   X                k∈B             2⎬                     (1)            X ,W                       k∈B           j∈FP | ||||
|                                    PjCi ⎭ , | ||||
|                                                                                                                                     (4) | ||||
| 
 | ||||
| +             ρH  rP zˆCj i , X                                          where W = {wj|j ∈ FP } represents the set of all weights. | ||||
| 
 | ||||
|      (i,j)∈P                                                             By adopting weight and regularization factor inspired by | ||||
| 
 | ||||
|                                                                          B-R duality, the influence of features with a high reprojection | ||||
| 
 | ||||
| where ρH (·) denotes the Huber loss [21]; rp, rI , and rP represent      error compared to the estimated state can be reduced while | ||||
| residuals for marginalization, IMU, and visual reprojection mea- | ||||
| surements, respectively; zˆbbkk+1 and zˆjCi stand for observations of    maintaining the state estimation performance. The details will | ||||
| IMU and feature points; Hp denotes a measurement estimation | ||||
| matrix of the marginalization, and P denotes the covariance of           be covered in the remainder of this subsection. | ||||
| each term. For convenience, rI (zˆbbkk+1 , X ) and rP (zˆCj i , X ) are | ||||
| simplified as rIk and rjP,i, respectively.                                (4) is solved using an alternating optimization [20]. Because | ||||
| 
 | ||||
|    The Huber loss does not work successfully once the ra-                the current state X can be estimated from the IMU preintegration | ||||
| 
 | ||||
| tio of outliers increases. This is because the Huber loss does           and the previously optimized state, unlike other methods [16], | ||||
| 
 | ||||
| not entirely reject the residuals from outliers [22]. On the             [17], W is updated first with the fixed X . Then, X is optimized | ||||
| 
 | ||||
| other hand, the redescending M-estimators, such as Geman-                with the fixed W. | ||||
| 
 | ||||
| McClure (GMC) [23], ignore the outliers perfectly once the               While optimizing W, all terms except weights are constants. | ||||
| 
 | ||||
| residuals are over a specific range owing to their zero-gradients.        Hence, the formulation for optimizing weights can be expressed | ||||
| 
 | ||||
| Unfortunately, this truncation triggers a problem that features          as follows: | ||||
| 
 | ||||
|                                                                                                 ⎧                 ⎫ | ||||
|                                                                                                 ⎨                 ⎬ | ||||
|                                                                                                          ρ wj , rPj ⎭ . | ||||
|                                                                                            min  ⎩                                   (5) | ||||
| 
 | ||||
|                                                                                             W      j∈FP | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. | ||||
| 11526                                                                                    IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 | ||||
| 
 | ||||
|                                                                                          Fig. 4. Framework of robust BA. Each feature has a weight and is used in | ||||
|                                                                                          the visual residual. Each weight has been optimized through the regularization | ||||
|                                                                                          factor and the weight momentum factor. Preintegrated IMU data are used in the | ||||
|                                                                                          IMU residual term. All parameters are optimized in the robust BA. | ||||
| 
 | ||||
|                                                                                          D. Weight Momentum Factor | ||||
| 
 | ||||
| Fig. 3. Changes of loss functions w.r.t. various parameters. (a) ρ(wj , rjP ) w.r.t.     When the motion becomes aggressive, the IMU preinte- | ||||
| wj in the alternating optimization for λw = 1. ρ¯(rPj ) represents the converged | ||||
| loss. (b) ρ¯(rPj ) w.r.t. λw. (c) ρ¯m(rPj ) w.r.t. w¯j for nj = 5. (d) ρ¯m(rPj ) w.r.t.  gration becomes imprecise, and thus the estimated state be- | ||||
| nj for w¯j = 0. | ||||
|                                                                                          comes inaccurate. In this case, the reprojection residuals of | ||||
| 
 | ||||
|                                                                                          the features from the static objects become larger; hence, by | ||||
| 
 | ||||
|                                                                                          the regularization factor, those features will be ignored in the | ||||
| 
 | ||||
|                                                                                          BA process even though the previous weights were close to | ||||
| 
 | ||||
| Because the weight wj is independent to each other, (5) can be                           one. | ||||
| 
 | ||||
| optimized independently for each wj as follows:          ⎫                                  If increasing λw to solve this problem, even the fea- | ||||
|                   ⎧⎛                           ⎞                                         tures with high reprojection residuals by dynamic objects | ||||
|                   ⎨                                      ⎬ | ||||
|                   ⎩wj2  ⎝               rjP,i  2⎠ + λwΦ2(wj )⎭ .                         are used. Therefore, the result of the BA will be inac- | ||||
|         min                                                       (6) | ||||
|                            i∈P                                                           curate. Thus, increasing λw is not enough to cope this | ||||
|        wj ∈[0,1]                (fj  )                                                   problem. | ||||
| 
 | ||||
| Because the terms in (6) are in a quadratic form w.r.t. wj, the                          To solve this issue, an additional factor, a weight momentum | ||||
| optimal wj can be derived as follows: | ||||
|                                                                                          factor, is proposed to make the previously estimated feature | ||||
| 
 | ||||
|                                              λw                                          weights unaffected by an aggressive motion. | ||||
|                                              + λw | ||||
|                         wj      =                  ,              (7)                    Because the features are continuously tracked, each feature | ||||
| 
 | ||||
|                                         rjP                                              fj is optimized nj times with its previous weight w¯j. In order to | ||||
|                                                                                          make the current weight tend to remain at w¯j, and to increase the | ||||
|    As mentioned previously, the weights are first optimized | ||||
| based on the estimated state. Thus the weights of features with                          degree of the tendency as nj increases, the weight momentum | ||||
|                                                                                          factor Ψ(wj) is designed as follows: | ||||
| high reprojection errors start with small values. However, as | ||||
| shown in Fig. 3(a), the loss of the feature ρ(wj, rjP ) is a convex                                   Ψ(wj) = nj(w¯j − wj).                    (9) | ||||
| function unless the weight is zero, there is a non-zero gradient | ||||
| not only in the loss of an inlier feature but also in the loss of an                     Then, adding (9) to (2), the modified loss term can be derived | ||||
| outlier feature. Which means that the new feature affects the BA | ||||
| regardless of the type at first.                                                          as follows: | ||||
| 
 | ||||
|    While the optimization step is repeated until the states and                                   ρm wj , rPj = wj2               rjP,i 2 | ||||
| the weights are converged, the weights of the outlier features | ||||
|                                                                                                                      i∈P(fj ) | ||||
| 
 | ||||
| are lowered and their losses are more flattened. As a result, the                                                + λwΦ2(wj ) + λmΨ2(wj ), (10) | ||||
| losses of the outlier features approach zero-gradient and cannot | ||||
| affect the BA.                                                                           where λm ∈ R+ represents a constant parameter to adjust the | ||||
|                                                                                          effect of the momentum factor on the BA. | ||||
|    After convergence, the weight can be expressed using the | ||||
| reprojection error as in (7). Thus the converged loss ρ¯(rjP ) can                       In summary, proposed robust BA can be illustrated as Fig. 4. | ||||
| be derived by applying (7) to (2) as follows: | ||||
|                                                                                          The previous weights of the tracked features are used in the | ||||
| 
 | ||||
|                                                                                          weight momentum factor, and the weights of all features in the | ||||
| 
 | ||||
|                                          λw rjP                                          current window are used in the regularization factor. As a result, | ||||
|                                         λw + rPj | ||||
|                         ρ¯(rPj )     =                .           (8)                    the ro⎧bust BA is expressed as follows:                        ⎫ | ||||
|                                                                                                ⎨                                                        ⎬ | ||||
|                                                                                                             2+       rkI 2 +                   wj , rjP ⎭ . | ||||
|    As shown in Fig. 3(b), increasing λw affects ρ¯(rjP ) in two di-                      min   ⎩  rp − HpX                                 ρm           (11) | ||||
| rections: increasing the gradient value and convexity. By increas- | ||||
|                                                                                          X ,W                   k∈B               j∈FP | ||||
| 
 | ||||
| ing the gradient value, the visual reprojection residuals affect                         (11) can be solved by using the alternating optimization in the | ||||
| 
 | ||||
| the BA more than the marginalization and IMU preintegration                              same way as (4). The alternating optimization is iterated until X | ||||
|                                                                                          and W are converged. Then, the converged loss ρ¯m(rjP ) can be | ||||
| residuals. And by increasing the convexity, some of the outlier                          derived. ρ¯m(rjP ) w.r.t. w¯j and nj is shown in Fig. 3(c) and (d), | ||||
|                                                                                          respectively. | ||||
| features can affect the BA. | ||||
| 
 | ||||
| To sum up, the proposed factor benefits from both Huber | ||||
| 
 | ||||
| loss and GMC by adjusting the weights in an adaptive way;                                   As shown in Fig. 3(c), if w¯ is low, the gradient of the loss is | ||||
|                                                                                          small even when rjP is close to 0. Thus, the features presumably | ||||
| our method efficiently filters out outliers, but does not entirely | ||||
| 
 | ||||
| ignore outliers in the optimization at first as well. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. | ||||
| SONG et al.: DYNAVINS: A VISUAL-INERTIAL SLAM FOR DYNAMIC ENVIRONMENTS                                                                         11527 | ||||
| 
 | ||||
| originated from dynamic objects don’t have much impact on                Fig. 5. The procedure of the multiple hypotheses clustering. (a) Keyframes | ||||
| 
 | ||||
| the BA even if their reprojection errors are low in the current          that share the minimum number of the tracked features are grouped. (b) There are | ||||
| 
 | ||||
| step. In addition, the gradient of the loss increases for features       two types of features used for matchings: static and temporarily static features. | ||||
| whose w¯ is close to 1, so even though the current residual is           k,mTWi , the estimated pose of Ci, can be estimated using the matching result | ||||
| high, an optimization is performed in the direction of reducing          Tmk and the local relative pose Tki. An accurate keyframe pose can be estimated | ||||
| the residual rather than w.                                              if static features are used for the matching. (c) The temporarily static feature | ||||
| 
 | ||||
|    Furthermore, as shown in Fig. 3(d), if w¯j is zero, the gradient      is moved from the previous position. However, the matching result is based on | ||||
| gets smaller as nj increases; hence the tracked outlier feature | ||||
| has less effect on the BA, and the longer it is tracked, the less it     the previous position of the feature. Thus, the estimated keyframe pose will be | ||||
|                                                                          inaccurate. Finally, the feature matching results with similar TWi are clustered | ||||
| affects the BA.                                                          based on the Euclidean distance. | ||||
| 
 | ||||
|    For the stereo camera configuration, in addition to the repro- | ||||
| 
 | ||||
| jection on one camera, reprojections on the other camera in the | ||||
| same keyframe, rPstereo, or another keyframe, rPanother, exist. In that | ||||
| case, weights are also applied to the reprojection raPnother because | ||||
| it is also affected by the movement of features, while rPstereo is | ||||
| invariant to the movement of features and is only adopted as the | ||||
| 
 | ||||
| criterion for the depth estimation. | ||||
| 
 | ||||
|               IV. SELECTIVE GLOBAL OPTIMIZATION                             However, it is difficult to directly compute the similarity be- | ||||
| 
 | ||||
|    In the VIO framework, the drift is inevitably cumulative along        tween the loop closures from different keyframes in the current | ||||
| the trajectory because the optimization is performed only within         group. Assuming that the relative pose Tki between Ck and Ci | ||||
| the moving window. Hence, a loop closure detection, e.g. using           is sufficiently accurate, the estimated pose of Ci in the world | ||||
| DBoW2 [24], is necessary to optimize all trajectories.                   frame can be expressed as follows: | ||||
| 
 | ||||
|    In a typical visual SLAM, all loop closures are exploited even                        k,mTWi = Tki ·m TWk .                                 (14) | ||||
| if some of them are from temporarily static objects. Those false | ||||
| positive loop closures may lead to the failure of the SLAM                  If the features used for matchings are from the same object, | ||||
| framework. Moreover, features from the temporarily static ob-            the estimated TWi of the matchings will be located close to each | ||||
| jects and from the static objects may exist at the same keyframe.        other, even if Ck and Cm of the matchings are different. Hence, | ||||
| Therefore, in this section, we propose a method to eliminate the         after calculating Euclidean distances between the loop closure’s | ||||
| false positive loop closures while maintaining the true positive         estimated TWi , the similar loop closures with the small Euclidean | ||||
| loop closures.                                                           distance can be clustered as shown in Fig. 5(c). | ||||
| 
 | ||||
| A. Keyframe Grouping                                                        Depending on which loop closure cluster is selected, the | ||||
| 
 | ||||
| Unlike conventional methods that treat loop closures indi-               trajectory result from the graph optimization varies. Therefore, | ||||
| 
 | ||||
| vidually, in this study, loop closures from the same features            each cluster can be called a hypothesis. To reduce the computa- | ||||
| 
 | ||||
| are grouped, even if they are from different keyframes. As a             tional cost, top-two hypotheses were adopted by comparing the | ||||
| 
 | ||||
| result, only one weight per group is used, allowing for effective        cardinality of the loop closures within the hypothesis. These two | ||||
|                                                                          hypotheses of the current group Gi are denoted as Hi0 and Hi1. | ||||
| optimization. | ||||
|                                                                             However, it is not yet possible to distinguish between true or | ||||
| As shown in Fig. 5(a), before grouping the loop closures, | ||||
|                                                                          false positive hypotheses. Hence, the method for determining the | ||||
| adjacent keyframes that share at least a minimum number of | ||||
|                                                                          true positive hypothesis among the candidate hypotheses will be | ||||
| tracked features have to be grouped. The group starting from | ||||
|                                                                          described in the next section. | ||||
| the i-th camera frame Ci is defined as follows: | ||||
| 
 | ||||
| Group(Ci) = Ck| |Fik| ≥ α, k ≥ i ,                       (12) | ||||
| 
 | ||||
| where α represents a minimum number of tracked features, and             C. Selective Optimization for Constraint Groups | ||||
| Fik represents the set of features tracked from Ci to Ck. For | ||||
| simplicity, Group(Ci) will be denoted as Gi hereinafter. | ||||
| 
 | ||||
| B. Multiple Hypotheses Clustering                                        Most of the recent visual SLAM algorithms use a graph | ||||
| 
 | ||||
|                                                                          optimization. Let C, T , L, and W denote the sets of keyframes, | ||||
| 
 | ||||
|    After keyframes are grouped as in the previous subsection,            poses, loop closures, and all weights, respectively. Then the | ||||
| 
 | ||||
| DBoW2 is employed to identify the similar keyframe Cm with               graph optimization can be denoted as: | ||||
| each keyframe Ck in the current group Gi starting from Ci                ⎧                                                                       ⎫ | ||||
| (Ck ∈ Gi and m < i). Note that Ck is skipped if there is no              ⎨⎪⎪⎪⎪⎪                                                                  ⎪⎪⎪⎪⎪⎬ | ||||
| similar keyframe. After identifying up to three different m                                            + ρ 2                       r(Tkj, T )  2PL⎪⎭⎪⎪⎪⎪, | ||||
|                                                                          mTin⎪⎪⎪⎪⎪⎩ i∈C  r(Tii+1, T )  PT    i+1                H | ||||
| for k, a feature matching is conducted between Ck and these                                                  i    (j,k)∈L | ||||
| keyframes, and the relative pose Tmk can be obtained. Using                                               T | ||||
| Tmk , the estimated pose of Ck in the world frame, mTWk , can be | ||||
| obtained as follows:                                                                     local edge               loop closure edge | ||||
| 
 | ||||
|                mTWk = Tmk · TWm,                                                                                                               (15) | ||||
| 
 | ||||
|                                                          (13)            where Tii+1 represents the local pose between two adjacent | ||||
|                                                                          keyframes Ci and Ci+1; Tkj is the relative pose between Cj and | ||||
| where TWm represents the pose of Cm in the world frame. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. | ||||
| 11528                                                                                                IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 | ||||
| 
 | ||||
| Ck  from  the  loop    closure;  PT i+1      and    PL  denote                      the  covariance                                            TABLE I | ||||
|                                                                                                                                       ABLATION EXPERIMENT | ||||
|                                     Ti | ||||
| of the local pose and loop closure, respectively.                                                    A. Dataset | ||||
| 
 | ||||
|    For the two hypotheses of group Gi, weights are denoted                                              VIODE Dataset VIODE dataset [13] is a simulated dataset | ||||
| as wi0 and wi1, a sum of the weights as wi, and the set of                                           that contains lots of moving objects, such as cars or trucks, | ||||
| hypotheses as H. Using a similar procedure as in Section III-C,                                      compared with conventional datasets. In addition, the dataset | ||||
|                                                                                                      includes overall occlusion situations, where most parts of | ||||
| Black-Rangarajan duality is applied to (15) as follows:                                              the image are occluded by dominant dynamic objects as | ||||
|                                                                                                      shown in Fig. 1. Note that the sub-sequence name none | ||||
|                ⎧                             2                                                       to high means how many dynamic objects exist in the | ||||
|                ⎪⎪⎪⎪⎨⎪                                                                                scene. | ||||
| 
 | ||||
|        min     ⎩⎪⎪⎪⎪⎪       r Tii+1, T       PT     i+1                                                 Our Dataset Unfortunately, VIODE dataset does not contain | ||||
|                                                     i                                                harsh loop closing situations caused by temporarily static ob- | ||||
|        T ,W            i∈C                      T                                                    jects. Accordingly, we obtained our dataset with four sequences | ||||
|                                                                                                      to evaluate our global optimization. First, Static sequence | ||||
|                    ⎛                                                                     ⎞           validates the dataset. In Dynamic follow sequence, a dom- | ||||
|           + Hi∈H ⎜⎝⎜⎜⎜⎜⎝⎛(j,k)∈Hi0                                                                   inant dynamic object moves in front of the camera. Next, in | ||||
|                                               wi0   r(Tkj                  ,  T  )  2    ⎠           Temporal static sequence, the same object is observed | ||||
|                                              |Hi0|                                  PL               from multiple locations. In other words, the object is static while | ||||
|                                                                                                      being observed, and then it moves to a different position. Finally, | ||||
|              ⎛                   residual for hypothesis 0                                           in E-shape sequence, the camera moves along the shape of the | ||||
|                                                                                                      letter E. The checkerboard is moved while not being observed, | ||||
|           +⎝                                         ⎞                                               thus it will be observed at the three end-vertices of the E-shaped | ||||
|                                                                                                      trajectory in the camera perspective, which triggers the false- | ||||
|                   (j,k)∈Hi1       wi1     r  Tkj , T                       2   ⎠                     positive loop closures. Note that the feature-rich checkerboard | ||||
|                                  |Hi1|                                     PL                        is used in the experiment to address the effect of false loop | ||||
|                                                                                                      closures. | ||||
|                        residual for hypothesis1 (optional) | ||||
|                                                                                                      B. Error Metrics | ||||
|                                                     ⎞⎫ | ||||
|                                                     ⎟⎟⎟⎠⎟⎟⎪⎪⎪⎪⎪⎪⎪⎪⎭⎪⎪⎬                                  The accuracy of the estimated trajectory from each algorithm | ||||
|           +            λlΦl2(wi)                                        ,                   (16)     is measured by Absolute Trajectory Error (ATE) [25], which di- | ||||
|                                                                                                      rectly measures the difference between points of the ground truth | ||||
|                hypothesis regularization  function                                                   and the aligned estimated trajectory. In addition, for the VIODE | ||||
|                                                                                                      dataset, the degradation rate [13], rd = ATEhigh/ATEnone, is | ||||
| where λl ∈ R+ is a constant parameter. The regularization factor                                     calculated to determine the robustness of the algorithm. | ||||
| for the loop closure, Φl, is defined as follows: | ||||
|                                                                                                      C. Evaluation on the VIODE Dataset | ||||
|                        Φl(wi) = 1 − wi | ||||
|                                                                                                         First, the effects of the proposed factors on BA time cost | ||||
|                              = 1 − wi0 + wi1 ,                                              (17)     and accuracy are analyzed as shown in the Table I. Ours | ||||
|                                                                                                      with only the regularization factor has a better result than | ||||
| where wi0, wi1 ∈ [0, 1]. To ensure that the weights are not af-                                      VINS-Fusion, but with the momentum factor together, not | ||||
| fected by the number of loop closures in the hypothesis, the                                         only it shows outperforming result than VINS-Fusion, but also | ||||
| weights are divided by the cardinality of each hypothesis.                                           it takes less time due to a previous information. Moreover, | ||||
|                                                                                                      although the BA time of ours was increased due to addi- | ||||
|    Then, (16) is optimized in the same manner as (11). Ac-                                           tional optimizations, it is sufficient for high-level control of | ||||
| cordingly, only the hypothesis with a high weight is adopted                                         robots. | ||||
| in the optimization. In addition, all weights can be close to | ||||
| 0 when all hypotheses are false positives due to the multiple                                           As shown in Table II and Fig. 6, the SOTA methods show | ||||
| temporarily static objects. Hence, the failure caused by false                                       precise pose estimation results in static environments. However, | ||||
| positive hypotheses can be prevented.                                                                they struggle with the effect of dominant dynamic objects. In | ||||
|                                                                                                      particular, even though DynaSLAM employs a semantic seg- | ||||
|    Because keyframe poses are changed after the optimization,                                        mentation module, DynaSLAM tends to diverge or shows large | ||||
| the hypothesis clustering in Section IV-B is conducted again for                                     ATE compared with other methods as the number of dynamic | ||||
| all groups for the next optimization.                                                                objects increases (from none to high). This performance | ||||
|                                                                                                      degradation is due to the overall occlusion situations, leading to | ||||
|                     V. EXPERIMENTAL RESULTS | ||||
| 
 | ||||
|    To evaluate the proposed algorithm, we compare ours with | ||||
| SOTA algorithms, namely, VINS-Fusion [2], ORB-SLAM3 [4], | ||||
| and DynaSLAM [7]. Each algorithm is tested in a mono- | ||||
| inertial (-M-I) and a stereo-inertial (-S-I) mode. Note that | ||||
| an IMU is not used in DynaSLAM, so it is only tested in a | ||||
| stereo (-S) mode and compared with the -S-I mode of other | ||||
| algorithms. It could be somewhat unfair, but the comparison is | ||||
| conducted to stress the necessity for an IMU when dealing with | ||||
| dynamic environments. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. | ||||
| SONG et al.: DYNAVINS: A VISUAL-INERTIAL SLAM FOR DYNAMIC ENVIRONMENTS                                             11529 | ||||
| 
 | ||||
|                                                                                            TABLE II | ||||
|                                                     COMPARISON WITH STATE-OF-THE-ART METHODS (RMSE OF ATE IN [M]) | ||||
| 
 | ||||
| Fig. 6. ATE results of state-of-the-art algorithms and ours on the city_day sequences of the VIODE dataset [13]. Note that the y-axis is expressed in logarithmic | ||||
| scale. Our algorithm shows promising performance with less performance degeneration compared with the other state-of-the-art methods. | ||||
| 
 | ||||
| the failure of the semantic segmentation module and the absence      Fig. 7. Results of the state-of-the-art algorithms and ours on the park- | ||||
| of features from static objects.                                     ing_lot high sequence of the VIODE dataset [13]. (a) Trajectory of each | ||||
|                                                                      algorithm in the 3D feature map, which is the result of our proposed algorithm. | ||||
|    Similarly, although ORB-SLAM3 tries to reject the frames          Features with low weight are depicted in red. (b) Enlarged view of (a). All | ||||
| with inaccurate features, it diverges when dominant dynamic          other algorithms except our algorithm lost track or had noisy trajectories while | ||||
| objects exist in parking_lot mid, high and city_day                  observing dynamic objects and as in (c) feature weighting result of our algorithm, | ||||
| high sequences. However, especially in parking_lot low               features from dynamic objects (red crosses) have low weight while robust | ||||
| sequence, there is only one vehicle that is far from the camera,     features (green circles) have high weight. | ||||
| and it occludes an unnecessary background environment. As | ||||
| a consequence, ORB-SLAM3-S-I outperforms other algo-                                                          TABLE III | ||||
| rithms.                                                                                    COMPARISON OF DEGRADATION RATE rd | ||||
| 
 | ||||
|    VINS-Fusion is less hindered by the dynamic objects because | ||||
| it tries to remove the features with an incorrectly estimated depth | ||||
| (negative or far) after BA. However, those features have affected | ||||
| the BA before they are removed. As a result, as the number of | ||||
| the features from dynamic objects increases, the trajectory error | ||||
| of VINS-Fusion gets higher. | ||||
| 
 | ||||
|    In contrast, our proposed method shows promising perfor- | ||||
| mance in both mono-inertial and stereo-inertial modes. For | ||||
| example, in parking_lot high sequence as shown in | ||||
| Fig. 7(a)–(b), ours performs stable pose estimation even when | ||||
| other algorithms are influenced by dynamic objects. Moreover, | ||||
| even though the number of dynamic objects increases, a perfor- | ||||
| mance degradation remains small compared to other methods | ||||
| in all scenes. This confirms that our method overcomes the | ||||
| problems caused by dynamic objects owing to our robust BA | ||||
| method, which is also supported by Table III. In other words, | ||||
| our proposed method successfully rejects all dynamic features | ||||
| by adjusting the weights in an adaptive way. Also, our method | ||||
| could be even robust against the overall occlusion situations, as | ||||
| shown in Fig. 1(b). | ||||
| 
 | ||||
|    Interestingly, our proposed robust BA method enables robust- | ||||
| ness against changes in illuminance by rejecting the inconsistent | ||||
| features (e.g., low weight features in dark area of Fig. 7(c)). Ac- | ||||
| cordingly, our method shows remarkable performance compared | ||||
| with the SOTA methods in city_night scenes where not only | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. | ||||
| 11530                                                                               IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 | ||||
| 
 | ||||
| Fig. 8. Results of the algorithms on E-shape sequence. (a) Trajectory results.       [3] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman | ||||
| Other algorithms are inaccurate due to false positive loop closures. (b) A loop           filter for vision-aided inertial navigation,” in Proc. IEEE Int. Conf. Robot. | ||||
| closure rejection result of our algorithm. Constraints with low weight (red lines)        Automat., 2007, pp. 3565–3572. | ||||
| do not contribute to the optimized trajectory. | ||||
|                                                                                      [4] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. M. Montiel, and J. | ||||
| dynamic objects exist, but also there is a lack of illuminance.                           D. Tardós, “ORB-SLAM3: An accurate open-source library for visual, | ||||
| Note that -M-I of ours has better result than -S-I. This is                               visual–inertial, and multimap SLAM,” IEEE Trans. Robot., vol. 37, no. 6, | ||||
| because the stereo reprojection, rPstereo, can be inaccurate in                           pp. 1874–1890, Dec. 2021. | ||||
| low-light conditions. | ||||
|                                                                                      [5] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, | ||||
| D. Evaluation on Our Dataset                                                              “Keyframe-based visual-inertial odometry using nonlinear optimization,” | ||||
|                                                                                           Int. J. Robot. Res., vol. 34, no. 3, pp. 314–334, 2015. | ||||
|    In the static case, all algorithms have low ATE values. | ||||
| This sequence validates that our dataset is correctly obtained.                      [6] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual inertial | ||||
|                                                                                           odometry using a direct EKF-based approach,” in Proc. IEEE/RSJ Int. | ||||
|    However, in Dynamic follow, other algorithms tried to                                  Conf. Intell. Robots Syst., 2015, pp. 298–304. | ||||
| track the occluding object. Hence, not only failures of BA but | ||||
| also false-positive loop closures are triggered. Consequently,                       [7] B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “DynaSLAM: Tracking, | ||||
| other algorithms except ours have higher ATEs.                                            mapping, and inpainting in dynamic scenes,” IEEE Robot. Automat. Lett., | ||||
|                                                                                           vol. 3, no. 4, pp. 4076–4083, Oct. 2018. | ||||
|    Furthermore, in Temporal static, ORB-SLAM3 and | ||||
| VINS-Fusion can eliminate the false-positive loop closure in                         [8] Y. Fan, H. Han, Y. Tang, and T. Zhi, “Dynamic objects elimination | ||||
| the stereo-inertial case. However, in the mono-inertial case,                             in SLAM based on image fusion,” Pattern Recognit. Lett., vol. 127, | ||||
| due to an inaccurate depth estimation, they cannot reject the                             pp. 191–201, 2019. | ||||
| false-positive loop closures. Additionaly, VINS-Fusion with | ||||
| Switchable Constraints [15] can also reject the false-positive                       [9] B. Canovas, M. Rombaut, A. Nègre, D. Pellerin, and S. Olympi- | ||||
| loop closures, but ours has a better performance as shown in                              eff, “Speed and memory efficient dense RGB-D SLAM in dynamic | ||||
| Table II.                                                                                 scenes,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2020, | ||||
|                                                                                           pp. 4996–5001. | ||||
|    Finally, in E-shape case, other algorithms fail to optimize | ||||
| the trajectory, as illustrated in Fig. 8(a), owing to the false-                    [10] R. Long, C. Rauch, T. Zhang, V. Ivan, and S. Vijayakumar, “RigidFusion: | ||||
| positive loop closures. Also VINS-Fusion with Switchable Con-                             Robot localisation and mapping in environments with large dynamic | ||||
| straints cannot reject the false-positive loop closures that are con-                     rigid objects,” IEEE Robot. Automat. Lett., vol. 6, no. 2, pp. 3703–3710, | ||||
| tinuously generated. However, ours optimizes the weight of each                           Apr. 2021. | ||||
| hypothesis, not individual loop closures. Hence, false-positive | ||||
| loop closures are rejected in the optimization irrespective of the                  [11] B. Bescos, C. Campos, J. D. Tardós, and J. Neira, “DynaSLAM II: Tightly- | ||||
| number of them, as illustrated in Fig. 8(b). Ours does not use                            coupled multi-object tracking and SLAM,” IEEE Robot. Automat. Lett., | ||||
| any object-wise information from the image; hence the features                            vol. 6, no. 3, pp. 5191–5198, Jul. 2021. | ||||
| from the same object can be divided into different hypotheses, | ||||
| as depicted in Fig. 1(c).                                                           [12] K. Qiu, T. Qin, W. Gao, and S. Shen, “Tracking 3-D motion of dynamic | ||||
|                                                                                           objects using monocular visual-inertial sensing,” IEEE Trans. Robot., | ||||
|                            VI. CONCLUSION                                                 vol. 35, no. 4, pp. 799–816, Aug. 2019. | ||||
| 
 | ||||
|    In this study, DynaVINS has been proposed, which is a robust                     [13] K. Minoda, F. Schilling, V. Wüest, D. Floreano, and T. Yairi, “VIODE: A | ||||
| visual-inertial SLAM framework based on the robust BA and                                 simulated dataset to address the challenges of visual-inertial odometry | ||||
| the selective global optimization in dynamic environments. The                            in dynamic environments,” IEEE Robot. Automat. Lett., vol. 6, no. 2, | ||||
| experimental evidence corroborated that our algorithm works                               pp. 1343–1350, Apr. 2021. | ||||
| better than other algorithms in simulations and in actual environ- | ||||
| ments with various dynamic objects. In future works, we plan to                     [14] E. Olson and P. Agarwal, “Inference on networks of mixtures for ro- | ||||
| improve the speed and the performance. Moreover, we will adapt                            bust robot mapping,” Int. J. Robot. Res., vol. 32, no. 7, pp. 826–840, | ||||
| the concept of DynaVINS to the LiDAR-Visual-Inertial (LVI)                                2013. | ||||
| SLAM framework. | ||||
|                                                                                     [15] N. Sünderhauf and P. Protzel, “Switchable constraints for robust pose | ||||
|                                  REFERENCES                                               graph SLAM,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2012, | ||||
|                                                                                           pp. 1879–1884. | ||||
|  [1] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: A | ||||
|       versatile and accurate monocular SLAM system,” IEEE Trans. Robot.,            [16] H. Yang, P. Antonante, V. Tzoumas, and L. Carlone, “Graduated non- | ||||
|       vol. 31, no. 5, pp. 1147–1163, Oct. 2015.                                           convexity for robust spatial perception: From non-minimal solvers to | ||||
|                                                                                           global outlier rejection,” IEEE Robot. Automat. Lett., vol. 5, no. 2, | ||||
|  [2] T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monoc-                pp. 1127–1134, Apr. 2020. | ||||
|       ular visual-inertial state estimator,” IEEE Trans. Robot., vol. 34, no. 4, | ||||
|       pp. 1004–1020, Aug. 2018.                                                     [17] Q.-Y. Zhou, J. Park, and V. Koltun, “Fast global registration,” in Proc. Eur. | ||||
|                                                                                           Conf. Comput. Vis., 2016, pp. 766–782. | ||||
| 
 | ||||
|                                                                                     [18] S. Song, H. Lim, S. Jung, and H. Myung, “G2P-SLAM: Generalized RGB- | ||||
|                                                                                           D SLAM framework for mobile robots in low-dynamic environments,” | ||||
|                                                                                           IEEE Access, vol. 10, pp. 21370–21383, 2022. | ||||
| 
 | ||||
|                                                                                     [19] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-SLAM: | ||||
|                                                                                           Semantic monocular visual localization and mapping based on deep | ||||
|                                                                                           learning in dynamic environment,” Robot. Auton. Syst., vol. 117, | ||||
|                                                                                           pp. 1–16, 2019. | ||||
| 
 | ||||
|                                                                                     [20] M. J. Black and A. Rangarajan, “On the unification of line processes, | ||||
|                                                                                           outlier rejection, and robust statistics with applications in early vision,” | ||||
|                                                                                           Int. J. Comput. Vis., vol. 19, no. 1, pp. 57–91, 1996. | ||||
| 
 | ||||
|                                                                                     [21] P. J. Huber, “Robust estimation of a location parameter,” in Breakthroughs | ||||
|                                                                                           Statist., 1992, pp. 492–518. | ||||
| 
 | ||||
|                                                                                     [22] P. Babin, P. Giguère, and F. Pomerleau, “Analysis of robust functions for | ||||
|                                                                                           registration algorithms,” in Proc. IEEE Int. Conf. Robot. Automat., 2019, | ||||
|                                                                                           pp. 1451–1457. | ||||
| 
 | ||||
|                                                                                     [23] S. Geman, D. E. McClure, and D. Geman, “A nonlinear filter for film | ||||
|                                                                                           restoration and other problems in image processing,” CVGIP: Graph. | ||||
|                                                                                           Models Image Process., vol. 54, no. 4, pp. 281–289, 1992. | ||||
| 
 | ||||
|                                                                                     [24] D. Galvez-López and J. D. Tardos, “Bags of binary words for fast place | ||||
|                                                                                           recognition in image sequences,” IEEE Trans. Robot., vol. 28, no. 5, | ||||
|                                                                                           pp. 1188–1197, Oct. 2012. | ||||
| 
 | ||||
|                                                                                     [25] Z. Zhang and D. Scaramuzza, “A tutorial on quantitative trajectory eval- | ||||
|                                                                                           uation for visual(-inertial) odometry,” in Proc. IEEE/RSJ Int. Conf. Intell. | ||||
|                                                                                           Robots Syst., 2018, pp. 7244–7251. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. | ||||
|  | ||||
| @ -0,0 +1,476 @@ | ||||
|                                                                                                                                                       2023 IEEE International Conference on Robotics and Automation (ICRA 2023) | ||||
|                                                                                                                                                       May 29 - June 2, 2023. London, UK | ||||
| 
 | ||||
|                                                                                                                                                       DytanVO: Joint Refinement of Visual Odometry and Motion | ||||
|                                                                                                                                                                    Segmentation in Dynamic Environments | ||||
| 
 | ||||
|                                                                                                                                                                            Shihao Shen, Yilin Cai, Wenshan Wang, Sebastian Scherer | ||||
| 
 | ||||
| 2023 IEEE International Conference on Robotics and Automation (ICRA) | 979-8-3503-2365-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICRA48891.2023.10161306     Abstract— Learning-based visual odometry (VO) algorithms               Fig. 1: A overview of the DytanVO. (a) Input frame at time t0 and | ||||
|                                                                                                                                                       achieve remarkable performance on common static scenes,                   t1. (b) Optical flow output from the matching network. (c) Motion | ||||
|                                                                                                                                                       benefiting from high-capacity models and massive annotated                segmentation output after iterations. (d) Trajectory estimation on | ||||
|                                                                                                                                                       data, but tend to fail in dynamic, populated environments.                sequence RoadCrossing VI from the AirDOS-Shibuya Dataset, | ||||
|                                                                                                                                                       Semantic segmentation is largely used to discard dynamic                  which is a highly dynamic environment cluttered with humans. Ours | ||||
|                                                                                                                                                       associations before estimating camera motions but at the cost             is the only learning-based VO that keeps track. | ||||
|                                                                                                                                                       of discarding static features and is hard to scale up to unseen | ||||
|                                                                                                                                                       categories. In this paper, we leverage the mutual dependence              (MAV) that operate with aggressive and frequent rotations | ||||
|                                                                                                                                                       between camera ego-motion and motion segmentation and                     cars do not have. Learning without supervision is hindered | ||||
|                                                                                                                                                       show that both can be jointly refined in a single learning-               from generalizing due to biased data with simple motion | ||||
|                                                                                                                                                       based framework. In particular, we present DytanVO, the                   patterns. Therefore, we approach the dynamic VO problem | ||||
|                                                                                                                                                       first supervised learning-based VO method that deals with                 as supervised learning so that the model can map inputs to | ||||
|                                                                                                                                                       dynamic environments. It takes two consecutive monocular                  complex ego-motion ground truth and be more generalizable. | ||||
|                                                                                                                                                       frames in real-time and predicts camera ego-motion in an | ||||
|                                                                                                                                                       iterative fashion. Our method achieves an average improvement                To identify dynamic objects, object detection or semantic | ||||
|                                                                                                                                                       of 27.7% in ATE over state-of-the-art VO solutions in real-world          segmentation techniques are largely relied on to mask all | ||||
|                                                                                                                                                       dynamic environments, and even performs competitively among               movable objects, such as pedestrians and vehicles [12]– | ||||
|                                                                                                                                                       dynamic visual SLAM systems which optimize the trajectory                 [15]. Their associated features are discarded before applying | ||||
|                                                                                                                                                       on the backend. Experiments on plentiful unseen environments              geometry-based methods. However, there are two issues of | ||||
|                                                                                                                                                       also demonstrate our method’s generalizability.                           utilizing semantic information in dynamic VO. First, class- | ||||
|                                                                                                                                                                                                                                 specific detectors for semantic segmentation heavily depend | ||||
|                                                                                                                                                                                I. INTRODUCTION                                  on appearance cues but not every object that can move is | ||||
|                                                                                                                                                                                                                                 present in the training categories, leading to false negatives. | ||||
|                                                                                                                                                          Visual odometry (VO), one of the most essential com-                   Second, even if all moving objects in a scene within the cat- | ||||
|                                                                                                                                                       ponents for pose estimation in the visual Simultaneous                    egories, algorithms could not distinguish between “actually | ||||
|                                                                                                                                                       Localization and Mapping (SLAM) system, has attracted                     moving” versus “static but being able to move”. In dynamic | ||||
|                                                                                                                                                       significant interest in robotic applications over past few                VO where static features are crucial to robust ego-motion | ||||
|                                                                                                                                                       years [1]. A lot of research works have been conducted                    estimation, one should segment objects based on pure motion | ||||
|                                                                                                                                                       to develop an accurate and robust monocular VO system                     (motion segmentation) rather than heuristic appearance cues. | ||||
|                                                                                                                                                       using both geometry-based methods [2], [3]. However, they | ||||
|                                                                                                                                                       require significant engineering effort for each module to                    Motion segmentation utilizes relative motion between con- | ||||
|                                                                                                                                                       be carefully designed and finetuned [4], which makes it                   secutive frames to remove the effect of camera movement | ||||
|                                                                                                                                                       difficult to be readily deployed in the open world with                   from the 2D motion fields and calculates residual optical flow | ||||
|                                                                                                                                                       complex environmental dynamcis, changes of illumination                   to account for moving regions. But paradoxically, ego-motion | ||||
|                                                                                                                                                       or inevitable sensor noises.                                              cannot be correctly estimated in dynamic scenes without a | ||||
|                                                                                                                                                                                                                                 robust segmentation. There exists such a mutual dependence | ||||
|                                                                                                                                                          On the other hand, recent learning-based methods [4]– | ||||
|                                                                                                                                                       [7] are able to outperform geometry-based methods in | ||||
|                                                                                                                                                       more challenging environments such as large motion, fog | ||||
|                                                                                                                                                       or rain effects and lack of features. However, they will | ||||
|                                                                                                                                                       easily fail in dynamic environments if they do not take | ||||
|                                                                                                                                                       into consideration independently moving objects that cause | ||||
|                                                                                                                                                       unpredictable changes in illumination or occlusions. To this | ||||
|                                                                                                                                                       end, recent works utilize abundant unlabeled data and adopt | ||||
|                                                                                                                                                       either self-supervised learning [8], [9] or unsupervised learn- | ||||
|                                                                                                                                                       ing [10], [11] to handle dynamic scenes. Although they | ||||
|                                                                                                                                                       achieve outstanding performance on particular tasks, such as | ||||
|                                                                                                                                                       autonomous driving, they produce worse results if applied to | ||||
|                                                                                                                                                       very different data distributions, such as micro air vehicles | ||||
| 
 | ||||
|                                                                                                                                                          Code is available at https://github.com/Geniussh/DytanVO | ||||
|                                                                                                                                                          S. Shen, Y. Cai, W. Wang, S. Scherer are with the Robotics Institute, | ||||
|                                                                                                                                                       Carnegie Mellon University, Pittsburgh, PA 15213, USA. {shihaosh, | ||||
|                                                                                                                                                       yilincai, wenshanw, basti}@andrew.cmu.edu | ||||
| 
 | ||||
|                                                                                                                                                       979-8-3503-2365-8/23/$31.00 ©2023 IEEE  4048 | ||||
| 
 | ||||
|                                                                                                                                                       Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. | ||||
| between motion segmentation and ego-motion estimation that     on geometric constraints arising from epipolar geometry | ||||
| has never been explored in supervised learning methods.        and rigid transformations, which are vulnerable to motion | ||||
| Therefore, motivated by jointly refining the VO and motion     ambiguities such as objects moving in the colinear direc- | ||||
| segmentation, we propose our learning-based dynamic VO         tion relative to the camera being indistinguishable from the | ||||
| (DytanVO). To our best knowledge, our work is the first        background given only ego-motion and optical flow. On the | ||||
| supervised learning-based VO for dynamic environments.         other hand, MaskVO [8] and SimVODIS++ [9] approach the | ||||
| The main contributions of this paper are threefold:            problem by learning to mask dynamic feature points in a self- | ||||
|                                                                supervised manner. CC [11] couples motion segmentation, | ||||
|    • A novel learning-based VO is introduced to leverage       flow, depth and camera motion models which are jointly | ||||
|       the interdependence among camera ego-motion, optical     solved in an unsupervised way. Nevertheless, these self- | ||||
|       flow and motion segmentation.                            supervised or unsupervised methods are trained on self- | ||||
|                                                                driving vehicle data dominated by pure translational motions | ||||
|    • We introduce an iterative framework where both ego-       with little rotation, which makes them difficult to generalize | ||||
|       motion estimation and motion segmentation can con-       to completely different data distributions such as handheld | ||||
|       verge quickly within time constraints for real-time ap-  cameras or drones. Our work introduces a framework that | ||||
|       plications.                                              jointly refines camera ego-motion and motion segmentation | ||||
|                                                                in an iterative way that is robust against motion ambiguities | ||||
|    • Among learning-based VO solutions, our method             as well as generalizes to the open world. | ||||
|       achieves state-of-the-art performance in real-world dy- | ||||
|       namic scenes without finetuning. Furthermore, our                               III. METHODOLOGY | ||||
|       method performs even comparably with visual SLAM | ||||
|       solutions that optimize trajectories on the backend.     A. Datasets | ||||
| 
 | ||||
|                         II. RELATED WORK                          Built on TartanVO [5], our method remains its general- | ||||
|                                                                ization capability while handling dynamic environments in | ||||
|    Learning-based VO solutions aim to avoid hard-coded         multiple types of scenes, such as car, MAV, indoor and | ||||
| modules that require significant engineering efforts for de-   outdoor. Besides taking camera intrinsics as an extra layer | ||||
| sign and finetuning in classic pipelines [1], [16]. For exam-  into the network to adapt to various camera settings as | ||||
| ple, Valada [17] applies auxiliary learning to leverage rela-  explored in [5], we train our model on large amounts of | ||||
| tive pose information to constrain the search space and pro-   synthetic data with broad diversity, which is shown capable | ||||
| duce consistent motion estimation. Another class of learning-  of facilitating easy adaptation to the real world [27]–[29]. | ||||
| based methods rely on dense optical flow to estimate pose as | ||||
| it provides more robust and redundant modalities for feature      Our model is trained on both TartanAir [27] and Scene- | ||||
| association in VO [5], [18], [19]. However, their frameworks   Flow [30]. The former contains more than 400,000 data | ||||
| are built on the assumption of photometric consistency which   frames with ground truth of optical flow and camera pose in | ||||
| only holds in a static environment without independently       static environments only. The latter provides 39,000 frames | ||||
| moving objects. They easily fail when dynamic objects          in highly dynamic environments with each trajectory hav- | ||||
| unpredictably cause occlusions or illuminations change.        ing backward/forward passes, different objects and motion | ||||
|                                                                characteristics. Although SceneFlow does not provide ground | ||||
|    Semantic information is largely used by earlier works in    truth of motion segmentations, we are able to recover it by | ||||
| VO or visual SLAM to handle dynamic objects in the scene,      taking use of its ground truth of disparity, optical flow and | ||||
| which is obtained by either a feature-based method or a        disparity change maps. | ||||
| learning-based method. Feature-based methods utilize hand- | ||||
| designed features to recognize semantic entities [20]. An      B. Architecture | ||||
| exemplary system proposed by [21] computes SIFT descrip- | ||||
| tors from monocular image sequences in order to recognize         Our network architecture is illustrated in Fig. 2, which is | ||||
| semantic objects. On the other hand, data-driven CNN-based     based on TartanVO. Our method takes in two consecutive | ||||
| semantic methods have been widely used to improve the          undistorted images It, It+1 and outputs the relative camera | ||||
| performance, such as DS-SLAM [22] and SemanticFusion           motion δtt+1 = (R|T), where T ∈ R3 is the 3D translation | ||||
| [23]. A few works on semantic VO/SLAM have fused the           and R ∈ SO(3) is the 3D rotation. Our framework consists | ||||
| semantic information from recognition modules to enhance       of three sub-modules, a matching network, a motion seg- | ||||
| motion estimation and vice versa [24], [25]. However, all      mentation network, and a pose network. We estimate dense | ||||
| these methods are prone to limited semantic categories,        optical flow Ftt+1 with a matching network, Mθ (It, It+1), | ||||
| which leads to false negatives when scaling to unusual real-   from two consecutive images. The network is built based | ||||
| world applications such as offroad driving or MAV, and         on PWC-Net [31]. The motion segmentation network Uγ, | ||||
| requires continuous efforts in ground-truth labeling.          based on a lightweight U-Net [32], takes in the relative | ||||
|                                                                camera motion output, R|T, optical flow from Mθ, and the | ||||
|    Instead of utilizing appearance cues for segmentation,      original input frames. It outputs a probability map, ztt+1, | ||||
| efforts are made to segment based on geometry cues. Flow-      of every pixel belonging to a dynamic object or not, which | ||||
| Fusion [26] iteratively refines its ego-motion estimation by   is thresholded and turned into a binary segmentation mask, | ||||
| computing residual optical flow. GeoNet [10] divides its       Stt+1. The optical flow is then stacked with the mask and | ||||
| system into two sub-tasks by separately predicting static | ||||
| scene structure and dynamic motions. However, both depend | ||||
| 
 | ||||
|                                                                                                      4049 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. | ||||
| Fig. 2: Overview of our three-stage network architecture. It consists of a matching network which estimates optical flow from two | ||||
| consecutive images, a pose network that estimates pose based on optical flow without dynamic movements, and a motion segmentation | ||||
| network that outputs a probability mask of the dynamicness. The matching network is forwarded only once while the pose network and | ||||
| the segmentation network are iterated to jointly refine pose estimate and motion segmentation. In the first iteration, we randomly initialize | ||||
| the segmentation mask. In each iteration, optical flow is set to zero inside masked regions. | ||||
| 
 | ||||
| the intrinsics layer KC, followed by setting all optical flow     iterations being smaller than prefixed thresholds ϵ. Instead | ||||
| inside the masked regions to zeros, i.e., F˜tt+1. The last is     of having a fixed constant to threshold probability maps into | ||||
| a pose network P ϕ, with ResNet50 [33] as the backbone,           segmentation masks, we predetermine a decaying parameter | ||||
| which takes in the previous stack, and outputs camera             that empirically reduces the input threshold over time, in | ||||
| motion.                                                           order to discourage inaccurate masks in earlier iterations | ||||
|                                                                   while embracing refined masks in later ones. | ||||
| C. Motion segmentation | ||||
|                                                                   Algorithm 1 Inference with Iterations | ||||
|    Earlier dynamic VO methods that use motion segmentation | ||||
| rely on purely geometric constraints arising from epipolar           Given two consecutive frames It, It+1 and intrinsics K | ||||
| geometry and rigid transformations [12], [26] so that they can       Initialize iteration number: i ← 1 | ||||
| threshold residual optical flow which is designed to account         Initialize difference in output camera motions: δR|T ← ∞ | ||||
| for moving regions. However, they are prone to catastrophic          iFtt+1 ← OpticalFlow(It, It+1) | ||||
| failures under two cases: (1) points in 3D moving along              while δR|T ≥ stopping criterion, ϵ do | ||||
| epipolar lines cannot be identified from the background given | ||||
| only monocular cues; (2) pure geometry methods leave no                  if i is 0 then | ||||
| tolerance to noisy optical flow and less accurate camera                      iStt+1 ← getCowmask(It) | ||||
| motion estimations, which in our framework is very likely to | ||||
| happen in the first few iterations. Therefore, following [34],           else | ||||
| to deal with the ambiguities above, we explicitly model cost                  iztt+1 ← MotionSegmentation(iFtt+1, It, iR|iT) | ||||
| maps as inputs into the segmentation network after upgrading                  iStt+1 ← mask iztt+1 ≥ zthreshold | ||||
| the 2D optical flow to 3D through optical expansion [35], | ||||
| which estimates the relative depth based on the scale change             iF˜tt+1 ← set iFtt+1 = 0 for iStt+1 == 1 | ||||
| of overlapping image patches. The cost maps are tailored                 iR|iT ← PoseNetwork(iF˜tt+1, iStt+1, K) | ||||
| to coplanar and colinear motion ambiguities that cause seg-              δR|T ← iR|iT − i−1R|i−1T | ||||
| mentation failures in geometry-based motion segmentation.                i←i+1 | ||||
| More details can be found in [34]. | ||||
|                                                                      Intuitively, during early iterations, the estimated motion | ||||
| D. Iteratively refine camera motion                               is less accurate, which leads to false positives in the seg- | ||||
|                                                                   mentation output (assigning high probabilities to static ar- | ||||
|    We provide an overview of our iterative framework in           eas). However, due to the fact that optical flow map still | ||||
| Algorithm 1. During inference, the matching network is            provides enough correspondences regardless of cutting out | ||||
| forwarded only once while the pose network and the seg-           non-dynamic regions from it, Pϕ is able to robustly leverage | ||||
| mentation network are iterated to jointly refine ego-motion       the segmentation mask Stt+1 concatenated with F˜tt+1, and | ||||
| estimation and motion segmentation. In the first iteration,       outputs reasonable camera motion. In later iterations, Uγ | ||||
| the segmentation mask is initialized randomly using [36].         is expected to output increasingly precise probability maps | ||||
| The criterion to stop iteration is straightforward, which is the  such that static regions in the optical flow map are no longer | ||||
| rotational and translational differences of R|T between two | ||||
| 
 | ||||
|                                                                                                      4050 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. | ||||
|                                                                     ground truth motion R|T , | ||||
| 
 | ||||
|                                                                     LP =  Tˆ                   T        + Rˆ − R | ||||
|                                                                           max(∥Tˆ ∥, ϵ) − max (∥T∥, ϵ)            (1) | ||||
| 
 | ||||
|                                                                     where ϵ=1e-6 to prevent numerical instability and ˆ· denotes | ||||
|                                                                     estimated quantities. | ||||
| 
 | ||||
|                                                                        Our framework can also be trained in an end-to-end | ||||
|                                                                     fashion, in which case the objective becomes an aggregated | ||||
|                                                                     loss of the optical flow loss LM , the camera motion loss | ||||
|                                                                     LP and the motion segmentation loss LU , where LM is the | ||||
|                                                                     L1 norm between the predicted flow and the ground truth | ||||
|                                                                     flow whereas LU is the binary cross entropy loss between | ||||
|                                                                     predicted probability and the segmentation label. | ||||
| 
 | ||||
|                                                                           L = λ1LM + λ2LU + LP                    (2) | ||||
| 
 | ||||
| Fig. 3: Motion segmentation output at each iteration when testing   From preliminary empirical comparison, end-to-end training | ||||
| on unseen data. (a) Running inference on the hardest sequence in    gives similar performance to training the pose network only, | ||||
| AirDOS-Shibuya with multiple people moving in different direc-      because we use λ1 and λ2 to regularize the objective such | ||||
| tions with our segmentation network. (b) Inference on the sequence  that the training is biased toward mainly improving the | ||||
| from FlyingThings3D where dynamic objects take up more than         odometry rather than optimizing the other two tasks. This is | ||||
| 60% area. Ground truth (GT) mask on Shibuya is generated by the     ideal since the pose network is very tolerant of false positives | ||||
| segmentation network with GT ego-motion as input.                   in segmentation results (shown in III-D). In the following | ||||
|                                                                     section, we show our results of supervising only on Eq. 1 | ||||
| “wasted” and hence Pϕ can be improved accordingly.                  by fixing the motion segmentation network. | ||||
|    In practice, we find that 3 iterations are more than enough | ||||
|                                                                                       IV. EXPERIMENTAL RESULTS | ||||
| to get both camera motion and segmentation refined. To clear | ||||
| up any ambiguity, a 1-iteration pass is composed of one Mθ          A. Implementation details | ||||
| forward pass and one Pϕ forward pass with random mask, | ||||
| while a 3-iteration pass consists of one Mθ forward pass,              1) Network: We intialize the matching network Mθ with | ||||
| two Uγ forward passes and three Pϕ forward passes. In Fig.          the pre-trained model from TartanVO [5], and fix the motion | ||||
| 3 we illustrate how segmentation masks evolve after three           segmentation network Uγ with the pre-trained weights from | ||||
| iterations on unseen data. The mask at the first iteration          Yang et al. [34]. The pose network Pϕ uses ResNet50 [33] | ||||
| contains a significant amount of false positives but quickly        as the backbone, removes the bach normalization layers, and | ||||
| converges beyond the second iteration. This verifies our            adds two output heads for rotation R and translation T . | ||||
| assumption that the pose network is robust against false            Mθ outputs optical flow at size of H/4 × W/4. Pϕ takes | ||||
| positives in segmentation results.                                  in a 5-channel input, i.e., F˜tt+1 ∈ R2×H/4×W/4, Stt+1 ∈ | ||||
|                                                                     RH/4×W/4 and KC ∈ R2×H/4×W/4. The concatenation of | ||||
| E. Supervision                                                      F˜tt+1 and KC augments the optical flow input with 2D | ||||
|                                                                     positional information while concatenating F˜tt+1 with Stt+1 | ||||
|    We train our pose network to be robust against large areas       encourages the network to learn dynamic representations. | ||||
| of false positives. On training data without any dynamic | ||||
| object, we adopt the cow-mask [36] to create sufficiently              2) Training: Our method is implemented in PyTorch [43] | ||||
| random yet locally connected segmentation patterns as a             and trained on 2 NVIDIA A100 Tensor Core GPUs. We train | ||||
| motion segmentation could occur in any size, any shape              the network in two stages on TartanAir, which includes only | ||||
| and at any position in an image while exhibiting locally            static scenes, and SceneFlow [30]. In the first stage, we train | ||||
| explainable structures corresponding to the types of moving         Pϕ independently using ground truth optical flow, camera | ||||
| objects. In addition, we apply curriculum learning to the           motion, and motion segmentation mask in a curriculum- | ||||
| pose network where we gradually increase the maximum                learning fashion. We generate random cow-masks [36] on | ||||
| percentage of dynamic areas in SceneFlow from 15%, 20%,             TartanAir as motion segmentation input. Each curriculum is | ||||
| 30%, 50% to 100%. Since TartanAir only contains static              initialized with weights from the previous curriculum and | ||||
| scenes, we adjust the size of the cow-masks accordingly.            takes 100,000 iterations with a batch size of 256. In the | ||||
|                                                                     second stage, Pϕ and Mθ are jointly optimized for another | ||||
|    We supervise our network on the camera motion loss LP .          100,000 iterations with a batch size of 64. During curriculum | ||||
| Under the monocular setting, we only recover an up-to-scale         learning, the learning rate starts at 2e-4, while the second | ||||
| camera motion. We follow [5] and normalize the translation          stage uses a learning rate of 2e-5. Both stages apply a | ||||
| vector before calculating the distance to ground truth. Given       decay rate of 0.2 to the learning rate every 50,000 iterations. | ||||
|                                                                     Random cropping and resizing (RCR) [5] as well as frame | ||||
|                                                                     skipping are applied to both datasets. | ||||
| 
 | ||||
|                                                                                                      4051 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. | ||||
| TABLE I: ATE (m) results on dynamic sequences from AirDOS-Shibuya. Our method gives outstanding performance among VO methods. | ||||
| DeepVO, TrianFlow and CC are trained on KITTI only and unable to generalize to complex motion patterns. All SLAM methods use | ||||
| bundle adjustment (BA) on multiple frames to optimize the trajectory and hence we only numerically compare ours with pure VO methods. | ||||
| The best and the second best VO performances are highlighted as bold and underlined. We use “-” to denote SLAM methods that fail to | ||||
| initialize. | ||||
| 
 | ||||
|                                               StandingHuman              RoadCrossing (Easy)     RoadCrossing (Hard) | ||||
| 
 | ||||
|                                               I                  II      III     IV      V       VI      VII | ||||
| 
 | ||||
| SLAM method            DROID-SLAM [37]        0.0051             0.0073  0.0103  0.0120  0.2778  0.0253  0.5788 | ||||
|   VO method            AirDOS w/ mask [38]    0.0606             0.0193  0.0951  0.0331  0.0206  0.2230  0.5625 | ||||
|                        ORB-SLAM w/ mask [39]  0.0788             0.0060  0.0657  0.0196  0.0148  1.0984  0.8476 | ||||
|                        VDO-SLAM [40]          0.0994             0.6129  0.3813  0.3879  0.2175  0.2400  0.6628 | ||||
|                        DynaSLAM [41]                             0.8836  0.3907  0.4196  0.4925  0.6446  0.6539 | ||||
|                                                  - | ||||
|                        DeepVO [4] | ||||
|                        TrianFlow [42]         0.3956             0.6351  0.7788  0.3436  0.5434  0.7223  0.9633 | ||||
|                        CC [11]                0.9743             1.3835  1.3348  1.6172  1.4769  1.7154  1.9075 | ||||
|                        TartanVO [5]           0.4527             0.7714  0.5406  0.6345  0.5411  0.8558  1.0896 | ||||
|                        Ours                   0.0600             0.1605  0.2762  0.1814  0.2174  0.3228  0.5009 | ||||
|                                               0.0327             0.1017  0.0608  0.0516  0.0755  0.0365  0.0660 | ||||
| 
 | ||||
|    3) Runtime: Although our method iterates multiple times       Crossing (Easy) contains multiple humans moving in and out | ||||
| to refine both segmentation and camera motion, we find in        of the camera’s view, and in Road Crossing (Hard) humans | ||||
| practice that 3 iterations are more than enough due to the       enter camera’s view abruptly. Besides VO methods, we also | ||||
| robustness of Pϕ as shown in Fig. 3. On an NVIDIA RTX            compare ours with SLAM methods that are able to handle | ||||
| 2080 GPU, inference takes 40ms with 1 iteration, 100ms           dynamic scenes. DROID-SLAM [37] is a learning-based | ||||
| with 2 iterations and 160ms with 3 iterations.                   SLAM trained on TartanAir. AirDOS [38], VDO-SLAM [40] | ||||
|                                                                  and DynaSLAM [41] are three feature-based SLAM methods | ||||
|    4) Evaluation: We use the Absolute Trajectory Error           targeting dynamic scenes. We provide the performance of | ||||
| (ATE) to evaluate our algorithm against other state-of-the-art   AirDOS and ORB-SLAM [39] after masking the dynamic | ||||
| methods including both VO and Visual SLAM. We evaluate           features during their ego-motion estimation. DeepVO [4], | ||||
| our method on AirDOS-Shibuya dataset [38] and KITTI              TartanVO and TrianFlow [42] are three learning-based VO | ||||
| Odometry dataset [44]. Additionally, in the supplemental         methods not targeting dynamic scenes while CC [11] is an | ||||
| material, we test our method on data collected in a cluttered    unsupervised VO resolving dynamic scenes through motion | ||||
| intersection to demonstrate our method can scale to real-        segmentation. | ||||
| world dynamic scenes competitively. | ||||
|                                                                     Our model achieves the best performance in all sequences | ||||
| B. Performance on AirDOS-Shibuya Dataset                         among VO baselines and is competitive even among SLAM | ||||
|                                                                  methods. DeepVO, TrianFlow and CC perform badly on | ||||
|    We first provide an ablation study of the number of itera-    AirDOS-Shibuya dataset because they are trained on KITTI | ||||
| tions (iter) in Tab. III using three sequences from AirDOS-      only and not able to generalize. TartanVO performs better but | ||||
| Shibuya [38]. The quantitative results are consistent with Fig.  it is still susceptible to the disturbance of dynamic objects. | ||||
| 3 where the pose network quickly converges after the first       On RoadCrossing V as shown in Fig. 1, all VO baselines | ||||
| iteration. We also compare the 3-iteration finetuned model       fail except ours. In hard sequences where there are more | ||||
| after jointly optimizing Pϕ and Mθ (second stage), which         aggressive camera movements and abundant moving objects, | ||||
| shows less improvement because the optical flow estimation       ours outperforms dynamic SLAM methods such as AirDOS, | ||||
| on AirDOS-Shibuya already has high quality.                      VDO-SLAM and DynaSLAM by more than 80%. While | ||||
|                                                                  DROID-SLAM remains competitive most time, it loses track | ||||
| TABLE III: Experiments on number of iterations in ATE (m)        of RoadCrossing V and VII as soon as a walking person | ||||
|                                                                  occupies a large area in the image. Note that ours only takes | ||||
| 1 iter     Standing I  RoadCrossing III  RoadCrossing VII        0.16 seconds per inference with 3 iterations but DROID- | ||||
| 2 iter                                                           SLAM takes extra 4.8 seconds to optimize the trajectory. | ||||
| 3 iter       0.0649           0.1666            0.3157           More qualitative results are in the supplemental material. | ||||
| Finetuned    0.0315           0.0974            0.0658 | ||||
|              0.0327           0.0608            0.0660           C. Performance on KITTI | ||||
|              0.0384           0.0631            0.0531 | ||||
|                                                                     We also evaluated our method against others on sequences | ||||
|    We then compare our method with others on the seven           from KITTI Odometry dataset [44] in Tab. II. Our method | ||||
| sequences from AirDOS-Shibuya in Tab. I and demonstrate          outperforms other VO baselines in 6 out of 8 dynamic | ||||
| that our method outperforms existing state-of-the-art VO         sequences with an improvement of 27.7% on average against | ||||
| algorithms. This benchmark covers much more challenging          the second best method. DeepVO, TrianFlow and CC are | ||||
| viewpoints and diverse motion patterns for articulated objects   trained on some of the sequences in KITTI while ours has not | ||||
| than our training data. The seven sequences are categorized      been finetuned on KITTI and is trained purely using synthetic | ||||
| into three levels of difficulty: most humans stand still in | ||||
| Standing Human with few of them moving around, Road | ||||
| 
 | ||||
|                                                                                                      4052 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. | ||||
| TABLE II: Results of ATE (m) on Dynamic Sequences from KITTI Odometry. Original sequences are trimmed into shorter ones that | ||||
| contain dynamic objects1. DeepVO [4], TrianFlow [42] and CC [11] are trained on KITTI, while ours has not been finetuned on KITTI | ||||
| 
 | ||||
| and is trained purely using synthetic data. Without backend optimization unlike SLAM, we achieve the best performance on 00, 02, 04, | ||||
| 
 | ||||
| and competitive performance on the rest among all methods including SLAM. | ||||
| 
 | ||||
| SLAM method  DROID-SLAM [37]          00        01                               02        03      04      07       08        10 | ||||
|              ORB-SLAM w/ mask [39]  0.0148    49.193                           0.1064    0.0119  0.0374  0.1939   0.9713    0.0368 | ||||
|              DynaSLAM [41]          0.0187                                     0.0796    0.1519  0.0198  0.2108   1.0479    0.0246 | ||||
|                                     0.0138       -                             0.1046            0.1450  0.3187   1.0559    0,0264 | ||||
|                                                  -                                          - | ||||
|                                                                                                          0.7262  (0.6547)   0.1042 | ||||
| VO method    DeepVO [4]             (0.0206)   1.2896                          (0.2975)  0.0783  0.0506  1.5540  (3.8984)   0.2545 | ||||
|              TrianFlow [42]          0.6966   (8.2127)                         (1.8759)  1.6862  1.2950  0.6789  (1.0411)  (0.0346) | ||||
|              CC [11]                 0.0253   (0.3060)                         (0.2559)  0.0505  0.0337  0.7108   0.9776    0.1024 | ||||
|              TartanVO [5]            0.0345    4.7080                           0.1049   0.2832  0.0743  0.6367   1.0344    0.0280 | ||||
|              Ours                    0.0126    0.4081                           0.0594   0.0406  0.0180 | ||||
| 
 | ||||
| We use (·) to denote the sequence is in the training set of the corresponding method. | ||||
| 
 | ||||
| Fig. 4: Qualitative results on dynamic sequences in KITTI Odometry 01, 03, 04 and 10. The first row is our segmentation outputs of | ||||
| moving objects. The second row is the visualization after aligning the scales of trajectories with ground truth all at once. Ours produces | ||||
| precise odometry given large areas in the image being dynamic even among methods that are trained on KITTI. Note that the trajectories | ||||
| do not always reflect the ATE results due to alignment. | ||||
| 
 | ||||
| data. Moreoever, we achieve the best ATE on 3 sequences                        almost the entire optical flow map as zeros, leading to the | ||||
| among both VO and SLAM without any optimization. We                            divergence of motion estimation and segmentation. Future | ||||
| provide qualitative results in Fig. 4 on four challenging                      work could hence consider incorporating dynamic object- | ||||
| sequences with fast-moving vehicles or dynamic objects                         awareness into the framework and utilizing dynamic cues | ||||
| occupying large areas in images. Note on sequence 01 which                     instead of fully discarding them. Additionally, learning-based | ||||
| starts with a high-speed vehicle passing by, both ORB-SLAM                     VO tends to overfit on simple translational movements such | ||||
| and DynaSLAM fail to initialize, while DROID-SLAM loses                        as in KITTI, which is resolved in our method by training on | ||||
| track from the beginning. Even though CC uses 01 in its                        datasets with broad diversity, but our method gives worse | ||||
| training set, ours gives only 0.1 higher ATE while 0.88 lower                  performance when there is little or zero camera motion, | ||||
| than the third best baseline. On sequence 10 when a huge                       caused by the bias in currently available datasets. One should | ||||
| van takes up significant areas in the center of the image, ours                consider training on zero-motion inputs in addition frame | ||||
| is the only VO that keeps track robustly.                                      skipping. | ||||
| 
 | ||||
| D. Diagnostics                                                                                           V. CONCLUSION | ||||
| 
 | ||||
|    While we observe our method is robust to heavily dynamic                       In this paper, we propose a learning-based dynamic VO | ||||
| scenes with as much as 70% dynamic objects in the image,                       (DytanVO) which can jointly refine the estimation of camera | ||||
| it still fails when all foreground objects are moving, leaving                 pose and segmentation of the dynamic objects. We demon- | ||||
| textureless background only. This is most likely to happen                     strate both ego-motion estimation and motion segmentation | ||||
| when dynamic objects take up large areas in the image. For                     can converge quickly within time constrains for real-time | ||||
| example, when testing on the test set of FlyingThings3D [30]                   applications. We evaluate our method on KITTI Odometry | ||||
| where 80% of the image being dynamic, our method masks                         and AirDOS-Shibuya datasets, and demonstrate state-of-the- | ||||
|                                                                                art performance in dynamic environments without finetuning | ||||
|    1Sequences listed are trimmed into lengths of 28, 133, 67, 31, 40, 136, 51  nor optimation on the backend. Our work introduces new | ||||
| and 59 respectively which contain moving pedestrians, vehicles and cyclists.   directions for dynamic visual SLAM algorithms. | ||||
| 
 | ||||
|                                                                                                      4053 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. | ||||
|                             REFERENCES                                          [23] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Se- | ||||
|                                                                                       manticfusion: Dense 3d semantic mapping with convolutional neural | ||||
|  [1] D. Scaramuzza and F. Fraundorfer, “Visual odometry [tutorial],” IEEE             networks,” in 2017 IEEE International Conference on Robotics and | ||||
|       robotics & automation magazine, vol. 18, no. 4, pp. 80–92, 2011.                automation (ICRA), pp. 4628–4635, IEEE, 2017. | ||||
| 
 | ||||
|  [2] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE        [24] L. An, X. Zhang, H. Gao, and Y. Liu, “Semantic segmentation–aided | ||||
|       transactions on pattern analysis and machine intelligence, vol. 40,             visual odometry for urban autonomous driving,” International Journal | ||||
|       no. 3, pp. 611–625, 2017.                                                       of Advanced Robotic Systems, vol. 14, no. 5, p. 1729881417735667, | ||||
|                                                                                       2017. | ||||
|  [3] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct | ||||
|       monocular visual odometry,” in 2014 IEEE international conference         [25] K.-N. Lianos, J. L. Schonberger, M. Pollefeys, and T. Sattler, “Vso: | ||||
|       on robotics and automation (ICRA), pp. 15–22, IEEE, 2014.                       Visual semantic odometry,” in Proceedings of the European conference | ||||
|                                                                                       on computer vision (ECCV), pp. 234–250, 2018. | ||||
|  [4] S. Wang, R. Clark, H. Wen, and N. Trigoni, “Deepvo: Towards | ||||
|       end-to-end visual odometry with deep recurrent convolutional neural       [26] T. Zhang, H. Zhang, Y. Li, Y. Nakamura, and L. Zhang, “Flowfusion: | ||||
|       networks,” in 2017 IEEE international conference on robotics and                Dynamic dense rgb-d slam based on optical flow,” in 2020 IEEE Inter- | ||||
|       automation (ICRA), pp. 2043–2050, IEEE, 2017.                                   national Conference on Robotics and Automation (ICRA), pp. 7322– | ||||
|                                                                                       7328, IEEE, 2020. | ||||
|  [5] W. Wang, Y. Hu, and S. Scherer, “Tartanvo: A generalizable learning- | ||||
|       based vo,” arXiv preprint arXiv:2011.00359, 2020.                         [27] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, | ||||
|                                                                                       and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” | ||||
|  [6] H. Zhou, B. Ummenhofer, and T. Brox, “Deeptam: Deep tracking and                 in 2020 IEEE/RSJ International Conference on Intelligent Robots and | ||||
|       mapping,” in Proceedings of the European conference on computer                 Systems (IROS), pp. 4909–4916, IEEE, 2020. | ||||
|       vision (ECCV), pp. 822–838, 2018. | ||||
|                                                                                 [28] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, | ||||
|  [7] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha, “Self-supervised             “Domain randomization for transferring deep neural networks from | ||||
|       deep visual odometry with online adaptation,” in Proceedings of the             simulation to the real world,” in 2017 IEEE/RSJ international con- | ||||
|       IEEE/CVF Conference on Computer Vision and Pattern Recognition,                 ference on intelligent robots and systems (IROS), pp. 23–30, IEEE, | ||||
|       pp. 6339–6348, 2020.                                                            2017. | ||||
| 
 | ||||
|  [8] W. Xuan, R. Ren, S. Wu, and C. Chen, “Maskvo: Self-supervised              [29] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, | ||||
|       visual odometry with a learnable dynamic mask,” in 2022 IEEE/SICE               T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep | ||||
|       International Symposium on System Integration (SII), pp. 225–231,               networks with synthetic data: Bridging the reality gap by domain | ||||
|       IEEE, 2022.                                                                     randomization,” in Proceedings of the IEEE conference on computer | ||||
|                                                                                       vision and pattern recognition workshops, pp. 969–977, 2018. | ||||
|  [9] U.-H. Kim, S.-H. Kim, and J.-H. Kim, “Simvodis++: Neural seman- | ||||
|       tic visual odometry in dynamic environments,” IEEE Robotics and           [30] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, | ||||
|       Automation Letters, vol. 7, no. 2, pp. 4244–4251, 2022.                         and T. Brox, “A large dataset to train convolutional networks for | ||||
|                                                                                       disparity, optical flow, and scene flow estimation,” in Proceedings | ||||
| [10] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth,                of the IEEE conference on computer vision and pattern recognition, | ||||
|       optical flow and camera pose,” in Proceedings of the IEEE conference            pp. 4040–4048, 2016. | ||||
|       on computer vision and pattern recognition, pp. 1983–1992, 2018. | ||||
|                                                                                 [31] D. Sun, X. Yang, M. Liu, and J. Kautz, “Pwc-net: Cnns for | ||||
| [11] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and                  optical flow using pyramid, warping, and cost volume,” CoRR, | ||||
|       M. J. Black, “Competitive collaboration: Joint unsupervised learning            vol. abs/1709.02371, 2017. | ||||
|       of depth, camera motion, optical flow and motion segmentation,” in | ||||
|       Proceedings of the IEEE/CVF conference on computer vision and             [32] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional | ||||
|       pattern recognition, pp. 12240–12249, 2019.                                     networks for biomedical image segmentation,” in International Confer- | ||||
|                                                                                       ence on Medical image computing and computer-assisted intervention, | ||||
| [12] H. Liu, G. Liu, G. Tian, S. Xin, and Z. Ji, “Visual slam based on                pp. 234–241, Springer, 2015. | ||||
|       dynamic object removal,” in 2019 IEEE International Conference on | ||||
|       Robotics and Biomimetics (ROBIO), pp. 596–601, IEEE, 2019.                [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for | ||||
|                                                                                       image recognition,” CoRR, vol. abs/1512.03385, 2015. | ||||
| [13] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and | ||||
|       S. Leutenegger, “Mid-fusion: Octree-based object-level multi-instance     [34] G. Yang and D. Ramanan, “Learning to segment rigid motions | ||||
|       dynamic slam,” in 2019 International Conference on Robotics and                 from two frames,” in Proceedings of the IEEE/CVF Conference on | ||||
|       Automation (ICRA), pp. 5231–5237, IEEE, 2019.                                   Computer Vision and Pattern Recognition, pp. 1266–1275, 2021. | ||||
| 
 | ||||
| [14] S. Li and D. Lee, “RGB-D SLAM in dynamic environments using                [35] G. Yang and D. Ramanan, “Upgrading optical flow to 3d scene | ||||
|       static point weighting,” IEEE Robotics and Automation Letters, vol. 2,          flow through optical expansion,” in Proceedings of the IEEE/CVF | ||||
|       no. 4, pp. 2263–2270, 2017.                                                     Conference on Computer Vision and Pattern Recognition, pp. 1334– | ||||
|                                                                                       1343, 2020. | ||||
| [15] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving rgb-d slam in | ||||
|       dynamic environments: A motion removal approach,” Robotics and            [36] G. French, A. Oliver, and T. Salimans, “Milking cowmask for semi- | ||||
|       Autonomous Systems, vol. 89, pp. 110–122, 2017.                                 supervised image classification,” arXiv preprint arXiv:2003.12022, | ||||
|                                                                                       2020. | ||||
| [16] F. Fraundorfer and D. Scaramuzza, “Visual odometry: Part ii: Match- | ||||
|       ing, robustness, optimization, and applications,” IEEE Robotics &         [37] Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular, | ||||
|       Automation Magazine, vol. 19, no. 2, pp. 78–90, 2012.                           stereo, and rgb-d cameras,” Advances in Neural Information Process- | ||||
|                                                                                       ing Systems, vol. 34, pp. 16558–16569, 2021. | ||||
| [17] A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning | ||||
|       for visual localization and odometry,” in 2018 IEEE international         [38] Y. Qiu, C. Wang, W. Wang, M. Henein, and S. Scherer, “Airdos: | ||||
|       conference on robotics and automation (ICRA), pp. 6939–6946, IEEE,              Dynamic slam benefits from articulated objects,” in 2022 International | ||||
|       2018.                                                                           Conference on Robotics and Automation (ICRA), pp. 8047–8053, | ||||
|                                                                                       IEEE, 2022. | ||||
| [18] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring | ||||
|       representation learning with cnns for frame-to-frame ego-motion esti-     [39] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: A | ||||
|       mation,” IEEE robotics and automation letters, vol. 1, no. 1, pp. 18–25,        versatile and accurate monocular slam system,” IEEE transactions on | ||||
|       2015.                                                                           robotics, vol. 31, no. 5, pp. 1147–1163, 2015. | ||||
| 
 | ||||
| [19] H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid, “Visual odometry      [40] J. Zhang, M. Henein, R. Mahony, and V. Ila, “Vdo-slam: a visual | ||||
|       revisited: What should be learnt?,” in 2020 IEEE International Con-             dynamic object-aware slam system,” arXiv preprint arXiv:2005.11052, | ||||
|       ference on Robotics and Automation (ICRA), pp. 4203–4210, IEEE,                 2020. | ||||
|       2020. | ||||
|                                                                                 [41] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “DynaSLAM: Tracking, | ||||
| [20] D.-H. Kim and J.-H. Kim, “Effective background model-based rgb-d                 mapping, and inpainting in dynamic scenes,” IEEE Robotics and | ||||
|       dense visual odometry in a dynamic environment,” IEEE Transactions              Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018. | ||||
|       on Robotics, vol. 32, no. 6, pp. 1565–1573, 2016. | ||||
|                                                                                 [42] W. Zhao, S. Liu, Y. Shu, and Y.-J. Liu, “Towards better generalization: | ||||
| [21] S. Pillai and J. Leonard, “Monocular slam supported object recogni-              Joint depth-pose learning without posenet,” in Proceedings of the | ||||
|       tion,” arXiv preprint arXiv:1506.01732, 2015.                                   IEEE/CVF Conference on Computer Vision and Pattern Recognition, | ||||
|                                                                                       pp. 9151–9161, 2020. | ||||
| [22] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds- | ||||
|       slam: A semantic visual slam towards dynamic environments,” in 2018       [43] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, | ||||
|       IEEE/RSJ International Conference on Intelligent Robots and Systems             T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An | ||||
|       (IROS), pp. 1168–1174, IEEE, 2018. | ||||
| 
 | ||||
|                                                                                                      4054 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. | ||||
|                 imperative style, high-performance deep learning library,” Advances | ||||
|                 in neural information processing systems, vol. 32, 2019. | ||||
|          [44] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: | ||||
|                 The kitti dataset,” The International Journal of Robotics Research, | ||||
|                 vol. 32, no. 11, pp. 1231–1237, 2013. | ||||
| 
 | ||||
|                                                                                                      4055 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. | ||||
|  | ||||
| @ -0,0 +1,510 @@ | ||||
|                                        Multi-modal Semantic SLAM for Complex Dynamic Environments | ||||
| 
 | ||||
|                                                                   Han Wang*, Jing Ying Ko* and Lihua Xie, Fellow, IEEE | ||||
| 
 | ||||
|                                           Abstract— Simultaneous Localization and Mapping (SLAM)                 Input Image       Segmentation Node             Mask Image | ||||
|                                        is one of the most essential techniques in many real-world | ||||
|                                        robotic applications. The assumption of static environments is                                                  Semantic | ||||
|                                        common in most SLAM algorithms, which however, is not the                                                       Category | ||||
|                                        case for most applications. Recent work on semantic SLAM | ||||
| arXiv:2205.04300v1 [cs.RO] 9 May 2022  aims to understand the objects in an environment and distin-                                              Semantic | ||||
|                                        guish dynamic information from a scene context by performing                                                Mask | ||||
|                                        image-based segmentation. However, the segmentation results | ||||
|                                        are often imperfect or incomplete, which can subsequently                 Peception Node    Point Cloud   Filter ing & SL AM Node | ||||
|                                        reduce the quality of mapping and the accuracy of localization. | ||||
|                                        In this paper, we present a robust multi-modal semantic                        LiDAR Input | ||||
|                                        framework to solve the SLAM problem in complex and highly | ||||
|                                        dynamic environments. We propose to learn a more powerful                                                 Clustering                 Projected | ||||
|                                        object feature representation and deploy the mechanism of                                                   Result                 Segmentation | ||||
|                                        looking and thinking twice to the backbone network, which | ||||
|                                        leads to a better recognition result to our baseline instance                                                                          Result | ||||
|                                        segmentation model. Moreover, both geometric-only clustering | ||||
|                                        and visual semantic information are combined to reduce the                Dynamic Point     Static Point                  Mapping | ||||
|                                        effect of segmentation error due to small-scale objects, occlusion | ||||
|                                        and motion blur. Thorough experiments have been conducted to              Fig. 1: System overview of the proposed multi-modal se- | ||||
|                                        evaluate the performance of the proposed method. The results              mantic SLAM. Compared to traditional semantic SLAM, | ||||
|                                        show that our method can precisely identify dynamic objects               we propose to use multi-modal method to improve the | ||||
|                                        under recognition imperfection and motion blur. Moreover, the             efficiency and accuracy of the existing SLAM methods in the | ||||
|                                        proposed SLAM framework is able to efficiently build a static              complex and dynamic environment. Our method significantly | ||||
|                                        dense map at a processing rate of more than 10 Hz, which can              reduces the localization drifts caused by dynamic objects and | ||||
|                                        be implemented in many practical applications. Both training              performs dense semantic mapping in real time. | ||||
|                                        data and the proposed method is open sourced1. | ||||
|                                                                                                                  correspondences or insufficient matching features [4]. The | ||||
|                                                               I. INTRODUCTION                                    presence of dynamic objects can greatly degrade the accuracy | ||||
|                                                                                                                  of localization and the reliability of the mapping during the | ||||
|                                           Simultaneous Localization and Mapping (SLAM) is one of                 SLAM process. | ||||
|                                        the most significant capabilities in many robot applications | ||||
|                                        such as self-driving cars, unmanned aerial vehicles, etc.                    Advancements in deep learning have enabled the develop- | ||||
|                                        Over the past few decades, SLAM algorithms have been                      ments of various instance segmentation networks based on | ||||
|                                        extensively studied in both Visual SLAM such as ORB-                      2D images [5]–[6]. Most existing semantic SLAMs leverage | ||||
|                                        SLAM [1] and LiDAR-based SLAM such as LOAM [2]                            the success of deep learning-based image segmentation, e.g., | ||||
|                                        and LeGO-LOAM [3]. Unfortunately, many existing SLAM                      dynamic-SLAM [7] and DS-SLAM [8]. However, the seg- | ||||
|                                        algorithms assume the environment to be static, and cannot                mentation results are not ideal under dynamic environments. | ||||
|                                        handle dynamic environments well. The localization is often               Various factors such as small-scale objects, objects under | ||||
|                                        achieved via visual or geometric features such as feature                 occlusion and motion blur contribute to challenges in 2D | ||||
|                                        points, lines and planes without including semantic infor-                instance segmentation. For example, the object is partially | ||||
|                                        mation to represent the surrounding environment, which can                recognized under motion blur or when it is near to the border | ||||
|                                        only work well under static environments. However, the real-              of the image. These can degrade the accuracy of localization | ||||
|                                        world is generally complex and dynamic. In the presence of                and the reliability of the mapping. Some recent works target | ||||
|                                        moving objects, pose estimation might suffer from drifting,               to perform deep learning on 3D point clouds to achieve | ||||
|                                        which may cause the system failure if there are wrong                     semantic recognition [9]–[10]. However, 3D point cloud | ||||
|                                                                                                                  instance segmentation does not perform as well as its 2D | ||||
|                                           *Jing Ying Ko and Han Wang contribute equally to this paper and are | ||||
|                                        considered as jointly first authors. | ||||
| 
 | ||||
|                                           The research is supported by the National Research Foundation, Singa- | ||||
|                                        pore under its Medium Sized Center for Advanced Robotics Technology | ||||
|                                        Innovation. | ||||
| 
 | ||||
|                                           Jing Ying Ko, Han Wang and Lihua Xie are with the School of | ||||
|                                        Electrical and Electronic Engineering, Nanyang Technological Univer- | ||||
|                                        sity, 50 Nanyang Avenue, Singapore 639798. e-mail: {hwang027, | ||||
|                                        E170043}@e.ntu.edu.sg; elhxie@ntu.edu.sg | ||||
| 
 | ||||
|                                           1https://github.com/wh200720041/MMS_SLAM | ||||
| counterpart due to its smaller scale of training data and high  A. Feature Consistency Verification | ||||
| computational cost. There are several reasons: 1) 3D point | ||||
| cloud instance segmentation such as PointGroup takes a long        Dai et al. [13] presents a segmentation method using the | ||||
| computation time (491ms) [11]; 2) it is much less efficient to   correlation between points to distinguish moving objects | ||||
| label a point cloud since the geometric information is not as   from the stationary scene, which has a low computational | ||||
| straightforward as the visual information; 3) it is inevitable  requirement. Lee et al. [14] introduces a real-time depth | ||||
| to change the viewpoint in order to label a point cloud [12],   edge-based RGB-D SLAM system to deal with a dynamic | ||||
| which increases the labeling time.                              environment. Static weighting method is proposed to mea- | ||||
|                                                                 sure the likelihood of the edge point being part of the | ||||
|    In this paper, we propose a robust and computationally       static environment and is further used for the registration of | ||||
| efficient multi-modal semantic SLAM framework to tackle          the frame-to-keyframe point cloud. These methods generally | ||||
| the limitation of existing SLAM methods in dynamic en-          can achieve real-time implementation without increasing the | ||||
| vironments. We modify the existing backbone network to          computational complexity. Additionally, they need no prior | ||||
| learn a more powerful object feature representation and         knowledge about the dynamic objects. However, they are | ||||
| deploy the mechanism of looking and thinking twice to the       unable to continuously track potential dynamic objects, e.g., | ||||
| backbone network, which leads to a better recognition result    a person that stops at a location temporarily between moves | ||||
| to our baseline instance segmentation model. Moreover, we       is considered as a static object in their work. | ||||
| combine the geometric-only clustering and visual semantic | ||||
| information to reduce the effect of motion blur. Eventually     B. Deep Learning-Based Dynamic SLAM | ||||
| the multi-modal semantic recognition is integrated into the | ||||
| SLAM framework which is able to provide real-time local-           Deep learning-based dynamic SLAM usually performs | ||||
| ization in different dynamic environments. The experiment       better than feature consistency verification as it provides | ||||
| results show that the segmentation errors due to misclassifi-    conceptual knowledge of the surrounding environment to | ||||
| cation, small-scale object and occlusion can be well-solved     perform the SLAM tasks. Xun et al. [15] proposes a feature- | ||||
| with our proposed method. The main contributions of this        based visual SLAM algorithm based on ORB-SLAM2, | ||||
| paper are summarized as follows:                                where a front-end semantic segmentation network is in- | ||||
|                                                                 troduced to filter out dynamic feature points and subse- | ||||
|    • We propose a robust and fast multi-modal semantic          quently fine-tune the camera pose estimation, thus making | ||||
|       SLAM framework that targets to solve the SLAM prob-       the tracking algorithm more robust. Reference [16] combines | ||||
|       lem in complex and dynamic environments. Specifically,     a semantic segmentation network with a moving consistency | ||||
|       we combine the geometric-only clustering and visual se-   check method to reduce the impact of dynamic objects and | ||||
|       mantic information to reduce the effect of segmentation   generate a dense semantic octree map. A visual SLAM | ||||
|       error due to small-scale objects, occlusion and motion    system proposed by [17] develops a dynamic object detector | ||||
|       blur.                                                     with multi-view geometry and background inpainting, which | ||||
|                                                                 aims to estimate a static map and reuses it in long term | ||||
|    • We propose to learn a more powerful object feature         applications. However, Mask R-CNN is considered as com- | ||||
|       representation and deploy the mechanism of looking and    putationally intensive; as a result, the whole framework can | ||||
|       thinking twice to the backbone network, which leads       only be performed offline. | ||||
|       to a better recognition result to our baseline instance | ||||
|       segmentation model.                                          Deep learning-based LiDAR SLAM in dynamic envi- | ||||
|                                                                 ronments are relatively less popular than visual SLAM. | ||||
|    • A thorough evaluation on the proposed method is pre-       Reference [18] integrates semantic information by using a | ||||
|       sented. The results show that our method is able to       fully convolutional neural network to embed these labels | ||||
|       provide reliable localization and a semantic dense map.   into a dense surfel-based map representation. However, the | ||||
|                                                                 adopted segmentation network is based on 3D point clouds, | ||||
|    The rest of the paper is organized as follows: Section II    which is less effective as compared to 2D segmentation net- | ||||
| presents an overview of the related works regarding the three   works. Reference [19] develops a laser-inertial odometry and | ||||
| main SLAM methods in dynamic environments. Section III          mapping method which consists of four sequential modules | ||||
| describes the details of the proposed SLAM framework.           to perform a real-time and robust pose estimation for large | ||||
| Section IV provides quantitative and qualitative experimental   scale high-way environments. Reference [20] presents a dy- | ||||
| results in dynamic environments. Section V concludes this       namic objects-free LOAM system by overlapping segmented | ||||
| paper.                                                          images into LiDAR scans. Although deep learning-based | ||||
|                                                                 methods can effectively alleviate the impact of dynamic ob- | ||||
|                       II. RELATED WORK                          jects on the SLAM performance, they are normally difficult | ||||
|                                                                 to operate in real-time due to the implementation of deep- | ||||
|    In this section, we present the existing works that address  learning neural networks which possess high computational | ||||
| SLAM problems in dynamic environments. The existing             complexity. | ||||
| dynamic SLAM can be categorized into three main methods: | ||||
| feature consistency verification method, deep learning-based | ||||
| method and multi-modal-based method. | ||||
|                                                                                                                                  Iterative | ||||
|                                                                                                                                   Estima | ||||
| 
 | ||||
|              (b) M ulti-modal Fusion M odule                   (c) L ocalization M odule | ||||
| 
 | ||||
|  LiDAR       Point Cloud  Geometric       Semantic              Feature          Data | ||||
|                           Clustring        Fusion              Extraction    Association | ||||
| Camera | ||||
|                                                                                               Pose | ||||
|                                                                                            Estimation | ||||
| 
 | ||||
|              Image        Segmentation     Motion Blur         Key Frame     Feature Map | ||||
|                                  &        Compensation          Selection       Update | ||||
| 
 | ||||
|                           Classfication | ||||
| 
 | ||||
|              (a) I nstance Segmentation M odule                (d) Global Optimization & M apping M odule | ||||
| 
 | ||||
|  Dynamic         Data         Model        Convolutional        Static Map        Map                      Localization | ||||
| Object Info  Acquisition  Generalization  Neural Network       Construction  Optimization                     Output | ||||
| 
 | ||||
|                Data       Data Training                        Dynamic       Global Map    Visualization   3D Map | ||||
|              Labelling                                         Mapping          Fusion                     Output | ||||
| 
 | ||||
| Fig. 2: Flow chart of the proposed method. Our system consists of four modules: (a) semantic fusion module; (b) semantic | ||||
| learning module; (c) localization module; (d) global optimization and mapping module. | ||||
| 
 | ||||
| C. Multi-modal-based Dynamic SLAM                              other state-of-the-art instance segmentation models, both in | ||||
|                                                                segmentation accuracy and inference speed. Given an input | ||||
|    Multi-modal approaches are also explored to deal with       image I, our adopted instance segmentation network predicts | ||||
| dynamic environments. Reference [21] introduces a multi-       a set of {Ci, M i}in=1, where Ci is a class label and M i is a | ||||
| modal sensor-based semantic mapping algorithm to improve       binary mask, n is the number of instances in the image. | ||||
| the semantic 3D map in large-scale as well as in featureless   The image is spatially separated into N × N grid cells. If | ||||
| environments. Although this work is similar to our proposed    the center of an object falls into a grid cell, that grid cell | ||||
| method, it incurs higher computational cost as compared to     is responsible for predicting the semantic category Cij and | ||||
| our proposed method. A LiDAR-camera SLAM system [22]           semantic mask M ij of the object in category branch Bc and | ||||
| is presented by applying a sparse subspace clustering-based    mask branch P m respectively: | ||||
| motion segmentation method to build a static map in dynamic | ||||
| environments. Reference [23] incorporates the information of        Bc(I, θc) : I → {Cij ∈ Rλ | i, j = 0, 1, ..., N }, (1a) | ||||
| a monocular camera and a laser range finder to remove the           P m(I, θm) : I → {M ij ∈ Rφ | i, j = 0, 1, ..., N }, (1b) | ||||
| feature outliers related to dynamic objects. However, both | ||||
| reference [22] and [23] can only work well in low dynamic      where θc and θm are the parameters of category branch Bc | ||||
| environments.                                                  and mask branch P m respectively. λ is the number of classes. | ||||
|                                                                φ is the total number of grid cells. The category branch | ||||
|                      III. METHODOLOGY                          and mask branch are implemented with a Fully Connected | ||||
|                                                                Network (FCN). Cij has a total of λ elements. Each element | ||||
|    In this section, the proposed method will be discussed      of Cij indicates the class probability for each object instance | ||||
| in detail. Fig. 2 illustrates an overview of our framework.    at grid cell (i, j). In parallel with the category branch, M ij | ||||
| It is mainly composed of four modules, namely instance         has a total of N 2 elements [24]. Each positive grid cell (i, j) | ||||
| segmentation module, multi-modal fusion module, localiza-      will generate the corresponding instance mask in kth element, | ||||
| tion module and global optimization & mapping module.          where kth = i · N + j. Since our proposed SLAM system is | ||||
| Instance segmentation module uses a real-time instance seg-    intentionally designed for real-world robotics applications, | ||||
| mentation network to extract the semantic information of       computational cost for performing instance segmentation | ||||
| all potential dynamic objects that are present in an RGB       is our primary concern. Therefore, we use a light-weight | ||||
| image. The convolution neural network is trained offline and    version of SOLOv2 with lower accuracy to achieve real- | ||||
| is later implemented online to achieve real-time performance.  time instance segmentation. To improve the segmentation | ||||
| Concurrently, the multi-modal fusion module transfers rel-     accuracy, several methods have been implemented to build a | ||||
| evant semantic data to LiDAR through sensor fusion and         more effective and robust feature representation discriminator | ||||
| subsequently uses the multi-modal information to further       in the backbone network. Firstly, we modify our backbone | ||||
| strengthen the segmentation results. The static information    architecture from the original Feature Pyramid Network | ||||
| is used in the localization module to find the robot pose,      (FPN) to Recursive Feature Pyramid Network (RFP) [25]. | ||||
| while both static information and dynamic information are      Theoretically, RFP instills the idea of looking twice or | ||||
| utilized in the global optimization and mapping module to      more by integrating additional feedback from FPN into | ||||
| build a 3D dense semantic map.                                 bottom-up backbone layers. This recursively strengthens the | ||||
|                                                                existing FPN and provides increasingly stronger feature | ||||
| A. Instance Segmentation & Semantic Learning                   representations. By offsetting richer information with small | ||||
| 
 | ||||
|    A recent 2D instance segmentation framework [24] is | ||||
| employed in our work due to its ability to outperform | ||||
|                         (1a)                                     (2a)  (3a) | ||||
| 
 | ||||
| SOL Ov2 with Or iginal  (1b)                                     (2b)  (3b) | ||||
|    DetectoRS SOL O v2 | ||||
| Fig. 3: Comparison of the original SOLOv2 with the proposed method. Our segmentation results achieve higher accuracy: | ||||
| In (1b), our method can preserve a more detailed mask for the rider on a motorcycle compared to the SOLOv2 result in | ||||
| (1a); In (2b), we can handle the occluded object while it is not detected in (2a); In (3b), our method can accurately predict | ||||
| the mask for a handbag compared to (3a). | ||||
| 
 | ||||
| receptive field in the lower-level feature maps, we are able      dynamic targets will degrade the localization accuracy and | ||||
| to improve the segmentation performance on small objects.        produce noise when performing a mapping task. Therefore, | ||||
| Meanwhile, the ability of RFP to adaptively strengthen and       we firstly implement morphological dilation to convolute | ||||
| suppress neuron activation enables the instance segmentation     the 2D pixel-wise mask image with a structuring element, | ||||
| network to handle occluded objects more efficiently. On           for gradually expanding the boundaries of regions for the | ||||
| the other hand, we replace the convolutional layers in the       dynamic objects. The morphological dilation result marks | ||||
| backbone architecture with Switchable Atrous Convolution         the ambiguous boundaries around the dynamic objects. We | ||||
| (SAC). SAC operates as a soft switch function, which is          take the both dynamic objects and their boundaries as the | ||||
| used to collect the outputs of convolutional computation with    dynamic information, which will be further refined in the | ||||
| different atrous rates. Therefore, we are able to learn the      multi-modal fusion section. | ||||
| optimal coefficient from SAC and can adaptively select the | ||||
| size of receptive field. This allows SOLOv2 to efficiently            2) Geometric Clustering & Semantic Fusion: Compensa- | ||||
| extract important spatial information.                           tion via connectivity analysis on Euclidean space [27] is also | ||||
|                                                                  implemented in our work. Instance segmentation network has | ||||
|    The outputs are pixel-wise instance masks for each dy-        excellent recognition capability in most practical situations, | ||||
| namic object, as well as their corresponding bounding box        however motion blur limits the segmentation performance | ||||
| and class type. To better integrate the dynamic information to   due to ambiguous pixels between regions, leading to undesir- | ||||
| the SLAM algorithm, the output binary mask is transformed        able segmentation error. Therefore, we combine both point | ||||
| into a single image containing all pixel-wise instance masks     cloud clustering results and segmentation results to better | ||||
| in the scene. The pixel with the mask falling onto it is         refine the dynamic objects. In particular, we perform the | ||||
| considered as “dynamic state” and otherwise is considered as     connectivity analysis on the geometry information and merge | ||||
| “static state”. The binary mask is then applied to the semantic  with vision-based segmentation results. | ||||
| fusion module to generate a 3D dynamic mask. | ||||
|                                                                     A raw LiDAR scan often contains tens of thousands of | ||||
| B. Multi-Modal Fusion                                            points. To increase the efficiency of our work, 3D point | ||||
|                                                                  cloud is firstly downsized to reduce the scale of data and | ||||
|    1) Motion Blur Compensation: The instance segmenta-           used as the input for point cloud clustering. Then the | ||||
| tion has achieved good performance on the public dataset         instance segmentation results are projected to the point cloud | ||||
| such as the COCO dataset and the Object365 dataset [24]–         coordinate to label each point. The point cloud cluster will be | ||||
| [26]. However, in practice the target may be partially rec-      considered as a dynamic cluster when most points (90%) are | ||||
| ognized or incomplete due to the motion blur on moving           dynamic labelled points. The static point will be re-labeled to | ||||
| objects, resulting in ambiguous boundaries of a moving           the dynamic tag when it is close to a dynamic point cluster. | ||||
| object. Moreover, motion blur effect is further enlarged         And the dynamic point will be re-labelled when there is no | ||||
| when projecting the 2D pixel-wise semantic mask for a            dynamic points cluster nearby. | ||||
| dynamic object to 3D semantic label, leading to point mis- | ||||
| alignment and inconsistency of feature point extraction. In | ||||
| the experiments, we find that the ambitious boundaries of | ||||
| C. Localization & Pose Estimation                                                  (a)  (b)  (c)  (d)  (e) | ||||
| 
 | ||||
|    1) Feature Extraction: After applying multi-modal dy-                           (f) | ||||
| namic segmentation, the point cloud is divided into a dy- | ||||
| namic point cloud PD and a static point cloud PS . The | ||||
| static point cloud is subsequently used for the localization | ||||
| and mapping module based on our previous work [28]. | ||||
| Compared to the existing SLAM approach such as LOAM | ||||
| [2], the proposed framework in [28] is able to support real- | ||||
| time performance at 30 Hz which is a few times faster. It | ||||
| is also resistant to illumination variation compared to visual | ||||
| SLAMs such as ORB-SLAM [1] and VINS-MONO [29]. | ||||
| For each static point pk ∈ PS , we can search for its nearby | ||||
| static points set Sk by radius search in Euclidean space. Let | ||||
| |S| be the cardinality of a set S, the local smoothness is thus | ||||
| defined by: | ||||
| 
 | ||||
|                σk    =     1   ·        (||pk||   −    ||pi||).               (2)  Fig. 4: Different types of AGVs used in our warehouse | ||||
|                                                                                    environment: (a) the grabbing AGV with a robot arm; (b) | ||||
|                         |Sk |     pi ∈Sk                                           folklift AGV; (c) scanning AGV; (d) the Pioneer robot; (e) | ||||
|                                                                                    the transportation AGV with conveyor belt; (f) warehouse | ||||
| The edge features are defined by the points with large σk                           environment; | ||||
| and the planar features are defined by the points with small | ||||
|                                                                                    D. Global Map Building | ||||
| σk . | ||||
|    2) Data Association: The final robot pose is calculated                             The semantic map is separated into a static map and | ||||
|                                                                                    a dynamic map. Note that the visual information given | ||||
| by minimizing the point-to-edge and point-to-plane distance.                       previously is also used to construct the colored dense static | ||||
|                                                                                    map. Specifically, the visual information can be achieved | ||||
| For an edge feature point pE ∈ PE , it can be transformed                          by re-projecting 3D points into the image plane. After each | ||||
| into local map coordinate by pˆE = T·pE , where T ∈ SE(3)                          update, the map is down-sampled by using a 3D voxelized | ||||
|                                                                                    grid approach [30] in order to prevent memory overflow. | ||||
| is the current pose. We can search for 2 nearest edge features                     The dynamic map is built by PD and it is used to reveal the | ||||
| pE1 and pE2 from the local edge feature map and the point-                         dynamic objects. The dynamic information can be used for | ||||
| to-edge residual is defined by [28]:                                                high-level tasks such as motion planning. | ||||
| 
 | ||||
|             fE (pˆE )   =  ||(pˆ E   − pE1 ) ×  (pˆE −    pE2 )|| ,           (3)               IV. EXPERIMENT EVALUATION | ||||
|                                        ||pE1 −  p2E || | ||||
|                                                                                       In this section, experimental results will be presented to | ||||
| where symbol × is the cross product. Similarly, given a                            demonstrate the effectiveness of our proposed method. First, | ||||
|                                                                                    our experimental setup will be discussed in detail. Second, | ||||
| planar feature point pL ∈ PL and its transformed point                             we elaborate how we acquire the data of potential moving | ||||
| pˆL = T · pL, we can search for 3 nearest points p1L, p2L, and                     objects in a warehouse environment. Third, we evaluate the | ||||
| p3L from the local planar map. The point-to-plane residual is                      segmentation performance on our adopted instance segmen- | ||||
|                                                                                    tation model. Subsequently, we explain how we perform the | ||||
| defined by:                                                                         dense mapping and dynamic tracking. Lastly, we evaluate | ||||
|                                                                                    the performance of our proposed method regarding the | ||||
| fL(pˆL)     =  (pˆL  −  p1L)T     ·   (p1L   −  pL2 )  ×  (pL1  −  p3L)    .  (4)  localization drifts under dynamic environments. | ||||
|                                      ||(pL1  −  p2L)   ×  (pL1  −  p3L)|| | ||||
|                                                                                    A. Experimental Setup | ||||
| 3) Pose Estimation: The final robot pose is calculated | ||||
|                                                                                       For our experimental setup, the Robot Operating System | ||||
| by minimizing the sum of point-to-plane and point-to-edge                          (ROS) is utilized as the interface for the integration of | ||||
|                                                                                    the semantic learning module and the SLAM algorithm, as | ||||
| residuals:                                                                         shown in Fig. 1. Intel RealSense LiDAR camera L515 is | ||||
|                                                                                    used to capture RGB and point cloud at a fixed frame rate. | ||||
| T∗ = arg min                         fE (pˆE ) +          fL(pˆL). (5)             All the experiments are performed on a computer with an | ||||
|                                                                                    Intel i7 CPU and an Nvidia GeForce RTX 2080 Ti GPU. | ||||
|                    T       pE ∈PE                 pL ∈PL | ||||
| 
 | ||||
| This non-linear optimization problem can be solved by the | ||||
| 
 | ||||
| Gauss-Newton method and we can derive an optimal robot | ||||
| 
 | ||||
| pose based on the static information. | ||||
| 
 | ||||
| 4) Feature Map Update & Key Frame Selection: Once | ||||
| 
 | ||||
| the optimal pose is derived, the features are updated to the | ||||
| 
 | ||||
| local edge map and local plane map respectively, which | ||||
| 
 | ||||
| will be used for the data association on the next frame. | ||||
| 
 | ||||
| Note that to build and update a global dense map is often | ||||
| 
 | ||||
| very computational costly. Hence, the global static map is | ||||
| 
 | ||||
| updated based on the keyframe. A key frame is selected when | ||||
| 
 | ||||
| the translational change of the robot pose is greater than a | ||||
| 
 | ||||
| predefined translation threshold, or the rotational change of | ||||
| 
 | ||||
| the robot pose is greater than a predefined rotation threshold. | ||||
|                               (a)                                                                         (b) | ||||
| 
 | ||||
| Fig. 5: Static map creation and final semantic mapping result: (a) static map built by the proposed SLAM framework; (b) | ||||
| final semantic mapping result. The instance segmentation is shown on the left. Human operators are labeled by red bounding | ||||
| boxes and AGVs are labeled by green bounding boxes. | ||||
| 
 | ||||
| B. Data Acquisition                                             tion network, SOLOv2 is built based on the MMDetection | ||||
|                                                                 2.0 [32], an open-source object detection toolbox based | ||||
|    Humans are often considered as dynamic objects in many       on PyTorch. We trained SOLOv2 on the COCO dataset | ||||
| scenarios such as autonomous driving and smart warehouse        which consists of 81 classes. We choose ResNet-50 as our | ||||
| logistics. Therefore we choose 5,000 human images from          backbone architecture since this configuration satisfies our | ||||
| the COCO dataset. In the experiment, the proposed method        requirements for the real-world robotics applications. Instead | ||||
| is evaluated in the warehouse environment as shown in           of training the network from scratch, we make use of the | ||||
| Fig. 4. Other than considering humans as dynamic objects,       parameters of ResNet-50 that are pre-trained on ImageNet. | ||||
| an advanced factory requires human-to-robot and robot-to-       For fair comparison, all the models are trained under the | ||||
| robot collaboration, so that the Automated Guided Vehicles      same configurations, they are trained with the synchronized | ||||
| (AGVs) are also potential dynamic objects. Hence a total        stochastic gradient descent with a total of 8 images per mini- | ||||
| of 3,000 AGV images are collected to train the instance         batch for 36 epochs. | ||||
| segmentation network and some of the AGVs are shown in | ||||
| Fig. 4.                                                            For SOLOv2 with Recursive Feature Pyramid (RFP), we | ||||
|                                                                 modify our backbone architecture from Feature Pyramid | ||||
|    In order to solve the small dataset problem, we implement    Network (FPN) to RFP network. In this experiment, we only | ||||
| the copy-paste augmentation method proposed by [31] to          set the number of stages to be 2, allowing SOLOv2 to look | ||||
| enhance the generalization ability of the network and directly  at the image twice. As illustrated in Table I, implementation | ||||
| improve the robustness of the network. To be specific, this      of RFP network brings a significant improvement on the | ||||
| method generates new images through applying random             segmentation performance. On the other hand, we replace | ||||
| scale jittering on two random training datasets and randomly    all 3x3 convolutional layers in the backbone network with | ||||
| chooses a subset of object instances from one image to paste    Switchable Atrous Convolution (SAC), which increases the | ||||
| onto the other image.                                           segmentation accuracy by 2.3%. By implementing both SAC | ||||
|                                                                 and RFP network to SOLOv2, the segmentation performance | ||||
| C. Evaluation on Instance Segmentation Performance              is further improved by 5.9% with only 17ms increase in | ||||
|                                                                 inference time. Overall, SOLOv2 learns to look at the image | ||||
|    In this part, we will evaluate the segmentation performance  twice with adaptive receptive fields, therefore it is able to | ||||
| on the COCO dataset with regards to the segmentation loss       highlight important semantic information for the instance | ||||
| and mean Average Precision (mAP). The purpose of this           segmentation network. The segmentation result is further | ||||
| evaluation is to compare our adopted instance segmentation      visualized in Fig. 3. | ||||
| network, SOLOv2, with the proposed method. The results | ||||
| are illustrated in Table I. Our adopted instance segmenta- | ||||
| 
 | ||||
| Model                   Segmentation Mean Inference | ||||
| 
 | ||||
|                         Loss  AP (%) Time (ms)                                  Methods            ATDE   MTDE | ||||
|                                                                                                     (cm)   (cm) | ||||
| SOLOv2                  0.52  38.8  54.0                            W/O Semantic Recognition | ||||
|                                                                 Vision-based Semantic Recognition  4.834   1.877 | ||||
| SOLOv2 + RFP            0.36  41.2  64.0                        Multi-Modal Recognition (Ours)     1.273   0.667 | ||||
|                                                                                                    0.875   0.502 | ||||
| SOLOv2 + SAC            0.39  39.8  59.0 | ||||
| 
 | ||||
| SOLOv2+DetectoRS(Ours)  0.29  43.4  71.0 | ||||
| 
 | ||||
| TABLE I: Performance comparison of instance segmenta-           TABLE II: Ablation study of localization drifts under dy- | ||||
| tion.                                                           namic environments. | ||||
| 3                                                              (a)                                       (c) | ||||
| 
 | ||||
|                   w/o filtering | ||||
| 
 | ||||
| 2                 proposed | ||||
| 
 | ||||
|                   Ground Truth | ||||
| 
 | ||||
| 1                                                              (b) | ||||
| 
 | ||||
| 0 | ||||
| 
 | ||||
| -1 | ||||
| 
 | ||||
| -2                                                             Fig. 7: Ablation study of localization drifts. (a) original | ||||
|                                                                image view; (b) the visual semantic recognition result based | ||||
| -3                                                             on the proposed method; (c) Localization drifts observed due | ||||
|                                                                to the moving objects. The localization drifts are highlighted | ||||
| -3  -2  -1  0  1  2              3                             in red circle. | ||||
| 
 | ||||
| Fig. 6: Localization comparison in a dynamic environment.      E. Ablation Study of Localization Drifts  (d) | ||||
| The ground truth, the original localization result without | ||||
| filtering and the localization result with our proposed multi-     To further evaluate the performance of localization under | ||||
| modal semantic filtering are plotted in red, green and orange   dynamic profiles, we compare the localization drifts of differ- | ||||
| respectively.                                                  ent dynamic filtering approaches. Firstly, we keep the robot | ||||
|                                                                still and let a human operator walk frequently in front of the | ||||
| D. Dense Mapping and Dynamic Tracking                          robot. The localization drifts are recorded in order to evaluate | ||||
|                                                                the performance under dynamic objects. Specifically, we | ||||
|    To evaluate the performance of our multi-modal semantic     calculate the Average Translational Drifts Error (ATDE) and | ||||
| SLAM in dynamic environments, the proposed method is           Maximum Translational Drifts Error (MTDE) to verify the | ||||
| implemented on warehouse AGVs which are shown in Fig. 4.       localization, where the ATDE is the average translational | ||||
| In a smart manufacturing factory, both human operators and     error of each frame and MTDE is the maximum translational | ||||
| different types of AGVs (e.g., folklift AGVs, transportation   drift caused by the walking human. The results are shown | ||||
| AGVs and robot-arm equipped AGVs) are supposed to work         in Table II. We firstly remove the semantic recognition | ||||
| in a collaborative manner. Therefore, the capability of each   module from SLAM and evaluate the performance. Then | ||||
| AGV to localize itself under moving human operators and        we use the visual semantic recognition (SOLOv2) to re- | ||||
| other AGVs is the essential technology towards industry 4.0.   move the dynamic information. The results are compared | ||||
| In many warehouse environments, the rest of objects such       with the proposed semantic multi-modal SLAM. It can be | ||||
| as operating machines or tables can be taken as a static       seen that, compared to the original SLAM, the proposed | ||||
| environment. Hence we only consider humans and AGVs            method significantly reduces the localization drift. Compared | ||||
| as dynamic objects in order to reduce the computational        to vision-only-based filtering methods, the proposed multi- | ||||
| cost. In the experiment, an AGV is manually controlled         modal semantic SLAM is more stable and accurate under the | ||||
| to move around and build the warehouse environment map         presence of dynamic objects. | ||||
| simultaneously, while the human operators are walking fre- | ||||
| quently in the warehouse. The localization result is shown                             V. CONCLUSION | ||||
| in Fig. 6, where we compare the results of ground truth, the | ||||
| proposed SLAM method and original SLAM without our                In this paper, we have presented a semantic multi-modal | ||||
| filtering approach. It can be seen that when the dynamic        framework to tackle the SLAM problem in dynamic en- | ||||
| object appears (in blue), the proposed multi-modal semantic    vironments, which is able to effectively reduce the impact | ||||
| SLAM is more robust and stable than traditional SLAM.          of dynamic objects in complex dynamic environments. Our | ||||
| The mapping results are shown in Fig. 5. The proposed          approach aims to provide a modular pipeline to allow real- | ||||
| method is able to efficiently identify the potential dynamic    world applications in dynamic environments. Meanwhile, a | ||||
| objects and separate them from the static map. Although the    3D dense stationary map is constructed with the removal | ||||
| human operators are walking frequently in front of the robot,  of dynamic information. To verify the effectiveness of the | ||||
| they are totally removed from the static map. All potential    proposed method in a dynamic complex environment, our | ||||
| dynamic objects are enclosed by bounding boxes and are         method is evaluated on warehouse AGVs used for smart | ||||
| added into a final semantic map to visualize the status of      manufacturing. The results show that our proposed method | ||||
| each object in real time, where the moving human is colored    can significantly improve the existing semantic SLAM algo- | ||||
| in red and the AGVs are colored in green. Our method is        rithm in terms of robustness and accuracy. | ||||
| able to identify and locate multiple targets in the complex | ||||
| dynamic environment. | ||||
|                             REFERENCES                                         [17] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “Dynaslam: Tracking, | ||||
|                                                                                      mapping, and inpainting in dynamic scenes,” IEEE Robotics and | ||||
|  [1] R. Mur-Artal and J. D. Tardo´s, “Orb-slam 2: An open-source slam                Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018. | ||||
|       system for monocular, stereo, and rgb-d cameras,” IEEE Transactions | ||||
|       on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.                        [18] X. Chen, A. Milioto, E. Palazzolo, P. Gigue`re, and C. Stachniss, | ||||
|                                                                                      “Suma++: Efficient lidar-based semantic slam,” IEEE International | ||||
|  [2] J. Zhang and S. Singh, “Loam: Lidar odometry and mapping in real-               Conference on Intelligent Robots and Systems, 2019. | ||||
|       time.” in Robotics: Science and Systems, vol. 2, no. 9, 2014. | ||||
|                                                                                [19] S. Zhao, Z. Fang, H. Li, and S. Scherer, “A robust laser-inertial | ||||
|  [3] T. Shan and B. Englot, “Lego-loam: Lightweight and ground-                      odometry and mapping method for large-scale highway environments,” | ||||
|       optimized lidar odometry and mapping on variable terrain,” in 2018             IEEE International Conference on Intelligent Robots and Systems, | ||||
|       IEEE/RSJ International Conference on Intelligent Robots and Systems            2019. | ||||
|       (IROS). IEEE, 2018, pp. 4758–4765. | ||||
|                                                                                [20] R. Jian, W. Su, R. Li, S. Zhang, J. Wei, B. Li, and R. Huang, | ||||
|  [4] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, “Robust monocular                “A semantic segmentation based lidar slam system towards dynamic | ||||
|       slam in dynamic environments,” IEEE International Symposium on                 environments,” IEEE International Conference on Intelligent Robotics | ||||
|       Mixed and Augmented Reality, vol. 1, pp. 209–218, 2013.                        and Applications, pp. 582–590, 2019. | ||||
| 
 | ||||
|  [5] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick, “Mask r-cnn,” IEEE       [21] J. Jeong, T. S. Yoon, and P. J. Bae, “Towards a meaningful 3d map | ||||
|       International Conference on Computer Vision, 2017.                             using a 3d lidar and a camera,” Sensors, vol. 18, no. 8, 2018. | ||||
| 
 | ||||
|  [6] D. Bolya, Z. Chong, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance   [22] C. Jiang, D. P. Paudel, Y. Fougerolle, D. Fofi, and C. Demonceaux, | ||||
|       segmentation,” IEEE International Conference on Computer Vision,               “Static-map and dynamic object reconstruction in outdoor scenes using | ||||
|       2019.                                                                          3-d motion segmentation,” IEEE Robotics and Automation Letters, | ||||
|                                                                                      vol. 1, no. 1, pp. 324–331, 2016. | ||||
|  [7] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-slam: | ||||
|       Semantic monocular visual localization and mapping based on deep         [23] X. Zhang, A. B. Rad, and Y.-K. Wong, “Sensor fusion of monocular | ||||
|       learning in dynamic environment,” Robotics and Autonomous Systems,             cameras and laser range finders for line-based simultaneous localiza- | ||||
|       vol. 117, pp. 1–16, 2019.                                                      tion and mapping (slam) tasks in autonomous mobile robots,” Sensors, | ||||
|                                                                                      vol. 12, pp. 429–452, 2012. | ||||
|  [8] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds- | ||||
|       slam: A semantic visual slam towards dynamic environments,” in 2018      [24] X. Wang, R. Zhang, K. Tao, L. Lei, and C. Shen, “Solov2: Dynamic | ||||
|       IEEE/RSJ International Conference on Intelligent Robots and Systems            and fast instance segmentation,” IEEE Computer Vision and Pattern | ||||
|       (IROS). IEEE, 2018, pp. 1168–1174.                                             Recognition, 2020. | ||||
| 
 | ||||
|  [9] L. Han, T. Zheng, L. Xu, and L. Fang, “Occuseg: Occupancy-aware 3d        [25] S. Qiao, L.-C. Chen, and A. Yuille, “Detectors: Detecting objects with | ||||
|       instance segmentation,” IEEE International Conference on Computer              recursive feature pyramid and switchable atrous convolution,” IEEE | ||||
|       Vision and Pattern Recognition, 2020.                                          Computer Vision and Pattern Recognition, 2020. | ||||
| 
 | ||||
| [10] J. Li, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “Pointgroup: Dual-  [26] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V. | ||||
|       set point grouping for 3d instance segmentation,” IEEE International           Le, and B. Zoph, “Simple copy-paste is a strong data augmentation | ||||
|       Conference on Computer Vision and Pattern Recognition, 2020.                   method for instance segmentation,” in Proceedings of the IEEE/CVF | ||||
|                                                                                      Conference on Computer Vision and Pattern Recognition, 2021, pp. | ||||
| [11] L. Jiang, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “Pointgroup:           2918–2928. | ||||
|       Dual-set point grouping for 3d instance segmentation,” in Proceed- | ||||
|       ings of the IEEE/CVF Conference on Computer Vision and Pattern           [27] R. B. Rusu, “Semantic 3d object maps for everyday manipulation in | ||||
|       Recognition, 2020, pp. 4867–4876.                                              human living environments,” KI-Ku¨nstliche Intelligenz, vol. 24, no. 4, | ||||
|                                                                                      pp. 345–348, 2010. | ||||
| [12] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stach- | ||||
|       niss, and J. Gall, “SemanticKITTI: A Dataset for Semantic Scene          [28] H. Wang, C. Wang, and L. Xie, “Lightweight 3-d localization and | ||||
|       Understanding of LiDAR Sequences,” in Proc. of the IEEE/CVF                    mapping for solid-state lidar,” IEEE Robotics and Automation Letters, | ||||
|       International Conf. on Computer Vision (ICCV), 2019.                           2020. | ||||
| 
 | ||||
| [13] W. Dai, Y. Zhang, P. Li, Z. Fang, and S. Schere, “Rgb-d slam in           [29] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc- | ||||
|       dynamic environments using point correlations,” IEEE Transactions              ular visual-inertial state estimator,” IEEE Transactions on Robotics, | ||||
|       on Pattern Analysis and Machine Intelligence, vol. 1, no. 1, 2020.             vol. 34, no. 4, pp. 1004–1020, 2018. | ||||
| 
 | ||||
| [14] S. Li and D. Lee, “Rgb-d slam in dynamic environments using static        [30] R. B. Rusu and S. Cousins, “3d is here: Point cloud library (pcl),” | ||||
|       point weighting,” IEEE Robotics and Automation Letters), vol. 2, no. 4,        in 2011 IEEE international conference on robotics and automation. | ||||
|       pp. 2262–2270, 2017.                                                           IEEE, 2011, pp. 1–4. | ||||
| 
 | ||||
| [15] Y. Xun and C. Song, “Sad-slam: A visual slam based on semantic            [31] G. Ghiasi, C. Yin, A. Srinivas, R. Qian, T.-Y. Lin, E. D.Cubuk, | ||||
|       and depth information,” IEEE International Conference on Intelligent           Q. V. Le, and B. Zoph, “Simple copy-paste is a strong data aug- | ||||
|       Robots and Systems, 2021.                                                      mentation method for instance segmentation,” IEEE Computer Vision | ||||
|                                                                                      and Pattern Recognition, 2020. | ||||
| [16] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds- | ||||
|       slam: A semantic visual slam towards dynamic environments,” IEEE         [32] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, | ||||
|       International Conference on Intelligent Robots and Systems, pp. 1168–          Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, | ||||
|       1174, 2018.                                                                    X. Lu, R. Zhu, Y. Wu, J. Dai, W. Jingdong, J. Shi, W. Ouyang, C. C. | ||||
|                                                                                      Loy, and D. Lin, “Mmdetection: Open mmlab detection toolbox and | ||||
|                                                                                      benchmark,” IEEE Computer Vision and Pattern Recognition, 2019. | ||||
|  | ||||
| @ -0,0 +1,478 @@ | ||||
| IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022                                             9573 | ||||
| 
 | ||||
| RGB-D Inertial Odometry for a Resource-Restricted | ||||
|            Robot in Dynamic Environments | ||||
| 
 | ||||
|                     Jianheng Liu , Xuanfu Li, Yueqian Liu , and Haoyao Chen , Member, IEEE | ||||
| 
 | ||||
|    Abstract—Current simultaneous localization and mapping                        performance. Most of the existing vSLAM systems depend on | ||||
| (SLAM) algorithms perform well in static environments but eas-                   a static world assumption. Stable features in the environment | ||||
| ily fail in dynamic environments. Recent works introduce deep                    are used to form a solid constraint for Bundle Adjustment [5]. | ||||
| learning-based semantic information to SLAM systems to reduce                    However, in real-world scenarios like shopping malls and sub- | ||||
| the influence of dynamic objects. However, it is still challeng-                  ways, dynamic objects such as moving people, vehicles, and | ||||
| ing to apply a robust localization in dynamic environments for                   unknown objects, have an adverse impact on pose optimization. | ||||
| resource-restricted robots. This paper proposes a real-time RGB-D                Although some approaches like RANSAC [6] can suppress the | ||||
| inertial odometry system for resource-restricted robots in dynamic               influence of dynamic features to a certain extent, it will become | ||||
| environments named Dynamic-VINS. Three main threads run in                       overwhelmed when a vast number of dynamic objects appear in | ||||
| parallel: object detection, feature tracking, and state optimization.            the scene. | ||||
| The proposed Dynamic-VINS combines object detection and depth | ||||
| information for dynamic feature recognition and achieves per-                       Therefore, it is necessary for the system to reduce dynamic | ||||
| formance comparable to semantic segmentation. Dynamic-VINS                       objects’ influence on the estimation results consciously. The | ||||
| adopts grid-based feature detection and proposes a fast and ef-                  pure geometric methods [7]–[9] are widely used to handle | ||||
| ficient method to extract high-quality FAST feature points. IMU                   dynamic objects, but it is unable to cope with latent or slightly | ||||
| is applied to predict motion for feature tracking and moving                     moving objects. With the development of deep learning, many | ||||
| consistency check. The proposed method is evaluated on both                      researchers have tried combining multi-view geometric methods | ||||
| public datasets and real-world applications and shows competitive                with semantic information [10]–[13] to implement a robust | ||||
| localization accuracy and robustness in dynamic environments. Yet,               SLAM system in dynamic environments. To avoid the accidental | ||||
| to the best of our knowledge, it is the best-performance real-time               deletion of stable features through object detection [14], recent | ||||
| RGB-D inertial odometry for resource-restricted platforms in dy-                 dynamic SLAM systems [15], [16] exploit the advantages of | ||||
| namic environments for now. The proposed system is open source                   pixel-wise semantic segmentation for a better recognition of | ||||
| at: https://github.com/HITSZ-NRSL/Dynamic-VINS.git                               dynamic features. Due to the expensive computing resource | ||||
|                                                                                  consumption of semantic segmentation, it is difficult for a | ||||
|    Index Terms—Localization, visual-inertial SLAM.                               semantic-segmentation-based SLAM system to run in real-time. | ||||
|                                                                                  Therefore, some researchers have tried to perform semantic | ||||
|                            I. INTRODUCTION                                       segmentation only on keyframes and track moving objects via | ||||
|                                                                                  moving probability propagation [17], [18] or direct method [19] | ||||
| S IMULTANEOUS localization and mapping (SLAM) is a                               on each frame. In the cases of missed detections or object track- | ||||
|      foundational capability for many emerging applications,                     ing failures, the pose optimization is imprecise. Moreover, since | ||||
| such as autonomous mobile robots and augmented reality. Cam-                     semantic segmentation is performed after keyframe selection, | ||||
| eras as portable sensors are commonly equipped on mobile                         real-time precise pose estimation is inaccessible, and unstable | ||||
| robots and devices. Therefore, visual SLAM (vSLAM) has                           dynamic features in the original frame may also cause redundant | ||||
| received tremendous attention over the past decades. Lots of                     keyframe creation and unnecessary computational burdens. | ||||
| works [1]–[4] are proposed to improve visual SLAM systems’ | ||||
|                                                                                     The above systems still require too many computing re- | ||||
|    Manuscript received 25 February 2022; accepted 20 June 2022. Date of          sources to perform robust real-time localization in dynamic | ||||
| publication 15 July 2022; date of current version 26 July 2022. This letter was  environments for Size, Weight, and Power (SWaP) restricted | ||||
| recommended for publication by Associate Editor L. Paull and Editor J. Civera    mobile robots or devices. Some researchers [20]–[22] try to | ||||
| upon evaluation of the reviewers’ comments. This work was supported in part      run visual odometry in real-time on embedded computing | ||||
| by the National Natural Science Foundation of China under Grants U21A20119       devices, yet the keyframe-based visual odometry is not per- | ||||
| and U1713206 and in part by the Shenzhen Science and Innovation Com-             formed [23], which makes their accuracy unsatisfactory. At | ||||
| mittee under Grants JCYJ20200109113412326, JCYJ20210324120400003,                the same time, increasingly embedded computing platforms are | ||||
| JCYJ20180507183837726, and JCYJ20180507183456108. (Corresponding                 equipped with NPU/GPU computing units, such as HUAWEI | ||||
| Author: Haoyao Chen.)                                                            Atlas200, NVIDIA Jetson, etc. It enables lightweight deep | ||||
|                                                                                  learning networks to run on the embedded computing platform | ||||
|    Jianheng Liu, Yueqian Liu, and Haoyao Chen are with the School of Mechan-     in real-time. Some studies [14], [24] implemented a keyframe- | ||||
| ical Engineering and Automation, Harbin Institute of Technology Shenzhen,        based dynamic SLAM system running on embedded computing | ||||
| Shenzhen, Guangdong 518055, China (e-mail: liujianhengchris@qq.com; yue- | ||||
| qianliu@outlook.com; hychen5@hit.edu.cn). | ||||
| 
 | ||||
|    Xuanfu Li is with the Department of HiSilicon Research, Huawei Tech- | ||||
| nology Co., Ltd, Shenzhen, Guangdong 518129, China (e-mail: lixuanfu@ | ||||
| huawei.com). | ||||
| 
 | ||||
|    This letter has supplementary downloadable material available at | ||||
| https://doi.org/10.1109/LRA.2022.3191193, provided by the authors. | ||||
| 
 | ||||
|    Digital Object Identifier 10.1109/LRA.2022.3191193 | ||||
| 
 | ||||
| 2377-3766 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. | ||||
|                    See https://www.ieee.org/publications/rights/index.html for more information. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. | ||||
| 9574                                                               IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 | ||||
| 
 | ||||
| Fig. 1. The framework of Dynamic-VINS. The contributing modules are highlighted and surrounded by dash lines with different colors. Three main threads run | ||||
| in parallel in Dynamic-VINS. Features are tracked and detected in the feature tracking thread. The object detection thread detects dynamic objects in each frame in | ||||
| real-time. The state optimization thread summarizes the features information, object detection results, and depth image to recognize the dynamic features. Finally, | ||||
| stable features and IMU preintegration results are used for pose estimation. | ||||
| 
 | ||||
| platforms. However, these works are still difficult to balance      summarize the features information, object detection results, | ||||
| efficiency and accuracy for mobile robot applications.              and depth image to recognize the dynamic features. A missed | ||||
|                                                                    detection compensation module is conducted in case of missed | ||||
|    To address all these issues, this paper proposes a real-time    detection. The moving consistency check procedure combines | ||||
| RGB-D inertial odometry for resource-restricted robots in dy-      the IMU preintegration and historical pose estimation results | ||||
| namic environments named Dynamic-VINS. It enables edge             to identify potential dynamic features. Finally, stable features | ||||
| computing devices to provide instant robust state feedback for     and IMU preintegration results are used for the pose estimation. | ||||
| mobile platforms with little computation burden. An efficient       And the propagation of the IMU is responsible for an IMU-rate | ||||
| dynamic feature recognition module that does not require a         pose estimation result. Loop closure is also supported in this | ||||
| high-precision depth camera can be used in mobile devices          system, but this paper pays more attention to the localization | ||||
| equipped with depth-measure modules. The main contributions        independent of loop closure. | ||||
| of this paper are as follows: | ||||
|                                                                                             III. METHODOLOGY | ||||
|    1) An efficient optimization-based RGB-D inertial odometry | ||||
|        is proposed to provide real-time state estimation results      This study proposes lightweight, high-quality feature tracking | ||||
|        for resource-restricted robots in dynamic and complex       and detection methods to accelerate the system. Semantic and | ||||
|        environments.                                               geometry information from the input RGB-D images and IMU | ||||
|                                                                    preintegration are applied for dynamic feature recognition and | ||||
|    2) Lightweight feature detection and tracking are proposed      moving consistency check. The missed detection compensation | ||||
|        to cut the computing burden. In addition, dynamic fea-      module plays a subsidiary role to object detection in case of | ||||
|        ture recognition modules combining object detection and     missed detection. Dynamic features on unknown objects are | ||||
|        depth information are proposed to provide robust dynamic    further identified by moving consistency check. The proposed | ||||
|        feature recognition in complex and outdoor environments.    methods are divided into five parts for a detailed description. | ||||
| 
 | ||||
|    3) Validation experiments are performed to show the pro-        A. Feature Matching | ||||
|        posed system’s competitive accuracy, robustness, and ef- | ||||
|        ficiency on resource-restricted platforms in dynamic en-        For each incoming image, the feature points are tracked using | ||||
|        vironments.                                                 the KLT sparse optical flow method [27]. In this paper, the IMU | ||||
|                                                                    measurements between frames are used to predict the motion of | ||||
|                         II. SYSTEM OVERVIEW                        features. Better initial position estimation of features is provided | ||||
|                                                                    to improve the efficiency of feature tracking by reducing optical | ||||
|    The proposed SLAM system in this paper is extended based on     flow pyramid layers. It can effectively discard unstable features | ||||
| VINS-Mono [2] and VINS-RGBD [25]; our framework is shown           such as noise and dynamic features with inconsistent motion. | ||||
| in Fig. 1, and the contributing modules are highlighted with       The basic idea is illustrated in Fig. 2. | ||||
| different colors. For efficiency, three main threads (surrounded | ||||
| by dash lines) run parallel in Dynamic-VINS: object detection,        In the previous frame, stable features are colored red, and | ||||
| feature tracking, and state optimization. Color images are passed  newly detected features are colored blue. When the current frame | ||||
| to both the object detection thread and the feature tracking       arrives, the IMU measurements between the current and previous | ||||
| thread. IMU measurements between two consecutive frames            frames are used to predict the feature position (green) in the | ||||
| are preintegrated [26] for feature tracking, moving consistency    current frame. Optical flow uses the predicted feature position | ||||
| check, and state optimization.                                     as the initial position to look for a match feature in the current | ||||
|                                                                    frame. The successfully tracked features are turned red, while | ||||
|    In the feature tracking thread, features are tracked with the   those that failed to be tracked are marked as unstable features | ||||
| help of IMU preintegration and detected by grid-based feature | ||||
| detection. The object detection thread detects dynamic objects in | ||||
| each frame in real-time. Then, the state optimization thread will | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. | ||||
| LIU et al.: RGB-D INERTIAL ODOMETRY FOR A RESOURCE-RESTRICTED ROBOT IN DYNAMIC ENVIRONMENTS                                9575 | ||||
| 
 | ||||
| Fig. 2. Illustration of feature tracking and detection. Stable features and new     Fig. 3. Illustration of semantic mask setting for dynamic feature recognition | ||||
| features are colored red and blue, respectively. The green circles denote the       when all pixel’s depth is available (d > 0). The left scene represents when an | ||||
| prediction for optical flow. The successfully tracked features turn red; otherwise,  objected bounding box’s farthest corner’s depth is bigger than the center to | ||||
| the features turn purple. The orange and purple dash-line circles as masks are      a threshold and a semantic mask with weighted depth is set between them | ||||
| set for a uniform feature distribution and reliable feature detection. New feature  to separate features on dynamic objects from the background. Otherwise, the | ||||
| points are detected from unmasked areas in the current frame.                       semantic mask is set behind the bounding box’s center with the distance of , | ||||
|                                                                                     shown on the right. | ||||
| 
 | ||||
| (purple). In order to avoid the repetition and aggregation of                       constraints to the system. For the sake of efficiency and compu- | ||||
| feature detection, an orange circular mask centered on the stable                   tational cost, a real-time single-stage object detection method, | ||||
| feature is set; the region where the unstable features are located is               YOLOv3 [11], is used to detect many kinds of dynamic scene | ||||
| considered an unstable feature detection region and masked with                     elements like people and vehicles. If a detected bounding box | ||||
| a purple circular to avoid unstable feature detection. According                    covers a large region of the image, blindly deleting feature | ||||
| to the mask, new features are detected from unmasked areas in                       points in the bounding box might result in no available features | ||||
| the current frame and colored blue.                                                 to provide constraints. Therefore, semantic-segmentation-like | ||||
|                                                                                     masks are helpful to maintain the system’s running by tracking | ||||
|    The above means can obtain uniformly distributed features to                     features not occluded by dynamic objects. | ||||
| capture comprehensive constraints and avoid repeatedly extract- | ||||
| ing unstable features on the area with blurs or weak textures.                         This paper combines object detection and depth information | ||||
| Long-term feature tracking can reduce the time consumption                          for highly efficient dynamic feature recognition to achieve per- | ||||
| with the help of grid-based feature detection in the following.                     formance comparable to semantic segmentation. As the farther | ||||
|                                                                                     the depth camera measures, the worse the accuracy is. This | ||||
| B. Grid-Based Feature Detection                                                     problem makes some methods, such as Seed Filling, DBSCAN, | ||||
|                                                                                     and K-Means, which make full use of the depth information, | ||||
|    The system maintains a minimum number of features for                            exhibit poor performance with a low accuracy depth camera, as | ||||
| stability. Therefore, feature points need to be extracted from the                  shown in Fig. 5(a). Therefore, a set of points in the detected | ||||
| frame constantly. This study adopts grid-based feature detection.                   bounding box and depth information are integrated to obtain | ||||
| Image is divided into grids, and the boundary of each grid is                       comparable performance to the semantic segmentation, as illus- | ||||
| padded to prevent the features at the edge of the grid from being                   trated in Fig. 3. | ||||
| ignored; the padding enables the current grid to obtain adjacent | ||||
| pixel information for feature detection. Unlike traversing the                         A pixel’s depth d is available, if d > 0, otherwise, d = 0. | ||||
| whole image to detect features, only the grid with insufficient                      Considering that the bounding box corners of most dynamic | ||||
| matched features will conduct feature detection. The grid cell                      objects correspond to the background points, and the dynamic | ||||
| that fails to detect features due to weak texture or is covered by                  objects commonly have a relatively large depth gap with the | ||||
| the mask will be skipped in the next detection frame to avoid                       background. The K-th dynamic object’s largest background | ||||
| repeated useless detection. The thread pool technique is used to                    depth K dmax is obtained as follow | ||||
| exploit the parallel performance of grid-based feature detection. | ||||
| Thus, the time consumption of feature detection is significantly                        K dmax = max K dtl + K dtrK + K dbl + K dbr , (1) | ||||
| reduced without loss. | ||||
|                                                                                     where K dtl, K dtr, K dbl, K dbr are the depth values of the Kth | ||||
|    The FAST feature detector [28] can efficiently extract feature | ||||
| points but easily treats noise as features and extracts similar                     object detection bounding box’s corners, respectively. Next, the | ||||
| clustered features. Therefore, the ideas of mask in Section III-A                   Kth bounding box’s depth threshold Kd¯is defined as | ||||
| and Non-Maximum-Suppression are combined to select high- | ||||
| quality and uniformly distributed FAST features.                                            ⎧      1  K dmax  +  K  dc  ,  if K dmax −K dc > , K dc > 0, | ||||
|                                                                                             ⎨⎪⎪⎪⎪  2                       if K dmax −K dc < , K dc > 0, | ||||
| C. Dynamic Feature Recognition                                                              ⎪⎪⎪⎪⎩                          if K dmax > 0, K dc = 0, | ||||
|                                                                                     K  d¯=         K dc + , | ||||
|    Most feature points can be stably tracked through the above                                                             otherwise , | ||||
| improvement. However, long-term tracking features on dynamic                                       K dmax, | ||||
| objects always come with abnormal motion and introduce wrong | ||||
|                                                                                                    +∞, | ||||
| 
 | ||||
|                                                                                                                            (2) | ||||
| 
 | ||||
|                                                                                     where K dc is the depth value of the bounding box’s center; > 0 | ||||
| 
 | ||||
|                                                                                     is a predefined distance according to the most common dynamic | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. | ||||
| 9576                                                                              IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 | ||||
| 
 | ||||
| Fig. 4. Results of missed detection compensation. The dynamic feature recognition results are shown in the first row. The green box shows the dynamic object’s | ||||
| position from the object detection results. The second row shows the generated semantic mask. With the help of missed detection compensation, even if object | ||||
| detection failed in (b) and (d), a semantic mask including all dynamic objects could be built. | ||||
| 
 | ||||
|                                                                                   incoming feature point from the feature tracking thread will be | ||||
|                                                                                   judged whether it is a historical dynamic feature or not. The | ||||
|                                                                                   above methods can avoid blindly deleting feature points while | ||||
|                                                                                   ensuring efficiency. It can save time from detecting features on | ||||
|                                                                                   dynamic objects, has the robustness to the missed detection of | ||||
|                                                                                   object detection, and recycle false-positive dynamic features, as | ||||
|                                                                                   illustrated in Section III-E. | ||||
| 
 | ||||
|                                                                                   D. Missed Detection Compensation | ||||
| 
 | ||||
| Fig. 5. Results of dynamic feature recognition. The stable features are circled      Since object detection might sometimes fail, the proposed | ||||
| by yellow. The dynamic feature recognition results generated by Seed Filling      Dynamic-VINS utilizes the previous detection results to predict | ||||
| and the proposed method are shown in (a) and (b), respectively. The weighted      the following detection result to compensate for missed detec- | ||||
| depth d¯is colored gray; the brighter means a bigger value. The feature point on  tions. It is assumed that the dynamic objects in adjacent frames | ||||
| the white area will be marked as a dynamic feature.                               have a consistent motion. Once a dynamic object is detected, its | ||||
|                                                                                   pixel velocity and bounding box will be updated. Assumed that | ||||
| objects’ size in scenes. The depth threshold Kd¯is defined in the                  j is the current detected frame and j − 1 is the previous detected | ||||
| middle of the center’s depth K dc and the deepest background                      frame, the pixel velocity K vcj (pixel/frame) of the Kth dynamic | ||||
| depth K dmax. When the dynamic object has a close connection                      object between frames is defined as | ||||
| with the background or is behind an object K dmax − K dc < , | ||||
| the depth threshold is defined at distance from the dynamic                                              K vcj  =  K uccj  −   K  ucj−1  ,               (3) | ||||
| object. If the depth is unavailable, a conservative strategy is | ||||
| adopted to choose an infinite depth as the threshold.                                                                               c | ||||
| 
 | ||||
|    On the semantic mask, the area covered by the K-th dynamic                     where  K uccj ,  u K  cj−1  represent  the  pixel  location  of  the  K th | ||||
| object bounding box is set to the weighted depth Kd¯; the area                                          c | ||||
| without dynamic objects is set to 0. Each incoming feature’s | ||||
| depth d is compared with the corresponding pixel’s depth thresh-                  object detection bounding box’s center in jth frame and j − 1th | ||||
| old d¯on the semantic mask. If d < d¯, the feature is considered as | ||||
| a dynamic one. Otherwise, the feature is considered as a stable                   frame, respectively. A weighted predicted velocity K vˆ is defined | ||||
| one. Therefore, the region where the depth value is smaller than | ||||
| the weighted depth d¯constitutes the generalized semantic mask,                   as | ||||
| as shown in Figs. 4 and 5(b). | ||||
|                                                                                                    K vˆcj+1    =  1  K vcj + K vˆcj        ,            (4) | ||||
|    Considering that dynamic objects may exist in the field of                                                      2 | ||||
| view for a long time, the dynamic features are tracked but | ||||
| not used for pose estimation, different from directly deleting                    With the update going on, the velocities of older frames will have | ||||
| dynamic features. According to its recorded information, each | ||||
|                                                                                   a lower weight in K vˆ. If the object fail to be detected in the next | ||||
|                                                                                   frame, the bounding box K Box containing the corners’ pixel | ||||
|                                                                                   locations K utl, K utr, K ubl and K ubr, will be updated based on | ||||
|                                                                                   the predicted velocity K vˆ as follow | ||||
| 
 | ||||
|                                                                                                    K Bˆoxcj+1 = K Boxcj + K vˆcj+1 ,                    (5) | ||||
| 
 | ||||
|                                                                                   When the missed detection time is over a threshold, this dynamic | ||||
|                                                                                   object’s compensation will be abandoned. The result is shown | ||||
|                                                                                   in Fig. 4. It improves the recall rate of object detection and is | ||||
|                                                                                   helpful for a more consistent dynamic feature recognition. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. | ||||
| LIU et al.: RGB-D INERTIAL ODOMETRY FOR A RESOURCE-RESTRICTED ROBOT IN DYNAMIC ENVIRONMENTS  9577 | ||||
| 
 | ||||
| E. Moving Consistency Check                                                    In order to demonstrate the efficiency of the proposed system, | ||||
|                                                                             all experiments of Dynamic-VINS are performed on the em- | ||||
|    Since object detection can only recognize artificially defined             bedded edge computing devices, HUAWEI Atlas200 DK and | ||||
| dynamic objects and has a missed detection problem, the state               NVIDIA Jetson AGX Xavier. And the compared algorithms’ | ||||
| optimization will still be affected by unknown moving objects               results are included from their original papers. Atlas200 DK | ||||
| like books moved by people. Dynamic-VINS combines the pose                  has an 8-core A55 Arm CPU (1.6 GHz), 8 GB of RAM, and | ||||
| predicted by IMU and the optimized pose in the sliding windows              a 2-core HUAWEI DaVinci NPU. Jetson AGX Xavier has an | ||||
| to recognize dynamic features.                                              8-core ARMv8.2 64-bit CPU (2.25 GHz), 16 GB of RAM, | ||||
|                                                                             and a 512-core Nvidia Volta GPU. And the results tested on | ||||
|    Consider the kth feature is first observed in the ith image and           both devices are named Dynamic-VINS-Atlas and Dynamic- | ||||
| is observed by other m images in sliding windows. The average               VINS-Jetson, respectively. Yet, to the best of our knowledge, the | ||||
| reprojection residual rk of the feature observation in the sliding          proposed method is the best-performance real-time RGB-D iner- | ||||
| windows is defined as                                                        tial odometry for dynamic environments on resource-restricted | ||||
|                                                                             embedded platforms. | ||||
| rk  =  1       ukci − π TcbTwbi Tbwj TbcPkcj ,                         (6) | ||||
|        m                                                                    A. OpenLORIS-Scene Dataset | ||||
|           j=i | ||||
|                                                                                OpenLORIS-Scene [3] is a real-world indoor dataset with | ||||
| where ukci is the observation of kth feature in the ith frame; Pkcj         a large variety of challenging scenarios like dynamic scenes, | ||||
| is the 3D location of kth feature in the jth frame; Tcb and Twbj            featureless frames, and dim illumination. The results on the | ||||
| are the transforms from camera frame to body frame and from                 OpenLORIS-Scene dataset are shown in Fig. 6, including the | ||||
| jth body frame to world frame, respecvtively; π represents the              results of VINS-Mono, ORB-SLAM2, and DS-SLAM from [3] | ||||
| camera projection model. When the rk is over a preset threshold,            as baselines. | ||||
| the kth feature is considered as a dynamic feature. | ||||
|                                                                                The OpenLORIS dataset includes five scenes and 22 se- | ||||
|    As shown in Fig. 7, the moving consistency check (MCC)                   quences in total. The proposed Dynamic-VINS shows the best | ||||
| module can find out unstable features. However, some stable                  robustness among the tested algorithms. In office scenes that | ||||
| features are misidentified (top left image), and features on                 are primarily static environments, all the algorithms can track | ||||
| standing people are not recognized (bottom right image). A low              successfully and achieve a decent accuracy. It is challenging for | ||||
| threshold holds a high recall rate of unstable features. Further,           the pure visual SLAM systems to track stable features in home | ||||
| a misidentified unstable feature with more observations will be              and corridor scenes that contain a large area of textureless walls | ||||
| recycled if its reprojection error is lower than the threshold.             and dim lighting. Thanks to the IMU sensor, the VINS systems | ||||
|                                                                             show robustness superiority when the camera is unreliable. The | ||||
|                     IV. EXPERIMENTAL RESULTS                                scenarios of home and caf e contain a number of sitting people | ||||
|                                                                             with a bit of motion, and market exists lots of moving pedes- | ||||
|    Quantitative experiments1 are performed to evaluate the pro-             trians and objects with unpredictable motion. And the market | ||||
| posed system’s accuracy, robustness, and efficiency. Public                  scenes cover the largest area and contain highly dynamic objects, | ||||
| SLAM evaluation datasets, OpenLORIS-Scene [29] and TUM                      as shown in Fig. 5. Although DS-SLAM is able to filter out | ||||
| RGB-D [30], provide sensor data and ground truth to evaluate                some dynamic features, its performance is still unsatisfactory. | ||||
| SLAM system in complex dynamic environments. Since our sys-                 VINS-RGBD has a similar performance with Dynamic-VINS | ||||
| tem is built on VINS-Mono [2] and VINS-RGBD [25], they are                  in relative static scenes, while VINS-RGBD’s accuracy drops in | ||||
| used as the baselines to demonstrate our improvement. VINS-                 highly dynamic market scenes. The proposed Dynamic-VINS | ||||
| Mono [2] provides robust and accurate visual-inertial odometry              can effectively deal with complex dynamic environments and | ||||
| by fusing IMU preintegration and feature observations. VINS-                improve robustness and accuracy. | ||||
| RGBD [25] integrates RGB-D camera based on VINS-Mono | ||||
| for better performance. Furthermore, DS-SLAM [15] and Ji                    B. TUM RGB-D Dataset | ||||
| et al.[24], state-of-the-art semantic algorithms based on ORB- | ||||
| SLAM2 [4], are also included for comparison.                                   The TUM RGB-D dataset [30] offers several sequences con- | ||||
|                                                                             taining dynamic objects in indoor environments. The highly | ||||
|    The accuracy is evaluated by Root-Mean-Square-Error                      dynamic f r3_walking sequences are chosen for evaluation | ||||
| (RMSE) of Absolute Trajectory Error (ATE), Translational Rel-               where two people walk around a desk and change chairs’ | ||||
| ative Pose Error (T.RPE), and Rotational Relative Pose Error                positions while the camera moves in different motions. As | ||||
| (R.RPE). Correct Rate (CR) [29] measuring the correct rate                  the VINS system does not support VO mode and the TUM | ||||
| over the whole period of data is used to evaluate the robustness.           RGB-D dataset does not provide IMU measurements, a VO | ||||
| The RMSE of an algorithm is calculated only for its success-                mode is implemented by simply disabling modules relevant to | ||||
| ful tracking outputs. Therefore, the longer an algorithm tracks             IMU in Dynamic-VINS for experiments. The results are shown | ||||
| successfully, the more error is likely to accumulate. It implies            in Table I. The compared methods’ results are included from | ||||
| that evaluating algorithms purely by ATE could be mislead-                  their original published papers. The algorithms based on ORB- | ||||
| ing. On the other hand, considering only CR could also be                   SLAM2 and semantic segmentation perform better. Although | ||||
| misleading. | ||||
| 
 | ||||
| 1The experimental video is available at https://youtu.be/y0U1IVtFBwY. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. | ||||
| 9578                                                                         IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 | ||||
| 
 | ||||
| Fig. 6. Per-sequence testing results with the OpenLORIS-Scene datasets. Each black dot on the top line represents the start of one data sequence. For each | ||||
| algorithm, blue dots indicate successful initialization moments, and blue lines indicate successful tracking span. The percentage value on the top left of each scene | ||||
| is the average correct rate; the higher the correct rate of an algorithm, the more robust it is. The float value on the first line below is average ATE RMSE and the | ||||
| values on the second line below are T.RPE and R.RPE from left to right, and smaller means more accurate. | ||||
| 
 | ||||
|                                                                                             TABLE I | ||||
|                               RESULTS OF RMSE OF ATE [m], T.RPE [m/s], AND R.RPE [◦/s] ON TUM RGB-D f r3_walking DATASETS | ||||
| 
 | ||||
|                                                                                  TABLE II | ||||
|       ABLATION EXPERIMENT RESULTS OF RMSE OF ATE [m], T.RPE [m/s], AND R.RPE [◦/s] ON TUM RGB-D f r3_walking DATASETS | ||||
| 
 | ||||
|                                                                              to extract evenly distributed stable features, which seriously | ||||
|                                                                              degrades the accuracy performance. Without the object detec- | ||||
|                                                                              tion (W/O OBJECT DETECTION), dynamic features introduce | ||||
|                                                                              wrong constraints to impair the system’s accuracy. Dynamic- | ||||
|                                                                              VINS-W/O-SEG-LIKE-MASK shows the results that mask all | ||||
|                                                                              features in the bounding boxes. The background features help the | ||||
|                                                                              system maintain as many stable features as possible to provide | ||||
|                                                                              more visual constraints. The moving consistency check plays | ||||
|                                                                              an important role when object detection fails, as shown in the | ||||
|                                                                              column W/O-MCC. | ||||
| 
 | ||||
| Fig. 7. Results of Moving Consistency Check. Features without yellow circu-  C. Runtime Analysis | ||||
| lar are the outliers marked by the Moving Consistency Check module. | ||||
|                                                                                 This part compares VINS-Mono, VINS-RGBD, and | ||||
| Dynamic-VINS is not designed for pure visual odometry, it still              Dynamic-VINS for runtime analysis. These methods are ex- | ||||
| shows competitive performance and has a significant improve-                  pected to track and detect 130 feature points, and the frames | ||||
| ment over ORB-SLAM2.                                                         in Dynamic-VINS are divided into 7x8 grids. The object detec- | ||||
|                                                                              tion runs on the NPU/GPU parallel to the CPU. The average | ||||
|    To validate the effectiveness of each module in Dynamic-                  computation times of each module and thread are calculated on | ||||
| VINS, ablation experiments are conducted as shown in Table II.               OpenLORIS market scenes; the results run on both embedded | ||||
| The system without applying circular masks (W/O CIRCU-                       platforms are shown in Table III. It should be noted that the | ||||
| LAR MASK) from the Section III-A and Section III-B fails                     average computation time is only to be updated when the module | ||||
|                                                                              is used. Specifically, in VINS architecture, the feature detection | ||||
|                                                                              is executed at a consistent frequency with the state optimization | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. | ||||
| LIU et al.: RGB-D INERTIAL ODOMETRY FOR A RESOURCE-RESTRICTED ROBOT IN DYNAMIC ENVIRONMENTS                                                         9579 | ||||
| 
 | ||||
|                                                                                            TABLE III | ||||
|                                   AVERAGE COMPUTATION TIME [ms] OF EACH MODULE AND THREAD ON OPENLORIS market SCENES | ||||
| 
 | ||||
| * Tracking Thread, Optimization Thread and Object Detection correspond to the three different threads shown in Fig. 1, respectively. | ||||
| † Dynamic Feature Recognition Modules sum up the Dynamic Feature Recognition, Missed Detection Compensation, and Moving Consistency Check modules. | ||||
| 
 | ||||
| Fig. 8. A compact aerial robot equipped with an RGB-D camera, an autopilot  Fig. 9. The estimated trajectories in the outdoor environment aligned with the | ||||
| with IMUs, an onboard computer, and an embedded edge computing device.      Google map. The green line is the estimated trajectory from Dynamic-VINS, the | ||||
| The whole size is about 255 × 165 mm.                                       red line is from VINS-RGBD, and the yellow line represents the loop closure | ||||
|                                                                             that happened at the end of the dataset. | ||||
| thread, which means the frequency of feature detection is lower | ||||
| than that of Feature Tracking Thread.                                       Fig. 10. Results of dynamic feature recognition in outdoor environments. The | ||||
|                                                                             dynamic feature recognition modules are still able to segment dynamic objects | ||||
|    On edge computing devices with AI accelerator modules,                   but with a larger mask region. | ||||
| the single-stage object detection method is computed by an | ||||
| NPU or GPU without costing the CPU resources and can out-                   handheld aerial robot above for safety. The total path lengths | ||||
| put inference results in real-time. With the same parameters,               are approximately 800 m and 1220 m, respectively. The dataset | ||||
| Dynamic-VINS shows significant improvement in feature de-                    has a similar scene at the beginning and the end for loop | ||||
| tection efficiency in both embedded platforms and is the one able            closure, while loop closure fails in the THUSZ campus dataset. | ||||
| to achieve instant feature tracking and detection in HUAWEI At-             VINS-RGBD and Dynamic-VINS run the dataset on NVIDIA | ||||
| las200 DK. The dynamic feature recognition modules (Dynamic                 Jetson AGX Xavier. The estimated trajectories and loop closure | ||||
| Feature Recognition, Missed Detection Compensation, Moving                  trajectory aligned with the Google map are shown in Fig. 9. | ||||
| Consistency Check) to recognize dynamic features only take                  In outdoor environments, the depth camera is limited in range | ||||
| a tiny part of the consuming time. For real-time application,               and affected by the sunlight. The dynamic feature recognition | ||||
| the system is able to output a faster frame-to-frame pose and a             modules can still segment dynamic objects but with a larger | ||||
| higher-frequency imu-propagated pose rather than waiting for                mask region, as shown in Fig. 10. Compared with loop closure | ||||
| the complete optimization result.                                           results, Dynamic-VINS could provide a robust and stable pose | ||||
|                                                                             estimation with little drift. | ||||
| D. Real-World Experiments | ||||
| 
 | ||||
|    A compact aerial robot is shown in Fig. 8. An RGB-D camera | ||||
| (Intel Realsense D455) provides 30 Hz color and aligned depth | ||||
| images. An autopilot (CUAV X7pro) with an onboard IMU | ||||
| (ADIS16470, 200 Hz) is used to provide IMU measurements. | ||||
| The aerial robot is equipped with an onboard computer (Intel | ||||
| NUC, i7-5557 U CPU) and an embedded edge computing de- | ||||
| vice (HUAWEI Atlas200 DK). These two computation resource | ||||
| providers play different roles in the aerial robot. The onboard | ||||
| computer charges for peripheral management and other core | ||||
| functions requiring more CPU resources, such as planning and | ||||
| mapping. The edge computing device as auxiliary equipment | ||||
| offers instant state feedback and object detection results to the | ||||
| onboard computer. | ||||
| 
 | ||||
|    Large-scale outdoor datasets with moving people and vehi- | ||||
| cles on the HITSZ and THUSZ campus are recorded by the | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. | ||||
| 9580                                                                               IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 | ||||
| 
 | ||||
|                             V. CONCLUSION                                          [12] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep con- | ||||
|                                                                                          volutional encoder-decoder architecture for image segmentation,” IEEE | ||||
|    This paper presents a real-time RGB-D inertial odometry                               Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, | ||||
| for resource-restricted robots in dynamic environments. Cost-                            Dec. 2017. | ||||
| efficient feature tracking and detection methods are proposed to | ||||
| cut down the computing burden. A lightweight object-detection-                     [13] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proc. | ||||
| based method is introduced to deal with dynamic features in                              IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969. | ||||
| real-time. Validation experiments show the proposed system’s | ||||
| competitive accuracy, robustness, and efficiency in dynamic                         [14] L. Xiao et al., “Dynamic-SLAM: Semantic monocular visual localization | ||||
| environments. Furthermore, Dynamic-VINS is able to run on                                and mapping based on deep learning in dynamic environment,” Robot. | ||||
| resource-restricted platforms to output an instant pose estima-                          Auton. Syst., vol. 117, pp. 1–16, 2019. | ||||
| tion. In the future, the proposed approaches are expected to | ||||
| be validated on the existing popular SLAM frameworks. The                          [15] C. Yu et al., “DS-SLAM: A semantic visual SLAM towards dynamic | ||||
| missed detection compensation module is expected to develop                              environments,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2018, | ||||
| into a moving object tracking module, and semantic information                           pp. 1168–1174. | ||||
| will be further introduced for high-level guidance on mobile | ||||
| robots or mobile devices in complex dynamic environments.                          [16] B. Bescos,, J. M. Facil, J. Civera, and J. Neira, “DynaSLAM: Tracking, | ||||
|                                                                                          mapping, and inpainting in dynamic scenes,” IEEE Robot. Automat. Lett., | ||||
|                                  REFERENCES                                              vol. 3, no. 4, pp. 4076–4083, Oct. 2018. | ||||
| 
 | ||||
|  [1] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE           [17] F. Zhong, S. Wang, Z. Zhang, C. Chen, and Y. Wang, “Detect-SLAM: | ||||
|       Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611–625, Mar. 2018.        Making object detection and SLAM mutually beneficial,” in Proc. IEEE | ||||
|                                                                                          Winter Conf. Appl. Comput. Vis., 2018, pp. 1001–1010. | ||||
|  [2] T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monoc- | ||||
|       ular visual-inertial state estimator,” IEEE Trans. Robot., vol. 34, no. 4,   [18] Y. Liu and J. Miura, “RDS-SLAM: Real-time dynamic SLAM using | ||||
|       pp. 1004–1020, Aug. 2018.                                                          semantic segmentation methods,” IEEE Access, vol. 9, pp. 23 772–23 785, | ||||
|                                                                                          2021. | ||||
|  [3] P. Geneva, K. Eckenhoff, W. Lee, Y. Yang, and G. Huang, “OpenVINS: A | ||||
|       research platform for visual-inertial estimation,” in Proc. IEEE Int. Conf.  [19] I. Ballester, A. Fontán, J. Civera, K. H. Strobl, and R. Triebel, “DOT: | ||||
|       Robot. Automat., 2020, pp. 4666–4672.                                              Dynamic object tracking for visual SLAM,” in Proc. IEEE Int. Conf. Robot. | ||||
|                                                                                          Automat., 2021, pp. 11 705–11 711. | ||||
|  [4] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: An open-source SLAM | ||||
|       system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot.,        [20] K. Schauwecker, N. R. Ke, S. A. Scherer, and A. Zell, “Markerless visual | ||||
|       vol. 33, no. 5, pp. 1255–1262, Oct. 2017.                                          control of a quad-rotor micro aerial vehicle by means of on-board stereo | ||||
|                                                                                          processing,” in Proc. Auton. Mobile Syst., 2012, pp. 11–20. | ||||
|  [5] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle | ||||
|       adjustment—a modern synthesis,” in Proc. Int. Workshop Vis. Algorithms,      [21] Z. Z. Nejad and A. Hosseininaveh Ahmadabadian, “ARM-VO: An efficient | ||||
|       1999, pp. 298–372.                                                                 monocular visual odometry for ground vehicles on ARM CPUs,” Mach. | ||||
|                                                                                          Vis. Appl., vol. 30, no. 6, pp. 1061–1070, 2019. | ||||
|  [6] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm | ||||
|       for model fitting with applications to image analysis and automated car-      [22] S. Bahnam, S. Pfeiffer, and G. C. H. E. de Croon, “Stereo visual iner- | ||||
|       tography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981.                         tial odometry for robots with limited computational resources,” in Proc. | ||||
|                                                                                          IEEE/RSJ Int. Conf. Intell. Robots Syst., 2021, pp. 9154–9159. | ||||
|  [7] Y. Sun, M. Liu, and M.Q.-H. Meng, “Improving RGB-D SLAM in dynamic | ||||
|       environments: A motion removal approach,” Robot. Auton. Syst., vol. 89,      [23] G. Younes et al., “Keyframe-based monocular SLAM: Design, survey, and | ||||
|       pp. 110–122, 2017.                                                                 future directions,” Robot. Auton. Syst., vol. 98, pp. 67–88, 2017. | ||||
| 
 | ||||
|  [8] E. Palazzolo,, J. Behley, P. Lottes, P. Gigu, and C. Stachniss, “ReFusion:    [24] T. Ji, C. Wang, and L. Xie, “Towards real-time semantic RGB-D SLAM in | ||||
|       3D reconstruction in dynamic environments for RGB-D cameras exploit-               dynamic environments,” in Proc. IEEE Int. Conf. Robot. Automat., 2021, | ||||
|       ing residuals,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2019,           pp. 11 175–11 181. | ||||
|       pp. 7855–7862. | ||||
|                                                                                    [25] Z. Shan, R. Li, and S. Schwertfeger, “RGBD-inertial trajectory estima- | ||||
|  [9] W. Dai et al., “RGB-D SLAM in dynamic environments using point                      tion and mapping for ground robots,” Sensors, vol. 19, no. 10, 2019, | ||||
|       correlations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 1,            Art. no. 2251. | ||||
|       pp. 373–389, Jan. 2022. | ||||
|                                                                                    [26] C. Forster et al., “IMU preintegration on manifold for efficient visual- | ||||
| [10] W. Liu et al., “SSD: Single shot MultiBox detector,” in Eur. Conf. Comp.            inertial maximum-a-posteriori estimation,” in Proc. Robot.: Sci. Syst., | ||||
|       Vis., 2016, pp. 21–37.                                                             2015. | ||||
| 
 | ||||
| [11] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”               [27] B. D. Lucas et al., “An iterative image registration technique with an appli- | ||||
|       2018, arXiv:1804.02767.                                                            cation to stereo vision,” in Proc. DARPA Image Understanding Workshop, | ||||
|                                                                                          1981, pp. 121–130. | ||||
| 
 | ||||
|                                                                                    [28] E. Rosten and T. Drummond, “Machine learning for high-speed corner | ||||
|                                                                                          detection,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 430–443. | ||||
| 
 | ||||
|                                                                                    [29] X. Shi et al., “Are we ready for service robots? The OpenLORIS-Scene | ||||
|                                                                                          datasets for lifelong SLAM,” in Proc. IEEE Int. Conf. Robot. Automat., | ||||
|                                                                                          2020, pp. 3139–3145. | ||||
| 
 | ||||
|                                                                                    [30] J. Sturm,, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A bench- | ||||
|                                                                                          mark for the evaluation of RGB-D SLAM systems,” in Proc. IEEE/RSJ | ||||
|                                                                                          Int. Conf. Intell. Robots Syst., 2012, pp. 573–580. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. | ||||
|  | ||||
| @ -0,0 +1,665 @@ | ||||
| IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023                                           7501012 | ||||
| 
 | ||||
| SG-SLAM: A Real-Time RGB-D Visual SLAM | ||||
|   Toward Dynamic Scenes With Semantic and | ||||
|                Geometric Information | ||||
| 
 | ||||
|     Shuhong Cheng , Changhe Sun , Shijun Zhang , Student Member, IEEE, and Dianfan Zhang | ||||
| 
 | ||||
|    Abstract— Simultaneous localization and mapping (SLAM) is                  systems, we have access to cheaper, faster, higher quality, | ||||
| one of the fundamental capabilities for intelligent mobile robots             and smaller vision-based sensors. It also helps vision-based | ||||
| to perform state estimation in unknown environments. However,                 measurement (VBM) become more ubiquitous and applica- | ||||
| most visual SLAM systems rely on the static scene assumption                  ble [2]. Hence, in the past years, a large number of excellent | ||||
| and consequently have severely reduced accuracy and robustness                visual SLAM systems have emerged, such as PTAM [3], | ||||
| in dynamic scenes. Moreover, the metric maps constructed by                   ORB-SLAM2 [4], DVO [5], and Kimera [6]. Some of these | ||||
| many systems lack semantic information, so the robots cannot                  visual SLAM systems are quite mature and have achieved | ||||
| understand their surroundings at a human cognitive level.                     good performance under certain specific environmental | ||||
| In this article, we propose SG-SLAM, which is a real-time                     conditions. | ||||
| RGB-D semantic visual SLAM system based on the ORB-SLAM2 | ||||
| framework. First, SG-SLAM adds two new parallel threads: an                      As SLAM enters the age of robust perception [7], the system | ||||
| object detecting thread to obtain 2-D semantic information and                has higher requirements in terms of robustness and high-level | ||||
| a semantic mapping thread. Then, a fast dynamic feature rejec-                understanding characteristics. However, many visual-based | ||||
| tion algorithm combining semantic and geometric information                   classical SLAM systems still fall short of these requirements | ||||
| is added to the tracking thread. Finally, they are published                  in some practical scenarios. On the one hand, most visual | ||||
| to the robot operating system (ROS) system for visualization                  SLAM systems work based on the static scene assumption, | ||||
| after generating 3-D point clouds and 3-D semantic objects in                 which makes the system less accurate and less robust in real | ||||
| the semantic mapping thread. We performed an experimental                     dynamic scenes (e.g., scenes containing walking people and | ||||
| evaluation on the TUM dataset, the Bonn dataset, and the                      moving vehicles). On the other hand, most existing SLAM | ||||
| OpenLORIS-Scene dataset. The results show that SG-SLAM is                     systems only construct a globally consistent metric map of | ||||
| not only one of the most real-time, accurate, and robust systems in           the robot’s working environment [8]. However, the metric map | ||||
| dynamic scenes but also allows the creation of intuitive semantic             does not help the robot to understand its surroundings at a | ||||
| metric maps.                                                                  higher semantic level. | ||||
| 
 | ||||
|    Index Terms— Dynamic scenes, geometric constraint, seman-                     Most visual SLAM algorithms rely on the static scene | ||||
| tic metric map, visual-based measurement, visual simultaneous                 assumption, which is why the presence of dynamic objects can | ||||
| localization and mapping (SLAM).                                              cause these algorithms to produce the wrong data correlation. | ||||
|                                                                               These outliers obtained from dynamic objects can seriously | ||||
|                           I. INTRODUCTION                                     impair the accuracy and stability of the algorithms. Even | ||||
|                                                                               though these algorithms show superior performance in some | ||||
| S IMULTANEOUS localization and mapping (SLAM) has                             specific scenarios, it is difficult to extend them to actual | ||||
|      an important role in the state perception of mobile robots.              production and living scenarios containing dynamic objects. | ||||
| It can help a robot in an unknown environment with an                         Some recent works, such as [9], [10], [11], and [12], have | ||||
| unknown pose to incrementally build a globally consistent map                 used methods that combine geometric and semantic infor- | ||||
| and simultaneously measure its pose in this map [1]. Due to                   mation to eliminate the adverse effects of dynamic objects. | ||||
| continuing and rapid development of cameras and computing                     These algorithms mainly using deep learning have significant | ||||
|                                                                               improvements in experimental accuracy, but they suffer from | ||||
|   Manuscript received 25 August 2022; revised 31 October 2022; accepted       shortcomings in scene generalizability or real time due to vari- | ||||
| 23 November 2022. Date of publication 9 December 2022; date of current        ous factors. Therefore, how skillfully detecting and processing | ||||
| version 17 January 2023. This work was supported in part by the National Key  dynamic objects in the scene is crucial for the system to | ||||
| Research and Development Program under Grant 2021YFB3202303, in part          operate accurately, robustly, and in real time. | ||||
| by the S&T Program of Hebei under Grant 20371801D, in part by the Hebei | ||||
| Provincial Department of Education for Cultivating Innovative Ability of         Traditional SLAM systems construct only a sparse metric | ||||
| Postgraduate Students under Grant CXZZBS2022145, and in part by the Hebei     map [3], [4]. This metric map consists of simple geome- | ||||
| Province Natural Science Foundation Project under Grant E2021203018.          tries (points, lines, and surfaces) and every pose is strictly | ||||
| The Associate Editor coordinating the review process was Dr. Jae-Ho Han.      related to the global coordinate system. Enabling a robot to | ||||
| (Corresponding authors: Shijun Zhang; Dianfan Zhang.)                         perform advanced tasks with intuitive human–robot interac- | ||||
|                                                                               tion requires it to understand its surroundings at a human | ||||
|   Shuhong Cheng and Changhe Sun are with the School of Electri- | ||||
| cal Engineering, Yanshan University, Qinhuangdao 066000, China (e-mail: | ||||
| shhcheng@ysu.edu.cn; silencht@qq.com). | ||||
| 
 | ||||
|   Shijun Zhang is with the School of Mechanical Engineering, Yanshan | ||||
| University, Qinhuangdao 066000, China (e-mail: 980871977@qq.com). | ||||
| 
 | ||||
|   Dianfan Zhang is with the Key Laboratory of Special Delivery Equipment, | ||||
| Yanshan University, Qinhuangdao 066004, China (e-mail: zdf@ysu.edu.cn). | ||||
| 
 | ||||
|   Digital Object Identifier 10.1109/TIM.2022.3228006 | ||||
| 
 | ||||
| 1557-9662 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. | ||||
|                     See https://www.ieee.org/publications/rights/index.html for more information. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. | ||||
| 7501012  IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 | ||||
| 
 | ||||
| Fig. 1. Overview of the framework of the SG-SLAM system. The original work of ORB-SLAM2 is presented on an aqua-green background, while our main | ||||
| new or modified work is presented on a red background. | ||||
| 
 | ||||
| cognitive level. However, the metric map lacks the neces-         The main contributions of this article include the following. | ||||
| sary semantic information and therefore cannot provide this       1) A complete real-time RGB-D visual SLAM system | ||||
| capability. With the rapid development of deep learning in | ||||
| recent years, some neural networks can effectively capture the        called SG-SLAM is proposed using ORB-SLAM2 as | ||||
| semantic information in the scenes. Therefore, the metric map         a framework. Compared to ORB-SLAM2, it has higher | ||||
| can be extended to the semantic metric map by integrating             accuracy and robustness in dynamic scenes and can pub- | ||||
| semantic information. The semantic information contained              lish a semantic metric map through the robot operating | ||||
| in the semantic metric map can provide the robot with the             system (ROS) system [13]. | ||||
| capability to understand its surroundings at a higher level.      2) A fast dynamic feature rejection algorithm is proposed | ||||
|                                                                       by combining geometric information and semantic infor- | ||||
|    This article focuses on a dynamic feature rejection algorithm      mation. The geometric information is calculated from | ||||
| that integrates semantic and geometric information, which not         the epipolar constraint between image frames. Also, the | ||||
| only significantly improves the accuracy of system localization        semantic information about dynamic objects is obtained | ||||
| but also has excellent computational efficiency. Thus, our algo-       through an NCNN-based [14] object detection network | ||||
| rithm is very useful from an instrumentation and measurement          in a new thread. The algorithm speed is greatly improved | ||||
| point of view [2]. This article also focuses on how to construct      by appropriate modifications and a combination of clas- | ||||
| the semantic metric map to improve the perceptual level of            sical methods while maintaining accuracy. | ||||
| the robot to understand the surrounding scenes. The overall       3) An independent semantic metric mapping thread that can | ||||
| framework of the SG-SLAM system is shown in Fig. 1.                   generate semantic objects and Octo maps [15] using the | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. | ||||
| CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES  7501012 | ||||
| 
 | ||||
|        ROS interface is embedded in SG-SLAM. These maps           regarded as outliers and eliminated. Similarly, Dynamic- | ||||
|        can be useful in subsequent localization, navigation, and  SLAM proposed by Xiao et al. [25] has the same problem | ||||
|        object capture tasks.                                      of directly rejecting all features within the bounding box. | ||||
|                                                                   Liu and Miura [26] adopted a semantic segmentation method | ||||
|    The remaining sections of this article are organized as        to detect dynamic objects and remove outliers in keyframes. | ||||
| follows. The work related to this system is described in          The semantic segmentation method solves the problem of | ||||
| Section II. Section III shows the details related to the imple-   wrong recognition due to bounding boxes to a certain extent. | ||||
| mentation of this system. Section IV provides an experimental     However, the semantic information method relies heavily on | ||||
| evaluation and analysis of the results. The conclusions and       the quality of the neural network, so it is difficult to meet the | ||||
| future works of this article are presented in Section V.          requirements of speed and accuracy at the same time. | ||||
| 
 | ||||
|                         II. RELATED WORKS                            Recently, much work has taken on the method of combining | ||||
|                                                                   geometric and semantic information. For the RGB-D camera, | ||||
| A. SLAM in Dynamic Scenes                                         Bescos et al. [9] used the semantic segmentation results of | ||||
|                                                                   Mask R-CNN [27] combined with multiview geometry to | ||||
|    Most current visual SLAMs assume that the working scene        detect dynamic objects and reject outliers. Yu et al. [10] | ||||
| is static and rigid. When these systems work in dynamic           used an optical flow-based moving consistency check method | ||||
| scenes, erroneous data associations due to the static scene       to detect all feature points and simultaneously performed | ||||
| assumption can seriously weaken the accuracy and stability        semantic segmentation of the image using SegNet [28] in | ||||
| of the system. The presence of dynamic objects in the scene       an independent thread. If the moving consistency checking | ||||
| makes all features divided into two categories: static features   method detects more than a certain percentage of dynamic | ||||
| and dynamic features. How to detect and reject dynamic            points within the range of the human object, all feature points | ||||
| features is the key to the problem solution. The previous         that lie inside the object are directly rejected. Wu et al. [11] | ||||
| research work can be divided into three categories: geomet-       used YOLO to detect a priori dynamic objects in the scene | ||||
| ric information method, semantic information method, and          and then combined it with the depth-RANSAC method to | ||||
| method combining geometric and semantic information.              reject the feature points inside the range of dynamic objects. | ||||
|                                                                   Chang et al. [12] segmented the dynamic objects by YOLACT | ||||
|    Geometric information method, whose main idea is to            and then removed the outliers inside the objects. Then, geo- | ||||
| assume that only static features can satisfy the geometric        metric constraints are introduced to further filter the missing | ||||
| constraints of the algorithm. A remarkable early monocular        dynamic points. | ||||
| dynamic object detection system comes from the work of | ||||
| Kundu et al. [16]. The system creates two geometric con-             The above methods have achieved quite good results in | ||||
| straints to detect dynamic objects based on the multiview         terms of accuracy improvement. Nevertheless, the idea of all | ||||
| geometry [17]. One of the most important is the epipolar          these methods relies heavily on semantic information and, to a | ||||
| constraint defined by the fundamental matrix. The idea is          lesser extent, on geometric information. Thus, more or less all | ||||
| that a static feature point in the current image must lie on      of them have the following shortcomings. | ||||
| the pole line corresponding to the same feature point in the | ||||
| previous image. A feature point is considered dynamic if             1) Inability to correctly handle dynamic features outside of | ||||
| its distance from the corresponding polar line exceeds an                the prior object [10], [11], [23], [25], [26]. For example, | ||||
| empirical threshold. The fundamental matrix of the system is             chairs are static objects by default, but dynamic during | ||||
| calculated with the help of an odometer. In a purely visual              being moved by a person; moving cats appear in the | ||||
| system, the fundamental matrix can be calculated by the                  scene, while the neural network is not trained on the | ||||
| seven-point method based on RANSAC [18]. The algorithm                   category of cats; low recall problem for the detection | ||||
| of Kundu et al. [16] has the advantages of fast speed and                algorithm. | ||||
| strong scene generalization. However, it lacks a high-level | ||||
| understanding of the scene, so the empirical threshold is            2) The a priori dynamic object remains stationary yet still | ||||
| difficult to select and the accuracy is not high. In addition,            brutally rejects the feature points in its range, resulting in | ||||
| some works use the direct method for motion detection of                 less available association data [11], [12], [23], [25], [26]. | ||||
| scenes, such as [19], [20], [21], and [22]. The direct method            For example, a person who is sitting still is nevertheless | ||||
| algorithms are faster and can utilize more image information.            considered a dynamic object. | ||||
| However, it is less robust in complex environments because it | ||||
| is based on the gray-scale invariance assumption.                    3) The real-time performance is weak [9], [10], [11], [12]. | ||||
|                                                                          The average frame rate of the system is low due to | ||||
|    Semantic information method, whose main idea is brutally              factors such as complex semantic segmentation networks | ||||
| rejecting features in dynamic regions that are obtained a priori         or unreasonable system architecture. | ||||
| using deep learning techniques. Zhang et al. [23] used the | ||||
| YOLO [24] object detection method to obtain the semantic             We propose an efficient dynamic feature rejection algorithm | ||||
| information of dynamic objects in the working scene and           combining geometric and semantic information to solve the | ||||
| then reject the dynamic feature points based on the semantic      above problem. Unlike most current work that relies heavily | ||||
| information to improve the accuracy of the system. However,       on deep learning, our algorithm uses mainly geometric infor- | ||||
| the way YOLO extracts semantic information by bounding            mation and then supplements it with semantic information. | ||||
| box will cause a part of static feature points to be wrongly      This shift in thinking allows our algorithm to avoid the short- | ||||
|                                                                   comings associated with relying too much on deep learning. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. | ||||
| 7501012  IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 | ||||
| 
 | ||||
| B. Semantic Mapping                                               the base framework to provide global localization and mapping | ||||
|                                                                   functions. | ||||
|    Many current visual SLAMs only provide a metric map | ||||
| that only satisfies the basic functions of localization and           As shown in Fig. 1, the SG-SLAM system adds two more | ||||
| navigation of mobile robots, such as the sparse feature point     parallel threads: the object detecting thread and the seman- | ||||
| map constructed by ORB-SLAM2. If a mobile robot is to             tic mapping thread. Multithreading mechanism improves the | ||||
| perceive its surroundings at the human conceptual level, it is    system operation efficiency. The purpose of adding an object | ||||
| necessary to incorporate semantic information in the metric       detecting thread is to use the neural network to obtain the | ||||
| map to form a semantic map. The semantic metric map can           2-D semantic information. This 2-D semantic information | ||||
| help robots to act according to human rules, execute high-level   then provides a priori dynamic object information for the | ||||
| tasks, and communicate with humans at the conceptual level.       dynamic feature rejection strategy. The semantic mapping | ||||
|                                                                   thread integrates the 2-D semantic information and 3-D point | ||||
|    In an earlier study, Mozos et al. [29] used the hidden         cloud information from keyframes to generate a 3-D semantic | ||||
| Markov model to partition the metric map into different           object database. An intuitive semantic metric map is obtained | ||||
| functional locations (rooms, corridors, and doorways). The        by publishing the 3-D point cloud, 3-D semantic objects, and | ||||
| work of Nieto-Granda et al. [30] deployed a mapping module        camera pose to the ROS system. The semantic metric maps can | ||||
| based on the Rao–Blackwellized particle filtering technique on     help mobile robots understand their surroundings and perform | ||||
| a ROS [13] and used the Gaussian model to partition the map       advanced tasks from a higher cognitive level compared to the | ||||
| into marked semantic regions. Subsequently, the development       sparse feature point maps of ORB-SLAM2. | ||||
| of deep learning has greatly contributed to the advancement | ||||
| of object detection and semantic segmentation algorithms.            When the SG-SLAM system is running, the image frames | ||||
| Sünderhauf et al. [31] used SSD [32] to detect objects in         captured from the RGB-D camera are first fed together to the | ||||
| each RGB keyframe and then assign a 3-D point cloud to            tracking thread and the object detecting thread. The object | ||||
| each object using an adaptive 3-D unsupervised segmentation       detecting thread starts to perform object recognition on the | ||||
| method. This work is based on the data association mechanism      input RGB images. At the same time, the tracking thread also | ||||
| of ICP-like matching scores to decide whether to create           starts to extract ORB feature points from the input frames. | ||||
| new objects in the semantic map or to associate them with         After the extraction is completed, the iterative Lucas–Kanade | ||||
| existing objects. Zhang et al. [23] acquired semantic maps        optical flow method with pyramids is used to match the sparse | ||||
| of the working scene through the YOLO object detection            feature points between the current frame and previous frames. | ||||
| module and localization module in the RGB-D SLAM system.          Then, the seven-point method based on RANSAC is used to | ||||
| In summary, many works only stop at using SLAM to help            compute the fundamental matrix between the two frames. This | ||||
| with semantic mapping and do not fully utilize the acquired       reduces the adverse effects due to incorrect data correlation | ||||
| semantic information to help to track. DS-SLAM, a semantic        in dynamic regions. Compared with feature extraction and | ||||
| mapping system proposed by Yu et al. [10], adopted semantic       fundamental matrix computation, the object detection task is | ||||
| segmentation information to build semantic maps. However,         more time-consuming. In other words, when the fundamental | ||||
| DS-SLAM only simply attaches semantic labels to the metric        matrix is computed, the tracking thread needs to wait for the | ||||
| map for visual display. The lack of position coordinates for      result of the object detecting thread. Since the tracking thread | ||||
| the objects described in mathematical form limits the system’s    adopts object detection rather than semantic segmentation, the | ||||
| ability to perform advanced task planning.                        blocking time is not too long [26]. This enhances the real-time | ||||
|                                                                   performance of the system. Next, the tracking thread combines | ||||
|                       III. SYSTEM OVERVIEW                        the epipolar constraint and 2-D semantic information to reject | ||||
|                                                                   the dynamic feature points. The camera pose is computed | ||||
|    In this section, we will introduce the technical details of    and released to ROS according to the remaining static feature | ||||
| the SG-SLAM system from five aspects. First, we introduce          points. | ||||
| the framework and the basic flow of the system. Second, | ||||
| we give information about the object detecting thread. Then,         The new keyframes are fed into the local mapping thread | ||||
| the geometric principle of the epipolar constraint method         and the loop closing thread for pose optimization, which is | ||||
| for judging dynamic features is illustrated. Subsequently, the    the same as the original ORB-SLAM2 system. The difference | ||||
| dynamic feature rejection strategy is proposed. Finally, we pro-  is that the depth image of the new keyframe is used to | ||||
| pose methods to acquire semantic objects and build semantic       generate a 3-D point cloud in the semantic mapping thread. | ||||
| maps.                                                             Next, the 3-D point cloud is combined with the 2-D semantic | ||||
|                                                                   information to generate a 3-D semantic object database. There | ||||
| A. System Framework                                               are problems such as high computational effort and redundant | ||||
|                                                                   information between normal frames in semantic map con- | ||||
|    The SG-SLAM proposed in this article is developed based        struction. Thus, the practice of processing only keyframe data | ||||
| on the ORB-SLAM2 system, which is a feature point-based           here improves the efficiency of mapping. The reuse of 2-D | ||||
| classical visual SLAM system. ORB-SLAM2 consists of three         semantic information also improves the real-time performance | ||||
| main parallel threads: tracking, local mapping, and loop clos-    of the system. Finally, the 3-D point cloud and the 3-D | ||||
| ing. With the evaluation of many popular public datasets,         semantic object data are published to the 3-D visualization | ||||
| ORB-SLAM2 is one of the systems that achieve the state-of-        tool Rviz for map display using the interface of the ROS | ||||
| the-art accuracy. Therefore, SG-SLAM selects ORB-SLAM as          system. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. | ||||
| CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES                                                 7501012 | ||||
| 
 | ||||
|    The adoption of object detection networks (rather than | ||||
| semantic segmentation), multithreading, keyframe-based map- | ||||
| ping, and data reuse mechanisms overcomes the real-time | ||||
| performance shortcomings listed in Section II-A. | ||||
| 
 | ||||
| B. Object Detection                                                 Fig. 2. Epipolar constraints. | ||||
| 
 | ||||
|    Due to the limitations in battery life, mobile robots generally     According to the pinhole camera model, as shown in Fig. 2, | ||||
| choose ARM architecture processors with high performance            the camera observes the same spatial point P from different | ||||
| per watt. NCNN is a high-performance neural network infer-          angles. O1 and O2 denote the optical centers of the camera. P1 | ||||
| ence computing framework optimized for mobile platforms             and P2 are the matching feature points of the spatial point P | ||||
| since NCNN is implemented in pure C++ with no third-party           maps in the previous frame and the current frame, respectively. | ||||
| dependencies and can be easily integrated into SLAM systems.        The short dashed lines L1 and L2 are the epipolar lines in the | ||||
| Thus, we choose it as the base framework for object detecting       frame. The homogeneous coordinate forms of P1 and P2 are | ||||
| thread.                                                             denoted as follows: | ||||
| 
 | ||||
|    Many SLAM systems, such as [9], [10], [11], and [12],                      P1 = [x1, y1, 1], P2 = [x2, y2, 1]           (1) | ||||
| run slowly due to complex semantic segmentation networks or | ||||
| unreasonable system architectures. SLAM, as a fundamental           where x and y denote the coordinate values of the feature | ||||
| component for state estimation of mobile robots, only has the | ||||
| good real-time performance to ensure the smooth operation           points in the image pixel coordinate system. Then, the polar | ||||
| of upper level tasks. To improve the object detection speed | ||||
| as much as possible, the single-shot multibox detector SSD is       line L2 in the current frame can be calculated from the | ||||
| chosen as the detection head. In addition, we use MobileNetV3 | ||||
| [33] as a drop-in replacement for the backbone feature extrac-      fundamental matrix (denoted as F) with the equation as | ||||
| tor in SSDLite. Finally, the network was trained using the | ||||
| PASCAL VOC 2007 Dataset [34].                                       follows:  ⎡⎤                                   ⎡⎤ | ||||
| 
 | ||||
|    In reality, other detectors can be used flexibly depending                      X                                    x1 | ||||
| on the hardware performance to achieve a balance between | ||||
| accuracy and speed.                                                           L2 = ⎢⎣ Y ⎥⎦ = F P1 = F⎢⎣ y1 ⎥⎦              (2) | ||||
| 
 | ||||
| C. Epipolar Constraints                                                           Z                                    1 | ||||
| 
 | ||||
|    SG-SLAM uses geometric information obtained from epipo-          where X, Y , and Z represent the line vectors. According to | ||||
| lar constraint to determine whether feature points are dynamic      [16], the epipolar constraint can be formulated as follows: | ||||
| or not. The judgment pipeline of the epipolar constraint is | ||||
| very straightforward. First, match the ORB feature points of                  P2T F P1 = P2T L2 = 0.                       (3) | ||||
| two consecutive frames. Next, solve the fundamental matrix. | ||||
| Finally, the distance is calculated between the feature point of    Next, the distance between the feature point Pi (i = 2, 4) and | ||||
| the current frame and its corresponding polar line. The bigger      the corresponding polar line is defined as the offset distance, | ||||
| the distance is, the more likely the feature point is dynamic.      denoted by the symbol d. The offset distance can be described | ||||
|                                                                     as follows: | ||||
|    To solve the fundamental matrix, it is necessary to have the | ||||
| correct data association between the feature points. However,                     di               =  √PiT F P1 .          (4) | ||||
| the purpose of solving the fundamental matrix is to judge                                               X2 + Y2 | ||||
| whether the data association is correct or not. This becomes | ||||
| a classic chicken or the egg problem. ORB-SLAM2 takes the           If the point P is a static space point, jointly with (3) and (4), | ||||
| Bag-of-Words method to accelerate feature matching, and the         the offset distance of the point P2 is | ||||
| continued use of this method cannot eliminate the adverse | ||||
| effect of outliers. Hence, to obtain a relatively accurate funda-             d2  =                √P2T F P1  =    0.      (5) | ||||
| mental matrix, SG-SLAM uses the pyramidal iterative Lucas-                                           X2 + Y2 | ||||
| Kanade optical flow method to calculate the matching point set | ||||
| of features. Inspired by Yu et al. [10], the matching point pairs      Equation (5) demonstrates that in the ideal case, the feature | ||||
| located at the edges of images and with excessive differences       point P2 in the current frame falls exactly on the polar line L2. | ||||
| in appearance are then removed to further reduce erroneous          In reality, however, the offset distance is generally greater than | ||||
| data associations. Then, the seven-point method based on            zero but below an empirical threshold ε due to the influence | ||||
| RANSAC is used to calculate the fundamental matrix between          of various types of noise. | ||||
| two frames. In general, the proportion of dynamic regions | ||||
| is relatively small compared to the whole image. Thus, the | ||||
| RANSAC algorithm can effectively reduce the adverse effects | ||||
| of wrong data association in dynamic regions. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. | ||||
| 7501012                                         IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 | ||||
| 
 | ||||
| Algorithm 1 Dynamic Feature Rejection Strategy | ||||
| 
 | ||||
| Input: Previous frame, F1; Current frame, F2; Previous frame’s feature points, P1; Current frame’s feature points, P2; | ||||
|            Standard empirical thresholds, εstd; | ||||
| 
 | ||||
| Output: The set of static feature points in the current frame’s feature points, S; | ||||
| 1: P1 = CalcOpticalFlowPyrLK( F2, F1, P2 ) | ||||
| 2: Remove matched pairs that are located at the edges and have too much variation in appearance | ||||
| 
 | ||||
| 3: FundmentalMatrix = FindFundamentalMat(P2, P1, 7-point method based on RANSAC) | ||||
| 
 | ||||
| 4: for each matched pair p1, p2 in P1, P2 do: | ||||
| 
 | ||||
| 5: if (DynamicObjectsExist && IsInDynamicRegion(P2)) then | ||||
| 
 | ||||
| 6:       if (CalcEpiLineDistance( p2, p1, FundmentalMatrix) × GetDynamicWeightValue ( p2) < εstd ) then | ||||
| 
 | ||||
| 7:       Append p2 to S | ||||
| 
 | ||||
| 8: end if | ||||
| 
 | ||||
| 9: else | ||||
| 
 | ||||
| 10: if (CalcEpiLineDistance( p2, p1, FundmentalMatrix) < εstd) then | ||||
| 
 | ||||
| 11:      Append p2 to S | ||||
| 
 | ||||
| 12: end if | ||||
| 
 | ||||
| 13: end if | ||||
| 
 | ||||
| 14: end for | ||||
| 
 | ||||
|    If the point P is not a static spatial point, as shown in Fig. 2,     With these preparations, all feature points in the current | ||||
| when the camera moves from the previous frame to the current          frame can be judged one by one. The dynamic feature rejection | ||||
| frame, the point P also moves to P . In this case, the point P1       strategy is described in Algorithm 1. | ||||
| is matched with the P4 point mapped from P to the current | ||||
| frame. If point P moves without degeneration [16], then in            E. Semantic Mapping | ||||
| general, the offset distance of P4 is greater than the threshold ε. | ||||
| In other words, the feature points can be judged as dynamic              The ROS [13] is a set of software tool libraries that | ||||
| or not by comparing the offset distance with the empirical            help developers quickly build robot applications. Rviz is a | ||||
| threshold ε.                                                          visualization tool in the ROS. In addition to the tracking thread | ||||
|                                                                       that publishes camera poses to the ROS, the semantic mapping | ||||
| D. Dynamic Feature Rejection Strategy                                 thread also publishes two kinds of data: 3-D point clouds and | ||||
|                                                                       3-D semantic objects. These data are then processed by rviz | ||||
|    To avoid the shortcomings of relying heavily on deep               to display an intuitive map interface. | ||||
| learning for dynamic feature judgment, our algorithm relies | ||||
| mainly on geometric information. The geometric information               For efficiency, only keyframes are used to construct seman- | ||||
| method judges whether a feature is dynamic by comparing the           tic metric maps. When a new keyframe arrives, the semantic | ||||
| offset distance d with an empirical threshold ε. However, the         mapping thread immediately uses its depth image and pose to | ||||
| threshold ε value is very difficult to set [12]: setting it too        generate a 3-D ordered point cloud. The 3-D point cloud is | ||||
| small will make many static feature points wrongly judged as          subsequently published to the ROS, and a global Octo-map | ||||
| dynamic points and setting it too large will miss many true           is built incrementally by the Octomap_server package. The | ||||
| dynamic feature points. This is because the purely geometric          global Octo-map has the advantages of being updatable, | ||||
| method cannot understand the scene at the semantic level and          flexible, and compact, which can easily serve navigation | ||||
| can only mechanically process all feature points using a fixed         and obstacle avoidance tasks. However, the Octo-map lacks | ||||
| threshold.                                                            semantic information, so it limits the capability of advanced | ||||
|                                                                       task planning between mobile robots and semantic objects. | ||||
|    To solve the above problem, all objects that can be detected       Hence, a map with semantic objects with their coordinates | ||||
| by the object detecting thread are first classified as static           is also necessary. The semantic mapping thread generates the | ||||
| objects and dynamic objects based on a priori knowledge. Any          3-D semantic objects by combining 2-D semantic information | ||||
| object with moving properties is defined as a dynamic object           with 3-D point clouds, and the main process is described as | ||||
| (e.g., a person or car); otherwise, it is a static object. Then,      follows. | ||||
| both weight values w are defined. The standard empirical | ||||
| threshold εstd is set in a very straightforward way: just make           The 2-D object bounding box is captured in the dynamic | ||||
| sure that only obvious true dynamic feature points are rejected       feature rejection algorithm stage. Fetch the 3-D point clouds in | ||||
| when using it. The dynamic weight value w is an a priori in           the bounding box region to calculate the 3-D semantic object | ||||
| the range of 1–5, which is set according to the probability           information. Yet, since the bounding box contains some noisy | ||||
| of the object moving. For example, a human normally moves             regions of nontarget objects, it cannot accurately segment the | ||||
| with a high probability, and then, w = 5; a chair normally            semantic object outline. To acquire relatively accurate position | ||||
| does not move, and then, w = 2.                                       and size information of the objects, the bounding box is | ||||
|                                                                       first reduced appropriately. Next, we calculate the average | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. | ||||
| CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES                                  7501012 | ||||
|                                                                                            TABLE I | ||||
| 
 | ||||
|                                                                   RESULTS OF METRIC ROTATIONAL DRIFT (RPE) | ||||
| 
 | ||||
|                                                                                            TABLE II | ||||
|                                                                RESULTS OF METRIC TRANSLATIONAL DRIFT (RPE) | ||||
| 
 | ||||
|                                                                                           TABLE III | ||||
|                                                          RESULTS OF METRIC ABSOLUTE TRAJECTORY ERROR (ATE) | ||||
| 
 | ||||
| depth of the point cloud corresponding to the bounding box       A. Performance Evaluation on TUM RGB-D Dataset | ||||
| region. Then, the depth of each point cloud in the original | ||||
| bounding box is compared with the average depth, which is           The TUM RGB-D dataset [35] is a large dataset provided | ||||
| rejected if the difference is too large. Eventually, we filter    by the Technical University of Munich Computer Vision | ||||
| the remaining point cloud and calculate their sizes and spatial  Group to create a novel benchmark for visual odometry and | ||||
| centroid coordinates.                                            SLAM systems. To evaluate the accuracy and robustness of | ||||
|                                                                  the SG-SLAM system in dynamic scenes, the experiments | ||||
|    The above operation is performed for each 2-D semantic        mainly use five sequences under the dynamic objects category | ||||
| information (except dynamic objects, e.g., people, and dogs)     in the dataset. The first four of them are high dynamic scene | ||||
| in the current keyframe to obtain the 3-D semantic object data.  sequences, as a supplement, and the fifth one is a low dynamic | ||||
| During the operation of the system, the 3-D semantic object      scene sequence. | ||||
| database can be continuously merged or updated according to | ||||
| the object class, centroid, and size information. By publishing     There are two main error evaluation metrics for the exper- | ||||
| this database through the ROS interface, the semantic metric     iment. One is the absolute trajectory error (ATE), which is | ||||
| maps can be visualized.                                          directly used to measure the difference between the ground | ||||
|                                                                  trajectory and the estimated trajectory. The other is the relative | ||||
|                   IV. EXPERIMENTAL RESULTS                       pose error (RPE), which is mainly used to measure rotational | ||||
|                                                                  drift and translational drift. To evaluate the improvement in | ||||
|    In this section, we will experimentally evaluate and          performance relative to the original system, the experimental | ||||
| demonstrate the SG-SLAM system in four aspects. First,           results of SG-SLAM were compared with the ORB-SLAM2. | ||||
| the tracking performance is evaluated with two public            The evaluation comparison results in the five dynamic scene | ||||
| datasets. Second, we demonstrate the effectiveness of the        sequences are shown in Tables I–III. | ||||
| dynamic feature rejection strategy and analyze the advan- | ||||
| tages of the fusion algorithm compared to the individ-              The experimental results in Tables I–III show that our | ||||
| ual algorithms. Next, the system’s real-time performance         system improves more than 93% in most metrics in high | ||||
| is evaluated. Finally, the visualization of the semantic         dynamic sequences compared to the ORB-SLAM2 system. | ||||
| objects and the global Octo-map are shown. The experi-           Figs. 3 and 4 show the experimental results of ATE and | ||||
| ments were performed mainly on the NVIDIA Jetson AGX             RPE for the two systems at five sequences with an RGB-D | ||||
| Xavier development kit with Ubuntu 18.04 as the system           camera input. As shown in the figure, the accuracy of the | ||||
| environment.                                                     estimation results of our system in the high dynamic scene | ||||
|                                                                  sequences [Figs. 3(a)–(d) and 4(a)–(d)] is significantly higher | ||||
|                                                                  than ORB-SLAM2. In the experiments with low dynamic | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. | ||||
| 7501012  IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 | ||||
| 
 | ||||
| Fig. 3. ATE results of SG-SLAM and ORB-SLAM2 running five sequences. (a) fr3/walking_xyz. (b) fr3/walking_static. (c) fr3/walking_rpy. | ||||
| (d) fr3/walking_halfsphere. (e) fr3/sitting_static. | ||||
| 
 | ||||
| Fig. 4. RPE results of SG-SLAM and ORB-SLAM2 running five sequences. (a) fr3/walking_xyz. (b) fr3/walking_static. (c) fr3/walking_rpy. | ||||
| (d) fr3/walking_halfsphere. (e) fr3/sitting_static. | ||||
| 
 | ||||
| scene sequences [Figs. 3(e) and 4(e)], the accuracy improve-      SLAM provided by Bonn University in 2019. To validate the | ||||
| ment is only 31.03% because the area and magnitude of             generalization performance of the dynamic feature rejection | ||||
| dynamic object activity are small.                                algorithm, we performed another experimental evaluation | ||||
|                                                                   using this dataset. | ||||
|    To further evaluate the effectiveness of the proposed algo- | ||||
| rithm, it continues to be compared with M-removal DVO [22],          The experiment mainly selected nine representative | ||||
| RDS-SLAM [26], ORB-SLAM3 [36], and other similar algo-            sequences in the dataset. Among them, the “crowd” sequences | ||||
| rithms. The results are shown in Table IV. Although the           are the scenes of three people walking randomly in the room. | ||||
| DynaSLAM system using pixel-level semantic segmentation           The “moving no box” sequences show a person moving a box | ||||
| achieves a slight lead in individual sequence results, its        from the floor to a desk. The “person tracking” sequences are | ||||
| real-time performance is weak (as shown in Table VII). All        scenes where the camera is tracking a walking person. The | ||||
| other methods have difficulty in achieving the highest accu-       “synchronous” sequences present scenes of several people | ||||
| racy of experimental results because of certain shortcomings      jumping together in the same direction over and over again. | ||||
| described in Section II. Overall, from the experimental results,  In order to evaluate the accuracy performance of our system, | ||||
| it can be concluded that SG-SLAM achieves a state-of-the-art      it is mainly compared with the original ORB-SLAM2 | ||||
| level in terms of average accuracy improvement for all            system and the current state-of-the-art YOLO-SLAM | ||||
| sequences.                                                        system. | ||||
| 
 | ||||
| B. Performance Evaluation on Bonn RGB-D Dataset                      The evaluation comparison results in the nine dynamic | ||||
|                                                                   scene sequences are shown in Table V. Only in the two | ||||
|    The Bonn RGB-D Dynamic Dataset [37] is a dataset               “synchronization” sequences, SG-SLAM does not perform as | ||||
| with 24 dynamic sequences for the evaluation of RGB-D             well as YOLO-SLAM. The main reason is that the human | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. | ||||
| CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES                             7501012 | ||||
| 
 | ||||
|                                                                                           TABLE IV | ||||
|                                                                                 RESULTS OF METRIC ATE | ||||
| 
 | ||||
| Fig. 5. Dynamic feature rejection effect demonstration. The empir-  feature points on walking people are missed. Next, Fig. 5(d) | ||||
| ical threshold ε in (b) is 0.2 and in (c) is 1.0. (a) ORB-SLAM2.    shows the results of feature point extraction using only the | ||||
| (b) and (c) SG-SLAM (G). (d) SG-SLAM (S). (e) SG-SLAM (S + G).      semantic information method: all feature points around the | ||||
|                                                                     human body are brutally rejected. Finally, the experimental | ||||
|                                                                     results of the SG-SLAM system combining semantic and | ||||
|                                                                     geometric information are shown in Fig. 5(e). SG-SLAM | ||||
|                                                                     rejects all feature points on the human body and retains as | ||||
|                                                                     many static feature points outside the human body as possible, | ||||
|                                                                     and the rejection effect is better than the first two algorithms. | ||||
|                                                                     The experimental results of the two algorithms based on | ||||
|                                                                     separate information are mutually superior and inferior in | ||||
|                                                                     different sequences. The algorithm combining both pieces of | ||||
|                                                                     information shows the most accurate experimental results in | ||||
|                                                                     all sequences. From the results in Table VI, the experimental | ||||
|                                                                     data of each algorithm match the intuitive rejection effect in | ||||
|                                                                     Fig. 5. This proves the effectiveness of the fusion of geometric | ||||
|                                                                     and semantic information algorithms. | ||||
| 
 | ||||
| jump direction in the scene is similar to the polar line            D. Timing Analysis | ||||
| direction leading to different degrees of degeneration of the | ||||
| algorithm [16]. The results in Table V show that our algorithm         As the basic component of robot state estimation, the speed | ||||
| outperforms other algorithms in most sequences. Not only            of SLAM directly affects the smooth execution of higher level | ||||
| does this once again prove that the SG-SLAM system achieves         tasks. Thus, we tested the average time cost of processing each | ||||
| state-of-the-art accuracy and robustness in dynamic scenes but      frame when the system is running and compared it with other | ||||
| also proves its generalizability.                                   systems. | ||||
| 
 | ||||
| C. Effectiveness of Dynamic Feature Rejection Strategy                 The experimental time-consuming results and hardware | ||||
|                                                                     platforms are shown in Table VII. Since systems, such as | ||||
|    SG-SLAM combines geometrical and semantic information            DS-SLAM, DynaSLAM, and YOLACT-based SLAM, use | ||||
| to reject dynamic features, drawing on the advantages and           pixel-level semantic segmentation networks, their average time | ||||
| avoiding the disadvantages of both methods. To validate             cost per frame is expensive. YOLO-SLAM uses the end-to-end | ||||
| the effectiveness of the fusion of geometric and semantic           YOLO fast object detection algorithm, but it is very slow due | ||||
| information algorithms, we designed comparative experiments.        to limitations such as system architecture optimization and | ||||
| Fig. 5 shows the experimental results of these methods for          hardware performance. The SG-SLAM system significantly | ||||
| detecting dynamic points. First, SG-SLAM (S) denotes a              increases frame processing speed by using multithreading, | ||||
| semantic information-only algorithm to reject dynamic feature       SSD object detection algorithms, and data multiplexing mech- | ||||
| points. Next, SG-SLAM (G) is only the geometry algo-                anisms. Compared to ORB-SLAM2, our work increases the | ||||
| rithm based on the epipolar constraint. Finally, SG-SLAM            average processing time per frame by less than 10 ms, which | ||||
| (S + G) uses a fusion algorithm based on geometric and              can meet the real-time performance requirements of mobile | ||||
| semantic information. The experimental results are shown in         robots. | ||||
| Table VI. | ||||
|                                                                     E. Semantic Mapping | ||||
|    Fig. 5(a) shows the results of ORB-SLAM2 extracting | ||||
| feature points: essentially no dynamic regions are processed.          To show the actual semantic mapping effect, the SG-SLAM | ||||
| Fig. 5(b) and (c) shows the results of using only the epipolar      system conducts mapping experiments in the TUM RGB-D | ||||
| constraint method at different empirical thresholds. At the low     dataset and the OpenLORIS-Scene dataset [38]. OpenLORIS- | ||||
| threshold [see Fig. 5(b)], many static feature points are misde-    Scene is a dataset of data recorded by robots in real | ||||
| tected and rejected (e.g., feature points at the corners of the TV  scenes using a motion capture system to obtain real tra- | ||||
| monitor); at the high threshold [see Fig. 5(c)], some dynamic       jectories. This dataset is intended to help evaluate the | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. | ||||
| 7501012               IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 | ||||
|                    TABLE V | ||||
|          RESULTS OF METRIC ATE | ||||
| 
 | ||||
|                    TABLE VI | ||||
|          RESULTS OF METRIC ATE | ||||
| 
 | ||||
|                   TABLE VII | ||||
|                TIME ANALYSIS | ||||
| 
 | ||||
|                                                              Fig. 7. (a) Semantic object map and (b) global octo-map for the cafe1-2 | ||||
|                                                              sequence of the OpenLORIS-Scene dataset. | ||||
| 
 | ||||
| Fig. 6. Semantic object map for fr3_walking_xyz sequence.       Fig. 6 shows the semantic object mapping effect of | ||||
|                                                              SG-SLAM in the fr3_walking_xyz sequence of the TUM | ||||
| maturity of SLAM and scene understanding algorithms in real  RGB-D dataset. Fig. 7(a) and (b) shows the semantic object | ||||
| deployments.                                                 map and the global Octo-map built in the cafe1-2 sequence of | ||||
|                                                              the OpenLORIS-Scene dataset, respectively. The coordinates | ||||
|                                                              of the objects shown in the map are transformed from the | ||||
|                                                              origin point where the SLAM system is running. The semantic | ||||
|                                                              metric map and the global Octo-map not only enable mobile | ||||
|                                                              robots to navigate and avoid obstacles but also enable them | ||||
|                                                              to understand scenes at a higher level and perform advanced | ||||
|                                                              tasks. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. | ||||
| CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES      7501012 | ||||
| 
 | ||||
|                            V. CONCLUSION                                        [16] A. Kundu, K. M. Krishna, and J. Sivaswamy, “Moving object detection | ||||
|                                                                                       by multi-view geometric techniques from a single camera mounted | ||||
|    This article presents a real-time semantic visual SG-SLAM                          robot,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2009, | ||||
| toward dynamic scenes with an RGB-D camera input.                                     pp. 4306–4312. | ||||
| SG-SLAM adds two new threads based on ORB-SLAM2: | ||||
| the object detecting thread and the semantic mapping thread.                    [17] R. Hartley and A. Zisserman, Multiple View Geometry in Computer | ||||
| The system significantly improves real time, accuracy, and                             Vision. Cambridge, U.K.: Cambridge Univ. Press, 2003. | ||||
| robustness in dynamic scenes with the dynamic feature rejec- | ||||
| tion algorithm. The semantic mapping thread reuses the 2-D                      [18] M. A. Fischler and R. Bolles, “Random sample consensus: A para- | ||||
| semantic information to build the semantic object map with                            digm for model fitting with applications to image analysis and auto- | ||||
| object coordinates and the global Octo-map. Experiments                               mated cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, | ||||
| prove that improved traditional algorithms can achieve supe-                          1981. | ||||
| rior performance when introducing deep learning and coupled | ||||
| with proper engineering implementations.                                        [19] M. Piaggio, R. Fornaro, A. Piombo, L. Sanna, and R. Zaccaria, | ||||
|                                                                                       “An optical-flow person following behaviour,” in Proc. IEEE Int. Symp. | ||||
|    There are still some disadvantages of the system that need                         Intell. Control (ISIC), IEEE Int. Symp. Comput. Intell. Robot. Autom. | ||||
| to be addressed in the future. For example, the degeneration                          (CIRA), Intell. Syst. Semiotics (ISAS), 1998, pp. 301–306. | ||||
| problem of dynamic objects moving along the polar line direc- | ||||
| tion can cause the dynamic feature rejection algorithm to fail,                 [20] D. Nguyen, C. Hughes, and J. Horgan, “Optical flow-based moving- | ||||
| semantic metric map improvement in precision, experimental                            static separation in driving assistance systems,” in Proc. IEEE 18th Int. | ||||
| quantitative analysis, and so on.                                                     Conf. Intell. Transp. Syst., Sep. 2015, pp. 1644–1651. | ||||
| 
 | ||||
|                              REFERENCES                                         [21] T. Zhang, H. Zhang, Y. Li, Y. Nakamura, and L. Zhang, “Flow- | ||||
|                                                                                       Fusion: Dynamic dense RGB-D SLAM based on optical flow,” | ||||
|  [1] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and map-              in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020, | ||||
|       ping: Part I,” IEEE Robot. Autom. Mag., vol. 13, no. 2, pp. 99–110,             pp. 7322–7328. | ||||
|       Jun. 2006. | ||||
|                                                                                 [22] Y. Sun, M. Liu, and M. Q.-H. Meng, “Motion removal for reliable | ||||
|  [2] S. Shirmohammadi and A. Ferrero, “Camera as the instrument:                      RGB-D SLAM in dynamic environments,” Robot. Auton. Syst., vol. 108, | ||||
|       The rising trend of vision based measurement,” IEEE Instrum. Meas.              pp. 115–128, Oct. 2018. | ||||
|       Mag., vol. 17, no. 3, pp. 41–47, Jun. 2014. | ||||
|                                                                                 [23] L. Zhang, L. Wei, P. Shen, W. Wei, G. Zhu, and J. Song, “Semantic | ||||
|  [3] G. Klein and D. Murray, “Parallel tracking and mapping for small AR              SLAM based on object detection and improved octomap,” IEEE Access, | ||||
|       workspaces,” in Proc. 6th IEEE ACM Int. Symp. Mixed Augmented                   vol. 6, pp. 75545–75559, 2018. | ||||
|       Reality, Nov. 2007, pp. 225–234. | ||||
|                                                                                 [24] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in | ||||
|  [4] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source slam                   Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, | ||||
|       system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot.,           pp. 7263–7271. | ||||
|       vol. 33, no. 5, pp. 1255–1262, Oct. 2017. | ||||
|                                                                                 [25] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-SLAM: | ||||
|  [5] C. Kerl, J. Sturm, and D. Cremers, “Dense visual SLAM for RGB-D                  Semantic monocular visual localization and mapping based on deep | ||||
|       cameras,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Nov. 2013,         learning in dynamic environment,” Robot. Auton. Syst., vol. 117, | ||||
|       pp. 2100–2106.                                                                  pp. 1–16, Jul. 2019. | ||||
| 
 | ||||
|  [6] A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: An open-          [26] Y. Liu and J. Miura, “RDS-SLAM: Real-time dynamic SLAM | ||||
|       source library for real-time metric-semantic localization and map-              using semantic segmentation methods,” IEEE Access, vol. 9, | ||||
|       ping,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020,                 pp. 23772–23785, 2021. | ||||
|       pp. 1689–1696. | ||||
|                                                                                 [27] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. | ||||
|  [7] C. Cadena et al., “Past, present, and future of simultaneous localization        ICCV, Jun. 2017, pp. 2961–2969. | ||||
|       and mapping: Toward the robust-perception age,” IEEE Trans. Robot., | ||||
|       vol. 32, no. 6, pp. 1309–1332, Dec. 2016.                                 [28] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep | ||||
|                                                                                       convolutional encoder–decoder architecture for image segmentation,” | ||||
|  [8] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile                    IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, | ||||
|       robotics tasks: A survey,” Robot. Auton. Syst., vol. 66, pp. 86–103,            Jan. 2017. | ||||
|       Apr. 2015. | ||||
|                                                                                 [29] Ó. M. Mozos, R. Triebel, P. Jensfelt, A. Rottmann, and W. Burgard, | ||||
|  [9] B. Bescos, J. M. Fácil, J. Civera, and J. L. Neira, “DynaSLAM: Tracking,         “Supervised semantic labeling of places using information extracted | ||||
|       mapping, and inpainting in dynamic scenes,” IEEE Robot. Autom. Lett.,           from sensor data,” Robot. Auto. Syst., vol. 55, no. 5, pp. 391–402, | ||||
|       vol. 3, no. 4, pp. 4076–4083, Oct. 2018.                                        May 2007. | ||||
| 
 | ||||
| [10] C. Yu et al., “DS-SLAM: A semantic visual SLAM towards dynamic             [30] C. Nieto-Granda, J. G. Rogers, A. J. B. Trevor, and H. I. Christensen, | ||||
|       environments,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS),        “Semantic map partitioning in indoor environments using regional | ||||
|       Oct. 2018, pp. 1168–1174.                                                       analysis,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2010, | ||||
|                                                                                       pp. 1451–1456. | ||||
| [11] W. Wu, L. Guo, H. Gao, Z. You, Y. Liu, and Z. Chen, “YOLO- | ||||
|       SLAM: A semantic SLAM system towards dynamic environment                  [31] N. Sunderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid, “Mean- | ||||
|       with geometric constraint,” Neural Comput. Appl., vol. 34, pp. 1–16,            ingful maps with object-oriented semantic mapping,” in Proc. IEEE/RSJ | ||||
|       Apr. 2022.                                                                      Int. Conf. Intell. Robots Syst. (IROS), Sep. 2017, pp. 5079–5085. | ||||
| 
 | ||||
| [12] J. Chang, N. Dong, and D. Li, “A real-time dynamic object segmentation     [32] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf. | ||||
|       framework for SLAM system in dynamic scenes,” IEEE Trans. Instrum.              Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37. | ||||
|       Meas., vol. 70, pp. 1–9, 2021. | ||||
|                                                                                 [33] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE/CVF Int. | ||||
| [13] M. Quigley et al., “ROS: An open-source robot operating system,” in              Conf. Comput. Vis., Oct. 2019, pp. 1314–1324. | ||||
|       Proc. ICRA Workshop Open Source Softw., Kobe, Japan, 2009, vol. 3, | ||||
|       no. 3, p. 5.                                                              [34] M. Everingham, L. Van Gool, C. Williams, J. Winn, and | ||||
|                                                                                       A. Zisserman, “The PASCAL visual object classes challenge 2007 | ||||
| [14] Tencent. (2017). NCNN. [Online]. Available: https://github.com/Tencent/          results,” 2008. [Online]. Available: http://www.pascal-network.org/ | ||||
|       ncnn                                                                            challenges/VOC/voc2007/workshop/index.html | ||||
| 
 | ||||
| [15] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard,        [35] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, | ||||
|       “OctoMap: An efficient probabilistic 3D mapping framework based on               “A benchmark for the evaluation of RGB-D SLAM systems,” in Proc. | ||||
|       octrees,” Auton. Robot., vol. 34, no. 3, pp. 189–206, 2013.                     IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573–580. | ||||
| 
 | ||||
|                                                                                 [36] C. Campos, R. Elvira, J. J. G. Rodriguez, J. M. M. Montiel, and | ||||
|                                                                                       J. D. Tardos, “ORB-SLAM3: An accurate open-source library for visual, | ||||
|                                                                                       visual–inertial, and multimap SLAM,” IEEE Trans. Robot., vol. 37, | ||||
|                                                                                       no. 6, pp. 1874–1890, Dec. 2021. | ||||
| 
 | ||||
|                                                                                 [37] E. Palazzolo, J. Behley, P. Lottes, P. Giguere, and C. Stachniss, “ReFu- | ||||
|                                                                                       sion: 3D reconstruction in dynamic environments for RGB-D cameras | ||||
|                                                                                       exploiting residuals,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. | ||||
|                                                                                       (IROS), Nov. 2019, pp. 7855–7862. | ||||
| 
 | ||||
|                                                                                 [38] X. Shi et al., “Are we ready for service robots? The OpenLORIS-scene | ||||
|                                                                                       datasets for lifelong SLAM,” in Proc. IEEE Int. Conf. Robot. Autom. | ||||
|                                                                                       (ICRA), May 2020, pp. 3139–3145. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. | ||||
| 7501012  IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 | ||||
| 
 | ||||
|                                Shuhong Cheng was born in Daqing, Heilongjiang,        Shijun Zhang (Student Member, IEEE) was born | ||||
|                                China, in 1978. She received the B.S., M.S., and       in Lianyungang, China, in 1993. He received the | ||||
|                                Ph.D. degrees from Yanshan University, Qinhuang-       bachelor’s and master’s degrees in control engineer- | ||||
|                                dao, China, in 2001, 2007, and 2012, respectively.     ing from Yanshan University, Qinhuangdao, China, | ||||
|                                                                                       in 2016 and 2019, respectively, where he is cur- | ||||
|                                  She studied as a Visiting Scholar at the University  rently pursuing the Ph.D. degree in mechanical | ||||
|                                of Reading, Reading, U.K., in 2014. After her Ph.D.    engineering. | ||||
|                                degree, she has been working as a Professor at | ||||
|                                Yanshan University since 2019. She has published         His main research directions include mobile robot | ||||
|                                about 50 papers in journals and international confer-  control and perception, computer vision, and deep | ||||
|                                ences and eight computer software copyrights. She      learning. | ||||
|                                has been granted more than four Chinese invention | ||||
| patents. Since 2012, she has presided over and undertaken more than ten | ||||
| national projects. Her current research interests are in rehabilitation robots, | ||||
| assisting robot for the disabled, and the elderly and computer vision. | ||||
| 
 | ||||
|          Changhe Sun was born in Tangshan, China,                                     Dianfan Zhang was born in Jilin, China, in 1978. | ||||
|          in 1996. He received the bachelor’s degree in com-                           He received the bachelor’s and master’s degrees | ||||
|          munication engineering from the Chongqing Uni-                               in control engineering and the Ph.D. degree from | ||||
|          versity of Technology, Chongqing, China, in 2019.                            Yanshan University, Qinhuangdao, China, in 2001, | ||||
|          He is currently pursuing the master’s degree with                            2006, and 2010, respectively. | ||||
|          the School of Electrical Engineering, Yanshan Uni- | ||||
|          versity, Qinhuangdao, China.                                                   His main research directions include mobile robot | ||||
|                                                                                       control and signal processing. | ||||
|            His main research interests include simultaneous | ||||
|          localization and mapping (SLAM), computer vision, | ||||
|          and robotics. | ||||
| 
 | ||||
| Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. | ||||
|  | ||||
| @ -0,0 +1,520 @@ | ||||
| Received January 10, 2022, accepted January 27, 2022, date of publication February 7, 2022, date of current version February 18, 2022. | ||||
| Digital Object Identifier 10.1109/ACCESS.2022.3149885 | ||||
| 
 | ||||
| The STDyn-SLAM: A Stereo Vision and Semantic | ||||
| Segmentation Approach for VSLAM in Dynamic | ||||
| Outdoor Environments | ||||
| 
 | ||||
| DANIELA ESPARZA AND GERARDO FLORES , (Member, IEEE) | ||||
| 
 | ||||
| Laboratorio de Percepción y Robótica [LAPyR], Centro de Investigaciones en Óptica (CIO), León, Guanajuato 37150, Mexico | ||||
| 
 | ||||
| Corresponding author: Gerardo Flores (gflores@cio.mx) | ||||
| This work was supported in part by the Consejo Nacional de Ciencia y Tecnología (CONACYT), Fondo Institucional de Fomento Regional | ||||
| para el Desarrollo Científico, Tecnológico y de Innovación (FORDECYT) under Grant 292399. | ||||
| 
 | ||||
|   ABSTRACT The Visual Simultaneous Localization and Mapping (VSLAM) is a system based on the scene’s | ||||
|   features to estimate a map and the system pose. Commonly, VSLAM algorithms are focused on a static | ||||
|   environment; however, some dynamic objects are present in the vast majority of real-world applications. | ||||
|   This work presents a feature-based SLAM system focused on dynamic environments using convolutional | ||||
|   neural networks, optical flow, and depth maps to detect objects in the scene. The proposed system employs | ||||
|   a stereo camera as the primary sensor to capture the scene. The neural network is responsible for object | ||||
|   detection and segmentation to avoid erroneous maps and wrong system locations. Moreover, the proposed | ||||
|   system’s processing time is fast and can run in real-time, running in outdoor and indoor environments. The | ||||
|   proposed approach has been compared with state-of-the-art; besides, we present several experimental results | ||||
|   outdoors that corroborate the approach’s effectiveness. Our code is available online. | ||||
| 
 | ||||
|   INDEX TERMS VSLAM, dynamic environment, stereo vision, neural network. | ||||
| 
 | ||||
| I. INTRODUCTION                                                         moving objects can generate an erroneous map and wrong | ||||
| Simultaneous Localization and Mapping (SLAM) systems                    poses because dynamic features cause a bad pose estimation | ||||
| are strategic for developing the following navigation tech-             and incorrect data. For this reason, new approaches have | ||||
| niques. This is mainly due to its fundamental utility in                arisen for solving the dynamic environment problem, such | ||||
| solving the problem of autonomous exploration tasks in                  as NeuroSLAM [10], hierarchical Outdoor SLAM [11], and | ||||
| unknown environments such as mines, highways, farmlands,                Large-Scale Outdoor SLAM [12]. | ||||
| underwater/aerial environments, and in broad terms, indoor | ||||
| and outdoor scenes. The problem of SLAM for indoor                         In this work, we propose a method called STDyn-SLAM for | ||||
| environments has been investigated for years, where usually             solving VSLAM’s problem in dynamic outdoor environments | ||||
| RGB-D cameras or Lidars are the primary sensors to capture              using stereo vision [19]. Fig. 1 depicts a sketch of our | ||||
| scenes [1]–[3]. Indoors, dynamic objects are usually more               proposal in real experiments. The first row shows the input | ||||
| controllable, unlike outdoors, where dynamic objects are                images, where a potentially dynamic object is present on | ||||
| inherent to the scene.                                                  the scene and is detected by a semantic segmentation neural | ||||
|                                                                         network. Fig. 1d depicts the 3D reconstruction excluding | ||||
|    On the other hand, the vast majority of SLAM systems                 dynamic objects. To evaluate our system, we carried out | ||||
| are focused on the assumption of static environments, such              experiments in different outdoor scenes, and we qualitatively | ||||
| as HECTOR-SLAM [4], Kintinuous [5], MonoSLAM [6],                       compared the 3D reconstructions taking into account the | ||||
| PTAM [7], SVO [8], LSD-SLAM [9], among others. Since                    excluding of dynamic objects. We conducted experiments | ||||
| this assumption is strong, the system is restricted to work in          using sequences from KITTI Dataset, and they are compared | ||||
| static environments. However, in dynamic environments, the              with state-of-the-art systems. Furthermore, our approach is | ||||
|                                                                         implemented in ROS, in which we use the depth image | ||||
|    The associate editor coordinating the review of this manuscript and  from a stereo camera for making the 3D reconstruction using | ||||
|                                                                         the octomap. Also, we analyzed the processing time using | ||||
| approving it for publication was Sudipta Roy . | ||||
| 
 | ||||
| VOLUME 10, 2022  This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.                    18201 | ||||
|                                  For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ | ||||
|                                                                                      D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM | ||||
| TABLE 1. This table shows the state-of-the-art SLAM problem considering dynamic environments. | ||||
| 
 | ||||
| different datasets. Further, we publish our code been available  FIGURE 1. The STDyn-SLAM results in scenes with moving objects. First | ||||
| on GitHub.1 Also, a video is available on YouTube. The main      raw: Input images with two dynamic objects. Second raw: 3D | ||||
| contributions are itemized as follows:                           reconstruction performed by the STDyn-SLAM discarding moving objects. | ||||
| 
 | ||||
|    • We proposed a Stereo SLAM for dynamic environments          OctoMap. The dynamic pixels are removed using an object | ||||
|      using semantic segmentation neural network and geo-         detector and a K-means to segment the point cloud. On the | ||||
|      metrical constraints to eliminate the dynamic objects.      other hand, in [21], Gimenez et al. present a CP-SLAM based | ||||
|                                                                  on continuous probabilistic mapping and a Markov random | ||||
|    • We use the depth image from a stereo camera for making      field; they use the iterated conditional modes. Wang et al. [22] | ||||
|      the 3D reconstruction using the octomap. The depth          propose a SLAM system for indoor environments based | ||||
|      image is not necessary for the SLAM process.                on an RGB-D camera. They use the number of features | ||||
|                                                                  on the static scene and assume that the parallax between | ||||
|    • This work was tested using the KITTI and EurocMav           consecutive images is a movement constraint. In [23], | ||||
|      datasets, and we compared our system with the stereo        Cheng, Sun, and Meng implement an optical-flow and the | ||||
|      configuration systems from state-of-the-art. In addition,    five-point algorithm approach to obtain dynamic features. | ||||
|      we obtained results from outdoor and indoor environ-        In [24], Ma and Jia proposed a visual SLAM for dynamic | ||||
|      ments of our sequences. | ||||
|                                                                                                                                                                   VOLUME 10, 2022 | ||||
|    • Some results are shown in a YouTube video, and the | ||||
|      STDyn-SLAM is available as a GitHub repo. | ||||
| 
 | ||||
|    The rest of the paper is structured as follows. Section II | ||||
| mentions the related work of SLAM in dynamic environ- | ||||
| ments. Then, in Section III, we show the main results and | ||||
| the algorithm STDyn-SLAM algorithm. Section IV presents | ||||
| the real-time experiments of STDyn-SLAM in outdoor | ||||
| environments with moving objects; we compare our approach | ||||
| with state-of-art methods using the KITTI dataset. Finally, the | ||||
| conclusions and the future work are given in Section V. | ||||
| 
 | ||||
| II. RELATED WORK | ||||
| A. CLASSIC APPROACHES | ||||
| The classical methods do not consider artificial intelligence. | ||||
| Some of these approaches are based on optical flow, epipolar | ||||
| geometry, or a combination of the two. For example, in [20], | ||||
| Yang et al. propose a SLAM system using an RGB-D camera | ||||
| and two encoders for estimating the pose and building an | ||||
| 
 | ||||
|    1https://github.com/DanielaEsparza/STDyn-SLAM | ||||
| 
 | ||||
| 18202 | ||||
| D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM | ||||
| 
 | ||||
| FIGURE 2. A block diagram showing the algorithm steps of the STDyn-SLAM. | ||||
| 
 | ||||
| environments, detecting the moving objects in the scene using             the Mask R-CNN, edge refinement, and optical flow to detect | ||||
| optical flow. Furthermore, they use the RANSAC algorithm                   the probably dynamic objects. Henein et al. [18] proposed a | ||||
| to improve the computation of the homography matrix.                      system based on an RGBD camera and proprioceptive sensors | ||||
| In [25], Sun et al. proposed an RGB-D system for detecting                for tackling the SLAM problem. They employ a model | ||||
| moving objects based on ego-motion to compensate for the                  of factor graph and an instance-level object segmentation | ||||
| camera movement, then obtaining the frame difference. The                 algorithm to the classification of objects and the tracking of | ||||
| result of frame difference helps for detecting the moving                 features. The proprioceptive sensors are used to estimate the | ||||
| object. After that, Sun et al. proposed in [26] an RGB-D                  camera pose. Also, some works use a monocular camera, | ||||
| system for motion removal based on a foreground model. This               for instance, the DSOD-SLAM presented in [16]. Ma et al. | ||||
| system does not require prior information.                                employ a semantic segmentation network, a depth prediction | ||||
|                                                                           network, and geometry properties to improve the results in | ||||
| B. ARTIFICIAL-INTELLIGENCE-BASED APPROACHES                               dynamic environments. Our work is built on the well-known | ||||
| Thanks to the growing use of deep learning, the                           ORB-SLAM2 [32], taking some ideas from DS-SLAM | ||||
| researchers have proposed some SLAM systems using                         system [33]. In the DS-SLAM, the authors used stored images | ||||
| artificial-intelligence-based approaches. Table 1 resumes                  from an RGB-D camera for solving the SLAM problem | ||||
| the state-of-art in this regard. Some works, such as                      in indoor dynamic environments. Nevertheless, the depth | ||||
| Dosovitskiy et al. [27], Ilg et al. [28] and Mayer et al. [29],           map obtained from an RGB-D camera is hard for external | ||||
| used optical flow and supervised learning for detecting and                environments. In [34], Cheng et al. proposed a SLAM | ||||
| segmenting moving objects.                                                system for building a semantic map in dynamic environments | ||||
|                                                                           using CRF-RNN for segmenting objects. Bescos et al. | ||||
|    In [30], Xu et al. proposed an instance segmentation of                in [14] proposed a system for object detecting using the | ||||
| the objects in the scene based on the COCO dataset [31].                  Mask R-CNN, and their method proposed for inpainting the | ||||
| The geometric and motion properties are detected and used to              background using the information from previous images. | ||||
| improve the mask boundaries. Also, they tracked the visible               An update of [14] is [35], where Bescos et al. proposed a | ||||
| objects and moving objects and estimated the system’s pose.               visual SLAM based on the trajectories of the objects and a | ||||
| Several works are based on RGB-D cameras, such as [15],                   bundle adjustment. | ||||
| [17], and [18]. Cui and Ma [15] proposed the SOF-SLAM, | ||||
| an RGB-D system based on ORB-SLAM2, which combines                        III. METHODS | ||||
| a neural network for semantic segmentation, and optical flow               In this section, we present and describe the framework of the | ||||
| for removing dynamic features. Zhao et al. [17] proposed an               STDyn-SLAM with all the parts that compose it. A block | ||||
| RGB-D framework to dynamic scenes, where they combined | ||||
|                                                                                                                                                                          18203 | ||||
| VOLUME 10, 2022 | ||||
| D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM | ||||
| 
 | ||||
| diagram describing the framework’s pipeline is depicted in       natural dynamic objects among all the objects in the scene. | ||||
| Fig. 2, where the inputs at the instant time t are the stereo    It is here where the NN depicted in Fig. 2 is introduced. | ||||
| pair, depth image, and the left image captured at t − 1 (aka     In the NN block of that figure, a semantic segmentation | ||||
| previous left image). The process starts with extracting ORB     neural network is shown, with the left image as input and | ||||
| features in the stereo pair and the past left image. Then,       a segmented image with the object of interest as output. | ||||
| it follows the optical flow and epipolar geometry image           This NN is a pixel-wise classification and segmentation | ||||
| processing. Next, the neural network segments potentially        framework. The STDyn-SLAM implements a particular NN | ||||
| moving objects parallelly in the current left image. To remove   of this kind called SegNet [37], which is an encoder-decoder | ||||
| outliers (features inside dynamic objects) and estimate the      network based on the VGG-16 model [38]. The encoder | ||||
| visual odometry, it is necessary to computation the semantic     of this NN architecture counts with thirteen convolutional | ||||
| information and the movement checking process. Finally, the      layers with batch normalization, a ReLU non-linearity | ||||
| 3D reconstruction is computed from the segmented image,          divided into five encoders, and five non-overlapping max- | ||||
| visual odometry, the current left frame, and the depth image.    pooling and sub-sampling layers located at the end of each | ||||
| These processes are explained in detail in the following         encoder. Since each encoder is connected to a corresponding | ||||
| subsections.                                                     decoder, the decoder architecture has the same number | ||||
|                                                                  of layers as encoder architecture, and every decoder has | ||||
| A. STEREO PROCESS                                                an upsampling layer at first. The last layer is a softmax | ||||
| Motivated by the vast applications of robotics outdoors,         classifier. SegNet classifies the pixel-wise using a model | ||||
| where dynamic objects are presented, we proposed that            based on the PASCAL VOC dataset [39], which consists | ||||
| our STDyn-SLAM system be focused on stereo vision.               of twenty classes. The pixel-wise can be classified into | ||||
| A considerable advantage of this is that the depth estimation    one of the following classes: airplane, bicycle, bird, boat, | ||||
| from a stereo camera is directly given as a distance measure.    bottle, bus, car, cat, chair, cow, dining table, dog, horse, | ||||
| The process described in this part is depicted in Fig. 2,        motorbike, person, potted plant, sheep, sofa, train and | ||||
| where three main tasks are developed: feature extraction,        TV/monitor. | ||||
| optical flow, and epipolar geometry. Let’s begin with the | ||||
| former.                                                             Notwithstanding those above, not all feature points in the | ||||
|                                                                  left frame are matched in the right frame. For that reason and | ||||
|    The first step of the stereo process is acquiring the left,    to save computing resources, the SegNet classifies the objects | ||||
| right, and depth frames from a stereo camera. Then, a local      of interest only on the left input image. | ||||
| feature detector is applied in the stereo pair and the previous | ||||
| left image. As a feature detector, we use the Oriented fast      1) OUTLIERS REMOVAL | ||||
| and Rotated Brief (ORB) feature detector, which throws the | ||||
| well-known ORB features [36]. Once the ORB features are          Once all the previous steps have been accomplished, a thresh- | ||||
| found, optical flow and a process using epipolar geometry are | ||||
| conducted.                                                       old is selected to determine the features as inlier or outlier. | ||||
| 
 | ||||
|    To avoid dynamic objects not classified by the neural          Fig. 3 depicts the three cases of a mapped feature. Let x1, x2, | ||||
| network (explained in the following subsection), the STDyn-      and x3 denote the ORB features from the previous left image; | ||||
| SLAM computes optical flow using the previous and current         x1, x2, and x3 are the corresponding features from the current | ||||
| left frames. This step employs a Harris detector to compute      left image; X and X represent the homogeneous coordinates | ||||
| the optical flow. Remember, these features are different from | ||||
| the ORB ones. The Harris points pair is discarded if at least    of x and x , respectively; F is the fundamental matrix; and | ||||
| one of the points is on the edge corner or close to it. | ||||
|                                                                  l1 = FX1, l2 = FX2, and l3 = FX3 are the epipolar lines. | ||||
|    From the fundamental matrix, ORB features, and optical        The first and second cases correspond to inliers, x1 is over | ||||
| flow, we compute the epipolar lines. Thus, we can map             l1, and the distance from x2 to l2 is less than the threshold. | ||||
| the matched features from the current left frame into the        The third case is an outlier because the distance from x3 | ||||
| previous left frame. The distance from the corresponding         to l3 is greater than the threshold. To compute the distance | ||||
| epipolar line to the mapped feature into the past left image     between the point x and the epipolar line, l , we proceed as | ||||
| determines an inlier or outlier. Please refer to the remove | ||||
| outliers section in Fig. 2. Notice that the orb features of the  follows, | ||||
| car in the left image were removed, but the points on the | ||||
| right frame remain unchanged. This is because removing           d(X , l ) =  X T FX             (1) | ||||
| the points in the right images adds computational cost and is | ||||
| unnecessary.                                                                  (FX )21 + (FX )22 | ||||
| 
 | ||||
| B. ARTIFICIAL NEURAL NETWORK’s ARCHITECTURE                      where the subindex from (FX )1 and (FX )2 denotes the | ||||
| The approach we use is eliminating the ORB features on           element of the epipolar line. If the distance is larger than | ||||
| dynamic objects. To address this, we need to discern the         the threshold, the feature point is considered an outlier, i.e., | ||||
|                                                                  a dynamic feature. | ||||
| 18204 | ||||
|                                                                     Remember that the SegNet, described before, semantically | ||||
|                                                                  segments the left image in object classes. The semantic | ||||
|                                                                  segmentation enhances the rejection of ORB features on | ||||
|                                                                  the possible dynamic objects. The ORB features inside | ||||
| 
 | ||||
|                                                                                                  VOLUME 10, 2022 | ||||
| D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM | ||||
| 
 | ||||
| FIGURE 3. The cases of inliers and outliers. Green: the x1 and x2 are | ||||
| inliers; the distance from the point to their corresponding epipolar line l | ||||
| is less than a threshold. Red: x3 is an outlier, since the distance is greater | ||||
| than the threshold. | ||||
| 
 | ||||
|                                                                                 FIGURE 5. The STDyn-SLAM when a static object becomes dynamic. | ||||
|                                                                                 Images a) and b) corresponds to the left images from a sequence. Image | ||||
|                                                                                 c) is the 3D reconstruction of the environment; in red dots is the | ||||
|                                                                                 trajectory. The OctoMap node fills empty areas along the sequence of | ||||
|                                                                                 images. | ||||
| 
 | ||||
| FIGURE 4. Diagram of the ROS nodes of the STDyn-SLAM required to | ||||
| generate the trajectory and 3D reconstruction. The circles represent each | ||||
| process’s ROS node, and the arrows are the ROS topics published by the | ||||
| ROS nodes. The continued arrows depict the final ROS topics. | ||||
| 
 | ||||
| segmented objects, and thus possible moving objects, are | ||||
| rejected. The remained points are matched with the ORB | ||||
| features from the right image. | ||||
| 
 | ||||
| C. VISUAL ODOMETRY | ||||
| Because the system is based on ORB-SLAM2, the VSLAM | ||||
| visually computes the odometry. Therefore, the next step | ||||
| needs the ORB features to estimate the depth for each feature | ||||
| pair. The features are classified in mono and stereo and will | ||||
| be necessary to track the camera’s pose. Again, this step is | ||||
| merely a process from ORB-SLAM2. | ||||
| 
 | ||||
| D. 3D RECONSTRUCTION                                                            FIGURE 6. The 3D reconstruction from STDyn-SLAM in an indoor | ||||
| Finally, the STDyn-SLAM builds a 3D reconstruction from                         environment. In the scene appears a moving person, which is crossing | ||||
| left, segmented, and depth images using visual odometry.                        from left to right. The VSLAM system considers the person as a dynamic | ||||
| First, the 3D reconstruction process checks each pixel of the                   object. | ||||
| segmented image to reject the point corresponding to the | ||||
| classes of the objects selected as dynamic in section III-B.                       Remark 1: It is essential to mention that we merely applied | ||||
| Then, if the pixel is not considered a dynamic object, the                      the semantic segmentation, optical flow, and geometry | ||||
| equivalent pixel from the depth image is added to the point                     constraints to the left image to avoid increasing the time | ||||
| cloud, and the assigned color of the point is obtained from                     executing. Moreover, the right-hand-side frame segmentation | ||||
| the left frame. This section builds a local point cloud only in                 is unnecessary because feature selection rejects the ORB | ||||
| the current pose of the system, and then the octomap [40]                       features inside dynamic objects from the left image, so the | ||||
| joins and updates the local point clouds in a full point                        corresponding points from the right frame will not be | ||||
| cloud.                                                                          matched. | ||||
| 
 | ||||
| VOLUME 10, 2022                                                                                                                                                                18205 | ||||
| D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM | ||||
| 
 | ||||
| FIGURE 7. The 3D reconstruction, with the presence of static (two parked cars) and dynamic objects (a person and two dogs). Notice that the person and | ||||
| dogs are not visualized in the scene for the effect of the STDyn-SLAM. Fig. a) depicts the static objects. Nevertheless, the vehicles are potentially dynamic | ||||
| objects, thus in Fig. b), the STDyn-SLAM excludes the bodies considering its possible movement. | ||||
| 
 | ||||
| IV. EXPERIMENTS                                               B. REAL-TIME EXPERIMENTS | ||||
| This section tests our algorithm STDyn-SLAM in real-time      We present real-time experiments under three different | ||||
| scenes under the KITTI datasets. Our system’s experiments     scenarios explained next. | ||||
| were compared to other state-of-art systems to evaluate the | ||||
| 3D reconstruction and the odometry. The results of the 3D        First, we test the STDyn-SLAM in an outdoor environment | ||||
| map were qualitatively measured because of the nature of      where a car is parked and then moves forward. In this case, | ||||
| the experiment. We employ the Absolute Pose Error (APE)       a static object (a car) becomes dynamic, see Fig. 5. This figure | ||||
| metric for the odometry.                                      shows the 3D reconstruction, where the car appears static in | ||||
|                                                               the first images from the sequence, Fig. 5 a). Then, the car | ||||
| A. HARDWARE AND SOFTWARE SETUP                                becomes a dynamic object when it moves forward (Fig. 5 b), | ||||
| We tested our system on an Intel Core i7-7820HK laptop        so the STDyn-SLAM is capable of filling the empty zone if | ||||
| computer with 32 Gb RAM and a GPU GeForceGTX                  the scene is covered again, as is the case in Fig. 5 c). | ||||
| 1070. Moreover, we used as input a ZED camera, which | ||||
| is a stereo camera developed by Sterolabs. We selected an        The second experiment tests our system in an indoor | ||||
| HD720 resolution. The ZED camera resolutions are WVGA         environment. The scene consists of a moving person crossing | ||||
| (672 × 376), HD720 (1280 × 720), HD1080 (1920 × 1080),        from left to right. Subfigures a and b depicts the left and right | ||||
| and 2.2K (2208 × 1242).                                       images from Fig. 6. And c shows the 3D reconstruction. The | ||||
|                                                               area occupied by the moving person is filled after the zone is | ||||
|    The STDyn-SLAM is developed naturally on ROS. Our          visible. | ||||
| system’s main inputs are the left and right images, but | ||||
| the depth map is necessary to build the point cloud.             The third experiment consists of a scene sequence with | ||||
| However, if this is not available, it is possible to exe-     two parked cars, a walking person, and a dog. Even though | ||||
| cute the STDyn-SLAM only with the stereo images and           the vehicles are static, the rest of the objects move. Fig. 7a | ||||
| then obtain the trajectory. On the other hand, the STDyn      shows the scene taking into account the potentially dynamic | ||||
| node in ROS generates two main topics; the Odom and           entities. However, a car can change its position; the STDyn- | ||||
| the ORB_SLAM2_PointMap_SegNetM/Point_Clouds topics.           SLAM excludes the probable moving bodies (parked cars) to | ||||
| The point cloud topic is the input of the octomap_server      avoid multiple plotting throughout the reconstruction. This is | ||||
| node; this node publishes the joined point cloud of the       depicted in Fig. 7b. | ||||
| scene. | ||||
|                                                                  We compared the point clouds from the RTABMAP and the | ||||
|    Fig. 4 depicts the required ROS nodes by the STDyn-        STDyn-SLAM systems as a fourth experiment. The sequence | ||||
| SLAM to generate the trajectory and the 3D reconstruction.    was carried out outdoors with a walking person and two | ||||
| The camera node publishes the stereo images and computes      dogs. Since RTABMAP generates a point cloud of the scene, | ||||
| the depth map from the left and right frames. Then, the       we decided to compare it with our system. To build the | ||||
| STDyn-SLAM calculates the odometry and the local point        3D reconstructions from RTABMAP, we provided left and | ||||
| cloud. The OctoMap combines and updates the current local     depth images, camera info, and odometry as inputs for the | ||||
| point cloud with the previous global map to visualize the     RTABMAP. We used stereo and depth images; the intrinsic | ||||
| global point cloud. It is worth mentioning that the user can  parameters are saved in a text file in the ORB-SLAM2 | ||||
| choose the maximum depth of the local point cloud. All the    package. Fig 8 shows the 3D reconstructions. In Fig. 8a our | ||||
| ROS topics can be shown through the viewer.                   system excludes the dynamic objects. On the other hand, Fig | ||||
|                                                               8b RTABMAP plotted the dynamic objects on different sides | ||||
| 18206                                                         of the scene, resulting in an incorrect map of the environment. | ||||
| 
 | ||||
|                                                                                                                                                                VOLUME 10, 2022 | ||||
| D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM | ||||
|                                                                                                                  TABLE 3. Comparison of Absolute Pose Error (APE) on Euroc-Mav dataset. | ||||
| 
 | ||||
|                                                                          TABLE 4. Comparison of Relative Pose Error (RPE) on KITTI dataset. | ||||
| 
 | ||||
| FIGURE 8. Experiment comparison between the STDyn-SLAM and the | ||||
| RTABMAP [41]. Image a) shows the 3D reconstruction given by | ||||
| STDyn-SLAM; it eliminates dynamic objects’ effect on the mapping. Image | ||||
| b) shows the point cloud created by RTABMAP; notice how dynamic | ||||
| objects are mapped along the trajectory. This is undesirable behavior. | ||||
| 
 | ||||
| TABLE 2. Comparison of Absolute Pose Error (APE) on KITTI dataset. | ||||
| 
 | ||||
|                                                                          TABLE 5. Comparison of Relative Pose Error (RPE) on Euroc-Mav dataset. | ||||
| 
 | ||||
| C. COMPARISON OF STATE-OF-ART AND OUR SLAM                                  To evaluate the significative difference of the ATE evalua- | ||||
| USING KITTI AND EurocMav DATASETS                                        tion, we implemented the Score Sρ [45] over the sequences | ||||
| We compare our VSLAM with DynaSLAM1 [14] and ORB-                        of EurocMav and KITTI datasets of tables 6 and 7. The | ||||
| SLAM2 approaches. We selected sequences with dynamic                     results in table 8 show an improvement of our system against | ||||
| objects, loop, and no-loop closure to evaluate the SLAM                  ORBSLAM2 in the trajectories of the EurocMav dataset. | ||||
| systems. Therefore, we chose the 00−10 sequences from                    In the KITTI dataset, STDyn-SLAM and ORBSLAM2 are | ||||
| the odometry KITTI datasets [42], furthermore all sequences              not significative different. In evaluating our system and | ||||
| from the EurocMav dataset excepting the V1_03 and V2_03.                 DynaSLAM1, the Dyna is slightly better. | ||||
| Moreover, we employed EVO [43] tools to evaluate the | ||||
| Absolute Pose Error (APE) and the Relative Pose Error                    D. PROCESSING TIME | ||||
| (RPE), and RGB-D tools [44] to calculate the Absolute                    In this section, we analyzed the processing time of this work. | ||||
| Trajectory Error (ATE).                                                  For the study, we evaluate some datasets with different types | ||||
|                                                                          of images. The analysis consists of obtaining the processing | ||||
|    We present the results of APE, RPE, and ATE in different              time of each sequence with the same characteristics and | ||||
| tables. We divided the tables depending on the dataset                   calculating the average of the sequence’s mean. Table 9 shows | ||||
| evaluated. Tables 3 and 4 show the APE experiments on                    the times getting with the datasets. We use the KITTI and | ||||
| KITTI and EurocMav datasets, respectively. Tables 4 and 5                EurocMav datasets for the RGB and Gray columns. Since | ||||
| correspond to RPE, and tables 6 and 7 present the ATE results.           the sequences do not provide a depth image, we did not map | ||||
| We did not evaluate the EurocMav with the DynaSLAM1 due                  a 3D reconstruction. For the last column, we utilized our | ||||
| to excessive processing time to compute the trajectories.                trajectories. In addition, our dataset contains depth images, | ||||
| 
 | ||||
| VOLUME 10, 2022                                                                                                                                                         18207 | ||||
| D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM | ||||
| 
 | ||||
| TABLE 6. Comparison of Absolute Trajectory Error (ATE) on KITTI dataset.  The STDyn-SLAM is based on images captured by a stereo | ||||
|                                                                           pair for 3D reconstruction of scenes, where the possible | ||||
| TABLE 7. Comparison of Absolute Trajectory Error (ATE) on Euroc-Mav       dynamic objects are discarded from the map; this allows a | ||||
| dataset.                                                                  trustworthy point cloud. The system capability for computing | ||||
|                                                                           a reconstruction and localization in real-time depends on | ||||
| TABLE 8. Comparison of Score Sρ (a, b) on the datasets.                   the computer’s processing power, since a GPU is necessary | ||||
|                                                                           to support the processing. However, with a medium-range | ||||
| TABLE 9. Processing time.                                                 computer, the algorithms work correctly. | ||||
| 
 | ||||
| so we plotted a 3D reconstruction. For this reason, the                      In the future, we plan to implement an optical flow | ||||
| processing time is longer.                                                approach based on the last generation of neural networks | ||||
| V. CONCLUSION                                                             to improve dynamic object detection. The implementation | ||||
| This work presents the STDyn-SLAM system for outdoor                      of neural networks allows replacing classic methods such | ||||
| and indoor environments where dynamic objects are present.                as geometric constraints. Furthermore, we plan to increase | ||||
|                                                                           the size of the 3D map to reconstruct larger areas and | ||||
| 18208                                                                     obtain longer reconstructions of the scenes. The next step | ||||
|                                                                           is implementing the algorithm in an aerial manipulator | ||||
|                                                                           constructed in the lab. | ||||
| 
 | ||||
|                                                                           SUPPLEMENTARY MATERIAL | ||||
|                                                                           The implementation of our system is released on GitHub | ||||
|                                                                           and is available under the following link: https://github. | ||||
|                                                                           com/DanielaEsparza/STDyn-SLAM | ||||
| 
 | ||||
|                                                                              Besides, this letter has supplementary video material | ||||
|                                                                           available at https://youtu.be/3tnkwvRnUss, provided by the | ||||
|                                                                           authors. | ||||
| 
 | ||||
|                                                                           REFERENCES | ||||
| 
 | ||||
|                                                                            [1] J. Castellanos, J. Montiel, J. Neira, and J. Tardos, ‘‘The SPmap: A | ||||
|                                                                                  probabilistic framework for simultaneous localization and map building,’’ | ||||
|                                                                                  IEEE Trans. Robot. Autom., vol. 15, no. 5, pp. 948–952, 1999. | ||||
| 
 | ||||
|                                                                            [2] G. Dissanayake, H. Durrant-Whyte, and T. Bailey, ‘‘A computationally | ||||
|                                                                                  efficient solution to the simultaneous localisation and map building | ||||
|                                                                                  (SLAM) problem,’’ in Proc. IEEE Int. Conf. Robot. Automation. Symposia | ||||
|                                                                                  (ICRA), 2000, pp. 1009–1014. | ||||
| 
 | ||||
|                                                                            [3] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, ‘‘FastSLAM: A | ||||
|                                                                                  factored solution to the simultaneous localization and mapping problem,’’ | ||||
|                                                                                  in Proc. AAAI Nat. Conf. Artif. Intell., 2002, pp. 593–598. | ||||
| 
 | ||||
|                                                                            [4] S. Kohlbrecher, O. von Stryk, J. Meyer, and U. Klingauf, ‘‘A flexible and | ||||
|                                                                                  scalable SLAM system with full 3D motion estimation,’’ in Proc. IEEE Int. | ||||
|                                                                                  Symp. Saf., Secur., Rescue Robot., Nov. 2011, pp. 155–160. | ||||
| 
 | ||||
|                                                                            [5] T. Whelan, J. McDonald, M. Kaess, M. Fallon, H. Johannsson, and | ||||
|                                                                                  J. J. Leonard, ‘‘Kintinuous: Spatially extended KinectFusion,’’ in Proc. | ||||
|                                                                                  RSS Workshop RGB-D, Adv. Reasoning with Depth Cameras, Jul. 2012, | ||||
|                                                                                  pp. 1–10. | ||||
| 
 | ||||
|                                                                            [6] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, ‘‘MonoSLAM: | ||||
|                                                                                  Real-time single camera SLAM,’’ IEEE Trans. Pattern Anal. Mach. Intell., | ||||
|                                                                                  vol. 29, no. 6, pp. 1052–1067, Jun. 2007. | ||||
| 
 | ||||
|                                                                            [7] Y. Kameda, ‘‘Parallel tracking and mapping for small AR workspaces | ||||
|                                                                                  (PTAM) augmented reality,’’ J. Inst. Image Inf. Telev. Engineers, vol. 66, | ||||
|                                                                                  no. 1, pp. 45–51, 2012. | ||||
| 
 | ||||
|                                                                            [8] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, | ||||
|                                                                                  ‘‘SVO: Semidirect visual odometry for monocular and multicam- | ||||
|                                                                                  era systems,’’ IEEE Trans. Robot., vol. 33, no. 2, pp. 249–265, | ||||
|                                                                                  Apr. 2017. | ||||
| 
 | ||||
|                                                                            [9] J. Engel, T. Schöps, and D. Cremers, ‘‘LSD-SLAM: Large-scale direct | ||||
|                                                                                  monocular SLAM,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), D. Fleet, | ||||
|                                                                                  T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham, Switzerland: Springer, | ||||
|                                                                                  2014, pp. 834–849. | ||||
| 
 | ||||
|                                                                           [10] F. Yu, J. Shang, Y. Hu, and M. Milford, ‘‘NeuroSLAM: A brain-inspired | ||||
|                                                                                  SLAM system for 3D environments,’’ Biol. Cybern., vol. 113, nos. 5–6, | ||||
|                                                                                  pp. 515–545, Dec. 2019. | ||||
| 
 | ||||
|                                                                           [11] D. Schleicher, L. M. Bergasa, M. Ocana, R. Barea, and M. E. Lopez, | ||||
|                                                                                  ‘‘Real-time hierarchical outdoor SLAM based on stereovision and GPS | ||||
|                                                                                  fusion,’’ IEEE Trans. Intell. Transp. Syst., vol. 10, no. 3, pp. 440–452, | ||||
|                                                                                  Sep. 2009. | ||||
| 
 | ||||
|                                                                                                                                                                            VOLUME 10, 2022 | ||||
| D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM | ||||
| 
 | ||||
| [12] R. Ren, H. Fu, and M. Wu, ‘‘Large-scale outdoor SLAM based on 2D                   [35] B. Bescos, C. Campos, J. D. Tardos, and J. Neira, ‘‘DynaSLAM II: Tightly- | ||||
|        LiDAR,’’ Electronics, vol. 8, no. 6, p. 613, May 2019.                                  coupled multi-object tracking and SLAM,’’ IEEE Robot. Autom. Lett., | ||||
|                                                                                                vol. 6, no. 3, pp. 5191–5198, Jul. 2021. | ||||
| [13] S. Yang and S. Scherer, ‘‘CubeSLAM: Monocular 3-D object SLAM,’’ | ||||
|        IEEE Trans. Robot., vol. 35, no. 4, pp. 925–938, Aug. 2019.                      [36] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ‘‘ORB: An efficient | ||||
|                                                                                                alternative to SIFT or SURF,’’ in Proc. Int. Conf. Comput. Vis., Nov. 2011, | ||||
| [14] B. Bescos, J. M. Fácil, J. Civera, and J. L. Neira, ‘‘DynaSLAM: Tracking,                 pp. 2564–2571. | ||||
|        mapping, and inpainting in dynamic scenes,’’ IEEE Robot. Autom. Lett., | ||||
|        vol. 3, no. 4, pp. 4076–4083, Oct. 2018.                                         [37] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep | ||||
|                                                                                                convolutional encoder-decoder architecture for image segmentation,’’ | ||||
| [15] L. Cui and C. Ma, ‘‘SOF-SLAM: A semantic visual SLAM for                                  IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, | ||||
|        dynamic environments,’’ IEEE Access, vol. 7, pp. 166528–166539,                         Dec. 2017. | ||||
|        2019. | ||||
|                                                                                         [38] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for | ||||
| [16] P. Ma, Y. Bai, J. Zhu, C. Wang, and C. Peng, ‘‘DSOD: DSO                                  large-scale image recognition,’’ in Proc. Int. Conf. Learn. Represent. | ||||
|        in dynamic environments,’’ IEEE Access, vol. 7, pp. 178300–178309,                      (ICLR), San Diego, CA, USA, Jul. 2015, pp. 1–14. | ||||
|        2019. | ||||
|                                                                                         [39] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and W. Zisserman, | ||||
| [17] L. Zhao, Z. Liu, J. Chen, W. Cai, W. Wang, and L. Zeng, ‘‘A compatible                    ‘‘The PASCAL visual object classes (VOC) challenge,’’ Int. J. Comput. | ||||
|        framework for RGB-D SLAM in dynamic scenes,’’ IEEE Access, vol. 7,                      Vis., vol. 88, no. 2, pp. 303–338, Sep. 2010. | ||||
|        pp. 75604–75614, 2019. | ||||
|                                                                                         [40] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, | ||||
| [18] M. Henein, J. Zhang, R. Mahony, and V. Ila, ‘‘Dynamic SLAM: The                           ‘‘OctoMap: An efficient probabilistic 3D mapping framework based on | ||||
|        need for speed,’’ in Proc. IEEE Int. Conf. Robot. Automat. (ICRA), 2020,                octrees,’’ Auto. Robots, vol. 34, no. 3, pp. 189–206, Apr. 2013. [Online]. | ||||
|        pp. 2123–2129.                                                                          Available: https://octomap.github.io | ||||
| 
 | ||||
| [19] S. Trejo, K. Martinez, and G. Flores, ‘‘Depth map estimation methodology           [41] M. Labbé and F. Michaud, ‘‘Long-term online multi-session graph- | ||||
|        for detecting free-obstacle navigation areas,’’ in Proc. Int. Conf. Unmanned            based SPLAM with memory management,’’ Auto. Robots, vol. 42, no. 6, | ||||
|        Aircr. Syst. (ICUAS), Jun. 2019, pp. 916–922.                                           pp. 1133–1150, 2018. | ||||
| 
 | ||||
| [20] D. Yang, S. Bi, W. Wang, C. Yuan, W. Wang, X. Qi, and Y. Cai, ‘‘DRE-               [42] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous driving? | ||||
|        SLAM: Dynamic RGB-D encoder SLAM for a differential-drive robot,’’                      The KITTI vision benchmark suite,’’ in Proc. Int. Conf. Pattern Recognit., | ||||
|        Remote Sens., vol. 11, no. 4, p. 380, Feb. 2019.                                        Jun. 2012, pp. 3354–3361. | ||||
| 
 | ||||
| [21] J. Gimenez, A. Amicarelli, J. M. Toibero, F. di Sciascio, and R. Carelli,          [43] (2017). U. Technologies. EVO: Python Package for the Evalua- | ||||
|        ‘‘Continuous probabilistic SLAM solved via iterated conditional modes,’’                tion of Odometry and SLAM. [Online]. Available: https://github.com/ | ||||
|        Int. J. Autom. Comput., vol. 16, no. 6, pp. 838–850, Aug. 2019.                         MichaelGrupp/evo | ||||
| 
 | ||||
| [22] R. Wang, W. Wan, Y. Wang, and K. Di, ‘‘A new RGB-D SLAM method                     [44] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, ‘‘A | ||||
|        with moving object detection for dynamic indoor scenes,’’ Remote Sens.,                 benchmark for the evaluation of RGB-D SLAM systems,’’ in Proc. | ||||
|        vol. 11, no. 10, p. 1143, May 2019.                                                     IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573–580. | ||||
| 
 | ||||
| [23] J. Cheng, Y. Sun, and M. Q.-H. Meng, ‘‘Improving monocular visual                  [45] R. Muñoz-Salinas and R. Medina-Carnicer, ‘‘UcoSLAM: Simultaneous | ||||
|        SLAM in dynamic environments: An optical-flow-based approach,’’ Adv.                     localization and mapping by fusion of keypoints and squared planar | ||||
|        Robot., vol. 33, no. 12, pp. 576–589, Jun. 2019.                                        markers,’’ Pattern Recognit., vol. 101, May 2020, Art. no. 107193. | ||||
| 
 | ||||
| [24] Y. Ma and Y. Jia, ‘‘Robust SLAM algorithm in dynamic environment using                                            DANIELA ESPARZA received the B.S. degree in | ||||
|        optical flow,’’ in Proc. Chin. Intell. Syst. Conf., Y. Jia, J. Du, and W. Zhang,                                 robotic engineering from the Universidad Politéc- | ||||
|        Eds. Singapore: Springer 2020, pp. 681–689.                                                                     nica del Bicentenario, México, in 2017, and the | ||||
|                                                                                                                        master’s degree in optomechatronics from the | ||||
| [25] Y. Sun, M. Liu, and M. Q.-H. Meng, ‘‘Improving RGB-D SLAM in                                                      Center for Research in Optics, in 2019, where | ||||
|        dynamic environments: A motion removal approach,’’ Robot. Auton. Syst.,                                         she is currently pursuing the Ph.D. degree in | ||||
|        vol. 89, pp. 110–122, Mar. 2017.                                                                                mechatronics and mechanical design. | ||||
| 
 | ||||
| [26] Y. Sun, M. Liu, and M. Q.-H. Meng, ‘‘Motion removal for reliable                                                     Her research interests include artificial vision, | ||||
|        RGB-D SLAM in dynamic environments,’’ Robot. Auton. Syst., vol. 108,                                            such as 3D reconstruction and deep learning | ||||
|        pp. 115–128, Oct. 2018.                                                                                         applied to SLAM developed over platforms as | ||||
|                                                                                                                        mobile robots. | ||||
| [27] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, | ||||
|        P. V. D. Smagt, D. Cremers, and T. Brox, ‘‘FlowNet: Learning optical                                            GERARDO FLORES (Member, IEEE) received | ||||
|        flow with convolutional networks,’’ in Proc. IEEE Int. Conf. Comput. Vis.                                        the B.S. degree (Hons.) in electronic engineering | ||||
|        (ICCV), Dec. 2015, pp. 2758–2766.                                                                               from the Instituto Tecnológico de Saltillo, Mexico, | ||||
|                                                                                                                        in 2007, the M.S. degree in automatic control | ||||
| [28] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,                                              from CINVESTAV-IPN, Mexico City, in 2010, | ||||
|        ‘‘FlowNet 2.0: Evolution of optical flow estimation with deep networks,’’                                        and the Ph.D. degree in systems and information | ||||
|        in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,                                           technology from the Heudiasyc Laboratory, Uni- | ||||
|        pp. 2462–2470.                                                                                                  versité de Technologie de Compiègne–Sorbonne | ||||
|                                                                                                                        Universités, France, in October 2014. | ||||
| [29] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and | ||||
|        T. Brox, ‘‘A large dataset to train convolutional networks for disparity,                                          Since August 2016, he has been a full-time | ||||
|        optical flow, and scene flow estimation,’’ in 2016 IEEE Conf. Comput. Vis.         Researcher and the Head of the Perception and Robotics Laboratory, Center | ||||
|        Pattern Recognit. (CVPR), Jun. 2016, pp. 4040–4048.                              for Research in Optics, León, Guanajuato, Mexico. His current research | ||||
|                                                                                         interests include the theoretical and practical problems arising from the | ||||
| [30] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and                          development of autonomous robotic and vision systems. He has been an | ||||
|        S. Leutenegger, ‘‘MID-fusion: Octree-based object-level multi-instance           Associate Editor of Mathematical Problems in Engineering, since 2020. | ||||
|        dynamic SLAM,’’ in Proc. Int. Conf. Robot. Automat. (ICRA), May 2019, | ||||
|        pp. 5231–5237.                                                                                                                                                                  18209 | ||||
| 
 | ||||
| [31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, | ||||
|        P. Dollár, and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in | ||||
|        context,’’ in Computer Vision, D. Fleet, T. Pajdla, B. Schiele, and | ||||
|        T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, | ||||
|        pp. 740–755. | ||||
| 
 | ||||
| [32] R. Mur-Artal and J. D. Tardós, ‘‘ORB-SLAM2: An open-source slam | ||||
|        system for monocular, stereo, and RGB-D cameras,’’ IEEE Trans. Robot., | ||||
|        vol. 33, no. 5, pp. 1255–1262, Oct. 2017. | ||||
| 
 | ||||
| [33] C. Yu, Z. Liu, X. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, ‘‘DS- | ||||
|        SLAM: A semantic visual SLAM towards dynamic environments,’’ | ||||
|        in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018, | ||||
|        pp. 1168–1174. | ||||
| 
 | ||||
| [34] J. Cheng, Y. Sun, and M. Q.-H. Meng, ‘‘Robust semantic mapping | ||||
|        in challenging environments,’’ Robotica, vol. 38, no. 2, pp. 256–270, | ||||
|        Feb. 2020. | ||||
| 
 | ||||
| VOLUME 10, 2022 | ||||
|  | ||||
							
								
								
									
										3
									
								
								动态slam/2020年-2022年开源动态SLAM/~$20-2022年开源动态SLAM.docx
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										3
									
								
								动态slam/2020年-2022年开源动态SLAM/~$20-2022年开源动态SLAM.docx
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,3 @@ | ||||
| 
 | ||||
| junwen Lai | ||||