first commit

This commit is contained in:
wubw 2025-04-23 15:46:42 +08:00
commit 7cec2d9ba0
31 changed files with 100116 additions and 0 deletions

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,231 @@
电 子 科 技 大 学
学术学位研究生学位论文开题报告表
攻读学位级别: □博士 硕士
学科专业: 软件工程
学 院: 信息与软件工程学院
学 号: 202221090225
姓 名: 武博文
论文题目: 室外动态场景下基于实例
分割的视觉SLAM研究
指导教师: 王春雨
填表日期: 2023 年 12 月 15 日
电子科技大学研究生院
学位论文研究内容
课题类型
□基础研究 □应用基础研究 应用研究
课题来源
□纵向 □横向 自拟
学位论文的研究目标、研究内容及拟解决的关键性问题(可续页)
研究目标
目前机器人SLAM算法主要分为激光SLAM和视觉SLAM区别在于传感器分别是激光雷达和相机。随着移动机器人的普及以及应用场景的增多激光SLAM由于激光雷达的高价格难以应用在小电器以及低成本机器人上而视觉SLAM凭借相机价格较低体积较少能够采集多维度信息等优势逐渐成为目前SLAM算法中研究的主流方向。
视觉同步定位与建图Visual Simultaneous Localization And MappingV-SLAM在机器人视觉感知领域中占有重要地位。最先进的V-SLAM算法提供了高精度定位和场景重建的能力[[1][]]。然而,它们大多忽略了动态对象所产生的不良影响。在这些研究中,环境被认为是完全静止的,这种强假设使得系统在复杂的动态环境中会产生严重的误差导致位姿估计误差较大,甚至导致定位失败。因此研究动态场景下的相机运动和物体运动是十分有必要的。
拟在存在动态物体的室外场景下使用相机作为传感器研究如何区分真正移动的动态物体和潜在运动但是静止的物体更好地利用静态特征点提高相机运动估计的准确性和SLAM系统的鲁棒性。
研究内容
动态场景作为V-SLAM走向实际应用的一大阻碍具有较大的难度和挑战性。也是许多学者研究的内容。本文拟研究在室外动态场景下如何识别动态物体设计动态物体识别算法将动态物体对相机位姿估计的影响降低获得较为精准的相机位姿。在获得较为精准的相机位姿后跟踪动态物体建立动态物体跟踪集合对新出现的物体和消失的物体记录。最后将观测量如相机位姿和物体位姿等传入后端建立全局优化根据优化后的地图点建立地图。
针对如何识别室外动态物体的问题,研究深度学习和几何约束相结合的动态点判定方法,设计识别运动物体的算法,去除语义信息未包括的动点,正确恢复相机位姿。
针对运动物体跟踪,研究在语义信息中的不同物体的跟踪方法,设计区分不同物体以及其运动,恢复运动物体的位姿。
针对后端优化,研究应用动态物体信息的优化方法,同时优化相机位姿和物体位姿,得到更精确的相机位姿。
拟解决的关键性问题
动态物体判别问题
动态物体判别是整个动态SLAM问题要解决的一个关键环节其最终解决的效果好坏直接影响到相机位姿的估计和后端的建图效果。该问题解决的是在相机不同的两帧之间物体的空间位置是否发生了移动的问题。只使用实例分割获得的语义信息只能判定已知语义的物体是动态物体但是不能确定在当前图像物体是否真的发生了移动。同时在处理未知运动物体方面语义信息会失效需要结合几何信息设计一种算法来判定物体是否真正运动以将动态物体的特征点与静态背景特征点做区分。
动态物体跟踪问题
对动态点的处理常常是在判定为动态点后,直接将其从特征点中去除,不再考虑这些特征点的意义。但是这些特征点也是地图中的点,对于动态物体存在跟踪的价值,因此研究动态物体所产生的特征点的存储和利用是关键点。在动态场景下,动态特征点非常可能不是来源一个物体,即在一个图像中可能存在多个动态物体,因此需要研究不同物体在不同帧间的关联关系,建立唯一的匹配,实现动态物体的分别跟踪。
同步跟踪和优化问题
在求解相机位姿后,跟踪动态物体的运动,获得运动物体的位姿,物体运动信息是预测得来的信息,可以经过局部优化或全局优化获得更精准的信息。但一般的优化只进行线性优化或者只对相机位姿优化,忽略了动态物体点的有效信息。因此拟设计一种优化的过程,确定优化变量,实现更准确的位姿估计,生成更准确的地图点,解决动态物体有效信息不完全利用的问题。
学位论文研究依据
学位论文的选题依据和研究意义国内外研究现状和发展态势主要参考文献以及已有的工作积累和研究成果。应有2000字
选题依据和研究意义
同步定位与地图构建SLAM是搭载激光雷达、IMUInertial Measurement Unit、相机等传感器的移动载体在未知环境下同步进行定位与地图构建的过程[[][2][]]。SLAM一般可分为激光SLAM和视觉SLAM。激光SLAM利用激光雷达、编码器和惯性测量单元IMU等多种传感器相结合已在理论和应用方面相对成熟。然而激光雷达具有较高的价格使其难以普及到个人小型设备并且雷达信息获取量有限。视觉SLAM利用视觉传感器如单目、双目和RGB-D带有深度信息的彩色图像相机等来构建环境地图。相机能够获取丰富的图像信息并且视觉传感器具有低廉的价格简单的结构和小巧便携的特点因此成为近年来研究者们关注的热点也成为SLAM技术中的主要研究方向。视觉SLAM能够广泛应用于无人驾驶自主机器人导盲避障等领域对视觉SLAM的研究具有现实意义。
经过近二十年的发展视觉同时定位与建图Visual Simultaneous Localization And MappingV-SLAM框架已趋于成熟。现阶段V-SLAM系统大多数建立在非动态环境的假设上即假设移动载体在跟踪过程中不存在动态物体。然而这种假设是一种强假设在现实场景中很难成立。在室内场景中常出现移动的人和桌椅等等在室外场景中常常出现移动的车和动物等等这些动态物体的出现对V-SLAM系统的影响巨大尤其是对V-SLAM中的前端模块的影响。SLAM前端求解存在两种方案直接法和特征点法。直接法基于光度不变假设来描述像素随时间在图像之间的运动方式每个像素在两帧之间的运动是一致的通过此估计相机的运动。然而由于相机获得的图像受到光线噪声等影响光度不变假设往往不成立如果再出现动态物体直接使用此方法更会影响相机的运动估计。特征点法是一种间接的方法它首先提取图像的特征点然后通过两帧间特征点的匹配和位置变化求解相机运动。特征点的选择与使用大幅提高了V-SLAM系统定位的准确性例如著名开源视觉SLAM框架ORB-SLAM2[[3]]、ORB-SLAM3[[4]]、VINS-Mono[[5]]都采用了特征点法。但是一旦出现动态物体这些特征点中会包含动态物体上的点动态物体的移动造成了特征点移动的不一致性从而对相机运动的估计造成了巨大影响。这种影响会导致后端失效定位精度大幅减弱不能忽视。随着视觉SLAM技术的发展如何解决动态影响受到广泛关注具有重要的研究价值。
国内外研究现状和发展态势
2.1视觉SLAM研究现状
视觉SLAM问题最早可追溯到滤波技术的提出Smith等人提出了采用状态估计理论的方法处理机器人在定位和建图等方面的问题[[][6][]]。随后出现各种基于滤波算法的SLAM系统例如粒子滤波[[][7][]]和卡尔曼滤波[[][8][]]。2007年视觉SLAM取得重大突破A. J. Davison等人提出第一个基于单目相机的视觉SLAM系统MonoSLAM[[][9][]]。该系统基于扩展卡尔曼滤波算法Extended Kalman Filter, UKF是首个达到实时效果的单目视觉SLAM系统在此之前其他的算法都是对预先拍好的视频进行处理无法做到同步。同年Klein 等人提出了PTAM( Parallel Tracking And Mapping) [[][10][]]创新地以并行的方式进行跟踪和建图线程这种并行的方式也是当下SLAM框架的主流。PTAM应用了关键帧和非线性化优化理论而非当时多数的滤波方案为后续基于非线性化优化的视觉SLAM开辟了道路。
2014年慕尼黑工业大学计算机视觉组Jakob Engel等人[[][11][]]提出LSD-SLAM该方案是一种基于直接法的单目视觉SLAM算法不需要计算特征点通过最小化光度误差进行图像像素信息的匹配实现了效果不错的建图。该方案的出现证明了基于直接法的视觉SLAM系统的有效性为后续的研究奠定了基础。同年SVO被Forster等人提出[[][12][]]。这是一种基于稀疏直接法的视觉SLAM方案结合了特征点和直接法使用了特征点但是不计算特征点的描述子特征点的匹配使用其周围像素利用直接法匹配。
2015年Mur-Artal等人参考PTAM关键帧和并行线程的方案提出了ORB-SLAM框架[[][13][]]。该框架是一种完全基于特征点法的单目视觉SLAM系统包括了跟踪建图和回环检测三个并行线程。最为经典的是该系统采用的ORB特征点能实现提取速度和效果的平衡。但是其系统只适用于单目相机精度低且应用场景受限。随着相机的进步2017年Mur-Artal 等人对ORB-SLAM进行了改进扩展了对双目和RGB-D相机的支持提出ORB-SLAM2[[][3][]]。相比于原版,该系统支持三种相机,同时新增重定位,全局优化和地图复用等功能,更具鲁棒性。
2017年香港科技大学Qin Tong等人[[][1][4]]提出VINS Mono系统该系统在单目相机中融合IMU传感器在视觉信息短暂失效时可利用IMU估计位姿视觉信息在优化时可以修正IMU数据的漂移两者的结合表现出了优良的性能。2019年提出改进版系统VINS-Fusion[[][1][5][]]新增对双目相机和GPS传感器的支持融合后的系统效果更优。
2020年Carlos Campos等[[][4][]]提出了ORB-SLAM3该系统在ORB-SLAM2的基础上加入了对视觉惯性传感器融合的支持并在社区开源。系统对算法的多个环节进行改进优化例如加入了多地图系统和新的重定位模块能够适应更多的场景同时精度相比上一版增加2-3倍。在2021年底系统更新了V1.0版本继承了ORB-SLAM2的优良性能成为现阶段最有代表性的视觉SLAM系统之一。
2.2动态SLAM研究现状
针对动态物体的影响已经有许多研究人员开展了相关工作尝试解决动态场景下的视觉SLAM问题。解决这一问题的主要挑战就是如何高效地检测到动态物体和其特征点并将动态特征点剔除以恢复相机运动。
最早的解决思路是根据几何约束来筛除动态物体的特征点如WANG 等[[][1][6][]]首次使用 K-Means 将由 RGB-D相机计算的3D点聚类并使用连续图像之间的极线约束计算区域中内点关键点数量的变化内点数量较少的区域被认定是动态的。Fang[[][1][7][]]使用光流法检测图像之间的动态物体所在位置,对其特征点进行滤除。该方法利用光流提高检测的精度,有效地降低了帧之间极线约束的误差。尽管基于几何约束的方法可以在一定程度消除动态特征点的影响,但随着深度学习的发展,图像中语义信息逐渐被重视和利用起来。
现阶段有许多优秀的深度学习网络如YOLO[[1][8][]]SegNet[[1][9][]]Mask R-CNN[[][20][]]等等。这些神经网络有着强大的特征提取能力和语义信息提取能力可以帮助SLAM系统更轻松地辨别出动态物体的存在从而消除其影响。Fangwei Zhong等人提出的Detect-SLAM[[2][1][]]利用目标检测网络获取环境中的动态的人和车等为了实时性只在关键帧中进行目标检测最后去除所有检测到的动态点来恢复相机位姿。LIU和MIURA[[][2][2][]]提出了RDS-SLAM。基于ORB-SLAM3[4]的RDS-SLAM框架使用模型的分割结果初始化移动对象的移动概率将概率传播到随后的帧以此来区分动静点。这种只基于深度学习的方法仅能提供图像中的语义信息但无法判断图像中的物体是否真的在运动比如静止的人或者路边停靠的汽车。若根据语义信息将其标记为动态物体后直接去除其特征点这种方法会导致系统丢失有用的特征点对相机的运动估计有所影响。因此仅利用深度学习不能很好解决动态物体对SLAM系统的影响。
许多研究开始探索语义信息和几何信息的结合。例如清华大学Chao Yun等提出的DS-SLAM[[][2][3][]]该系统首先利用SegNet网络进行语义分割再利用极线约束过滤移动的物体达到了不错的效果。Berta Bescos等人首次利用Mask R-CNN网络进行实例分割提出了DynaSLAM[[][2][4][]]。该系统结合基于多视几何深度的动态物体分割和区域生长算法,大幅降低了位姿估计的误差。
利用深度学习得来的语义信息和几何信息结合来解决SLAM中的动态场景问题渐渐成了一种主流但是上述大多系统只是为了恢复相机的位姿而剔除动态物体的特征点而没有估计动态物体的位姿。同时估计相机运动和跟踪动态物体运动将动态物体的点加入优化步骤正在发展为一种趋势。Javier Civera等人提出的DOT SLAMDynamic Object Tracking for Visual SLAM[[][2][5][]]主要工作在前端结合实例分割为对态对象生成掩码通过最小化光度重投影误差跟踪物体。AirDOS被卡内基梅隆大学Yuheng Qiu等人提出[[][2][6][]]将刚性和运动约束引入模型铰接对象通过联合优化相机位姿、物体运动和物体三维结构来纠正相机位姿估计。VDO SLAM[[][2][7][]]利用Mask R-CNN掩码和光流区分动静点将动态环境下的SLAM表示为整体的图优化同时估计相机位姿和物体位姿。
总体来说目前动态场景下的视觉SLAM问题的解决需要借助几何信息和深度学习的语义信息语义信息提供更准确的物体几何信息提供物体真实的运动状态两者结合来估计相机运动和跟踪物体。
主要参考文献
J. Engel, V. Koltun, and D. Cremers, "Direct sparse odometry," IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611 - 625, Mar. 2016.
孔德磊, 方正. 基于事件的视觉传感器及其应用综述[J]. 信息与控制, 2021, 50(1): 1-19. KONG D L, FANG Z. A review of event-based vision sensors and their applications[J]. Information and Control, 2021, 50(1): 1-19.
Mur-Artal R , JD Tardós. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras[J]. IEEE Transactions on Robotics, 2017.
Campos C, Elvira R, Rodriguez J, et al. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual - Inertial, and Multimap SLAM[J]. IEEE Transactions on Robotics: A publication of the IEEE Robotics and Automation Society, 2021, 37(6): 1874-1890.
Tong, Qin, Peiliang, et al. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2018.
Smith R, Self M, Cheeseman P. Estimating Uncertain Spatial Relationships in Robotics [J]. Machine Intelligence & Pattern Recognition, 1988, 5(5):435-461.
Grisetti G, Stachniss C, Burgard W. Improved Techniques for Grid Mapping With Rao-Blackwellized Particle Filters [J]. IEEE Transactions on Robotics, 2007, 23(1):34-46.
Kalman R E. A New Approach To Linear Filtering and Prediction Problems[J]. Journal of Basic Engineering, 1960, 82D:35-45.DOI:10.1115/1.3662552.
Davison A J, Reid I D, Molton N D, et al. MonoSLAM: Real-Time Single Camera SLAM[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6):1052-1067.
Klein G, Murray D. Parallel Tracking and Mapping for Small AR Workspaces[C]. IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE, 2007:1-10.
ENGEL J, SCHOPS T, CREMERS D, LSD-SLAM: Large-scale direct monocular SLAM[C]. European Conference on Computer Vision(ECCV), 2014:834 - 849.
FORSTER C, PIZZOLI M, SCARAMUZZA D. SVO: Fast semi-direct monocular visual odometry[C]. Hong Kong, China: IEEE International Conference on Robotics and Automation (ICRA), 2014: 15-22.
MURARTAL R, MONTIEL J M, TARDOS J D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System[J]. IEEE Transactions on Robotics, 2015, 31(5):1147-1163.
TONG Q, PEILIANG L, SHAOJIE S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2017,99:1-17.
QIN T, PAN J, CAO S, et al. A general optimization-based framework for local odometry estimation with multiple sensors[J]. ArXiv, 2019:1901.03638.
WANG R, WAN W, WANG Y, et al. A new RGB-D SLAM method with moving object detection for dynamic indoor scenes[J]. Remote Sensing, 2019, 11(10): 1143.
Fang Y, Dai B. An improved moving target detecting and tracking based on Optical Flow technique and Kalman filter[J]. IEEE, 2009.DOI:10.1109/ICCSE.2009.5228464.
Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection[C]//Computer Vision & Pattern Recognition. IEEE, 2016.DOI:10.1109/CVPR.2016.91.
Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495.
Gkioxari G, He K, Piotr Dollár, et al. Mask R-CNN [J]. IEEE transactions on pattern analysis and machine intelligence, 2020, 42(2): 386-397.
Zhong F, Wang S, Zhang Z, et al. Detect-SLAM: Making Object Detection and SLAM Mutually Beneficial[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018.DOI:10.1109/WACV.2018.00115.
LIU Y, MIURA J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods[J]. IEEE Access, 2021, 9: 23772-23785.
C. Yu, et al. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments[A]. //2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 1168-1174.
B. Bescos, J. M. Fácil, J. Civera and J. Neira, DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes[J]. IEEE Robotics and Automation Letters, 2018, 3(4)4076-4083.
Ballester I, Fontan A, Civera J, et al. DOT: Dynamic Object Tracking for Visual SLAM[J]. 2020.DOI:10.48550/arXiv.2010.00052.
Qiu Y, Wang C, Wang W, et al. AirDOS: Dynamic SLAM benefits from articulated objects[C]//2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022: 8047-8053.
Zhang J, Henein M, Mahony R, et al. VDO-SLAM: A Visual Dynamic Object-aware SLAM System[J]. 2020.DOI:10.48550/arXiv.2005.11052.
高翔, 张涛等. 视觉SLAM十四讲[M]. 第二版. 北京:电子工业出版社, 2019
已有的工作积累和研究成果
工作积累
研究生期间学习主要以视觉十四讲[[][28][]]为主阅读了大量SLAM相关文献在虚拟机环境下测试过ORB-SLAM2VDO-SLAM等多种框架在公开数据集KITTI序列中的性能。掌握框架的主要函数可以通过编程实现环境的搭建和算法的编写测试。
研究成果
暂无
学位论文研究计划及预期目标
1.拟采取的主要理论、研究方法、技术路线和实施方案(可续页)
1.1 主要理论和研究方法
一个典型的视觉SLAM系统一般可以分为五个子模块包括传感器前端后端优化回环检测和建图。如图3-1所示。
图3-1 SLAM模块图
对于视觉SLAM而言传感器为相机前端又称为视觉里程计主要根据相机信息估计相邻两个时刻内的运动即位姿变化。后端优化位姿回环检测是检测相机是否经过相同的场景与建图有着密切的联系。本文的主要工作集中在前端和后端。在光照变化不明显没有动态物体的场景下SLAM基本模块已经很完善。要解决动态场景下的问题需要在此模块的基础上结合深度学习模型来实现语义级别的SLAM。
在后端优化方面基于因子图优化。因子图是应用贝叶斯定律的估计模型贝叶斯模型是给定Z求解X的概率表示为P(X|Z)P(X|Z)正比于给定求解Z的概率如公式1所示。
PXZ=PZXPXPZ=k*PZXPX#1
贝叶斯定律左侧称为后验概率右侧的P(Z|X)称为似然P(X)称为先验。直接求后验分布是困难的但是求一个状态最优估计使得在该状态下后验概率最大化是可行的如公式2所示。因此求解最大后验概率等价于求解最大化似然和先验的乘积。
X*=argmax PXZ=argmax PZXPX#2
求解最大似然估计时,考虑观测数据的条件概率满足高斯分布,可以使用最小化负对数来求高斯分布的最大似然,这样就可以得到一个最小二乘问题如公式3所示它的解等价于状态的最大似然估计。其中公式3中的f(x)为噪声符合高斯分布的X的误差项。
X*=argmax PXZ=argmaxlogPXZ=argminfx22 #3
在SLAM问题中每一个观测变量在贝叶斯网络中都是相互独立的因此所有条件概率是乘积的形式且可分解对应于因子图中的每一项。因子图包含节点和边节点为状态变量节点表示待估计的变量如位姿3D点等。边为误差项即因子表示变量之间的误差项。因子图还会包含一个先验因子用来固定系统的解以满足可解。因子图的求解就是使得所有因子的乘积最大化的状态量该步骤可转化为最小二乘问题最终解得的系统状态是在概率上最可能的系统状态。
在研究时从主要理论出发阅读大量室外动态场景下的视觉SLAM文献对文献总结和理解学习方法的异同优化自己的算法。从实践出发多写代码尝试不同的算法测试算法性能通过实验得到良好的解决方案。
  1.2 技术路线和实施方案
本文预计的技术路线和实施方案如图3-2所示
图3-2 技术路线和实施方案
在室外动态场景下基于实例分割的SLAM算法首先需要解决深度学习模型的数据预处理然后应用得到的语义信息和几何约束设计算法来实现动静点判定。根据静点估计相机的运动根据动点估计运动物体的运动不同的运动物体分别跟踪。最终研究相机位姿运物体位姿和地图点的全局优化实现建图。
本文预计的详细技术路线和实施方案如下:
基于实例分割和聚类的动态物体判别方法
在室外动态场景下提出一种基于实例分割和超像素聚类的动态物体识别算法。通过实例分割得到物体掩码将掩码内的点作为动点候选点通过特征提取的点与动点候选点做差得到静点候选点。静点候选点通过聚类后重投影到前一帧计算点误差提出一种基于误差比的动点判断方法解决语义未知的动态物体判定问题。对于语义已知的掩码物体同样使用该方法判定是否真的在运动。研究思路如图3-3所示。
图3-3 基于实例分割和聚类的动态物体判别方法
依赖掩码内动点集合的动态物体跟踪方法
研究具有掩码的动态物体的运动提出一种在室外场景下全局的动态物体跟踪方法。首先通过掩码稠密地提取像素点每隔2个点取一个点以保证物体跟踪时特征点的数量。再通过运动判定和语义标签得到真的在运动的物体设计一个存储集合来管理这些物体像素点同时利用提取的像素点估计不同物体的位姿物体位姿的求解建立在刚体假设之上。研究思路如图3-4所示。
图3-4 动态物体跟踪方法
因子图优化方法
研究基于因子图的相机位姿和物体位姿优化该方法将动态SLAM问题作为一个图形优化的问题为了构建全局一致的地图。因子图的变量节点作为观测得来的值是要估计求解的变量点之间的变量作为状态变量是因子节点作为约束。拟设计的因子图如图3-5所示。
图3-5 因子图
2.研究计划可行性,研究条件落实情况,可能存在的问题及解决办法(可续页)
2.1 可行性分析
得益于视觉SLAM的逐渐发展动态物体问题已经有了不少解决思路尤其是前端部分的研究更多每年都有一定的论文产出可作为参考。其次随着深度学习的模型逐渐完善实例分割技术和光流检测等技术也能有比较好的效果对动态SLAM问题的解决有所助益。因此在理论上和实践上本论文的研究方向均具有可行性。
2.2 研究条件
(1) 教研室的科研氛围,指导老师和教研室老师们的意见,师兄们的帮助。教研室已经发了不少相关论文和专利;
(2) 教研室完备的硬件环境,服务器,移动小车和各种摄像头等硬件设施;
(3) 研究内容相关的论文和书籍,有足够的理论基础支撑研究;
2.3 可能存在的问题及解决办法
(1) 全局优化的结果不如原始数据
在将预测值进行全局优化时,不确定预测值的误差大小,会导致一些误差较大的预测值加入全局优化,使得优化后的效果不如原始数据。针对这样的问题,首先考虑优化对象的选择,增加或删除优化值,为了更准确的效果。其次考虑在加入优化前对预测值做处理,比如绝对阈值处理或相对阈值。
(2) 实施方案未能达到较好的效果
若出现这样的问题,则需要和导师师兄交流,讨论原因做好记录,找到问题所在,并根据实际情况调整技术路线,设计新的方案来达到效果。
3.研究计划及预期成果
起止年月
完成内容
2023.12-2024.02
研究动态物体判别方法
2024.02-2024.04
研究动态物体跟踪方法
2024.04-2024.06
研究包含动态物体的局部优化和全局优化
2024.06-2024.08
验证地图精度指标,改进算法
2024.08-2024.11
测试数据集,做实验
2024.11-2025.03
撰写硕士学位论文
预期创新点
设计基于实例分割和聚类的动态物体判别方法
提出基于掩码的动态物体同步跟踪方法
设计因子图,实现更优的全局优化
成果形式
学术论文
发表一篇学术论文
专利
申请发明专利1-2项
论文
撰写硕士学位论文1篇
开题报告审查意见
1.导师对学位论文选题和论文计划可行性意见,是否同意开题:
导师(组)签字: 年 月 日
2.开题报告考评组意见
开题日期
开题地点
考评专家
考评成绩
合格 票 基本合格 票 不合格 票
结 论
□通过 □原则通过 □不通过
通过:表决票均为合格
原则通过表决票中有1票为基本合格或不合格其余为合格和基本合格
不通过表决票中有2票及以上为不合格
考评组对学位论文的选题、研究计划及方案实施的可行性的意见和建议:
考评组签名:
年 月 日
3.学院意见:
负责人签名: 年 月 日

View File

@ -0,0 +1,65 @@
电子科技大学学术学位硕士研究生学位论文文献综述
姓名:武博文
学号202221090225
学科:软件工程
综述题目室外动态场景下基于实例分割的视觉SLAM研究
导师意见:
导师签字:
日期:
选题依据和研究意义
同步定位与地图构建SLAM是搭载激光雷达、IMUInertial Measurement Unit、相机等传感器的移动载体在未知环境下同步进行定位与地图构建的过程[[][1][]]。SLAM一般可分为激光SLAM和视觉SLAM。激光SLAM利用激光雷达、编码器和惯性测量单元IMU等多种传感器相结合已在理论和应用方面相对成熟。然而激光雷达具有较高的价格使其难以普及到个人小型设备并且雷达信息获取量有限。视觉SLAM利用视觉传感器如单目、双目和RGB-D带有深度信息的彩色图像相机等来构建环境地图。相机能够获取丰富的图像信息并且视觉传感器具有低廉的价格简单的结构和小巧便携的特点因此成为近年来研究者们关注的热点也成为SLAM技术中的主要研究方向。视觉SLAM能够广泛应用于无人驾驶自主机器人导盲避障等领域对视觉SLAM的研究具有现实意义。
经过近二十年的发展视觉同时定位与建图Visual Simultaneous Localization And MappingV-SLAM框架已趋于成熟在机器人视觉感知领域中占有重要地位最先进的V-SLAM算法提供了高精度定位和场景重建的能力[[][2][]]。现阶段V-SLAM系统大多数建立在非动态环境的假设上即假设移动载体在跟踪过程中不存在动态物体。然而这种假设是一种强假设在现实场景中很难成立。在室内场景中常出现移动的人和桌椅等等在室外场景中常常出现移动的车和动物等等这些动态物体的出现对V-SLAM系统的影响巨大尤其是对V-SLAM中的前端模块的影响。SLAM前端求解存在两种方案直接法和特征点法。直接法基于光度不变假设来描述像素随时间在图像之间的运动方式每个像素在两帧之间的运动是一致的通过此估计相机的运动。然而由于相机获得的图像受到光线噪声等影响光度不变假设往往不成立如果再出现动态物体直接使用此方法更会影响相机的运动估计。特征点法是一种间接的方法它首先提取图像的特征点然后通过两帧间特征点的匹配和位置变化求解相机运动。特征点的选择与使用大幅提高了V-SLAM系统定位的准确性例如著名开源视觉SLAM框架ORB-SLAM2[[3]]、ORB-SLAM3[[4]]、VINS-Mono[[5]]都采用了特征点法。但是一旦出现动态物体这些特征点中会包含动态物体上的点动态物体的移动造成了特征点移动的不一致性从而对相机运动的估计造成了巨大影响。这种影响会导致后端失效定位精度大幅减弱不能忽视。随着视觉SLAM技术的发展如何解决动态影响受到广泛关注具有重要的研究价值。
国内外研究现状和发展态势
2.1 视觉SLAM研究现状
视觉SLAM问题最早可追溯到滤波技术的提出Smith等人提出了采用状态估计理论的方法处理机器人在定位和建图等方面的问题[[6]]。随后出现各种基于滤波算法的SLAM系统例如粒子滤波[[7]]和卡尔曼滤波[[8]]。2007年视觉SLAM取得重大突破A. J. Davison等人提出第一个基于单目相机的视觉SLAM系统MonoSLAM[[][9]]。该系统基于扩展卡尔曼滤波算法Extended Kalman Filter, UKF是首个达到实时效果的单目视觉SLAM系统在此之前其他的算法都是对预先拍好的视频进行处理无法做到同步。MonoSLAM的发布标志着视觉SLAM的研究从理论层面转到了实际应用具有里程碑式意义。同年Klein 等人提出了PTAM( Parallel Tracking And Mapping) [[10]]创新地以并行的方式进行跟踪和建图线程解决了MonoSLAM计算复杂度高的问题这种并行的方式也是当下SLAM框架的主流。PTAM应用了关键帧和非线性化优化理论而非当时多数的滤波方案为后续基于非线性化优化的视觉SLAM开辟了道路。
2014年慕尼黑工业大学计算机视觉组Jakob Engel等人[[11]]提出LSD-SLAM该方案是一种基于直接法的单目视觉SLAM算法不需要计算特征点通过最小化光度误差进行图像像素信息的匹配实现了效果不错的建图可以生成半稠密的深度图。该方案的出现证明了基于直接法的视觉SLAM系统的有效性为后续的研究奠定了基础。但该方案仍旧存在尺度不确定性问题以及在相机快速移动时容易丢失目标的问题等等。同年SVOsemi-direct monocular visual odometry被Forster等人提出[[12]]。这是一种基于稀疏直接法的视觉SLAM方案结合了特征点和直接法使用了特征点但是不计算特征点的描述子特征点的匹配使用特征点周围像素利用直接法匹配。SVO有着较快的计算速度但是缺少了后端的功能对相机的运动估计有较为明显的累计误差应用场景受限。
2015年Mur-Artal等人参考PTAM关键帧和并行线程的方案提出了ORB-SLAM框架[[13]]。该框架是一种完全基于特征点法的单目视觉SLAM系统包括了跟踪建图和回环检测三个并行线程。跟踪线程负责提取ORB[[][14][]]oriented FAST and rotated BRIEF特征点这该系统最为经典的一部分采用的ORB特征点具有良好的尺度不变性和旋转不变性能实现提取速度和效果的平衡。跟踪线程还完成估计位姿的工作并且适时选出新的关键帧来实现建图。建图线程接收跟踪线程选出的关键帧删除冗余的关键帧和地图点再进行全局优化。回环线程接收建图线程筛选后的关键帧与其他关键图进行回环检测然后更新相机位姿和地图。ORB-SLAM因为回环检测线程的加入有限消除了累计误差的影响提高了定位和建图的准确性。但是其系统只适用于单目相机精度低且应用场景受限。随着相机的进步2017年Mur-Artal 等人对ORB-SLAM进行了改进扩展了对双目和RGB-D相机的支持提出ORB-SLAM2[[3]]。相比于原版,该系统支持三种相机,同时新增重定位,全局优化和地图复用等功能,更具鲁棒性。
2017年香港科技大学Qin Tong等人[[1][5][]]提出VINS Mono系统该系统在单目相机中融合IMU传感器在视觉信息短暂失效时可利用IMU估计位姿视觉信息在优化时可以修正IMU数据的漂移两者的结合表现出了优良的性能。2019年提出改进版系统VINS-Fusion[[1][6][]]新增对双目相机和GPS传感器的支持融合后的系统效果更优。
2020年Carlos Campos等提出了ORB-SLAM3[[4]]该系统在ORB-SLAM2的基础上加入了对视觉惯性传感器融合的支持并在社区开源。系统对算法的多个环节进行改进优化例如加入了多地图系统和新的重定位模块能够适应更多的场景同时精度相比上一版增加2-3倍。在2021年底系统更新了V1.0版本继承了ORB-SLAM2的优良性能成为现阶段最有代表性的视觉SLAM系统之一。
2.2 动态SLAM研究现状
针对动态物体的影响已经有许多研究人员开展了相关工作尝试解决动态场景下的视觉SLAM问题。解决这一问题的主要挑战就是如何高效地检测到动态物体和其特征点并将动态特征点剔除以恢复相机运动。
最早的解决思路是根据几何约束来筛除动态物体的特征点如WANG等[[1][7][]]首次使用K-Means将由RGB-D相机计算的3D点聚类并使用连续图像之间的极线约束计算区域中内点关键点数量的变化内点数量较少的区域被认定是动态的。利用极线约束是一种判断动态物体特征点的常见方法但是如果相邻帧间存在高速移动物体或者运动物体沿着极线方向移动这种方法效果会大大减弱。为了更好地利用几何信息研究人员提出借助光流信息来提高动态物体的检测。Fang[[1][8][]]使用光流法检测图像之间的动态物体所在位置,对其特征点进行滤除。该方法利用光流提高检测的精度,有效地降低了帧之间极线约束的误差。尽管基于几何约束的方法可以在一定程度消除动态特征点的影响,但随着深度学习的发展,图像中语义信息逐渐被重视和利用起来。
现阶段有许多优秀的深度学习网络如YOLO[[1][9][]]SegNet[[][20][]]Mask R-CNN[[2][1][]]等等。这些神经网络有着强大的特征提取能力和语义信息提取能力可以帮助SLAM系统更轻松地辨别出动态物体的存在提供语义先验信息从而消除其影响。Fangwei Zhong等人提出的Detect-SLAM[[2][2][]]利用目标检测网络获取环境中的动态的人和车等为了实时性只在关键帧中进行目标检测最后去除所有检测到的动态点来恢复相机位姿。LIU和MIURA[[2][3][]]提出了 RDS-SLAM。基于ORB-SLAM3[[4]]的RDS-SLAM框架使用模型的分割结果初始化移动对象的移动概率将概率传播到随后的帧以此来区分动静点。这种只基于深度学习的方法仅能提供图像中的语义信息但无法判断图像中的物体是否真的在运动比如静止的人或者路边停靠的汽车。若根据语义信息将其标记为动态物体后直接去除其特征点这种方法会导致系统丢失有用的特征点对相机的运动估计有所影响。因此仅利用深度学习不能很好解决动态物体对SLAM系统的影响。
许多研究开始探索语义信息和几何信息的结合。清华大学Chao Yun等提出的DS-SLAM[[2][4][]]该系统首先利用SegNet网络进行语义分割再利用极线约束过滤移动的物体达到了不错的效果。Berta Bescos等人首次利用Mask R-CNN网络进行实例分割提出了DynaSLAM[[2][5][]]。该系统结合基于多视几何深度的动态物体分割和区域生长算法大幅降低了位姿估计的误差。Runz等人提出了MaskFusion一种考虑物体的语义和动态RGD-D SLAM系统[[][26][]]。这个系统基于MASK-RCNN语义分割和几何分割将语义分割和SALM线程放在两个线程以保证整个SLAM系统的实时性。但是该系统物体边界分割常包含背景仍有改善空间。等人提出RS-SLAM一种使用RGB-D相机解决动态环境不良影响的SLAM[[][27][]]。该系统采用语义分割识别动态对象通过动态对象和可移动对象的几何关系来判断可移动对象是否移动。动态内容随后被剔除跟踪模块对剔除过的静态背景图像帧进行ORB特征提取并估计相机位姿。
利用深度学习得来的语义信息和几何信息结合来解决SLAM中的动态场景问题渐渐成了一种主流但是上述大多系统只是为了恢复相机的位姿而剔除动态物体的特征点而没有估计动态物体的位姿。同时估计相机运动和跟踪动态物体运动将动态物体的点加入优化步骤正在发展为一种趋势。Henein等人提出一种新的基于特征的无模型的动态SLAM算法Dynamic SLAMDynamic SLAM: The Need For Speed[[][28][]]。该方法利用语义分割场景中的刚体物体的运动并提取运动物体的速度有效性在各种虚拟和真实数据集上得到了验证。Javier Civera等人提出的DOT SLAMDynamic Object Tracking for Visual SLAM[[2][9][]]主要工作在前端结合实例分割为对态对象生成掩码通过最小化光度重投影误差跟踪物体。AirDOS被卡内基梅隆大学Yuheng Qiu等人提出[[][30][]]将刚性和运动约束引入模型铰接对象通过联合优化相机位姿、物体运动和物体三维结构来纠正相机位姿估计。VDO SLAM[[][31][]]利用Mask R-CNN掩码和光流区分动静点将动态环境下的SLAM表示为整体的图优化同时估计相机位姿和物体位姿。
总体来说目前动态场景下的视觉SLAM问题的解决需要借助几何信息和深度学习的语义信息语义信息提供更准确的物体几何信息提供物体真实的运动状态两者结合来估计相机运动和跟踪物体。
参考文献
孔德磊, 方正. 基于事件的视觉传感器及其应用综述[J]. 信息与控制, 2021, 50(1): 1-19. KONG D L, FANG Z. A review of event-based vision sensors and their applications[J]. Information and Control, 2021, 50(1): 1-19.
J. Engel, V. Koltun, and D. Cremers, "Direct sparse odometry," IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611 - 625, Mar. 2016.
Mur-Artal R , JD Tardós. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras[J]. IEEE Transactions on Robotics, 2017.
Campos C, Elvira R, Rodriguez J, et al. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual - Inertial, and Multimap SLAM[J]. IEEE Transactions on Robotics: A publication of the IEEE Robotics and Automation Society, 2021, 37(6): 1874-1890.
Tong, Qin, Peiliang, et al. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2018.
Smith R, Self M, Cheeseman P. Estimating Uncertain Spatial Relationships in Robotics [J]. Machine Intelligence & Pattern Recognition, 1988, 5(5):435-461.
Grisetti G, Stachniss C, Burgard W. Improved Techniques for Grid Mapping With Rao-Blackwellized Particle Filters [J]. IEEE Transactions on Robotics, 2007, 23(1):34-46.
Kalman R E. A New Approach To Linear Filtering and Prediction Problems[J]. Journal of Basic Engineering, 1960, 82D:35-45.DOI:10.1115/1.3662552.
Davison A J, Reid I D, Molton N D, et al. MonoSLAM: Real-Time Single Camera SLAM[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6):1052-1067.
Klein G, Murray D. Parallel Tracking and Mapping for Small AR Workspaces[C]. IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE, 2007:1-10.
ENGEL J, SCHOPS T, CREMERS D, LSD-SLAM: Large-scale direct monocular SLAM[C]. European Conference on Computer Vision(ECCV), 2014:834 - 849.
FORSTER C, PIZZOLI M, SCARAMUZZA D. SVO: Fast semi-direct monocular visual odometry[C]. Hong Kong, China: IEEE International Conference on Robotics and Automation (ICRA), 2014: 15-22.
MURARTAL R, MONTIEL J M, TARDOS J D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System[J]. IEEE Transactions on Robotics, 2015, 31(5):1147-1163.
Rublee E,Rabaud V,Konolige K,et al.ORB:An efficient alternative to SIFT or SURF[C].2011 International conference on computer vision. IEEE, 2011:2564-2571.
TONG Q, PEILIANG L, SHAOJIE S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2017,99:1-17.
QIN T, PAN J, CAO S, et al. A general optimization-based framework for local odometry estimation with multiple sensors[J]. ArXiv, 2019:1901.03638.
WANG R, WAN W, WANG Y, et al. A new RGB-D SLAM method with moving object detection for dynamic indoor scenes[J]. Remote Sensing, 2019, 11(10): 1143.
Fang Y, Dai B. An improved moving target detecting and tracking based on Optical Flow technique and Kalman filter[J]. IEEE, 2009.DOI:10.1109/ICCSE.2009.5228464.
Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection[C]//Computer Vision & Pattern Recognition. IEEE, 2016.DOI:10.1109/CVPR.2016.91.
Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495.
Gkioxari G, He K, Piotr Dollár, et al. Mask R-CNN [J]. IEEE transactions on pattern analysis and machine intelligence, 2020, 42(2): 386-397.
Zhong F, Wang S, Zhang Z, et al. Detect-SLAM: Making Object Detection and SLAM Mutually Beneficial[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018.DOI:10.1109/WACV.2018.00115.
LIU Y, MIURA J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods[J]. IEEE Access, 2021, 9: 23772-23785.
C. Yu, et al. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments[A]. //2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 1168-1174.
B. Bescos, J. M. Fácil, J. Civera and J. Neira, DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes[J]. IEEE Robotics and Automation Letters, 2018, 3(4)4076-4083.
Runz M, Buffier M, Agapito L. MaskFusion: real-time recognition, tracking and reconstruction of multiple moving objects[J]. 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2018, pp. 10-20.
T. Ran, L. Yuan, J. Zhang, D. Tang and L. He. RS-SLAM: A Robust Semantic SLAM in Dynamic Environments Based on RGB-D Sensor[J]. IEEE Sensors Journal, 2021, vol. 21, no. 18, pp. 20657-20664.
M. Henein, J. Zhang, R. Mahony and V. Ila. Dynamic SLAM: The Need For Speed[C]. 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020: 2123-2129.
Ballester I, Fontan A, Civera J, et al. DOT: Dynamic Object Tracking for Visual SLAM[J]. 2020.DOI:10.48550/arXiv.2010.00052.
Qiu Y, Wang C, Wang W, et al. AirDOS: Dynamic SLAM benefits from articulated objects[C]//2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022: 8047-8053.
Zhang J, Henein M, Mahony R, et al. VDO-SLAM: A Visual Dynamic Object-aware SLAM System[J]. 2020.DOI:10.48550/arXiv.2005.11052.

View File

@ -0,0 +1,3 @@
FROM nvidia/cuda:11.1.1-devel-ubuntu18.04
WORKDIR /root

10
docker/wbw-slam/run.txt Normal file
View File

@ -0,0 +1,10 @@
docker run --name wbw-slam --gpus=all -it -v /home/wbw/data:/data -e DISPLAY -e WAYLAND_DISPLAY -e XDG_RUNTIME_DIR -e PULSE_SERVER -p 8080:5901 -p 8081:20 wbw-slam /bin/bash
docker run --name wbw-docker --gpus=all -it -v /home/wbw/data:/data -e DISPLAY -e WAYLAND_DISPLAY -e XDG_RUNTIME_DIR -e PULSE_SERVER -p 8083:5901 -p 8084:20 wbw-docker /bin/bash
// 启动docker
docker start wbw-slam
docker exec -it wbw-slam bash

Binary file not shown.

1101
动态slam/06_tar.txt Normal file

File diff suppressed because it is too large Load Diff

Binary file not shown.

View File

@ -0,0 +1,38 @@
2020-2023年开源的动态SLAM论文
一、2020年
1.Zhang J, Henein M, Mahony R, et al. VDO-SLAM: a visual dynamic object-aware SLAM system[J]. arXiv preprint arXiv:2005.11052, 2020.
https://github.com/halajun/vdo_slam
2.Bescos B, Cadena C, Neira J. Empty cities: A dynamic-object-invariant space for visual SLAM[J]. IEEE Transactions on Robotics, 2020, 37(2): 433-451.
https://github.com/bertabescos/EmptyCities_SLAM
3.Vincent J, Labbé M, Lauzon J S, et al. Dynamic object tracking and masking for visual SLAM[C]//2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020: 4974-4979.
https://github.com/introlab/dotmask
二、2021年
1.Liu Y, Miura J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods[J]. Ieee Access, 2021, 9: 23772-23785.
https://github.com/yubaoliu/RDS-SLAM/
2.Bao R, Komatsu R, Miyagusuku R, et al. Stereo camera visual SLAM with hierarchical masking and motion-state classification at outdoor construction sites containing large dynamic objects[J]. Advanced Robotics, 2021, 35(3-4): 228-241.
https://github.com/RunqiuBao/kenki-positioning-vSLAM
3.Wimbauer F, Yang N, Von Stumberg L, et al. MonoRec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 6112-6122.
https://github.com/Brummi/MonoRec
4.Wang W, Hu Y, Scherer S. Tartanvo: A generalizable learning-based vo[C]//Conference on Robot Learning. PMLR, 2021: 1761-1772.
https://github.com/castacks/tartanvo
5.Zhan H, Weerasekera C S, Bian J W, et al. DF-VO: What should be learnt for visual odometry?[J]. arXiv preprint arXiv:2103.00933, 2021.
https://github.com/Huangying-Zhan/DF-VO
三、2022年
1.Liu J, Li X, Liu Y, et al. RGB-D inertial odometry for a resource-restricted robot in dynamic environments[J]. IEEE Robotics and Automation Letters, 2022, 7(4): 9573-9580.
https://github.com/HITSZ-NRSL/Dynamic-VINS
2.Song S, Lim H, Lee A J, et al. Dynavins: A visual-inertial slam for dynamic environments[J]. IEEE Robotics and Automation Letters, 2022, 7(4): 11523-11530.
https://github.com/url-kaist/dynavins
3.Wang H, Ko J Y, Xie L. Multi-modal Semantic SLAM for Complex Dynamic Environments[J]. arXiv e-prints, 2022: arXiv: 2205.04300.
https://github.com/wh200720041/MMS_SLAM
4.Qiu Y, Wang C, Wang W, et al. AirDOS: Dynamic SLAM benefits from articulated objects[C]//2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022: 8047-8053.
https://github.com/haleqiu/AirDOS
5.Cheng S, Sun C, Zhang S, et al. SG-SLAM: a real-time RGB-D visual SLAM toward dynamic scenes with semantic and geometric information[J]. IEEE Transactions on Instrumentation and Measurement, 2022, 72: 1-12.
https://github.com/silencht/SG-SLAM
6.Esparza D, Flores G. The STDyn-SLAM: a stereo vision and semantic segmentation approach for VSLAM in dynamic outdoor environments[J]. IEEE Access, 2022, 10: 18201-18209.
https://github.com/DanielaEsparza/STDyn-SLAM
7.Shen S, Cai Y, Wang W, et al. DytanVO: Joint refinement of visual odometry and motion segmentation in dynamic environments[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 4048-4055.
https://github.com/castacks/DytanVO

View File

@ -0,0 +1,381 @@
Dynamic Object Tracking and Masking for Visual SLAM
Jonathan Vincent, Mathieu Labbe´, Jean-Samuel Lauzon, Franc¸ois Grondin,
Pier-Marc Comtois-Rivet, Franc¸ois Michaud
arXiv:2008.00072v1 [cs.CV] 31 Jul 2020 Abstract— In dynamic environments, performance of visual the proposed method. Our research hypothesis is that a
SLAM techniques can be impaired by visual features taken deep learning algorithm can be used to semantically segment
from moving objects. One solution is to identify those objects object instances in images using a priori semantic knowledge
so that their visual features can be removed for localization and of dynamic objects, enabling the identification, tracking and
mapping. This paper presents a simple and fast pipeline that removal of dynamic objects from the scenes using extended
uses deep neural networks, extended Kalman filters and visual Kalman filters to improve both localization and mapping in
SLAM to improve both localization and mapping in dynamic vSLAM. By doing so, the approach, referred to as Dynamic
environments (around 14 fps on a GTX 1080). Results on the Object Tracking and Masking for vSLAM (DOTMask)1
dynamic sequences from the TUM dataset using RTAB-Map aims at providing six benefits: 1) increased visual odometry
as visual SLAM suggest that the approach achieves similar performance; 2) increased quality of loop closure detection;
localization performance compared to other state-of-the-art 3) produce 3D maps free of dynamic objects; 4) tracking of
methods, while also providing the position of the tracked dynamic objects; 5) modular and fast pipeline.
dynamic objects, a 3D map free of those dynamic objects, better
loop closure detection with the whole pipeline able to run on a The paper is organized as follows. Section II presents re-
robot moving at moderate speed. lated work of approaches taking into consideration dynamic
objects during localization and during mapping. Section III
I. INTRODUCTION describes our approach applied as a pre-processing module
to RTAB-Map [5], a vSLAM approach. Section IV presents
To perform tasks effectively and safely, autonomous mo- the experimental setup, and Section V provides comparative
bile robots need accurate and reliable localization from their results on dynamic sequences taken from the TUM dataset.
representation of the environment. Compared to LIDARs
(Light Detection And Ranging sensors) and GPS (Global II. RELATED WORK
Positioning System), using visual images for Simultaneous
Localization and Mapping (SLAM) adds significant infor- Some approaches take into consideration dynamic objects
mation about the environment [1], such as color, textures, during localization. For instance, BaMVO [6] uses a RGB-
surface composition that can be used for semantic interpre- D camera to estimate ego-motion. It uses a background
tation of the environment. Standard visual SLAM (vSLAM) model estimator combined with an energy-based dense visual
techniques perform well in static environments by being odometry technique to estimate the motion of the camera. Li
able to extract stable visual features from images. However, et al. [7] developed a static point weighting method which
in environments with dynamic objects (e.g., people, cars, calculates a weight for each edge point in a keyframe. This
animals), performance decreases significantly because visual weight indicates the likelihood of that specific edge point
features may come from those objects, making localization being part of the static environment. Weights are determined
less reliable [1]. Deep learning architectures have recently by the movement of a depth edge point between two frames
demonstrated interesting capabilities to achieve semantic seg- and are added to an Intensity Assisted Iterative Closest Point
mentation from images, outperforming traditional techniques (IA-ICP) method used to perform the registration task in
in tasks such as image classification [2]. For instance, Segnet SLAM. Sun et al. [8] present a motion removal approach to
[3] is commonly used for semantic segmentation [4]. It uses increase the localization reliability in dynamic environments.
an encoder and a decoder to achieve pixel wise semantic It consists of three steps: 1) detecting moving objects motion
segmentation of a scene. based on ego-motion compensated using image differencing;
2) using a particle filter for tracking; and 3) applying a
This paper introduces a simple and fast pipeline that Maximum-A-Posterior (MAP) estimator on depth images
uses neural networks, extended Kalman filters and vSLAM to determine the foreground. This approach is used as the
algorithm to deal with dynamic objects. Experiments con- frontend of Dense Visual Odometry (DVO) SLAM [9]. Sun
ducted on the TUM dataset demonstrate the robustness of et al. [10] uses a similar foreground technique but instead
of using a MAP they use a foreground model which is
This work was supported by the Institut du ve´hicule innovant (IVI), updated on-line. All of these approaches demonstrate good
Mitacs, InnovE´ E´ and NSERC. J. Vincent, M. Labbe´, J.-S. Lauzon, localization results using the Technical University of Munich
F. Grondin and F. Michaud are with the Interdisciplinary Institute for (TUM) dataset [11], however, mapping is yet to be addressed.
Technological Innovation (3IT), Dept. Elec. Eng. and Comp. Eng.,
Universite´ de Sherbrooke, 3000 boul. de lUniversite´, Que´bec (Canada) 1https://github.com/introlab/dotmask
J1K 0A5. P.-M. Comtois-Rivet is with the Institut du Ve´hicule Innovant
(IVI), 25, boul. Maisonneuve, Saint-Je´roˆme, Que´bec (Canada), J5L 0A1.
{Jonathan.Vincent2, Mathieu.m.Labbe, Jean-Samuel.Lauzon,
Francois.Grondin2, Francois.Michaud}@USherbrooke.ca,
Pmcrivet@ivisolutions.ca
Depth Image RGB Image Instance segmentation Dynamic is then applied to the original depth image, resulting in a
DOS Object masked depth image (MDI). The DOS is also sent to the
Classes Tracking module. After computing a 3D centroid for each
masked object, the Tracking module predict the position and
MDI velocity of the objects. This information is then used by the
Moving Object Classification module (MOC) to classify the
MO-MDI Tracking/MOC Camera object as idle or not based on its class, its estimated velocity
World and its shape deformation. Moving objects are removed
Pose from the original depth image, resulting in the Moving
Object Masked Depth Image (MO-MDI). The original RGB
vSLAM image, the MDI and the MO-MDI are used by the vSLAM
algorithm. It uses the depth images as a mask for feature
Odometry extraction thus ignoring features from the masked regions.
The MO-MDI is used by the visual odometry algorithm of
Map the vSLAM approach while the MDI is used by both its
mapping and loop closure algorithms, resulting in a map free
Fig. 1: Architecture of DOTMask of dynamic objects while still being able to use the features
of the idle objects for visual odometry. The updated camera
SLAM++ [12] and Semantic Fusion [13] focus on pose is then used in the Tracking module to estimate the
the mapping aspect of SLAM in dynamic environments. position and velocity of the dynamic objects resulting in a
SLAM++ [12] is an object-oriented SLAM which achieves closed loop.
efficient semantic scene description using 3D object recog-
nition. SLAM++ defines objects using areas of interest A. Instance Segmentation
to subsequently locate and map them. However, it needs
predefined 3D object models to work. Semantic Fusion Deep learning algorithms such as Mask R-CNN recently
[13] creates a semantic segmented 3D map in real time proved to be useful to accomplish instance semantic seg-
using RGB-CNN [14], a convolutional deep learning neural mentation [4]. A recent and interesting architecture for
network, and a dense SLAM algorithm. However, SLAM++ fast instance segmentation is the YOLACT [18] and its
and Semantic Fusion do not address SLAM localization update YOLACT++ [19]. This network aims at providing
accuracy in dynamic environments, neither do they remove similar results as the Mask-RCNN or the Fully Convolutional
dynamic objects in the 3D map. Instance-aware Semantic Segmentation (FCIS) [20] but at a
much lower computational cost. YOLACT and YOLACT++
Other approaches use deep learning algorithm to provide can achieve real-time instance segmentation. Development in
improved localisation and mapping. Fusion++ [15] and MID- neural networks has been incredibly fast in the past few years
Fusion [16] uses object-level octree-based volumetric repre- and probably will be in the years to come. DOTMask was
sentation to estimate both the camera pose and the object designed the be modular and can easily change the neural
positions. They use deep learning techniques to segment ob- network used in the pipeline. In its current state, DOTMask
ject instances. DynaSLAM [17] proposes to combine multi- works with Mask-RCNN, YOLACT and YOLACT++. The
view geometry models and deep-learning-based algorithms YOLACT is much faster than the two others and the loss
to detect dynamic objects and to remove them from the im- in precision doesnt impact our results. This is why this
ages prior to a vSLAM algorithm. They also uses inpainting architecture is used in our tests. The instance segmentation
to recreate the image without object occlusion. DynaSLAM module takes the input RGB image and outputs the bounding
achieves impressive results on the TUM dataset. However, box, class and binary mask for each instance.
these approaches are not optimized for real-time operation.
B. Tracking Using EKF
III. DYNAMIC OBJECT TRACKING AND MASKING FOR
VSLAM Using the DOS from the Instance Segmentation module
and odometry from vSLAM, the Tracking module predicts
The objective of our work is to provide a fast and complete the pose and velocity of the objects in the world frame. This
solution for visual SLAM in dynamic environments. Figure is useful when the camera is moving at speed similar to the
1 illustrates the DOTMask pipeline. As a general overview objects to track (e.g., moving cars on the highway, robot
of the approach, a set of objects of interest (OOI) are defined following a pedestrian) or when idle objects have a high
using a priori knowledge and understanding of dynamic amount of features (e.g., person wearing a plaid shirt).
objects classes that can be found in the environment. Instance
segmentation is done using a neural network trained to First, the Tracking module receives the DOS and the
identify the object classes from an RGB image. For each original depth image as a set, defined as Dk = {d1k, ..., dkI },
dynamic object instance, its bounding box, class type and where dki = Tk, Bki , ζik is the object instance detected
binary mask are grouped for convenience and referred as the by the Instance Segmentation module, with i ∈ I, I =
dynamic object state (DOS). The binary mask of the DOS {1, ..., L}, L being the total number of object detection
in the frame at time k. T ∈ Rm×n is the depth image ,
B ∈ Zm 2 ×n is the binary mask and ζ ∈ J is the class ID, 2) Update: In EKF, the Update step starts by evaluating
with J = {1, ..., W }, and W is the number of total trained the innovation y˜k defined as (4):
classes in the Instance Segmentation module.
y˜k = zk hˆk(xˆk|k1) (4)
The DOS and the original depth image are used by EKF
to estimate the dynamic objects positions and velocities. where zk ∈ R3 is a 3D observation of a masked object in
EKF provides steady tracking of each object instance corre-
sponding to the object type detected by the neural network. reference to the camera for each object instance, with z =
An EKF is instantiated for each new object, and a priori [zx zy zz]T , zx = (µx Cx)zz/fx and zy = (µy Cy)zz/fy,
knowledge from the set of dynamic object classes defines where Cx and Cy are the principal center point coordinate
some of the filters parameters. This instantiation is made and fx and fy are the focal lengths expressed in pixels. zz
using the following parameters: the class of the object, its
binary mask and its 3D centroid position. The 3D centroid is approximated using the average depth from the masked
is defined as the center of the corresponding bounding box.
If the tracked object is observed in the DOS, its position is region on the depth image. The expressions µx and µy stand
updated accordingly, otherwise its predicted position using
EKF is used. If no observations of the object are made for for the center of the bounding box.
e number of frames, the object is considered removed from
the scene and therefore the filter is discarded. The Tracking To simplify the following equations, (s, c) represent re-
module outputs the estimated velocity of the objects to the
MOC module. The MOC module will classify the objects spectively the sine and cosine operations of the the Euler
as idle or not based on the object class, the filter velocity angles φ, θ, ψ (roll, pitch, yaw). h(xk) ∈ R4 is the
estimation and the object deformation. observation function which maps the true state space xk to
the observed state space zk. hˆ(xk) is the three first terms of
To explain further how the Tracking module works, the h(xk). However, in our case, the transform between those
following subsections presents in more details the Prediction
and Update steps of EKF used by DOTMask. spaces is not linear, justifying the use of EKF. The non-linear
rotation matrix used to transform the estimate state xˆk in the
1) Prediction: Let us define the hidden state x ∈ R6×1 as observed state zk follows the (x, y, z) Tait-Bryan convention
the 3D position and velocity of an object referenced in the and is given by h(xˆk) = [hφ hθ hψ 1], where:
global map in Cartesian coordinates. The a priori estimate
of the state at time k ∈ N is predicted based on the previous hφ = (cφcθ)xˆx + (cφsθsψ cψsφ)xˆy + (sφsψ + cφcψsθ)xˆz + cx
state at time k 1 as in (1): hθ = (cθsφ)xˆx + (cφcψ + sφsθsψ)xˆy + (cψsφsθ cφsψ)xˆz + cy
hψ = (sθ)xˆx + (cθsψ)xˆy + (cθcψ)xˆz + cz
(5)
and cx, cy and cz are the coordinate of the camera referenced
to the world, which is derived using vSLAM odometry.
The innovation covariance Sk ∈ R3×3 is defined as
follows, where the expression Hk ∈ R3×6 stands for the
Jacobian of h(xˆk):
xˆk|k1 = Fxˆk1|k1 with F = I3 ∆tI3 (1) Sk = HkPk|k1 Hk T + Rk (6)
03 I3
where F ∈ R6×6 is the state transition matrix, ∆t ∈ R+ is where Rk ∈ R3×3 is the covariance of the observation noise,
the time between each prediction, 03 is a 3 × 3 zero matrix its diagonal terms stand for the imprecision of the RGB-
and I3 is a 3 × 3 identity matrix. Note that the value of ∆t D camera. The near optimal Kalman gain Kk ∈ R3×3 is
is redefined before each processing cycle. defined as follows:
The a priori estimate of the state covariance (Pk|k1 ∈ Kk = Pk|k1 Hk T (Sk)1 (7)
R6×6) at time k is predicted based on the previous state at
time k 1 as given by (2): Finally, the updated state estimate xˆk|k and the covariance
estimate are given respectively by (8) and (9).
Pk|k1 = FPk1|k1FT + Q (2)
where Q ∈ R6×6 is the process noise covariance matrix xˆk|k = xˆk|k1 + Kky˜k (8)
defined using the random acceleration model (3):
Pk|k = (I6 KkHk)Pk|k1 (9)
Q = ΓΣΓT with Γ = [ ∆t2 I3×3 ∆t2I3×3 ]T (3) C. Moving Object Classification
2
The MOC module classify dynamic objects as either
where Γ ∈ R6×3 is the mapping between the random moving or idle. It takes as inputs the dynamic objects class,
acceleration vector a ∈ R3 and the state x, and Σ ∈ R3×3 velocity and mask. The object velocity comes from the
is the covariance matrix of a. The acceleration components tracking module estimation. The object class and mask are
ax, ay and az are assumed to be uncorrelated. directly obtained from the DOS. The object class defines
if the object is rigid or not. The deformation of non-rigid
The dynamic of every detected objects may vary greatly object is computed using the intersection over union (IoU)
depending on its class. For instance, a car does not have the of the masks of the object at time k and k 1. The IoU
same dynamic as a mug. To better track different types of algorithm takes two arbitrary convex shape Mk1, Mk and
objects, a covariance matrix is defined for each class to better is defined as IoU = |Mk ∩ Mk1|/|Mk Mk1|, where
represent their respective process noise.
TABLE I: Experimental Parameters
Description Value (a) Original RGB Image
Frame to terminate object tracking
10
Score threshold (s) 0.1
Maximum number of observations (m) 5
0.01 m/sec
Velocity threshold for a person 0.1 m/sec
Velocity threshold for the other objects 0.62 m/s2
1.0 m/s2
Random acceleration for a person
Random acceleration for other objects
| . . . | is the cardinality of the set. A dynamic object is (b) RGB and depth image superposed without DOTMask
classified as moving if its velocity is higher than a predefined
threshold or if it is an non-rigid object with an IoU above (c) RGB and depth image superposed with DOTMask
another predefined threshold. The original depth image is
then updated resulting in the MO-MDI. The MO-MDI is Fig. 2: RTAB-Map features (colored dots) not appearing on
sent to the vSLAM odometry to update the camera pose. moving objects with DOTMask
IV. EXPERIMENTAL SETUP dataset, along with their superimposed RGB and depth
images with features used by RTAB-Map (Fig. 2b) and with
To test our DOTMask approach, we chose to use the TUM DOTMask (Fig. 2c). Using the depth image as a mask to
dataset because it presents challenging indoor dynamic RGB- filter outlying features, dynamic objects (i.e., humans and
D sequences with ground truth to evaluate visual odometry chairs in this case) are filtered out because the MDI includes
techniques. Also, TUM is commonly used to compare with the semantic mask. The MO-MDI is used by RTAB-Map
other state-of-the-art techniques. We used sequences in low to compute visual odometry, keeping only the features from
dynamic and highly dynamic environments. static objects as seen in Fig. 2c (left vs right) with the colored
dots representing visual features used for visual odometry. In
For our experimental setup, ROS is used as a middleware the left image of Fig. 2c, the man on the left is classified
to make the interconnections between the input images, by the Tracking module as moving, while the man on the
segmentation network, EKF and RTAB-Map. The deep learn- right is classified as being idle, resulting in keeping his
ing library PyTorch is used for the instance segmentation visual features. In the rigth image of Fig. 2c, the man on the
algorithm. The ResNet-50-FPN backbone is used for the right is also classified as moving because he is standing up,
YOLACT architecture because this configuration achieves masking his visual features. Figure 3 illustrates the influence
the best results at a higher framerate [18]. Our Instance of MDI, which contains the depth mask of all the dynamic
segmentation module is based on the implementation of objects, either idle or not, to generate a map free of dynamic
YOLACT by dbolya2 and its pre-trained weights. The net- objects. This has two benefits: it creates a more visually
work is trained on all 91 classes of the COCO dataset. accurate 3D rendered map, and it improves loop closure
The COCO dataset is often used to compare state-of-the-art detection. The differences in the 3D generated maps between
instance segmentation approaches, which is why we chose to RTAB-Map without and with DOTMask are very apparent:
use it in our trials. In our tests, person, chair, cup and bottle there are less artifacts of dynamic objects and less drifting.
are the the OOI used because of their presence in the TUM The fr3/walking static sequence shows improved quality in
dataset and in our in-house tests.The RTAB-Map library [5] the map, while the fr3/walking rpy sequence presents some
is also used, which includes various state-of-the-art visual undesirable artifacts. These artifacts are caused either by the
odometry algorithms, a loop closure detection approach and mask failing to identify dynamic objects that are tilted or
a 3D map render. upside down or by the time delay between the RGB image
and its corresponding depth image. The fr3/sitting static
Table I presents the parameters used for DOTMask in our
trials, based on empirical observations in the evaluated TUM
sequences and our understanding of the nature of the objects.
A probability threshold p and a maximum instance number
m are used to reduce the number of object instances to feed
into the pipeline. Only detections with a score above p are
used and at maximum, m objects detections are processed.
This provides faster and more robust tracking.
V. RESULTS
Trials were conducted in comparison with approaches
by Kim and Kim [6], Sun et al. [8], Bescos et al. [17]
and RTAB-Map, the latter being also used with DOTMask.
Figure 2a shows two original RGB frames in the TUM
2https://github.com/dbolya/yolact
TABLE II: Absolute Transitional Error (ATE) RMSE in cm TABLE IV: Timing Analysis
TUM Seqs BaMVO Aproach Img. Res. Avg. Time CPU GPU
Sun et al.
DynaSLAM BaMVO. 320×240 42.6 ms i7 3.3GHz -
RTAB-Map Sun et al. 640×480 500 ms i5 -
DOTMask DynaSLAM 640×480 500 ms - -
Impr. (%) DOTMask 640×480 70 ms GTX1080
DOTMask 640×480 125 ms i5-8600K GTX1050
fr3/sit static 2.48 - - 1.70 0.60 64.71 i7-8750H
fr3/sit xyz 1.60 1.80 -12.50
fr3/wlk static 4.82 3.17 1.5 10.7 0.80 92.52
fr3/wlk xyz 24.50 2.10 91.42
fr3/wlk rpy 13.39 0.60 2.61 22.80 5.30 76.75
fr3/wlk halfsph 14.50 4.00 72.41
23.26 9.32 1.50 a mobile robot operating at a moderate speed. The fastest
method is BaMVO with only 42 ms cycle time.
35.84 13.33 3.50
Figure 4 shows the tracked dynamic objects in the ROS
17.38 12.52 2.50 visualizer RViz. DOTMask generates ROS transforms to
track the position of the objects. Those transforms could
TABLE III: Loop Closure Analysis easily be used in other ROS applications. Figure 5 shows the
difference between RTAB-Map and DOTMask in a real scene
TUM Seqs RTAB-Map DOTMask where a robot moves at a similar speed as dynamic objects
Nb Terr Rerr Nb Terr Rerr (chairs and humans). The pink and blue lines represent the
fr3/sit static loop (cm) (deg) loop (cm) (deg) odometry of RTAB-Map without and with DOTMask. These
fr3/sit xyz results suggest qualitatively that DOTMask improves the
fr3/wlk static 33 1.80 0.26 1246 0.60 0.21 odometry and the 3D map.
fr3/wlk xyz 288 2.10 0.42 1486 2.50 0.45
fr3/wlk halfs. 105 9.00 0.18 1260 7.00 0.15 VI. CONCLUSION
fr3/wlk rpy 55 6.5 0.99 1516 2.9 0.45
121 5.90 0.84 964 4.90 0.79 This paper presents DOTMask, a fast and modular pipeline
94 6.7 1.06 965 6.00 1.04 that uses a deep learning algorithm to semantically segment
images, enabling the tracking and masking of dynamic
shows the result when masking idle object, resulting in objects in scenes to improve both localization and mapping in
completely removing the dynamic objects from the scene. vSLAM. Our approach aims at providing a simple and com-
plete pipeline to allow mobile robots to operate in dynamic
Table II characterizes the overall SLAM quality in terms environments. Results on the TUM dataset suggest that using
of absolute trajectory error (ATE). In almost all cases, DOTMask with RTAB-Map provides similar performance
DOTMask improves the ATE compared to RTAB-Map alone compared to other state-of-the-art localization approaches
(as seen in the last column of the table). Table II characterizes while providing an improved 3D map, dynamic objects
the overall SLAM quality in terms of absolute trajectory tracking and higher loop closure detection. While DOTMask
error (ATE). While DynaSLAM is better in almost every does not outperform DynaSLAM on the TUM dataset or
sequences, DOTMask is not far off with closer values com- outrun BaMVO, it reveals to be a good compromise for
pared to the other techniques. robotic applications. Because DOTMask pipeline is highly
modular, it can also evolve with future improvements of
Table III presents the number of loop closure detections, deep learning architectures and new sets of dynamic object
the mean translation error (Terr) and the mean rotational classes. In future work, we want to use the tracked dynamic
error (Rerr) on each sequences both with and without objects to create a global 3D map with object permanence,
DOTMask. In all sequences, DOTMask helps RTAB-Map and explore more complex neural networks3 to add body
to make more loop closures while also lowering both mean keypoint tracking, which could significantly improve human
errors. Since loop closure features are computed from the feature extraction. We would also like to explore techniques
depth image (MDI), using DOTMask forces RTAB-Map to to detect outlier segmentations from the neural network to
use only features from static object hence providing better improve robustness.
loop closures.
REFERENCES
On the fr3/sitting xyz sequence, RTAB-Map alone pro-
vides better performance in both ATE and loop closure [1] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendo´n-Mancha,
detection. In this entire sequence, the dynamic objects do “Visual simultaneous localization and mapping: A survey,” Artificial
not move. While the MO-MDI enables features from idle Intelligence Review, vol. 43, no. 1, pp. 5581, 2015.
dynamic objects to be used by the odometry algorithm, the
MDI does not enables those same features for the loop [2] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural
closure algorithm. Since nothing is moving in this particular networks for image classification,” in Proc. IEEE Conf. Computer
sequence, all features will help to provide a better locali- Vision and Pattern Recognition, 2012, pp. 36423649.
sation. However, this case is not representative of dynamic
environments. [3] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
convolutional encoder-decoder architecture for image segmentation,”
Table IV presents the average computation time to process IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 39,
a frame for each approach without vSLAM and odometry no. 12, pp. 24812495, 2017.
algorithms. Results are processed on a computer equipped
with a GTX 1080 GPU and a I5-8600k CPU. DOTMask was 3https://github.com/daijucug/Mask-RCNN-TF detection-human segment-
also tested on a laptop with a GTX 1050 where it achieved body keypoint-regression
an average of 8 frames per second. At 70 ms, it can run on
(a) fr3/sitting static (b) fr3/walking static (c) fr3/walking rpy
Fig. 3: RTAB-Map 3D rendered map from the TUM sequences, without (top) and with (bottom) DOTMask
Fig. 4: Position of tracked dynamic objects shown in RVIZ static point weighting,” IEEE Robotics and Automation Letters, vol. 2,
no. 4, pp. 22632270, 2017.
(a) RTAB-Map alone (b) RTAB-Map with DOTMask [8] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving RGB-D SLAM in
dynamic environments: A motion removal approach,” Robotics and
Fig. 5: 3D map and odometry improved with DOTMask Autonomous Systems, vol. 89, pp. 110122, 2017.
[9] C. Kerl, J. Sturm, and D. Cremers, “Dense visual SLAM for RGB-D
[4] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and cameras,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems,
J. Garcia-Rodriguez, “A review on deep learning techniques applied 2013, pp. 21002106.
to semantic segmentation,” arXiv preprint arXiv:1704.06857, 2017. [10] Y. Sun, M. Liu, and M. Q.-H. Meng, “Motion removal for reliable rgb-
d slam in dynamic environments,” Robotics and Autonomous Systems,
[5] M. Labbe´ and F. Michaud, “Online global loop closure detection for vol. 108, pp. 115128, 2018.
large-scale multi-session graph-based SLAM,” in Proc. IEEE/RSJ Int. [11] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A
Conf. on Intelligent Robots and Systems, 2014, pp. 26612666. benchmark for the evaluation of RGB-D SLAM systems,” in Proc.
IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Oct. 2012.
[6] D.-H. Kim and J.-H. Kim, “Effective background model-based RGB- [12] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and
D dense visual odometry in a dynamic environment,” IEEE Trans. A. J. Davison, “SLAM++: Simultaneous localisation and mapping at
Robotics, vol. 32, no. 6, pp. 15651573, 2016. the level of objects,” in Proc. IEEE Conf. Computer Vision and Pattern
Recognition, 2013, pp. 13521359.
[7] S. Li and D. Lee, “RGB-D SLAM in dynamic environments using [13] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Seman-
ticFusion: Dense 3D semantic mapping with convolutional neural
networks,” in Proc. IEEE Int. Conf. Robotics and Automation, 2017,
pp. 46284635.
[14] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for
semantic segmentation,” in Proc. IEEE Int. Conf. Computer Vision,
2015, pp. 15201528.
[15] J. McCormac, R. Clark, M. Bloesch, A. Davison, and S. Leutenegger,
“Fusion++: Volumetric object-level SLAM,” in 2018 international
conference on 3D vision (3DV). IEEE, 2018, pp. 3241.
[16] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and
S. Leutenegger, “Mid-fusion: Octree-based object-level multi-instance
dynamic slam,” in 2019 International Conference on Robotics and
Automation (ICRA). IEEE, 2019, pp. 52315237.
[17] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “Dynaslam: Tracking,
mapping, and inpainting in dynamic scenes,” IEEE Robotics and
Automation Letters, vol. 3, no. 4, pp. 40764083, 2018.
[18] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance
segmentation,” in ICCV, 2019.
[19] ——, “Yolact++: Better real-time instance segmentation,” 2019.
[20] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance-
aware semantic segmentation,” in Proc. IEEE Conf. Computer Vision
and Pattern Recognition, 2017, pp. 23592367.

View File

@ -0,0 +1,679 @@
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 978-1-6654-4509-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/CVPR46437.2021.00605 MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments
from a Single Moving Camera
Felix Wimbauer1, Nan Yang1,2, Lukas von Stumberg1 Niclas Zeller1,2 Daniel Cremers1,2
1 Technical University of Munich, 2 Artisense
{wimbauer, yangn, stumberg, zellern, cremers}@in.tum.de
Abstract
In this paper, we propose MonoRec, a semi-supervised Figure 1: MonoRec can deliver high-quality dense recon-
monocular dense reconstruction architecture that predicts struction from a single moving camera. The figure shows
depth maps from a single moving camera in dynamic en- an example of a large-scale outdoor point cloud reconstruc-
vironments. MonoRec is based on a multi-view stereo set- tion (KITTI Odometry sequence 07) by simply accumulat-
ting which encodes the information of multiple consecutive ing predicted depth maps. Please refer to our project page
images in a cost volume. To deal with dynamic objects in for the video of the entire reconstruction of the sequence.
the scene, we introduce a MaskModule that predicts mov-
ing object masks by leveraging the photometric inconsisten- creasing demand of reducing the total number of sensors.
cies encoded in the cost volumes. Unlike other multi-view Over the past years, researchers have therefore put a lot of
stereo methods, MonoRec is able to reconstruct both static effort into solving the problem of perception with only a sin-
and moving objects by leveraging the predicted masks. Fur- gle monocular camera. Considering recent achievements in
thermore, we present a novel multi-stage training scheme monocular visual odometry (VO) [8, 58, 51], with respect to
with a semi-supervised loss formulation that does not re- ego-motion estimation, this was certainly successful. Nev-
quire LiDAR depth values. We carefully evaluate MonoRec ertheless, reliable dense 3D mapping of the static environ-
on the KITTI dataset and show that it achieves state-of-the- ment and moving objects is still an open research topic.
art performance compared to both multi-view and single-
view methods. With the model trained on KITTI, we further- To tackle the problem of dense 3D reconstruction based
more demonstrate that MonoRec is able to generalize well on a single moving camera, there are basically two paral-
to both the Oxford RobotCar dataset and the more chal-
lenging TUM-Mono dataset recorded by a handheld cam-
era. Code and related materials are available at https:
//vision.in.tum.de/research/monorec.
1. Introduction
1.1. Real-world Scene Capture from Video
Obtaining a 3D understanding of the entire static and dy-
namic environment can be seen as one of the key-challenges
in robotics, AR/VR, and autonomous driving. State of to-
day, this is achieved based on the fusion of multiple sen-
sor sources (incl. cameras, LiDARs, RADARs and IMUs).
This guarantees dense coverage of the vehicles surround-
ings and accurate ego-motion estimation. However, driven
by the high cost as well as the challenge to maintain cross-
calibration of such a complex sensor suite, there is an in-
Indicates equal contribution.
978-1-6654-4509-2/21/$31.00 ©2021 IEEE 6108
DOI 10.1109/CVPR46437.2021.00605
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
lel lines of research. On one side, there are dense multi- [26, 27, 36] or 3D point cloud based [3, 12]. Most popular
view stereo (MVS) methods, which evolved over the last are still depth map representations predicted from a 3D cost
decade [39, 45, 2] and saw a great improvement through the volume [23, 53, 61, 66, 22, 56, 41, 24, 33, 62, 19, 64, 57].
use of convolutional neural networks (CNNs) [23, 61, 57]. Huang et al. [23] proposed one of the first cost-volume
On the other side, there are monocular depth prediction based approaches. They compute a set of image-pair-wise
methods which purely rely on deep learning [7, 16, 58]. plane-sweep volumes with respect to a reference image and
Though all these methods show impressive performance, use a CNN to predict one single depth map based on this
both types have also their respective shortcomings. For set. Zhou et al. [66] also use the photometric cost volumes
MVS the overall assumption is a stationary environment to as the inputs of the deep neural networks and employ a two
be reconstructed, so the presence of dynamic objects deteri- stage approach for dense depth prediction. Yao et al. [61]
orate their performance. Monocular depth prediction meth- instead calculate a single cost volume using deep features
ods, in contrast, perform very well in reconstructing mov- of all input images.
ing objects, as predictions are made only based on individ-
ual images. At the same time, due to their use of a single 2.2. Dense Depth Estimation in Dynamic Scenes
image only, they strongly rely on the perspective appear-
ance of objects as observed with specific camera intrinsics Reconstructing dynamic scenes is challenging since the
and extrinsics and therefore do not generalize well to other moving objects violate the static-world assumption for clas-
datasets. sical multi-view stereo methods. Russell et al. [43] and
Ranftl et al. [40] base on motion segmentation and perform
1.2. Contribution classical optimization. Li et al. [32] proposed to estimate
dense depth maps from the scenes with moving people. All
To combine the advantage of both deep MVS and these methods need additional inputs, e.g., optical flow, ob-
monocular depth prediction, we propose MonoRec, a novel ject masks, etc., for the inference, while MonoRec requires
monocular dense reconstruction architecture that consists of only the posed images as the inputs. Another line of re-
a MaskModule and a DepthModule. We encode the infor- search is monocular depth estimation [7, 6, 29, 31, 11, 59,
mation from multiple consecutive images using cost vol- 16, 48, 67, 63, 65, 52, 18, 17, 58]. These methods are not
umes which are constructed based on structural similarity affected by moving objects, but the depth estimation is not
index measure (SSIM) [54] instead of sum of absolute dif- necessarily accurate, especially in unseen scenarios. Luo
ferences (SAD) like prior works. The MaskModule is able et al. [34] proposed a test-time optimization method which
to identify moving pixels and downweights the correspond- is not real-time capable. In a concurrent work, Watson et
ing voxels in the cost volume. Thereby, in contrast to other al. [55] address moving objects with the consistency be-
MVS methods, MonoRec does not suffer from artifacts on tween monocular depth estimation and multi-view stereo,
moving objects and therefore delivers depth estimations on while MonoRec predicts the dynamic masks explicitly by
both static and dynamic objects. the proposed MaskModule.
With the proposed multi-stage training scheme, 2.3. Dense SLAM
MonoRec achieves state-of-the-art performance compared
to other MVS and monocular depth prediction methods Several of the methods cited above solve both the prob-
on the KITTI dataset [14]. Furthermore, we validate the lem of dense 3D reconstruction and camera pose estima-
generalization capabilities of our network on the Oxford tion [48, 67, 63, 65, 66, 59, 58]. Nevertheless, these meth-
RobotCar dataset [35] and the TUM-Mono dataset [9]. ods either solve both problems independently or only in-
Figure 1 shows a dense point cloud reconstructed by our tegrate one into the other (e.g. [66, 58]). Newcombe et
method on one of our test sequences of KITTI. al. [37] instead jointly optimize the 6DoF camera pose and
the dense 3D scene structure. However, due to its volu-
2. Related Work metric map representation it is only applicable to small-
scale scenes. Recently, Bloesch et al. [1] proposed a
2.1. Multi-view Stereo learned code representation which can be optimized jointly
with the 6DoF camera poses. This idea is pursued by
Multi-view stereo (MVS) methods estimate a dense rep- Czarnowski et al. [5] and integrated into a full SLAM sys-
resentation of the 3D environment based on a set of im- tem. All the above-mentioned methods, however, do not
ages with known poses. Over the past years, several address the issue of moving objects. Instead, the proposed
methods have been developed to solve the MVS problem MonoRec network explicitly deals with moving objects and
[46, 28, 30, 2, 47, 49, 39, 13, 45, 60] based on classical achieves superior accuracy both on moving and on static
optimization. Recently, due to the advance of deep neu- structures. Furthermore, prior works show that the accuracy
ral networks (DNNs), different learning based approaches of camera tracking does not necessarily improve with more
were proposed. This representation can be volumetric
6109
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
Cost Volume weighted
Construction
MaskModule
DepthModule
masked
Reprojections Max Pool
ResNet-18 Image features for
Figure 2: MonoRec Architecture: It first constructs a photometric cost volume from multiple input frames. Unlike prior
works, we use the SSIM [54] metric instead of SAD to measure the photometric consistency. The MaskModule aims to detect
inconsistencies between the different input frames to determine moving objects. The multi-frame cost volume C is multiplied
with the predicted mask and then passed to the DepthModule which predicts a dense inverse depth map. In both the decoders
of MaskModule and DepthModule, the cost volume features are concatenated with pre-trained ResNet-18 features.
points [8, 10]. MonoRec therefore focuses solely on deliv- 3.2. Cost Volume
ering dense reconstruction using poses from a sparse VO
system and shows state-of-the-art results on public bench- A cost volume encodes geometric information from the
marks. Note that, this way, MonoRec can be easily com- different frames in a tensor that is suited as input for neural
bined with any VO systems with arbitrary sensor setups. networks. For a number of discrete depth steps, the tem-
poral stereo frames are reprojected to the keyframe and a
3. The MonoRec Network pixel-wise photometric error is computed. Ideally, the lower
the photometric error, the better the depth step approximates
MonoRec uses a set of consecutive frames and the cor- the real depth at a given pixel. Our cost volume follows the
responding camera poses to predict a dense depth map for general formulation of the prior works [37, 66]. Neverthe-
the given keyframe. The MonoRec architecture combines less, unlike the previous works that define the photometric
a MaskModule and a DepthModule. MaskModule predicts error pe() as a patch-wise SAD, we propose to use the SSIM
moving object masks that improve depth accuracy and al- as follows:
lows us to eliminate noise in 3D reconstructions. Depth-
Module predicts a depth map from the masked cost volume. pe(x, d) = 1 SSIM(Itt (x, d), It(x)) (2)
In this section, we first describe the different modules of our 2
architecture, and then discuss the specialized multi-stage
semi-supervised training scheme. with 3 × 3 patch size. Here Itt (x, d) defines the intensity
at pixel x of the image It warped with constant depth d.
3.1. Preliminaries In practice, we clamp the error to [0, 1]. The cost volume
Our method aims to predict a dense inverse depth map C stores at C(x, d) the aggregated photometric consistency
Dt of the selected keyframe from a set of consecutive for pixel x and depth d
frames {I1, · · · , IN }. We denote the selected keyframe as
It and others as It (t ∈ {1, · · · , N } \ t). Given the camera C(x, d) = 1 2 · 1 pett (x, d) · ωt (x) (3)
intrinsics, the inverse depth map Dt, and the relative cam- ·
era pose Ttt ∈ SE(3) between It and It, we can perform
the reprojection from It to It as t ωt t
where d ∈ {di|dmin + i · (dmin dmax)}. The weighting
M
term wt (x) weights the optimal depth step height based on
the photometric error while others are weighted lower:
Itt = It proj Dt, Ttt , (1) 1
wt (x) =1 M 1
exp α pett (x, d) pett (x, d) 2 (4)
where proj () is the projection function and is the dif- ·
ferentiable sampler [25]. This reprojection formulation is
important for both the cost volume formation (Sec. 3.2) and d=d
the self-supervised loss term (Sec. 3.4).
with dt = arg mind pett (x, d). Note that C(x, d) has the
In the following, we refer to the consecutive frames as range [1, 1] where 1/1 indicates the lowest/highest pho-
temporal stereo (T) frames. During training, we use an ad-
ditional static stereo (S) frame ItS for each sample, which tometric consistency.
was captured by a synchronized stereo camera at the same
time as the respective keyframe. In the following section, we denote cost volumes calcu-
lated based on the keyframe It and only one non-keyframe
It by Ct (x, d) where applicable.
6110
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
3.3. Network Architecture Figure 3: Auxiliary Training Masks: Examples of aux-
iliary training masks from the training set that are used as
As shown in Figure 2, the proposed network architec- reference.
ture contains two sub-modules, namely, MaskModule and
DepthModule. MaskModule refinement stage and the DepthModule refine-
ment stage are executed successively.
MaskModule MaskModule aims to predict a mask Mt
where Mt(x) ∈ [0, 1] indicates the probability of a pixel Bootstrapping In the bootstrapping stage, MaskModule
x in It belonging to a moving object. Determining mov- and DepthModule are trained separately. DepthModule
ing objects from It alone is an ambiguous task and hard to takes the non-masked C as the input and predicts Dt. The
be generalizable. Therefore, we propose to use the set of training objective of DepthModule is defined as a multi-
cost volumes {Ct |t ∈ {1, · · · , N } \ t} which encode the scale (s ∈ [0, 3]) semi-supervised loss. It combines a self-
geometric priors between It and {It |t ∈ {1, · · · , N } \ t} supervised photometric loss and an edge-aware smoothness
respectively. We use Ct instead of C since the inconsis- term, as proposed in [17], with a supervised sparse depth
tent geometric information from different Ct is a strong loss.
prior for moving object prediction dynamic pixels yield
inconsistent optimal depth steps in different Ct . However, 3
geometric priors alone are not enough to predict moving
objects, since poorly-textured or non-Lambertian surfaces Ldepth = Lself,s + αLsparse,s + βLsmooth,s. (5)
can lead to inconsistencies as well. Furthermore, the cost
volumes tend to reach a consensus on wrong depths that s=0
semantically dont fit into the context of the scene for ob-
jects that move at constant speed . Therefore, we further The self-supervised loss is computed from the photometric
leverage pre-trained ResNet-18 [21] features of It to en- errors between the keyframe and the reprojected temporal
code semantic priors in addition to the geometric ones. The stereo and static stereo frames:
network adapts a U-Net architecture design [42] with skip
connections. All cost volumes are passed through the en- Lself,s = min 1 SSIM(Itt , It)
coders with shared weights. The features from different cost λ
volumes are aggregated using max-pooling and then passed t ∈t {tS } 2
through the decoder. In this way, MaskModule can be ap- (6)
plied to different numbers of frames without retraining.
+ (1 λ)||Itt It||1 ,
DepthModule DepthModule predicts a dense pixel-wise
inverse depth map Dt of It. To this end, the module re- where λ = 0.85. Note that Lself,s takes the per-pixel min-
ceives the complete cost volume C concatenated with the
keyframe It. Unlike MaskModule, here we use C instead imum which has be shown to be superior compared to the
of Ct since multi-frame cost volumes in general lead to
higher depth accuracy and robustness against photometric per-pixel average [17]. The sparse supervised depth loss is
noise [37]. To eliminate wrong depth predictions for mov-
ing objects, we perform pixel-wise multiplication between defined as
Mt and the cost volume C for every depth step d. This way,
there wont be any maxima (i.e. strong priors) in regions Lsparse,s = ||Dt DV O||1, (7)
of moving objects left, such that DepthModule has to rely
on information from the image features and the surround- where the ground-truth sparse depth maps (DV O) are ob-
ings to infer the depth of moving objects. We employ a tained by a visual odometry system [59]. Note that all the
U-Net architecture with multi-scale depth outputs from the supervision signals of DepthModule are generated from ei-
decoder [17]. Finally, DepthModule outputs an interpola- ther images themselves or the visual odometry system with-
tion factor between dmin and dmax. In practice, we use out any manual labeling or LiDAR depth.
s = 4 scales of depth prediction.
MaskModule is trained with the mask loss Lmask which
3.4. Multi-stage Training is the weighted binary cross entropy between the predicted
mask Mt and the auxiliary ground-truth moving object
In this section, we propose a multi-stage training scheme mask Maux. We generate Maux by leveraging a pre-trained
for the networks. Specifically, the bootstrapping stage, the Mask-RCNN and the trained DepthModule as explained
above. We firstly define the movable object classes, e.g.,
cars, cyclists, etc, and then obtain the instance segmenta-
tions of these object classes for the training images. A
movable instance is classified as a moving instance if it
6111
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
has a high ratio of photometrically inconsistent pixels be- a)
Depth
tween temporal stereo and static stereo. Specifically, for Module
each image, we predict its depth maps Dt and DtS using
the cost volumes formed by temporal stereo images C and Mask
static stereo images CS, respectively. Then a pixel x is re- Module
garded as a moving pixel if two of the following three met- Depth
Module
rics are above predefined thresholds: (1) The static stereo b)
photometric error using Dt, i.e., pettS (x, Dt(x)). (2) The
average temporal stereo photometric error using DtS, i.e., Depth
pett (x, DtS(x)). (3) The difference between Dt(x) and Module
DtS(x). Please refer to our supplementary materials for
more details. Figure 3 shows some examples of the gen- Mask
Module
erated auxiliary ground-truth moving object masks.
Depth
MaskModule Refinement The bootstrapping stage for Module
MaskModule is limited in two ways: (1) Heavy augmen- Figure 4: Refinement Losses: a) MaskModule refinement
and b) DepthModule refinement loss functions. Dashed out-
tation is needed since mostly only a very small percent- lines denote that no gradient is being computed for the re-
spective forward pass in the module.
age of pixels on the image belongs to moving objects. (2)
stereo and sparse depth losses are backpropagated. Because
The auxiliary masks are not necessarily related to the ge- moving objects make up only a small percentage of all pix-
els in a keyframe, the gradients from the photometric error
ometric prior in the cost volume, which slows down the are rather weak. To solve this, we perform a further static
stereo forward pass and use the resulting depth map DtS
convergence. Therefore, to improve the mask prediction, as prior for moving objects. Therefore, as shown in Fig-
ure 4(b), the loss for refining DepthModule is defined as
we utilize the trained DepthModule from the bootstrapping
Ld ref,s =(1 Mt) (Lself,s + αLsparse,s)
stage. We leverage the fact that the depth prediction for
+ Mt LsSelf,s + γ Dt DtS 1 (9)
moving objects, and consequently the photometric consis-
+ βLsmooth,s.
tency, should be better with a static stereo prediction than
3.4.1 Implementation Details
with a temporal stereo one. Therefore, similar to the classi-
The networks are implemented in PyTorch [38] with image
fication of moving pixels as explained in the previous sec- size 512×256. For the bootstrapping stage, we train Depth-
tion, we obtain DtS and Dt from two forward passes using Module for 70 epochs with learning rate lr = 1e4 for
CS and C as inputs, respectively. Then we compute the the first 65 epochs and lr = 1e5 for the remaining ones.
static stereo photometric error LsSelf,s using DtS as depth MaskModule is trained for 60 epochs with lr = 1e4. Dur-
and the temporal stereo photometric error LsTelf,s using Dt ing MaskModule refinement, we train for 32 epochs with
as depth. To train Mt, we interpret it as pixel-wise inter- lr = 1e4, and during DepthModule refinement we train
polation factors between LsSelf,s and LsTelf,s, and minimize for 15 epochs with lr = 1e4 and another 4 epochs at
the summation: lr = 1e5. The hyperparameters α, β and γ are set to
4, 103 × 2s and 4, respectively. For inference, MonoRec
3 can achieve 10 fps with batch size 1 using 2GB memory.
Lm ref = MtLdSepth,s + (1 Mt)LdTepth,s (8) 4. Experiments
s=0 To evaluate the proposed method, we first compare
against state-of-the-art monocular depth prediction and
+ Lmask. MVS methods with our train/test split of the KITTI
dataset [15]. Then, we perform extensive ablation studies
Figure 4(a) shows the diagram illustrating different loss to show the efficacy of our design choices. In the end,
terms. Note that we still add the supervised mask loss we demonstrate the generalization capabilities of different
Lmask as a regularizer to stabilize the training. This way, methods on Oxford RobotCar [35] and TUM-Mono [9] us-
the new gradients are directly related to the geometric struc- ing the model trained on KITTI.
ture in the cost volume and help to improve the mask pre-
diction accuracy and alleviate the danger of overfitting.
DepthModule Refinement The bootstrapping stage does
not distinguish between the moving pixels and static pixels
when training DepthModule. Therefore, we aim to refine
DepthModule such that it is able to predict proper depths
also for moving objects. The key idea is that, by utilizing
Mt, only the static stereo loss is backpropagated for mov-
ing pixels, while for static pixels the temporal stereo, static
6112
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
Seq=07, KF=395 Depth Map View 1 View 2
GT View 2 GT View 1 Keyframe MonoRec
PackNet
DORN
Seq=00, KF=1482 Depth Map View 1 View 2
GT View 2 GT View 1 Keyframe MonoRec
PackNet
DORN
Mask Filtered
Original
Figure 5: Qualitative Results on KITTI: The upper part of the figure shows the results for a selected number of frames
from the KITTI test set. The compared PackNet model was trained in a semi-supervised fashion using LiDAR as the ground
truth. Besides the depth maps, we also show the 3D point clouds by reprojecting the depth and viewing from two different
perspectives. For comparison we show the LiDAR ground truth from the corresponding perspectives. Our method clearly
shows the best prediction quality. The lower part of the figure shows large scale reconstructions as point clouds accumulated
from multiple frames. The red insets depict the reconstructed artifacts from moving objects. With the proposed MaskModule,
we can effectively filter out the moving objects to avoid those artifacts in the final reconstruction.
6113
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
Method Training Dataset Input Abs Rel Sq Rel RMSE RMSElog δ < 1.25 δ < 1.252 δ < 1.253
Colmap [44] (geometric) - - KF + 2 0.099 3.451 5.632 0.184 0.952 0.979 0.986
Colmap [44] (photometric) - - KF + 2 0.190 6.826 7.781 0.531 0.893 0.932 0.947
Monodepth2 [17] MS Eigen Split KF 0.082 0.405 3.129 0.127 0.931 0.985 0.996
PackNet [20] MS CS+Eigen Split KF 0.080 0.331 2.914 0.124 0.929 0.987 0.997
PackNet [20] MS, D CS+Eigen Split KF 0.077 0.290 2.688 0.118 0.935 0.988 0.997
DORN [11] D KF 0.077 0.290 2.723 0.113 0.949 0.988 0.996
Eigen Split
DeepMVS [23] D KF+2 0.103 1.160 3.968 0.166 0.896 0.947 0.978
DeepMVS [23] (pretr.) D Odom. Split KF+2 0.088 0.644 3.191 0.146 0.914 0.955 0.982
DeepTAM [66] (only FB) MS, D* Odom. Split KF+2 0.059 0.474 2.769 0.096 0.964 0.987 0.994
DeepTAM [66] (1x Ref.) MS, D* Odom. Split KF+2 0.053 0.351 2.480 0.089 0.971 0.990 0.995
Odom. Split
MonoRec MS, D* KF+2 0.050 0.295 2.266 0.082 0.973 0.991 0.996
Odom. Split
Table 1: Quantitative Results on KITTI: Comparison between MonoRec and other methods on our KITTI test set. The
Dataset column shows the training dataset used by the corresponding method and please note that Eigen split is a superset
of our odometry split. Best / Second best results are marked bold / underlined. The evaluation result shows that our method
achieves overall the best performance. Legend: M: Monocular images, S: Stereo images, D: GT depth, D*: Depths from
DVSO, KF: Keyframe, KF + 2: Keyframe + 2 mono frames, CS: Cityscapes [4], pretr.: Pretrained network, FB: Fixed band
module of DeepTAM, Ref.: Narrow band refinement module of DeepTAM
(a) Keyframe (b) W/o MaskModule TAM), shown in Table 1. Note that the training code of
DeepTAM was not published, we therefore implemented it
(c) MaskModule (d) MaskModule+D.Ref. ourselves for training and testing using our split to deliver
a fair comparison. Our method outperforms all the other
Figure 6: Qualitative Improvement: Effects of cost vol- methods with a notable margin despite relying on images
ume masking and depth refinement. only without using LiDAR ground truth for training.
4.1. The KITTI Dataset This is also clearly reflected in the qualitative results
shown in Figure 5. Compared with monocular depth esti-
The Eigen split [6] is the most popular training/test split mation methods, our method delivers very sharp edges in
for evaluating depth estimation on KITTI. We cannot make the depth maps and can recover finer details. In comparison
use of it directly since MonoRec requires temporally con- to the other MVS methods, it can better deal with moving
tinuous images with estimated poses. Hence, we select our objects, which is further illustrated in Figure 7.
training/testing splits as the intersection between the KITTI
Odometry benchmark and the Eigen split, which results in A single depth map usually cannot really reflect the qual-
13714/8634 samples for training/testing. We obtain the rel- ity for large scale reconstruction. We therefore also visual-
ative poses between the images from the monocular VO sys- ize the accumulated points using the depth maps from mul-
tem DVSO [59]. During training, we also leverage the point tiple frames in lower part of Figure 5. We can see that our
clouds generated by DVSO as the sparse depth supervision method can deliver very high quality reconstruction and,
signals. For training MaskModule we only use images that due to our MaskModule, is able to remove artifacts caused
contain moving objects in the generated auxiliary masks, by moving objects. We urge readers to watch the supple-
2412 in total. For all the following evaluation results we mentary video for more convincing comparisons.
use the improved ground truth [50] and cap depths at 80 m. Ablation Studies. We also investigated the contribution
of the different components towards the methods perfor-
We first compare our method against the recent state of mance. Table 2 shows quantitative results of our ablation
the art including an optimization based method (Colmap), studies, which confirm that all our proposed contributions
self-supervised monocular methods (MonoDepth2 and improve the depth prediction over the baseline method. Fur-
PackNet), a semi-supervised monocular method using thermore, Figure 6 demonstrates the qualitative improve-
sparse LiDAR data (PackNet), a supervised monocular ment achieved by MaskModule and refinement training.
method (DORN) and MVS methods (DeepMVS and Deep-
4.2. Oxford RobotCar and TUM-Mono
To demonstrate the generalization capabilities of
MonoRec, we test our KITTI model on the Oxford Robot-
Car dataset and the TUM-Mono dataset. Oxford RobotCar
is a street view dataset and shows a similar motion pattern
and view perspective to KITTI. TUM-Mono, however, is
recorded by a handheld monochrome camera, so it demon-
6114
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
Model SSIM MaskModule D. Ref. M. Ref. Abs Rel Sq Rel RMSE RMSElog δ < 1.25 δ < 1.252 δ < 1.253
Baseline 0.056 0.342 2.624 0.092 0.965 0.990 0.994
Baseline 0.054 0.346 2.444 0.088 0.970 0.989 0.995
MonoRec 0.054 0.306 2.372 0.087 0.970 0.990 0.995
MonoRec 0.051 0.346 2.361 0.085 0.972 0.990 0.995
MonoRec 0.052 0.302 2.303 0.087 0.969 0.990 0.995
MonoRec 0.050 0.295 2.266 0.082 0.973 0.991 0.996
Table 2: Ablation Study: Baseline consists of only DepthModule using the unmasked cost volume (CV). Baseline without
SSIM uses a 5x5 patch that has same receptive field as SSIM. Using SSIM to form CV gives a significant improvement. For
MonoRec, only the addition of MaskModule without refinement does not yield significant improvements. The DepthModule
refinement gives a major improvement. The best performance is achieved by combining all the proposed components.
Keyframe MonoRec Mask Prediction DeepTAM DeepMVS
Figure 7: Comparison on Moving Objects Depth Estimation: In comparison to other MVS methods, MonoRec is able to
predict plausible depths. Furthermore, the depth prediction has less noise and artifacts in static regions of the scene.
strates very different motion and image quality compared PackNet Monodepth2 Keyframe
to KITTI. The results are shown in Figure 8. The monoc-
ular methods struggle to generalize to a new context. The DORN
compared MVS methods show more artifacts and cannot
predict plausible depths for the moving objects. In contrast DeepTAM DeepMVS
our method is able to generalize well to the new scenes for
both depth and moving object predictions. Since Oxford MonoRec
RobotCar also provides LiDAR depth data, we further show
a quantitative evaluation in the supplementary material. Figure 8: Oxford RobotCar and TUM-Mono: All results
are obtained by the respective best-performing variant in
5. Conclusion Table 1. MonoRec shows stronger generalization capabil-
ity than the monocular methods. Compared to DeepMVS
We have presented MonoRec, a deep architecture that and DeepTAM, MonoRec delivers depth maps with less ar-
estimates accurate dense 3D reconstructions from only a tifacts and predicts the moving object masks in addition.
single moving camera. We first propose to use SSIM as
the photometric measurement to construct the cost vol- Acknowledgement This work was supported by the Munich Center
umes. To deal with dynamic objects, we propose a novel
MaskModule which predicts moving object masks from the for Machine Learning and by the ERC Advanced Grant SIMULACRON.
input cost volumes. With the predicted masks, the pro-
posed DepthModule is able to estimate accurate depths for
both static and dynamic objects. Additionally, we propose
a novel multi-stage training scheme together with a semi-
supervised loss formulation for training the depth predic-
tion. All combined, MonoRec is able to outperform the
state-of-the-art MVS and monocular depth prediction meth-
ods both qualitatively and quantitatively on KITTI and also
shows strong generalization capability on Oxford RobotCar
and TUM-Mono. We believe that this capacity to recover
accurate dense 3D reconstructions from a single moving
camera will help to establish the camera as the lead sensor
for autonomous systems.
6115
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
References national Journal of Robotics Research (IJRR), pages 1229
1235, 2013.
[1] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and
A. J. Davison. CodeSLAM - learning a compact, optimisable [15] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
representation for dense visual SLAM. In IEEE Conference ready for autonomous driving? the KITTI vision benchmark
on Computer Vision and Pattern Recognition (CVPR), pages suite. In IEEE Conference on Computer Vision and Pattern
25602568, 2018. Recognition (CVPR), pages 33543361. IEEE, 2012.
[2] Neill D. F. Campbell, George Vogiatzis, Carlos Herna´ndez, [16] Clement Godard, Oisin Mac Aodha, and Gabriel J. Bros-
and Roberto Cipolla. Using multiple hypotheses to improve tow. Unsupervised monocular depth estimation with left-
depth-maps for multi-view stereo. In European Conference right consistency. In IEEE Conference on Computer Vision
on Computer Vision (ECCV), pages 766779, 2008. and Pattern Recognition (CVPR), 2017.
[3] Rui Chen, Songfang Han, Jing Xu, and Hao Su. Point-based [17] Cle´ment Godard, Oisin Mac Aodha, Michael Firman, and
multi-view stereo network. In International Conference on Gabriel J. Brostow. Digging into self-supervised monocular
Computer Vision (ICCV), 2019. depth estimation. In International Conference on Computer
Vision (ICCV), pages 38283838, 2019.
[4] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe [18] Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia
Franke, Stefan Roth, and Bernt Schiele. The cityscapes Angelova. Depth from videos in the wild: Unsupervised
dataset for semantic urban scene understanding. In IEEE monocular depth learning from unknown cameras. In In-
Conference on Computer Vision and Pattern Recognition ternational Conference on Computer Vision (ICCV), 2019.
(CVPR), pages 32133223, 2016.
[19] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong
[5] Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An- Tan, and Ping Tan. Cascade cost volume for high-resolution
drew J. Davison. DeepFactors: Real-time probabilistic dense multi-view stereo and stereo matching. In IEEE Interna-
monocular SLAM. IEEE Robotics and Automation Letters tional Conference on Computer Vision and Pattern Recog-
(RA-L), 5(2):721728, 2020. nition (CVPR), 2020.
[6] David Eigen and Rob Fergus. Predicting depth, surface nor- [20] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven-
mals and semantic labels with a common multi-scale convo- tos, and Adrien Gaidon. 3D packing for self-supervised
lutional architecture. In International Conference on Com- monocular depth estimation. In IEEE Conference on Com-
puter Vision (ICCV), pages 26502658, 2015. puter Vision and Pattern Recognition (CVPR), pages 2485
2494, 2020.
[7] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map
prediction from a single image using a multi-scale deep net- [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
work. In Neural Information Processing Systems (NIPS), Deep residual learning for image recognition. In IEEE
2014. Conference on Computer Vision and Pattern Recognition
(CVPR), pages 770778, 2016.
[8] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct
sparse odometry. IEEE Transactions on Pattern Analysis and [22] Yuxin Hou, Juho Kannala, and Arno Solin. Multi-view
Machine Intelligence (PAMI), 40(3):611625, 2018. stereo by temporal nonparametric fusion. In International
Conference on Computer Vision (ICCV), 2019.
[9] Jakob Engel, Vladyslav Usenko, and Daniel Cremers. A
photometrically calibrated benchmark for monocular visual [23] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra
odometry. In arXiv, July 2016. Ahuja, and Jia-Bin Huang. DeepMVS: Learning multi-view
stereopsis. In IEEE Conference on Computer Vision and Pat-
[10] Alejandro Fontan, Javier Civera, and Rudolph Triebel. tern Recognition (CVPR), pages 28212830, 2018.
Information-driven direct rgb-d odometry. In IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), [24] Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So
pages 49294937, 2020. Kweon. DPSNet: End-to-end deep plane sweep stereo. In In-
ternational Conference on Learning Representations (ICLR),
[11] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- 2019.
manghelich, and Dacheng Tao. Deep ordinal regression net-
work for monocular depth estimation. In IEEE Conference [25] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and
on Computer Vision and Pattern Recognition (CVPR), pages Koray Kavukcuoglu. Spatial transformer networks. In Neu-
20022011, 2018. ral Information Processing Systems (NIPS), pages 2017
2025, 2015.
[12] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and
robust multiview stereopsis. IEEE Transactions on Pat- [26] Mengqi Ji, Ju¨rgen Gall, Haitian Zheng, Yebin Liu, and Lu
tern Analysis and Machine Intelligence (PAMI), pages 1362 Fang. SurfaceNet: An end-to-end 3D neural network for
1376, 2010. multiview stereopsis. In International Conference on Com-
puter Vision (ICCV), pages 23262334, 2017.
[13] Silvano Galliani, Katrin Lasinger, and Konrad Schindler.
Massively parallel multiview stereopsis by surface normal [27] Abhishek Kar, Christian Ha¨ne, and Jitendra Malik. Learning
diffusion. In International Conference on Computer Vision a multi-view stereo machine. In Neural Information Process-
(ICCV), 2015. ing Systems (NIPS), page 364375, 2017.
[14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel [28] Kiriakos N. Kutulakos and Steven M. Seitz. A theory of
Urtasun. Vision meets robotics: The KITTI dataset. Inter- shape by space carving. In International Conference on
Computer Vision (ICCV), 1999.
6116
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
[29] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed- puting and Computer Assisted Intervention (MICCAI), pages
erico Tombari, and Nassir Navab. Deeper depth prediction 234241. Springer, 2015.
with fully convolutional residual networks. In International
Conference on 3D Vision (3DV), 2016. [43] Chris Russell, Rui Yu, and Lourdes Agapito. Video pop-up:
Monocular 3d reconstruction of dynamic scenes. In Euro-
[30] Maxime Lhuillier and Long Quan. A quasi-dense approach pean Conference on Computer Vision (ECCV), pages 583
to surface reconstruction from uncalibrated images. IEEE 598. Springer, 2014.
Transactions on Pattern Analysis and Machine Intelligence
(PAMI), pages 418433, 2005. [44] Johannes L Schonberger and Jan-Michael Frahm. Structure-
from-motion revisited. In IEEE Conference on Computer
[31] Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel, Vision and Pattern Recognition (CVPR), pages 41044113,
and Mingyi He. Depth and surface normal estimation from 2016.
monocular images using regression on deep features and hi-
erarchical CRFs. In IEEE Conference on Computer Vision [45] Johannes L. Scho¨nberger, Enliang Zheng, Jan-Michael
and Pattern Recognition (CVPR), 2015. Frahm, and Marc Pollefeys. Pixelwise view selection for
unstructured multi-view stereo. In European Conference on
[32] Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Computer Vision (ECCV), pages 501518, 2016.
Noah Snavely, Ce Liu, and William T Freeman. Learning
the depths of moving people by watching frozen people. In [46] Steven M. Seitz and Charles R. Dyer. Photorealistic scene
Proceedings of the IEEE/CVF Conference on Computer Vi- reconstruction by voxel coloring. In IEEE Conference on
sion and Pattern Recognition, pages 45214530, 2019. Computer Vision and Pattern Recognition (CVPR), 1997.
[33] Keyang Luo, Tao Guan, Lili Ju, Haipeng Huang, and Yawei [47] Jan Stu¨hmer, Stefan Gumhold, and Daniel Cremers. Real-
Luo. P-MVSNet: Learning patch-wise matching confidence time dense geometry from a handheld camera. In DAGM
aggregation for multi-view stereo. In International Confer- Conference on Pattern Recognition, pages 1120, 2010.
ence on Computer Vision (ICCV), 2019.
[48] Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir
[34] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, Navab. CNN-SLAM: Real-time dense monocular SLAM
and Johannes Kopf. Consistent video depth estimation. with learned depth prediction. In IEEE Conference on Com-
39(4), 2020. puter Vision and Pattern Recognition (CVPR), 2017.
[35] Will Maddern, Geoff Pascoe, Chris Linegar, and Paul New- [49] Engin Tola, Christoph Strecha, and Pascal Fua. Efficient
man. 1 Year, 1000km: The Oxford RobotCar Dataset. Inter- large-scale multi-view stereo for ultra high-resolution image
national Journal of Robotics Research (IJRR), 36(1):315, sets. Machine Vision and Applications (MVA), pages 903
2017. 920, 2011.
[36] Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha, [50] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke,
Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End- Thomas Brox, and Andreas Geiger. Sparsity invariant CNNs.
to-end 3D scene reconstruction from posed images. In Euro- In International Conference on 3D Vision (3DV), pages 11
pean Conference on Computer Vision (ECCV), 2020. 20. IEEE, 2017.
[37] Richard A. Newcombe, Steven J. Lovegrove, and Andrew J. [51] Vladyslav Usenko, Nikolaus Demmel, David Schubert, Jo¨rg
Davison. DTAM: Dense tracking and mapping in real-time. Stu¨ckler, and Daniel Cremers. Visual-inertial mapping with
In International Conference on Computer Vision (ICCV), non-linear factor recovery. IEEE Robotics and Automation
2011. Letters (RA-L), 5(2):422429, 2020.
[38] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, [52] Chaoyang Wang, Jose Miguel Buenaposada, Rui Zhu, and
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Simon Lucey. Learning depth from monocular videos using
Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An direct methods. In IEEE Conference on Computer Vision and
imperative style, high-performance deep learning library. In Pattern Recognition (CVPR), 2018.
Advances in neural information processing systems, pages
80268037, 2019. [53] Kaixuan Wang and Shaojie Shen. MVDepthNet: Real-time
multiview depth estimation neural network. In International
[39] Matia Pizzoli, Christian Forster, and Davide Scaramuzza. Conference on 3D Vision (3DV), 2018.
REMODE: Probabilistic, monocular dense reconstruction in
real time. In IEEE International Conference on Robotics and [54] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si-
Automation (ICRA), 2014. moncelli. Image quality assessment: from error visibility to
structural similarity. IEEE transactions on image processing,
[40] Rene Ranftl, Vibhav Vineet, Qifeng Chen, and Vladlen 13(4):600612, 2004.
Koltun. Dense monocular depth estimation in complex dy-
namic scenes. In IEEE Conference on Computer Vision and [55] Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel
Pattern Recognition (CVPR), pages 40584066, 2016. Brostow, and Michael Firman. The temporal opportunist:
Self-supervised multi-frame monocular depth. In IEEE
[41] Andrea Romanoni and Matteo Matteucci. TAPA-MVS: Conference on Computer Vision and Pattern Recognition
Textureless-aware PAtchMatch multi-view stereo. In Inter- (CVPR), 2021.
national Conference on Computer Vision (ICCV), 2019.
[56] Youze Xue, Jiansheng Chen, Weitao Wan, Yiqing Huang,
[42] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Cheng Yu, Tianpeng Li, and Jiayu Bao. MVSCRF: Learning
Net: Convolutional networks for biomedical image segmen- multi-view stereo with conditional random fields. In Inter-
tation. In International Conference on Medical Image Com- national Conference on Computer Vision (ICCV), 2019.
6117
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
[57] Jiayu Yang, Wei Mao, Jose M. Alvarez, and Miaomiao Liu.
Cost volume pyramid based depth inference for multi-view
stereo. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2020.
[58] Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cre-
mers. D3VO: Deep depth, deep pose and deep uncertainty
for monocular visual odometry. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2020.
[59] Nan Yang, Rui Wang, Jo¨rg Stu¨ckler, and Daniel Cremers.
Deep virtual stereo odometry: Leveraging deep depth predic-
tion for monocular direct sparse odometry. In European Con-
ference on Computer Vision (ECCV), pages 817833, 2018.
[60] Yao Yao, Shiwei Li, Siyu Zhu, Hanyu Deng, Tian Fang, and
Long Quan. Relative camera refinement for accurate dense
reconstruction. In International Conference on 3D Vision
(3DV), 2017.
[61] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long
Quan. MVSNet: Depth inference for unstructured multi-
view stereo. In European Conference on Computer Vision
(ECCV), pages 785801, 2018.
[62] Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang,
and Long Quan. Recurrent MVSNet for high-resolution
multi-view stereo depth inference. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2019.
[63] Zhichao Yin and Jianping Shi. GeoNet: Unsupervised learn-
ing of dense depth, optical flow and camera pose. In IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2018.
[64] Zehao Yu and Shenghua Gao. Fast-MVSNet: Sparse-to-
dense multi-view stereo with learned propagation and gauss-
newton refinement. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2020.
[65] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera,
Kejie Li, Harsh Agarwal, and Ian M. Reid. Unsupervised
learning of monocular depth estimation and visual odometry
with deep feature reconstruction. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2018.
[66] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox.
DeepTAM: Deep tracking and mapping. In European Con-
ference on Computer Vision (ECCV), pages 822838, 2018.
[67] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G.
Lowe. Unsupervised learning of depth and ego-motion from
video. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017.
6118
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.

View File

@ -0,0 +1,738 @@
Received December 21, 2020, accepted January 6, 2021, date of publication January 11, 2021, date of current version February 10, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3050617
RDS-SLAM: Real-Time Dynamic SLAM Using
Semantic Segmentation Methods
YUBAO LIU AND JUN MIURA , (Member, IEEE)
Department of Computer Science and Engineering, Toyohashi University of Technology, Toyohashi 441-8580, Japan
Corresponding author: Yubao Liu (yubao.liu.ra@tut.jp)
This work was supported in part by the Japan Society for the Promotion of Science (JSPS) KAKENHI under Grant 17H01799.
ABSTRACT The scene rigidity is a strong assumption in typical visual Simultaneous Localization and
Mapping (vSLAM) algorithms. Such strong assumption limits the usage of most vSLAM in dynamic
real-world environments, which are the target of several relevant applications such as augmented reality,
semantic mapping, unmanned autonomous vehicles, and service robotics. Many solutions are proposed
that use different kinds of semantic segmentation methods (e.g., Mask R-CNN, SegNet) to detect dynamic
objects and remove outliers. However, as far as we know, such kind of methods wait for the semantic
results in the tracking thread in their architecture, and the processing time depends on the segmentation
methods used. In this paper, we present RDS-SLAM, a real-time visual dynamic SLAM algorithm that
is built on ORB-SLAM3 and adds a semantic thread and a semantic-based optimization thread for robust
tracking and mapping in dynamic environments in real-time. These novel threads run in parallel with the
others, and therefore the tracking thread does not need to wait for the semantic information anymore.
Besides, we propose an algorithm to obtain as the latest semantic information as possible, thereby making
it possible to use segmentation methods with different speeds in a uniform way. We update and propagate
semantic information using the moving probability, which is saved in the map and used to remove outliers
from tracking using a data association algorithm. Finally, we evaluate the tracking accuracy and real-time
performance using the public TUM RGB-D datasets and Kinect camera in dynamic indoor scenarios.
Source code and demo: https://github.com/yubaoliu/RDS-SLAM.git
INDEX TERMS Dynamic SLAM, ORB SLAM, Mask R-CNN, SegNet, real-time.
I. INTRODUCTION algorithm from using data associations related to such
Simultaneous localization and mapping (SLAM) [1] is a fun- dynamic objects in real-time is the challenge to allow vSLAM
damental technique for many applications such as augmented to be deployed in the real world.
reality (AR), robotics, and unmanned autonomous vehicles
(UAV). Visual SLAM (vSLAM) [2] uses the camera as the We classify the solutions into two classes: pure
input and is useful in scene understanding and decision mak- geometric-based [3][7] and semantic-based [8][13] meth-
ing. However, the strong assumption of scene rigidity limits ods. These geometric-based approaches cannot remove all
the use of most vSLAM in real-world environments. Dynamic potential dynamic objects, e.g., people who are sitting. Fea-
objects will cause many bad or unstable data associations that tures on such objects are unreliable and also need to be
accumulate drifts during the SLAM process. In Fig. 1, for removed from tracking and mapping. These semantic-based
example, assume m1 is on a person and its position changes methods use semantic segmentation or object detection
in the scene. The bad or unstable data associations (the red approaches to obtain pixel-wise masks or bounding box of
lines in Fig. 1) will lead to incorrect camera ego-motion potential dynamic objects. Sitting people can be detected
estimation in dynamic environments. Usually, there are two and removed from tracking and mapping using the semantic
basic requirements for vSLAM: robustness in tracking and information and a map of static objects can be built. Usu-
real-time performance. Therefore, how to detect dynamic ally, in semantic-based methods, geometric check, such as
objects in the populated scene and prevent the tracking Random Sample Consensus (RANSAC) [14] and multi-view
geometry, are also used to remove outliers.
The associate editor coordinating the review of this manuscript and
These semantic-based methods first detect or segment
approving it for publication was Heng Wang . objects and then remove outliers from tracking. The tracking
thread has to wait for semantic information before tracking
23772 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 9, 2021
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
FIGURE 1. Example of data association in vSLAM under dynamic scene. These classified map points will be used to select as stable
Ft , (t ≥ 0) is the frame and KFt is the selected keyframe. mi , i ∈ {0, 1, ...} data associations as possible in tracking.
is the map point. Assume m1 moved to new position m1 because it
belongs to a moving object. The red line indicates the unstable or bad The main contributions of this paper are:
data association. (1) we propose a novel semantic-based real-time dynamic
vSLAM algorithm, RDS-SLAM, which enables the tracking
thread does not need to wait for the semantic results any-
more. This method efficiently and effectively uses seman-
tic segmentation results for dynamic object detection and
outliers removing while keeping the algorithms real-time
nature.
(2) we propose a keyframe selection strategy that uses as
the latest new semantic information as possible for outliers
removal with any semantic segmentation methods with dif-
ferent speeds in a uniform way.
(3) We show the real-time performance of the proposed
method is better than the existing similar methods using the
TUM dataset.
The rest of the paper is structured as follows. Section II dis-
cusses related work. Section III describes a system overview.
Sections IV, V, and VI detail the implementation of the pro-
posed methods. Section VII shows experimental results, and
section VIII presents the conclusions and discusses future
work.
FIGURE 2. Blocked model. Semantic model can use different kinds of II. RELATED WORK
segmentation methods, e.g., Mask R-CNN and SegNet. Note that this is A. VISUAL SLAM
not exactly the same as the semantic-based methods mentioned [8][13]. vSLAM [2] can be classified into feature-based methods
The tracking process is blocked to wait for the results of semantic model. and direct methods. Mur-Artal et al. presented ORB-SLAM2
[16], a complete SLAM system for monocular, stereo, and
(camera ego-motion estimation), which is called the blocked RGB-D cameras, which works in real-time on standard CPUs
model in this paper (as shown in Fig. 2). Their processing in a wide variety of environments. This system estimates the
speed is limited by the time-consuming of semantic segmen- ego-motion of the camera by matching the corresponding
tation methods used. For example, Mask R-CNN requires ORB [17] features between the current frame and previous
about 200ms [15] for segmenting one image and this will limit frames and has three parallel threads: tracking, local map-
the real-time performance of the entire system. ping, and loop closing. Carlos et al. proposed the latest
version ORB-SLAM3 [18], mainly adding two novelties:
Our main challenge is how to execute vSLAM in real-time 1) a feature-based tightly-integrated visual-inertial SLAM
under dynamic scenes with various pixel-wise semantic seg- that fully relies on maximum-a-posteriori (MAP) estimation;
mentation methods that ran at a different speed, such as 2) a multiple map system (ATLAS [19]) that relies on a new
SegNet and Mask R-CNN. We propose a semantic thread place recognition method with improved recall. In contrast to
to wait for the semantic information. It runs in parallel with features-based methods,. For example, Kerl et al. proposed a
the tracking thread and the tracking thread does not need to dense visual SLAM method, DVO [20], for RGB-D cameras
wait for the segment result. Therefore, the tracking thread that minimizes both the photometric and the depth error over
can execute in real-time. We call it a non-blocked model in all pixels. However, none of the above methods can address
this paper. Faster segmentation methods (e.g., SegNet) can the common problem of dynamic objects. Detecting and deal-
update semantic information more frequently than slower ing with dynamic objects in a dynamic scene in real-time is a
methods (e.g., Mask R-CNN). Although we cannot control challenging task in vSLAM.
the segmentation speed, we can use a strategy to obtain as
the latest semantic information as possible to remove outliers Our work follows the implementation of ORB-SLAM3
from the current frame. [18]. The concepts in ORB-SLAM3: keyframe, covisibility
graph, ATLAS and Bundle adjustment (BA), are also used in
Because the semantic thread runs in parallel with the track- our implementation.
ing thread, we use the map points to save and share the
semantic information. As shown in Fig. 1, we update and 1) KEYFRAME
propagate semantic information using the moving probability Keyframes [18] is a subset of selected frames to avoid
and classify map points into three categories, static, dynamic, unnecessary redundancy in tracking and optimization. Each
and unknown, according to the moving probability threshold.
23773
VOLUME 9, 2021
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
keyframe stores 1) a rigid body transformation of the camera C. SEMANTIC-BASED SOLUTIONS
pose that transforms points from the world to the camera DS-SLAM [10], implemented on ORB-SLAM2 [16], com-
coordinate system; 2) ORB features, associated or not to bines a semantic segmentation network (SegNet [22]) with a
a map point. In this paper, keyframes are selected by the moving consistency check to reduce the impact of dynamic
same policy as ORB-SLAM3; a keyframe is selected if all objects and produce a dense semantic octree map [23].
the following conditions are met: 1) 20 frames have passed DS-SLAM assumes that the feature points on the people are
from the last global relocalization or the last keyframe inser- most likely to be outliers. If a person is determined to be static,
tion; 2) local mapping thread is idle; 3) current frame tracks then matching points on the person can also be used to predict
at least 50 points or less than 90% points than reference the pose of the camera.
keyframe.
DynaSLAM [9], also built on ORB-SLAM2, is robust
2) COVISIBILITY GRAPH in dynamic scenarios for monocular, stereo, and RGB-D
Covisibility graph [16] is represented as an undirected datasets, by adding the capabilities of dynamic object detec-
weighted graph, in which each node is a keyframe and the tion and background inpainting. It can detect the moving
edge holds the number of commonly observed map points. objects either by multi-view geometry, deep learning, or both
and inpaint the frame background that has been occluded by
3) ATLAS dynamic objects using a static map of the scene. It uses Mask
The Atlas [19] is a multi-map representation that handles an R-CNN to segment out all the priori dynamic objects, such as
unlimited number of sub-maps. Two kinds of maps, active people or vehicles. DynaSLAM II [24] tightly integrates the
map and non-active map, are managed in the atlas. When multi-object tracking capability. But this method only works
the camera tracking is considered lost and relocalization was for rigid objects. However, in the dynamic scene of TUM [25]
failed for a few frames, the active map becomes a non-active dataset, people change their shape by sometimes standing and
map, and a new map will be initialized. In the atlas, keyframes sometimes sitting.
and map points are managed using the covisibility graph and
the spanning tree. Detect-SLAM [12], also built on ORB-SLAM2, integrates
visual SLAM with single-shot multi-box detector (SSD) [26]
4) BUNDLE ADJUSTMENT (BA) to make the two functions mutually beneficial. They call
BA [21] is the problem of refining a visual reconstruction to the probability of a feature point belonging to a moving
produce jointly optimal 3D structure and viewing parameter object the moving probability. They distinguish keypoints
estimates. Local BA is used in the local mapping thread into four states, high-confidence static, low-confidence static,
to optimize only the camera pose. Loop closing launches a low-confidence dynamic, and high-confidence dynamic.
thread to perform full BA after the pose-graph optimization Considering the delay of detection and the spatio-temporal
to jointly optimize the camera pose and the corresponding consistency of successive frames, they only use the color
landmarks. images of keyframes to detect using SSD, meanwhile prop-
agating probability frame-by-frame in the tracking thread.
B. GEOMETRIC-BASED SOLUTIONS Once the detection result is obtained, they insert the keyframe
Li et al. [5] proposed a real-time depth edge-based RGB-D into the local map and update the moving probability on
SLAM system for dynamic environments based on the frame- the local map. Then they update the moving probabil-
to-keyframe registration. They only use depth edge points ity of 3D points in the local map that matched with the
which have an associated weight indicating its probability keyframe.
of belonging to a dynamic object. Sun et al. [6] classify
pixels using the segmentation of the quantized depth image DM-SLAM [11] combines Mask R-CNN, optical flow, and
and calculate the difference in intensity between consec- epipolar constraint to judge outliers. The Ego-motion Estima-
utive RGB images. Tan et al. [3] propose a novel online tion module estimates the initial pose of the camera, similar
keyframe representation and updating method to adaptively to the Low-cost tracking module in DynaSLAM. DM-SLAM
model the dynamic environments. The camera pose can reli- also uses features in priori dynamic objects, if they are not
ably be estimated even in challenging situations using a moving heavily, to reduce the feature-less case caused by
novel prior-based adaptive RANSAC algorithm to efficiently removing all priori dynamic objects.
remove outliers.
Fan et al. [8] proposed a novel semantic SLAM system
Although the geometric-based vSLAM solution in with a more accurate point cloud map in dynamic environ-
dynamic environments can restrict the effect of the dynamic ments and they use BlizNet [27] to obtain the masks and
objects to some extent, there are some limitations: 1) they bounding boxes of the dynamic objects in the image.
cannot detect the potential dynamic objects that temporarily
keep static; 2) lack of semantic information. We cannot judge All these methods use the blocked model. They wait for the
dynamic objects using priori knowledge of the scene. semantic results of every frame or keyframe before estimating
the camera pose. As a result, their processing speed are
23774 limited by the specific CNN models they used. In this paper,
we propose RDS-SLAM that uses the non-blocked model and
shows its real-time performance by comparing it with those
methods.
VOLUME 9, 2021
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
FIGURE 3. System architecture. Models with orange color are modified blocks based on ORB-SLAM3. Models with magenta color are newly added
features. Blocks in blue are important data structures.
III. SYSTEM OVERVIEW thread, we use a simple example to explain the general
Each frame will first pass through the tracking thread. The ini- flow, as shown in Fig. 4. Assume the keyframes are selected
tial camera pose is estimated for the current frame after being every two frames. The keyframes are selected by the ORB-
tracked with the last frame and further optimized by being SLAM3 and we inserted them into a keyframe list KF sequen-
tracked with the local map. Then, keyframes are selected tially. Assume, at time t = 12, KF2-KF6 are inside KF.
and they are useful in semantic tracking, semantic-based The next step is to select keyframes from KF to request
optimization, and local mapping thread. We modify several semantic labels from the semantic server. We call this pro-
models in the tracking and the local mapping threads to cess as semantic keyframe selection process in this paper.
remove outliers from camera ego-motion estimation using We take one keyframe from the head of KF (KF2) and one
the semantic information. In the tracking thread, we propose from the back of KF (KF6) to request the semantic labels.
a data association algorithm to use as the features on static Then, we calculate the mask of the priori dynamic objects
objects as possible. using semantic labels S2 and S6. Next, we update the moving
probability of map points stored in the atlas. The moving
The semantic thread runs in parallel with the others, so as probability will be used later to remove outliers from the
not to block the tracking thread and saves the semantic infor- tracking thread.
mation into the atlas. Semantic labels are used to generate
the mask image of the priori dynamic objects. The moving Alg. 1 shows the detailed implementation of the semantic
probability of the map points matched with features in the thread. The first step is to select semantic keyframes from
keyframes is updated using the semantic information. Finally, keyframe list KF (Line 2). Next, we request semantic labels
the camera pose is optimized using the semantic information from the semantic model and return the semantic labels SLs
in the atlas. (Line 3). Lines 4-8 are to save and process the semantic
results for each item returned. Line 6 is to generate the mask
We will introduce the new features and modified models in image of dynamic objects and Line 7 updates the moving
the following sections. We skip the detailed explanations of probability stored in the atlas. We will introduce each sub-
the modules that are the same as those of ORB-SLAM3. module of the semantic thread sequentially (see Fig. 3).
IV. SEMANTIC THREAD A. SEMANTIC KEYFRAME SELECTION ALGORITHM
The semantic thread is responsible for generating seman- The semantic keyframe selection algorithm is to select
tic information and updating it into the atlas map. Before keyframes for requesting the semantic labels later. We need
we introduce the detailed implementation of the semantic
VOLUME 9, 2021 23775
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
FIGURE 4. Semantic tracking example. Assume keyframes KFn is selected every two frames Fn and inserted into keyframe list KF .
We choose keyframes from KF to request semantic labels Sn. Then we update the moving probability into the atlas using the mask image
of dynamic objects that reproduced from the semantic label. Blue circles stand for the static map points and red circles for the dynamic
map points. Others marked in green are unknown.
Algorithm 1 Semantic Tracking Thread However, this will monotonically increase the time delay
when using time-consuming segmentation methods as shown
Require: KeyFrame list: KF as the blue line in Fig. 6. For instance, at time t = 10 (F10),
1: while not_request_finish() do the semantic model completed the segmentation of KF0 (F0)
2: SK = semantic_keyframe_selection(KF) and the semantic delay is d = 10. Similarly, at time 40 (F40),
3: SLs = request_segmentation(SK) the semantic delay becomes 34. That is, the last frame that
4: for i = 0; i < SLs.size(); i + + do has semantic information is 34 frames behind the current
5: KeyFrame kf = SR[i] frame. The current frame cannot obtain the latest semantic
6: kf->mask = GenerateMaskImage(SLs[i]) information.
7: kf->UpdatePrioriMovingProbability()
8: end for To shorten the distance, supposed that we segment two
9: end while frames sequentially at the same time (Fig. 5 (b)). Then,
the delay becomes 122 = 10 if KF0 and KF1 are segmented
to keep the real-time performance while using different kinds at the same time. The delay still grows linearly as shown as
of semantic segmentation methods. However, some of them, the red line in Fig. 6.
such as Mask R-CNN, are time-consuming and the current
frame in tracking may not obtain the new semantic informa- To further shorten the semantic delay, we use a
tion if we segment every keyframe sequentially. bi-directional model. We do not segment keyframes sequen-
tially. Instead, we do semantic segmentation using keyframes
To evaluate the distance quantitatively, we define the both from the front and back of the list to use as the latest
semantic delay that is the distance between the latest frame id semantic information as possible, as shown in Fig. 5 (c) and as
which has the semantic label (St ) that holds the latest semantic the yellow line in Fig. 6. The semantic delay becomes a con-
information and the current frame (Ft ) id, as follows: stant value. In practice, the delay in the bidirectional model is
not always 10. The distance is influenced by the segmentation
d = FrameID(Ft ) FrameID(St ). (1) method used, the frequency of keyframe selection, and the
processing speed of the related threads.
Fig. 5 shows the semantic delay for several cases. The
general idea is to segment each frame or keyframe sequen- The left side of Fig. 7 indicates a semantic keyframe selec-
tially, according to the time sequence as shown in Fig. 5 (a). tion example and the right side of Fig. 7 shows the time-
We call this kind of model the sequential segmentation model. line of requesting semantic information from the semantic
model/server. We take both keyframes from the head and
23776
VOLUME 9, 2021
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
FIGURE 7. Semantic time line. The left side is the contents inside the
keyframe list KF and right side is the time line of requesting semantic
label. Keyframe in green color means this item has already obtained the
semantic information in the previous round.
FIGURE 5. Bi-direction model vs sequential model. Assume we use Mask We can obtain relatively new information if we segment the
R-CNN (200ms) and ORB-SLAM3 (20ms), and the keyframe is selected keyframe at the tail of the KF list. Then why do we also need
every two frames. About 200/20 = 10 frames delay while waiting for the to segment the keyframe that in front of the list? Different
semantic result. from the blocked model, there is no semantic information for
the first few frames (about 10 frames if use Mask R-CNN)
in our method. Since the processing speed of the tracking
thread is usually faster than the semantic thread, vSLAM
may have already accumulated large errors because of the
dynamic objects. Therefore, we need to correct these drift
errors using the semantic information by popping out and
feeding the keyframes in the front of the KF list sequentially
to the semantic-based optimization thread to correct/optimize
the camera poses.
B. SEMANTIC SEGMENTATION
In our experiment, we use two models with different
speeds, Mask R-CNN (slower) and SegNet (faster), as shown
in Fig. 8. Mask R-CNN [15] is trained with the MS
COCO [28], which has both pixel-wise semantic segmenta-
tion results and instance labels. We implemented it based on
the TensorFlow version of Matterport.1 SegNet [22] imple-
mented using Caffe,2 is trained with the PASCAL VOC [29]
2012 dataset, where 20 classes are offered. We did not refine
the network using the TUM dataset because SLAM usually
runs in an unknown environment.
FIGURE 6. Semantic delay of sequential model vs bi-direction model. C. SEMANTIC MASK GENERATION
We merge all the binary mask images of instance segmenta-
back of KF to request the semantic label. (Round 1) At time tion results into one mask image that is used to generate the
t = 2, two keyframes KF0 and KF1 are selected. Segmen- mask image (Fig. 8) of people. Then we calculate the priori
tation finished at t = 12. By this time, new keyframes are moving probability of map points using the mask. In practice,
selected and then inserted into KF (see Round 2). Then we since the segmentation on object boundaries are sometimes
take two elements KF2 from the front and KF6 from this back unreliable, the features on the boundaries cannot be detected
to request the semantic label. At the time t = 22, we received if directly apply the mask image, as shown in Fig. 9 (a).
the semantic result and continue the next round (Round 3).
1https://github.com/matterport/Mask_RCNN
VOLUME 9, 2021 2https://github.com/alexgkendall/SegNet-Tutorial
23777
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
FIGURE 10. Segmentation failure case. Some features on the body on the
person (a) cannot be identified as outliers using unsound mask
(c) generated by semantic result (b). Therefore, those features are
wrongly labeled as static in this frame.
FIGURE 11. Moving probability. θs is the static threshold and θd is the
dynamic threshold value.
FIGURE 8. Semantic information. M stands for Mask R-CNN and S for moving probability is used to detect and remove outliers from
SegNet. (e) shows the outliers that marked in red color, which are tracking.
detected using the mask image.
1) DEFINITION OF MOVING PROBABILITY
FIGURE 9. Mask dilation. Remove outliers on the edge of dynamic As we know, vSLAM is usually running in an unknown
objects. environment, the semantic result is not always robust if the
CNN network is not well trained or refined according to
Therefore, we dilate the mask using a morphological filter to the current environment (Fig. 10). To detect outliers, it is
include the edge of dynamic objects, as shown in Fig. 9 (b). more reasonable to consider the spatio-temporal consistency
D. MOVING PROBABILITY UPDATE of frames, rather than just use the semantic result of one
In order not to wait for the semantic information in the frame. Therefore, we use the moving probability to leverage
tracking thread, we isolate the semantic segmentation from the semantic information of successive keyframes.
tracking. We use the moving probability to convey semantic
information from semantic thread to tracking thread. The We define the moving probability (p(mti ), mti ∈ M ) of each
map point i at the current time as shown in Fig. 11. The
23778 status of the map point is more likely dynamic if its moving
probability is closer to one. The more static the map point
is if it is more closer to zero. To simplify, we abbreviate the
moving probability of map point i at time t (p(mti )) to p(mt ).
Each map point has two status (M ), dynamic and static, and
the initial probability (initial belief) is set to 0.5 (bel(m0)).
M = {static(s), dynamic(d) }
bel(m0 = d) = bel(m0 = s) = 0.5
2) DEFINITION OF OBSERVED MOVING PROBABILITY
Considering the fact that the semantic segmentation is not
100% accurate, we define the observe moving probability as:
p(zt = d|mt = d) = α,
p(zt = s|mt = d) = 1 α,
p(zt = s|mt = s) = β, and
p(zt = d|mt = s) = 1 β.
The values α and β are manually given and it is related to the
accuracy of semantic segmentation. In the experiment, we set
α and β to 0.9 by supping the semantic segmentation is fairly
reliable.
VOLUME 9, 2021
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
3) MOVING PROBABILITY UPDATE Algorithm 2 Robust Data Association Algorithm
The moving probability of the current time bel(mt ) is pre- Require: Current Frame: Ft
dicted based on the observation z1:t (semantic segmentation) Last Frame: Ft1
and initial status m0. We formulate the moving probability Unknown subset: Unknown<FeatureId, MapPoint*>
updating problem as a Bayesian filter [30] problem:
Static subset: Static<FeatureId, MapPoint*>
bel(mt ) = p(mt |z1:t , m0) Threshold: θd , θs, τ = 20
1: for i = 0; i < Ft1.Features.size(); i + + do
= ηp(zt |mt , z1:t1, m0)p(mt |z1:t1, m0) 2: MapPoint* m = Ft1.MapPoints[i]
= ηp(zt |mt )p(mt |z1:t1, m0) 3: f = FindMatchedFeatures(Ft , m)
= ηp(zt |mt )bel(mt ) (2) 4: if p(m) > θd then
5: continue
In Eq. 2 exploits Bayes rule and the conditional independence 6: end if
7: if p(m) < θs then
that the current observation zt only relies on the current status 8: Static.insert(f , m)
mt . η is a constant. The prediction bel(mt ) is calculated by:
bel(mt ) = p(mt |mt1, z1:t1)p(mt1|z1:t1)dmt1 9: end if
10: if θd ≤ p(m) ≤ θs then
11: Unknown.insert(f , m)
= p(mt |mt1)bel(mt1)dmt1 (3) 12: end if
13: end for
In Eq. (3), we exploit the assumption that our state is com- 14: for it = Static.begin(); it! = Static.end();it ++ do
plete. This implies if we know the previous state mt1, past 15: Ft .MapPoints[it->first] = it->second;
measurements convey no information regarding the state mt . 16: end for
We assume the state transition probability p(mt = d|mt1 = 17: if Static.size()<τ then
s) = 0 and p(mt = d|mt1 = d) = 1 because we cannot
detect the suddenly change of objects. η is calculated by 18: for it = Unknown.begin(); it!=Unknown.end();it ++
(bel(mt = d)+bel(mt = s))/2. The probability of map points
belonging to dynamic is calculated by: do
19: Ft .MapPoints[it->first] = it->second;
20: end for
bel(mt = d) 21: end if
= p(mt = d|mt1 = d)bel(mt1 = d) (4)
4) JUDGEMENT OF STATIC AND DYNAMIC POINTS in order to remove the bad influence from dynamic map
points, we skip those map points that have higher moving
Whether a point is dynamic or static is judged using prede- probability (Lines 4-6). Then, there are two kinds of map
fined probability thresholds, θd and θs (see Fig. 11). They are points left, static and unknown map points. We want to
set to 0.6 and 0.4 respectively in the experiment. use only the static map points as far as we can. Therefore,
we classify the remaining map points into two subsets: static
 subset and unknown subset, according to their moving proba-
 dynamic p(mt ) > θd bility (Lines 7-12). Finally, we use the selected relative good
 matches. We first use all the good data stored in static subset
Status(mti ) = static (Lines 14-16). If the size of these data is not enough (less
p(mt ) < θs (5) than the threshold τ = 20, the value used in ORB-SLAM3),
we also use the data in unknown subset (Lines 17-21).
 unknown others
 We try to exclude outliers from tracking using the moving
probability stored in the atlas. How well the outliers are
V. TRACKING THREAD removed will have a great influence on the tracking accuracy.
The tracking thread runs in real-time and tends to accumulate We show the results of a few frames in Fig. 12. All the features
the drift error due to the incorrect or unstable data associa- in the first few frames are in green color because no semantic
tions of 3D map points and 2D features in each frame caused information can be used and the moving probability of all
by dynamic objects. We modify the Track Last Frame model map points is 0.5, the initial value. The features in red belong
and Track Local Map model of ORB-SLAM3 tracking thread to dynamic objects and they are hard to match with the last
to remove outliers (see Fig. 3). We propose a data association frame than static features (blue features). The green features
algorithm that uses as good data associations as possible are almost disappeared because the map points obtained the
using the moving probability stored in the atlas. semantic information over time. We only use features in the
static subset if its size number is enough to estimate camera
A. TRACK LAST FRAME ego-motion.
Alg. 2 shows the data association algorithm in tracking last
frame model. For each feature i in the last frame, we first 23779
get their matched map point m (Line 2). Next, we find
the matched feature in the current frame by comparing the
descriptor distance of ORB features (Line 3). After that,
VOLUME 9, 2021
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
FIGURE 12. Results after tracking last frame. M stands for Mask R-CNN an S for SegNet. The features in red color are not used in tracking. Blue
features belong to the static subset and green features belong to the unknown subset.
FIGURE 13. Results after tracking local map. M stands for Mask R-CNN and S for SegNet.
B. TRACK LOCAL MAP drifts have already accumulated to some extent with the
The basic idea of the data association algorithm in the Track-
ing Local Map model is similar with Alg. 2. The difference influence of dynamic objects. Therefore, we try to correct
is that here we use all the map points in the local map to
find good data association. The data association result after the camera pose using semantic information. We modify
tracking local map is shown in Fig. 13. More map points are
used to match in this model than the tracking last frame. The the error term used in ORB-SLAM3 by using the moving
features on the people are almost successfully detected or not
matched/used. probability of map points for weighting, as shown below.
VI. OPTIMIZATION In the experience, we only use the matched static map points
A. SEMANTIC-BASED OPTIMIZATION
We optimize the camera pose using the keyframes given by for optimization.
the semantic keyframe selection algorithm. Considering that Assume Xjw ∈ R3 is the 3D pose of a map point j in the
the tracking thread runs very fast than the semantic thread,
world coordinate system. The i-th keyframe pose in the world
23780 coordinate is Tiw ∈ SE(3). The camera pose Tiw is optimized
by minimizing the reprojection error concerning the matched
keypoint xij ∈ R2 of the map point. The error term for the
observation of a map point j in a keyframe i is:
e(i, j) = (xij πi(Tiw, Xjw))(1 p(mj)), (6)
VOLUME 9, 2021
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
where πi is the projection function that projects a 3D map the obtained trajectories using their source codes and
point into a 2D pixel point in the keyframe i. The larger the therefore the trajectories are not exactly the same as the
moving probability is, the smaller contribution to the error. ones in their original paper. We evaluated our system
The cost function to be optimized is: using both Mask R-CNN (M) and SegNet (S). The tra-
jectory of DynaSLAM that use Mask R-CNN is very
C= ρ(eiT,j 1 ei,j) (7) similar with our Mask R-CNN version as shown in
i,j Fig. 14 (m-p) and Fig. 14 (q-t). The performance of our
SegNet version (Fig. 14 (i and j)) is similar to the DS-SLAM
i,j (Fig. 14 (e and f)).
where ρ is the Huber robust cost function and 1 is the The error in the estimated trajectory was calculated by
i,j comparing it with the ground truth, using two promi-
nent measurements: absolute trajectory error (ATE) and
covariance matrix. relative pose error (RPE) [25], which are well-suited for
measuring the performance of the vSLAM. The root mean
B. BUNDLE ADJUSTMENT IN LOCAL MAPPING THREAD squared error (RMSE), and the standard deviation (S.D.)
We modify the local BA model to reduce the influence of of ATE and RPE are compared. Each sequence was run at
dynamic map points using semantic information. What we least five times as dynamic objects are prone to increase
modified are: 1) the error term, in which the moving probabil- the non-deterministic effect. We compared our method
ity is used, as shown in Eq. 6; 2) only keyframes that already with ORB-SLAM3 [18], DS-SLAM [10], DynaSLAM [9],
obtained semantic information are used for BA. SLAM-PCD [8], DM-SLAM [11], and Detect-SLAM [12].
The comparison results are summarized in Tables 1, 2, and 3.
VII. EXPERIMENTAL RESULTS DynaSLAM reported they obtained the best performance
We evaluate the tracking accuracy using TUM [25] indoor using the combination of Mask R-CNN and geometric model.
dataset and demonstrate the real-time performance by In this paper, we mainly focus on the time cost problem
comparing with state-of-the-art vSLAMs methods using, caused by semantic segmentation. Contrary to the very heavy
when possible, the results in the original papers. geometric model that DynaSLAM used, we only use the very
light geometric check, such as RANSAC, photometric error
A. SYSTEM SETUP to deal with the outliers that not rely on the priori dynamic
objects.
Our system is evaluated using GeForce RTX 2080Ti GPU,
Cuda 11.1, and docker. 3 Docker is used to deploy differ- Our proposal outperforms the original ORB-SLAM3
(RGB-D mode only without IMU) and obtains similar per-
ent kinds of semantic segmentation methods on the same formance with DynaSLAM, SLAM-PCD, and DM-SLAM,
machine. We also use Kinect v2 4 camera to evaluate in real in which the tracking error is already very small. Different
from them, we use the non-blocked model. The first few
environment. frames do not have any semantic information. The number
of keyframes that have a semantic label is smaller than suing
B. TRACKING ACCURACY EVALUATION the blocked model because the processing speed of the track-
ing thread is much faster than the semantic segmentation
The proposed method was compared against the ORB- (especially for the heavy model, Mask R-CNN). However,
SLAM3 and similar semantic-based algorithms to quantify we achieved a similar tracking performance using less seman-
the tracking performance of our proposal in dynamic scenar- tic information.
ios.
C. REAL ENVIRONMENT EVALUATION
The TUM RGB-D dataset contains color and depth images We test our system using Kinect2 RGB-D camera, as shown
along the ground-truth trajectory of the sensor. In the in Fig. 15. All the features are in initial status when in the first
sequence named fr3/walking_* (labeled as f3/w/*), two few frames because they have not yet obtained any semantic
people walk through an office. This is intended to evaluate information. The static features will be increasingly detected
the robustness of vSLAM in the case of quickly moving over time and used to estimate camera pose. The features
dynamic objects in large parts of a visible scene. Four types on the person is detected and excluded from tracking. The
of camera motion are included in walking data sequences algorithm runs in around 30HZ, as shown in Table 4.
1) xyz, the Asus Xtion camera is manually moved along
three directions (xyz); 2) static, where the camera is kept in D. EXECUTION TIME
place manually; 3) halfsphere, where the camera is moved Tab. 4 compares the execution time of vSLAM algorithms.
on a small half-sphere of approximately one-meter diameter; In the blocked model, the tracking thread needs to wait for
4) rpy, where the camera is rotated along the principal axes the semantic label. The speed of the other methods is related
(roll-pitch-yaw). In the experiment, the person is dealt with as to the semantic segmentation methods used. The heavy the
the only priori dynamic object in the TUM dataset.
23781
We compared the trajectory of camera with ORB-
SLAM3,5 DS-SLAM,6 and DynaSLAM. 7 Fig. 14 compares
3https://docs.docker.com/
4https://github.com/code-iai/iai_kinect2
5https://github.com/UZ-SLAMLab/ORB_SLAM3
6https://github.com/ivipsourcecode/DS-SLAM
7https://github.com/BertaBescos/DynaSLAM
VOLUME 9, 2021
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
FIGURE 14. Trajectory comparing frame by brame. M stands for Mask R-CNN and S for SegNet.
TABLE 1. Results of absolute trajectory error of TUM (m). Ours (1) and (3) are evaluated results only using keyframes.
semantic model used, the higher the total time consuming is. known, DynaSLAM is not a real-time algorithm. DS-SLAM
Although DynaSLAM achieved good tracking performance, is the second fastest algorithm because it uses a lightweight
the processing time is long due to Mask R-CNN. As we semantic segmentation method, SegNet. However, the
23782 VOLUME 9, 2021
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
TABLE 2. Results of translational relative pose error (RPE) (m). Ours (1) and (3) are evaluated results only using keyframes.
TABLE 3. Results of rotational relative pose error (RPE) (m). Ours (1) and (3) are evaluated results only using keyframes.
TABLE 4. The execution time comparison of TUM Dataset. We use the data in their original paper as possible. If not provide in their papers,
we approximate the processing time.
architecture used is also a blocked model. The execution time TABLE 5. Semantic keyframe number comparison (Mask R-CNN).
will increase if a more time-consuming method is used. Our
method uses the non-blocked model and runs almost at a keyframes are segmented, the better tracking accuracy can be
constant speed regardless of the segmentation methods. achieved. This depends on the specific application and the
segmentation methods used.
We evaluate the error metric of TUM dataset using 15HZ
by manually adding some time delay in the tracking thread In the bi-direction model, we selected two keyframes at
because TUM dataset is very short. Very small semantic the same time. We offered two strategies to segment them:
information can be obtained in this short time. We compare 1) infer images at the same time as a batch on the same GPU,
the time and the number of keyframes that obtained semantic 2) infer images on the same GPU sequentially (one by one).
label (Semantic keyframe Number) in Tab. 5. We only com-
pared the Mask R-CNN version because SegNet is faster and 23783
it can segment almost all the keyframes in each dataset. We
assume the time cost of Mask R-CNN is 0.2s for segmenting
each frame. The total time of running the fr3/w/xyz dataset is
about 57.3s for 15HZ, however, only 28.3s for 30HZ. In this
short time, the number of semantic keyframes in 30HZ (143)
is two times smaller than 15HZ (286). Usually, the more
VOLUME 9, 2021
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
FIGURE 16. Semantic Delay of TUM w/xyz Dataset. The average value of
Mask R-CNN case is 10 and SegNet is 5.
FIGURE 15. Result of real environment. The green features are in initial factors: 1) the segmentation speed, 2) the keyframe selection
status and their moving probability is 0.5. The blue features are static policy, 3) the undetermined influence caused by the different
features and the red are outliers. (a) is the original detected ORB running speed of multiple threads (e.g., Loop Closing thread),
features. (b) is the output after tracking last frame process and (c) is the 3) the hardware configures. In the fr3/w/xyz dataset, the cam-
result after tracking local map process. era sometimes moves very slow and sometimes moves for-
ward or backward. As a result, this will change the keyframe
We suggest using (1) if the GPU can infer a batch of images at selection frequency and cause the variance of semantic delay.
the same time. Our Mask R-CNN version uses (1) because we
found we need 0.3s-0.4s in case (1) and 0.2s in case (2). Our VIII. CONCLUSION
SegNet version is evaluated using the strategy (2) because A novel vSLAM system, semantic-based real-time visual
SegNet is very fast and can be segmented sequentially. SLAM (RDS-SLAM) for dynamic environment using an
RGB-D camera is presented. We modify ORB-SLAM3 and
E. SEMANTIC DELAY EVALUATION add a semantic tracking thread and a semantic-based opti-
We have analyzed the semantic delay by assuming the mization thread to remove the influence of dynamic objects
keyframe is selected every two frames (see Fig. 6). In exper- using semantic information. These new threads run in parallel
iment, we follow the keyframe selection policy used in with the tracking thread and therefore, the tracking thread is
ORB SLAM3 and we compared the semantic delay of not blocked to wait for semantic information. We proposed
Mask R-CNN case and SegNet case using the TUM dataset, a keyframe selection strategy for semantic segmentation to
as shown in Fig. 16. The semantic delay is influenced by these obtain as the latest semantic information as possible that can
deal with segmentation methods with different speeds. We
23784 update and propagate semantic information using the moving
probability which is used to detect and remove outliers from
tracking using a data association algorithm. We evaluated the
tracking performance and the processing time using the TUM
dataset. The comparison against state-of-the-art vSLAMs
shows that our method achieved good tracking performance
and can track each frame in real-time. The fastest speed of the
system is about 30HZ, which is similar to the tracking speed
of ORB-SLAM3. In future work, we will try to 1) deploy our
system on a real robot, 2) extend our system to the stereo
camera and mono camera systems, and 3) build a semantic
map.
REFERENCES
[1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,
I. Reid, and J. J. Leonard, Past, present, and future of simultaneous
localization and mapping: Toward the robust-perception age, IEEE Trans.
Robot., vol. 32, no. 6, pp. 13091332, Dec. 2016. [Online]. Available:
http://ieeexplore.ieee.org/document/7747236/
[2] T. Taketomi, H. Uchiyama, and S. Ikeda, Visual SLAM algorithms:
A survey from 2010 to 2016, IPSJ Trans. Comput. Vis. Appl., vol. 9, no. 1,
pp. 111, Dec. 2017, doi: 10.1186/s41074-017-0027-2.
VOLUME 9, 2021
Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
[3] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, Robust monocular [25] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, A bench-
SLAM in dynamic environments, in Proc. IEEE Int. Symp. Mixed Aug- mark for the evaluation of RGB-D SLAM systems, in Proc. IEEE/RSJ
mented Reality (ISMAR), Oct. 2013, pp. 209218. [Online]. Available: Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573580. [Online]. Available:
http://ieeexplore.ieee.org/ http://vision.in.tum.de/data/datasets/
[4] W. Dai, Y. Zhang, P. Li, and Z. Fang, RGB-D SLAM in dynamic envi- [26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu,
ronments using points correlations, IEEE Robot. Autom. Lett., vol. 2, and A. C. Berg, SSD: Single shot multibox detector, in Computer
no. 4, pp. 22632270, Nov. 2018. [Online]. Available: https://arxiv.org/ Vision—ECCV 2016 (Lecture Notes in Computer Science: Lecture Notes
pdf/1811.03217v1.pdf in Artificial Intelligence: Lecture Notes in Bioinformatics), vol. 9905.
Cham, Switzerland: Springer, 2016, pp. 2137. [Online]. Available:
[5] S. Li and D. Lee, RGB-D SLAM in dynamic environments using static https://github.com/weiliu89/caffe/tree/ssd
point weighting, IEEE Robot. Autom. Lett., vol. 2, no. 4, pp. 22632270,
Oct. 2017. [27] N. Dvornik, K. Shmelkov, J. Mairal, and C. Schmid, BlitzNet: A real-time
deep network for scene understanding, in Proc. IEEE Int. Conf. Comput.
[6] Y. Sun, M. Liu, and M. Q.-H. Meng, Improving RGB-D SLAM in Vis. (ICCV), Oct. 2017, pp. 41744182.
dynamic environments: A motion removal approach, Robot. Auto. Syst.,
vol. 89, pp. 110122, Mar. 2017. [28] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
and C. L. Zitnick, Microsoft COCO: Common objects in context, in
[7] D.-H. Kim, S.-B. Han, and J.-H. Kim, Visual odometry algorithm Computer Vision—ECCV 2014 (Lecture Notes in Computer Science: Lec-
using an RGB-D sensor and IMU in a highly dynamic environ- ture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),
ment, in Robot Intelligence Technology and Applications 3, vol. 345. vol. 8693. Cham, Switzerland: Springer, 2014, pp. 740755.
New York, NY, USA: Springer-Verlag, 2015, pp. 1126. [Online]. Avail-
able: https://link.springer.com/chapter/10.1007/978-3-319-16841-8_2 [29] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman, The Pascal visual object classes (VOC) challenge,
[8] Y. Fan, Q. Zhang, S. Liu, Y. Tang, X. Jing, J. Yao, and H. Han, Semantic Int. J. Comput. Vis., vol. 88, no. 2, pp. 303338, Jun. 2010. [Online].
SLAM with more accurate point cloud map in dynamic environments, Available: http://link.springer.com/10.1007/s11263-009-0275-4
IEEE Access, vol. 8, pp. 112237112252, 2020.
[30] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge, MA,
[9] B. Bescos, J. M. Facil, J. Civera, and J. Neira, DynaSLAM: Tracking, USA: MIT Press, 2012, p. 2012. [Online]. Available: https://mitpress-
mapping, and inpainting in dynamic scenes, IEEE Robot. Autom. Lett., mit-edu.proxy.library.uu.nl/books/probabilistic-robotics%0Ahttp://
vol. 3, no. 4, pp. 40764083, Oct. 2018. mitpress.mit.edu/books/probabilistic-robotics
[10] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, DS- YUBAO LIU received the bachelors degree in
SLAM: A semantic visual SLAM towards dynamic environments, computer science from Qufu Normal University,
in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018, Qufu, China, in 2012, and the masters degree in
pp. 11681174. computer science from Capital Normal University,
Beijing, China, in 2015. He is currently pursuing
[11] J. Cheng, Z. Wang, H. Zhou, L. Li, and J. Yao, DM-SLAM: A feature- the Ph.D. degree with the Toyohashi University of
based SLAM system for rigid dynamic scenes, ISPRS Int. J. Geo-Inf., Technology, Toyohashi, Japan. In 2015, he joined
vol. 9, no. 4, pp. 118, 2020. Intel Research Center, Beijing, and he transferred
to Isoftstone, Beijing, in 2016, as a Senior Soft-
[12] F. Zhong, S. Wang, Z. Zhang, C. Chen, and Y. Wang, Detect-SLAM: ware Engineer, working on computer vision and
Making object detection and SLAM mutually beneficial, in Proc. IEEE AR. His research interests include pattern recognition and SLAM for AR
Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2018, pp. 10011010. and smart robotics.
[13] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, Dynamic- JUN MIURA (Member, IEEE) received the
SLAM: Semantic monocular visual localization and mapping based on B.Eng. degree in mechanical engineering and
deep learning in dynamic environment, Robot. Auto. Syst., vol. 117, the M.Eng. and Dr.Eng. degrees in informa-
pp. 116, Jul. 2019. [Online]. Available: https://linkinghub.elsevier.com/ tion engineering from The University of Tokyo,
retrieve/pii/S0921889018308029 Tokyo, Japan, in 1984, 1986, and 1989, respec-
tively. In 1989, he joined the Department of
[14] M. A. Fischler and R. Bolles, Random sample consensus: A paradigm for Computer-Controlled Mechanical Systems, Osaka
model fitting with applications to image analysis and automated cartogra- University, Suita, Japan. Since April 2007, he has
phy, Commun. ACM, vol. 24, no. 6, pp. 381395, 1981. been a Professor with the Department of Computer
Science and Engineering, Toyohashi University of
[15] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, Mask R-CNN, in Proc. Technology, Toyohashi, Japan. From March 1994 to February 1995, he was a
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 29802988. Visiting Scientist with the Computer Science Department, Carnegie Mellon
University, Pittsburgh, PA, USA. He has published over 220 articles in
[16] R. Mur-Artal and J. D. Tardos, ORB-SLAM2: An open-source SLAM international journal and conferences in the areas of intelligent robotics,
system for monocular, stereo, and RGB-D cameras, IEEE Trans. Robot., mobile service robots, robot vision, and artificial intelligence. He received
vol. 33, no. 5, pp. 12551262, Oct. 2017. several awards, including the Best Paper Award from the Robotics Society
of Japan, in 1997, the Best Paper Award Finalist at ICRA-1995, and the Best
[17] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ORB: An efficient Service Robotics Paper Award Finalist at ICRA-2013.
alternative to SIFT or SURF, in Proc. Int. Conf. Comput. Vis., Nov. 2011,
pp. 25642571. 23785
[18] C. Campos, R. Elvira, J. J. Gómez Rodríguez, J. M. M. Montiel, and
J. D. Tardós, ORB-SLAM3: An accurate open-source library for visual,
visual-inertial and multi-map SLAM, 2020, arXiv:2007.11898. [Online].
Available: http://arxiv.org/abs/2007.11898
[19] R. Elvira, J. D. Tardos, and J. M. M. Montiel, ORBSLAM-Atlas:
A robust and accurate multi-map system, in Proc. IEEE/RSJ Int. Conf.
Intell. Robots Syst. (IROS), Nov. 2019, pp. 62536259.
[20] C. Kerl, J. Sturm, and D. Cremers, Dense visual SLAM for RGB-D
cameras, in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Nov. 2013,
pp. 21002106.
[21] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon,
Bundle adjustment—A modern synthesis, in Proc. Int. Workshop Vis.
Algorithms, 2000, pp. 298372. [Online]. Available: http://link.springer.
com/10.1007/3-540-44480-7_21
[22] V. Badrinarayanan, A. Kendall, and R. Cipolla, SegNet: A deep convolu-
tional encoder-decoder architecture for image segmentation, IEEE Trans.
Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 24812495, Dec. 2017.
[23] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard,
OctoMap: An efficient probabilistic 3D mapping framework based on
octrees, Auto. Robots, vol. 34, no. 3, pp. 189206, Apr. 2013.
[24] B. Bescos, C. Campos, J. D. Tardós, and J. Neira, DynaSLAM II: Tightly-
coupled multi-object tracking and SLAM, 2020, arXiv:2010.07820.
[Online]. Available: http://arxiv.org/abs/2010.07820
VOLUME 9, 2021

View File

@ -0,0 +1,760 @@
12
STEREO CAMERA VISUAL SLAM WITH HIERARCHICAL
MASKING AND MOTION-STATE CLASSIFICATION AT OUTDOOR
CONSTRUCTION SITES CONTAINING LARGE DYNAMIC
OBJECTS
arXiv:2101.06563v1 [cs.RO] 17 Jan 2021 Runqiu Bao Ren Komatsu
Dept. of Precision Engineering Dept. of Precision Engineering
The University of Tokyo, Tokyo, Japan The University of Tokyo, Tokyo, Japan
bao@robot.t.u-tokyo.ac.jp komatsu@robot.t.u-tokyo.ac.jp
Renato Miyagusuku
Dept. of Mechanical and Intelligent Engineering
Utsunomiya University, Utsunomiya, Tochigi, Japan
miyagusuku@cc.utsunomiya-u.ac.jp
Masaki Chino Atsushi Yamashita
Construction Division Dept. of Precision Engineering
HAZAMA ANDO CORPORATION, Tokyo, Japan The University of Tokyo, Tokyo, Japan
chino.masaki@ad-hzm.co.jp yamashita@robot.t.u-tokyo.ac.jp
Hajime Asama
Dept. of Precision Engineering
The University of Tokyo, Tokyo, Japan
asama@robot.t.u-tokyo.ac.jp
January 19, 2021
ABSTRACT
At modern construction sites, utilizing GNSS (Global Navigation Satellite System) to measure the
real-time location and orientation (i.e. pose) of construction machines and navigate them is very
common. However, GNSS is not always available. Replacing GNSS with on-board cameras and
visual simultaneous localization and mapping (visual SLAM) to navigate the machines is a cost-
effective solution. Nevertheless, at construction sites, multiple construction machines will usually
work together and side-by-side, causing large dynamic occlusions in the cameras view. Standard
visual SLAM cannot handle large dynamic occlusions well. In this work, we propose a motion
segmentation method to efficiently extract static parts from crowded dynamic scenes to enable robust
tracking of camera ego-motion. Our method utilizes semantic information combined with object-level
geometric constraints to quickly detect the static parts of the scene. Then, we perform a two-step
coarse-to-fine ego-motion tracking with reference to the static parts. This leads to a novel dynamic
visual SLAM formation. We test our proposals through a real implementation based on ORB-SLAM2,
and datasets we collected from real construction sites. The results show that when standard visual
Code available at: https://github.com/RunqiuBao/kenki-positioning-vSLAM
Corresponding author Email: bao@robot.t.u-tokyo.ac.jp
A PREPRINT - JANUARY 19, 2021
SLAM fails, our method can still retain accurate camera ego-motion tracking in real-time. Comparing
to state-of-the-art dynamic visual SLAM methods, ours shows outstanding efficiency and competitive
result trajectory accuracy.
Keywords
dynamic visual SLAM, motion segmentation, hierarchical masking, object motion-state classification,
ego-motion tracking
1 Introduction
Knowledge of real-time location and orientation (i.e. pose) of construction machines, such as bulldozers, excavators,
and vibration rollers, is essential for the automation of construction sites. Currently, RTK-GNSS (Real-Time Kinematic
- Global Navigation Satellite System) is widely used because of its centimeter-level location accuracy. However,
in addition to the high price, the location output of RTK-GNSS could be unstable due to loss of satellite signals
underground, near mountains and trees, and between tall buildings. Therefore, replacing RTK-GNSS with on-board
cameras and visual SLAM (vSLAM) has been proposed [1]. Assuming machines starting pose is known in a global
coordinate system, relative pose outputs from vSLAM can be used to navigate the machine.
However at construction sites, several machines usually work together and side-by-side (Figure 1), which results in
large dynamic occlusions in the view of the cameras. Such dynamic occlusions can occupy more than 50% of the image.
It leads to a dramatic drop in tracking accuracy or even tracking failure when using standard vSLAM. We introduce
this problem distinctly in the context of dynamic vSLAM and propose an original stereo camera dynamic vSLAM
formation.
To deal with dynamic occlusions, our idea is to firstly detect static objects and backgrounds, and then track ego-motion
with reference to them. To achieve this, we need to estimate the real motion-states of objects. We use learning-based
object detection and instance segmentation combined with object-wise geometric measurement in stereo frames to label
the motion-states of object instances and generate occlusion masks for dynamic objects. Additionally, two types of
occlusion masks are applied to balance accuracy and computation cost, bounding box mask for small occlusions and
pixel-wise for large occlusions. Pixel-wise masks describe boundaries of objects more accurately. While bounding
boxes are faster to predict, it is not so accurate.
In a nutshell, our contributions in this work include: (1) a semantic-geometric approach to detect static objects and
static backgrounds for stereo vSLAM, (2) a masking technique for dynamic objects called hierarchical masking, (3) a
novel stereo camera dynamic visual SLAM system for construction sites.
The remainder of this work is organized as follows: In Section 2, we summarize the existing research on dynamic visual
SLAM and motion segmentation methods, and describe the features of this work. In Section 3, the system structure
and our original proposals (two algorithms) are introduced. In Section 4, to test the performance of our proposals,
we conducted experiments at real construction sites and built datasets for algorithm evaluation. We used Absolute
Trajectroy RMSE [2] to evaluate accuracy of the location outputs of the vSLAM system. Finally, Section 5 contains the
conclusions and future work plan.
Figure 1: Simultaneous working of construction machines causing large-area moving occlusions in on-board cameras
view.
2
A PREPRINT - JANUARY 19, 2021
Figure 2: Cameras are mounted on top of our construction machine facing to the sides, and RTK-GNSS is used to
collect ground truth positions.
2 Related Work
2.1 Dynamic Visual SLAM
Standard visual SLAM (vSLAM) assumes that the environment is static. Correspondingly, vSLAM for dynamic
environments (Dynamic vSLAM or Robust vSLAM) distinguishes static and dynamic features and computes pose
estimation based solely on static features.
Depending on the application, dynamic vSLAM can be categorized into two classes. One solely builds a static
background model, ignoring moving objects [3, 4, 2]. The other aims at not only creating a static background map, but
simultaneously maintaining sub-maps of moving objects [5, 6, 7]. Our task, i.e. positioning of construction machines,
requires fast and accurate camera ego-motion tracking and thus belongs to the first class.
Real-time positioning task at construction sites brought new problem to vSLAM. Specifically, we found that at a busy
construction site, there are often many machines, trucks and persons moving around which become large dynamic
occlusions (occlusion rate >50% from time to time) in the camera view. Besides, such occlusions usually contain
more salient feature points than earthen ground and cause chaos in feature-based camera ego-motion tracking. Even
existing dynamic vSLAM solutions may suffer from different issues and are thus not the optimal solution to this task.
For example, [8, 9, 10, 11] proposed very fast methods for dealing with dynamic objects. Yet, they did not explicitly
consider overly-large dynamic occlusions and thus might suffer from accuracy drop. [2] and [6] proposed very robust
methods for masking dynamic occlusions. But both of them require heavy computation and are not suitable for real-time
positioning task. Therefore, we proposed our own dynamic vSLAM solution for real-time positioning at dynamic
construction sites.
In a dynamic vSLAM system, there are mainly two major modules: (1) motion segmentation and (2) localization and
mapping [12]. Motion segmentation is the key part that distinguishes an outstanding dynamic vSLAM system from the
rests.
2.2 Motion Segmentation
Motion segmentation is aimed at detecting moving parts in the image and classifying the features into two groups, static
and dynamic features.
Standard visual SLAM achieves this by applying robust statistical approaches to the estimation of geometric models,
such as Random Sample Consensus (RANSAC) [13]. However, such approach may fail when large dynamic occlusions
exist, and static features are not in the majority. Other approaches leverage external sensors such as inertial measurement
units (IMU) to fix camera ego-motion. In the following, we focus on visual-only approaches to distinguish static and
dynamic features. Muhamad et al. [12] summarizes this research area well, for more details please refer to the study.
The most intuitive approach for motion segmentation is using semantic information to separate object instances that
may move in the scene. To obtain semantic information, Bârsan et al. [6] used learning-based instance segmentation
to generate pixel-wise masks for object instances. Cui et al. [14] proposed only using bounding boxes obtained from
YOLO v3 [15] to filter dynamic objects, which can reduce computation cost. However, these works simply assume that
movable objects are dynamic. End-to-end learning-based methods for motion segmentation (without prior information
about the environment) are still scarce [12].
Another common strategy for motion segmentation is utilizing geometric constraints. It leverages the fact that dynamic
features will violate constraints defined in multi-view geometry for static scenes. Kundu et al. [16] detected dynamic
features by checking if the points lie on the epipolar line in the subsequent view and used Flow Vector Bound (FVB) to
distinguish motion-states of 3D points moving along the epipolar line. Migliore et al. [17] kept checking the intersection
3
A PREPRINT - JANUARY 19, 2021
between three projected viewing rays in three different views to confirm static points. Tan et al. [18] projected existing
map points into the current frame to check if the feature is dynamic. It is difficult for us to evaluate these methods.
However, one obvious drawback is that they require complicated modifications to the bottom components of standard
visual SLAM algorithm without the static environment assumption. We argue that such modifications are not good for
the modularity of a vSLAM system.
As a novel hybrid approach, Berta et al. [2], in their work named Dynaslam, proposed to combine learning-based
instance segmentation with multi-view geometry to refine masks for objects that are not a priori dynamic, but movable.
Our system follows the hybrid fashion of Dynaslam, but we treat motion segmentation as an object-level classification
problem. Our idea is, by triangulating and measuring positions of points inside the bounding boxes and comparing them
between frames, we can estimate object-level motion-states for every bounding box (assuming objects are all rigid). If
we know the motion-state of every bounding box, the surroundings can be easily divided into static and dynamic parts.
Besides, bounding boxes of large dynamic occlusions reduce available static features. We will show that it is essential to
keep the overall masked area under a certain threshold if possible. Hence, we designed an algorithm named hierarchical
masking to refine a pixel-wise mask inside the bounding box when the overall masked area extends past a threshold to
save scarce static features. This hierarchical masking algorithm is also an original proposal from us.
3 Stereo Camera Dynamic Visual SLAM robust against Large Dynamic Occlusions
The core problem in this research is to achieve fast and accurate camera ego-motion tracking when there are large
occlusions in the cameras view. Subsection 3.1 is a general introduction of the system pipeline. In Subsection 3.2, the
principle of feature-based camera ego-motion tracking with occlusion masks for dynamic occlusions is introduced. In
order to balance computation speed and accuracy in occlusion mask generation, a hierarchical masking approach is
proposed in Subsection 3.3. Last, through stereo triangulation and comparison, object instances in the current frame
will be assigned a predicted motion-state label, static or dynamic, which leads to further mask refining and a second
around of tracking.
3.1 System Overview
The system installation is illustrated in Figure 2 and the system pipeline is shown in Figure 3. Inputs are stereo frames
(left image and right image) captured by a stereo camera. Then semantic information, including object labels and
bounding boxes, are extracted using learning-based object detection. In addition, a hierarchical mask generation
approach is proposed to balance mask accuracy and generation speed. Object masks exclude suspicious dynamic objects
from the static background. The features in the static background are then used in the initial tracking of camera pose.
After initial tracking, a rough pose of the new frame is known, with which we distinguish static objects from other
objects. This is done by triangulating object-level 3D key points in the reference and current frame and comparing the
3D position errors to distinguish whether the object is moving or not. Large static objects can provide more salient static
features for improving tracking accuracy. Dynamic objects will be kept masked in the second ego-motion tracking.
This two-round coarse-to-fine tracking scheme helps detect static objects and improve pose estimation accuracy.
After the second round of tracking, there will be mapping and pose graph optimization steps as most of state-of-the-art
vSLAM algorithms do.
3.2 Feature-based Camera Ego-motion Tracking by Masking Dynamic Occlusions
Camera ego-motion tracking framework used here is based on ORB-SLAM2 stereo [19]. When a new frame comes in,
first, a constant velocity motion model is used to predict the new camera pose with which we can search for map points
and 2D feature points matches. After enough matches are found, a new pose can be re-estimated by Perspective-n-point
(PnP) algorithm [20]. Motion-only bundle adjustment (BA) is then used for further pose optimization. Motion-only
BA estimates the camera pose of the new stereo frame, including orientation R ∈ SO (3) and position t ∈ R3, by
minimizing the reprojection error between matched 3D points xi ∈ R3 in the SLAM coordinates and feature points
pi(.) in the new frame, where i = 1, 2, ..., N . pi(.) include monocular feature points pm i ∈ R2 and stereo feature points
pis ∈ R3.
Now supposing M out of N 3D points are on a rigid body dynamic object that had a pose change R , t in the physical
world, and their 3D coordinates change from xi to xi, for i = 1, 2, ..., M . The rigid body transformation can be
4
A PREPRINT - JANUARY 19, 2021
Figure 3: An overview of the proposed system. Inputs are stereo frames (all the processes are on the left image. Right
image is only for triangulating 3D points). After semantic information extraction, occlusion masks of objects are
generated and used in filtering potential dynamic features. The initial ego-motion tracking is based purely on the static
background. Then more static objects are found and used as references in the second round of tracking to get more
accuracy results. The final output is the camera pose R and t of the current frame in the SLAM coordinates.
expressed as xi = R xi + t . Pose change estimation can be expressed as:
M 2
{R, t, R , t } = arg min ρ pi(.) π(.) (R (R xi + t ) + t) Σ
R,t,R ,t i=1 (1)
N 2
+ ρ pi(.) π(.) (Rxi + t) Σ ,
M +1
where ρ is the robust Huber cost function that controls the error growth of the quadratic function, and Σ is the covariance
matrix associated to the scale of the feature point. The project functions π(.) include monocular πm and rectified stereo
5
A PREPRINT - JANUARY 19, 2021
πs, as defined in [19]:
πm X = fxX/Z + cx = ul , (2)
Y fyY /Z + cy vl
Z
X fxX/Z + cx ul
πs Y = fyY /Z + cy = vl , (3)
Z fx (X b) /Z + cx ur
where (fx, fy) is the focal length, (cx, cy) is the principal point and b the baseline. (ul, vl) represents the monocular
feature points and (ul, vl, ur) the stereo feature points.
However, solving this equation (1) is not easy, not to mention that there could be more than one dynamic objects in real
world. If we only want to estimate R, t, equation (1) can be simplified to:
N 2
{R, t} = arg min ρ pi(.) π(.) (Rxi + t) , (4)
R,t i=M +1 Σ
which means only using static points in the scene to estimate the camera pose. If dynamic feature points as moving
outliers are not excluded, the estimation result will be wrong.
To separate static and dynamic feature points, our approach is to use a binary image as mask (for the left image of the
input stereo frame). The mask has the same size as the input image, and pixels with value 0 indicate static area, while
pixels with value 1 indicate dynamic area. Suppose that Imask (u, v) refers to a pixel in the mask image Imask. Sp is a
set of static pixels and Dp is a set of dynamic pixels,
0, Imask (u, v) ∈ Sp . (5)
Imask (u, v) = 1, Imask (u, v) ∈ Dp
Figure 4 shows examples of mask (with alpha blending). To generate a mask, we first get bounding boxes or pixel-wise
segmentation results from learning-based object detection and instance segmentation (Subsection 3.3). Then, for those
objects with a priori dynamic semantic label such as "car", "person", "truck", etc., we change the pixels value to 1 in
the mask image, while keeping the others as 0. We also apply geometrical measurement and calculate a motion-state
label for every object bounding box. Inside a static bounding box, we change the pixels value to 0 whatever it was
(Subsection 3.4). Later during ego-motion tracking period, only the areas where the mask value equals 0 will be used to
estimate camera pose as described by Equation (4).
3.3 Hierarchical Object Masking
The switching between two types of masks forms a hierarchical masking strategy that balances computation speed and
mask accuracy.
To reduce computation cost, we first used object detectors, e.p. EfficientDet [21], to predict object instances and
recognize their bounding boxes. Such learning-based object detector is a deep neural network, which can predict all the
bounding boxes, class labels, and class probabilities directly from an image in one evaluation. A bounding box only
represents a rough boundary of the object, so when using it as an object mask, background feature points inside the
rectangle are also classified as "object". It is, therefore, only a rough boundary description.
There were cases when bounding boxes occupied most of the area in the image, which led to a shortage of available static
features, and thus the accuracy of the ego-motion tracking declined. In such cases, we perform pixel-wise segmentation
on the image to save more static features. For pixel-wise segmentation, we also use deep learning approaches, such as
Mask R-CNN [22]. Pixel-wise segmentation takes more time and slows down the system output rate. Thus, only in
extreme cases when the frame is so crowded with object bounding boxes, should pixel-wise segmentation be performed.
6
A PREPRINT - JANUARY 19, 2021
Figure 4: Two kinds of masks and masked features.
Algorithm 1: Hierarchical Masking
Input: stereo images in current frame, Icl, Icr; Mased Area Ratio threshold, τmar.
Output: image mask for the left image in current frame, Imask.
Initialisation: a blank image mask, Imask; initial Masked Area Ratio as 0, mar = 0;
1: Imask=objectDetectionAndMasking(Icl);
2: mar=calMaskedAreaRatio(Imask);
3: if (mar ≥ τmar) then
4: Imask=pixelwiseSegmentationAndMasking(Icl);
5: end if
6: return Imask
The switching to pixel-wise segmentation is controlled by an index named Masked Area Ratio (mar). If Am is the total
area of bounding boxes in pixels and Af is the total area of the image in pixels, then we have,
mar = Am . (6)
Af
If mar is larger than the threshold τmar, it means the current frame is quite crowded and pixel-wise segmentation is
necessary.
Hierarchical object masking is concluded as following: when we get one frame input, we first use object detector
performing object detection and obtain bounding boxes. Then mar is calculated. If mar is higher than a pre-set
threshold τmar, then we perform pixel-wise segmentation and output the pixel-wise object mask. If mar is smaller than
the threshold, then the bounding box mask are directly forwarded as object mask. This algorithm is summarized in
Algorithm 1.
3.4 Objects Motion-state Classification for Further Mask Refinement
After the first ego-motion tracking, with reference to the background, we roughly know the pose of the current frame.
Based on the current pose, we triangulate object-level 3D points on all the detected object instances in the current frame
and a selected reference frame and distinguish whether they have moved. Feature points inside static bounding boxes
are then unmasked and used as valid static references in the second round of tracking. This algorithm (Algorithm 2)
named motion-state classification is detailed in the following.
To classify objects motion-state, first, a reference frame needs to be selected from previous frames. In this work, we
used the N -th frame before the current frame as reference frame. N is determined based on the machines velocity.
For example, for vibration rollers moving at 4 km/h mostly, FPS/3 to FPS/2 can be selected as N (FPS stands for
the frame rate of camera recording, namely Frame Per Second). For domestic automobiles running at higher speed,
N should be selected smaller so that there is appropriate visual change between current and reference frame. This
strategy is simple but effective, given the simple moving pattern of construction machines. There are more sophisticated
methods for selecting the best reference frame as stated in [2] and [18].
Then, suppose there are objects {obj1, obj2, ..., objm} in the reference frame (RF) and objects {obj1, obj2, ..., objn}
in the current frame (CF). We associate the m objects in RF with the n objects in CF by feature matching. If the
7
A PREPRINT - JANUARY 19, 2021
Figure 5: Associate bounding boxes between the Reference Frame (RF) and Current Frame (CF) using feature matching.
Triangulate object-level 3D points in RF, then triangulate corresponding 3D points in CF and compare their positions in
the two measurements. If most of point-wise position errors of an object (bounding box) are smaller than three times
the standard variation of static background points, the object is labeled as static during camera pose change from RF
to CF.
Figure 6: Algorithm 2: Objects Motion-state Classification.
object instances are associated successfully between two frames, which means the object is co-visible in both frames,
we triangulate 3D points within the bounding boxes in both frames in SLAM coordinates and calculate point-wise
position errors. 3D points position errors of static background are assumed to obey zero-mean Gaussian distribution.
The standard deviation, σbkg, is determined beforehand and used as the threshold for classification. For static objects,
principally all 3D points position error should be less than three times of σbkg. But considering the inaccuracy of a
bounding box, we loosened the condition to 70%, i.e. objects are classified as "static" when more than 70% of its 3D
points have a position error smaller than (3 × σbkg). However, outliers of feature matching usually result in very large
position errors. We only keep points with position error smaller than the median to exclude outliers. Figure 5 shows the
8
A PREPRINT - JANUARY 19, 2021
(a) Construction site bird view (b) Vibration roller
Figure 7: Experiment setting.
principle of the geometric constraint, the left one is a dynamic object and the right one is a static object. Figure 6 shows
input and output as well as main ideas of Algorithm 2. Details about how to implement this algorithm can be found in
our code repository.
4 Experimental Evaluations
4.1 Testing Environments and Datasets
To evaluate our proposed approaches, we conducted experiments at two construction sites in Japan with a machine
called vibration roller as shown in Figure 7(b). Vibration roller is used to flatten the earthen basement of structures and
facilities. For efficiency of work, there are usually multiple rollers running simultaneously and side by side, thus large
moving occlusions become a serious problem for visual SLAM.
In all experiments, a stereo camera was mounted on the cabin top of a roller facing the side. The baseline of the stereo
camera was about 1 m. The roller moved along a typical trajectory (Figure 7(a)) with maximum speed of 11 km/h. The
ground truth trajectories were recorded using RTK-GNSS. We synchronized ground truth and estimated camera poses
by minimizing Absolute Trajectory RMSE ([2, 19, 23]) and choosing appropriate time offsets between GNSSs and
the cameras timer. Then the estimated camera trajectories will be aligned with ground truth trajectories by Umeyama
algorithm [24]. We evaluate the accuracy of camera pose outputs of the vSLAM system with reference to the associated
ground truth by Absolute Trajectory RMSE (AT-RMSE).
Video data were collected at the site and evaluated in the lab. Image resolution was 3840 × 2160, and frame rate was 60
fps. For efficient evaluation, we downsampled the image sequences to 960 × 540 and 6 fps. We eventually collected
five image sequences, three with dynamic machines inside, the 4th one containing only two static machines, and the
5th one was without any occlusions. The specifications of the computer being used were Intel Core i7-9700K CPU,
and NVIDIA GeForce RTX 2080 Ti GPU. We used a tool provided by [25] for trajectory accuracy evaluation and
visualization.
When evaluating our vSLAM system implemetation, all the masks including bounding box and pixel-wise masks
are generated beforehand using EfficientDet [21] and Detectron2 [26] version of Mask R-CNN [22]. EfficientDet is
reported to be able to prioritize detection speed or detection accuracy through configuration. In our implementation, we
used EfficientDet-D0 and the weights were trained on MS COCO dataset [27]. The weights for Mask R-CNN are also
trained on MS COCO datasets [27]. Without fine-tuning, they are already good enough for this study. Besides, when
calculating overall computation time per frame, we record time consumption for vSLAM tracking part as well as mask
generation part respectively, and then add them together. Note that in hierarchical masking, the additional time caused
by pixel-wise segmentation will be averaged into all the frames.
9
A PREPRINT - JANUARY 19, 2021
(a) Absolute position error of every camera pose (b) Camera trajectory with colormapped position error
Figure 8: Quantitative evaluation for estimated trajectory of image sequence 1 "kumamoto1".
Table 1: Details about the five image sequences.
Dataset details kumamoto1 kumamoto2 chiba1 chiba2 chiba3
Max. occlusion ratio 0.493 0.445 0.521 0.633 0.0
0/1263 0/1186 12/647 69/668
MAR>0.5 frames 0 to 4 km/h 0 to 4 km/h 0/708
Machines speed 0 to 4 km/h 0 to 4 km/h 0 to 4 km/h
1 roller 1 roller 1 roller (dynamic) 2 rollers (static)
Occlusions & 1 roller (static) no occlusions
their motion-states (dynamic) (dynamic)
7 color cones 7 color cones
(static) (static)
1 checkerboard 1 checkerboard
(static) (static)
4.2 Performance Evaluation with Our Construction Site Datasets
Figure 8(a) shows the absolute position error of every camera pose between the estimated trajectory using the proposed
system and ground truth of a sequence (kumamoto2). Figure 8(b) is a birds eye view of the camera trajectory with
colormapped absolute position error. There are totally 5 sequences prepared, we repeat such evaluation 10 times for
each sequence. The details about the five sequences are described in Table 1. Figure 9(a) shows the distribution of
(a) Estimated trajectory accuracy (lower is better) (b) Averaged computation speed (lower is better)
Figure 9: Performance comparison on our construction site datasets.
10
A PREPRINT - JANUARY 19, 2021
(a) Three machines working parallelly to each other. (b) From view point of the on-board camera
Figure 10: Dynamic scene and hierarchical masking example.
Absolute Trajectory RMSE of all five sequences. We compare our proposed system with a simple baseline system, with
DynaSLAM [2] and with the original ORB-SLAM2 stereo. The baseline system is also based on ORB-SLAM2 but
is able to detect and remove moving objects. Its “moving object removal” method is derived from Detect-SLAM [9],
which performs bounding box detection and masks all movable bounding boxes detected. In the results, our proposed
system shows better trajectory accuracy in 3 sequences out of five comparing to the baseline, including kumamoto1,
chiba1 and chiba3. If the baseline represents fast and efficient handling of dynamic objects, DynaSLAM is much
heavier computationally. But the motion segmentation method in DynaSLAM is pixel-level precise and indeed the
current state-of-the-art. The experiment results shows that, DynaSLAM does show slight superiority of trajectory
accuracy in sequences including kumamoto1, chiba1. The original ORB-SLAM2 stereo can only survive chiba2 and
chiba3, which are completely static. In addition, trajectory accuracy of chiba2 and chiba3 are generally better than
those of dynamic sequences, no matter which method. Dynamic occlusions do cause irreversible influence on camera
ego-motion tracking.
Averaged computation speed comparisons are shown in Figure 9(b). Our proposed system is relatively slow than the
baseline and orb-slam2 stereo at the beginning. However, our method is able to be significantly accelerated by utilizing
parallel computing such as GPU acceleration. In our implementation named "ours_gpu" in Figure 9, we enabled GPU
acceleration for all the ORB feature extractions, and the speed can be improved notably. However, the trajectory
accuracy became different from "ours" to a certain extent, although theoretically they should be the same. We are still
looking for the root cause. Finally, time cost of DynaSLAM (only tracking, without background inpainting) is 2 to 3
times of ours_gpu. Large computation latency is not preferable, since our targeted task is real-time positioning and
navigation of a construction machine.
4.3 Ablation Study
4.3.1 Hierarchical Object Masking
Hierarchical masking aims to efficiently propose an appropriate initial mask in case there are overly-large dynamic
occlusions in the image. Figure 10(a) shows a scene when the machine was working along with two other machines and
thus had two large occlusions in the camera view. Figure 10(b) shows a sample image recorded by the on-board camera.
Notice that the two rectangles labeled as truck are bounding boxes detected by object detection algorithm, and the
color masks inside the bounding boxes are by pixel-wise segmentation. Besides, ORB feature points are extracted
and marked on this image. Green points are static features on the static background, blue points are those included
by bounding boxes but not included by pixel-wise masks, and red points are features masked by pixel-wise masks.
It is obvious that bounding box mask causes many innocent static features being treated as dynamic. Through a toy
experiment, we can see how it will cause shortage of available feature points and lead to worse pose tracking accuracy.
Then by a real example in our datasets, we explain the effectiveness of hierarchical masking.
11
A PREPRINT - JANUARY 19, 2021
Figure 11: A toy experiment: estimated trajectory accuracy when putting different sizes of occlusions on the 4th image
sequence "chiba2".
Table 2: Tracking accuracy of "chiba2" with three different mask types.
Mask type AT-RMSE, m Max. occlusion ratio
(average of 10 trials)
B-box mask 0.0437 0.63
Hierarchical mask 0.0404 0.50
Pixel-wise mask 0.0397 0.32
(1) A toy experiment
We put a fake constant dynamic occlusion at the center of the mask images of the 4th image sequence chiba2 (static
scene). And we adjusted the size of this area to simulate different occlusion ratio and see how the result trajectory
accuracy changes. The result is plotted in Figure 11. Before occlusion ratio reaches 0.6, trajectory error only varies
over a small range; when occlusion ratio exceeds 0.7, the RMSE increases exponentially due to shortage of available
features. Therefore, when occlusion ratio of the image approaches the critical point of 0.6, we define it as a large
occlusion condition, requiring the refinement of the bounding box mask to a pixel-wise mask to suppress the growing
error. Besides, when occlusion ratio is larger than 0.6, tracking lost will frequently happen which is not preferred when
navigating a construction machine. To avoid tracking lost and relocalization, we set the threshold (τmar in section 3.3)
to 0.5 as a safty limit.
However, when occlusion ratio is far smaller than 0.6, bounding box mask is enough and also faster to get. With our
computer, generating bounding box masks for one image frame takes 0.0207 seconds in average while a pixel-wise
mask takes 0.12 seconds.
(2) An overly large occlusion case
In order to demonstrate the effectiveness of hierarchial masking when facing overly large occlusions, we show an
example in sequence "chiba2". From the 3500th frame to 4500th frame (1000 frames in the original 60 fps sequence) in
"chiba2" sequence, we encountered an overly large occlusion. As Table 2 shows, when changing from bounding box
mask to pixel-wise mask, the maximum masked area ratio reduced from 0.63 to 0.32 and, correspondingly, trajectory
error decreased. Hierarchical masking benefits trajectory accuracy, and it will cost much less time than only using
pixel-wise mask. In this example, only 2/3 of the frames during this period need to calculate pixel-wise mask. And the
maximum masked area ratio is constrained within 0.5. Note that although the Absolute Trajectory RMSE difference
between 0.0404 and 0.0437 seems trivial here in Table 2. It is partially because of the trajectory alignment algorithm
[24] we used for evaluation, the actual accuracy difference can be larger.
4.3.2 Objects Motion-state Classification
Not all a priori dynamic objects are moving. Ignoring static objects leads to loss of information, especially when
they are salient and occupy a large area in the image. Therefore, we designed the objects motion-state classification
algorithm to detect static objects and unmask them for ego-motion tracking. Figure 12 shows dynamic and static objects
detected in the image sequences and scores relating to the possibility of them being dynamic. We also show an example
12
A PREPRINT - JANUARY 19, 2021
Figure 12: Illustration of the classification result. In the left column, the third row shows that there is one machine
classified as "moving" and another classified as "static" in this frame. The second row shows the position errors of
3D points on these two machines between this frame and the reference frame. Points on the "moving" machine have
higher position errors. Similarly, in the right column, there are also "moving" machines (two parts of one machine) and
a "static" color cone detected.
Table 3: Tracking accuracy with motion-state classification.
Mask type AT-RMSE, m Max. occlusion ratio
All objects masked 0.04973 0.63
Static objects unmasked 0.04198 0.0
of using the proposed algorithm in visual SLAM. Again, we use the 3500th frame to 4500th frame (1000 frames) in
"chiba2" sequence, and since the machines are totally static during this period, they are detected as static and unmasked.
Table 3 shows how it can influence the tracking accuracy.
However, there is still one bottleneck in this algorithm. σbkg is an essential parameter for the performance of motion-
state classification. For all the evaluations above with the four image sequences, we set σbkg to 0.12 which was
empirically determined. To explore the influence of this parameter on system performance, we variate σbkg between 0
and 0.6 to evaluate the classifier in terms of ROC (Receiver Operating Characteristics). Since the final target is to find
static objects, "static" is regarded as positive and "dynamic" as negative, ignoring objects that cannot be classified. The
ROC curve is shown in Figure 13, true positive rate (TPR, sensitivity) on the y axis is the ratio of true positive number
over the sum of true positives and false negatives. False positive rate on the x axis is the ratio of false positives over the
sum of false positives and true negatives. According to this curve, the Area Under the Curve (AUC) reached 0.737,
which proved it to be a valid classifier. The red dot in the plot is the position where σbkg = 0.12.
4.4 Evaluation with KITTI Dataset
The KITTI Dataset [28] provides stereo camera sequences in outdoor urban and highway environments. It has been a
wide-spread benchmark for evaluating vSLAM system performance, especially trajectory accuracy. Works such as
[2, 19] all provide evaluation results with KITTI. There are some sequences in KITTI containing normal-size dynamic
occlusions, thus KITTI is also appropriate for evaluation of our method. Table 4 shows the evaluation results. The
comparison includes four systems, our proposed system, the baseline, DynaSLAM and ORB-SLAM2 stereo, same as in
section 4.2. For the baseline, DynaSLAM and ORB-SLAM2 stereo, all the settings remain the same as before. For our
system, τmar (Section 3.3) remains to be 0.5 and σbkg (Section 3.4) remains to be 0.12. However, N (Section 3.4) is
changed to be 2, since frame rate of KITTI is 10 fps and the cars are much faster than our construction machines. We
ran each sequence 10 times with each system and recorded the averaged Absolute Trajectory RMSE (AT-RMSE, m) as
well as the averaged computation time per frame (s). For our system, we recorded both results with GPU acceleration (w
A) and without GPU acceleration (w/o A). Between the four comparisons, best AT-RMSE for each sequence is marked
with bold font and best computation time marked with bold, italic font. Note that the AT-RMSE results of DynaSLAM
13
A PREPRINT - JANUARY 19, 2021
Figure 13: ROC curve for the motion-state classification when σbkg was between 0 and 0.6, estimated with the 3rd
image sequence "chiba1". The Area Under Curve (AUC) reached 0.737. Red dot is the position where σbkg = 0.12.
Table 4: Trajectory accuracy and time consumption evaluation on KITTI Dataset.
ours baseline dynaslam (tracking) orb-slam2
Sequence AT-RMSE (m) time per AT-RMSE time per AT-RMSE time per AT-RMSE time per
frame (s)
KITTI 00 (m) frame (s) (m) frame (s) (m) frame (s)
KITTI 01 w/o A w A w/o A w A
KITTI 02
KITTI 03 2.1290 1.7304 0.2018 0.1565 2.0173 0.0912 3.9691 0.3354 1.7304 0.0703
KITTI 04 9.1271 0.0917 21.8982 0.3273 8.7620 0.0734
KITTI 05 8.4940 8.7620 0.1860 0.1305 4.9280 0.0935 5.9401 0.3243 4.9994 0.0771
KITTI 06 3.1174 0.0898 4.7770 0.3459 3.0735 0.0723
KITTI 07 5.1759 4.7338 0.1764 0.1194 0.9970 0.0864 1.3371 0.3420 1.0079 0.0672
KITTI 08 2.0528 0.0923 1.7644 0.3482 1.9751 0.0717
KITTI 09 3.2169 3.4246 0.1462 0.0983 1.9338 0.0943 2.0627 0.3434 1.8793 0.0752
KITTI 10 1.1799 0.0843 1.1285 0.3493 0.9733 0.0632
1.0835 1.2937 0.1811 0.1297 4.7857 0.0882 3.7062 0.3488 4.6483 0.0675
7.1441 0.0865 4.2753 0.3463 5.9788 0.0657
2.1243 2.2529 0.1915 0.1382 2.6986 0.0912 2.2028 0.3466 2.6699 0.0631
2.1718 2.2651 0.2076 0.1546
1.2323 1.3159 0.1791 0.1337
4.5641 5.2294 0.1945 0.1445
4.9692 5.8698 0.1760 0.1231
2.5849 2.6375 0.1522 0.1022
and ORB-SLAM2 stereo are different from the original paper. It is because we only align the trajectory with ground
truth without adjusting scale before calculating trajectory error, since our target is online positioning with vSLAM.
From Table 4, we see that in terms of computation speed, ORB-SLAM2 stereo is always the best. Because it has
adapted the static environment assumption. DynaSLAM is the slowest. Ours is slightly worse than the baseline and
ORB-SLAM2 stereo, however, we do see that GPU acceleration helps improving speed to a tolerable level. In terms of
AT-RMSE, the results are various, but DynaSLAM and ORB-SLAM2 stereo did have the most bold fonts numbers.
In KITTI dataset, there are moving automobiles, bicycles and persons in some frames, but they are not overly-large.
Actually there are only 6 frames in "07" in which occlusion ratio became larger than 0.5. Besides, automobiles on
the street do not contain so many salient feature points as construction machines, most of them have texture-less and
smooth surface. Therefore, our proposed system is not advantageous in KITTI.
5 Conclusions & Future Work
We presented a stereo vSLAM system for dynamic outdoor construction sites. The key contributions are, first, a
hierarchical masking strategy that can timely refine overly-large occlusion mask in an efficient way. Second, a semantic-
geometric approach for objects motion-state classification and a two-step coarse-to-fine ego-motion tracking scheme.
Our system accurately retrieved the motion trajectories of a stereo camera at construction sites, and most of the
surrounding objects motion-states in the scene were correctly predicted. Hierarchical object masking has also been
14
A PREPRINT - JANUARY 19, 2021
proved to be a simple but useful strategy. Our proposed system can deal with dynamic and crowded environments that
standard vSLAM systems may fail to keep tracking.
In future work, the method to select reference frames can be optimized to enable more robust object motion-state
classification. Moreover, we plan to combine vSLAM with an inertial measuring unit (IMU) sensor for higher-accuracy
positioning. However, the fierce and high-frequency vibration of the vibration roller may cause severe noises in the
IMU measurements, which results in worse pose accuracy. Therefore, we will look into this problem and meanwhile
also explore other topics about visual SLAM related research at construction sites.
References
[1] Runqiu Bao, Ren Komatsu, Renato Miyagusuku, Masaki Chino, Atsushi Yamashita, and Hajime Asama. Cost-
effective and robust visual based localization with consumer-level cameras at construction sites. In Proceedings of
the 2019 IEEE Global Conference on Consumer Electronics (GCCE 2019), pages 983985, 2019.
[2] Berta Bescos, José M Fácil, Javier Civera, and José Neira. Dynaslam: Tracking, mapping, and inpainting in
dynamic scenes. IEEE Robotics and Automation Letters, 3(4):40764083, 2018.
[3] Mariano Jaimez, Christian Kerl, Javier Gonzalez-Jimenez, and Daniel Cremers. Fast odometry and scene flow
from rgb-d cameras based on geometric clustering. In Proceedings of the 2017 IEEE International Conference on
Robotics and Automation (ICRA 2017), pages 39923999, 2017.
[4] Dan Barnes, Will Maddern, Geoffrey Pascoe, and Ingmar Posner. Driven to distraction: Self-supervised distractor
learning for robust monocular visual odometry in urban environments. In Proceedings of the 2018 IEEE
International Conference on Robotics and Automation (ICRA 2018), pages 18941900, 2018.
[5] Binbin Xu, Wenbin Li, Dimos Tzoumanikas, Michael Bloesch, Andrew Davison, and Stefan Leutenegger. Mid-
fusion: Octree-based object-level multi-instance dynamic slam. In Proceedings of the 2019 IEEE International
Conference on Robotics and Automation (ICRA 2019), pages 52315237, 2019.
[6] Ioan Andrei Bârsan, Peidong Liu, Marc Pollefeys, and Andreas Geiger. Robust dense mapping for large-scale
dynamic environments. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation
(ICRA 2018), pages 75107517, 2018.
[7] Martin Runz, Maud Buffier, and Lourdes Agapito. Maskfusion: Real-time recognition, tracking and reconstruction
of multiple moving objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented
Reality (ISMAR 2018), pages 1020, 2018.
[8] Chao Yu, Zuxin Liu, Xin-Jun Liu, Fugui Xie, Yi Yang, Qi Wei, and Qiao Fei. Ds-slam: A semantic visual slam
towards dynamic environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS 2018), pages 11681174. IEEE, 2018.
[9] Fangwei Zhong, Sheng Wang, Ziqi Zhang, and Yizhou Wang. Detect-slam: Making object detection and slam
mutually beneficial. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision
(WACV 2018), pages 10011010. IEEE, 2018.
[10] Linhui Xiao, Jinge Wang, Xiaosong Qiu, Zheng Rong, and Xudong Zou. Dynamic-slam: Semantic monocular
visual localization and mapping based on deep learning in dynamic environment. Robotics and Autonomous
Systems, 117:116, 2019.
[11] João Carlos Virgolino Soares, Marcelo Gattass, and Marco Antonio Meggiolaro. Visual slam in human populated
environments: Exploring the trade-off between accuracy and speed of yolo and mask r-cnn. In Proceedings of the
2019 International Conference on Advanced Robotics (ICAR 2019), pages 135140. IEEE, 2019.
[12] Muhamad Risqi U Saputra, Andrew Markham, and Niki Trigoni. Visual slam and structure from motion in
dynamic environments: A survey. ACM Computing Surveys (CSUR), 51(2):136, 2018.
[13] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications
to image analysis and automated cartography. Communications of the ACM, 24(6):381395, 1981.
[14] Zhaopeng Cui, Lionel Heng, Ye Chuan Yeo, Andreas Geiger, Marc Pollefeys, and Torsten Sattler. Real-time dense
mapping for self-driving vehicles using fisheye cameras. In Proceedings of the 2019 International Conference on
Robotics and Automation (ICRA 2019), pages 60876093, 2019.
[15] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
[16] Abhijit Kundu, K Madhava Krishna, and Jayanthi Sivaswamy. Moving object detection by multi-view geometric
techniques from a single camera mounted robot. In Proceedings of the 2009 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS 2009), pages 43064312, 2009.
15
A PREPRINT - JANUARY 19, 2021
[17] Davide Migliore, Roberto Rigamonti, Daniele Marzorati, Matteo Matteucci, and Domenico G Sorrenti. Use a
single camera for simultaneous localization and mapping with mobile object tracking in dynamic environments.
In Proceedings of the 2009 ICRA Workshop on Safe navigation in open and dynamic environments: Application to
autonomous vehicles, pages 1217, 2009.
[18] Wei Tan, Haomin Liu, Zilong Dong, Guofeng Zhang, and Hujun Bao. Robust monocular slam in dynamic
environments. In Proceedings of the 2013 IEEE International Symposium on Mixed and Augmented Reality
(ISMAR 2013), pages 209218, 2013.
[19] Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d
cameras. IEEE Transactions on Robotics, 33(5):12551262, 2017.
[20] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular
slam system. IEEE Transactions on Robotics, 31(5):11471163, 2015.
[21] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. arXiv
preprint arXiv:1911.09070, 2019.
[22] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the 2017 IEEE
International Conference on Computer Vision (ICCV 2017), pages 29612969, 2017.
[23] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the
evaluation of rgb-d slam systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS 2012), pages 573580. IEEE, 2012.
[24] Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE
Transactions on Pattern Analysis & Machine Intelligence, (4):376380, 1991.
[25] Michael Grupp. evo: Python package for the evaluation of odometry and slam. https://github.com/
MichaelGrupp/evo, 2017.
[26] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https:
//github.com/facebookresearch/detectron2, 2019.
[27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the 2014 European
Conference on Computer Vision (ECCV 2014), pages 740755, 2014.
[28] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The
International Journal of Robotics Research, 32(11):12311237, 2013.
16

View File

@ -0,0 +1,724 @@
TartanVO: A Generalizable Learning-based VO
Wenshan Wang Yaoyu Hu Sebastian Scherer
Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University
arXiv:2011.00359v1 [cs.CV] 31 Oct 2020 Abstract: We present the first learning-based visual odometry (VO) model,
which generalizes to multiple datasets and real-world scenarios, and outperforms
geometry-based methods in challenging scenes. We achieve this by leveraging
the SLAM dataset TartanAir, which provides a large amount of diverse synthetic
data in challenging environments. Furthermore, to make our VO model generalize
across datasets, we propose an up-to-scale loss function and incorporate the cam-
era intrinsic parameters into the model. Experiments show that a single model,
TartanVO, trained only on synthetic data, without any finetuning, can be general-
ized to real-world datasets such as KITTI and EuRoC, demonstrating significant
advantages over the geometry-based methods on challenging trajectories. Our
code is available at https://github.com/castacks/tartanvo.
Keywords: Visual Odometry, Generalization, Deep Learning, Optical Flow
1 Introduction
Visual SLAM (Simultaneous Localization and Mapping) becomes more and more important for
autonomous robotic systems due to its ubiquitous availability and the information richness of im-
ages [1]. Visual odometry (VO) is one of the fundamental components in a visual SLAM system.
Impressive progress has been made in both geometric-based methods [2, 3, 4, 5] and learning-based
methods [6, 7, 8, 9]. However, it remains a challenging problem to develop a robust and reliable VO
method for real-world applications.
On one hand, geometric-based methods are not robust enough in many real-life situations [10, 11].
On the other hand, although learning-based methods demonstrate robust performance on many vi-
sual tasks, including object recognition, semantic segmentation, depth reconstruction, and optical
flow, we have not yet seen the same story happening to VO.
It is widely accepted that by leveraging a large amount of data, deep-neural-network-based methods
can learn a better feature extractor than engineered ones, resulting in a more capable and robust
model. But why havent we seen the deep learning models outperform geometry-based methods yet?
We argue that there are two main reasons. First, the existing VO models are trained with insufficient
diversity, which is critical for learning-based methods to be able to generalize. By diversity, we
mean diversity both in the scenes and motion patterns. For example, a VO model trained only on
outdoor scenes is unlikely to be able to generalize to an indoor environment. Similarly, a model
trained with data collected by a camera fixed on a ground robot, with limited pitch and roll motion,
will unlikely be applicable to drones. Second, most of the current learning-based VO models neglect
some fundamental nature of the problem which is well formulated in geometry-based VO theories.
From the theory of multi-view geometry, we know that recovering the camera pose from a sequence
of monocular images has scale ambiguity. Besides, recovering the pose needs to take account of the
camera intrinsic parameters (referred to as the intrinsics ambiguity later). Without explicitly dealing
with the scale problem and the camera intrinsics, a model learned from one dataset would likely fail
in another dataset, no matter how good the feature extractor is.
To this end, we propose a learning-based method that can solve the above two problems and can
generalize across datasets. Our contributions come in three folds. First, we demonstrate the crucial
effects of data diversity on the generalization ability of a VO model by comparing performance on
different quantities of training data. Second, we design an up-to-scale loss function to deal with the
Corresponding author: wenshanw@andrew.cmu.edu
4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA.
scale ambiguity of monocular VO. Third, we create an intrinsics layer (IL) in our VO model enabling
generalization across different cameras. To our knowledge, our model is the first learning-based VO
that has competitive performance in various real-world datasets without finetuning. Furthermore,
compared to geometry-based methods, our model is significantly more robust in challenging scenes.
A demo video can be found: https://www.youtube.com/watch?v=NQ1UEh3thbU
2 Related Work
Besides early studies of learning-based VO models [12, 13, 14, 15], more and more end-to-end
learning-based VO models have been studied with improved accuracy and robustness. The majority
of the recent end-to-end models adopt the unsupervised-learning design [6, 16, 17, 18], due to the
complexity and the high-cost associated with collecting ground-truth data. However, supervised
models trained on labeled odometry data still have a better performance [19, 20].
To improve the performance, end-to-end VO models tend to have auxiliary outputs related to camera
motions, such as depth and optical flow. With depth prediction, models obtain supervision signals
by imposing depth consistency between temporally consecutive images [17, 21]. This procedure can
be interpreted as matching the temporal observations in the 3D space. A similar effect of temporal
matching can be achieved by producing the optical flow, e.g., [16, 22, 18] jointly predict depth,
optical flow, and camera motion.
Optical flow can also be treated as an intermediate representation that explicitly expresses the 2D
matching. Then, camera motion estimators can process the optical flow data rather than directly
working on raw images[20, 23]. If designed this way, components for estimating the camera motion
can even be trained separately on available optical flow data [19]. We follow these designs and use
the optical flow as an intermediate representation.
It is well known that monocular VO systems have scale ambiguity. Nevertheless, most of the super-
vised learning models did not handle this issue and directly use the difference between the model
prediction and the true camera motion as the supervision [20, 24, 25]. In [19], the scale is handled
by dividing the optical flow into sub-regions and imposing a consistency of the motion predictions
among these regions. In non-learning methods, scale ambiguity can be solved if a 3D map is avail-
able [26]. Ummenhofer et al. [20] introduce the depth prediction to correcting the scale-drift. Tateno
et al. [27] and Sheng et al. [28] ameliorate the scale problem by leveraging the key-frame selection
technique from SLAM systems. Recently, Zhan et al. [29] use PnP techniques to explicitly solve
for the scale factor. The above methods introduce extra complexity to the VO system, however, the
scale ambiguity is not totally suppressed for monocular setups especially in the evaluation stage.
Instead, some models choose to only produce up-to-scale predictions. Wang et al. [30] reduce the
scale ambiguity in the monocular depth estimation task by normalizing the depth prediction before
computing the loss function. Similarly, we will focus on predicting the translation direction rather
than recovering the full scale from monocular images, by defining a new up-to-scale loss function.
Learning-based models suffer from generalization issues when tested on images from a new en-
vironment or a new camera. Most of the VO models are trained and tested on the same dataset
[16, 17, 31, 18]. Some multi-task models [6, 20, 32, 22] only test their generalization ability on the
depth prediction, not on the camera pose estimation. Recent efforts, such as [33], use model adap-
tation to deal with new environments, however, additional training is needed on a per-environment
or per-camera basis. In this work, we propose a novel approach to achieve cross-camera/dataset
generalization, by incorporating the camera intrinsics directly into the model.
Figure 1: The two-stage network architecture. The model consists of a matching network, which
estimates optical flow from two consecutive RGB images, followed by a pose network predicting
camera motion from the optical flow.
2
3 Approach
3.1 Background
We focus on the monocular VO problem, which takes two consecutive undistorted images {It, It+1},
and estimates the relative camera motion δtt+1 = (T, R), where T ∈ R3 is the 3D translation and
R ∈ so(3) denotes the 3D rotation. According to the epipolar geometry theory [34], the geometry-
based VO comes in two folds. Firstly, visual features are extracted and matched from It and It+1.
Then using the matching results, it computes the essential matrix leading to the recovery of the
up-to-scale camera motion δtt+1.
Following the same idea, our model consists of two sub-modules. One is the matching module
Mθ(It, It+1), estimating the dense matching result Ftt+1 from two consecutive RGB images (i.e.
optical flow). The other is a pose module Pφ(Ftt+1) that recovers the camera motion δtt+1 from the
matching result (Fig. 1). This modular design is also widely used in other learning-based methods,
especially in unsupervised VO [13, 19, 16, 22, 18].
3.2 Training on large scale diverse data
The generalization capability has always been one of the most critical issues for learning-based
methods. Most of the previous supervised models are trained on the KITTI dataset, which contains
11 labeled sequences and 23,201 image frames in the driving scenario [35]. Wang et al. [8] presented
the training and testing results on the EuRoC dataset [36], collected by a micro aerial vehicle (MAV).
They reported that the performance is limited by the lack of training data and the more complex
dynamics of a flying robot. Surprisingly, most unsupervised methods also only train their models in
very uniform scenes (e.g., KITTI and Cityscape [37]). To our knowledge, no learning-based model
has yet shown the capability of running on multiple types of scenes (car/MAV, indoor/outdoor). To
achieve this, we argue that the training data has to cover diverse scenes and motion patterns.
TartanAir [11] is a large scale dataset with highly diverse scenes and motion patterns, containing
more than 400,000 data frames. It provides multi-modal ground truth labels including depth, seg-
mentation, optical flow, and camera pose. The scenes include indoor, outdoor, urban, nature, and
sci-fi environments. The data is collected with a simulated pinhole camera, which moves with ran-
dom and rich 6DoF motion patterns in the 3D space.
We take advantage of the monocular image sequences {It}, the optical flow labels {Ftt+1}, and the
ground truth camera motions {δtt+1} in our task. Our objective is to jointly minimize the optical
flow loss Lf and the camera motion loss Lp. The end-to-end loss is defined as:
L = λLf + Lp = λ Mθ(It, It+1) Ftt+1 + Pφ(Fˆtt+1) δtt+1 (1)
where λ is a hyper-parameter balancing the two losses. We use ˆ· to denote the estimated variable
from our model.
Since TartanAir is purely synthetic, the biggest question is can a model learned from simulation
data be generalized to real-world scenes? As discussed by Wang et al. [11], a large number of
studies show that training purely in simulation but with broad diversity, the model learned can be
easily transferred to the real world. This is also known as domain randomization [38, 39]. In our
experiments, we show that the diverse simulated data indeed enable the VO model to generalize to
real-world data.
3.3 Up-to-scale loss function
The motion scale is unobservable from a monocular image sequence. In geometry-based methods,
the scale is usually recovered from other sources of information ranging from known object size or
camera height to extra sensors such as IMU. However, in most existing learning-based VO studies,
the models generally neglect the scale problem and try to recover the motion with scale. This is
feasible if the model is trained and tested with the same camera and in the same type of scenario.
For example, in the KITTI dataset, the camera is mounted at a fixed height above the ground and a
fixed orientation. A model can learn to remember the scale in this particular setup. Obviously, the
model will have huge problems when tested with a different camera configuration. Imagine if the
3
Figure 2: a) Illustration of the FoV and image resolution in TartanAir, EuRoC, and KITTI datasets.
b) Calculation of the intrinsics layer.
camera in KITTI moves a little upwards and becomes higher from the ground, the same amount of
camera motion would cause a smaller optical flow value on the ground, which is inconsistent with
the training data. Although the model could potentially learn to pick up other clues such as object
size, it is still not fully reliable across different scenes or environments.
Following the geometry-based methods, we only recover an up-to-scale camera motion from the
monocular sequences. Knowing that the scale ambiguity only affects the translation T , we design
a new loss function for T and keep the loss for rotation R unchanged. We propose two up-to-scale
loss functions for LP : the cosine similarity loss Lcpos and the normalized distance loss Lnporm. Lpcos
is defined by the cosine angle between the estimated Tˆ and the label T :
Lpcos = max( Tˆ · T T + Rˆ R (2)
Tˆ · ,)
Similarly, for Lnporm, we normalize the translation vector before calculating the distance between
the estimation and the label:
Lpnorm = Tˆ T + Rˆ R
max( Tˆ , ) max( T (3)
,)
where = 1e-6 is used to avoid divided by zero error. From our preliminary empirical comparison,
the above two formulations have similar performance. In the following sections, we will use Eq 3
to replace Lp in Eq 1. Later, we show by experiments that the proposed up-to-scale loss function is
crucial for the models generalization ability.
3.4 Cross-camera generalization by encoding camera intrinsics
In epipolar geometry theory, the camera intrinsics is required when recovering the camera pose
from the essential matrix (assuming the images are undistorted). In fact, learning-based methods
are unlikely to generalize to data with different camera intrinsics. Imagine a simple case that the
camera changes a lens with a larger focal length. Assume the resolution of the image remains the
same, the same amount of camera motion will introduce bigger optical flow values, which we call
the intrinsics ambiguity.
A tempting solution for intrinsics ambiguity is warping the input images to match the camera in-
trinsics of the training data. However, this is not quite practical especially when the cameras differ
too much. As shown in Fig. 2-a, if a model is trained on TartanAir, the warped KITTI image only
covers a small part of the TartanAirs field of view (FoV). After training, a model learns to exploit
cues from all possible positions in the FoV and the interrelationship among those cues. Some cues
no longer exist in the warped KITTI images leading to drastic performance drops.
3.4.1 Intrinsics layer
We propose to train a model that takes both RGB images and camera intrinsics as input, thus the
model can directly handle images coming from various camera settings. Specifically, instead of re-
covering the camera motion Ttt+1 only from the feature matching Ftt+1, we design a new pose net-
work Pφ(Ftt+1, K), which depends also on the camera intrinsic parameters K = {fx, fy, ox, oy},
where fx and fy are the focal lengths, and ox and oy denote the position of the principle point.
4
Figure 3: The data augmentation procedure of random cropping and resizing. In this way we gener-
ate a wide range of camera intrinsics (FoV 40◦ to 90◦).
As for the implementation, we concatenate an IL (intrinsics layer) Kc ∈ R2×H×W (H and W
are image height and width respectively) to Ftt+1 before going into Pφ. To compose Kc, we first
generate two index matrices Xind and Yind for the x and y axis in the 2D image frame (Fig. 2-b).
Then the two channels of Kc are calculated from the following formula:
Kxc = (Xind ox)/fx (4)
Kyc = (Yind oy)/fy
The concatenation of Ftt+1 and Kc augments the optical flow estimation with 2D position informa-
tion. Similar to the situation where geometry-based methods have to know the 2D coordinates of the
matched features, Kc provides the necessary position information. In this way, intrinsics ambiguity
is explicitly handled by coupling 2D positions and matching estimations (Ftt+1).
3.4.2 Data generation for various camera intrinsics
To make a model generalizable across different cameras, we need training data with various camera
intrinsics. TartanAir only has one set of camera intrinsics, where fx = fy = 320, ox = 320,
and oy = 240. We simulate various intrinsics by randomly cropping and resizing (RCR) the input
images. As shown in Fig. 3, we first crop the image at a random location with a random size. Next,
we resize the cropped image to the original size. One advantage of the IL is that during RCR, we can
crop and resize the IL with the image, without recomputing the IL. To cover typical cameras with
FoV between 40◦ to 90◦, we find that using random resizing factors up to 2.5 is sufficient during
RCR. Note the ground truth optical flow should also be scaled with respect to the resizing factor. We
use very aggressive cropping and shifting in our training, which means the optical center could be
way off the image center. Although the resulting intrinsic parameters will be uncommon in modern
cameras, we find the generalization is improved.
4 Experimental Results
4.1 Network structure and training detail
Network We utilize the pre-trained PWC-Net [40] as the matching network Mθ, and a modified
ResNet50 [41] as the pose network Pφ. We remove the batch normalization layers from the ResNet,
and add two output heads for the translation and rotation, respectively. The PWC-Net outputs optical
flow in size of H/4 × W/4, so Pφ is trained on 1/4 size, consuming very little GPU memory. The
overall inference time (including both Mθ and Pφ) is 40ms on an NVIDIA GTX 1080 GPU.
Training Our model is implemented by PyTorch [42] and trained on 4 NVIDIA GTX 1080 GPUs.
There are two training stages. First, Pφ is trained separately using ground truth optical flow and
camera motions for 100,000 iterations with a batch size of 100. In the second stage, Pφ and Mθ are
connected and jointly optimized for 50,000 iterations with a batch size of 64. During both training
stages, the learning rate is set to 1e-4 with a decay rate of 0.2 at 1/2 and 7/8 of the total training
steps. The RCR is applied on the optical flow, RGB images, and the IL (Sec 3.4.2).
4.2 How the training data quantity affects the generalization ability
To show the effects of data diversity, we compare the generalization ability of the model trained
with different amounts of data. We use 20 environments from the TartanAir dataset, and set aside
3 environments (Seaside-town, Soul-city, and Hongkong-alley) only for testing, which results in
5
Figure 4: Generalization ability with respect to different quantities of training data. Model Pφ is
trained on true optical flow. Blue: training loss, orange: testing loss on three unseen environments.
Testing loss drops constantly with increasing quantity of training data.
Figure 5: Comparison of the loss curve w/ and w/o up-to-scale loss function. a) The training and
testing loss w/o the up-to-scale loss. b) The translation and rotation loss of a). Big gap exists between
the training and testing translation losses (orange arrow in b)). c) The training and testing losses w/
up-to-scale loss. d) The translation and rotation losses of c). The translation loss gap decreases.
more than 400,000 training frames and about 40,000 testing frames. As a comparison, KITTI and
EuRoC datasets provide 23,201 and 26,604 pose labeled frames, respectively. Besides, data in
KITTI and EuRoC are much more uniform in the sense of scene type and motion pattern. As shown
in Fig. 4, we set up three experiments, where we use 20,000 (comparable to KITTI and EuRoC),
100,000, and 400,000 frames of data for training the pose network Pφ. The experiments show that
the generalization ability, measured by the gap between training loss and testing loss on unseen
environments, improves constantly with increasing training data.
4.3 Up-to-scale loss function
Without the up-to-scale loss, we observe that there is a gap between the training and testing loss even
trained with a large amount of data (Fig. 5-a). As we plotting the translation loss and rotation loss
separately (Fig. 5-b), it shows that the translation error is the main contributor to the gap. After we
apply the up-to-scale loss function described in Sec 3.3, the translation loss gap decreases (Fig. 5-
c,d). During testing, we align the translation with the ground truth to recover the scale using the
same way as described in [16, 6].
4.4 Camera intrinsics layer
The IL is critical to the generalization ability across datasets. Before we move to other datasets,
we first design an experiment to investigate the properties of the IL using the pose network Pφ. As
shown in Table 1, in the first two columns, where the data has no RCR augmentation, the training
and testing loss are low. But these two models would output nonsense values on data with RCR
augmentation. One interesting finding is that adding IL doesnt help in the case of only one type
of intrinsics. This indicates that the network has learned a very different algorithm compared with
the geometry-based methods, where the intrinsics is necessary to recover the motion. The last two
columns show that the IL is critical when the input data is augmented by RCR (i.e. various intrin-
sics). Another interesting thing is that training a model with RCR and IL leads to a lower testing
loss (last column) than only training on one type of intrinsics (first two columns). This indicates that
by generating data with various intrinsics, we learned a more robust model for the VO task.
6
Table 1: Training and testing losses with four combinations of RCR and IL settings. The IL is
critical with the presence of RCR. The model trained with RCR reaches lower testing loss than
those without RCR.
Training configuration w/o RCR, w/o IL w/o RCR, w/ IL w/ RCR, w/o IL w/ RCR, w/ IL
Training loss 0.0325 0.0311 0.1534 0.0499
Test-loss on data w/ RCR - - 0.1999 0.0723
Test-loss on data w/o RCR 0.0744 0.0714 0.1630 0.0549
Table 2: Comparison of translation and rotation on the KITTI dataset. DeepVO [43] is a super-
vised method trained on Seq 00, 02, 08, 09. It contains an RNN module, which accumulates
information from multiple frames. Wang et al. [9] is a supervised method trained on Seq 00-08
and uses the semantic information of multiple frames to optimize the trajectory. UnDeepVO [44]
and GeoNet [16] are trained on Seq 00-08 in an unsupervised manner. VISO2-M [45] and ORB-
SLAM [3] are geometry-based monocular VO. ORB-SLAM uses the bundle adjustment on multiple
frames to optimize the trajectory. Our method works in a pure VO manner (only takes two frames).
It has never seen any KITTI data before the testing, and yet achieves competitive results.
Seq 06 07 09 10 Ave
DeepVO [43]*† trel rrel trel rrel trel rrel trel rrel trel rrel
Wang et al. [9]*†
UnDeepVO [44]* 5.42 5.82 3.91 4.6 - - 8.11 8.83 5.81 6.41
GeoNet [16]*
VISO2-M [45] - - - - 8.04 1.51 6.23 0.97 7.14 1.24
ORB-SLAM [3]†
TartanVO (ours) 6.20 1.98 3.15 2.48 - - 10.63 4.65 6.66 3.04
9.28 4.34 8.27 5.93 26.93 9.54 20.73 9.04 16.3 7.21
7.3 6.14 23.61 19.11 4.04 1.43 25.2 3.8 15.04 7.62
18.68 0.26 10.96 0.37 15.3 0.26 3.71 0.3 12.16 0.3
4.72 2.95 4.32 3.41 6.0 3.11 6.89 2.73 5.48 3.05
trel: average translational RMSE drift (%) on a length of 100800 m.
rrel: average rotational RMSE drift (◦/100 m) on a length of 100800 m.
*: the starred methods are trained or finetuned on the KITTI dataset.
†: these methods use multiple frames to optimize the trajectory after the VO process.
4.5 Generalize to real-world data without finetuning
KITTI dataset The KITTI dataset is one of the most influential datasets for VO/SLAM tasks. We
compare our model, TartanVO, with two supervised learning models (DeepVO [43], Wang et al.
[9]), two unsupervised models (UnDeepVO [44], GeoNet [16]), and two geometry-based methods
(VISO2-M [45], ORB-SLAM [3]). All the learning-based methods except ours are trained on the
KITTI dataset. Note that our model has not been finetuned on KITTI and is trained purely on a
synthetic dataset. Besides, many algorithms use multiple frames to further optimize the trajectory.
In contrast, our model only takes two consecutive images. As listed in Table 2, TartanVO achieves
comparable performance, despite no finetuning nor backend optimization are performed.
EuRoC dataset The EuRoC dataset contains 11 sequences collected by a MAV in an indoor en-
vironment. There are 3 levels of difficulties with respect to the motion pattern and the light con-
dition. Few learning-based methods have ever been tested on EuRoC due to the lack of training
data. The changing light condition and aggressive rotation also pose real challenges to geometry-
based methods as well. In Table 3, we compare with geometry-based methods including SVO [46],
ORB-SLAM [3], DSO [5] and LSD-SLAM [2]. Note that all these geometry-based methods per-
form some types of backend optimization on selected keyframes along the trajectory. In contrast, our
model only estimates the frame-by-frame camera motion, which could be considered as the frontend
module in these geometry-based methods. In Table 3, we show the absolute trajectory error (ATE)
of 6 medium and difficult trajectories. Our method shows the best performance on the two most dif-
ficult trajectories VR1-03 and VR2-03, where the MAV has very aggressive motion. A visualization
of the trajectories is shown in Fig. 6.
Challenging TartanAir data TartanAir provides 16 very challenging testing trajectories2 that
cover many extremely difficult cases, including changing illumination, dynamic objects, fog and
rain effects, lack of features, and large motion. As listed in Table 4, we compare our model with the
ORB-SLAM using ATE. Our model shows a more robust performance in these challenging cases.
2https://github.com/castacks/tartanair tools#download-the-testing-data-for-the-cvpr-visual-slam-challenge
7
Table 3: Comparison of ATE on EuRoC dataset. We are among very few learning-based methods,
which can be tested on this dataset. Same as the geometry-based methods, our model has never seen
the EuRoC data before testing. We show the best performance on two difficult sequences VR1-03
and VR2-03. Note our method doesnt contain any backend optimization module.
Seq. MH-04 MH-05 VR1-02 VR1-03 VR2-02 VR2-03
SVO [46] 1.36 0.51 0.47 x 0.47 x
Geometry-based * ORB-SLAM [3] 0.20 0.19 x x 0.07 x
DSO [5] 0.25 0.11 0.11 0.93 0.13 1.16
LSD-SLAM [2] 2.13 0.85 1.11 x x x
Learning-based † TartanVO (ours) 0.74 0.68 0.45 0.64 0.67 1.04
* These results are from [46]. † Other learning-based methods [36] did not report numerical results.
Figure 6: The visualization of 6 EuRoC trajectories in Table 3. Black: ground truth trajectory,
orange: estimated trajectory.
Table 4: Comparison of ATE on TartanAir dataset. These trajectories are not contained in the
training set. We repeatedly run ORB-SLAM 5 times and report the best result.
Seq MH000 MH001 MH002 MH003 MH004 MH005 MH006 MH007
ORB-SLAM [3] 1.3 0.04 2.37 2.45 x x 21.47 2.73
TartanVO (ours) 4.88 0.26 2 0.94 1.07 3.19 1 2.04
Figure 7: TartanVO outputs competitive results on D345i IR data compared to T265 (equipped with
fish-eye stereo camera and an IMU). a) The hardware setup. b) Trail 1: smooth and slow motion. c)
Trail 2: smooth and medium speed. d) Trail 3: aggressive and fast motion. See videos for details.
RealSense Data Comparison We test TartanVO using data collected by a customized sensor
setup. As shown in Fig. 7 a), a RealSense D345i is fixed on top of a RealSense T265 tracking
camera. We use the left near-infrared (IR) image on D345i in our model and compare it with the
trajectories provided by the T265 tracking camera. We present 3 loopy trajectories following similar
paths with increasing motion difficulties. From Fig. 7 b) to d), we observe that although TartanVO
has never seen real-world images or IR data during training, it still generalizes well and predicts
odometry closely matching the output of T265, which is a dedicated device estimating the camera
motion with a pair of fish-eye stereo camera and an IMU.
5 Conclusions
We presented TartanVO, a generalizable learning-based visual odometry. By training our model
with a large amount of data, we show the effectiveness of diverse data on the ability of model gener-
alization. A smaller gap between training and testing losses can be expected with the newly defined
up-to-scale loss, further increasing the generalization capability. We show by extensive experiments
that, equipped with the intrinsics layer designed explicitly for handling different cameras, TartanVO
can generalize to unseen datasets and achieve performance even better than dedicated learning mod-
els trained directly on those datasets. Our work introduces many exciting future research directions
such as generalizable learning-based VIO, Stereo-VO, multi-frame VO.
8
Acknowledgments
This work was supported by ARL award #W911NF1820218. Special thanks to Yuheng Qiu and Huai
Yu from Carnegie Mellon University for preparing simulation results and experimental setups.
References
[1] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendo´n-Mancha. Visual simultaneous localization and
mapping: a survey. Artificial Intelligence Review, 43(1):5581, 2015.
[2] J. Engel, T. Schops, and D. Cremers. LSD-SLAM: Large-scale direct monocular slam. In ECCV, 2014.
[3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: a versatile and accurate monocular slam
system. IEEE transactions on robotics, 31(5):11471163, 2015.
[4] C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In ICRA,
pages 1522. IEEE, 2014.
[5] J. Engel, V. Koltun, and D. Cremers. Direct sparse odometry. IEEE transactions on pattern analysis and
machine intelligence, 40(3):611625, 2017.
[6] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from
video. In CVPR, 2017.
[7] S. Vijayanarasimhan, S. Ricco, C. Schmidy, R. Sukthankar, and K. Fragkiadaki. Sfm-net: Learning of
structure and motion from video. In arXiv:1704.07804, 2017.
[8] S. Wang, R. Clark, H. Wen, and N. Trigoni. End-to-end, sequence-to-sequence probabilistic visual odom-
etry through deep neural networks. The International Journal of Robotics Research, 37(4-5):513542,
2018.
[9] X. Wang, D. Maturana, S. Yang, W. Wang, Q. Chen, and S. Scherer. Improving learning-based ego-
motion estimation with homomorphism-based losses and drift correction. In 2019 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pages 970976. IEEE, 2019.
[10] G. Younes, D. Asmar, E. Shammas, and J. Zelek. Keyframe-based monocular slam: design, survey, and
future directions. Robotics and Autonomous Systems, 98:6788, 2017.
[11] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer. Tartanair: A
dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), 2020.
[12] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch. Memory-based learning for visual odometry.
In Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, pages 4752. IEEE,
2008.
[13] V. Guizilini and F. Ramos. Semi-parametric models for visual odometry. In Robotics and Automation
(ICRA), 2012 IEEE International Conference on, pages 34823489. IEEE, 2012.
[14] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci. Evaluation of non-geometric methods for visual
odometry. Robotics and Autonomous Systems, 62(12):17171730, 2014.
[15] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned
depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 62436252, 2017.
[16] Z. Yin and J. Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2,
2018.
[17] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid. Unsupervised learning of monocular
depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 340349, 2018.
[18] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black. Competitive collabora-
tion: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June
2019.
[19] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia. Exploring representation learning with cnns for
frame-to-frame ego-motion estimation. RAL, 1(1):1825, 2016.
[20] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and
motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), July 2017.
[21] N. Yang, L. v. Stumberg, R. Wang, and D. Cremers. D3vo: Deep depth, deep pose and deep uncertainty
for monocular visual odometry. In IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), June 2020.
[22] Y. Zou, Z. Luo, and J.-B. Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task
consistency. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
9
[23] H. Zhou, B. Ummenhofer, and T. Brox. Deeptam: Deep tracking and mapping. In Proceedings of the
European Conference on Computer Vision (ECCV), September 2018.
[24] C. Tang and P. Tan. Ba-net: Dense bundle adjustment network. arXiv preprint arXiv:1806.04807, 2018.
[25] R. Clark, M. Bloesch, J. Czarnowski, S. Leutenegger, and A. J. Davison. Ls-net: Learning to solve
nonlinear least squares for monocular stereo. arXiv preprint arXiv:1809.02966, 2018.
[26] H. Li, W. Chen, j. Zhao, J.-C. Bazin, L. Luo, Z. Liu, and Y.-H. Liu. Robust and efficient estimation of ab-
solute camera pose for monocular visual odometry. In Proceedings of the IEEE International Conference
on Robotics and Automation (ICRA), May 2020.
[27] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned
depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), July 2017.
[28] L. Sheng, D. Xu, W. Ouyang, and X. Wang. Unsupervised collaborative learning of keyframe detec-
tion and visual odometry towards monocular deep slam. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (ICCV), October 2019.
[29] H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid. Visual odometry revisited: What should be learnt?
In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), May 2020.
[30] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct
methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
June 2018.
[31] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu. Unos: Unified unsupervised optical-flow and
stereo-depth estimation by watching videos. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), June 2019.
[32] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and ego-motion from monoc-
ular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2018.
[33] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha. Self-supervised deep visual odometry with online
adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), June 2020.
[34] D. Niste´r. An efficient solution to the five-point relative pose problem. IEEE transactions on pattern
analysis and machine intelligence, 26(6):756770, 2004.
[35] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International
Journal of Robotics Research, 32(11):12311237, 2013.
[36] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart. The
euroc micro aerial vehicle datasets. The International Journal of Robotics Research, 35(10):11571163,
2016.
[37] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and
B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 32133223, 2016.
[38] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transfer-
ring deep neural networks from simulation to the real world. In IROS, pages 2330. IEEE, 2017.
[39] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon,
and S. Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain ran-
domization. In CVPR Workshops, pages 969977, 2018.
[40] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and
cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
89348943, 2018.
[41] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 770778, 2016.
[42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and
A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
[43] S. Wang, R. Clark, H. Wen, and N. Trigoni. Deepvo: Towards end-to-end visual odometry with deep
recurrent convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International
Conference on, pages 20432050. IEEE, 2017.
[44] R. Li, S. Wang, Z. Long, and D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep
learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 72867291.
IEEE, 2018.
[45] S. Song, M. Chandraker, and C. Guest. High accuracy monocular SFM and scale correction for au-
tonomous driving. IEEE Transactions on Pattern Analysis & Machine Intelligence, pages 11, 2015.
[46] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza. Svo: Semidirect visual odometry
for monocular and multicamera systems. IEEE Transactions on Robotics, 33(2):249265, 2016.
10
A Additional experimental details
In this section, we provide additional details for the experiments, including the network structure,
training parameters, qualitative results, and quantitative results.
A.1 Network Structure
Our network consists of two sub-modules, namely, the matching network Mθ and the pose network
Pφ. As mentioned in the paper, we employ PWC-Net as the matching network, which takes in two
consecutive images of size 640 x 448 (PWC-Net only accepts image size that is multiple of 64). The
output optical flow, which is 160 x 112 in size, is fed into the pose network. The structure of the
pose network is detailed in Table 5. The overall inference time (including both Mθ and Pφ) is 40ms
on an NVIDIA GTX 1080 GPU.
Table 5: Parameters of the proposed pose network. Constructions of residual blocks are designated
in brackets multiplied by the number of stacked blocks. Downsampling is performed by Conv1, and
at the beginning of each residual block. After the residual blocks, we reshape the feature map into a
one-dimensional vector, which goes through three fully connected layers in the translation head and
rotation head, respectively.
Name Layer setting Output dimension
Input 1 H × 1 W × 2 114 × 160
Conv1 4 4 56 × 80
Conv2 56 × 80
Conv3 3 × 3, 32 1 H × 1 W × 32 56 × 80
3 × 3, 32 8 8
3 × 3, 32
1 H × 1 W × 32
8 8
1 H × 1 W × 32
8 8
ResBlock
Block1 3 × 3, 64 ×3 1 H × 1 W × 64 28 × 40
3 × 3, 64 16 16
Block2 3 × 3, 128 ×4 1 H × 1 W × 128 14 × 20
3 × 3, 128 32 32
Block3 3 × 3, 128 ×6 1 H × 1 W × 128 7 × 10
3 × 3, 128 64 64
Block4 3 × 3, 256 ×7 1 H × 1 W × 256 4×5
3 × 3, 256 128 128
Block5 3 × 3, 256 ×3 1 H × 1 W × 256 2×3
3 × 3, 256 256 256
FC trans FC rot
Trans head fc1 256 × 6 × 128 Rot head fc1 256 × 6 × 128
Trans head fc2 128 × 32 Rot head fc2 128 × 32
Trans head fc3 32 × 3 Rot head fc3 32 × 3
Output 3 Output 3
Table 6: Comparison of ORB-SLAM and TartanVO on the TartanAir dataset using the ATE metric.
These trajectories are not contained in the training set. We repeatedly run ORB-SLAM for 5 times
and report the best result.
Seq SH000 SH001 SH002 SH003 SH004 SH005 SH006 SH007
ORB-SLAM x 3.5 x x x x x x
TartanVO (ours) 2.52 1.61 3.65 0.29 3.36 4.74 3.72 3.06
A.2 Testing Results on TartanAir
TartanAir provides 16 challenging testing trajectories. We reported 8 trajectories in the experiment
section. The rest 8 trajectories are shown in Table 6. We compare TartanVO against the ORB-SLAM
monocular algorithm. Due to the randomness in ORB-SLAM, we repeatedly run ORB-SLAM for 5
trials and report the best result. We consider a trial is a failure if ORB-SLAM tracks less than 80%
11
of the trajectory. A visualization of all the 16 trajectories (including the 8 trajectories shown in the
experiment section) is shown in Figure 8.
Figure 8: Visualization of the 16 testing trajectories in the TartanAir dataset. The black dashed line
represents the ground truth. The estimated trajectories by TartanVO and the ORB-SLAM monocular
algorithm are shown in orange and blue lines, respectively. The ORB-SLAM algorithm frequently
loses tracking in these challenging cases. It fails in 9/16 testing trajectories. Note that we run
full-fledge ORB-SLAM with local bundle adjustment, global bundle adjustment, and loop closure
components. In contrast, although TartanVO only takes in two images, it is much more robust than
ORB-SLAM.
12

View File

@ -0,0 +1,518 @@
2022 IEEE International Conference on Robotics and Automation (ICRA)
May 23-27, 2022. Philadelphia, PA, USA
AirDOS: Dynamic SLAM benefits from Articulated Objects
Yuheng Qiu1, Chen Wang1, Wenshan Wang1, Mina Henein2, and Sebastian Scherer1
2022 IEEE International Conference on Robotics and Automation (ICRA) | 978-1-7281-9681-7/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICRA46639.2022.9811667 Abstract— Dynamic Object-aware SLAM (DOS) exploits (a) Challenge of Shibuya Tokyo (b) TartanAir Shibuya Dataset
object-level information to enable robust motion estimation in
dynamic environments. Existing methods mainly focus on iden- (c) Example of KITTI tracking dataset training 19
tifying and excluding dynamic objects from the optimization.
In this paper, we show that feature-based visual SLAM systems Fig. 1. (a) Example of a highly dynamic environment cluttered with
can also benefit from the presence of dynamic articulated humans which represents a challenge for Visual SLAM. Existing dynamic
objects by taking advantage of two observations: (1) The 3D SLAM algorithms often fail in this challenging scenario (b) Example of
structure of each rigid part of articulated object remains the TartanAir Shibuya Dataset. (c) Example of the estimated full map with
consistent over time; (2) The points on the same rigid part dynamic objects and static background.
follow the same motion. In particular, we present AirDOS,
a dynamic object-aware system that introduces rigidity and Can we make use of moving objects in SLAM to improve
motion constraints to model articulated objects. By jointly camera pose estimation rather than filtering them out?
optimizing the camera pose, object motion, and the object 3D
structure, we can rectify the camera pose estimation, preventing In this paper, we extend the simple rigid objects to general
tracking loss, and generate 4D spatio-temporal maps for both articulated objects, defined as objects composed of one
dynamic objects and static scenes. Experiments show that our or more rigid parts (links) connected by joints allowing
algorithm improves the robustness of visual SLAM algorithms rotational motion [10], e.g., vehicles and humans in Fig. 2,
in challenging crowded urban environments. To the best of our and utilize the properties of articulated objects to improve
knowledge, AirDOS is the first dynamic object-aware SLAM the camera pose estimation. Namely, we jointly optimize
system demonstrating that camera pose estimation can be (1) the 3D structural information and (2) the motion of
improved by incorporating dynamic articulated objects. articulated objects. To this end, we introduce (1) a rigidity
constraint, which assumes that the distance between any two
I. INTRODUCTION points located on the same rigid part remains constant over
time, and (2) a motion constraint, which assumes that feature
Simultaneous localization and mapping (SLAM) is a fun- points on the same rigid parts follow the same 3D motion.
damental research problem in many robotic applications. This allows us to build a 4D spatio-temporal map including
Despite its success in static environments, the performance both dynamic and static structures.
degradation and lack of robustness in the dynamic world has
become a major hurdle for its practical applications [1], [2]. In summary, the main contributions of this paper are:
To address the challenges of dynamic environments, most • A new pipeline, named AirDOS, is introduced for stereo
SLAM algorithms adopt an elimination strategy that treats
moving objects as outliers and estimates the camera pose SLAM to jointly optimize the camera poses, trajectories
only based on the measurements of static landmarks [3], [4]. of dynamic objects, and the map of the environment.
This strategy can handle environments with a small number • We introduce simple yet efficient rigidity and motion
of dynamics, but cannot address challenging cases, where constraints for general dynamic articulated objects.
dynamic objects cover a large field of view as in Fig. 1(a). • We introduce a new benchmark TartanAir Shibuya, on
which we demonstrates that, for the first time, dynamic
Some efforts have been made to include dynamic objects articulated objects can benefit the camera pose estima-
in the SLAM process. Very few methods try to estimate the tion in visual SLAM.
pose of simple rigid objects [5], [6] or estimate their motion
model [7], [8]. For example, CubeSLAM [6] introduces a
simple 3D cuboid to model rigid objects. Dynamic SLAM
[9] estimates 3D motions of dynamic objects. However, these
methods can only cover special rigid objects, e.g., cubes [6]
and quadrics [5] and do not show that camera pose estimation
can be improved by the introduction of dynamic objects [7]
[9]. This introduces our main question:
*This work was supported by the Sony award #A023367.
Source Code: https://github.com/haleqiu/AirDOS.
1Yuheng Qiu, Chen Wang, Wenshan Wang, and Sebastiian
Scherer are with the Robotics Institute, Carnegie Mellon University,
Pittsburgh, PA 15213, USA {yuhengq, wenshanw, basti}
@andrew.cmu.edu; chenwang@dr.com
2Mina Henein is with the System, Theory and Robotics Lab, Australian
National University. mina.henein@anu.edu.au
978-1-7281-9680-0/22/$31.00 ©2022 IEEE 8047
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
Wang et al. [18] introduce a simultaneous localization, map-
ping, and moving object tracking (SLAMMOT) algorithm,
which tracks moving objects with a learned motion model
based on a dynamic Bayesian network. Reddy, et al. [19] use
optical flow to segment moving objects, and apply a smooth
trajectory constraint to enforce the smoothness of objects
motion. Judd et al. [8] propose multi-motion visual odometry
(MVO), which simultaneously estimates the camera pose
and the object motion. The work by Henein, et al. [7],
[20], [21], of which the most recent is VDO-SLAM [20],
generate a map of dynamic and static structure and estimate
velocities of rigid moving objects using motion constraints.
Rosinol, et al. [22] propose 3D dynamic scene graphs to
detect and track dense human mesh in dynamic scenes. This
method constraints the humans maximum walking speed for
a consistency check.
Fig. 2. This is an example of the articulated dynamic objects point-segment C. Rigidity Constraint
mode. In urban environment, we can model rigid objects like vehicle and
semi-rigid objects like pedestrian as articulated object. pki and pkj are the i-th Rigidity constraint assumes that pair-wise distances of
and j-th dynamic features on the moving objects at time k. pik+1 and pkj+1 points on the same rigid body remain the same over time.
is the dynamic features after the motion l Tk at time k + 1. In this model, It was applied to segment moving objects in dynamic en-
the segment si j is invariant over time and motion. vironments dating back to the 1980s. Zhang et al. [23]
propose to use rigidity constraint to match moving rigid
II. RELATED WORK bodies. Thompson et al. [24] use a similar idea of rigidity
constraint and propose a rigidity geometry testing for moving
Recent works on dynamic SLAM roughly fall into three rigid object matching. Previous research utilized rigidity
categories: elimination strategy, motion constraint, and rigid- assumption to segment moving rigid objects, while in this
ity constraint, which will be reviewed, respectively. paper, we use rigidity constraint to recover objects structure.
A. Elimination Strategy To model rigid object, SLAM++ [25] introduced pre-
defined CAD models into the object matching and pose
Algorithms in this category filter out the dynamic objects optimization. QuadricSLAM [5] utilize dual-quadrics as 3D
and only utilize the static structures of the environment object representation, to represent the orientation and scale of
for pose estimation. Therefore, most of the algorithms in object landmarks. Yang and Scherer [6] propose a monocular
this category apply elimination strategies like RANSAC object SLAM system named CubeSLAM for 3D cuboid
[11] and robust loss functions [12] to eliminate the effects object detection and multi-view object SLAM. As mentioned
of dynamic objects. For example, ORB-SLAM [3] applies earlier, the above methods can only model simple rigid
RANSAC to select and remove points that cannot converge objects, e.g., cubes, while we target more general objects,
to a stable pose estimation. DynaSLAM [13] detects the i.e., articulated objects, which can cover common dynamic
moving objects by multi-view geometry and deep learning objects such as vehicles and humans.
modules. This allows inpainting the frame background that
has been occluded by dynamic objects. Bârsan et al. [14] use III. METHODOLOGY
both instance-aware semantic segmentation and sparse scene
flow to classify objects as either background, moving, or A. Background and Notation
potentially moving objects. Dai et al. [15] utilize the distance
correlation of map points to segment dynamic objects from Visual SLAM in static environments is often formulated as
static background. To reduce the computational cost, Ji et al.
[16] combine semantic segmentation and geometry modules, a factor graph optimization [26]. The objective (1) is to find
which clusters the depth image into a few regions and
identify dynamic regions via reprojection errors. the robot state xk ∈ X, k ∈ [0, nx] and the static landmarks pi ∈
B. Motion Constraint Ps, i ∈ [0, nps ] that best fit the observation of the landmarks
zki ∈ Z, where nx denotes the total number of the robots
Most algorithms in this category estimate the motion of state and nps denotes the number of the static landmarks.
dynamic objects but do not show that the motion constraint This is often based on a reprojection error minimization
can contribute to the camera pose estimation, and would thus ei,k = h(xk, pi) zik with:
suffer in highly dynamic environments. For example, Hahnel
et al. [17] track the dynamic objects in the SLAM system. ∑ X , P = argmin eiT,kΩi,k1ei,k (1)
{X,Ps} i,k
where h(xk, pi) denotes the 3D points observation function
and Ωi,k denotes the observation covariance matrix.
8048
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
Nose C. Motion Constraint
Neck
Right Left We adopt the motion constraint from [7] which does not
Shoulder Shoulder need a prior geometric model. For every feature point on the
same rigid part of an articulated object l, we have
Right Left
Elbow Elbow
Right Left l p¯ki +1 = l T l p¯ki , (4)
Hand Hand
where lT ∈ SE(3) is a motion transform associated with the
Right Left object l and ¯· indicates homogeneous coordinates. Therefore,
Knee Knee we can define the loss function for motion constraint as:
Right Left
Feet Feet
(a) Rigidity Constraint Factor Graph (b) Human Rigidity em = || l p¯ki +1 lT l p¯ik||.
(5)
(c) Motion Constraint Factor Graph The motion constraint simultaneously estimates the ob-
jects motion lT and enforces each point l pik to follow the
Fig. 3. (a) Factor graph of the rigidity constraint. Black nodes represent same motion pattern [7]. This motion model lT assumes that
the camera pose, blue nodes the dynamic points, and red nodes indicate the the object is rigid, thus, for articulated objects, we apply the
rigid segment length. Cyan and red rectangles represent the measurements motion constraint on each rigid part of articulated object. In
of points and rigidity consequently. (c) Human can be modeled with point Fig. 3(c) we show the factor graph of the motion constraint.
and segment based on the body parts rigidity. (b) Factor graph of the motion
constraint. The orange node is the estimated motion and the green rectangles In highly dynamic environments, even if we filter out
denote the motion constraints the moving objects, the tracking of static features is easily
interrupted by the moving objects. By enforcing the motion
constraints, dynamic objects will be able to contribute to the
motion estimation of the camera pose. Therefore, when the
static features are not reliable enough, moving objects can
correct the camera pose estimation, preventing tracking loss.
In dynamic SLAM, the reprojection error ep of dynamic D. Bundle Adjustment
feature points is also considered:
The bundle adjustment (BA) jointly optimizes the static
ep = h(xk, l pki ) lzik , (2) points pi, dynamic points l pki , segments si j, camera poses xk
and dynamic object motions lT . This can be formulated as
the factor graph optimization:
where l pik ∈ Pd are the dynamic points and lzik are the X , P, S, T = argmin erT Ωi, j1er+
corresponding observation of dynamic points.
{X,P,S,T }
B. Rigidity Constraint
emT Ωi,l1em + eTp Ωi,k1ep, (6)
Let si j be the segment length between two feature points
l pik and l pkj, the rigidity constraint is that si j is invariant over where P is the union set of Ps and Pd. This problem can be
time, i.e, ski j = ski j+1, if l pki and l pkj are on the same rigid solved using the Levenberg-Marquardt algorithms.
part of an articulated object, as shown in Fig. 2. Inspired by
IV. SYSTEM OVERVIEW
this, we model the dynamic articulated object using a rigidity
We propose the framework AirDOS in Fig. III-B for dy-
constraint, and thus we can define the rigidity error er as namic stereo visual SLAM, which consists of three modules,
pre-processing, tracking, and back-end bundle adjustment.
er = l pki l pkj si j . (3)
In pre-processing and tracking modules, we first extract
Fig. 3(a) shows the factor graph of the rigidity constraint, ORB features [28] and perform an instance-level segmen-
where the length of segment si j is invariant after the motion. tation [29] to identify potential moving objects. We then
The benefits to involving the rigidity error (3) are two-fold. estimate the initial ego-motion by tracking the static features.
First, it offers a temporal geometric constraint for dynamic For articulated objects like humans, we perform Alpha-Pose
points, which is able to correct the scale and 3D structure [27] to extract the human key points and calculate their
of dynamic objects. Second, it provides a geometric check, 3D positions by triangulating the corresponding key points
which eliminates the incorrectly matched points. from stereo images. We then track the moving humans using
the optical flow generated by PWC-net [30]. The tracking
We model humans as a special articulated object shown module provides a reliable initialization for the camera pose
in Fig. 3(b), where each human can be described by 14 and also the object poses of dynamic objects.
key points, including nose, shoulders, elbows, hands, waists,
knee, feet, etc. In the experiments, we detect the human key In the back-end optimization, we construct a global map
points using the off-the-shelf algorithm Alpha-Pose [27]. consisting of camera poses, static points, dynamic points, and
the motion of objects. We perform local bundle adjustment
with dynamic objects in the co-visibility graph [31] built
8049
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
Preprocessing Tracking Back-end
Stereo Image Tracking Camera Pose Optimization
Instant-level Map
Segmentation
Human Pose Static Feature Ego-motion Track Local Local Map Camera Pose
Extractor Static Points
Detection Estimation Map
Optical Flow
Estimation 3D Human Pose Motion Local Bundle Dynamic Points
Triangulation Estimation Adjustment Object Rigidity
Dynamic Object Global Bundle Motion
Tracking Adjustment
Tracking Dynamic Objects
Fig. 4. The framework of AirDOS, which is composed of three modules, i.e., pre-processing, tracking, and back-end optimization.
TABLE I Camera Pose
PERFORMANCE ON KITTI DATASETS BASED ON ATE (m). Object 1
Object 2
Sequence W/ Mask W/O Mask AirDOS
ORB-SLAM AirDOS ORB-SLAM AirDOS ORB-SLAM
Test 18
Test 28 0.933 0.934 0.937 0.948
Train 13 2.033 2.027 2.031 2.021
Train 14 1.547 1.618 1.551 1.636
Train 15 0.176 0.172 0.174 0.169
Train 19 0.240 0.234 0.240 0.234
2.633 2.760 2.642 2.760
from the co-visible landmarks for the sake of efficiency. Fig. 5. Qualitative analysis of the KITTI Tracking datasets in training 19.
Similar to the strategy of RANSAC, we eliminate the factors Applying rigidity constraint and motion constraint improve the estimation
and edges which contribute a large error based on the rigidity of the objects structure.
constraint (3) and motion constraint (5). This process helps
to identify the mismatched or falsely estimated human poses.
Visual SLAM algorithms usually only perform bundle adjust-
ment on selected key-frames due to the repeated static feature
observations. However, in highly dynamic environments, like
the ones presented in this paper, this might easily result in
loss of dynamic object tracking, therefore we perform bundle
adjustment on every frame to capture the full trajectory.
V. EXPERIMENTS B. Performance on KITTI Tracking Dataset
A. Metric, Baseline, and Implementation The KITTI Tracking dataset [32] contains 50 sequences
(29 for testing, 21 for training) with multiple moving objects.
We use the Absolute Translation Error (ATE) to evaluate We select 6 sequences that contain moving pedestrians. For
our algorithm. Our method is compared against the state-of- evaluation, we generate the ground truth using IMU and GPS.
the-art methods, ORB-SLAM [3] (1) with and (2) without the As shown in Table I, the ATEs of both our method and ORB-
masking of potential dynamic objects, and RGB-D dynamic SLAM are small in all sequences, which means that both
SLAM algorithm [20]. Similar to the setup described in methods perform well in these sequences. The main reason is
Section IV, we modified the ORB-SLAM to perform BA that the moving objects are relatively far and small, and there
on every frame with the observation from dynamic features, are plentiful static features in these sequences. Moreover,
so as to capture the full trajectory of the moving objects. In most sequences have a simple translational movement, which
the experiment, we applied the same parameters to AirDOS makes these cases very simple.
and ORB-SLAM, i.e., the number of feature points extracted
per frame, the threshold for RANSAC, and the covariance Although the camera trajectory is similar, our algorithm
of the reprojection error. recovers a better human model as shown in Fig. 5. The ORB-
SLAM generates noisy human poses when the human is far
8050
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
TABLE II
EXPERIMENTS ON TARTAN-AIR DATASET WITH AND WITHOUT MASK
Datasets Sequence W/ Mask W/O Mask
ORB-SLAM
AirDOS VDO-SLAM [20] AirDOS ORB-SLAM
Standing I 0.0606 0.0788 0.0994 0.0469 0.1186
0.0193 0.0060 0.6129 - -
Human II
0.0951 0.0657 0.3813 0.0278 0.0782
Road Crossing III 0.0331 0.0196 0.3879 0.1106 0.0927
(Easy) IV 0.0206 0.0148 0.2175 0.0149 0.0162
V
Road Crossing VI 0.2230 1.0984 0.2400 3.6700 4.3907
0.5625 0.8476 0.6628 1.1572 1.4632
(Hard) VII
Overall 0.1449 0.3044 0.3717 0.8379 1.0226
Results show Absolute Trajectory Error (ATE) in meter (m). - means that SLAM failed in this sequence.
Camera Pose
Object 1
Object 2
Object 3
Object 3
ORBSLAM
(a) Standing Human (b) Road Crossing
Fig. 6. (a) Example of the Tartan-Air datasets, where almost every one is
standing. (b) Example of moving humans in road crossing.
away from the camera. Thats because the rigidity constraint AirDOS
helps to recover the structure of the moving articulated
objects. Also, the motion constraint can improve the accuracy Fig. 7. Qualitative analysis of the TartanAir sequence IV. The moving
of the dynamic objects trajectory. Given the observation objects tracked by the ORB-SLAM is noisy, while our proposed method
from the entire trajectory, our algorithm recovers the human generate a smooth trajectory. We present that dynamic objects and the
pose and eliminates the mismatched dynamic feature points. camera pose can benefits each other in visual SLAM.
C. Performance on TartanAir Shibuya Dataset 1) Evaluation: To test the robustness of our system when
the visual odometry is interrupted by dynamic objects or
We notice that the moving objects in KITTI dataset only in cases where the segmentation might fail due to indirect
cover a small field of view. To address the challenges of occlusions such as illumination changes, we evaluate the
the highly dynamic environment, we build the TartanAir performance in two settings: with and without masking the
Shibuya dataset as shown in Fig. 6, and demonstrate that dynamic features during ego-motion estimation.
our method outperforms the existing dynamic SLAM al-
gorithms in this benchmark. Our previous work TartanAir As shown in the Table II, with human masks, our algo-
[33] is a very challenging visual SLAM dataset consisting rithm obtains a 39.5% and 15.2% improvements compared
of binocular RGB-D sequences together with additional per- to ORB-SLAM [3] and VDO-SLAM [20] in the overall
frame information such as camera poses, optical flow, and performance. In Sequence II, IV and V, both ORB-SLAM
semantic annotations. In this paper, we use the same pipeline and our algorithm show a good performance, where all
to generate TartanAir Shibuya, which is to simulate the ATEs are lower than 0.04. We notice that the performance
worlds most busy road intersection at Shibuya Tokyo shown of VDO-SLAM is not as good as ORB-SLAM. This may
in Fig. 1. It covers much more challenging viewpoints and be because that VDO-SLAM relies heavily on the optical
diverse motion patterns for articulated objects than TartanAir. flow for feature matching, it is likely to confuse background
features with dynamic features.
We separate the TartanAir Shibuya dataset into two
groups: Standing Humans in Fig. 6(a) and Road Crossing in Our algorithm also outperforms ORB-SLAM without
Fig. 6(b) with easy and difficult categories. Each sequence masking the potential moving objects. As shown in the
contains 100 frames and more than 30 tracked moving sequence I, III, V, and VI of Table II, our method obtains a
humans. In the sequences of Standing Human, most of higher accuracy than ORB-SLAM by 0.0717, 0.050, 0.721
the humans standstill, while few of them move around the and 0.306. Overall, we achieve an improvement of 18.1%.
space. In Road Crossing, there are multiple moving humans Thats because moving objects can easily lead the traditional
coming from different directions. For the difficult sequences, visual odometry to fail, but we take the observations from
dynamic objects often enter the scene abruptly, in which the moving articulated objects to rectify the camera poses, and
visual odometry of traditional methods will fail easily. filter out the mismatched dynamic features.
8051
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
TABLE III
ABLATION STUDY ON SIMULATED DATASET.
Groups RPE-R I ATE RPE-R II ATE RPE-R III ATE RPE-R IV ATE RPE-R Overall
RPE-T RPE-T RPE-T RPE-T RPE-T ATE
Before BA 0.4898 83.441 0.6343 109.968 1.1003 138.373 0.7925 168.312 0.7908
BA w/ static point 0.0989 15.991 15.002 0.1348 17.728 25.796 0.2028 21.070 17.085 0.1389 19.242 35.521 0.1537 18.8328 125.024
0.0988 3.3184 15.019 0.1349 3.7146 25.708 0.2035 4.2522 16.985 0.1388 3.5074 35.269 0.1538 3.7540 23.351
BA w/o motion 0.0962 3.3176 14.881 0.1282 3.7176 25.704 0.1871 4.2631 16.921 0.1226 3.5069 35.426 0.1410 3.7565 23.245
BA w/o rigidity 0.0958 3.2245 14.879 0.1276 3.4984 25.703 0.1870 4.0387 16.914 0.1215 3.2397 35.412 0.1407 3.5148 23.233
BA in Equation (6) 3.2177 3.4824 4.0372 3.2227 3.5085 23.227
Results show RPE-T and ATE in centimeter (cm) and RPE-R in degree (°).
It can be seen that in Fig. 7, ORB-SLAM was interrupted (d) is 5.71 ± 0.31 lower than (b). This is because the motion
by the moving humans and failed when making a large rota- constraint assumes that every dynamic feature on the same
tion. By tracking moving humans, our method outperforms object follows the same motion pattern, which requires the
ORB-SLAM when making a turn. Also, a better camera object to be rigid. From another point of view, the rigidity
pose estimation can in turn benefit the moving objects constraint provides a good initialization to the objects 3D
trajectory. As can be seen, the objects trajectories generated structure, and so indirectly improves the estimation of the
by ORB-SLAM are noisy and inconsistent, while ours are objects trajectory. In general, the ablation study proves
smoother. In general, the proposed motion constraint and that applying motion and rigidity constraints to dynamic
rigidity constraint have a significant impact on the difficult articulated objects can benefit the camera pose estimation.
sequences, where ORB-SLAM outputs inaccurate trajectories
due to dynamic objects. C. Computational Analysis
VI. ABLATION STUDY Finally, we evaluate the running time of the rigidity con-
straint and motion constraint in the optimization. The back-
We perform an ablation study to show the effects of the end optimization is implemented in C++ with a modified
introduced rigidity and motion constraints. Specifically, we g2o [34] solver. With the same setup as section VI-A, we
demonstrate that the motion constraint and rigidity constraint randomly initialized 10 different sequences with 18 frames.
inprove the camera pose estimation via bundle adjustment. In each frame, we can observe 8 static landmarks, and 12
dynamic landmarks from one moving object. In Table IV,
A. Implementation We show the (i) convergence time (ii) runtime per iteration
of group I in the ablation study. Our method takes 53.54
We simulate dynamic articulated objects that follow a (mSec) to converge, which is comparable to 39.22 (mSec)
simple constant motion pattern, and initialize the robots state from the optimization with re-projection error only.
with Gaussian noise of σ = 0.05m on translation σ = 2.9°
on rotation. We also generate static features around the path In this paper, semantic mask [29] and human poses [27]
of the robot, and simulate a sensor with a finite field of view. are pre-processed as an input to the system. The experiment
The measurement of point also has a noise of σ = 0.05m. are carried out on an Intel Core i7 with 16GB RAM.
We generate 4 groups of sequences with different lengths
and each group consists of 10 sequences that are initialized TABLE IV
with the same number of static and dynamic features. We TIME ANALYSIS OF BUNDLE ADJUSTMENT
set the ratio of static to dynamic landmarks as 1:1.8.
BA w/ reprojection error Convergence Time (mSec) Runtime/iter (mSec)
B. Results BA w/o Rigidity
BA w/o Motion 39.22 4.024
We evaluate the performance of (a) bundle adjustment with 45.47 4.078
static features only, (b) bundle adjustment without motion BA in Equation (6) 45.37 4.637
constraint, (c) bundle adjustment without rigidity constraint, 53.54 4.792
and (d) bundle adjustment with both the motion constraint
and rigidity constraint. We use the Absolute Translation Error CONCLUSION
(ATE) and Relative Pose Error of Rotation (RPE-R) and
Translation (RPE-T) as our evaluation metrics. In this paper, we introduce the rigidity constraint and
motion constraint to model dynamic articulated objects. We
As shown in Table III, both motion and rigidity constraints propose a new pipeline, AirDOS for stereo SLAM which
are able to improve the camera pose estimation, while the jointly optimizes the trajectory of dynamic objects, map of
best performance is obtained when the two constraints are the environment, and camera poses, improving the robustness
applied together. An interesting phenomenon is that rigidity and accuracy in dynamic environments. We evaluate our
constraint can also benefit the objects trajectory estimation. algorithm in KITTI tracking and TartanAir Shibuya dataset,
In Groups I, we evaluate the estimation of dynamic points and demonstrate that camera pose estimation and dynamic
with setting (b), (c), and (d), with 100 repeated experiments. objects can benefit each other, especially when there is
We find that the ATE of dynamic object feature points in an aggressive rotation or static features are not enough to
setting (c) is 5.68 ± 0.30 lower than setting (b), while setting support the visual odometry.
8052
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [24] W. B. Thompson, P. Lechleider, and E. R. Stuck, “Detecting moving
objects using the rigidity constraint,” IEEE Transactions on Pattern
[1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, Analysis and Machine Intelligence, vol. 15, no. 2, pp. 162166, 1993.
I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous
localization and mapping: Toward the robust-perception age,” IEEE [25] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and
Transactions on robotics, vol. 32, no. 6, pp. 13091332, 2016. A. J. Davison, “Slam++: Simultaneous localisation and mapping at the
level of objects,” in Proceedings of the IEEE conference on computer
[2] C. Wang, J. Yuan, and L. Xie, “Non-iterative SLAM,” in International vision and pattern recognition, 2013, pp. 13521359.
Conference on Advanced Robotics (ICAR). IEEE, 2017, pp. 8390.
[26] M. Kaess, A. Ranganathan, and F. Dellaert, “isam: Incremental
[3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a smoothing and mapping,” IEEE Transactions on Robotics, vol. 24,
versatile and accurate monocular slam system,” IEEE transactions on no. 6, pp. 13651378, 2008.
robotics, vol. 31, no. 5, pp. 11471163, 2015.
[27] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional multi-
[4] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE person pose estimation,” in ICCV, 2017.
transactions on pattern analysis and machine intelligence, vol. 40,
no. 3, pp. 611625, 2017. [28] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An
efficient alternative to sift or surf,” in 2011 International conference
[5] L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: Dual on computer vision. Ieee, 2011, pp. 25642571.
quadrics from object detections as landmarks in object-oriented slam,”
IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 18, 2018. [29] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
Proceedings of the IEEE international conference on computer vision,
[6] S. Yang and S. Scherer, “Cubeslam: Monocular 3-d object slam,” IEEE 2017, pp. 29612969.
Transactions on Robotics, vol. 35, no. 4, pp. 925938, 2019.
[30] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical
[7] M. Henein, G. Kennedy, R. Mahony, and V. Ila, “Exploiting rigid body flow using pyramid, warping, and cost volume,” in Proceedings of the
motion for slam in dynamic environments,” environments, vol. 18, IEEE conference on computer vision and pattern recognition, 2018,
p. 19, 2018. pp. 89348943.
[8] K. M. Judd, J. D. Gammell, and P. Newman, “Multimotion visual [31] C. Mei, G. Sibley, and P. Newman, “Closing loops without places,”
odometry (mvo): Simultaneous estimation of camera and third-party in 2010 IEEE/RSJ International Conference on Intelligent Robots and
motions,” in 2018 IEEE/RSJ International Conference on Intelligent Systems. IEEE, 2010, pp. 37383744.
Robots and Systems (IROS). IEEE, 2018, pp. 39493956.
[32] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
[9] M. Henein, J. Zhang, R. Mahony, and V. Ila, “Dynamic slam: The The kitti dataset,” The International Journal of Robotics Research,
need for speed,” in 2020 IEEE International Conference on Robotics vol. 32, no. 11, pp. 12311237, 2013.
and Automation (ICRA). IEEE, 2020, pp. 21232129.
[33] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor,
[10] G. Stamou, M. Krinidis, E. Loutas, N. Nikolaidis, and I. Pitas, “4.11- and S. Scherer, “Tartanair: A dataset to push the limits of visual
2d and 3d motion tracking in digital video,” Handbook of Image and slam,” in IEEE/RSJ International Conference on Intelligent Robots
Video Processing, 2005. and Systems (IROS), 2020.
[11] M. A. Fischler and R. C. Bolles, “Random sample consensus: a [34] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,
paradigm for model fitting with applications to image analysis and “g 2 o: A general framework for graph optimization,” in 2011 IEEE
automated cartography,” Communications of the ACM, vol. 24, no. 6, International Conference on Robotics and Automation. IEEE, 2011,
pp. 381395, 1981. pp. 36073613.
[12] C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for
rgb-d cameras,” in 2013 IEEE International Conference on Robotics
and Automation. IEEE, 2013, pp. 37483754.
[13] B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “Dynaslam: Tracking,
mapping, and inpainting in dynamic scenes,” IEEE Robotics and
Automation Letters, vol. 3, no. 4, pp. 40764083, 2018.
[14] I. A. Bârsan, P. Liu, M. Pollefeys, and A. Geiger, “Robust dense
mapping for large-scale dynamic environments,” in 2018 IEEE In-
ternational Conference on Robotics and Automation (ICRA). IEEE,
2018, pp. 75107517.
[15] W. Dai, Y. Zhang, P. Li, Z. Fang, and S. Scherer, “Rgb-d slam in
dynamic environments using point correlations,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2020.
[16] T. Ji, C. Wang, and L. Xie, “Towards real-time semantic rgb-d slam in
dynamic environments,” in 2021 International Conference on Robotics
and Automation (ICRA), 2021.
[17] D. Hahnel, R. Triebel, W. Burgard, and S. Thrun, “Map building with
mobile robots in dynamic environments,” in 2003 IEEE International
Conference on Robotics and Automation (Cat. No. 03CH37422),
vol. 2. IEEE, 2003, pp. 15571563.
[18] C.-C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte,
“Simultaneous localization, mapping and moving object tracking,” The
International Journal of Robotics Research, vol. 26, no. 9, pp. 889
916, 2007.
[19] N. D. Reddy, P. Singhal, V. Chari, and K. M. Krishna, “Dynamic body
vslam with semantic constraints,” in 2015 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS). IEEE, 2015,
pp. 18971904.
[20] J. Zhang, M. Henein, R. Mahony, and V. Ila, “Vdo-slam: a visual
dynamic object-aware slam system,” arXiv preprint arXiv:2005.11052,
2020.
[21] M. Henein, J. Zhang, R. Mahony, and V. Ila, “Dynamic slam: The
need for speed,” in 2020 IEEE International Conference on Robotics
and Automation (ICRA), 2020, pp. 21232129.
[22] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic
scene graphs: Actionable spatial perception with places, objects, and
humans,” arXiv preprint arXiv:2002.06289, 2020.
[23] Z. Zhang, O. D. Faugeras, and N. Ayache, “Analysis of a sequence
of stereo scenes containing multiple moving objects using rigidity
constraints,” in ICCV, 1988.
8053
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.

View File

@ -0,0 +1,663 @@
IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 11523
DynaVINS: A Visual-Inertial SLAM for
Dynamic Environments
Seungwon Song , Hyungtae Lim , Graduate Student Member, IEEE, Alex Junho Lee ,
and Hyun Myung , Senior Member, IEEE
Abstract—Visual inertial odometry and SLAM algorithms are Fig. 1. Our algorithm, DynaVINS, in various dynamic environments. (a)(b)
widely used in various fields, such as service robots, drones, and Feature rejection results in city_day sequence of VIODE dataset [13]. Even if
autonomous vehicles. Most of the SLAM algorithms are based on the most features are dynamic, DynaVINS can discard the effect of the dynamic
assumption that landmarks are static. However, in the real-world, features. (c) Separation of feature matching results into multiple hypotheses in E
various dynamic objects exist, and they degrade the pose estimation shape sequence of our dataset. Even if a temporarily static object exists, only a
accuracy. In addition, temporarily static objects, which are static hypothesis from static objects is determined as true positive. Features with high
during observation but move when they are out of sight, trigger false and low weights are denoted as green circles and red crosses, respectively, in
positive loop closings. To overcome these problems, we propose a both two cases.
novel visual-inertial SLAM framework, called DynaVINS, which is
robust against both dynamic objects and temporarily static objects. cameras [4], [5], [6] are widely used because of their relatively
In our framework, we first present a robust bundle adjustment low cost and weight with rich information.
that could reject the features from dynamic objects by leveraging
pose priors estimated by the IMU preintegration. Then, a keyframe Various visual SLAM methods have been studied for more
grouping and a multi-hypothesis-based constraints grouping meth- than a decade. However, most researchers have assumed that
ods are proposed to reduce the effect of temporarily static objects in landmarks are implicitly static; thus, many visual SLAM meth-
the loop closing. Subsequently, we evaluated our method in a public ods still have potential risks when interacting with real-world
dataset that contains numerous dynamic objects. Finally, the exper- environments that contain various dynamic objects. Only re-
imental results corroborate that our DynaVINS has promising per- cently, several studies focused on dealing with dynamic objects
formance compared with other state-of-the-art methods by success- solely using visual sensors.
fully rejecting the effect of dynamic and temporarily static objects.
Most of the studies [7], [8], [9] address the problems by de-
Index Terms—Visual-inertial SLAM, SLAM, visual tracking. tecting the regions of dynamic objects via depth clustering, fea-
ture reprojection, or deep learning. Moreover, some researchers
I. INTRODUCTION incorporate the dynamics of the objects into the optimization
framework [10], [11], [12]. However, geometry-based methods
S IMULTANEOUS localization and mapping (SLAM) al- require accurate camera poses; hence they can only deal with
gorithms have been widely exploited in various robotic limited fractions of dynamic objects. In addition, deep-learning-
applications that require precise positioning or navigation in aided methods have the limitation of solely working for prede-
environments where GPS signals are blocked. Various types fined objects.
of sensors have been used in SLAM algorithms. In particular,
visual sensors such as monocular cameras [1], [2], [3] and stereo In the meanwhile, visual-inertial SLAM (VI-SLAM) frame-
works [2], [3], [4], [5], [6] have been proposed by integrating an
Manuscript received 27 April 2022; accepted 22 August 2022. Date of inertial measurement unit (IMU) into the visual SLAM. Unlike
publication 31 August 2022; date of current version 6 September 2022. This the visual SLAMs, a motion prior from the IMU helps the
letter was recommended for publication by Associate Editor M. Magnusson and VI-SLAM algorithms to tolerate scenes with dynamic objects to
Editor S. Behnke upon evaluation of the reviewers comments. This work was some degree. However, if the dominant dynamic objects occlude
supported in part by the Indoor Robot Spatial AI Technology Development”
project funded by KT, KT award under Grant B210000715 and in part by the
Institute of Information & Communications Technology Planning & Evaluation
(IITP) grant funded by Korea government (MSIT) under Grant 2020-0-00440,
Development of Artificial Intelligence Technology that Continuously Improves
Itself as the Situation Changes in the Real World. The students are supported
by the BK21 FOUR from the Ministry of Education (Republic of Korea).
(Corresponding author: Hyun Myung.)
Seungwon Song, Hyungtae Lim, and Hyun Myung are with the School of
Electrical Engineering, KAIST, Daejeon 34141, Republic of Korea (e-mail:
sswan55@kaist.ac.kr; shapelim@kaist.ac.kr; hmyung@kaist.ac.kr).
Alex Junho Lee is with the Department of Civil and Environmen-
tal Engineering, KAIST, Daejeon 34141, Republic of Korea (e-mail:
alex_jhlee@kaist.ac.kr).
Our code is available: https://github.com/url-kaist/dynaVINShttps://github.
com/url-kaist/dynaVINS
This letter has supplementary downloadable material available at
https://doi.org/10.1109/LRA.2022.3203231, provided by the authors.
Digital Object Identifier 10.1109/LRA.2022.3203231
2377-3766 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
11524 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
most of the view as shown in Fig. 1(b), the problem cannot be camera movement and the feature. Canovas et al. [9] proposed
a similar method, but adopted a surfel, similar to a polygon,
solved solely using the motion prior. to enable a real-time performance by reducing the number of
items to be computed. However, multi-view geometry-based
In addition, in real-world applications, temporarily static ob- algorithms assumed that the camera pose estimation is accurate
enough, leading to the failure when the camera pose estimation
jects are static while being observed but in motion when they are is inaccurate owing to the dominant dynamic objects.
not under observation. These objects may lead to a critical failure One of the solutions to this problem is to employ a wheel
encoder. G2P-SLAM [18] rejected loop closure matching results
on the loop closure process due to false positives as shown in with a high Mahalanobis distance from the estimated pose by
the wheel odometry, which is invariant to the effect of dynamic
Fig. 1(c). To deal with temporarily static objects, robust back-end and temporarily static objects. Despite the advantages of wheel
encoder, these methods are highly dependent on the wheel
methods [14], [15], [16], [17] are proposed to reduce the effect of encoder, limiting their own applicability.
the false positive loop closures in optimization. However, since Another feasible approach is to adopt deep learning networks
to identify predefined dynamic objects. In the DynaSLAM [7],
they focused on the instantaneous false positive loop closures, masked areas of the predefined dynamic objects using a deep
learning network were eliminated and the remainder was deter-
they cannot deal with the persistent false positive loop closures mined via multi-view geometry. In the Dynamic SLAM [19],
a compensation method was adopted to make up for missed
caused by the temporarily static objects. detections in a few keyframes using sequential data. Although
the deep learning methods can successfully discard the dynamic
In this study, to address the aforementioned problems, we objects even if they are temporarily static, these methods are
somewhat problematic for the following two reasons: a) the types
propose a robust VI-SLAM framework, called DynaVINS, which of dynamic objects have to be predefined, and b) sometimes, only
a part of the dynamic object is visible as shown in Fig. 1(b). For
is robust against dynamic and temporarily static objects. Our these reasons, the objects may not be detected occasionally.
conrtriTbhuetiornosbuarset summarized as follows: proposed to handle On the other hand, methods for tracking a dynamic objects
VI-SLAM approach is motion have been proposed. RigidFusion [10] assumed that only
a single dynamic object is in the environment and estimated
dominant, undefined dynamic objects that cannot be solved the motion of the dynamic object. Qiu et al. [12] combined a
deep learning method and VINS-Mono [2] to track poses of the
r solely by learning-based or vision-only methods. camera and object simultaneously. DynaSLAM II [11] identified
A novel bundle adjustment (BA) pipeline is proposed for dynamic objects, similar to DynaSLAM [7], then, within the BA
factor graph, the poses of static features and the camera were
simultaneously estimating camera poses and discarding the estimated while estimating the motion of the dynamic objects
simultaneously.
features from the dynamic objects that deviate significantly
C. Robust Back-End
r from the motion prior.
A robust global optimization with constraints grouped into In the graph SLAM field, several researchers have attempted
to discard incorrectly created constraints. For instance, max-
multiple hypotheses is proposed to reject persistent loop mixture [14] employed a single integrated Bayesian framework
to eliminate the incorrect loop closures, while switchable con-
closures from the temporarily static objects. straint [15] is proposed to adjust the weight of each constraint to
eliminate false positive loop closures in the optimization. How-
In the remainder of this letter, we introduce the robust BA ever, false-positive loop closures can be expected to be consistent
and occur persistently by the temporarily static objects. These
method for optimizing moving windows in Section III, methods robust kernels are not appropriate to handling such persistent
loop closures.
for the robust global optimization in Section IV, and compare our
On the other hand, the Black-Rangarajan (B-R) duality [20] is
proposed method with other state-of-the-art (SOTA) methods proposed to unify robust estimation and outlier rejection process.
Some methods [16], [17] utilize B-R duality in point cloud
in various environments in Section V. registration and pose graph optimization (PGO) to reduce the
effect of false-positive matches even if they are dominant. These
II. RELATED WORKS methods are useful for rejecting outliers in a PGO. However,
repeatedly detected false-positive loop closures from similar
A. Visual-Inertial SLAM objects are not considered. Moreover, B-R duality is not yet
utilized in the BA of the VI-SLAM.
As mentioned earlier, to address the limitations of the visual
SLAM framework, VI-SLAM algorithms have been recently To address the aforementioned limitations, we improve the
proposed to correct the scale and camera poses by adopting VI-SLAM to minimize the effect of the dynamic and temporarily
the IMU. MSCKF [3] was proposed as an extended Kalman static objects by adopting the B-R duality not only in the graph
filter(EKF)-based VI-SLAM algorithm. ROVIO [6] also used
an EKF, but proposed a fully robocentric and direct VI-SLAM
framework running in real time.
There are other approaches using optimization. OKVIS [5]
proposed a keyframe-based framework and fuses the IMU
preintegration residual and the reprojection residual in an op-
timization. ORB-SLAM3 [4] used an ORB descriptor for the
feature matching, and poses and feature positions are corrected
through an optimization. VINS-Fusion [2], an extended version
of VINS-Mono, supports a stereo camera and adopts a feature
tracking, rather than a descriptor matching, which makes the
algorithm faster and more robust.
However, these VI-SLAM methods described above still have
potential limitations in handling the dominant dynamic objects
and the temporarily static objects.
B. Dynamic Objects Rejection in Visual and VI SLAM
Numerous researchers have proposed various methods to
handle dynamic objects in visual and VI SLAM algorithms. Fan
et al. [8] proposed a multi-view geometry-based method using an
RGB-D camera. After obtaining camera poses by minimizing the
reprojection error, the type of each feature point is determined
as dynamic or static by the geometric relationship between the
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: DYNAVINS: A VISUAL-INERTIAL SLAM FOR DYNAMIC ENVIRONMENTS 11525
Fig. 2. The pipeline of our robust visual inertial SLAM. Features are tracked in mono or stereo images and IMU data are preintegrated in the sensor preprocessing
step. Then, the robust BA is applied to discard tracked features from dynamic objects and only the features from static objects will be remain. Keyframes are
grouped using the number of tracked features, and loop closures detected in current keyframe groups are clustered into hypotheses. Each hypothesis with the weight
is used or rejected in the selective optimization. Using the proposed framework, a trajectory robust against dynamic and temporarily static objects can be obtained.
structure but also in the BA framework by reflecting the IMU considered as outliers would never become inliers even though
prior and the feature tracking information. the features are originated from static objects.
III. ROBUST BUNDLE ADJUSTMENT To address these problems, our BA method consists of two
parts: a) a regularization factor that leverages the IMU preinte-
A. Notation gration and b) a momentum factor for considering the previous
state of each weight to cover the case where the preintegration
In this letter, the following notations are defined. The i-th becomes temporarily inaccurate.
camera frame and the j-th tracked feature are denoted as Ci
and fj, respectively. For two frames CA and CB, TBA ∈ SE(3) C. Regularization Factor
denotes the pose of CA relative to CB. And the pose of CA in
the world frame W can be denoted as TWA . First, to reject the outlier features while robustly estimate
the poses, we propose a novel loss term inspired by the B-R
B is a set of indices of the IMU preintegrations, and P is a set duality [20] as follows:
of visual pairs (i, j) where i corresponds to the frame Ci and j to
the feature fj. Because the feature fj is tracked across multiple ρ wj , rPj = wj2rPj + λwΦ2(wj ), (2)
camera frames, different camera frames can contain the same
feature fj. Thus, a set of indices of all tracked features in the where rjP denotes i∈P(fj) rjP,i 2 for simplicity, wj ∈ [0, 1]
current moving window is denoted as FP , and a set of indices denotes the weight corresponding to each feature fj, and fj
of the camera frames that contain the feature fj is denoted as with wj close to 1 is determined as a static feature; λw ∈ R+
P (fj ). is a constant parameter; Φ(wj) denotes the regularization factor
In the visual-inertial optimization framework of the current of the weight wj and is defined as follows:
sliding window, X represents the full state vector that contains
sets of poses and velocities of the keyframes, biases of the IMU, Φ(wj) = 1 wj. (3)
i.e., acceleration and gyroscope biases, and estimated depth of
the features as in [2].
B. Conventional Bundle Adjustment Then, ρ(wj, rPj ) in (2) is adopted instead of the Huber norm
in the visual reprojection term in (1). Hence, the BA formulation
In the conventional visual-inertial state estimator [2], the
visual-inertial BA formulation is defined as follows: can be expressed as:
⎧ ⎫
⎨ ⎬
rp HpX 2 + rI zˆbbkk+1 , X 2 2+ rIk 2 + ρ wj , rjP ⎭ ,
min Pbbkk+1 min ⎩ rp HpX
X k∈B 2⎬ (1) X ,W k∈B j∈FP
PjCi ⎭ ,
(4)
+ ρH rP zˆCj i , X where W = {wj|j ∈ FP } represents the set of all weights.
(i,j)∈P By adopting weight and regularization factor inspired by
B-R duality, the influence of features with a high reprojection
where ρH (·) denotes the Huber loss [21]; rp, rI , and rP represent error compared to the estimated state can be reduced while
residuals for marginalization, IMU, and visual reprojection mea-
surements, respectively; zˆbbkk+1 and zˆjCi stand for observations of maintaining the state estimation performance. The details will
IMU and feature points; Hp denotes a measurement estimation
matrix of the marginalization, and P denotes the covariance of be covered in the remainder of this subsection.
each term. For convenience, rI (zˆbbkk+1 , X ) and rP (zˆCj i , X ) are
simplified as rIk and rjP,i, respectively. (4) is solved using an alternating optimization [20]. Because
The Huber loss does not work successfully once the ra- the current state X can be estimated from the IMU preintegration
tio of outliers increases. This is because the Huber loss does and the previously optimized state, unlike other methods [16],
not entirely reject the residuals from outliers [22]. On the [17], W is updated first with the fixed X . Then, X is optimized
other hand, the redescending M-estimators, such as Geman- with the fixed W.
McClure (GMC) [23], ignore the outliers perfectly once the While optimizing W, all terms except weights are constants.
residuals are over a specific range owing to their zero-gradients. Hence, the formulation for optimizing weights can be expressed
Unfortunately, this truncation triggers a problem that features as follows:
⎧ ⎫
⎨ ⎬
ρ wj , rPj ⎭ .
min ⎩ (5)
W j∈FP
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
11526 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
Fig. 4. Framework of robust BA. Each feature has a weight and is used in
the visual residual. Each weight has been optimized through the regularization
factor and the weight momentum factor. Preintegrated IMU data are used in the
IMU residual term. All parameters are optimized in the robust BA.
D. Weight Momentum Factor
Fig. 3. Changes of loss functions w.r.t. various parameters. (a) ρ(wj , rjP ) w.r.t. When the motion becomes aggressive, the IMU preinte-
wj in the alternating optimization for λw = 1. ρ¯(rPj ) represents the converged
loss. (b) ρ¯(rPj ) w.r.t. λw. (c) ρ¯m(rPj ) w.r.t. w¯j for nj = 5. (d) ρ¯m(rPj ) w.r.t. gration becomes imprecise, and thus the estimated state be-
nj for w¯j = 0.
comes inaccurate. In this case, the reprojection residuals of
the features from the static objects become larger; hence, by
the regularization factor, those features will be ignored in the
BA process even though the previous weights were close to
Because the weight wj is independent to each other, (5) can be one.
optimized independently for each wj as follows: ⎫ If increasing λw to solve this problem, even the fea-
⎧⎛ ⎞ tures with high reprojection residuals by dynamic objects
⎨ ⎬
⎩wj2 ⎝ rjP,i 2⎠ + λwΦ2(wj )⎭ . are used. Therefore, the result of the BA will be inac-
min (6)
i∈P curate. Thus, increasing λw is not enough to cope this
wj ∈[0,1] (fj ) problem.
Because the terms in (6) are in a quadratic form w.r.t. wj, the To solve this issue, an additional factor, a weight momentum
optimal wj can be derived as follows:
factor, is proposed to make the previously estimated feature
λw weights unaffected by an aggressive motion.
+ λw
wj = , (7) Because the features are continuously tracked, each feature
rjP fj is optimized nj times with its previous weight w¯j. In order to
make the current weight tend to remain at w¯j, and to increase the
As mentioned previously, the weights are first optimized
based on the estimated state. Thus the weights of features with degree of the tendency as nj increases, the weight momentum
factor Ψ(wj) is designed as follows:
high reprojection errors start with small values. However, as
shown in Fig. 3(a), the loss of the feature ρ(wj, rjP ) is a convex Ψ(wj) = nj(w¯j wj). (9)
function unless the weight is zero, there is a non-zero gradient
not only in the loss of an inlier feature but also in the loss of an Then, adding (9) to (2), the modified loss term can be derived
outlier feature. Which means that the new feature affects the BA
regardless of the type at first. as follows:
While the optimization step is repeated until the states and ρm wj , rPj = wj2 rjP,i 2
the weights are converged, the weights of the outlier features
i∈P(fj )
are lowered and their losses are more flattened. As a result, the + λwΦ2(wj ) + λmΨ2(wj ), (10)
losses of the outlier features approach zero-gradient and cannot
affect the BA. where λm ∈ R+ represents a constant parameter to adjust the
effect of the momentum factor on the BA.
After convergence, the weight can be expressed using the
reprojection error as in (7). Thus the converged loss ρ¯(rjP ) can In summary, proposed robust BA can be illustrated as Fig. 4.
be derived by applying (7) to (2) as follows:
The previous weights of the tracked features are used in the
weight momentum factor, and the weights of all features in the
λw rjP current window are used in the regularization factor. As a result,
λw + rPj
ρ¯(rPj ) = . (8) the ro⎧bust BA is expressed as follows: ⎫
⎨ ⎬
2+ rkI 2 + wj , rjP ⎭ .
As shown in Fig. 3(b), increasing λw affects ρ¯(rjP ) in two di- min ⎩ rp HpX ρm (11)
rections: increasing the gradient value and convexity. By increas-
X ,W k∈B j∈FP
ing the gradient value, the visual reprojection residuals affect (11) can be solved by using the alternating optimization in the
the BA more than the marginalization and IMU preintegration same way as (4). The alternating optimization is iterated until X
and W are converged. Then, the converged loss ρ¯m(rjP ) can be
residuals. And by increasing the convexity, some of the outlier derived. ρ¯m(rjP ) w.r.t. w¯j and nj is shown in Fig. 3(c) and (d),
respectively.
features can affect the BA.
To sum up, the proposed factor benefits from both Huber
loss and GMC by adjusting the weights in an adaptive way; As shown in Fig. 3(c), if w¯ is low, the gradient of the loss is
small even when rjP is close to 0. Thus, the features presumably
our method efficiently filters out outliers, but does not entirely
ignore outliers in the optimization at first as well.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: DYNAVINS: A VISUAL-INERTIAL SLAM FOR DYNAMIC ENVIRONMENTS 11527
originated from dynamic objects dont have much impact on Fig. 5. The procedure of the multiple hypotheses clustering. (a) Keyframes
the BA even if their reprojection errors are low in the current that share the minimum number of the tracked features are grouped. (b) There are
step. In addition, the gradient of the loss increases for features two types of features used for matchings: static and temporarily static features.
whose w¯ is close to 1, so even though the current residual is k,mTWi , the estimated pose of Ci, can be estimated using the matching result
high, an optimization is performed in the direction of reducing Tmk and the local relative pose Tki. An accurate keyframe pose can be estimated
the residual rather than w. if static features are used for the matching. (c) The temporarily static feature
Furthermore, as shown in Fig. 3(d), if w¯j is zero, the gradient is moved from the previous position. However, the matching result is based on
gets smaller as nj increases; hence the tracked outlier feature
has less effect on the BA, and the longer it is tracked, the less it the previous position of the feature. Thus, the estimated keyframe pose will be
inaccurate. Finally, the feature matching results with similar TWi are clustered
affects the BA. based on the Euclidean distance.
For the stereo camera configuration, in addition to the repro-
jection on one camera, reprojections on the other camera in the
same keyframe, rPstereo, or another keyframe, rPanother, exist. In that
case, weights are also applied to the reprojection raPnother because
it is also affected by the movement of features, while rPstereo is
invariant to the movement of features and is only adopted as the
criterion for the depth estimation.
IV. SELECTIVE GLOBAL OPTIMIZATION However, it is difficult to directly compute the similarity be-
In the VIO framework, the drift is inevitably cumulative along tween the loop closures from different keyframes in the current
the trajectory because the optimization is performed only within group. Assuming that the relative pose Tki between Ck and Ci
the moving window. Hence, a loop closure detection, e.g. using is sufficiently accurate, the estimated pose of Ci in the world
DBoW2 [24], is necessary to optimize all trajectories. frame can be expressed as follows:
In a typical visual SLAM, all loop closures are exploited even k,mTWi = Tki ·m TWk . (14)
if some of them are from temporarily static objects. Those false
positive loop closures may lead to the failure of the SLAM If the features used for matchings are from the same object,
framework. Moreover, features from the temporarily static ob- the estimated TWi of the matchings will be located close to each
jects and from the static objects may exist at the same keyframe. other, even if Ck and Cm of the matchings are different. Hence,
Therefore, in this section, we propose a method to eliminate the after calculating Euclidean distances between the loop closures
false positive loop closures while maintaining the true positive estimated TWi , the similar loop closures with the small Euclidean
loop closures. distance can be clustered as shown in Fig. 5(c).
A. Keyframe Grouping Depending on which loop closure cluster is selected, the
Unlike conventional methods that treat loop closures indi- trajectory result from the graph optimization varies. Therefore,
vidually, in this study, loop closures from the same features each cluster can be called a hypothesis. To reduce the computa-
are grouped, even if they are from different keyframes. As a tional cost, top-two hypotheses were adopted by comparing the
result, only one weight per group is used, allowing for effective cardinality of the loop closures within the hypothesis. These two
hypotheses of the current group Gi are denoted as Hi0 and Hi1.
optimization.
However, it is not yet possible to distinguish between true or
As shown in Fig. 5(a), before grouping the loop closures,
false positive hypotheses. Hence, the method for determining the
adjacent keyframes that share at least a minimum number of
true positive hypothesis among the candidate hypotheses will be
tracked features have to be grouped. The group starting from
described in the next section.
the i-th camera frame Ci is defined as follows:
Group(Ci) = Ck| |Fik| ≥ α, k ≥ i , (12)
where α represents a minimum number of tracked features, and C. Selective Optimization for Constraint Groups
Fik represents the set of features tracked from Ci to Ck. For
simplicity, Group(Ci) will be denoted as Gi hereinafter.
B. Multiple Hypotheses Clustering Most of the recent visual SLAM algorithms use a graph
optimization. Let C, T , L, and W denote the sets of keyframes,
After keyframes are grouped as in the previous subsection, poses, loop closures, and all weights, respectively. Then the
DBoW2 is employed to identify the similar keyframe Cm with graph optimization can be denoted as:
each keyframe Ck in the current group Gi starting from Ci ⎧ ⎫
(Ck ∈ Gi and m < i). Note that Ck is skipped if there is no ⎨⎪⎪⎪⎪⎪ ⎪⎪⎪⎪⎪⎬
similar keyframe. After identifying up to three different m + ρ 2 r(Tkj, T ) 2PL⎪⎭⎪⎪⎪⎪,
mTin⎪⎪⎪⎪⎪⎩ i∈C r(Tii+1, T ) PT i+1 H
for k, a feature matching is conducted between Ck and these i (j,k)∈L
keyframes, and the relative pose Tmk can be obtained. Using T
Tmk , the estimated pose of Ck in the world frame, mTWk , can be
obtained as follows: local edge loop closure edge
mTWk = Tmk · TWm, (15)
(13) where Tii+1 represents the local pose between two adjacent
keyframes Ci and Ci+1; Tkj is the relative pose between Cj and
where TWm represents the pose of Cm in the world frame.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
11528 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
Ck from the loop closure; PT i+1 and PL denote the covariance TABLE I
ABLATION EXPERIMENT
Ti
of the local pose and loop closure, respectively. A. Dataset
For the two hypotheses of group Gi, weights are denoted VIODE Dataset VIODE dataset [13] is a simulated dataset
as wi0 and wi1, a sum of the weights as wi, and the set of that contains lots of moving objects, such as cars or trucks,
hypotheses as H. Using a similar procedure as in Section III-C, compared with conventional datasets. In addition, the dataset
includes overall occlusion situations, where most parts of
Black-Rangarajan duality is applied to (15) as follows: the image are occluded by dominant dynamic objects as
shown in Fig. 1. Note that the sub-sequence name none
⎧ 2 to high means how many dynamic objects exist in the
⎪⎪⎪⎪⎨⎪ scene.
min ⎩⎪⎪⎪⎪⎪ r Tii+1, T PT i+1 Our Dataset Unfortunately, VIODE dataset does not contain
i harsh loop closing situations caused by temporarily static ob-
T ,W i∈C T jects. Accordingly, we obtained our dataset with four sequences
to evaluate our global optimization. First, Static sequence
⎛ ⎞ validates the dataset. In Dynamic follow sequence, a dom-
+ Hi∈H ⎜⎝⎜⎜⎜⎜⎝⎛(j,k)∈Hi0 inant dynamic object moves in front of the camera. Next, in
wi0 r(Tkj , T ) 2 ⎠ Temporal static sequence, the same object is observed
|Hi0| PL from multiple locations. In other words, the object is static while
being observed, and then it moves to a different position. Finally,
⎛ residual for hypothesis 0 in E-shape sequence, the camera moves along the shape of the
letter E. The checkerboard is moved while not being observed,
+⎝ ⎞ thus it will be observed at the three end-vertices of the E-shaped
trajectory in the camera perspective, which triggers the false-
(j,k)∈Hi1 wi1 r Tkj , T 2 ⎠ positive loop closures. Note that the feature-rich checkerboard
|Hi1| PL is used in the experiment to address the effect of false loop
closures.
residual for hypothesis1 (optional)
B. Error Metrics
⎞⎫
⎟⎟⎟⎠⎟⎟⎪⎪⎪⎪⎪⎪⎪⎪⎭⎪⎪⎬ The accuracy of the estimated trajectory from each algorithm
+ λlΦl2(wi) , (16) is measured by Absolute Trajectory Error (ATE) [25], which di-
rectly measures the difference between points of the ground truth
hypothesis regularization function and the aligned estimated trajectory. In addition, for the VIODE
dataset, the degradation rate [13], rd = ATEhigh/ATEnone, is
where λl ∈ R+ is a constant parameter. The regularization factor calculated to determine the robustness of the algorithm.
for the loop closure, Φl, is defined as follows:
C. Evaluation on the VIODE Dataset
Φl(wi) = 1 wi
First, the effects of the proposed factors on BA time cost
= 1 wi0 + wi1 , (17) and accuracy are analyzed as shown in the Table I. Ours
with only the regularization factor has a better result than
where wi0, wi1 ∈ [0, 1]. To ensure that the weights are not af- VINS-Fusion, but with the momentum factor together, not
fected by the number of loop closures in the hypothesis, the only it shows outperforming result than VINS-Fusion, but also
weights are divided by the cardinality of each hypothesis. it takes less time due to a previous information. Moreover,
although the BA time of ours was increased due to addi-
Then, (16) is optimized in the same manner as (11). Ac- tional optimizations, it is sufficient for high-level control of
cordingly, only the hypothesis with a high weight is adopted robots.
in the optimization. In addition, all weights can be close to
0 when all hypotheses are false positives due to the multiple As shown in Table II and Fig. 6, the SOTA methods show
temporarily static objects. Hence, the failure caused by false precise pose estimation results in static environments. However,
positive hypotheses can be prevented. they struggle with the effect of dominant dynamic objects. In
particular, even though DynaSLAM employs a semantic seg-
Because keyframe poses are changed after the optimization, mentation module, DynaSLAM tends to diverge or shows large
the hypothesis clustering in Section IV-B is conducted again for ATE compared with other methods as the number of dynamic
all groups for the next optimization. objects increases (from none to high). This performance
degradation is due to the overall occlusion situations, leading to
V. EXPERIMENTAL RESULTS
To evaluate the proposed algorithm, we compare ours with
SOTA algorithms, namely, VINS-Fusion [2], ORB-SLAM3 [4],
and DynaSLAM [7]. Each algorithm is tested in a mono-
inertial (-M-I) and a stereo-inertial (-S-I) mode. Note that
an IMU is not used in DynaSLAM, so it is only tested in a
stereo (-S) mode and compared with the -S-I mode of other
algorithms. It could be somewhat unfair, but the comparison is
conducted to stress the necessity for an IMU when dealing with
dynamic environments.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: DYNAVINS: A VISUAL-INERTIAL SLAM FOR DYNAMIC ENVIRONMENTS 11529
TABLE II
COMPARISON WITH STATE-OF-THE-ART METHODS (RMSE OF ATE IN [M])
Fig. 6. ATE results of state-of-the-art algorithms and ours on the city_day sequences of the VIODE dataset [13]. Note that the y-axis is expressed in logarithmic
scale. Our algorithm shows promising performance with less performance degeneration compared with the other state-of-the-art methods.
the failure of the semantic segmentation module and the absence Fig. 7. Results of the state-of-the-art algorithms and ours on the park-
of features from static objects. ing_lot high sequence of the VIODE dataset [13]. (a) Trajectory of each
algorithm in the 3D feature map, which is the result of our proposed algorithm.
Similarly, although ORB-SLAM3 tries to reject the frames Features with low weight are depicted in red. (b) Enlarged view of (a). All
with inaccurate features, it diverges when dominant dynamic other algorithms except our algorithm lost track or had noisy trajectories while
objects exist in parking_lot mid, high and city_day observing dynamic objects and as in (c) feature weighting result of our algorithm,
high sequences. However, especially in parking_lot low features from dynamic objects (red crosses) have low weight while robust
sequence, there is only one vehicle that is far from the camera, features (green circles) have high weight.
and it occludes an unnecessary background environment. As
a consequence, ORB-SLAM3-S-I outperforms other algo- TABLE III
rithms. COMPARISON OF DEGRADATION RATE rd
VINS-Fusion is less hindered by the dynamic objects because
it tries to remove the features with an incorrectly estimated depth
(negative or far) after BA. However, those features have affected
the BA before they are removed. As a result, as the number of
the features from dynamic objects increases, the trajectory error
of VINS-Fusion gets higher.
In contrast, our proposed method shows promising perfor-
mance in both mono-inertial and stereo-inertial modes. For
example, in parking_lot high sequence as shown in
Fig. 7(a)(b), ours performs stable pose estimation even when
other algorithms are influenced by dynamic objects. Moreover,
even though the number of dynamic objects increases, a perfor-
mance degradation remains small compared to other methods
in all scenes. This confirms that our method overcomes the
problems caused by dynamic objects owing to our robust BA
method, which is also supported by Table III. In other words,
our proposed method successfully rejects all dynamic features
by adjusting the weights in an adaptive way. Also, our method
could be even robust against the overall occlusion situations, as
shown in Fig. 1(b).
Interestingly, our proposed robust BA method enables robust-
ness against changes in illuminance by rejecting the inconsistent
features (e.g., low weight features in dark area of Fig. 7(c)). Ac-
cordingly, our method shows remarkable performance compared
with the SOTA methods in city_night scenes where not only
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
11530 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
Fig. 8. Results of the algorithms on E-shape sequence. (a) Trajectory results. [3] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman
Other algorithms are inaccurate due to false positive loop closures. (b) A loop filter for vision-aided inertial navigation,” in Proc. IEEE Int. Conf. Robot.
closure rejection result of our algorithm. Constraints with low weight (red lines) Automat., 2007, pp. 35653572.
do not contribute to the optimized trajectory.
[4] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. M. Montiel, and J.
dynamic objects exist, but also there is a lack of illuminance. D. Tardós, “ORB-SLAM3: An accurate open-source library for visual,
Note that -M-I of ours has better result than -S-I. This is visualinertial, and multimap SLAM,” IEEE Trans. Robot., vol. 37, no. 6,
because the stereo reprojection, rPstereo, can be inaccurate in pp. 18741890, Dec. 2021.
low-light conditions.
[5] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale,
D. Evaluation on Our Dataset “Keyframe-based visual-inertial odometry using nonlinear optimization,”
Int. J. Robot. Res., vol. 34, no. 3, pp. 314334, 2015.
In the static case, all algorithms have low ATE values.
This sequence validates that our dataset is correctly obtained. [6] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual inertial
odometry using a direct EKF-based approach,” in Proc. IEEE/RSJ Int.
However, in Dynamic follow, other algorithms tried to Conf. Intell. Robots Syst., 2015, pp. 298304.
track the occluding object. Hence, not only failures of BA but
also false-positive loop closures are triggered. Consequently, [7] B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “DynaSLAM: Tracking,
other algorithms except ours have higher ATEs. mapping, and inpainting in dynamic scenes,” IEEE Robot. Automat. Lett.,
vol. 3, no. 4, pp. 40764083, Oct. 2018.
Furthermore, in Temporal static, ORB-SLAM3 and
VINS-Fusion can eliminate the false-positive loop closure in [8] Y. Fan, H. Han, Y. Tang, and T. Zhi, “Dynamic objects elimination
the stereo-inertial case. However, in the mono-inertial case, in SLAM based on image fusion,” Pattern Recognit. Lett., vol. 127,
due to an inaccurate depth estimation, they cannot reject the pp. 191201, 2019.
false-positive loop closures. Additionaly, VINS-Fusion with
Switchable Constraints [15] can also reject the false-positive [9] B. Canovas, M. Rombaut, A. Nègre, D. Pellerin, and S. Olympi-
loop closures, but ours has a better performance as shown in eff, “Speed and memory efficient dense RGB-D SLAM in dynamic
Table II. scenes,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2020,
pp. 49965001.
Finally, in E-shape case, other algorithms fail to optimize
the trajectory, as illustrated in Fig. 8(a), owing to the false- [10] R. Long, C. Rauch, T. Zhang, V. Ivan, and S. Vijayakumar, “RigidFusion:
positive loop closures. Also VINS-Fusion with Switchable Con- Robot localisation and mapping in environments with large dynamic
straints cannot reject the false-positive loop closures that are con- rigid objects,” IEEE Robot. Automat. Lett., vol. 6, no. 2, pp. 37033710,
tinuously generated. However, ours optimizes the weight of each Apr. 2021.
hypothesis, not individual loop closures. Hence, false-positive
loop closures are rejected in the optimization irrespective of the [11] B. Bescos, C. Campos, J. D. Tardós, and J. Neira, “DynaSLAM II: Tightly-
number of them, as illustrated in Fig. 8(b). Ours does not use coupled multi-object tracking and SLAM,” IEEE Robot. Automat. Lett.,
any object-wise information from the image; hence the features vol. 6, no. 3, pp. 51915198, Jul. 2021.
from the same object can be divided into different hypotheses,
as depicted in Fig. 1(c). [12] K. Qiu, T. Qin, W. Gao, and S. Shen, “Tracking 3-D motion of dynamic
objects using monocular visual-inertial sensing,” IEEE Trans. Robot.,
VI. CONCLUSION vol. 35, no. 4, pp. 799816, Aug. 2019.
In this study, DynaVINS has been proposed, which is a robust [13] K. Minoda, F. Schilling, V. Wüest, D. Floreano, and T. Yairi, “VIODE: A
visual-inertial SLAM framework based on the robust BA and simulated dataset to address the challenges of visual-inertial odometry
the selective global optimization in dynamic environments. The in dynamic environments,” IEEE Robot. Automat. Lett., vol. 6, no. 2,
experimental evidence corroborated that our algorithm works pp. 13431350, Apr. 2021.
better than other algorithms in simulations and in actual environ-
ments with various dynamic objects. In future works, we plan to [14] E. Olson and P. Agarwal, “Inference on networks of mixtures for ro-
improve the speed and the performance. Moreover, we will adapt bust robot mapping,” Int. J. Robot. Res., vol. 32, no. 7, pp. 826840,
the concept of DynaVINS to the LiDAR-Visual-Inertial (LVI) 2013.
SLAM framework.
[15] N. Sünderhauf and P. Protzel, “Switchable constraints for robust pose
REFERENCES graph SLAM,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2012,
pp. 18791884.
[1] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: A
versatile and accurate monocular SLAM system,” IEEE Trans. Robot., [16] H. Yang, P. Antonante, V. Tzoumas, and L. Carlone, “Graduated non-
vol. 31, no. 5, pp. 11471163, Oct. 2015. convexity for robust spatial perception: From non-minimal solvers to
global outlier rejection,” IEEE Robot. Automat. Lett., vol. 5, no. 2,
[2] T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monoc- pp. 11271134, Apr. 2020.
ular visual-inertial state estimator,” IEEE Trans. Robot., vol. 34, no. 4,
pp. 10041020, Aug. 2018. [17] Q.-Y. Zhou, J. Park, and V. Koltun, “Fast global registration,” in Proc. Eur.
Conf. Comput. Vis., 2016, pp. 766782.
[18] S. Song, H. Lim, S. Jung, and H. Myung, “G2P-SLAM: Generalized RGB-
D SLAM framework for mobile robots in low-dynamic environments,”
IEEE Access, vol. 10, pp. 2137021383, 2022.
[19] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-SLAM:
Semantic monocular visual localization and mapping based on deep
learning in dynamic environment,” Robot. Auton. Syst., vol. 117,
pp. 116, 2019.
[20] M. J. Black and A. Rangarajan, “On the unification of line processes,
outlier rejection, and robust statistics with applications in early vision,”
Int. J. Comput. Vis., vol. 19, no. 1, pp. 5791, 1996.
[21] P. J. Huber, “Robust estimation of a location parameter,” in Breakthroughs
Statist., 1992, pp. 492518.
[22] P. Babin, P. Giguère, and F. Pomerleau, “Analysis of robust functions for
registration algorithms,” in Proc. IEEE Int. Conf. Robot. Automat., 2019,
pp. 14511457.
[23] S. Geman, D. E. McClure, and D. Geman, “A nonlinear filter for film
restoration and other problems in image processing,” CVGIP: Graph.
Models Image Process., vol. 54, no. 4, pp. 281289, 1992.
[24] D. Galvez-López and J. D. Tardos, “Bags of binary words for fast place
recognition in image sequences,” IEEE Trans. Robot., vol. 28, no. 5,
pp. 11881197, Oct. 2012.
[25] Z. Zhang and D. Scaramuzza, “A tutorial on quantitative trajectory eval-
uation for visual(-inertial) odometry,” in Proc. IEEE/RSJ Int. Conf. Intell.
Robots Syst., 2018, pp. 72447251.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.

View File

@ -0,0 +1,476 @@
2023 IEEE International Conference on Robotics and Automation (ICRA 2023)
May 29 - June 2, 2023. London, UK
DytanVO: Joint Refinement of Visual Odometry and Motion
Segmentation in Dynamic Environments
Shihao Shen, Yilin Cai, Wenshan Wang, Sebastian Scherer
2023 IEEE International Conference on Robotics and Automation (ICRA) | 979-8-3503-2365-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICRA48891.2023.10161306 Abstract— Learning-based visual odometry (VO) algorithms Fig. 1: A overview of the DytanVO. (a) Input frame at time t0 and
achieve remarkable performance on common static scenes, t1. (b) Optical flow output from the matching network. (c) Motion
benefiting from high-capacity models and massive annotated segmentation output after iterations. (d) Trajectory estimation on
data, but tend to fail in dynamic, populated environments. sequence RoadCrossing VI from the AirDOS-Shibuya Dataset,
Semantic segmentation is largely used to discard dynamic which is a highly dynamic environment cluttered with humans. Ours
associations before estimating camera motions but at the cost is the only learning-based VO that keeps track.
of discarding static features and is hard to scale up to unseen
categories. In this paper, we leverage the mutual dependence (MAV) that operate with aggressive and frequent rotations
between camera ego-motion and motion segmentation and cars do not have. Learning without supervision is hindered
show that both can be jointly refined in a single learning- from generalizing due to biased data with simple motion
based framework. In particular, we present DytanVO, the patterns. Therefore, we approach the dynamic VO problem
first supervised learning-based VO method that deals with as supervised learning so that the model can map inputs to
dynamic environments. It takes two consecutive monocular complex ego-motion ground truth and be more generalizable.
frames in real-time and predicts camera ego-motion in an
iterative fashion. Our method achieves an average improvement To identify dynamic objects, object detection or semantic
of 27.7% in ATE over state-of-the-art VO solutions in real-world segmentation techniques are largely relied on to mask all
dynamic environments, and even performs competitively among movable objects, such as pedestrians and vehicles [12]
dynamic visual SLAM systems which optimize the trajectory [15]. Their associated features are discarded before applying
on the backend. Experiments on plentiful unseen environments geometry-based methods. However, there are two issues of
also demonstrate our methods generalizability. utilizing semantic information in dynamic VO. First, class-
specific detectors for semantic segmentation heavily depend
I. INTRODUCTION on appearance cues but not every object that can move is
present in the training categories, leading to false negatives.
Visual odometry (VO), one of the most essential com- Second, even if all moving objects in a scene within the cat-
ponents for pose estimation in the visual Simultaneous egories, algorithms could not distinguish between “actually
Localization and Mapping (SLAM) system, has attracted moving” versus “static but being able to move”. In dynamic
significant interest in robotic applications over past few VO where static features are crucial to robust ego-motion
years [1]. A lot of research works have been conducted estimation, one should segment objects based on pure motion
to develop an accurate and robust monocular VO system (motion segmentation) rather than heuristic appearance cues.
using both geometry-based methods [2], [3]. However, they
require significant engineering effort for each module to Motion segmentation utilizes relative motion between con-
be carefully designed and finetuned [4], which makes it secutive frames to remove the effect of camera movement
difficult to be readily deployed in the open world with from the 2D motion fields and calculates residual optical flow
complex environmental dynamcis, changes of illumination to account for moving regions. But paradoxically, ego-motion
or inevitable sensor noises. cannot be correctly estimated in dynamic scenes without a
robust segmentation. There exists such a mutual dependence
On the other hand, recent learning-based methods [4]
[7] are able to outperform geometry-based methods in
more challenging environments such as large motion, fog
or rain effects and lack of features. However, they will
easily fail in dynamic environments if they do not take
into consideration independently moving objects that cause
unpredictable changes in illumination or occlusions. To this
end, recent works utilize abundant unlabeled data and adopt
either self-supervised learning [8], [9] or unsupervised learn-
ing [10], [11] to handle dynamic scenes. Although they
achieve outstanding performance on particular tasks, such as
autonomous driving, they produce worse results if applied to
very different data distributions, such as micro air vehicles
Code is available at https://github.com/Geniussh/DytanVO
S. Shen, Y. Cai, W. Wang, S. Scherer are with the Robotics Institute,
Carnegie Mellon University, Pittsburgh, PA 15213, USA. {shihaosh,
yilincai, wenshanw, basti}@andrew.cmu.edu
979-8-3503-2365-8/23/$31.00 ©2023 IEEE 4048
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
between motion segmentation and ego-motion estimation that on geometric constraints arising from epipolar geometry
has never been explored in supervised learning methods. and rigid transformations, which are vulnerable to motion
Therefore, motivated by jointly refining the VO and motion ambiguities such as objects moving in the colinear direc-
segmentation, we propose our learning-based dynamic VO tion relative to the camera being indistinguishable from the
(DytanVO). To our best knowledge, our work is the first background given only ego-motion and optical flow. On the
supervised learning-based VO for dynamic environments. other hand, MaskVO [8] and SimVODIS++ [9] approach the
The main contributions of this paper are threefold: problem by learning to mask dynamic feature points in a self-
supervised manner. CC [11] couples motion segmentation,
• A novel learning-based VO is introduced to leverage flow, depth and camera motion models which are jointly
the interdependence among camera ego-motion, optical solved in an unsupervised way. Nevertheless, these self-
flow and motion segmentation. supervised or unsupervised methods are trained on self-
driving vehicle data dominated by pure translational motions
• We introduce an iterative framework where both ego- with little rotation, which makes them difficult to generalize
motion estimation and motion segmentation can con- to completely different data distributions such as handheld
verge quickly within time constraints for real-time ap- cameras or drones. Our work introduces a framework that
plications. jointly refines camera ego-motion and motion segmentation
in an iterative way that is robust against motion ambiguities
• Among learning-based VO solutions, our method as well as generalizes to the open world.
achieves state-of-the-art performance in real-world dy-
namic scenes without finetuning. Furthermore, our III. METHODOLOGY
method performs even comparably with visual SLAM
solutions that optimize trajectories on the backend. A. Datasets
II. RELATED WORK Built on TartanVO [5], our method remains its general-
ization capability while handling dynamic environments in
Learning-based VO solutions aim to avoid hard-coded multiple types of scenes, such as car, MAV, indoor and
modules that require significant engineering efforts for de- outdoor. Besides taking camera intrinsics as an extra layer
sign and finetuning in classic pipelines [1], [16]. For exam- into the network to adapt to various camera settings as
ple, Valada [17] applies auxiliary learning to leverage rela- explored in [5], we train our model on large amounts of
tive pose information to constrain the search space and pro- synthetic data with broad diversity, which is shown capable
duce consistent motion estimation. Another class of learning- of facilitating easy adaptation to the real world [27][29].
based methods rely on dense optical flow to estimate pose as
it provides more robust and redundant modalities for feature Our model is trained on both TartanAir [27] and Scene-
association in VO [5], [18], [19]. However, their frameworks Flow [30]. The former contains more than 400,000 data
are built on the assumption of photometric consistency which frames with ground truth of optical flow and camera pose in
only holds in a static environment without independently static environments only. The latter provides 39,000 frames
moving objects. They easily fail when dynamic objects in highly dynamic environments with each trajectory hav-
unpredictably cause occlusions or illuminations change. ing backward/forward passes, different objects and motion
characteristics. Although SceneFlow does not provide ground
Semantic information is largely used by earlier works in truth of motion segmentations, we are able to recover it by
VO or visual SLAM to handle dynamic objects in the scene, taking use of its ground truth of disparity, optical flow and
which is obtained by either a feature-based method or a disparity change maps.
learning-based method. Feature-based methods utilize hand-
designed features to recognize semantic entities [20]. An B. Architecture
exemplary system proposed by [21] computes SIFT descrip-
tors from monocular image sequences in order to recognize Our network architecture is illustrated in Fig. 2, which is
semantic objects. On the other hand, data-driven CNN-based based on TartanVO. Our method takes in two consecutive
semantic methods have been widely used to improve the undistorted images It, It+1 and outputs the relative camera
performance, such as DS-SLAM [22] and SemanticFusion motion δtt+1 = (R|T), where T ∈ R3 is the 3D translation
[23]. A few works on semantic VO/SLAM have fused the and R ∈ SO(3) is the 3D rotation. Our framework consists
semantic information from recognition modules to enhance of three sub-modules, a matching network, a motion seg-
motion estimation and vice versa [24], [25]. However, all mentation network, and a pose network. We estimate dense
these methods are prone to limited semantic categories, optical flow Ftt+1 with a matching network, Mθ (It, It+1),
which leads to false negatives when scaling to unusual real- from two consecutive images. The network is built based
world applications such as offroad driving or MAV, and on PWC-Net [31]. The motion segmentation network Uγ,
requires continuous efforts in ground-truth labeling. based on a lightweight U-Net [32], takes in the relative
camera motion output, R|T, optical flow from Mθ, and the
Instead of utilizing appearance cues for segmentation, original input frames. It outputs a probability map, ztt+1,
efforts are made to segment based on geometry cues. Flow- of every pixel belonging to a dynamic object or not, which
Fusion [26] iteratively refines its ego-motion estimation by is thresholded and turned into a binary segmentation mask,
computing residual optical flow. GeoNet [10] divides its Stt+1. The optical flow is then stacked with the mask and
system into two sub-tasks by separately predicting static
scene structure and dynamic motions. However, both depend
4049
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
Fig. 2: Overview of our three-stage network architecture. It consists of a matching network which estimates optical flow from two
consecutive images, a pose network that estimates pose based on optical flow without dynamic movements, and a motion segmentation
network that outputs a probability mask of the dynamicness. The matching network is forwarded only once while the pose network and
the segmentation network are iterated to jointly refine pose estimate and motion segmentation. In the first iteration, we randomly initialize
the segmentation mask. In each iteration, optical flow is set to zero inside masked regions.
the intrinsics layer KC, followed by setting all optical flow iterations being smaller than prefixed thresholds ϵ. Instead
inside the masked regions to zeros, i.e., F˜tt+1. The last is of having a fixed constant to threshold probability maps into
a pose network P ϕ, with ResNet50 [33] as the backbone, segmentation masks, we predetermine a decaying parameter
which takes in the previous stack, and outputs camera that empirically reduces the input threshold over time, in
motion. order to discourage inaccurate masks in earlier iterations
while embracing refined masks in later ones.
C. Motion segmentation
Algorithm 1 Inference with Iterations
Earlier dynamic VO methods that use motion segmentation
rely on purely geometric constraints arising from epipolar Given two consecutive frames It, It+1 and intrinsics K
geometry and rigid transformations [12], [26] so that they can Initialize iteration number: i ← 1
threshold residual optical flow which is designed to account Initialize difference in output camera motions: δR|T ← ∞
for moving regions. However, they are prone to catastrophic iFtt+1 ← OpticalFlow(It, It+1)
failures under two cases: (1) points in 3D moving along while δR|T ≥ stopping criterion, ϵ do
epipolar lines cannot be identified from the background given
only monocular cues; (2) pure geometry methods leave no if i is 0 then
tolerance to noisy optical flow and less accurate camera iStt+1 ← getCowmask(It)
motion estimations, which in our framework is very likely to
happen in the first few iterations. Therefore, following [34], else
to deal with the ambiguities above, we explicitly model cost iztt+1 ← MotionSegmentation(iFtt+1, It, iR|iT)
maps as inputs into the segmentation network after upgrading iStt+1 ← mask iztt+1 ≥ zthreshold
the 2D optical flow to 3D through optical expansion [35],
which estimates the relative depth based on the scale change iF˜tt+1 ← set iFtt+1 = 0 for iStt+1 == 1
of overlapping image patches. The cost maps are tailored iR|iT ← PoseNetwork(iF˜tt+1, iStt+1, K)
to coplanar and colinear motion ambiguities that cause seg- δR|T ← iR|iT i1R|i1T
mentation failures in geometry-based motion segmentation. i←i+1
More details can be found in [34].
Intuitively, during early iterations, the estimated motion
D. Iteratively refine camera motion is less accurate, which leads to false positives in the seg-
mentation output (assigning high probabilities to static ar-
We provide an overview of our iterative framework in eas). However, due to the fact that optical flow map still
Algorithm 1. During inference, the matching network is provides enough correspondences regardless of cutting out
forwarded only once while the pose network and the seg- non-dynamic regions from it, Pϕ is able to robustly leverage
mentation network are iterated to jointly refine ego-motion the segmentation mask Stt+1 concatenated with F˜tt+1, and
estimation and motion segmentation. In the first iteration, outputs reasonable camera motion. In later iterations, Uγ
the segmentation mask is initialized randomly using [36]. is expected to output increasingly precise probability maps
The criterion to stop iteration is straightforward, which is the such that static regions in the optical flow map are no longer
rotational and translational differences of R|T between two
4050
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
ground truth motion R|T ,
LP = Tˆ T + Rˆ R
max(∥Tˆ ∥, ϵ) max (∥T∥, ϵ) (1)
where ϵ=1e-6 to prevent numerical instability and ˆ· denotes
estimated quantities.
Our framework can also be trained in an end-to-end
fashion, in which case the objective becomes an aggregated
loss of the optical flow loss LM , the camera motion loss
LP and the motion segmentation loss LU , where LM is the
L1 norm between the predicted flow and the ground truth
flow whereas LU is the binary cross entropy loss between
predicted probability and the segmentation label.
L = λ1LM + λ2LU + LP (2)
Fig. 3: Motion segmentation output at each iteration when testing From preliminary empirical comparison, end-to-end training
on unseen data. (a) Running inference on the hardest sequence in gives similar performance to training the pose network only,
AirDOS-Shibuya with multiple people moving in different direc- because we use λ1 and λ2 to regularize the objective such
tions with our segmentation network. (b) Inference on the sequence that the training is biased toward mainly improving the
from FlyingThings3D where dynamic objects take up more than odometry rather than optimizing the other two tasks. This is
60% area. Ground truth (GT) mask on Shibuya is generated by the ideal since the pose network is very tolerant of false positives
segmentation network with GT ego-motion as input. in segmentation results (shown in III-D). In the following
section, we show our results of supervising only on Eq. 1
“wasted” and hence Pϕ can be improved accordingly. by fixing the motion segmentation network.
In practice, we find that 3 iterations are more than enough
IV. EXPERIMENTAL RESULTS
to get both camera motion and segmentation refined. To clear
up any ambiguity, a 1-iteration pass is composed of one Mθ A. Implementation details
forward pass and one Pϕ forward pass with random mask,
while a 3-iteration pass consists of one Mθ forward pass, 1) Network: We intialize the matching network Mθ with
two Uγ forward passes and three Pϕ forward passes. In Fig. the pre-trained model from TartanVO [5], and fix the motion
3 we illustrate how segmentation masks evolve after three segmentation network Uγ with the pre-trained weights from
iterations on unseen data. The mask at the first iteration Yang et al. [34]. The pose network Pϕ uses ResNet50 [33]
contains a significant amount of false positives but quickly as the backbone, removes the bach normalization layers, and
converges beyond the second iteration. This verifies our adds two output heads for rotation R and translation T .
assumption that the pose network is robust against false Mθ outputs optical flow at size of H/4 × W/4. Pϕ takes
positives in segmentation results. in a 5-channel input, i.e., F˜tt+1 ∈ R2×H/4×W/4, Stt+1 ∈
RH/4×W/4 and KC ∈ R2×H/4×W/4. The concatenation of
E. Supervision F˜tt+1 and KC augments the optical flow input with 2D
positional information while concatenating F˜tt+1 with Stt+1
We train our pose network to be robust against large areas encourages the network to learn dynamic representations.
of false positives. On training data without any dynamic
object, we adopt the cow-mask [36] to create sufficiently 2) Training: Our method is implemented in PyTorch [43]
random yet locally connected segmentation patterns as a and trained on 2 NVIDIA A100 Tensor Core GPUs. We train
motion segmentation could occur in any size, any shape the network in two stages on TartanAir, which includes only
and at any position in an image while exhibiting locally static scenes, and SceneFlow [30]. In the first stage, we train
explainable structures corresponding to the types of moving Pϕ independently using ground truth optical flow, camera
objects. In addition, we apply curriculum learning to the motion, and motion segmentation mask in a curriculum-
pose network where we gradually increase the maximum learning fashion. We generate random cow-masks [36] on
percentage of dynamic areas in SceneFlow from 15%, 20%, TartanAir as motion segmentation input. Each curriculum is
30%, 50% to 100%. Since TartanAir only contains static initialized with weights from the previous curriculum and
scenes, we adjust the size of the cow-masks accordingly. takes 100,000 iterations with a batch size of 256. In the
second stage, Pϕ and Mθ are jointly optimized for another
We supervise our network on the camera motion loss LP . 100,000 iterations with a batch size of 64. During curriculum
Under the monocular setting, we only recover an up-to-scale learning, the learning rate starts at 2e-4, while the second
camera motion. We follow [5] and normalize the translation stage uses a learning rate of 2e-5. Both stages apply a
vector before calculating the distance to ground truth. Given decay rate of 0.2 to the learning rate every 50,000 iterations.
Random cropping and resizing (RCR) [5] as well as frame
skipping are applied to both datasets.
4051
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
TABLE I: ATE (m) results on dynamic sequences from AirDOS-Shibuya. Our method gives outstanding performance among VO methods.
DeepVO, TrianFlow and CC are trained on KITTI only and unable to generalize to complex motion patterns. All SLAM methods use
bundle adjustment (BA) on multiple frames to optimize the trajectory and hence we only numerically compare ours with pure VO methods.
The best and the second best VO performances are highlighted as bold and underlined. We use “-” to denote SLAM methods that fail to
initialize.
StandingHuman RoadCrossing (Easy) RoadCrossing (Hard)
I II III IV V VI VII
SLAM method DROID-SLAM [37] 0.0051 0.0073 0.0103 0.0120 0.2778 0.0253 0.5788
VO method AirDOS w/ mask [38] 0.0606 0.0193 0.0951 0.0331 0.0206 0.2230 0.5625
ORB-SLAM w/ mask [39] 0.0788 0.0060 0.0657 0.0196 0.0148 1.0984 0.8476
VDO-SLAM [40] 0.0994 0.6129 0.3813 0.3879 0.2175 0.2400 0.6628
DynaSLAM [41] 0.8836 0.3907 0.4196 0.4925 0.6446 0.6539
-
DeepVO [4]
TrianFlow [42] 0.3956 0.6351 0.7788 0.3436 0.5434 0.7223 0.9633
CC [11] 0.9743 1.3835 1.3348 1.6172 1.4769 1.7154 1.9075
TartanVO [5] 0.4527 0.7714 0.5406 0.6345 0.5411 0.8558 1.0896
Ours 0.0600 0.1605 0.2762 0.1814 0.2174 0.3228 0.5009
0.0327 0.1017 0.0608 0.0516 0.0755 0.0365 0.0660
3) Runtime: Although our method iterates multiple times Crossing (Easy) contains multiple humans moving in and out
to refine both segmentation and camera motion, we find in of the cameras view, and in Road Crossing (Hard) humans
practice that 3 iterations are more than enough due to the enter cameras view abruptly. Besides VO methods, we also
robustness of Pϕ as shown in Fig. 3. On an NVIDIA RTX compare ours with SLAM methods that are able to handle
2080 GPU, inference takes 40ms with 1 iteration, 100ms dynamic scenes. DROID-SLAM [37] is a learning-based
with 2 iterations and 160ms with 3 iterations. SLAM trained on TartanAir. AirDOS [38], VDO-SLAM [40]
and DynaSLAM [41] are three feature-based SLAM methods
4) Evaluation: We use the Absolute Trajectory Error targeting dynamic scenes. We provide the performance of
(ATE) to evaluate our algorithm against other state-of-the-art AirDOS and ORB-SLAM [39] after masking the dynamic
methods including both VO and Visual SLAM. We evaluate features during their ego-motion estimation. DeepVO [4],
our method on AirDOS-Shibuya dataset [38] and KITTI TartanVO and TrianFlow [42] are three learning-based VO
Odometry dataset [44]. Additionally, in the supplemental methods not targeting dynamic scenes while CC [11] is an
material, we test our method on data collected in a cluttered unsupervised VO resolving dynamic scenes through motion
intersection to demonstrate our method can scale to real- segmentation.
world dynamic scenes competitively.
Our model achieves the best performance in all sequences
B. Performance on AirDOS-Shibuya Dataset among VO baselines and is competitive even among SLAM
methods. DeepVO, TrianFlow and CC perform badly on
We first provide an ablation study of the number of itera- AirDOS-Shibuya dataset because they are trained on KITTI
tions (iter) in Tab. III using three sequences from AirDOS- only and not able to generalize. TartanVO performs better but
Shibuya [38]. The quantitative results are consistent with Fig. it is still susceptible to the disturbance of dynamic objects.
3 where the pose network quickly converges after the first On RoadCrossing V as shown in Fig. 1, all VO baselines
iteration. We also compare the 3-iteration finetuned model fail except ours. In hard sequences where there are more
after jointly optimizing Pϕ and Mθ (second stage), which aggressive camera movements and abundant moving objects,
shows less improvement because the optical flow estimation ours outperforms dynamic SLAM methods such as AirDOS,
on AirDOS-Shibuya already has high quality. VDO-SLAM and DynaSLAM by more than 80%. While
DROID-SLAM remains competitive most time, it loses track
TABLE III: Experiments on number of iterations in ATE (m) of RoadCrossing V and VII as soon as a walking person
occupies a large area in the image. Note that ours only takes
1 iter Standing I RoadCrossing III RoadCrossing VII 0.16 seconds per inference with 3 iterations but DROID-
2 iter SLAM takes extra 4.8 seconds to optimize the trajectory.
3 iter 0.0649 0.1666 0.3157 More qualitative results are in the supplemental material.
Finetuned 0.0315 0.0974 0.0658
0.0327 0.0608 0.0660 C. Performance on KITTI
0.0384 0.0631 0.0531
We also evaluated our method against others on sequences
We then compare our method with others on the seven from KITTI Odometry dataset [44] in Tab. II. Our method
sequences from AirDOS-Shibuya in Tab. I and demonstrate outperforms other VO baselines in 6 out of 8 dynamic
that our method outperforms existing state-of-the-art VO sequences with an improvement of 27.7% on average against
algorithms. This benchmark covers much more challenging the second best method. DeepVO, TrianFlow and CC are
viewpoints and diverse motion patterns for articulated objects trained on some of the sequences in KITTI while ours has not
than our training data. The seven sequences are categorized been finetuned on KITTI and is trained purely using synthetic
into three levels of difficulty: most humans stand still in
Standing Human with few of them moving around, Road
4052
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Results of ATE (m) on Dynamic Sequences from KITTI Odometry. Original sequences are trimmed into shorter ones that
contain dynamic objects1. DeepVO [4], TrianFlow [42] and CC [11] are trained on KITTI, while ours has not been finetuned on KITTI
and is trained purely using synthetic data. Without backend optimization unlike SLAM, we achieve the best performance on 00, 02, 04,
and competitive performance on the rest among all methods including SLAM.
SLAM method DROID-SLAM [37] 00 01 02 03 04 07 08 10
ORB-SLAM w/ mask [39] 0.0148 49.193 0.1064 0.0119 0.0374 0.1939 0.9713 0.0368
DynaSLAM [41] 0.0187 0.0796 0.1519 0.0198 0.2108 1.0479 0.0246
0.0138 - 0.1046 0.1450 0.3187 1.0559 0,0264
- -
0.7262 (0.6547) 0.1042
VO method DeepVO [4] (0.0206) 1.2896 (0.2975) 0.0783 0.0506 1.5540 (3.8984) 0.2545
TrianFlow [42] 0.6966 (8.2127) (1.8759) 1.6862 1.2950 0.6789 (1.0411) (0.0346)
CC [11] 0.0253 (0.3060) (0.2559) 0.0505 0.0337 0.7108 0.9776 0.1024
TartanVO [5] 0.0345 4.7080 0.1049 0.2832 0.0743 0.6367 1.0344 0.0280
Ours 0.0126 0.4081 0.0594 0.0406 0.0180
We use (·) to denote the sequence is in the training set of the corresponding method.
Fig. 4: Qualitative results on dynamic sequences in KITTI Odometry 01, 03, 04 and 10. The first row is our segmentation outputs of
moving objects. The second row is the visualization after aligning the scales of trajectories with ground truth all at once. Ours produces
precise odometry given large areas in the image being dynamic even among methods that are trained on KITTI. Note that the trajectories
do not always reflect the ATE results due to alignment.
data. Moreoever, we achieve the best ATE on 3 sequences almost the entire optical flow map as zeros, leading to the
among both VO and SLAM without any optimization. We divergence of motion estimation and segmentation. Future
provide qualitative results in Fig. 4 on four challenging work could hence consider incorporating dynamic object-
sequences with fast-moving vehicles or dynamic objects awareness into the framework and utilizing dynamic cues
occupying large areas in images. Note on sequence 01 which instead of fully discarding them. Additionally, learning-based
starts with a high-speed vehicle passing by, both ORB-SLAM VO tends to overfit on simple translational movements such
and DynaSLAM fail to initialize, while DROID-SLAM loses as in KITTI, which is resolved in our method by training on
track from the beginning. Even though CC uses 01 in its datasets with broad diversity, but our method gives worse
training set, ours gives only 0.1 higher ATE while 0.88 lower performance when there is little or zero camera motion,
than the third best baseline. On sequence 10 when a huge caused by the bias in currently available datasets. One should
van takes up significant areas in the center of the image, ours consider training on zero-motion inputs in addition frame
is the only VO that keeps track robustly. skipping.
D. Diagnostics V. CONCLUSION
While we observe our method is robust to heavily dynamic In this paper, we propose a learning-based dynamic VO
scenes with as much as 70% dynamic objects in the image, (DytanVO) which can jointly refine the estimation of camera
it still fails when all foreground objects are moving, leaving pose and segmentation of the dynamic objects. We demon-
textureless background only. This is most likely to happen strate both ego-motion estimation and motion segmentation
when dynamic objects take up large areas in the image. For can converge quickly within time constrains for real-time
example, when testing on the test set of FlyingThings3D [30] applications. We evaluate our method on KITTI Odometry
where 80% of the image being dynamic, our method masks and AirDOS-Shibuya datasets, and demonstrate state-of-the-
art performance in dynamic environments without finetuning
1Sequences listed are trimmed into lengths of 28, 133, 67, 31, 40, 136, 51 nor optimation on the backend. Our work introduces new
and 59 respectively which contain moving pedestrians, vehicles and cyclists. directions for dynamic visual SLAM algorithms.
4053
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [23] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Se-
manticfusion: Dense 3d semantic mapping with convolutional neural
[1] D. Scaramuzza and F. Fraundorfer, “Visual odometry [tutorial],” IEEE networks,” in 2017 IEEE International Conference on Robotics and
robotics & automation magazine, vol. 18, no. 4, pp. 8092, 2011. automation (ICRA), pp. 46284635, IEEE, 2017.
[2] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE [24] L. An, X. Zhang, H. Gao, and Y. Liu, “Semantic segmentationaided
transactions on pattern analysis and machine intelligence, vol. 40, visual odometry for urban autonomous driving,” International Journal
no. 3, pp. 611625, 2017. of Advanced Robotic Systems, vol. 14, no. 5, p. 1729881417735667,
2017.
[3] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct
monocular visual odometry,” in 2014 IEEE international conference [25] K.-N. Lianos, J. L. Schonberger, M. Pollefeys, and T. Sattler, “Vso:
on robotics and automation (ICRA), pp. 1522, IEEE, 2014. Visual semantic odometry,” in Proceedings of the European conference
on computer vision (ECCV), pp. 234250, 2018.
[4] S. Wang, R. Clark, H. Wen, and N. Trigoni, “Deepvo: Towards
end-to-end visual odometry with deep recurrent convolutional neural [26] T. Zhang, H. Zhang, Y. Li, Y. Nakamura, and L. Zhang, “Flowfusion:
networks,” in 2017 IEEE international conference on robotics and Dynamic dense rgb-d slam based on optical flow,” in 2020 IEEE Inter-
automation (ICRA), pp. 20432050, IEEE, 2017. national Conference on Robotics and Automation (ICRA), pp. 7322
7328, IEEE, 2020.
[5] W. Wang, Y. Hu, and S. Scherer, “Tartanvo: A generalizable learning-
based vo,” arXiv preprint arXiv:2011.00359, 2020. [27] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor,
and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,”
[6] H. Zhou, B. Ummenhofer, and T. Brox, “Deeptam: Deep tracking and in 2020 IEEE/RSJ International Conference on Intelligent Robots and
mapping,” in Proceedings of the European conference on computer Systems (IROS), pp. 49094916, IEEE, 2020.
vision (ECCV), pp. 822838, 2018.
[28] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
[7] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha, “Self-supervised “Domain randomization for transferring deep neural networks from
deep visual odometry with online adaptation,” in Proceedings of the simulation to the real world,” in 2017 IEEE/RSJ international con-
IEEE/CVF Conference on Computer Vision and Pattern Recognition, ference on intelligent robots and systems (IROS), pp. 2330, IEEE,
pp. 63396348, 2020. 2017.
[8] W. Xuan, R. Ren, S. Wu, and C. Chen, “Maskvo: Self-supervised [29] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil,
visual odometry with a learnable dynamic mask,” in 2022 IEEE/SICE T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep
International Symposium on System Integration (SII), pp. 225231, networks with synthetic data: Bridging the reality gap by domain
IEEE, 2022. randomization,” in Proceedings of the IEEE conference on computer
vision and pattern recognition workshops, pp. 969977, 2018.
[9] U.-H. Kim, S.-H. Kim, and J.-H. Kim, “Simvodis++: Neural seman-
tic visual odometry in dynamic environments,” IEEE Robotics and [30] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy,
Automation Letters, vol. 7, no. 2, pp. 42444251, 2022. and T. Brox, “A large dataset to train convolutional networks for
disparity, optical flow, and scene flow estimation,” in Proceedings
[10] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, of the IEEE conference on computer vision and pattern recognition,
optical flow and camera pose,” in Proceedings of the IEEE conference pp. 40404048, 2016.
on computer vision and pattern recognition, pp. 19831992, 2018.
[31] D. Sun, X. Yang, M. Liu, and J. Kautz, “Pwc-net: Cnns for
[11] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and optical flow using pyramid, warping, and cost volume,” CoRR,
M. J. Black, “Competitive collaboration: Joint unsupervised learning vol. abs/1709.02371, 2017.
of depth, camera motion, optical flow and motion segmentation,” in
Proceedings of the IEEE/CVF conference on computer vision and [32] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
pattern recognition, pp. 1224012249, 2019. networks for biomedical image segmentation,” in International Confer-
ence on Medical image computing and computer-assisted intervention,
[12] H. Liu, G. Liu, G. Tian, S. Xin, and Z. Ji, “Visual slam based on pp. 234241, Springer, 2015.
dynamic object removal,” in 2019 IEEE International Conference on
Robotics and Biomimetics (ROBIO), pp. 596601, IEEE, 2019. [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” CoRR, vol. abs/1512.03385, 2015.
[13] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and
S. Leutenegger, “Mid-fusion: Octree-based object-level multi-instance [34] G. Yang and D. Ramanan, “Learning to segment rigid motions
dynamic slam,” in 2019 International Conference on Robotics and from two frames,” in Proceedings of the IEEE/CVF Conference on
Automation (ICRA), pp. 52315237, IEEE, 2019. Computer Vision and Pattern Recognition, pp. 12661275, 2021.
[14] S. Li and D. Lee, “RGB-D SLAM in dynamic environments using [35] G. Yang and D. Ramanan, “Upgrading optical flow to 3d scene
static point weighting,” IEEE Robotics and Automation Letters, vol. 2, flow through optical expansion,” in Proceedings of the IEEE/CVF
no. 4, pp. 22632270, 2017. Conference on Computer Vision and Pattern Recognition, pp. 1334
1343, 2020.
[15] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving rgb-d slam in
dynamic environments: A motion removal approach,” Robotics and [36] G. French, A. Oliver, and T. Salimans, “Milking cowmask for semi-
Autonomous Systems, vol. 89, pp. 110122, 2017. supervised image classification,” arXiv preprint arXiv:2003.12022,
2020.
[16] F. Fraundorfer and D. Scaramuzza, “Visual odometry: Part ii: Match-
ing, robustness, optimization, and applications,” IEEE Robotics & [37] Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular,
Automation Magazine, vol. 19, no. 2, pp. 7890, 2012. stereo, and rgb-d cameras,” Advances in Neural Information Process-
ing Systems, vol. 34, pp. 1655816569, 2021.
[17] A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning
for visual localization and odometry,” in 2018 IEEE international [38] Y. Qiu, C. Wang, W. Wang, M. Henein, and S. Scherer, “Airdos:
conference on robotics and automation (ICRA), pp. 69396946, IEEE, Dynamic slam benefits from articulated objects,” in 2022 International
2018. Conference on Robotics and Automation (ICRA), pp. 80478053,
IEEE, 2022.
[18] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring
representation learning with cnns for frame-to-frame ego-motion esti- [39] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: A
mation,” IEEE robotics and automation letters, vol. 1, no. 1, pp. 1825, versatile and accurate monocular slam system,” IEEE transactions on
2015. robotics, vol. 31, no. 5, pp. 11471163, 2015.
[19] H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid, “Visual odometry [40] J. Zhang, M. Henein, R. Mahony, and V. Ila, “Vdo-slam: a visual
revisited: What should be learnt?,” in 2020 IEEE International Con- dynamic object-aware slam system,” arXiv preprint arXiv:2005.11052,
ference on Robotics and Automation (ICRA), pp. 42034210, IEEE, 2020.
2020.
[41] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “DynaSLAM: Tracking,
[20] D.-H. Kim and J.-H. Kim, “Effective background model-based rgb-d mapping, and inpainting in dynamic scenes,” IEEE Robotics and
dense visual odometry in a dynamic environment,” IEEE Transactions Automation Letters, vol. 3, no. 4, pp. 40764083, 2018.
on Robotics, vol. 32, no. 6, pp. 15651573, 2016.
[42] W. Zhao, S. Liu, Y. Shu, and Y.-J. Liu, “Towards better generalization:
[21] S. Pillai and J. Leonard, “Monocular slam supported object recogni- Joint depth-pose learning without posenet,” in Proceedings of the
tion,” arXiv preprint arXiv:1506.01732, 2015. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 91519161, 2020.
[22] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds-
slam: A semantic visual slam towards dynamic environments,” in 2018 [43] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
IEEE/RSJ International Conference on Intelligent Robots and Systems T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An
(IROS), pp. 11681174, IEEE, 2018.
4054
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
imperative style, high-performance deep learning library,” Advances
in neural information processing systems, vol. 32, 2019.
[44] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
The kitti dataset,” The International Journal of Robotics Research,
vol. 32, no. 11, pp. 12311237, 2013.
4055
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.

View File

@ -0,0 +1,510 @@
Multi-modal Semantic SLAM for Complex Dynamic Environments
Han Wang*, Jing Ying Ko* and Lihua Xie, Fellow, IEEE
Abstract— Simultaneous Localization and Mapping (SLAM) Input Image Segmentation Node Mask Image
is one of the most essential techniques in many real-world
robotic applications. The assumption of static environments is Semantic
common in most SLAM algorithms, which however, is not the Category
case for most applications. Recent work on semantic SLAM
arXiv:2205.04300v1 [cs.RO] 9 May 2022 aims to understand the objects in an environment and distin- Semantic
guish dynamic information from a scene context by performing Mask
image-based segmentation. However, the segmentation results
are often imperfect or incomplete, which can subsequently Peception Node Point Cloud Filter ing & SL AM Node
reduce the quality of mapping and the accuracy of localization.
In this paper, we present a robust multi-modal semantic LiDAR Input
framework to solve the SLAM problem in complex and highly
dynamic environments. We propose to learn a more powerful Clustering Projected
object feature representation and deploy the mechanism of Result Segmentation
looking and thinking twice to the backbone network, which
leads to a better recognition result to our baseline instance Result
segmentation model. Moreover, both geometric-only clustering
and visual semantic information are combined to reduce the Dynamic Point Static Point Mapping
effect of segmentation error due to small-scale objects, occlusion
and motion blur. Thorough experiments have been conducted to Fig. 1: System overview of the proposed multi-modal se-
evaluate the performance of the proposed method. The results mantic SLAM. Compared to traditional semantic SLAM,
show that our method can precisely identify dynamic objects we propose to use multi-modal method to improve the
under recognition imperfection and motion blur. Moreover, the efficiency and accuracy of the existing SLAM methods in the
proposed SLAM framework is able to efficiently build a static complex and dynamic environment. Our method significantly
dense map at a processing rate of more than 10 Hz, which can reduces the localization drifts caused by dynamic objects and
be implemented in many practical applications. Both training performs dense semantic mapping in real time.
data and the proposed method is open sourced1.
correspondences or insufficient matching features [4]. The
I. INTRODUCTION presence of dynamic objects can greatly degrade the accuracy
of localization and the reliability of the mapping during the
Simultaneous Localization and Mapping (SLAM) is one of SLAM process.
the most significant capabilities in many robot applications
such as self-driving cars, unmanned aerial vehicles, etc. Advancements in deep learning have enabled the develop-
Over the past few decades, SLAM algorithms have been ments of various instance segmentation networks based on
extensively studied in both Visual SLAM such as ORB- 2D images [5][6]. Most existing semantic SLAMs leverage
SLAM [1] and LiDAR-based SLAM such as LOAM [2] the success of deep learning-based image segmentation, e.g.,
and LeGO-LOAM [3]. Unfortunately, many existing SLAM dynamic-SLAM [7] and DS-SLAM [8]. However, the seg-
algorithms assume the environment to be static, and cannot mentation results are not ideal under dynamic environments.
handle dynamic environments well. The localization is often Various factors such as small-scale objects, objects under
achieved via visual or geometric features such as feature occlusion and motion blur contribute to challenges in 2D
points, lines and planes without including semantic infor- instance segmentation. For example, the object is partially
mation to represent the surrounding environment, which can recognized under motion blur or when it is near to the border
only work well under static environments. However, the real- of the image. These can degrade the accuracy of localization
world is generally complex and dynamic. In the presence of and the reliability of the mapping. Some recent works target
moving objects, pose estimation might suffer from drifting, to perform deep learning on 3D point clouds to achieve
which may cause the system failure if there are wrong semantic recognition [9][10]. However, 3D point cloud
instance segmentation does not perform as well as its 2D
*Jing Ying Ko and Han Wang contribute equally to this paper and are
considered as jointly first authors.
The research is supported by the National Research Foundation, Singa-
pore under its Medium Sized Center for Advanced Robotics Technology
Innovation.
Jing Ying Ko, Han Wang and Lihua Xie are with the School of
Electrical and Electronic Engineering, Nanyang Technological Univer-
sity, 50 Nanyang Avenue, Singapore 639798. e-mail: {hwang027,
E170043}@e.ntu.edu.sg; elhxie@ntu.edu.sg
1https://github.com/wh200720041/MMS_SLAM
counterpart due to its smaller scale of training data and high A. Feature Consistency Verification
computational cost. There are several reasons: 1) 3D point
cloud instance segmentation such as PointGroup takes a long Dai et al. [13] presents a segmentation method using the
computation time (491ms) [11]; 2) it is much less efficient to correlation between points to distinguish moving objects
label a point cloud since the geometric information is not as from the stationary scene, which has a low computational
straightforward as the visual information; 3) it is inevitable requirement. Lee et al. [14] introduces a real-time depth
to change the viewpoint in order to label a point cloud [12], edge-based RGB-D SLAM system to deal with a dynamic
which increases the labeling time. environment. Static weighting method is proposed to mea-
sure the likelihood of the edge point being part of the
In this paper, we propose a robust and computationally static environment and is further used for the registration of
efficient multi-modal semantic SLAM framework to tackle the frame-to-keyframe point cloud. These methods generally
the limitation of existing SLAM methods in dynamic en- can achieve real-time implementation without increasing the
vironments. We modify the existing backbone network to computational complexity. Additionally, they need no prior
learn a more powerful object feature representation and knowledge about the dynamic objects. However, they are
deploy the mechanism of looking and thinking twice to the unable to continuously track potential dynamic objects, e.g.,
backbone network, which leads to a better recognition result a person that stops at a location temporarily between moves
to our baseline instance segmentation model. Moreover, we is considered as a static object in their work.
combine the geometric-only clustering and visual semantic
information to reduce the effect of motion blur. Eventually B. Deep Learning-Based Dynamic SLAM
the multi-modal semantic recognition is integrated into the
SLAM framework which is able to provide real-time local- Deep learning-based dynamic SLAM usually performs
ization in different dynamic environments. The experiment better than feature consistency verification as it provides
results show that the segmentation errors due to misclassifi- conceptual knowledge of the surrounding environment to
cation, small-scale object and occlusion can be well-solved perform the SLAM tasks. Xun et al. [15] proposes a feature-
with our proposed method. The main contributions of this based visual SLAM algorithm based on ORB-SLAM2,
paper are summarized as follows: where a front-end semantic segmentation network is in-
troduced to filter out dynamic feature points and subse-
• We propose a robust and fast multi-modal semantic quently fine-tune the camera pose estimation, thus making
SLAM framework that targets to solve the SLAM prob- the tracking algorithm more robust. Reference [16] combines
lem in complex and dynamic environments. Specifically, a semantic segmentation network with a moving consistency
we combine the geometric-only clustering and visual se- check method to reduce the impact of dynamic objects and
mantic information to reduce the effect of segmentation generate a dense semantic octree map. A visual SLAM
error due to small-scale objects, occlusion and motion system proposed by [17] develops a dynamic object detector
blur. with multi-view geometry and background inpainting, which
aims to estimate a static map and reuses it in long term
• We propose to learn a more powerful object feature applications. However, Mask R-CNN is considered as com-
representation and deploy the mechanism of looking and putationally intensive; as a result, the whole framework can
thinking twice to the backbone network, which leads only be performed offline.
to a better recognition result to our baseline instance
segmentation model. Deep learning-based LiDAR SLAM in dynamic envi-
ronments are relatively less popular than visual SLAM.
• A thorough evaluation on the proposed method is pre- Reference [18] integrates semantic information by using a
sented. The results show that our method is able to fully convolutional neural network to embed these labels
provide reliable localization and a semantic dense map. into a dense surfel-based map representation. However, the
adopted segmentation network is based on 3D point clouds,
The rest of the paper is organized as follows: Section II which is less effective as compared to 2D segmentation net-
presents an overview of the related works regarding the three works. Reference [19] develops a laser-inertial odometry and
main SLAM methods in dynamic environments. Section III mapping method which consists of four sequential modules
describes the details of the proposed SLAM framework. to perform a real-time and robust pose estimation for large
Section IV provides quantitative and qualitative experimental scale high-way environments. Reference [20] presents a dy-
results in dynamic environments. Section V concludes this namic objects-free LOAM system by overlapping segmented
paper. images into LiDAR scans. Although deep learning-based
methods can effectively alleviate the impact of dynamic ob-
II. RELATED WORK jects on the SLAM performance, they are normally difficult
to operate in real-time due to the implementation of deep-
In this section, we present the existing works that address learning neural networks which possess high computational
SLAM problems in dynamic environments. The existing complexity.
dynamic SLAM can be categorized into three main methods:
feature consistency verification method, deep learning-based
method and multi-modal-based method.
Iterative
Estima
(b) M ulti-modal Fusion M odule (c) L ocalization M odule
LiDAR Point Cloud Geometric Semantic Feature Data
Clustring Fusion Extraction Association
Camera
Pose
Estimation
Image Segmentation Motion Blur Key Frame Feature Map
& Compensation Selection Update
Classfication
(a) I nstance Segmentation M odule (d) Global Optimization & M apping M odule
Dynamic Data Model Convolutional Static Map Map Localization
Object Info Acquisition Generalization Neural Network Construction Optimization Output
Data Data Training Dynamic Global Map Visualization 3D Map
Labelling Mapping Fusion Output
Fig. 2: Flow chart of the proposed method. Our system consists of four modules: (a) semantic fusion module; (b) semantic
learning module; (c) localization module; (d) global optimization and mapping module.
C. Multi-modal-based Dynamic SLAM other state-of-the-art instance segmentation models, both in
segmentation accuracy and inference speed. Given an input
Multi-modal approaches are also explored to deal with image I, our adopted instance segmentation network predicts
dynamic environments. Reference [21] introduces a multi- a set of {Ci, M i}in=1, where Ci is a class label and M i is a
modal sensor-based semantic mapping algorithm to improve binary mask, n is the number of instances in the image.
the semantic 3D map in large-scale as well as in featureless The image is spatially separated into N × N grid cells. If
environments. Although this work is similar to our proposed the center of an object falls into a grid cell, that grid cell
method, it incurs higher computational cost as compared to is responsible for predicting the semantic category Cij and
our proposed method. A LiDAR-camera SLAM system [22] semantic mask M ij of the object in category branch Bc and
is presented by applying a sparse subspace clustering-based mask branch P m respectively:
motion segmentation method to build a static map in dynamic
environments. Reference [23] incorporates the information of Bc(I, θc) : I → {Cij ∈ Rλ | i, j = 0, 1, ..., N }, (1a)
a monocular camera and a laser range finder to remove the P m(I, θm) : I → {M ij ∈ Rφ | i, j = 0, 1, ..., N }, (1b)
feature outliers related to dynamic objects. However, both
reference [22] and [23] can only work well in low dynamic where θc and θm are the parameters of category branch Bc
environments. and mask branch P m respectively. λ is the number of classes.
φ is the total number of grid cells. The category branch
III. METHODOLOGY and mask branch are implemented with a Fully Connected
Network (FCN). Cij has a total of λ elements. Each element
In this section, the proposed method will be discussed of Cij indicates the class probability for each object instance
in detail. Fig. 2 illustrates an overview of our framework. at grid cell (i, j). In parallel with the category branch, M ij
It is mainly composed of four modules, namely instance has a total of N 2 elements [24]. Each positive grid cell (i, j)
segmentation module, multi-modal fusion module, localiza- will generate the corresponding instance mask in kth element,
tion module and global optimization & mapping module. where kth = i · N + j. Since our proposed SLAM system is
Instance segmentation module uses a real-time instance seg- intentionally designed for real-world robotics applications,
mentation network to extract the semantic information of computational cost for performing instance segmentation
all potential dynamic objects that are present in an RGB is our primary concern. Therefore, we use a light-weight
image. The convolution neural network is trained offline and version of SOLOv2 with lower accuracy to achieve real-
is later implemented online to achieve real-time performance. time instance segmentation. To improve the segmentation
Concurrently, the multi-modal fusion module transfers rel- accuracy, several methods have been implemented to build a
evant semantic data to LiDAR through sensor fusion and more effective and robust feature representation discriminator
subsequently uses the multi-modal information to further in the backbone network. Firstly, we modify our backbone
strengthen the segmentation results. The static information architecture from the original Feature Pyramid Network
is used in the localization module to find the robot pose, (FPN) to Recursive Feature Pyramid Network (RFP) [25].
while both static information and dynamic information are Theoretically, RFP instills the idea of looking twice or
utilized in the global optimization and mapping module to more by integrating additional feedback from FPN into
build a 3D dense semantic map. bottom-up backbone layers. This recursively strengthens the
existing FPN and provides increasingly stronger feature
A. Instance Segmentation & Semantic Learning representations. By offsetting richer information with small
A recent 2D instance segmentation framework [24] is
employed in our work due to its ability to outperform
(1a) (2a) (3a)
SOL Ov2 with Or iginal (1b) (2b) (3b)
DetectoRS SOL O v2
Fig. 3: Comparison of the original SOLOv2 with the proposed method. Our segmentation results achieve higher accuracy:
In (1b), our method can preserve a more detailed mask for the rider on a motorcycle compared to the SOLOv2 result in
(1a); In (2b), we can handle the occluded object while it is not detected in (2a); In (3b), our method can accurately predict
the mask for a handbag compared to (3a).
receptive field in the lower-level feature maps, we are able dynamic targets will degrade the localization accuracy and
to improve the segmentation performance on small objects. produce noise when performing a mapping task. Therefore,
Meanwhile, the ability of RFP to adaptively strengthen and we firstly implement morphological dilation to convolute
suppress neuron activation enables the instance segmentation the 2D pixel-wise mask image with a structuring element,
network to handle occluded objects more efficiently. On for gradually expanding the boundaries of regions for the
the other hand, we replace the convolutional layers in the dynamic objects. The morphological dilation result marks
backbone architecture with Switchable Atrous Convolution the ambiguous boundaries around the dynamic objects. We
(SAC). SAC operates as a soft switch function, which is take the both dynamic objects and their boundaries as the
used to collect the outputs of convolutional computation with dynamic information, which will be further refined in the
different atrous rates. Therefore, we are able to learn the multi-modal fusion section.
optimal coefficient from SAC and can adaptively select the
size of receptive field. This allows SOLOv2 to efficiently 2) Geometric Clustering & Semantic Fusion: Compensa-
extract important spatial information. tion via connectivity analysis on Euclidean space [27] is also
implemented in our work. Instance segmentation network has
The outputs are pixel-wise instance masks for each dy- excellent recognition capability in most practical situations,
namic object, as well as their corresponding bounding box however motion blur limits the segmentation performance
and class type. To better integrate the dynamic information to due to ambiguous pixels between regions, leading to undesir-
the SLAM algorithm, the output binary mask is transformed able segmentation error. Therefore, we combine both point
into a single image containing all pixel-wise instance masks cloud clustering results and segmentation results to better
in the scene. The pixel with the mask falling onto it is refine the dynamic objects. In particular, we perform the
considered as “dynamic state” and otherwise is considered as connectivity analysis on the geometry information and merge
“static state”. The binary mask is then applied to the semantic with vision-based segmentation results.
fusion module to generate a 3D dynamic mask.
A raw LiDAR scan often contains tens of thousands of
B. Multi-Modal Fusion points. To increase the efficiency of our work, 3D point
cloud is firstly downsized to reduce the scale of data and
1) Motion Blur Compensation: The instance segmenta- used as the input for point cloud clustering. Then the
tion has achieved good performance on the public dataset instance segmentation results are projected to the point cloud
such as the COCO dataset and the Object365 dataset [24] coordinate to label each point. The point cloud cluster will be
[26]. However, in practice the target may be partially rec- considered as a dynamic cluster when most points (90%) are
ognized or incomplete due to the motion blur on moving dynamic labelled points. The static point will be re-labeled to
objects, resulting in ambiguous boundaries of a moving the dynamic tag when it is close to a dynamic point cluster.
object. Moreover, motion blur effect is further enlarged And the dynamic point will be re-labelled when there is no
when projecting the 2D pixel-wise semantic mask for a dynamic points cluster nearby.
dynamic object to 3D semantic label, leading to point mis-
alignment and inconsistency of feature point extraction. In
the experiments, we find that the ambitious boundaries of
C. Localization & Pose Estimation a b c d e
1) Feature Extraction: After applying multi-modal dy- f
namic segmentation, the point cloud is divided into a dy-
namic point cloud PD and a static point cloud PS . The
static point cloud is subsequently used for the localization
and mapping module based on our previous work [28].
Compared to the existing SLAM approach such as LOAM
[2], the proposed framework in [28] is able to support real-
time performance at 30 Hz which is a few times faster. It
is also resistant to illumination variation compared to visual
SLAMs such as ORB-SLAM [1] and VINS-MONO [29].
For each static point pk ∈ PS , we can search for its nearby
static points set Sk by radius search in Euclidean space. Let
|S| be the cardinality of a set S, the local smoothness is thus
defined by:
σk = 1 · (||pk|| ||pi||). (2) Fig. 4: Different types of AGVs used in our warehouse
environment: (a) the grabbing AGV with a robot arm; (b)
|Sk | pi ∈Sk folklift AGV; (c) scanning AGV; (d) the Pioneer robot; (e)
the transportation AGV with conveyor belt; (f) warehouse
The edge features are defined by the points with large σk environment;
and the planar features are defined by the points with small
D. Global Map Building
σk .
2) Data Association: The final robot pose is calculated The semantic map is separated into a static map and
a dynamic map. Note that the visual information given
by minimizing the point-to-edge and point-to-plane distance. previously is also used to construct the colored dense static
map. Specifically, the visual information can be achieved
For an edge feature point pE ∈ PE , it can be transformed by re-projecting 3D points into the image plane. After each
into local map coordinate by pˆE = T·pE , where T ∈ SE(3) update, the map is down-sampled by using a 3D voxelized
grid approach [30] in order to prevent memory overflow.
is the current pose. We can search for 2 nearest edge features The dynamic map is built by PD and it is used to reveal the
pE1 and pE2 from the local edge feature map and the point- dynamic objects. The dynamic information can be used for
to-edge residual is defined by [28]: high-level tasks such as motion planning.
fE (pˆE ) = ||(pˆ E pE1 ) × (pˆE pE2 )|| , (3) IV. EXPERIMENT EVALUATION
||pE1 p2E ||
In this section, experimental results will be presented to
where symbol × is the cross product. Similarly, given a demonstrate the effectiveness of our proposed method. First,
our experimental setup will be discussed in detail. Second,
planar feature point pL ∈ PL and its transformed point we elaborate how we acquire the data of potential moving
pˆL = T · pL, we can search for 3 nearest points p1L, p2L, and objects in a warehouse environment. Third, we evaluate the
p3L from the local planar map. The point-to-plane residual is segmentation performance on our adopted instance segmen-
tation model. Subsequently, we explain how we perform the
defined by: dense mapping and dynamic tracking. Lastly, we evaluate
the performance of our proposed method regarding the
fL(pˆL) = (pˆL p1L)T · (p1L pL2 ) × (pL1 p3L) . (4) localization drifts under dynamic environments.
||(pL1 p2L) × (pL1 p3L)||
A. Experimental Setup
3) Pose Estimation: The final robot pose is calculated
For our experimental setup, the Robot Operating System
by minimizing the sum of point-to-plane and point-to-edge (ROS) is utilized as the interface for the integration of
the semantic learning module and the SLAM algorithm, as
residuals: shown in Fig. 1. Intel RealSense LiDAR camera L515 is
used to capture RGB and point cloud at a fixed frame rate.
T = arg min fE (pˆE ) + fL(pˆL). (5) All the experiments are performed on a computer with an
Intel i7 CPU and an Nvidia GeForce RTX 2080 Ti GPU.
T pE ∈PE pL ∈PL
This non-linear optimization problem can be solved by the
Gauss-Newton method and we can derive an optimal robot
pose based on the static information.
4) Feature Map Update & Key Frame Selection: Once
the optimal pose is derived, the features are updated to the
local edge map and local plane map respectively, which
will be used for the data association on the next frame.
Note that to build and update a global dense map is often
very computational costly. Hence, the global static map is
updated based on the keyframe. A key frame is selected when
the translational change of the robot pose is greater than a
predefined translation threshold, or the rotational change of
the robot pose is greater than a predefined rotation threshold.
a b
Fig. 5: Static map creation and final semantic mapping result: (a) static map built by the proposed SLAM framework; (b)
final semantic mapping result. The instance segmentation is shown on the left. Human operators are labeled by red bounding
boxes and AGVs are labeled by green bounding boxes.
B. Data Acquisition tion network, SOLOv2 is built based on the MMDetection
2.0 [32], an open-source object detection toolbox based
Humans are often considered as dynamic objects in many on PyTorch. We trained SOLOv2 on the COCO dataset
scenarios such as autonomous driving and smart warehouse which consists of 81 classes. We choose ResNet-50 as our
logistics. Therefore we choose 5,000 human images from backbone architecture since this configuration satisfies our
the COCO dataset. In the experiment, the proposed method requirements for the real-world robotics applications. Instead
is evaluated in the warehouse environment as shown in of training the network from scratch, we make use of the
Fig. 4. Other than considering humans as dynamic objects, parameters of ResNet-50 that are pre-trained on ImageNet.
an advanced factory requires human-to-robot and robot-to- For fair comparison, all the models are trained under the
robot collaboration, so that the Automated Guided Vehicles same configurations, they are trained with the synchronized
(AGVs) are also potential dynamic objects. Hence a total stochastic gradient descent with a total of 8 images per mini-
of 3,000 AGV images are collected to train the instance batch for 36 epochs.
segmentation network and some of the AGVs are shown in
Fig. 4. For SOLOv2 with Recursive Feature Pyramid (RFP), we
modify our backbone architecture from Feature Pyramid
In order to solve the small dataset problem, we implement Network (FPN) to RFP network. In this experiment, we only
the copy-paste augmentation method proposed by [31] to set the number of stages to be 2, allowing SOLOv2 to look
enhance the generalization ability of the network and directly at the image twice. As illustrated in Table I, implementation
improve the robustness of the network. To be specific, this of RFP network brings a significant improvement on the
method generates new images through applying random segmentation performance. On the other hand, we replace
scale jittering on two random training datasets and randomly all 3x3 convolutional layers in the backbone network with
chooses a subset of object instances from one image to paste Switchable Atrous Convolution (SAC), which increases the
onto the other image. segmentation accuracy by 2.3%. By implementing both SAC
and RFP network to SOLOv2, the segmentation performance
C. Evaluation on Instance Segmentation Performance is further improved by 5.9% with only 17ms increase in
inference time. Overall, SOLOv2 learns to look at the image
In this part, we will evaluate the segmentation performance twice with adaptive receptive fields, therefore it is able to
on the COCO dataset with regards to the segmentation loss highlight important semantic information for the instance
and mean Average Precision (mAP). The purpose of this segmentation network. The segmentation result is further
evaluation is to compare our adopted instance segmentation visualized in Fig. 3.
network, SOLOv2, with the proposed method. The results
are illustrated in Table I. Our adopted instance segmenta-
Model Segmentation Mean Inference
Loss AP (%) Time (ms) Methods ATDE MTDE
(cm) (cm)
SOLOv2 0.52 38.8 54.0 W/O Semantic Recognition
Vision-based Semantic Recognition 4.834 1.877
SOLOv2 + RFP 0.36 41.2 64.0 Multi-Modal Recognition (Ours) 1.273 0.667
0.875 0.502
SOLOv2 + SAC 0.39 39.8 59.0
SOLOv2+DetectoRS(Ours) 0.29 43.4 71.0
TABLE I: Performance comparison of instance segmenta- TABLE II: Ablation study of localization drifts under dy-
tion. namic environments.
3 a c
w/o filtering
2 proposed
Ground Truth
1 b
0
-1
-2 Fig. 7: Ablation study of localization drifts. (a) original
image view; (b) the visual semantic recognition result based
-3 on the proposed method; (c) Localization drifts observed due
to the moving objects. The localization drifts are highlighted
-3 -2 -1 0 1 2 3 in red circle.
Fig. 6: Localization comparison in a dynamic environment. E. Ablation Study of Localization Drifts d
The ground truth, the original localization result without
filtering and the localization result with our proposed multi- To further evaluate the performance of localization under
modal semantic filtering are plotted in red, green and orange dynamic profiles, we compare the localization drifts of differ-
respectively. ent dynamic filtering approaches. Firstly, we keep the robot
still and let a human operator walk frequently in front of the
D. Dense Mapping and Dynamic Tracking robot. The localization drifts are recorded in order to evaluate
the performance under dynamic objects. Specifically, we
To evaluate the performance of our multi-modal semantic calculate the Average Translational Drifts Error (ATDE) and
SLAM in dynamic environments, the proposed method is Maximum Translational Drifts Error (MTDE) to verify the
implemented on warehouse AGVs which are shown in Fig. 4. localization, where the ATDE is the average translational
In a smart manufacturing factory, both human operators and error of each frame and MTDE is the maximum translational
different types of AGVs (e.g., folklift AGVs, transportation drift caused by the walking human. The results are shown
AGVs and robot-arm equipped AGVs) are supposed to work in Table II. We firstly remove the semantic recognition
in a collaborative manner. Therefore, the capability of each module from SLAM and evaluate the performance. Then
AGV to localize itself under moving human operators and we use the visual semantic recognition (SOLOv2) to re-
other AGVs is the essential technology towards industry 4.0. move the dynamic information. The results are compared
In many warehouse environments, the rest of objects such with the proposed semantic multi-modal SLAM. It can be
as operating machines or tables can be taken as a static seen that, compared to the original SLAM, the proposed
environment. Hence we only consider humans and AGVs method significantly reduces the localization drift. Compared
as dynamic objects in order to reduce the computational to vision-only-based filtering methods, the proposed multi-
cost. In the experiment, an AGV is manually controlled modal semantic SLAM is more stable and accurate under the
to move around and build the warehouse environment map presence of dynamic objects.
simultaneously, while the human operators are walking fre-
quently in the warehouse. The localization result is shown V. CONCLUSION
in Fig. 6, where we compare the results of ground truth, the
proposed SLAM method and original SLAM without our In this paper, we have presented a semantic multi-modal
filtering approach. It can be seen that when the dynamic framework to tackle the SLAM problem in dynamic en-
object appears (in blue), the proposed multi-modal semantic vironments, which is able to effectively reduce the impact
SLAM is more robust and stable than traditional SLAM. of dynamic objects in complex dynamic environments. Our
The mapping results are shown in Fig. 5. The proposed approach aims to provide a modular pipeline to allow real-
method is able to efficiently identify the potential dynamic world applications in dynamic environments. Meanwhile, a
objects and separate them from the static map. Although the 3D dense stationary map is constructed with the removal
human operators are walking frequently in front of the robot, of dynamic information. To verify the effectiveness of the
they are totally removed from the static map. All potential proposed method in a dynamic complex environment, our
dynamic objects are enclosed by bounding boxes and are method is evaluated on warehouse AGVs used for smart
added into a final semantic map to visualize the status of manufacturing. The results show that our proposed method
each object in real time, where the moving human is colored can significantly improve the existing semantic SLAM algo-
in red and the AGVs are colored in green. Our method is rithm in terms of robustness and accuracy.
able to identify and locate multiple targets in the complex
dynamic environment.
REFERENCES [17] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “Dynaslam: Tracking,
mapping, and inpainting in dynamic scenes,” IEEE Robotics and
[1] R. Mur-Artal and J. D. Tardo´s, “Orb-slam 2: An open-source slam Automation Letters, vol. 3, no. 4, pp. 40764083, 2018.
system for monocular, stereo, and rgb-d cameras,” IEEE Transactions
on Robotics, vol. 33, no. 5, pp. 12551262, 2017. [18] X. Chen, A. Milioto, E. Palazzolo, P. Gigue`re, and C. Stachniss,
“Suma++: Efficient lidar-based semantic slam,” IEEE International
[2] J. Zhang and S. Singh, “Loam: Lidar odometry and mapping in real- Conference on Intelligent Robots and Systems, 2019.
time.” in Robotics: Science and Systems, vol. 2, no. 9, 2014.
[19] S. Zhao, Z. Fang, H. Li, and S. Scherer, “A robust laser-inertial
[3] T. Shan and B. Englot, “Lego-loam: Lightweight and ground- odometry and mapping method for large-scale highway environments,”
optimized lidar odometry and mapping on variable terrain,” in 2018 IEEE International Conference on Intelligent Robots and Systems,
IEEE/RSJ International Conference on Intelligent Robots and Systems 2019.
(IROS). IEEE, 2018, pp. 47584765.
[20] R. Jian, W. Su, R. Li, S. Zhang, J. Wei, B. Li, and R. Huang,
[4] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, “Robust monocular “A semantic segmentation based lidar slam system towards dynamic
slam in dynamic environments,” IEEE International Symposium on environments,” IEEE International Conference on Intelligent Robotics
Mixed and Augmented Reality, vol. 1, pp. 209218, 2013. and Applications, pp. 582590, 2019.
[5] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick, “Mask r-cnn,” IEEE [21] J. Jeong, T. S. Yoon, and P. J. Bae, “Towards a meaningful 3d map
International Conference on Computer Vision, 2017. using a 3d lidar and a camera,” Sensors, vol. 18, no. 8, 2018.
[6] D. Bolya, Z. Chong, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance [22] C. Jiang, D. P. Paudel, Y. Fougerolle, D. Fofi, and C. Demonceaux,
segmentation,” IEEE International Conference on Computer Vision, “Static-map and dynamic object reconstruction in outdoor scenes using
2019. 3-d motion segmentation,” IEEE Robotics and Automation Letters,
vol. 1, no. 1, pp. 324331, 2016.
[7] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-slam:
Semantic monocular visual localization and mapping based on deep [23] X. Zhang, A. B. Rad, and Y.-K. Wong, “Sensor fusion of monocular
learning in dynamic environment,” Robotics and Autonomous Systems, cameras and laser range finders for line-based simultaneous localiza-
vol. 117, pp. 116, 2019. tion and mapping (slam) tasks in autonomous mobile robots,” Sensors,
vol. 12, pp. 429452, 2012.
[8] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds-
slam: A semantic visual slam towards dynamic environments,” in 2018 [24] X. Wang, R. Zhang, K. Tao, L. Lei, and C. Shen, “Solov2: Dynamic
IEEE/RSJ International Conference on Intelligent Robots and Systems and fast instance segmentation,” IEEE Computer Vision and Pattern
(IROS). IEEE, 2018, pp. 11681174. Recognition, 2020.
[9] L. Han, T. Zheng, L. Xu, and L. Fang, “Occuseg: Occupancy-aware 3d [25] S. Qiao, L.-C. Chen, and A. Yuille, “Detectors: Detecting objects with
instance segmentation,” IEEE International Conference on Computer recursive feature pyramid and switchable atrous convolution,” IEEE
Vision and Pattern Recognition, 2020. Computer Vision and Pattern Recognition, 2020.
[10] J. Li, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “Pointgroup: Dual- [26] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V.
set point grouping for 3d instance segmentation,” IEEE International Le, and B. Zoph, “Simple copy-paste is a strong data augmentation
Conference on Computer Vision and Pattern Recognition, 2020. method for instance segmentation,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2021, pp.
[11] L. Jiang, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “Pointgroup: 29182928.
Dual-set point grouping for 3d instance segmentation,” in Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern [27] R. B. Rusu, “Semantic 3d object maps for everyday manipulation in
Recognition, 2020, pp. 48674876. human living environments,” KI-Ku¨nstliche Intelligenz, vol. 24, no. 4,
pp. 345348, 2010.
[12] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stach-
niss, and J. Gall, “SemanticKITTI: A Dataset for Semantic Scene [28] H. Wang, C. Wang, and L. Xie, “Lightweight 3-d localization and
Understanding of LiDAR Sequences,” in Proc. of the IEEE/CVF mapping for solid-state lidar,” IEEE Robotics and Automation Letters,
International Conf. on Computer Vision (ICCV), 2019. 2020.
[13] W. Dai, Y. Zhang, P. Li, Z. Fang, and S. Schere, “Rgb-d slam in [29] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc-
dynamic environments using point correlations,” IEEE Transactions ular visual-inertial state estimator,” IEEE Transactions on Robotics,
on Pattern Analysis and Machine Intelligence, vol. 1, no. 1, 2020. vol. 34, no. 4, pp. 10041020, 2018.
[14] S. Li and D. Lee, “Rgb-d slam in dynamic environments using static [30] R. B. Rusu and S. Cousins, “3d is here: Point cloud library (pcl),”
point weighting,” IEEE Robotics and Automation Letters), vol. 2, no. 4, in 2011 IEEE international conference on robotics and automation.
pp. 22622270, 2017. IEEE, 2011, pp. 14.
[15] Y. Xun and C. Song, “Sad-slam: A visual slam based on semantic [31] G. Ghiasi, C. Yin, A. Srinivas, R. Qian, T.-Y. Lin, E. D.Cubuk,
and depth information,” IEEE International Conference on Intelligent Q. V. Le, and B. Zoph, “Simple copy-paste is a strong data aug-
Robots and Systems, 2021. mentation method for instance segmentation,” IEEE Computer Vision
and Pattern Recognition, 2020.
[16] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds-
slam: A semantic visual slam towards dynamic environments,” IEEE [32] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng,
International Conference on Intelligent Robots and Systems, pp. 1168 Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li,
1174, 2018. X. Lu, R. Zhu, Y. Wu, J. Dai, W. Jingdong, J. Shi, W. Ouyang, C. C.
Loy, and D. Lin, “Mmdetection: Open mmlab detection toolbox and
benchmark,” IEEE Computer Vision and Pattern Recognition, 2019.

View File

@ -0,0 +1,478 @@
IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 9573
RGB-D Inertial Odometry for a Resource-Restricted
Robot in Dynamic Environments
Jianheng Liu , Xuanfu Li, Yueqian Liu , and Haoyao Chen , Member, IEEE
Abstract—Current simultaneous localization and mapping performance. Most of the existing vSLAM systems depend on
(SLAM) algorithms perform well in static environments but eas- a static world assumption. Stable features in the environment
ily fail in dynamic environments. Recent works introduce deep are used to form a solid constraint for Bundle Adjustment [5].
learning-based semantic information to SLAM systems to reduce However, in real-world scenarios like shopping malls and sub-
the influence of dynamic objects. However, it is still challeng- ways, dynamic objects such as moving people, vehicles, and
ing to apply a robust localization in dynamic environments for unknown objects, have an adverse impact on pose optimization.
resource-restricted robots. This paper proposes a real-time RGB-D Although some approaches like RANSAC [6] can suppress the
inertial odometry system for resource-restricted robots in dynamic influence of dynamic features to a certain extent, it will become
environments named Dynamic-VINS. Three main threads run in overwhelmed when a vast number of dynamic objects appear in
parallel: object detection, feature tracking, and state optimization. the scene.
The proposed Dynamic-VINS combines object detection and depth
information for dynamic feature recognition and achieves per- Therefore, it is necessary for the system to reduce dynamic
formance comparable to semantic segmentation. Dynamic-VINS objects influence on the estimation results consciously. The
adopts grid-based feature detection and proposes a fast and ef- pure geometric methods [7][9] are widely used to handle
ficient method to extract high-quality FAST feature points. IMU dynamic objects, but it is unable to cope with latent or slightly
is applied to predict motion for feature tracking and moving moving objects. With the development of deep learning, many
consistency check. The proposed method is evaluated on both researchers have tried combining multi-view geometric methods
public datasets and real-world applications and shows competitive with semantic information [10][13] to implement a robust
localization accuracy and robustness in dynamic environments. Yet, SLAM system in dynamic environments. To avoid the accidental
to the best of our knowledge, it is the best-performance real-time deletion of stable features through object detection [14], recent
RGB-D inertial odometry for resource-restricted platforms in dy- dynamic SLAM systems [15], [16] exploit the advantages of
namic environments for now. The proposed system is open source pixel-wise semantic segmentation for a better recognition of
at: https://github.com/HITSZ-NRSL/Dynamic-VINS.git dynamic features. Due to the expensive computing resource
consumption of semantic segmentation, it is difficult for a
Index Terms—Localization, visual-inertial SLAM. semantic-segmentation-based SLAM system to run in real-time.
Therefore, some researchers have tried to perform semantic
I. INTRODUCTION segmentation only on keyframes and track moving objects via
moving probability propagation [17], [18] or direct method [19]
S IMULTANEOUS localization and mapping (SLAM) is a on each frame. In the cases of missed detections or object track-
foundational capability for many emerging applications, ing failures, the pose optimization is imprecise. Moreover, since
such as autonomous mobile robots and augmented reality. Cam- semantic segmentation is performed after keyframe selection,
eras as portable sensors are commonly equipped on mobile real-time precise pose estimation is inaccessible, and unstable
robots and devices. Therefore, visual SLAM (vSLAM) has dynamic features in the original frame may also cause redundant
received tremendous attention over the past decades. Lots of keyframe creation and unnecessary computational burdens.
works [1][4] are proposed to improve visual SLAM systems
The above systems still require too many computing re-
Manuscript received 25 February 2022; accepted 20 June 2022. Date of sources to perform robust real-time localization in dynamic
publication 15 July 2022; date of current version 26 July 2022. This letter was environments for Size, Weight, and Power (SWaP) restricted
recommended for publication by Associate Editor L. Paull and Editor J. Civera mobile robots or devices. Some researchers [20][22] try to
upon evaluation of the reviewers comments. This work was supported in part run visual odometry in real-time on embedded computing
by the National Natural Science Foundation of China under Grants U21A20119 devices, yet the keyframe-based visual odometry is not per-
and U1713206 and in part by the Shenzhen Science and Innovation Com- formed [23], which makes their accuracy unsatisfactory. At
mittee under Grants JCYJ20200109113412326, JCYJ20210324120400003, the same time, increasingly embedded computing platforms are
JCYJ20180507183837726, and JCYJ20180507183456108. (Corresponding equipped with NPU/GPU computing units, such as HUAWEI
Author: Haoyao Chen.) Atlas200, NVIDIA Jetson, etc. It enables lightweight deep
learning networks to run on the embedded computing platform
Jianheng Liu, Yueqian Liu, and Haoyao Chen are with the School of Mechan- in real-time. Some studies [14], [24] implemented a keyframe-
ical Engineering and Automation, Harbin Institute of Technology Shenzhen, based dynamic SLAM system running on embedded computing
Shenzhen, Guangdong 518055, China (e-mail: liujianhengchris@qq.com; yue-
qianliu@outlook.com; hychen5@hit.edu.cn).
Xuanfu Li is with the Department of HiSilicon Research, Huawei Tech-
nology Co., Ltd, Shenzhen, Guangdong 518129, China (e-mail: lixuanfu@
huawei.com).
This letter has supplementary downloadable material available at
https://doi.org/10.1109/LRA.2022.3191193, provided by the authors.
Digital Object Identifier 10.1109/LRA.2022.3191193
2377-3766 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
9574 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
Fig. 1. The framework of Dynamic-VINS. The contributing modules are highlighted and surrounded by dash lines with different colors. Three main threads run
in parallel in Dynamic-VINS. Features are tracked and detected in the feature tracking thread. The object detection thread detects dynamic objects in each frame in
real-time. The state optimization thread summarizes the features information, object detection results, and depth image to recognize the dynamic features. Finally,
stable features and IMU preintegration results are used for pose estimation.
platforms. However, these works are still difficult to balance summarize the features information, object detection results,
efficiency and accuracy for mobile robot applications. and depth image to recognize the dynamic features. A missed
detection compensation module is conducted in case of missed
To address all these issues, this paper proposes a real-time detection. The moving consistency check procedure combines
RGB-D inertial odometry for resource-restricted robots in dy- the IMU preintegration and historical pose estimation results
namic environments named Dynamic-VINS. It enables edge to identify potential dynamic features. Finally, stable features
computing devices to provide instant robust state feedback for and IMU preintegration results are used for the pose estimation.
mobile platforms with little computation burden. An efficient And the propagation of the IMU is responsible for an IMU-rate
dynamic feature recognition module that does not require a pose estimation result. Loop closure is also supported in this
high-precision depth camera can be used in mobile devices system, but this paper pays more attention to the localization
equipped with depth-measure modules. The main contributions independent of loop closure.
of this paper are as follows:
III. METHODOLOGY
1) An efficient optimization-based RGB-D inertial odometry
is proposed to provide real-time state estimation results This study proposes lightweight, high-quality feature tracking
for resource-restricted robots in dynamic and complex and detection methods to accelerate the system. Semantic and
environments. geometry information from the input RGB-D images and IMU
preintegration are applied for dynamic feature recognition and
2) Lightweight feature detection and tracking are proposed moving consistency check. The missed detection compensation
to cut the computing burden. In addition, dynamic fea- module plays a subsidiary role to object detection in case of
ture recognition modules combining object detection and missed detection. Dynamic features on unknown objects are
depth information are proposed to provide robust dynamic further identified by moving consistency check. The proposed
feature recognition in complex and outdoor environments. methods are divided into five parts for a detailed description.
3) Validation experiments are performed to show the pro- A. Feature Matching
posed systems competitive accuracy, robustness, and ef-
ficiency on resource-restricted platforms in dynamic en- For each incoming image, the feature points are tracked using
vironments. the KLT sparse optical flow method [27]. In this paper, the IMU
measurements between frames are used to predict the motion of
II. SYSTEM OVERVIEW features. Better initial position estimation of features is provided
to improve the efficiency of feature tracking by reducing optical
The proposed SLAM system in this paper is extended based on flow pyramid layers. It can effectively discard unstable features
VINS-Mono [2] and VINS-RGBD [25]; our framework is shown such as noise and dynamic features with inconsistent motion.
in Fig. 1, and the contributing modules are highlighted with The basic idea is illustrated in Fig. 2.
different colors. For efficiency, three main threads (surrounded
by dash lines) run parallel in Dynamic-VINS: object detection, In the previous frame, stable features are colored red, and
feature tracking, and state optimization. Color images are passed newly detected features are colored blue. When the current frame
to both the object detection thread and the feature tracking arrives, the IMU measurements between the current and previous
thread. IMU measurements between two consecutive frames frames are used to predict the feature position (green) in the
are preintegrated [26] for feature tracking, moving consistency current frame. Optical flow uses the predicted feature position
check, and state optimization. as the initial position to look for a match feature in the current
frame. The successfully tracked features are turned red, while
In the feature tracking thread, features are tracked with the those that failed to be tracked are marked as unstable features
help of IMU preintegration and detected by grid-based feature
detection. The object detection thread detects dynamic objects in
each frame in real-time. Then, the state optimization thread will
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: RGB-D INERTIAL ODOMETRY FOR A RESOURCE-RESTRICTED ROBOT IN DYNAMIC ENVIRONMENTS 9575
Fig. 2. Illustration of feature tracking and detection. Stable features and new Fig. 3. Illustration of semantic mask setting for dynamic feature recognition
features are colored red and blue, respectively. The green circles denote the when all pixels depth is available (d > 0). The left scene represents when an
prediction for optical flow. The successfully tracked features turn red; otherwise, objected bounding boxs farthest corners depth is bigger than the center to
the features turn purple. The orange and purple dash-line circles as masks are a threshold and a semantic mask with weighted depth is set between them
set for a uniform feature distribution and reliable feature detection. New feature to separate features on dynamic objects from the background. Otherwise, the
points are detected from unmasked areas in the current frame. semantic mask is set behind the bounding boxs center with the distance of ,
shown on the right.
(purple). In order to avoid the repetition and aggregation of constraints to the system. For the sake of efficiency and compu-
feature detection, an orange circular mask centered on the stable tational cost, a real-time single-stage object detection method,
feature is set; the region where the unstable features are located is YOLOv3 [11], is used to detect many kinds of dynamic scene
considered an unstable feature detection region and masked with elements like people and vehicles. If a detected bounding box
a purple circular to avoid unstable feature detection. According covers a large region of the image, blindly deleting feature
to the mask, new features are detected from unmasked areas in points in the bounding box might result in no available features
the current frame and colored blue. to provide constraints. Therefore, semantic-segmentation-like
masks are helpful to maintain the systems running by tracking
The above means can obtain uniformly distributed features to features not occluded by dynamic objects.
capture comprehensive constraints and avoid repeatedly extract-
ing unstable features on the area with blurs or weak textures. This paper combines object detection and depth information
Long-term feature tracking can reduce the time consumption for highly efficient dynamic feature recognition to achieve per-
with the help of grid-based feature detection in the following. formance comparable to semantic segmentation. As the farther
the depth camera measures, the worse the accuracy is. This
B. Grid-Based Feature Detection problem makes some methods, such as Seed Filling, DBSCAN,
and K-Means, which make full use of the depth information,
The system maintains a minimum number of features for exhibit poor performance with a low accuracy depth camera, as
stability. Therefore, feature points need to be extracted from the shown in Fig. 5(a). Therefore, a set of points in the detected
frame constantly. This study adopts grid-based feature detection. bounding box and depth information are integrated to obtain
Image is divided into grids, and the boundary of each grid is comparable performance to the semantic segmentation, as illus-
padded to prevent the features at the edge of the grid from being trated in Fig. 3.
ignored; the padding enables the current grid to obtain adjacent
pixel information for feature detection. Unlike traversing the A pixels depth d is available, if d > 0, otherwise, d = 0.
whole image to detect features, only the grid with insufficient Considering that the bounding box corners of most dynamic
matched features will conduct feature detection. The grid cell objects correspond to the background points, and the dynamic
that fails to detect features due to weak texture or is covered by objects commonly have a relatively large depth gap with the
the mask will be skipped in the next detection frame to avoid background. The K-th dynamic objects largest background
repeated useless detection. The thread pool technique is used to depth K dmax is obtained as follow
exploit the parallel performance of grid-based feature detection.
Thus, the time consumption of feature detection is significantly K dmax = max K dtl + K dtrK + K dbl + K dbr , (1)
reduced without loss.
where K dtl, K dtr, K dbl, K dbr are the depth values of the Kth
The FAST feature detector [28] can efficiently extract feature
points but easily treats noise as features and extracts similar object detection bounding boxs corners, respectively. Next, the
clustered features. Therefore, the ideas of mask in Section III-A Kth bounding boxs depth threshold Kd¯is defined as
and Non-Maximum-Suppression are combined to select high-
quality and uniformly distributed FAST features. ⎧ 1 K dmax + K dc , if K dmax K dc > , K dc > 0,
⎨⎪⎪⎪⎪ 2 if K dmax K dc < , K dc > 0,
C. Dynamic Feature Recognition ⎪⎪⎪⎪⎩ if K dmax > 0, K dc = 0,
K d¯= K dc + ,
Most feature points can be stably tracked through the above otherwise ,
improvement. However, long-term tracking features on dynamic K dmax,
objects always come with abnormal motion and introduce wrong
+∞,
(2)
where K dc is the depth value of the bounding boxs center; > 0
is a predefined distance according to the most common dynamic
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
9576 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
Fig. 4. Results of missed detection compensation. The dynamic feature recognition results are shown in the first row. The green box shows the dynamic objects
position from the object detection results. The second row shows the generated semantic mask. With the help of missed detection compensation, even if object
detection failed in (b) and (d), a semantic mask including all dynamic objects could be built.
incoming feature point from the feature tracking thread will be
judged whether it is a historical dynamic feature or not. The
above methods can avoid blindly deleting feature points while
ensuring efficiency. It can save time from detecting features on
dynamic objects, has the robustness to the missed detection of
object detection, and recycle false-positive dynamic features, as
illustrated in Section III-E.
D. Missed Detection Compensation
Fig. 5. Results of dynamic feature recognition. The stable features are circled Since object detection might sometimes fail, the proposed
by yellow. The dynamic feature recognition results generated by Seed Filling Dynamic-VINS utilizes the previous detection results to predict
and the proposed method are shown in (a) and (b), respectively. The weighted the following detection result to compensate for missed detec-
depth d¯is colored gray; the brighter means a bigger value. The feature point on tions. It is assumed that the dynamic objects in adjacent frames
the white area will be marked as a dynamic feature. have a consistent motion. Once a dynamic object is detected, its
pixel velocity and bounding box will be updated. Assumed that
objects size in scenes. The depth threshold Kd¯is defined in the j is the current detected frame and j 1 is the previous detected
middle of the centers depth K dc and the deepest background frame, the pixel velocity K vcj (pixel/frame) of the Kth dynamic
depth K dmax. When the dynamic object has a close connection object between frames is defined as
with the background or is behind an object K dmax K dc < ,
the depth threshold is defined at distance from the dynamic K vcj = K uccj K ucj1 , (3)
object. If the depth is unavailable, a conservative strategy is
adopted to choose an infinite depth as the threshold. c
On the semantic mask, the area covered by the K-th dynamic where K uccj , u K cj1 represent the pixel location of the K th
object bounding box is set to the weighted depth Kd¯; the area c
without dynamic objects is set to 0. Each incoming features
depth d is compared with the corresponding pixels depth thresh- object detection bounding boxs center in jth frame and j 1th
old d¯on the semantic mask. If d < d¯, the feature is considered as
a dynamic one. Otherwise, the feature is considered as a stable frame, respectively. A weighted predicted velocity K vˆ is defined
one. Therefore, the region where the depth value is smaller than
the weighted depth d¯constitutes the generalized semantic mask, as
as shown in Figs. 4 and 5(b).
K vˆcj+1 = 1 K vcj + K vˆcj , (4)
Considering that dynamic objects may exist in the field of 2
view for a long time, the dynamic features are tracked but
not used for pose estimation, different from directly deleting With the update going on, the velocities of older frames will have
dynamic features. According to its recorded information, each
a lower weight in K vˆ. If the object fail to be detected in the next
frame, the bounding box K Box containing the corners pixel
locations K utl, K utr, K ubl and K ubr, will be updated based on
the predicted velocity K vˆ as follow
K Bˆoxcj+1 = K Boxcj + K vˆcj+1 , (5)
When the missed detection time is over a threshold, this dynamic
objects compensation will be abandoned. The result is shown
in Fig. 4. It improves the recall rate of object detection and is
helpful for a more consistent dynamic feature recognition.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: RGB-D INERTIAL ODOMETRY FOR A RESOURCE-RESTRICTED ROBOT IN DYNAMIC ENVIRONMENTS 9577
E. Moving Consistency Check In order to demonstrate the efficiency of the proposed system,
all experiments of Dynamic-VINS are performed on the em-
Since object detection can only recognize artificially defined bedded edge computing devices, HUAWEI Atlas200 DK and
dynamic objects and has a missed detection problem, the state NVIDIA Jetson AGX Xavier. And the compared algorithms
optimization will still be affected by unknown moving objects results are included from their original papers. Atlas200 DK
like books moved by people. Dynamic-VINS combines the pose has an 8-core A55 Arm CPU (1.6 GHz), 8 GB of RAM, and
predicted by IMU and the optimized pose in the sliding windows a 2-core HUAWEI DaVinci NPU. Jetson AGX Xavier has an
to recognize dynamic features. 8-core ARMv8.2 64-bit CPU (2.25 GHz), 16 GB of RAM,
and a 512-core Nvidia Volta GPU. And the results tested on
Consider the kth feature is first observed in the ith image and both devices are named Dynamic-VINS-Atlas and Dynamic-
is observed by other m images in sliding windows. The average VINS-Jetson, respectively. Yet, to the best of our knowledge, the
reprojection residual rk of the feature observation in the sliding proposed method is the best-performance real-time RGB-D iner-
windows is defined as tial odometry for dynamic environments on resource-restricted
embedded platforms.
rk = 1 ukci π TcbTwbi Tbwj TbcPkcj , (6)
m A. OpenLORIS-Scene Dataset
j=i
OpenLORIS-Scene [3] is a real-world indoor dataset with
where ukci is the observation of kth feature in the ith frame; Pkcj a large variety of challenging scenarios like dynamic scenes,
is the 3D location of kth feature in the jth frame; Tcb and Twbj featureless frames, and dim illumination. The results on the
are the transforms from camera frame to body frame and from OpenLORIS-Scene dataset are shown in Fig. 6, including the
jth body frame to world frame, respecvtively; π represents the results of VINS-Mono, ORB-SLAM2, and DS-SLAM from [3]
camera projection model. When the rk is over a preset threshold, as baselines.
the kth feature is considered as a dynamic feature.
The OpenLORIS dataset includes five scenes and 22 se-
As shown in Fig. 7, the moving consistency check (MCC) quences in total. The proposed Dynamic-VINS shows the best
module can find out unstable features. However, some stable robustness among the tested algorithms. In office scenes that
features are misidentified (top left image), and features on are primarily static environments, all the algorithms can track
standing people are not recognized (bottom right image). A low successfully and achieve a decent accuracy. It is challenging for
threshold holds a high recall rate of unstable features. Further, the pure visual SLAM systems to track stable features in home
a misidentified unstable feature with more observations will be and corridor scenes that contain a large area of textureless walls
recycled if its reprojection error is lower than the threshold. and dim lighting. Thanks to the IMU sensor, the VINS systems
show robustness superiority when the camera is unreliable. The
IV. EXPERIMENTAL RESULTS scenarios of home and caf e contain a number of sitting people
with a bit of motion, and market exists lots of moving pedes-
Quantitative experiments1 are performed to evaluate the pro- trians and objects with unpredictable motion. And the market
posed systems accuracy, robustness, and efficiency. Public scenes cover the largest area and contain highly dynamic objects,
SLAM evaluation datasets, OpenLORIS-Scene [29] and TUM as shown in Fig. 5. Although DS-SLAM is able to filter out
RGB-D [30], provide sensor data and ground truth to evaluate some dynamic features, its performance is still unsatisfactory.
SLAM system in complex dynamic environments. Since our sys- VINS-RGBD has a similar performance with Dynamic-VINS
tem is built on VINS-Mono [2] and VINS-RGBD [25], they are in relative static scenes, while VINS-RGBDs accuracy drops in
used as the baselines to demonstrate our improvement. VINS- highly dynamic market scenes. The proposed Dynamic-VINS
Mono [2] provides robust and accurate visual-inertial odometry can effectively deal with complex dynamic environments and
by fusing IMU preintegration and feature observations. VINS- improve robustness and accuracy.
RGBD [25] integrates RGB-D camera based on VINS-Mono
for better performance. Furthermore, DS-SLAM [15] and Ji B. TUM RGB-D Dataset
et al.[24], state-of-the-art semantic algorithms based on ORB-
SLAM2 [4], are also included for comparison. The TUM RGB-D dataset [30] offers several sequences con-
taining dynamic objects in indoor environments. The highly
The accuracy is evaluated by Root-Mean-Square-Error dynamic f r3_walking sequences are chosen for evaluation
(RMSE) of Absolute Trajectory Error (ATE), Translational Rel- where two people walk around a desk and change chairs
ative Pose Error (T.RPE), and Rotational Relative Pose Error positions while the camera moves in different motions. As
(R.RPE). Correct Rate (CR) [29] measuring the correct rate the VINS system does not support VO mode and the TUM
over the whole period of data is used to evaluate the robustness. RGB-D dataset does not provide IMU measurements, a VO
The RMSE of an algorithm is calculated only for its success- mode is implemented by simply disabling modules relevant to
ful tracking outputs. Therefore, the longer an algorithm tracks IMU in Dynamic-VINS for experiments. The results are shown
successfully, the more error is likely to accumulate. It implies in Table I. The compared methods results are included from
that evaluating algorithms purely by ATE could be mislead- their original published papers. The algorithms based on ORB-
ing. On the other hand, considering only CR could also be SLAM2 and semantic segmentation perform better. Although
misleading.
1The experimental video is available at https://youtu.be/y0U1IVtFBwY.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
9578 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
Fig. 6. Per-sequence testing results with the OpenLORIS-Scene datasets. Each black dot on the top line represents the start of one data sequence. For each
algorithm, blue dots indicate successful initialization moments, and blue lines indicate successful tracking span. The percentage value on the top left of each scene
is the average correct rate; the higher the correct rate of an algorithm, the more robust it is. The float value on the first line below is average ATE RMSE and the
values on the second line below are T.RPE and R.RPE from left to right, and smaller means more accurate.
TABLE I
RESULTS OF RMSE OF ATE [m], T.RPE [m/s], AND R.RPE [◦/s] ON TUM RGB-D f r3_walking DATASETS
TABLE II
ABLATION EXPERIMENT RESULTS OF RMSE OF ATE [m], T.RPE [m/s], AND R.RPE [◦/s] ON TUM RGB-D f r3_walking DATASETS
to extract evenly distributed stable features, which seriously
degrades the accuracy performance. Without the object detec-
tion (W/O OBJECT DETECTION), dynamic features introduce
wrong constraints to impair the systems accuracy. Dynamic-
VINS-W/O-SEG-LIKE-MASK shows the results that mask all
features in the bounding boxes. The background features help the
system maintain as many stable features as possible to provide
more visual constraints. The moving consistency check plays
an important role when object detection fails, as shown in the
column W/O-MCC.
Fig. 7. Results of Moving Consistency Check. Features without yellow circu- C. Runtime Analysis
lar are the outliers marked by the Moving Consistency Check module.
This part compares VINS-Mono, VINS-RGBD, and
Dynamic-VINS is not designed for pure visual odometry, it still Dynamic-VINS for runtime analysis. These methods are ex-
shows competitive performance and has a significant improve- pected to track and detect 130 feature points, and the frames
ment over ORB-SLAM2. in Dynamic-VINS are divided into 7x8 grids. The object detec-
tion runs on the NPU/GPU parallel to the CPU. The average
To validate the effectiveness of each module in Dynamic- computation times of each module and thread are calculated on
VINS, ablation experiments are conducted as shown in Table II. OpenLORIS market scenes; the results run on both embedded
The system without applying circular masks (W/O CIRCU- platforms are shown in Table III. It should be noted that the
LAR MASK) from the Section III-A and Section III-B fails average computation time is only to be updated when the module
is used. Specifically, in VINS architecture, the feature detection
is executed at a consistent frequency with the state optimization
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: RGB-D INERTIAL ODOMETRY FOR A RESOURCE-RESTRICTED ROBOT IN DYNAMIC ENVIRONMENTS 9579
TABLE III
AVERAGE COMPUTATION TIME [ms] OF EACH MODULE AND THREAD ON OPENLORIS market SCENES
* Tracking Thread, Optimization Thread and Object Detection correspond to the three different threads shown in Fig. 1, respectively.
† Dynamic Feature Recognition Modules sum up the Dynamic Feature Recognition, Missed Detection Compensation, and Moving Consistency Check modules.
Fig. 8. A compact aerial robot equipped with an RGB-D camera, an autopilot Fig. 9. The estimated trajectories in the outdoor environment aligned with the
with IMUs, an onboard computer, and an embedded edge computing device. Google map. The green line is the estimated trajectory from Dynamic-VINS, the
The whole size is about 255 × 165 mm. red line is from VINS-RGBD, and the yellow line represents the loop closure
that happened at the end of the dataset.
thread, which means the frequency of feature detection is lower
than that of Feature Tracking Thread. Fig. 10. Results of dynamic feature recognition in outdoor environments. The
dynamic feature recognition modules are still able to segment dynamic objects
On edge computing devices with AI accelerator modules, but with a larger mask region.
the single-stage object detection method is computed by an
NPU or GPU without costing the CPU resources and can out- handheld aerial robot above for safety. The total path lengths
put inference results in real-time. With the same parameters, are approximately 800 m and 1220 m, respectively. The dataset
Dynamic-VINS shows significant improvement in feature de- has a similar scene at the beginning and the end for loop
tection efficiency in both embedded platforms and is the one able closure, while loop closure fails in the THUSZ campus dataset.
to achieve instant feature tracking and detection in HUAWEI At- VINS-RGBD and Dynamic-VINS run the dataset on NVIDIA
las200 DK. The dynamic feature recognition modules (Dynamic Jetson AGX Xavier. The estimated trajectories and loop closure
Feature Recognition, Missed Detection Compensation, Moving trajectory aligned with the Google map are shown in Fig. 9.
Consistency Check) to recognize dynamic features only take In outdoor environments, the depth camera is limited in range
a tiny part of the consuming time. For real-time application, and affected by the sunlight. The dynamic feature recognition
the system is able to output a faster frame-to-frame pose and a modules can still segment dynamic objects but with a larger
higher-frequency imu-propagated pose rather than waiting for mask region, as shown in Fig. 10. Compared with loop closure
the complete optimization result. results, Dynamic-VINS could provide a robust and stable pose
estimation with little drift.
D. Real-World Experiments
A compact aerial robot is shown in Fig. 8. An RGB-D camera
(Intel Realsense D455) provides 30 Hz color and aligned depth
images. An autopilot (CUAV X7pro) with an onboard IMU
(ADIS16470, 200 Hz) is used to provide IMU measurements.
The aerial robot is equipped with an onboard computer (Intel
NUC, i7-5557 U CPU) and an embedded edge computing de-
vice (HUAWEI Atlas200 DK). These two computation resource
providers play different roles in the aerial robot. The onboard
computer charges for peripheral management and other core
functions requiring more CPU resources, such as planning and
mapping. The edge computing device as auxiliary equipment
offers instant state feedback and object detection results to the
onboard computer.
Large-scale outdoor datasets with moving people and vehi-
cles on the HITSZ and THUSZ campus are recorded by the
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
9580 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
V. CONCLUSION [12] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep con-
volutional encoder-decoder architecture for image segmentation,” IEEE
This paper presents a real-time RGB-D inertial odometry Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 24812495,
for resource-restricted robots in dynamic environments. Cost- Dec. 2017.
efficient feature tracking and detection methods are proposed to
cut down the computing burden. A lightweight object-detection- [13] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proc.
based method is introduced to deal with dynamic features in IEEE Int. Conf. Comput. Vis., 2017, pp. 29612969.
real-time. Validation experiments show the proposed systems
competitive accuracy, robustness, and efficiency in dynamic [14] L. Xiao et al., “Dynamic-SLAM: Semantic monocular visual localization
environments. Furthermore, Dynamic-VINS is able to run on and mapping based on deep learning in dynamic environment,” Robot.
resource-restricted platforms to output an instant pose estima- Auton. Syst., vol. 117, pp. 116, 2019.
tion. In the future, the proposed approaches are expected to
be validated on the existing popular SLAM frameworks. The [15] C. Yu et al., “DS-SLAM: A semantic visual SLAM towards dynamic
missed detection compensation module is expected to develop environments,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2018,
into a moving object tracking module, and semantic information pp. 11681174.
will be further introduced for high-level guidance on mobile
robots or mobile devices in complex dynamic environments. [16] B. Bescos,, J. M. Facil, J. Civera, and J. Neira, “DynaSLAM: Tracking,
mapping, and inpainting in dynamic scenes,” IEEE Robot. Automat. Lett.,
REFERENCES vol. 3, no. 4, pp. 40764083, Oct. 2018.
[1] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE [17] F. Zhong, S. Wang, Z. Zhang, C. Chen, and Y. Wang, “Detect-SLAM:
Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611625, Mar. 2018. Making object detection and SLAM mutually beneficial,” in Proc. IEEE
Winter Conf. Appl. Comput. Vis., 2018, pp. 10011010.
[2] T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monoc-
ular visual-inertial state estimator,” IEEE Trans. Robot., vol. 34, no. 4, [18] Y. Liu and J. Miura, “RDS-SLAM: Real-time dynamic SLAM using
pp. 10041020, Aug. 2018. semantic segmentation methods,” IEEE Access, vol. 9, pp. 23 77223 785,
2021.
[3] P. Geneva, K. Eckenhoff, W. Lee, Y. Yang, and G. Huang, “OpenVINS: A
research platform for visual-inertial estimation,” in Proc. IEEE Int. Conf. [19] I. Ballester, A. Fontán, J. Civera, K. H. Strobl, and R. Triebel, “DOT:
Robot. Automat., 2020, pp. 46664672. Dynamic object tracking for visual SLAM,” in Proc. IEEE Int. Conf. Robot.
Automat., 2021, pp. 11 70511 711.
[4] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: An open-source SLAM
system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot., [20] K. Schauwecker, N. R. Ke, S. A. Scherer, and A. Zell, “Markerless visual
vol. 33, no. 5, pp. 12551262, Oct. 2017. control of a quad-rotor micro aerial vehicle by means of on-board stereo
processing,” in Proc. Auton. Mobile Syst., 2012, pp. 1120.
[5] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle
adjustment—a modern synthesis,” in Proc. Int. Workshop Vis. Algorithms, [21] Z. Z. Nejad and A. Hosseininaveh Ahmadabadian, “ARM-VO: An efficient
1999, pp. 298372. monocular visual odometry for ground vehicles on ARM CPUs,” Mach.
Vis. Appl., vol. 30, no. 6, pp. 10611070, 2019.
[6] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm
for model fitting with applications to image analysis and automated car- [22] S. Bahnam, S. Pfeiffer, and G. C. H. E. de Croon, “Stereo visual iner-
tography,” Commun. ACM, vol. 24, no. 6, pp. 381395, 1981. tial odometry for robots with limited computational resources,” in Proc.
IEEE/RSJ Int. Conf. Intell. Robots Syst., 2021, pp. 91549159.
[7] Y. Sun, M. Liu, and M.Q.-H. Meng, “Improving RGB-D SLAM in dynamic
environments: A motion removal approach,” Robot. Auton. Syst., vol. 89, [23] G. Younes et al., “Keyframe-based monocular SLAM: Design, survey, and
pp. 110122, 2017. future directions,” Robot. Auton. Syst., vol. 98, pp. 6788, 2017.
[8] E. Palazzolo,, J. Behley, P. Lottes, P. Gigu, and C. Stachniss, “ReFusion: [24] T. Ji, C. Wang, and L. Xie, “Towards real-time semantic RGB-D SLAM in
3D reconstruction in dynamic environments for RGB-D cameras exploit- dynamic environments,” in Proc. IEEE Int. Conf. Robot. Automat., 2021,
ing residuals,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2019, pp. 11 17511 181.
pp. 78557862.
[25] Z. Shan, R. Li, and S. Schwertfeger, “RGBD-inertial trajectory estima-
[9] W. Dai et al., “RGB-D SLAM in dynamic environments using point tion and mapping for ground robots,” Sensors, vol. 19, no. 10, 2019,
correlations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 1, Art. no. 2251.
pp. 373389, Jan. 2022.
[26] C. Forster et al., “IMU preintegration on manifold for efficient visual-
[10] W. Liu et al., “SSD: Single shot MultiBox detector,” in Eur. Conf. Comp. inertial maximum-a-posteriori estimation,” in Proc. Robot.: Sci. Syst.,
Vis., 2016, pp. 2137. 2015.
[11] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” [27] B. D. Lucas et al., “An iterative image registration technique with an appli-
2018, arXiv:1804.02767. cation to stereo vision,” in Proc. DARPA Image Understanding Workshop,
1981, pp. 121130.
[28] E. Rosten and T. Drummond, “Machine learning for high-speed corner
detection,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 430443.
[29] X. Shi et al., “Are we ready for service robots? The OpenLORIS-Scene
datasets for lifelong SLAM,” in Proc. IEEE Int. Conf. Robot. Automat.,
2020, pp. 31393145.
[30] J. Sturm,, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A bench-
mark for the evaluation of RGB-D SLAM systems,” in Proc. IEEE/RSJ
Int. Conf. Intell. Robots Syst., 2012, pp. 573580.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.

View File

@ -0,0 +1,665 @@
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 7501012
SG-SLAM: A Real-Time RGB-D Visual SLAM
Toward Dynamic Scenes With Semantic and
Geometric Information
Shuhong Cheng , Changhe Sun , Shijun Zhang , Student Member, IEEE, and Dianfan Zhang
Abstract— Simultaneous localization and mapping (SLAM) is systems, we have access to cheaper, faster, higher quality,
one of the fundamental capabilities for intelligent mobile robots and smaller vision-based sensors. It also helps vision-based
to perform state estimation in unknown environments. However, measurement (VBM) become more ubiquitous and applica-
most visual SLAM systems rely on the static scene assumption ble [2]. Hence, in the past years, a large number of excellent
and consequently have severely reduced accuracy and robustness visual SLAM systems have emerged, such as PTAM [3],
in dynamic scenes. Moreover, the metric maps constructed by ORB-SLAM2 [4], DVO [5], and Kimera [6]. Some of these
many systems lack semantic information, so the robots cannot visual SLAM systems are quite mature and have achieved
understand their surroundings at a human cognitive level. good performance under certain specific environmental
In this article, we propose SG-SLAM, which is a real-time conditions.
RGB-D semantic visual SLAM system based on the ORB-SLAM2
framework. First, SG-SLAM adds two new parallel threads: an As SLAM enters the age of robust perception [7], the system
object detecting thread to obtain 2-D semantic information and has higher requirements in terms of robustness and high-level
a semantic mapping thread. Then, a fast dynamic feature rejec- understanding characteristics. However, many visual-based
tion algorithm combining semantic and geometric information classical SLAM systems still fall short of these requirements
is added to the tracking thread. Finally, they are published in some practical scenarios. On the one hand, most visual
to the robot operating system (ROS) system for visualization SLAM systems work based on the static scene assumption,
after generating 3-D point clouds and 3-D semantic objects in which makes the system less accurate and less robust in real
the semantic mapping thread. We performed an experimental dynamic scenes (e.g., scenes containing walking people and
evaluation on the TUM dataset, the Bonn dataset, and the moving vehicles). On the other hand, most existing SLAM
OpenLORIS-Scene dataset. The results show that SG-SLAM is systems only construct a globally consistent metric map of
not only one of the most real-time, accurate, and robust systems in the robots working environment [8]. However, the metric map
dynamic scenes but also allows the creation of intuitive semantic does not help the robot to understand its surroundings at a
metric maps. higher semantic level.
Index Terms— Dynamic scenes, geometric constraint, seman- Most visual SLAM algorithms rely on the static scene
tic metric map, visual-based measurement, visual simultaneous assumption, which is why the presence of dynamic objects can
localization and mapping (SLAM). cause these algorithms to produce the wrong data correlation.
These outliers obtained from dynamic objects can seriously
I. INTRODUCTION impair the accuracy and stability of the algorithms. Even
though these algorithms show superior performance in some
S IMULTANEOUS localization and mapping (SLAM) has specific scenarios, it is difficult to extend them to actual
an important role in the state perception of mobile robots. production and living scenarios containing dynamic objects.
It can help a robot in an unknown environment with an Some recent works, such as [9], [10], [11], and [12], have
unknown pose to incrementally build a globally consistent map used methods that combine geometric and semantic infor-
and simultaneously measure its pose in this map [1]. Due to mation to eliminate the adverse effects of dynamic objects.
continuing and rapid development of cameras and computing These algorithms mainly using deep learning have significant
improvements in experimental accuracy, but they suffer from
Manuscript received 25 August 2022; revised 31 October 2022; accepted shortcomings in scene generalizability or real time due to vari-
23 November 2022. Date of publication 9 December 2022; date of current ous factors. Therefore, how skillfully detecting and processing
version 17 January 2023. This work was supported in part by the National Key dynamic objects in the scene is crucial for the system to
Research and Development Program under Grant 2021YFB3202303, in part operate accurately, robustly, and in real time.
by the S&T Program of Hebei under Grant 20371801D, in part by the Hebei
Provincial Department of Education for Cultivating Innovative Ability of Traditional SLAM systems construct only a sparse metric
Postgraduate Students under Grant CXZZBS2022145, and in part by the Hebei map [3], [4]. This metric map consists of simple geome-
Province Natural Science Foundation Project under Grant E2021203018. tries (points, lines, and surfaces) and every pose is strictly
The Associate Editor coordinating the review process was Dr. Jae-Ho Han. related to the global coordinate system. Enabling a robot to
(Corresponding authors: Shijun Zhang; Dianfan Zhang.) perform advanced tasks with intuitive humanrobot interac-
tion requires it to understand its surroundings at a human
Shuhong Cheng and Changhe Sun are with the School of Electri-
cal Engineering, Yanshan University, Qinhuangdao 066000, China (e-mail:
shhcheng@ysu.edu.cn; silencht@qq.com).
Shijun Zhang is with the School of Mechanical Engineering, Yanshan
University, Qinhuangdao 066000, China (e-mail: 980871977@qq.com).
Dianfan Zhang is with the Key Laboratory of Special Delivery Equipment,
Yanshan University, Qinhuangdao 066004, China (e-mail: zdf@ysu.edu.cn).
Digital Object Identifier 10.1109/TIM.2022.3228006
1557-9662 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
7501012 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
Fig. 1. Overview of the framework of the SG-SLAM system. The original work of ORB-SLAM2 is presented on an aqua-green background, while our main
new or modified work is presented on a red background.
cognitive level. However, the metric map lacks the neces- The main contributions of this article include the following.
sary semantic information and therefore cannot provide this 1) A complete real-time RGB-D visual SLAM system
capability. With the rapid development of deep learning in
recent years, some neural networks can effectively capture the called SG-SLAM is proposed using ORB-SLAM2 as
semantic information in the scenes. Therefore, the metric map a framework. Compared to ORB-SLAM2, it has higher
can be extended to the semantic metric map by integrating accuracy and robustness in dynamic scenes and can pub-
semantic information. The semantic information contained lish a semantic metric map through the robot operating
in the semantic metric map can provide the robot with the system (ROS) system [13].
capability to understand its surroundings at a higher level. 2) A fast dynamic feature rejection algorithm is proposed
by combining geometric information and semantic infor-
This article focuses on a dynamic feature rejection algorithm mation. The geometric information is calculated from
that integrates semantic and geometric information, which not the epipolar constraint between image frames. Also, the
only significantly improves the accuracy of system localization semantic information about dynamic objects is obtained
but also has excellent computational efficiency. Thus, our algo- through an NCNN-based [14] object detection network
rithm is very useful from an instrumentation and measurement in a new thread. The algorithm speed is greatly improved
point of view [2]. This article also focuses on how to construct by appropriate modifications and a combination of clas-
the semantic metric map to improve the perceptual level of sical methods while maintaining accuracy.
the robot to understand the surrounding scenes. The overall 3) An independent semantic metric mapping thread that can
framework of the SG-SLAM system is shown in Fig. 1. generate semantic objects and Octo maps [15] using the
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES 7501012
ROS interface is embedded in SG-SLAM. These maps regarded as outliers and eliminated. Similarly, Dynamic-
can be useful in subsequent localization, navigation, and SLAM proposed by Xiao et al. [25] has the same problem
object capture tasks. of directly rejecting all features within the bounding box.
Liu and Miura [26] adopted a semantic segmentation method
The remaining sections of this article are organized as to detect dynamic objects and remove outliers in keyframes.
follows. The work related to this system is described in The semantic segmentation method solves the problem of
Section II. Section III shows the details related to the imple- wrong recognition due to bounding boxes to a certain extent.
mentation of this system. Section IV provides an experimental However, the semantic information method relies heavily on
evaluation and analysis of the results. The conclusions and the quality of the neural network, so it is difficult to meet the
future works of this article are presented in Section V. requirements of speed and accuracy at the same time.
II. RELATED WORKS Recently, much work has taken on the method of combining
geometric and semantic information. For the RGB-D camera,
A. SLAM in Dynamic Scenes Bescos et al. [9] used the semantic segmentation results of
Mask R-CNN [27] combined with multiview geometry to
Most current visual SLAMs assume that the working scene detect dynamic objects and reject outliers. Yu et al. [10]
is static and rigid. When these systems work in dynamic used an optical flow-based moving consistency check method
scenes, erroneous data associations due to the static scene to detect all feature points and simultaneously performed
assumption can seriously weaken the accuracy and stability semantic segmentation of the image using SegNet [28] in
of the system. The presence of dynamic objects in the scene an independent thread. If the moving consistency checking
makes all features divided into two categories: static features method detects more than a certain percentage of dynamic
and dynamic features. How to detect and reject dynamic points within the range of the human object, all feature points
features is the key to the problem solution. The previous that lie inside the object are directly rejected. Wu et al. [11]
research work can be divided into three categories: geomet- used YOLO to detect a priori dynamic objects in the scene
ric information method, semantic information method, and and then combined it with the depth-RANSAC method to
method combining geometric and semantic information. reject the feature points inside the range of dynamic objects.
Chang et al. [12] segmented the dynamic objects by YOLACT
Geometric information method, whose main idea is to and then removed the outliers inside the objects. Then, geo-
assume that only static features can satisfy the geometric metric constraints are introduced to further filter the missing
constraints of the algorithm. A remarkable early monocular dynamic points.
dynamic object detection system comes from the work of
Kundu et al. [16]. The system creates two geometric con- The above methods have achieved quite good results in
straints to detect dynamic objects based on the multiview terms of accuracy improvement. Nevertheless, the idea of all
geometry [17]. One of the most important is the epipolar these methods relies heavily on semantic information and, to a
constraint defined by the fundamental matrix. The idea is lesser extent, on geometric information. Thus, more or less all
that a static feature point in the current image must lie on of them have the following shortcomings.
the pole line corresponding to the same feature point in the
previous image. A feature point is considered dynamic if 1) Inability to correctly handle dynamic features outside of
its distance from the corresponding polar line exceeds an the prior object [10], [11], [23], [25], [26]. For example,
empirical threshold. The fundamental matrix of the system is chairs are static objects by default, but dynamic during
calculated with the help of an odometer. In a purely visual being moved by a person; moving cats appear in the
system, the fundamental matrix can be calculated by the scene, while the neural network is not trained on the
seven-point method based on RANSAC [18]. The algorithm category of cats; low recall problem for the detection
of Kundu et al. [16] has the advantages of fast speed and algorithm.
strong scene generalization. However, it lacks a high-level
understanding of the scene, so the empirical threshold is 2) The a priori dynamic object remains stationary yet still
difficult to select and the accuracy is not high. In addition, brutally rejects the feature points in its range, resulting in
some works use the direct method for motion detection of less available association data [11], [12], [23], [25], [26].
scenes, such as [19], [20], [21], and [22]. The direct method For example, a person who is sitting still is nevertheless
algorithms are faster and can utilize more image information. considered a dynamic object.
However, it is less robust in complex environments because it
is based on the gray-scale invariance assumption. 3) The real-time performance is weak [9], [10], [11], [12].
The average frame rate of the system is low due to
Semantic information method, whose main idea is brutally factors such as complex semantic segmentation networks
rejecting features in dynamic regions that are obtained a priori or unreasonable system architecture.
using deep learning techniques. Zhang et al. [23] used the
YOLO [24] object detection method to obtain the semantic We propose an efficient dynamic feature rejection algorithm
information of dynamic objects in the working scene and combining geometric and semantic information to solve the
then reject the dynamic feature points based on the semantic above problem. Unlike most current work that relies heavily
information to improve the accuracy of the system. However, on deep learning, our algorithm uses mainly geometric infor-
the way YOLO extracts semantic information by bounding mation and then supplements it with semantic information.
box will cause a part of static feature points to be wrongly This shift in thinking allows our algorithm to avoid the short-
comings associated with relying too much on deep learning.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
7501012 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
B. Semantic Mapping the base framework to provide global localization and mapping
functions.
Many current visual SLAMs only provide a metric map
that only satisfies the basic functions of localization and As shown in Fig. 1, the SG-SLAM system adds two more
navigation of mobile robots, such as the sparse feature point parallel threads: the object detecting thread and the seman-
map constructed by ORB-SLAM2. If a mobile robot is to tic mapping thread. Multithreading mechanism improves the
perceive its surroundings at the human conceptual level, it is system operation efficiency. The purpose of adding an object
necessary to incorporate semantic information in the metric detecting thread is to use the neural network to obtain the
map to form a semantic map. The semantic metric map can 2-D semantic information. This 2-D semantic information
help robots to act according to human rules, execute high-level then provides a priori dynamic object information for the
tasks, and communicate with humans at the conceptual level. dynamic feature rejection strategy. The semantic mapping
thread integrates the 2-D semantic information and 3-D point
In an earlier study, Mozos et al. [29] used the hidden cloud information from keyframes to generate a 3-D semantic
Markov model to partition the metric map into different object database. An intuitive semantic metric map is obtained
functional locations (rooms, corridors, and doorways). The by publishing the 3-D point cloud, 3-D semantic objects, and
work of Nieto-Granda et al. [30] deployed a mapping module camera pose to the ROS system. The semantic metric maps can
based on the RaoBlackwellized particle filtering technique on help mobile robots understand their surroundings and perform
a ROS [13] and used the Gaussian model to partition the map advanced tasks from a higher cognitive level compared to the
into marked semantic regions. Subsequently, the development sparse feature point maps of ORB-SLAM2.
of deep learning has greatly contributed to the advancement
of object detection and semantic segmentation algorithms. When the SG-SLAM system is running, the image frames
Sünderhauf et al. [31] used SSD [32] to detect objects in captured from the RGB-D camera are first fed together to the
each RGB keyframe and then assign a 3-D point cloud to tracking thread and the object detecting thread. The object
each object using an adaptive 3-D unsupervised segmentation detecting thread starts to perform object recognition on the
method. This work is based on the data association mechanism input RGB images. At the same time, the tracking thread also
of ICP-like matching scores to decide whether to create starts to extract ORB feature points from the input frames.
new objects in the semantic map or to associate them with After the extraction is completed, the iterative LucasKanade
existing objects. Zhang et al. [23] acquired semantic maps optical flow method with pyramids is used to match the sparse
of the working scene through the YOLO object detection feature points between the current frame and previous frames.
module and localization module in the RGB-D SLAM system. Then, the seven-point method based on RANSAC is used to
In summary, many works only stop at using SLAM to help compute the fundamental matrix between the two frames. This
with semantic mapping and do not fully utilize the acquired reduces the adverse effects due to incorrect data correlation
semantic information to help to track. DS-SLAM, a semantic in dynamic regions. Compared with feature extraction and
mapping system proposed by Yu et al. [10], adopted semantic fundamental matrix computation, the object detection task is
segmentation information to build semantic maps. However, more time-consuming. In other words, when the fundamental
DS-SLAM only simply attaches semantic labels to the metric matrix is computed, the tracking thread needs to wait for the
map for visual display. The lack of position coordinates for result of the object detecting thread. Since the tracking thread
the objects described in mathematical form limits the systems adopts object detection rather than semantic segmentation, the
ability to perform advanced task planning. blocking time is not too long [26]. This enhances the real-time
performance of the system. Next, the tracking thread combines
III. SYSTEM OVERVIEW the epipolar constraint and 2-D semantic information to reject
the dynamic feature points. The camera pose is computed
In this section, we will introduce the technical details of and released to ROS according to the remaining static feature
the SG-SLAM system from five aspects. First, we introduce points.
the framework and the basic flow of the system. Second,
we give information about the object detecting thread. Then, The new keyframes are fed into the local mapping thread
the geometric principle of the epipolar constraint method and the loop closing thread for pose optimization, which is
for judging dynamic features is illustrated. Subsequently, the the same as the original ORB-SLAM2 system. The difference
dynamic feature rejection strategy is proposed. Finally, we pro- is that the depth image of the new keyframe is used to
pose methods to acquire semantic objects and build semantic generate a 3-D point cloud in the semantic mapping thread.
maps. Next, the 3-D point cloud is combined with the 2-D semantic
information to generate a 3-D semantic object database. There
A. System Framework are problems such as high computational effort and redundant
information between normal frames in semantic map con-
The SG-SLAM proposed in this article is developed based struction. Thus, the practice of processing only keyframe data
on the ORB-SLAM2 system, which is a feature point-based here improves the efficiency of mapping. The reuse of 2-D
classical visual SLAM system. ORB-SLAM2 consists of three semantic information also improves the real-time performance
main parallel threads: tracking, local mapping, and loop clos- of the system. Finally, the 3-D point cloud and the 3-D
ing. With the evaluation of many popular public datasets, semantic object data are published to the 3-D visualization
ORB-SLAM2 is one of the systems that achieve the state-of- tool Rviz for map display using the interface of the ROS
the-art accuracy. Therefore, SG-SLAM selects ORB-SLAM as system.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES 7501012
The adoption of object detection networks (rather than
semantic segmentation), multithreading, keyframe-based map-
ping, and data reuse mechanisms overcomes the real-time
performance shortcomings listed in Section II-A.
B. Object Detection Fig. 2. Epipolar constraints.
Due to the limitations in battery life, mobile robots generally According to the pinhole camera model, as shown in Fig. 2,
choose ARM architecture processors with high performance the camera observes the same spatial point P from different
per watt. NCNN is a high-performance neural network infer- angles. O1 and O2 denote the optical centers of the camera. P1
ence computing framework optimized for mobile platforms and P2 are the matching feature points of the spatial point P
since NCNN is implemented in pure C++ with no third-party maps in the previous frame and the current frame, respectively.
dependencies and can be easily integrated into SLAM systems. The short dashed lines L1 and L2 are the epipolar lines in the
Thus, we choose it as the base framework for object detecting frame. The homogeneous coordinate forms of P1 and P2 are
thread. denoted as follows:
Many SLAM systems, such as [9], [10], [11], and [12], P1 = [x1, y1, 1], P2 = [x2, y2, 1] (1)
run slowly due to complex semantic segmentation networks or
unreasonable system architectures. SLAM, as a fundamental where x and y denote the coordinate values of the feature
component for state estimation of mobile robots, only has the
good real-time performance to ensure the smooth operation points in the image pixel coordinate system. Then, the polar
of upper level tasks. To improve the object detection speed
as much as possible, the single-shot multibox detector SSD is line L2 in the current frame can be calculated from the
chosen as the detection head. In addition, we use MobileNetV3
[33] as a drop-in replacement for the backbone feature extrac- fundamental matrix (denoted as F) with the equation as
tor in SSDLite. Finally, the network was trained using the
PASCAL VOC 2007 Dataset [34]. follows: ⎡⎤ ⎡⎤
In reality, other detectors can be used flexibly depending X x1
on the hardware performance to achieve a balance between
accuracy and speed. L2 = ⎢⎣ Y ⎥⎦ = F P1 = F⎢⎣ y1 ⎥⎦ (2)
C. Epipolar Constraints Z 1
SG-SLAM uses geometric information obtained from epipo- where X, Y , and Z represent the line vectors. According to
lar constraint to determine whether feature points are dynamic [16], the epipolar constraint can be formulated as follows:
or not. The judgment pipeline of the epipolar constraint is
very straightforward. First, match the ORB feature points of P2T F P1 = P2T L2 = 0. (3)
two consecutive frames. Next, solve the fundamental matrix.
Finally, the distance is calculated between the feature point of Next, the distance between the feature point Pi (i = 2, 4) and
the current frame and its corresponding polar line. The bigger the corresponding polar line is defined as the offset distance,
the distance is, the more likely the feature point is dynamic. denoted by the symbol d. The offset distance can be described
as follows:
To solve the fundamental matrix, it is necessary to have the
correct data association between the feature points. However, di = √PiT F P1 . (4)
the purpose of solving the fundamental matrix is to judge X2 + Y2
whether the data association is correct or not. This becomes
a classic chicken or the egg problem. ORB-SLAM2 takes the If the point P is a static space point, jointly with (3) and (4),
Bag-of-Words method to accelerate feature matching, and the the offset distance of the point P2 is
continued use of this method cannot eliminate the adverse
effect of outliers. Hence, to obtain a relatively accurate funda- d2 = √P2T F P1 = 0. (5)
mental matrix, SG-SLAM uses the pyramidal iterative Lucas- X2 + Y2
Kanade optical flow method to calculate the matching point set
of features. Inspired by Yu et al. [10], the matching point pairs Equation (5) demonstrates that in the ideal case, the feature
located at the edges of images and with excessive differences point P2 in the current frame falls exactly on the polar line L2.
in appearance are then removed to further reduce erroneous In reality, however, the offset distance is generally greater than
data associations. Then, the seven-point method based on zero but below an empirical threshold ε due to the influence
RANSAC is used to calculate the fundamental matrix between of various types of noise.
two frames. In general, the proportion of dynamic regions
is relatively small compared to the whole image. Thus, the
RANSAC algorithm can effectively reduce the adverse effects
of wrong data association in dynamic regions.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
7501012 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
Algorithm 1 Dynamic Feature Rejection Strategy
Input: Previous frame, F1; Current frame, F2; Previous frames feature points, P1; Current frames feature points, P2;
Standard empirical thresholds, εstd;
Output: The set of static feature points in the current frames feature points, S;
1: P1 = CalcOpticalFlowPyrLK( F2, F1, P2 )
2: Remove matched pairs that are located at the edges and have too much variation in appearance
3: FundmentalMatrix = FindFundamentalMat(P2, P1, 7-point method based on RANSAC)
4: for each matched pair p1, p2 in P1, P2 do:
5: if (DynamicObjectsExist && IsInDynamicRegion(P2)) then
6: if (CalcEpiLineDistance( p2, p1, FundmentalMatrix) × GetDynamicWeightValue ( p2) < εstd ) then
7: Append p2 to S
8: end if
9: else
10: if (CalcEpiLineDistance( p2, p1, FundmentalMatrix) < εstd) then
11: Append p2 to S
12: end if
13: end if
14: end for
If the point P is not a static spatial point, as shown in Fig. 2, With these preparations, all feature points in the current
when the camera moves from the previous frame to the current frame can be judged one by one. The dynamic feature rejection
frame, the point P also moves to P . In this case, the point P1 strategy is described in Algorithm 1.
is matched with the P4 point mapped from P to the current
frame. If point P moves without degeneration [16], then in E. Semantic Mapping
general, the offset distance of P4 is greater than the threshold ε.
In other words, the feature points can be judged as dynamic The ROS [13] is a set of software tool libraries that
or not by comparing the offset distance with the empirical help developers quickly build robot applications. Rviz is a
threshold ε. visualization tool in the ROS. In addition to the tracking thread
that publishes camera poses to the ROS, the semantic mapping
D. Dynamic Feature Rejection Strategy thread also publishes two kinds of data: 3-D point clouds and
3-D semantic objects. These data are then processed by rviz
To avoid the shortcomings of relying heavily on deep to display an intuitive map interface.
learning for dynamic feature judgment, our algorithm relies
mainly on geometric information. The geometric information For efficiency, only keyframes are used to construct seman-
method judges whether a feature is dynamic by comparing the tic metric maps. When a new keyframe arrives, the semantic
offset distance d with an empirical threshold ε. However, the mapping thread immediately uses its depth image and pose to
threshold ε value is very difficult to set [12]: setting it too generate a 3-D ordered point cloud. The 3-D point cloud is
small will make many static feature points wrongly judged as subsequently published to the ROS, and a global Octo-map
dynamic points and setting it too large will miss many true is built incrementally by the Octomap_server package. The
dynamic feature points. This is because the purely geometric global Octo-map has the advantages of being updatable,
method cannot understand the scene at the semantic level and flexible, and compact, which can easily serve navigation
can only mechanically process all feature points using a fixed and obstacle avoidance tasks. However, the Octo-map lacks
threshold. semantic information, so it limits the capability of advanced
task planning between mobile robots and semantic objects.
To solve the above problem, all objects that can be detected Hence, a map with semantic objects with their coordinates
by the object detecting thread are first classified as static is also necessary. The semantic mapping thread generates the
objects and dynamic objects based on a priori knowledge. Any 3-D semantic objects by combining 2-D semantic information
object with moving properties is defined as a dynamic object with 3-D point clouds, and the main process is described as
(e.g., a person or car); otherwise, it is a static object. Then, follows.
both weight values w are defined. The standard empirical
threshold εstd is set in a very straightforward way: just make The 2-D object bounding box is captured in the dynamic
sure that only obvious true dynamic feature points are rejected feature rejection algorithm stage. Fetch the 3-D point clouds in
when using it. The dynamic weight value w is an a priori in the bounding box region to calculate the 3-D semantic object
the range of 15, which is set according to the probability information. Yet, since the bounding box contains some noisy
of the object moving. For example, a human normally moves regions of nontarget objects, it cannot accurately segment the
with a high probability, and then, w = 5; a chair normally semantic object outline. To acquire relatively accurate position
does not move, and then, w = 2. and size information of the objects, the bounding box is
first reduced appropriately. Next, we calculate the average
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES 7501012
TABLE I
RESULTS OF METRIC ROTATIONAL DRIFT (RPE)
TABLE II
RESULTS OF METRIC TRANSLATIONAL DRIFT (RPE)
TABLE III
RESULTS OF METRIC ABSOLUTE TRAJECTORY ERROR (ATE)
depth of the point cloud corresponding to the bounding box A. Performance Evaluation on TUM RGB-D Dataset
region. Then, the depth of each point cloud in the original
bounding box is compared with the average depth, which is The TUM RGB-D dataset [35] is a large dataset provided
rejected if the difference is too large. Eventually, we filter by the Technical University of Munich Computer Vision
the remaining point cloud and calculate their sizes and spatial Group to create a novel benchmark for visual odometry and
centroid coordinates. SLAM systems. To evaluate the accuracy and robustness of
the SG-SLAM system in dynamic scenes, the experiments
The above operation is performed for each 2-D semantic mainly use five sequences under the dynamic objects category
information (except dynamic objects, e.g., people, and dogs) in the dataset. The first four of them are high dynamic scene
in the current keyframe to obtain the 3-D semantic object data. sequences, as a supplement, and the fifth one is a low dynamic
During the operation of the system, the 3-D semantic object scene sequence.
database can be continuously merged or updated according to
the object class, centroid, and size information. By publishing There are two main error evaluation metrics for the exper-
this database through the ROS interface, the semantic metric iment. One is the absolute trajectory error (ATE), which is
maps can be visualized. directly used to measure the difference between the ground
trajectory and the estimated trajectory. The other is the relative
IV. EXPERIMENTAL RESULTS pose error (RPE), which is mainly used to measure rotational
drift and translational drift. To evaluate the improvement in
In this section, we will experimentally evaluate and performance relative to the original system, the experimental
demonstrate the SG-SLAM system in four aspects. First, results of SG-SLAM were compared with the ORB-SLAM2.
the tracking performance is evaluated with two public The evaluation comparison results in the five dynamic scene
datasets. Second, we demonstrate the effectiveness of the sequences are shown in Tables IIII.
dynamic feature rejection strategy and analyze the advan-
tages of the fusion algorithm compared to the individ- The experimental results in Tables IIII show that our
ual algorithms. Next, the systems real-time performance system improves more than 93% in most metrics in high
is evaluated. Finally, the visualization of the semantic dynamic sequences compared to the ORB-SLAM2 system.
objects and the global Octo-map are shown. The experi- Figs. 3 and 4 show the experimental results of ATE and
ments were performed mainly on the NVIDIA Jetson AGX RPE for the two systems at five sequences with an RGB-D
Xavier development kit with Ubuntu 18.04 as the system camera input. As shown in the figure, the accuracy of the
environment. estimation results of our system in the high dynamic scene
sequences [Figs. 3(a)(d) and 4(a)(d)] is significantly higher
than ORB-SLAM2. In the experiments with low dynamic
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
7501012 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
Fig. 3. ATE results of SG-SLAM and ORB-SLAM2 running five sequences. (a) fr3/walking_xyz. (b) fr3/walking_static. (c) fr3/walking_rpy.
(d) fr3/walking_halfsphere. (e) fr3/sitting_static.
Fig. 4. RPE results of SG-SLAM and ORB-SLAM2 running five sequences. (a) fr3/walking_xyz. (b) fr3/walking_static. (c) fr3/walking_rpy.
(d) fr3/walking_halfsphere. (e) fr3/sitting_static.
scene sequences [Figs. 3(e) and 4(e)], the accuracy improve- SLAM provided by Bonn University in 2019. To validate the
ment is only 31.03% because the area and magnitude of generalization performance of the dynamic feature rejection
dynamic object activity are small. algorithm, we performed another experimental evaluation
using this dataset.
To further evaluate the effectiveness of the proposed algo-
rithm, it continues to be compared with M-removal DVO [22], The experiment mainly selected nine representative
RDS-SLAM [26], ORB-SLAM3 [36], and other similar algo- sequences in the dataset. Among them, the “crowd” sequences
rithms. The results are shown in Table IV. Although the are the scenes of three people walking randomly in the room.
DynaSLAM system using pixel-level semantic segmentation The “moving no box” sequences show a person moving a box
achieves a slight lead in individual sequence results, its from the floor to a desk. The “person tracking” sequences are
real-time performance is weak (as shown in Table VII). All scenes where the camera is tracking a walking person. The
other methods have difficulty in achieving the highest accu- “synchronous” sequences present scenes of several people
racy of experimental results because of certain shortcomings jumping together in the same direction over and over again.
described in Section II. Overall, from the experimental results, In order to evaluate the accuracy performance of our system,
it can be concluded that SG-SLAM achieves a state-of-the-art it is mainly compared with the original ORB-SLAM2
level in terms of average accuracy improvement for all system and the current state-of-the-art YOLO-SLAM
sequences. system.
B. Performance Evaluation on Bonn RGB-D Dataset The evaluation comparison results in the nine dynamic
scene sequences are shown in Table V. Only in the two
The Bonn RGB-D Dynamic Dataset [37] is a dataset “synchronization” sequences, SG-SLAM does not perform as
with 24 dynamic sequences for the evaluation of RGB-D well as YOLO-SLAM. The main reason is that the human
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES 7501012
TABLE IV
RESULTS OF METRIC ATE
Fig. 5. Dynamic feature rejection effect demonstration. The empir- feature points on walking people are missed. Next, Fig. 5(d)
ical threshold ε in (b) is 0.2 and in (c) is 1.0. (a) ORB-SLAM2. shows the results of feature point extraction using only the
(b) and (c) SG-SLAM (G). (d) SG-SLAM (S). (e) SG-SLAM (S + G). semantic information method: all feature points around the
human body are brutally rejected. Finally, the experimental
results of the SG-SLAM system combining semantic and
geometric information are shown in Fig. 5(e). SG-SLAM
rejects all feature points on the human body and retains as
many static feature points outside the human body as possible,
and the rejection effect is better than the first two algorithms.
The experimental results of the two algorithms based on
separate information are mutually superior and inferior in
different sequences. The algorithm combining both pieces of
information shows the most accurate experimental results in
all sequences. From the results in Table VI, the experimental
data of each algorithm match the intuitive rejection effect in
Fig. 5. This proves the effectiveness of the fusion of geometric
and semantic information algorithms.
jump direction in the scene is similar to the polar line D. Timing Analysis
direction leading to different degrees of degeneration of the
algorithm [16]. The results in Table V show that our algorithm As the basic component of robot state estimation, the speed
outperforms other algorithms in most sequences. Not only of SLAM directly affects the smooth execution of higher level
does this once again prove that the SG-SLAM system achieves tasks. Thus, we tested the average time cost of processing each
state-of-the-art accuracy and robustness in dynamic scenes but frame when the system is running and compared it with other
also proves its generalizability. systems.
C. Effectiveness of Dynamic Feature Rejection Strategy The experimental time-consuming results and hardware
platforms are shown in Table VII. Since systems, such as
SG-SLAM combines geometrical and semantic information DS-SLAM, DynaSLAM, and YOLACT-based SLAM, use
to reject dynamic features, drawing on the advantages and pixel-level semantic segmentation networks, their average time
avoiding the disadvantages of both methods. To validate cost per frame is expensive. YOLO-SLAM uses the end-to-end
the effectiveness of the fusion of geometric and semantic YOLO fast object detection algorithm, but it is very slow due
information algorithms, we designed comparative experiments. to limitations such as system architecture optimization and
Fig. 5 shows the experimental results of these methods for hardware performance. The SG-SLAM system significantly
detecting dynamic points. First, SG-SLAM (S) denotes a increases frame processing speed by using multithreading,
semantic information-only algorithm to reject dynamic feature SSD object detection algorithms, and data multiplexing mech-
points. Next, SG-SLAM (G) is only the geometry algo- anisms. Compared to ORB-SLAM2, our work increases the
rithm based on the epipolar constraint. Finally, SG-SLAM average processing time per frame by less than 10 ms, which
(S + G) uses a fusion algorithm based on geometric and can meet the real-time performance requirements of mobile
semantic information. The experimental results are shown in robots.
Table VI.
E. Semantic Mapping
Fig. 5(a) shows the results of ORB-SLAM2 extracting
feature points: essentially no dynamic regions are processed. To show the actual semantic mapping effect, the SG-SLAM
Fig. 5(b) and (c) shows the results of using only the epipolar system conducts mapping experiments in the TUM RGB-D
constraint method at different empirical thresholds. At the low dataset and the OpenLORIS-Scene dataset [38]. OpenLORIS-
threshold [see Fig. 5(b)], many static feature points are misde- Scene is a dataset of data recorded by robots in real
tected and rejected (e.g., feature points at the corners of the TV scenes using a motion capture system to obtain real tra-
monitor); at the high threshold [see Fig. 5(c)], some dynamic jectories. This dataset is intended to help evaluate the
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
7501012 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
TABLE V
RESULTS OF METRIC ATE
TABLE VI
RESULTS OF METRIC ATE
TABLE VII
TIME ANALYSIS
Fig. 7. (a) Semantic object map and (b) global octo-map for the cafe1-2
sequence of the OpenLORIS-Scene dataset.
Fig. 6. Semantic object map for fr3_walking_xyz sequence. Fig. 6 shows the semantic object mapping effect of
SG-SLAM in the fr3_walking_xyz sequence of the TUM
maturity of SLAM and scene understanding algorithms in real RGB-D dataset. Fig. 7(a) and (b) shows the semantic object
deployments. map and the global Octo-map built in the cafe1-2 sequence of
the OpenLORIS-Scene dataset, respectively. The coordinates
of the objects shown in the map are transformed from the
origin point where the SLAM system is running. The semantic
metric map and the global Octo-map not only enable mobile
robots to navigate and avoid obstacles but also enable them
to understand scenes at a higher level and perform advanced
tasks.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES 7501012
V. CONCLUSION [16] A. Kundu, K. M. Krishna, and J. Sivaswamy, “Moving object detection
by multi-view geometric techniques from a single camera mounted
This article presents a real-time semantic visual SG-SLAM robot,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2009,
toward dynamic scenes with an RGB-D camera input. pp. 43064312.
SG-SLAM adds two new threads based on ORB-SLAM2:
the object detecting thread and the semantic mapping thread. [17] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
The system significantly improves real time, accuracy, and Vision. Cambridge, U.K.: Cambridge Univ. Press, 2003.
robustness in dynamic scenes with the dynamic feature rejec-
tion algorithm. The semantic mapping thread reuses the 2-D [18] M. A. Fischler and R. Bolles, “Random sample consensus: A para-
semantic information to build the semantic object map with digm for model fitting with applications to image analysis and auto-
object coordinates and the global Octo-map. Experiments mated cartography,” Commun. ACM, vol. 24, no. 6, pp. 381395,
prove that improved traditional algorithms can achieve supe- 1981.
rior performance when introducing deep learning and coupled
with proper engineering implementations. [19] M. Piaggio, R. Fornaro, A. Piombo, L. Sanna, and R. Zaccaria,
“An optical-flow person following behaviour,” in Proc. IEEE Int. Symp.
There are still some disadvantages of the system that need Intell. Control (ISIC), IEEE Int. Symp. Comput. Intell. Robot. Autom.
to be addressed in the future. For example, the degeneration (CIRA), Intell. Syst. Semiotics (ISAS), 1998, pp. 301306.
problem of dynamic objects moving along the polar line direc-
tion can cause the dynamic feature rejection algorithm to fail, [20] D. Nguyen, C. Hughes, and J. Horgan, “Optical flow-based moving-
semantic metric map improvement in precision, experimental static separation in driving assistance systems,” in Proc. IEEE 18th Int.
quantitative analysis, and so on. Conf. Intell. Transp. Syst., Sep. 2015, pp. 16441651.
REFERENCES [21] T. Zhang, H. Zhang, Y. Li, Y. Nakamura, and L. Zhang, “Flow-
Fusion: Dynamic dense RGB-D SLAM based on optical flow,”
[1] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and map- in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020,
ping: Part I,” IEEE Robot. Autom. Mag., vol. 13, no. 2, pp. 99110, pp. 73227328.
Jun. 2006.
[22] Y. Sun, M. Liu, and M. Q.-H. Meng, “Motion removal for reliable
[2] S. Shirmohammadi and A. Ferrero, “Camera as the instrument: RGB-D SLAM in dynamic environments,” Robot. Auton. Syst., vol. 108,
The rising trend of vision based measurement,” IEEE Instrum. Meas. pp. 115128, Oct. 2018.
Mag., vol. 17, no. 3, pp. 4147, Jun. 2014.
[23] L. Zhang, L. Wei, P. Shen, W. Wei, G. Zhu, and J. Song, “Semantic
[3] G. Klein and D. Murray, “Parallel tracking and mapping for small AR SLAM based on object detection and improved octomap,” IEEE Access,
workspaces,” in Proc. 6th IEEE ACM Int. Symp. Mixed Augmented vol. 6, pp. 7554575559, 2018.
Reality, Nov. 2007, pp. 225234.
[24] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in
[4] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source slam Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot., pp. 72637271.
vol. 33, no. 5, pp. 12551262, Oct. 2017.
[25] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-SLAM:
[5] C. Kerl, J. Sturm, and D. Cremers, “Dense visual SLAM for RGB-D Semantic monocular visual localization and mapping based on deep
cameras,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Nov. 2013, learning in dynamic environment,” Robot. Auton. Syst., vol. 117,
pp. 21002106. pp. 116, Jul. 2019.
[6] A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: An open- [26] Y. Liu and J. Miura, “RDS-SLAM: Real-time dynamic SLAM
source library for real-time metric-semantic localization and map- using semantic segmentation methods,” IEEE Access, vol. 9,
ping,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020, pp. 2377223785, 2021.
pp. 16891696.
[27] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
[7] C. Cadena et al., “Past, present, and future of simultaneous localization ICCV, Jun. 2017, pp. 29612969.
and mapping: Toward the robust-perception age,” IEEE Trans. Robot.,
vol. 32, no. 6, pp. 13091332, Dec. 2016. [28] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
convolutional encoderdecoder architecture for image segmentation,”
[8] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 24812495,
robotics tasks: A survey,” Robot. Auton. Syst., vol. 66, pp. 86103, Jan. 2017.
Apr. 2015.
[29] Ó. M. Mozos, R. Triebel, P. Jensfelt, A. Rottmann, and W. Burgard,
[9] B. Bescos, J. M. Fácil, J. Civera, and J. L. Neira, “DynaSLAM: Tracking, “Supervised semantic labeling of places using information extracted
mapping, and inpainting in dynamic scenes,” IEEE Robot. Autom. Lett., from sensor data,” Robot. Auto. Syst., vol. 55, no. 5, pp. 391402,
vol. 3, no. 4, pp. 40764083, Oct. 2018. May 2007.
[10] C. Yu et al., “DS-SLAM: A semantic visual SLAM towards dynamic [30] C. Nieto-Granda, J. G. Rogers, A. J. B. Trevor, and H. I. Christensen,
environments,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), “Semantic map partitioning in indoor environments using regional
Oct. 2018, pp. 11681174. analysis,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2010,
pp. 14511456.
[11] W. Wu, L. Guo, H. Gao, Z. You, Y. Liu, and Z. Chen, “YOLO-
SLAM: A semantic SLAM system towards dynamic environment [31] N. Sunderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid, “Mean-
with geometric constraint,” Neural Comput. Appl., vol. 34, pp. 116, ingful maps with object-oriented semantic mapping,” in Proc. IEEE/RSJ
Apr. 2022. Int. Conf. Intell. Robots Syst. (IROS), Sep. 2017, pp. 50795085.
[12] J. Chang, N. Dong, and D. Li, “A real-time dynamic object segmentation [32] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf.
framework for SLAM system in dynamic scenes,” IEEE Trans. Instrum. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 2137.
Meas., vol. 70, pp. 19, 2021.
[33] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE/CVF Int.
[13] M. Quigley et al., “ROS: An open-source robot operating system,” in Conf. Comput. Vis., Oct. 2019, pp. 13141324.
Proc. ICRA Workshop Open Source Softw., Kobe, Japan, 2009, vol. 3,
no. 3, p. 5. [34] M. Everingham, L. Van Gool, C. Williams, J. Winn, and
A. Zisserman, “The PASCAL visual object classes challenge 2007
[14] Tencent. (2017). NCNN. [Online]. Available: https://github.com/Tencent/ results,” 2008. [Online]. Available: http://www.pascal-network.org/
ncnn challenges/VOC/voc2007/workshop/index.html
[15] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, [35] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,
“OctoMap: An efficient probabilistic 3D mapping framework based on “A benchmark for the evaluation of RGB-D SLAM systems,” in Proc.
octrees,” Auton. Robot., vol. 34, no. 3, pp. 189206, 2013. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573580.
[36] C. Campos, R. Elvira, J. J. G. Rodriguez, J. M. M. Montiel, and
J. D. Tardos, “ORB-SLAM3: An accurate open-source library for visual,
visualinertial, and multimap SLAM,” IEEE Trans. Robot., vol. 37,
no. 6, pp. 18741890, Dec. 2021.
[37] E. Palazzolo, J. Behley, P. Lottes, P. Giguere, and C. Stachniss, “ReFu-
sion: 3D reconstruction in dynamic environments for RGB-D cameras
exploiting residuals,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst.
(IROS), Nov. 2019, pp. 78557862.
[38] X. Shi et al., “Are we ready for service robots? The OpenLORIS-scene
datasets for lifelong SLAM,” in Proc. IEEE Int. Conf. Robot. Autom.
(ICRA), May 2020, pp. 31393145.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
7501012 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
Shuhong Cheng was born in Daqing, Heilongjiang, Shijun Zhang (Student Member, IEEE) was born
China, in 1978. She received the B.S., M.S., and in Lianyungang, China, in 1993. He received the
Ph.D. degrees from Yanshan University, Qinhuang- bachelors and masters degrees in control engineer-
dao, China, in 2001, 2007, and 2012, respectively. ing from Yanshan University, Qinhuangdao, China,
in 2016 and 2019, respectively, where he is cur-
She studied as a Visiting Scholar at the University rently pursuing the Ph.D. degree in mechanical
of Reading, Reading, U.K., in 2014. After her Ph.D. engineering.
degree, she has been working as a Professor at
Yanshan University since 2019. She has published His main research directions include mobile robot
about 50 papers in journals and international confer- control and perception, computer vision, and deep
ences and eight computer software copyrights. She learning.
has been granted more than four Chinese invention
patents. Since 2012, she has presided over and undertaken more than ten
national projects. Her current research interests are in rehabilitation robots,
assisting robot for the disabled, and the elderly and computer vision.
Changhe Sun was born in Tangshan, China, Dianfan Zhang was born in Jilin, China, in 1978.
in 1996. He received the bachelors degree in com- He received the bachelors and masters degrees
munication engineering from the Chongqing Uni- in control engineering and the Ph.D. degree from
versity of Technology, Chongqing, China, in 2019. Yanshan University, Qinhuangdao, China, in 2001,
He is currently pursuing the masters degree with 2006, and 2010, respectively.
the School of Electrical Engineering, Yanshan Uni-
versity, Qinhuangdao, China. His main research directions include mobile robot
control and signal processing.
His main research interests include simultaneous
localization and mapping (SLAM), computer vision,
and robotics.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.

View File

@ -0,0 +1,520 @@
Received January 10, 2022, accepted January 27, 2022, date of publication February 7, 2022, date of current version February 18, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3149885
The STDyn-SLAM: A Stereo Vision and Semantic
Segmentation Approach for VSLAM in Dynamic
Outdoor Environments
DANIELA ESPARZA AND GERARDO FLORES , (Member, IEEE)
Laboratorio de Percepción y Robótica [LAPyR], Centro de Investigaciones en Óptica (CIO), León, Guanajuato 37150, Mexico
Corresponding author: Gerardo Flores (gflores@cio.mx)
This work was supported in part by the Consejo Nacional de Ciencia y Tecnología (CONACYT), Fondo Institucional de Fomento Regional
para el Desarrollo Científico, Tecnológico y de Innovación (FORDECYT) under Grant 292399.
ABSTRACT The Visual Simultaneous Localization and Mapping (VSLAM) is a system based on the scenes
features to estimate a map and the system pose. Commonly, VSLAM algorithms are focused on a static
environment; however, some dynamic objects are present in the vast majority of real-world applications.
This work presents a feature-based SLAM system focused on dynamic environments using convolutional
neural networks, optical flow, and depth maps to detect objects in the scene. The proposed system employs
a stereo camera as the primary sensor to capture the scene. The neural network is responsible for object
detection and segmentation to avoid erroneous maps and wrong system locations. Moreover, the proposed
systems processing time is fast and can run in real-time, running in outdoor and indoor environments. The
proposed approach has been compared with state-of-the-art; besides, we present several experimental results
outdoors that corroborate the approachs effectiveness. Our code is available online.
INDEX TERMS VSLAM, dynamic environment, stereo vision, neural network.
I. INTRODUCTION moving objects can generate an erroneous map and wrong
Simultaneous Localization and Mapping (SLAM) systems poses because dynamic features cause a bad pose estimation
are strategic for developing the following navigation tech- and incorrect data. For this reason, new approaches have
niques. This is mainly due to its fundamental utility in arisen for solving the dynamic environment problem, such
solving the problem of autonomous exploration tasks in as NeuroSLAM [10], hierarchical Outdoor SLAM [11], and
unknown environments such as mines, highways, farmlands, Large-Scale Outdoor SLAM [12].
underwater/aerial environments, and in broad terms, indoor
and outdoor scenes. The problem of SLAM for indoor In this work, we propose a method called STDyn-SLAM for
environments has been investigated for years, where usually solving VSLAMs problem in dynamic outdoor environments
RGB-D cameras or Lidars are the primary sensors to capture using stereo vision [19]. Fig. 1 depicts a sketch of our
scenes [1][3]. Indoors, dynamic objects are usually more proposal in real experiments. The first row shows the input
controllable, unlike outdoors, where dynamic objects are images, where a potentially dynamic object is present on
inherent to the scene. the scene and is detected by a semantic segmentation neural
network. Fig. 1d depicts the 3D reconstruction excluding
On the other hand, the vast majority of SLAM systems dynamic objects. To evaluate our system, we carried out
are focused on the assumption of static environments, such experiments in different outdoor scenes, and we qualitatively
as HECTOR-SLAM [4], Kintinuous [5], MonoSLAM [6], compared the 3D reconstructions taking into account the
PTAM [7], SVO [8], LSD-SLAM [9], among others. Since excluding of dynamic objects. We conducted experiments
this assumption is strong, the system is restricted to work in using sequences from KITTI Dataset, and they are compared
static environments. However, in dynamic environments, the with state-of-the-art systems. Furthermore, our approach is
implemented in ROS, in which we use the depth image
The associate editor coordinating the review of this manuscript and from a stereo camera for making the 3D reconstruction using
the octomap. Also, we analyzed the processing time using
approving it for publication was Sudipta Roy .
VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 18201
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
TABLE 1. This table shows the state-of-the-art SLAM problem considering dynamic environments.
different datasets. Further, we publish our code been available FIGURE 1. The STDyn-SLAM results in scenes with moving objects. First
on GitHub.1 Also, a video is available on YouTube. The main raw: Input images with two dynamic objects. Second raw: 3D
contributions are itemized as follows: reconstruction performed by the STDyn-SLAM discarding moving objects.
• We proposed a Stereo SLAM for dynamic environments OctoMap. The dynamic pixels are removed using an object
using semantic segmentation neural network and geo- detector and a K-means to segment the point cloud. On the
metrical constraints to eliminate the dynamic objects. other hand, in [21], Gimenez et al. present a CP-SLAM based
on continuous probabilistic mapping and a Markov random
• We use the depth image from a stereo camera for making field; they use the iterated conditional modes. Wang et al. [22]
the 3D reconstruction using the octomap. The depth propose a SLAM system for indoor environments based
image is not necessary for the SLAM process. on an RGB-D camera. They use the number of features
on the static scene and assume that the parallax between
• This work was tested using the KITTI and EurocMav consecutive images is a movement constraint. In [23],
datasets, and we compared our system with the stereo Cheng, Sun, and Meng implement an optical-flow and the
configuration systems from state-of-the-art. In addition, five-point algorithm approach to obtain dynamic features.
we obtained results from outdoor and indoor environ- In [24], Ma and Jia proposed a visual SLAM for dynamic
ments of our sequences.
VOLUME 10, 2022
• Some results are shown in a YouTube video, and the
STDyn-SLAM is available as a GitHub repo.
The rest of the paper is structured as follows. Section II
mentions the related work of SLAM in dynamic environ-
ments. Then, in Section III, we show the main results and
the algorithm STDyn-SLAM algorithm. Section IV presents
the real-time experiments of STDyn-SLAM in outdoor
environments with moving objects; we compare our approach
with state-of-art methods using the KITTI dataset. Finally, the
conclusions and the future work are given in Section V.
II. RELATED WORK
A. CLASSIC APPROACHES
The classical methods do not consider artificial intelligence.
Some of these approaches are based on optical flow, epipolar
geometry, or a combination of the two. For example, in [20],
Yang et al. propose a SLAM system using an RGB-D camera
and two encoders for estimating the pose and building an
1https://github.com/DanielaEsparza/STDyn-SLAM
18202
D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
FIGURE 2. A block diagram showing the algorithm steps of the STDyn-SLAM.
environments, detecting the moving objects in the scene using the Mask R-CNN, edge refinement, and optical flow to detect
optical flow. Furthermore, they use the RANSAC algorithm the probably dynamic objects. Henein et al. [18] proposed a
to improve the computation of the homography matrix. system based on an RGBD camera and proprioceptive sensors
In [25], Sun et al. proposed an RGB-D system for detecting for tackling the SLAM problem. They employ a model
moving objects based on ego-motion to compensate for the of factor graph and an instance-level object segmentation
camera movement, then obtaining the frame difference. The algorithm to the classification of objects and the tracking of
result of frame difference helps for detecting the moving features. The proprioceptive sensors are used to estimate the
object. After that, Sun et al. proposed in [26] an RGB-D camera pose. Also, some works use a monocular camera,
system for motion removal based on a foreground model. This for instance, the DSOD-SLAM presented in [16]. Ma et al.
system does not require prior information. employ a semantic segmentation network, a depth prediction
network, and geometry properties to improve the results in
B. ARTIFICIAL-INTELLIGENCE-BASED APPROACHES dynamic environments. Our work is built on the well-known
Thanks to the growing use of deep learning, the ORB-SLAM2 [32], taking some ideas from DS-SLAM
researchers have proposed some SLAM systems using system [33]. In the DS-SLAM, the authors used stored images
artificial-intelligence-based approaches. Table 1 resumes from an RGB-D camera for solving the SLAM problem
the state-of-art in this regard. Some works, such as in indoor dynamic environments. Nevertheless, the depth
Dosovitskiy et al. [27], Ilg et al. [28] and Mayer et al. [29], map obtained from an RGB-D camera is hard for external
used optical flow and supervised learning for detecting and environments. In [34], Cheng et al. proposed a SLAM
segmenting moving objects. system for building a semantic map in dynamic environments
using CRF-RNN for segmenting objects. Bescos et al.
In [30], Xu et al. proposed an instance segmentation of in [14] proposed a system for object detecting using the
the objects in the scene based on the COCO dataset [31]. Mask R-CNN, and their method proposed for inpainting the
The geometric and motion properties are detected and used to background using the information from previous images.
improve the mask boundaries. Also, they tracked the visible An update of [14] is [35], where Bescos et al. proposed a
objects and moving objects and estimated the systems pose. visual SLAM based on the trajectories of the objects and a
Several works are based on RGB-D cameras, such as [15], bundle adjustment.
[17], and [18]. Cui and Ma [15] proposed the SOF-SLAM,
an RGB-D system based on ORB-SLAM2, which combines III. METHODS
a neural network for semantic segmentation, and optical flow In this section, we present and describe the framework of the
for removing dynamic features. Zhao et al. [17] proposed an STDyn-SLAM with all the parts that compose it. A block
RGB-D framework to dynamic scenes, where they combined
18203
VOLUME 10, 2022
D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
diagram describing the frameworks pipeline is depicted in natural dynamic objects among all the objects in the scene.
Fig. 2, where the inputs at the instant time t are the stereo It is here where the NN depicted in Fig. 2 is introduced.
pair, depth image, and the left image captured at t 1 (aka In the NN block of that figure, a semantic segmentation
previous left image). The process starts with extracting ORB neural network is shown, with the left image as input and
features in the stereo pair and the past left image. Then, a segmented image with the object of interest as output.
it follows the optical flow and epipolar geometry image This NN is a pixel-wise classification and segmentation
processing. Next, the neural network segments potentially framework. The STDyn-SLAM implements a particular NN
moving objects parallelly in the current left image. To remove of this kind called SegNet [37], which is an encoder-decoder
outliers (features inside dynamic objects) and estimate the network based on the VGG-16 model [38]. The encoder
visual odometry, it is necessary to computation the semantic of this NN architecture counts with thirteen convolutional
information and the movement checking process. Finally, the layers with batch normalization, a ReLU non-linearity
3D reconstruction is computed from the segmented image, divided into five encoders, and five non-overlapping max-
visual odometry, the current left frame, and the depth image. pooling and sub-sampling layers located at the end of each
These processes are explained in detail in the following encoder. Since each encoder is connected to a corresponding
subsections. decoder, the decoder architecture has the same number
of layers as encoder architecture, and every decoder has
A. STEREO PROCESS an upsampling layer at first. The last layer is a softmax
Motivated by the vast applications of robotics outdoors, classifier. SegNet classifies the pixel-wise using a model
where dynamic objects are presented, we proposed that based on the PASCAL VOC dataset [39], which consists
our STDyn-SLAM system be focused on stereo vision. of twenty classes. The pixel-wise can be classified into
A considerable advantage of this is that the depth estimation one of the following classes: airplane, bicycle, bird, boat,
from a stereo camera is directly given as a distance measure. bottle, bus, car, cat, chair, cow, dining table, dog, horse,
The process described in this part is depicted in Fig. 2, motorbike, person, potted plant, sheep, sofa, train and
where three main tasks are developed: feature extraction, TV/monitor.
optical flow, and epipolar geometry. Lets begin with the
former. Notwithstanding those above, not all feature points in the
left frame are matched in the right frame. For that reason and
The first step of the stereo process is acquiring the left, to save computing resources, the SegNet classifies the objects
right, and depth frames from a stereo camera. Then, a local of interest only on the left input image.
feature detector is applied in the stereo pair and the previous
left image. As a feature detector, we use the Oriented fast 1) OUTLIERS REMOVAL
and Rotated Brief (ORB) feature detector, which throws the
well-known ORB features [36]. Once the ORB features are Once all the previous steps have been accomplished, a thresh-
found, optical flow and a process using epipolar geometry are
conducted. old is selected to determine the features as inlier or outlier.
To avoid dynamic objects not classified by the neural Fig. 3 depicts the three cases of a mapped feature. Let x1, x2,
network (explained in the following subsection), the STDyn- and x3 denote the ORB features from the previous left image;
SLAM computes optical flow using the previous and current x1, x2, and x3 are the corresponding features from the current
left frames. This step employs a Harris detector to compute left image; X and X represent the homogeneous coordinates
the optical flow. Remember, these features are different from
the ORB ones. The Harris points pair is discarded if at least of x and x , respectively; F is the fundamental matrix; and
one of the points is on the edge corner or close to it.
l1 = FX1, l2 = FX2, and l3 = FX3 are the epipolar lines.
From the fundamental matrix, ORB features, and optical The first and second cases correspond to inliers, x1 is over
flow, we compute the epipolar lines. Thus, we can map l1, and the distance from x2 to l2 is less than the threshold.
the matched features from the current left frame into the The third case is an outlier because the distance from x3
previous left frame. The distance from the corresponding to l3 is greater than the threshold. To compute the distance
epipolar line to the mapped feature into the past left image between the point x and the epipolar line, l , we proceed as
determines an inlier or outlier. Please refer to the remove
outliers section in Fig. 2. Notice that the orb features of the follows,
car in the left image were removed, but the points on the
right frame remain unchanged. This is because removing d(X , l ) = X T FX (1)
the points in the right images adds computational cost and is
unnecessary. (FX )21 + (FX )22
B. ARTIFICIAL NEURAL NETWORKs ARCHITECTURE where the subindex from (FX )1 and (FX )2 denotes the
The approach we use is eliminating the ORB features on element of the epipolar line. If the distance is larger than
dynamic objects. To address this, we need to discern the the threshold, the feature point is considered an outlier, i.e.,
a dynamic feature.
18204
Remember that the SegNet, described before, semantically
segments the left image in object classes. The semantic
segmentation enhances the rejection of ORB features on
the possible dynamic objects. The ORB features inside
VOLUME 10, 2022
D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
FIGURE 3. The cases of inliers and outliers. Green: the x1 and x2 are
inliers; the distance from the point to their corresponding epipolar line l
is less than a threshold. Red: x3 is an outlier, since the distance is greater
than the threshold.
FIGURE 5. The STDyn-SLAM when a static object becomes dynamic.
Images a) and b) corresponds to the left images from a sequence. Image
c) is the 3D reconstruction of the environment; in red dots is the
trajectory. The OctoMap node fills empty areas along the sequence of
images.
FIGURE 4. Diagram of the ROS nodes of the STDyn-SLAM required to
generate the trajectory and 3D reconstruction. The circles represent each
processs ROS node, and the arrows are the ROS topics published by the
ROS nodes. The continued arrows depict the final ROS topics.
segmented objects, and thus possible moving objects, are
rejected. The remained points are matched with the ORB
features from the right image.
C. VISUAL ODOMETRY
Because the system is based on ORB-SLAM2, the VSLAM
visually computes the odometry. Therefore, the next step
needs the ORB features to estimate the depth for each feature
pair. The features are classified in mono and stereo and will
be necessary to track the cameras pose. Again, this step is
merely a process from ORB-SLAM2.
D. 3D RECONSTRUCTION FIGURE 6. The 3D reconstruction from STDyn-SLAM in an indoor
Finally, the STDyn-SLAM builds a 3D reconstruction from environment. In the scene appears a moving person, which is crossing
left, segmented, and depth images using visual odometry. from left to right. The VSLAM system considers the person as a dynamic
First, the 3D reconstruction process checks each pixel of the object.
segmented image to reject the point corresponding to the
classes of the objects selected as dynamic in section III-B. Remark 1: It is essential to mention that we merely applied
Then, if the pixel is not considered a dynamic object, the the semantic segmentation, optical flow, and geometry
equivalent pixel from the depth image is added to the point constraints to the left image to avoid increasing the time
cloud, and the assigned color of the point is obtained from executing. Moreover, the right-hand-side frame segmentation
the left frame. This section builds a local point cloud only in is unnecessary because feature selection rejects the ORB
the current pose of the system, and then the octomap [40] features inside dynamic objects from the left image, so the
joins and updates the local point clouds in a full point corresponding points from the right frame will not be
cloud. matched.
VOLUME 10, 2022 18205
D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
FIGURE 7. The 3D reconstruction, with the presence of static (two parked cars) and dynamic objects (a person and two dogs). Notice that the person and
dogs are not visualized in the scene for the effect of the STDyn-SLAM. Fig. a) depicts the static objects. Nevertheless, the vehicles are potentially dynamic
objects, thus in Fig. b), the STDyn-SLAM excludes the bodies considering its possible movement.
IV. EXPERIMENTS B. REAL-TIME EXPERIMENTS
This section tests our algorithm STDyn-SLAM in real-time We present real-time experiments under three different
scenes under the KITTI datasets. Our systems experiments scenarios explained next.
were compared to other state-of-art systems to evaluate the
3D reconstruction and the odometry. The results of the 3D First, we test the STDyn-SLAM in an outdoor environment
map were qualitatively measured because of the nature of where a car is parked and then moves forward. In this case,
the experiment. We employ the Absolute Pose Error (APE) a static object (a car) becomes dynamic, see Fig. 5. This figure
metric for the odometry. shows the 3D reconstruction, where the car appears static in
the first images from the sequence, Fig. 5 a). Then, the car
A. HARDWARE AND SOFTWARE SETUP becomes a dynamic object when it moves forward (Fig. 5 b),
We tested our system on an Intel Core i7-7820HK laptop so the STDyn-SLAM is capable of filling the empty zone if
computer with 32 Gb RAM and a GPU GeForceGTX the scene is covered again, as is the case in Fig. 5 c).
1070. Moreover, we used as input a ZED camera, which
is a stereo camera developed by Sterolabs. We selected an The second experiment tests our system in an indoor
HD720 resolution. The ZED camera resolutions are WVGA environment. The scene consists of a moving person crossing
(672 × 376), HD720 (1280 × 720), HD1080 (1920 × 1080), from left to right. Subfigures a and b depicts the left and right
and 2.2K (2208 × 1242). images from Fig. 6. And c shows the 3D reconstruction. The
area occupied by the moving person is filled after the zone is
The STDyn-SLAM is developed naturally on ROS. Our visible.
systems main inputs are the left and right images, but
the depth map is necessary to build the point cloud. The third experiment consists of a scene sequence with
However, if this is not available, it is possible to exe- two parked cars, a walking person, and a dog. Even though
cute the STDyn-SLAM only with the stereo images and the vehicles are static, the rest of the objects move. Fig. 7a
then obtain the trajectory. On the other hand, the STDyn shows the scene taking into account the potentially dynamic
node in ROS generates two main topics; the Odom and entities. However, a car can change its position; the STDyn-
the ORB_SLAM2_PointMap_SegNetM/Point_Clouds topics. SLAM excludes the probable moving bodies (parked cars) to
The point cloud topic is the input of the octomap_server avoid multiple plotting throughout the reconstruction. This is
node; this node publishes the joined point cloud of the depicted in Fig. 7b.
scene.
We compared the point clouds from the RTABMAP and the
Fig. 4 depicts the required ROS nodes by the STDyn- STDyn-SLAM systems as a fourth experiment. The sequence
SLAM to generate the trajectory and the 3D reconstruction. was carried out outdoors with a walking person and two
The camera node publishes the stereo images and computes dogs. Since RTABMAP generates a point cloud of the scene,
the depth map from the left and right frames. Then, the we decided to compare it with our system. To build the
STDyn-SLAM calculates the odometry and the local point 3D reconstructions from RTABMAP, we provided left and
cloud. The OctoMap combines and updates the current local depth images, camera info, and odometry as inputs for the
point cloud with the previous global map to visualize the RTABMAP. We used stereo and depth images; the intrinsic
global point cloud. It is worth mentioning that the user can parameters are saved in a text file in the ORB-SLAM2
choose the maximum depth of the local point cloud. All the package. Fig 8 shows the 3D reconstructions. In Fig. 8a our
ROS topics can be shown through the viewer. system excludes the dynamic objects. On the other hand, Fig
8b RTABMAP plotted the dynamic objects on different sides
18206 of the scene, resulting in an incorrect map of the environment.
VOLUME 10, 2022
D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
TABLE 3. Comparison of Absolute Pose Error (APE) on Euroc-Mav dataset.
TABLE 4. Comparison of Relative Pose Error (RPE) on KITTI dataset.
FIGURE 8. Experiment comparison between the STDyn-SLAM and the
RTABMAP [41]. Image a) shows the 3D reconstruction given by
STDyn-SLAM; it eliminates dynamic objects effect on the mapping. Image
b) shows the point cloud created by RTABMAP; notice how dynamic
objects are mapped along the trajectory. This is undesirable behavior.
TABLE 2. Comparison of Absolute Pose Error (APE) on KITTI dataset.
TABLE 5. Comparison of Relative Pose Error (RPE) on Euroc-Mav dataset.
C. COMPARISON OF STATE-OF-ART AND OUR SLAM To evaluate the significative difference of the ATE evalua-
USING KITTI AND EurocMav DATASETS tion, we implemented the Score Sρ [45] over the sequences
We compare our VSLAM with DynaSLAM1 [14] and ORB- of EurocMav and KITTI datasets of tables 6 and 7. The
SLAM2 approaches. We selected sequences with dynamic results in table 8 show an improvement of our system against
objects, loop, and no-loop closure to evaluate the SLAM ORBSLAM2 in the trajectories of the EurocMav dataset.
systems. Therefore, we chose the 0010 sequences from In the KITTI dataset, STDyn-SLAM and ORBSLAM2 are
the odometry KITTI datasets [42], furthermore all sequences not significative different. In evaluating our system and
from the EurocMav dataset excepting the V1_03 and V2_03. DynaSLAM1, the Dyna is slightly better.
Moreover, we employed EVO [43] tools to evaluate the
Absolute Pose Error (APE) and the Relative Pose Error D. PROCESSING TIME
(RPE), and RGB-D tools [44] to calculate the Absolute In this section, we analyzed the processing time of this work.
Trajectory Error (ATE). For the study, we evaluate some datasets with different types
of images. The analysis consists of obtaining the processing
We present the results of APE, RPE, and ATE in different time of each sequence with the same characteristics and
tables. We divided the tables depending on the dataset calculating the average of the sequences mean. Table 9 shows
evaluated. Tables 3 and 4 show the APE experiments on the times getting with the datasets. We use the KITTI and
KITTI and EurocMav datasets, respectively. Tables 4 and 5 EurocMav datasets for the RGB and Gray columns. Since
correspond to RPE, and tables 6 and 7 present the ATE results. the sequences do not provide a depth image, we did not map
We did not evaluate the EurocMav with the DynaSLAM1 due a 3D reconstruction. For the last column, we utilized our
to excessive processing time to compute the trajectories. trajectories. In addition, our dataset contains depth images,
VOLUME 10, 2022 18207
D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
TABLE 6. Comparison of Absolute Trajectory Error (ATE) on KITTI dataset. The STDyn-SLAM is based on images captured by a stereo
pair for 3D reconstruction of scenes, where the possible
TABLE 7. Comparison of Absolute Trajectory Error (ATE) on Euroc-Mav dynamic objects are discarded from the map; this allows a
dataset. trustworthy point cloud. The system capability for computing
a reconstruction and localization in real-time depends on
TABLE 8. Comparison of Score Sρ (a, b) on the datasets. the computers processing power, since a GPU is necessary
to support the processing. However, with a medium-range
TABLE 9. Processing time. computer, the algorithms work correctly.
so we plotted a 3D reconstruction. For this reason, the In the future, we plan to implement an optical flow
processing time is longer. approach based on the last generation of neural networks
V. CONCLUSION to improve dynamic object detection. The implementation
This work presents the STDyn-SLAM system for outdoor of neural networks allows replacing classic methods such
and indoor environments where dynamic objects are present. as geometric constraints. Furthermore, we plan to increase
the size of the 3D map to reconstruct larger areas and
18208 obtain longer reconstructions of the scenes. The next step
is implementing the algorithm in an aerial manipulator
constructed in the lab.
SUPPLEMENTARY MATERIAL
The implementation of our system is released on GitHub
and is available under the following link: https://github.
com/DanielaEsparza/STDyn-SLAM
Besides, this letter has supplementary video material
available at https://youtu.be/3tnkwvRnUss, provided by the
authors.
REFERENCES
[1] J. Castellanos, J. Montiel, J. Neira, and J. Tardos, The SPmap: A
probabilistic framework for simultaneous localization and map building,
IEEE Trans. Robot. Autom., vol. 15, no. 5, pp. 948952, 1999.
[2] G. Dissanayake, H. Durrant-Whyte, and T. Bailey, A computationally
efficient solution to the simultaneous localisation and map building
(SLAM) problem, in Proc. IEEE Int. Conf. Robot. Automation. Symposia
(ICRA), 2000, pp. 10091014.
[3] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, FastSLAM: A
factored solution to the simultaneous localization and mapping problem,
in Proc. AAAI Nat. Conf. Artif. Intell., 2002, pp. 593598.
[4] S. Kohlbrecher, O. von Stryk, J. Meyer, and U. Klingauf, A flexible and
scalable SLAM system with full 3D motion estimation, in Proc. IEEE Int.
Symp. Saf., Secur., Rescue Robot., Nov. 2011, pp. 155160.
[5] T. Whelan, J. McDonald, M. Kaess, M. Fallon, H. Johannsson, and
J. J. Leonard, Kintinuous: Spatially extended KinectFusion, in Proc.
RSS Workshop RGB-D, Adv. Reasoning with Depth Cameras, Jul. 2012,
pp. 110.
[6] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, MonoSLAM:
Real-time single camera SLAM, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 29, no. 6, pp. 10521067, Jun. 2007.
[7] Y. Kameda, Parallel tracking and mapping for small AR workspaces
(PTAM) augmented reality, J. Inst. Image Inf. Telev. Engineers, vol. 66,
no. 1, pp. 4551, 2012.
[8] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza,
SVO: Semidirect visual odometry for monocular and multicam-
era systems, IEEE Trans. Robot., vol. 33, no. 2, pp. 249265,
Apr. 2017.
[9] J. Engel, T. Schöps, and D. Cremers, LSD-SLAM: Large-scale direct
monocular SLAM, in Proc. Eur. Conf. Comput. Vis. (ECCV), D. Fleet,
T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham, Switzerland: Springer,
2014, pp. 834849.
[10] F. Yu, J. Shang, Y. Hu, and M. Milford, NeuroSLAM: A brain-inspired
SLAM system for 3D environments, Biol. Cybern., vol. 113, nos. 56,
pp. 515545, Dec. 2019.
[11] D. Schleicher, L. M. Bergasa, M. Ocana, R. Barea, and M. E. Lopez,
Real-time hierarchical outdoor SLAM based on stereovision and GPS
fusion, IEEE Trans. Intell. Transp. Syst., vol. 10, no. 3, pp. 440452,
Sep. 2009.
VOLUME 10, 2022
D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
[12] R. Ren, H. Fu, and M. Wu, Large-scale outdoor SLAM based on 2D [35] B. Bescos, C. Campos, J. D. Tardos, and J. Neira, DynaSLAM II: Tightly-
LiDAR, Electronics, vol. 8, no. 6, p. 613, May 2019. coupled multi-object tracking and SLAM, IEEE Robot. Autom. Lett.,
vol. 6, no. 3, pp. 51915198, Jul. 2021.
[13] S. Yang and S. Scherer, CubeSLAM: Monocular 3-D object SLAM,
IEEE Trans. Robot., vol. 35, no. 4, pp. 925938, Aug. 2019. [36] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ORB: An efficient
alternative to SIFT or SURF, in Proc. Int. Conf. Comput. Vis., Nov. 2011,
[14] B. Bescos, J. M. Fácil, J. Civera, and J. L. Neira, DynaSLAM: Tracking, pp. 25642571.
mapping, and inpainting in dynamic scenes, IEEE Robot. Autom. Lett.,
vol. 3, no. 4, pp. 40764083, Oct. 2018. [37] V. Badrinarayanan, A. Kendall, and R. Cipolla, SegNet: A deep
convolutional encoder-decoder architecture for image segmentation,
[15] L. Cui and C. Ma, SOF-SLAM: A semantic visual SLAM for IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 24812495,
dynamic environments, IEEE Access, vol. 7, pp. 166528166539, Dec. 2017.
2019.
[38] K. Simonyan and A. Zisserman, Very deep convolutional networks for
[16] P. Ma, Y. Bai, J. Zhu, C. Wang, and C. Peng, DSOD: DSO large-scale image recognition, in Proc. Int. Conf. Learn. Represent.
in dynamic environments, IEEE Access, vol. 7, pp. 178300178309, (ICLR), San Diego, CA, USA, Jul. 2015, pp. 114.
2019.
[39] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and W. Zisserman,
[17] L. Zhao, Z. Liu, J. Chen, W. Cai, W. Wang, and L. Zeng, A compatible The PASCAL visual object classes (VOC) challenge, Int. J. Comput.
framework for RGB-D SLAM in dynamic scenes, IEEE Access, vol. 7, Vis., vol. 88, no. 2, pp. 303338, Sep. 2010.
pp. 7560475614, 2019.
[40] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard,
[18] M. Henein, J. Zhang, R. Mahony, and V. Ila, Dynamic SLAM: The OctoMap: An efficient probabilistic 3D mapping framework based on
need for speed, in Proc. IEEE Int. Conf. Robot. Automat. (ICRA), 2020, octrees, Auto. Robots, vol. 34, no. 3, pp. 189206, Apr. 2013. [Online].
pp. 21232129. Available: https://octomap.github.io
[19] S. Trejo, K. Martinez, and G. Flores, Depth map estimation methodology [41] M. Labbé and F. Michaud, Long-term online multi-session graph-
for detecting free-obstacle navigation areas, in Proc. Int. Conf. Unmanned based SPLAM with memory management, Auto. Robots, vol. 42, no. 6,
Aircr. Syst. (ICUAS), Jun. 2019, pp. 916922. pp. 11331150, 2018.
[20] D. Yang, S. Bi, W. Wang, C. Yuan, W. Wang, X. Qi, and Y. Cai, DRE- [42] A. Geiger, P. Lenz, and R. Urtasun, Are we ready for autonomous driving?
SLAM: Dynamic RGB-D encoder SLAM for a differential-drive robot, The KITTI vision benchmark suite, in Proc. Int. Conf. Pattern Recognit.,
Remote Sens., vol. 11, no. 4, p. 380, Feb. 2019. Jun. 2012, pp. 33543361.
[21] J. Gimenez, A. Amicarelli, J. M. Toibero, F. di Sciascio, and R. Carelli, [43] (2017). U. Technologies. EVO: Python Package for the Evalua-
Continuous probabilistic SLAM solved via iterated conditional modes, tion of Odometry and SLAM. [Online]. Available: https://github.com/
Int. J. Autom. Comput., vol. 16, no. 6, pp. 838850, Aug. 2019. MichaelGrupp/evo
[22] R. Wang, W. Wan, Y. Wang, and K. Di, A new RGB-D SLAM method [44] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, A
with moving object detection for dynamic indoor scenes, Remote Sens., benchmark for the evaluation of RGB-D SLAM systems, in Proc.
vol. 11, no. 10, p. 1143, May 2019. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573580.
[23] J. Cheng, Y. Sun, and M. Q.-H. Meng, Improving monocular visual [45] R. Muñoz-Salinas and R. Medina-Carnicer, UcoSLAM: Simultaneous
SLAM in dynamic environments: An optical-flow-based approach, Adv. localization and mapping by fusion of keypoints and squared planar
Robot., vol. 33, no. 12, pp. 576589, Jun. 2019. markers, Pattern Recognit., vol. 101, May 2020, Art. no. 107193.
[24] Y. Ma and Y. Jia, Robust SLAM algorithm in dynamic environment using DANIELA ESPARZA received the B.S. degree in
optical flow, in Proc. Chin. Intell. Syst. Conf., Y. Jia, J. Du, and W. Zhang, robotic engineering from the Universidad Politéc-
Eds. Singapore: Springer 2020, pp. 681689. nica del Bicentenario, México, in 2017, and the
masters degree in optomechatronics from the
[25] Y. Sun, M. Liu, and M. Q.-H. Meng, Improving RGB-D SLAM in Center for Research in Optics, in 2019, where
dynamic environments: A motion removal approach, Robot. Auton. Syst., she is currently pursuing the Ph.D. degree in
vol. 89, pp. 110122, Mar. 2017. mechatronics and mechanical design.
[26] Y. Sun, M. Liu, and M. Q.-H. Meng, Motion removal for reliable Her research interests include artificial vision,
RGB-D SLAM in dynamic environments, Robot. Auton. Syst., vol. 108, such as 3D reconstruction and deep learning
pp. 115128, Oct. 2018. applied to SLAM developed over platforms as
mobile robots.
[27] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,
P. V. D. Smagt, D. Cremers, and T. Brox, FlowNet: Learning optical GERARDO FLORES (Member, IEEE) received
flow with convolutional networks, in Proc. IEEE Int. Conf. Comput. Vis. the B.S. degree (Hons.) in electronic engineering
(ICCV), Dec. 2015, pp. 27582766. from the Instituto Tecnológico de Saltillo, Mexico,
in 2007, the M.S. degree in automatic control
[28] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, from CINVESTAV-IPN, Mexico City, in 2010,
FlowNet 2.0: Evolution of optical flow estimation with deep networks, and the Ph.D. degree in systems and information
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, technology from the Heudiasyc Laboratory, Uni-
pp. 24622470. versité de Technologie de CompiègneSorbonne
Universités, France, in October 2014.
[29] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and
T. Brox, A large dataset to train convolutional networks for disparity, Since August 2016, he has been a full-time
optical flow, and scene flow estimation, in 2016 IEEE Conf. Comput. Vis. Researcher and the Head of the Perception and Robotics Laboratory, Center
Pattern Recognit. (CVPR), Jun. 2016, pp. 40404048. for Research in Optics, León, Guanajuato, Mexico. His current research
interests include the theoretical and practical problems arising from the
[30] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and development of autonomous robotic and vision systems. He has been an
S. Leutenegger, MID-fusion: Octree-based object-level multi-instance Associate Editor of Mathematical Problems in Engineering, since 2020.
dynamic SLAM, in Proc. Int. Conf. Robot. Automat. (ICRA), May 2019,
pp. 52315237. 18209
[31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in
context, in Computer Vision, D. Fleet, T. Pajdla, B. Schiele, and
T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014,
pp. 740755.
[32] R. Mur-Artal and J. D. Tardós, ORB-SLAM2: An open-source slam
system for monocular, stereo, and RGB-D cameras, IEEE Trans. Robot.,
vol. 33, no. 5, pp. 12551262, Oct. 2017.
[33] C. Yu, Z. Liu, X. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, DS-
SLAM: A semantic visual SLAM towards dynamic environments,
in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018,
pp. 11681174.
[34] J. Cheng, Y. Sun, and M. Q.-H. Meng, Robust semantic mapping
in challenging environments, Robotica, vol. 38, no. 2, pp. 256270,
Feb. 2020.
VOLUME 10, 2022

View File

@ -0,0 +1,3 @@
junwen Lai
junwen Lai77€ÖÒµ÷Ú¯‡e¾Ž2yÚéyôþ

View File

@ -0,0 +1,846 @@
jinja2=2.10 -> markupsafe[version='>=0.23|>=0.23,<2']
_anaconda_depends=2019.03 -> jinja2 -> markupsafe[version='<2.0|>=0.23|>=0.23,<2|>=0.23,<2.1|>=2.0|>=2.0.0rc2|>=2.1.1']
jupyter=1.0.0 -> nbconvert -> markupsafe[version='>=2.0']
Package pycairo conflicts for:
nltk=3.4 -> matplotlib -> pycairo
anaconda=custom -> _anaconda_depends -> pycairo
_anaconda_depends=2019.03 -> pycairo
seaborn=0.9.0 -> matplotlib[version='>=1.4.3'] -> pycairo
scikit-image=0.15.0 -> matplotlib[version='>=2.0.0'] -> pycairo
Package isort conflicts for:
pylint=2.3.1 -> isort[version='>=4.2.5']
spyder=3.3.3 -> pylint -> isort[version='>=4.2.5|>=4.2.5,<5|>=4.2.5,<6']
isort=4.3.16
_anaconda_depends=2019.03 -> pylint -> isort[version='>=4.2.5|>=4.2.5,<5|>=4.2.5,<6']
anaconda=custom -> _anaconda_depends -> isort
_anaconda_depends=2019.03 -> isort
Package pyflakes conflicts for:
spyder=3.3.3 -> pyflakes
anaconda=custom -> _anaconda_depends -> pyflakes
pyflakes=2.1.1
_anaconda_depends=2019.03 -> pyflakes
Package pycurl conflicts for:
anaconda=custom -> _anaconda_depends -> pycurl
pycurl=7.43.0.2
_anaconda_depends=2019.03 -> pycurl
Package pycodestyle conflicts for:
spyder=3.3.3 -> pycodestyle
_anaconda_depends=2019.03 -> pycodestyle
pycodestyle=2.5.0
anaconda=custom -> _anaconda_depends -> pycodestyle
Package singledispatch conflicts for:
distributed=1.26.0 -> singledispatch
ipykernel=5.1.0 -> tornado[version='>=4.0'] -> singledispatch==3.4.0.3
nltk=3.4 -> singledispatch
matplotlib=3.0.3 -> tornado -> singledispatch==3.4.0.3
_anaconda_depends=2019.03 -> singledispatch
terminado=0.8.1 -> tornado[version='>=4'] -> singledispatch==3.4.0.3
jupyter_client=5.2.4 -> tornado[version='>=4.1'] -> singledispatch==3.4.0.3
bokeh=1.0.4 -> tornado[version='>=4.3'] -> singledispatch==3.4.0.3
numba=0.43.1 -> singledispatch
dask=1.1.4 -> distributed[version='>=1.26.0'] -> singledispatch
spyder=3.3.3 -> pylint -> singledispatch
anaconda=custom -> _anaconda_depends -> singledispatch
_anaconda_depends=2019.03 -> astroid -> singledispatch==3.4.0.3
singledispatch=3.4.0.3
notebook=5.7.8 -> tornado[version='>=4.1,<7'] -> singledispatch==3.4.0.3
anaconda-project=0.8.2 -> tornado[version='>=4.2'] -> singledispatch==3.4.0.3
distributed=1.26.0 -> tornado[version='<6.2'] -> singledispatch==3.4.0.3
Package gast conflicts for:
gast=0.2.2
tensorflow=1.13.1 -> gast[version='>=0.2.0']
Package cudnn conflicts for:
torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> cudnn[version='7.3.*|>=7.6,<8.0a0|>=7.6.5.32,<8.0a0|>=8.4.1.50,<9.0a0|>=8.8.0.121,<9.0a0|>=8.2.1.32,<9.0a0|>=8.1.0.77,<9.0a0|>=8.9,<9.0a0|>=8.9.2.26,<9.0a0|>=8.2,<9.0a0|>=8.2.1,<9.0a0|>=7.6.5,<8.0a0|>=7.6.4,<8.0a0|>=7.3.1,<8.0a0|>=7.3.0,<=8.0a0']
cupy=6.0.0 -> cudnn[version='>=7.1.3,<8.0a0|>=7.3.1,<8.0a0']
pytorch=1.1.0 -> cudnn[version='>=7.3.1,<8.0a0']
cudnn=7.6.0
tensorflow=1.13.1 -> tensorflow-base==1.13.1=gpu_py27h8f37b9b_0 -> cudnn[version='>=7.3.1,<8.0a0']
Package libdeflate conflicts for:
anaconda=custom -> _anaconda_depends -> libdeflate
_anaconda_depends=2019.03 -> libtiff -> libdeflate[version='>=1.10,<1.11.0a0|>=1.12,<1.13.0a0|>=1.13,<1.14.0a0|>=1.14,<1.15.0a0|>=1.16,<1.17.0a0|>=1.17,<1.18.0a0|>=1.18,<1.19.0a0|>=1.19,<1.20.0a0|>=1.8,<1.9.0a0|>=1.7,<1.8.0a0']
pillow=6.0.0 -> libtiff[version='>=4.0.9,<4.4.0a0'] -> libdeflate[version='>=1.10,<1.11.0a0|>=1.8,<1.9.0a0|>=1.7,<1.8.0a0|>=1.19,<1.20.0a0|>=1.18,<1.19.0a0|>=1.17,<1.18.0a0|>=1.16,<1.17.0a0|>=1.14,<1.15.0a0|>=1.13,<1.14.0a0|>=1.12,<1.13.0a0']
Package smart_open conflicts for:
anaconda=custom -> _anaconda_depends -> smart_open
nltk=3.4 -> gensim -> smart_open[version='>=1.2.1|>=1.8.1']
Package gmp conflicts for:
nbconvert=5.4.1 -> pandoc[version='>=1.12.1,<2.0.0'] -> gmp=6.1
mpc=1.1.0 -> mpfr[version='>=4.0.2,<5.0a0'] -> gmp[version='>=6.2.1,<7.0a0']
gmpy2=2.0.8 -> gmp[version='>=6.1.2|>=6.1.2,<7.0a0']
gmp=6.1.2
mpc=1.1.0 -> gmp[version='>=5.0.1,<7|>=6.1.2,<7.0a0|>=6.2.0,<7.0a0|>=6.1.2']
pandoc=2.2.3.2 -> gmp
gmpy2=2.0.8 -> mpc[version='>=1.1.0,<2.0a0'] -> gmp[version='>=5.0.1,<7|>=6.2.0,<7.0a0|>=6.2.1,<7.0a0']
mpfr=4.0.1 -> gmp[version='>=6.1.2|>=6.1.2,<7.0a0']
sympy=1.3 -> gmpy2[version='>=2.0.8'] -> gmp[version='>=6.1.2|>=6.1.2,<7.0a0|>=6.2.0,<7.0a0|>=6.2.1,<7.0a0']
anaconda=custom -> _anaconda_depends -> gmp
_anaconda_depends=2019.03 -> gmp
_anaconda_depends=2019.03 -> gmpy2 -> gmp[version='6.1.*|>=5.0.1,<7|>=6.1.2|>=6.1.2,<7.0a0|>=6.2.0,<7.0a0|>=6.2.1,<7.0a0']
Package numexpr conflicts for:
anaconda=custom -> _anaconda_depends -> numexpr
seaborn=0.9.0 -> pandas[version='>=0.14.0'] -> numexpr[version='>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0']
_anaconda_depends=2019.03 -> pandas -> numexpr[version='2.0.*|2.1.*|2.2.*|2.3.*|2.4.*|2.5.*|>=2.6.2|>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0']
_anaconda_depends=2019.03 -> numexpr
numexpr=2.6.9
dask=1.1.4 -> pandas[version='>=0.19.0,<2.0.0a0'] -> numexpr[version='>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0']
statsmodels=0.9.0 -> pandas[version='>=0.14'] -> numexpr[version='>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0']
bkcharts=0.2 -> pandas -> numexpr[version='>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0']
pytables=3.5.1 -> numexpr
Package iniconfig conflicts for:
pytest-astropy=0.5.0 -> pytest[version='>=3.1'] -> iniconfig
anaconda=custom -> _anaconda_depends -> iniconfig
pytest-remotedata=0.3.1 -> pytest[version='>=3.1'] -> iniconfig
pytest-doctestplus=0.3.0 -> pytest[version='>=3.0'] -> iniconfig
pytest-openfiles=0.3.2 -> pytest[version='>=2.8.0'] -> iniconfig
_anaconda_depends=2019.03 -> pytest -> iniconfig
pytest-arraydiff=0.3 -> pytest -> iniconfig
Package contextlib2 conflicts for:
contextlib2=0.5.5
anaconda=custom -> _anaconda_depends -> contextlib2
_anaconda_depends=2019.03 -> contextlib2
importlib_metadata=0.8 -> contextlib2
path.py=11.5.0 -> importlib_metadata[version='>=0.5'] -> contextlib2
Package sympy conflicts for:
sympy=1.3
_anaconda_depends=2019.03 -> sympy
anaconda=custom -> _anaconda_depends -> sympy
torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> sympy
Package pyodbc conflicts for:
anaconda=custom -> _anaconda_depends -> pyodbc
_anaconda_depends=2019.03 -> pyodbc
pyodbc=4.0.26
Package pytorch conflicts for:
torchvision=0.3.0 -> pytorch[version='1.1.*|>=1.1.0']
pytorch=1.1.0
Package qtawesome conflicts for:
anaconda=custom -> _anaconda_depends -> qtawesome
_anaconda_depends=2019.03 -> qtawesome
qtawesome=0.5.7
spyder=3.3.3 -> qtawesome[version='>=0.4.1']
_anaconda_depends=2019.03 -> spyder -> qtawesome[version='>=0.4.1|>=0.5.7|>=1.0.2|>=1.2.1']
Package exceptiongroup conflicts for:
jupyter_console=6.0.0 -> ipython -> exceptiongroup
ipykernel=5.1.0 -> ipython[version='>=5.0'] -> exceptiongroup
pytest-astropy=0.5.0 -> pytest[version='>=3.1'] -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8']
pytest-arraydiff=0.3 -> pytest -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8']
pytest-doctestplus=0.3.0 -> pytest[version='>=3.0'] -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8']
pytest-remotedata=0.3.1 -> pytest[version='>=3.1'] -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8']
_anaconda_depends=2019.03 -> ipython -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8']
ipywidgets=7.4.2 -> ipython[version='>=4.0.0'] -> exceptiongroup
pytest-openfiles=0.3.2 -> pytest[version='>=2.8.0'] -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8']
Package dbus conflicts for:
keyring=18.0.0 -> secretstorage -> dbus[version='>=1.13.18,<2.0a0']
anaconda=custom -> _anaconda_depends -> dbus
_anaconda_depends=2019.03 -> dbus
_anaconda_depends=2019.03 -> pyqt -> dbus[version='>=1.10.22,<2.0a0|>=1.12.2,<2.0a0|>=1.13.12,<2.0a0|>=1.13.6,<2.0a0|>=1.13.2,<2.0a0|>=1.13.0,<2.0a0|>=1.13.18,<2.0a0']
pyqt=5.9.2 -> dbus[version='>=1.12.2,<2.0a0|>=1.13.12,<2.0a0|>=1.13.6,<2.0a0|>=1.13.2,<2.0a0']
matplotlib=3.0.3 -> pyqt -> dbus[version='>=1.10.22,<2.0a0|>=1.12.2,<2.0a0|>=1.13.12,<2.0a0|>=1.13.6,<2.0a0|>=1.13.2,<2.0a0']
qt=5.9.7 -> dbus[version='>=1.13.2,<2.0a0|>=1.13.6,<2.0a0']
secretstorage=3.1.1 -> dbus
dbus=1.13.6
spyder=3.3.3 -> pyqt[version='>=5.6,<5.7'] -> dbus[version='>=1.10.22,<2.0a0|>=1.13.6,<2.0a0|>=1.13.12,<2.0a0|>=1.13.2,<2.0a0|>=1.12.2,<2.0a0']
qtconsole=4.4.3 -> pyqt -> dbus[version='>=1.10.22,<2.0a0|>=1.12.2,<2.0a0|>=1.13.12,<2.0a0|>=1.13.6,<2.0a0|>=1.13.2,<2.0a0']
Package greenlet conflicts for:
_anaconda_depends=2019.03 -> greenlet
anaconda=custom -> _anaconda_depends -> greenlet
gevent=1.4.0 -> greenlet[version='>=0.4.14']
_anaconda_depends=2019.03 -> bokeh -> greenlet[version='!=0.4.17|0.4.*|>=2.0.0|>=1.1.3,<2.0|>=1.1.0,<2.0|>=0.4.17,<2.0|>=0.4.17|>=0.4.14|>=0.4.13|>=0.4.10|>=0.4.9']
greenlet=0.4.15
Package graphite2 conflicts for:
pango=1.42.4 -> harfbuzz[version='>=2.7.2,<3.0a0'] -> graphite2[version='1.3.*|>=1.3.11,<2.0a0|>=1.3.10,<2.0a0']
anaconda=custom -> _anaconda_depends -> graphite2
_anaconda_depends=2019.03 -> graphite2
pango=1.42.4 -> graphite2[version='>=1.3.12,<2.0a0|>=1.3.13,<2.0a0|>=1.3.14,<2.0a0']
harfbuzz=1.8.8 -> graphite2[version='>=1.3.11,<2.0a0']
graphite2=1.3.13
_anaconda_depends=2019.03 -> harfbuzz -> graphite2[version='1.3.*|>=1.3.14,<2.0a0|>=1.3.13,<2.0a0|>=1.3.11,<2.0a0|>=1.3.10,<2.0a0|>=1.3.12,<2.0a0']
Package pthread-stubs conflicts for:
qt=5.9.7 -> libxcb -> pthread-stubs
libxcb=1.13 -> pthread-stubs
gst-plugins-base=1.14.0 -> libxcb[version='>=1.14,<2.0a0'] -> pthread-stubs
harfbuzz=1.8.8 -> libxcb[version='>=1.13,<2.0a0'] -> pthread-stubs
cairo=1.14.12 -> libxcb -> pthread-stubs
_anaconda_depends=2019.03 -> libxcb -> pthread-stubs
Package astropy conflicts for:
astropy=3.1.2
anaconda=custom -> _anaconda_depends -> astropy
_anaconda_depends=2019.03 -> astropy
Package pyasn1 conflicts for:
urllib3=1.24.1 -> cryptography[version='>=1.3.4'] -> pyasn1[version='>=0.1.8']
anaconda=custom -> _anaconda_depends -> pyasn1
_anaconda_depends=2019.03 -> cryptography -> pyasn1[version='0.1.7|0.1.9|>=0.1.8']
secretstorage=3.1.1 -> cryptography -> pyasn1[version='0.1.7|0.1.9|>=0.1.8']
Package ninja conflicts for:
torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> ninja
ninja=1.9.0
pytorch=1.1.0 -> ninja
Package tensorboard conflicts for:
tensorboard=1.13.1
tensorflow=1.13.1 -> tensorboard[version='1.13.*|>=1.13.0,<1.14.0a0|>=1.13.0,<1.14.0']
Package bokeh conflicts for:
anaconda=custom -> _anaconda_depends -> bokeh
dask=1.1.4 -> bokeh[version='>=0.13.0|>=0.13.0,<3.0.0a0']
_anaconda_depends=2019.03 -> bokeh
bokeh=1.0.4
_anaconda_depends=2019.03 -> dask -> bokeh[version='<3.0a0|>=0.13.0,<3.0.0a0|>=1.0.0,!=2.0.0,<3.0.0a0|>=2.1.1,<3.0.0a0|>=2.4.2,<3.0.0a0|>=2.4.2|>=2.4.2,!=3.0.*|>=2.4.2,<3|>=1.0.0,<3.0.0a0|>=2.4.2,<3.0|>=2.1.1|>=1.0.0,!=2.0.0|>=1.0.0|>=0.13.0|>=0.12.3|>=0.12.1']
Package future conflicts for:
path.py=11.5.0 -> backports.os -> future
_anaconda_depends=2019.03 -> future
backports.os=0.1.1 -> future
pytorch=1.1.0 -> future
anaconda=custom -> _anaconda_depends -> future
torchvision=0.3.0 -> future
Package path.py conflicts for:
_anaconda_depends=2019.03 -> path.py
ipython=7.4.0 -> pickleshare -> path.py
anaconda=custom -> _anaconda_depends -> path.py
spyder=3.3.3 -> pickleshare -> path.py
path.py=11.5.0
Package dbus-python conflicts for:
keyring=18.0.0 -> secretstorage -> dbus-python
_anaconda_depends=2019.03 -> secretstorage -> dbus-python
Package _ipython_minor_entry_point conflicts for:
jupyter_console=6.0.0 -> ipython -> _ipython_minor_entry_point=8.7.0
ipywidgets=7.4.2 -> ipython[version='>=4.0.0'] -> _ipython_minor_entry_point=8.7.0
ipykernel=5.1.0 -> ipython[version='>=5.0'] -> _ipython_minor_entry_point=8.7.0
_anaconda_depends=2019.03 -> ipython -> _ipython_minor_entry_point=8.7.0
Package gmpy2 conflicts for:
sympy=1.3 -> gmpy2[version='>=2.0.8']
sympy=1.3 -> mpmath[version='>=0.19'] -> gmpy2
_anaconda_depends=2019.03 -> sympy -> gmpy2[version='>=2.0.8']
anaconda=custom -> _anaconda_depends -> gmpy2
_anaconda_depends=2019.03 -> gmpy2
gmpy2=2.0.8
Package fonttools conflicts for:
scikit-image=0.15.0 -> matplotlib-base[version='>=2.0.0'] -> fonttools[version='>=4.22.0']
seaborn=0.9.0 -> matplotlib-base -> fonttools[version='>=4.22.0']
anaconda=custom -> _anaconda_depends -> fonttools
Package blis conflicts for:
numpy=1.16.2 -> libblas[version='>=3.8.0,<4.0a0'] -> blis[version='0.5.1.*|>=0.5.2,<0.5.3.0a0|>=0.6.0,<0.6.1.0a0|>=0.6.1,<0.6.2.0a0|>=0.7.0,<0.7.1.0a0|>=0.8.0,<0.8.1.0a0|>=0.8.1,<0.8.2.0a0|>=0.9.0,<0.9.1.0a0']
scipy=1.2.1 -> libblas[version='>=3.8.0,<4.0a0'] -> blis[version='0.5.1.*|>=0.5.2,<0.5.3.0a0|>=0.6.0,<0.6.1.0a0|>=0.6.1,<0.6.2.0a0|>=0.7.0,<0.7.1.0a0|>=0.8.0,<0.8.1.0a0|>=0.8.1,<0.8.2.0a0|>=0.9.0,<0.9.1.0a0']
Package qtconsole conflicts for:
_anaconda_depends=2019.03 -> spyder -> qtconsole[version='>=4.2|>=4.6.0|>=4.7.7|>=5.0.1|>=5.0.3|>=5.1.0|>=5.1.0,<5.2.0|>=5.2.1,<5.3.0|>=5.3.0,<5.4.0|>=5.3.2,<5.4.0|>=5.4.0,<5.5.0|>=5.4.2,<5.5.0|>=5.5.0,<5.6.0']
qtconsole=4.4.3
anaconda=custom -> _anaconda_depends -> qtconsole
spyder=3.3.3 -> qtconsole[version='>=4.2']
jupyter=1.0.0 -> qtconsole
_anaconda_depends=2019.03 -> qtconsole
Package filelock conflicts for:
torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> filelock
anaconda=custom -> _anaconda_depends -> filelock
_anaconda_depends=2019.03 -> filelock
Package libnghttp2 conflicts for:
_anaconda_depends=2019.03 -> libcurl -> libnghttp2[version='>=1.41.0,<2.0a0|>=1.43.0,<2.0a0|>=1.47.0,<2.0a0|>=1.51.0,<2.0a0|>=1.52.0,<2.0a0|>=1.57.0|>=1.57.0,<2.0a0|>=1.52.0|>=1.46.0|>=1.46.0,<2.0a0']
tensorflow=1.13.1 -> libcurl[version='>=7.64.1,<9.0a0'] -> libnghttp2[version='>=1.41.0,<2.0a0|>=1.43.0,<2.0a0|>=1.47.0,<2.0a0|>=1.51.0,<2.0a0|>=1.52.0,<2.0a0|>=1.57.0|>=1.57.0,<2.0a0|>=1.52.0|>=1.46.0|>=1.46.0,<2.0a0']
pycurl=7.43.0.2 -> libcurl[version='>=7.64.1,<9.0a0'] -> libnghttp2[version='>=1.41.0,<2.0a0|>=1.43.0,<2.0a0|>=1.47.0,<2.0a0|>=1.51.0,<2.0a0|>=1.52.0,<2.0a0|>=1.57.0|>=1.57.0,<2.0a0|>=1.52.0|>=1.46.0|>=1.46.0,<2.0a0']
anaconda=custom -> _anaconda_depends -> libnghttp2
Package secretstorage conflicts for:
secretstorage=3.1.1
spyder=3.3.3 -> keyring -> secretstorage[version='>=3|>=3.2']
_anaconda_depends=2019.03 -> secretstorage
keyring=18.0.0 -> secretstorage
anaconda=custom -> _anaconda_depends -> secretstorage
_anaconda_depends=2019.03 -> keyring -> secretstorage[version='>=3|>=3.2']
Package pyobjc-framework-cocoa conflicts for:
_anaconda_depends=2019.03 -> send2trash -> pyobjc-framework-cocoa
notebook=5.7.8 -> send2trash -> pyobjc-framework-cocoa
Package astroid conflicts for:
spyder=3.3.3 -> pylint -> astroid[version='1.0.1|1.1.0|1.1.1|1.2.1|1.3.2|1.3.4|1.4.4|2.5.6|>=2.11.0,<=2.12.0|>=2.11.2,<=2.12.0|>=2.11.3,<=2.12.0|>=2.11.5,<2.12.0|>=2.11.6,<2.12.0|>=2.12.10,<2.14.0-dev0|>=2.12.11,<2.14.0-dev0|>=2.12.12,<2.14.0-dev0|>=2.12.13,<2.14.0-dev0|>=2.14.1,<2.16.0-dev0|>=2.14.2,<2.16.0-dev0|>=2.15.0,<2.17.0-dev0|>=2.15.2,<2.17.0-dev0|>=2.15.4,<2.17.0-dev0|>=2.15.6,<2.17.0-dev0|>=2.15.7,<2.17.0-dev0|>=2.15.8,<2.17.0-dev0|>=3.0.0,<3.1.0-dev0|>=3.0.1,<3.1.0-dev0|>=2.12.9,<2.14.0-dev0|>=2.12.4,<2.14.0-dev0|>=2.9.0,<2.10|>=2.8.0,<2.9|>=2.7.2,<2.8|>=2.6.5,<2.7|>=2.6.4,<2.7|>=2.6.2,<2.7|>=2.6.1,<2.7|>=2.5.7,<2.7|>=2.5.1,<2.6|>=2.4.0,<=2.5|>=2.4.0,<2.5|>=2.3.0,<2.4|>=2.2.0,<3|>=2.2.0|>=2.0.0|>=1.6,<2.0|>=1.5.1|>=1.4.5,<1.5.0|>=2.14.2,<=2.16.0|>=2.6.5,<=2.7|>=2.6.2,<=2.7|>=2.5.8,<=2.7|>=1.4.1,<1.5.0']
_anaconda_depends=2019.03 -> astroid
pylint=2.3.1 -> astroid[version='>=2.2.0']
anaconda=custom -> _anaconda_depends -> astroid
astroid=2.2.5
_anaconda_depends=2019.03 -> pylint -> astroid[version='1.0.1|1.1.0|1.1.1|1.2.1|1.3.2|1.3.4|1.4.4|2.5.6|>=2.11.0,<=2.12.0|>=2.11.2,<=2.12.0|>=2.11.3,<=2.12.0|>=2.11.5,<2.12.0|>=2.11.6,<2.12.0|>=2.12.10,<2.14.0-dev0|>=2.12.11,<2.14.0-dev0|>=2.12.12,<2.14.0-dev0|>=2.12.13,<2.14.0-dev0|>=2.14.1,<2.16.0-dev0|>=2.14.2,<2.16.0-dev0|>=2.15.0,<2.17.0-dev0|>=2.15.2,<2.17.0-dev0|>=2.15.4,<2.17.0-dev0|>=2.15.6,<2.17.0-dev0|>=2.15.7,<2.17.0-dev0|>=2.15.8,<2.17.0-dev0|>=3.0.0,<3.1.0-dev0|>=3.0.1,<3.1.0-dev0|>=2.12.9,<2.14.0-dev0|>=2.12.4,<2.14.0-dev0|>=2.9.0,<2.10|>=2.8.0,<2.9|>=2.7.2,<2.8|>=2.6.5,<2.7|>=2.6.4,<2.7|>=2.6.2,<2.7|>=2.6.1,<2.7|>=2.5.7,<2.7|>=2.5.1,<2.6|>=2.4.0,<=2.5|>=2.4.0,<2.5|>=2.3.0,<2.4|>=2.2.0,<3|>=2.2.0|>=2.0.0|>=1.6,<2.0|>=1.5.1|>=1.4.5,<1.5.0|>=2.14.2,<=2.16.0|>=2.6.5,<=2.7|>=2.6.2,<=2.7|>=2.5.8,<=2.7|>=1.4.1,<1.5.0']
Package xorg-libice conflicts for:
cairo=1.14.12 -> xorg-libsm -> xorg-libice[version='1.0.*|>=1.1.1,<2.0a0']
cairo=1.14.12 -> xorg-libice
Package anaconda-project conflicts for:
anaconda=custom -> _anaconda_depends -> anaconda-project
anaconda-project=0.8.2
_anaconda_depends=2019.03 -> anaconda-client -> anaconda-project[version='>=0.9.1']
_anaconda_depends=2019.03 -> anaconda-project
Package parso conflicts for:
spyder=3.3.3 -> jedi[version='>=0.9'] -> parso[version='0.1.0|>=0.1.0,<0.2|>=0.2.0,<0.8.0a0|>=0.3.0,<0.8.0a0|>=0.5.0,<0.8.0a0|>=0.5.2,<0.8.0a0|>=0.7.0,<0.8.0a0|>=0.7.0,<0.8.0|>=0.8.0,<0.9.0|>=0.8.3,<0.9.0|>=0.7.0|>=0.5.2|>=0.5.0|>=0.3.0|>=0.2.0']
ipython=7.4.0 -> jedi[version='>=0.10'] -> parso[version='0.1.0|>=0.1.0,<0.2|>=0.2.0,<0.8.0a0|>=0.3.0,<0.8.0a0|>=0.5.0,<0.8.0a0|>=0.5.2,<0.8.0a0|>=0.7.0,<0.8.0a0|>=0.7.0,<0.8.0|>=0.8.0,<0.9.0|>=0.8.3,<0.9.0|>=0.7.0|>=0.5.2|>=0.5.0|>=0.3.0|>=0.2.0']
_anaconda_depends=2019.03 -> parso
_anaconda_depends=2019.03 -> jedi -> parso[version='0.1.0|>=0.1.0,<0.2|>=0.2.0,<0.8.0a0|>=0.3.0,<0.8.0a0|>=0.5.0,<0.8.0a0|>=0.5.2,<0.8.0a0|>=0.7.0,<0.8.0a0|>=0.7.0,<0.8.0|>=0.8.0,<0.9.0|>=0.8.3,<0.9.0|>=0.7.0|>=0.5.2|>=0.5.0|>=0.3.0|>=0.2.0|>=0.7.0,<0.9.0|0.7.0.*|0.5.2.*']
parso=0.3.4
jedi=0.13.3 -> parso[version='>=0.3.0|>=0.3.0,<0.8.0a0']
anaconda=custom -> _anaconda_depends -> parso
Package typing conflicts for:
spyder=3.3.3 -> sphinx -> typing
torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> typing
anaconda=custom -> _anaconda_depends -> typing
_anaconda_depends=2019.03 -> typing
numpydoc=0.8.0 -> sphinx -> typing
Package clyent conflicts for:
clyent=1.2.2
anaconda-project=0.8.2 -> anaconda-client -> clyent[version='>=1.2.0|>=1.2.2']
anaconda-client=1.7.2 -> clyent[version='>=1.2.0|>=1.2.2']
_anaconda_depends=2019.03 -> clyent
anaconda=custom -> _anaconda_depends -> clyent
_anaconda_depends=2019.03 -> anaconda-client -> clyent[version='>=1.2.0|>=1.2.2']
Package jupyterlab_pygments conflicts for:
anaconda=custom -> _anaconda_depends -> jupyterlab_pygments
notebook=5.7.8 -> nbconvert -> jupyterlab_pygments
jupyter=1.0.0 -> nbconvert -> jupyterlab_pygments
spyder=3.3.3 -> nbconvert -> jupyterlab_pygments
_anaconda_depends=2019.03 -> nbconvert -> jupyterlab_pygments
Package pytest conflicts for:
pytest-doctestplus=0.3.0 -> pytest[version='>=2.8|>=3.0']
anaconda=custom -> _anaconda_depends -> pytest
_anaconda_depends=2019.03 -> pytest
pytest=4.3.1
pytest-astropy=0.5.0 -> pytest[version='>=3.1']
pytest-openfiles=0.3.2 -> pytest[version='>=2.8.0']
pytest-remotedata=0.3.1 -> pytest[version='>=3.1']
_anaconda_depends=2019.03 -> astropy -> pytest[version='<3.7|<4|>=2.8|>=4.6|>=3.1|>=3.1.0|>=4.0|>=3.0|>=2.8.0']
pytest-astropy=0.5.0 -> pytest-arraydiff[version='>=0.1'] -> pytest[version='>=2.8.0|>=2.8|>=3.0|>=4.0|>=4.6']
astropy=3.1.2 -> pytest-astropy -> pytest[version='>=3.1.0|>=3.1|>=4.6']
pytest-arraydiff=0.3 -> pytest
Package jsonschema conflicts for:
anaconda=custom -> _anaconda_depends -> jsonschema
_anaconda_depends=2019.03 -> jsonschema
jsonschema=3.0.1
ipywidgets=7.4.2 -> nbformat[version='>=4.2.0'] -> jsonschema[version='>=2.4,!=2.5.0|>=2.6']
nbformat=4.4.0 -> jsonschema[version='>=2.4,!=2.5.0']
anaconda-client=1.7.2 -> nbformat[version='>=4.4.0'] -> jsonschema[version='2.4.0|>=2.0,!=2.5.0|>=2.4,!=2.5.0|>=2.6']
nbconvert=5.4.1 -> nbformat[version='>=4.4'] -> jsonschema[version='>=2.4,!=2.5.0|>=2.6']
notebook=5.7.8 -> nbformat -> jsonschema[version='2.4.0|>=2.0,!=2.5.0|>=2.4,!=2.5.0|>=2.6']
_anaconda_depends=2019.03 -> jupyterlab_server -> jsonschema[version='2.4.0|>=2.0,!=2.5.0|>=2.4,!=2.5.0|>=2.6|>=3.0.1|>=4.17.3|>=4.18|>=4.18.0|>=3.2.0']
Package tblib conflicts for:
tblib=1.3.2
_anaconda_depends=2019.03 -> distributed -> tblib[version='>=1.6.0']
dask=1.1.4 -> distributed[version='>=1.26.0'] -> tblib[version='>=1.6.0']
distributed=1.26.0 -> tblib
_anaconda_depends=2019.03 -> tblib
anaconda=custom -> _anaconda_depends -> tblib
Package sphinxcontrib-websupport conflicts for:
sphinxcontrib-websupport=1.1.0
_anaconda_depends=2019.03 -> sphinxcontrib-websupport
numpydoc=0.8.0 -> sphinx -> sphinxcontrib-websupport
anaconda=custom -> _anaconda_depends -> sphinxcontrib-websupport
spyder=3.3.3 -> sphinx -> sphinxcontrib-websupport
Package tqdm conflicts for:
_anaconda_depends=2019.03 -> anaconda-client -> tqdm[version='>=4.56.0']
anaconda=custom -> _anaconda_depends -> tqdm
_anaconda_depends=2019.03 -> tqdm
anaconda-project=0.8.2 -> anaconda-client -> tqdm[version='>=4.56.0']
tqdm=4.32.2
Package brotli-python conflicts for:
anaconda-client=1.7.2 -> urllib3[version='<2.0.0a'] -> brotli-python[version='>=1.0.9']
_anaconda_depends=2019.03 -> urllib3 -> brotli-python[version='>=1.0.9']
Package jdcal conflicts for:
jdcal=1.4
anaconda=custom -> _anaconda_depends -> jdcal
_anaconda_depends=2019.03 -> jdcal
_anaconda_depends=2019.03 -> openpyxl -> jdcal==1.0
openpyxl=2.6.1 -> jdcal
Package werkzeug conflicts for:
anaconda=custom -> _anaconda_depends -> werkzeug
_anaconda_depends=2019.03 -> werkzeug
flask=1.0.2 -> werkzeug[version='>=0.14|>=0.15,<2.0']
werkzeug=0.14.1
tensorboard=1.13.1 -> werkzeug[version='>=0.11.10|>=0.11.15']
_anaconda_depends=2019.03 -> flask -> werkzeug[version='0.8.3|>=0.14|>=0.15|>=0.15,<2.0|>=2.0|>=2.2.0|>=2.2.2|>=2.3.0|>=2.3.3|>=2.3.7|>=3.0.0|>=0.7|>=0.7,<1.0.0']
tensorflow=1.13.1 -> tensorboard[version='>=1.13.0,<1.14.0a0'] -> werkzeug[version='>=0.11.10|>=0.11.15']
Package sphinxcontrib-qthelp conflicts for:
numpydoc=0.8.0 -> sphinx -> sphinxcontrib-qthelp
_anaconda_depends=2019.03 -> sphinx -> sphinxcontrib-qthelp
anaconda=custom -> _anaconda_depends -> sphinxcontrib-qthelp
spyder=3.3.3 -> sphinx -> sphinxcontrib-qthelp
Package cairo conflicts for:
pango=1.42.4 -> harfbuzz[version='>=1.7.6,<2.0a0'] -> cairo[version='1.14.*|>=1.14.12,<2.0.0a0']
pango=1.42.4 -> cairo[version='>=1.14.12,<2.0a0|>=1.16.0,<2.0.0a0']
anaconda=custom -> _anaconda_depends -> cairo
_anaconda_depends=2019.03 -> cairo
_anaconda_depends=2019.03 -> harfbuzz -> cairo[version='1.12.*|1.14.*|>=1.14.12,<2.0.0a0|>=1.16.0,<2.0.0a0|>=1.16.0,<2.0a0|>=1.18.0,<2.0a0|>=1.14.12,<2.0a0|>=1.14.10,<2.0a0|>=1.12.10|>=1.14.10,<2.0.0a0']
cairo=1.14.12
harfbuzz=1.8.8 -> cairo[version='>=1.14.12,<2.0.0a0|>=1.14.12,<2.0a0']
Package qtpy conflicts for:
spyder=3.3.3 -> qtpy[version='>=1.5.0']
qtpy=1.7.0
spyder=3.3.3 -> qtawesome[version='>=0.4.1'] -> qtpy[version='>=2.0.1|>=2.4.0']
jupyter=1.0.0 -> qtconsole-base -> qtpy[version='>=2.0.1|>=2.4.0']
_anaconda_depends=2019.03 -> qtconsole -> qtpy[version='>=1.1|>=1.2.0|>=1.5.0|>=2.0.1|>=2.4.0|>=2.1.0']
qtawesome=0.5.7 -> qtpy
anaconda=custom -> _anaconda_depends -> qtpy
_anaconda_depends=2019.03 -> qtpy
Package pycparser conflicts for:
anaconda=custom -> _anaconda_depends -> pycparser
_anaconda_depends=2019.03 -> pycparser
pycparser=2.19
gevent=1.4.0 -> cffi[version='>=1.11.5'] -> pycparser
cffi=1.12.2 -> pycparser
pytorch=1.1.0 -> cffi -> pycparser
cryptography=2.6.1 -> cffi[version='>=1.7'] -> pycparser
Package mpi conflicts for:
hdf5=1.10.4 -> openmpi[version='>=3.1,<3.2.0a0'] -> mpi==1.0[build='openmpi|mpich']
anaconda=custom -> _anaconda_depends -> mpi
h5py=2.9.0 -> openmpi[version='>=3.1.4,<3.2.0a0'] -> mpi==1.0[build='openmpi|mpich']
Package cycler conflicts for:
_anaconda_depends=2019.03 -> matplotlib -> cycler[version='>=0.10|>=0.10.0']
matplotlib=3.0.3 -> cycler[version='>=0.10']
_anaconda_depends=2019.03 -> cycler
scikit-image=0.15.0 -> matplotlib-base[version='>=2.0.0'] -> cycler[version='>=0.10|>=0.10.0']
anaconda=custom -> _anaconda_depends -> cycler
cycler=0.10.0
seaborn=0.9.0 -> matplotlib-base -> cycler[version='>=0.10|>=0.10.0']
nltk=3.4 -> matplotlib -> cycler[version='>=0.10|>=0.10.0']
Package cached-property conflicts for:
_anaconda_depends=2019.03 -> h5py -> cached-property
keras-applications=1.0.7 -> h5py -> cached-property
anaconda=custom -> _anaconda_depends -> cached-property
Package boto conflicts for:
anaconda=custom -> _anaconda_depends -> boto
boto=2.49.0
_anaconda_depends=2019.03 -> boto
Package wheel conflicts for:
anaconda=custom -> _anaconda_depends -> wheel
_anaconda_depends=2019.03 -> wheel
pip=19.0.3 -> wheel
wheel=0.33.1
python=3.6.8 -> pip -> wheel
Package wurlitzer conflicts for:
_anaconda_depends=2019.03 -> spyder-kernels -> wurlitzer[version='>=1.0.3']
spyder-kernels=0.4.2 -> wurlitzer
_anaconda_depends=2019.03 -> wurlitzer
wurlitzer=1.0.2
spyder=3.3.3 -> spyder-kernels[version='>=0.4.2,<1'] -> wurlitzer
anaconda=custom -> _anaconda_depends -> wurlitzer
Package get_terminal_size conflicts for:
ipywidgets=7.4.2 -> ipython[version='>=4.0.0'] -> get_terminal_size
ipykernel=5.1.0 -> ipython[version='>=5.0'] -> get_terminal_size
jupyter_console=6.0.0 -> ipython -> get_terminal_size
get_terminal_size=1.0.0
anaconda=custom -> _anaconda_depends -> get_terminal_size
_anaconda_depends=2019.03 -> get_terminal_size
Package pyqtchart conflicts for:
spyder=3.3.3 -> pyqt=5 -> pyqtchart==5.12[build='py36h7ec31b9_6|py38h7400c14_7|py37he336c9b_7|py39h0fcd23e_7|py310hfcd6d55_8|py39h0fcd23e_8|py37he336c9b_8|py38h7400c14_8|py36h7ec31b9_7|py39h0fcd23e_6|py38h7400c14_6|py37he336c9b_6|py37he336c9b_5']
matplotlib=3.0.3 -> pyqt -> pyqtchart==5.12[build='py36h7ec31b9_6|py38h7400c14_7|py37he336c9b_7|py39h0fcd23e_7|py310hfcd6d55_8|py39h0fcd23e_8|py37he336c9b_8|py38h7400c14_8|py36h7ec31b9_7|py39h0fcd23e_6|py38h7400c14_6|py37he336c9b_6|py37he336c9b_5']
qtconsole=4.4.3 -> pyqt -> pyqtchart==5.12[build='py36h7ec31b9_6|py38h7400c14_7|py37he336c9b_7|py39h0fcd23e_7|py310hfcd6d55_8|py39h0fcd23e_8|py37he336c9b_8|py38h7400c14_8|py36h7ec31b9_7|py39h0fcd23e_6|py38h7400c14_6|py37he336c9b_6|py37he336c9b_5']
_anaconda_depends=2019.03 -> pyqt -> pyqtchart==5.12[build='py36h7ec31b9_6|py38h7400c14_7|py37he336c9b_7|py39h0fcd23e_7|py310hfcd6d55_8|py39h0fcd23e_8|py37he336c9b_8|py38h7400c14_8|py36h7ec31b9_7|py39h0fcd23e_6|py38h7400c14_6|py37he336c9b_6|py37he336c9b_5']
Package numba conflicts for:
_anaconda_depends=2019.03 -> numba
anaconda=custom -> _anaconda_depends -> numba
numba=0.43.1
Package mccabe conflicts for:
_anaconda_depends=2019.03 -> mccabe
_anaconda_depends=2019.03 -> pylint -> mccabe[version='>=0.6,<0.7|>=0.6,<0.8']
spyder=3.3.3 -> pylint -> mccabe[version='>=0.6,<0.7|>=0.6,<0.8']
mccabe=0.6.1
pylint=2.3.1 -> mccabe
anaconda=custom -> _anaconda_depends -> mccabe
Package jaraco.itertools conflicts for:
_anaconda_depends=2019.03 -> zipp -> jaraco.itertools
importlib_metadata=0.8 -> zipp[version='>=0.3.2'] -> jaraco.itertools
Package pycrypto conflicts for:
_anaconda_depends=2019.03 -> pycrypto
anaconda=custom -> _anaconda_depends -> pycrypto
pycrypto=2.6.1
Package _anaconda_depends conflicts for:
_anaconda_depends=2019.03
anaconda=custom -> _anaconda_depends
Package pkg-config conflicts for:
dbus=1.13.6 -> glib -> pkg-config
_anaconda_depends=2019.03 -> glib -> pkg-config
Package jupyter conflicts for:
_anaconda_depends=2019.03 -> jupyter
anaconda=custom -> _anaconda_depends -> jupyter
jupyter=1.0.0
Package scikit-image conflicts for:
scikit-image=0.15.0
anaconda=custom -> _anaconda_depends -> scikit-image
_anaconda_depends=2019.03 -> scikit-image
Package tensorflow-estimator conflicts for:
tensorflow-estimator=1.13.0
tensorflow=1.13.1 -> tensorflow-estimator[version='>=1.13.0,<1.14.0a0|>=1.13.0,<1.14.0rc0']
Package dataclasses conflicts for:
anaconda=custom -> _anaconda_depends -> dataclasses
torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> dataclasses
nltk=3.4 -> gensim -> dataclasses
tensorboard=1.13.1 -> werkzeug[version='>=0.11.10'] -> dataclasses
_anaconda_depends=2019.03 -> werkzeug -> dataclasses
flask=1.0.2 -> werkzeug[version='>=0.14'] -> dataclasses
Package jupyter-lsp conflicts for:
_anaconda_depends=2019.03 -> jupyterlab -> jupyter-lsp[version='>=2.0.0']
jupyter=1.0.0 -> jupyterlab -> jupyter-lsp[version='>=2.0.0']
Package et_xmlfile conflicts for:
et_xmlfile=1.0.1
openpyxl=2.6.1 -> et_xmlfile
anaconda=custom -> _anaconda_depends -> et_xmlfile
_anaconda_depends=2019.03 -> et_xmlfile
Package heapdict conflicts for:
_anaconda_depends=2019.03 -> heapdict
heapdict=1.0.0
distributed=1.26.0 -> zict[version='>=0.1.3'] -> heapdict
zict=0.1.4 -> heapdict
anaconda=custom -> _anaconda_depends -> heapdict
Package spyder conflicts for:
anaconda=custom -> _anaconda_depends -> spyder
_anaconda_depends=2019.03 -> spyder
spyder=3.3.3
Package notebook-shim conflicts for:
_anaconda_depends=2019.03 -> jupyterlab -> notebook-shim[version='>=0.2|>=0.2,<0.3']
jupyterlab_server=0.2.0 -> notebook -> notebook-shim[version='>=0.2,<0.3']
jupyter=1.0.0 -> notebook -> notebook-shim[version='>=0.2|>=0.2,<0.3']
widgetsnbextension=3.4.2 -> notebook[version='>=4.4.1'] -> notebook-shim[version='>=0.2,<0.3']
jupyterlab=0.35.4 -> notebook[version='>=4.3.1'] -> notebook-shim[version='>=0.2,<0.3']
Package xlsxwriter conflicts for:
anaconda=custom -> _anaconda_depends -> xlsxwriter
_anaconda_depends=2019.03 -> xlsxwriter
xlsxwriter=1.1.5
Package qtconsole-base conflicts for:
jupyter=1.0.0 -> qtconsole-base
_anaconda_depends=2019.03 -> jupyter -> qtconsole-base[version='>=5.2.2,<5.2.3.0a0|>=5.3.0,<5.3.1.0a0|>=5.3.1,<5.3.2.0a0|>=5.3.2,<5.3.3.0a0|>=5.4.0,<5.4.1.0a0|>=5.4.1,<5.4.2.0a0|>=5.4.2,<5.4.3.0a0|>=5.4.3,<5.4.4.0a0|>=5.4.4,<5.4.5.0a0|>=5.5.0,<5.5.1.0a0|>=5.5.1,<5.5.2.0a0']
spyder=3.3.3 -> qtconsole[version='>=4.2'] -> qtconsole-base[version='>=5.2.2,<5.2.3.0a0|>=5.3.0,<5.3.1.0a0|>=5.3.1,<5.3.2.0a0|>=5.3.2,<5.3.3.0a0|>=5.4.0,<5.4.1.0a0|>=5.4.1,<5.4.2.0a0|>=5.4.2,<5.4.3.0a0|>=5.4.3,<5.4.4.0a0|>=5.4.4,<5.4.5.0a0|>=5.5.0,<5.5.1.0a0|>=5.5.1,<5.5.2.0a0']
jupyter=1.0.0 -> qtconsole -> qtconsole-base[version='>=5.2.2,<5.2.3.0a0|>=5.3.0,<5.3.1.0a0|>=5.3.1,<5.3.2.0a0|>=5.3.2,<5.3.3.0a0|>=5.4.0,<5.4.1.0a0|>=5.4.1,<5.4.2.0a0|>=5.4.2,<5.4.3.0a0|>=5.4.3,<5.4.4.0a0|>=5.4.4,<5.4.5.0a0|>=5.5.0,<5.5.1.0a0|>=5.5.1,<5.5.2.0a0']
Package pycosat conflicts for:
_anaconda_depends=2019.03 -> pycosat
pycosat=0.6.3
anaconda=custom -> _anaconda_depends -> pycosat
Package xyzservices conflicts for:
dask=1.1.4 -> bokeh[version='>=0.13.0'] -> xyzservices[version='>=2021.09.1']
_anaconda_depends=2019.03 -> bokeh -> xyzservices[version='>=2021.09.1']
Package brotlipy conflicts for:
anaconda-client=1.7.2 -> urllib3[version='<2.0.0a'] -> brotlipy[version='>=0.6.0']
anaconda=custom -> _anaconda_depends -> brotlipy
_anaconda_depends=2019.03 -> urllib3 -> brotlipy[version='>=0.6.0']
Package libtool conflicts for:
_anaconda_depends=2019.03 -> libtool
anaconda=custom -> _anaconda_depends -> libtool
libtool=2.4.6
Package backports.os conflicts for:
anaconda=custom -> _anaconda_depends -> backports.os
_anaconda_depends=2019.03 -> backports.os
path.py=11.5.0 -> backports.os
backports.os=0.1.1
Package tbb4py conflicts for:
anaconda=custom -> _anaconda_depends -> tbb4py
mkl_random=1.0.2 -> numpy-base[version='>=1.0.2,<2.0a0'] -> tbb4py
_anaconda_depends=2019.03 -> numpy-base -> tbb4py
Package libllvm8 conflicts for:
numba=0.43.1 -> llvmlite[version='>=0.28.0'] -> libllvm8[version='>=8.0.1,<8.1.0a0']
_anaconda_depends=2019.03 -> llvmlite -> libllvm8[version='>=8.0.1,<8.1.0a0']
Package anaconda-anon-usage conflicts for:
anaconda-project=0.8.2 -> anaconda-client -> anaconda-anon-usage[version='>=0.4.0']
_anaconda_depends=2019.03 -> anaconda-client -> anaconda-anon-usage[version='>=0.4.0']
Package pcre2 conflicts for:
dbus=1.13.6 -> libglib[version='>=2.70.2,<3.0a0'] -> pcre2[version='>=10.37,<10.38.0a0|>=10.40,<10.41.0a0|>=10.42,<10.43.0a0']
pango=1.42.4 -> libglib[version='>=2.64.6,<3.0a0'] -> pcre2[version='>=10.37,<10.38.0a0|>=10.40,<10.41.0a0|>=10.42,<10.43.0a0']
Package cupti conflicts for:
tensorflow=1.13.1 -> tensorflow-base==1.13.1=gpu_py27h8f37b9b_0 -> cupti
torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> cupti
Package unicodecsv conflicts for:
anaconda=custom -> _anaconda_depends -> unicodecsv
unicodecsv=0.14.1
_anaconda_depends=2019.03 -> unicodecsv
Package dill conflicts for:
_anaconda_depends=2019.03 -> dask -> dill[version='0.2.2|0.2.3|0.2.4|>=0.3.7|>=0.3.6|>=0.2']
spyder=3.3.3 -> pylint -> dill[version='>=0.2|>=0.3.6|>=0.3.7']
Package cryptography-vectors conflicts for:
_anaconda_depends=2019.03 -> cryptography -> cryptography-vectors[version='2.3.*|2.3.1.*']
urllib3=1.24.1 -> cryptography[version='>=1.3.4'] -> cryptography-vectors[version='2.3.*|2.3.1.*']
pyopenssl=19.0.0 -> cryptography[version='>=2.2.1'] -> cryptography-vectors[version='2.3.*|2.3.1.*']
secretstorage=3.1.1 -> cryptography -> cryptography-vectors[version='2.3.*|2.3.1.*']
Package xlrd conflicts for:
anaconda=custom -> _anaconda_depends -> xlrd
_anaconda_depends=2019.03 -> xlrd
xlrd=1.2.0
Package seaborn conflicts for:
anaconda=custom -> _anaconda_depends -> seaborn
_anaconda_depends=2019.03 -> seaborn
seaborn=0.9.0
Package mpi4py conflicts for:
keras-applications=1.0.7 -> h5py -> mpi4py[version='>=3.0']
h5py=2.9.0 -> mpi4py
_anaconda_depends=2019.03 -> h5py -> mpi4py[version='>=3.0']
Package selectors2 conflicts for:
spyder-kernels=0.4.2 -> wurlitzer -> selectors2
_anaconda_depends=2019.03 -> wurlitzer -> selectors2
Package referencing conflicts for:
_anaconda_depends=2019.03 -> jsonschema -> referencing[version='>=0.28.4']
nbformat=4.4.0 -> jsonschema[version='>=2.4,!=2.5.0'] -> referencing[version='>=0.28.4']
Package pyside conflicts for:
nltk=3.4 -> matplotlib -> pyside[version='1.1.2|1.2.1']
_anaconda_depends=2019.03 -> matplotlib -> pyside[version='1.1.2|1.2.1']
Package gevent conflicts for:
_anaconda_depends=2019.03 -> bokeh -> gevent==1.0.1
anaconda=custom -> _anaconda_depends -> gevent
_anaconda_depends=2019.03 -> gevent
gevent=1.4.0
Package pbr conflicts for:
pytables=3.5.1 -> mock -> pbr[version='1.3.0|>=1.3']
tensorflow=1.13.1 -> mock[version='>=2.0.0'] -> pbr[version='>=1.3']
tensorflow-estimator=1.13.0 -> mock[version='>=2.0.0'] -> pbr[version='>=1.3']
Package keras-base conflicts for:
keras-applications=1.0.7 -> keras[version='>=2.1.6'] -> keras-base[version='2.2.0.*|2.2.2.*|2.2.4.*|2.3.1.*|2.4.3.*']
keras-preprocessing=1.0.9 -> keras[version='>=2.1.6'] -> keras-base[version='2.2.0.*|2.2.2.*|2.2.4.*|2.3.1.*|2.4.3.*']
Package openpyxl conflicts for:
anaconda=custom -> _anaconda_depends -> openpyxl
_anaconda_depends=2019.03 -> openpyxl
openpyxl=2.6.1
Package distribute conflicts for:
_anaconda_depends=2019.03 -> pip -> distribute
python=3.6.8 -> pip -> distributeThe following specifications were found to be incompatible with your system:
- feature:/linux-64::__cuda==11.7=0
- feature:/linux-64::__glibc==2.27=0
- feature:/linux-64::__linux==5.4.0=0
- feature:/linux-64::__unix==0=0
- feature:|@/linux-64::__cuda==11.7=0
- feature:|@/linux-64::__glibc==2.27=0
- feature:|@/linux-64::__linux==5.4.0=0
- feature:|@/linux-64::__unix==0=0
- _anaconda_depends=2019.03 -> click -> __unix
- _anaconda_depends=2019.03 -> click -> __win
- _anaconda_depends=2019.03 -> gst-plugins-base -> __glibc[version='>=2.17|>=2.17,<3.0.a0']
- _anaconda_depends=2019.03 -> ipykernel -> __linux
- astropy=3.1.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- bitarray=0.8.3 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- blosc=1.15.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- bottleneck=1.2.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- bzip2=1.0.6 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- c-ares=1.15.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- cairo=1.14.12 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- cffi=1.12.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- cryptography=2.6.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- cudatoolkit=9 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
- cupy=6.0.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- curl=7.64.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- cython=0.29.6 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- cytoolz=0.9.0.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- dbus=1.13.6 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
- distributed=1.26.0 -> click[version='>=6.6'] -> __unix
- distributed=1.26.0 -> click[version='>=6.6'] -> __win
- expat=2.2.6 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- fastcache=1.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- fastrlock=0.4 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- flask=1.0.2 -> click[version='>=5.1'] -> __unix
- flask=1.0.2 -> click[version='>=5.1'] -> __win
- fontconfig=2.13.0 -> libgcc-ng[version='>=4.9'] -> __glibc[version='>=2.17']
- freetype=2.9.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- fribidi=1.0.5 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- gevent=1.4.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- glib=2.56.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- gmp=6.1.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- gmpy2=2.0.8 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- graphite2=1.3.13 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
- greenlet=0.4.15 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- grpcio=1.16.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- gst-plugins-base=1.14.0 -> gstreamer[version='>=1.14.0,<2.0a0'] -> __glibc[version='>=2.17|>=2.17,<3.0.a0']
- gstreamer=1.14.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- h5py=2.9.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- harfbuzz=1.8.8 -> libgcc-ng[version='>=4.9'] -> __glibc[version='>=2.17']
- hdf5=1.10.4 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- icu=58.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- ipykernel=5.1.0 -> ipython[version='>=5.0'] -> __linux
- ipykernel=5.1.0 -> ipython[version='>=5.0'] -> __unix
- ipykernel=5.1.0 -> ipython[version='>=5.0'] -> __win
- ipywidgets=7.4.2 -> ipykernel[version='>=4.5.1'] -> __linux
- ipywidgets=7.4.2 -> ipykernel[version='>=4.5.1'] -> __osx
- ipywidgets=7.4.2 -> ipykernel[version='>=4.5.1'] -> __win
- ipywidgets=7.4.2 -> ipython[version='>=4.0.0'] -> __unix
- jbig=2.1 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
- jpeg=9b -> libgcc-ng[version='>=7.2.0'] -> __glibc[version='>=2.17']
- jupyter=1.0.0 -> ipykernel -> __linux
- jupyter=1.0.0 -> ipykernel -> __win
- jupyter_console=6.0.0 -> ipykernel -> __linux
- jupyter_console=6.0.0 -> ipykernel -> __win
- jupyter_console=6.0.0 -> ipython -> __unix
- kiwisolver=1.0.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- krb5=1.16.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- lazy-object-proxy=1.3.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- libcurl=7.64.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- libedit=3.1.20181209 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- libffi=3.2.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- libpng=1.6.36 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- libprotobuf=3.8.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- libsodium=1.0.16 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- libssh2=1.8.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- libtiff=4.0.10 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- libtool=2.4.6 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
- libuuid=1.0.3 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
- libxcb=1.13 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
- libxml2=2.9.9 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- libxslt=1.1.33 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
- llvmlite=0.28.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- lxml=4.3.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- lzo=2.10 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
- markupsafe=1.1.1 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
- matplotlib=3.0.3 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- mistune=0.8.4 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
- mkl-service=1.1.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- mkl_fft=1.0.10 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- mkl_random=1.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- mpc=1.1.0 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
- mpfr=4.0.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- msgpack-python=0.6.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- nccl=1.3.5 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- ncurses=6.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- ninja=1.9.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- notebook=5.7.8 -> ipykernel -> __linux
- notebook=5.7.8 -> ipykernel -> __win
- numba=0.43.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- numexpr=2.6.9 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- numpy-base=1.16.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- numpy=1.16.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- openssl=1.1.1c -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- pandas=0.24.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- pango=1.42.4 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
- pcre=8.43 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- pillow=6.0.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- pixman=0.38.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- psutil=5.6.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- pycosat=0.6.3 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
- pycrypto=2.6.1 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
- pycurl=7.43.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- pyodbc=4.0.26 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- pyqt=5.9.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- pyrsistent=0.14.11 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
- pytables=3.5.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- python=3.6.8 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- pytorch=1.1.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- pywavelets=1.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- pyyaml=5.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- pyzmq=18.0.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- qt=5.9.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- qtconsole=4.4.3 -> ipykernel[version='>=4.1'] -> __linux
- qtconsole=4.4.3 -> ipykernel[version='>=4.1'] -> __win
- readline=7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- ruamel_yaml=0.15.46 -> libgcc-ng[version='>=4.9'] -> __glibc[version='>=2.17']
- scikit-image=0.15.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- scikit-learn=0.20.3 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- scipy=1.2.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- sip=4.19.8 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- snappy=1.1.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- spyder-kernels=0.4.2 -> ipykernel[version='>4.9.0'] -> __linux
- spyder-kernels=0.4.2 -> ipykernel[version='>4.9.0'] -> __win
- sqlalchemy=1.3.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- sqlite=3.27.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- statsmodels=0.9.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- tensorboard=1.13.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- tensorflow=1.13.1 -> libgcc-ng[version='>=5.4.0'] -> __glibc[version='>=2.17']
- tk=8.6.8 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- torchvision=0.3.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17|>=2.17,<3.0.a0']
- torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> __cuda[version='>=11.8']
- tornado=6.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- typed-ast=1.3.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- unixodbc=2.3.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- urllib3=1.24.1 -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> __unix
- urllib3=1.24.1 -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> __win
- wrapt=1.11.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- xz=5.2.4 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- yaml=0.1.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- zeromq=4.3.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
- zlib=1.2.11 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
- zstd=1.3.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
Your installed version is: not available

9
动态slam/run.txt Normal file
View File

@ -0,0 +1,9 @@
python evaluation.py --result_dir=./data/ --eva_seqs=../pose_est/06/06_pred
python evaluate_kitti.py ./pose_gt/06.txt ./06_est.txt
python tartanair_evaluator.py
conda env create -f requirement.yml -p /root/miniconda3/envs/dfvo

724
动态slam/tartan.pdf Normal file
View File

@ -0,0 +1,724 @@
TartanVO: A Generalizable Learning-based VO
Wenshan Wang Yaoyu Hu Sebastian Scherer
Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University
arXiv:2011.00359v1 [cs.CV] 31 Oct 2020 Abstract: We present the first learning-based visual odometry (VO) model,
which generalizes to multiple datasets and real-world scenarios, and outperforms
geometry-based methods in challenging scenes. We achieve this by leveraging
the SLAM dataset TartanAir, which provides a large amount of diverse synthetic
data in challenging environments. Furthermore, to make our VO model generalize
across datasets, we propose an up-to-scale loss function and incorporate the cam-
era intrinsic parameters into the model. Experiments show that a single model,
TartanVO, trained only on synthetic data, without any finetuning, can be general-
ized to real-world datasets such as KITTI and EuRoC, demonstrating significant
advantages over the geometry-based methods on challenging trajectories. Our
code is available at https://github.com/castacks/tartanvo.
Keywords: Visual Odometry, Generalization, Deep Learning, Optical Flow
1 Introduction
Visual SLAM (Simultaneous Localization and Mapping) becomes more and more important for
autonomous robotic systems due to its ubiquitous availability and the information richness of im-
ages [1]. Visual odometry (VO) is one of the fundamental components in a visual SLAM system.
Impressive progress has been made in both geometric-based methods [2, 3, 4, 5] and learning-based
methods [6, 7, 8, 9]. However, it remains a challenging problem to develop a robust and reliable VO
method for real-world applications.
On one hand, geometric-based methods are not robust enough in many real-life situations [10, 11].
On the other hand, although learning-based methods demonstrate robust performance on many vi-
sual tasks, including object recognition, semantic segmentation, depth reconstruction, and optical
flow, we have not yet seen the same story happening to VO.
It is widely accepted that by leveraging a large amount of data, deep-neural-network-based methods
can learn a better feature extractor than engineered ones, resulting in a more capable and robust
model. But why havent we seen the deep learning models outperform geometry-based methods yet?
We argue that there are two main reasons. First, the existing VO models are trained with insufficient
diversity, which is critical for learning-based methods to be able to generalize. By diversity, we
mean diversity both in the scenes and motion patterns. For example, a VO model trained only on
outdoor scenes is unlikely to be able to generalize to an indoor environment. Similarly, a model
trained with data collected by a camera fixed on a ground robot, with limited pitch and roll motion,
will unlikely be applicable to drones. Second, most of the current learning-based VO models neglect
some fundamental nature of the problem which is well formulated in geometry-based VO theories.
From the theory of multi-view geometry, we know that recovering the camera pose from a sequence
of monocular images has scale ambiguity. Besides, recovering the pose needs to take account of the
camera intrinsic parameters (referred to as the intrinsics ambiguity later). Without explicitly dealing
with the scale problem and the camera intrinsics, a model learned from one dataset would likely fail
in another dataset, no matter how good the feature extractor is.
To this end, we propose a learning-based method that can solve the above two problems and can
generalize across datasets. Our contributions come in three folds. First, we demonstrate the crucial
effects of data diversity on the generalization ability of a VO model by comparing performance on
different quantities of training data. Second, we design an up-to-scale loss function to deal with the
Corresponding author: wenshanw@andrew.cmu.edu
4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA.
scale ambiguity of monocular VO. Third, we create an intrinsics layer (IL) in our VO model enabling
generalization across different cameras. To our knowledge, our model is the first learning-based VO
that has competitive performance in various real-world datasets without finetuning. Furthermore,
compared to geometry-based methods, our model is significantly more robust in challenging scenes.
A demo video can be found: https://www.youtube.com/watch?v=NQ1UEh3thbU
2 Related Work
Besides early studies of learning-based VO models [12, 13, 14, 15], more and more end-to-end
learning-based VO models have been studied with improved accuracy and robustness. The majority
of the recent end-to-end models adopt the unsupervised-learning design [6, 16, 17, 18], due to the
complexity and the high-cost associated with collecting ground-truth data. However, supervised
models trained on labeled odometry data still have a better performance [19, 20].
To improve the performance, end-to-end VO models tend to have auxiliary outputs related to camera
motions, such as depth and optical flow. With depth prediction, models obtain supervision signals
by imposing depth consistency between temporally consecutive images [17, 21]. This procedure can
be interpreted as matching the temporal observations in the 3D space. A similar effect of temporal
matching can be achieved by producing the optical flow, e.g., [16, 22, 18] jointly predict depth,
optical flow, and camera motion.
Optical flow can also be treated as an intermediate representation that explicitly expresses the 2D
matching. Then, camera motion estimators can process the optical flow data rather than directly
working on raw images[20, 23]. If designed this way, components for estimating the camera motion
can even be trained separately on available optical flow data [19]. We follow these designs and use
the optical flow as an intermediate representation.
It is well known that monocular VO systems have scale ambiguity. Nevertheless, most of the super-
vised learning models did not handle this issue and directly use the difference between the model
prediction and the true camera motion as the supervision [20, 24, 25]. In [19], the scale is handled
by dividing the optical flow into sub-regions and imposing a consistency of the motion predictions
among these regions. In non-learning methods, scale ambiguity can be solved if a 3D map is avail-
able [26]. Ummenhofer et al. [20] introduce the depth prediction to correcting the scale-drift. Tateno
et al. [27] and Sheng et al. [28] ameliorate the scale problem by leveraging the key-frame selection
technique from SLAM systems. Recently, Zhan et al. [29] use PnP techniques to explicitly solve
for the scale factor. The above methods introduce extra complexity to the VO system, however, the
scale ambiguity is not totally suppressed for monocular setups especially in the evaluation stage.
Instead, some models choose to only produce up-to-scale predictions. Wang et al. [30] reduce the
scale ambiguity in the monocular depth estimation task by normalizing the depth prediction before
computing the loss function. Similarly, we will focus on predicting the translation direction rather
than recovering the full scale from monocular images, by defining a new up-to-scale loss function.
Learning-based models suffer from generalization issues when tested on images from a new en-
vironment or a new camera. Most of the VO models are trained and tested on the same dataset
[16, 17, 31, 18]. Some multi-task models [6, 20, 32, 22] only test their generalization ability on the
depth prediction, not on the camera pose estimation. Recent efforts, such as [33], use model adap-
tation to deal with new environments, however, additional training is needed on a per-environment
or per-camera basis. In this work, we propose a novel approach to achieve cross-camera/dataset
generalization, by incorporating the camera intrinsics directly into the model.
Figure 1: The two-stage network architecture. The model consists of a matching network, which
estimates optical flow from two consecutive RGB images, followed by a pose network predicting
camera motion from the optical flow.
2
3 Approach
3.1 Background
We focus on the monocular VO problem, which takes two consecutive undistorted images {It, It+1},
and estimates the relative camera motion δtt+1 = (T, R), where T ∈ R3 is the 3D translation and
R ∈ so(3) denotes the 3D rotation. According to the epipolar geometry theory [34], the geometry-
based VO comes in two folds. Firstly, visual features are extracted and matched from It and It+1.
Then using the matching results, it computes the essential matrix leading to the recovery of the
up-to-scale camera motion δtt+1.
Following the same idea, our model consists of two sub-modules. One is the matching module
Mθ(It, It+1), estimating the dense matching result Ftt+1 from two consecutive RGB images (i.e.
optical flow). The other is a pose module Pφ(Ftt+1) that recovers the camera motion δtt+1 from the
matching result (Fig. 1). This modular design is also widely used in other learning-based methods,
especially in unsupervised VO [13, 19, 16, 22, 18].
3.2 Training on large scale diverse data
The generalization capability has always been one of the most critical issues for learning-based
methods. Most of the previous supervised models are trained on the KITTI dataset, which contains
11 labeled sequences and 23,201 image frames in the driving scenario [35]. Wang et al. [8] presented
the training and testing results on the EuRoC dataset [36], collected by a micro aerial vehicle (MAV).
They reported that the performance is limited by the lack of training data and the more complex
dynamics of a flying robot. Surprisingly, most unsupervised methods also only train their models in
very uniform scenes (e.g., KITTI and Cityscape [37]). To our knowledge, no learning-based model
has yet shown the capability of running on multiple types of scenes (car/MAV, indoor/outdoor). To
achieve this, we argue that the training data has to cover diverse scenes and motion patterns.
TartanAir [11] is a large scale dataset with highly diverse scenes and motion patterns, containing
more than 400,000 data frames. It provides multi-modal ground truth labels including depth, seg-
mentation, optical flow, and camera pose. The scenes include indoor, outdoor, urban, nature, and
sci-fi environments. The data is collected with a simulated pinhole camera, which moves with ran-
dom and rich 6DoF motion patterns in the 3D space.
We take advantage of the monocular image sequences {It}, the optical flow labels {Ftt+1}, and the
ground truth camera motions {δtt+1} in our task. Our objective is to jointly minimize the optical
flow loss Lf and the camera motion loss Lp. The end-to-end loss is defined as:
L = λLf + Lp = λ Mθ(It, It+1) Ftt+1 + Pφ(Fˆtt+1) δtt+1 (1)
where λ is a hyper-parameter balancing the two losses. We use ˆ· to denote the estimated variable
from our model.
Since TartanAir is purely synthetic, the biggest question is can a model learned from simulation
data be generalized to real-world scenes? As discussed by Wang et al. [11], a large number of
studies show that training purely in simulation but with broad diversity, the model learned can be
easily transferred to the real world. This is also known as domain randomization [38, 39]. In our
experiments, we show that the diverse simulated data indeed enable the VO model to generalize to
real-world data.
3.3 Up-to-scale loss function
The motion scale is unobservable from a monocular image sequence. In geometry-based methods,
the scale is usually recovered from other sources of information ranging from known object size or
camera height to extra sensors such as IMU. However, in most existing learning-based VO studies,
the models generally neglect the scale problem and try to recover the motion with scale. This is
feasible if the model is trained and tested with the same camera and in the same type of scenario.
For example, in the KITTI dataset, the camera is mounted at a fixed height above the ground and a
fixed orientation. A model can learn to remember the scale in this particular setup. Obviously, the
model will have huge problems when tested with a different camera configuration. Imagine if the
3
Figure 2: a) Illustration of the FoV and image resolution in TartanAir, EuRoC, and KITTI datasets.
b) Calculation of the intrinsics layer.
camera in KITTI moves a little upwards and becomes higher from the ground, the same amount of
camera motion would cause a smaller optical flow value on the ground, which is inconsistent with
the training data. Although the model could potentially learn to pick up other clues such as object
size, it is still not fully reliable across different scenes or environments.
Following the geometry-based methods, we only recover an up-to-scale camera motion from the
monocular sequences. Knowing that the scale ambiguity only affects the translation T , we design
a new loss function for T and keep the loss for rotation R unchanged. We propose two up-to-scale
loss functions for LP : the cosine similarity loss Lcpos and the normalized distance loss Lnporm. Lpcos
is defined by the cosine angle between the estimated Tˆ and the label T :
Lpcos = max( Tˆ · T T + Rˆ R (2)
Tˆ · ,)
Similarly, for Lnporm, we normalize the translation vector before calculating the distance between
the estimation and the label:
Lpnorm = Tˆ T + Rˆ R
max( Tˆ , ) max( T (3)
,)
where = 1e-6 is used to avoid divided by zero error. From our preliminary empirical comparison,
the above two formulations have similar performance. In the following sections, we will use Eq 3
to replace Lp in Eq 1. Later, we show by experiments that the proposed up-to-scale loss function is
crucial for the models generalization ability.
3.4 Cross-camera generalization by encoding camera intrinsics
In epipolar geometry theory, the camera intrinsics is required when recovering the camera pose
from the essential matrix (assuming the images are undistorted). In fact, learning-based methods
are unlikely to generalize to data with different camera intrinsics. Imagine a simple case that the
camera changes a lens with a larger focal length. Assume the resolution of the image remains the
same, the same amount of camera motion will introduce bigger optical flow values, which we call
the intrinsics ambiguity.
A tempting solution for intrinsics ambiguity is warping the input images to match the camera in-
trinsics of the training data. However, this is not quite practical especially when the cameras differ
too much. As shown in Fig. 2-a, if a model is trained on TartanAir, the warped KITTI image only
covers a small part of the TartanAirs field of view (FoV). After training, a model learns to exploit
cues from all possible positions in the FoV and the interrelationship among those cues. Some cues
no longer exist in the warped KITTI images leading to drastic performance drops.
3.4.1 Intrinsics layer
We propose to train a model that takes both RGB images and camera intrinsics as input, thus the
model can directly handle images coming from various camera settings. Specifically, instead of re-
covering the camera motion Ttt+1 only from the feature matching Ftt+1, we design a new pose net-
work Pφ(Ftt+1, K), which depends also on the camera intrinsic parameters K = {fx, fy, ox, oy},
where fx and fy are the focal lengths, and ox and oy denote the position of the principle point.
4
Figure 3: The data augmentation procedure of random cropping and resizing. In this way we gener-
ate a wide range of camera intrinsics (FoV 40◦ to 90◦).
As for the implementation, we concatenate an IL (intrinsics layer) Kc ∈ R2×H×W (H and W
are image height and width respectively) to Ftt+1 before going into Pφ. To compose Kc, we first
generate two index matrices Xind and Yind for the x and y axis in the 2D image frame (Fig. 2-b).
Then the two channels of Kc are calculated from the following formula:
Kxc = (Xind ox)/fx (4)
Kyc = (Yind oy)/fy
The concatenation of Ftt+1 and Kc augments the optical flow estimation with 2D position informa-
tion. Similar to the situation where geometry-based methods have to know the 2D coordinates of the
matched features, Kc provides the necessary position information. In this way, intrinsics ambiguity
is explicitly handled by coupling 2D positions and matching estimations (Ftt+1).
3.4.2 Data generation for various camera intrinsics
To make a model generalizable across different cameras, we need training data with various camera
intrinsics. TartanAir only has one set of camera intrinsics, where fx = fy = 320, ox = 320,
and oy = 240. We simulate various intrinsics by randomly cropping and resizing (RCR) the input
images. As shown in Fig. 3, we first crop the image at a random location with a random size. Next,
we resize the cropped image to the original size. One advantage of the IL is that during RCR, we can
crop and resize the IL with the image, without recomputing the IL. To cover typical cameras with
FoV between 40◦ to 90◦, we find that using random resizing factors up to 2.5 is sufficient during
RCR. Note the ground truth optical flow should also be scaled with respect to the resizing factor. We
use very aggressive cropping and shifting in our training, which means the optical center could be
way off the image center. Although the resulting intrinsic parameters will be uncommon in modern
cameras, we find the generalization is improved.
4 Experimental Results
4.1 Network structure and training detail
Network We utilize the pre-trained PWC-Net [40] as the matching network Mθ, and a modified
ResNet50 [41] as the pose network Pφ. We remove the batch normalization layers from the ResNet,
and add two output heads for the translation and rotation, respectively. The PWC-Net outputs optical
flow in size of H/4 × W/4, so Pφ is trained on 1/4 size, consuming very little GPU memory. The
overall inference time (including both Mθ and Pφ) is 40ms on an NVIDIA GTX 1080 GPU.
Training Our model is implemented by PyTorch [42] and trained on 4 NVIDIA GTX 1080 GPUs.
There are two training stages. First, Pφ is trained separately using ground truth optical flow and
camera motions for 100,000 iterations with a batch size of 100. In the second stage, Pφ and Mθ are
connected and jointly optimized for 50,000 iterations with a batch size of 64. During both training
stages, the learning rate is set to 1e-4 with a decay rate of 0.2 at 1/2 and 7/8 of the total training
steps. The RCR is applied on the optical flow, RGB images, and the IL (Sec 3.4.2).
4.2 How the training data quantity affects the generalization ability
To show the effects of data diversity, we compare the generalization ability of the model trained
with different amounts of data. We use 20 environments from the TartanAir dataset, and set aside
3 environments (Seaside-town, Soul-city, and Hongkong-alley) only for testing, which results in
5
Figure 4: Generalization ability with respect to different quantities of training data. Model Pφ is
trained on true optical flow. Blue: training loss, orange: testing loss on three unseen environments.
Testing loss drops constantly with increasing quantity of training data.
Figure 5: Comparison of the loss curve w/ and w/o up-to-scale loss function. a) The training and
testing loss w/o the up-to-scale loss. b) The translation and rotation loss of a). Big gap exists between
the training and testing translation losses (orange arrow in b)). c) The training and testing losses w/
up-to-scale loss. d) The translation and rotation losses of c). The translation loss gap decreases.
more than 400,000 training frames and about 40,000 testing frames. As a comparison, KITTI and
EuRoC datasets provide 23,201 and 26,604 pose labeled frames, respectively. Besides, data in
KITTI and EuRoC are much more uniform in the sense of scene type and motion pattern. As shown
in Fig. 4, we set up three experiments, where we use 20,000 (comparable to KITTI and EuRoC),
100,000, and 400,000 frames of data for training the pose network Pφ. The experiments show that
the generalization ability, measured by the gap between training loss and testing loss on unseen
environments, improves constantly with increasing training data.
4.3 Up-to-scale loss function
Without the up-to-scale loss, we observe that there is a gap between the training and testing loss even
trained with a large amount of data (Fig. 5-a). As we plotting the translation loss and rotation loss
separately (Fig. 5-b), it shows that the translation error is the main contributor to the gap. After we
apply the up-to-scale loss function described in Sec 3.3, the translation loss gap decreases (Fig. 5-
c,d). During testing, we align the translation with the ground truth to recover the scale using the
same way as described in [16, 6].
4.4 Camera intrinsics layer
The IL is critical to the generalization ability across datasets. Before we move to other datasets,
we first design an experiment to investigate the properties of the IL using the pose network Pφ. As
shown in Table 1, in the first two columns, where the data has no RCR augmentation, the training
and testing loss are low. But these two models would output nonsense values on data with RCR
augmentation. One interesting finding is that adding IL doesnt help in the case of only one type
of intrinsics. This indicates that the network has learned a very different algorithm compared with
the geometry-based methods, where the intrinsics is necessary to recover the motion. The last two
columns show that the IL is critical when the input data is augmented by RCR (i.e. various intrin-
sics). Another interesting thing is that training a model with RCR and IL leads to a lower testing
loss (last column) than only training on one type of intrinsics (first two columns). This indicates that
by generating data with various intrinsics, we learned a more robust model for the VO task.
6
Table 1: Training and testing losses with four combinations of RCR and IL settings. The IL is
critical with the presence of RCR. The model trained with RCR reaches lower testing loss than
those without RCR.
Training configuration w/o RCR, w/o IL w/o RCR, w/ IL w/ RCR, w/o IL w/ RCR, w/ IL
Training loss 0.0325 0.0311 0.1534 0.0499
Test-loss on data w/ RCR - - 0.1999 0.0723
Test-loss on data w/o RCR 0.0744 0.0714 0.1630 0.0549
Table 2: Comparison of translation and rotation on the KITTI dataset. DeepVO [43] is a super-
vised method trained on Seq 00, 02, 08, 09. It contains an RNN module, which accumulates
information from multiple frames. Wang et al. [9] is a supervised method trained on Seq 00-08
and uses the semantic information of multiple frames to optimize the trajectory. UnDeepVO [44]
and GeoNet [16] are trained on Seq 00-08 in an unsupervised manner. VISO2-M [45] and ORB-
SLAM [3] are geometry-based monocular VO. ORB-SLAM uses the bundle adjustment on multiple
frames to optimize the trajectory. Our method works in a pure VO manner (only takes two frames).
It has never seen any KITTI data before the testing, and yet achieves competitive results.
Seq 06 07 09 10 Ave
DeepVO [43]*† trel rrel trel rrel trel rrel trel rrel trel rrel
Wang et al. [9]*†
UnDeepVO [44]* 5.42 5.82 3.91 4.6 - - 8.11 8.83 5.81 6.41
GeoNet [16]*
VISO2-M [45] - - - - 8.04 1.51 6.23 0.97 7.14 1.24
ORB-SLAM [3]†
TartanVO (ours) 6.20 1.98 3.15 2.48 - - 10.63 4.65 6.66 3.04
9.28 4.34 8.27 5.93 26.93 9.54 20.73 9.04 16.3 7.21
7.3 6.14 23.61 19.11 4.04 1.43 25.2 3.8 15.04 7.62
18.68 0.26 10.96 0.37 15.3 0.26 3.71 0.3 12.16 0.3
4.72 2.95 4.32 3.41 6.0 3.11 6.89 2.73 5.48 3.05
trel: average translational RMSE drift (%) on a length of 100800 m.
rrel: average rotational RMSE drift (◦/100 m) on a length of 100800 m.
*: the starred methods are trained or finetuned on the KITTI dataset.
†: these methods use multiple frames to optimize the trajectory after the VO process.
4.5 Generalize to real-world data without finetuning
KITTI dataset The KITTI dataset is one of the most influential datasets for VO/SLAM tasks. We
compare our model, TartanVO, with two supervised learning models (DeepVO [43], Wang et al.
[9]), two unsupervised models (UnDeepVO [44], GeoNet [16]), and two geometry-based methods
(VISO2-M [45], ORB-SLAM [3]). All the learning-based methods except ours are trained on the
KITTI dataset. Note that our model has not been finetuned on KITTI and is trained purely on a
synthetic dataset. Besides, many algorithms use multiple frames to further optimize the trajectory.
In contrast, our model only takes two consecutive images. As listed in Table 2, TartanVO achieves
comparable performance, despite no finetuning nor backend optimization are performed.
EuRoC dataset The EuRoC dataset contains 11 sequences collected by a MAV in an indoor en-
vironment. There are 3 levels of difficulties with respect to the motion pattern and the light con-
dition. Few learning-based methods have ever been tested on EuRoC due to the lack of training
data. The changing light condition and aggressive rotation also pose real challenges to geometry-
based methods as well. In Table 3, we compare with geometry-based methods including SVO [46],
ORB-SLAM [3], DSO [5] and LSD-SLAM [2]. Note that all these geometry-based methods per-
form some types of backend optimization on selected keyframes along the trajectory. In contrast, our
model only estimates the frame-by-frame camera motion, which could be considered as the frontend
module in these geometry-based methods. In Table 3, we show the absolute trajectory error (ATE)
of 6 medium and difficult trajectories. Our method shows the best performance on the two most dif-
ficult trajectories VR1-03 and VR2-03, where the MAV has very aggressive motion. A visualization
of the trajectories is shown in Fig. 6.
Challenging TartanAir data TartanAir provides 16 very challenging testing trajectories2 that
cover many extremely difficult cases, including changing illumination, dynamic objects, fog and
rain effects, lack of features, and large motion. As listed in Table 4, we compare our model with the
ORB-SLAM using ATE. Our model shows a more robust performance in these challenging cases.
2https://github.com/castacks/tartanair tools#download-the-testing-data-for-the-cvpr-visual-slam-challenge
7
Table 3: Comparison of ATE on EuRoC dataset. We are among very few learning-based methods,
which can be tested on this dataset. Same as the geometry-based methods, our model has never seen
the EuRoC data before testing. We show the best performance on two difficult sequences VR1-03
and VR2-03. Note our method doesnt contain any backend optimization module.
Seq. MH-04 MH-05 VR1-02 VR1-03 VR2-02 VR2-03
SVO [46] 1.36 0.51 0.47 x 0.47 x
Geometry-based * ORB-SLAM [3] 0.20 0.19 x x 0.07 x
DSO [5] 0.25 0.11 0.11 0.93 0.13 1.16
LSD-SLAM [2] 2.13 0.85 1.11 x x x
Learning-based † TartanVO (ours) 0.74 0.68 0.45 0.64 0.67 1.04
* These results are from [46]. † Other learning-based methods [36] did not report numerical results.
Figure 6: The visualization of 6 EuRoC trajectories in Table 3. Black: ground truth trajectory,
orange: estimated trajectory.
Table 4: Comparison of ATE on TartanAir dataset. These trajectories are not contained in the
training set. We repeatedly run ORB-SLAM 5 times and report the best result.
Seq MH000 MH001 MH002 MH003 MH004 MH005 MH006 MH007
ORB-SLAM [3] 1.3 0.04 2.37 2.45 x x 21.47 2.73
TartanVO (ours) 4.88 0.26 2 0.94 1.07 3.19 1 2.04
Figure 7: TartanVO outputs competitive results on D345i IR data compared to T265 (equipped with
fish-eye stereo camera and an IMU). a) The hardware setup. b) Trail 1: smooth and slow motion. c)
Trail 2: smooth and medium speed. d) Trail 3: aggressive and fast motion. See videos for details.
RealSense Data Comparison We test TartanVO using data collected by a customized sensor
setup. As shown in Fig. 7 a), a RealSense D345i is fixed on top of a RealSense T265 tracking
camera. We use the left near-infrared (IR) image on D345i in our model and compare it with the
trajectories provided by the T265 tracking camera. We present 3 loopy trajectories following similar
paths with increasing motion difficulties. From Fig. 7 b) to d), we observe that although TartanVO
has never seen real-world images or IR data during training, it still generalizes well and predicts
odometry closely matching the output of T265, which is a dedicated device estimating the camera
motion with a pair of fish-eye stereo camera and an IMU.
5 Conclusions
We presented TartanVO, a generalizable learning-based visual odometry. By training our model
with a large amount of data, we show the effectiveness of diverse data on the ability of model gener-
alization. A smaller gap between training and testing losses can be expected with the newly defined
up-to-scale loss, further increasing the generalization capability. We show by extensive experiments
that, equipped with the intrinsics layer designed explicitly for handling different cameras, TartanVO
can generalize to unseen datasets and achieve performance even better than dedicated learning mod-
els trained directly on those datasets. Our work introduces many exciting future research directions
such as generalizable learning-based VIO, Stereo-VO, multi-frame VO.
8
Acknowledgments
This work was supported by ARL award #W911NF1820218. Special thanks to Yuheng Qiu and Huai
Yu from Carnegie Mellon University for preparing simulation results and experimental setups.
References
[1] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendo´n-Mancha. Visual simultaneous localization and
mapping: a survey. Artificial Intelligence Review, 43(1):5581, 2015.
[2] J. Engel, T. Schops, and D. Cremers. LSD-SLAM: Large-scale direct monocular slam. In ECCV, 2014.
[3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: a versatile and accurate monocular slam
system. IEEE transactions on robotics, 31(5):11471163, 2015.
[4] C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In ICRA,
pages 1522. IEEE, 2014.
[5] J. Engel, V. Koltun, and D. Cremers. Direct sparse odometry. IEEE transactions on pattern analysis and
machine intelligence, 40(3):611625, 2017.
[6] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from
video. In CVPR, 2017.
[7] S. Vijayanarasimhan, S. Ricco, C. Schmidy, R. Sukthankar, and K. Fragkiadaki. Sfm-net: Learning of
structure and motion from video. In arXiv:1704.07804, 2017.
[8] S. Wang, R. Clark, H. Wen, and N. Trigoni. End-to-end, sequence-to-sequence probabilistic visual odom-
etry through deep neural networks. The International Journal of Robotics Research, 37(4-5):513542,
2018.
[9] X. Wang, D. Maturana, S. Yang, W. Wang, Q. Chen, and S. Scherer. Improving learning-based ego-
motion estimation with homomorphism-based losses and drift correction. In 2019 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pages 970976. IEEE, 2019.
[10] G. Younes, D. Asmar, E. Shammas, and J. Zelek. Keyframe-based monocular slam: design, survey, and
future directions. Robotics and Autonomous Systems, 98:6788, 2017.
[11] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer. Tartanair: A
dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), 2020.
[12] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch. Memory-based learning for visual odometry.
In Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, pages 4752. IEEE,
2008.
[13] V. Guizilini and F. Ramos. Semi-parametric models for visual odometry. In Robotics and Automation
(ICRA), 2012 IEEE International Conference on, pages 34823489. IEEE, 2012.
[14] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci. Evaluation of non-geometric methods for visual
odometry. Robotics and Autonomous Systems, 62(12):17171730, 2014.
[15] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned
depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 62436252, 2017.
[16] Z. Yin and J. Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2,
2018.
[17] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid. Unsupervised learning of monocular
depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 340349, 2018.
[18] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black. Competitive collabora-
tion: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June
2019.
[19] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia. Exploring representation learning with cnns for
frame-to-frame ego-motion estimation. RAL, 1(1):1825, 2016.
[20] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and
motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), July 2017.
[21] N. Yang, L. v. Stumberg, R. Wang, and D. Cremers. D3vo: Deep depth, deep pose and deep uncertainty
for monocular visual odometry. In IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), June 2020.
[22] Y. Zou, Z. Luo, and J.-B. Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task
consistency. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
9
[23] H. Zhou, B. Ummenhofer, and T. Brox. Deeptam: Deep tracking and mapping. In Proceedings of the
European Conference on Computer Vision (ECCV), September 2018.
[24] C. Tang and P. Tan. Ba-net: Dense bundle adjustment network. arXiv preprint arXiv:1806.04807, 2018.
[25] R. Clark, M. Bloesch, J. Czarnowski, S. Leutenegger, and A. J. Davison. Ls-net: Learning to solve
nonlinear least squares for monocular stereo. arXiv preprint arXiv:1809.02966, 2018.
[26] H. Li, W. Chen, j. Zhao, J.-C. Bazin, L. Luo, Z. Liu, and Y.-H. Liu. Robust and efficient estimation of ab-
solute camera pose for monocular visual odometry. In Proceedings of the IEEE International Conference
on Robotics and Automation (ICRA), May 2020.
[27] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned
depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), July 2017.
[28] L. Sheng, D. Xu, W. Ouyang, and X. Wang. Unsupervised collaborative learning of keyframe detec-
tion and visual odometry towards monocular deep slam. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (ICCV), October 2019.
[29] H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid. Visual odometry revisited: What should be learnt?
In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), May 2020.
[30] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct
methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
June 2018.
[31] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu. Unos: Unified unsupervised optical-flow and
stereo-depth estimation by watching videos. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), June 2019.
[32] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and ego-motion from monoc-
ular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2018.
[33] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha. Self-supervised deep visual odometry with online
adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), June 2020.
[34] D. Niste´r. An efficient solution to the five-point relative pose problem. IEEE transactions on pattern
analysis and machine intelligence, 26(6):756770, 2004.
[35] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International
Journal of Robotics Research, 32(11):12311237, 2013.
[36] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart. The
euroc micro aerial vehicle datasets. The International Journal of Robotics Research, 35(10):11571163,
2016.
[37] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and
B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 32133223, 2016.
[38] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transfer-
ring deep neural networks from simulation to the real world. In IROS, pages 2330. IEEE, 2017.
[39] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon,
and S. Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain ran-
domization. In CVPR Workshops, pages 969977, 2018.
[40] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and
cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
89348943, 2018.
[41] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 770778, 2016.
[42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and
A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
[43] S. Wang, R. Clark, H. Wen, and N. Trigoni. Deepvo: Towards end-to-end visual odometry with deep
recurrent convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International
Conference on, pages 20432050. IEEE, 2017.
[44] R. Li, S. Wang, Z. Long, and D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep
learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 72867291.
IEEE, 2018.
[45] S. Song, M. Chandraker, and C. Guest. High accuracy monocular SFM and scale correction for au-
tonomous driving. IEEE Transactions on Pattern Analysis & Machine Intelligence, pages 11, 2015.
[46] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza. Svo: Semidirect visual odometry
for monocular and multicamera systems. IEEE Transactions on Robotics, 33(2):249265, 2016.
10
A Additional experimental details
In this section, we provide additional details for the experiments, including the network structure,
training parameters, qualitative results, and quantitative results.
A.1 Network Structure
Our network consists of two sub-modules, namely, the matching network Mθ and the pose network
Pφ. As mentioned in the paper, we employ PWC-Net as the matching network, which takes in two
consecutive images of size 640 x 448 (PWC-Net only accepts image size that is multiple of 64). The
output optical flow, which is 160 x 112 in size, is fed into the pose network. The structure of the
pose network is detailed in Table 5. The overall inference time (including both Mθ and Pφ) is 40ms
on an NVIDIA GTX 1080 GPU.
Table 5: Parameters of the proposed pose network. Constructions of residual blocks are designated
in brackets multiplied by the number of stacked blocks. Downsampling is performed by Conv1, and
at the beginning of each residual block. After the residual blocks, we reshape the feature map into a
one-dimensional vector, which goes through three fully connected layers in the translation head and
rotation head, respectively.
Name Layer setting Output dimension
Input 1 H × 1 W × 2 114 × 160
Conv1 4 4 56 × 80
Conv2 56 × 80
Conv3 3 × 3, 32 1 H × 1 W × 32 56 × 80
3 × 3, 32 8 8
3 × 3, 32
1 H × 1 W × 32
8 8
1 H × 1 W × 32
8 8
ResBlock
Block1 3 × 3, 64 ×3 1 H × 1 W × 64 28 × 40
3 × 3, 64 16 16
Block2 3 × 3, 128 ×4 1 H × 1 W × 128 14 × 20
3 × 3, 128 32 32
Block3 3 × 3, 128 ×6 1 H × 1 W × 128 7 × 10
3 × 3, 128 64 64
Block4 3 × 3, 256 ×7 1 H × 1 W × 256 4×5
3 × 3, 256 128 128
Block5 3 × 3, 256 ×3 1 H × 1 W × 256 2×3
3 × 3, 256 256 256
FC trans FC rot
Trans head fc1 256 × 6 × 128 Rot head fc1 256 × 6 × 128
Trans head fc2 128 × 32 Rot head fc2 128 × 32
Trans head fc3 32 × 3 Rot head fc3 32 × 3
Output 3 Output 3
Table 6: Comparison of ORB-SLAM and TartanVO on the TartanAir dataset using the ATE metric.
These trajectories are not contained in the training set. We repeatedly run ORB-SLAM for 5 times
and report the best result.
Seq SH000 SH001 SH002 SH003 SH004 SH005 SH006 SH007
ORB-SLAM x 3.5 x x x x x x
TartanVO (ours) 2.52 1.61 3.65 0.29 3.36 4.74 3.72 3.06
A.2 Testing Results on TartanAir
TartanAir provides 16 challenging testing trajectories. We reported 8 trajectories in the experiment
section. The rest 8 trajectories are shown in Table 6. We compare TartanVO against the ORB-SLAM
monocular algorithm. Due to the randomness in ORB-SLAM, we repeatedly run ORB-SLAM for 5
trials and report the best result. We consider a trial is a failure if ORB-SLAM tracks less than 80%
11
of the trajectory. A visualization of all the 16 trajectories (including the 8 trajectories shown in the
experiment section) is shown in Figure 8.
Figure 8: Visualization of the 16 testing trajectories in the TartanAir dataset. The black dashed line
represents the ground truth. The estimated trajectories by TartanVO and the ORB-SLAM monocular
algorithm are shown in orange and blue lines, respectively. The ORB-SLAM algorithm frequently
loses tracking in these challenging cases. It fails in 9/16 testing trajectories. Note that we run
full-fledge ORB-SLAM with local bundle adjustment, global bundle adjustment, and loop closure
components. In contrast, although TartanVO only takes in two images, it is much more robust than
ORB-SLAM.
12

View File

@ -0,0 +1,12 @@
tartanvo
shibuya_Standing01
sum: 99
total time: 8.106080770492554
average time: 0.08458585690970373
Kitti04序列
sum: 270
total time: 20.52476716041565
average time: 0.07601765614968758

View File

@ -0,0 +1,266 @@
电 子 科 技 大 学
学术学位研究生学位论文中期考评表
攻读学位级别: □博士 硕士
学科专业: 软件工程
学 院: 信息与软件工程
学 号: 202221090225
姓 名: 武博文
论文题目: 室外动态场景下基于实例
分割的视觉SLAM研究
指导教师: 王春雨
填表日期: 2024 年 9 月 15 日
电子科技大学研究生院
已完成的主要工作
1.开题报告通过时间: 2023 年 12 月 21 日
2. 课程学习情况
是否已达到培养方案规定的学分要求
□是 否
3. 论文研究进展
从理论分析或计算部分、实验(或实证)工作等方面进行总结(可续页)
一、基于实例分割和光流检测的运动物体判别算法
理论分析
动态物体判别是整个动态SLAM问题要解决的一个关键环节其最终解决的效果好坏直接影响到相机位姿的估计和后端的建图效果。该问题解决的是在相机不同的两帧之间物体的空间位置是否发生了移动的问题。只使用实例分割获得的语义信息只能判定已知语义的物体是动态物体但是不能确定在当前图像物体是否真的发生了移动。同时在处理未知运动物体方面语义信息会失效。因此在ORB_SLAM2的基础上设计了一种基于实时光流检测和实例分割的物体判别方法。
光流检测是一种用于估计图像序列中像素移动的技术,即在连续帧之间,图像中像素的运动轨迹。实例分割不仅要求检测出图像中的物体,还要精确地划分出每个物体实例的像素级别的掩码,不仅要区分图像中的不同类别,还需要区分属于同一类别的不同实例。
设计的运动物体判别方法如下算法流程如图1-1所示
图1-1 运动物体判别算法流程图
首先判断运动物体通过实例分割得到当前帧物体掩码将实例Oi中像素不为0的点作为动点候选点pio。同时通过光流检测出当前帧中的像素点pi的光流值光流值可分为x方向的fix和y方向fiy当光流值不为0时认为该像素在其方向上存在运动。故可以结合两个方向的光流值得到唯一光流值fi计算如公式1所示
fi=fix2+fiy2 1
设置光流阈值Thf当光流值fi大于阈值时认为该像素点存在光流运动作为光流动点pif否则作为静点系统中Thf取值为0.12。对于实例Oi的动点候选点计算其光流值和光流动点数。实例Oi的运动状态Di可以表述如公式2
rd=pifpio 2
其中rd为实例中光流动点数目在动点候选点数目所占的比例pio是实例Oi中动点候选点总数目pif是光流动点总数目通过上述方式将光流信息与语义信息融合。最终根据Di来判断物体的运动状态0为静止1为运动如公式3所示。
Di=1,rd>=Thd 0,rd<Thd 3
其中Thd为实例运动的判断阈值当所占比例大于等于阈值时认为该实例确实处于运动状态否则为静止物体系统中Thd取值为0.7。
其次针对实例外的点直接通过光流检测判断像素点pi是否为动点。若为静点则加入相机运动估计候选点保证静点数目。
工作内容
DM-SLAM系统具体结构如图1-2所示。输入图像和深度图像首先经过实例分割得到掩码信息经过光流预测得到光流信息以此来关联相邻两帧。其次通过掩码区分出静点和动态物体点此时的动点为候选动点。将候选动点和光流信息融合进行运动物体判断得到真正运动的物体及其点集合。同时进行相机位姿估计通过静点的匹配实现相机位姿估计得到相机初始位姿。在得到相机初始位姿和运动物体点集合后进行运动物体位姿估计计算运动物体的位姿。最终将相机位姿运动物体位姿传入后端进行联合优化以输出最终位姿。
图1-2 DM-SLAM系统结构图
实例分割重点关注室外实例对象如车自行车树等物体。研究主流实例分割模型如YOLACTPolarMask和MASK_RCNN。现采用MASK_RCNN训练模型进行实例分割实现较为精准的边界分割精度高。
光流预测重点关注实时性和准确性研究不同光流模型如LiteFlowNetPWC-Net和RAFT。现选用RAFT模型进行微调训练得出模型增加实时光流分割模块对图像进行光流预测。
设计动态物体判断方法实现了相应模块算法流程图如图1-1所示。测试了KITTI-13序列利用evo评估工具对比了相关指标如表1-1所示。通过表的对比可以看出在该序列上我的系统rmse提升了11.1%说明我的系统可以降低位姿误差这得益于动态物体的判别方法的有效。跟踪过程如图1-3所示其中不同颜色表示不同的实例即动态物体黑点代表静态背景点。
图1-3 系统跟踪图
表1-1 KITTI-13序列部分图像上DM-SLAM与原系统误差数值对比
数值
VDO-SLAM
DM-SLAM
APE
RPE
APE
RPE
max
0.35
0.10
0.32
0.10
mean
0.16
0.06
0.15
0.06
rmse
0.18
0.07
0.16
0.07
min
0.02
0.04
0.03
0.03
实验对比了几个现有动态SLAM系统ORB_SLAM2DyStSLAM和DynaSLAM在KITTI Odometry数据集不同序列下的各项轨迹误差包括绝对轨迹误差和相对轨迹误差如表1-2所示。这些数据可用于以后我的系统测试评估DM-SLAM的鲁棒性。
表1-2 KITTI数据集上各个系统轨迹误差对比
Seq
ORB_SLAM2
DyStSLAM
DynaSLAM
APE
RPE
APE
RPE
APE
RPE
00
1.30
0.70
1.28
0.70
3.88
1.35
01
10.18
1.44
9.64
1.49
10.23
1.33
02
6.28
0.78
4.91
0.75
6.37
1.42
03
0.73
0.73
0.58
0.70
4.69
2.73
04
0.25
0.5
0.23
0.47
1.18
1.11
05
0.76
0.39
0.78
0.40
1.72
0.69
06
0.80
0.53
0.72
0.50
1.95
0.94
07
0.55
0.51
0.54
0.50
1.12
0.84
08
3.59
1.02
3.27
1.01
3.97
1.22
09
2.71
0.86
2.95
0.85
5.24
1.31
10
1.02
0.60
1.01
0.61
2.33
0.98
二、多运动刚性物体跟踪算法
理论分析
研究不同实例间的运动并跟踪动态物体,尤其是在复杂的室外场景中,具有重要意义。车辆作为刚性物体,其运动模型的推导在自动驾驶系统中是核心技术之一。通过准确建模车辆的运动,可以实现对车辆的实时跟踪、预测其未来轨迹,以及作出有效的避障和路径规划决策。同时,动态物体的位姿估计可以用来做联合优化,实现信息有效利用。
刚体运动涉及两个方面平移和旋转。为描述这些运动通常会定义多个参考坐标系。世界坐标系或者全局坐标系这是一个固定的参考系用来描述刚体在空间中的位置和运动。刚体坐标系固定在刚体上随着刚体的运动而运动这个坐标系的原点通常选择在刚体的质心我的系统选择各个3D物体点的质心作为物体坐标系原点。下面为室外动态场景下运动中的刚性物体运动模型推导过程。
设刚体在k-1和k帧间刚体坐标系下的位姿为k-1Rk-1Uk∈SE3其中刚体在全局坐标系下的运动用表示{0}表示全局坐标系。
设在全局坐标系下k帧的第i个点的3D坐标为由坐标系转化可得式4
4
又因为对于刚性物体所有物体点在刚性坐标系下的坐标保持不变即如公式5所示
5
由公式45可得公式6
6
公式6通过齐次变换联系了不同帧的运动刚体上的相同3D点展示了刚体运动k-1Rk-1Uk和全局坐标系的关系。因此在全局坐标系下3D点满足公式7
7
其中k-10Uk∈SE3表示在全局坐标系下物体点的运动。公式7是物体运动估计的核心仅通过3D点将刚性物体运动展现消除了估计物体全局位姿的工作。刚性物体运动模型原理图如图1-4所示。
图1-4 刚体运动模型原理图
若考虑实际场景下第k帧图像通过实例分割得到实例动点其像素点为pki对应的3D点为qki已知k-1帧中的3D点为qk-1i与之对应。在k-1时刻以k-1帧的物体运动作为参考系则有公式8
8
将k-1帧的3D点通过相机坐标系重投影到k帧上可以得到投影点pki如公式9:
9
其中k-1k-1Hk∈SE3PI为相机投影函数表示在k-1帧的相机坐标系下第k-1帧到第k帧的位姿的逆。将k-1k-1Hk参数化为k-1k-1Hk=expk-1k-1εkk-1k-1εk∈se3类似于相机位姿求解通过最小化找到k-1k-1Hk的最优解如公式10
k-1k-1εk*=argmininρhpki-pki 10
其中ρh为鲁棒核函数。当给定物体上在k帧的2D点和k-1帧的3D点共n个时物体点的运动可由公式11计算恢复。
k-1k-1Uk=k-1k-1Tkk-1k-1Hk 11
工作内容
研究刚体运动设计物体的标准以实例分割的掩码取得不同物体的像素点通过相机投影得到当前相机坐标系下的3D点规定物体质心为物体点中心以此作为刚性物体的坐标系原点。当物体第一次出现时原点与掩码内深度值最低的点连线正方向作为x轴与之垂直的方向作为y轴xy轴构成的面的法向量作为z轴。当物体再次出现时物体坐标系自身保持不变。
实现全局的物体跟踪应用室外场景下的刚性物体运动估计方法估计物体运动位姿根据研究内容1中的方法确定运动的实例运动实例在不同帧间通过掩码和光流关联。设计最小化方程通过重投影求解对应的位姿。最终效果如图1-5和图1-6所示系统同时跟踪不同的实例并计算其运动位姿。
图1-5 物体关联图1
图1-6 物体关联图2
4. 阶段性研究成果
按《研究生学位论文撰写格式规范》的格式要求分类填写与学位论文相关的阶段性研究成果,例如期刊论文、会议论文、专利、科研获奖等,限填第一作者或导师为第一作者时的第二作者成果,其中已录用、已投稿或拟投稿的在括号内注明(可续页)
存在的主要问题和解决办法
1.未按开题计划完成的研究工作,研究工作存在的原理性、技术性难题以及在实验条件等方面的限制(可续页)
1实例分割的实时性问题
在追求实例分割的高精度时,使用的模型难以满足系统的实时性工作,故对图像做预处理以实现系统的正常运行。
2后端优化设计问题
目前后端的优化基于图优化,优化因子为静点和动点,一定程度上提高了跟踪的位姿估计,但仍需要设计完整的后端优化图,实验测试优化效果。
3整个系统的鲁棒性验证问题
需进行更多测试来验证系统的鲁棒性。
2.针对上述问题采取何种解决办法,对学位论文的研究内容及所采取的理论方法、技术路线和实施方案的进一步调整,以及下一步的研究研究计划(可续页)
1针对实例分割的实时性问题进一步研究实时性实例分割模型尝试在精度和实时性之间达到平衡新增线程同步完成实例分割任务通过ROS实现通信。
2针对后端模块设计新的优化函数完善优化因子尝试将物体速度等信息加入联合优化以得到更加准确的相机位姿同时估计多物体运动轨迹。
3针对整个系统验证系统在KITTI数据集下的鲁棒性再进一步测试在其他室外公开数据集下的鲁棒性。
下一步研究计划如表2-1所示。
表2-1 下一步研究计划表
起止年月
完成内容
2024.09-2024.10
完成后端设计和实现
2024.10-2024.11
研究实例分割实时性
2024.10-2024.11
完成DM-SLAM系统在不同数据集的测试完成专利1篇
2024.11-2024.12
撰写并完成硕士学位论文初稿。
2025.01-2025.03
完成硕士学位论文修改提升。
2025.03-2025.04
完成硕士学位论文答辩。
中期考评审查意见
1.导师对工作进展及研究计划的意见:
进展符合预期,研究计划合理可行。
导师(组)签字: 2024 年 9 月 20 日
2.中期考评专家组意见
考评日期
2024.09.20
考评地点
国际创新中心B栋一楼会议室B105
腾讯会议479-687-638
考评专家
杨远望、庄杰、李耶
考评成绩
合格 3 票 基本合格 0 票 不合格 0 票
结 论
通过 □原则通过 □不通过
通过:表决票均为合格
原则通过表决票中有1票为基本合格或不合格其余为合格和基本合格
不通过表决票中有2票及以上为不合格
对学位论文工作进展,从事科学研究的能力和作风,以及下一步研究计划的建议,是否适合继续攻读学位:
研究工作进展正常,计划可行,适合继续攻读学位。
考评组签名:
2024年 9 月 20 日
3.学院意见:
负责人签名: 年 月 日