Hostname: page-component-586b7cd67f-tf8b9 Total loading time: 0 Render date: 2024-11-23T19:21:13.017Z Has data issue: false hasContentIssue false

RGB-D visual odometry by constructing and matching features at superpixel level

Published online by Cambridge University Press:  18 September 2024

Meiyi Yang
Affiliation:
Department of Automation, University of Science and technology of China, Hefei, China Department of Mechanical Engineering, City University of Hong Kong, Kowloon, Hongkong
Junlin Xiong*
Affiliation:
Department of Automation, University of Science and technology of China, Hefei, China
Youfu Li
Affiliation:
Department of Mechanical Engineering, City University of Hong Kong, Kowloon, Hongkong
*
Corresponding author: Junlin Xiong; Email: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Visual odometry (VO) is a key technology for estimating camera motion from captured images. In this paper, we propose a novel RGB-D visual odometry by constructing and matching features at the superpixel level that represents better adaptability in different environments than state-of-the-art solutions. Superpixels are content-sensitive and perform well in information aggregation. They could thus characterize the complexity of the environment. Firstly, we designed the superpixel-based feature SegPatch and its corresponding 3D representation MapPatch. By using the neighboring information, SegPatch robustly represents its distinctiveness in various environments with different texture densities. Due to the inclusion of depth measurement, the MapPatch constructs the scene structurally. Then, the distance between SegPatches is defined to characterize the regional similarity. We use the graph search method in scale space for searching and matching. As a result, the accuracy and efficiency of matching process are improved. Additionally, we minimize the reprojection error between the matched SegPatches and estimate camera poses through all these correspondences. Our proposed VO is evaluated on the TUM dataset both quantitatively and qualitatively, showing good balance to adapt to the environment under different realistic conditions.

Type
Research Article
Copyright
© The Author(s), 2024. Published by Cambridge University Press

1. Introduction

Visual odometry (VO) enables robots to perceive the surrounding environments and localize themselves and thus is essential in many industries, such as auto-driving and augmented reality. VO is generally regarded as a key component of visual-based Simultaneous Localization And Mapping (v-SLAM), which further incorporates loop closure detection and global optimization modules to alleviate the accumulated drift in a long run [Reference Company-Corcoles, Garcia-Fidalgo and Ortiz1]. Compared with LiDAR- and GPS-based methods, VO estimates the trajectories of robots by analyzing the sequences of images captured. Cameras provide abundant texture information at a lower price of hardware, and the depth value of environments can be measured efficiently with the availability of RGB-D cameras, especially indoors.

In recent years, significant advancements have been made in real-time methods for SLAM and VO. Typical systems [Reference Mur-Artal, Montiel and Tardos2Reference Engel, Koltun and Cremers6] directly operate on pixels and estimate camera poses through the correspondences between pixels. Feature-based methods [Reference Mur-Artal, Montiel and Tardos2Reference Wang and Lin4] find corresponding feature points and estimate camera poses by solving PnP problem. Feature points are distinctive in texture-rich environments while their performances are generally degraded in poorer texture environments. Direct methods [Reference Kerl, Sturm and Cremers5, Reference Engel, Koltun and Cremers6], using the actual value of the measurement, are often affected by brightness changes. Deep learning-based approaches [Reference Jia, Pu, Chen, Cheng, Liao and Yang7, Reference Yang, v. Stumberg, Wang and Cremers8] have shown promising results with end-to-end training, but face challenges in generalization and computational efficiency. High-level features like lines [Reference Taubner, Tschopp, Novkovic, Siegwart and Furrer9Reference Zhao, Huang, Yan and Dissanayake11] and planes [Reference Yang, Song, Kaess and Scherer12, Reference Yang and Scherer13] describe the environment structurally. Fitted by amounts of points, planes can effectively eliminate the influence of measurement noise, especially in indoor environments. Indoor environments are mostly constructed by various manufactured objects and structures. Their surfaces are usually planes with limited size, contours, and edges. Planes representing indoor environment thus can be exploited to construct constraints for robust pose estimation.

In this paper, we design features at the superpixel level. A superpixel is a series of adjacent pixels with similar characteristics, including similar brightness, similar texture, and low contour energy inside the region [Reference Ren and Malik14]. Superpixels have good properties such as content sensitivity, high information aggregation, and edge preservation. We design SegPatch as a superpixel-based feature for scene representation instead of points and planes. SegPatch exploits the regional information of the superpixel to approximate minor details of the small structures, achieving higher computational efficiency than planes in complex environments. At the same time, SegPatch removes the spatial redundancy, making it less affected by measurement noise and more distinctive, thus offsets the lack of feature points in poor texture environments. As a result, SegPatch could robustly characterize environments with various texture densities. Correspondingly, MapPatch is designed for mapping. Structural parameters are estimated from depth measurements and updated with local maps.

Figure 1. Pipeline of superpixel-based visual odometry.

We propose a superpixel-based visual odometry to achieve robust and accurate pose estimation results. Using an RGB-D camera, we decompose the RGB image into superpixels and estimate structural parameters from the depth image. SegPatches are then constructed and their corresponding MapPatches are initialized. We propose a novel matching algorithm to establish the correspondences between two consecutive frames or between frames and the local map. The similarity between SegPatches is measured by the defined distance. To improve computational efficiency, we propose a novel searching method following the idea of approximate nearest neighbor search using hierarchical navigable small world (HNSW [Reference Malkov and Yashunin15]) in scale space. By constructing constraints derived from the variation in brightness, epipolar geometry, and normal consistency, the camera pose is finally estimated via all matched correspondences. The local map is established simultaneously. Figure 1 shows the framework of our proposed VO system. We provide real-world evaluation on the TUM RGB-D [Reference Sturm, Engelhard, Endres, Burgard and Cremers16] dataset to prove the feasibility of using superpixel-based feature in VO system and show good balance of our proposed VO to adapt different environments.

In summary, the contributions of this paper are as follows:

  1. 1. We design superpixel-based feature named SegPatch. Through the approximation of small structures and the elimination of redundant information, SegPatch shows good adaptability and uniqueness in different environments.

  2. 2. We propose a novel matching algorithm for SegPatches. The efficient searching strategies and accurate similarity measurements robustly provide credible correspondences for further use.

  3. 3. We propose a superpixel-based RGB-D visual odometry and evaluate it on TUM indoor datasets, showing good adaptability in various environments and reaching comparable results with state-of-the-art solutions to VO systems.

The rest of paper is structured as follows. Section 2 provides background on related work. We first introduce the superpixel-based feature SegPatch in Section 3, then present the matching algorithm and pose estimation method in VO framework (Section 4), followed by implementations and experiments in Section 5. Section 6 concludes this paper.

2. Related works

According to different tracking and mapping method, traditional geometric-based VO can be roughly divided into feature-based and direct methods [Reference Yang, Wang and Cremers17]. Among them, feature-based approaches are typically more robust to illumination changes than direct methods. In this paper, our proposed visual odometry is to construct and match features at superpixel level. As such, our proposed method is relevant to the existing VO methods reviewed below.

Point-feature-based methods. A number of research works using point-feature-based methods have been reported, notable examples are ORB-SLAM [Reference Mur-Artal, Montiel and Tardos2], RGBD-SLAM [Reference Endres, Hess, Sturm, Cremers and Burgard18], and DVO [Reference Kerl, Sturm and Cremers5]. Most of these methods have been extended to stereo or RGB-D versions such as ORB-SLAM2 [Reference Mur-Artal and Tardós19] and DVO for RGB-D camera [Reference Kerl, Sturm and Cremers20]. Point-feature-based methods generally extract and match features with strong significance and minimize geometry reprojection error. The availability of robust feature detectors and descriptors enables feature-based methods to construct correspondences even under large movement between frames [Reference Pretto, Menegatti, Bennewitz, Burgard and Pagello21]. Therefore, point-feature-based methods often have robust performances in rich texture scenes that involve varying illuminations, occlusions, and large viewpoint changes. However, the heavy dependence on textures impairs the robustness in poor texture environments due to the shortage of reliable feature points.

Plane and Surfel-based methods. To solve the problems of point features, planes are exploited to refine the performance in a few systems. Earlier works like [Reference Weingarten and Siegwart22] added planes into the extended Kalman filter state vectors. However, the growing size of the dense covariance matrix will lead to a huge computational cost, so these methods are only applicable to some small scenes. In ref. [Reference Lee, Lim, Lee, An and Oh23], a fast plane extraction and matching method for indoor mapping is proposed, and a minimal representation for infinite planes was introduced in ref. [Reference Kaess24]. CPA-SLAM [Reference Ma, Kerl, Stückler and Cremers25] modeled the environment with a global plane and used the planes for tracking and global graph optimization. Then, KDP-SLAM [Reference Hsiao, Westman, Zhang and Kaess26] proposed a keyframe-based approach based on dense plane extraction and matching using a projective plane association algorithm. Besides, the authors of [Reference Taguchi, Jian, Ramalingam and Feng27, Reference Zhang, Wang, Qi, Liao and Wei28] used points and planes together and tried to optimize them for higher accuracy and efficiency. Structural properties indoors, such as Manhattan World Assumption, were exploited to construct relationships between planes and points [Reference Li, Yunus, Brasch, Navab and Tombari29].

Compared to extracting planes from point clouds above, surfels [Reference Pfister, Zwicker, van Baar and Gross30] as finite planar elements often provide more accurate surface representation, better surface fitting, and more efficient computation. DPPTAM [Reference Concha and Civera31] modeled the environment with 3D points in high-gradient areas and 3D piecewise planes in low-gradient areas for dense scene reconstruction. Probabilistic Surfel Map [Reference Yan, Ye and Ren32] was proposed to maintain a globally consistent map with both photometric and geometric uncertainties encoded. Then, UcoSLAM [Reference Muñoz-Salinas and Medina-Carnicer33] combined the points and squared planar markers as features alleviated the relocalization problem in repetitive environments. In SP-SLAM [Reference Cho, Jo and Kim34], feature points and surfels were also optimized together to eliminate the effect of different texture densities on systems. MSC-VO [Reference Company-Corcoles, Garcia-Fidalgo and Ortiz1] introduced structural constraints combined with estimated Manhattan axes and the reprojection errors, allowing the method to work for a wider variety of scenarios. Surfels were initialized directly from sparse planes by extracting superpixels to utilize the structural information indoors in ManhattanSLAM [Reference Yunus, Li and Tombari35].

Learning-based methods. The great progress of deep learning stimulates the learning-based VO by predicting the relative poses between images with supervised [Reference Wang, Clark, Wen and Trigoni36] or unsupervised learning [Reference Li, Wang, Long and Gu37]. DeepVO [Reference Wang, Clark, Wen and Trigoni36] employed an RNN to predict camera pose end-to-end from input image sequence, and UndeepVO [Reference Li, Wang, Long and Gu37] concurrently trained a pose net and a depth net to form CNN-based VO systems. DVSO [Reference Yang, Wang, Stuckler and Cremers38] proposed a virtual stereo term that incorporates the depth estimation from a semi-supervised network into a direct VO pipeline. D $^2$ VO [Reference Jia, Pu, Chen, Cheng, Liao and Yang7] presented a novel deep learning and direct method-based monocular visual odometry, and D3VO [Reference Yang, v. Stumberg, Wang and Cremers8] exploited deep networks on three levels – deep depth, pose, and uncertainty estimation. However, although learning-based VO has advantages in feature extraction and robustness, it also suffers from high computational demands and generalization.

In this article, we propose a new VO method in which SegPatch is used as superpixel-based feature. SegPatch compensates for the shortcomings of points and surfels in representing various environments. Moreover, our superpixel-based method also holds potential use of learning-based methods, given its ability to extract informative and compact feature descriptors from images. A detailed explanation of superpixel-based VO is given in the following sections.

3. Concepts and model formulation

This section explains fundamental concepts that are necessary to follow the remainder of the latter. Our method is the first VO system using the superpixel-based features. Thus we design the superpixel-based feature SegPatch as basis.

3.1. Superpixel

Superpixels are sub-regions of the over-segmented image with similar information, which is sensitive to the variance of texture density. By SLIC [Reference Achanta, Shaji, Smith, Lucchi, Fua and Süsstrunk39] algorithm, superpixels are initialized into regular grids. Then, the pixels are clustered locally. In order to keep the central intensity gradient fluctuating within a specific range, we set upper and lower limits for merging and splitting. Iterations are repeated until every pixel is classified stably. An image $I$ is finally decomposed into $M$ superpixels $I=\{S_i, i = 1,\ldots, M\}$ , where $\forall S_i, S_j \in I, i\neq j$ , $S_i \bigcap S_j = \emptyset$ . For a region $\Omega \in I$ , the higher texture density in $\Omega$ , the more superpixels generated, and vice versa. Thus, the distribution of superpixels with different sizes reflects the complexity of the environment. Due to the intra-region similarity and inter-region dissimilarity, superpixels only maintain the major information of sub-regions, ensuring their effectiveness in various environments.

3.2. SegPatch and MapPatch

Because of the irregular and non-stable shape, the superpixel is insufficient to provide a robust image descriptor. We propose SegPatch as the superpixel-based features, extended from SuperPatch [Reference Giraud, Ta, Bugeau, Coupe and Papadakis40]. A SegPatch $\textbf{S}_i$ is centered at the superpixel $S_i$ and is combined with its neighboring superpixels $\{S_{i^{\prime}}\}$ such that $\textbf{S}_i = \{ S_{i^{\prime}}| \| c_i - c_{i^{\prime}} \|_2 \leq \mathcal{R} \}$ , where $c_i$ is the spatial barycenter and $\mathcal{R}$ is the parameter based on scale factor and superpixel granularity. We use $\mathcal{R}$ to control the number of neighboring superpixels, in order that every SegPatch has the similar size topologically.

The SegPatch is then constructed as a dual descriptor $\textbf{S}_i = \{F_{S_i}, \textbf{X}_{S_i}\}$ , including inner-feature $F_{S_i}$ and the inter-features $\textbf{X}_{S_i}$ . Inner-feature $F_{S_i} = \{\textbf{p}_i, \textbf{c}_i, r_i, s_i\}$ contains the information of the central superpixel, where $\textbf{p}_i = (u_i, v_i)^T$ is the coordinates of the barycenter, $\textbf{c}_i = (l_i, a_i, b_i)$ is mean color in LAB color space, $\pi r_i^2$ represents size of the central superpixel, and $s_i$ the scalar. Inter-features $\textbf{X}_{S_i} = \{X_{S_{i^{\prime}}}\}_{i^{\prime}=1,2,\ldots m}$ describe the relationship between the center superpixel and the neighboring superpixels. For every neighboring element, $X_{S_{i^{\prime}}} = (u_i - u_{i^{\prime}}, v_i - v_{i^{\prime}})$ is a vector pointing to the center superpixel, where $(u_{i^{\prime}},v_{i^{\prime}})$ is the barycenter of the neighboring superpixel. Let $I(S_i)$ be the mean intensity of the superpixel, the gradient of the SegPatch

(1) \begin{align} \left \{ \begin{aligned} g_x &= \sum _{S_{i^{\prime}}\in \textbf{S}_i} (I(S_{i^{\prime}})-I(S_i)) \cdot (u_{i^{\prime}} - u_i) \\ g_y &= \sum _{S_{i^{\prime}}\in \textbf{S}_i} (I(S_{i^{\prime}})-I(S_i)) \cdot (v_{i^{\prime}} - v_i) \end{aligned}\right . \end{align}

is a holistic character to describe the direction of change in a small region.

Figure 2. SegPatch construction. Superpixels are generated into different sizes in different texture regions. The SegPatch is constructed on the neighborhood with a topological radius $\mathcal{R}$ . The SegPatch is described by a dual descriptor, including inner- and inter-information of the center superpixel. The arrow shows the direction of the SegPatch.

In 3D space, we further construct the MapPatch for scene reconstruction. MapPatch is described as $\Sigma _M = [\{P_i\}_M,\textbf{n}_M,R_M]$ , maintaining the index of the corresponding SegPatches at the same time. $\{P_i\}_M$ is the set of control points in the world coordinate, where $P_i$ is initialized from the barycenter of SegPatch $\textbf{S}_i$ , that is $P_i = K^{-1}\textbf{p}_i$ . SVD method is used to initialize the normal $\textbf{n}_M$ from the depth data. The radius $R_M = 2r*z/(f_x+f_y)$ is also initialized so that the projection of the MapPatch can cover the corresponding superpixel in the observed frame. When a MapPatch is observed from different perspectives, the number of its control points increases, and the normal $\textbf{n}_M$ is fine-tuned by minimizing a fitting error defined as

(2) \begin{align} E_{\textbf{n}_M} = \sum _i L_\delta (\textbf{n}_M \cdot (P_i-\bar{P}_M)+b) \end{align}

where $\bar{P}$ is the mean of the control points $P_i$ and $b$ estimates the bias. $R^2_M$ is always the minimal size from $\bar{P}$ to cover all the corresponding superpixels in different frames.

The SegPatch construction is indicated in Figure 2. The definitions of SegPatch and MapPatch fully exploit the sensitivity to the texture density of superpixels. SegPatch could adaptively adjust its size in areas with different texture densities, ensuring its robustness in various environments. In the poor texture region, SegPatch contains a larger region. The saliency of SegPatch is greatly improved by removing the image redundancy. By using structure parameters, MapPatch models the environment structurally. In areas with rich textures, superpixels degenerate towards individual pixels. SegPatch, as a superpixel-based feature, assures uniqueness and robustness due to the inclusion of local information. Drawing from the small SegPatch, the accuracy of MapPatch does not suffer from the approximation. In contrast, the accuracy of the reconstruction is guaranteed due to the fine-grained structure of the fragments. Additionally, as a regional feature, SegPatch is also not sensitive to image noise and motion blur.

3.3. Scale space and image pyramid

Figure 3. Instead of extracting superpixels in different sizes, the use of Gaussian down-sampling can better reflect the distance of observation. (a) Shows the image pyramid constructed for superpixel segmentation. We introduce the idea of HNSW algorithm (b) for fast searching and matching through the image pyramid. (c) Shows the process of the proposed matching algorithm. Different from the original HNSW algorithm, the corresponding node in the next layer is not the best-matching node in this layer, but a new node split from it in the scale space.

SegPatch needs to be constructed at different scales because searching for correspondences requires comparing images seen at different scales. The image pyramid often represents the scale space. The images are repeatedly smoothed with a Gaussian filter and then sub-sampled to achieve a higher level of pyramid. The down-sampling procedure reflects the variation of the observed distance. We extract the superpixels at each layer of the pyramid and distribute them to the superpixels at the next higher layers. Superpixels at the same level distributing to the same superpixel are brother nodes of each other and are children of the one distributed. Thus the pyramid is a fully connected hierarchical graph in the form of a multiple-tree. The construction of the image pyramid is shown in Figure 3a.

After constructing an image pyramid and extracting superpixels for multiple images of different scenes and views, we find that the number of superpixels decreases exponentially from low to high layers, and the sizes of superpixels segmented in the same scale follow the same exponential distribution. The spatial distribution of the number of superpixels between layers is also uniform. The pyramid could thus tell rich texture regions from poor texture regions.

4. Superpixel-Based visual odometry

The structure of the proposed superpixel-based visual odometry is shown in Figure 1. The system starts with RGB-D image preprocessing, in which images are segmented into superpixels, SegPatches are constructed in scale space and MapPatches are initialized. Searching and matching procedure provides correspondences between two consecutive frames or between frames and the local map. Finally, the camera pose is estimated by minimizing reprojection errors between SegPatch measurements and observations. The local map is also updated in separate thread. This section describes the key methods of the proposed VO in detail.

4.1. SegPatch comparing, searching and matching

For two SegPatches, the number of elements and geometry are generally different, so it is difficult to compare directly. To measure the similarity between two SegPatches $\textbf{S}_i^A$ and $\textbf{S}_j^B$ in different images $I_A$ and $I_B$ , the distance is defined as follows.

(3) \begin{align} D\left(\textbf{S}_i^A, \textbf{S}_j^B\right) = \frac{\sum _{S^A_{i^{\prime}}}\sum _{S^B_{j^{\prime}}}\omega \left(S^A_{i^{\prime}},S^B_{j^{\prime}}\right)d\left(F_{i^{\prime}}, F_{j^{\prime}}\right)}{\sum _{S^A_{i^{\prime}}}\sum _{S^B_{j^{\prime}}}\omega \left(S^A_{i^{\prime}},S^B_{j^{\prime}}\right)} \end{align}

where $d$ is the Euclidean distance between the inner features. Let weight $\omega$ represents the relative overlapping area of neighboring superpixels $S^A_{i^{\prime}}$ and $S^B_{j^{\prime}}$ , following Gaussian distribution symmetrically

(4) \begin{align} \omega \left(S^A_{i^{\prime}},S^B_{j^{\prime}}\right) = \exp \left({-x_{i^{\prime}j^{\prime}}^Tx_{i^{\prime}j^{\prime}}/\sigma _1^2}\right)\varpi \left(S^A_{i^{\prime}}\right)\varpi \left(S^B_{j^{\prime}}\right) \end{align}

where $x_{i^{\prime}j^{\prime}} = (X_{S_{i^{\prime}}}-X_{S_{j^{\prime}}}) / s_is_j$ is the relative distance between $\textbf{p}_{i^{\prime}}$ and $\textbf{p}_{j^{\prime}}$ , $\varpi (S^A_{i^{\prime}})$ weights the influence of $S^A_{i^{\prime}}$ according to its distance to $S^A_{i}$ as $\omega (S^A_{i^{\prime}}) = \exp \left (-|X_{S_{i^{\prime}}}|^2/(s_i\sigma _2)^2\right )$ . $\varpi (S^B_{i^{\prime}})$ is as the same. $s_i$ , $s_j$ are scale coefficients, and $\sigma _1$ , $\sigma _2$ are two control parameters depending on the superpixel segmentation. Depending on the superpixel deposition scale, $\sigma _1 = \frac{1}{2}\sqrt{N/M}$ for an image with $N$ pixels decomposed into $M$ superpixels is set as half of the average distance between superpixel barycenters. Depending on the SegPatch size, $\sigma _2 = \sqrt{2}\mathcal{R}$ is set to weight the contribution of closest superpixels.

The matching task in VO consists of matching between two consecutive frames (2d-2d), as well as matching between frames and the local map (2d-3d). PatchMatch (PM) [Reference Barnes, Shechtman, Finkelstein and Goldman41] is a classical method to compute pixel-based patch correspondences between two images. As an extension of PM, SuperPatchMatch (SPM) [Reference Giraud, Ta, Bugeau, Coupe and Papadakis40] provides correspondences of irregular structures from superpixel decompositions. These two methods share one key point that good correspondences can be propagated to the adjacent patches within an image. However, it takes a long time in the propagation step to find the optimal solution.

Instead of processing images in scan order in PM and SPM, our proposed matching algorithm exploits the graph search method for fast-searching candidates. We propose a new modification of the HNSW algorithm [Reference Malkov and Yashunin15] and demonstrate it to be applicable. HNSW is an approximate K-nearest neighbor search method based on navigable small-world graphs with controllable hierarchy. The image pyramid satisfies the four requirements of the implementation of the HNSW algorithm: (1) The construction of the image pyramid is a hierarchical graph; (2) The number of upward nodes (superpixels) decreases following the exponential decay; (3) The highest projection layer of a node is obtained by the exponential decay probability function based on its position and size; (4) The node could be found in all layers from the highest layer down. Furthermore, due to the assumption that the image sequences are consecutive and the motion is not so fast, we set a trust region for optimal selection. A candidate located in the trust region is with higher probability to be the best match. The search process is performed layer-by-layer from coarse to fine in both scale and position. The HNSW and our proposed method are shown in Figure 3.

The following steps are then performed to find and improve the correspondences between frames. For the sake of clarity, we assume the $N$ -layer image pyramid with the top $I_0$ , the current best match of a SegPatch $\textbf{S}_i^A \in I^A$ in $I^B_k$ , is donated as $\mathcal{M}^B_k(i)$ , storing the distance and the index of their corresponding SegPatch. First of all, for every SegPatch $\textbf{S}_i^A \in I^A$ , we delineate a trust region in $I_0^B$ according to the position and size of $\textbf{S}_i^A$ . Matching candidates are selected from the trust region. After similarity measurement and comparison, the best match $\mathcal{M}^B_0(i)$ is determined. Then, the children of $\mathcal{M}^B_0(i)$ are obtained to be the candidates of the next layer. Several other SegPatches in the trust region are also selected randomly to escape from possible local minimum. Repeat the process until the bottom of the pyramid. Instead of updating the most corresponding SegPatch when traversal processes, obtaining the best match in every layer individually assures the scale stability. Generally, due to the small motion between two consecutive frames, there will not be much change of scales between two corresponding SegPatch. But sometimes, most SegPatches scale up or down when the camera moves forwards or backends. Similarly with the feature points, the consistency of variation for the intensity orientation is checked, and the incompatible matches will be removed in the end.

Figure 4. Projection relationships between SegPatch and MapPatch. When a MapPatch is projected onto the image, there may be a small deviation from the correct corresponding SegPatch. Thus, the trust region is set smaller to find the best match.

The 2d-3d matching is used for constructing a stronger connection with the local map. It is a guided match because 2d-3d matching is performed when there is already an estimated or predicted camera pose. Following the principle of geometry, the MapPatch is reprojected onto the current image. As shown in Figure 4, the projected position and size of the MapPatch may not coincide with the optimal SegPatch exactly, but near. Compared to 2d-2d matching, a smaller trust region is required here to find the nearest SegPatch. Matching candidates are also selected through an image pyramid, where the similarity between the latest corresponding SegPatch and the ones within the trust region is measured. Benefiting from the motion priors, the correspondences are constructed with high probability.

The proposed matching algorithm is well applicable to different cases. It is worth noting that our distance calculation method does not focus on the difference in size and shape between the central superpixels, but rather on the similarity in the properties and geometry inside the SegPatch. As a result, the accuracy of the similarity measure is not affected by the observed deformation from multi-view, motion blur, and variance of texture density. The graph search method based on image pyramid reduces the computational complexity. In summary, our proposed matching algorithm accurately provides robust correspondences with a small computational load. Figure 5 shows part of the matches in different intensity density, illustrating the robustness of our method.

4.2. Pose estimation

The camera pose is finally estimated through all these correspondences. Relative motion estimation is ideally performed by maximizing the overlapping area between the matched superpixels in two images.

(5) \begin{align} \xi ^* = \max _\xi \sum _{\begin{array}{c}\\[-14pt] \scriptstyle S_A \in\, \textbf{S}_A\\[-4pt] \scriptstyle S_B \in\, \textbf{S}_B\end{array}} \frac{T_\xi (S_A) \bigcap S_B}{T_\xi (S_A) \bigcup S_B} \end{align}

However, due to the irregular shape of the superpixels, this computation requires the expensive count of overlapping pixels, which cancels the computational advantages of the superpixel representation. For this reason, we simplify the problem to find a suboptimal solution.

Figure 5. Matches in different cases.

Figure 6. Fitting function for IoU of two intersecting circles on the plane. In the second row, we show the comparison between our proposed fitting function and the original IoU curve. $\phi (x)$ shows the true IoU of the intersecting circles and $\psi (x)$ is the fitting function. The x-axis represents $\mathsf{d}/\min (\mathsf{R}_1,\mathsf{R}_2)$ , with $\mathsf{R}_1/\mathsf{R}_2 = 1, 1.05, 1.1, 1.2$ for example. The third row represents the error histogram of the fitting function. Experiments show that the maximum errors of the fitting functions are 0.0438, 0.0182, 0.0114, 0.0222, the mean errors are 0.0192, 0.0107, 0.0114, 0.0062, the variance of errors are 1.9073e-4, 3.2020e-5, 1.0026e-5, 2.3721e-5 respectively. The numerical experiment demonstrated that the quadric function $\psi (x)$ could represent the IoU approximately well when two circles are of similar size.

Considering two intersecting circles with the same radius $R_\circ$ on the plane, their Intersection over Union (IoU) is a nonlinear function of radius $R_\circ$ and center distance $d_\circ$ :

\begin{align*} \phi _{IOU}(R_\circ, d_\circ ) = \frac {\arccos \left(\frac {d_\circ }{2R_\circ }\right)R_\circ ^2-\frac {d_\circ }{2}\sqrt {R_\circ ^2-\left(\frac {d_\circ }{2}\right)^2}}{\left(\pi -\arccos \left(\frac {d_\circ }{2R_\circ }\right)\right)R_\circ ^2+\frac {d_\circ }{2}\sqrt {R_\circ ^2-\left(\frac {d_\circ }{2}\right)^2}},\quad d_\circ \in [0,2R_\circ ) \end{align*}

Let $k = \frac{d_\circ }{2R_\circ }$ , $\phi _{IOU}(R_\circ, d_\circ )$ could be written as

\begin{align*} \phi _{IOU}(R_\circ, d_\circ ) = \phi _{IOU}(k) = \frac {\arccos (k)-k\sqrt {1-k^2}}{\pi -\arccos (k)+k\sqrt {1-k^2}},\quad k \in [0,1) \end{align*}

To simplify the computation, $\phi _{IOU}(k)$ could be further approximated as a quadric function of $d_\circ$

\begin{align*} \phi _{IOU\_appro}(d_\circ ) = (1-k)^2 = \left(\frac {2R_\circ -d_\circ }{2R_\circ }\right)^2,\quad d_\circ \in [0,2R_\circ ) \end{align*}

Numerical results show that our proposed approximation scheme is efficient. More generally, if there are two circles with different radius ${R_\circ }_1$ and ${R_\circ }_2$ , the IoU is approximated as

\begin{align*} \phi _{IOU-appro}(d_\circ ) = \left(\frac {{R_\circ }_1+{R_\circ }_2-d_\circ }{2*\max({R_\circ }_1,{R_\circ }_2)}\right)^2 + c,\quad d_\circ \in \left [|{R_\circ }_1-{R_\circ }_2|,{R_\circ }_1+{R_\circ }_2\right ) \end{align*}

As shown in Figure 6, numerical experiments show that when the radii of two circles are similar, the accuracy of the approximation is within an acceptable range for the following use. Given that there is generally no significant scale variation between consecutive images, corresponding SegPatches tend to have similar sizes. Thus, fitting a complex nonlinear function by a simple quadric function is feasible to obtain a suboptimal IoU solution. The Equation (5) is simplified as

(6) \begin{align} \begin{aligned} \xi ^* &= \arg \max _\xi \sum _{\mbox{$\begin{array}{c}\\[-16pt] \scriptstyle S_A \in\, \textbf{S}_A\\[-4pt] \scriptstyle S_B \in\, \textbf{S}_B\end{array}$}} \left (\frac{r_1+r_2 - \|T_\xi (x_A) - x_B \|_2 }{2\max ({r1,r2})}\right )^2 \\ &\Rightarrow \arg \max _\xi \sum _{\begin{array}{c}\\[-16pt] \scriptstyle S_A \in\, \textbf{S}_A\\[-4pt] \scriptstyle S_B \in\, \textbf{S}_B\end{array}} \left (\frac{r1+r2}{2\max ({r1,r2})}- \frac{\|T_\xi (x_A) - x_B \|_2}{2\max ({r1,r2})} \right )^2 \\ &\Rightarrow \arg \min _\xi \sum _{\begin{array}{c}\\[-16pt] \scriptstyle S_A \in\, \textbf{S}_A\\[-4pt] \scriptstyle S_B \in\, \textbf{S}_B\end{array}} \left (\frac{\|T_\xi (x_A) - x_B \|_2}{2\max ({r1,r2})}\right )^2 \\ &\Rightarrow \arg \min _\xi \sum _{\begin{array}{c}\\[-16pt] \scriptstyle S_A \in\, \textbf{S}_A\\[-4pt] \scriptstyle S_B \in\, \textbf{S}_B\end{array}} \mu _{AB}\|T_\xi (x_A) - x_B \|_2^2 \end{aligned} \end{align}

where $\mu _{AB} = 1/(2\max ({r1,r2}))^2$ is the weight parameter depending on the size of matched SegPatch pair. Thus, the problem is degraded approximately by minimizing the relative distance of superpixel coordinates. The weighted least square problem could be solved by Gauss-Newton or Levenberg-Marquardt algorithms.

4.3. Local mapping

The local map is updated simultaneously to provide the optimal reconstruction in the surroundings of the camera pose. Newly observed MapPatches are added. A MapPatch is initialized by one frame, but could be observed by others. Therefore, it is reprojected onto the other active frames, and 2d-3d correspondences are searched as detailed above. MapPatch observed from multi-view are fused. By gathering the information from SegPatches in different frames, MapPatch maintains the key information of control points, normal and size. As the camera moves, invisible MapPatches are removed. The local mapping is also in charge of culling redundant frames. Only frames active in the current sliding window and their corresponding MapPatches with high quality are preserved in the local map. The global map is then generated by fusing and optimizing the local map. The proposed superpixel-based visual odometry is processed with continuous pose estimation and mapping.

5. Experiments

We evaluate the accuracy and efficiency of the matching algorithm on several sequences from the publicly available TUM RGB-D dataset [Reference Sturm, Engelhard, Endres, Burgard and Cremers16] and evaluate the localization performance against state-of-the-art techniques. Our method is practicable in these real-world image sequences under different realistic conditions like image noise, motion blur, poor focus, dynamic lighting, and other difficulties. The proposed method was implemented with $C\texttt{++}$ code on a standard Linux computer with four cores at 2.50 GHz and 8 GB of RAM.

5.1. Matching accuracy and efficiency

For each image in a sequence, we construct a 4-layer image pyramid with a scale factor $\beta =2$ . After superpixel segmentation, SegPatch is generated from every superpixel with effective depth and convex polygon.

We evaluate the matching accuracy by the absolute distance between the centers of two matched superpixels and their average overlap in two successive images. The degree of overlap could be approximately quantified as the relative distance $\gamma = \frac{\|c_1-c_2 \|_{_2}}{r_1}$ , where $\pi r_1^2$ is the size of central superpixel and $c_1$ is the ideal location of the SegPatch, $c_2$ is barycenter of the matched SegPatch. We counted more than 16 thousand groups of 2d-2d matched superpixels in images randomly selected in different sequences. Figure 7 shows the displacement for every matched superpixel. Most matched SegPatches are offset by 10 pixels from the ground truth, 2.5 superpixels at superpixel level, shown in Figure 7a and Figure 7b, respectively. The average displacement distribution is shown in Figure 7c, demonstrating regional accuracy.

Figure 7. Displacement for every superpixel of performing proposed matching algorithms. The absolute displacement (a) shows most of the matched superpixels are offset less than 10 pixels or say no more than 2.5 equal-size superpixels shown in (b). The average displacement distribution (c) confirms the point again.

Figure 8. Influence of radius $r$ for relative displacement (a) in the proposed SegPatch matching algorithms. The average displacement (b) between each superpixel and one of its closest matches in the successive images.

In Figure 8, we show the average relative distance with respect to the size of the trust region. Figure 8a is the direct statistical distribution for all superpixels, and Figure 8b is the relationship by Gaussian fitting. It illustrates that a balanced trust region size provides higher matching accuracy. In the areas with poor texture, a large trust region will avoid the matching algorithm falling into local optimum, while in the areas with rich texture, a small trust region will protect the accuracy of the algorithm from influence of noise. Therefore, setting the size of the trust region according to the size and scale of SegPatch helps to provide accurate correspondences efficiently.

As for computational efficiency, our approach incorporates superpixels as the fundamental processing units for input images, involving an additional step of superpixel segmentation. Considering size of the image to be $MN$ and the number of features extracted is $K$ , we examine two aspects: feature extraction and matching to analyze the computational complexity. On the one hand, for computational efficiency of feature extraction, we compare the time complexity of our utilized superpixel segmentation and SegPatch extraction method with traditional feature-based methods. Point-feature-based SLAM methods, such as ORB-SLAM [Reference Mur-Artal, Montiel and Tardos2], exhibit a time complexity of $O(MN)$ for feature extraction. Plane-based feature extraction methods using AHC [Reference Feng, Taguchi and Kamat42] or ICP [Reference Besl and McKay43] algorithms have higher time complexities, such as $O((MN)^2)$ and $O((MN)^3)$ , respectively, for plane extraction. The complexity of plane-based method will rise further in high-texture regions. Our utilized superpixel segmentation and SegPatch extraction have a time complexity of $O(MN+K)$ , so no higher complexity is introduced in the feature extraction process. On the other hand, for computational efficiency of matching process, we compare the time complexity of our proposed 2d-2d matching algorithm with SPM [Reference Giraud, Ta, Bugeau, Coupe and Papadakis40]. SPM utilizes the assumption that when a patch in $I_A$ corresponds to a patch in $I_B$ , the adjacent patches in $I_A$ should also match adjacent patches in $I_B$ . All SuperPatches are processed according to a scan order to search and propagate through patch-based approximate nearest neighbor (ANN). The time complexity of SPM is $O(K^2)$ . Due to the uncertainty of random assignment initialization and random search, it takes time to propagate and improve the performance. The running time grows exponentially with the number of iterations. In our algorithm, we build a trust region and search the image pyramid top-down, exploiting the consistency and continuity of the shifts in a single image. It does reduce the search time for matching and increases efficiency. Thus the computational time is reduced to $O(K\lg K)$ . Therefore, considering the aforementioned aspects, the overall time complexity of our algorithm is given by $O(MN+K+K\lg K)$ , indicating its superior efficiency in terms of computation time.

Table I. Computational cost and accuracy for matching process.

Table II. RGB-D odometry benchmark ATE RMSE (m).

Here, we exclusively concentrate on the comparison of runtime efficiency for methods at superpixel level. The computational cost and the accuracy comparison results with SPM are shown in Table I, where the accuracy is measured by the percentage of matched features within 2.5 equal-size superpixels compared to the ground truth. Computational time and accuracy are given for each frame in a single thread, with the same initial region size and ruler used for superpixel extraction. SPM results are obtained with $k=20$ ANN and $R=50$ neighboring pixels. We show the result of the open-source Matlab version and rewrite it in $\text{C}\texttt{++}$ format for fair comparison. In our method, the matching candidates are selected from the image pyramid within the trust region. And in practice, we do not perform matching on every superpixel in the image. To efficiently triangulate and obtain MapPatches after obtaining correspondences, only SegPatches with valid depth measurements are operated. Multithreaded implementations have also been used to shorten computation time (around 0.02s in practice) to ensure real-time operation of the system. Experiments show that our proposed matching algorithm is efficient and accurate.

5.2. Pose estimation accuracy

We choose some representative sequences from the public TUM-RGBD datasets to evaluate our system. $\texttt{fr1/desk}$ , $\texttt{fr1/room}$ , $\texttt{fr2/desk}$ represent the indoor environment with rich texture intensity, while $\texttt{fr3/str_notex_far}$ and $\texttt{fr3/str_notex_near}$ represent poor texture scenes indoors. We follow the same evaluation protocol as in ref. [Reference Sturm, Engelhard, Endres, Burgard and Cremers16] by computing the maximum drift (deviation from ground-truth pose) across the whole sequence in translation and rotation (represented in quaternion) of camera motion for all trajectories. We compared the estimated camera pose with ground-truth trajectories by computing relative trajectory root-mean-square errors (RMSE) and absolute trajectory (AT) RMSE in various challenging scenes. After projecting the local MapPatches onto the current frame by the estimation from 2d-2d matching results, the pose estimation could limit the relative translation error to around 2.16 cm and rotation error below 1.05 degrees. For most sequences, we achieve comparable results with most mainstream VO algorithms, meaning that our method can maintain correct camera trajectories in challenging scenes. Table II shows the results. It is worth noting that although our method may not be the best under one condition of single texture density, it shows good balance to adapt to the environment shown in Figure 9. Additional work, such as global bundle adjustment and loop closure, is left for future work to become more competitive.

Figure 9. Compared with other typical VO systems, our visual odometry achieves a balanced effect in environments with different texture density.

6. Conclusion

We have presented a novel attempt at superpixel-based visual odometry for motion estimation. The design of the superpixel-based feature SegPatch is essential to scene representation. The searching and matching algorithm could construct reliable correspondences of SegPatch in high efficiency. The proposed visual odometry operates robustly under various conditions such as different texture densities, image noise, and motion blur. We have provided an evaluation of the quality of matching and pose estimation on a variety of real-world datasets to show how our method works and demonstrate the utility of SLAM.

Author contributions

All authors contributed to the study conception, design, and implementation. Material preparation, data collection, and analysis were performed by Meiyi Yang. The original draft of the manuscript was written by Meiyi Yang and reviewed and edited by Junlin Xiong and Youfu Li. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Financial support

This work was supported by National Natural Science Foundation of China under Grant 62273320 and 62173286.

Competing interests

The authors declare no competing interests exist.

Ethical approval

Not applicable.

References

Company-Corcoles, J. P., Garcia-Fidalgo, E. and Ortiz, A., “MSC-VO: Exploiting Manhattan and structural constraints for visual odometry,” IEEE Robot Autom Lett 7(2), 28032810 (2022).CrossRefGoogle Scholar
Mur-Artal, R., Montiel, J. M. and Tardos, J. D., “ORB-SLAM: A versatile and accurate monocular SLAM system,” IEEE Trans Robot 31(5), 11471163 (2015).CrossRefGoogle Scholar
Liu, H., Li, C., Chen, G., Zhang, G., Kaess, M. and Bao, H., “Robust keyframe-based dense SLAM with an RGB-D camera” (2017). arXiv preprint arXiv: 1711.05166, 2017.Google Scholar
Wang, Y.-T. and Lin, G.-Y., “Improvement of speeded-up robust features for robot visual simultaneous localization and mapping,” Robotica 32(4), 533549 (2014).CrossRefGoogle Scholar
Kerl, C., Sturm, J. and Cremers, D., “Robust Odometry Estimation for RGB-D Cameras,” In: IEEE International Conference on Robotics and Automation 2013, (IEEE, 2013) pp. 37483754.CrossRefGoogle Scholar
Engel, J., Koltun, V. and Cremers, D., “Direct sparse odometry,” IEEE Trans Pattern Anal 40(3), 611625 (2017).CrossRefGoogle ScholarPubMed
Jia, Q., Pu, Y., Chen, J., Cheng, J., Liao, C. and Yang, X., “D2VO: Monocular Deep Direct Visual Odometry,” In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), (IEEE, 2020) pp. 1015810165.CrossRefGoogle Scholar
Yang, N., v. Stumberg, L., Wang, R. and Cremers, D., “D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry,” In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2020) pp. 12781289.CrossRefGoogle Scholar
Taubner, F., Tschopp, F., Novkovic, T., Siegwart, R. and Furrer, F., “LCD-Line Clustering and Description for Place Recognition,” In: 2020 International Conference on 3D Vision (3DV), (IEEE, 2020) pp. 908917.CrossRefGoogle Scholar
Pumarola, A., Vakhitov, A., Agudo, A., Sanfeliu, A. and Moreno-Noguer, F., “PL-SLAM: Real-time Monocular Visual SLAM with Points and Lines,” In: Proceedings - IEEE International Conference on Robotics and Automation, (IEEE, 2017) pp. 45034508.CrossRefGoogle Scholar
Zhao, L., Huang, S., Yan, L. and Dissanayake, G., “A new feature parametrization for monocular SLAM using line features,” Robotica 33(3), 513536 (2015).CrossRefGoogle Scholar
Yang, S., Song, Y., Kaess, M. and Scherer, S., “Pop-up SLAM: Semantic Monocular Plane SLAM for Low-texture Environments,” In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2016, (IEEE, 2016) pp. 12221229.CrossRefGoogle Scholar
Yang, S. and Scherer, S., “Monocular object and plane SLAM in structured environments,” IEEE Robot Autom Lett 4(4), 31453152 (2019).CrossRefGoogle Scholar
Ren, X. and Malik, J., “Learning a Classification Model for Segmentation,” In: Proceedings of the IEEE International Conference on Computer Vision 2003, (IEEE, 2003) pp.1017.CrossRefGoogle Scholar
Malkov, Y. A. and Yashunin, D. A., “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans Pattern Anal 42(4), 824836 (2020).CrossRefGoogle ScholarPubMed
Sturm, J., Engelhard, N., Endres, F., Burgard, W. and Cremers, D., “A Benchmark for the Evaluation of RGB-D SLAM Systems,” In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2012, (IEEE, 2012) pp. 573580.CrossRefGoogle Scholar
Yang, N., Wang, R. and Cremers, D., “Feature-based or direct: An evaluation of monocular visual odometry,” 112 (2017), arXiv preprint arXiv: 1705.04300, 2017.Google Scholar
Endres, F., Hess, J., Sturm, J., Cremers, D. and Burgard, W., “3-D mapping with an RGB-D camera,” IEEE Trans Robot 30(1), 177187 (2014).CrossRefGoogle Scholar
Mur-Artal, R. and Tardós, J. D., “ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras,” IEEE Trans Robot 33(5), 12551262 (2017).CrossRefGoogle Scholar
Kerl, C., Sturm, J. and Cremers, D., “Dense Visual SLAM for RGB-D Cameras,” In: IEEE/RSJ International Conference on Intelligent Robots and Systems 2013, (IEEE, 2013) pp. 21002106.CrossRefGoogle Scholar
Pretto, A., Menegatti, E., Bennewitz, M., Burgard, W. and Pagello, E., “A Visual Odometry Framework Robust to Motion Blur,” In: IEEE International Conference on Robotics and Automation 2009, ( IEEE, 2009) pp. 22502257.CrossRefGoogle Scholar
Weingarten, J. and Siegwart, R., “EKF-based 3D SLAM for Structured Environment Reconstruction,” In: IEEE/RSJ International Conference on Intelligent Robots and Systems 2005, (IEEE, 2005) pp. 38343839.CrossRefGoogle Scholar
Lee, T.-K., Lim, S., Lee, S., An, S. and Oh, S.-Y., “Indoor Mapping Using Planes Extracted from Noisy RGB-D Sensors,” In: IEEE/RSJ International Conference on Intelligent Robots and Systems 2012, (IEEE, 2012) pp. 17271733.CrossRefGoogle Scholar
Kaess, M., “Simultaneous Localization and Mapping with Infinite Planes,” In: IEEE International Conference on Robotics and Automation (ICRA) 2015, (IEEE, 2015) pp. 46054611.CrossRefGoogle Scholar
Ma, L., Kerl, C., Stückler, J. and Cremers, D., “ CPA-SLAM: Consistent Plane-Model Alignment for Direct RGB-D SLAM,” In: IEEE International Conference on Robotics and Automation (ICRA) 2016, (IEEE, 2016) pp. 12851291.CrossRefGoogle Scholar
Hsiao, M., Westman, E., Zhang, G. and Kaess, M., “Keyframe-based Dense Planar SLAM,” In: IEEE International Conference on Robotics and Automation (ICRA) 2017, (IEEE, 2017) pp. 51105117.CrossRefGoogle Scholar
Taguchi, Y., Jian, Y.-D., Ramalingam, S. and Feng, C., “Point-Plane SLAM for Hand-Held 3D Sensors,” In: IEEE International Conference on Robotics and Automation 2013, (IEEE, 2013) pp. 51825189.CrossRefGoogle Scholar
Zhang, X., Wang, W., Qi, X., Liao, Z. and Wei, R., “Point-plane SLAM using supposed planes for indoor environments,” Sensors 19(17), 3795 (2019).CrossRefGoogle ScholarPubMed
Li, Y., Yunus, R., Brasch, N., Navab, N. and Tombari, F., “RGB-D SLAM with Structural Regularities,” In: IEEE International Conference on Robotics and Automation (ICRA) 2021, (IEEE, 2021) pp. 1158111587.CrossRefGoogle Scholar
Pfister, H., Zwicker, M., van Baar, J. and Gross, M., “Surfels: Surface Elements as Rendering Primitives,” In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, (ACM Press/Addison-Wesley Publishing Co, 2000) pp. 335342.CrossRefGoogle Scholar
Concha, A. and Civera, J., “DPPTAM: Dense Piecewise Planar Tracking and Mapping from A Monocular Sequence,” In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2015, (IEEE, 2015) pp. 56865693.CrossRefGoogle Scholar
Yan, Z., Ye, M. and Ren, L., “Dense visual SLAM with probabilistic surfel map,” IEEE Trans Vis Comput Gr 23(11), 23892398 (2017).CrossRefGoogle ScholarPubMed
Muñoz-Salinas, R. and Medina-Carnicer, R., “UcoSLAM: Simultaneous localization and mapping by fusion of Keypoints and squared planar markers,” Pattern Recognit 101, 107193 (2020).CrossRefGoogle Scholar
Cho, H. M., Jo, H. and Kim, E., “SP-SLAM: Surfel-point simultaneous localization and mapping,” IEEE/ASME Trans Mechatron 27(5), 25682579 (2021).CrossRefGoogle Scholar
Yunus, R., Li, Y. and Tombari, F., “ManhattanSLAM: Robust Planar Tracking and Mapping Leveraging Mixture of Manhattan Frames,” In: IEEE International Conference on Robotics and Automation (ICRA) 2021, (IEEE, 2021) pp. 66876693.CrossRefGoogle Scholar
Wang, S., Clark, R., Wen, H. and Trigoni, N., “DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks,” In: IEEE international conference on robotics and automation (ICRA) 2017, (IEEE, 2017) pp. 20432050.CrossRefGoogle Scholar
Li, R., Wang, S., Long, Z. and Gu, D., “UnDeepVO: Monocular Visual Odometry Through Unsupervised Deep Learning,” In: IEEE International Conference on Robotics and Automation (ICRA) 2018, (IEEE, 2018) pp. 72867291.CrossRefGoogle Scholar
Yang, N., Wang, R., Stuckler, J. and Cremers, D., “Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry,” In: Proceedings of the European Conference on Computer Vision (ECCV), (Springer, 2018) pp. 835852.CrossRefGoogle Scholar
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P. and Süsstrunk, S., “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Trans Pattern Anal 34(11), 22742281 (2012).CrossRefGoogle ScholarPubMed
Giraud, R., Ta, V.-T., Bugeau, A., Coupe, P. and Papadakis, N., “SuperPatchMatch: An algorithm for robust correspondences using superpixel patches,” IEEE Trans Image Process 26(8), 40684078 (2017).CrossRefGoogle ScholarPubMed
Barnes, C., Shechtman, E., Finkelstein, A. and Goldman, D. B., “PatchMatch: A randomized correspondence algorithm for structural image editing,” ACM Trans Graphic 28(3), 24 (2009).CrossRefGoogle Scholar
Feng, C., Taguchi, Y. and Kamat, V. R., “Fast Plane Extraction in Organized Point Clouds Using Agglomerative Hierarchical Clustering,” In: IEEE International Conference on Robotics and Automation (ICRA) 2014, (IEEE, 2014) pp. 62186225.CrossRefGoogle Scholar
Besl, P. J. and McKay, N. D., “Method for Registration of 3-d Shapes,” In: Sensor Fusion IV: Control Paradigms and Data Structures 1992. vol. 1611 (Spie, 1992) pp. 586606.CrossRefGoogle Scholar
Whelan, T., Salas-Moreno, R. F., Glocker, B., Davison, A. J. and Leutenegger, S., “ElasticFusion: Real-time dense SLAM and light source estimation,” Int J Robot Res 35(14), 16971716 (2016).CrossRefGoogle Scholar
Li, S., Wu, X., Cao, Y. and Zha, H., “Generalizing to the Open World: Deep Visual Odometry With Online Adaptation,” In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021, (IEEE, 2021) pp. 1318413193.CrossRefGoogle Scholar
Li, S., Xue, F., Wang, X., Yan, Z. and Zha, H., “Sequential Adversarial Learning for Self-Supervised Deep Visual Odometry,” In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2019, (IEEE, 2019) pp. 28512860.CrossRefGoogle Scholar
Figure 0

Figure 1. Pipeline of superpixel-based visual odometry.

Figure 1

Figure 2. SegPatch construction. Superpixels are generated into different sizes in different texture regions. The SegPatch is constructed on the neighborhood with a topological radius $\mathcal{R}$. The SegPatch is described by a dual descriptor, including inner- and inter-information of the center superpixel. The arrow shows the direction of the SegPatch.

Figure 2

Figure 3. Instead of extracting superpixels in different sizes, the use of Gaussian down-sampling can better reflect the distance of observation. (a) Shows the image pyramid constructed for superpixel segmentation. We introduce the idea of HNSW algorithm (b) for fast searching and matching through the image pyramid. (c) Shows the process of the proposed matching algorithm. Different from the original HNSW algorithm, the corresponding node in the next layer is not the best-matching node in this layer, but a new node split from it in the scale space.

Figure 3

Figure 4. Projection relationships between SegPatch and MapPatch. When a MapPatch is projected onto the image, there may be a small deviation from the correct corresponding SegPatch. Thus, the trust region is set smaller to find the best match.

Figure 4

Figure 5. Matches in different cases.

Figure 5

Figure 6. Fitting function for IoU of two intersecting circles on the plane. In the second row, we show the comparison between our proposed fitting function and the original IoU curve. $\phi (x)$ shows the true IoU of the intersecting circles and $\psi (x)$ is the fitting function. The x-axis represents $\mathsf{d}/\min (\mathsf{R}_1,\mathsf{R}_2)$, with $\mathsf{R}_1/\mathsf{R}_2 = 1, 1.05, 1.1, 1.2$ for example. The third row represents the error histogram of the fitting function. Experiments show that the maximum errors of the fitting functions are 0.0438, 0.0182, 0.0114, 0.0222, the mean errors are 0.0192, 0.0107, 0.0114, 0.0062, the variance of errors are 1.9073e-4, 3.2020e-5, 1.0026e-5, 2.3721e-5 respectively. The numerical experiment demonstrated that the quadric function $\psi (x)$ could represent the IoU approximately well when two circles are of similar size.

Figure 6

Figure 7. Displacement for every superpixel of performing proposed matching algorithms. The absolute displacement (a) shows most of the matched superpixels are offset less than 10 pixels or say no more than 2.5 equal-size superpixels shown in (b). The average displacement distribution (c) confirms the point again.

Figure 7

Figure 8. Influence of radius $r$ for relative displacement (a) in the proposed SegPatch matching algorithms. The average displacement (b) between each superpixel and one of its closest matches in the successive images.

Figure 8

Table I. Computational cost and accuracy for matching process.

Figure 9

Table II. RGB-D odometry benchmark ATE RMSE (m).

Figure 10

Figure 9. Compared with other typical VO systems, our visual odometry achieves a balanced effect in environments with different texture density.