1. Introduction
The unmanned aerial vehicle (UAV) has the advantages of lightness, flexibility, and highly maneuverable. By equipping the UAV with some environmental sensing equipment, such as LiDAR and vision sensors, the UAV can replace workers to complete complex patrol inspection and other similar tasks in the underground mine. In this way, the incidence of coal mine safety accidents can be greatly reduced, and production efficiency can be improved. Furthermore, mechanization, automation, informatization, and intelligence of the coal mining industry could be rapidly promoted. However, the underground mine environment is extremely complex. The roadways of the mine are narrow and featureless. There are no GPS signals and the light condition is poor. These limitations make it difficult for the existing localization methods to achieve accurate positioning in the underground mine. The two commonly used types of sensors for localization in general environments are camera and LiDAR. Therefore, we review vision-based and LiDAR-based localization methods applied in the underground mine environment below.
Vision-based localization
ORB-SLAM [Reference Mur-Artal, Montiel and Tardos1–Reference Campos, Elvira, Rodríguez, Montiel and Tardós3] is currently the most widely used robot localization method based on visual feature point matching. Rogers et al. [Reference Rogers, Gregory, Fink and Stump4] tested the localization accuracy of ORB-SLAM2 in the open-source underground mine. However, the lack of image texture features in underground roadways leads to the maximum localization error of ORB-SLAM up to 30 m. Compared to visual localization algorithms based on feature matching, direct sparse visual odometry exploits pixel gradients rather than image features to estimate the relative pose change, which is usually applied in featureless environments. Özaslan et al. [Reference Özaslan, Shen, Mulgaonkar, Michael and Kumar5–Reference Özaslan, Loianno, Keller, Taylor, Kumar, Wozencraft and Hood7] equipped a UAV with an active light and a camera and estimated the axial displacement of the UAV based on Lucas–Kanade optical flow [Reference Shin, Kim, Kang, Lee, Paik, Abidi and Abidi8] during flight in a dark tunnel. The experimental results show that the maximum localization error is up to 40%. The large localization error results from the fact that the active lighting carried by UAV during the flight in the dark tunnel makes it difficult to satisfy the theory of the gray-scale invariance assumption of optical flow. To improve the localization accuracy of the vision-based method in underground mines, Jacobson et al. [Reference Jacobson, Zeng, Smith, Boswell, Peynot and Milford9] proposed a semi-supervised SLAM algorithm for pose tracking of mining vehicles. The semi-supervised SLAM established multiple positioning nodes and stored the keyframe images to form an image database and then used the optical flow to estimate the vehicle pose. Whereas, the semi-supervised SLAM requires a large amount of image data and manual intervention. With the development of robot intelligence, semantic features [Reference Zhu, Wang, Blum, Liu, Song, Pollefeys and Wang10] are conducive to navigation and localization. Furthermore, to cope with the sensitivity of lighting changes and motion blur of single-camera images, vision inertial odometry (VIO) was applied to the localization of robots in the underground mine. Kramer et al. [Reference Kramer, Kasper and Heckman11] tested the localization accuracy of existing OKVIS [Reference Leutenegger, Lynen, Bosse, Siegwart and Furgale12] algorithms. Chen et al. [Reference Chen, Henawy, Kocer and Seet13] evaluated VIO-Mono [Reference Qin, Li and Shen14] state estimator’s accuracy in featureless indoor environments like underground mine environment. Papachristos et al. [Reference Papachristos, Mascarich and Alexis15–Reference Khattak, Papachristos and Alexis17] combined the images from an infrared camera and data from an IMU to localize the UAV indoors and proposed RITIO and KTIO algorithm [Reference Khattak, Papachristos and Alexis18]. Compared with visible light images, although infrared images can be adapted to the harsh lighting of underground mines to some extent, the lack of enough features leads to a large localization error. In summary, since the mine is located hundreds of meters underground, where most areas have only weak light and some areas are dark, the vision-based localization method cannot capture enough features or satisfy the gray-scale invariance assumption, making it difficult to achieve accurate localization.
LiDAR-based localization
In the visually degraded underground mine environment, LiDAR sensors can directly measure the distance and orientation information of the target object, which has the advantages of wide-ranging scope and high sensing accuracy compared to vision sensors. Thus, the LiDAR-based method is more suitable for robot localization and mapping [Reference Fasiolo, Scalera and Maset19]. Meanwhile, the LiDAR data is almost unaffected by the changes in ambient light, making it suitable for dark coal mine environments. LiDAR sensors can be divided into single-line LiDAR (2D LiDAR) and multi-line LiDAR (3D LiDAR) depending on the number of laser beams. 2D LiDAR was commonly used in the localization of indoor mobile robots in the early stage, which can only scan in a plane and cannot measure the height of the object. The ATRV mobile robot developed by Bakambu et al. [Reference Bakambu and Polotski20] was equipped with two 2D LiDARs to explore the mine. They used line segments scanned by 2D LiDAR as the basic elements for localization and mapping. In addition, the Groundhog robot developed by Sebastian et al. [Reference Thrun, Thayer, Whittaker, Baker, Burgard, Ferguson, Hähnel, Montemerlo, Morris, Omohundro and Reverte21] also used the point cloud data from 2D LiDAR aligned with a map to estimate the pose of the robot based on the Iterative Closest Point (ICP) matching. Since the 2D LiDAR cannot measure the spatial information, the above research can only acquire the x-y translation and heading angle of the mobile robot. Compared to 2D LiDAR, 3D LiDAR can measure the spatial geometric structure of the object. LOAM [Reference Zhang and Singh22] and Lego_LOAM [Reference Shan and Englot23] are two typical 3D point cloud-based localization algorithms. Li et al. [Reference Li, Zhu, You, Wang and Tang24] combined Lego_LOAM with the Normal Distributions Transform (NDT)-based feature matching algorithm to localize the pose of a mobile robot in a mine. The experimental results showed that the maximum error was up to 29%. Papachristos et al. [Reference Papachristos, Khattak, Mascarich and Alexis25] fused LOAM and IMU based on the Extended Kalman Filter (EKF) to estimate the pose of a UAV during flight in a mine. However, they did not quantitatively analyze the localization accuracy of the UAV during flight since it is difficult to obtain the ground truth of the UAV pose in the real mine environment. Chow et al. [Reference Chow, Kocer, Henawy, Seet, Li, Yau and Pratama26] conducted experimental analysis in an indoor environment for three SLAM methods: Hector_slam [Reference Kohlbrecher, Von Stryk, Meyer and Klingauf27], Gmapping [Reference Grisetti, Stachniss and Burgard28], and Cartographer [Reference Hess, Kohler, Rapp and Andor29]. The results showed that the trajectory generated by Cartographer fluctuated greatly and Gmapping failed in some tests. Koval et al. [Reference Koval, Kanellakis and Nikolakopoulos30] used a dataset collected in an underground area with multiple tunnels to conduct experimental analysis of various LiDAR-based positioning algorithms. However, all of the tested algorithms showed large drift in the Z-axis since the underground mine environment had sparse features and only one LiDAR was used. In addition to traditional pose estimation methods, Wang et al. [Reference Wang, Wu, Liu and Wang31] proposed a supervised 3D point cloud learning model, PWCLO-Net, for LiDAR localization. However, the supervised learning method requires a large amount of point cloud data labeled with ground truth for training. However, the ground truth trajectory in the underground mine is hard to obtain. In conclusion, LiDAR sensors have the advantage of accurate ranging, which are not affected by illumination. The localization accuracy of the LiDAR-based method can reach the decimeter level in general outdoor scenes. However, in degraded mine environments, the sparse point clouds measured by LiDAR cannot directly distinguish the similar geometry structure of the mine roadway, resulting in a non-negligible cumulative error in the LiDAR-based localization method.
Multi-sensor fusion-based localization
As mentioned above, each type of sensor has inherent defects, such as the visible light vision sensor being sensitive to illumination, the resolution of infrared vision image being low, and the LiDAR point cloud being sparse. Multi-sensor data fusion can effectively incorporate the advantages of each sensor to better adapt to complex mines. Alexis et al. [Reference Alexis32] proposed a concurrent fusion framework consisting of LOAM, ROVIO, and KTIO to estimate the pose of a UAV in the mine. However, they did not perform a quantitative evaluation of the localization accuracy. Jacobson et al. [Reference Jacobson, Zeng, Smith, Boswell, Peynot and Milford33] fused LiDAR point cloud with IMU data for pose estimation and used visual images for scene recognition. However, their method needs extensive human intervention to build maps for localization. In addition, some studies combined active sensors such as RFID, UWB, and WIFI with environmental sensing sensors such as LiDAR and camera to localize robot in a mine. Lavigne et al. [Reference Lavigne and Marshall34] fused 2D LiDAR, RFID, and absolute optical encoders to construct an underground localization network of passive RFID tags. Nevertheless, this method restricts the localization to a two-dimensional plane, which results in a large deviation in the robot’s pose when the height of the mine changes. Li et al. [Reference Li, Zhu, You and Tang35] proposed an underground mine pseudo-GPS system, composed of UWB to localize the robot in the mine. Unfortunately, the deployment cost of UWB is high, which limits its practical application in underground positioning. Wang et al. [Reference Wang, Wang, Ding, Liu, Wang, Fan, Bai, Hongbiao and Du36] integrated multi-source data containing WIFI, LiDAR, and a camera for estimating the pose of an underground search and rescue robot. However, the intricate tunnel structure of the mine has a large impact on the signal transmission, making it difficult to obtain the robot’s pose in areas where the WIFI signal intensity is low. In summary, the existing multi-sensor fusion-based localization methods are also limited in the perceptually degraded mine, and it is challenging to accurately estimate the UAV pose in the complex mine.
Based on the above review of localization methods for the robot/UAV in mines, the vision-based and LiDAR-based methods suffer from localization error accumulation due to the lack of enough image features and high similarity geometric structure. Most of the current robot/UAV localization methods for underground mines are based on the alignment of image features or geometric features. While with the development of the robot intelligence, semantic features [Reference Li, Gu, Li, Li, Guo, Gao, Zhao, Yang, Li and Dong37, Reference Ma, Jiang, Ou and Xu38] are conducive to improving navigation and localization. Inspired by manual inspection, intersections are the critical semantic features to achieve accurate UAV localization in underground mines. Therefore, based on our previous work proposed in [Reference Chen, Feng, Zhao, Wang and Liang39] and [Reference Chen, Feng, Wang and Liang40], this paper proposes a semantic knowledge database-based localization method for UAVs in underground mines. Based on the intersection recognition method proposed in [Reference Chen, Feng, Wang and Liang40], a semantic knowledge database is established by segmenting the intersection point cloud from the pre-built map of the mine. Furthermore, the global pose constraint of the current frame with the semantic knowledge database is constructed by detecting semantic intersection features of the mine in real time during UAV flight. Combining the relative pose constraints between the keyframes of the LiDAR point cloud, the pose graph model is established for UAV pose optimization.
2. System overview
To realize stable UAV flight over a long distance in underground mines, a UAV pose estimation framework is proposed in this paper based on LiDAR odometry constraint and semantic intersection pose constraint. The framework is shown in Figure 1. The gray part represents the process of relative pose constraint factor construction, and the yellow part represents the process of global pose constraint factor construction.
To construct the relative pose constraint during UAV flight, the keyframe selection procedure is performed first to improve the computational efficiency of pose estimation, using equal distance intervals and equal angle intervals selection strategies [Reference Zhou, Koppel and Kaess41, Reference Zhang42]. That is, if the relative translation distance between the new frame and the last keyframe is larger than a set distance threshold, or if the rotation angle between the new frame and the last keyframe is larger than a set angle threshold, the new frame will be selected as the keyframe.
Based on the selected keyframes, the geometric features of the point cloud are extracted for estimating the pose change between the neighboring keyframes. Inspired by the feature-extracting method proposed by LOAM, the curvature features of each point in the measured LiDAR point cloud are calculated for extracting edge features and plane features. By establishing the point-to-line and point-to-plane spatial distance functions based on the extracted edge and plane features between keyframes, the relative pose and local map can be estimated. Furthermore, the relative pose constraint factor is constructed taking into account the noise during UAV flight.
As stated in [Reference Chalvatzaras, Pratikakis and Amanatiadis43, Reference Liu, Di and Xu44], map-based localization algorithms are currently considered as the most accurate ones. In this paper, the intersection semantic knowledge database is established based on the pre-built point cloud map. The pre-built point cloud maps are often referred to as High Definition Maps (HD Maps). The resolution of HD Maps can reach centimeter-level accuracy. The intersection pose constraint factor building process is shown in the yellow part of Figure 1. The intersection semantic knowledge database is established based on the pre-built point cloud map. The green part of Figure 1 shows the building procedures of mine intersection semantic knowledge database. The pre-built point cloud map of the underground mine is the basis of the semantic knowledge database, which can be reconstructed by our previous work proposed in [Reference Chen, Feng, Zhao, Wang and Liang39]. Based on the pre-built point cloud map, the dense point cloud information of each intersection is segmented. Accordingly, the location of each intersection, as well as the Polar Wall Contour Vector (PWCV) and Cylinder Wall Contour (CWC) descriptors, are generated for building the intersection semantic knowledge database. Then, the semantic features of the sampled LiDAR point cloud during UAV flight are detected in real time. The candidate intersection is selected by comparing the intersection similarity between the current keyframe and the intersections in the semantic knowledge database. Next, the ICP algorithm is applied to register the detected intersection with the corresponding intersection in the mine intersection semantic knowledge database. If the ICP distance after registration is below the minimum ICP distance threshold, the intersection pose factor is constructed. Combining the relative pose constraint factor and the intersection pose constraint factor, the pose factor graph model is established for UAV pose estimation.
3. Intersection-based factor graph model
The factor graph model is a Bayesian network-based graph optimization model, which is proposed by Kschischang et al. [Reference Kschischang, Frey and Loeliger45] and suitable for modeling complex state estimation problems in SLAM and Structure From Motion (SFM). The factor graph is an undirected graph that contains two types of nodes, that is, variable nodes and factor nodes. The variable nodes represent the unknown variables to be estimated, such as the UAV pose during flight. The factor nodes represent probabilistic constraint relationships between variable nodes, which can be obtained from the sensor measurements or prior knowledge. The factor graph can handle back-end optimization problems incrementally. When the new edges and nodes are added to the graph model, only the nodes connected to the newly added nodes need to be optimized. Therefore, this paper proposes an intersection-based factor graph model for UAV pose estimation.
The proposed UAV pose factor graph model is shown in Figure 2. The UAV pose $\boldsymbol{x}_{\boldsymbol{i}}$ is the variable node to be estimated. $I_k$ is the position of the intersection. The gray box represents the relative pose constraint factor between LiDAR keyframes. The yellow box represents the intersection pose constraint factor between the current keyframe and the observed intersection. The pose variable of the graph factor model to be optimized is denoted as:
where $\boldsymbol{\chi }$ is the set of pose variables to be estimated. $\boldsymbol{t}_{\boldsymbol{i}}^{\boldsymbol{W}}=(t_x,t_y,t_z)$ is the x-y-z translation of the UAV in the world coordinate system. $\boldsymbol{\theta}_{\boldsymbol{i}}^{\boldsymbol{W}}=(\theta _x,\theta _y,\theta _z)$ represents the roll-pitch-yaw rotation of the UAV in the world coordinate system. According to the Bayesian rule, the joint probability expression of $\boldsymbol{\chi }$ is
where $\boldsymbol{Z}$ represents the set of observations. The first term $P({I_k} |{\boldsymbol{x}_{\boldsymbol{i}}})$ is the likelihood probability, which represents the probability of obtaining observation $\boldsymbol{z}_{\boldsymbol{k}}$ at UAV pose $\boldsymbol{x}_{\boldsymbol{i}}$ . The second term $P(\boldsymbol{x}_{\textbf{i}}|{\boldsymbol{x}_{\boldsymbol{i}\boldsymbol-\textbf{1}}})$ is the prior probability, which represents the relationship between pose ${\boldsymbol{x}_{\boldsymbol{i}\boldsymbol-\textbf{1}}}$ and $\boldsymbol{x}_{\boldsymbol{i}}$ . More generally, Eq. (2) can be represented by the product of relative pose constraint factor $\boldsymbol{\psi}_{\boldsymbol{L}} (\boldsymbol{\chi}_{\boldsymbol{i}})$ and intersection pose constraint factor $\boldsymbol{\psi}_{\boldsymbol{I}} (\boldsymbol{\chi}_{\boldsymbol{i}})$ as:
where $\boldsymbol{\chi}_{\boldsymbol{i}}$ denotes the set of variables $\boldsymbol{x}_{\boldsymbol{i}}$ associated with the factor $\boldsymbol{\psi}_{\boldsymbol{i}}$ . In the following, we use unified $\boldsymbol{\psi}_{\boldsymbol{i}}$ to represent the factors $\boldsymbol{\psi}_{\boldsymbol{L}}$ and $\boldsymbol{\psi}_{\boldsymbol{I}}$ , respectively. Each factor $\boldsymbol{\psi}_{\boldsymbol{i}}$ is a function of the pose variables in $\boldsymbol{\chi}_{\boldsymbol{i}}$ .
Based on the above derivation, the solution of the UAV pose factor graph model is to find the optimal estimation of the variables $\boldsymbol{\chi }$ for a prior relative pose and a known observation. It can then be converted into a maximum likelihood problem as:
With the assumption that the measurement noise $\boldsymbol{\varpi}_{\boldsymbol{i}}$ follows the Gaussian distribution [Reference Thrun46], the maximum likelihood problem can be transformed into a simpler form. The general form of the measurement function is
where the noise term $\boldsymbol{\varpi}_{\boldsymbol{i}}$ follows the distribution $\boldsymbol{\varpi}_{\boldsymbol{i}} \sim \textrm{N}(0,\boldsymbol{\Omega}_{\boldsymbol{i}})$ . Then, the probability density function expansion of the corresponding constraint factor $\boldsymbol{\psi}_{\boldsymbol{i}}$ takes the form:
where $\boldsymbol{e}_{\boldsymbol{i}}\left ({\boldsymbol{\chi}_{\boldsymbol{i}}} \right )$ denotes the residual function of the $i$ th constraint factor ${\boldsymbol{\psi}_{\boldsymbol{i}}} (\boldsymbol{\chi}_{\boldsymbol{i}})$ . $\left \|{\boldsymbol{e}_{\boldsymbol{i}}\left ({\boldsymbol{\chi}_{\boldsymbol{i}}} \right )} \right \|_{\boldsymbol{\Omega}_{\boldsymbol{i}}}^2 = \boldsymbol{e}_{\boldsymbol{i}}{\left ({\boldsymbol{\chi}_{\boldsymbol{i}}} \right )^T}\boldsymbol{\Omega}_{\boldsymbol{i}}^{ - 1}\boldsymbol{e}_{\boldsymbol{i}}\left ({\boldsymbol{\chi}_{\boldsymbol{i}}} \right )$ is defined as the square of the Mahalanobis distance with covariance matrix $\boldsymbol{\Omega}_{\boldsymbol{i}}$ . Substituting Eq. (6) into Eq. (4), the maximum likelihood estimation of the pose variable $\boldsymbol{\chi }$ is equivalent to minimizing the negative logarithm of the constraint factor. That is, the objective function for the optimization of the UAV flight pose $\boldsymbol{\chi } _{MLE}^ *$ can be established as:
where $\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ is the residual function of the LIDAR relative pose constraint factor $\boldsymbol{\psi}_{\boldsymbol{L}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ . $\boldsymbol{\Omega}_{\boldsymbol{L}}$ is the covariance matrix of the LIDAR relative pose estimation, which determines the weighting coefficients of the residual function $\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ in the UAV pose objective function. $\boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ is the residual function of the intersection pose constraint factor $\boldsymbol{\psi}_{\boldsymbol{I}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ . $\boldsymbol{\Omega}_{\boldsymbol{I}}$ is the covariance matrix of the intersection observation, which determines the weighting coefficients of the residual function $\boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ in the UAV pose objective function. The following subsection will separately describe the modeling of the LiDAR relative pose constraint factor $\boldsymbol{\psi}_{\boldsymbol{L}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ and the intersection pose constraint factor $\boldsymbol{\psi}_{\boldsymbol{I}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ in detail.
3.1. LiDAR relative pose constraint factor
The scanning frequency of LiDAR is 10 Hz. The continuously scanned LiDAR point clouds contain some redundant information, so LiDAR keyframes are selected to build the UAV factor graph model. Therefore, the equal interval relative pose change sampling criterion is applied to ensure the uniform distribution of the LiDAR keyframes. That is, the first scanned LiDAR frame is selected as the first keyframe point cloud. Then, the newly scanned frame is aligned with the previous keyframe for calculating the relative pose change $\boldsymbol{\Delta} \boldsymbol{T}$ :
where $\boldsymbol{\Delta} \boldsymbol{R}$ is the relative rotation matrix and $\boldsymbol{\Delta} \boldsymbol{t}$ is the relative translation vector. Based on $\boldsymbol{\Delta} \boldsymbol{R}$ and $\boldsymbol{\Delta} \boldsymbol{t}$ , the translation distance $||\boldsymbol{\Delta} \boldsymbol{t}||_2$ and the relative rotation angle $\Delta \theta$ between the current frame and the previous keyframe can be calculated by:
If the relative distance $||\boldsymbol{\Delta} \boldsymbol{t}||_2$ is larger than the set minimum translation distance threshold $\varepsilon _t$ or the relative rotation angle $\Delta \theta$ is larger than the set minimum rotation angle threshold $\varepsilon _\theta$ , the current frame can be selected as a new keyframe. Then, the new keyframe can be added to the factor graph model as a new node.
Once a new keyframe is added to the factor graph model, the relative pose constraint between the new and previous keyframe needs to be constructed. For two keyframe point clouds $i$ and $j$ , the relative pose transformation matrix $T_i^j$ can be calculated by registering the spatial geometry features. The relative pose transformation matrix $T_i^j$ satisfies the following equation:
where $\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}}$ is the transformation matrix from the $i$ th keyframe to the world coordinate system and $\boldsymbol{T}_{\boldsymbol{j}}^{\boldsymbol{W}}$ is the transformation matrix from the $j$ th keyframe to the world coordinate system.
Due to the accumulated error caused by the environmental perception deviation during UAV flight, Eq. (10) will not be strictly valid. Then, the residual function $\boldsymbol{e}_{\boldsymbol{L}}$ between the $i$ th keyframe and $j$ th keyframe can be denoted as:
In the above residual function, there are two pose variables to be optimized: the pose $\boldsymbol{x}_{\boldsymbol{i}}$ of $i$ th keyframe and the pose $\boldsymbol{x}_{\boldsymbol{j}}$ of $j$ th keyframe. Therefore, it is required to find the derivative of $\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})$ with respect to the variables $\boldsymbol{x}_{\boldsymbol{i}}$ and $\boldsymbol{x}_{\boldsymbol{j}}$ . It is equivalent to finding the derivative of ${\boldsymbol{e}_{\boldsymbol{L}}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})$ with respect to $\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}}$ and $\boldsymbol{T}_{\boldsymbol{j}}^{\boldsymbol{W}}$ . However, the transformation matrices $\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}}$ and $\boldsymbol{T}_{\boldsymbol{j}}^{\boldsymbol{W}}$ are not closed under addition and subtraction. Thus, Lie group and Lie algebra [Reference Gao, Zhang, Gao and Zhang47] are introduced to solve the deviation. The transformation matrix $\boldsymbol{T}$ is labeled as the special Euclidean group $SE(3)$ and its corresponding Lie algebra is $se(3)$ . In Lie algebra, each pose is denoted by $\boldsymbol{\xi } = \left [ \begin{array}{l}\boldsymbol{\rho } \\\boldsymbol{\phi } \end{array} \right ]$ , where $\boldsymbol{\xi }$ is a six-dimensional vector. The first three dimensions $\boldsymbol{\rho }$ denote the translations and the last three dimensions $\boldsymbol{\phi }$ denote the rotations.
According to the transformation relation between Lie groups and Lie algebras, the Lie algebra form of the pose $\boldsymbol{x}_{\boldsymbol{i}}$ of the $i$ th keyframe is denoted as $\boldsymbol{\xi}_{\boldsymbol{i}}$ , and the Lie algebra form between $i$ th and $j$ th keyframe is denoted as $\boldsymbol{\xi}_{\boldsymbol{ij}}$ . By adding the left perturbation terms ${\boldsymbol\delta\boldsymbol\xi}_{\boldsymbol{i}}$ and ${\boldsymbol\delta\boldsymbol\xi}_{\boldsymbol{j}}$ to the poses $\boldsymbol{\xi}_{\boldsymbol{i}}$ and $\boldsymbol{\xi}_{\boldsymbol{j}}$ , the residual function $\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})$ is changed to the following form:
Based on the derivation rule of Lie algebra, the derivatives of the residual function $\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})$ with respect to $\boldsymbol{\xi}_{\boldsymbol{i}}$ and $\boldsymbol{\xi}_{\boldsymbol{j}}$ are equivalent to the derivatives of the residual function $\hat{\boldsymbol{e}}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})$ with respect to the left perturbation terms ${\boldsymbol\delta\boldsymbol\xi}_{\boldsymbol{i}}$ and ${\boldsymbol\delta\boldsymbol\xi}_{\boldsymbol{j}}$ :
where $\boldsymbol{J}_{\boldsymbol{r}}$ is the right multiplication of the Jacobi matrix and $Ad(\cdot )$ represents the adjacency matrix.
Furthermore, it is also necessary to estimate the covariance matrix $\boldsymbol{\Omega}_{\boldsymbol{L}}$ of the relative pose factors. The inverse of the covariance matrix $(\boldsymbol{\Omega}_{\boldsymbol{L}})^{-1}$ is alternatively called the information matrix, which reflects the weight of residuals of each factor in the factor graph model. The covariance matrix $\boldsymbol{\Omega}_{\boldsymbol{L}}$ between $i$ th and $j$ th keyframes can be constructed based on the uncertainty of the matched geometric feature points. Supposing that $\boldsymbol{{}^{\boldsymbol{i}}\boldsymbol{p}_{\boldsymbol{f}}}$ and $\boldsymbol{{}^{\boldsymbol{j}}\boldsymbol{p}_{\boldsymbol{f}}}$ are a pair of feature points between the $i$ th and $j$ th keyframes, the two feature points satisfy the following projection relationship:
Based on the above feature points, the covariance matrix can be estimated by calculating the sum of all matched feature point pairs in the two keyframes:
where $N_f$ is the total number of pairs. $\boldsymbol{\Lambda}_{\boldsymbol{L}}$ is the zero-mean Gaussian matrix (the noise matrix of LiDAR). $\boldsymbol{H}$ is the Jacobi matrix, which is defined as:
The noise matrix $\boldsymbol{\Lambda}_{\boldsymbol{L}}$ is defined as:
where $\sigma _x$ , $\sigma _y$ , and $\sigma _z$ are the measurement noise along the $X$ , $Y$ , and $Z$ axes of the LiDAR sensor with units of m. By setting the covariance matrix at the initial pose as the zero matrix $\boldsymbol{0_{6 \times 6}}$ , the covariance matrix corresponding to the $k$ th keyframe can be iterated according to the following equation as the LiDAR odometry is accumulated:
where $\boldsymbol{H}_{\boldsymbol{k,k}\textbf{-}\textbf{1}}$ corresponds to the Jacobi matrix between the $k$ th keyframe and the $(k-1)$ th keyframe.
3.2. Intersection pose constraint factor
Since most areas of underground roadways are long and narrow corridors with high similarity of geometric structures, the point cloud scanned by LiDAR in such degraded scenarios lacks adequate discriminative features, resulting in increasing accumulated positioning errors. Therefore, the intersection pose constraint factor is added to the factor graph model to provide reliable pose constraint for UAV pose optimization.
Based on our previous work proposed in [Reference Chen, Feng, Wang and Liang40], we first construct an intersection semantic knowledge database from the pre-built point cloud map of the underground mine. The constructed semantic knowledge database contains the geometric invariant point location $\boldsymbol{C}^{\boldsymbol{I}_{\boldsymbol{i}}}$ , the dense point cloud $\boldsymbol{P}^{\boldsymbol{I}_{\boldsymbol{i}}}$ , the PWCV descriptor $\boldsymbol{V}_{\boldsymbol{d}}^{\boldsymbol{I}_{\boldsymbol{i}}}$ , and CWC descriptor ${\boldsymbol{M}_{\boldsymbol{d}}^{\boldsymbol{I}_{\boldsymbol{i}}}}$ of each intersection. While the UAV is inspecting the underground mine, the pose constraint constructing process between the current keyframe and the intersection in the semantic knowledge database is shown in Figure 3. The detailed steps are as follows.
-
1. The keyframe point cloud is preprocessed first. The key steps of preprocessing include point cloud filtering, mine ground plane detection, and wall point cloud segmentation.
-
2. Based on the segmented wall point cloud, the topology feature detection is performed to identify whether the current keyframe is an intersection or not. If the current keyframe is not an intersection, it returns to the first step. If the current keyframe is an intersection, then the geometrical invariant point location of the intersection is computed.
-
3. Centering at the geometrical invariant point within radius $R_{pwcv}$ , the intersection point cloud is decoded as a PWCV descriptor $\boldsymbol{V}_{\boldsymbol{curr}}$ . Then, the similarity between the current PWCV descriptor $\boldsymbol{V}_{\boldsymbol{curr}}$ and the PWCV descriptor in the semantic knowledge database is calculated one by one.
-
4. According to the calculated similarity, the candidate intersections subset with high similarity is selected. By setting the minimum PWCV similarity threshold $\varepsilon _{pwcv}$ , the intersections with similarity greater than $\varepsilon _{pwcv}$ are selected to form a candidate intersection subset.
-
5. The matched intersection is determined by the number of the candidate intersection subset $N_{inter}$ . If $N_{inter}$ is equal to 0, the current keyframe is not a real intersection. The subsequent processes are stopped, and then return to the first step to wait for the processing of the next keyframe. If $N_{inter}$ is equal to 1, it means the unique intersection in the candidate intersections subset is the matched intersection, and the process returns to step 8. If the number of candidate intersections $N_{inter}$ is larger than 1, it means that the current keyframe is similar to multiple intersections of the semantic knowledge database. It is difficult to identify the matched intersection using only the PWCV descriptor similarity results. It is necessary to further compare the similarity of CWC descriptors between the current keyframe and the intersections in the semantic knowledge database.
-
6. The CWC descriptor of the current keyframe is generated to select the final matched intersections. The CWC is a cylinder wall contour descriptor, which is generated by adding the height feature encoding on the basis of the PWCV descriptor. The single-frame LiDAR point cloud is sparse and contains less spatial information. Therefore, before generating the CWC descriptor of the current keyframe, the nearest neighboring $N_{his}$ keyframes are transformed to the current keyframe for staking dense point cloud to obtain richer spatial features. The dense point cloud is stacked by:
(20) \begin{equation} \boldsymbol{P}_{\boldsymbol{dense}} = \boldsymbol{P}_{\boldsymbol{curr}} + \sum \limits _{i = 1}^{{N_{his}}} \boldsymbol{T}_{\boldsymbol{hi}{\boldsymbol{s}_{\boldsymbol{i}}}}^{\boldsymbol{curr}}{\boldsymbol{P}_{\boldsymbol{hi}{\boldsymbol{s}_{\boldsymbol{i}}}}} \end{equation}where $\boldsymbol{P}_{\boldsymbol{dense}}$ is the dense point cloud after stacking. $\boldsymbol{P}_{\boldsymbol{curr}}$ is the point cloud of current keyframe. $\boldsymbol{P}_{\boldsymbol{hi}{\boldsymbol{s}}_{\boldsymbol{i}}}$ is the $i$ th nearest-neighbor keyframe point cloud. $\boldsymbol{T}_{\boldsymbol{his}_{\boldsymbol{i}}}^{curr}$ is the transformation matrix from the current keyframe to the $i$ th nearest-neighbor keyframe. Based on the stacked dense point cloud $\boldsymbol{P}_{\boldsymbol{dense}}$ , the CWC descriptor $\boldsymbol{M}_{\boldsymbol{curr}}$ is generated. Meanwhile, its similarity with the CWC descriptors of intersections in the candidate subset is calculated one by one. The intersection with the highest similarity is selected as the final matched intersection. -
7. To avoid false intersection recognition, the number of consecutive recognitions of the same intersection is counted. When the UAV flies through an intersection, this intersection should be recognized by multiple keyframes. At the beginning, the number of keyframes $N_{same}$ continuously recognized for the same intersection is set to 0. If the matched intersection ID of the current keyframe is the same as the matched intersection ID of the latest keyframe, $N_{same}$ is increased by 1. If the matched intersection ID of the current keyframe is different from the matched intersection ID of the latest keyframe, $N_{ same}$ is initialized to 0 and the subsequent steps are terminated. The process then returns to the first step and waits for the processing of the next keyframe.
-
8. Judge whether the same intersection is stably detected based on the number of consecutive recognition $N_{same}$ . When $N_{same}$ is smaller than the set minimum number of consecutive recognition $\varepsilon _{same}$ , the subsequent step is terminated until $N_{same}$ is equal to $\varepsilon _{same}$ . Then, it proceeds to the next step, and $N_{same}$ is initialized to 0 simultaneously.
-
9. The ICP distance $d_{icp}$ between the current keyframe and the matched intersection is used for distance verification. If the ICP distance $d_{icp}$ is larger than the minimum distance threshold $\varepsilon _{icp}$ , it is considered a wrong match and the intersection pose constraint will not be added to the fact graph model. If the ICP distance $d_{icp}$ is smaller than $\varepsilon _{icp}$ , it is considered as a correct match. Then, the intersection pose constraint between the current keyframe and the matched intersection in the semantic knowledge database is added to the factor graph model for UAV pose optimization.
According to the above procedures, the intersection pose constraint can be established. When the UAV detects a stable intersection at pose $\boldsymbol{x}_{\boldsymbol{i}}$ , the spatial pose relationship among the UAV pose (the LiDAR coordinate system $O_LX_LY_LZ_L$ ), the world coordinate system $O_WX_WY_WZ_W$ , and the intersection coordinate system $O_IX_IY_IZ_I$ is shown in Figure 4.
As Figure 4 shown, $\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{W}}$ displayed as the orange dashed line is the 3D vector of the observed intersection in the world coordinate system at pose $\boldsymbol{x}_{\boldsymbol{i}}$ . $\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}}$ and $\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}$ are displayed as the purple dashed line. $\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}}$ represents the 3D vector of the LiDAR sensor, and $\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}$ represents the 3D vector of the observed intersection in the current LiDAR coordinate system. Therefore, the observation $\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}}$ of the intersection in the world coordinate system can be denoted as:
where $\boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}$ is the rotation matrix of the LiDAR coordinate system to the world coordinate system.
The error between the intersection observation ${{}^{\boldsymbol{W}}{{\boldsymbol{z}}_{\boldsymbol{I}}}}$ at pose $\boldsymbol{x}_{\boldsymbol{i}}$ and the prior intersection location provided by the intersection semantic knowledge database $\boldsymbol{\mathord{\buildrel{\lower 3pt\hbox{$\scriptscriptstyle \frown $}} \over t} _I^W}$ constitutes the residual function $\boldsymbol{e}_{\boldsymbol{I}}\left ({\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}}} \right )$ of the current intersection pose constraint factor:
where the variables to be optimized are as follows:
Similarly, since the rotation matrix $\boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}$ is not closed under the addition and subtraction, the derivative of the residue function $\boldsymbol{e}_{\boldsymbol{I}}\left ({\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}}} \right )$ with respect to the $\boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}$ cannot be computed directly. It is required to convert the rotation matrix $\boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}$ into Lie algebra form for derivation. The rotation matrix $\boldsymbol{R}$ is known as the special orthogonal group SO(3), and the Lie algebra form of SO(3) is $\boldsymbol{\phi }$ .
Thus, the Lie algebra form of rotation matrix $\boldsymbol{R}_{\boldsymbol{Li}}^{\boldsymbol{W}}$ in the residual function $\boldsymbol{e}_{\boldsymbol{I}}\left ({\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}}} \right )$ (Eq. (22)) is defined as $\boldsymbol{\phi}_{\boldsymbol{i}}$ . Then, the derivative of the residue function $\boldsymbol{e}_{\boldsymbol{I}}\left ({\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}}} \right )$ with respect to the $\boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}$ can be denoted as:
By introducing the left perturbation terms ${\boldsymbol\delta\boldsymbol\phi}_{\boldsymbol{i}}$ , the derivative of the residual function $\boldsymbol{e}_{\boldsymbol{I}}\left ({\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{x}_{\boldsymbol{i}}} \right )$ with respect to the rotation $\boldsymbol{\phi}_{\boldsymbol{i}}$ is equal to:
Expanding $\exp \left ({{{\left ( \boldsymbol{\delta{\boldsymbol\phi _{\boldsymbol{i}}}} \right )}^ \wedge }} \right )$ in Eq. (25) with Taylor function, the derivation $\frac{{\partial \boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}})}}{{\partial \boldsymbol{{\phi _{\boldsymbol{i}}}}}}$ can be simplified as:
In addition, according to Eq. (22), the derivatives of the residual function $\partial \boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}})$ with respect to the optimization variables $\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}}$ and $\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}$ are as follows:
To construct the intersection pose constraint factor, it is also important to analyze the uncertainty of the intersection observation by estimating the covariance matrix $\boldsymbol{\Omega}_{\boldsymbol{I}}$ . Based on the registered points correspondence between the current keyframe and the matched intersection in the semantic knowledge database, the covariance matrix $\boldsymbol{\Omega}_{\boldsymbol{I}}$ can be calculated. Assuming $\boldsymbol{{}^{\boldsymbol{W}}{\boldsymbol{p}}_{{\boldsymbol{I}}_{\boldsymbol{f}}}}$ and $\boldsymbol{^{\boldsymbol{i}}\boldsymbol{p}_{\boldsymbol{I}_{\boldsymbol{f}}}}$ are a pair of registered feature points in the semantic knowledge database and the $i$ th keyframe, they will obey the following projection function:
Then, the covariance matrix $\boldsymbol{\Omega}_{\boldsymbol{I}}$ can be estimated by calculating the sum of the distance among all matched feature points between the $i$ th keyframe and intersection in the semantic knowledge database:
where $N_{I_f}$ is the total number of matched point pairs. $\boldsymbol{\Lambda}_{\boldsymbol{I}}$ is the zero-mean Gaussian noise matrix of matched intersections. Based on Eq. 29, the Jocabi matrix $\boldsymbol{Q}$ is defined as:
Based on the PWCV descriptor generation process proposed in [Reference Chen, Feng, Wang and Liang40], the nonzero portion of the wall contour component of the final formed PWCV descriptor is tightly correlated with the relative pose between the current UAV and the detected intersection geometric invariant point. Therefore, the completeness of the wall contour component of the PWCV descriptor is proposed to estimate the noise matrix $\boldsymbol{\Lambda}_{\boldsymbol{I}}$ . Figure 5(a) shows the PWCV descriptor of an example intersection in the semantic database, and its nonzero eigenvalue dimensions of the wall contour component is $k_b$ . Figure 5(b) is the PWCV descriptor of the detected intersection during UAV flight, and its nonzero eigenvalue dimensions of the wall contour component is $k_c$ . The noise matrix $\boldsymbol{\Lambda}_{\boldsymbol{I}}$ is defined as:
Thus, the relative pose constraint factor residual function (Eq. (11)) and its covariance matrix (Eq. (16)), and the intersection pose constraint factor residual function (Eq. (22)) and its covariance matrix (Eq. (30)) have been established. By substituting the residual functions and their covariance matrices to the UAV pose factor graph model (Eq. (7)), the UAV pose can be optimized in nonlinear iterations based on the results of deviations (Eq. (13), Eq. (14), Eq. (26), and Eq. (27)).
4. Experiments and results
In this section, two UAV localization tests are conducted in the large-scale Edgar Mine and a mine-like indoor corridor environment. In the two experiments, we compared the localization accuracy with two typical LiDAR-based localization methods, that is, LOAM and Scan Context. LOAM is an open-loop laser odometry localization method. LOAM + Scan Context adds the Scan Context global point cloud descriptor to LOAM for scene recognition. The loop constraints based on Scan Context are added to correct the cumulative localization error during UAV flight after recognizing the same scene.
4.1. UAV localization in Edgar Mine
The first UAV localization experiment is conducted in the simulated Edgar Mine. The simulated Edgar Mine is developed by importing the Edgar Mine environment model file from DARPA [Reference Rogers, Gregory, Fink and Stump48] on the ROS Gazebo platform. As shown in Figure 5 (a), the total length of Edgar Mine is 2.8 km and it consists of a number of narrow roadways and several intersections. The DJI UAV platform equipped with Velodyne 16 LiDAR and Realsense D435 camera is added to the ROS Gazebo for mine exploration, as shown in Figure 5 (b).
First, the UAV is controlled manually to fly slowly along the Edgar Mine tunnel for one cycle to record the LiDAR data and RGB-D image data. The Edgar Mine point cloud map is reconstructed based on our previous work [Reference Chen, Feng, Zhao, Wang and Liang39] and the obtained sensor data. Furthermore, the semantic topology information is segmented from the pre-built map to construct a semantic knowledge database for the Edgar Mine. Finally, a long-distance flight trajectory is recorded for UAV flight localization accuracy comparison. The trajectory is shown in Figure 6. The total length of this trajectory is 5.68 km. The LiDAR point cloud data and the ground truth pose are recorded during the UAV flight. In particular, the ground truth pose is obtained by a high-precision IMU sensor without zero drift.
4.1.1. Semantic knowledge database construction
With reference to the point cloud fusion framework [Reference Chen, Feng, Zhao, Wang and Liang39], the LiDAR data and depth image are fused based on the Bayesian Kriging model to reconstruct a single-frame high-precision dense point cloud of the mine roadway. Furthermore, the ISS3D key points and FPFH descriptors are extracted from the reconstructed single-frame dense point cloud for spatial feature coarse matching and ICP registration. The reconstructed point cloud map of Edgar Mine is shown in Figure 7.
The roadway of Edgar Mine is complicated and contains several loops and intersections (the ID of each intersection is shown in the yellow box in Figure 7). To construct the intersection semantic knowledge database, the intersection dense point cloud is first segmented from the reconstructed map. Then, the intersection type, geometric invariant points location, the PWCV descriptors, and CWC descriptors are generated for each intersection based on our previous work [Reference Chen, Feng, Wang and Liang40]. The constructed intersection semantic knowledge database is shown in Figure 8.
4.1.2. UAV localization accuracy analysis
Figure 9 shows the estimated pose trajectory of UAV based on competing methods and the proposed localization method. Comparing the trajectories plotted in Figure 9, it can be found that the trajectory estimated by LOAM gradually deviates from the ground truth trajectory. This is because the LOAM localization algorithm only relies on the neighboring point cloud registration for pose estimation, which results in error accumulation during long-distance flights in underground mines with similar geometry. LOAM + Scan Context shows a large localization deviation. This is due to the fact that the similar mine roadways cause Scan Context to recognize the scene incorrectly. In contrast, the localization method proposed in this paper can accurately recognize different intersections without false scene recognition. Therefore, in the case where the prior map is not pre-built, the accumulated error at the same intersection can be eliminated, resulting in a reduction of the whole trajectory error. Furthermore, by establishing the semantic knowledge database, the intersection pose constraint between the current frame and the intersection in the semantic knowledge database can be added. Once a stable intersection is detected during flight, the accumulated localization error between the last detected intersection and the current intersection can be eliminated immediately.
To quantitatively analyze the localization error of each method [Reference Zhang and Scaramuzza49], the maximum error (MAE), root mean square error (RMSE), and the relative error percentage (REP) are used to evaluate the localization accuracy. REP-1, REP-2, REP-3, REP-4, and REP-5 stand for 20%, 40%, 60%, 80%, and 100% of the trajectory. The localization errors are listed in Table I.
As listed in Table I, the MAE, RMSE, and REP-5 of LOAM are 39.15 m, 10.15 m, and 0.51 %, respectively. The MAE, RMSE, and REP-5 of LOAM + Scan Context are 6.62 m, 2.63 m, and 0.22 %, which cannot completely eliminate the accumulated error of LOAM. In comparison, the MAE, RMSE, and REP-5 of the proposed method are 2.22 m, 1.22 m, and 0.17 %. By adding the intersection pose constraint based on the semantic knowledge database with the pre-built map, the MAE, RMSE, and REP-5 of the proposed method are 2.06 m, 0.60 m, and 0.13 %, which shows over three times performance improvement compared to LOAM. It can be concluded that the proposed method can achieve accurate localization by optimizing the accumulative error of open-loop LOAM.
To further analyze the relationship between the distribution of localization error and the flight trajectory, the localization error maps of the comparison methods and the proposed method are plotted in Figure 10.
It can be seen from Figure 10(a) that the localization error of LOAM continuously increases with the increase in the UAV flight distance. The error distribution map in Figure 10(b) shows that the LOAM + Scan Context can detect a loop in the same scene with a small pose change. However, the localization error still accumulates with the fight distance, resulting in a maximum error of up to 6.6 m. As shown in Figure 10(c), based on detecting the same intersection from different directions without a pre-built map, the proposed method can establish stable loop constraints. Then, the maximum error is reduced to 2.2 m. Figure 10(d) shows the error distribution map of the proposed method based on the semantic knowledge database. When an intersection in the semantic knowledge database is detected during UAV flight, the localization error is eliminated immediately. The localization error only remains between two intersections.
4.2. UAV localization in mine-like indoor corridor
To evaluate the localization accuracy and adaptability of the proposed method in real environments, a UAV hardware platform same as the simulated UAV platform is designed and applied to conduct localization experiments in the mine-like indoor corridor. The components of the UAV hardware platform are shown in Figure 11.
Since the ground truth pose of UAV cannot be directly measured in a narrow indoor corridor, a localization accuracy evaluation method based on multiple Apriltag [Reference Wang and Olson50] is applied to compare the localization error of different methods. As shown in Figure 12(a), the first experiment was conducted in an indoor corridor. The total length of the indoor corridor is 210 m and its minimum width is 1.3 m. As shown in Figure 12(b), 20 Apriltags were deployed for positioning accuracy evaluation. The position of each Apriltag is listed in Table II. Figure 12(c) shows an example of Apriltag pasted on the wall. To ensure the safety of the localization experiment, the UAV was mounted on a mobile tripod, making it easy to adjust the pose change during flight.
4.2.1. Semantic knowledge database construction
Firstly, the indoor corridor prior map can be reconstructed with the point cloud data scanned by the Velodyne 16 and depth images measured by the Realsense D435, based on our previous work [Reference Chen, Feng, Zhao, Wang and Liang39]. The reconstructed point cloud map of the indoor corridor is shown in Figure 13.
It can be seen from Figure 13 that the indoor corridor contains five intersections, which are numbered 1 to 5. The dense point cloud of each intersection is segmented from the map for constructing the semantic knowledge database, where the segmented point cloud is shown in Figure 14. It can be found from Figure 14 that the geometric structure of intersections 1, 2, and 5 are similar, resulting in the difficulty of distinguishing them only relying on the point cloud data.
Based on our previous work [Reference Chen, Feng, Wang and Liang40], the geometric invariant point of each intersection is detected for generating PWCV and CWC descriptors, which are shown in Figure 15. Correspondingly, the PWCV and CWC similarity matrices are computed for recognizing different intersections. The results are shown in Figure 16. It can be seen that the PWCV and CWC similarities of different intersections are less than 0.8. In the UAV localization experiment, the minimum similarity threshold is set as 0.8 so that the different intersections can be recognized correctly. Based on the segmented intersection point cloud, detected invariant point, and the descriptor, the semantic knowledge database of the indoor corridor is constructed, where the structure is the same as Figure 8.
4.2.2. UAV localization accuracy analysis
The UAV platform was driven in the indoor corridor for data recording, and the UAV pose was changed by adjusting the tripod to simulate the flight process. The driving trajectory is shown as the red line of Figure 17. The trajectory passed sequentially through the Apriltags numbered 0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-14-13-12-11-10-9-8-7-6-5-16-17-18-19-0, and the total length of the trajectory is 420 m.
Based on the recorded LiDAR data, the UAV pose is estimated by LOAM, LOAM + Scan Context, and the proposed method. The localization errors are computed by calculating the difference between the estimated locations of Apriltags and the ground truth locations of Apriltags, which are listed in Table II. The localization error curves of each method are plotted in Figure 18. The horizontal coordinate represents the IDs of the 31 Apriltags passed by the UAV in sequence, and the vertical coordinate represents the calculated localization error.
As shown in Figure 18, the localization error of LOAM is increasing continuously with the increase of driving distance, where the average error is 14.18 m and the maximum error is 51.8 m. This is due to the similar geometric structure of the indoor corridor. At the same time, the similar geometric structure also results in mismatches of LOAM + Scan Context, with an average error of 20.9 m, which is higher than that of the open-loop LOAM positioning method. As the red curve shows in Figure 18, the proposed localization method can decrease the average localization error to 5.0 m without a prior map. The proposed localization method (no map) can accurately recognize the same intersection from different directions, thereby constructing stable loop constraints. Furthermore, after adding the intersection constraint factor provided by the prior map, the proposed method can eliminate the accumulated localization error from the previous intersection to the current intersection when passing through the intersection in the semantic knowledge database. Thus, the localization error of the proposed method (with map) is optimized in segments, which greatly improves the accuracy of pose estimation and reduces the average positioning error to 1.7 m.
5. Conclusion
In this paper, a semantic knowledge database-based localization method is proposed for UAV inspection in the perceptually degraded underground mine. First, the relative pose constraint factor is constructed based on the spatial geometry features between neighboring LiDAR keyframes to realize the UAV local pose estimation. Furthermore, the dense point cloud of each intersection is extracted from the prior map. The geometrical invariant point, PWCV, and CWC descriptors are generated for constructing the semantic knowledge database. Moreover, the intersection pose constraint factor is constructed by comparing the semantic topology of the current LiDAR keyframe with the intersections in the semantic knowledge database. Based on the pose factor graph model, the relative pose constraint factor and the intersection pose constraint factor are combined to optimize the UAV flight pose. Finally, the experimental results in the Edgar Mine and the mine-like indoor corridor demonstrate that the proposed UAV localization method proposed in this paper can realize the segmentation elimination of accumulative error, achieve high localization accuracy, and meet the needs of underground inspection and positioning.
Author contributions
Qinghua Liang and Min Chen wrote the main manuscript. Min Chen and Minghui Zhao conducted the experiment and analyze the data. Shigang Wang revised the manuscript. All authors reviewed the manuscript.
Financial support
This work is supported by “Research on key technologies of UAV in a coal mine,” whose project number is 2019-TD-2-CXY007.
Competing interests
The authors declare no conflicts of interest exist.
Ethical approval
Not applicable.