1. Introduction
The fusion of visual and inertial measurement data has attracted extensive attention due to its small size, low power consumption, and simple mechanical structure [Reference Yang and Shen1–Reference Martinelli6], which has been widely used in mobile robots [Reference Jung, Lee and Myung7–Reference Liu, Sun, Pu and Yan8], virtual reality, augmented reality [Reference Muñoz-Salinas, Sarmadi, Cazzato and Medina-Carnicer9], and other fields. Existing visual-inertial odometry (VIO) algorithms can be divided into nonlinear optimization frameworks and filter frameworks according to different back-end optimization methods. The nonlinear optimization method obtains the best state estimation by minimizing the residual error generated by the camera and the inertial measurement unit (IMU) measurement, which can achieve higher pose estimation accuracy. However, its iterative optimization processing consumes many computing resources [Reference Qin, Li and Shen10–Reference Ling, Bao, Jie, Zhu, Li, Tang, Liu, Liu and Zhang13]. The algorithm based on the filter frameworks has no iterative optimization processing and consumes fewer computing resources than the nonlinear optimization algorithms [Reference Zhang, Dong, Wang and Sun14–Reference Li and Mourikis19], but its pose estimation accuracy is equivalent to the algorithm based on nonlinear optimization. The operating mechanism of the VIO system is the complementary nature of image data and IMU data [Reference Sabatelli, Galgani, Fanucci and Rocchi20], which means the IMU measurement can make up for short-term camera tracking failure errors, and the camera constraints can also correct the inaccurate attitude state predicted by IMU. Only by knowing the precise sampling times of camera and IMU can their data be accurately fused. However, the timestamps of the camera and the IMU are affected by triggering and transmission delays, resulting in unknown time offsets between the camera and the IMU [Reference Qin and Shen21], which will affect the accuracy of the VIO. Therefore, it is of great significance to study the online time calibration between the camera and the IMU to improve the accuracy of VIO pose estimation.
In view of the discrepancy of VIO pose estimation caused by time asynchronism during the fusion of camera and IMU, many different time calibration methods have been proposed. Zhang et al. [Reference Zhang, Li and Zhu22] used a delayed sensor to measure the time offset. Then based on the assumption that the time offset is completely known, a visual-assisted inertial navigation method was proposed for VIO pose estimation. Choi et al. [Reference Choi, Choi, Park and Chung23] proposed an improved algorithm to estimate the VIO pose based on the assumption that the time offset is approximately known. However, these algorithms are all based on the assumption that the time offset is known and do not attempt to estimate it. In actual engineering, it is difficult to know the time offset in advance. Furgale et al. [Reference Furgale, Rehder and Siegwart24] proposed a spatiotemporal calibration algorithm, Kalibr, which jointly estimates the time offset and spatial transformation between the camera and the IMU within the maximum likelihood estimation theory framework. Kelly et al. [Reference Kelly and Sukhatme25] first calculated the rotation estimates of the camera and the IMU separately and registered them for time alignment by a variant of the iterative closest point method in the rotation space. However, this method only used the rotation measurement and not all available measurements. The positioning accuracy of the VIO system will decrease. Both of the above two algorithms need to use a calibration board to implement, and they focus on offline calculations, which cannot solve the problem of online time calibration. Qing et al. [Reference Qin and Shen21] proposed an online temporal calibration VIO algorithm based on a projection model, which models and compensates for the considered temporal offset based on the assumption of uniform motion of feature points between adjacent image frames. Liu et al. [Reference Liu and Meng26] proposed an online time calibration algorithm for VIO based on an improved projection model, which improved the assumption of uniform motion of feature points. However, this kind of algorithm based on the projection model has high requirements for extracting and distributing front-end feature points. In practice, due to the illumination, it is difficult to achieve the desired effect. Li et al. [Reference Li and Mourikis19] proposed an online time calibration method based on Extended Kalman Filter (EKF), which does not require high quality of the front-end feature points, is less affected by factors such as illumination and has good computational efficiency. Guo et al. [Reference Guo, Kottas, DuToit, Ahmed, Li and Roumeliotis27] adopted the same algorithm and introduced a ratio variable to account for the pose displacement caused by time offset. However, these algorithms do not consider the large noise in the gyroscope measurement, which makes the deviation between the IMU-predicted pose and the real pose, reducing the pose estimation accuracy of this algorithm.
To solve the problem of significant errors in VIO pose estimation caused by the time offset during the fusion of camera and IMU, this paper proposes a robust VIO method based on double-stage EKF online time calibration to improve the accuracy of mobile robot pose estimation. The algorithm in this paper consists of a complementary Kalman filter and a multistate constrained Kalman filter (MSCKF). The complementary Kalman filter takes accelerometer measurements as observation vectors, corrects for gyroscope bias, and finally outputs a more accurate initial pose for MSCKF. Then, the unknown time offset is added to the state vector of MSCKF and calculates the feature points reprojection error, which is used to update the time offset, IMU state, and camera state. Finally, the Schur complement model is used to marginalize the old camera state, keep the calculation scale small, preserve prior constraint information, and improve VIO accuracy further.
The main contributions of this article are as follows:
1. A novel double-stage EKF VIO is proposed. When the error of visual information is large or even invalid, the complementary Kalman filter corrects the gyroscope deviation, calculates the accurate IMU initial attitude estimation value, and the VIO system obtains more accurate attitude estimation.
2. The time offset is added to the state vector of the VIO system, and the time offset is calibrated online with the constraint of the feature point reprojection error, improving the accuracy of the camera and IMU fusion system.
The rest of this paper is organized as follows. Section II describes the related issues, Section III introduces the overall framework of the proposed time alignment algorithm. Sections IV and V detail the double-stage Kalman filter of the proposed algorithm, Section VI presents the experimental results of EuRoC dataset and real mobile robot localization, and Section VII is the conclusion of this paper. In addition, some main notations used in this article are given in Table I.
2. Problem formulation
The VIO system consists of a camera and IMU, and each sensor provides measurement data at a constant frequency. In practice, the timestamps of individual sensors are often affected by triggering and transmission delays, resulting in unknown time offsets $t_{d}$ between different sensors. Figure 1 shows the process of time offsets generated by sensor delays. Among them, the IMU data wait for $t_{b}$ , and the camera data wait for $t_{a}$ to be received. $t_{a}$ is not equal to $t_{b}$ , and both of them are unknown quantities. The time offset between different sensors is defined as $t_{d}=t_{a}-t_{b}$ , and the positive or negative of $t_{d}$ is determined by the relative size of $t_{a}$ and $t_{b}$ . The IMU time stamp is used as the reference time in this paper. As shown in Figure 1, $t_{d}$ is a negative value.
Due to the negative effect of $t_{d}$ , the IMU preintegration results between each image frame are inconsistent with the actual pose transformation data, and the system will generate wrong state estimations. At the same time, the attitude information in the IMU preintegration is provided by the gyroscope measurement. However, the gyroscope measurement contains significant noise, which will make the IMU preintegration result deviate from the actual value. Therefore, in implementing the VIO system, it is essential to fully consider the measurement noise of the gyroscope and the time offset between the camera and the IMU.
3. System overview
The proposed VIO system is divided into two parts: the first part is the front-end feature point matching and tracking, and the second part is the online time calibration framework. The second part mainly includes two filters, the first-stage complementary Kalman filter and the second-stage MSCKF. The overall system flow is shown in Figure 2.
In the first part, the feature points are extracted and tracked using the Lucas–Kanade optical flow algorithm from the input image, and the position of the feature points is predicted based on the IMU preintegration results [Reference Baker and Matthews28]. When the number of feature points is less than the set threshold, new feature points are extracted outside the area where the matched feature points reside to ensure that the number of feature points meets the threshold requirements [Reference Rosten and Drummond29–Reference Forster, Carlone, Dellaert and Scaranmuzza31]. To improve the system’s robustness, we evenly divide the image to ensure that each sub-region can extract a certain number of feature points.
The second part mainly consists of two Kalman filters. In the first-stage complementary filter, we mainly utilize the complementary characteristics of the accelerometer and gyroscope of IMU. First, the rotation information is estimated by the gyroscope as the state vector and the measurement value of the accelerometer as the observation vector. Second, the error ${\textbf{r}}$ between the acceleration measurement and the gravity is used to update the rotation information. Finally, the updated rotation information is output to the second-stage MSCKF. In the second-stage of MSCKF, the main task is to solve the unknown time offset. We add the time offset and the latest IMU-predicted camera state to the state vector of the VIO system, which is updated by the reprojection errors of the same feature points from multiple camera frames as residuals. The Schur complement model is used to marginalize the old camera states, which can not only keep the fixed size of the sliding window and improve the computational efficiency but also can preserve the prior information in the marginalized image frame and improve the estimation accuracy of the camera pose.
4. Complementary Kalman filter
Define the IMU pose ${\textbf{X}}_{pose}$ as the state vector of the complementary Kalman filter. The details are as follows:
where the ${}^{I}\!\!{}_{G}{{\textbf{q}}}{}=\left[\begin{array}{l@{\quad}l@{\quad}l@{\quad}l} q_{x} & q_{y} & q_{z} & q_{w} \end{array}\right]^{T}$ is the unit quaternion that describes the rotation from the global frame to the IMU frame. ${\textbf{b}}_{g}$ represents the bias of the gyroscope, which is modelled as random walk processes driven by the white Gaussian noise ${\textbf{n}}_{{{{{\unicode{x03C9}}}}} g}$ .
4.1. Process model
The state equation in continuous time form is as follows:
where ${\unicode{x1D6DA}}={\unicode{x1D6DA}}_{m}-{\textbf{b}}_{g}-{\textbf{n}}_{g}$ is the true value of the angular velocity of the IMU, ${\textbf{n}}_{g}$ is the Gaussian noise of the gyroscope. ${\unicode{x1D6DA}}_{m}=\left[\begin{array}{l@{\quad}l@{\quad}l} {{{{\unicode{x03C9}}}}} _{x} & {{{{\unicode{x03C9}}}}} _{y} & {{{{\unicode{x03C9}}}}} _{z} \end{array}\right]^{T}$ represents the angular velocity measured by the gyroscope. $\boldsymbol{\Omega }({\unicode{x1D6DA}})=\left[\begin{array}{l@{\quad}l} -\lfloor {\unicode{x1D6DA}}\times \rfloor & {\unicode{x1D6DA}}\\[5pt] -{\unicode{x1D6DA}}^{T} & 0 \end{array}\right]$ with $\lfloor {\unicode{x1D6DA}}\times \rfloor$ denotes the skew-symmetric matrix of ${\unicode{x1D6DA}}$ .
The approximate state equation can be obtained as follows by applying the expectation operator in the above formula
where $\hat{{\unicode{x1D6DA}}}={\unicode{x1D6DA}}_{m}-\hat{{\textbf{b}}}_{g}$ represents the estimated results of actual angular velocity information of the gyroscope. To propagate state uncertainty, the state transition matrix is calculated as follows:
where ${{\Delta}} t$ is the time interval between two IMU data frames. From the above formula, the discrete state transition equation of the system can be obtained as
where ${\textbf{H}}_{p}$ represents the state transition matrix of the gyroscope state. The specific values are shown in ref. [Reference Sun, Mohta, Pfrommer, Watterson, Liu, Mulgaonkar, Taylor and Kumar18]. ${\textbf{y}}(k)$ obeys a Gaussian distribution with a mean of 0 and a covariance matrix ${\textbf{Q}}_{k}$ .
where $\sigma _{q}$ and $\sigma _{bg}$ are constants.
Finally, the propagated covariance of the gyroscope state can be expressed as follows:
4.2. Observation model
The actual value of the accelerometer is
where ${\textbf{{a}}}_{{\textbf{{m}}}}=\left[\begin{array}{l@{\quad}l@{\quad}l} a_{{\textbf{{x}}}} & a_{{\textbf{{y}}}} & a_{{\textbf{{z}}}} \end{array}\right]^{{\textbf{{T}}}}$ represents the acceleration measured by accelerometer, ${\textbf{b}}_{a}$ represents the bias of the accelerometer. According to the attitude information predicted by the gyroscope, we can obtain the $\hat{{\textbf{{a}}}}$
where ${\textbf{g}}$ is the gravity vector. The residual of the accelerometer can be expressed as
where ${\textbf{H}}_{g}$ represents the Jacobian matrix of the acceleration measurement with respect to the state of the gyroscope, and the specific values are shown in ref. [Reference Sun, Mohta, Pfrommer, Watterson, Liu, Mulgaonkar, Taylor and Kumar18]. ${\textbf{v}}(k)$ obeys a Gaussian distribution with mean 0 and covariance ${\textbf{R}}_{k}$ .
where $\sigma _{a}$ is a constant. Finally, update ${\textbf{X}}_{pose}$ by the residual
where $\textbf{K}$ is the Kalman gain, ${\textbf{X}}_{pos{e_{k+1|k+1}}}$ and ${\textbf{P}}_{pos{e_{k+1|k+1}}}$ are the state vectors and covariance matrices estimated at time $k+1$ , respectively. By fusing the measured values of the accelerometer, we can calculate the precise IMU pose ${}^{I}\!\!{}_{G}{{\textbf{q}}}{}$ and gyroscope bias ${\textbf{b}}_{g}$ for state vector of the IMU in the second-stage filter.
5. The principle of MSCKF
The state vector of the IMU in the second-stage filter MSCKF is as follows:
where the unit quaternion ${}^{I}\!\!{}_{G}{{\textbf{q}}}{}$ and the vector ${\textbf{b}}_{g}$ are the optimal estimates from the first-stage complementary Kalman filter. The vectors ${}^{G}{{\textbf{v}}}{_{I}^{}}$ and ${}^{G}{{\textbf{p}}}{_{I}^{}}$ represent the velocity and position of the IMU frame, respectively. The vector ${\textbf{b}}_{a}$ represents the linear acceleration deviation of the accelerometer. ${}^{I}\!\!{}_{C}{{\textbf{q}}}{}$ and ${}^{I}{{\textbf{p}}}{_{C}^{}}$ describe the relative transformation between the camera and the IMU.
Use the standard additive error to define the position, velocity, bias, and time offsets (e.g. ${}^{G}{\tilde{{\textbf{p}}}}{_{I}^{}}={}^{G}{{\textbf{p}}}{_{I}^{}}-{}^{G}{\hat{{\textbf{p}}}}{_{I}^{}}$ ). Whereas the minimal 3D orientation error representation is used, that is,
where $\tilde{\boldsymbol{\theta }}\in \mathrm{\mathbb{R}}^{3}$ represents a small angle rotation. According to the above error definition, the IMU error state vector $\tilde{{\textbf{X}}}_{I}$ is given:
The IMU state and N camera states are both added to the VIO system state vector, and the entire error of state vector in VIO system is
The error vector for each camera state is
5.1. Process model
The state equation of the IMU in continuous time form is
where $C(\!\cdot\!)$ is the function that converts the quaternion into the rotation matrix. We can get the propagation model of the IMU state through the linear continuous kinematics model of the IMU error state, and it is shown as follows:
where ${\textbf{n}}_{I}=\left(\begin{array}{l@{\quad}l@{\quad}l@{\quad}l} {\textbf{n}}_{g}^{T} & {\textbf{n}}_{{{{{\unicode{x03C9}}}}} g}^{T} & {\textbf{n}}_{a}^{T} & {\textbf{n}}_{{{{{\unicode{x03C9}}}}} a}^{T} \end{array}\right)^{T}$ . The vectors ${\textbf{n}}_{g}$ and ${\textbf{n}}_{a}$ are the Gaussian noise of the gyroscope and accelerometer measurements. ${\textbf{n}}_{{{{{\unicode{x03C9}}}}} g}$ and ${\textbf{n}}_{{{{{\unicode{x03C9}}}}} a}$ are the random walk rates of the gyroscope and accelerometer measurement bias. The details of ${\textbf{F}}$ and ${\textbf{G}}$ are shown in ref. [Reference Sun, Mohta, Pfrommer, Watterson, Liu, Mulgaonkar, Taylor and Kumar18]. A fourth-order Runge–Kutta numerical integration is used to estimate the state of the IMU The discrete-time state transition matrix $\boldsymbol{\Phi }_{k}$ and the discrete-time noise covariance matrix ${\textbf{Q}}_{k}$ are shown as follows:
The covariance prediction matrix of the VIO system at time $k$ can be described as
The pose of the newly added camera state ${\textbf{x}}_{{C_{i}}}$ in the state vector can be calculated according to the IMU pre-integration result and the relative position between the camera and the IMU.
The augmented covariance matrix is as follows:
where $\textbf{J}$ is shown in ref. [Reference Sun, Mohta, Pfrommer, Watterson, Liu, Mulgaonkar, Taylor and Kumar18].
5.2. Measurement model
The image data are used to correct the state of IMU. At time $t$ , there is a time offset $t_{d}$ between the camera and IMU. Assuming that the left and right cameras can track the same feature point $f_{i}$ at the same time, where the pose of the left camera is $\left[\begin{array}{l@{\quad}l} {}_{G}^{C1}{{\textbf{q}}}{}(t+t_{d}) & {}^{G}{{\textbf{p}}}{_{C1}^{}}(t+t_{d}) \end{array}\right]$ and the pose of the right camera is $\left[\begin{array}{l@{\quad}l} {}_{G}^{C2}{{\textbf{q}}}{}(t+t_{d}) & {}^{G}{{\textbf{p}}}{_{C2}^{}}(t+t_{d}) \end{array}\right]$ . The observation of the stereo camera is
where $\left[\begin{array}{l@{\quad}l@{\quad}l} {}^{Ck}{x}{_{i}^{}}(t+t_{d}) & {}^{Ck}{y}{_{i}^{}}(t+t_{d}) & {}^{Ck}{z}{_{i}^{}}(t+t_{d}) \end{array}\right]^{T},k=1,2$ represents the position of the feature point in the left and right images and is shown as follows:
where ${}^{G}{{\textbf{p}}}{_{fi}^{}}$ is the 3D point position of the feature point $f_{i}$ in the global frame, which can be calculated through the least square method of multiple feature points. After obtaining the ${}^{G}{{\textbf{p}}}{_{fi}^{}}$ , we can calculate the reprojection error of the feature point observed in the current frame, and it is shown as follows:
where ${\textbf{n}}_{{f_{i}}}$ is the measurement noise. The measurement Jacobian matrices ${\textbf{H}}_{C}^{i}$ and ${\textbf{H}}_{{f_{i}}}$ are shown in ref. [Reference Sun, Mohta, Pfrommer, Watterson, Liu, Mulgaonkar, Taylor and Kumar18].
The above is the observation residual of a feature point in a camera frame, and the residual block obtained by superimposing multiple observations of the feature point $f_{i}$ is shown as follows:
Since ${}^{G}{\tilde{{\textbf{p}}}}{_{fi}^{}}$ is computed using the camera pose, its uncertainty is related to the camera pose in that state. To ensure that the uncertainty of ${}^{G}{\tilde{{\textbf{p}}}}{_{fi}^{}}$ will not affect the residual, the residual in Eq, (36) ${\textbf{r}}_{i}$ is projected onto the left null space of ${\textbf{H}}_{{f_{i}}}$
$\textbf{A}$ represents a unitary matrix composed of column vectors of ${\textbf{H}}_{{f_{i}}}$ . The state of the IMU is updated through the gyroscope and accelerometer measurements. Then, the new camera state is estimated according to the IMU preintegration result and then added to the state vector of VIO system to calculate the augmented covariance matrix. Finally, the VIO system state is updated using the reprojection errors of the multi-constrained feature points.
where ${\textbf{r}}'$ is the reprojection error of all points in the current frame, composed of different ${\textbf{r}}_{0}^{i}$ . ${\textbf{H}}$ is the observation matrix, composed of different ${\textbf{H}}_{X,0}^{i}$ .
5.3. Schur complement marginalization
With the continuous operation of the VIO system, the number of newly added camera states grows, and the matrix’s dimension becomes larger, which increases the computational cost of camera pose estimation. To meet the real-time requirements, it is necessary for us to marginalize some camera states with just a small amount of information. If such a camera state is deleted directly, it will cause the prior information of some camera states to be lost. The Schur complement model is utilized to marginalize old camera states to reduce the computational cost and preserve the prior information of this kind of camera state.
Figure 3 shows the process of adding new states and deleting old states in the sliding window where ${\textbf{X}}_{I}$ is the IMU state, ${\textbf{C}}_{1}$ is the old camera state that needs to be deleted, and ${\textbf{C}}_{7}$ is the new camera state that needs to be added to the sliding window. First, the state of ${\textbf{C}}_{7}$ is added to the system state vector. At the same time, the cross-covariance part between ${\textbf{C}}_{7}$ and ${\textbf{C}}_{6}$ , and ${\textbf{X}}_{I}$ is added to the system covariance matrix. Then, the state ${\textbf{C}}_{1}$ needs to be deleted, but ${\textbf{C}}_{1}$ is connected to ${\textbf{C}}_{0}$ and ${\textbf{C}}_{2}$ . Deleting it directly will cause the loss of prior information, so the marginalization is performed through the Schur complement model.
where ${\textbf{P}}_{BB}$ is the covariance matrix of state ${\textbf{C}}_{1}, {\textbf{P}}_{AA}$ is the covariance matrix of the state to be preserved, and ${\textbf{P}}_{AB}$ is the covariance matrix between state ${\textbf{C}}_{1}$ and the state to be preserved. In the process of marginalization, state ${\textbf{C}}_{1}$ is first moved to the top of the covariance matrix and then the covariance matrices ${\textbf{P}}_{AB}$ and ${\textbf{P}}_{AB}^{T}$ are obtained. Finally, a new covariance matrix named ${\textbf{P}}_{AA}^{new}$ was obtained through the Schur complement model, preserving prior information between ${\textbf{C}}_{1}$ and ${\textbf{C}}_{0}$ , and ${\textbf{C}}_{2}$ , and adding new constraints between ${\textbf{C}}_{0}$ and ${\textbf{C}}_{2}$ .
The algorithm flow of this paper is shown in Algorithm 1.
6. Experiments
In this section, the performance of the proposed online time calibration algorithm is verified by the following two groups of experiments. The first experiment is on the EuRoC dataset, with the original MSCKF [Reference Liu and Meng26], the algorithm C-MSCKF with complementary Kalman filter [Reference Cen, Jiang, Tan, Su and Xue32], optimization-based VINS- Mono without loop closure detection mode [Reference Li and Mourikis19] and OPEN-VINS algorithm [Reference Geneva, Eckenhoff, Lee, Yang and Huang5] based on Kalman filter for comparison. The root mean square error (RMSE) is used to analyze the performance of our proposed method. The second experiment is a real experiment. First, on a circle with a fixed radius of 2.5 m, experiments using the VINS-MONO and OPEN-VINS algorithms with online time calibration were performed to evaluate the performance of the algorithms by comparing the estimated trajectories with the real trajectories. Second, it is compared with the MSCKF algorithm in a closed indoor environment, and the performance improvement of the algorithm is detected by judging the loopback accuracy. All algorithms are run on a laptop configured with i5-8250U (quad-core 1.6 GHz) CPU and 8G RAM.
6.1. EuRoC data set
The EuRoC dataset is collected by an unmanned aerial vehicle containing stereo image pairs at 20 Hz and IMU measurements at 200 Hz. In the original dataset [Reference Burri, Nikolic, Gohl, Schneider, Rehder, Omari, Achtelik and Siegwart33], due to the high-precision cameras and IMUs, the image stream and the IMU data sequence are fully time-synchronized. In order to verify the performance of our proposed method in this article, the time stamp of the image stream is uniformly modified. The open-source toolbox provided in ref. [Reference Zhang and Scaramuzza34] is used to evaluate the VIO performance quantitatively. The RMSE of the absolute trajectory error is selected as the evaluation standard of the algorithm performance. All algorithms run 10 Monte Carlo experiments on the custom EuRoC dataset.
Figure 4 shows the time offset estimation results for the Proposed, OPEN-VINS, and VINS-MONO on four sequences (V1_01, V2_01, MH_04, and MH_05) with a time offset of 10 ms. From the results in Figure 4, it can be observed that our proposed method achieves the highest precision in time offset estimation. VINS-MONO relying on minimizing the reprojection error of feature points, depends on high-quality image features, leading to inaccurate estimation results. On the other hand, OPEN-VINS employs a Kalman filter framework to predict the time offset, but it fails to sufficiently consider the impact of imprecise gyroscope attitudes on system accuracy, resulting in inaccurate estimation as well. Calculate the average time offset estimated by each algorithm and compare it with the preset time offset to evaluate the time estimation performance of each algorithm. Combining the information from Figure 4 and Table 2, it is evident that our proposed method online time calibration accurately predicts the specified time offset with estimation errors within 0.5 ms. By incorporating the time offset into sensor timestamps and achieving synchronization between sensors through software implementation, the effectiveness of our propoed method has been demonstrated.
Figure 5 shows the trajectory estimation accuracy comparison of our proposed method, MSCKF, C-MSCKF, VINS-MONO, and OPEN-VINS on the EuRoC dataset with 20 ms time offset. The gray dotted line represents the groundtruth. Figure 5(a), (b), (c), and (d) are the XY plane projection diagrams of the trajectories of the five algorithms on the four sequences of the custom EuRoC dataset V1_01, V2_01, MH_04, and MH_05, respectively. Table 2 shows the time offset estimation results of our proposed method and the trajectory RMSE of the five algorithms in the four sequences V1_01, V2_01, MH_04, and MH_05, It can be seen from Table 2 and Figure 5 that our proposed method has higher algorithm accuracy. The possible reasons for the advantage of our proposed method over the other method are mainly manifested in two points: first, the inclusion of the first-stage complementary Kalman filter, which corrects the position estimation results of the gyroscope by using accelerometer measurements, enables the IMU to output a more accurate position and provides good position estimation results for the VIO system, which is a part that other methods fail to achieve. Second, the problem that the time offsets between different sensors lead to the time unsynchronization between the estimated and observed states of the Kalman filter is fully considered, and a one-dimensional time offset is added to the state vector of the VIO, which is continuously updated using the camera observation, so as to achieve the smallest error between the estimated state of the IMU and the observed state of the camera, thus improving the positioning accuracy of the system.
Figure 6 shows the average running time of the back end for the four algorithms. The running times of our proposed method, MSCKF, OPEN-VINS, and VINS-MONO algorithms are 8.72 ms, 7.73 ms, 18.14 ms, and 29.9 ms, respectively. It can be seen from the figure that the optimization-based algorithm VINS-MONO has the highest time complexity, more than three times that of our proposed method, mainly because VINS-MONO jointly optimizes all the state vectors that can be optimized. When new images arrive, the data size will grow, thus increasing the time complexity of the algorithm. The complexity of the OPEN-VINS algorithm based on EKF is also relatively high because it combines a variety of modules that can improve the accuracy of the algorithm, resulting in an increase in overall performance and complexity. Compared with the MSCKF algorithm, our proposed method has a first-order complementary Kalman filter structure, and the time offset is added to the VIO system state vector. Therefore, the problem size is slightly larger than the original MSCKF algorithm, and the running time is slightly higher than the MSCKF algorithm, but our proposed method is more accurate than MSCKF. The positioning accuracy of VIO is significantly higher than that of MSCKF algorithm.
6.2. Experimental verfication of real scenes
In real experiment, the camera is fixed on a mobile robot platform to make the robot move in a circular motion. Meanwhile, select the VICON motion capture system to record the robot’s 3D position as ground truth. The experimental environment is shown in Figure 7. The ZED2 Stereo camera is used in real experiment. This sensor contains two global shutter cameras with the field of vision of $120^{\circ}\times 110^{\circ}\times 70^{\circ}$ (diagonal, horizontal, and vertical, respectively) and a six-axis IMU. The camera frequency is set to be 20 Hz, the image resolution is set to be 1280 × 720, and the IMU frequency is 200 Hz.
In Figure 8(a), the gray dotted line represents the ground truth, and the blue, green, and red lines represent VINS-MONO (no loop), OPEN-VINS, and our proposed method, respectively. Figure 8(b) shows the unstable temporal offset estimated by our proposed method during motion. It can be seen from Figure 8(a) that the trajectory accuracy of our proposed method is higher and closer to the ground truth. This shows that our proposed method uses the accelerometer to dynamically correct the gyroscope bias, which improves the initial attitude and trajectory accuracy of the VIO system. At the same time, this paper defines the time offset in different sensor fusion processes as an unknown variable and uses a MSCKF for dynamic estimation, which further improves the trajectory accuracy of the VIO system.
In Figure 9, we compare the final trajectories of our proposed method and MSCKF in a closed environment. Figure 9(a) is the indoor experimental scene, Figure 9(b) is our proposed method and MSCKF trajectory, where the blue line belongs to the MSCKF algorithm, the red line belongs to our proposed method, and Figure 9(c) is the unstable time offset calculated by our proposed method. From Figure 9(b), it can be seen that due to the IMU rotation, there is a large deviation in the start and end points of the MSCKF algorithm trajectory estimation. Our proposed method corrects the rotation obtained by the gyroscope through the accelerometer. Fully consider the unstable time offset in the fusion process, so that the IMU rotation data are more stable, the image constraints are more accurate, and the starting point and the ending point coincide. The above experiments fully prove that the VIO algorithm based on the two-stage EKF proposed in this paper has higher positioning accuracy under the condition of increasing a small amount of calculation cost.
7. Conclusion
This paper proposes a new online time calibration framework based on double-stage EKF for the VIO system. It solves the problem of inaccurate time offset caused by trigger delay and IMU noise during camera and IMU fusion. The algorithm in this paper can estimate the accurate time offset, and the RMSE of the trajectory is better than the current best VINS-MONO algorithm based on optimization and the best filter-based MSCKF algorithm. The work content is as follows:
1. The algorithm in this paper adds an unknown time offset to the VIO system state vector, estimates the time offset between the camera and the IMU online through camera observation constraints, and adds a complementary filter to improve the accuracy of the algorithm. Finally, compared with the same type of excellent online calibration algorithms, the proposed method has higher accuracy and robustness.
2. Compared with the current best VIO algorithm, the algorithm in this paper has obvious advantages in time complexity. The time complexity is about 30% of the optimization-based VINS-MONO algorithm and about 48% of the EKF-based OPEN-VINS algorithm, so it is easier to apply to low-cost computing platforms.
During the operation of the VIO system, many factors will have an impact on the estimation results, such as lighting conditions, equipment movement speed, and long corridor problems will affect the process of image feature point extraction and matching, leading to inaccurate observation information of Kalman filtering and affecting the system accuracy. The accuracy of each sensor and the severe time offset between sensors can also affect the system positioning accuracy. In the future, we will continue to carry out research on this problem.
If you need the dataset and related documentation, please contact the author.
Author contributions
S.L. and J.N. designed the study, conducted data gathering, and performed statistical analyzes, J.N., C.G., and Y.Y. wrote the article. L.M. provided critical revisions to the manuscript.
Financial support
This work was supported by Chongqing Natural Science Foundation Joint Fund for Innovation and Development (No. CSTB2024NSCQ-LZX0035), Science and Technology Research Project of Chongqing Municipal Education Commission (No. KJZD-M202300605), Young Talent Project of Nanning Municipal "Yongjiang Program" (RC20230107), Chongqing Municipal General Project for Scientific and Technological Innovation and Applied Development Special Project ( CSTB2022TIAD-GPX0028), and the Natural Science Foundation of Chongqing Municipality (CSTB2022NSCQ-MSX0230).
Competing interests
The authors declare that they have no competing interests.
Ethical approval
None.