1. Introduction
In recent years, mobile robots have been continuously expanded to industries, services, transportation, and other fields [Reference Nascimento, Dórea and Gonçalves1–Reference Li, Ren, Zhao, Deng and Feng4]. The application of mobile robots in complex, cluttered environments requires robots to avoid obstacles safely while following their desired path. Thus, the path-following and obstacle avoidance of wheeled mobile robots have been an active research direction and have been widely studied (see refs. [Reference Chen, Liu, He, Qiao and Ji5–Reference Alonso-Mora, Beardsley and Siegwart7]).
Among the recent works, the path-following and obstacle avoidance of mobile robots in complex environments can be achieved via two general methods [Reference Hoy, Matveev and Al8]: reactive methods [9] and motion planning methods [Reference Funke, Brown, Erlien and Gerdes10].
Motion planning methods, such as model predictive control (MPC) [Reference Funke, Brown, Erlien and Gerdes10, Reference Hagen, Kufoalor, Brekke and T.A.Johansen11], obtain a smooth collision-free path according to the receding horizon principle, which solves the optimal path-following control problem with collision avoidance. However, the classic MPC with numerical solvers will have a high computational cost, especially when using nonlinear dynamical models. Despite the fact that the computational efficiency can be enhanced by the improved MPC [Reference Brito, Floor, Ferranti and Alonso-Mora12] with efficient solvers, this approach is still restrictive in environments with a large number of obstacles of different sizes. The reactive algorithms have advantages in the above scenarios. As a reactive method, the potential field-based (PF) approach [9], which treats each obstacle as a repulsive point to keep the robot away from it, is widely implemented in mobile robots. The PF methods for path-following and obstacle avoidance are addressed in refs. [Reference Hoy, Matveev and Al8, Reference Leica, Herrera, Rosales, Roberti, Toibero and Carelli13, Reference Sudhakara, Ganapathy, Priyadharshini and Sundaran14]. This approach, however, does not take into account the high-level objectives, such as energy consumption in obstacle avoidance. Moreover, its learning and optimization capabilities in complex environments need to be strengthened.
As a branch of machine learning, reinforcement learning (RL) and approximate dynamic programming (ADP) are advanced tools to solve nonlinear optimal control problems and have aroused the interest of researchers nowadays [Reference Hu, Guan, Lewis and Chen15, Reference Dong, Zhao and Yang16]. Many works have been developed for path-following control with obstacle avoidance of wheeled mobile robots [Reference Zhang, Chen, Zhu and Z.Yan17–Reference Zhang, Peng, Luo, Pan, Xu and H.Xie19]. The integrated path-following control and obstacle avoidance of mobile robots can be transformed into a learning-based optimal control problem with constraints.
Related research works in the safe RL setting focus on optimal control with state constraints. These approaches transform the constrained optimal control problem into an unconstrained one, such as the reward/cost shaping-based RL approach, where the cost function is incorporated with safety guarantee terms (such as the Lyapunov functions [Reference Dong, Tang and Yuan20] or barrier functions [Reference Deptula, Chen, Licitra, Rosenfeld and Dixon18, Reference Ohnishi, Wang, Notomista and Egerstedt21, Reference Wen and Topcu22]) in the training process. By embedding costs of safety constraints into the reward function, a barrier-based online learning ADP algorithm was addressed in ref. [Reference Cohen and Belta23] for optimal control problems with the safe guarantee. In ref. [Reference Yang, He, Vamvoudakis, Modares and Wunsch24], a barrier function-based system transformation method was presented, transforming the original problem into an unconstrained one with virtual state variables. With the idea of constraint system transformation, a constrained cross-entropy-based approach was developed in ref. [Reference Wen and Topcu22] for a safe RL problem with constraints defined as the expected cost. Some other exciting works for safety-critical systems may refer to [Reference Ames, Xu, Grizzle and P.Tabuada25], with the unification of control barrier functions and control Lyapunov functions for automotive applications. In ref. [Reference Deptula, Chen, Licitra, Rosenfeld and Dixon18], a model-based ADP method was proposed for motion planning with constraints of moving avoidance regions.
Noteworthy, these techniques may not deal with conflicting control objectives as the optimal control objective and constraint satisfaction are contradictory. For example, in path-following control of wheeled mobile robots with obstacle avoidance, ensuring both optimal tracking control and obstacle avoidance is a difficult task since the convergence of the actor–critic (AC) might not be guaranteed by the reward/cost shaping-based RL approach due to conflicting learning objectives. In addition, the energy loss and control costs are valuable to consider in the obstacle avoidance process.
In this paper, to decouple the impact of path-following control and obstacle avoidance on mobile robots, we propose a potential field-based dual heuristic programming (P-DHP) algorithm with two progressive AC structures to achieve a trade-off between control cost and safe performance. To decouple the conflicting learning goals, the training process was decoupled into two parts in the proposed P-DHP. Firstly, a path-following control policy was learned by value iteration with an initial AC based on the standard neural network. Then, with the neural network weights fixed, the PF term is used to construct another AC structure to learn a safe control policy with collision avoidance capability. Moreover, another cost is designed in the second part of the learning process. This cost is minimized to reduce the energy loss and state errors of the robot with obstacle avoidance.
Different from ref. [Reference Zhang, Peng, Luo, Pan, Xu and H.Xie19], in our approach, the control policy’s learning mode for path-following control and obstacle avoidance is asynchronous to guarantee convergence in the case of conflicting learning objectives. Besides, this paper also contributes to the optimization of reducing the energy loss and state errors in the obstacle avoidance process.
The remainder of the paper is constructed as follows. Section II formulates the tracking error model and problem formulation. In Section III, we present the proposed P-DHP approach for mobile robots, while the simulated and experimental results are demonstrated in Section IV.
Notation
For a vector $h\in \mathbb{R}^m$ , $\lVert h \rVert _2$ is defined as the Eulidean norm, and $\lVert h \rVert _R^2$ is denoted as $h^{\top }Rh$ . For a function $\phi (\nu (k))$ with an input variable $\nu (k)$ , we denote $\phi (k)$ to represent $\phi (\nu (k))$ for short.
2. Tracking error model and problem formulation
2.1. Tracking error model
The wheeled mobile robot is shown in Fig. 1, and it is assumed that its particle and center point are coincident. The model is described as
where $q=(x,y)$ is the position of the robot in the global coordinate frame and $\theta$ is the heading angle related to the robot’s kinematic model. $v$ and $\omega$ denote the linear velocity and angular velocity of the robot, respectively. $a_{v}$ and $a_{\omega }$ are the accelerations of $v$ and $\omega$ , respectively, related to the dynamic model of the robot. The control input is $u=[a_{v},a_{\omega }]^{\top }$ , and the state of the robot is defined as $p=[x,y,\theta,v,\omega ]^{\top }$ .
Suppose a virtual robot guides the robot to move forward, and its state is described as $p_r=[{x}_{r},{y}_{r},{\theta }_{r},{v}_{r},{\omega }_{r}]^{\top }$ , satisfying:
where $u_r=[a_{v,r}, a_{w,r}]^{\top }$ is the control input of the virtual leader robot.
The tracking error model of the robot can be written as
where $T$ is the coordinate transformation matrix from the global coordinate frame $XOY$ to the robot’s local coordinate frame $xoy$ , which is
Then, we can get the tracking error model in derivative form by differentiating both sides of (3), which is
According to Eqs. (1)–(2) and (5), the discrete-time tracking error model according to the forward-Euler discretization can be written as
where $e=[e_{x},e_{y},e_{\theta },e_{v},e_{\omega }]^{\top }$ , $v(k)=v_r(k)-e_{v}(k)$ , $\omega (k)=\omega _r(k)-e_{\omega }(k)$ , $k$ is the discrete-time index, and $\Delta h$ is the adopted sampling interval.
The nonlinear tracking error model can be abbreviated as
where $f$ is the error transition function, and $g$ is the input mapping function.
2.2. Problem formulation
We define the performance index of path-following control as
where
and $\gamma \in (0,1]$ is the discounting factor, $Q^{\top }=Q \in \mathbb{R}^{5\times 5}$ , $R^{\top }=R\in \mathbb{R}^{2\times 2}$ , $Q\succ 0$ , $R\succ 0$ .
The cost $J(e(k))$ is minimized subject to the following constraints $q(k)\in \mathcal{X}, \forall k\in [1,\infty ]$ , where
$q_{o},o\in \mathbb{N}_{1}^{n}$ , is the position of the obstacle $o$ . $r$ and $r_o$ are the radii of the smallest approximate circles surrounding the robot and obstacle $o$ , $o\in \mathbb{N}_{1}^{n}$ , respectively.
Below, we give the definition of the potential field function and the assumption on the detection rules for the robot.
Definition 1 (Potential field function):
For the robot’s position $q$ and the obstacle’s position $q_{o}\in \mathbb{R}^2$ , $o\in \mathbb{N}_{1}^{n}$ , the potential field function $\mathcal{B}_{o}$ is defined as
where $\kappa _{o}$ is the relaxation factor of the potential field. $\lVert q-q_{o} \rVert _2$ , $o\in \mathbb{N}_{1}^{n}$ , represents the distance between the robot and the obstacle $o$ . The depiction of the above parameters is shown in Fig. 2.
Assumption 1 (Detection rules): The robot can detect the position and angle of obstacles in a sector area (with the robot as the center and a radius of $r_{d}$ ).
3. P-DHP for path-following control with collision avoidance
3.1. Design of P-DHP
To solve the path-following control problem with safety constraints, we propose a control policy that has two control parts, which is described as
where $\bar{u}^{i_1}(e(k))$ is a control policy that corresponds to a fixed weight matrix without the obstacles’ constraints. $\eta _o \in \mathbb{R}^{2\times 2}$ , $o\in \mathbb{N}_{1}^{n}$ , are variables to be optimized for obstacles of different sizes and positions (see Section 3.2). $\nabla \mathcal{B}_{o}=\partial \mathcal{B}_{o}/\partial q$ .
Remark 1: As the robot approaches the obstacle, the input item $\nabla \mathcal{B}_{o}$ could generate a repulsive force to drive $q(k)$ to satisfy the constraint (10). Meanwhile, the parameter $\eta _o$ is continuously optimized to ensure safety in path tracking. In the case that the input item $\nabla \mathcal{B}_{o}$ is identically equal to 0, the policy $\bar{u}(k)$ is to approach the path-following control policy under no obstacle.
To decouple the conflicting learning objectives of path-following control and obstacle avoidance, the learning process is divided into two parts, where $\bar u$ and $\eta _o$ in (12) are learned asynchronously by value iteration to guarantee convergence.
Then, the performance index of path-following control is redefined as
where $r_1(\bar e(k),\bar u(k))=\lVert \bar e(k) \rVert _{Q_1}^2+\lVert \bar u(k)-u_r(k)\rVert _{R_1}^2$ , $Q_1^{\top }=Q_1 \in \mathbb{R}^{5\times 5}$ , $R_1^{\top }=R_1\in \mathbb{R}^{2\times 2}$ , $Q_1\succ 0$ , $R_1\succ 0$ . $\bar e(k+1)$ is the state of the robot that satisfies $\bar e(k+1)=f(\bar e(k))+g(\bar u(k))$ .
Define $J_1^*$ as the optimal value function of $J_1$ under the optimal control policy $\bar u^*$ . According to the Hamilton–Jacobi–Bellman (HJB) equation, one has
and
Define $i$ as the number of iterations. Firstly, the control policy $\bar u(k)$ in the unconstrained scenario is learned within iterations $i=1,\ldots,i_{1}$ , and the update rule in each iteration of $J_1(k)$ and $\bar u(k)$ can be obtained as
where $i_{1}$ is the iterative value as $J_1^{i}(k)$ converges (see Algorithm 1 for details).
With the fixed control policy $\bar u^{i_1}(k)$ , the variable $\eta _o$ in the second item of (12) is updated within iterations $i=i_{1},\ldots,i_{2}$ to minimize the reconstructed cost function $ J_2(k)$ for collision avoidance constraints, which is
where $r_2(\mathcal{B}_{o}(k),e(k),u(k))=\lVert e(k)-{\bar e}(k)\rVert _{Q_2}^2+\lVert u(k)-\bar{u}^{i_1}(k)\rVert _{R_2}^2+\mu \mathcal{B}_{o}(k)$ , ${Q}_2^{\top }={Q_2}\in \mathbb{R}^{5\times 5}$ , ${R}_2^{\top }={R_2}\in \mathbb{R}^{2\times 2}$ , ${Q_2}\succ 0$ , ${R_2}\succ 0$ , and $\mu \gt 0$ is a tuning parameter.
Remark 2: The control objective is to minimize the cost $J_2$ , which is equivalent to reducing the tracking error and input energy consumption while satisfying the obstacle avoidance constraint. The trade-off among obstacle avoidance, reduced tracking error and input energy consumption can be adjusted by the value of $\mu$ and the weight of matrix $Q_2$ and $R_2$ , respectively.
Let $\overrightarrow{\eta }$ be the sequence of ${\eta }_{1},\ldots,{\eta }_{n}$ . Then $J_{2}$ and $\overrightarrow{\eta }$ can be updated in each iteration $i=i_{1},\ldots,i_{2}$ by
The main steps of the proposed P-DHP can be seen in Algorithm 1.
In the following, we prove the convergence of our proposed P-DHP.
Theorem 1 (Convergence of (16) and (17)): If the initial condition satisfies $J_1^{0}(k) \geq{r_1}(\bar e(k), \bar u^{0}(k))+\gamma{J}_1^{0}(k+1)$ , then it follows that
1) ${J}_1^{i+1}(k)\leq{J}_1^{i}(k)$ .
2) ${\mathrm{lim}}_{i\rightarrow \infty }{J}_1^{i}(k)={J}_1^{*}(k)$ .
Proof. For the proof details please refer to refs. [Reference Luo, Liu, Huang and Liu26] and [Reference Zhang, Peng, Luo, Pan, Xu and H.Xie19].
Theorem 2 (Convergence of (20) and (21)) [Reference Zhang, Peng, Luo, Pan, Xu and H.Xie19]: If $ J_2^{i_1}(k) \geq$ $\sum \nolimits _{o=1}^{n}{r_2}(\mathcal{B}_{o}(k),e(k), u^{i_1}(k))+\gamma J_2^{i_1}(k)$ and ${u}^{i_1}(k)=\bar u^{i_1}(k)+\sum \nolimits _{o=1}^{n}\eta _{o}^{i_1}(k)\nabla \mathcal{B}_{o}(k)$ is a safe control policy to restrain $q^{i_1}(k)\in \mathcal{X}$ , then
1) $J_2^{i+1}(k)\leq J_2^{i}(k).$
2) ${u}^{i}(k)=\bar u^{i_1}(k)+\sum \nolimits _{o=1}^{n}\eta _{o}^{i}(k)\nabla \mathcal{B}_{o}(k)$ , $i\geq i_1$ , is a safe control policy.
Proof. 1): The proof is similar to that in Theorem 1, which is to replace $J_1^{i}(k)$ with $J_2^{i}(k)$ .
2): As ${u}^{i_1}(k)$ is a safe control policy, one has $J_2^{i_1}(k)\leq J_2^{f}$ , where $ J_2^{f}$ is the threshold of the safe value function. According to Theorem 1, for $i\geq i_1$ , it satisfies $J_2^{i}(k) \leq J_2^{i_1}(k)\leq J_2^{f}$ , i.e., $q^{i}(k)\in \mathcal{X}$ . Therefore, ${u}^{i}(k)$ , $i\geq i_1$ , is a safe control policy.
3.2. AC learning for P-DHP
In this section, the AC learning structures (see Fig. 3) are introduced to implement Algorithm 1. The AC learning structures consist of two decoupling parts. The actor-1–critic-1 network corresponds to the iteration of $J_1(k)$ and $\bar u(k)$ , and the actor-2–critic-2 network corresponds to the iteration of $J_2(k)$ and $\overrightarrow{\eta}$ , respectively. The AC network weights are updated in an incremental learning mode.
The critic-1 network is constructed as
where the costate $\nabla \hat{J_1}(k)=\partial \hat{J_1}(k)/\partial \bar e(k)$ , $W_{c}\in \mathbb{R}^{5\times 5}$ is the weighting matrix, and $h_{c}$ is the activation function.
Define $\nabla{J_1^{*}}=\partial{J_1^{*}}/\partial \bar e$ . To minimize the bias between $\nabla \hat{J_1}$ and $\nabla J_1^*$ , we designed a target to be followed by $\nabla \hat{J_1}$ , i.e.,
where $\nabla f=\partial{f}/\partial{\bar e}$ .
Let the quadratic approximation error of the critic-1 network be defined as $E_{c}(k)=||\nabla{\hat J_1}(k)-\nabla J_1^d(k)||_2^2$ . The goal of the critic-1 network is to minimize the distance between $\nabla{\hat{J_1}}(k)$ and $\nabla J_1^d(k)$ by updating weights. At each time instant $k$ , the update rule of $W_{c}$ is
where $\alpha _{c}$ is the learning rate of $W_{c}$ .
Then we design the actor-1 network for the robot, which is
where the $W_{a}\in \mathbb{R}^{5\times 2}$ is the weighting matrix, and $h_{a}$ is the activation function.
In view of (17), we define the $ \bar u^{d}(k)$ as the desired value of $\hat{\bar u}(k)$ , which is expressed as
where $\nabla g=\partial{g}/\partial{\bar u}$ , and $\nabla \hat{J_1}$ is the output of the critic-1 network.
The quadratic approximation error of actor-1 network can be defined as $E_{a}(k)=||\hat{\bar u}(k)-\bar u^d(k)||_2^2$ , and the update rule of $W_{a}$ is
where $\alpha _{a}$ is the learning rate of $W_{a}$ . Note that the learning process (27) is performed in $i_1$ iterations. We define the converged matrix $W_{a}$ as $\textbf{W}_{a}$ and the learned control policy $\hat{\bar u}$ as ${\hat{\bar{\textbf{u}}}}$ , respectively.
Suppose that the potential fields of obstacles do not overlap. In the following design, we decompose the actor-2–critic-2 network into $n$ subnetworks to learn the variable $\eta _o$ , $o\in \mathbb{N}_1^{n}$ , in (12) respectively, which corresponds to obstacles $o$ , $o\in \mathbb{N}_1^{n}$ , of different sizes and positions.
The sub-critic-2 network for each obstacle $o$ is given as
where $\rho _{o}\in \mathbb{R}^{2\times 2}$ , and $\nabla \hat J_{2,o}=\partial \hat J_{2,o}/\partial e$ . $\textbf{W}_c$ is the learned matrix of $W_{c}$ , and $\mu _{c}\gt 0$ is a tuning parameter.
The desired value of $\nabla \hat J_{2,o}$ i.e., $\nabla{J}_{2,o}^{d}$ is defined as
where $\nabla f=\partial{f}/\partial{e}$ .
Denote ${E}_{c,o}(k)=||\nabla \hat {J}_{2,o}(k)-\nabla {J}_{2,o}^d(k)||_2^2$ as the quadratic approximation error of the sub-critic-2 network. The update rule of $\rho _{o}$ can be obtained as
where $\beta _{c}$ is the learning rate of $\rho _{o}$ , $o\in \mathbb{N}_{1}^{n}$ .
In view of (12), the sub-actor-2 network for each obstacle $o$ is designed with the following form:
where $\hat{u}_o(k)$ is the control policy for the robot to avoid the obstacle $o$ , which is equal to ${\hat u}(k)$ in the case of only one obstacle $o$ to be avoided.
Define $\varepsilon _o =2R_2({\hat u_o}-\hat{\bar{\textbf{u}}})$ . Then the desired value of $\varepsilon _o(k)$ is redefined as
where $\nabla g=\partial{g}/\partial{u}$ .
In the same way, the quadratic approximation error of the sub-actor-2 network and the update rule of $\eta _o$ are given as
where $\beta _{a}$ is the learning rate of $\eta _{o}$ , $o\in \mathbb{N}_{1}^{n}$ .
Given that all variables $\eta _o$ , $o\in \mathbb{N}_{1}^{n}$ , have been trained until converging, the overall control policy to be deployed to the robot is
4. Simulation and experimental results
We first test our algorithm by numerical simulation in a MATLAB environment and deploy the control policy on a real-world differential-drive wheeled robot platform. Suppose the robot is equipped with sensors that can measure the obstacles’ approximate size, relative distance, and angle in the simulation and experiment.
4.1. Simulation experiments
4.1.1. Parameter settings and training process
In the simulation, a relaxed potential function is defined and used as follows to guarantee the smoothness in the control process, i.e.,
where $d_o=\lVert q-q_{o} \Vert_2$ and $0\lt \delta _{o}\lt d_{o}$ is an adjustable relaxing factor. $\sigma _{o}(q)$ is a derivative and monotonic function, where $\sigma_{o}(q)=\frac{1}{2}\sigma_{1}(d_{o}-{\delta_{o}})^2+\sigma_{2}(d_{o}-{\delta_{o}})+\sigma_{3}$ , $\sigma _{1}=\nabla ^2_{d_{o}} \mathcal{B}_{o}(q)|_{d_{o}=\delta _{o}}$ , $\sigma _{2}=\nabla _{d_{o}} \mathcal{B}_{o}(q)|_{d_{o}=\delta _{o}}$ , $\sigma _{3}=\mathcal{B}_{o}(q)|_{d_{o}=\delta _{o}}$ .
Also, the perceived radius values of obstacles are properly expanded to construct $\nabla \mathcal{\bar B}_{o}(q)$ , which replaces $\nabla \mathcal{B}_{o}(q)$ in (12) to ensure control safety. The reconstructed $\mathcal{\bar B}_{o}(q)$ is defined as
where $r_{d}$ indicates the maximum distance that the robot can detect the obstacles, and the definitions of relevant parameters can be seen in Fig. 2. The parameters satisfy $\kappa _{o}\lt \bar \kappa _{o}$ and $r_{o}+r\lt r_{d}$ . Similar to $\mathcal{B}_{o}^{r}(q)$ in (36), $ \mathcal{\bar B}_{o}(q)$ is deflated into $\mathcal{\bar B}_{o}^{r}(q)$ by $\bar \delta _{o}$ , $0\lt \bar \delta _{o}\lt r_{d}$ .
In the training process, the discounting factor was $\gamma =0.95$ . The penalty matrices were set as $Q_1=I_5$ , $R_1=0.01I_2$ , $Q_2=0.001I_5$ , $R_2=0.001I_2$ , $\mu =1$ , $\mu _c=0.001$ , $\alpha _{a}=0.2$ , $\alpha _{c}=0.4$ , $\beta _{a}=0.4$ , $\beta _{c}=0.4$ , $\delta _o=1.5$ , $\bar \delta _o=5$ , $\kappa _{o}=5.5$ , $\bar \kappa _{o}=600$ , $r_d=6$ , and $\Delta h=0.05\,\text{s}$ . In the learning process, the weights were initialized with uniformly distributed random values. The simulations were implemented in the Matlab 2019b environment on a laptop with an AMD Ryzen 7 4800H CPU.
Firstly, we tested our algorithm in a straight-line path scenario with one obstacle in the center of the road. The recorded performance costs were defined as $J^i_{r,1}=\sum \nolimits _{k=1}^{K_1}J_1^i(k)$ and $ J^i_{r,2}=\sum \nolimits _{k=1}^{K_2}J_2^i(k)$ with $K_1=1000$ and $K_2=1000$ in each training iteration. The variation results of $J^i_{r,1}$ and $J^i_{r,2}$ in Fig. 4 show that both the performance costs decrease with iteration increases and converge in 30 iterations. Moreover, the robot’s trajectories in the learning process and the actor weights of the robot in a single-obstacle scenario are displayed in Figs. 5 and 6. The results illustrate that the robot gradually moves away from the obstacle by continuous iterative learning until the robot’s trajectory is circumscribed by the edge of the obstacle’s potential field.
4.1.2. Comparison results with state-of-the-art approaches
In this subsection, we compared our approach with the reward/cost shaping-based RL approach (CS-based RL) [Reference Perkins and Barto27], the MPC using a disjunctive chance constraint (MPC-dc) [Reference Castillo-Lopez, Ludivig, Sajadi-Alamdari, Sanchez-Lopez, Olivares-Mendez and Voos28], and the MPC using a linearized chance constraint (MPC-lc) [Reference Zhu and Alonso-Mora29]. Please refer to Fig. 9 for the adopted reference paths in the simulation tests.
In the CS-based RL approach, the cost function was constructed as $r=\lVert e(k) \rVert _{Q_1}^2+\lVert u(k)-u_r(k)\rVert _{R_1}^2+\mu\nabla\mathcal{B}_{o}(k)$ , and the parameters of $\nabla\mathcal{B}_{o}$ were fine-tuned. For a fair comparison, the same DHP-based AC learning structure(actor-1-critic-1) was used in CS-based RL. The comparative results in 100 tests with different learning rates are illustrated in Table I and Fig. 7, which show that our approach outperforms the CS-based RL approach in the success rate of the learning process. Although the success rate between ours and CS-based RL is similar as the learning rate is set small, as shown in Fig. 8, the CS-based RL approach has a probability of failure to avoid obstacles due to the lack of corrective terms $\nabla \mathcal{B}_{o}$ in the control policy. As the learning rate increases, the learning failure rate of the CS-based RL approach also increases and has a longer obstacle avoidance distance with more energy loss, whereas our method possesses stable control performance for different learning rates. This illustrates, in part, that only shaping the cost function may not be sufficient to learn a convergent control policy efficiently via a standard AC learning algorithm. To solve this problem, our P-DHP approach are proposed to resolve the conflict between the path-following control and obstacle avoidance goals in the learning process.
The MPC-dc [Reference Castillo-Lopez, Ludivig, Sajadi-Alamdari, Sanchez-Lopez, Olivares-Mendez and Voos28] and MPC-lc [Reference Zhu and Alonso-Mora29] algorithms were implemented by the CasADi toolbox with an Ipopt solver [Reference Andersson, Gillis, Horn, Rawlings and Diehl30]. The stage costs in MPC-dc and MPC-lc were designed the same as (9) in our approach. The performance indicator $J_e$ was defined as $J_e=\text{1/}\bar K\sum_{k\in K}{\lVert e(k) \rVert_2}$ , where $K$ is the time interval in obstacle avoidance, and $\bar K$ is the length of $K$ . The comparative study was carried out under the scenario with three different sizes of obstacles, where the side lengths of the bounding boxes were set as $2.3/\sqrt{2}\,\text{m}$ , $2/\sqrt{2}\,\text{m}$ , and $2.6/\sqrt{2} \,\text{m}$ . The prediction horizon was set as $N=30$ . For the details of the simulation parameters, please refer to ref. [Reference Castillo-Lopez, Ludivig, Sajadi-Alamdari, Sanchez-Lopez, Olivares-Mendez and Voos28].
The comparative simulation results are illustrated in Fig. 9 and Tables II and III. The results show that our approach outperforms the fixed-parameters MPC-lc and MPC-dc with smaller cost values and shorter path length in obstacle avoidance, demonstrating that our approach has better control performance at different reference velocities. This is due to the learning optimization mechanism and policy design of our algorithm. In addition, our approach has advantages in computational efficiency (see Table III).
4.2. Real-world experiments
We also tested our algorithm on the experimental platform of a real-world differential-drive wheeled robot. As shown in Fig.10,the robot was controlled by a laptop equipped with the Ubuntu operating system, and the state of the robot and the positions of obstacles were obtained by mounted satellite inertial. In the experiment, we deployed the offline training weights from the simulation to generate control policy $u=[a_{v}, a_{\omega }]^{\top }$ to drive the underlying system of the robot, and the sampling interval was $\Delta h=0.1\,\text{s}$ .
The real-world experimental results are shown in Figs. 11 and 12, which illustrate that the robots can follow the desired path from an initial state, meanwhile avoiding all encountered obstacles on the path and recovering path following after collision avoidance. Moreover, the above-reported results show that our approach can resolve the weight divergence problem caused by the conflict between path following and obstacle avoidance.
5. Conclusion
This paper proposed a P-DHP algorithm for path-following control of wheeled robots with obstacle avoidance. As for the main characteristics, the proposed P-DHP utilizes two coupled AC networks to resolve the conflicting goals, developing a near-optimal and safe control policy for path-following and obstacle avoidance. The convergence and safety of our method were proven. Our approach has been evaluated in simulation and tested on a real-world differential-drive mobile robot. The simulated and experimental results illustrate that our approach can realize path-following control with collision avoidance under conflicting learning objectives, showing the advantages in computational efficiency and smaller cost values compared with state-of-the-art approaches. Future work will focus on the application of model-free RL on wheeled mobile robots.
Author contributions
All authors have made great contributions to this paper.
Financial support
The work was supported by the National Natural Science Foundation of China under Grant U21A20518, 62003361, and 61825305, China Postdoctoral Science Foundation under Grant 47680, and the National Key R&D Program of China 2018YFB1305105.
Conflicts of interest
The authors declare no conflicts of interest exist.
Ethical approval
Not applicable.