Hostname: page-component-586b7cd67f-dlnhk Total loading time: 0 Render date: 2024-11-27T20:01:44.772Z Has data issue: false hasContentIssue false

Potential field-based dual heuristic programming for path-following and obstacle avoidance of wheeled mobile robots

Published online by Cambridge University Press:  27 April 2023

Yaoqian Peng
Affiliation:
College of Intelligence Science and Technology, National University of Defense Technology, Changsha, 410073, China
Xinglong Zhang*
Affiliation:
College of Intelligence Science and Technology, National University of Defense Technology, Changsha, 410073, China
Haibin Xie*
Affiliation:
College of Intelligence Science and Technology, National University of Defense Technology, Changsha, 410073, China
Xin Xu
Affiliation:
College of Intelligence Science and Technology, National University of Defense Technology, Changsha, 410073, China
*
*Corresponding authors. E-mail: [email protected]; [email protected]
*Corresponding authors. E-mail: [email protected]; [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Path-following control of wheeled mobile robots has been a crucial research topic in robotic control theory and applications. In path-following control with obstacles, the path-following control and collision avoidance goals might be conflicting, making it challenging to obtain near-optimal solutions for path-following control and obstacle avoidance with low tracking error and input energy consumption. To address this problem, we propose a potential field-based dual heuristic programming (P-DHP) algorithm with an actor–critic (AC) structure for path-following control of mobile robots with obstacle avoidance. In the proposed P-DHP, the path-following control and collision avoidance problems are decoupled into two ones to resolve the control conflict. Firstly, a neural network-based AC is constructed to approximate the near-optimal path-following control policy in a no-obstacle environment. Then, with the trained path-following control policy fixed, a potential field-based control policy structure is constructed by another AC network to generate opposite control forces as the robot moves toward the obstacle, which can guarantee the robot’s control safety and reduce the tracking error and input energy consumption in obstacle avoidance. The simulated and experimental results show that P-DHP can realize near-optimal path-following control with the satisfaction of safety constraints and outperforms state-of-the-art approaches in control performance.

Type
Research Article
Copyright
© The Author(s), 2023. Published by Cambridge University Press

1. Introduction

In recent years, mobile robots have been continuously expanded to industries, services, transportation, and other fields [Reference Nascimento, Dórea and Gonçalves1Reference Li, Ren, Zhao, Deng and Feng4]. The application of mobile robots in complex, cluttered environments requires robots to avoid obstacles safely while following their desired path. Thus, the path-following and obstacle avoidance of wheeled mobile robots have been an active research direction and have been widely studied (see refs. [Reference Chen, Liu, He, Qiao and Ji5Reference Alonso-Mora, Beardsley and Siegwart7]).

Among the recent works, the path-following and obstacle avoidance of mobile robots in complex environments can be achieved via two general methods [Reference Hoy, Matveev and Al8]: reactive methods [9] and motion planning methods [Reference Funke, Brown, Erlien and Gerdes10].

Motion planning methods, such as model predictive control (MPC) [Reference Funke, Brown, Erlien and Gerdes10, Reference Hagen, Kufoalor, Brekke and T.A.Johansen11], obtain a smooth collision-free path according to the receding horizon principle, which solves the optimal path-following control problem with collision avoidance. However, the classic MPC with numerical solvers will have a high computational cost, especially when using nonlinear dynamical models. Despite the fact that the computational efficiency can be enhanced by the improved MPC [Reference Brito, Floor, Ferranti and Alonso-Mora12] with efficient solvers, this approach is still restrictive in environments with a large number of obstacles of different sizes. The reactive algorithms have advantages in the above scenarios. As a reactive method, the potential field-based (PF) approach [9], which treats each obstacle as a repulsive point to keep the robot away from it, is widely implemented in mobile robots. The PF methods for path-following and obstacle avoidance are addressed in refs. [Reference Hoy, Matveev and Al8, Reference Leica, Herrera, Rosales, Roberti, Toibero and Carelli13, Reference Sudhakara, Ganapathy, Priyadharshini and Sundaran14]. This approach, however, does not take into account the high-level objectives, such as energy consumption in obstacle avoidance. Moreover, its learning and optimization capabilities in complex environments need to be strengthened.

As a branch of machine learning, reinforcement learning (RL) and approximate dynamic programming (ADP) are advanced tools to solve nonlinear optimal control problems and have aroused the interest of researchers nowadays [Reference Hu, Guan, Lewis and Chen15, Reference Dong, Zhao and Yang16]. Many works have been developed for path-following control with obstacle avoidance of wheeled mobile robots [Reference Zhang, Chen, Zhu and Z.Yan17Reference Zhang, Peng, Luo, Pan, Xu and H.Xie19]. The integrated path-following control and obstacle avoidance of mobile robots can be transformed into a learning-based optimal control problem with constraints.

Related research works in the safe RL setting focus on optimal control with state constraints. These approaches transform the constrained optimal control problem into an unconstrained one, such as the reward/cost shaping-based RL approach, where the cost function is incorporated with safety guarantee terms (such as the Lyapunov functions [Reference Dong, Tang and Yuan20] or barrier functions [Reference Deptula, Chen, Licitra, Rosenfeld and Dixon18, Reference Ohnishi, Wang, Notomista and Egerstedt21, Reference Wen and Topcu22]) in the training process. By embedding costs of safety constraints into the reward function, a barrier-based online learning ADP algorithm was addressed in ref. [Reference Cohen and Belta23] for optimal control problems with the safe guarantee. In ref. [Reference Yang, He, Vamvoudakis, Modares and Wunsch24], a barrier function-based system transformation method was presented, transforming the original problem into an unconstrained one with virtual state variables. With the idea of constraint system transformation, a constrained cross-entropy-based approach was developed in ref. [Reference Wen and Topcu22] for a safe RL problem with constraints defined as the expected cost. Some other exciting works for safety-critical systems may refer to [Reference Ames, Xu, Grizzle and P.Tabuada25], with the unification of control barrier functions and control Lyapunov functions for automotive applications. In ref. [Reference Deptula, Chen, Licitra, Rosenfeld and Dixon18], a model-based ADP method was proposed for motion planning with constraints of moving avoidance regions.

Noteworthy, these techniques may not deal with conflicting control objectives as the optimal control objective and constraint satisfaction are contradictory. For example, in path-following control of wheeled mobile robots with obstacle avoidance, ensuring both optimal tracking control and obstacle avoidance is a difficult task since the convergence of the actor–critic (AC) might not be guaranteed by the reward/cost shaping-based RL approach due to conflicting learning objectives. In addition, the energy loss and control costs are valuable to consider in the obstacle avoidance process.

In this paper, to decouple the impact of path-following control and obstacle avoidance on mobile robots, we propose a potential field-based dual heuristic programming (P-DHP) algorithm with two progressive AC structures to achieve a trade-off between control cost and safe performance. To decouple the conflicting learning goals, the training process was decoupled into two parts in the proposed P-DHP. Firstly, a path-following control policy was learned by value iteration with an initial AC based on the standard neural network. Then, with the neural network weights fixed, the PF term is used to construct another AC structure to learn a safe control policy with collision avoidance capability. Moreover, another cost is designed in the second part of the learning process. This cost is minimized to reduce the energy loss and state errors of the robot with obstacle avoidance.

Different from ref. [Reference Zhang, Peng, Luo, Pan, Xu and H.Xie19], in our approach, the control policy’s learning mode for path-following control and obstacle avoidance is asynchronous to guarantee convergence in the case of conflicting learning objectives. Besides, this paper also contributes to the optimization of reducing the energy loss and state errors in the obstacle avoidance process.

The remainder of the paper is constructed as follows. Section II formulates the tracking error model and problem formulation. In Section III, we present the proposed P-DHP approach for mobile robots, while the simulated and experimental results are demonstrated in Section IV.

Notation

For a vector $h\in \mathbb{R}^m$ , $\lVert h \rVert _2$ is defined as the Eulidean norm, and $\lVert h \rVert _R^2$ is denoted as $h^{\top }Rh$ . For a function $\phi (\nu (k))$ with an input variable $\nu (k)$ , we denote $\phi (k)$ to represent $\phi (\nu (k))$ for short.

2. Tracking error model and problem formulation

2.1. Tracking error model

The wheeled mobile robot is shown in Fig. 1, and it is assumed that its particle and center point are coincident. The model is described as

(1) \begin{equation} [\dot{x},\dot{y},\dot{\theta }, \dot{v},\dot{\omega }]^{\top }=[v\cos \theta, v\sin \theta,\omega, a_{v}, a_{\omega }]^{\top }, \end{equation}

where $q=(x,y)$ is the position of the robot in the global coordinate frame and $\theta$ is the heading angle related to the robot’s kinematic model. $v$ and $\omega$ denote the linear velocity and angular velocity of the robot, respectively. $a_{v}$ and $a_{\omega }$ are the accelerations of $v$ and $\omega$ , respectively, related to the dynamic model of the robot. The control input is $u=[a_{v},a_{\omega }]^{\top }$ , and the state of the robot is defined as $p=[x,y,\theta,v,\omega ]^{\top }$ .

Fig. 1. Tracking error model of the wheeled robot.

Suppose a virtual robot guides the robot to move forward, and its state is described as $p_r=[{x}_{r},{y}_{r},{\theta }_{r},{v}_{r},{\omega }_{r}]^{\top }$ , satisfying:

(2) \begin{equation} [\dot x_r,\dot y_r,\dot \theta _r, \dot v_r,\dot \omega _r]^{\top }=[v_r\cos \theta _r, v_r\sin \theta _r,\omega _r, a_{v,r}, a_{\omega,r}]^{\top }, \end{equation}

where $u_r=[a_{v,r}, a_{w,r}]^{\top }$ is the control input of the virtual leader robot.

The tracking error model of the robot can be written as

(3) \begin{equation} \begin{aligned} e=T(p_r-p), \end{aligned} \end{equation}

where $T$ is the coordinate transformation matrix from the global coordinate frame $XOY$ to the robot’s local coordinate frame $xoy$ , which is

(4) \begin{equation} T=\left [ \begin{array}{c@{\quad}c@{\quad}c} \cos \theta & \sin \theta & 0\\ \\[-8pt] -\sin \theta & \cos \theta & 0\\ \\[-8pt] 0& 0& I_3\\ \end{array} \right ]. \end{equation}

Then, we can get the tracking error model in derivative form by differentiating both sides of (3), which is

(5) \begin{equation} \dot{e}=T{(\dot{p_r}-\dot{p})}+\dot{T}{(p_r-p)}. \end{equation}

According to Eqs. (1)–(2) and (5), the discrete-time tracking error model according to the forward-Euler discretization can be written as

(6) \begin{equation} \left \{\begin{aligned} e_x\left( k+1 \right) &=e_x\left( k \right) +\Delta h\left( \omega \left( k \right) e_y\left( k \right) -v\left( k \right) + \right. \left. v_r\left( k \right) \cos e_{\theta}\left( k \right) \right)\\e_y\left( k+1 \right) &=e_y\left( k \right) +\Delta h\left( -\omega \left( k \right) e_x\left( k \right) + \right. \left. v_r\left( k \right) \sin e_{\theta}\left( k \right) \right)\\e_{\theta}\left( k+1 \right) &=e_{\theta}\left( k \right) +\Delta h\left( \omega _r\left( k \right) -\omega \left( k \right) \right)\\e_v\left( k+1 \right) &=e_v\left( k \right) +\Delta h\left( a_{v,r}\left( k \right) -a_v\left( k \right) \right)\\e_{\omega}\left( k+1 \right) &=e_{\omega}\left( k \right) +\Delta h\left( a_{\omega ,r}\left( k \right) -a_{\omega}\left( k \right) \right) , \end{aligned}\right. \end{equation}

where $e=[e_{x},e_{y},e_{\theta },e_{v},e_{\omega }]^{\top }$ , $v(k)=v_r(k)-e_{v}(k)$ , $\omega (k)=\omega _r(k)-e_{\omega }(k)$ , $k$ is the discrete-time index, and $\Delta h$ is the adopted sampling interval.

The nonlinear tracking error model can be abbreviated as

(7) \begin{equation} e(k+1)=f(e(k))+g(u(k)), \end{equation}

where $f$ is the error transition function, and $g$ is the input mapping function.

2.2. Problem formulation

We define the performance index of path-following control as

(8) \begin{equation} J(e(k))=\sum \nolimits _{k=0}^{\infty } \gamma ^{k} r(e(k),u(k)), \end{equation}

where

(9) \begin{equation} r(e(k),u(k))=\lVert e(k) \rVert _Q^2+\lVert u(k)-u_r(k)\rVert _{R}^2, \end{equation}

and $\gamma \in (0,1]$ is the discounting factor, $Q^{\top }=Q \in \mathbb{R}^{5\times 5}$ , $R^{\top }=R\in \mathbb{R}^{2\times 2}$ , $Q\succ 0$ , $R\succ 0$ .

The cost $J(e(k))$ is minimized subject to the following constraints $q(k)\in \mathcal{X}, \forall k\in [1,\infty ]$ , where

(10) \begin{equation} \begin{aligned} \mathcal{X}=\{q|\lVert q-q_{o}\rVert _2 \gt r+r_o\}, \end{aligned} \end{equation}

$q_{o},o\in \mathbb{N}_{1}^{n}$ , is the position of the obstacle $o$ . $r$ and $r_o$ are the radii of the smallest approximate circles surrounding the robot and obstacle $o$ , $o\in \mathbb{N}_{1}^{n}$ , respectively.

Below, we give the definition of the potential field function and the assumption on the detection rules for the robot.

Definition 1 (Potential field function):

For the robot’s position $q$ and the obstacle’s position $q_{o}\in \mathbb{R}^2$ , $o\in \mathbb{N}_{1}^{n}$ , the potential field function $\mathcal{B}_{o}$ is defined as

(11) \begin{equation} \mathcal{B}_{o}(q)= \begin{cases} \dfrac{1}{2} \kappa _{o}\left(\dfrac{1}{\lVert q-q_{o} \rVert _2}-\dfrac{1}{r_{o}+r}\right)^2, &\lVert q-q_{o} \rVert _2 \leq r_{o}+r \\ 0, & \lVert q-q_{o} \rVert _2 \gt r_{o}+r,\end{cases} \end{equation}

where $\kappa _{o}$ is the relaxation factor of the potential field. $\lVert q-q_{o} \rVert _2$ , $o\in \mathbb{N}_{1}^{n}$ , represents the distance between the robot and the obstacle $o$ . The depiction of the above parameters is shown in Fig. 2.

Fig. 2. Depiction of robot’s parameters in obstacle avoidance.

Assumption 1 (Detection rules): The robot can detect the position and angle of obstacles in a sector area (with the robot as the center and a radius of $r_{d}$ ).

3. P-DHP for path-following control with collision avoidance

3.1. Design of P-DHP

To solve the path-following control problem with safety constraints, we propose a control policy that has two control parts, which is described as

(12) \begin{equation} {u}(k)=\bar{u}^{i_1}(e(k))+\sum \nolimits _{o=1}^{n}\eta _{o}(k)\nabla \mathcal{B}_{o}(k), \end{equation}

where $\bar{u}^{i_1}(e(k))$ is a control policy that corresponds to a fixed weight matrix without the obstacles’ constraints. $\eta _o \in \mathbb{R}^{2\times 2}$ , $o\in \mathbb{N}_{1}^{n}$ , are variables to be optimized for obstacles of different sizes and positions (see Section 3.2). $\nabla \mathcal{B}_{o}=\partial \mathcal{B}_{o}/\partial q$ .

Remark 1: As the robot approaches the obstacle, the input item $\nabla \mathcal{B}_{o}$ could generate a repulsive force to drive $q(k)$ to satisfy the constraint (10). Meanwhile, the parameter $\eta _o$ is continuously optimized to ensure safety in path tracking. In the case that the input item $\nabla \mathcal{B}_{o}$ is identically equal to 0, the policy $\bar{u}(k)$ is to approach the path-following control policy under no obstacle.

To decouple the conflicting learning objectives of path-following control and obstacle avoidance, the learning process is divided into two parts, where $\bar u$ and $\eta _o$ in (12) are learned asynchronously by value iteration to guarantee convergence.

Then, the performance index of path-following control is redefined as

(13) \begin{equation} J_1( \bar e(k))=\sum \nolimits _{k=0}^{\infty } \gamma ^{k} r_1(\bar e(k),\bar u(k)), \end{equation}

where $r_1(\bar e(k),\bar u(k))=\lVert \bar e(k) \rVert _{Q_1}^2+\lVert \bar u(k)-u_r(k)\rVert _{R_1}^2$ , $Q_1^{\top }=Q_1 \in \mathbb{R}^{5\times 5}$ , $R_1^{\top }=R_1\in \mathbb{R}^{2\times 2}$ , $Q_1\succ 0$ , $R_1\succ 0$ . $\bar e(k+1)$ is the state of the robot that satisfies $\bar e(k+1)=f(\bar e(k))+g(\bar u(k))$ .

Define $J_1^*$ as the optimal value function of $J_1$ under the optimal control policy $\bar u^*$ . According to the Hamilton–Jacobi–Bellman (HJB) equation, one has

(14) \begin{equation} \begin{aligned} J_1^{*}(k)=\underset{\,\, {\bar u}(k)}{\rm min}\,\,{{r_1(\bar e(k),{\bar u}(k))}+\gamma J_1^{*}(k+1)},\\ \end{aligned} \end{equation}

and

(15) \begin{equation} \begin{aligned}{\bar u}^{*}(k)&=\underset{\,\,{\bar u}(k)}{\mathrm{argmin}}\,\,{r_1(\bar e(k),\bar u(k))+ \gamma J_1^{*}(k+1)}. \end{aligned} \end{equation}

Define $i$ as the number of iterations. Firstly, the control policy $\bar u(k)$ in the unconstrained scenario is learned within iterations $i=1,\ldots,i_{1}$ , and the update rule in each iteration of $J_1(k)$ and $\bar u(k)$ can be obtained as

(16) \begin{equation} \begin{aligned} J_1^{i+1}(k)={r_1(\bar e(k),{\bar u}^{i}(k))}+\gamma J_1^{i}(k+1),\\ \end{aligned} \end{equation}
(17) \begin{equation} \begin{aligned}{\bar u}^{i+1}(k)&=\underset{\,\,{\bar u}(k)}{\mathrm{argmin}}\,\,{r_1(\bar e(k),\bar u(k))+ \gamma J_1^{i+1}(k+1)}, \end{aligned} \end{equation}

where $i_{1}$ is the iterative value as $J_1^{i}(k)$ converges (see Algorithm 1 for details).

With the fixed control policy $\bar u^{i_1}(k)$ , the variable $\eta _o$ in the second item of (12) is updated within iterations $i=i_{1},\ldots,i_{2}$ to minimize the reconstructed cost function $ J_2(k)$ for collision avoidance constraints, which is

(18) \begin{equation} J_2(k)=\sum _{o=1}^{n}J_{2,o}(k), \end{equation}
(19) \begin{equation} J_{2,o}(k)=\sum _{k=0}^{\infty } \gamma ^{k}r_2(\mathcal{B}_{o}(k),e(k),u(k)), \end{equation}

where $r_2(\mathcal{B}_{o}(k),e(k),u(k))=\lVert e(k)-{\bar e}(k)\rVert _{Q_2}^2+\lVert u(k)-\bar{u}^{i_1}(k)\rVert _{R_2}^2+\mu \mathcal{B}_{o}(k)$ , ${Q}_2^{\top }={Q_2}\in \mathbb{R}^{5\times 5}$ , ${R}_2^{\top }={R_2}\in \mathbb{R}^{2\times 2}$ , ${Q_2}\succ 0$ , ${R_2}\succ 0$ , and $\mu \gt 0$ is a tuning parameter.

Remark 2: The control objective is to minimize the cost $J_2$ , which is equivalent to reducing the tracking error and input energy consumption while satisfying the obstacle avoidance constraint. The trade-off among obstacle avoidance, reduced tracking error and input energy consumption can be adjusted by the value of $\mu$ and the weight of matrix $Q_2$ and $R_2$ , respectively.

Let $\overrightarrow{\eta }$ be the sequence of ${\eta }_{1},\ldots,{\eta }_{n}$ . Then $J_{2}$ and $\overrightarrow{\eta }$ can be updated in each iteration $i=i_{1},\ldots,i_{2}$ by

(20) \begin{equation} \begin{aligned} J_{2}^{i+1}(k)=\sum _{o=1}^n{r_2(\mathcal{B}_{o}(k),e(k),{u}^{i}(k))}+\gamma J_{2}^{i}(k+1),\\ \end{aligned} \end{equation}
(21) \begin{equation} \begin{aligned} \overrightarrow{\eta }^{i+1}(k)&=\underset{\,\, \overrightarrow{\eta }(k)}{\mathrm{argmin}}\,\,\sum _{o=1}^n{r_2(\mathcal{B}_{o}(k),e(k),u(k))+ \gamma J_{2}^{i+1}(k+1).} \end{aligned} \end{equation}

The main steps of the proposed P-DHP can be seen in Algorithm 1.

In the following, we prove the convergence of our proposed P-DHP.

Theorem 1 (Convergence of (16) and (17)): If the initial condition satisfies $J_1^{0}(k) \geq{r_1}(\bar e(k), \bar u^{0}(k))+\gamma{J}_1^{0}(k+1)$ , then it follows that

1) ${J}_1^{i+1}(k)\leq{J}_1^{i}(k)$ .

2) ${\mathrm{lim}}_{i\rightarrow \infty }{J}_1^{i}(k)={J}_1^{*}(k)$ .

Proof. For the proof details please refer to refs. [Reference Luo, Liu, Huang and Liu26] and [Reference Zhang, Peng, Luo, Pan, Xu and H.Xie19].

Theorem 2 (Convergence of (20) and (21)) [Reference Zhang, Peng, Luo, Pan, Xu and H.Xie19]: If $ J_2^{i_1}(k) \geq$ $\sum \nolimits _{o=1}^{n}{r_2}(\mathcal{B}_{o}(k),e(k), u^{i_1}(k))+\gamma J_2^{i_1}(k)$ and ${u}^{i_1}(k)=\bar u^{i_1}(k)+\sum \nolimits _{o=1}^{n}\eta _{o}^{i_1}(k)\nabla \mathcal{B}_{o}(k)$ is a safe control policy to restrain $q^{i_1}(k)\in \mathcal{X}$ , then

1) $J_2^{i+1}(k)\leq J_2^{i}(k).$

2) ${u}^{i}(k)=\bar u^{i_1}(k)+\sum \nolimits _{o=1}^{n}\eta _{o}^{i}(k)\nabla \mathcal{B}_{o}(k)$ , $i\geq i_1$ , is a safe control policy.

Proof. 1): The proof is similar to that in Theorem 1, which is to replace $J_1^{i}(k)$ with $J_2^{i}(k)$ .

2): As ${u}^{i_1}(k)$ is a safe control policy, one has $J_2^{i_1}(k)\leq J_2^{f}$ , where $ J_2^{f}$ is the threshold of the safe value function. According to Theorem 1, for $i\geq i_1$ , it satisfies $J_2^{i}(k) \leq J_2^{i_1}(k)\leq J_2^{f}$ , i.e., $q^{i}(k)\in \mathcal{X}$ . Therefore, ${u}^{i}(k)$ , $i\geq i_1$ , is a safe control policy.

Algorithm 1. Pseudocode of P-DHP.

3.2. AC learning for P-DHP

In this section, the AC learning structures (see Fig. 3) are introduced to implement Algorithm 1. The AC learning structures consist of two decoupling parts. The actor-1–critic-1 network corresponds to the iteration of $J_1(k)$ and $\bar u(k)$ , and the actor-2–critic-2 network corresponds to the iteration of $J_2(k)$ and $\overrightarrow{\eta}$ , respectively. The AC network weights are updated in an incremental learning mode.

The critic-1 network is constructed as

(22) \begin{equation} \nabla \hat{J_1}(k) =W_{c}^{\top }(k)h_{c}(\bar e(k)), \end{equation}

where the costate $\nabla \hat{J_1}(k)=\partial \hat{J_1}(k)/\partial \bar e(k)$ , $W_{c}\in \mathbb{R}^{5\times 5}$ is the weighting matrix, and $h_{c}$ is the activation function.

Define $\nabla{J_1^{*}}=\partial{J_1^{*}}/\partial \bar e$ . To minimize the bias between $\nabla \hat{J_1}$ and $\nabla J_1^*$ , we designed a target to be followed by $\nabla \hat{J_1}$ , i.e.,

(23) \begin{equation} \begin{aligned} \nabla J_1^{d}(k)=2Q_1\bar e(k)+\gamma \nabla f(k)^{\top }\nabla \hat{J_1}(k+1), \end{aligned} \end{equation}

where $\nabla f=\partial{f}/\partial{\bar e}$ .

Let the quadratic approximation error of the critic-1 network be defined as $E_{c}(k)=||\nabla{\hat J_1}(k)-\nabla J_1^d(k)||_2^2$ . The goal of the critic-1 network is to minimize the distance between $\nabla{\hat{J_1}}(k)$ and $\nabla J_1^d(k)$ by updating weights. At each time instant $k$ , the update rule of $W_{c}$ is

(24) \begin{equation} \begin{aligned} W_{c}(k+1) &=W_{c}(k) -\alpha _{c}\frac{\partial E_{c}(k)}{\partial W_{c}(k)}, \end{aligned} \end{equation}

where $\alpha _{c}$ is the learning rate of $W_{c}$ .

Then we design the actor-1 network for the robot, which is

(25) \begin{equation} \hat{\bar u}(k) =W_{a}^{\top }(k)h_{a}(\bar e(k)), \end{equation}

where the $W_{a}\in \mathbb{R}^{5\times 2}$ is the weighting matrix, and $h_{a}$ is the activation function.

In view of (17), we define the $ \bar u^{d}(k)$ as the desired value of $\hat{\bar u}(k)$ , which is expressed as

(26) \begin{equation} \bar u^{d}(k) =-\frac{1}{2}\gamma{R_1}^{-1}\nabla g(k)^{\top }\nabla \hat{J_1}(k+1) +u_r(k), \end{equation}

where $\nabla g=\partial{g}/\partial{\bar u}$ , and $\nabla \hat{J_1}$ is the output of the critic-1 network.

Fig. 3. The structure of the actor–critic learning algorithm.

The quadratic approximation error of actor-1 network can be defined as $E_{a}(k)=||\hat{\bar u}(k)-\bar u^d(k)||_2^2$ , and the update rule of $W_{a}$ is

(27) \begin{equation} \begin{aligned} W_{a}(k+1) &=W_{a}(k) -\alpha _{a}\frac{\partial E_{a}(k)}{\partial W_{a}(k)},\\ \end{aligned} \end{equation}

where $\alpha _{a}$ is the learning rate of $W_{a}$ . Note that the learning process (27) is performed in $i_1$ iterations. We define the converged matrix $W_{a}$ as $\textbf{W}_{a}$ and the learned control policy $\hat{\bar u}$ as ${\hat{\bar{\textbf{u}}}}$ , respectively.

Suppose that the potential fields of obstacles do not overlap. In the following design, we decompose the actor-2–critic-2 network into $n$ subnetworks to learn the variable $\eta _o$ , $o\in \mathbb{N}_1^{n}$ , in (12) respectively, which corresponds to obstacles $o$ , $o\in \mathbb{N}_1^{n}$ , of different sizes and positions.

The sub-critic-2 network for each obstacle $o$ is given as

(28) \begin{equation} \nabla{\hat{J}_{2,o}}(k) =\mu _{c}{\textbf{W}_{\textbf{c}}}^{\top }{h}_{c}(e(k))+\rho _{o}(k)\nabla \mathcal{B}_{o}(k), \end{equation}

where $\rho _{o}\in \mathbb{R}^{2\times 2}$ , and $\nabla \hat J_{2,o}=\partial \hat J_{2,o}/\partial e$ . $\textbf{W}_c$ is the learned matrix of $W_{c}$ , and $\mu _{c}\gt 0$ is a tuning parameter.

The desired value of $\nabla \hat J_{2,o}$ i.e., $\nabla{J}_{2,o}^{d}$ is defined as

(29) \begin{equation} \begin{aligned} \nabla \hat{J}_{2,o}^{d}(k)=2Q_2(e(k)-\bar e(k))+\mu \left(\frac{\partial{q}(k)}{\partial{e}(k)}\right)^{\top }\nabla \mathcal{B}_{o}(k)+\gamma \nabla f(k)^{\top }\nabla \hat{J}_{2,o}(k+1). \end{aligned} \end{equation}

where $\nabla f=\partial{f}/\partial{e}$ .

Denote ${E}_{c,o}(k)=||\nabla \hat {J}_{2,o}(k)-\nabla {J}_{2,o}^d(k)||_2^2$ as the quadratic approximation error of the sub-critic-2 network. The update rule of $\rho _{o}$ can be obtained as

(30) \begin{equation} \begin{aligned} \rho _{o}(k+1)=\rho _{o}(k) -\beta _{c}\frac{\partial{E}_{c,o}(k)}{\partial \rho _{o}(k)},\\ \end{aligned} \end{equation}

where $\beta _{c}$ is the learning rate of $\rho _{o}$ , $o\in \mathbb{N}_{1}^{n}$ .

In view of (12), the sub-actor-2 network for each obstacle $o$ is designed with the following form:

(31) \begin{equation} \hat{u}_o(k) =\textbf{W}_{\textbf{a}}^{\top }{h}_{a}(e(k))+\eta _{o}(k)\nabla \mathcal{B}_{o}(k), \end{equation}

where $\hat{u}_o(k)$ is the control policy for the robot to avoid the obstacle $o$ , which is equal to ${\hat u}(k)$ in the case of only one obstacle $o$ to be avoided.

Define $\varepsilon _o =2R_2({\hat u_o}-\hat{\bar{\textbf{u}}})$ . Then the desired value of $\varepsilon _o(k)$ is redefined as

(32) \begin{equation} \varepsilon^d_o(k) =-\gamma\nabla g(k)^{\top} \nabla \hat{J}_{2,o}(k+1). \end{equation}

where $\nabla g=\partial{g}/\partial{u}$ .

In the same way, the quadratic approximation error of the sub-actor-2 network and the update rule of $\eta _o$ are given as

(33) \begin{equation} \begin{aligned}E_{a,o}(k)=||\varepsilon_o(k)-\varepsilon_o^d(k)||_2^2, \end{aligned} \end{equation}
(34) \begin{equation} \begin{aligned} \eta _{o}(k+1) =\eta _{o}(k) -\beta _{a}\frac{\partial{E}_{a,o}(k)}{\partial \eta _{o}(k)}, \\ \end{aligned} \end{equation}

where $\beta _{a}$ is the learning rate of $\eta _{o}$ , $o\in \mathbb{N}_{1}^{n}$ .

Given that all variables $\eta _o$ , $o\in \mathbb{N}_{1}^{n}$ , have been trained until converging, the overall control policy to be deployed to the robot is

(35) \begin{equation} \hat{u}(k) =\textbf{W}_{\textbf{a}}^{\top }{h}_{a}(e(k))+\sum \nolimits _{o=1}^{n}\eta _{o}(k)\nabla \mathcal{B}_{o}(k). \end{equation}

4. Simulation and experimental results

We first test our algorithm by numerical simulation in a MATLAB environment and deploy the control policy on a real-world differential-drive wheeled robot platform. Suppose the robot is equipped with sensors that can measure the obstacles’ approximate size, relative distance, and angle in the simulation and experiment.

4.1. Simulation experiments

4.1.1. Parameter settings and training process

In the simulation, a relaxed potential function is defined and used as follows to guarantee the smoothness in the control process, i.e.,

(36) \begin{equation} \mathcal{B}_{o}^{r}(q)= \begin{cases}\mathcal{B}_{o}(q), &d_{o} \geq \delta _{o} \\ \\[-8pt] \sigma _{o}(q), &d_{o} \lt \delta _{o}, \end{cases} \end{equation}

where $d_o=\lVert q-q_{o} \Vert_2$ and $0\lt \delta _{o}\lt d_{o}$ is an adjustable relaxing factor. $\sigma _{o}(q)$ is a derivative and monotonic function, where $\sigma_{o}(q)=\frac{1}{2}\sigma_{1}(d_{o}-{\delta_{o}})^2+\sigma_{2}(d_{o}-{\delta_{o}})+\sigma_{3}$ , $\sigma _{1}=\nabla ^2_{d_{o}} \mathcal{B}_{o}(q)|_{d_{o}=\delta _{o}}$ , $\sigma _{2}=\nabla _{d_{o}} \mathcal{B}_{o}(q)|_{d_{o}=\delta _{o}}$ , $\sigma _{3}=\mathcal{B}_{o}(q)|_{d_{o}=\delta _{o}}$ .

Also, the perceived radius values of obstacles are properly expanded to construct $\nabla \mathcal{\bar B}_{o}(q)$ , which replaces $\nabla \mathcal{B}_{o}(q)$ in (12) to ensure control safety. The reconstructed $\mathcal{\bar B}_{o}(q)$ is defined as

(37) \begin{equation} \mathcal{\bar B}_{o}(q)= \begin{cases} \dfrac{1}{2} \bar \kappa _{o}\left(\dfrac{1}{d_{o}}-\dfrac{1}{r_{d}}\right)^2, &d_{o} \leq r_{d} \\ \\[-8pt] 0, & d_{o} \gt r_{d},\end{cases} \end{equation}

where $r_{d}$ indicates the maximum distance that the robot can detect the obstacles, and the definitions of relevant parameters can be seen in Fig. 2. The parameters satisfy $\kappa _{o}\lt \bar \kappa _{o}$ and $r_{o}+r\lt r_{d}$ . Similar to $\mathcal{B}_{o}^{r}(q)$ in (36), $ \mathcal{\bar B}_{o}(q)$ is deflated into $\mathcal{\bar B}_{o}^{r}(q)$ by $\bar \delta _{o}$ , $0\lt \bar \delta _{o}\lt r_{d}$ .

In the training process, the discounting factor was $\gamma =0.95$ . The penalty matrices were set as $Q_1=I_5$ , $R_1=0.01I_2$ , $Q_2=0.001I_5$ , $R_2=0.001I_2$ , $\mu =1$ , $\mu _c=0.001$ , $\alpha _{a}=0.2$ , $\alpha _{c}=0.4$ , $\beta _{a}=0.4$ , $\beta _{c}=0.4$ , $\delta _o=1.5$ , $\bar \delta _o=5$ , $\kappa _{o}=5.5$ , $\bar \kappa _{o}=600$ , $r_d=6$ , and $\Delta h=0.05\,\text{s}$ . In the learning process, the weights were initialized with uniformly distributed random values. The simulations were implemented in the Matlab 2019b environment on a laptop with an AMD Ryzen 7 4800H CPU.

Firstly, we tested our algorithm in a straight-line path scenario with one obstacle in the center of the road. The recorded performance costs were defined as $J^i_{r,1}=\sum \nolimits _{k=1}^{K_1}J_1^i(k)$ and $ J^i_{r,2}=\sum \nolimits _{k=1}^{K_2}J_2^i(k)$ with $K_1=1000$ and $K_2=1000$ in each training iteration. The variation results of $J^i_{r,1}$ and $J^i_{r,2}$ in Fig. 4 show that both the performance costs decrease with iteration increases and converge in 30 iterations. Moreover, the robot’s trajectories in the learning process and the actor weights of the robot in a single-obstacle scenario are displayed in Figs. 5 and 6. The results illustrate that the robot gradually moves away from the obstacle by continuous iterative learning until the robot’s trajectory is circumscribed by the edge of the obstacle’s potential field.

Fig. 4. Variations of $J^{i}_{r,1}$ and $J^{i}_{r,2}$ in the learning process.

Fig. 5. Iteration of robot’s trajectories in a single-obstacle scenario based on P-DHP algorithm. The radii of the robot and obstacle are $r=1\,\text{m}$ and $r_o=1\,\text{m}$ , respectively.

Fig. 6. Actor-2 weights of robot with the increase of iteration, where $ \eta_{o}(i,j)$ refers to the element in the i-th row and j-th column of matrix $\eta_{o},$ and $o=1$ .

Table I. Comparative results of P-DHP and CS-based RL in 100 tests.

4.1.2. Comparison results with state-of-the-art approaches

In this subsection, we compared our approach with the reward/cost shaping-based RL approach (CS-based RL) [Reference Perkins and Barto27], the MPC using a disjunctive chance constraint (MPC-dc) [Reference Castillo-Lopez, Ludivig, Sajadi-Alamdari, Sanchez-Lopez, Olivares-Mendez and Voos28], and the MPC using a linearized chance constraint (MPC-lc) [Reference Zhu and Alonso-Mora29]. Please refer to Fig. 9 for the adopted reference paths in the simulation tests.

In the CS-based RL approach, the cost function was constructed as $r=\lVert e(k) \rVert _{Q_1}^2+\lVert u(k)-u_r(k)\rVert _{R_1}^2+\mu\nabla\mathcal{B}_{o}(k)$ , and the parameters of $\nabla\mathcal{B}_{o}$ were fine-tuned. For a fair comparison, the same DHP-based AC learning structure(actor-1-critic-1) was used in CS-based RL. The comparative results in 100 tests with different learning rates are illustrated in Table I and Fig. 7, which show that our approach outperforms the CS-based RL approach in the success rate of the learning process. Although the success rate between ours and CS-based RL is similar as the learning rate is set small, as shown in Fig. 8,  the CS-based RL approach has a probability of failure to avoid obstacles due to the lack of corrective terms $\nabla \mathcal{B}_{o}$ in the control policy. As the learning rate increases, the learning failure rate of the CS-based RL approach also increases and has a longer obstacle avoidance distance with more energy loss, whereas our method possesses stable control performance for different learning rates. This illustrates, in part, that only shaping the cost function may not be sufficient to learn a convergent control policy efficiently via a standard AC learning algorithm. To solve this problem, our P-DHP approach are proposed to resolve the conflict between the path-following control and obstacle avoidance goals in the learning process.

Fig. 7. Simulation results of our approach and the CS-based RL [Reference Perkins and Barto27] with $v_r=1\,\text{m/s}$ , $r=1.3\,\text{m}$ , $r_o=0.7\,\text{m}$ or $r_o=1\,\text{m}$ . The CS-based RL method has a certain probability of learning failure, see Table I for details.

Fig. 8. Statistics on the closest distance from the robot to the obstacle in 100 repetitive tests. The comparisons were made between P-DHP and the CS-based RL [Reference Perkins and Barto27] in the single-obstacle scenario (shown in Fig.5). The safe bound is represented in red, and the outliers are drawn as blue circles.

The MPC-dc [Reference Castillo-Lopez, Ludivig, Sajadi-Alamdari, Sanchez-Lopez, Olivares-Mendez and Voos28] and MPC-lc [Reference Zhu and Alonso-Mora29] algorithms were implemented by the CasADi toolbox with an Ipopt solver [Reference Andersson, Gillis, Horn, Rawlings and Diehl30]. The stage costs in MPC-dc and MPC-lc were designed the same as (9) in our approach. The performance indicator $J_e$ was defined as $J_e=\text{1/}\bar K\sum_{k\in K}{\lVert e(k) \rVert_2}$ , where $K$ is the time interval in obstacle avoidance, and $\bar K$ is the length of $K$ . The comparative study was carried out under the scenario with three different sizes of obstacles, where the side lengths of the bounding boxes were set as $2.3/\sqrt{2}\,\text{m}$ , $2/\sqrt{2}\,\text{m}$ , and $2.6/\sqrt{2} \,\text{m}$ . The prediction horizon was set as $N=30$ . For the details of the simulation parameters, please refer to ref. [Reference Castillo-Lopez, Ludivig, Sajadi-Alamdari, Sanchez-Lopez, Olivares-Mendez and Voos28].

The comparative simulation results are illustrated in Fig. 9 and Tables II and III. The results show that our approach outperforms the fixed-parameters MPC-lc and MPC-dc with smaller cost values and shorter path length in obstacle avoidance, demonstrating that our approach has better control performance at different reference velocities. This is due to the learning optimization mechanism and policy design of our algorithm. In addition, our approach has advantages in computational efficiency (see Table III).

Table II. Numerical comparison of P-DHP, MPC-dc and MPC-lc.

Table III. Computational time comparison ( $v_r=1\,\text{m/s}$ ).

Fig. 9. Simulation results of our approach, MPC-lc [Reference Zhu and Alonso-Mora29] and MPC-dc [Reference Castillo-Lopez, Ludivig, Sajadi-Alamdari, Sanchez-Lopez, Olivares-Mendez and Voos28] with $v_r=1.6\,\text{m/s}$ : the filled blue circular area represents the obstacles and the light blue dotted line indicates the influence range of the obstacle potential field. The simulation scene contains three sizes of obstacles, corresponding to $r_o=0.7\,\text{m}$ , $r_o=1\,\text{m}$ and $r_o=1.3\,\text{m}$ respectively. The radius of the robot is $r = 1.3\,\text{m}$ .

4.2. Real-world experiments

We also tested our algorithm on the experimental platform of a real-world differential-drive wheeled robot. As shown in Fig.10,the robot was controlled by a laptop equipped with the Ubuntu operating system, and the state of the robot and the positions of obstacles were obtained by mounted satellite inertial. In the experiment, we deployed the offline training weights from the simulation to generate control policy $u=[a_{v}, a_{\omega }]^{\top }$ to drive the underlying system of the robot, and the sampling interval was $\Delta h=0.1\,\text{s}$ .

Fig. 10. Experimental platform of the differential-drive wheeled robot.

Fig. 11. Experiment results of robot’s path aerial view in a real-world scenario.

Fig. 12. Experiment results of robot’s state errors in real-world scenario.

The real-world experimental results are shown in Figs. 11 and 12, which illustrate that the robots can follow the desired path from an initial state, meanwhile avoiding all encountered obstacles on the path and recovering path following after collision avoidance. Moreover, the above-reported results show that our approach can resolve the weight divergence problem caused by the conflict between path following and obstacle avoidance.

5. Conclusion

This paper proposed a P-DHP algorithm for path-following control of wheeled robots with obstacle avoidance. As for the main characteristics, the proposed P-DHP utilizes two coupled AC networks to resolve the conflicting goals, developing a near-optimal and safe control policy for path-following and obstacle avoidance. The convergence and safety of our method were proven. Our approach has been evaluated in simulation and tested on a real-world differential-drive mobile robot. The simulated and experimental results illustrate that our approach can realize path-following control with collision avoidance under conflicting learning objectives, showing the advantages in computational efficiency and smaller cost values compared with state-of-the-art approaches. Future work will focus on the application of model-free RL on wheeled mobile robots.

Author contributions

All authors have made great contributions to this paper.

Financial support

The work was supported by the National Natural Science Foundation of China under Grant U21A20518, 62003361, and 61825305, China Postdoctoral Science Foundation under Grant 47680, and the National Key R&D Program of China 2018YFB1305105.

Conflicts of interest

The authors declare no conflicts of interest exist.

Ethical approval

Not applicable.

References

Nascimento, T. P., Dórea, C. E. and Gonçalves, L. M. G., “Nonholonomic mobile robots’ trajectory tracking model predictive control: A survey,” Robotica. 36(5), 676696 (2018).10.1017/S0263574717000637CrossRefGoogle Scholar
Li, Z., Zhao, K., Zhang, L., Wu, X., Zhang, T., Li, Q., Li, X. and Su, C., “Human-in-the-loop control of a wearable lower limb exoskeleton for stable dynamic walking,” IEEE/ASME Trans. Mechatron. 26(5), 27002711 (2021).10.1109/TMECH.2020.3044289CrossRefGoogle Scholar
Su, H., Hu, Y., Karimi, H. R., Knoll, A., Ferrigno, G. and De Momi, E., “Improved recurrent neural network-based manipulator control with remote center of motion constraints: Experimental results,” Neural Netw. 131, 291299 (2020).10.1016/j.neunet.2020.07.033CrossRefGoogle ScholarPubMed
Li, Z., Ren, Z., Zhao, K., Deng, C. and Feng, Y., “Human-cooperative control design of a walking exoskeleton for body weight support,” IEEE Trans. Ind. Inform. 16(5), 29852996 (2020).10.1109/TII.2019.2900121CrossRefGoogle Scholar
Chen, Z., Liu, Y., He, W., Qiao, H. and Ji, H., “Adaptive-neural-network-based trajectory tracking control for a nonholonomic wheeled mobile robot with velocity constraints,” IEEE Trans. Ind. Electron. 68(6), 50575067 (2021).10.1109/TIE.2020.2989711CrossRefGoogle Scholar
Wiig, M. S., Pettersen, K. Y. and Krogstad, T. R., “Collision avoidance for underactuated marine vehicles using the constant avoidance angle algorithm,” IEEE Trans. Control. Syst. Technol. 28(3), 951966 (2019).10.1109/TCST.2019.2903451CrossRefGoogle Scholar
Alonso-Mora, J., Beardsley, P. and Siegwart, R., “Cooperative collision avoidance for nonholonomic robots,” IEEE Trans. Robot. 34(2), 404420 (2018).10.1109/TRO.2018.2793890CrossRefGoogle Scholar
Hoy, M., Matveev, A. S. and Al, A. V. S., “Algorithms for collision-free navigation of mobile robots in complex cluttered environments: A survey,” Robotica. 33(3), 463497 (2015).10.1017/S0263574714000289CrossRefGoogle Scholar
Khatib, Real-time obstacle avoidance for manipulators and mobile robots,” Int. J. Robot. Res. 5(1), 500505 (1986).Google Scholar
Funke, J., Brown, M., Erlien, S. M. and Gerdes, J. C., “Collision avoidance and stabilization for autonomous vehicles in emergency scenarios,” IEEE Trans. Control. Syst. Technol. 25(4), 12041216 (2017).10.1109/TCST.2016.2599783CrossRefGoogle Scholar
Hagen, I. B., Kufoalor, D. K. M., Brekke, E. F. and T.A.Johansen, M., “MPC-Based Collision Avoidance Strategy for Existing Marine Vessel Guidance Systems,” In: 2018 IEEE International Conference on Robotics and Automation (ICRA) (IEEE, 2018) pp. 76187623.10.1109/ICRA.2018.8463182CrossRefGoogle Scholar
Brito, B., Floor, B., Ferranti, L. and Alonso-Mora, J., “Model predictive contouring control for collision avoidance in unstructured dynamic environments,” IEEE Robot. Autom. Lett. 4(4), 44594466 (2019).10.1109/LRA.2019.2929976CrossRefGoogle Scholar
Leica, P., Herrera, M., Rosales, C., Roberti, F., Toibero, J. and Carelli, R., “Dynamic Obstacle Avoidance Based on Time-Variation of a Potential Field for Robots Formations,” In: 2017 IEEE Second Ecuador Technical Chapters Meeting (ETCM) (IEEE, 2017) pp. 16.Google Scholar
Sudhakara, P., Ganapathy, V., Priyadharshini, B. and Sundaran, K., “Obstacle avoidance and navigation planning of a wheeled mobile robot using amended artificial potential field method,” Proc. Comput. Sci. 133, 9981004 (2018).10.1016/j.procs.2018.07.076CrossRefGoogle Scholar
Hu, B., Guan, Z.-H., Lewis, F. L. and Chen, C. P., “Adaptive tracking control of cooperative robot manipulators with markovian switched couplings,” IEEE Trans. Ind. Electron. 68(3), 24272436 (2021).CrossRefGoogle Scholar
Dong, H., Zhao, X. and Yang, H., “Reinforcement learning-based approximate optimal control for attitude reorientation under state constraints,” IEEE Trans. Control. Syst. Technol. 29(4), 16641673 (2020).10.1109/TCST.2020.3007401CrossRefGoogle Scholar
Zhang, Z., Chen, S., Zhu, X. and Z.Yan, T., “Two hybrid end-effector posture-maintaining and obstacle-limits avoidance schemes for redundant robot manipulators,” IEEE Trans. Ind. Inform. 16(2), 754763 (2020).CrossRefGoogle Scholar
Deptula, P., Chen, H.-Y., Licitra, R. A., Rosenfeld, J. A. and Dixon, W. E., “Approximate optimal motion planning to avoid unknown moving avoidance regions,” IEEE Trans. Robot. 36(2), 414430 (2020).10.1109/TRO.2019.2955321CrossRefGoogle Scholar
Zhang, X., Peng, Y., Luo, B., Pan, W., Xu, X. and H.Xie, M., “Model-Based Safe Reinforcement Learning with Time-Varying State and Control Constraints: An Application to Intelligent Vehicles,” arXiv preprint, arXiv:2112.11217 (2021).Google Scholar
Dong, Y., Tang, X. and Yuan, Y., “Principled reward shaping for reinforcement learning via Lyapunov stability theory,” Neurocomputing. 393, 8390 (2020).10.1016/j.neucom.2020.02.008CrossRefGoogle Scholar
Ohnishi, M., Wang, L., Notomista, G. and Egerstedt, M., “Barrier-certified adaptive reinforcement learning with applications to brushbot navigation,” IEEE Trans. Robot. 35(5), 11861205 (2019).10.1109/TRO.2019.2920206CrossRefGoogle Scholar
Wen, M. and Topcu, U., “Constrained cross-entropy method for safe reinforcement learning,” IEEE Trans. Autom. Control 66(7), 31233137 (2021).10.1109/TAC.2020.3015931CrossRefGoogle Scholar
Cohen, M. H. and Belta, C., “Approximate Optimal Control for Safety-Critical Systems with Control Barrier Functions,” In: 2020 59th IEEE Conference on Decision and Control (CDC) (IEEE, 2020) pp. 20622067.10.1109/CDC42340.2020.9303896CrossRefGoogle Scholar
Yang, Y. Y., He, W., Vamvoudakis, K. G., Modares, H. M. and Wunsch, D. C., “Safety-Aware Reinforcement Learning Framework with an Actor-Critic-Barrier Structure,” In: 2019 American Control Conference (ACC) (IEEE, 2019) pp. 23522358.CrossRefGoogle Scholar
Ames, A., Xu, X., Grizzle, J. and P.Tabuada, C., “Control barrier function based quadratic programs for safety critical systems,” IEEE Robot. Automat. Lett. 62(8), 38613876 (2017).Google Scholar
Luo, B., Liu, D., Huang, T. and Liu, J., “Output tracking control based on adaptive dynamic programming with multistep policy evaluation,” IEEE Trans. Syst. Man Cybernet. Syst. 49(10), 21552165 (2019).10.1109/TSMC.2017.2771516CrossRefGoogle Scholar
Perkins, T. J. and Barto, A. G., “Lyapunov design for safe reinforcement learning,” J. Mach. Learn. Res. 3(12), 803832 (2002).Google Scholar
Castillo-Lopez, M., Ludivig, P., Sajadi-Alamdari, S. A., Sanchez-Lopez, J. L., Olivares-Mendez, M. A. and Voos, H., “A real-time approach for chance-constrained motion planning with dynamic obstacles,” IEEE Robot. Automat. Lett. 5(2), 36203625 (2020).CrossRefGoogle Scholar
Zhu, H. and Alonso-Mora, J., “Chance-constrained collision avoidance for mavs in dynamic environments,” IEEE Robot. Automat. Lett. 4(2), 776783 (2019).CrossRefGoogle Scholar
Andersson, J. A. E., Gillis, J., Horn, G., Rawlings, J. B. and Diehl, M., “CasADi: A software framework for nonlinear optimization and optimal control,” Math. Program. Comput. 1(1), 136 (2019).CrossRefGoogle Scholar
Figure 0

Fig. 1. Tracking error model of the wheeled robot.

Figure 1

Fig. 2. Depiction of robot’s parameters in obstacle avoidance.

Figure 2

Algorithm 1. Pseudocode of P-DHP.

Figure 3

Fig. 3. The structure of the actor–critic learning algorithm.

Figure 4

Fig. 4. Variations of $J^{i}_{r,1}$ and $J^{i}_{r,2}$ in the learning process.

Figure 5

Fig. 5. Iteration of robot’s trajectories in a single-obstacle scenario based on P-DHP algorithm. The radii of the robot and obstacle are $r=1\,\text{m}$ and $r_o=1\,\text{m}$, respectively.

Figure 6

Fig. 6. Actor-2 weights of robot with the increase of iteration, where $ \eta_{o}(i,j)$ refers to the element in the i-th row and j-th column of matrix $\eta_{o},$ and $o=1$.

Figure 7

Table I. Comparative results of P-DHP and CS-based RL in 100 tests.

Figure 8

Fig. 7. Simulation results of our approach and the CS-based RL [27] with $v_r=1\,\text{m/s}$, $r=1.3\,\text{m}$, $r_o=0.7\,\text{m}$ or $r_o=1\,\text{m}$. The CS-based RL method has a certain probability of learning failure, see Table I for details.

Figure 9

Fig. 8. Statistics on the closest distance from the robot to the obstacle in 100 repetitive tests. The comparisons were made between P-DHP and the CS-based RL [27] in the single-obstacle scenario (shown in Fig.5). The safe bound is represented in red, and the outliers are drawn as blue circles.

Figure 10

Table II. Numerical comparison of P-DHP, MPC-dc and MPC-lc.

Figure 11

Table III. Computational time comparison ($v_r=1\,\text{m/s}$).

Figure 12

Fig. 9. Simulation results of our approach, MPC-lc [29] and MPC-dc [28] with $v_r=1.6\,\text{m/s}$: the filled blue circular area represents the obstacles and the light blue dotted line indicates the influence range of the obstacle potential field. The simulation scene contains three sizes of obstacles, corresponding to $r_o=0.7\,\text{m}$, $r_o=1\,\text{m}$ and $r_o=1.3\,\text{m}$ respectively. The radius of the robot is $r = 1.3\,\text{m}$.

Figure 13

Fig. 10. Experimental platform of the differential-drive wheeled robot.

Figure 14

Fig. 11. Experiment results of robot’s path aerial view in a real-world scenario.

Figure 15

Fig. 12. Experiment results of robot’s state errors in real-world scenario.