Nomenclature
- RBF
-
radial basis function
- RL
-
reinforcement learning
- VST
-
virtual sliding target
1.0 Introduction
Reentry vehicles are a class of unpowered aircraft capable of flying in near space, with special large lift-to-drag ratio aerodynamic profile, high manoeuverability and strong penetration ability. Different from traditional inertial vehicles, reentry vehicles rely on aerodynamic control and can achieve large lateral manoeuvering flight, which can significantly improve survivability [Reference Bairstow1, Reference Zhang Mingang2]. However, where there is a spear, there is a shield. Reconnaissance surveillance and interception technologies for reentry vehicles have also made great progress throughout the world [Reference Rahmani-Nejad3, Reference Liu, Yan, Liu, Dai, Yan and Xin4]. The interference and interception measures faced by the vehicles in the dive flight phase are more diverse than those in the midcourse and reentry phases, and the probability of being successfully intercepted will be greatly increased [Reference Yu5]. Therefore, it is a very interesting research topic to provide a manoeuver for reentry vehicles in the dive phase with high penetration ability and strong anti-disturbance ability [Reference Dwivedi, Bhale, Bhattacharyya and Padhi6, Reference He, Yan and Tang7]. When the reentry vehicle does not have any prior information about the interceptor weapon, it is essential to choose a manoeuver mode that is difficult to predict and has high penetration probability. The typical ones are snake manoeuver, roller manoeuver and spiral manoeuver [Reference Li, Zhang and Tang8]. Among them, the spiral manoeuver has become a promising research direction in recent years because of the advantages of large manoeuver range, time-varying manoeuver frequency and unpredictable trajectory [Reference Rusnak and Peled-Eitan9].
The main guidance techniques used for the dive phase are optimal guidance law, predictor-corrector guidance law, sliding mode variable structure guidance law and proportional navigation guidance law [Reference Yibo, Xiaokui, Guangshan and Jiashun10, Reference Zhang and Wang11]. An optimal guidance law design method based on block pulse function is proposed by Dou et al. [Reference Dou and Dou12]. It is able to optimise the landing angle, miss distance and control energy consumption simultaneously. Wang et al. [Reference Qing13] proposes an energy-based predictor-corrector guidance algorithm to design longitudinal and lateral guidance laws, respectively. Li and Qian [Reference Li and Qian14] design a three-dimensional guidance law considering target manoeuver, impact angle constraint and input saturation by using integral sliding mode control and adaptive control. Dhananjay and Ghose [Reference Dhananjay and Ghose15] develop a proportional navigation guidance law incorporating a time-to-go estimation algorithm to strike stationary targets. Among many guidance methods, the proportional navigation guidance has been widely studied in the field of guidance because of its advantages of simplicity, high efficiency and small miss distance. Nevertheless, the proportional navigation guidance is not competent for flight missions that need to perform specific manoeuvering modes. For this reason, some studies have introduced the concept of virtual sliding target (VST) to extend the application scope of the proportional navigation guidance. Intuitively, it is the mapping of the real flight trajectory of the vehicle to the trajectory of the virtual sliding target. And the end point of the virtual sliding target overlaps with the real target point. As long as the vehicle flies in the direction pointing to the virtual sliding target, it will eventually hit the real target. In the Mozaffari et al. [Reference Mozaffari, Safarinejadian and Binazadeh16] and Raju and Ghose [Reference Raju and Ghose17], specific manoeuvers of the vehicle are achieved by controlling the speed and direction of the virtual sliding target. In this case, the parameters of the virtual sliding target are selected empirically. By introducing a virtual sliding target, Hu et al. [Reference Hu, Han and Xin18] designs a two-stage guidance method combining non-singular terminal sliding mode guidance law and proportional navigation guidance law, which can satisfy both impact angle and time constraints. For the stationary targets, an adaptive proportional guidance navigation law based on the virtual sliding target is proposed by He and Yan [Reference He and Yan19] to guide the vehicle to complete the spiral dive manoeuver. And the scope of application of this method is extended to low-speed moving targets in He et al. [Reference He, Yan and Tang20]. Despite the very large contributions made by the mentioned works, none of them consider the effect of unknown disturbances on the guidance effectiveness.
Fortunately, many nonlinear control methods are used to deal with the adverse effects of unknown disturbances on the controlled system. Some of these are inherently robust to unknown disturbances, while others combine various types of disturbance observers [Reference Ning, Liu, Wang and Luo21, Reference Yin, Wang, Xiong, Xiang, Liu, Fan and Xue22]. Sliding mode control is frequently used to design controllers for vehicles because of its insensitivity to matched disturbances [Reference Mao, Dou, Yang, Tian and Zong23]. Shen et al. [Reference Shen, Xia, Zhang and Cui24] proposes a continuous adaptive super-twisting sliding mode tracking control method, which combines a conventional super-twisting sliding mode controller with an adaptive gain technique to overcome bounded disturbances. By exploiting the constraint handling capability and enhanced anti-disturbance capability of model predictive control, Chai et al. [Reference Chai, Tsourdos, Gao, Chai and Xia25] presents a robust model predictive attitude control algorithm. The nonlinear feedback law is designed, and the system constraints are tightened to ensure that robust constraints are satisfied for all allowed uncertainties. A resilient attitude control method for spacecraft is proposed by Cao and Xiao [Reference Cao and Xiao26], which utilises a nonlinear disturbance observer to compensate for unknown disturbances. Xiang et al. [Reference Xiang, Yanli, Peng and Haibing27] proposes an adaptive backstepping attitude control method for hypersonic vehicles with which a nonlinear disturbance observer is also used to estimate unknown disturbances. In Refs (Reference Cheng, Wang and Gong28–Reference Zhang, Chen, Fu and Huang30), a single group or two groups of neural networks are utilised to fit the unmodeled dynamics and external disturbances of hypersonic vehicles. In addition, reinforcement learning (RL) is also introduced into the design of the controller. Ouyang et al. [Reference Ouyang, Dong, Wei and Sun31] introduces actor-critic design into the tracking control problem of elastic joint robots to fit the system uncertainties. Shi et al. [Reference Shi, Wang and Cheng32] proposes a robust adaptive safety control framework for hypersonic vehicles based on reinforcement learning, in which the actor-critic networks are used to approximate the optimal controller. Wang et al. [Reference Wang and Liu33] develops a reinforcement learning-based adaptive tracking control method for a class of semi-Markovian non-Lipschitz uncertain systems, in which actor-critic networks are used to handle unmatched disturbances. The actor-critic networks in reinforcement learning not only inherit the good nonlinear processing ability of neural networks, but also introduce the error-related cost function. Therefore, they have better performance than neural networks in theory.
Combined with the previous discussion, this paper further studies the problem of spiral-diving manoeuver guidance for reentry vehicles considering unknown disturbances based on the results achieved in He and Yan [Reference He and Yan19] and He et al. [Reference He, Yan and Tang20]. The main contributions of this paper are as follows:
-
Compared with He and Yan [Reference He and Yan19] and He et al. [Reference He, Yan and Tang20], this paper considers the adverse effects of unknown disturbances on the spiral-diving manoeuver of reentry vehicles. Specifically, the guidance command tracking control problem model considering unknown disturbances is established. This model is abstracted as a first-order coupled multivariable nonlinear system, which is more relevant to engineering practice.
-
The coordinate transformation technique is employed to overcome the controller design challenge caused by the coupling of control variables. Combined with the recursive design technique, the first-order time derivative of the control variables is finally obtained, and the control variables can be obtained by integrating it.
-
By designing the actor-critic networks and the corresponding adaptive weight update law, the unknown disturbances are compensated with high accuracy. Furthermore, by using the Lyapunov method, it is proved that the tracking errors are uniformly ultimately bounded. As a result, the assumption that the actual guidance commands are equivalent to the desired guidance commands made in the convergence analysis of the guidance parameters in He and Yan [Reference He and Yan19] and He et al. [Reference He, Yan and Tang20] is verified in this paper.
2.0 Problem formulation and preminimaries
2.1 Problem statement
The centroid dynamic model of unpowered reentry vehicle subjected to unknown disturbances is shown as follows:
where $x,y,z$ are the positions of the vehicle in the inertial frame. $V$ is the velocity. $\theta $ represents the angle between the velocity vector and the horizontal plane, i.e., the path angle. When the velocity vector is above the horizontal plane, the $\theta $ is positive. ${\psi _v}$ represents the angle between the projection of the velocity vector in the horizontal plane and the $x$ axis, i.e., the deflection angle, measured counterclockwise in the horizontal plane. $\alpha $ and ${\gamma _v}$ are the angle-of-attack and back angle. $m$ and $g$ are mass and gravitational acceleration. ${d_1}\left( t \right)$ and ${d_2}\left( t \right)$ are the unknown disturbances to the vehicle. $X$ and $Y$ are the drag and lift forces, and the expression of $Y$ is
where $C_y^0$ and $C_y^\alpha $ are the lift coefficients. ${S_{ref}}$ is the reference area. $q$ is the dynamic pressure, and its expression is
where $\rho $ is the atmospheric density.
Define $\xi = {\left[ {\theta, {\psi _v}} \right]^T}$ as the state variables. Then, when the reentry vehicle performs a spiral-diving manoeuver, the desired guidance commands can be expressed as ${\xi _d} = {\left[ {{\theta _d},{\psi _{vd}}} \right]^T}$ . Up to now, the problem of spiral-diving guidance for reentry vehicle subject to unknown disturbances can be essentially translated into the problem of tracking the desired guidance command. And this problem can be organized as follows:
where $u = {\left[ {\alpha, {\gamma _v}} \right]^T}$ are the control variables. ${\chi _o}$ is the output vector. $d = {\left[ {{d_1},{d_2}} \right]^T}$ . ${f_1}$ and ${g_1}$ are smooth nonlinear functions that can be expressed as
The tracking error can be expressed as
Assumption 1. The system represented by Equation 4 is controllable, which satisfies
Lemma 1. [Reference He and Dong34] For a Lyapunov function $L\left( t \right)$ , if its initial value $L\left( 0 \right)$ is bounded and its time derivative satisfies
where $\kappa \gt 0$ and $\delta \gt 0$ are constants, then $L\left( t \right)$ is bounded.
Lemma 2. [Reference Xing-Kai35] For vectors $A \in {\mathbb{R}^n}$ and $B \in {\mathbb{R}^n}$ , there always holds
Remark 1. As shown in Equation 1, the impact of unknown disturbances are considered in this paper when studying the spiral-diving guidance problem. This work is missing in He and Yan [Reference He and Yan19] and He et al. [Reference He, Yan and Tang20]. For this reason, this paper is complemented by the design of the desired guidance command tracking system as shown in Equation 4. Further, the objective of this paper is to design a reinforcement learning based adaptive controller for system 4 such that $\|{e_1}\| \le {\tilde e_1}$ as $t \to \infty $ , where ${\tilde e_1}$ is a sufficiently small positive constant.
2.2 Spiral trajectory parameters solving
The logarithmic spiral trajectory can be defined as
where ${r_0}$ is the initial polar diameter. $\vartheta $ is the polar angle. ${\rm{\Lambda }}$ is the angle between the component of the vehicle velocity vector in yaw plane and the polar diameter. It can be found from Equation 12 that once the values of ${r_0}$ and ${\rm{\Lambda }}$ are acquired, the shape of the spiral trajectory can be determined uniquely.
Figure 3 in He et al. [Reference He, Yan and Tang20] shows the geometric representation of the spiral trajectory in the yaw plane. Where ${M_0}$ and ${M_s}$ are the initial position and the current desired position of the vehicle. $T$ is the target position. $p{x_p}{z_p}$ denotes the polar frame, the pole $p$ coincides with the rotation centre of the spiral trajectory, the polar axis $p{z_p}$ points to the ${M_0}$ , and the polar axis $p{x_p}$ is perpendicular to $p{z_p}$ . ${r_{s0}}$ , ${r_s}$ and ${r_{s1}}$ are the polar diameters at ${M_0}$ , ${M_s}$ and $T$ , and the corresponding polar angles and deflection angles are ${\vartheta _0}$ , $\vartheta $ , ${\vartheta _1}$ and ${\psi _{vs0}}$ , ${\psi _{vs}}$ , ${\psi _{vs1}}$ , respectively. $\eta $ is the rotation angle of the polar frame with respect to the inertial frame. If let $\left( {{x_0},{z_0}} \right)$ and $\left( {{x_1},{z_1}} \right)$ represent the coordinates at ${M_0}$ and $T$ , respectively, then the initial condition set ${H_0}$ and terminal constraint set ${H_1}$ of the vehicle can be defined as
It is worth noting that the definition of deflection angle direction in this paper is contrary to that in He et al. [Reference He, Yan and Tang20]. Therefore, in order to facilitate understanding, it is necessary to deduce new expressions related to the determination of spiral trajectory parameters. Referring to Fig. 3 in He et al. [Reference He, Yan and Tang20], the main geometrical relation can be expressed as follows:
Substituting the ${\psi _{vs0}}$ and ${\psi _{vs1}}$ into Equation 13, we get
The coordinates of the $p$ can be solved by
where ${K_0}$ and ${K_1}$ are slopes of the rays $p{M_0}$ and $pT$ , respectively, and their expressions are:
Refer to He et al. [Reference He, Yan and Tang20] for the calculation of ${x_p}$ and ${z_p}$ when ${K_0}$ and ${K_1}$ do not exist. Next, the lengths of the polar diameters ${r_{s0}}$ and ${r_{s1}}$ can be calculated:
Dividing Equation 19 by Equation 18 and combining Equations 12 and 17 yields
where $\mu = {\rm{arctan}}\left[ {\left( {{x_0} - {x_1}} \right)/\left( {{z_0} - {z_1}} \right)} \right]$ .
Once the values in the sets ${H_0}$ and ${H_1}$ are given, ${\rm{\Lambda }}$ can be acquired by solving Equation 20. By substituting ${\rm{\Lambda }}$ into Equations 16, 18, 14 and 15, the polar coordinates $\left( {{x_p},{z_p}} \right)$ , initial polar diameter ${r_0}$ , the rotation angle $\eta $ of the polar frame with respect to the inertial frame and the terminal polar angle ${\vartheta _1}$ can be calculated, respectively.
Remark 2. This subsection presents the procedure for calculating the spiral trajectory parameters in the yaw plane. Without loss of generality, the spiral trajectory in three-dimensional space can be obtained by stretching the yaw plane spiral trajectory along the vertical direction.
2.3 Neural networks in reinforcement learning
Neural networks are an important part of reinforcement learning and are powerful in coping with nonlinearities. With the help of neural networks, a nonlinear function $f$ can be expressed as
where $W \in {\mathbb{R}^l}$ is the weight vector, $l$ is the number of nodes in the hidden layer. $\bar Z = {\left[ {{{\bar z}_1},{{\bar z}_2}, \ldots, {{\bar z}_m}} \right]^T} \in {\mathbb{R}^m}$ are the inputs of neural networks with dimension $m$ . ${\rm{\Phi }}\left( {\bar Z} \right) = {\left[ {{\varphi _{b1}},{\varphi _{b2}}, \ldots {\varphi _{bl}}} \right]^T} \in {\mathbb{R}^l}$ are the basis functions. $\varepsilon \left( {\bar Z} \right)$ is the function reconstruction error. The optimal approximation can be obtained by properly selecting the number of network nodes.
The radial basis functions (RBF) neural networks are selected as the basic network frame in this paper, and its basis functions can be described as
where ${\zeta _j} = {\left[ {{\zeta _{j1}},{\zeta _{j2}}, \ldots, {\zeta _{jm}}} \right]^T}$ is the centre vector of the $j$ -th node in the hidden layer. ${\varsigma _j}$ is the width value.
Lemma 3. [Reference Ouyang, Dong, Wei and Sun31] The basis function ${\rm{\Phi }}\left( {\bar Z} \right)$ of neural networks is bounded, which satisfies $\left\|{\rm{\Phi }}\left( {\bar Z} \right)\right\| \le {{\rm{\Phi }}_M}$ and $\left\|\dot{\Phi}\left( {\bar Z} \right)\right\| \le {{\rm{\Phi }}_{dM}}$ , where ${{\rm{\Phi }}_M}$ and ${{\rm{\Phi }}_{dM}}$ are positive constants.
Lemma 4. [Reference Ouyang, Dong, Wei and Sun31] If the ideal weight ${W^{\rm{*}}}$ is obtained, then there exists $\left| {\varepsilon \left( {\bar Z} \right)} \right| \le {\varepsilon _m}$ and $\left| {\dot \varepsilon \left( {\bar Z} \right)} \right| \le {\varepsilon _{dm}}$ , where ${\varepsilon _m}$ and ${\varepsilon _{dm}}$ are positive constants.
3.0 Main results
This section fully presents the reinforcement learning spiral-diving manoeuver guidance method proposed in this paper. The concept of virtual sliding target is employed to design the desired proportional navigation guidance law. With the help of coordinate transformation technique, the desired guidance command tracking controller design challenge arising from the coupling of control variables is overcome. And the actor-critic networks and the corresponding adaptive weight update law are designed to approximate the unknown disturbances. After proving that the tracking errors are uniformly ultimately bounded by using the Lyapunov method, the range of guidance parameters is derived. The diagram of proposed reinforcement learning spiral-diving manoeuver guidance framework for reentry vehicle is shown in Fig. 1.
3.1 Desired proportional navigation guidance law
Figure 4 in He et al. [Reference He, Yan and Tang20] displays the motion of the vehicle and the virtual sliding target in the yaw plane. Where $M$ is the projection of the current position of the vehicle in the yaw plane, and ${M_s}$ is the closest point of $M$ to the spiral trajectory.
Assumption 2. The polar angle at $M$ is the same as the polar angle at ${M_s}$ . Moreover, in order to keep the shape of the spiral trajectory invariant, the pole $p$ is assumed to have the same dynamic properties as the target.
Under Assumption 2, the polar angle $\vartheta $ corresponding to the $M$ can be obtained by solving the following equation:
The time derivative of Equation 23 can be arranged to obtain the time derivative of $\vartheta $ :
where ${V_t}$ and ${\psi _{vt}}$ are the size and direction angle of the target velocity, respectively. Note that ${\psi _{vt}}$ is meaningless when the target is stationary, i.e., ${V_t} = 0$ .
The trajectory of the virtual sliding target $T{\rm{'}}$ is designed based on the curve involute principle and is denoted as
where ${r_{vt}} = {\left[ {{x_{vt}},{z_{vt}}} \right]^T}$ represents the coordinate vector of the $T{\rm{'}}$ . ${l_{go}}$ is the remaining length of the spiral trajectory and its value can be obtained by integrating Equation 12. The time derivative of Equation 25 yields
The virtual line-of-sight deflection angle from the current position of $T{\rm{'}}$ pointing to $M$ is defined as
Taking the time derivative of Equation 27, and combining with Equation 1 and Equation 26, the following equation can be obtained:
where
$s = \sqrt {{{\left( {x - {x_{vt}}} \right)}^2} + {{\left( {z - {z_{vt}}} \right)}^2}} $ is the remaining flight distance.
Similarly, the virtual line-of-sight path angle from the current position of $T{\rm{'}}$ pointing to $M$ is defined as
Combining Equations 1 and 26, the time derivative of Equation 31 can be derived:
where $r = \sqrt {{s^2} + {y^2}} $ is the distance of the vehicle from the virtual target.
Furthermore, taking the time deriving of Equation 13 yields
Based on Equations 28, 32 and 33, the desired proportional navigation guiding law of the vehicle with respect to the virtual sliding target as shown in Equations 34 and 35 can be designed:
where ${\lambda _1}$ and ${\lambda _2}$ are user-defined guidance parameters, and their value ranges will be determined later.
3.2 Reinforcement learning adaptive controller
For the first-order multivariate tightly coupled system 4 considering the effects of unknown disturbances, treat ${g_1}\left( {\xi, u} \right)$ as the virtual control variable and define
where $\upsilon $ is the virtual control law.
Taking the time derivative of Equation 7 and combining it with Equation 36 yields
so the virtual controller can be designed as
where ${k_1} \gt 0$ is a user-defined control gain. $\hat d$ is the estimation of $d$ .
Taking the time derivative of Equation 36 leads to
so the time derivative of the controller $u$ can be designed as
where ${k_2} \gt 0$ is a user-defined control gain. By integrating Equation 40, $u$ can be obtained.
From Equations 38 and 40, it is clear that how to obtain the estimate of $d$ is the premise of designing the controller $u$ . Ingeniously, the actor-critic networks in reinforcement learning provide a superior alternative to deal with the problem.
In the framework of the actor network, $d$ can theoretically be expressed by
where ${{\rm{\Phi }}_a} \in {\mathbb{R}^{{l_a}}}$ is the basis function of dimension ${l_a}$ , which satisfies $\|{{\rm{\Phi }}_a}\| \le {{\rm{\Phi }}_{aM}}$ . ${\varepsilon _a} \in {\mathbb{R}^2}$ is the actor reconstruction error and satisfies $\|{\varepsilon _a}\| \le {\varepsilon _{am}}$ . $W_a^{\rm{*}} \in {\mathbb{R}^{{l_a} \times 2}}$ is the real actor network weight. The reality is that only the estimation of $d$ can be obtained:
where ${\hat W_a} \in {\mathbb{R}^{{l_a} \times 2}}$ is the estimated weight of the actor network.
In the framework of the critic network, the integral penalty function can be designed as
where ${\mathcal{L}}\left( t \right) = e_1^TQ{e_1}$ , $Q \in {\mathbb{R}^{2 \times 2}}$ is a positive definite matrix. $J$ can theoretically be expressed by
where ${{\rm{\Phi }}_c} \in {\mathbb{R}^{{l_c}}}$ is the basis function of dimension ${l_c}$ , which satisfies $\|{\dot{\Phi}_c}\| \le {{\rm{\Phi }}_{cdM}}$ . ${\varepsilon _c}$ is the critic reconstruction error and satisfies $\left| {{{\dot \varepsilon }_c}} \right| \le {\varepsilon _{cdm}}$ . $W_c^{\rm{*}} \in {\mathbb{R}^{{l_c}}}$ is the real critic network weight. The reality is that only the estimation of $J$ can be obtained:
where ${\hat W_c} \in {\mathbb{R}^{{l_c}}}$ is the estimated weight of the critic network.
Define the weight error of critic network as ${\hat W_c} = {\hat W_c} - W_c^{\rm{*}}$ . In addition, define the critic error as
and the critic error function can be designed as
According to the gradient descent criterion, the adaptive update law of ${\hat W_c}$ can be deduced as follows:
where ${\lambda _c} \gt 0$ and ${\hbar _c} \gt 0$ are the user-defined learning rates of the critic network.
Define the weight error of actor network as ${\hat W_a} = {\hat W_a} - W_a^{\rm{*}}$ . And define the approximation error ${H_a}$ as
Then, the actor error can be defined as
where ${{\rm{\Omega }}_a} \in {\mathbb{R}^{2 \times 1}}$ is the user-defined gain matrix satisfying $\|{{\rm{\Omega }}_a}\| \le {{\rm{\Omega }}_{aM}}$ , and ${{\rm{\Omega }}_{aM}}$ is a positive constant.
Furthermore, the actor error function can be designed as
According to the gradient descent criterion, the adaptive update law of ${\hat W_a}$ can be deduced as follows:
where $a \gt 0$ and ${\hbar _a} \gt 0$ are the user-defined learning rates of the actor network.
3.3 Stability and convergence analysis
Theorem 1. Consider the Assumptions 1–2 and Lemmas 1–4, if the control law 40 is designed for system 4, the actor-critic networks 42 and 45 and the corresponding adaptive weight update laws 48 and 52 are designed to cope with $d$ , and the Lyapunov candidate function as shown in Equation 53 is constructed, then the tracking error ${e_1}$ is uniformly ultimately bounded stable. As well, the weight errors ${\hat W_a}$ and ${\hat W_c}$ of the actor-critic networks are uniformly ultimately bounded.
Proof. Construct the Lyapunov candidate function as follows:
where
By taking time derivative of Equation 54 and combining Equations 37–42, it can be deduced that
By taking time derivative of Equation 55 and combining Equation 52, it can be deduced that
Similarly, by taking time derivative of Equation 3.3 and combining Equation 48, it can be deduced that
At last, by taking time derivative of Equation 53 and substituting Equations 57–59, we get
Equation 60 satisfies $\dot L \le - \kappa L\left( t \right) + \delta $ under the condition that $\left( {{k_1} - 1} \right) \gt 0$ , $\left( {{k_2} - Tr\left( {\frac{{\partial g}}{{\partial \xi }}{{\left( {\frac{{\partial g}}{{\partial \xi }}} \right)}^T}} \right)} \right) \gt 0$ , $\left( {{\hbar _a} - 3{\rm{\Phi }}_{aM}^2} \right) \gt 0$ and $\left( {{\hbar _c} - {\rm{\Phi }}_{cdM}^2 - {\rm{\Phi }}_{cM}^2{\rm{\Omega }}_{aM}^2} \right) \gt 0$ , where
Therefore, ${e_1}$ , ${e_2}$ , ${\tilde W_a}$ and ${\tilde W_c}$ are uniformly ultimately bounded.
Remark 3. ${e_1}$ is bounded indicating that $\theta \to {\theta _d}$ , ${\psi _v} \to {\psi _{vd}}$ as $t \to \infty $ , that is, the objective of this paper highlighted in Remark 1 is satisfied. ${\tilde W_a},{\tilde W_c}$ are bounded indicating that ${\hat W_a} \to W_a^{\rm{*}}$ , ${\hat W_c} \to W_c^{\rm{*}}$ as $t \to \infty $ , that is, $\hat d \to d$ as $t \to \infty $ . In conclusion, the designed actor-critic networks and the corresponding adaptive weight update laws can cope with unknown disturbances well.
Theorem 2. For the spiral trajectory in the yaw plane, consider Theorem 1 and the geometric relationship shown in Fig. 4 of He et al. [Reference He, Yan and Tang20]. In addition, let the initial angle between the velocity vector of the vehicle and the virtual line-of-sight be such that $\left| {\delta {\psi _v}\left( 0 \right)} \right| \lt \frac{\pi }{2}$ . If $\left| \theta \right| \lt \frac{\pi }{2}$ , the guidance parameter ${\lambda _1} \lt - 1$ renders $s \to 0$ as $t \to \infty $ , regardless of the value of $V$ . The guidance parameter ${\lambda _1} \lt - 2$ not only renders the flight trajectory converges to the spiral trajectory, but also renders the velocity vector of the vehicle converges to the virtual line-of-sight, meaning that $\varphi - {\psi _v} \to - \frac{\pi }{2}$ and $\varphi - {\psi _{vs}} \to - \frac{\pi }{2}$ .
Proof. It follows from Theorem 1 that
where $\left| {{e_{12}}} \right|$ is an arbitrarily small constant. And the time derivative of Equation 61 gives
Taking the time derivative of Equation 29 and substituting Equations 28, 34 and 62, the following equation can be derived:
Neglecting the second-order small quantity $\dot \vartheta {\rm{si}}{{\rm{n}}^2}\left( {{\rm{\Delta }}\varphi } \right)$ and the action term $\frac{{{V_t}}}{s}{\rm{cos}}\left( {{\psi _{vt}} - \varphi } \right)$ of the low-speed moving target in Equation 63, it can be rewritten as
A similar treatment to time derivative of $s$ produces
Dividing Equation 65 by Equation 64, the Equation 66 can be obtained:
And the Equation 67 can be obtained by integrating the Equation 66:
where $\ell \gt 0$ is the bounded integration constant.
If $\delta {\psi _v}\left( t \right)$ satisfies $0 \lt \delta {\psi _v}\left( 0 \right) \lt \frac{\pi }{2}$ at $t = 0$ , then by substituting Equation 67 into Equation 64, we can get
Note that $V{\rm{cos}}\theta \gt 0$ always holds no matter in which flight state. In the case $0 \lt \delta {\psi _v}\left( 0 \right) \lt \frac{\pi }{2}$ , when the guidance parameter ${\lambda _1} \lt - 1$ , there is $\delta {\dot \psi _v} \lt 0$ , which indicates that $\delta {\psi _v} \to 0$ as $t \to \infty $ . From Equation 67, it can be found that the remaining flight distance between the vehicle and the virtual sliding target $s \to 0$ as $\delta {\psi _v} \to 0$ . Furthermore, from Equation 29, it can be found that $\varphi - {\psi _v} \to - \frac{\pi }{2}$ as $\delta {\psi _v} \to 0$ . From Equation 68, when ${\lambda _1} \lt - 2$ , there is $\delta {\dot \psi _v} \to 0$ as $\delta {\psi _v} \to 0$ , so $\dot \varphi - {\dot \psi _v} \to 0$ , and combining Equations 34 and 62, it can be observed that ${\dot \psi _{vs}} \to {\dot \psi _v}$ . Therefore, the flight trajectory converges to the spiral trajectory. This means that ${\rm{\Delta }}\varphi \to 0$ , i.e. $\varphi - {\psi _{vs}} \to - \frac{\pi }{2}$ . $\delta {\psi _v} \to 0$ and ${\rm{\Delta }}\varphi \to 0$ indicate that velocity vector of the vehicle converges to the virtual line-of-sight. The same conclusion can be obtained when $ - \frac{\pi }{2} \lt \delta {\psi _v}\left( 0 \right) \lt 0$ .
Theorem 3. Considering Theorem 1 and Theorem 2, the guidance parameter ${\lambda _2} \gt 1$ renders the distance from the vehicle to the virtual sliding target $r \to 0$ as $t \to \infty $ . The guidance parameter ${\lambda _2} \gt 2$ also renders $\phi + \theta \to 0$ .
Proof. It follows from Theorem 1 that
where $\left| {{e_{11}}} \right|$ is an arbitrarily small constant. And the time derivative of Equation 69 gives
Define $\sigma = \phi + \theta $ , by taking its time derivative and combining Equations 35 and 70, we get
Taking the time derivative of $r$ , and then neglecting the low-speed moving target action term and noting that $\delta {\psi _v} \to 0$ , ${\rm{\Delta }}\varphi \to 0$ as $t \to \infty $ , yields
Similarly, there is
Dividing Equation 72 by Equation 71 yields
And the Equation 75 can be obtained by integrating the Equation 74:
where $\ell {\rm{'}} \gt 0$ is the bounded integration constant. Combining Equations 71, 73 and 75, the Equation 76 can be organized:
when $0 \lt \sigma \left( 0 \right) \lt \pi $ , if the guidance parameter ${\lambda _2} \gt 1$ , then $\dot \sigma \lt 0$ . Therefore $\sigma \to 0$ as $t \to \infty $ . From Equation 75, it can be found that spatial distance from the vehicle to the virtual sliding target $r \to 0$ as $\sigma \to 0$ . If ${\lambda _2} \gt 2$ , there is $\dot \sigma \to 0$ . Because $\dot \sigma = \left( {1 - {\lambda _2}} \right)\dot \phi $ , so $\dot \phi \to 0$ . Thus, $\phi $ approaches a constant and $\theta $ approaches the negative of the same constant. That is, in the pitch plane, $\phi + \theta \to 0$ . The same conclusion can be obtained when $ - \pi \lt \sigma \left( 0 \right) \lt 0$ .
Remark 4. As can be seen from Equations 61, 62 and Equations 69, 70, the assumptions that ${\dot \psi _v} = {\dot \psi _{vd}}$ and $\dot \theta = {\dot \theta _d}$ made in the He and Yan [Reference He and Yan19] and He et al. [Reference He, Yan and Tang20] are verified.
4.0 Simulations
In this section, some simulations are presented to demonstrate the validity and superiority of the proposed reinforcement learning based adaptive spiral-diving guidance method. Specifically, the validity of the proposed method is verified by striking a stationary target and a low-speed moving target. For convenience, the former is denoted as Case 1, and the latter is denoted as Case 2. Otherwise, the superiority of the proposed method is demonstrated by comparing it with methods that without RL and RBF neural networks for unknown disturbances.
The parameters of the vehicle are: vehicle mass $m = 200$ kg, the reference area ${S_{ref}} = 1.8$ m. The initial position ${\left( {{x_0},{y_0},{z_0}} \right)^T} = {\left( {0,32,60} \right)^T}$ km, and the initial velocity ${V_0} = 1200$ m/s. The initial path angle ${\theta _0} = - {2^ \circ }$ , and the initial deflection angle ${\psi _{v0}} = {28^ \circ }$ . The terminal deflection angle ${\psi _{vf}} = {363^ \circ }$ . Moreover, the gravitational acceleration $g = 9.81$ m/s ${{\rm{\;}}^2}$ .The position of the stationary target is ${\left( {0,0,0} \right)^T}$ km, which is also the starting point of the low-speed moving target. Note that the low-speed target moves only in the horizontal plane with velocity ${V_t} = 9$ m/s and directional angle ${\psi _{vt}} = - {90^ \circ }$ . Other parameters of the two cases are shown in Table 1.
The simulation results for Case 1 and Case 2 are shown in Figs. 2 and 3, respectively. The 3-D spiral trajectories of the vehicle in two cases are shown in Figs. 2(a) and 3(a). And corresponding trajectories of the vehicle, the target and the virtual sliding target in yaw plane are shown in Figs. 2(b) and 3(b). They indicate that the vehicle is able to hit the target in both cases, and the respective miss distances are 0.449 m and 0.6092 m. Figures 2(c) and 3(c) show the vehicle velocity response profiles in two cases. Figures 2(d) and 3(d) show the real path angle versus desired path angle for two cases. And Figs. 2(e) and 3(e) show the real deflection angle versus desired deflection angle for two cases. At the moment of hitting the target, the desired path angle and the deflection angle have a small jump. The reason for this is that the vehicle needs to slow down the descent rate in the y-direction to adjust the motion in the x, z-directions to reduce the miss distance. The tracking errors of the path angle and the deflection angle in two cases are shown in Figs. 2(f) and 3(f). So the uniform ultimate boundedness of the tracking error is proved. The control input profiles under reinforcement learning based adaptive law for two cases are depicted in Figs. 2(g) and 3(g). Figures 2(h) and 3(h) show the adaptive adjustment profiles of the weights in the two cases, and their effects are verified in Figs. 2(i) and 3(i). In other words, the unknown disturbances are well compensated. In brief, the above simulation results fully demonstrate the validity of the proposed method in this paper.
Without loss of generality, a comparative simulation of the proposed method with Without RL method and RBF method is included based on Case 2. The striking effects of the three methods are shown in Fig. 4. And as shown in Table 2, the miss distances under the three methods are 0.6092 m, 0.8718 m and 0.7602 m, respectively. The strike accuracy of the proposed method has improved by 30.12 ${\rm{\% }}$ and 19.86 ${\rm{\% }}$ compared to without RL method and RBF method.
5.0 Conclusion
In this paper, the reinforcement learning based adaptive method has been implemented for a class of spiral-diving manoeuver guidance problems of reentry vehicles subject to unknown disturbances. By designing the actor-critic networks and the corresponding adaptive weight update laws, the unknown disturbances are well compensated. In addition, by introducing the coordinate transformation technique, the controller design problem caused by the coupling of control variables is overcome. As a result, a novel reinforcement learning based adaptive guidance framework has been constructed such that desired guidance commands can be tracked stably. Some numerical simulations have been provided to demonstrate the validity and superiority of the proposed method. Based on the work done in this paper, we will study the cooperative spiral-diving guidance of reentry vehicle formation.
Acknowledgements
This work was supported by the Foundation of National Key Laboratory of Science and Technology on Test Physics and Numerical Mathematics and The Foundation of Shanghai Astronautics Science and Technology Innovation, China.