Multi-objective reward shaping for global and local trajectory planning of wing-in-ground crafts based on deep reinforcement learning

H. Hu; D. Li; G. Zhang; Z. Zhang

doi:10.1017/aer.2023.43

Multi-objective reward shaping for global and local trajectory planning of wing-in-ground crafts based on deep reinforcement learning

Published online by Cambridge University Press: 14 June 2023

and

H. Hu: Affiliation:
State Key Laboratory of Structural Analysis for Industrial Equipment, School of Naval Architecture Engineering, Dalian University of Technology, Dalian, China
D. Li: Affiliation:
School of Aeronautic Science and Engineering, Beihang University, Beijing, China
G. Zhang*: Affiliation:
State Key Laboratory of Structural Analysis for Industrial Equipment, School of Naval Architecture Engineering, Dalian University of Technology, Dalian, China Collaborative Innovation Center for Advanced Ship and Deep-Sea Exploration, Shanghai, China
Z. Zhang: Affiliation:
State Key Laboratory of Structural Analysis for Industrial Equipment, School of Naval Architecture Engineering, Dalian University of Technology, Dalian, China
*: Corresponding author: G. Zhang; Email: [email protected]

Article contents

Abstract
Nomenclature
Greek symbol
Introduction
Model and method
Reward shaping
Cases and analysis
Conclusion
Competing interests
APPENDIX
References

Rights & Permissions

Abstract

The control of a wing-in-ground craft (WIG) usually allows for many needs, like cruising, speed, survival and stealth. Various degrees of emphasis on these requirements result in different trajectories, but there has not been a way of integrating and quantifying them yet. Moreover, most previous studies on other vehicles’ multi-objective trajectory is planned globally, lacking for local planning. For the multi-objective trajectory planning of WIGs, this paper proposes a multi-objective function in a polynomial form, in which each item represents an independent requirement and is adjusted by a linear or exponential weight. It uses the magnitude of weights to demonstrate how much attention is paid relatively to the corresponding demand. Trajectories of a virtual WIG model above the wave trough terrain are planned using reward shaping based on the introduced multi-objective function and deep reinforcement learning (DRL). Two conditions are considered globally and locally: a single scheme of weights is assigned to the whole environment, and two different schemes of weights are assigned to the two parts of the environment. Effectiveness of the multi-object reward function is analysed from the local and global perspectives. The reward function provides WIGs with a universal framework for adjusting the magnitude of weights, to meet different degrees of requirements on cruising, speed, stealth and survival, and helps WIGs guide an expected trajectory in engineering.

Keywords

Wing-in-ground craft Trajectory planning Multi-objective function Deep reinforcement learning Reward shaping

Type: Research Article
Information: The Aeronautical Journal , Volume 128 , Issue 1320 , February 2024 , pp. 371 - 397

DOI: https://doi.org/10.1017/aer.2023.43 [Opens in a new window]
Copyright: © The Author(s), 2023. Published by Cambridge University Press on behalf of Royal Aeronautical Society

Nomenclature

$c$: chord of the WIG’s main wing
$CG$: centre of the WIG’s gravity
${L_1}$: vertical distance between the centre of thrust and $CG$ during level flight
${L_2}$: vertical distance from $CG$ to the hull bottom during level flight
${F^i}$: inertial coordinate frame
${F^b}$: WIG’s coordinate frame
${F^v}$: body coordinate frame
${F_x}$: drag
${F_z}$: lift
${M_y}$: moment
lr: learning rate
${v_{cr}}$: cruising speed
${\rm{\Delta }}t$: timestep
d: the span when cruising in a timestep
${h_{ub}}$: altitude of the upper boundary
${h_{cr}}$: altitude of cruising
${h_{lb}}$: altitude of the lower boundary
${h_{sur}}$: altitude of the surface
${w_{l,cr}}$: linear weight of requirement for cruising
${w_{e,cr}}$: exponential weight of requirement for cruising
${w_{l,spd}}$: linear weight of requirement for speed
${w_{e,spd}}$: exponential weight of requirement for speed
${w_{l,bnd}}$: linear weight of requirement for bounded constraints
${w_{e,bnd}}$: exponential weight of requirement for bounded constraints
${V_s}$: the horizontal speed

Greek symbol

$\phi $: potential function
${\phi _{cr}}$: performance for cruising
${\phi _{spd}}$: performance for speed
${\phi _{ste}}$: performance for stealth
${\phi _{sur}}$: performance for survival
${\phi _{bnd}}$: performance for bounded constraints
${{\rm{\Phi }}_{cr}}$: reward function for requirement of cruising
${{\rm{\Phi }}_{spd}}$: reward function for requirement of speed
${{\rm{\Phi }}_{ste}}$: reward function for requirement of stealth
${{\rm{\Phi }}_{sur}}$: reward function for requirement of survival
${{\rm{\Phi }}_{bnd}}$: reward function for requirement of bounded constraints
${\rm{\Phi }}$: reward function

1.0 Introduction

When a wing-in-ground craft (WIG) is flying over the surface of the ground or a wave, the ground-effect zone is one chord of the main wing in height, within which the WIG exerts ground effect to enhance the lift. In addition, to decrease the reduced drag for a higher lift-to-drag ratio, the WIG’s flying altitude is supposed to be within one span of its main wing. There are usually four aspects to the requirements for the movement of a WIG (Fig. 1). Firstly, it’s supposed to be flying at a fixed altitude like cruising for steady control because the aerodynamics of a WIG in the ground-effect zone vary strongly nonlinearly with the state, including altitude, undermining the WIG’s stability when the altitude changes. Secondly, the speed of the WIG is a focus when it participates in emergency rescue, disaster relief or commercial activities. Thirdly, the WIG takes advantage of sea clutter for stealthy motion when operating near the surface. As a result of its low flying characteristics, the WIG can serve as a stealth fighter in military utilities due to its very low radar signature on surface radar. In general, the slower a WIG operates, the greater its transport and stealth capacity, but it also increases the risk of colliding with the surface or wave slamming. So, fourthly, the WIG’s survival is in question. However, the four requirements may conflict in the practical operation of WIG. For instance, when a WIG is skimming over the wave, the demand for high speed leads to the WIG’s striding over the trough of the wave or through the crest of the wave, giving rise to the WIG’s deviation from the cruise altitude and weakening its stealth or survivability. As a result, an allocation of weights on the four aspects is required to meet the specific demand for the WIG in practice.

Figure 1. Bounds constraints and optimal cruising path when a WIG operates.

Finding the compromise among the four requirements for WIGs is a typical multi-objective optimisation problem. DRL has exhibited superiority in more and more engineering areas [Reference Brunke, Greeff, Hall, Yuan, Zhou, Panerati and Schoellig1], and it has been used in previous studies on multi-objective problems.

For vehicles in the air, Dooraki et al. [Reference Dooraki and Lee2] trained an unmanned aerial vehicle (UAV) for autonomous navigation in static and dynamically challenging environments. They defined several rewards, including avoiding collisions, moving to the next area, reaching the goal location and keeping movement. Xu et al. [Reference Xu, Jiang, Wang and Wang3] aimed to perform autonomous UAV obstacle avoidance and target tracking. Their reward function demonstrated that colliding with an obstacle or exceeding the environmental boundary is a symbol of failure, and if the pursuer chases the evader without colliding, the task is successful.

Concerning vehicles on the water’s surface, Wang et al. [Reference Wang, Luo, Li and Xie4] developed a reward function for autonomous obstacle avoidance control of an unmanned surface vessel (USV) by encouraging it to arrive at its destination as quickly as possible and punishing the occurrence of any collision or stalemate. Xu [Reference Xu, Lu, Liu and Zhang5] created intelligent collision avoidance technology for USV to ensure navigation safety. Rewards were divided into five parts, including sailing towards the target, keeping the right heading angle, stopping smoothly near the target point, avoiding collisions and obeying the standard for international shipping. Zhou et al. [Reference Zhou, Wu, Zhang, Guo and Liu6] generated collision-free trajectories for USVs in constrained maritime environments. They designed the rewards to include encouraging USVs to reach the goal area, preventing any collisions, ensuring the adjustment of USVs’ headings in a desired manner, finding the shortest path to destination and keeping a safe distance from obstacles.

As for underwater vehicles, Liu et al. [Reference Liu, Liu, Wu and Yu7] attempted the three-dimensional path following of an underactuated robotic dolphin in complex hydrodynamics, with a reward towards decreasing cross-track and vertical-track errors, encouraging it to move forward and avoid episode termination. Sun et al. [Reference Sun, Luo, Ran and Zhang8] solved the safe navigation problem of autonomous underwater vehicles (AUVs) in a complex and changeable environment with various mountains. For the design of the reward function, they considered the AUV’s tendency towards target, obstacle avoidance, stability of the heading and speed.

With regard to autonomous vehicles on the ground, Chen et al. [Reference Chen, Yuan and Tomizuka9] realised urban autonomous driving decision-making under complex road geometry and multi-agent interactions. They used five parts of reward, including encouraging forward movement, improving driving smoothness, penalising collisions with other surrounding vehicles, penalising running out of the lane and penalising stopping still. Deshpande and Spalanzani [Reference Deshpande and Spalanzani10] achieved autonomous navigation in structured urban environments among pedestrians. They set the rewards, including keeping a high speed, avoiding collisions with pedestrians and penalising stopping in a near collision situation. Wang et al. [Reference Wang, Wang and Cui11] developed an autonomous driving policy that takes into account the variety of traffic scenes and the uncertainty of interactions among surrounding vehicles. They considered the construction of rewards, including punishing the collision, smoothing the steering of the vehicle during cornering, encouraging running at a pre-set speed and minimising the cross-track error and heading angle error. Hu et al. [Reference Hu, Li, Yang, Bai, Li, Sun and Yang12] created a small-scale intelligent vehicle tracking and adaptive cruise control system. Their reward function aimed to minimise cross-track error from the centreline while increasing vehicle speed. Hu et al. [Reference Hu, Li, Hu, Song, Dong, Kong, Xu and Ren13] proposed a rear anti-collision decision-making methodology based on DRL for autonomous commercial vehicles. Their reward function is constructed by avoiding forward and rear collisions with the safety clearance threshold, preventing rollover, penalising the three behaviours mentioned before and improving driving smoothness. Ye et al. [Reference Ye, Cheng, Wang, Chan and Zhang14] took account of the problem of path planning for an autonomous vehicle that moves on a freeway. Their reward function reflected penalising collisions, deviations from the desired speed, unnecessary lane changes and accelerations or decelerations. Luo et al. [Reference Luo, Zhou and Wen15] developed the autonomous driving technique within the context of path following and collision avoidance. They defined the rewards as including motivating the vehicle to reach the goal, penalising the cross-track error and heading error and avoiding obstacles. Bakker and Grammatico [Reference Bakker and Grammatico16] employed deep reinforcement learning to automate driving on highways. Their reward design primarily penalised excessive speed, the occurrence of danger, collision, line change, overtaking and leaving the left lane, while encouraging the state on the destination lane. Schmidt et al. [Reference Schmidt, Kontes, Plinge and Mutschler17] attempted to make autonomous vehicles operate not only efficiently, but also safely and consistently. Their reward structure included encouraging high speed and safe distance and penalising crashes.

In addition to normal operations, there are also other integrated missions for autonomous driving. Ye et al. [Reference Ye, Cheng, Wang, Chan and Zhang14] proposed an automated lane change strategy based on lane change behaviours that overtake a slower vehicle and adapt to a merging lane ahead. Their strategy was centred around evaluation of lateral and longitudinal direction jerk, travel time and relative distance to the target lane, as well as the risk of collisions and near collisions. Xu et al. [Reference Xu, Pei and Lv18] put forward a method based on safe RL for a complex scenario where the number of vehicle lanes is reduced. Their reward components involved preventing collisions, maintaining the desired speed and avoiding meaningless lane changes. Lv et al. [Reference Lv, Pei, Chen and Xu19] created a motion planning strategy for autonomous driving tasks in highway scenarios in which an autonomous vehicle merges into a two-lane road traffic flow and performs lane-changing manoeuvers. They set rewards that encourage autonomous driving to maintain a desired velocity and for driving behaviour to be smooth, comfortable and safe. Car-following, human-car interaction and parking are also concerns. Peake et al. [Reference Peake, McCalmon, Raiford, Liu and Alqahtani20] used cooperative adaptive cruise control for platooning and multi-agent reinforcement learning. When setting the rewards, they considered the likelihood of collisions, the stability of the inter-vehicle distance and the travel time. Wurman et al. [Reference Wurman, Barrett, Kawamoto, MacGlashan, Subramanian, Walsh, Capobianco, Devlic, Eckert, Fuchs, Gilpin, Khandelwal, Kompella, Lin, MacAlpine, Oller, Seno, Sherstan, Thomure, Aghabozorgi, Barrett, Douglas, Whitehead, Dürr, Stone, Spranger and Kitano21] did research on automobile racing, which involves making real-time decisions in physical systems while interacting with humans. Their rewards consisted of course progress, off-course penalty, wall penalty, tire-slip penalty, passing bonus, any-collision penalty, rear-end penalty and unsporting-collision penalty. Zhang et al. [Reference Zhang, Chen, Song and Hu22] made reinforcement learning-based motion planning for automatic parking systems, taking the prevention of collisions with surrounding cars, comfort of punishing strenuous movement, parking efficiency of minimising the time for parking and final parking posture of the parking slot stage into consideration in the reward function.

To sum up, in previous research, they took several factors into consideration. Given a closer look at these papers, it will be found that different factor is multiplied with a weight, and then accumulated as the reward function. What’s more, trajectory planning were widely conducted globally, the local trajectory planning is less investigated. The polynomial form is worthwhile to be applied to the multi-objective problem of WIGs, and examined from the global and local aspects.

The novelty of this paper is proposing a basic framework for WIGs to make a compromise between requirements including cruising, speed, stealth and survival for an expected trajectory. In addition to the global trajectory planning widely done by previous researchers, we have also done local trajectory planning for more comprehensive examination. The framework is designed in a polynomial form, whose items describe these requirements, respectively. Items are enlarged or reduced with different linear and exponential weights to represent varying degrees of emphasis on these requirements. It serves as a reference for WIGs’ trajectory planning via DRL in order to meet the comprehensive demands.

The scientific contribution of this paper is putting forward a quantitative tool for WIGs’ global and local trajectory planning, considering different emphasis on cruising, speed, stealth and survival. Assigning weights to each item for the proposed polynomial reward function corresponds to altering the emphasis placed on each requirement. The expected trajectory can be determined by adjusting the item weights.

In the next parts, we will first briefly talk about a small WIG model and how it simulates flight. Then we’ll go over how the reward function is created. The final section of introduction examines the performance of reward functions and their corresponding trajectories using a terrain built by piecewise lines like one wave through. Another point to mention is that the WIG’s movement insider corridors is restricted by requirements for stealth and survival, which consist of an upper and lower boundary. For convenience, the two components will be merged into a single factor termed “bounded constraints”. at https://github.com/HuanHu2019/Reward-shaping-for-WIGs-Multi-objective-trajectory-planning. For the information of the computational cost in this study that uses DRL as the optimisation algorithm, the optimisation for a single case would take eight days on average by approximate estimate, in a Dell Precision Tower 7910 computer. For the detailed configuration of computer, the CPU is Intel(R) Xeon(R) E5-2680 v3 2.50GHz, and the RAM is 64.0 GB.

2.0 Model and method

2.1 Flight simulation

2.1.1 Model

In this paper, a virtual small WIG is used to investigate reward shaping. The main wing’s aerofoil is DHMYU2.65-20, 1.818-40, 1.515-60, -4.5-3, with an S-type cross section and thin thickness for utilising ground effect. The aerofoils of the vertical tail and horizontal stabiliser are NACA0002. The main wing’s chord length is 0.4m, with the aspect ratio being 3 and angle of incidence being 5 ${{\rm{\;}}^ \circ }$ . The vertical tail and horizontal stabiliser share the same chord length of 0.24m, and their aspect ratios are 5/3 and 4, respectively. The angle of incidence for the horizontal stabiliser is 1.5 ${{\rm{\;}}^ \circ }$ . The main wing and horizontal stabiliser are both rectangular, while the vertical tail’s sweep angle is 45 ${{\rm{\;}}^ \circ }$ . Its overall dimensions are 1.74m in length, 1.2m in width and 0.425m in height. It weighs 1.2kg and has a moment of inertia of 0.1875kg $ \cdot $ m ${{\rm{\;}}^2}$ . The chord length of the main wing can be written as $c$ ; then Fig. 2(a) shows how the main components align if the origin is fixed at 0.375 $c$ from the main wing’s leading edge, and the measurement points for the vertical tail and horizontal stabiliser are placed at one-quarter chord intervals. To measure the WIG’s flying altitude, we should first find the vertical projection of the centre of gravity ( $CG$ ) in the WIG’s hull bottom, and then measure the distance from the projection point to the surface [Reference Yuan23]. Figure 3 depicts this measurement, and the flying altitude $h$ is obtained by

(1)

\begin{align} h = z - {L_2} \cdot {\rm{sec}}\!\left( \alpha \right)\end{align}

where $z$ is the altitude of $CG$ , and $\alpha $ is the practical pitch angle. The notation of ${L_2}$ is the vertical distance from $CG$ to the hull bottom when $\alpha $ = ${0^ \circ }$ .

Figure 2. (a) Model and positional parameters of main parts. (b) Symmetric grid by VLM and image method.

Figure 3. Diagram of the forces and torques on the WIG.

2.1.2 Aerodynamics

Methods based on Navier-Stokes equations for the WIG’s aerodynamics are unacceptable due to the DRL’s great demand for a large number of samples. With the aid of Tornado, a freeware MATLAB application [Reference Melin24], flow over the WIG model is simulated using the vortex lattice technique (VLM). The principle behind the simulation of potential flow is to solve the equation as

(2)

\begin{align} \!\left[ {\begin{array}{c@{\quad}c@{\quad}c@{\quad}c}{{w_{11}}} & {{w_{12}}} & \cdot & \cdot \\[5pt] {{w_{21}}} & {. \cdot } & \cdot & \cdot \\[5pt] \cdot & \cdot & \cdot & \cdot \\[5pt] \cdot & \cdot & \cdot & {{w_{nn}}}\end{array}} \right] \cdot \!\left[ {\begin{array}{c}{{{\rm{\Gamma }}_1}}\\[5pt] \cdot \\[5pt] \cdot \\[5pt] {{{\rm{\Gamma }}_n}}\end{array}} \right] = \left[ {\begin{array}{c}{{b_1}}\\[5pt] \cdot \\[5pt] \cdot \\[5pt] {{b_n}}\end{array}} \right]\end{align}

where $w$ is the flow from each vortex through each panel. ${\rm{\Gamma }}$ is the vortex strength solved for, and $b$ is the flow through each panel as determined by the flight condition. The detailed derivation can be found in Melin’s study [Reference Melin24]. By positioning an identical wing symmetrically, the image approach is used to imitate the ground’s boundary and realise the non-penetrating impact of the boundary condition [Reference Barber, Leonardi and Archer25]. The grid is built from two symmetric pieces, and its 3D panels, collocation points and normals are as shown in Fig. 2(b).

The range of the main wing’s and horizontal stabiliser’s angle-of-attack is limited because potential flow is not applicable when the wing or stabiliser operates at a big pitch angle. The WIG’s overall pitch angle can only be between $ - {11.5^ \circ }$ and ${5.5^ \circ }$ , and the state that violates this restriction will be treated as failure. In this work, we ignore the slight body movement. Since the method for potential flow can only calculate lift and induced drag, the viscous drag is roughly calculated by empirical formulas [Reference Raymer26] encoded in Tornado. In its current state, the WIG performs elevator and thrust actions, and the dynamics can be obtained in the manner described. After that, the forces of lift, drag, and moment will be imported into the calculation of the next state in the subsequent timestep.

Figure 4. Three coordinates for the analysis on the WIG’s motion.

2.1.3 Flight dynamics

In the longitudinal plane, the WIG’s motion is limited to three degrees of freedom (DOF). They are one rotational axis and two directional axes, allowing the WIG to pitch and move vertically and horizontally. To decompose the dynamic behaviour, three coordinate systems are introduced in Fig. 4. Firstly, the inertial frame ${F^i}$ is an earth-fixed coordinate system with a forward-pointing unit vector ${i^i}$ and a downward-pointing unit vector ${k^i}$ , and an origin at the defined home location. Secondly, ${F^v}$ is the WIG frame, with its origin at WIG’s $CG$ and its axes aligned with ${F^i}$ ’s axis. Thirdly, ${F^b}$ is the body frame, and its unit vectors ${i^b}$ and ${k^b}$ point to WIG’s nose and belly, respectively [Reference Beard and McLain27]. The force and torque applied to the system are illustrated in Fig. 3. The total force is divided into two parts. The first is the aerodynamics of the entire body: the lift ${F_L}$ , the drag ${F_D}$ , and the torque $M$ . They are the total force and total torque based on ${F_{L,h.s}}$ , ${F_{D,h.s}}$ , ${M_{h.s}}$ from the horizontal stabiliser, ${F_{L,v.t}}$ , ${F_{D,v.t}}$ , ${M_{v.t}}$ from the vertical tail, and ${F_{L,m.w}}$ , ${F_{D,m.w}}$ , ${M_{m.w}}$ from the main wing. The second force is the thruster force T. Hence, we can calculate the total force and torque as follows:

(3)

\begin{align} {F_x} = {F_x} + T \cdot {\rm{cos}}\!\left( \alpha \right)\end{align}

(4)

\begin{align} {F_z} = {F_z} + T \cdot {\rm{sin}}\!\left( \alpha \right)\end{align}

(5)

\begin{align} {M_y} = M + T \cdot {L_1}\end{align}

where $\alpha $ is the practical pitch angle, and ${L_1}$ is the vertical distance between the centre of thrust and $CG$ when $\alpha $ = ${0^ \circ }$ .

Forces ${F_x}$ , ${F_z}$ , and moment ${M_y}$ will be fed into the following equations for the motion response after the aerodynamics of the WIG have been calculated [Reference Diston28].

(6)

\begin{align} \dot{\boldsymbol{u}} = \boldsymbol{u} \times \boldsymbol\omega + \boldsymbol{C}_{\boldsymbol{g}}^{\boldsymbol{b}} \cdot \!\left( {\frac{\boldsymbol{F}}{\boldsymbol{m}} + {\boldsymbol{g}^{\boldsymbol{n}}}} \right)\end{align}

(7)

\begin{align} \dot{\boldsymbol\omega} = {\boldsymbol{I}^{ - 1}}\!\left( { - \boldsymbol\omega \times \boldsymbol{I}\boldsymbol\omega + \boldsymbol{M}} \right)\end{align}

where $\boldsymbol{u} = {\!\left[ {u,0,w} \right]^T}$ , $\boldsymbol\omega = {\!\left[ {0,q,0} \right]^T}$ , $\boldsymbol{F} = {\!\left[ {{F_x},0,{F_z}} \right]^T}$ , $\boldsymbol{M} = {\!\left[ {0,{M_y},0} \right]^T}$ , ${\boldsymbol{g}^{\boldsymbol{n}}} = {\!\left[ {0,0,g} \right]^T}$ , $m$ represents mass of the WIG, $u$ and $w$ are decomposed velocities along ${i^b}$ and ${k^b}$ in ${F^b}$ . $\dot u$ and $\dot w$ are the corresponding accelerations. $q$ symbolised an angular velocity along ${j^b}$ in ${F^b}$ . ${F_x}$ and ${F_z}$ are forces along ${i^f}$ and ${k^f}$ in ${F^f}$ . $g$ signifies the acceleration of gravity. ${\boldsymbol{C}}_{\boldsymbol{g}}^{\boldsymbol{b}}$ is the transformation from ${F^f}$ to ${F^b}$ . $\boldsymbol{V}$ in ${F^i}$ can be transformed from $\boldsymbol{u}$ in ${F^b}$ by $\boldsymbol{V} = \boldsymbol{C}_{\boldsymbol{g}}^{\boldsymbol{b}} \cdot \boldsymbol{u}$ , where $\boldsymbol{V} = {\!\left[ {{V_x},0,{V_z}} \right]^T}$ , ${V_x}$ and ${V_z}$ indicate velocities along ${i^i}$ and ${k^i}$ in ${F^i}$ . $\boldsymbol{I}$ denotes moment of inertia. A flowchart about the process of calculation about the WIG’s flight dynamics is shown in Fig. 5, and the key quantities of flying altitude and horizontal speed are marked with a red circle.

Figure 5. Calculating process of dynamics for the WIG’s flight.

After receiving the lift, drag and moment by dynamic calculation, the kinetics of the WIG can be figured out for the new position and posture in the next timestep. The cycle repeats for the WIG’s interaction with the work environment.

2.2 Deep reforcement learning (DRL)

DRL learns a policy by trial and error in order to achieve a specific target. The algorithm’s objective is to maximise the cumulative reward when the agent interacts with the environment and receives a reward at each step. Deep learning has the perceptual ability to monitor the objective in its environment and offers information about the environment’s current condition. Reinforcement learning provides the decision-making capacity to connect current states and actions, and it evaluates the value of actions based on the predicted reward. The DRL process flow is depicted in Fig. 6. Using policy gradient approaches, the policy is iterated as [Reference Schulman, Wolski, Dhariwal, Radford and Klimov29]

(8)

\begin{align} {L^{PG}}\!\left( \theta \right) = {\hat{\mathbb{E}}_{t}}\!\left[ {{\rm{log}}{\pi _\theta }\!\left( {{a_t}|{s_t}} \right){{\hat{A}}_t}} \right]\end{align}

where $\theta $ is the policy parameter and ${\pi _\theta }$ is the stochastic policy that takes action ${a_t}$ at timestep $t$ in the state ${s_t}$ . ${\hat A_t}\!\left( {s,a} \right) = {\bf{E}}\!\left[ {r\!\left( {s,a} \right) - r\!\left( s \right)} \right]$ , where $r\!\left( {s,a} \right)$ represents the expected reward of action $a$ from state $s$ and $r\!\left( s \right)$ represents the expected reward of the entire state $s$ prior to action selection. The loss on which a gradient ascent step is performed is denoted by $L$ . In an algorithm that alternates between sampling and optimisation $\hat{\mathbb{E}}_{t}$ , represents the empirical average over a finite batch of samples. The connection between these quantities and the WIG’s dynamics is illustrated in Fig. 6. The notation of $s$ is the state of the WIG, $s = \left( {\alpha,\dot \alpha,\dot q,x,u,\dot u,z,w,\dot w} \right)$ , which are longitudinal displacements, velocities and accelerations. $r$ is the reward given to the state by the reward function ${\rm{\Phi }}$ based on its cruise, speed, stealth and survival performance; ${\rm{\Phi }}$ = $r\!\left( s \right)$ . The actions are denoted by the notation $a$ = $\!\left( {T,\delta } \right)$ , where $T$ and $\delta $ are the throttle and elevator, respectively. The notation of ${\pi _\theta }$ is the stochastic policy $\pi $ based on the policy parameter $\theta $ . In the loop, the policy will be updated to learn which action $a$ to take under the state $s$ to maximise the total reward. In the learning process, the WIG’s state continues to change based on flight dynamics, and the next state after action is taken will be rewarded by the reward function.

Figure 6. The workflow of deep reinforcement learning.

The policy gradient method used in this paper for DRL is Proximal Policy Optimisation (PPO) [Reference Schulman, Wolski, Dhariwal, Radford and Klimov29], which updates the policy slightly from the previous policy for low variance during training. Two deep neural networks (NN) are employed as part of the Actor-Critic model, with the Actor selecting actions and the Critic predicting state values. This paper applies ElegantRL [Reference Liu, Li and Zheng30], an open source framework for implementing DRL.

In this study, the Critic and Actor’s NNs contain five layers and 64 neurons in the intermediate layers. With the exception of those in the fourth layer of the Actor’s NN being Hardswish, all activation functions in both NNs are ReLU. Table 1 contains a list of training hyperparameters, and $N$ is the episodic max step, or the maximum timestep for WIG operation. If the episode reward does not grow in a certain number of consecutive epochs when DRL does training, $lr$ will become the next spare parameter. For example, suppose we begin training with $lr$ = $5 \cdot {10^{ - 5}}$ . If the episode reward does not increase over the next 200 steps, $lr$ will be converted to ${10^{ - 5}}$ . If the update of the hyperparameter is performed three times, training will be finished.

Table 1. Hyperparameters of DRL for training

3.0 Reward shaping

The reward function determines the performance of the WIG after training, as DRL attempts to steer the WIG to receive the most rewards. It is essential to design a reward mechanism that matches the expectations. An inappropriate reward function may result in deviation from expectations. The reward function for the WIG’s comprehensive needs is composed of four components: cruising, speed, survival and stealth.

3.1 Cruising

The first item is about the performance of cruising, which is described as the deviation from the cruising altitude by

(9)

\begin{align} {\phi _{cr}} = \frac{{abs\!\left( {h - {h^{\rm{*}}}} \right)}}{c}\end{align}

where $h$ is the altitude of the WIG and ${h^{\rm{*}}}$ is the target cruising altitude. The mathematical symbol $abs$ is used to calculate their absolute difference, which is normalised by the height of the ground-effect zone $c$ . It demonstrates that when elevating or descending from the cruising altitude, the reward will decrease. The larger the deviation of the current altitude from cruising altitude is, the smaller the reward will be. Figure 7(a) depicts the function curve of ${\phi _{cr}}$ when ${h^{\rm{*}}}$ = 0.6 $c$ .

Figure 7. Function curves for the reward of cruising and speed.

3.2 Speed

The second item indirectly demonstrates performance of the horizontal speed as

(10)

\begin{align} {\phi _{spd}} = \frac{{{x_n} - {x_{n - 1}}}}{{{d_{max}}}}\end{align}

where ${x_n}$ represents the current horizontal position at the $n$ - $th$ step and ${x_{n - 1}}$ represents the previous horizontal position at the ( $n$ -1)- $th$ step. Their difference shows how far the WIG goes, and it is normalised by ${v_{max}}{\rm{\Delta }}t$ which is the longest span for the WIG’s operation at a single timestep, ${d_{max}} = {v_{max}}{\rm{\Delta }}t$ , where ${v_{max}}$ = 18m/s is the WIG’s largest horizontal speed. The horizontal speed in each step is evaluated by ${V_s} = \left( {{x_n} - {x_{n - 1}}} \right)/{\rm{\Delta }}t$ . The greater the difference between the horizontal positions of the two consecutive steps is, the higher the speed will be, and then the larger the reward will be. Figure 7(b) depicts the function curve of ${\phi _{spd}}$ .

3.3 Survival and stealth

Since requirements of survival and stealth are bounded constraints, the third item in the reward function can be expressed as the their combination. Hence, the third item represents ${\phi _{ste}}$ and ${\phi _{sur}}$ , which are separate items on the WIG’s performance concerning stealth and survival, respectively.

(11)

\begin{align} {\phi _{bnd}} = {\phi _{ste}} + {\phi _{sur}}\end{align}

(12)

\begin{align} {\phi _{ste}} = \left\{ {\begin{array}{l@{\quad}l@{\quad}l}{0.5 - \dfrac{{h - {h_{ub}}}}{{{\rm{\Delta }}{h_{bnd}}}}} & {{\rm{if}}} & {h \gt {h_{ub}}}\\[5pt] {0.5} & {{\rm{if}}} & {h \le {h_{ub}}}\end{array}} \right.\end{align}

(13)

\begin{align} {\phi _{sur}} = \left\{ {\begin{array}{l@{\quad}l@{\quad}l}{0.5 - \dfrac{{{h_{lb}} - h}}{{{\rm{\Delta }}{h_{bnd}}}}} & {{\rm{if}}} & {h \lt {h_{lb}}}\\[5pt] {0.5} & {{\rm{if}}} & {h \geq {h_{lb}}}\end{array}} \right.\end{align}

where ${h_{ub}}$ and ${h_{lb}}$ are the upper and lower boundary altitudes, respectively. Since the WIG’s position above the upper boundary is not allowed due to the requirement of stealth, the reward will decrease if the WIG moves above the upper boundary. The greater the vertical difference between the current position and the upper boundary after the WIG’s overaction, the less the reward will be. The reward is normalised by ${h_{bnd}}$ = ${h_{ub}}$ - ${h_{lb}}$ that is the difference between the two boundaries. When the WIG’s altitude is below the upper boundary, the reward will be constant. If ${h_{ub}}$ = 0.7 $c$ , ${h_{lb}}$ = 0.5 $c$ and $h$ is limited within the range [0.3 $c$ , 0.9 $c$ ], the function curve of ${\phi _{ste}}$ can be drawn in Fig. 8(a). Similarly, according to the requirement of survival, the WIG’s position is not permitted to be below the lower boundary, so the reward will also decrease if the WIG moves below the lower boundary. The further the WIG underacts below the lower boundary, the more significantly the reward will be reduced, and the reward is also processed by normalisation. When the WIG’s altitude is above the lower boundary, the reward will be constant. Figure 8(b) depicts the function curve of ${\phi _{sur}}$ under the same condition. When put ${\phi _{ste}}$ and ${\phi _{sur}}$ together, the curve of ${\phi _{bnd}}$ is demonstrated Fig. 8(c).

Figure 8. Function curves for the reward of survival and stealth.

3.4 Reward function

In short, we design the reward function consisting of the items mentioned above as

(14)

\begin{align} {\rm{\Phi }} = {{\rm{\Phi }}_{cr}} + {{\rm{\Phi }}_{spd}} + {{\rm{\Phi }}_{bnd}}\end{align}

(15)

\begin{align} {{\rm{\Phi }}_{cr}} = {w_{l,cr}} \cdot {(1 + {\phi _{cr}})^{{w_{e,cr}}}}\end{align}

(16)

\begin{align} {{\rm{\Phi }}_{spd}} = {w_{l,spd}} \cdot {(1 + {\phi _{spd}})^{{w_{e,spd}}}}\end{align}

(17)

\begin{align} {{\rm{\Phi }}_{bnd}} = {w_{l,bnd}} \cdot {(1 + {\phi _{bnd}})^{{w_{e,bnd}}}}\end{align}

where ${{\rm{\Phi }}_{cr}}$ , ${{\rm{\Phi }}_{spd}}$ , ${{\rm{\Phi }}_{bnd}}$ are independent reward functions for the performance of the cruise, speed and bounded constraint, respectively. ${w_{l,cr}}$ , ${w_{l,spd}}$ , ${w_{l,bnd}}$ are linear weights, and ${w_{e,cr}}$ , ${w_{e,spd}}$ , ${w_{e,bnd}}$ are exponential weights for independent reward functions. The value of ${\phi _{cr}}$ , ${\phi _{spd}}$ , ${\phi _{bnd}}$ is limited within the range [0, 1]. To conveniently regulate the change in the exponential form, the three items are added with 1 to limit the base within [1, 2].

4.0 Cases and analysis

The purpose of this chapter is to examine if the designed reward function can yield the predicted trajectory by modifying the weights via DRL. The values of ${w_{l,cr}}$ , ${w_{e,cr}}$ , ${w_{l,spd}}$ , ${w_{e,spd}}$ , ${w_{l,bnd}}$ and ${w_{e,bnd}}$ indicate varying focus on the three requirements, which include keeping cruise, maximising speed, and operating in the corridor for stealth and survival. If the reward function is effective, the trajectory will indicate a preference for the requirement that has the greater relative weight. Two types of cases are taken into account. For the WIG’s global trajectory planning, a single reward function with a specified scheme of weights is used in the entire environment. The other applies different schemes of weights to construct reward functions in consecutive parts of the environment for the WIG’s local trajectory planning.

4.1 Environment

For simplicity, the environment for examination is a terrain composed of piecewise lines that simulate the trough of the wave. The cruising altitude is ${h^{\rm{*}}}$ = $0.6c$ , the cruising speed is ${v_{cru}}$ = 8.88 m/s, and the simulation timestep is ${\rm{\Delta }}t$ = 0.02 s. During cruising, a single stride spans $d$ = ${v_{cru}}{\rm{\Delta }}t$ = 1.78m. As for the boundary limitations, the WIG must fly below ${h_{ub}}$ = $0.7c$ for stealth requirements and above ${h_{lb}}$ = $0.5c$ for survival requirements. In Fig. 9, the topographic parameters, including slope and distance, are illustrated. The corner positions are denoted from P.1 to P.4. They divide the continuous environment into six lines, L.1 to L.6, with L.3 and L.4 representing each half of the line from P.2 to P.3. We identify two regions, with R.1 spanning from L.1 to L.3 and R.2 spanning from L.4 to L.6. There are 64 total steps in the operational task.

Figure 9. The through-like environment of piece-wise lines used in DRL for the WIG’s trajectory planning.

4.2 Case 1

4.2.1 Conditions

Case 1 assigns the fixed scheme of weights for the reward function in the two regions. As the number of weights is six, it is difficult to analyse the influence of multiple variables at the same time. They can be divided into three pairs, as ( ${w_{l,cr}}$ , ${w_{e,cr}}$ ), ( ${w_{l,spd}}$ , ${w_{e,spd}}$ ) and ( ${w_{l,bnd}}$ , ${w_{e,bnd}}$ ). For simplicity, with two pairs of weights fixed, we compare trajectories that differ only in one pair of weights. Then, three pairs of weights correspond to three subcases: Case 1.1, Case 1.2 and Case 1.3. For example, in Case 1.1, with the other five weights set to 1, Case 1.1 assigns ${w_{l,cr}}$ = 0.5, ${T_{g,2}}$ assigns ${w_{e,cr}}$ = 2, respectively, where the subscript means the global trajectory planning. In contrast with ${T_{g,0}}$ whose weights are 1, ${T_{g,1}}$ will show the influence if the proportion of ${{\rm{\Phi }}_{cr}}$ decreases comparatively, while ${T_{g,2}}$ will demonstrate the influence if the proportion of ${{\rm{\Phi }}_{cr}}$ increases relatively. Similarly, four trajectories are compared with the reference ${T_{g,0}}$ for the other two requirements. Schemes of weights in two regions for Case 1 are listed in Table 2.

4.2.2 Discussion

To give an assessment of the extent to which the requirements are met in the whole region, we use $\sum_{R.1+R.2}\phi_{cr},\; \sum_{R.1+R.2}\phi_{spd}$ and $\sum_{R.1+R.2}\phi_{bnd}$ to represent the performance of the trajectory on cruising, speed, and the bounded constraints composed of survival and stealth.

From ${T_{g,1}}$ to ${T_{g,0}}$ , and subsequently to ${T_{g,2}}$ , corresponds to the gradual increase of the cruise weight in the reward function. Figure 10(a) depicts the full view of the trajectory. It can be seen that with the increase in cruising weight, the WIG can keep cruising better in the L.1 and L.6, and it is closer to the cruise route in the L.3 $\sim $ L.4. From (b) to (e), the perspectives are around the four turning positions, respectively. It can be found that with the increase in cruise weight, the trajectory deviates less from the cruise route at the turning positions, and the whole trajectory goes deeper to the bottom of the trough. The horizontal speed history of the trajectory is shown in (f). As the cruise weight increases, the speed increases more slowly, and the entire span of the WIG decreases, bringing the WIG’s endpoint closer to its initial location. From (g) to (i) are the performances of the trajectory with respect to the three requirements, from which we can see the trend of the overall performance of each aspect as the cruise weight changes. When the cruise weight increases, the overall cruise performance improves. Simultaneously, the effect of bounded constraints has also been reinforced, but the overall horizontal speed has decreased, indicating that the higher speed is in exchange for better cruise, and the performance of bounded constraints has also improved.

Table 2. Schemes of weights in the whole region for Case 1

Figure 10. Trajectories planned globally in Case 1.1 for the requirement of cruising.

From ${T_{g,3}}$ to ${T_{g,0}}$ and finally to ${T_{g,4}}$ , the horizontal speed-related weight gradually increases. Figure 11(a) shows the full picture of the trajectory. It can be seen that with the increase of the speed weight, the trajectory of the WIG in the L.6 ends farther. (b) to (e) show the views around the four turning positions, respectively. It can be observed that with the increase in speed weight, the trajectory deviates further from the cruise route at the turning positions, appearing flatter overall and dipping less towards the trough floor. (f) displays the horizontal speed history of the trajectory. As the speed weight increases, the speed increases more rapidly, and the whole span of the WIG becomes longer; therefore, the WIG’s ending trajectory point moves further from its initial position. From (g) to (i) are the performance of the trajectory on the three requirements, from which we can see the trend of the global performance of each aspect as the speed weight change. When the speed weight increases, the overall performance on speed improves. However, the influence on cruising and bounded constraints differs. The performance of cruising initially improves and subsequently degrades, while the effect of bounded constraints gradually degrades, showing that the performance of bounded constraints and part of cruising are sacrificed to achieve the higher speed.

Figure 11. Trajectories planned globally in Case 1.2 for the requirement of speed.

The gradual increase of the bounded constraint weight corresponds from ${T_{g,5}}$ to ${T_{g,0}}$ and finally to ${T_{g,6}}$ . Figure 12(a) shows the full perspective of the trajectory. With the growth of the bounded constraint weight, the trajectory in L.1 exhibits a rising trend, the overshoot in L.6 advances, and the slope of the uphill and downhill motions in L.2 and L.5 increases. From (b) to (e), the perspectives are around the four turning positions. It is discovered that when the constrained constraint weight increases, the trajectory deviates less from the cruise route at the turning place, and the overall trajectory conforms to the shape of the trough. The horizontal speed history of the trajectory is illustrated in (f). As the weight of the bounded constraints increases, the speed increases more slowly, and the whole span of the WIG becomes smaller, so the end trajectory point of the WIG is closer to the initial position. From (g) to (i) are the performance of the trajectory with respect to the requirements of the three aspects, from which we can observe the trend of the global performance of each aspect as the bounded constraint weight changes. When the bounded constraint weight increases, the overall bounded constraint performance improves. Simultaneously, the cruising performance improves, demonstrating that even though there is a significant deviation from the cruise route in L.1 and L.6, the constrained constraints in L.2 $\sim $ L.5 lower the degree of deviation. Also, the overall horizontal speed has lowered, indicating that the speed is traded for stronger effect bounded constraints, and the cruising performance is also enhanced.

Figure 12. Trajectories planned globally in Case 1.3 for requirements of survival and stealth.

4.3 Case 2

4.3.1 Conditions

Case 2 assigns different reward functions to the separate regions. The whole environment is divided into two regions, R.1 and R.2. If each regions is assigned with a scheme that has a varying weight and the number of weights to be selected is 2, trajectories will amount to 64 ${{\rm{\;}}^2}$ . For simplicity, we keep the weighting scheme in R.1 as the reference, ${T_{g,0}}$ , and make it the control variable in R.2. The scheme of weights in R.2 changes in one couple of weights, and the manner of variation is similar to Case 1. Likewise, three couples of weights have three subcases. In Case 2.1, for example, ${T_{l,1}}$ , ${T_{l,2}}$ are assigned ${w_{l,cr}}$ = 0.5, ${w_{e,cr}}$ = 2, respectively, in R.2, with other scheme parameters in R.2 and R.1 remaining the same as the reference ${T_{g,0}}$ , where the subscript $l$ means the local trajectory planning. Then, the influence of varying proportions of ${{\rm{\Phi }}_{cr}}$ can be observed. Schemes of weights in three regions for Case 2 are listed in Table 3.

Table 3. Schemes of weights in the two regions for Case 2

The middle of the trough is regarded as the dividing line between areas, and the assessment criterion for the performance of horizontal speed should change because the reward for speed is accumulated along the way. For global planning, the total span corresponds to the overall horizontal speed, with a constant number of steps. However, when employing the reward function for local planning, no matter how many steps the WIG has gone through R.1, the final cumulative reward is almost fixed, so the reward function cannot play a role in regulating the speed. Therefore, when analysing local speeding performance, the span is used instead of the speed. We use the notation $spn$ to replace $spd$ . A larger span means better performance on the speed.

4.3.2 Discussion

To give an assessment of the extent to which the requirements are met in R.2, we use $\sum_{R.2}\phi_{cr},\; \sum_{R.2}\phi_{spn}$ and $\sum_{R.2}\phi_{bnd}$ to represent the local performance of the trajectory on cruising, speed, and the bounded constraints in R.2. The same treatment for the local performance is applied to R.1.

From ${T_{l,1}}$ to ${T_{g,0}}$ and then to ${T_{l,2}}$ , the cruise weight in R.2 gradually increases. Figure 13(a) depicts the full picture of the trajectory. As the cruise weight in R.2 increases, the WIG in L.4 $\sim $ L.5 moves closer to the cruise route, and the WIG in L.6 can maintain the cruise more effectively. In L.1 $\sim $ L.2, the trajectory deviates significantly from the cruise route, whereas the trajectory in L.3 tends to approach the cruising altitude. From (b) to (e), which depict the trajectory near the four turning positions, it can be seen that as the cruise weight in R.2 grows, the trajectory at P.3 deviates less from the cruise route. Whereas the trajectory deviates more from the cruise path at P.1 in R.1, it deviates less at P.2, as if adjustments were made in R.1 in preparation for better cruising in R.2. (f) shows the horizontal speed history of the trajectory. With the increase of cruising weight in R.2, the change in the WIG’s horizontal speed is negligible, the increase is very gradual, and the entire WIG span grows marginally. From (g) to (i) are the performances of the trajectory with respect to the requirements of the three aspects, from which we can see the trend of the performance of each aspect in R.1 and R.2 with the change of the cruise weight in R.2. As the cruising weight in R.2 increases, the cruising performance in R.2 improves along with the longer horizontal span and more effective bounded constraints, but in R.1, the performance of both cruising and bounded constraints degrades to varying degrees. The performance of the WIG’s span changes slightly because it is measured from the initial point to the trajectory point near the dividing line. It suggests that the enhanced performance of cruising in R.2 comes at the expense of that in R.1.

Figure 13. Trajectories planned locally in Case 2.1 for the requirement of cruising.

From ${T_{l,3}}$ to ${T_{g,0}}$ and subsequently to ${T_{l,4}}$ , corresponds to the gradual increase of the speed weight in R.2. Figure 14(a) shows the full view of the trajectory. It can be seen that with the increase of the speed weight in R.2, the WIG’s stopping position in L.6 is farther. The trajectory in L.5 appears flatter, and it exhibits a lesser tendency to drop in L.4. From (b) to (e), the views are around the four turning positions, respectively. It can be found that with the increase in speed weight in R.2, the trajectory deviates more from the cruise route at P.4 more. At P.1, the trajectory deviates more from the cruise route, whereas at P.2, the trajectory becomes gradually flatter, as though corrections were made in R.1 to prepare for quicker movement in R.2. The horizontal speed history of the trajectory is shown in (f). With the further increase of the cruising weight in R.2, the horizontal speed of the trajectory increases more dramatically, and the entire WIG span expands. From (g) to (i) are the performances of the trajectory regarding the requirements of the three aspects, from which we can see the trend of the performance of each aspect in R.1 and R.2 with the change of the speed weight in R.2. When the speed weight in R.2 increases, the WIG span in R.2 grows longer, whereas the performance of cruise and bounded constraints in R.1 degrades progressively. The performance of the span is measured from the initial point to the trajectory point near the dividing line, so the variation is minimal. The WIG’s span in R.2 is lengthened at the cost of its performance in R.1.

Figure 14. Trajectories planned locally in Case 2.2 for the requirement of speed.

Figure 15. Trajectories planned locally in Case 2.3 for requirement of survival and stealth.

From ${T_{l,5}}$ to ${T_{g,0}}$ and finally ${T_{l,6}}$ , the bounded constraint weight in R.2 gradually increases. Figure 15(a) shows the full perspective of the trajectory. It can be seen that with the increase of the weight of bounded constraints in R.2, the slope of the trajectory in L.4 $\sim $ L.5 gets larger, which is more in line with the uphill part of the trough, contributing to the overshooting behaviour in the beginning of L.6. In L.1 $\sim $ L.2 of R.1, the divergence from the cruise route is greater. From (b) to (e) are the views around the four turning positions, it can be found that as the weight of the bounded constraints in R.2 increases, the trajectory deviates less from the cruise route at P.4. However, at P.1, the trajectory deviates more from the cruise route, and at P.2, it descends to provide a better initial position for the subsequent fitting movement, as if making adjustments in R.1 to prepare for better performance of bounded constraints in R.2. (f) displays the horizontal speed history of the trajectory. As the weight of the bounded constraint in R.2 increased, there is slight increase in the horizontal speed, and the whole span of WIG is enlarged in small increments. From (g) to (i) are the performances of the trajectory with respect to the requirements of the three aspects, from which we can see the trend of the performance of each aspect in R.1 and R.2 with the change of the weight of the bounded constraints in R.2. When the weight of the bounded constraints in R.2 increases, the effect of the bounded constraints in R.2 is amplified, and the performance of the horizontal span and cruising improves, whereas in R.1, the performance of cruising and speed degrades to variable degrees. Due to the fact that the span is measured from the trajectory point close to the dividing line, the variation is minimal. It indicates that the effect of bounded constraints in R.2 is enhanced at the expense of its performance in R.1.

4.4 Comparison between Case 1 and Case 2

When comparing the performance of global and local planning for each item of the requirements, we pay more attention to the difference between each item than its magnitude. Therefore, ${T_{g,0}}$ is used as the baseline to analyse the effect of the adjustment by global planning and local planning when the weights change.

4.4.1 Cruising

Figure 16 shows performances of trajectories planned locally and globally on cruising in R.1, R.2, and both of them. When the global cruising weight decreases for the entire cruise, the global performance on cruising by global planning deteriorates more than that by local planning; when the global cruising weight increases, the global performance on the entire cruise by global planning improves more than that by local planning.

Figure 16. Performances of trajectories on cruising in different regions.

From a regional standpoint, the reverse is true. As the cruise weight drops locally, the local performance on cruises by local planning deteriorates more than that by global planning; when the cruise weight increases locally, the degree of improvement in the local cruising performance by local planning is greater than that by global planning.

4.4.2 Survival and stealth

Figure 17 shows performances of trajectories planned locally and globally on survival and stealth in R.1, R.2 and both of them. Changing the bounded constraint weight has a similar tendency as changing the cruise weight. For the global performance of bounded constraints, global planning exhibits more effective adjustment. For the local performance of bounded constraints, local planning provides more effective regulation.

Figure 17. Performances of trajectories on survival and stealth in different regions.

Figure 18. Performances of trajectories on speed in different regions.

4.4.3 Speed

Since the regions are separated topographically and the reward for speed is based on the positions of two steps in a row, figuring out the reward for the WIG’s speed is different from cruising and bounded constraints. Generally speaking, the larger the span, the greater the reward concerning speed. Therefore, whether it is global or local planning, a larger span in a single step is pursued.

Figure 18 shows performances of trajectories planned locally and globally on speed in R.1, R.2, and both of them. When the speed weight in R.2 grows, local planning not only encourages the trajectory to have a greater span in R.2, but also allocates as many R.1 trajectory points as possible to R.2. Additionally, the span in R.1 is not adequately extended. Under global planning, as few R.1 trajectory points as possible are assigned to R.2. The elongated span in R.1 permits more trajectory points in R.2, and the lengthened span in R.2 is also greater for the greater reward. Thus, global planning for the entire region outperforms local planning for R.2 both in terms of global speed and local span.

As the speed weight in R.2 decreases, local planning reduces the span in R.2 to achieve better performance on cruise and bounded constraints and even arranges the trajectory points in R.2 to R.1. The span in the R.2 region is shortened further. Under global planning, the speed weight in R.1 is also reduced, and the improvement in performance on cruising and bounded constraints lengthens the WIG’s span in R.1, resulting in the evacuation of some trajectory points to R.2. The WIG’s span in R.2 is elongated. Hence, whether viewed globally or locally, the degree of span reduction obtained by global planning is less than that gained by local planning.

Case 1 that conducts trajectory planning globally evaluates the global performance, and Case 2 that conducts trajectory planning globally evaluates the local performance. Their cross-examinations are presented in the Appendix. As for Case 1 which is mainly about global trajectory planning, their local performance on the requirements is illustrated from Figs. A1 to A3. We do the same for Case 2 from Figs. A4 to A6. At the same place, performance and corresponding reward for each item in different regions are listed in the Table A1. All trajectories in this study are depicted in Fig. A7.

5.0 Conclusion

This study focuses on multiple needs for WIGs’ trajectory planning, such as cruising, speed, survival and stealth. It suggests a framework in polynomial form, where each item is designed as an independent reward function based on how a specific need is defined. The sum of all the items demonstrates the integration of these requirements, and each basic item will be multiplied by a linear weight or assigned an exponential weight whose magnitude, comparatively, shows how much emphasis is laid on the requirement.

To examine performance of the framework on WIGs’ trajectory planning, a terrain which comprises piece-wise lines like one wave trough is taken as the interacting environment, and the reward function based on the polynomial framework guides the training via DRL to generate the trajectory. In the same environment, in addition to global trajectory planning widely examined in previous studies, the local one is also evaluated in this paper. The first one keeps a same scheme of weights all the way, and the second one uses different schemes of weights in the consecutive parts of the environment. The weight of each item is assigned in a series to represent allocating different degree of attention on requirements, and then by method of control variates, trajectories are compared in aspects of cruising, speed, stealth and survival.

It is found that the relative emphasis on a certain requirement results in the corresponding behaviour locally or globally, and the reward function can guide an expected trajectory by tuning relative weights appropriately. The framework helps achieve balanced behaviour between requirements, which is applicable to WIGs’ trajectory planning in engineering.

Acknowledgements

This work was partially funded by the National Natural Science Foundation of China (52061135107), the Fundamental Research Fund for the Central Universities (DUT20TD108), the Liao Ning Revitalization Talents Program (XLYC1908027), the Dalian Innovation Research Team in Key Areas (No.2020RT03) and the computation support of the Supercomputing Center of Dalian University of Technology.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

APPENDIX

Figure A1. Globally planned trajectories’ local performance on requirements for Case 1.1.

Figure A2. Globally planned trajectories’ local performance on requirements for Case 1.2.

Figure A3. Globally planned trajectories’ local performance on requirements for Case 1.3.

Figure A4. Locally planned trajectories’ global performance on requirements for Case 2.1.

Figure A5. Locally planned trajectories’ global performance on requirements for Case 2.2.

Figure A6. Locally planned trajectories’ global performance on requirements for Case 2.3.

Figure A7. Trajectories planned globally and locally.

Table A1. Reward and performance in regions for each trajectory

References

Brunke, L., Greeff, M., Hall, A.W., Yuan, Z., Zhou, S., Panerati, J. and Schoellig, A.P. Safe learning in robotics: From learning-based control to safe reinforcement learning, Annu. Rev. Control robot. Auton. Syst., 2022, 5, pp 411–444.CrossRef Google Scholar

Dooraki, A.R. and Lee, D.-J. A multi-objective reinforcement learning based controller for autonomous navigation in challenging environments, Machines, 2022, 10, p 500.CrossRef Google Scholar

Xu, G., Jiang, W., Wang, Z. and Wang, Y. Autonomous obstacle avoidance and target tracking of UAV based on deep reinforcement learning, J. Intell. Robot. Syst., 2022, 104, p 60.CrossRef Google Scholar

Wang, W., Luo, X., Li, Y. and Xie, S. Unmanned surface vessel obstacle avoidance with prior knowledge-based reward shaping, Concurr. Comput. Pract. Exp., 2021, 33, p. e6110.CrossRef Google Scholar

Xu, X., Lu, Y., Liu, X. and Zhang, W. Intelligent collision avoidance algorithms for USVs via deep reinforcement learning under COLREGs, Ocean Eng., 2020, 217, p 107704.CrossRef Google Scholar

Zhou, X., Wu, P., Zhang, H., Guo, W. and Liu, Y. Learn to navigate: Cooperative path planning for unmanned surface vehicles using deep reinforcement learning, IEEE Access, 2019, 7, pp 165262–165278.CrossRef Google Scholar

Liu, J., Liu, Z., Wu, Z. and Yu, J. Three-dimensional path following control of an underactuated robotic dolphin using deep reinforcement learning, In 2020 IEEE International Conference on Real-time Computing and Robotics (RCAR), IEEE, Asahikawa, Japan, 2020, pp 315–320.CrossRef Google Scholar

Sun, Y., Luo, X., Ran, X. and Zhang, G. A 2D optimal path planning algorithm for autonomous underwater vehicle driving in unknown underwater canyons, J. Mar. Sci. Eng., 2021, 9, p 252.CrossRef Google Scholar

Chen, J., Yuan, B. and Tomizuka, M. Model-free deep reinforcement learning for urban autonomous driving, In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), IEEE, Auckland, New Zealand, 2019, pp 2765–2771.CrossRef Google Scholar

Deshpande, N. and Spalanzani, A. Deep reinforcement learning based vehicle navigation amongst pedestrians using a grid-based state representation, In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), IEEE, Auckland, New Zealand, 2019, pp 2081–2086.CrossRef Google Scholar

Wang, H., Wang, Z. and Cui, X. Multi-objective optimization based deep reinforcement learning for autonomous driving policy, J. Phys. Conf. Ser., 2021, 1861, p 012097.CrossRef Google Scholar

Hu, B., Li, J., Yang, J., Bai, H., Li, S., Sun, Y. and Yang, X. Reinforcement learning approach to design practical adaptive control for a small-scale intelligent vehicle, Symmetry, 2019, 11, p 1139.CrossRef Google Scholar

Hu, W., Li, X., Hu, J., Song, X., Dong, X., Kong, D., Xu, Q. and Ren, C. A rear anti-collision decision-making methodology based on deep reinforcement learning for autonomous commercial vehicles, IEEE Sens. J., 2022, 22, pp 16370–16380.CrossRef Google Scholar

Ye, F., Cheng, X., Wang, P., Chan, C.-Y. and Zhang, J. Automated lane change strategy using proximal policy optimization-based deep reinforcement learning, In 2020 IEEE Intelligent Vehicles Symposium (IV), 2020, pp 1746–1752.CrossRef Google Scholar

Luo, Z., Zhou, J. and Wen, G., Deep reinforcement learning based tracking control of unmanned vehicle with safety guarantee, In 2022 13th Asian Control Conference (ASCC), 2022, pp 1893–1898.CrossRef Google Scholar

Bakker, L. and Grammatico, S. A multi-agent deep reinforcement learning framework for automated driving on highways, In 2020 28th Mediterranean Conference on Control and Automation ( MED ), 2020, pp. 770–775.Google Scholar

Schmidt, L.M., Kontes, G., Plinge, A. and Mutschler, C. Can you trust your autonomous car? interpretable and verifiably safe reinforcement learning, In 2021 IEEE Intelligent Vehicles Symposium (IV), 2021, pp 171–178.CrossRef Google Scholar

Xu, J., Pei, X. and Lv, K. Decision-Making for Complex Scenario using Safe Reinforcement Learning, In 2020 4th CAA International Conference on Vehicular Control and Intelligence (CVCI), IEEE, Hangzhou, China, 2020, pp. 1–6.CrossRef Google Scholar

Lv, K., Pei, X., Chen, C. and Xu, J. A safe and efficient lane change decision-making strategy of autonomous driving based on deep reinforcement learning, Mathematics, 2022, 10, p 1551.CrossRef Google Scholar

Peake, A., McCalmon, J., Raiford, B., Liu, T. and Alqahtani, S. Multi-agent reinforcement learning for cooperative adaptive cruise control, In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), pp 15–22.CrossRef Google Scholar

Wurman, P.R., Barrett, S., Kawamoto, K., MacGlashan, J., Subramanian, K., Walsh, T.J., Capobianco, R., Devlic, A., Eckert, F., Fuchs, F., Gilpin, L., Khandelwal, P., Kompella, V., Lin, H., MacAlpine, P., Oller, D., Seno, T., Sherstan, C., Thomure, M.D., Aghabozorgi, H., Barrett, L., Douglas, R., Whitehead, D., Dürr, P., Stone, P., Spranger, M. and Kitano, H. Outracing champion Gran Turismo drivers with deep reinforcement learning, Nature, 2022, 602, pp 223–228.CrossRef Google Scholar PubMed

Zhang, J., Chen, H., Song, S. and Hu, F. Reinforcement learning-based motion planning for automatic parking system, IEEE Access, 2020, 8, pp 154485–154501.CrossRef Google Scholar

Yuan, C. Philosophies of the Stability and Control of WIG Craft, Modern Ship Mechanics, National Defense Industry Press, 2014.Google Scholar

Melin, T. A vortex lattice MATLAB implementation for linear aerodynamic wing applications, Master’s Thesis, Department of Aeronautics, Royal Institute of Technology (KTH), Stockholm, Sweden, 2000.Google Scholar

Barber, T.J., Leonardi, E. and Archer, R.D. A technical note on the appropriate CFD boundary conditions for the prediction of ground effect aerodynamics, Aeronaut. J. 1968, 1999, 103, pp 545–547.CrossRef Google Scholar

Raymer, D.P. Aircraft Design: A Conceptual Approach, 4th ed., AIAA Education Series, American Institute of Aeronautics and Astronautics, Reston, VA, 2006.Google Scholar

Beard, R. and McLain, T. Small Unmanned Aircraft: Theory and Practice, Princeton University Press, 2012.CrossRef Google Scholar

Diston, D.J. Computational Modelling and Simulation of Aircraft and the Environment: Platform Kinematics and Synthetic Environment, volume 1, 1st ed. Aerospace Series, John Wiley & Sons Ltd, United Kingdom, 2009.CrossRef Google Scholar

Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., Policy Optimization Algorithms, Proximal. ArXiv:1707.06347 [cs], 2017.Google Scholar

Liu, X., Li, Z. and Zheng, J. ElegantRL: Massively Parallel Framework for Cloud-native Deep Reinforcement Learning, 2021. GitHub Repository.Google Scholar