Hostname: page-component-78c5997874-lj6df Total loading time: 0 Render date: 2024-11-16T13:21:28.234Z Has data issue: false hasContentIssue false

An improved fuzzy inference strategy using reinforcement learning for trajectory-tracking of a mobile robot under a varying slip ratio

Published online by Cambridge University Press:  25 January 2024

Muhammad Qomaruz Zaman
Affiliation:
Graduate Institute of Manufacturing Technology, National Taipei University of Technology, Taipei, Taiwan
Hsiu-Ming Wu*
Affiliation:
Department of Intelligent Automation Engineering, National Taipei University of Technology, Taipei, Taiwan
*
Corresponding author: Hsiu-Ming Wu; Email: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

In this study, a fuzzy reinforcement learning control (FRLC) is proposed to achieve trajectory tracking of a differential drive mobile robot (DDMR). The proposed FRLC approach designs fuzzy membership functions to fuzzify the relative position and heading between the current position and a prescribed trajectory. Instead of fuzzy inference rules, the relationship between the fuzzy inputs and actuator voltage outputs is built using a reinforcement learning (RL) agent. Herein, the deep deterministic policy gradient (DDPG) methodology consisted of actor and critic neural networks is employed in the RL agent. Simulations are conducted with considering varying slip ratio disturbances, different initial positions, and two different trajectories in the testing environment. In the meantime, a comparison with the classical DDPG model is presented. The results show that the proposed FRLC is capable of successfully tracking different trajectories under varying slip ratio disturbances as well as having performance superiority to the classical DDPG model. Moreover, experimental results validate that the proposed FRLC is also applicable to real mobile robots.

Type
Research Article
Copyright
© The Author(s), 2024. Published by Cambridge University Press

1. Introduction

Applications of mobile robots have become more and more ubiquitous in daily life over the last decades. In contrast to people, mobile robots are generally capable of enduring repetitive tasks [Reference Dekle1] and offer a less expensive alternative in the labor markets [Reference Fan, Hu and Tang2]. Robotic labor has been employed in various fields, including in the farmland [Reference Zhao, Wang, Qi and Runge3Reference Pak, Kim, Park and Son7], hospitals for patient care [Reference Jung, Kim, Lee, Bang, Kim and Kim8, Reference Niloy, Shama, Chakrabortty, Ryan, Badal, Tasneem, Ahamed, Moyeen, Das, Ali, Islam and Saha9], and the food service industry or restaurants [Reference Chen, Lin and Lai10, Reference Zheng, Chen and Cheng11]. Based on the aforementioned tasks, mobile robots are required to follow a designed trajectory. This remains a significant and essential problem in terms of differential drive mobile robots (DDMRs), which are nonholonomic. Therefore, it is important to further investigate and develop control techniques making mobile robots to autonomously track a planned trajectory.

Some control approaches employing artificial intelligence (AI) technologies have been successfully used to improve the trajectory-tracking performance of a DDMR. For example, in refs. [Reference de Jesus, Bottega, de Souza Leite Cuadros and Gamarra12Reference Gao, Ye, Guo and Li14], reinforcement learning (RL) agents are able to generate an optimum path for trajectory tracking by utilizing grid world simulation. Furthermore, RL agents are known to be able to produce control inputs based on sophisticated data without explicitly telling its correlation. Control feedback based on multiple tracking error inputs has been investigated in refs. [Reference Khlif, Nahla and Safya15Reference Wu, Wang, Esfahani and Yuan18], whereas alternative methods such as direct utilization of camera feeds or LiDAR measurements as inputs have also been explored in refs. [Reference Song, Li, Wang and Wang19Reference Xie, Miao, Wang, Blunsom, Wang, Chen, Markham and Trigoni24]. Based on previous works, it is imperative to note that RL differs from supervised learning in terms of training approach as rewards generated from agent–environment interaction are utilized during the process. In some instances, an RL agent is required to explore virtual environments to gain the experience and knowledge needed for trajectory-tracking purposes. The primary challenge facing the above-mentioned works pertains to ensuring that the learned policies from the virtual environments can be effectively applied to practical situations. When it comes to actual circumstances, DDMRs are controlled using actuator voltages according to its kinematics, dynamics, and actuator. However, the works in refs. [Reference de Jesus, Bottega, de Souza Leite Cuadros and Gamarra12Reference Xie, Miao, Wang, Blunsom, Wang, Chen, Markham and Trigoni24] do not involve the dynamics of the DDMR, which results in indirect control inputs using turning directions or velocities.

It is important to note that a specific voltage value does not directly correspond to a particular velocity, and a zero voltage does not result in an immediate halt. Due to the inherent nonlinearity of a DDMR, voltage inputs lead to a gradual increase or decrease in its velocity, serving to counteract its moment of inertia. This constitutes a fundamental distinction when employing the dynamic model of the DDMR and its associated actuator model. In contrast, within the context of the grid world simulation (see refs. [Reference Quan, Li and Zhang13Reference Khlif, Nahla and Safya15]), the DDMR is expected to transit instantaneously between grids. This, however, proves to be an impractical solution for position-tracking problems, as positional accuracy hinges on the resolution of the grid and it neglects to simulate the temporal aspect of grid-to-grid movement. Within the context of the kinematic model, a constant velocity denotes a state of motion devoid of acceleration, as kinematics does not account for the forces causing motion. By disregarding factors like mass and inertia, DDMR’s motion lacks momentum making every velocity command equivalent to the DDMR’s actual velocity. While both the kinematic model and the grid world simulation reduce simulation complexity, they underscore the disparities between the simulated environment and real-world conditions. Consequently, it is imperative to incorporate the kinematic, dynamic, and actuator models of the DDMR in the training process. This ensures that the RL agent learns nonlinearities in DDMR’s motion model and effectively utilizes voltages as control inputs.

Although RL agents possess impressive reasoning abilities, they can generate unexpected control commands in the presence of input that significantly differs from what is encountered during the training process. Therefore, it is essential to train the RL agent using an extensive dataset that covers all possible inputs. Fuzzy logic control (FLC) presents an advantageous alternative in these circumstances because it limits the inputs within the range of the predefined membership functions as demonstrated in refs. [Reference Ye, Shao, Liu and Yu25Reference Hua, Wang, Zhang, Alattas, Mohammadzadeh and The Vu35]. The membership functions are typically designed to overlap with one another. This approach provides interpolative reasoning capabilities as highlighted in refs. [Reference Ye, Shao, Liu and Yu25Reference Chwa and Boo28] where the FLC shows a high resilience against uncertainties and disturbances. Furthermore, as indicated in refs. [Reference Cherroun, Boumehraz and Kouzou29Reference Subbash and Chong31], it is possible to create membership functions that emphasize the significance of inputs within specific value ranges. Then, an alternate inference rule can be developed to further prioritize these particular ranges of values. However, as documented by refs. [Reference Saidi, Nemra and Tadjine32Reference Hua, Wang, Zhang, Alattas, Mohammadzadeh and The Vu35], the fuzzy inference system (FIS) can be very complex and the performance is highly limited on the designer’s knowledge. Therefore, it becomes imperative to explore alternative approaches to tune the FIS for optimal outcomes.

The conceptualization of the fuzzy inference system (FIS) was marked by the introduction of two distinguished methods, namely the Mamdani and the Takagi-Sugeno (T-S) methods [Reference Prieto, Aguilar, Cardenas-Maciel, Lopez-Renteria and Cazarez-Castro36Reference Tognetti and Linhares39]. The former relies on fuzzy outputs that require a subsequent defuzzification, while the latter employs fired multilinear equations followed by a weighted averaging technique. However, both methodologies face a dimensionality bottleneck where the complexity of the FIS grows exponentially with increased input variables. Consequently, explicating the devised FIS becomes notably intricate. In refs. [Reference Ruiz-García, Hagras, Pomares and Ruiz40Reference Mendel42], a viable approach to alleviate this complexity lies in the optimization of input variables, which can be accomplished through the integration of a type-2 FIS. This involves incorporating uncertainties or adaptabilities for each fuzzy variable that has been verified to enhance the optimization of input variables at the cost of increasing parameter intricacy in the design process. Furthermore, endeavors have been made to address this limitation through the adoption of a supervised learning paradigm, specifically the adaptive neuro-fuzzy inference system (ANFIS), which leverages the neural network’s capability to approximate any functions including those related to FIS [Reference Hu and Zheng43Reference Islam, Anderson, Pinar, Havens, Scott and Keller45]. Data-driven calibration has also been employed to fine-tune parameters such as the uncertainties in the type-2 membership functions [Reference Dombi and Hussain46]. However, the latter both methodologies rely on the availability of input–output data for training. In contrast, RL leverages interactions with the environment to facilitate the training of neural networks. Therefore, the replacement of the FIS with a neural network under the guidance of the RL algorithm may offer several advantages in circumventing the aforementioned limitations rendering this research direction worthy of exploration.

The primary objective of this work is to attain the trajectory-tracking capability using pose feedback and voltage control commands on a DDMR using the proposed fuzzy reinforcement learning control (FRLC). In order to break through the limitations found in the previous studies, this work combines the advantages of both FLC and the RL agent. Thus, the fuzzified inputs are utilized for the control feedback with an aim to decrease complexity and enhance the robustness against disturbance. This entails constructing a correlation between control feedback and commands utilizing an RL agent instead of employing a FIS. Further contributions as well as strategies that have not been explored by previous studies are outlined as follows:

  1. 1. This study integrates the kinematics, dynamics, and actuator model of a DDMR into the RL training process. The application of its dynamics and actuator model has led to the incorporation of voltage control inputs that have not been explored by refs. [Reference de Jesus, Bottega, de Souza Leite Cuadros and Gamarra12Reference Hua, Wang, Zhang, Alattas, Mohammadzadeh and The Vu35]. Despite the ensuing challenges related to increased system orders and complexity, the integration of both dynamics and kinematics yields a more comprehensive robot model crucial for the successful simulated training of an RL agent. This encompasses consideration of acceleration, mass, and moment of inertia in the law of motion which are essential for the effective simulated training of an RL agent.

  2. 2. Continuous action control commands are used in the RL agent. Unlike simulated games that have discrete action space, continuous action space is preferable in the robot control field to gain a smoother performance and safer implementation. The utilization of a continuous action space contrasts with earlier approaches in refs. [Reference Quan, Li and Zhang13, Reference Khlif, Nahla and Safya15, Reference Bai, Yan, Pan and Guo17Reference Peng, Chen, Zhang, Chen, Tseng, Wu and Meen23], which apply the use of discrete action space causing zigzag trajectory response. Furthermore, unlike previous studies where the linear and angular velocities outputs were closely proportional to the distance and heading control feedback (as evident in refs. [Reference de Jesus, Bottega, de Souza Leite Cuadros and Gamarra12Reference Xie, Miao, Wang, Blunsom, Wang, Chen, Markham and Trigoni24]), this study challenges the RL agent to produce two continuous voltages simultaneously for trajectory tracking of the DDMR.

  3. 3. Fuzzy membership functions are designed as the state inputs of the RL agent. Fuzzification is a versatile tool to encode signals within upper and lower bounds. Additionally, selective classification within a range of values in a fuzzy membership can be implemented to place greater emphasis on critical information provided to the RL agent. This methodology constitutes an innovative contribution that deviates from previous works in refs. [Reference de Jesus, Bottega, de Souza Leite Cuadros and Gamarra12Reference Xie, Miao, Wang, Blunsom, Wang, Chen, Markham and Trigoni24]. In addition, unlike prior works in refs. [Reference Ye, Shao, Liu and Yu25Reference Mendel42] which completely rely on designers’ expert knowledge to relate the fuzzy inputs and outputs, this study avoids complete dependency on expert knowledge by utilizing the gathered knowledge of an RL agent after extensive training in a simulated environment.

  4. 4. Simulation results for the DDMR under a varying slip ratio are demonstrated to show the effectiveness of the proposed approach in both theory and implementation. Additionally, similar trajectories and initial conditions are employed in experiments as those presented in the simulations to further support our findings. However, numerous existing works such as refs. [Reference Gao, Ye, Guo and Li14, Reference Gao, Gao, Liang, Zhang, Deng and Zhu16, Reference Bai, Yan, Pan and Guo17, Reference Wang, He and Sun21, Reference Peng, Chen, Zhang, Chen, Tseng, Wu and Meen23, Reference Cherroun, Boumehraz and Kouzou29Reference Hua, Wang, Zhang, Alattas, Mohammadzadeh and The Vu35] do not demonstrate experimental results.

The remainder of the paper is organized as follows: the system modeling of the DDMR and problem formulation are stated in Section 2. Then, the proposed FRLC design is described in Section 3. Section 4 presents the results and discussions of the simulation and experiment. Finally, the concluding statements are given in Section 5.

2. System modeling

The DDMR has a symmetrical shape with two wheels on both sides serving as the primary driving mechanism. Additionally, two caster wheels are attached to both the back and front chassis to maintain balance. As depicted in Fig. 1, the DDMR is positioned within a dual-coordinate frame of reference. Here, $\{R\}$ denotes the moving frame affixed to the robot’s center coinciding with the location of its center of mass. Besides, a fixed frame denoted as $\{I\}$ is securely anchored to the floor establishing the reference frame where the DDMR’s movements are measured. The distance between the wheels is defined as $2l_r$ with each wheel characterized by a radius of $r_w$ . An essential premise governing the DDMR’s motion stipulates the condition of pure rolling. This motion is further constrained by the nonholonomic condition expressed as $^I\dot{x}_R\sin \theta + ^I\dot{y}_R\cos \theta = 0$ . This implies that the DDMR cannot move laterally, necessitating a change in heading to alter its direction. With this setup in mind, the motion of the robot with respect to the fixed frame [Reference Velagic, Lacevic and Osmic47Reference Stefek, Pham, Krivanek and Pham50] can be defined as follows:

Figure 1. Geometric relationships of the DDMR: notations $\{R\}$ and $\{I\}$ are the moving and fixed frames, respectively. The positional discrepancy of the DDMR concerning the trajectory is quantified by the distance error $e_d$ and heading error $e_h$ . The DDMR’s X-axis denotes the forward direction of its motion.

(1) \begin{equation} \begin{bmatrix} ^I\dot{x}_R\\[5pt]^I\dot{y}_R\\[5pt]\dot{\theta } \end{bmatrix}=\dfrac{r_w}{2} \begin{bmatrix} \cos \theta & \quad \cos \theta \\[5pt] \sin \theta & \quad \sin \theta \\[5pt] \dfrac{1}{l_r} & \quad -\dfrac{1}{l_r} \end{bmatrix} \begin{bmatrix} \dot{\varphi }_u\\[5pt] \dot{\varphi }_v \end{bmatrix} \end{equation}

where $\dot{\varphi }_u$ and $\dot{\varphi }_v$ denote the angular velocities of the left and right wheels, respectively. The orientation of the robot $\theta$ is measured from the angle of the X-axis of frame $\{R\}$ and $\{I\}$ . The angular velocities of the wheels ( $\dot{\varphi }_u, \dot{\varphi }_v$ ) are related to the linear and angular velocities ( $^R\dot{x}, \dot{\theta }$ ) of the DDMR with respect to the moving frame given by

(2) \begin{equation} \left \{\begin{array}{c@{\quad}c} \dot{\varphi }_u &= \dfrac{\displaystyle ^R\dot{x} + \dot{\theta }l_r}{\displaystyle r_w(1-S_{ru})} \\[15pt] \dot{\varphi }_v &= \dfrac{\displaystyle ^R\dot{x} - \dot{\theta }l_r}{\displaystyle r_w(1-S_{rv})} \end{array}\right. \end{equation}

In addition, to incorporate the impact of slip effects between the wheel and the track during acceleration, slip ratios for the left and right wheels, denoted as $(S_{ru}, S_{rv})\in [0,1)$ , are introduced in Eq. (2). Following the elucidations outlined by refs. [Reference Ma, Zhao, Zhao, Lu and Chen51Reference Zhang, Han, Zhang, Guo and Lu53], $S_{ru}$ and $S_{rv}$ signify the ratio between angular and linear velocities relative to the wheel radius. In this framework, the numerators of Eq. (2) represent the linear velocities of the wheel. Consequently, the condition $(S_{ru}, S_{rv})= 0$ denotes equality between the linear velocity per wheel radius and the respective angular velocity. In such instances, slip is disregarded due to optimal traction between the wheel and track, resulting in the full conversion of the wheel’s angular velocity to linear velocity. Conversely, the wheel rotates on its axle amidst the absence of linear movement as $(S_{ru}, S_{rv}) \rightarrow 1$ . It is worth noting that in this research, the exploration of the slip ratio is confined to the kinematics context.

The accelerations of the DDMR ( $^R\ddot{x}, \ddot{\theta }$ ) can be obtained as the right and left wheel torques ( $\tau _u, \tau _v$ ) are applied. Assume that the robot body and the wheels are a unified rigid body with mass $M$ and moment of inertia $J$ , the dynamics of the DDMR [Reference Hatab48, Reference Tzafestas54, Reference Nurmaini55] is modeled as

(3) \begin{equation} ^R\ddot{x} = \frac{1}{Mr_w}(\tau _u + \tau _v)\\[5pt] \end{equation}
(4) \begin{equation} \ddot{\theta } = \frac{l_r}{Jr_w}(\tau _u - \tau _v) \end{equation}

To control the DDMR using voltage inputs, the actuator model of a DC motor is considered. The actuator model is vital since it will make the DDMR model more realistic. Here, identical DC motor dynamics for the left and right wheels are given by

(5) \begin{equation} \left \{ \begin{array}{l} V_a = R_a i_a + L_a \dfrac{d\,i_a}{dt} + e_a \\[5pt] e_a = K_b \omega _m\\[5pt] \tau _m = K_t i_a \end{array} \right. \end{equation}

where the armature voltage $V_a$ is the input. The notations $R_a$ and $L_a$ represent the armature resistance and inductance, respectively. The torque $\tau _m$ is the output generated by the armature current $i_a$ with the torque ratio $K_t$ . Also, the back emf $e_a$ voltage is generated as the shaft rotational velocity $\omega _m$ arises with the constant factor $K_b$ .

Based on the above-mentioned mathematical models of kinematics, dynamics, and actuator dynamics, a trajectory-tracking control scheme is proposed with the objective of controlling the DDMR along the desired trajectory in spite of suffering a varying slip ratio. The proposed control is designed using position and heading errors as inputs and actuator voltages as outputs. The overall control diagram is illustrated in Fig. 2.

Figure 2. Overall control diagram: the blocks outside the RL agent represent the RL environment, providing both rewards and states to the agent. The RL agent produces right and left motor voltages $(V_u, V_v)$ as actions to induce a transition of the DDMR to a new state. Subsequently, the state is compared to the prescribed trajectory, forming the feedback loop. The RL agent also functions as a fuzzy inference engine, specifically optimized to mitigate tracking errors.

3. Fuzzy reinforcement learning control design

The proposed FRLC scheme takes two tracking errors, that is, distance error $e_d$ and heading error $e_\theta$ , and then generates actuator voltages, that is, left motor voltage $V_v$ and right motor voltage $V_u$ . The tracking errors are calculated relative to the moving frame $\{R\}$ as seen in Fig. 1. Hence, the reference point $(^Ix_G,\ ^Iy_G)$ in the trajectory $\mathcal{B}$ with respect to fixed frame $\{I\}$ is first transformed to the moving frame $\{R\}$ as follows:

(6) \begin{equation} \begin{bmatrix} ^Rx_G\\[5pt]^Ry_G\\[5pt]1 \end{bmatrix}= \begin{bmatrix} \cos \theta & -\sin \theta & ^Ix_R\\[5pt] \sin \theta & \cos \theta & ^Iy_R\\[5pt] 0&0&1 \end{bmatrix}^{-1} \begin{bmatrix} ^Ix_G\\[5pt]^Iy_G\\[5pt]1 \end{bmatrix} \end{equation}

where $(^Ix_R, \ ^Iy_R)$ is the position of the robot with respect to the fixed frame $\{I\}$ . Then, the distance error can be calculated as

(7) \begin{equation} e_d = \sqrt{(^Rx_G)^2+(^Ry_G)^2} \end{equation}

and the following heading error can be calculated as

(8) \begin{equation} e_\theta = \text{atan2}\!\left(^Ry_G, ^Rx_G\right) \end{equation}

where $\text{atan2}({\cdot})$ is a full quadrant inverse tangent function that yields an angle between $[{-}\pi, \pi ]$ [Reference Chapman56Reference Zhou, Dietrich, Walden, Kolb and Doppelbauer59].

Next, $e_d$ and $e_\theta$ are fuzzified with membership functions $\mu _d$ and $\mu _\theta$ as shown in Fig. 3. The mentioned membership functions $(\mu _d, \mu _\theta )$ , respectively, represent linguistic terms of $f_d = \{Z\,:\, Zero, M\,:\, Medium, F\,:\, Far\}$ and $f_h = \{NB\,:\, Negative\ Big\,, N\,:\,Negative, Z\,:\, Zero, P\,:\,Positive,$ $PB\,:\, Positive\ Big\}$ . Here, $f_d$ and $f_h$ denote the firing degrees of membership functions $(\mu _d, \mu _\theta )$ . It is worth noting that the membership function $Z$ incorporated in $f_h$ has been formulated with a reduced range of values. The objective of this approach is to emphasize the relevant facts regarding the proximity between the DDMR and the prescribed trajectory. It is anticipated that by providing more information within this range of distance, the DDMR will exhibit better accuracy in tracking performance. By using $f_d$ and $f_h$ , the RL agent is trained and learned to generate right and left motor voltages $(V_u, V_v)$ and further drive the robot along the prescribed trajectory.

Figure 3. The designed fuzzy membership functions $(\mu _d, \mu _\theta )$ for $e_d$ and $e_\theta$ : the membership functions for $e_d$ encompass linguistic terms such as Zero (Z), Medium (M), and Far (F). For $e_\theta$ , the linguistic terms include Negative Big (NB), Negative (N), Zero (Z), Positive (P), and Positive Big (PB).

In this study, deep deterministic policy gradient (DDPG) [Reference Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver and Wierstra60Reference Liu, Wang, Wang, Jia and Shi62] approach in the RL agent is utilized to build the relation between $(f_d,f_h)$ and $(V_u, V_v)$ . It is important to note that DDPG is Markov decision process (MDP) based RL that requires the interaction of an agent and an environment. By referring to Fig. 2, the environment is the blocks except the RL agent block. For each time step $t$ , the environment provides state information $s_t$ and reward $r_{Dt}$ based on the action $a_t$ taken by the RL agent. In this way, the RL agent repeats the MDP and creates experience replay memory denoted as a tuple $h = (s_t, a_t, s_{t+1}, r_t)$ . For each step, the actions are chosen by following a deterministic policy function $\mu _D$ that maps the state space $\mathcal{S}$ to an action space $\mathcal{A}$ to maximize the sum of future discounted rewards given as

(9) \begin{align} R_{D(t)} &= \sum\nolimits_{k=0}^T \gamma ^k r_{D(t+k)}\nonumber\\[5pt] &= r_{Dt} + \gamma R_{D(t+1)} \end{align}

where $\gamma$ is a discount factor. As explained in ref. [Reference Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver and Wierstra60], the action-value function is used as follows:

(10) \begin{equation} Q(s_t, a_t)= E \!\left [ r_D(s_{t}, a_{t}) + \gamma Q(s_{(t+1)}, \mu _D (s_{(t+1)})) \right ] \end{equation}

where $Q(s_t, a_t)$ represents an expected discounted future reward after taking action $a_t$ at state $s_t$ and following policy $\mu _D$ afterward.

In the framework of DDPG, a dual neural network configuration is deployed, comprising the actor-network and critic-network illustrated in Fig. 4. The critic-network, parametrized by $\theta ^Q$ , serves the purpose of approximating the action-value function $Q(s,a\mid \theta ^Q)$ . In parallel, the actor-network, parametrized by $\theta ^\mu$ , is employed to approximate the deterministic policy function $\mu _D(s\mid \theta ^\mu )$ . The RL agent employs the actor-network $\theta ^\mu$ to generate the action $a_t$ for the purpose of directing the movement of the DDMR based on the presence of state $s_t$ . Following this action, the RL agent receives a reward $r_D$ from the environment and transitions to the subsequent state $s_{(t+1)}$ . The iterative process repeats and the RL agent accumulate the experience tuple $(s_t, a_t, s_{t+1}, r_t)$ . To encourage exploration, an action noise in accordance with the Ornstein–Uhlenbeck method [Reference Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver and Wierstra60] is added to the output of the actor-network $\theta ^\mu$ . Notably, the variance of this action noise is iteratively reduced in every step of the training process.

Figure 4. Diagram of the actor-critic neural networks: the actor-network, denoted as $\mu _D(s\mid \theta ^\mu )$ , is in charge of approximating the deterministic policy function and subsequently generating actions based on the provided state. Simultaneously, the critic-network, represented as $Q(s,a\mid \theta ^Q)$ , approximates the action-value function. This network architecture takes the current and preceding firing degree outputs of the fuzzy membership function, expressed as $[f_{d(t-1)}, f_{h(t-1)}, f_{d(t)}, f_{h(t)}] \in \mathbb{R}^{16 \times 1}$ .

The critic-network $\theta ^Q$ is used to evaluate the action taken by the actor-network $\theta ^\mu$ at a particular state. Its parameters are updated using the gradient method in terms of action-value loss given as

(11) \begin{equation} L(\theta ^Q) = \left( y_t - Q\!\left(s_t, \mu _D \!\left(s_t\mid \theta ^\mu \right)\right) \right)^2 \end{equation}

where

(12) \begin{equation} y_t = r_D(s_{t}, a_{t}) + \gamma Q^{\prime}\!\left(s_{(t+1)}, \mu^{\prime}_D\!\left(s_{(t+1)}\mid \theta ^{\mu ^{\prime}}\right)\mid \theta ^{Q^{\prime}}\right)\end{equation}

where $Q^{\prime}\!\left(s,a\mid \theta ^{Q^{\prime}}\right)$ and $\mu^{\prime}_D\!\left(s\mid \theta ^{\mu ^{\prime}}\right)$ are target networks of the critic and actor, respectively. The parameters of networks $\theta ^{Q}$ and $\theta ^{\mu }$ are initially assigned randomly. Target network parameters denoted as $\theta ^{Q^{\prime}}$ and $\theta ^{\mu ^{\prime}}$ are then created as duplicates of the original networks.

In terms of the training process, a batch of experience data is sampled from the experience replay memory. The reward term $r_D$ in Eq. (12) can be straightforwardly obtained from each data tuple stored in the memory. Then, the subsequent state $s_{(t+1)}$ is extracted from the data tuple and fed into the target actor-network $\mu^{\prime}_D\!\left(s_{(t+1)}\mid \theta ^{\mu ^{\prime}}\right)$ to obtain the subsequent action $a_{(t+1)}$ . Consequently, the Q-value of the target network denoted as $Q^{\prime}\!\left(s_{(t+1)}, a_{(t+1)}\right)$ in Eq. (12) is acquired. Similarly, the Q-value term $Q(s_{t}, a_{t})$ in Eq. (11) is computed using the states $s_{t}$ and actions $a_{t}$ obtained from the data tuple within the experience replay memory. This training cycle is known as the temporal difference (TD) learning method [Reference Watkins and Dayan63].

The actor-network $\mu _D(s\mid \theta ^\mu )$ is updated by using the gradient method of the expected discounted future reward as follows:

(13) \begin{equation} \nabla _{\theta ^\mu } J \approx \nabla _a Q(s,\mu _D (s\mid \theta ^\mu )\mid \theta ^{Q}) \nabla _{\theta ^\mu }\mu _D(s\mid \theta ^\mu ) \end{equation}

By using TD soft update, the target networks are updated as

(14) \begin{equation} \left \{ \begin{array}{l} \theta ^{Q^{\prime}} \longleftarrow \tau \theta ^{Q} + (1-\tau ) \theta ^{Q^{\prime}}\\[5pt] \theta ^{\mu ^{\prime}} \longleftarrow \tau \theta ^{\mu } + (1-\tau ) \theta ^{\mu ^{\prime}} \end{array} \right. \end{equation}

where $\tau$ denotes a smooth factor. In Fig. 4, Dense notation represents a fully connected neural network layer. There are three activation function used here, rectified linear unit (ReLU), sigmoid function, and linear function. Additionally, this study adopts a concat layer in which two previous neural network layers are concatenated to create combined inputs for the next neural network layer.

In this study, states of the agent are $ s_t = [f_{d(t-1)}, f_{h(t-1)}, f_{d(t)}, f_{h(t)}]$ which form a $\mathbb{R}\mathrm{}^{16}$ vector. At $t=0$ , the initial membership degrees are $f_{d(t=0)} = [0,0,1]$ and $f_{h(t=0)} = [0,0,0,0,1]$ . The actions for each time step are continuous right and left motor voltages $a_t=[V_u, V_v]$ ranged between $[0,20]$ volts. Consequently, rewards are given by the environment and are designed such that the RL agent achieves trajectory tracking. To do so, some key performance factors must be rewarded. The first reward $g_1$ is related to the tracking performance as

(15) \begin{equation} g_1 = \frac{1}{2}\left(1-\frac{1}{9}e_d\right) + \frac{1}{4}\left(1-\frac{\pi }{6}\mid e_\theta \mid \right) - \frac{1}{4}\text{sgn}(\dot{\mid e_\theta \mid }) \end{equation}

where $g_1$ is designed so that $e_\theta$ and $e_d$ asymptotically converge to zero as $t$ progresses. Second, a penalty $g_2 = -10$ is given when $\mid e_\theta \mid \gt \frac{8\pi }{9}$ . This is intended to encourage the RL to take the shortest turning angle to reduce $e_\theta$ . The reward $g_3 = 10$ is given when $e_d\lt 0.1$ . The term $g_3$ is designed to make the robot maintain its position while the trajectory is reached. Henceforth, the sum of rewards from the environment to the RL agent for each time step $t$ is calculated as

(16) \begin{equation} r_{Dt} = g_1 + g_2 + g_3 \end{equation}

In the context of trajectory tracking on a DDMR, minimizing spatial error ( $e_d$ ) from a predetermined trajectory is fundamental. Therefore, $e_d$ related rewards hold greater significance than $e_\theta$ . In addition, DDMRs are nonholonomic in nature and require sequential reduction of $e_\theta$ prior to $e_d$ value. The entire control process is operated by utilizing fuzzified $(e_d, e_\theta )$ as inputs leading to $(V_u, V_v)$ generation through an RL agent output. In order to evaluate the effectiveness of the proposed FRLC, both simulations and experiments are conducted and discussed in details in the next section.

4. Discussions of simulation and experimental results

The proposed FRLC is trained using MATLAB/SIMULINK where a random goal point is generated at the commencement of each episode. Throughout the training process, the duration of each episode spans 20 s, with a time step of 0.05 s. The specific system parameters governing the DDMR are enumerated in Table I within the simulation. Additionally, the training configuration encompasses an experience replay memory capacity of $10^6$ tuples, coupled with a batch size of 64 for the purpose of training. The parameter $\gamma$ , representing the discounted reward factor, is set to 0.99 to give a heightened emphasis on future rewards for the RL agent. In parallel, the noise variance $\mathcal{N}$ influencing action exploration commences at 0.3 volts and undergoes iterative updates as dictated by the recursive formula $\mathcal{N} \!\leftarrow \mathcal{N}(1-10^3)$ for each step of the training process. Lastly, the target network parameters are subject to iterative updates at every training step employing a soft update parameter value of $\tau =10^{-3}$ .

Table I. The used system parameters of the DDMR during simulations.

The training is performed until the average steady-state error ( $e_d$ ) attains a threshold below 0.2 m signifying the attainment of a predefined performance threshold. Drawing insights from prior training outcomes, it has been ascertained that 2500 training episodes suffice to attain this objective. The training results are concisely summarized through the presentation of average reward values computed over intervals of 20 episodes as shown in Fig. 5. Here, the training results of both the proposed FRLC and a classical DDPG model are exhibited for comparison. The classical DDPG model uses the same network structure and reward function with the proposed FRLC, but the state inputs of the classical DDPG consist of the tracking error values, that is, $\left [e_{d(t-1)}, e_{\theta (t-1)},e_{d(t)}, e_{\theta (t)}\right ]$ .

Figure 5. Comparison of average cumulative training phase rewards over 20 episodes: this representation illustrates that the proposed FRLC exhibits accelerated learning. Throughout the training phase, the proposed FRLC accumulates $8.5 \times 10^6$ reward points surpassing the classical DDPG model’s accumulation of $6.9 \times 10^6$ reward points.

The results show that the proposed FRLC outperforms the conventional DDPG in terms of cumulative rewards per episode. Specifically, the cumulative rewards attained by the FRLC are greater than those of the classical DDPG, reaching $8.5 \times 10^6$ reward points as opposed to the latter $6.9 \times 10^6$ reward points. It is also worth noting that the proposed FRLC achieves an accumulation of 3000 reward points at episode 181, indicating a faster pace compared to the classical DDPG, which attains this milestone at episode 411. The advantage of FRLC is further reflected in the stability of reward accumulation, particularly evident between episodes 1000 and 1500. This stability can be attributed to the state exploration process. In the case of FRLC, states are constrained by the fuzzy linguistic terms allowing for easier and broader exploration. Conversely, classical DDPG’s state inputs lack such bounds. When encountering a novel state, the classical DDPG agent performs a substantial update to the established network parameters resulting in a temporary loss of reward exploitation capacity.

Figure 6. Comparative responses of point-to-point tracking: this response illustrates the tracking performance of both the proposed FRLC and the classical DDPG model during the training phase. While the DDMR exhibits superior heading control, the proposed FRLC showcases an exceptional ability to accurately halt at the goal point, consequently maximizing reward accumulation.

Figure 7. Comparative tracking error ( $e_d, e_\theta$ ) responses: this figure presents a comparative analysis of tracking error ( $e_d, e_\theta$ ) responses between the proposed FRLC and the classical DDPG model. While the spatial plot in Fig. 6 depicts a more direct path from classical DDPG, it is observed that the convergence speed in $e_\theta$ for classical DDPG is only slightly better than that of FRLC. The response in $e_d$ further indicates that FRLC approaches the goal at a higher speed. The wider curve in the FRLC response in Fig. 6 can be attributed to the combined effects of turning with a faster approaching speed.

To validate the performance of the training result, the robot is initially placed at $(1,0)$ facing to the east and the goal position is at $({-}4,0)$ giving initial errors of $e_d(0)=5$ m and $e_\theta (0)=180^\circ$ . The tracking responses of the proposed FRLC and the classical DDPG are shown in Fig. 6. Moreover, their tracking errors and the corresponding actuator voltages are exhibited in Figs. 7 and 8, respectively. Both the proposed FRLC and the classical DDPG models successfully accomplish the task of point-to-point tracking. Nevertheless, it can be seen from Fig. 6 that the proposed FRLC has the ability to halt at the desired goal position and exploit the reward. On the other hand, the DDPG model makes a circular motion around the desired goal position to exploit the reward which is suboptimal compared to FRLC. This strategy can be confirmed by observing the actuator voltages shown in Fig. 8. Both $V_u$ and $V_v$ of the proposed FRLC converge to $0$ after $t\gt 2$ s, while the proposed FRLC is applied. On the contrary, the DDMR makes a counter-clockwise movement after $t\gt 3$ s, while the classical DDPG model is applied as shown by $V_v=0$ volts and $V_u=20$ volts. Another form of validation arises from the examination of the DDMR’s distance to the goal ( $e_d$ ) as a function of time. As depicted in Fig. 7, the proposed FRLC maintains a stable $e_d$ for $t\gt 1.9$ s, whereas the classical DDPG reveals fluctuations in $e_d$ at $t\gt 2.8$ s, due to its circular motion.

Figure 8. Comparative actuator voltage ( $V_u, V_v$ ) responses in the proposed FRLC and classical DDPG: in classical DDPG, a distinctive method of maintaining a straight path is observed and achieved through alternating voltages (as seen during $0\lt t\lt 3$ s). However, the FRLC exhibits a superior strategy characterized by a quicker tracking error response as demonstrated in Fig. 6. Additionally, after $t\gt 3$ s, the FRLC employs a unique strategy by completely regulating the voltage inputs to zero, exemplifying a halting approach as indicated in Fig. 6.

The motion of the DDMR is also discernible through the analysis of motor torques ( $\tau _u, \tau _v$ ), as depicted in Fig. 9. In both the FRLC and classical DDPG simulations, the initial $\tau _u$ exceeds $\tau _v$ , indicating the DDMR’s left-turning acceleration, as illustrated in Fig. 5. At subsequent simulation time, the torque response in the classical DDPG simulation exhibits alternating values corresponding to the voltage response in Fig. 8. These torque responses are intricately linked to changes in actuating voltages. Following this, the FRLC simulation, characterized by fewer variations in actuation voltage, produces comparatively smoother torque responses. Then, the classical DDPG torque response reveals $\tau _u \gt \tau _v$ at $t \gt 3$ s, signifying the turning motion. Notably, the negative value of $\tau _v$ aligns with the opposing torque generated by the back electromotive force. Conversely, in the FRLC simulation, both torque and voltage responses approach zero at $t \gt 2$ , indicating a consistent halting motion.

Figure 9. Response comparisons of motor torques ( $\tau _u, \tau _v$ ) between the proposed FRLC and classical DDPG: observations from both methodologies reveal an initial predominance of right motor torque, indicative of left-turning acceleration in the DDMR. At $t \gt 2$ s, both motor torques in the FRLC approach to zero. Conversely, a slightly higher value in the right motor torque persists at $t \gt 3$ s in the classical DDPG, sustaining the right-turning motion of the DDMR.

Another noteworthy phenomenon is that the classical DDPG has a more direct path compared to the proposed FRLC. On the other hand, as shown in Fig. 7, the difference in reaching time of $e_\theta$ is not significantly different, that is, at $t=0.30$ s for FRLC and $t=0.44$ s for DDPG. Furthermore, the straight path produced by classical DDPG contains oscillations that are apparent in the temporal domain as shown in Fig. 7. These oscillations can be attributed to the classical DDPG’s strategy of alternating voltages as depicted in Fig. 8. Additionally, the proposed FRLC shows superior results in terms of $e_d$ . It is shown in Fig. 7 that the proposed FRLC reaches the goal point at $t=1.9$ s compared to DDPG’s $t=2.8$ s. DDMR’s turning motion combined with a faster approaching speed explains the wider curve made by the proposed FRLC depicted in Fig. 7. Moreover, the steady-state $e_d$ of the proposed FRLC is smaller than $0.06$ m compared to the classical DDPG with a steady-state error $e_d\lt 0.39$ . In summary, the proposed FRLC accumulates 583.9 reward points during this validation phase using a point-to-point tracking task compared to the classical DDPG which accumulates 112.3 reward points. Based on the aforementioned assessments, it is concluded that the proposed FRLC is superior to the classical DDPG in solving this task.

The trajectory designed for the testing phase is different compared to its training phase. This is a typical way to further investigate the performance of an RL agent under conditions that have not been learned during the training phase. Hence, a desired circular trajectory is used for the testing routine and designed as

(17) \begin{equation} \left \{ \begin{array}{l} x_{ref} = 1.2 + 0.8 *\cos\! (2\pi t/T + \pi/2) \\[5pt] y_{ref} = 0.8 *\sin\! (2\pi t/T + \pi/2) \end{array} \right. \end{equation}

where the episode length $T$ is $60$ s. The two initial poses $(^Ix_B,^Iy_B, \theta )_{t=0}$ of case 1 and case 2 are, respectively, $(1.15, 0, 0)$ and $(1.94, 1.46, 0.8)$ . Additionally, a varying slip ratio is separately injected into each wheel to test the robustness of the proposed FRLC as

(18) \begin{equation} \left \{ \begin{array}{l} S_{ru} = \text{MAX}(0, 0.15\sin \!\left(\dfrac{2\pi t}{0.42T}\right) + 0.05\cos \!\left(\dfrac{2\pi t}{0.1T}\right) \\[18pt] S_{rv} = \text{MAX}(0, 0.15\sin \!\left(\dfrac{2\pi t}{0.38T}\right) + 0.05\cos \!\left(\dfrac{2\pi t}{0.13T}\right) \end{array} \right. \end{equation}

where $MAX(0,\cdot )$ is a function that yields any value bigger than zero. The response of the varying slip ratio is shown in Fig. 10. By using Eq. (18), the trajectory can be imagined and simulated like a slippery floor.

Figure 10. Varying slip ratios ( $S_{ru}, S_{rv}$ ) for both left and right wheels: the proposed FRLC, which has undergone training to track a goal point rather than a trajectory, is subjected to the additional challenge of varying slip ratios in the wheel dynamics to assess its robustness. These scenarios present novel conditions for the RL agent unencountered during its training phase.

The tracking responses of a circular trajectory under the proposed FRLC in the simulated environment are shown in Fig. 11. It is observed that the tracking performance of both initial cases is quite satisfactory as shown by the overlapping trajectory and the traced DDMR path. The accuracy of the circular trajectory tracking can be further observed in the $e_d$ responses shown in Fig. 12. Here, the steady state is reached at $t\gt 7.6$ s, and the error is confined within $e_d\lt 0.05$ m. Likewise, the proposed FRLC shows proficient heading control during the tracking of the designed circular trajectory as evidenced by the limited heading error $e_\theta \lt 12^\circ$ at $t\gt 11$ s. In Fig. 12, both $e_d$ and $e_\theta$ can be seen to remain constant despite the variations in the slip ratio. However, fluctuations appeared in actuator voltages (shown in Fig. 13) at $t=30$ and $t=55$ s. This particular timestamp is in relation to the peaks of varying slip ratios as shown in Fig. 10. Therefore, it can be concluded that the proposed FRLC successfully counteracts the varying slip ratios thereby maintaining the trajectory tracking accuracy.

Figure 11. Simulation results for tracking a circular trajectory: the DDMR is tasked to follow this trajectory from varying initial conditions. Notably, the DDMR’s path closely mirrors the intended trajectory, signifying a noteworthy degree of accuracy.

Figure 12. Tracking error responses $(e_d, e_\theta )$ of the circular trajectory in the simulation: the accuracy of the proposed FRLC extends to the temporal domain, with steady-state errors confined to $e_d \lt 0.05$ m while $t\gt 7.6$ s and $e_\theta \lt 12^\circ$ after $t\gt 11$ s.

Figure 13. Voltage responses $(V_u, V_v)$ during the tracking of the desired circular trajectory in the simulation: the trained FRLC approach effectively maintains an excellent tracking accuracy by adapting the voltage to overcome variations in slip ratio, particularly at $t=30$ and $t=55$ s.

Additionally, Fig. 14 showcases the torque responses, revealing a range of values similar to those observed in the preceding simulation (see Fig. 9). Specifically, the minimum and maximum torque values of $-1.2$ and $1.6$ Nm, respectively, are associated with case #1 simulation. This correlates with faster turning as indicated by a quicker reaching time for $e_\theta$ , as depicted in Fig. 12. Subsequently, the torques gradually approach to zero as the tracking errors are stabilized (at $t\gt 10$ in Fig. 12). This observation is also linked to the minimal adjustments in actuator voltages throughout this temporal interval. The torque responses contribute as an additional metric for tracking evaluation, aligning with the phenomena observed in the tracking error and actuator voltage responses.

Figure 14. Torque responses ( $\tau _u, \tau _v$ ) during the tracking simulation of the circular trajectory: the torque is representative of the difference in motor voltages. As the voltages settle and their magnitude approaches to a steady state, the torque response also approaches to zero.

In contrast to a circular trajectory characterized by a consistent counterclockwise curve, the Lemniscate of Gerono introduces more variations in curve sharpness and directions which poses more challenges to the proposed FRLC. Here, the following Lemniscate of Gerono trajectory is designed to further investigate the tracking performance of the proposed FRLC,

(19) \begin{equation} \left \{ \begin{array}{l} x_{ref} = 1.2 + 0.6\sin\! (4\pi t/T) \\[5pt] y_{ref} = -1 + 2\cos\! (2\pi t/T) \end{array} \right. \end{equation}

where the episode length $T$ is 70 s. The same varying slip ratio in Eq. (18) is also injected into the left and right wheels of the DDMR. The result is shown in Fig. 15. Notably, the traced DDMR path accurately aligns with the designed trajectory. This implies that, under the guidance of the proposed FRLC, the DDMR proficiently navigates the diverse curves and directions inherent to the Lemniscate of Gerono trajectory. Having undergone extensive testing with various initial conditions and trajectories, it can be concluded that the proposed FRLC has learned to maximize the trajectory-tracking policy.

Figure 15. Simulation results for Lemniscate of Gerono trajectory tracking: the Lemniscate of Gerono exhibits greater variability in curve sharpness and turning direction compared to the circular trajectory. Nonetheless, the DDMR showcases a high level of proficiency in navigating this trajectory starting from two initial cases.

In order to evaluate the practical applicability of the proposed FRLC in real-world mobile robotic systems, a series of experiments were conducted utilizing an ADLINK NeuronBot, featuring an Intel i5 processor and 8 GB of RAM. The robot operating system (ROS) is installed on the DDMR platform to facilitate communication between the actuators and the controller. Additionally, a desktop PC featuring an Intel i7 CPU, 16 GB of RAM, and an NVIDIA 2060 GPU is employed to implement the FRLC control methodology. In this context, the addition of a more powerful desktop PC enhances the computational capacity of the DDMR, as the DDMR itself lacks the size to accommodate the weight and dissipate the heat generated by a GPU. Furthermore, the DDMR is constrained in its ability to supply the required power to the GPU, given that it is solely powered by 24-volt batteries. Communication between the desktop PC and the DDMR platform is established via WiFi, leveraging a protocol facilitated by ROS. The robot’s pose is determined through odometry, employing integrated rotary encoders as the measuring apparatus. Notably, the DDMR utilized in the experiments is actuated by a CHP-36GP-555S DC motor, featuring an onboard Hall effect rotary sensor rated to generate 2363 pulses per rotation. The rotational velocity is computed using the formula $\dot \varphi _{u,v} = \text{(counted pulse)}_{u,v}*2\pi/(\Delta t*2363)$ where $\Delta t$ represents the time interval between measurements. The pose of the DDMR is estimated by integrating velocities over time, as outlined in Eq. (1). Additionally, the prescribed trajectory point is stored internally in the DDMR’s memory. Furthermore, the NeuronBot was equipped with a LiDAR system allowing for the preconstruction of an occupancy grid map which served as the basis for representing the robot within a fixed reference frame $\{I\}$ . Thus, within this experimental framework, all state inputs are quantifiable and computable, facilitated by sensors embedded in the DDMR. In the absence of this functionality, the deployment of the RL agent is constrained to simulated environments, rendering it non-applicable in real-world contexts. This essential feature serves as a crucial validation of the practical viability associated with the incorporation of the proposed FRLC in practical settings.

Figure 16 provides a visual representation of the experimental setup offering insight into its environmental condition. The experiments replicate the same trajectories and initial conditions employed in the simulations. As a result, the tracking responses for both circular and Lemniscate of Gerono trajectories are showcased in Figs. 17 and 18, respectively. Here, an occupancy grid map serves as the representation for the DDMR’s position.

Figure 16. Experimental configuration: the employed experimental platform is the ADLINK NeuronBot, characterized by its cubic form and black aesthetic. Illustrated here is the circular trajectory, featuring a radius of $0.8$ m.

Figure 17. Experimental results for circular trajectory tracking: the experiment replicates the initial conditions from the simulation. The traced path of the DDMR is visualized on a pre-established occupancy grid map of the room. These findings affirm the DDMR’s ability to track the prescribed circular trajectory based on the proposed FRLC.

Figure 18. Experimental results for Lemniscate of Gerono trajectory tracking: the successful navigation of both circular and Lemniscate of Gerono trajectories by the DDMR across diverse initial conditions is indicative to the effectiveness of the learned trajectory tracking policy facilitated by the proposed FRLC.

In practical scenarios, a discernible attenuation is present in both received pose data and transmitted control commands primarily stemming from fluctuations in WiFi signal strength. As the DDMR is set into motion, the delayed measurement introduces a mismatch between the received data and the actual system state. Consequently, control outputs are calculated based on this lagging information. Furthermore, an additional attenuation is present in the transmission of control commands. The combination of these factors leads to control commands that lack precise correspondence to the real-world state. This stands in contrast to the training phase where data transmission encounters no attenuation. Additionally, during training, the RL agent operates in an environment free from disturbances and measurement errors, whereas real-world conditions introduce various uncertainties and measurement noise. These factors collectively contribute to the emergence of high frequencies, evident in Figs. 17 and 18. To address this challenge, enhancing the RL agent’s capacity to comprehend temporal data is advisable for the continued investigation of this research pathway. This topic coincides with the investigations conducted on recurrent neural networks (RNN). Introducing an encoder network layer within the RL agent also offers an alternative strategy to improve the extraction of key information and bolster resilience against measurement noises. Additionally, conducting an online training phase in a real-world environment is considered beneficial for the subsequent iteration in the advancement of this research direction. Despite challenges and disparities between the simulation and real-world implementation, it is evident that under identical initial conditions, the DDMR’s motion closely mirrors that observed in simulations. Consequently, the DDMR adeptly adheres to the planned trajectories, substantiating not only the theoretical feasibility but also the practical applicability of the proposed fuzzy reinforcement learning controller (FRLC).

5. Conclusions

The FRLC technique is proposed for a DDMR to track prescribed trajectories. By employing fuzzified tracking errors as inputs and generating actuator voltages through a trained RL agent, the proposed approach showcases substantial advancements over classical DDPG. Notably, the FRLC accumulates an additional $1.6\times 10^{6}$ and $471$ reward points during the training and validation phases, respectively, surpassing the classical DDPG. Furthermore, the FRLC exhibits faster convergence, achieving 3000 reward points in 230 episodes earlier than its classical counterpart. Through simulations under different initial conditions and varying slip ratio disturbances, the FRLC demonstrates precision and swift response, achieving a steady-state error of 0.05 m within 7.6 s. It also displays proficient heading control, limiting the heading error to $12^\circ$ within 11 s. The FRLC’s adeptness in accurately maneuvering through a Lemniscate of Gerono trajectory, characterized by varying turning sharpness and directions amidst slip ratio disturbances, is further highlighted. Additionally, experimental verifications are performed based on identical initial conditions and trajectories as presented in the simulations. The results provide tangible evidence of the FRLC’s applicability to real-world mobile robots.

The prospective research directions in this domain involve integrating RNNs and encoder networks into the RL agent. This integration aims to improve the extraction of principal information from temporal data sequences and enhance the RL agent’s robustness against external disturbances. It should be emphasized that the trained RL agent is likely specialized for the control of a DDMR under specified dynamic characteristics. In alignment with the overarching objective of advancing Artificial General Intelligence, the uncertainties stemming from drastic variations in the physical attributes of the DDMR necessitate extensive investigations to develop a versatile RL agent capable of effectively controlling a DDMR under diverse dynamic configurations. Moreover, the development of online training capabilities in a real-world environment deserves further research exploration to incorporate both disturbance and uncertainty elements into the knowledge base of the RL agent and enhance its resilience against these elements in practical settings.

Author contributions

The research was a collaborative effort and all the authors played an important role in contributing to it.

Financial support

This research was supported by Taiwan’s Ministry of Science and Technology under grant number MOST 111-2221-E-027-130.

Competing interests

The authors declare no competing interests exist.

Ethical approval

Not applicable.

References

Dekle, R., “Robots and industrial labor: Evidence from Japan,” J. Jpn. Int. Econ. 58, 101108 (2020).CrossRefGoogle Scholar
Fan, H., Hu, Y. and Tang, L., “Labor costs and the adoption of robots in China,” J. Econ. Behav. Organ. 186, 608631 (2021).CrossRefGoogle Scholar
Zhao, W., Wang, X., Qi, B. and Runge, T., “Ground-level mapping and navigating for agriculture based on iot and computer vision,” IEEE Access 8, 221975221985 (2020).CrossRefGoogle Scholar
Fernandez, B., Herrera, P. J. and Cerrada, J. A., “A simplified optimal path following controller for an agricultural skid-steering robot,” IEEE Access 7, 9593295940 (2019).CrossRefGoogle Scholar
Dutta, A., Roy, S., Kreidl, O. P. and Bölöni, L., “Multi-robot information gathering for precision agriculture: Current state, scope, and challenges,” IEEE Access 9, 161416161430 (2021).CrossRefGoogle Scholar
Gao, X., Li, J., Fan, L., Zhou, Q., Yin, K., Wang, J., Song, C., Huang, L. and Wang, Z., “Review of wheeled mobile robots’ navigation problems and application prospects in agriculture,” IEEE Access 6, 4924849268 (2018).CrossRefGoogle Scholar
Pak, J., Kim, J., Park, Y. and Son, H. I., “Field evaluation of path-planning algorithms for autonomous mobile robot in smart farms,” IEEE Access 10, 6025360266 (2022).CrossRefGoogle Scholar
Jung, Y., Kim, Y., Lee, W. H., Bang, M. S., Kim, Y. and Kim, S., “Path planning algorithm for an autonomous electric wheelchair in hospitals,” IEEE Access 8, 208199208213 (2020).CrossRefGoogle Scholar
Niloy, M. A. K., Shama, A., Chakrabortty, R. K., Ryan, M. J., Badal, F. R., Tasneem, Z., Ahamed, M. H., Moyeen, S. I., Das, S. K., Ali, M. F., Islam, M. R. and Saha, D. K., “Critical design and control issues of indoor autonomous mobile robots: A review,” IEEE Access 9, 3533835370 (2021).CrossRefGoogle Scholar
Chen, C. S., Lin, C. J. and Lai, C. C., “Non-contact service robot development in fast-food restaurants,” IEEE Access 10, 3146631479 (2022).CrossRefGoogle Scholar
Zheng, Y., Chen, S. and Cheng, H., “Real-time cloud visual simultaneous localization and mapping for indoor service robots,” IEEE Access 8, 1681616829 (2020).CrossRefGoogle Scholar
de Jesus, J. C., Bottega, J. A., de Souza Leite Cuadros, M. A. and Gamarra, D. F. T., “Deep deterministic policy gradient for navigation of mobile robots,” J. Intell. Fuzzy Syst. 40(1), 349361 (2021).CrossRefGoogle Scholar
Quan, H., Li, Y. and Zhang, Y., “A novel mobile robot navigation method based on deep reinforcement learning,” Int. J. Adv. Rob. Syst. 17(3), 5 (2020).Google Scholar
Gao, J., Ye, W., Guo, J. and Li, Z., “Deep reinforcement learning for indoor mobile robot path planning,” Ah S Sens. 20(19), 5493 (2020).CrossRefGoogle ScholarPubMed
Khlif, N., Nahla, K. and Safya, B., “Reinforcement learning with modified exploration strategy for mobile robot path planning,” Robotica 41(9), 115 (2023).CrossRefGoogle Scholar
Gao, X., Gao, R., Liang, P., Zhang, Q., Deng, R. and Zhu, W., “A hybrid tracking control strategy for nonholonomic wheeled mobile robot incorporating deep reinforcement learning approach,” IEEE Access 9, 1559215602 (2021).CrossRefGoogle Scholar
Bai, C., Yan, P., Pan, W. and Guo, J., “Learning-based multi-robot formation control with obstacle avoidance,” IEEE Trans. Intell. Transp. 23(8), 1181111822 (2021).CrossRefGoogle Scholar
Wu, K., Wang, H., Esfahani, M. A. and Yuan, S., “Bnd*-ddqn: Learn to steer autonomously through deep reinforcement learning,” IEEE Trans. Cognit. Dev. Syst. 13(2), 249261 (2021).CrossRefGoogle Scholar
Song, H., Li, A., Wang, T. and Wang, M., “Multimodal deep reinforcement learning with auxiliary task for obstacle avoidance of indoor mobile robot,” Ah S Sens. 21(4), 2 (2021).Google ScholarPubMed
Shi, H., Shi, L., Xu, M. and Hwang, K.-S., “End-to-end navigation strategy with deep reinforcement learning for mobile robots,” IEEE Trans. Ind. Inf. 16(4), 23932402 (2020).CrossRefGoogle Scholar
Wang, Y., He, H. and Sun, C., “Learning to navigate through complex dynamic environment with modular deep reinforcement learning,” IEEE Trans. Games 10(4), 400412 (2018).CrossRefGoogle Scholar
Luong, M. and Pham, C., “Incremental learning for autonomous navigation of mobile robots based on deep reinforcement learning,” J. Intell. Rob. Syst. 101(1), 1 (2021).CrossRefGoogle Scholar
Peng, X., Chen, R., Zhang, J., Chen, B., Tseng, H.-W., Wu, T.-L. and Meen, T.-H., “Enhanced autonomous navigation of robots by deep reinforcement learning algorithm with multistep method,” Sens. Mater. 33(2), SI, 825842 (2021).Google Scholar
Xie, L., Miao, Y., Wang, S., Blunsom, P., Wang, Z., Chen, C., Markham, A. and Trigoni, N., “Learning with stochastic guidance for robot navigation,” IEEE Trans. Neural Network Learn. Syst. 32(1), 166176 (2021).CrossRefGoogle ScholarPubMed
Ye, C., Shao, J., Liu, Y. and Yu, S., “Fuzzy active disturbance rejection control method for an omnidirectional mobile robot with my3 wheel,” Ind. Robot Int. J. Rob. Res. Appl. 50(4), 706716 (2023).CrossRefGoogle Scholar
Boo, J. and Chwa, D., “Fuzzy integral sliding mode observer-based formation control of mobile robots with kinematic disturbance and unknown leader and follower velocities,” IEEE Access 10, 7692676938 (2022).CrossRefGoogle Scholar
Van, M. and Ge, S. S., “Adaptive fuzzy integral sliding-mode control for robust fault-tolerant control of robot manipulators with disturbance observer,” IEEE Trans. Fuzzy Syst. 29(5), 12841296 (2021).CrossRefGoogle Scholar
Chwa, D. and Boo, J., “Adaptive fuzzy output feedback simultaneous posture stabilization and tracking control of wheeled mobile robots with kinematic and dynamic disturbances,” IEEE Access 8, 228863228878 (2020).CrossRefGoogle Scholar
Cherroun, L., Boumehraz, M. and Kouzou, A., Mobile Robot Path Planning Based on Optimized Fuzzy Logic Controllers (Springer Singapore, Singapore, 2019) pp. 255283.Google Scholar
Campos, J., Jaramillo, S., Morales, L., Camacho, O., Chávez, D. and Pozo, D., “Pso Tuning for Fuzzy pd + i Controller Applied to a Mobile Robot Trajectory Control,” 2018 International Conference on Information Systems and Computer Science (INCISCOS) (2018) pp. 6268.Google Scholar
Subbash, P. and Chong, K. T., “Adaptive network fuzzy inference system based navigation controller for mobile robot,” Front. Inf. Tech. Electron. Eng. 20(2), 141151 (2019).CrossRefGoogle Scholar
Saidi, Y., Nemra, A. and Tadjine, M., “Robust mobile robot navigation using fuzzy type 2 with wheel slip dynamic modeling and parameters uncertainties,” Int. J. Modell. Simul. 40(6), 397420 (2020).CrossRefGoogle Scholar
Amador-Angulo, L., Castillo, O., Melin, P. and Castro, J. R., “Interval type-3 fuzzy adaptation of the bee colony optimization algorithm for optimal fuzzy control of an autonomous mobile robot,” Micromachines-BASEL 13(9), 1490 (2022).CrossRefGoogle ScholarPubMed
Ha, V. Q., Pham, S. H.-T. and Vu, N. T.-T., “Adaptive fuzzy type-ii controller for wheeled mobile robot with disturbances and wheelslips,” J. Rob. 2021, 111 (2021).CrossRefGoogle Scholar
Hua, G., Wang, F., Zhang, J., Alattas, K. A., Mohammadzadeh, A. and The Vu, M., “A new type-3 fuzzy predictive approach for mobile robots,” Mathematics 10(17), 3186 (2022).CrossRefGoogle Scholar
Prieto, P. J., Aguilar, L. T., Cardenas-Maciel, S. L., Lopez-Renteria, J. A. and Cazarez-Castro, N. R., “Stability analysis for mamdani-type integral fuzzy-based sliding-mode control of systems under persistent disturbances,” IEEE Trans. Fuzzy Syst. 30(6), 16401647 (2022).CrossRefGoogle Scholar
Selvachandran, G., Quek, S. G., Lan, L. T. H., Son, L. H., Giang, N. L., Ding, W., Abdel-Basset, M. and de Albuquerque, V. H. C., “A new design of mamdani complex fuzzy inference system for multiattribute decision making problems,” IEEE Trans. Fuzzy Syst. 29(4), 716730 (2021).CrossRefGoogle Scholar
Xia, H., Tang, J., Yu, W., Cui, C. and Qiao, J., “Takagi–sugeno fuzzy regression trees with application to complex industrial modeling,” IEEE Trans. Fuzzy Syst. 31(7), 22102224 (2023).CrossRefGoogle Scholar
Tognetti, E. S. and Linhares, T. M., “Dynamic output feedback controller design for uncertain takagi–sugeno fuzzy systems: A premise variable selection approach,” IEEE Trans. Fuzzy Syst. 29(6), 15901600 (2021).CrossRefGoogle Scholar
Ruiz-García, G., Hagras, H., Pomares, H. and Ruiz, I. R., “Toward a fuzzy logic system based on general forms of interval type-2 fuzzy sets,” IEEE Trans. Fuzzy Syst. 27(12), 23812395 (2019).CrossRefGoogle Scholar
Singh, D. J., Verma, N. K., Ghosh, A. K. and Malagaudanavar, A., “An approach towards the design of interval type-3 t–s fuzzy system,” IEEE Trans. Fuzzy Syst. 30(9), 38803893 (2022).CrossRefGoogle Scholar
Mendel, J. M., “Comparing the performance potentials of interval and general type-2 rule-based fuzzy systems in terms of sculpting the state space,” IEEE Trans. Fuzzy Syst. 27(1), 5871 (2019).CrossRefGoogle Scholar
Hu, Q. and Zheng, B., “An efficient takagi–sugeno fuzzy zeroing neural network for solving time-varying sylvester equation,” IEEE Trans. Fuzzy Syst. 31(7), 24012411 (2023).CrossRefGoogle Scholar
Zhang, L., Shi, Y., Chang, Y.-C. and Lin, C.-T., “Federated fuzzy neural network with evolutionary rule learning,” IEEE Trans. Fuzzy Syst. 31(5), 16531664 (2023).CrossRefGoogle Scholar
Islam, M. A., Anderson, D. T., Pinar, A. J., Havens, T. C., Scott, G. and Keller, J. M., “Enabling explainable fusion in deep learning with fuzzy integral neural networks,” IEEE Trans. Fuzzy Syst. 28(7), 12911300 (2020).CrossRefGoogle Scholar
Dombi, J. and Hussain, A., “Data-driven interval type-2 fuzzy inference system based on the interval type-2 distending function,” IEEE Trans. Fuzzy Syst. 31(7), 23452359 (2023).CrossRefGoogle Scholar
Velagic, J., Lacevic, B. and Osmic, N.. Nonlinear Motion Control of Mobile Robot Dynamic Model (IntechOpen, Rijeka, 2008) ch. 27.CrossRefGoogle Scholar
Hatab, R., “Dynamic modelling of differential-drive mobile robots using lagrange and newton-euler methodologies: A unified framework,” Adv. Rob. Autom. 02, 1000107 (2013).Google Scholar
Ali, M. A. and Mailah, M., “A simulation and experimental study on wheeled mobile robot path control in road roundabout environment,” Int. J. Adv. Rob. Syst. 16(2), 1729881419834778 (2019).Google Scholar
Stefek, A., Pham, T. V., Krivanek, V. and Pham, K. L., “Energy comparison of controllers used for a differential drive wheeled mobile robot,” IEEE Access 8, 170915170927 (2020).CrossRefGoogle Scholar
Ma, Y., Zhao, J., Zhao, H., Lu, C. and Chen, H., “Mpc-based slip ratio control for electric vehicle considering road roughness,” IEEE Access 7, 5240552413 (2019).CrossRefGoogle Scholar
Cao, K., Hu, M., Wang, D., Qiao, S., Guo, C., Fu, C. and Zhou, A., “All-wheel-drive torque distribution strategy for electric vehicle optimal efficiency considering tire slip,” IEEE Access 9, 2524525257 (2021).CrossRefGoogle Scholar
Zhang, N., Han, Z., Zhang, Z., Guo, K. and Lu, X., “MAS-based slip ratio fault-tolerant control in finite time for EV,” IEEE Access 9, 4564245654 (2021).CrossRefGoogle Scholar
Tzafestas, S. G.. Introduction to Mobile Robot Control (Elsevier, Oxford, 2013).Google Scholar
Nurmaini, S. and Chusniah, “Differential Drive Mobile Robot Control using Variable Fuzzy Universe of Discourse,” 2017 International Conference on Electrical Engineering and Computer Science (ICECOS) (2017) pp. 5055.Google Scholar
Chapman, S. J., Fortran 90/95 for Scientists and Engineers, 1st ed. (WCB/McGraw-Hill, Boston, 1998).Google Scholar
MathWorks, MATLAB The Language of Technical Computing: Function Reference Volume 1: A - E Version 7 (The MathWorks, Inc, Natick, 2004).Google Scholar
Torres, V. and Valls, J., “A fast and low-complexity operator for the computation of the arctangent of a complex number,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 25(9), 26632667 (2017).CrossRefGoogle Scholar
Zhou, J., Dietrich, M., Walden, P., Kolb, J. and Doppelbauer, M., “The Resolution of atan2-Function,” 2020 IEEE Sensors (2020) pp. 14.Google Scholar
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N. M. O., Erez, T., Tassa, Y., Silver, D. and Wierstra, D., Continuous control with deep reinforcement learning, CoRR, vol. abs/1509.02971 (2016).Google Scholar
Zhang, M., Zhang, Y., Gao, Z. and He, X., “An improved ddpg and its application based on the double-layer bp neural network,” IEEE Access 8, 177734177744 (2020).CrossRefGoogle Scholar
Liu, D., Wang, W., Wang, L., Jia, H. and Shi, M., “Dynamic pricing strategy of electric vehicle aggregators based on ddpg reinforcement learning algorithm,” IEEE Access 9, 2155621566 (2021).CrossRefGoogle Scholar
Watkins, C. and Dayan, P., “Q-learning,” Mach. Learn. 8(3-4), 279292 (5 1992).CrossRefGoogle Scholar
Figure 0

Figure 1. Geometric relationships of the DDMR: notations $\{R\}$ and $\{I\}$ are the moving and fixed frames, respectively. The positional discrepancy of the DDMR concerning the trajectory is quantified by the distance error $e_d$ and heading error $e_h$. The DDMR’s X-axis denotes the forward direction of its motion.

Figure 1

Figure 2. Overall control diagram: the blocks outside the RL agent represent the RL environment, providing both rewards and states to the agent. The RL agent produces right and left motor voltages $(V_u, V_v)$ as actions to induce a transition of the DDMR to a new state. Subsequently, the state is compared to the prescribed trajectory, forming the feedback loop. The RL agent also functions as a fuzzy inference engine, specifically optimized to mitigate tracking errors.

Figure 2

Figure 3. The designed fuzzy membership functions $(\mu _d, \mu _\theta )$ for $e_d$ and $e_\theta$: the membership functions for $e_d$ encompass linguistic terms such as Zero (Z), Medium (M), and Far (F). For $e_\theta$, the linguistic terms include Negative Big (NB), Negative (N), Zero (Z), Positive (P), and Positive Big (PB).

Figure 3

Figure 4. Diagram of the actor-critic neural networks: the actor-network, denoted as $\mu _D(s\mid \theta ^\mu )$, is in charge of approximating the deterministic policy function and subsequently generating actions based on the provided state. Simultaneously, the critic-network, represented as $Q(s,a\mid \theta ^Q)$, approximates the action-value function. This network architecture takes the current and preceding firing degree outputs of the fuzzy membership function, expressed as $[f_{d(t-1)}, f_{h(t-1)}, f_{d(t)}, f_{h(t)}] \in \mathbb{R}^{16 \times 1}$.

Figure 4

Table I. The used system parameters of the DDMR during simulations.

Figure 5

Figure 5. Comparison of average cumulative training phase rewards over 20 episodes: this representation illustrates that the proposed FRLC exhibits accelerated learning. Throughout the training phase, the proposed FRLC accumulates $8.5 \times 10^6$ reward points surpassing the classical DDPG model’s accumulation of $6.9 \times 10^6$ reward points.

Figure 6

Figure 6. Comparative responses of point-to-point tracking: this response illustrates the tracking performance of both the proposed FRLC and the classical DDPG model during the training phase. While the DDMR exhibits superior heading control, the proposed FRLC showcases an exceptional ability to accurately halt at the goal point, consequently maximizing reward accumulation.

Figure 7

Figure 7. Comparative tracking error ($e_d, e_\theta$) responses: this figure presents a comparative analysis of tracking error ($e_d, e_\theta$) responses between the proposed FRLC and the classical DDPG model. While the spatial plot in Fig. 6 depicts a more direct path from classical DDPG, it is observed that the convergence speed in $e_\theta$ for classical DDPG is only slightly better than that of FRLC. The response in $e_d$ further indicates that FRLC approaches the goal at a higher speed. The wider curve in the FRLC response in Fig. 6 can be attributed to the combined effects of turning with a faster approaching speed.

Figure 8

Figure 8. Comparative actuator voltage ($V_u, V_v$) responses in the proposed FRLC and classical DDPG: in classical DDPG, a distinctive method of maintaining a straight path is observed and achieved through alternating voltages (as seen during $0\lt t\lt 3$ s). However, the FRLC exhibits a superior strategy characterized by a quicker tracking error response as demonstrated in Fig. 6. Additionally, after $t\gt 3$ s, the FRLC employs a unique strategy by completely regulating the voltage inputs to zero, exemplifying a halting approach as indicated in Fig. 6.

Figure 9

Figure 9. Response comparisons of motor torques ($\tau _u, \tau _v$) between the proposed FRLC and classical DDPG: observations from both methodologies reveal an initial predominance of right motor torque, indicative of left-turning acceleration in the DDMR. At $t \gt 2$ s, both motor torques in the FRLC approach to zero. Conversely, a slightly higher value in the right motor torque persists at $t \gt 3$ s in the classical DDPG, sustaining the right-turning motion of the DDMR.

Figure 10

Figure 10. Varying slip ratios ($S_{ru}, S_{rv}$) for both left and right wheels: the proposed FRLC, which has undergone training to track a goal point rather than a trajectory, is subjected to the additional challenge of varying slip ratios in the wheel dynamics to assess its robustness. These scenarios present novel conditions for the RL agent unencountered during its training phase.

Figure 11

Figure 11. Simulation results for tracking a circular trajectory: the DDMR is tasked to follow this trajectory from varying initial conditions. Notably, the DDMR’s path closely mirrors the intended trajectory, signifying a noteworthy degree of accuracy.

Figure 12

Figure 12. Tracking error responses $(e_d, e_\theta )$ of the circular trajectory in the simulation: the accuracy of the proposed FRLC extends to the temporal domain, with steady-state errors confined to $e_d \lt 0.05$ m while $t\gt 7.6$ s and $e_\theta \lt 12^\circ$ after $t\gt 11$ s.

Figure 13

Figure 13. Voltage responses $(V_u, V_v)$ during the tracking of the desired circular trajectory in the simulation: the trained FRLC approach effectively maintains an excellent tracking accuracy by adapting the voltage to overcome variations in slip ratio, particularly at $t=30$ and $t=55$ s.

Figure 14

Figure 14. Torque responses ($\tau _u, \tau _v$) during the tracking simulation of the circular trajectory: the torque is representative of the difference in motor voltages. As the voltages settle and their magnitude approaches to a steady state, the torque response also approaches to zero.

Figure 15

Figure 15. Simulation results for Lemniscate of Gerono trajectory tracking: the Lemniscate of Gerono exhibits greater variability in curve sharpness and turning direction compared to the circular trajectory. Nonetheless, the DDMR showcases a high level of proficiency in navigating this trajectory starting from two initial cases.

Figure 16

Figure 16. Experimental configuration: the employed experimental platform is the ADLINK NeuronBot, characterized by its cubic form and black aesthetic. Illustrated here is the circular trajectory, featuring a radius of $0.8$ m.

Figure 17

Figure 17. Experimental results for circular trajectory tracking: the experiment replicates the initial conditions from the simulation. The traced path of the DDMR is visualized on a pre-established occupancy grid map of the room. These findings affirm the DDMR’s ability to track the prescribed circular trajectory based on the proposed FRLC.

Figure 18

Figure 18. Experimental results for Lemniscate of Gerono trajectory tracking: the successful navigation of both circular and Lemniscate of Gerono trajectories by the DDMR across diverse initial conditions is indicative to the effectiveness of the learned trajectory tracking policy facilitated by the proposed FRLC.