Hostname: page-component-cd9895bd7-dk4vv Total loading time: 0 Render date: 2024-12-27T09:30:13.894Z Has data issue: false hasContentIssue false

Reinforcement-learning-based actuator selection method for active flow control

Published online by Cambridge University Press:  12 January 2023

Romain Paris*
Affiliation:
DAAA, ONERA, Université Paris Saclay, F-92190 Meudon, France
Samir Beneddine
Affiliation:
DAAA, ONERA, Université Paris Saclay, F-92190 Meudon, France
Julien Dandois
Affiliation:
DAAA, ONERA, Université Paris Saclay, F-92190 Meudon, France
*
Email address for correspondence: [email protected]

Abstract

This paper addresses the issue of actuator selection for active flow control by proposing a novel method built on top of a reinforcement learning agent. Starting from a pre-trained agent using numerous actuators, the algorithm estimates the impact of a potential actuator removal on the value function, indicating the agent's performance. It is applied to two test cases, the one-dimensional Kuramoto–Sivashinsky equation and a laminar bidimensional flow around an airfoil at $Re=1000$ for different angles of attack ranging from $12^{\circ }$ to $20^{\circ }$, to demonstrate its capabilities and limits. The proposed actuator-sparsification method relies on a sequential elimination of the least relevant action components, starting from a fully developed layout. The relevancy of each component is evaluated using metrics based on the value function. Results show that, while still being limited by this intrinsic elimination paradigm (i.e. the sequential elimination), actuator patterns and obtained policies demonstrate relevant performances and allow us to draw an accurate approximation of the Pareto front of performances versus actuator budget.

Type
JFM Papers
Copyright
© The Author(s), 2023. Published by Cambridge University Press

1. Introduction

Flow control is one of the most used methods to improve aero/hydrodynamic qualities of vehicles. It encompasses a spectrum of methods from passive control to closed-loop nonlinear control. Passive control through shape optimisation and/or use of riblets, vortex generators, etc. has long been a go-to strategy thanks to its simplicity and robustness (Bruneau & Mortazavi Reference Bruneau and Mortazavi2008; Seshagiri, Cooper & Traub Reference Seshagiri, Cooper and Traub2009; Joubert et al. Reference Joubert, Le Pape, Heine and Huberson2013; Evans et al. Reference Evans, Hamed, Gorumlu, Doosttalab, Aksak, Chamorro and Castillo2018). Yet, better performance may often be reached through active control. The most simple approaches are open-loop strategies where the control law is predetermined irrespective of the state and response of the flow. While demonstrating fitness to reach control goals in numerous cases (Seifert et al. Reference Seifert, Bachar, Koss, Shepshelovich and Wygnanski1993; Seifert & Pack Reference Seifert and Pack1999), their general energetic cost and inability to adapt to the environment response and correct control error pledge for closed-loop methods.

Closed-loop control design is nowadays a well-documented and studied domain. Most synthesis methods rely on a linear model of the controlled system to optimise a feedback control law under criteria of stability, robustness, tracking error minimisation and/or disturbance rejection. The underlying linear assumption is the main challenge stated by the transposition of such design strategies to flow control. In particular, the system dimensionality, nonlinear response and simulation cost are issues specific to fluid mechanics. A wide literature proposed adapted methods to design closed-loop control laws with these constraints, as highlighted by Brunton & Noack (Reference Brunton and Noack2015). Among the recent related work, one may cite Seidel, Fagley & McLaughlin (Reference Seidel, Fagley and McLaughlin2018), who developed a multi-step method to design controllers, using open-loop responses to forcings and building a reduced-order model of the system.

Flow control aims to mitigate or suppress the detrimental effects of some dynamical mechanisms. Most of these well-documented issues concern noise emission, flow separation (Zaman, McKinzie & Rumsey Reference Zaman, McKinzie and Rumsey1989; Gupta & Ansell Reference Gupta and Ansell2019) causing a drop in performance (increasing drag and/or decreasing lift) or alternated loads inducing critical structural fatigue. Airfoil flow separation control (Wu et al. Reference Wu, Lu, Denny, Fan and Wu1998) has been a matter of interest for a few decades already, with studies using a wide variety of control methods at various flow and stall regimes (Seifert, Darabi & Wyganski Reference Seifert, Darabi and Wyganski1996; Amitay & Glezer Reference Amitay and Glezer2002; Shimomura et al. Reference Shimomura, Ogawa, Sekimoto, Nonomura, Oyama, Fujii and Nishida2017; Yeh & Taira Reference Yeh and Taira2019).

In the last few years, machine learning, and deep learning in particular, has demonstrated remarkable performances in a wide variety of fields such as natural language processing (Otter, Medina & Kalita Reference Otter, Medina and Kalita2020), image recognition (Ghosh et al. Reference Ghosh, Das, Das and Maulik2019) or robotics (Ibarz et al. Reference Ibarz, Tan, Finn, Kalakrishnan, Pastor and Levine2021). This progress is mainly due to the rise in accessible computing power and the use of neural networks (NNs) that act as quasi-universal function approximators that can be optimised fairly easily. This explains the flourishing literature investigating the potential of such techniques in fluid mechanics, as highlighted by Brunton, Noack & Koumoutsakos (Reference Brunton, Noack and Koumoutsakos2020) and Vinuesa & Brunton (Reference Vinuesa and Brunton2021).

Mirroring computer vision and image classification methods, multiple studies such as Zhang, Sung & Mavris (Reference Zhang, Sung and Mavris2018), Hui et al. (Reference Hui, Bai, Wang and Zhang2020), Bhatnagar et al. (Reference Bhatnagar, Afshar, Pan, Duraisamy and Kaushik2019) and Sekar et al. (Reference Sekar, Zhang, Shu and Khoo2019) aim at deriving aerodynamic coefficients, predicting flow fields or airfoil shapes using convolutional neural networks in a supervised manner. These studies take advantage of the NN's interpolation ability, optimising the bias-variance trade-off and surpassing the performances of most response-surface-based methods such as kriging. This enables faster design cycles, relying on these intermediate fidelity models. Linear decomposition methods such as proper orthogonal decomposition (POD) or dynamic mode decomposition and their variations are increasingly challenged by nonlinear reduced-order modelling tools that demonstrate their capacity to compress state information further without losing in precision (Wang et al. Reference Wang, Xiao, Fang, Govindan, Pain and Guo2018; Lee & Carlberg Reference Lee and Carlberg2020; Fukami et al. Reference Fukami, Hasegawa, Nakamura, Morimoto and Fukagata2021; Kneer et al. Reference Kneer, Sayadi, Sipp, Schmid and Rigas2021; Pichi et al. Reference Pichi, Ballarin, Rozza and Hesthaven2021). Some studies, such as Lusch, Kutz & Brunton (Reference Lusch, Kutz and Brunton2018), even integrate a reduced-order dynamical model, where predefined constraints (such as linearity and sparsity) help understand full-state dynamics. Recurrent structures such as long short-term memory cells provide a convenient way to integrate model dynamics (Mohan & Gaitonde Reference Mohan and Gaitonde2018; Hasegawa et al. Reference Hasegawa, Fukami, Murata and Fukagata2020). Others leverage generative capacities of generative adversarial networks and (variational) auto-encoders to generate boundary layer velocity profiles, realistic homogeneous turbulence for direct numerical simulations (Milano & Koumoutsakos Reference Milano and Koumoutsakos2002; Kochkov et al. Reference Kochkov, Smith, Alieva, Wang, Brenner and Hoyer2021) or conversely to model turbulence (Duraisamy, Iaccarino & Xiao Reference Duraisamy, Iaccarino and Xiao2019). Yet these neural structures display poor extrapolation capabilities. Some authors try to mitigate this issue by embedding some key characteristics of fluid mechanics such as mass conservation or invariances in the models (Wang, Wu & Xiao Reference Wang, Wu and Xiao2017; Djeumou et al. Reference Djeumou, Neary, Goubault, Putot and Topcu2022; Wong et al. Reference Wong, Ooi, Gupta and Ong2022).

Flow control also follows this trend of increasing integration of advanced machine learning methods, notably leveraging the paradigm of reinforcement learning (RL), based on the idea of trial and error. Reinforcement learning consists in evaluating a given control law (also called a policy) on a target system (referred to as the environment). The evaluation data are then used to tweak the control law to maximise a given performance metric, with the underlying idea of promoting beneficial actions or strategies with respect to the metric. In the case of deep reinforcement learning (DRL), the policy is embodied by a NN structure tasked with providing a control action given an observed measurement. Exploring new control strategies plays a decisive role in the performance of such methods. A flourishing literature (Rabault et al. Reference Rabault, Ren, Zhang, Tang and Xu2020; Garnier et al. Reference Garnier, Viquerat, Rabault, Larcher, Kuhnle and Hachem2021; Vona & Lauga Reference Vona and Lauga2021; Zhang et al. Reference Zhang, Wang, Wang and Wang2021; Li & Zhang Reference Li and Zhang2022; Mao, Zhong & Yin Reference Mao, Zhong and Yin2022), implements these methods and demonstrates their significant potential for flow control. Applications range from the control of a Rayleigh–Bénard convection cell (Beintema et al. Reference Beintema, Corbetta, Biferale and Toschi2020) to destabilised one-dimensional (1-D) liquid films (Belus et al. Reference Belus, Rabault, Viquerat, Che, Hachem and Reglade2019) and even to transitional flows, that are of crucial interest for the control of turbulence, for instance, with Ren, Rabault & Tang (Reference Ren, Rabault and Tang2021) who successfully implement RL-trained flow control on a cylinder wake.

Deep reinforcement learning was implemented on a cylinder flow by Koizumi, Tsutsumi & Shima (Reference Koizumi, Tsutsumi and Shima2018), using a Q-learning algorithm to learn how to stabilise the cylinder wake efficiently. One of the two main families of DRL is Q-learning that consists in implicitly encoding the policy in a Q-network that provides the value of every state-action pair, the policy being the action that maximises this value for each given state. Bucci et al. (Reference Bucci, Semeraro, Allauzen, Wisniewski, Cordier and Mathelin2019) used deep deterministic policy gradient (DDPG) to steer the Kuramoto–Sivashinksy equation to its fixed points, and Shimomura et al. (Reference Shimomura, Sekimoto, Oyama, Fujii and Nishida2020) used deep Q-networks to optimise the frequency of plasma actuator bursts to control airfoil flow separation. These methods make use of a self-consistency equation for their optimisation and, thus, display a well-known tendency to be unstable. Most policy optimisation algorithms, the other main family of DRL training methods, derive from proximal policy optimisation (PPO) introduced by Schulman et al. (Reference Schulman, Wolski, Dhariwal, Radford and Klimov2017). This algorithm directly implements the idea of promoting successful control action and avoiding detrimental ones. Proximal policy optimisation was used by Rabault et al. (Reference Rabault, Kuchta, Jensen, Réglade and Cerardi2019) to control a low-Reynolds-number cylinder wake using surface-mounted jets. Wang et al. (Reference Wang, Mei, Aubry, Chen, Wu and Wu2022) also resorted to PPO to control a low Reynolds confined bidimensional airfoil flow using three suction-side synthetic jets to reduce drag. In this paper, a variant named proximal policy optimisation with covariance matrix adaptation (PPO-CMA) (Hämäläinen et al. Reference Hämäläinen, Babadi, Ma and Lehtinen2020) is used as the base learning algorithm.

Fluid mechanics contrasts with classical artificial intelligence application domains by its sample cost, dimensionality and nonlinearity, as mentioned earlier. These factors represent challenges that are characteristic of the difficulties encountered in the scale-up of most machine learning methods to real-world applications. Experimental application is an intermediate step between simple numerically simulated cases and pre-industrial prototypes. Yet, for such data greedy methods and for specific cases, experiments may be less expensive per sample than numerical simulation since experiments tend to have high fixed costs and lower marginal costs, contrary to numerical simulation. While, for numerical simulation, state measurements and forcing actions are only limited by the stability of numerical schemes, experiments require sensors to be non-intrusive and actuators to be constrained to both physical integration restrictions and intrinsic performances, for instance, in terms of frequency bandwidth or power.

In this context, the issue of sensor and actuator location or selection becomes critical. Having an efficient and parsimonious control set-up motivates the emergence of all the following methods. In their multi-step heuristic approach to closed-loop flow control design, Seidel et al. (Reference Seidel, Fagley and McLaughlin2018) propose to rely on the correlation between observed relevant phenomena and sensor signal to choose observations and on clues provided by POD mode decomposition of the flow instabilities to choose actuators location. Cohen, Siegel & McLaughlin (Reference Cohen, Siegel and McLaughlin2006) and Willcox (Reference Willcox2006) also leverage POD analyses to derive optimised sensor placement.

Multiple studies rely on adjoint sensitivity analysis (Chomaz Reference Chomaz2005) and on the ‘wavemaker’, the overlap between the direct and adjoint sensitivity modes, introduced by Giannetti & Luchini (Reference Giannetti and Luchini2007) to derive appropriate sensor and actuator placement. Li & Zhang (Reference Li and Zhang2022) used a computation of the wavemaker on a confined cylinder flow to lay their sensors used as feedback for a reinforcement-learning-trained policy. Natarajan, Freund & Bodony (Reference Natarajan, Freund and Bodony2016) resorted to the wavemaker to optimally locate both sensors and actuators in a diffuser flow.

Sashittal & Bodony (Reference Sashittal and Bodony2021) applied a related method on a data-driven, linearised model of their systems to position their sensors. They used this method to control both the linearised complex Ginzburg–Landau equation and a flow around an inclined flat plate. Linear-quadratic-Gaussian control on balanced truncated reduced models is employed to derive optimal sensor and actuator placement using gradient descent methods (Manohar, Kutz & Brunton Reference Manohar, Kutz and Brunton2021; Yao, Sun & Hemati Reference Yao, Sun and Hemati2022). Modelling the control system as a linear plant is also used by Bhattacharjee et al. (Reference Bhattacharjee, Hemati, Klose and Jacobs2018) who took advantage of the eigensystem realisation algorithm to compare the controllability (in a $\mathcal {H}_2$ framework) of multiple jet actuators laid on the suction of an airfoil to select the best one depending on the performance criterion (lift or angle of attack upon flow separation).

Oehler & Illingworth (Reference Oehler and Illingworth2018) and Jin, Illingworth & Sandberg (Reference Jin, Illingworth and Sandberg2022) used both optimal estimation and full-state information control to derive optimal sensor and actuator placement for the complex linearized Ginzburg–Landau system and a low-Reynolds-number cylinder wake, respectively. Yeh & Taira (Reference Yeh and Taira2019) also employed resolvent-based analyses to discover the optimal forcing variable and location for an actuator aiming at preventing airfoil flow separation. Luhar, Sharma & McKeon (Reference Luhar, Sharma and McKeon2014) used leverage resolvent analyses to assess the potential of opposition control for drag reduction and turbulence control.

Among nonlinear actuator methods, one may cite the study of Rogers (Reference Rogers2000), who derived a set of actuator layouts on a stealth bomber to satisfy manoeuvrability goals using a genetic algorithm. Paris, Beneddine & Dandois (Reference Paris, Beneddine and Dandois2021) proposed a sensor selection algorithm using stochastic gates and leveraging the RL paradigm to filter out sensor measurement while preserving performance as much as possible.

Most of the previously cited actuator location/selection methods rely on linear analyses. In the current study we aim at introducing a reinforcement-learning-based method, leveraging some ideas previously introduced in Paris et al. (Reference Paris, Beneddine and Dandois2021) but this time to select actuators instead of sensors. The present paper provides an in-depth study of a new sparsity-seeking algorithm for actuators, a question that has never been addressed in the context of flow control by RL to our knowledge. By selecting two test cases, each one being challenging in their own way, the paper gives a critical analysis of the algorithm. This allows a more general discussion about the difficulties encountered when one tries to use RL for automatic actuator placement. Therefore, the goal of the present paper is twofold: introducing the first RL algorithm that tackles the essential question of actuator placement for flow control, and identifying future challenges and research directions to pave the way for upcoming work on the topic. Before describing the proposed action-sparsity-seeking algorithm, the RL method used as a base to the algorithm is introduced in § 2. Both test cases are then described and results of the method are discussed in §§ 3 and 4. In conclusion, a discussion about the limitations of the proposed sparsity-promoting method is led.

2. An action-sparsity-seeking algorithm

2.1. The generic RL loop

This study relies on the derivation of closed-loop control laws (called policies) trained by RL. As described in figure 1, the environment encloses the experimental set-up or the numerical simulation and provides partial state observations $s_t$ and a reward $r_t$, quantifying the instantaneous fitness of the current state with respect to a predefined performance metric. The environment can receive forcing actions $a_t$ that modify its behaviour when marched forward in time: $s_{t+1}\sim T(\ {\cdot}\ |s_t,a_t)$. The agent has two roles. It provides control actions $a_t$ based on observations $s_t$ following its policy ${\rm \pi}$: $a_t\sim {\rm \pi}(\ {\cdot }\ ,s_t)$. It is also tasked with implementing the training algorithm, that uses the collected data samples $(s_t,a_t,r_t)$ to tune the parameters of policy ${\rm \pi}$ in order to maximise the expected return $R_t=\mathbb {E}_{s_t\sim T,a_t\sim {\rm \pi}}\left [\sum _{t=0}^{\infty }\gamma ^tr_t\right ]$ for every partial state $s_t$, where $\gamma \in [0,1]$ is an actualisation scalar parameter. It quantifies the preference for immediate rewards rather than future rewards, and is typically set to values slightly below 1.

Figure 1. The RL feedback loop.

2.2. Proximal policy optimisation with covariance matrix adaptation

The RL algorithm used here is on-policy. This means that, at anytime in the training, the data used for optimisation has been provided by the previous control episode (also called roll-out) using the current policy ${\rm \pi}$ only. Thus, the training can be split into epochs, made of one or several episodes (data collection phase, evaluation of the current policy) and an optimisation phase using the data collected.

Proximal policy optimisation is one of the state-of-the art on-policy approaches proposed by Schulman et al. (Reference Schulman, Wolski, Dhariwal, Radford and Klimov2017). It relies on an actor-critic structure, where the agent contains two NNs: an actor (${\rm \pi}$) and a critic ($V$). The actor provides the control action $\boldsymbol{a}_t$ using $\boldsymbol{s}_t$ as an input. This action is sampled stochastically from a normal distribution $\boldsymbol{a}_t\sim {\rm \pi}_\theta (\ {\cdot }\ |s_t) = \mathcal {N}(\boldsymbol{\mu} _\theta (s_t),\sigma )$, where $\theta$ are the trainable parameters (weights and biases) of the actor NN, $\boldsymbol{\mu} _\theta$ the deterministic output of ${\rm \pi}$ (the optimal action to implement according to the actor) and $\sigma$ a predefined standard deviation. This random sampling of the control action helps to solve the exploration–exploitation dilemma efficiently for high-dimensional and continuous state-action spaces. Thus, this strategy enables us to explore only the vicinity of the deterministic (and best) trajectory in the state-action space, and $\sigma$ is simply set to zero during policy testing. The critic is trained to output an estimate $V_\phi (s_t)$ of the value $V^{\rm \pi} (s_t)$ of the current observed partial state, where $\phi$ are the parameters of the critic NN. This value is computed as the expected return $R_t=\mathbb {E}_{s_\tau \sim T,a_\tau \sim {\rm \pi}}\left [\sum _{\tau =t}^{\infty }\gamma ^\tau r_\tau \right ]$ previously introduced, the expected actualised sum of the reward under the current policy ${\rm \pi}$. Here $V_\phi$ is trained in a supervised fashion using the observed returns of the current episode as a target; ${\rm \pi}$ is trained to maximise the likelihood of successful actions (those that provide an increased return) and to reject poorly valued ones. More details can be found in the original article (Schulman et al. Reference Schulman, Wolski, Dhariwal, Radford and Klimov2017).

The standard PPO algorithm uses a clipping mechanism to prevent excessively large policy updates, which ‘can prematurely shrink the exploration variance’ according to Hämäläinen et al. (Reference Hämäläinen, Babadi, Ma and Lehtinen2020). To address this issue, these authors introduced PPO-CMA, as a variant of PPO. The main difference with the standard version of PPO is that $\sigma$ becomes a trained output of ${\rm \pi}$ alongside with $\mu _\theta$. The training of $\sigma$ is performed using bufferized data of the last episodes, which is a slight diversion from the on-policy paradigm. This enables a dynamic exploration, large at the beginning of training, smaller at convergence. In this study, the covariance matrix $\sigma$ is diagonal; thus, $\sigma$ is cast to a vector of length $n_{act}$ in the following. The structure of PPO-CMA is summarised in figure 2. Neural networks’ architecture and most of the hyperparameters of the learning algorithm are summed up in table 4 and fixed (unless explicitly stated). Their values have been set by trial and error, the optimality of the hyperparameters choice is not guaranteed and is out of the scope of the current study.

Figure 2. The PPO-CMA agent structure: compared with PPO, the actor has one extra output $\sigma$ that allows for a dynamic adaptation of the exploration.

2.3. A stochastic gating mechanism to sparsify actions

In this part the proposed action-sparsity-seeking mechanism is introduced. The proposed structure aims at learning a binary mask $\underline {\boldsymbol{p}}\in \{0,1\}^{n_{act}}$, reducing the number of non-null action components to a prescribed amount. Let $\underline {\boldsymbol{a}}_{dense} \sim {\rm \pi}= \mathcal {N}(\underline {\boldsymbol{\mu} }_{dense},\underline {\boldsymbol{\sigma} }_{dense})$ be the stochastic action provided by the actor. As illustrated in figure 3, the gated action $\underline {\boldsymbol{a}}_{sparse}$ is defined so that

(2.1)\begin{equation} \underline{\boldsymbol{a}}_{sparse} =\underline{\boldsymbol{p}}\odot\underline{\boldsymbol{a}}_{dense}+(1-\underline{\boldsymbol{p}})\odot\underline{\bar{\boldsymbol{a}}}, \end{equation}

with $\odot$ the scalar dot product and $\underline {\bar {a}}$ being substitution values. In our case $\underline {\bar {a}}$ is a null vector (the action is simply clipped), thus, the previous relation reduces to

(2.2)\begin{equation} \underline{\boldsymbol{a}}_{sparse}=\underline{\boldsymbol{p}}\odot\underline{\boldsymbol{a}}_{dense}. \end{equation}

When $p_i=0$, the $i$th component $a_{dense,i}$ is always filtered through the gate layer ($a_{sparse,i}=0$). Conversely, when $p_i=1$, the probability to filter out the action is null. During the sparsity-seeking phase of training, the mask $\underline {p}$ can take continuous values in $[0;1]^{n_{act}}$. At convergence and for evaluation, this learned mask $\underline {p}$ is bound to be constant and contains only binary values. As values of $\underline {\mu }_{dense}$ and $\underline {\sigma }_{dense}$ need to be provided for the optimisation of the actor, clipping is also applied on these vectors, by ‘replaying’ the same values of $\underline {p}$ as for the sparse $\underline {a}_{sparse}$ in the following way:

(2.3)\begin{equation} \left.\begin{gathered} \underline{\boldsymbol{\mu}}_{sparse} =\underline{\boldsymbol{p}}\odot\underline{\boldsymbol{\mu}}_{dense}, \\ \log\underline{\boldsymbol{\sigma}}_{sparse} =\underline{\boldsymbol{p}}\odot\log\underline{\boldsymbol{\sigma}}_{dense}+(1-\underline{\boldsymbol{p}}) \left(-\tfrac{1}{2}\log(2{\rm \pi})\right). \end{gathered}\right\} \end{equation}

This clipping of $\log \underline {\boldsymbol{\sigma} }$ is defined to keep the log likelihood of an action independent of the number of clipped components (for which the value is deterministically set to $0$).

Figure 3. Structure of the stochastic gating layer (SGL) used to filter the actor output. Here $\mu _d$ and $\mu _s$ respectively stand for $\mu _{dense}$ and $\mu _{sparse}$.

During training, the value of $\underline {p}$ is either sampled on Bernoulli distribution or ‘replayed’ for optimisation using a structure similar to the stochastic gating layer (SGL) proposed by Louizos, Welling & Kingma (Reference Louizos, Welling and Kingma2018) and has been previously used by Paris et al. (Reference Paris, Beneddine and Dandois2021) to reduce the number of observations for flow control by RL. It is used here in a simplified version, $\underline {p}$ is derived as a vector of Bernoulli trials,

(2.4)\begin{equation} p_i= f(\alpha_i,u_i) = \begin{cases} 1, & \text{if } u_i\leq \alpha_i\\ 0, & \text{otherwise} \end{cases} \end{equation}

where $\underline {u}\in [0,1]^{n_{act}}$ is a random vector and $\underline {\alpha }$ is a trainable vector that sets the probability for the gate to be open. The sampling of $\underline {u}$ depends on the training phase as summarised in table 1. During the sampling phase, the values $\underline {u}_{samp}$ are sampled from a uniform distribution and stored in the training buffer $\mathcal {B}$ alongside observations, actions and rewards. These values are then used during the optimisation phase to replay the state $\underline {p}$ of the gate and back-propagate faithful gradients.

Table 1. Values of $\underline {u}$ and $\underline {p}$ during the different phases of the training process.

2.4. Sequential elimination process and ranking metric

The actor–critic structure is first trained in a standard fashion, with $\alpha _i=1$ values guaranteeing fully open gates. Once this first phase is completed and the desired control performance reached, the sparsity-seeking phase starts. As shown in figure 4(a), the main goal is to close a prescribed number of gates by tuning $\underline {\alpha }$. The update of $\underline {\alpha }$ is performed one component at a time, in a sequential fashion, gradually tuning $\alpha _i$ from 0 to 1 corresponding to the elimination of the $i$th action component. The choice of the eliminated action is performed based on a Polyak-averaged metric (denoted as $m$) computed on each active component, the worst ranked component being chosen to be eliminated next. Figure 4 illustrates this sequential elimination process, that iterates over a three-phase loop as follows.

  1. (i) A first standard training phase where the policy is trained with all its active action components.

  2. (ii) A second phase where the ranking metric $m$ is estimated.

  3. (iii) A third phase consisting of the action clipping itself.

Figure 4. (a) Schematics of the proposed elimination process. Gates’ opening probability are represented as blue squares in the SGL layer, their colour saturation indicates the opening probability. (b) Illustration of the sequential elimination process proposed. The algorithm iterates over this loop until the breaking condition (i.e. the required number of eliminated actuators) is met.

The algorithm keeps on iterating this loop until the prescribed number of active components is met, as illustrated in figure 4 (right). In this paper the metric ranking is based on a ‘what-if’ analysis, where the effects of clipping an action component on the overall performance are estimated, by computing the expected clipped value function $V^i$ over the system's dynamics,

(2.5)\begin{equation} m_i = \mathbb{E}_{s\sim T,a\sim{\rm \pi}^i}\left[V^i(s)\right]\quad \forall\ i\in\{0,\ldots,n_{act}\}, \end{equation}

where ${\rm \pi} ^i = (1-\delta _{i,j}){\rm \pi} _j$, the $i$th component clipped actor. Once the component to be eliminated is chosen, the corresponding component $\alpha _i$ is tuned down slowly on a predefined number of training runs, to let the agent adapt to the elimination before fine tuning the agent on the new action layout.

2.5. Practical implementation

To properly learn the metric $m$ in phase (ii) of the elimination loop, roll-outs have to be performed on the environment using ${\rm \pi} ^i$ in order to collect relevant observation and return values. This clipped data are buffered (in a buffer $\mathcal {B}_i$). Similarly to the original actor–critic neural structure of the PPO-CMA, $V^i$ functions are embodied by NNs (one extra NN per action component). During that metric-evaluation phase, these networks are trained using the specific buffered data collected on clipped roll-outs, in a identical supervised fashion as for the standard value function $V$ using

(2.6)\begin{equation} \mathcal{L}_{V^i} = \frac{1}{|\mathcal{B}_i|}\sum_{(s_t,R_t) \in\mathcal{B}_i}\left(R_t-V^i(s_t)\right)^2 \end{equation}

as training loss. Note that these extra roll-outs represent a significant cost overhead. This may strongly impact training time when running on costly environments and if the number of action components $n_{act}$ is high. To save on computation time, clipped roll-outs corresponding to already eliminated components are not performed and their corresponding value function network is no longer trained. This identified limitation (extra roll-outs and networks) of the present algorithm is discussed in the conclusion.

The practical implementation of the three phases of the loop is described in figure 5, with phase (i) described on the left-hand side, corresponding to a standard training, with SGL $\alpha$ values kept frozen and phases (ii) and (iii) being reported on the right-hand side, where clipped value functions are trained on collected data, then the elimination action component is clipped. In practice, proceeding from phase (i) to phase (ii) and from (ii) to (iii) is only done if predefined performance stability criteria are met. Otherwise, the agent remains in the current phase until it complies to these. This allows for flexible elimination scheduling, saving training epochs. These stability criteria are based on the comparison of different Polyak averages of the metric $m$ and on the standard value function (depending on the phase). Figure 6 provides an illustrative example of the elimination process. During phase (iii) of the process, the $\alpha _i$ component corresponding to the eliminated actuator is linearly tuned down from $1$ to $0$ (over 200 epochs for both of the test cases later introduced).

Figure 5. Action-sparisty seeking PPO-CMA agent structure and training operations. (a) First training phase. Greyed-out structures are not used. (b) Second and third training phase. Phases 2 and 3 see the alternance of policy ${\rm \pi}$ and main critic $V$ training with specific training for each critic $V^i$.

Figure 6. Example of the one-by-one elimination strategy drawn from a Kuramoto–Savishinsky test case (§ 3). The performance curve (red line) has been smoothed using a rolling average of length 20 for the sake of readability. Shaded areas in the background illustrate learning phases. Black dots represent the evaluation runs and the checkpoint back-ups of the agent performed during training.

3. The 1-D Kuramoto–Sivashinsky equation

The first test case considered is the control of the 1-D Kuramoto–Sivashinsky (KS) equation. As shown later in this section, this case is very interesting because it is computationally inexpensive, allowing us to perform a brute-force search of the optimal actuator layouts, yet complex when analysing the obtained optimal layouts. It is therefore a particularly challenging test case for the present algorithm that highlights some of its strengths and weaknesses.

The KS equation is a well-studied fourth-order partial differential equation exhibiting a chaotic behaviour and describing the unstable evolution of flame fronts (Sivashinsky Reference Sivashinsky1980). On a periodic domain of length $L$, the KS equation reads

(3.1)\begin{equation} \frac{\partial u}{\partial t}+u\frac{\partial u}{\partial x}+\frac{\partial^2 u}{\partial x^2}+\frac{\partial^4 u} {\partial x^4} = a\quad \text{for all}\ t: u(0,t) = u(L,t),\end{equation}

where $a$ is the control action forcing later described. For small spatial domains (here $L=22$), the KS equation exhibits three fixed points (i.e. three steady solutions of (3.1) named E1, E2 and E3) and low-dimensional instabilities similarly to some low-Reynolds-number Navier–Stokes flows, as illustrated in figure 7. Further mathematical details about the control of the KS equation can be found in the work of (Bucci et al. Reference Bucci, Semeraro, Allauzen, Cordier and Mathelin2022).

Figure 7. (a) Spatio-temporal representation of the dynamics of the KS equation on $500$ non-dimensional time units, corresponding to $2000$ control steps. (b) Shape of the three fixed points of the KS equation.

The numerical simulation is carried out using a 64-mode Fourier spatial decomposition and a third-order semi-implicit Runge–Kutta scheme (implicit formulation for the linear terms, explicit for the nonlinear term) marched in time with a time step of $0.05$. This numerical set-up is based on the work of Bucci et al. (Reference Bucci, Semeraro, Allauzen, Wisniewski, Cordier and Mathelin2019) and a code from pyKS. The control term is also designed to mimic spatially localized Gaussian forcing actions,

(3.2)\begin{equation} a(x,t) = \sum_{i=0}^{n-1}a_i(t)\frac{1}{\sqrt{2{\rm \pi}\sigma}} \exp{\left(-\frac{(x-x^{act}_i)^2}{2\sigma^2}\right)}, \end{equation}

where $n$ is the number of control actions, $x^{act}_i\ i\in \{0,\ldots,n-1\}$ the locations of the centres of Gaussian kernels and $a_i$ the amplitude of each forcing implemented around $x_i$. Unless otherwise stated, the forcing action has eight forcing components implemented at equispaced locations ($x^{act}_i\in \{0,1,\ldots,7\}L/8$), $a_i\in [-0.5,0.5]$ and $\sigma =0.4$. The partial state observations are provided by measurements of $u$ interspersed between control action locations so that $x^{obs}_i\in \{1,3,5,7,9,11,13,15\}L/16$.

A control step is made of an update of the forcing action $a_t$, then five time steps and the measurement of the observations and reward. This duration of five numerical steps per control step is chosen as a trade-off with respect to two design criteria (as discussed by Rabault & Kuhnle Reference Rabault and Kuhnle2019). The first is imposed by the flow dynamics, which involves characteristic evolution time scales that need to be correctly observed by the agent. As measurements are performed at the end of each control step, ${\rm \Delta} t$ must be short enough to discretise the observation signal properly and avoid aliasing for the agent to capture the flow dynamics. The second criterion is imposed by the training algorithm itself. The RL algorithm should be able to observe the impact of a given action in a short-term future (generally $<100$ control steps), which advocates for longer control steps, that correspond to a further physical horizon.

A standard run of the KS equation lasts for 250 control steps. The reset state is seeded using a Gaussian noise of amplitude $0.01$ and ran for a random number (from $40$ to $100$) of control steps without control action, so that control starts on a random state with a fully developed instability.

For the present study, the aim of the control is to stabilise $u$ around the fixed point $E_1$, and, therefore, the reward $r_t$ is defined as

(3.3)\begin{equation} \left.\begin{gathered} MSE_t = \|u(\ {\cdot}\ ,t) - u_{E1}\|_2 = \sqrt{\frac{1}{L}\int_0^L \left(u(x,t)-u_{E1}\right)^2 {{\rm d}x}},\\ r_t = \frac{MSE_{ref}-MSE_t-0.1 ||\underline{\boldsymbol{a}}_t||^2}{|MSE_{ref}|}, \end{gathered}\right\} \end{equation}

where $u_{E1}$ describes the fixed point $E_1$ (refer to figure 7b), $MSE_{ref}$ is the time-averaged reference mean squared error of the uncontrolled state and $\underline {a}_t$ is the control action at time $t$. This way $r_t$ ranges over $]-\infty,1]$. Note that the configuration being $x$ invariant, any translated version of $E_1$ is still a fixed point and could be considered a valid control target. Given the chosen reward, the system is here driven toward the specific fixed point $E_1$ from figure 7(b). Accounting for the $x$ invariance would require a more complex reward formulation, which is not explored in this paper.

3.1. Results on the KS test case

3.1.1. Pre-trained performances

The initial training phase displays a steep increase of training performances until epoch $200$, as shown in figure 9. Normalised performance then slowly tends towards $0.95$. During this second phase the agent fine tunes the policy in order to reach the target more precisely and to reduce the transient duration. Figure 10 shows ensemble averages of the performance and action norm during test runs at the end of the pre-training phase. One can clearly notice that the short transient from the baseline state to the target state is the consequence of large amplitude actions. Their amplitude decreases rapidly once the target state is reached.

Figure 8. Location of observations (measuring $u$ at these locations) and Gaussian supports of the control actions overlaid on the $E1$ fixed point.

Figure 9. Average learning curve (red line) and evolution of the number of steps needed to reach a given percentage (reported on the line) of the maximum performance (orange lines). Dotted lines describe the ensemble minimum and maximum and shaded areas illustrate the standard deviation across the batch. These averages are computed on a 10-case training batch. Orange lines have been smoothed using a rolling average of width 75 for the sake of readability.

Figure 10. (c) Evaluation performances at the end of the pre-training. Averages are computed using 20 test runs. Normalised performance of the baseline (blue line) and of the controlled cases (red line) is to be read with the two left axes. Action $L_2$ norm (green line) is reported on the right axis. (a,b) Evolution of the state corresponding to the baseline and controlled ‘example’ runs, respectively.

3.1.2. A systematic ablation study

A first systematic study has been performed to get a reference in order to assess the performances of the proposed method. This study consists in training from scratch a policy with every possible actuator layout. This represents 255 different cases, each of these layouts being assessed using five test cases trained over 4000 epochs. Such an exhaustive study is only possible here because of the low computational cost of running the environment.

It has been found that the optimal layout evolution is complex and incompatible with a one-by-one elimination strategy: for a given number of actuators $n_{act}$, the best layout is not necessarily a subset (a ‘child’) of the optimal layout for $n_{act}+1$ actuators. Additionally, for a given $n_{act}$, the notion of ‘best layout’ is subject to caution since several sub-optimal layouts display performances almost equal to the optimal. Overall, layouts with actuators evenly spread out in the domain demonstrate the best performances, without significantly differing from each other.

These aspects make the comparison of the exhaustive ablation study with our method complex. Thus, it has been chosen to compare ensemble averages over the test batches rather than discussing slightly sub-optimal choices.

3.1.3. Sparsification process

The actuator elimination process is launched once the agent reaches a steady performance. The procedure described in § 2.5 is implemented and the specific hyperparameter values used for this case are reported in table 4 of the appendices. Figure 11(a) synthesises the results of the sparsification process. The elimination of the first three actuators marginally impacts performance. However, from five to one actuator, the average performance steadily decreases. A comparison with the results from the systematic study is also shown in the figure (blue lines). The obtained performance is below the average reward of the systematic study only for two actuators. This performance drop may be due to either a significantly sub-optimal layout selection from the algorithm (which would indicate a shortcoming of the proposed algorithm) or unconverged policies on these newly selected layouts. To answer this point, figure 11(b) indicates the performances to expect if these obtained agents were to go through a further fine tuning. This is done by considering the performance of the corresponding layout in the systematic study. One can notice that the average performance is higher with the proposed method than for the systematic study, even with only two actuators. The fact that fine tuning would enable better-than-average performances with two actuators can be explained by the observed slow convergence of the performances of two-actuator layouts, which is considered as stable by the algorithm despite evolving slowly. A lower stability threshold may solve this issue but it is likely to significantly slow down the elimination process.

Note that for one actuator, the average performance is slightly better than the optimal from the systematic study. One can assume that one-actuator agents obtained by the elimination method take advantage of a broader exploration of the state-action space thanks to training phases with more than one actuator compared with the agents converged in the systematic study. This observation is further discussed in Appendix A.3.

Figure 11. (a) Average training performance with respect to the number of active action components. The results of the proposed method (red line) are to be compared with those of the systematic study (blue line). The red shaded area illustrates the standard deviation across the batch, whereas the blue shaded area represents the performance envelope of all possible layouts. Data points are denoted by grey ‘plus’ signs. (b) Expected performances of the proposed methods after an extra fine-tuning phase of the agents (orange line) compared with the systematic study (blue line).

Figure 12 provides a more complete view of the elimination process by displaying the most frequent layouts obtained and the main paths towards sparsity. In a similar fashion as binomial combination coefficients, the number of layouts tends to increase towards four and five active components, but the three most selected layouts for each elimination step make up at least $65\,\%$ of the test batch. From two to one actuator, most of the individuals ($85\,\%$) converge toward the optimal one-actuator layout. Orange-coloured layouts are the optimal ones for a given complexity, according to the systematic study. As pointed out before, these optimal layouts may not be ‘children’ of one another and may therefore be incompatible with our sequential elimination approach. This is one identified shortcoming of the present method: in the one-by-one elimination process the algorithm makes a nearly optimal choice for a given number of actuators, but suffers from the constraint of its ‘inherited’ layout, restricting the choice of the next obtained layout.

Figure 12. Detailed elimination history for the three main configurations for each complexity. Stripes connecting layouts have a size proportional to the number of observed transitions (e.g. transitions from the eight-actuator layout to the seven-actuator layout having all active components but the last one, represent 72 out of the 200 test cases). Orange individuals denote overall best layouts for a given number of actuators.

Statistics about the obtained layouts are summarised in figure 13. The left-hand side graph shows the entropy $H$ of the layouts with respect to the number of actuators. Here $H$ is computed as

(3.4)\begin{equation} H = \sum_{l\in {layouts}} f_l \log\left(f_l\right), \end{equation}

where $f_l$ is the observed frequency of a given actuator layout over the batch. The entropy obtained with the proposed algorithm (red line) compared with what an equivalent random sampling of the same size gives (orange line) demonstrates the selectivity of the method. The right-hand side of the figure depicts the selection frequency of each action component for layouts having one to four active actuators. Here and in the following, actuators are numbered the same way as in figure 8. The elimination results show that actuators 3 and 4 are almost systematically eliminated for layouts of four or less action components, while actuators 0, 2 and, to a lesser extent, actuators 5 and 6 seem more important.

Figure 13. (a) Entropy of the obtained layouts with respect to the number of active actuators. The maximum reachable entropy is that reached by the systematic study (blue line). A random sampling of layouts of the same size as the study (200 test cases) would have the entropy represented by the orange line. (b) Frequency histograms of the obtained layouts (bar plots) compared with the absolute best layout from the systematic study (blue numbered boxes) for 1 to 4 action components. Population sizes (on which frequencies are computed) are reported below each histogram. Fixed point $E_1$ has been plotted in the background as a reference.

Yet, the importance of a given action component is a relative notion since it depends on the context of the rest of the gated layout. For instance, actuator 1 may only be important for the performance if both actuators 0 and 2 are disabled, and less useful otherwise. Thus, the frequency analysis of figure 13 provides only a synthetic glimpse of the results, but confirms that despite generating multiple layouts at some stages of the process, the method remains notably selective.

3.1.4. Partial conclusion on the study of the KS equation

Despite being a computationally cheap, well-documented, 1-D test case, the KS equation shows that even simple environments may present significant challenges for an actuator selection algorithm. First, the KS equation seems to be controllable with a very small number of actuators (two) but only as long as they are properly spread out along the $x$ axis. This constraint is only partially overcome by the elimination method, mostly due to the difficulty in isolating the impact of one actuator from the rest of the layout (having active neighbours tends to reduce the estimated actuator's importance). Thus, starting from eight actuators, the method has to perform six optimal (or nearly optimal since the order in which action components are removed can be interchanged) choices to reach the ‘right’ two-actuator layout, leaving very little room for a sub-optimal – yet efficient – layout choice along the process. Second, this case may also highlight that simply switching a given action component off to assess its importance has limitations since it does not take the adaptation of the policy ${\rm \pi}$ into account. This being said, starting from a given layout, the algorithm seems to perform the sequential elimination satisfactorily: despite relying on stochastic processes, the study shows that the algorithm has a high selectivity, and among the choice left by the sequential elimination paradigm, it yields layouts that will generally be among the high-performance ones.

To better characterise the performance reached by the algorithm, one may compare the difference of estimated state values that drive the algorithm with the real values computed from the actual returns after elimination. Table 2 uses $S(V_{syst.})$ (the standard deviation of the performance of all possible layouts from the systematic study) as a reference value to quantify the required level of accuracy needed to properly estimate the value of a given actuator elimination. Should the estimation error ${\rm \Delta} V=V_{estim.} - V_{obs.}$ be close to $S(V_{syst.})$, the estimation would not be precise enough to reliably select the ‘right’ actuator. The highest error is encountered for transition $2 \rightarrow 1$ and makes up around $1/3$ of the typical dispersion of values. Despite this large estimation error, most of the elimination choices lead to the optimal one-actuator layout as explained before. Then come transitions $4 \rightarrow 3$ and $8 \rightarrow 7$. The first may come from the adaption of the policy, which is not taken into account by the estimated value $V_{estim.}$. The latter is likely due to the very low dispersion of $V_{syst.}$ values (eliminating any actuator has nearly the same impact when all the others are kept active). For all the transitions, $V_{estim.}$ underestimates $V_{obs.}$ (${\rm \Delta} V<0$ on average) as one can reasonably expect. But overall, one may see that the metric used for the actuator elimination is rather well estimated by the algorithm, showing that the resulting sub-optimal choices appear to come from the complexity of the KS case regarding optimal actuator placement. This case allows us to pinpoint some interesting improvement directions for future work, discussed in the conclusion.

Table 2. Averaged relative value error ${\rm \Delta} V = V_{estim.} - V_{obs.}$ for each actuator elimination. Here $V_{estim.}$ is the forecast value for the actuator elimination and $V_{obs.}$ is the observed value function after the elimination; $S(V_{syst.})$ is the standard deviation of the average performances of all possible layouts (without taking any performance threshold into account).

4. Low Reynolds NACA 0012 stalled flow

The second test case concerns the control of a bidimensional flow around a stalled NACA 0012 at a chordwise Reynolds number $Re_c=1000$. As illustrated in figure 14, the origin of the domain $(0,0)$ is located at the airfoil leading edge, the computational domain is ‘C shaped’ and extends up to a distance of $20$ chord lengths ($C=1$). The simulation is built in the reference frame of the airfoil, meaning that the angle of attack ($\alpha$) is imposed by the upstream flow conditions, modelled here as a far-field boundary condition.

Figure 14. (a) Flow domain geometry, not at true scale. $\alpha$ denotes the angle of attack and $C$ is the (unitary) chord length. (b) Boundary conditions on the airfoil, with $n_{act}=10$. Blue numbered boxes symbolise actuators, number 0 being the upstream-most actuator, number 9 being the downstream-most one.

The free-stream flow is uniform at $M_\infty =0.1$. In the following, all quantities are made non-dimensional by the characteristic length $C$, the inflow density $\rho _\infty$, the velocity $U_\infty$ and the static temperature $T_\infty$. The flow solution is computed via direct numerical solving (no turbulence modelling) using ONERA's FastS finite volume method solver (Dandois, Mary & Brion Reference Dandois, Mary and Brion2018). The spatial discretisation is based on a simplified version (Mary Reference Mary1999) of a second-order accurate AUSM$+$(P) scheme proposed by Edwards & Liou (Reference Edwards and Liou1998), and the time integration of the equations is performed with a second-order implicit Euler scheme, with a global non-dimensional time step $dt=1.3\times 10^{-3}$. The structured mesh is made of $120,000$ nodes distributed such that the vicinity of the airfoil and its wake are properly resolved. The boundary conditions are specified in figure 14.

A control step ${\rm \Delta} t$ lasts for $58$ numerical time steps of the simulation, corresponding to ${\rm \Delta} t= 7.9\times 10^{-2}$ time units. Similarly to the KS test case, this choice was made to both comply with the needs for a long enough horizon for the agent and for a precise enough discretisation of the measured signal. Here ${\rm \Delta} t$ corresponds to ${\approx }1/50$ characteristic period, assuming that most of the consequences of a control action are measured within the next two characteristic periods.

Control action is performed on the airfoil suction side through a series of $n_{act}$ independent jet inlets. Negative control actions correspond to suction and positive to blowing, the latter being performed at an angle of $-80^{\circ }$ with respect to the local wall normal. The control action (ranging $[-1,1]^{n_{act}}$) is first rescaled before each control step to the actuators’ action limits (here $\pm$2) and converted to a mass-flow set point for the current control step. This command is used to compute the mass-flow boundary condition imposed for each time step using a $52$-iteration interpolation ramp ($90\,\%$ of the step) in order to avoid abrupt changes that may not be handled by the numerical solver, in a similar fashion as Paris et al. (Reference Paris, Beneddine and Dandois2021) and Rabault et al. (Reference Rabault, Kuchta, Jensen, Réglade and Cerardi2019) did. Thus, for the $i$th numerical iteration of control step $t$, the mass-flow per unit area $q_j^i$ implemented for actuator $j$ reads

(4.1)\begin{equation} q_j^i = \rho_\infty U_\infty\left(a_{j,t-1}^i(1-r_i) + a_{j,t}^i r_i\right),\quad \text{with } r_i = \min(i/52,1). \end{equation}

The stagnation enthalpy is held constant (at the upstream flow value) for injection actions. Figure 15 illustrates a standard set-up for this case.

Figure 15. Instantaneous $Y$ velocity flow field, with an arbitrary control action (here with $n_{act}=20$), in the free-stream reference frame. White dots represent the sensor locations. The coloured triangles nearby the airfoil depict the action, their heights and colours representing each action amplitude. The dashed diamond shapes mark off maximum actions (both positive and negative). The strong variations in velocity in the vicinity of the actuators are due to the presence of interspersed wall boundary conditions in-between actuators.

Both drag and lift coefficients ($C_d$ and $C_l$) are computed on the airfoil via the resulting force of the flow $\underline {F}$,

(4.2)\begin{gather} \underline{F} = \begin{pmatrix} F_x\\ F_y\end{pmatrix} = \oint_{{airfoil}}\underline{\underline{\boldsymbol{\mathsf{\sigma}}}}\ \boldsymbol{{\cdot}}\ {\underline{\boldsymbol{n}}}{\rm d} S, \end{gather}
(4.3)\begin{gather}C_d = \frac{1}{\dfrac{1}{2}\rho_\infty U_\infty^2 C}\begin{pmatrix} \cos\alpha & \sin\alpha \end{pmatrix}\begin{pmatrix} F_x\\ F_y\end{pmatrix}, \end{gather}
(4.4)\begin{gather}C_l = \frac{1}{\dfrac{1}{2}\rho_\infty U_\infty^2 C}\begin{pmatrix} -\sin\alpha & \cos\alpha\end{pmatrix} \begin{pmatrix}F_x\\ F_y\end{pmatrix}, \end{gather}

where $\underline {\boldsymbol{n}}$ is the unitary airfoil surface normal vector, $\underline {\underline {\boldsymbol{\sigma} }}$ is the stress tensor, $\underline {\boldsymbol{e}}_x = (1,0)$, $\underline {\boldsymbol{e}}_y = (0,1)$ and $\alpha$ is the angle of attack of the airfoil. Note that lift and drag coefficients are computed by integration around the airfoil on a closed circulation, in the presence of actuators.

4.1. Non-controlled flow

In this study the angle of attack $\alpha$ ranges from $12$ to $20$ degrees. At $Re_C=1000$, the flow is unsteady and displays a laminar vortex shedding (Wu et al. Reference Wu, Lu, Denny, Fan and Wu1998). This instability causes both lift and drag coefficients to vary periodically, yielding undesired alternated loads on the airfoil. The evolution of both the amplitude and the periodicity of these loads are shown in figure 16. The left-hand side graph shows evolution of the lift and drag coefficients with respect to the angle of attack $\alpha$. The Strouhal number decreases up to $\alpha =22^{\circ }$ and remains nearly steady for larger $\alpha$ values, in agreement with Roshko's correlation formula for a low Reynolds cylinder (Roshko Reference Roshko1953). The latter is computed considering the cross-sectional ‘area’ of the airfoil facing the air flow as the diameter of an equivalent cylinder. The formula is used in its range of validity since the diameter-based Reynolds number $Re_{\emptyset }$ ranges from $233$ ($\alpha =12^{\circ }$) to $451$ ($\alpha =26^{\circ }$).

Figure 16. (a) Evolution of the lift and drag coefficients and of the Strouhal number with the angle of attack $\alpha$. The shaded areas depict the standard deviation of $C_l$ and $C_d$. The secondary Strouhal number emerging for $\alpha \geq 22^{\circ }$ is drawn in a dashed line. The red dotted line shows the predicted Strouhal number for an equivalent cylinder using Roshko's formula. (b,c) The $C_l$ and $C_d$ power spectral density (PSD) spectra plotted for $\alpha =20^{\circ }$ (b) and $\alpha =25^{\circ }$ (c). Green vertical lines denote the main Strouhal numbers. Other Strouhal numbers (for the other values of $\alpha$) are reported as grey lines for comparison.

Figure 16(b,c) depicts both lift and drag coefficient spectra for angles of attack of $20^{\circ }$ and $25^{\circ }$. For $\alpha =20^{\circ }$, both lift and drag coefficients have a very peaked harmonic spectrum with a fundamental Strouhal number $St(\alpha =20)\approx 0.52$. For $\alpha =25^{\circ }$, interspersed peaks appear in-between pre-existing harmonics. The main Strouhal number slightly decreases to $St(\alpha =25)\approx 0.46$ and a halved Strouhal number $St'$ can be measured at around $0.23$. The emergence of the latter for $\alpha \geq 22^{\circ }$, whose evolution is reported on the left-hand side graph (green dashed line), causes the symmetry to break between two successive shedded vortex pairs.

Figure 17 illustrates this symmetry breaking over two vortex periods. During the first period, the negative-vorticity, suction-side vortex (coloured in blue) induces the shedding of a weak pressure-side vortex downward and feeds a second positive-vorticity vortex on the suction side, near the trailing edge. At the end of the first period, the latter induces an upward shedding of this clockwise-rotating vortex. During the second period, the trailing-edge vortex is stronger and higher compared with its counterpart one period before. It sheds horizontally and stretches the induced leading-edge vortex. Compared with the case $\alpha =20^{\circ }$, where trailing-edge and leading-edge vortices have comparable sizes and strengths, at $\alpha =25^{\circ }$, the leading-edge then trailing-edge vortices are alternatively dominant over time.

Figure 17. Snapshots of the flow vorticity around the airfoil over two characteristic periods $T$. Blue-framed thumbnails correspond to $\alpha =20^{\circ }$, orange-framed ones correspond to $\alpha =25^{\circ }$.

For any angle of attack $\alpha$, one can define the characteristic period $T=1/St(\alpha )$ of the unstable phenomenon. This time unit is later used in the study to size the control step. The main goal of the controller is to minimise lift fluctuations using as little control power as possible. Thus, the reward $r_t$ is defined as

(4.5)\begin{equation} r_t ={-} S\left(C_l\right)_{2T} - S\left(C_d\right)_{2T} - 0.05 \frac{1}{n_{act}}\sum_{i=0}^{n_{act}-1}|\langle a_i\rangle_{2T}|, \end{equation}

where $S(C_l)_{2T}$ and $S(C_d)_{2T}$ are the standard deviation of the lift and drag coefficients computed over two characteristic periods and $|\langle a_i\rangle _{2T}|$ the absolute value of the averaged $i$th action component also over two periods.

Figure 18 describes the ranges of both separation and reattachment points with respect to the angle of attack. The separation point ranges between $17.3$ % and $18.2$ % (of the chord length) for $\alpha =12^{\circ }$, between $8.6$ % and $9.7$ % for $\alpha =15^{\circ }$ and between $3.6$ % and $4.4$ % for $\alpha =20^{\circ }$. This expected reduction of the chord length of the separation comes along with a strong variation of the reattachment point due to a growing vortex-shedding unsteadiness with $\alpha$, as shown in the figure. For $\alpha =15^{\circ }$, no reattachment is measured, but one can consider it is located at the trailing edge.

Figure 18. Separation (filled areas) and reattachment (hatched areas) points ranges with respect to the angle of attack. For $\alpha =15^{\circ }$, the reattachment takes place at the trailing edge.

4.2. Results on the NACA test case

The proposed algorithm has been run on test cases with the previously introduced environment set-up for angles of attack ($\alpha$) of $12^{\circ }$, $15^{\circ }$ and $20^{\circ }$. The choice of these three cases is motivated by the significantly different dynamical behaviour they exhibit. The specific hyperparameter values used for this study are reported in table 4. For the sake of conciseness, comparison of the results between these cases is only presented and discussed when noticeable differences arise.

4.2.1. Pre-trained performances

For each angle of attack, 10 independently initialised test cases have been trained. Figure 19 illustrates the evolution of both the average training reward and the standard deviation of the lift ($C_l$) and drag ($C_d$) coefficients at the end of each training run. During the pre-training phase (from epoch 0 to 700) and for all of the cases, the average reward grows rapidly until epochs $300\sim 400$ then stabilises to its maximum value. This comes with a steep reduction of the variations of both $C_l$ and $C_d$. The average exploration component $\sigma$ of the policy steadily decreases during training (not shown). This behaviour is expected from the agent, that starts with a broad exploration of the possible actions and then narrows it down towards a more deterministic control strategy.

Figure 19. Ensemble-averaged learning curves of the pre-training phase for $\alpha =12^{\circ }$ (a), $\alpha =15^{\circ }$ (b) and $\alpha =20^{\circ }$ (c). Dotted lines describe the ensemble minimum and maximum and shaded areas illustrate the standard deviation across the batch.

Figure 20 shows the ensemble averages of evaluations performed at the end of this pre-training phase for $\alpha =15^{\circ }$ and vorticity snapshots of the flow along a randomly selected test run. As soon as the control starts, the control action swiftly drives the lift and drag coefficients towards favourable and stable values. As shown by both the extremal envelopes and the flow snapshots, this correlates with a stationarisation of the flow through the cutoff of vortex shedding and with flow reattachment. Control actions are moderate suction actions as expected. In the following, actuators are numbered according to figure 14.

Figure 20. Ensemble averages of evaluation runs performed at the end of the pre-training phase (epoch $700$) for $\alpha =15^{\circ }$. Here $10$ independent test cases are evaluated on 10 test runs (100 trajectories in total). Dotted lines describe the ensemble minimum and maximum and shaded areas illustrate the standard deviation across the batch. Vorticity snapshots illustrate the flow state at key moments of a randomly chosen run.

Contrary to the KS case, a systematic study to compute optimal layouts for the airfoil cases is not possible due to the unaffordable computational resources it would require. But this flow configuration is nonetheless interesting because the control laws are simple enough such that a first analysis of each action component provides insights on the usefulness of each actuator. This makes a relevant source of information to evaluate the sparsification algorithm performance. The following paragraph provides the main conclusions obtained from a careful observation of each action component for the converged policies. The figures supporting the following claims are provided in Appendix A.4.

For $\alpha =20^{\circ }$, all action components display strong variations during the transient phase that last for about eight non-dimensional time units. They all remain unsteady afterwards, synced on the vortex shedding that has not been entirely cancelled. Action components 0 and, to a lesser extent 1, 2 and 3, display a strong suction control forcing after the transient. These forcings ensure the flow reattachment.

For cases with $\alpha =15^{\circ }$, actuators 3, 4 and 6 enforce fast varying actions. The reattachment is again ensured by actuators 0 and 3, and to a lesser extent, actuator 1. It is worth noting that after the transient (lasting for around six non-dimensional time units), all the action components excepted actuator 6 become relatively stable.

At last, for $\alpha =12^{\circ }$, only actuator 3 has a strong action variation during the transient of duration of around five non-dimensional time units. During the stabilised phase, action components 2, 3 and 7 display a strong and relatively steady suction forcing. Again here actuators 6 and 9 remain relatively unsteady compared with the others.

4.2.2. Sparsification process

Figure 21 synthesises the results of the sparsification on the cases with $\alpha =15^{\circ }$. Since performances follow similar trends for $\alpha =12^{\circ }$ and $\alpha =20^{\circ }$, the sparsification process is discussed for $\alpha =15^{\circ }$ only. One can distinguish two different trends in the performances. First, from ten to four actuators, only the average values of both drag ($\overline {C_d}$) and lift ($\overline {C_l}$) coefficients (refer to lower left and right graphs of figure 21) are significantly impacted by the elimination of actuators, respectively seeing an increase and a drop of their value as the number of actuators is reduced. Standard deviations of these coefficients (on which the reward is based), quantifying the steadiness of the mechanical loads, remain rather constant. As the number of actuators further decreases, from four to one, both average coefficient values pursue and amplify their previously described evolution. Standard deviations significantly increase, denoting an expected decrease in overall performance. For the vast majority of the data points, the control yields significantly better performances compared with the non-controlled flow (denoted as ‘baseline’ on figure 21).

Figure 21. Averaged training performance indicators with respect to the number of active action components for $\alpha =15^{\circ }$. Data points are denoted by grey ‘plus’ signs. Ensemble averages are computed on these data points (batch size $5$). The blue dashed line represents the uncontrolled (baseline) performance, solid lines denote the evolution of the ensemble average and shaded areas illustrate the standard deviation across the batch. All indicators are computed on the last 40 control steps of a training run. (a) Evolution of the standard deviation of the drag coefficient $S(C_d)$. (b) Evolution of the standard deviation of the lift coefficient $S(C_l)$. (c) Time-averaged drag coefficient $\overline {C_d}$. (d) Time-averaged lift coefficient $\overline {C_l}$.

Figure 22 further describes the elimination choices with respect to both the angle of attack $\alpha$ and the number of active action components. In all cases, the algorithm chooses actuators consistently to the first analysis proposed at the end of § 4.2.1. With five and three components, eliminations for all studied angles of attacks display similar patterns, removing actuators 4 and 5, preserving other actuators with a slight advantage for upstream ones (0 to 3). For $\alpha =15^{\circ }$, $\alpha =20^{\circ }$ and for a reduced number of action components (1, 2 and 3), the algorithm favours locations near the separation range, as one may intuitively expect. However, for $\alpha =12^{\circ }$, the remaining actuators’ location differs from the separation range. This turns out to be more beneficial for this specific angle of attack, as shown in Appendix A.4 (figure 25).

Figure 22. Frequency histograms of the obtained layouts (bar plots) for $\alpha =12^{\circ }$ (top row), $\alpha =15^{\circ }$ (middle row) and $\alpha =20^{\circ }$ (bottom row). The blue shaded areas correspond to the range of evolution of the separation point. These frequencies are observed on the test batch of size 10, used in this study.

4.2.3. Partial conclusion on the study of the NACA flow

The study of this test case has first shown that, except for $\alpha =20^{\circ }$, a nearly complete stabilisation of the flow using reduced-amplitude control actions could be obtained with the proposed layout and using the described RL method. The proposed action sparsification algorithm has shown expected results while still allowing us to significantly preserve control performances during the elimination phases. For reduced numbers of action components, the feedback provided by the sensors enables the policy to properly synchronise control actions with the remaining unsteadiness and, thus, reduce effective load variations on the airfoil.

4.2.4. Remark on the obtained closed-loop control laws

The airfoil case has been selected because of the rather simple control policies it yields (in most of the cases, a suction in the vicinity of the separation region), which makes it a good candidate to assess the behaviour of our actuator elimination algorithm. As the control actions appear nearly steady during the late phase of control (whether it is with the full 10-actuator layout or the sparse ones), one may question the intrinsic closed-loop nature of the control law. Note that this question is not in the original scope of the present paper. It is nonetheless an interesting side point that is briefly addressed in the following.

Closed-loop tests runs have first been performed for ten, five and two actuators for all considered angles of attack, to be used as baseline values. Action sequences have then been re-implemented in an open-loop fashion on randomly reset cases. Table 3 presents the variation of various performance indicators at the end of the test runs.

Table 3. Comparison of the final (in the ‘stabilised phase’) performances in closed-loop and open-loop conditions. Performance variations are measured as a relative variation with respect to the closed-loop performance. Green coloured figures indicate that the open-loop control performs better on the metric than the closed loop whereas red coloured ones state the opposite. Ensemble averages are computed on batches of 10 test runs.

Relative variations of the time-averaged lift coefficient are systematically positive, whereas the drag coefficient evolves negatively. These variations yet remain small compared with the absolute coefficient values. Standard deviations quantifying the steadiness of the loads on the airfoil (which constitute the actual control objective defined in the reward) are systematically increased when implementing control action in an open-loop fashion. For 10 actuators, the relative variation showing an important increase must be relativised knowing that the reference standard deviation is very low, thus inflating the importance of the variation. The difference between open loop and closed loop can be seen for five and two actuators, where the closed loop does not totally stabilise the flow. Thus, here it can be considered that for 10 actuators, the policy is close to an open-loop forcing, whereas open-loop policies display a lesser efficiency for a reduced number of actuators, where the remaining load unsteadiness can be damped by an in-phase control action.

To conclude, the relative improvement obtained by using a closed-loop strategy over an open-loop one is significant here. But regarding absolute gains, they are more marginal. Therefore, it may be feasible to get satisfactory control performances on this case through a parametric open-loop control study (with constant or variable suction), where the number, action amplitude and sign, and actuator position would be varied. Yet, even for this simple flow configuration, such an exhaustive study would face the ‘curse of dimensionality’ inherited from the combinatorial search of optimal layout (with, for instance, 252 different layouts having five active action components). Consequently, if carried out blindly with respect to the flow physics, this search would end up being significantly more expensive than the present algorithm, even for this relatively easy-to-control flow.

To assess if the control yields energetic gains, the time-averaged drag power decrease has been compared with the estimated actuator power required for the control. The drag power $P_{drag}$ reads as $P_{drag}=\frac {1}{2}\rho U_\infty ^3 L\langle C_{x}\rangle$, where $\langle C_{x}\rangle$ is the time-averaged drag coefficient. The actuation power $P_{act}$ is modelled as $P_{act}=\sum _{i=1}^{10}|q_i{\rm \Delta} p_i S_i/\rho |r_{a}\approx \sum _{i=1}^{10}|a_{t,i}|S_i\rho _\infty U_\infty U_{jet}^2 r_{a}$, considering that compressibility effects are negligible, $S_i$ being the ‘area’ of actuator $i$, $U_{jet}$ the injection velocity (kept constant), ${\rm \Delta} p$ the excess pressure yielded by the actuator and $r_{a}$ the power efficiency ratio of actuators. Assuming a perfect actuator efficiency ($r_{a}=1$), the power saved thanks to control ranges from around $5\,\%$ (for $\alpha =12^{\circ }$) to $35\,\%$ (for $\alpha =20^{\circ }$). Small variations of these values can be observed depending on the type of control (closed loop or open loop) and on the number of active actuators. Conversely, one may compute the minimum actuator efficiency $r_{a,min}$ needed to have a positive energy saving. In that case, $r_{a,min}$ ranges from $65\,\%$ (for $\alpha =12^{\circ }$) to less than $20\,\%$ (for $\alpha =20^{\circ }$).

5. Conclusion

This study aims at proposing and testing a method of actuator selection, built on top of a pre-trained on-policy RL agent. This method relies on a sequential elimination of actuators based on a ‘what-if’ analysis on the removal of each action component and selecting the component having the smallest estimated impact on performance. This process is repeated (after a ‘recovery’ period) until the prescribed number of actuators is reached.

This method has first been applied to the 1-D periodic KS equation, starting from eight evenly distributed actuators. Their number has been successively reduced to one. The proposed elimination algorithm yields actuator layouts and sparse policies performing reasonably well compared with the maximum foreseeable performance, highlighting the ability of this method to provide a relevant approximation of the Pareto optimum of the performance with respect to the number of actuators at a lower cost than the exhaustive combinatorial study. The reasons for not reaching optimal performances have been investigated in detail. This study has shown that, despite being a computationally cheap test case, the KS equation posed numerous challenges to the method that are to be addressed in forthcoming developments, highlighting interesting research directions for the future. One may be the development of more complex approaches than the one-by-one elimination strategy that is proposed here. This strategy is indeed problematic when estimations are close to one another and where elimination choices lead to very different layouts, policies and performances at the end of the selection process.

The ‘what-if’ analysis paradigm may also show limitations for cases where actuators can compensate for other eliminated ones since the method does not account for the adaptation of the policy following the elimination, as illustrated by the KS equation test case. One may imagine that the policy adaptation may retrospectively change the actuators’ ranking in some cases. The strategy would be more accurate if it could account for it, or if it could consider the following elimination choices, mimicking a chess player thinking two or three moves ahead. In the present framework, this is unfortunately not tractable, since it would require around $n_{act}$ copies of the agent that would need to be trained in the same way the reference agent is. Therefore, the design of viable alternative strategies is a complex question that may be very interesting to explore in the future.

Then, a stalled bidimensional flow around a NACA 0012 airfoil has been considered as a control test case more representative of fluid mechanics control applications, both in terms of computational costs and dimensionality (even though it remains far from actual industrial flow configurations). The pre-trained controller showed first a nearly perfect stabilisation of the flow for the studied angles of attack. The method again succeeded at eliminating actuators while keeping a proficient level of stabilisation performance. Despite being controllable efficiently using an appropriate constant open-loop control, this test case underlines again the relevance of this method to finding an action-parsimonious control policy (in terms of number, position and amplitude).

Yet, as is, the computational cost of the present algorithm makes it very expensive for actual industrial flow configurations especially if a very large number of actuators (before elimination) is considered. Note that, to our knowledge, this is unfortunately the case for all state-of-the-art RL approaches for flow control. The overhead cost of the present algorithm stems from the following two factors. The first concerns the one-by-one elimination design choice, that forces a sequential elimination and imposes a minimum number of training epochs in order to reach the prescribed number of actuators. On the bright side, this strategy is useful when one does not know the final required number of actuators. Thus, our method can estimate the best layouts for each number of actuators and the corresponding performances. But again, one may want to derive a parallel elimination method that directly reaches the prescribed number of actuators. Such a method would have to account for the policy adaptation issue previously highlighted. The second factor of extra cost is due to the need to converge the different value estimates for each elimination candidate using data from dedicated roll-outs. Measures have already been implemented to mitigate this cost, e.g. already eliminated components are systematically skipped. Yet, this cost could be divided by a factor of $n_{act}$ if the estimates used for ranking actuators could be converged using data from ‘standard’ roll-outs, in an ‘on-the-fly’ fashion. Therefore, future work may search for accurate ways to estimate the value function of sparse policies without dedicated roll-outs. This is a challenging question, and several attempts in this direction for the present study have not yielded satisfactory results yet (not shown here). At last, the mirroring issue of completing a ‘deficient’ or partial actuator layout could also be considered for future work. But this may require much more complex methods and paradigms since estimating the impact of an extra forcing action with its own optimised dynamics would require much more knowledge on the environment's dynamics. The issue of redistributing an initially ‘unbalanced’ layout also falls into this category.

All these identified issues are as many ways to further improve a method tackling the crucial topic of optimal actuator layout and control behaviour, especially in the context of a gradual scale up in the complexity of cases, towards more realistic cases and ‘industry-ready’ control methods.

Funding

This work is funded by the French Agency for Innovation and Defence (AID) via a PhD scholarship. Their support is gratefully acknowledged.

Declaration of interests

The authors report no conflict of interest.

Appendix A

A.1. Hyperparameters

Table 4 presents the main numerical parameters of both the simulated case and the learning algorithm, with the notations used in the paper (if introduced).

Table 4. Additional numerical parameters.

A.2. Elimination speed

Figure 23 illustrates the average cost of elimination depending on the eliminated action component. This cost is the number of training epochs required to run phases 2 (metric evaluation), 3 (elimination) and 1 (policy adaptation), in that order, for a given elimination. One can notice a rather stable overhead of around $500$ epochs from elimination $8\rightarrow 7$ to $4\rightarrow 3$, then an inflation of this cost for the last two. This increase is mainly due to the policy adaptation phase that takes longer with fewer active actuators. This behaviour can be correlated with the slower convergence of the learning curves for layouts having one and two actuators compared with the others, in the systematic study (not shown here).

Figure 23. Duration of the elimination phases with respect to the eliminated action component. The ensemble average (solid line) is computed across the 200 test cases batch. The shaded area illustrates the corresponding ensemble standard deviation.

A.3. Performances with one actuator on the KS equation

With only one actuator, the obtained layouts slightly over-perform the systematic study best individual, as shown in figure 24. As discussed, this moderate over-performance of the agents obtained by the elimination process compared with those converged in the systematic study may be attributed to a better exploration of the state-action space during training. Comparing both example runs (blue-framed and red-framed graphs), one can notice than both agents tend to stabilise the state around the same target. While still being sub-optimal, this stationary state is close to the second fixed point $E_2$ (if $E_2$ were the control target, the reward would be around $0.72$) and is still better performing than the average non-controlled state. The gap in performance appears to be the consequence of a better stability of the state for the ‘sparsified’ agent.

Figure 24. (e) Comparison of the performances of both one-actuator agents (having only actuator 0 activated) obtained via the systematic study (blue line) and the elimination process (red line) with the non-controlled baseline (grey line) and an eight-actuator agent (orange line). Solid lines represent ensemble averages computed over batches over 100 runs, shaded areas account of the standard deviation across the batch, while dash-dotted lines depict the example run reported on the upper graphs. (ad) Evolution of the state on example runs for each agent.

A.4. Other results on the NACA test case

Figure 22 shows the observed actuator frequency selection with respect to the total number of active components and the angle of attack. This selection appears to be, as expected, correlated with the position of the uncontrolled flow separation range for $\alpha =15^{\circ }$ and $\alpha =20^{\circ }$, whereas for $\alpha =12^{\circ }$, the sparsification method keeps action components significantly upstream from the separation range. Figure 25 shows the difference in lift and drag performance for a constant open-loop forcing action implemented, respectively, on actuators 3 and 6. One can notice a significant advantage of implementing the forcing on the upstream actuator 3 rather than actuator 6, thus justifying the elimination choice of the algorithm.

Figure 25. Comparison of lift and drag open-loop control performances using actuator number 3 (as prescribed by the sparsification method), upstream to the separation range or using actuator number 6, located nearby the separation range. Ensemble averages are computed on a batch of 40 test runs.

Figure 26 compares the ensemble average of the control actions on test runs of the pre-trained agents (using all the available action components) for the different angles of attack.

Figure 26. Ensemble average of control actions of the pre-trained policies for $\alpha =12^{\circ }$ (a), $\alpha =15^{\circ }$ (b) and $\alpha =20^{\circ }$ (c). Averages are computed on batches of 50 test runs. Shaded areas illustrate the standard deviation across the batch. Action component numbering follows the notation introduced in figure 14.

References

REFERENCES

Amitay, M. & Glezer, A. 2002 Role of actuation frequency in controlled flow reattachment over a stalled airfoil. AIAA J. 40 (2), 209216.CrossRefGoogle Scholar
Beintema, G., Corbetta, A., Biferale, L. & Toschi, F. 2020 Controlling Rayleigh–Bénard convection via reinforcement learning. J. Turbul. 21 (9–10), 585605.CrossRefGoogle Scholar
Belus, V., Rabault, J., Viquerat, J., Che, Z., Hachem, E. & Reglade, U. 2019 Exploiting locality and translational invariance to design effective deep reinforcement learning control of the 1-dimensional unstable falling liquid film. AIP Adv. 9 (12), 125014.CrossRefGoogle Scholar
Bhatnagar, S., Afshar, Y., Pan, S., Duraisamy, K. & Kaushik, S. 2019 Prediction of aerodynamic flow fields using convolutional neural networks. Comput. Mech. 64 (2), 525545.CrossRefGoogle Scholar
Bhattacharjee, D., Hemati, M., Klose, B. & Jacobs, G. 2018 Optimal actuator selection for airfoil separation control. AIAA Paper 18-3692.CrossRefGoogle Scholar
Bruneau, C.-H. & Mortazavi, I. 2008 Numerical modelling and passive flow control using porous media. Comput. Fluids 37 (5), 488498.CrossRefGoogle Scholar
Brunton, S.L. & Noack, B.R. 2015 Closed-loop turbulence control: progress and challenges. Appl. Mech. Rev. 67 (5), 050801.CrossRefGoogle Scholar
Brunton, S.L., Noack, B.R. & Koumoutsakos, P. 2020 Machine learning for fluid mechanics. Annu. Rev. Fluid Mech. 52, 477508.CrossRefGoogle Scholar
Bucci, M.A., Semeraro, O., Allauzen, A., Cordier, L. & Mathelin, L. 2022 Nonlinear optimal control using deep reinforcement learning. In IUTAM Laminar-Turbulent Transition, pp. 279–290. Springer.CrossRefGoogle Scholar
Bucci, M.A., Semeraro, O., Allauzen, A., Wisniewski, G., Cordier, L. & Mathelin, L. 2019 Control of chaotic systems by deep reinforcement learning. Proc. R. Soc. A 475 (2231), 20190351.CrossRefGoogle ScholarPubMed
Chomaz, J.-M. 2005 Global instabilities in spatially developing flows: non-normality and nonlinearity. Annu. Rev. Fluid Mech. 37, 357392.CrossRefGoogle Scholar
Cohen, K., Siegel, S. & McLaughlin, T. 2006 A heuristic approach to effective sensor placement for modeling of a cylinder wake. Comput. Fluids 35 (1), 103120.CrossRefGoogle Scholar
Dandois, J., Mary, I. & Brion, V. 2018 Large-eddy simulation of laminar transonic buffet. J. Fluid Mech. 850, 156178.CrossRefGoogle Scholar
Djeumou, F., Neary, C., Goubault, E., Putot, S. & Topcu, U. 2022 Neural networks with physics-informed architectures and constraints for dynamical systems modeling. In Learning for Dynamics and Control Conference (ed. N. Lawrence & M. Reid), pp. 263–277. PMLR.Google Scholar
Duraisamy, K., Iaccarino, G. & Xiao, H. 2019 Turbulence modeling in the age of data. Annu. Rev. Fluid Mech. 51, 357377.CrossRefGoogle Scholar
Edwards, J.R. & Liou, M.-S. 1998 Low-diffusion flux-splitting methods for flows at all speeds. AIAA J. 36 (9), 16101617.Google Scholar
Evans, H.B., Hamed, A.M., Gorumlu, S., Doosttalab, A., Aksak, B., Chamorro, L.P. & Castillo, L. 2018 Engineered bio-inspired coating for passive flow control. Proc. Natl Acad. Sci. USA 115 (6), 12101214.CrossRefGoogle Scholar
Fukami, K., Hasegawa, K., Nakamura, T., Morimoto, M. & Fukagata, K. 2021 Model order reduction with neural networks: application to laminar and turbulent flows. SN Comput. Sci. 2 (6), 116.CrossRefGoogle Scholar
Garnier, P., Viquerat, J., Rabault, J., Larcher, A., Kuhnle, A. & Hachem, E. 2021 A review on deep reinforcement learning for fluid mechanics. Comput. Fluids 225, 104973.CrossRefGoogle Scholar
Ghosh, S., Das, N., Das, I. & Maulik, U. 2019 Understanding deep learning techniques for image segmentation. ACM Comput. Surv. 52 (4), 135.CrossRefGoogle Scholar
Giannetti, F. & Luchini, P. 2007 Structural sensitivity of the first instability of the cylinder wake. J. Fluid Mech. 581, 167197.CrossRefGoogle Scholar
Gupta, R. & Ansell, P.J. 2019 Unsteady flow physics of airfoil dynamic stall. AIAA J. 57 (1), 165175.Google Scholar
Hämäläinen, P., Babadi, A., Ma, X. & Lehtinen, J. 2020 PPO-CMA: Proximal policy optimization with covariance matrix adaptation. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing, pp. 1–6. IEEE.CrossRefGoogle Scholar
Hasegawa, K., Fukami, K., Murata, T. & Fukagata, K. 2020 Machine-learning-based reduced-order modeling for unsteady flows around bluff bodies of various shapes. Theor. Comput. Fluid Dyn. 34 (4), 367383.Google Scholar
Hui, X., Bai, J., Wang, H. & Zhang, Y. 2020 Fast pressure distribution prediction of airfoils using deep learning. Aerosp. Sci. Technol. 105, 105949.CrossRefGoogle Scholar
Ibarz, J., Tan, J., Finn, C., Kalakrishnan, M., Pastor, P. & Levine, S. 2021 How to train your robot with deep reinforcement learning: lessons we have learned. Intl J. Rob. Res. 40 (4–5), 698721.CrossRefGoogle Scholar
Jin, B., Illingworth, S.J. & Sandberg, R.D. 2022 Optimal sensor and actuator placement for feedback control of vortex shedding. J. Fluid Mech. 932, A2.CrossRefGoogle Scholar
Joubert, G., Le Pape, A., Heine, B. & Huberson, S. 2013 Vortical interactions behind deployable vortex generator for airfoil static stall control. AIAA J. 51 (1), 240252.Google Scholar
Kingma, D.P. & Ba, J. 2015 Adam: a method for stochastic optimization. In 3rd International Conference on Learning Representations, Conference Track Proceedings.Google Scholar
Kneer, S., Sayadi, T., Sipp, D., Schmid, P. & Rigas, G. 2021 Symmetry-aware autoencoders: s-PCA and s-nlPCA. arXiv:2111.02893.Google Scholar
Kochkov, D., Smith, J.A., Alieva, A., Wang, Q., Brenner, M.P. & Hoyer, S. 2021 Machine learning-accelerated computational fluid dynamics. Proc. Natl Acad. Sci. USA 118 (21), e2101784118.CrossRefGoogle ScholarPubMed
Koizumi, H., Tsutsumi, S. & Shima, E. 2018 Feedback control of Kármán vortex shedding from a cylinder using deep reinforcement learning. AIAA Paper 18-3691.CrossRefGoogle Scholar
Lee, K. & Carlberg, K.T. 2020 Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders. J. Comput. Phys. 404, 108973.CrossRefGoogle Scholar
Li, J. & Zhang, M. 2022 Reinforcement-learning-based control of confined cylinder wakes with stability analyses. J. Fluid Mech. 932, A44.CrossRefGoogle Scholar
Louizos, C., Welling, M. & Kingma, D.P. 2018 Learning sparse neural networks through $l_0$ regularization. In Sixth International Conference on Learning Representations.Google Scholar
Luhar, M., Sharma, A.S. & McKeon, B.J. 2014 Opposition control within the resolvent analysis framework. J. Fluid Mech. 749, 597626.CrossRefGoogle Scholar
Lusch, B., Kutz, J.N. & Brunton, S.L. 2018 Deep learning for universal linear embeddings of nonlinear dynamics. Nat. Commun. 9 (1), 4950.CrossRefGoogle ScholarPubMed
Manohar, K., Kutz, J.N. & Brunton, S.L. 2021 Optimal sensor and actuator selection using balanced model reduction. IEEE Trans. Automat. Contr. 67 (4), 21082115.Google Scholar
Mao, Y., Zhong, S. & Yin, H. 2022 Active flow control using deep reinforcement learning with time delays in Markov decision process and autoregressive policy. Phys. Fluids 34 (5), 053602.CrossRefGoogle Scholar
Mary, I. 1999 Méthode de newton approchée pour le calcul d’écoulements instationnaires comportant des zones à très faibles nombres de mach. PhD thesis, Paris 11.Google Scholar
Milano, M. & Koumoutsakos, P. 2002 Neural network modeling for near wall turbulent flow. J. Comput. Phys. 182 (1), 126.CrossRefGoogle Scholar
Mohan, A.T. & Gaitonde, D.V. 2018 A deep learning based approach to reduced order modeling for turbulent flow control using LSTM neural networks. arXiv:1804.09269.Google Scholar
Natarajan, M., Freund, J.B. & Bodony, D.J. 2016 Actuator selection and placement for localized feedback flow control. J. Fluid Mech. 809, 775792.CrossRefGoogle Scholar
Oehler, S.F. & Illingworth, S.J. 2018 Sensor and actuator placement trade-offs for a linear model of spatially developing flows. J. Fluid Mech. 854, 3455.CrossRefGoogle Scholar
Otter, D.W., Medina, J.R. & Kalita, J.K. 2020 A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32 (2), 604624.CrossRefGoogle Scholar
Paris, R., Beneddine, S. & Dandois, J. 2021 Robust flow control and optimal sensor placement using deep reinforcement learning. J. Fluid Mech. 913, A25.CrossRefGoogle Scholar
Pichi, F., Ballarin, F., Rozza, G. & Hesthaven, J.S. 2021 An artificial neural network approach to bifurcating phenomena in computational fluid dynamics. arXiv:2109.10765.Google Scholar
Rabault, J., Kuchta, M., Jensen, A., Réglade, U. & Cerardi, N. 2019 Artificial neural networks trained through deep reinforcement learning discover control strategies for active flow control. J. Fluid Mech. 865, 281302.CrossRefGoogle Scholar
Rabault, J. & Kuhnle, A. 2019 Accelerating deep reinforcement learning strategies of flow control through a multi-environment approach. Phys. Fluids 31 (9), 094105.CrossRefGoogle Scholar
Rabault, J., Ren, F., Zhang, W., Tang, H. & Xu, H. 2020 Deep reinforcement learning in fluid mechanics: a promising method for both active flow control and shape optimization. J. Hydrodyn. 32 (2), 234246.CrossRefGoogle Scholar
Ren, F., Rabault, J. & Tang, H. 2021 Applying deep reinforcement learning to active flow control in weakly turbulent conditions. Phys. Fluids 33 (3), 037121.CrossRefGoogle Scholar
Rogers, J. 2000 A parallel approach to optimum actuator selection with a genetic algorithm. AIAA Paper 2000–4484.Google Scholar
Roshko, A. 1953 On the development of turbulent wakes from vortex streets, report 1191. Tech. Rep. 2913. California Institue of Technology.Google Scholar
Sashittal, P. & Bodony, D.J. 2021 Data-driven sensor placement for fluid flows. Theor. Comput. Fluid Dyn. 35 (5), 709729.CrossRefGoogle Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. 2017 Proximal policy optimization algorithms. arXiv:1707.06347.Google Scholar
Seidel, J., Fagley, C. & McLaughlin, T. 2018 Feedback flow control: a heuristic approach. AIAA J. 56 (10), 38253834.CrossRefGoogle Scholar
Seifert, A., Bachar, T., Koss, D., Shepshelovich, M. & Wygnanski, I. 1993 Oscillatory blowing: a tool to delay boundary-layer separation. AIAA J. 31 (11), 20522060.CrossRefGoogle Scholar
Seifert, A., Darabi, A. & Wyganski, I. 1996 Delay of airfoil stall by periodic excitation. J. Aircraft 33 (4), 691698.CrossRefGoogle Scholar
Seifert, A. & Pack, L.G. 1999 Oscillatory control of separation at high Reynolds numbers. AIAA J. 37 (9), 10621071.Google Scholar
Sekar, V., Zhang, M., Shu, C. & Khoo, B.C. 2019 Inverse design of airfoil using a deep convolutional neural network. AIAA J. 57 (3), 9931003.CrossRefGoogle Scholar
Seshagiri, A., Cooper, E. & Traub, L.W. 2009 Effects of vortex generators on an airfoil at low Reynolds numbers. J. Aircraft 46 (1), 116122.CrossRefGoogle Scholar
Shimomura, S., Ogawa, T., Sekimoto, S., Nonomura, T., Oyama, A., Fujii, K. & Nishida, H. 2017 Experimental analysis of closed-loop flow control around airfoil using DBD plasma actuator. In ASME 2017 Fluids Engineering Division Summer Meeting (ed. ASME), p. V01CT22A004. American Society of Mechanical Engineers.Google Scholar
Shimomura, S., Sekimoto, S., Oyama, A., Fujii, K. & Nishida, H. 2020 Closed-loop flow separation control using the deep Q network over airfoil. AIAA J. 58 (10), 42604270.CrossRefGoogle Scholar
Sivashinsky, G.I. 1980 On flame propagation under conditions of stoichiometry. SIAM J. Appl. Maths 39 (1), 6782.CrossRefGoogle Scholar
Vinuesa, R. & Brunton, S.L. 2021 The potential of machine learning to enhance computational fluid dynamics. arXiv:2110.02085.Google Scholar
Vona, M. & Lauga, E. 2021 Stabilizing viscous extensional flows using reinforcement learning. Phys. Rev. E 104 (5), 055108.CrossRefGoogle ScholarPubMed
Wang, J.-X., Wu, J.-L. & Xiao, H. 2017 Physics-informed machine learning approach for reconstructing Reynolds stress modeling discrepancies based on DNS data. Phys. Rev. Fluids 2 (3), 034603.CrossRefGoogle Scholar
Wang, Y.-Z., Mei, Y.-F., Aubry, N., Chen, Z., Wu, P. & Wu, W.-T. 2022 Deep reinforcement learning based synthetic jet control on disturbed flow over airfoil. Phys. Fluids 34 (3), 033606.CrossRefGoogle Scholar
Wang, Z., Xiao, D., Fang, F., Govindan, R., Pain, C.C. & Guo, Y. 2018 Model identification of reduced order fluid dynamics systems using deep learning. Intl J. Numer. Meth. Fluids 86 (4), 255268.CrossRefGoogle Scholar
Willcox, K. 2006 Unsteady flow sensing and estimation via the gappy proper orthogonal decomposition. Comput. Fluids 35 (2), 208226.CrossRefGoogle Scholar
Wong, J.C., Ooi, C., Gupta, A. & Ong, Y.-S. 2022 Learning in sinusoidal spaces with physics-informed neural networks. IEEE Trans. Artif. Intell.Google Scholar
Wu, J.-Z., Lu, X.-Y., Denny, A.G., Fan, M. & Wu, J.-M. 1998 Post-stall flow control on an airfoil by local unsteady forcing. J. Fluid Mech. 371, 2158.CrossRefGoogle Scholar
Yao, H., Sun, Y. & Hemati, M.S. 2022 Feedback control of transitional shear flows: sensor selection for performance recovery. Theor. Comput. Fluid Dyn. 36 (4), 597626.CrossRefGoogle Scholar
Yeh, C.-A. & Taira, K. 2019 Resolvent-analysis-based design of airfoil separation control. J. Fluid Mech. 867, 572610.CrossRefGoogle Scholar
Zaman, K.B.M.Q., McKinzie, D.J. & Rumsey, C.L. 1989 A natural low-frequency oscillation of the flow over an airfoil near stalling conditions. J. Fluid Mech. 202, 403442.CrossRefGoogle Scholar
Zhang, T., Wang, R., Wang, Y. & Wang, S. 2021 Locomotion control of a hybrid propulsion biomimetic underwater vehicle via deep reinforcement learning. In 2021 IEEE International Conference on Real-time Computing and Robotics (ed. M.K. O'Malley), pp. 211–216. IEEE.CrossRefGoogle Scholar
Zhang, Y., Sung, W.J. & Mavris, D.N. 2018 Application of convolutional neural network to predict airfoil lift coefficient. AIAA Paper 2018-1903.CrossRefGoogle Scholar
Figure 0

Figure 1. The RL feedback loop.

Figure 1

Figure 2. The PPO-CMA agent structure: compared with PPO, the actor has one extra output $\sigma$ that allows for a dynamic adaptation of the exploration.

Figure 2

Figure 3. Structure of the stochastic gating layer (SGL) used to filter the actor output. Here $\mu _d$ and $\mu _s$ respectively stand for $\mu _{dense}$ and $\mu _{sparse}$.

Figure 3

Table 1. Values of $\underline {u}$ and $\underline {p}$ during the different phases of the training process.

Figure 4

Figure 4. (a) Schematics of the proposed elimination process. Gates’ opening probability are represented as blue squares in the SGL layer, their colour saturation indicates the opening probability. (b) Illustration of the sequential elimination process proposed. The algorithm iterates over this loop until the breaking condition (i.e. the required number of eliminated actuators) is met.

Figure 5

Figure 5. Action-sparisty seeking PPO-CMA agent structure and training operations. (a) First training phase. Greyed-out structures are not used. (b) Second and third training phase. Phases 2 and 3 see the alternance of policy ${\rm \pi}$ and main critic $V$ training with specific training for each critic $V^i$.

Figure 6

Figure 6. Example of the one-by-one elimination strategy drawn from a Kuramoto–Savishinsky test case (§ 3). The performance curve (red line) has been smoothed using a rolling average of length 20 for the sake of readability. Shaded areas in the background illustrate learning phases. Black dots represent the evaluation runs and the checkpoint back-ups of the agent performed during training.

Figure 7

Figure 7. (a) Spatio-temporal representation of the dynamics of the KS equation on $500$ non-dimensional time units, corresponding to $2000$ control steps. (b) Shape of the three fixed points of the KS equation.

Figure 8

Figure 8. Location of observations (measuring $u$ at these locations) and Gaussian supports of the control actions overlaid on the $E1$ fixed point.

Figure 9

Figure 9. Average learning curve (red line) and evolution of the number of steps needed to reach a given percentage (reported on the line) of the maximum performance (orange lines). Dotted lines describe the ensemble minimum and maximum and shaded areas illustrate the standard deviation across the batch. These averages are computed on a 10-case training batch. Orange lines have been smoothed using a rolling average of width 75 for the sake of readability.

Figure 10

Figure 10. (c) Evaluation performances at the end of the pre-training. Averages are computed using 20 test runs. Normalised performance of the baseline (blue line) and of the controlled cases (red line) is to be read with the two left axes. Action $L_2$ norm (green line) is reported on the right axis. (a,b) Evolution of the state corresponding to the baseline and controlled ‘example’ runs, respectively.

Figure 11

Figure 11. (a) Average training performance with respect to the number of active action components. The results of the proposed method (red line) are to be compared with those of the systematic study (blue line). The red shaded area illustrates the standard deviation across the batch, whereas the blue shaded area represents the performance envelope of all possible layouts. Data points are denoted by grey ‘plus’ signs. (b) Expected performances of the proposed methods after an extra fine-tuning phase of the agents (orange line) compared with the systematic study (blue line).

Figure 12

Figure 12. Detailed elimination history for the three main configurations for each complexity. Stripes connecting layouts have a size proportional to the number of observed transitions (e.g. transitions from the eight-actuator layout to the seven-actuator layout having all active components but the last one, represent 72 out of the 200 test cases). Orange individuals denote overall best layouts for a given number of actuators.

Figure 13

Figure 13. (a) Entropy of the obtained layouts with respect to the number of active actuators. The maximum reachable entropy is that reached by the systematic study (blue line). A random sampling of layouts of the same size as the study (200 test cases) would have the entropy represented by the orange line. (b) Frequency histograms of the obtained layouts (bar plots) compared with the absolute best layout from the systematic study (blue numbered boxes) for 1 to 4 action components. Population sizes (on which frequencies are computed) are reported below each histogram. Fixed point $E_1$ has been plotted in the background as a reference.

Figure 14

Table 2. Averaged relative value error ${\rm \Delta} V = V_{estim.} - V_{obs.}$ for each actuator elimination. Here $V_{estim.}$ is the forecast value for the actuator elimination and $V_{obs.}$ is the observed value function after the elimination; $S(V_{syst.})$ is the standard deviation of the average performances of all possible layouts (without taking any performance threshold into account).

Figure 15

Figure 14. (a) Flow domain geometry, not at true scale. $\alpha$ denotes the angle of attack and $C$ is the (unitary) chord length. (b) Boundary conditions on the airfoil, with $n_{act}=10$. Blue numbered boxes symbolise actuators, number 0 being the upstream-most actuator, number 9 being the downstream-most one.

Figure 16

Figure 15. Instantaneous $Y$ velocity flow field, with an arbitrary control action (here with $n_{act}=20$), in the free-stream reference frame. White dots represent the sensor locations. The coloured triangles nearby the airfoil depict the action, their heights and colours representing each action amplitude. The dashed diamond shapes mark off maximum actions (both positive and negative). The strong variations in velocity in the vicinity of the actuators are due to the presence of interspersed wall boundary conditions in-between actuators.

Figure 17

Figure 16. (a) Evolution of the lift and drag coefficients and of the Strouhal number with the angle of attack $\alpha$. The shaded areas depict the standard deviation of $C_l$ and $C_d$. The secondary Strouhal number emerging for $\alpha \geq 22^{\circ }$ is drawn in a dashed line. The red dotted line shows the predicted Strouhal number for an equivalent cylinder using Roshko's formula. (b,c) The $C_l$ and $C_d$ power spectral density (PSD) spectra plotted for $\alpha =20^{\circ }$ (b) and $\alpha =25^{\circ }$ (c). Green vertical lines denote the main Strouhal numbers. Other Strouhal numbers (for the other values of $\alpha$) are reported as grey lines for comparison.

Figure 18

Figure 17. Snapshots of the flow vorticity around the airfoil over two characteristic periods $T$. Blue-framed thumbnails correspond to $\alpha =20^{\circ }$, orange-framed ones correspond to $\alpha =25^{\circ }$.

Figure 19

Figure 18. Separation (filled areas) and reattachment (hatched areas) points ranges with respect to the angle of attack. For $\alpha =15^{\circ }$, the reattachment takes place at the trailing edge.

Figure 20

Figure 19. Ensemble-averaged learning curves of the pre-training phase for $\alpha =12^{\circ }$ (a), $\alpha =15^{\circ }$ (b) and $\alpha =20^{\circ }$ (c). Dotted lines describe the ensemble minimum and maximum and shaded areas illustrate the standard deviation across the batch.

Figure 21

Figure 20. Ensemble averages of evaluation runs performed at the end of the pre-training phase (epoch $700$) for $\alpha =15^{\circ }$. Here $10$ independent test cases are evaluated on 10 test runs (100 trajectories in total). Dotted lines describe the ensemble minimum and maximum and shaded areas illustrate the standard deviation across the batch. Vorticity snapshots illustrate the flow state at key moments of a randomly chosen run.

Figure 22

Figure 21. Averaged training performance indicators with respect to the number of active action components for $\alpha =15^{\circ }$. Data points are denoted by grey ‘plus’ signs. Ensemble averages are computed on these data points (batch size $5$). The blue dashed line represents the uncontrolled (baseline) performance, solid lines denote the evolution of the ensemble average and shaded areas illustrate the standard deviation across the batch. All indicators are computed on the last 40 control steps of a training run. (a) Evolution of the standard deviation of the drag coefficient $S(C_d)$. (b) Evolution of the standard deviation of the lift coefficient $S(C_l)$. (c) Time-averaged drag coefficient $\overline {C_d}$. (d) Time-averaged lift coefficient $\overline {C_l}$.

Figure 23

Figure 22. Frequency histograms of the obtained layouts (bar plots) for $\alpha =12^{\circ }$ (top row), $\alpha =15^{\circ }$ (middle row) and $\alpha =20^{\circ }$ (bottom row). The blue shaded areas correspond to the range of evolution of the separation point. These frequencies are observed on the test batch of size 10, used in this study.

Figure 24

Table 3. Comparison of the final (in the ‘stabilised phase’) performances in closed-loop and open-loop conditions. Performance variations are measured as a relative variation with respect to the closed-loop performance. Green coloured figures indicate that the open-loop control performs better on the metric than the closed loop whereas red coloured ones state the opposite. Ensemble averages are computed on batches of 10 test runs.

Figure 25

Table 4. Additional numerical parameters.

Figure 26

Figure 23. Duration of the elimination phases with respect to the eliminated action component. The ensemble average (solid line) is computed across the 200 test cases batch. The shaded area illustrates the corresponding ensemble standard deviation.

Figure 27

Figure 24. (e) Comparison of the performances of both one-actuator agents (having only actuator 0 activated) obtained via the systematic study (blue line) and the elimination process (red line) with the non-controlled baseline (grey line) and an eight-actuator agent (orange line). Solid lines represent ensemble averages computed over batches over 100 runs, shaded areas account of the standard deviation across the batch, while dash-dotted lines depict the example run reported on the upper graphs. (ad) Evolution of the state on example runs for each agent.

Figure 28

Figure 25. Comparison of lift and drag open-loop control performances using actuator number 3 (as prescribed by the sparsification method), upstream to the separation range or using actuator number 6, located nearby the separation range. Ensemble averages are computed on a batch of 40 test runs.

Figure 29

Figure 26. Ensemble average of control actions of the pre-trained policies for $\alpha =12^{\circ }$ (a), $\alpha =15^{\circ }$ (b) and $\alpha =20^{\circ }$ (c). Averages are computed on batches of 50 test runs. Shaded areas illustrate the standard deviation across the batch. Action component numbering follows the notation introduced in figure 14.