Hostname: page-component-586b7cd67f-t8hqh Total loading time: 0 Render date: 2024-11-23T17:22:14.687Z Has data issue: false hasContentIssue false

Data-driven science and machine learning methods in laser–plasma physics

Published online by Cambridge University Press:  30 May 2023

Andreas Döpp*
Affiliation:
Ludwig-Maximilians-Universität München, Garching, Germany Department of Physics, Clarendon Laboratory, University of Oxford, Oxford, UK
Christoph Eberle
Affiliation:
Ludwig-Maximilians-Universität München, Garching, Germany
Sunny Howard
Affiliation:
Ludwig-Maximilians-Universität München, Garching, Germany Department of Physics, Clarendon Laboratory, University of Oxford, Oxford, UK
Faran Irshad
Affiliation:
Ludwig-Maximilians-Universität München, Garching, Germany
Jinpu Lin
Affiliation:
Ludwig-Maximilians-Universität München, Garching, Germany
Matthew Streeter
Affiliation:
School for Mathematics and Physics, Queen’s University Belfast, Belfast, UK
*
Correspondence to: Andreas Döpp, Ludwig-Maximilians-Universität München, Am Coulombwall 1, 85748 Garching, Germany. Email: [email protected]

Abstract

Laser-plasma physics has developed rapidly over the past few decades as lasers have become both more powerful and more widely available. Early experimental and numerical research in this field was dominated by single-shot experiments with limited parameter exploration. However, recent technological improvements make it possible to gather data for hundreds or thousands of different settings in both experiments and simulations. This has sparked interest in using advanced techniques from mathematics, statistics and computer science to deal with, and benefit from, big data. At the same time, sophisticated modeling techniques also provide new ways for researchers to deal effectively with situation where still only sparse data are available. This paper aims to present an overview of relevant machine learning methods with focus on applicability to laser-plasma physics and its important sub-fields of laser-plasma acceleration and inertial confinement fusion.

Type
Review
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (https://creativecommons.org/licenses/by-nc-nd/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided that no alterations are made and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use and/or adaptation of the article.
Copyright
© The Author(s), 2023. Published by Cambridge University Press in association with Chinese Laser Press

1 Introduction

1.1 Laser–plasma physics

Over the past decades, the development of increasingly powerful laser systems[ Reference Danson, Hillier, Hopps and Neely1, Reference Danson, Haefner, Bromage, Butcher, Chanteloup, Chowdhury, Galvanauskas, Gizzi, Hein, Hillier, Hopps, Kato, Khazanov, Kodama, Korn, Li, Li, Limpert, Ma, Nam, Neely, Papadopoulos, Penman, Qian, Rocca, Shaykin, Siders, Spindloe, Szatmári, Trines, Zhu, Zhu and Zuegel2] has enabled the study of light–matter interaction across many regimes. Of particular interest is the interaction of intense laser pulses with plasma, which is characterized by strong nonlinearities that occur across many scales in space and time[ Reference Pukhov3, Reference Marklund and Shukla4]. These laser–plasma interactions are of interest both for fundamental physics research and as emerging technologies for potentially disruptive applications.

Regarding fundamental research, high-power lasers have, for instance, been used to study transitions from classical electrodynamics to quantum electrodynamics (QED) via the radiation reaction, where a particle’s backreaction to its radiation field manifests itself in an additional force[ Reference Cole, Behm, Gerstmayr, Blackburn, Wood, Baird, Duff, Harvey, Ilderton, Joglekar, Krushelnick, Kuschel, Marklund, McKenna, Murphy, Poder, Ridgers, Samarin, Sarri, Symes, Thomas, Warwick, Zepf, Najmudin and Mangles5 Reference Blackburn7]. Recent proposals to extend intensities to the Schwinger limit[ Reference Quéré and Vincenti8], where the electric field strength of the light is comparable to the Coulomb field, could allow the study of novel phenomena expected to occur due to a breakdown of perturbation theory. In an only slightly less extreme case, high-energy density physics (HEDP)[ Reference Drake9] research uses lasers for the production and study of states of matter that cannot be reached otherwise in terrestrial laboratories. This includes creating and investigating material under extreme pressures and temperatures, leading to exotic states such as warm–dense matter[ Reference Mahieu, Jourdain, Phuoc, Dorchies, Goddet, Lifschitz, Renaudin and Lecherbourg10 Reference Beier, Allison, Efthimion, Flippo, Gao, Hansen, Hill, Hollinger, Logantha, Musthafa, Nedbailo, Senthilkumaran, Shepherd, Shlyaptsev, Song, Wang, Dollar, Rocca and Hussein12].

Apart from the fundamental interest, there is also considerable interest in developing novel applications that are enabled by these laser–plasma interactions. Two particularly promising application areas have emerged over the past decades, namely the production of high-energy radiation beams (electrons, positrons, ions, X-rays, gamma-rays) and laser-driven fusion.

Laser–plasma acceleration (LPA) aims to accelerate charged particles to high energies over short distances by inducing charge separation in the plasma, for example, in the form of plasma waves to accelerate electrons or by stripping electrons from thin-foil targets to accelerate ions. The former scenario is called laser wakefield acceleration (LWFA)[ Reference Esarey, Schroeder and Leemans13 Reference Hooker15]. Here a high-power laser propagates through a tenuous plasma and drives a plasma wave, which can take the shape of a spherical ion cavity directly behind the laser. The fields within this so-called bubble typically reach around 100 GV/m, allowing LWFA to accelerate electrons from rest to GeV energies within centimeters[ Reference Leemans, Nagler, Gonsalves, Tóth, Nakamura, Geddes, Esarey, Schroeder and Hooker16 Reference Gonsalves, Nakamura, Daniels, Benedetti, Pieronek, de Raadt, Steinke, Bin, Bulanov, van Tilborg, Geddes, Schroeder, Tóth, Esarey, Swanson, Fan-Chiang, Bagdasarov, Bobrova, Gasilov, Korn, Sasorov and Leemans21]. While initial experiments were single-shot in nature and with significant shot-to-shot variations, the performance of LWFA has drastically increased in recent years. Particularly worth mentioning in this regard are the pioneering works on LWFA at the kHz-level repetition rate, starting with an early demonstration in 2013[ Reference He, Hou, Nees, Easter, Faure, Krushelnick and Thomas22] and in the following years mostly developed by Faure et al. [ Reference Faure, Gustas, Guénot, Vernier, Böhle, Ouillé, Haessler, Lopez-Martens and Lifschitz23], Rovige et al. [ Reference Rovige, Huijts, Andriyash, Vernier, Tomkus, Girdauskas, Raciukaitis, Dudutis, Stankevic, Gecys, Ouille, Cheng, Lopez-Martens and Faure24] and Salehi et al. [ Reference Salehi, Le, Railing, Kolesik and Milchberg25], as well as the day-long stable operation achieved by Maier et al. [ Reference Maier, Delbos, Eichner, Hübner, Jalas, Jeppe, Jolly, Kirchen, Leroux, Messner, Schnepp, Trunk, Walker, Werle and Winkler26]. As a result of these efforts, typical experimental datasets in publications have significantly increased in size. To give an example, first studies on the so-called beamloading effect in LWFA were done on sets of tens of shots[ Reference Rechatin, Faure, Davoine, Lundh, Lim, Ben-Ismaïl, Burgy, Tafzi, Lifschitz, Lefebvre and Malka27], whereas newer studies include hundreds of shots[ Reference Götzfried, Döpp, Gilljohann, Foerster, Ding, Schindler, Schilling, Buck, Veisz and Karsch28] or, most recently, thousands of shots[ Reference Kirchen, Jalas, Messner, Winkler, Eichner, Hübner, Hülsenbusch, Jeppe, Parikh, Schnepp and Maier29].

Laser-driven accelerators can also function as bright radiation sources via the processes of bremsstrahlung emission[ Reference Glinec, Faure, Le Dain, Darbon, Hosokai, Santos, Lefebvre, Rousseau, Burgy, Mercier and Malka30, Reference Döpp, Guillaume, Thaury, Lifschitz, Sylla, Goddet, Tafzi, Iaquanello, Lefrou, Rousseau, Conejero, Ruiz, Phuoc and Malka31], betatron radiation[ Reference Rousse, Phuoc, Shah, Pukhov, Lefebvre, Malka, Kiselev, Burgy, Rousseau, Umstadter and Hulin32, Reference Kneip, McGuffey, Martins, Martins, Bellei, Chvykov, Dollar, Fonseca, Huntington, Kalintchenko, Maksimchuk, Mangles, Matsuoka, Nagel, Palmer, Schreiber, Phuoc, Thomas, Yanovsky, Silva, Krushelnick and Najmudin33] and Compton scattering[ Reference Phuoc, Corde, Thaury, Malka, Tafzi, Goddet, Shah, Sebban and Rousse34 Reference Khrennikov, Wenz, Buck, Xu, Heigoldt, Veisz and Karsch36]. These sources have been used for a variety of proof-of-concept applications[ Reference Albert and Thomas37], ranging from spectroscopy studies of warm–dense matter[ Reference Mahieu, Jourdain, Phuoc, Dorchies, Goddet, Lifschitz, Renaudin and Lecherbourg10] and over imaging[ Reference Kneip, McGuffey, Dollar, Bloom, Chvykov, Kalintchenko, Krushelnick, Maksimchuk, Mangles, Matsuoka, Najmudin, Palmer, Schreiber, Schumaker, Thomas and Yanovsky38, Reference Fourmaux, Corde, Phuoc, Lassonde, Lebrun, Payeur, Martin, Sebban, Malka, Rousse and Kieffer39] to X-ray computed tomography (CT)[ Reference Wenz, Schleede, Khrennikov, Bech, Thibault, Heigoldt, Pfeiffer and Karsch40 Reference Guénot, Svendsen, Lehnert, Ulrich, Persson, Permogorov, Zigan, Wensing, Lundh and Berrocal43]. It has also recently been demonstrated that LWFA can produce electron beams with sufficiently high beam quality to drive free-electron lasers (FELs)[ Reference Wang, Feng, Ke, Yu, Xu, Qi, Chen, Qin, Zhang, Fang, Liu, Jiang, Wang, Wang, Yang, Wu, Leng, Liu, Li and Xu44, Reference Labat, Cabadağ, Ghaith, Irman, Berlioux, Berteaud, Blache, Bock, Bouvet, Briquez, Chang, Corde, Debus, De Oliveira, Duval, Dietrich, El Ajjouri, Eisenmann, Gautier, Gebhardt, Grams, Helbig, Herbeaux, Hubert, Kitegi, Kononenko, Kuntzsch, LaBerge, Lê, Leluan, Loulergue, Malka, Marteau, Guyen, Oumbarek-Espinos, Pausch, Pereira, Püschel, Ricaud, Rommeluere, Roussel, Rousseau, Schöbel, Sebdaoui, Steiniger, Tavakoli, Thaury, Ufer, Valléau, Vandenberghe, Vétéran, Schramm and Couprie45], offering a potential alternative driver for next generation light sources[ Reference Emma, van Tilborg, Albert, Labate, England, Gessner, Fiuza, Obst-Huebl, Zholents, Murokh and Rosenzweig46].

Laser-ion acceleration[ Reference Daido, Nishiuchi and Pirozhkov47, Reference Macchi, Borghesi and Passoni48] uses similar laser systems as LWFA, but typically operates with more tightly focused beams to reach even higher intensities. Here the goal is to separate a population of electrons from the ions and then use this electron cloud to strip ions from the target. The ions are accelerated to high energies by the fields that are generated by the charge separation process. This method has been used to accelerate ions to energies of a few tens of MeV/u in recent years. In an alternative scheme, radiation pressure acceleration[ Reference Robinson, Zepf, Kar, Evans and Bellei49, Reference Henig, Steinke, Schnürer, Sokollik, Hörlein, Kiefer, Jung, Schreiber, Hegelich, Yan, Meyer-ter-Vehn, Tajima, Nickles, Sandner and Habs50], the laser field is used to directly accelerate a target. Even though it uses the same or similar lasers, ion acceleration typically operates at much lower repetition rate because of its thin targets, which are not as easily replenished as gas-based plasma sources used in LWFA. Recent target design focuses on the mitigation of this issue, for instance using cryogenic jets[ Reference Fraga, Kalinin, Khnel, Hochhaus, Schottelius, Polz, Kaluza, Neumayer and Grisenti51 Reference Rehwald, Assenbaum, Bernert, Curry, Gauthier, Glenzer, Göde, Schoenwaelder, Schramm, Treffert and Zeil54] or liquid crystals[ Reference Schumacher, Poole, Willis, Cochran, Daskalova, Purcell and Heery55].

Another application of potentially high societal relevance is laser-driven inertial confinement fusion (ICF)[ Reference Moses56, Reference Betti and Hurricane57], where the aim is to induce fusion by heating matter to extremely high temperatures through laser–plasma interaction. As the name suggests, confinement is reached via the inertia of the plasma, which is orders of magnitude larger than the thermal energy. To achieve spatially homogenous heating that can penetrate deep into the fusion target, researchers commonly resort to driving the ignition process indirectly. In this method, light is focused into an empty cylindrical cavity, called a hohlraum, which is used to radiate a nearly isotropic blackbody spectrum that extends into the X-ray regime and is subsequently absorbed by the imploding capsule[ Reference Murakami and Meyer-ter-Vehn58]. In contrast to this, direct-drive methods aim to directly drive the thermonuclear fusion process[ Reference Craxton, Anderson, Boehly, Goncharov, Harding, Knauer, McCrory, McKenty, Meyerhofer, Myatt, Schmitt, Sethian, Short, Skupsky, Theobald, Kruer, Tanaka, Betti, Collins, Delettrez, Hu, Marozas, Maximov, Michel, Radha, Regan, Sangster, Seka, Solodov, Soures, Stoeckl and Zuegel59]. In this case, the laser is focused directly onto the fuel capsule. Direct-drive poses some challenges that are not present in indirect-drive schemes. For instance, the light has to penetrate through the high-density plasma shell surrounding the capsule and the illumination is less homogeneous, because of which the compressed target can be subject to large hydrodynamic instabilities. Advanced ignition schemes aim to separate compression of the thermonuclear fuel from triggering the ignition process. Examples are fast ignition[ Reference Kodama, Norreys, Mima, Dangor, Evans, Fujita, Kitagawa, Krushelnick, Miyakoshi, Miyanaga, Norimatsu, Rose, Shozaki, Shigemori, Sunahara, Tampo, Tanaka, Toyama, Yamanaka and Zepf60], which uses the high-intensity laser pulse to directly heat the compressed and dense fusion target, and shock ignition[ Reference Betti, Zhou, Anderson, Perkins, Theobald and Solodov61], which uses a shock wave to compress the target. Recently, the first ICF experiment at the National Ignition Facility (NIF) has reported reaching the burning-plasma state via the combination of indirect-drive with advanced target design[ Reference Zylstra, Hurricane, Callahan, Kritcher, Ralph, Robey, Ross, Young, Baker, Casey, Döppner, Divol, Hohenberger, Le Pape, Pak, Patel, Tommasini, Ali, Amendt, Atherton, Bachmann, Bailey, Benedetti, Berzak Hopkins, Betti, Bhandarkar, Biener, Bionta, Birge, Bond, Bradley, Braun, Briggs, Bruhn, Celliers, Chang, Chapman, Chen, Choate, Christopherson, Clark, Crippen, Dewald, Dittrich, Edwards, Farmer, Field, Fittinghoff, Frenje, Gaffney, Gatu Johnson, Glenzer, Grim, Haan, Hahn, Hall, Hammel, Harte, Hartouni, Heebner, Hernandez, Herrmann, Herrmann, Hinkel, Ho, Holder, Hsing, Huang, Humbird, Izumi, Jarrott, Jeet, Jones, Kerbel, Kerr, Khan, Kilkenny, Kim, Kleinrath, Kleinrath, Kong, Koning, Kroll, Kruse, Kustowski, Landen, Langer, Larson, Lemos, Lindl, Ma, MacDonald, MacGowan, Mackinnon, MacLaren, MacPhee, Marinak, Mariscal, Marley, Masse, Meaney, Meezan, Michel, Millot, Milovich, Moody, Moore, Morton, Murphy, Newman, Di Nicola, Nikroo, Nora, Patel, Pelz, Peterson, Ping, Pollock, Ratledge, Rice, Rinderknecht, Rosen, Rubery, Salmonson, Sater, Schiaffino, Schlossberg, Schneider, Schroeder, Scott, Sepke, Sequoia, Sherlock, Shin, Smalyuk, Spears, Springer, Stadermann, Stoupin, Strozzi, Suter, Thomas, Town, Tubman, Trosseille, Volegov, Weber, Widmann, Wild, Wilde, Van Wonterghem, Woods, Woodworth, Yamaguchi, Yang and Zimmerman62]. This breakthrough has re-enforced scientific and commercial interest in ICF, which is now also being pursued by a number of start-up companies.

1.2 Why data-driven techniques?

In recent years, data-driven methods and machine learning have been applied to a wide range of physics problems[ Reference Carleo, Cirac, Cranmer, Daudet, Schuld, Tishby, Vogt-Maranto and Zdeborová63], including for instance quantum physics[ Reference Dunjko and Briegel64, Reference Schütt, Chmiela, von Lilienfeld, Tkatchenko, Tsuda and Müller65], particle physics[ Reference Radovic, Williams, Rousseau, Kagan, Bonacorsi, Himmel, Aurisano, Terao and Wongjirad66], condensed matter physics[ Reference Bedolla, Padierna and Castaneda-Priego67], electron microscopy[ Reference Ede68] and fluid dynamics[ Reference Kutz69]. In comparison, its use in laser–plasma physics is still in its infancy and is curiously driven by both data abundance and data scarcity. Regarding the former, fast developments in both laser technology[ Reference Leemans70] and plasma targetry[ Reference Prencipe, Fuchs, Pascarelli, Schumacher, Stephens, Alexander, Briggs, Büscher, Cernaianu, Choukourov, De Marco, Erbe, Fassbender, Fiquet, Fitzsimmons, Gheorghiu, Hund, Huang, Harmand, Hartley, Irman, Kluge, Konopkova, Kraft, Kraus, Leca, Margarone, Metzkes, Nagai, Nazarov, Lutoslawski, Papp, Passoni, Pelka, Perin, Schulz, Smid, Spindloe, Steinke, Torchio, Vass, Wiste, Zaffino, Zeil, Tschentscher, Schramm and Cowa71 Reference Condamine, Jourdain, Hernandez, Taylor, Bohlin, Fajstavr, Jeong, Kumar, Laštovička, Renner and Weber74] nowadays permit the operation of laser–plasma experiments – in particular laser–plasma accelerators – at joule-level energies and multi-Hz to kHz repetition rates[ Reference Nakamura, Mao, Gonsalves, Vincenti, Mittelberger, Daniels, Magana, Toth and Leemans75 Reference Jourdain, Chaulagain, Havlík, Kramer, Kumar, Majerová, Tikhonchuk, Korn and Weber79]. The vast amounts of data generated by these experiments can be used to develop data-driven models, which are then employed in lieu of conventional theoretical or empirical approaches, or to augment them. In contrast, laser systems used for inertial fusion research produce MJ-level laser pulses and operate at repetition rates as low as one shot per day. With such a sparse number of independent experimental runs, data-driven models are used to extract as much information as possible from the existing data or combine them with other information sources, such as simulations.

The success of all of the above applications depends on the precise control of a complex nonlinear system. In order to optimize and control the process of laser–plasma interaction, it is essential to understand the underlying physics and to be able to model the complex plasma response to an applied laser pulse. However, this is complicated by the fact that it is a strongly nonlinear, multi-scale, multi-physics phenomenon. While analytical and numerical models have been essential tools for understanding laser–plasma interactions, they have several limitations. Firstly, analytical models are often limited to low-order approximations and therefore cannot accurately predict the behavior of complex laser–plasma systems. Secondly, accurate numerical simulations require immense computational resources, often millions of core hours, which limits their use for optimizing and controlling real-world laser–plasma experiments. In addition, the huge range of temporal and spatial scales means that in practice many physical processes (ionization, particle collisions, etc.) can only be treated approximately in large-scale numerical simulations. Because of this, one active area of research is to automatically extract knowledge from data in order to build faster computational models that can be used for prediction, optimization, classification and other tasks.

Another important problem is that the diagnostics employed in laser–plasma physics experiments typically only provide incomplete and insufficient information about the interaction, and key properties must be inferred from the limited set of available observables. Such inverse problems can be hard to solve, especially when some information is lost in the measurement process and the problem becomes ill-posed. Modern methods, such as compressed sensing (CS) and deep learning, are strong candidates to facilitate the solution of such problems and thus retrieve so-far elusive information from experiments.

The goal of this review is to summarize the rapid, recent developments of data-driven science and machine learning in laser–plasma physics, with particular emphasis on LPA, and to provide guidance for novices regarding the tools available for specific applications. We would like to start with a disclaimer that the lines between machine learning and other methods are often blurred.Footnote 1 Given the multitude of often competing subdivisions, we have chosen to organize this review based on a few broad classes of problems that we believe to have highest relevance for laser–plasma physics and its applications. These problems are modeling and prediction (Section 2), inverse problems (Section 3), optimization (Section 4), unsupervised learning for data analysis (Section 5) and, lastly, image analysis with supervised learning techniques (Section 6). Partial overlap between these applications and the tools used is unavoidable and is where possible indicated via cross-references. This is particularly true in the case of neural networks, which have found a broad usage across applications. Each section includes introductions to the most common techniques that address the problems outlined above. We explicitly include what can be considered as ‘classical’ data-driven techniques in order to provide a better context for more recent methods. Furthermore, examples for implementations of specific techniques in laser–plasma physics and related fields are highlighted in separate text boxes. We hope that these will help the reader to get a better idea of which methods might be most adequate for their own research. An overview of the basic application areas is shown in Figure 1.

Figure 1 Overview of some of the machine learning applications discussed in this manuscript. (a) General configuration of laser–plasma interaction setups, applicable to both experiments and simulations. The system will have a number of input parameters of the laser and target. Some of these are known and actively controlled (e.g., laser energy, plasma density), some are monitored and others are unknown and essentially contribute as noise to the observations. Predictive models take the known input parameters $x$ and use some models to predict the output $y$ . These models are discussed in Section 2.1 and some of them are sketched in (b). Inversely, in some cases one will want to derive the initial conditions from the output. These inverse problems are discussed in Section 3. In other cases one might be interested in a temporal evolution, discussed in Section 2.2. The output from observations or models can be used to optimize certain objectives, which can then be fed back to the control system to adjust the input parameters (see Section 4). Observations may also require further processing, for example, the image processing in (c) to detect patterns or objects. Note that sub-figure (a) is for illustrative purposes only and based on synthetic data.

To maintain brevity and readability, some generalizations and simplifications are made. For detailed descriptions and strict definitions of methods, the reader may kindly refer to the references given throughout the text. Furthermore, we would like to draw the reader’s attention to some recent reviews on the application of machine learning techniques in the related fields of plasma physics[ Reference Spears, Brase, Bremer, Chen, Field, Gaffney, Kruse, Langer, Lewis, Nora, Peterson, Thiagarajan, Van Essen and Humbird80 Reference Kambara, Kawaguchi, Lee, Ikuse, Hamaguchi, Ohmori and Ishikawa82], ultra-fast optics[ Reference Genty, Salmela, Dudley, Brunner, Kokhanovskiy, Kobtsev and Turitsyn83] and HEDP[ Reference Hatfield, Gaffney, Anderson, Ali, Antonelli, du Pree, Citrin, Fajardo, Knapp, Kettle, Kustowski, MacDonald, Mariscal, Martin, Nagayama, Palmer, Peterson, Rose, Ruby, Shneider, Streeter, Trickey and Williams84].

2 Modeling and prediction

Many real-life and simulated systems are expensive to evaluate in terms of time, money or other limited resources. This is particularly true for laser–plasma physics, which either hinges on the limited access to high-power laser facilities or requires high-performance computing to accurately model the ultra-fast laser–plasma interaction. It is therefore desirable to find models of the system, sometimes called digital twins[ Reference Rasheed, San and Kvamsdal85], which are comparatively cheap to evaluate and whose predictions can be used for extensive analyses. In engineering and especially in the context of model-based optimization (see Section 4), such lightweight models are often referred to as surrogate models. Reduced-order models feature fewer degrees of freedom than the original system, which is often achieved using methods of dimensionality reduction (see Section 5.3).

The general challenge in modeling is to find a good approximation of ${f}^{\ast }(x)$ for the real system $f(x)$ based on only a limited number of examples $f\left({x}_n\right) = {y}_n$ , the training data. Here ${x}_n$ is an $n$ -sized set of vector-valued input parameters and ${y}_n$ is the corresponding output. To complicate things further, any real measurement will be subject to noise, so ${y}_n$ has to be interpreted as a combination of the true value and some random noise contribution. Another complication arises from having imperfect or unrepeatable controllers for the input parameters ${x}_n.$ This can result in having different output values for the same set of input parameters.

The predictive models discussed in Section 2.1 below are mostly used to interpolate between training data, whereas the related problem of forecasting (Section 2.2) explicitly deals with the issue of extrapolating from existing data to new, so-far unseen data. In Section 2.3 we briefly discuss how models can be used to provide direct feedback for laser–plasma experiments.

2.1 Predictive models

In this section, we describe some of the most common ways to create predictive models, starting with the ‘classic’ approaches of spline interpolation and polynomial regression, before discussing some modern machine learning techniques.

2.1.1 Spline interpolation

The simplest way of constructing a model for a system, be it a real-world system or some complex simulation, is to use a set of $n$ training points and to predict that every unknown position will have the same value as the nearest neighboring training point (see Figure 2(a)). A straightforward extension with slightly higher computational requirements is to connect the training points with straight lines, resulting in a piecewise linear function. Both methods, however, are not continuously differentiable, and for instance less suited for the integration of the model into an optimization process (see Section 4). A more advanced approach to interpolating the training points is to use splines (see Figure 2(b)), which require the piecewise-interpolated functions to agree up to a certain derivative order. For instance, cubic splines are continuously differentiable up to the second derivative. While higher-order spline interpolation works well in 1D cases, it becomes increasingly difficult in multi-dimensional settings. Furthermore, the interpolation approach does not allow for incorporating uncertainty or stochasticity, which is present in real-world measurements. Therefore, it will treat noise components as a part of the system, including, for instance, outliers.

Figure 2 Illustration of standard approaches to making predictive models in machine learning. The data were sampled from the function $y = x\left(1+\sin {x}^2\right)+\epsilon$ with random Gaussian noise, $\epsilon$ , for which $\left\langle {\epsilon}^2\right\rangle = 1$ . The data have been fitted by (a) nearest neighbor interpolation, (b) cubic spline interpolation, (c) linear regression of a third-order polynomial and (d) Gaussian process regression.

2.1.2 Regression

In some specific cases the shortcomings of interpolation approaches can be addressed by using regression models. For instance, simple systems can often be described using a polynomial model, where the coefficients of the polynomial are determined via a least-squares fit (see Figure 2(c)), that is, minimizing the sum of the squares of the difference between the predicted and observed data. The results are generally improved by including more terms in the polynomial but this can lead to overfitting – a situation where the model describes the noise in the data better than the actual trends, and consequently will not generalize well to unseen data. Regression is not restricted to polynomials, but may use all kinds of mathematical models, often motivated by a known underlying physics model. Crucially, any regression model requires the prior definition of a function to be fitted, thus posing constraints on the relationships that can be modeled. In practice this is one of the main problems with this approach, as complex systems can scarcely be described using simple analytical models. Before using these models, it is thus important to verify their validity, for example, using a measure for the goodness of fit such as the correlation coefficient ${R}^2$ , the ${\chi}^2$ test or the mean-squared error (MSE).

2.1.3 Probabilistic models

The field of probabilistic modeling relies on the assumption that the relation between the observed data and the underlying system is stochastic in nature. This means that the observed data are assumed to be drawn from a probability distribution for each set of input parameters to a generative model. Inversely, one can use statistical methods to infer the parameters that best explain the observed data. We will discuss such inference problems in more detail in the context of solving (ill-posed) inverse problems in Section 3.2.

Probabilistic models can generally be divided into two methodologies, frequentist and Bayesian. At the heart of the frequentist approach lies the concept of likelihood, $p\left(y|\theta \right)$ , which is the probability of some outcomes or observations $y$ given the model parameters $\theta$ (see, e.g., Ref. [Reference Anderson and Burnham86] for an introduction). A model is fitted by maximum likelihood estimation (MLE), which means finding the model parameters $\widehat{\theta}$ that maximize the likelihood, $p\left(y|\widehat{\theta}\right)$ . When observations $y = \left({y}_1,{y}_2,\dots \right)$ are independent, the probability of observing $y$ is the product of the probabilities of each observation, $p(y) = {\prod}_{i = 1}^np\left({y}_i\right)$ . As sums are generally easier to handle than products, this is often expressed in terms of the log-likelihood function that is given by the sum of the logarithms of each observation’s probability, that is

(1) $$\begin{align}\log p\left(y|\theta \right) = \sum \limits_{i = 1}^n\log p\left({y}_i|\theta \right).\end{align}$$

Probabilities have values between 0 and 1, so the log-likelihood is always negative and, the logarithm being a strictly monotonic function, minimizing the log-likelihood maximizes the likelihood:

(2) $$\begin{align}{\widehat{\theta}}_{\mathrm{MLE}} = \underset{\theta }{\arg \max}\left\{p\left(y|\theta \right)\right\} = \underset{\theta }{\arg \min}\left\{\log p\left(y|\theta \right)\right\}.\end{align}$$

The optimum can be found using a variety of optimization methods, for example, gradient descent, which are described in more detail in Section 4. A simple example is the use of MLE for parameter estimation in regression problems. In the case in which the error $\left( Ax-y\right)$ is normally distributed, this turns out to be equivalent to the least-squares method we discussed in the previous section.

The MLE is often seen as the simplest and most practical approach to probabilistic modeling. One advantage is that it does not require any a priori assumptions about the model parameters, rather only about the probability distributions of the data. However, this can also be a drawback if useful prior knowledge of the model parameters is available. In this case one would turn to Bayesian statistics. Here one assesses information about the probability that a hypothesis about the parameter values $\theta$ is correct given a set of data $x$ . In this context the probabilistic model is viewed as a collection of conditional probability densities $p\left(\theta |y\right)$ for each set of observed data $y$ , with the aim of finding the posterior distribution  $p\left(\theta |y\right)$ , that is, the probability of some parameters given the data observations. This can be done by applying Bayes’ rule, as follows:

(3) $$\begin{align}p\left(\theta |y\right) = \frac{p\left(y|\theta \right)p\left(\theta \right)}{p(y)},\end{align}$$

where $p\left(y|\theta \right)$ is known from above as the likelihood function and $p\left(\theta \right)$ denotes the prior distribution, that is, our a priori knowledge about the parameters before we observe any data $y$ . The denominator, $p(y) = \int p\left(y|{\theta}^{\prime}\right)\cdot p\left({\theta}^{\prime}\right)\mathrm{d}{\theta}^{\prime }$ , is called the evidence and ensures that both sides of Bayes’ rule are properly normalized by integrating over all possible input parameters $\theta$ . Once the posterior distribution is known, we can maximize it and get the maximum a posteriori (MAP) estimate:

(4) $$\begin{align}{\widehat{\theta}}_{\mathrm{MAP}} = \underset{\theta }{\arg \max}\left\{p\left(\theta |y\right)\right\}.\end{align}$$

As mentioned above, a particular strength of the Bayesian approach is that we can encode a priori information in the prior distribution. Taking the example of polynomial regression, we could, for instance, set a priori distribution $p\left(\theta \right)$ for the regression coefficients $\theta$ that favors small coefficients, thus penalizing high-order polynomials. Another advantage of using the Bayesian framework is that one can quantify the uncertainty in the model result. This is particularly simple to compute in the special case of Gaussian distributions and their generalization, Gaussian processes (GPs), which we are going to discuss in the next section.

2.1.4 Gaussian process regression

A popular version of Bayesian probabilistic modeling is GP regression[ Reference MacKay87, Reference Rasmussen88]. This kind of modeling was pioneered in geostatistics in the context of mining exploration and is historically also referred to as kriging, after the South African engineer Danie G. Krige, who invented the method in the 1950s. Conceptually, one can think of it as loosely related to the spline interpolation method, as it is also locally ‘interpolates’ the training points. Compared to splines and conventional regression methods, kriging has a number of advantages. Being a regression method, kriging can deal with noisy training points, as seen in experimental data. At the same time, the use of GPs involves minimal assumptions and can in principle model any kind of function. Lastly, the ‘interpolation’ is done in a probabilistic way, that is, a probability distribution is assigned to the function values at unknown positions (see Figure 2(d)). This allows for quantifying the uncertainty of the prediction. These features make kriging an attractive method for the construction of surrogate models for complex systems for which only a limited number of the function evaluations is possible, for example, due to the long runtime of the system or the high costs of the function evaluations. GP regression forms the backbone of most implementations of Bayesian optimization (BO), which we will discuss in Section 4.5, including examples for potential use cases.

Mathematically speaking, a GP is an infinite collection of normal random variables, where each finite subset is jointly normally distributed. The mean vector and covariance matrix of the multivariate normal distribution are thereby generalized to the mean function $\mu (x)$ and the covariance function $\sigma \left(x,{x}^{\prime}\right)$ , respectively, where we use the short-hand notation $x = \left({x}_1,{x}_2,\dots \right)$ to denote the function inputs as a vector of real numbers.

A GP can be written in the following form:

(5) $$\begin{align}f(x)\sim \mathcal{GP}\left(\mu (x),\sigma \left(x,{x}^{\prime}\right)\right),\end{align}$$

denoting that the random function $f(x)$ follows a GP with mean function $\mu (x)$ and covariance function $\sigma \left(x,{x}^{\prime}\right)$ .

The mean function $\mu (x)$ is defined as the expectation of the GP, that is, $\mu (x) = \left\langle f(x)\right\rangle$ , whereas the covariance function is defined as $\sigma \left(x,{x}^{\prime}\right) = \left\langle f(x)-\mu (x)\left(f\left({x}^{\prime}\right)-\mu \left({x}^{\prime}\right)\right)\right\rangle$ . Note that in the special case of the constant mean function $\mu (x) = 0$ and a constant diagonal covariance function $\sigma \left(x,{x}^{\prime}\right) = {\sigma}^2\delta \left(x-{x}^{\prime}\right)$ , the GP simply reduces to a set of a normal random variable with zero mean and variance ${\sigma}^2$ commonly referred to as white noise. The covariance function is also referred to as the kernel. Its value at locations $x$ and ${x}^{\prime }$ is proportional to the correlation between the function values $f(x)$ and $f\left({x}^{\prime}\right)$ . A common choice for the covariance function is the squared exponential function, also known as the radial basis function (RBF) kernel:

(6) $$\begin{align}\sigma \left(x,{x}^{\prime}\right) = \exp \left(-\frac{{\left(x-{x}^{\prime}\right)}^2}{2{\mathrm{\ell}}^2}\right),\end{align}$$

where $\mathrm{\ell}$ is the length scale parameter, which controls the rate at which the correlation between $f(x)$ and $f\left({x}^{\prime}\right)$ decays. This kernel hyperparameter can usually be optimized by using the training data to minimize the log marginal likelihood.

It is possible to encode prior knowledge by choosing a specialized kernel that imposes certain restrictions on the model. A variety of such kernels exist. For instance, the periodic kernel (also known as exp-sine-square kernel) given by the following:

(7) $$\begin{align}\sigma \left(x,{x}^{\prime}\right) = \exp \left(-\frac{2\sin^2\left(\pi d\left(x,{x}^{\prime}\right)/\lambda \right)}{{\mathrm{\ell}}^2}\right),\end{align}$$

with the Euclidean distance $d\left(x,{x}^{\prime}\right)$ , length scale $\mathrm{\ell}$ and periodicity $\lambda$ as free hyperparameters, is particularly suitable to model systems that show an oscillatory behavior.

A useful property of kernels is the generation of new kernels through the addition or multiplication of existing kernels[ Reference Genton and Mach89]. This property provides another way to leverage prior information about the form of the function to increase the predictive accuracy of the model. For instance, measurement errors incorporated into the model by adding a Gaussian white noise kernel, as for instance done in Figure 2.

Figure 3 visualizes the covariance functions and their influence on the result of GP regression. This includes correlation matrices for the three covariance functions discussed above (white noise, RBF and periodic kernels). Once the mean and the covariance functions are fully defined, we can use training points for regression, that is, fit the GP and kernel parameters to the data and obtain the posterior distribution (see the right-hand panel of Figure 3).

Figure 3 Gaussian process regression: illustration of different covariance functions, prior distributions and (fitted) posterior distributions. Left: correlation matrices between two values $x$ and ${x}^{\prime }$ using different covariance functions (white noise, radial basis function and periodic). Center: samples of the prior distribution defined by the prior mean $\mu (x) = 0$ and the indicated covariance functions. Note that the sampled functions are depicted with increasing transparency for visual clarity. Right: posterior distribution given observation points sampled from $y = {\cos}^2x+\epsilon$ , where $\epsilon$ is random Gaussian noise with ${\sigma}_{\epsilon} = 0.1$ . Note how the variance between observations increases when no noise term is included in the kernel (top row). Within the observation window the fitted kernels show little difference, but outside of it the RBF kernel decays to the mean $\mu = 0$ dependent on the length scale $\mathrm{\ell}$ . This can be avoided if there exists prior knowledge about the data that can be encoded in the covariance function, in this case periodicity, as can be seen in the regression using a periodic kernel.

Figure 4 Sketch of a random forest, an architecture for regression or classification consisting of multiple decision trees, whose individual predictions are combined into an ensemble prediction, for example, via majority voting or averaging.

2.1.5 Decision trees and forests

A decision tree is a general-purpose machine learning method that learns a tree-like model of decision rules from the data to make predictions[ Reference Breiman, Friedman, Olshen and Stone90]. It works by splitting the dataset recursively into smaller groups, called nodes. Each node represents a decision point, and the tree branches out from the node according to the decisions that are made. The leaves of the tree represent the final prediction; see Figure 4 for an illustration. This can be either a categorical value in classification tasks (see Section 6) or a numerical value prediction in regression tasks. An advantage of this approach is that it can learn nonlinear relationships in data without having to specify them.

To generate a decision tree one starts at the root of the tree and determines a decision node such that it optimizes an underlying decision metric, such as the MSE in regression settings or entropy and information gain in a classification setting. At each decision point the dataset is split and subsequently the metric is re-evaluated for the resulting groups, generating the next layer of decision nodes. This process is repeated until the leaves are reached. The more decision layers are used, called the depth of the tree, the more complex relationships can be modeled.

Decision trees are easy to implement and can provide accurate predictions even for large datasets. However, with an increasing number of decision layers they may become computationally expensive and may overfit the data, the latter being in particular a problem with noisy data. One method to address overfitting is called pruning, where branches from the tree that do not improve the performance are removed. Another effective method is to use decision-tree-based ensemble algorithms instead of a single decision tree.

Figure 5 Example of gradient boosting with decision trees. Firstly, a decision tree ${g}_1$ is fitted to the data. In the next step, the residual difference between training data and the prediction of this tree is calculated and used to fit a second decision tree ${g}_2$ . This process is repeated $n$ times, with each new tree ${g}_n$ learning to correct only the remaining difference to the training data. Data in this example are sampled from the same function as in Figure 2 and each tree has a maximum depth of two decision layers.

One example of such an algorithm is the random forest, an ensemble algorithm that uses bootstrap aggregating or bagging to fit trees to random subsets of the data and the predictions of individually fitted decision trees are combined by majority vote or average to obtain a more accurate prediction. Another type of ensemble algorithms is boosting, where the trees are trained sequentially and each tries to correct its predecessor. A popular implementation is AdaBoost[ Reference Freund, Schapire and Jpn91], where the weights of the samples are changed according to the success of the predictions of the previous trees. Gradient boosting methods[ Reference Friedman92, Reference Friedman93] also use the concept of sequentially adding predictors, but while AdaBoost adjusts weights according to the residuals of each prediction, gradient boosting methods fit new residual trees to the remaining differences at each step (see Figure 5 for an example).

Compared to other machine learning algorithms, a great advantage of a decision tree is its explicitness in data analysis, especially in nonlinear high-dimensional problems. By splitting the dataset into branches, the decision tree naturally reveals the importance of each variable regarding the decision metric. This is in contrast to ‘black-box’ models, such as those created by neural networks, which must be interpreted by post hoc analysis.

Decision trees can also be used to seed neural networks by giving the initial weight parameters to train a deep neural network. Examples of using a decision tree as an initializer are the deep jointly-informed neural networks (DJINNs) developed by Humbird et al. [ Reference Humbird, Peterson and McClarren94], which have been widely applied in the high-power laser community, especially in analyzing ICF datasets. The algorithm first constructs a tree or a random forest with the tree depth set as a tunable hyperparameter. It then maps the tree to a neural network, or maps the forest to an ensemble of networks. The structure of the network (number of neurons and hidden layer, initial weights, etc.) reflects the structure of the tree. The neural network is then trained using back-propagation. The use of decision trees for initialization largely reduces the computational cost while maintaining comparable performance to optimized neural network architectures. The DJINN algorithm has been applied to several classification and regression tasks in high-power laser experiments, such as ICF[ Reference Humbird, Peterson, Spears and McClarren95 Reference Kluth, Humbird, Spears, Peterson, Scott, Patel, Koning, Marinak, Divol and Young97] and LWFA[ Reference Lin, Qian, Murphy, Hsu, Hero, Ma, Thomas and Krushelnick98].

2.1.6 Neural networks

Neural networks offer a versatile framework for fitting of arbitrary functions by training of a network of connected nodes or neurons, which are loosely inspired by biological neurons. In the case of a fully connected neural network, historically called a multilayer perceptron, the inputs to one node are the outputs of all of the nodes from the preceding layer. The very first layer is simply all of the inputs for the function to be modeled, and the very last layer is the function outputs. One of the very attractive properties of neural networks is the capacity to have many outputs and inputs, each of which can be multi-dimensional. The weighting of each connection ${w}_i$ and the bias for each node $b$ are the parameters of the network, which must be trained to provide the best approximation of the function to be modeled. Each node gets activated depending on both weights and biases according to a pre-defined activation function. The nonlinear properties of activation functions are widely seen as the key properties to allow neural networks to model arbitrary functions and achieve general learning properties.

One of the simplest, but also most common, activation functions is the rectified linear unit (ReLU), which is defined as follows:

(8) $$\begin{align}\mathrm{ReLU}(x) = \left\{\!\!\begin{array}{ll}x,& \mathrm{if}\;x\ge 0,\\ {}0,& \mathrm{otherwise}.\end{array}\right.\end{align}$$

The function argument $x = \sum {x}_i{w}_i+b$ is the sum of the weights of incoming connections, multiplied with their values ${x}_i$ , and the bias of the node. The ReLU function is easy to compute, and it has the main advantage that it is linear for $x>0$ , which greatly simplifies the gradient calculation in training (see the next paragraph). Drawbacks are that it is not differentiable at $x = 0$ , it is unbounded, and since it yields a constant value below $0$ , it has the potential to produce ‘dead’ neurons. The last issue is solved in the so-called leaky ReLU, which uses a reduced gradient for $x<0$ , giving an output of $\alpha x$ where $0<\alpha <1$ . Other common activation functions include the logistic or sigmoid function, $\kern0.1em \mathrm{sig}\kern0.1em (x) = {(1+{e}^{-x})}^{-1}$ , and the hyperbolic tangent function, $\tanh (x) = ({e}^x-{e}^{-x})/({e}^x+{e}^{-x})$ .

The training is performed iteratively by passing corresponding input–output pairs to the network and comparing the true outputs to those given by the network to calculate the loss function. This is often chosen to be the MSE between the model output and the reference from the training data, but many other types of loss functions are also used (see Section 4.1.1) and the choice of loss function strongly affects the model’s training. The loss is then used to modify the weights and biases via an algorithm known as ‘back-propagation’[ Reference Wythoff99]. Here the gradient of the loss function is calculated with respect to the weights and biases via the chain rule. One can then optimize the parameters using, for instance, stochastic gradient descent (SGD) or Adam (adaptive moment estimation)[ Reference Kingma and Ba100].

A number of hyperparameters are used to control the training process. Typical ones include the following: the number of epochs – the number of times the model sees each data point; the batch size – the number of data points the model sees before updating its gradients; and the learning rate – a factor determining the magnitude of the update to the gradient. Each factor can have a critical effect on the training of the model.

Figure 6 Simplified sketch of some popular neural network architectures. The simplest possible neural network is the perceptron, which consists of an input, which is fed into the neuron that processes the input based on the weights, an individual bias and its activation function. Multiple such layers can be stacked within so-called hidden layers, resulting in the popular multilayer perceptron (or fully connected network). Besides the direct connection between subsequent layers, there are also special connections common in many modern neural network architectures. Examples are the recurrent connection (which feeds the output of the current layer back into the input of the current layer), the convolutional connection (which replaces the direct connection between two layers by the convolutional operation) and the residual connection (which adds the input to the output of the current layer; note that the above illustration is simplified and the layers should be equal in size).

In the training process, the weights and biases are optimized in order to best approximate the function for converting the inputs into the corresponding outputs. The resulting model will yield an approximation for the unknown function $f(x)$ . However, learning via this method only optimizes how well the model reproduces the training data and, in the worst case, the network essentially ‘memorizes’ the training data, and learns little about the desired function $f(x)$ . To quantify this, a subset of the data is kept separate and used solely for validating the model. Ideally, the loss should be similar on both training and validation data, and if the model has a significantly smaller loss on the training data than on the validation set it is said to overfit, which signifies that the model has learnt the random noise of the training set. The validation loss thus quantifies how well this model generalizes and, with it, how useful it is for predictions on unseen data.

Common methods to combat overfitting include early stopping, that is, terminating the training process once the validation loss stagnates; or the use of dropout layers to randomly switch off some fraction of the network connections for each batch of the training process, thereby preventing the network from learning random noise by reducing its capacity. A further approach is to incorporate variational layers into the network. In these layers, pairs of nodes are used that represent the mean $\mu$ and standard deviation $\sigma$ of a Gaussian distribution. During training, output values are sampled from this distribution and then passed on the subsequent layers, thereby requiring the network to react smoothly to these small random variations to achieve a small training loss. Both dropout layers and variational layers provide regularization that smooths the network response and prevents single nodes from dominating, resulting in improved interpolation performance and better validation loss[ Reference Kristiadi, Hein and Hennig101, Reference Wager, Wang and Liang102] (see Section 3.3 for a brief general explanation of regularization).

Figure 7 Real-world example of a multilayer perceptron for beam parameter prediction. (a) The network layout[ Reference Kirchen, Jalas, Messner, Winkler, Eichner, Hübner, Hülsenbusch, Jeppe, Parikh, Schnepp and Maier29] consists of 15 input neurons, two hidden layers with 30 neurons and three output neurons (charge, mean energy and energy spread). The input is derived from parasitic laser diagnostics (laser pulse energy ${E}_{\mathrm{laser}}$ , central wavelength ${\lambda}_0$ and spectral bandwidth $\Delta \lambda$ , longitudinal focus position ${z}_{\mathrm{foc}}$ and Zernike coefficients of the wavefront). Neurons use a nonlinear ReLU activation and 20% of neurons drop out for regularization during training. The (normalized) predictions are compared to the training data to evaluate the accuracy of the model, in this case using the mean absolute ${\mathrm{\ell}}_1$ error as the loss function. In training, the gradient of the loss function is then propagated back through the network to adjust its weights and biases. (b) Measured and predicted median energy ( $\overline{E}$ ) and (c) measured and predicted energy spread ( $\Delta$ E), both for a series of 50 consecutive shots. Sub-figures (b) and (c) are adapted from Ref. [Reference Kirchen, Jalas, Messner, Winkler, Eichner, Hübner, Hülsenbusch, Jeppe, Parikh, Schnepp and Maier29].

Beyond the classical multilayer perceptron network, there exists a plethora of different neural network architectures, as partially illustrated in Figure 6, which are suited to different tasks. In particular, convolutions layers, which extract relevant features from one and two dimensions by learning suitable convolution matrices, are commonly used in the analysis of physical signals and images. Architecture selection and hyperparameter tuning are central challenges in the implementation of neural networks, and are often performed by an additional machine learning algorithm, for example, using BO (see Section 4.5).

The great strength of neural networks is their flexibility and relatively straightforward implementation, with many openly accessible software platforms available to choose from. However, trained networks are effectively ‘black-box’ functions, and do not, in their basic form, incorporate uncertainty quantification. As a result, the networks may make over-confident predictions about unseen data while giving no explanation for those predictions, leading to false conclusions. Various methods exist for incorporating uncertainty quantification into neural networks (see, for example, Ref. [Reference Gawlikowski, Tassi, Ali, Lee, Humt, Feng, Kruspe, Triebel, Jung, Roscher, Shahzad, Yang, Bamler and Zhu103]), such as by including variational layers (discussed above) and training an ensemble of networks on different training data subsets. There are several approaches to try and make neural network models explainable and a review of methods for network interpretation is given, for instance, by Montavon et al. [ Reference Montavon, Samek and Müller104].

Another strength of neural networks is that the performance of a model on a new task can be improved by leveraging knowledge from a pre-trained model on a related task. This so-called transfer learning is typically used in scenarios where it is difficult or expensive to train a model from scratch on a new task, or when there is a limited amount of training data available. For example, transfer learning has been used to successfully train models for image classification and object detection tasks (see Section 6), using pre-trained models that have been trained on large image datasets such as ImageNet[ Reference Huh, Agrawal and Efros105]. In the context of laser–plasma physics, transfer learning could be used to improve the performance of a neural network by leveraging knowledge from a pre-trained model that has been trained on data from previous experiments or simulations.

Kirchen et al. [ Reference Kirchen, Jalas, Messner, Winkler, Eichner, Hübner, Hülsenbusch, Jeppe, Parikh, Schnepp and Maier29] recently demonstrated the utility of even seemingly small neural networks for modeling laser–plasma accelerator performance. Their multilayer perceptron design is shown in Figure 7 and, as can been seen in Figures 7(b) and 7(c), the network accurately predicts electron beam energy and energy spread. Interestingly, this performance is achieved without using target parameters such as plasma density as an input, thus highlighting the relative importance of laser stabilization in this context. Gonoskov et al. [ Reference Gonoskov, Wallin, Polovinkin and Meyerov106] trained a neural network to read features in theoretical and experimental spectra from high-order-harmonic generation (HHG) in high-power laser–plasma interactions. Rodimkov et al. [ Reference Rodimkov, Bhadoria, Volokitin, Efimenko, Polovinkin, Blackburn, Marklund, Gonoskov and Meyerov107] used a neural network to extract information that was not directly measured in experiments, including the preplasma scale length and the pulse carrier-envelope phase, from the spectrum of extreme ultraviolet (XUV) radiation generated in laser–plasma interactions. Another recent example of deep learning modeling is the work by Djordjević et al. [ Reference Djordjević, Kemp, Kim, Simpson, Wilks, Ma and Mariscal108], where the authors used a multilayer perceptron to model the output of a laser-driven ion accelerator based on training with 1000 simulations. In the work of Watt[ Reference Watt109], a strong-field QED model incorporating a trained neural network was used to provide an additional computation package for the Geant4 particle physics platform. Neural networks are also trained to assist hohlraum design for ICF experiments by predicting the time evolution of the radiation temperature, in the recent work by McClarren et al. [ Reference McClarren, Tregillis, Urbatsch and Dodd110]. In the work of Simpson et al. [ Reference Simpson, Mariscal, Williams, Scott, Grace and Ma111], a fully connected neural network with three hidden layers is constructed to assist the analysis of an X-ray spectrometer, which measures the X-rays driven by MeV electrons produced from high-power laser–solid interaction. Finally, Streeter et al. [ Reference Streeter, Colgan, Cobo, Arran, Los, Watt, Bourgeois, Calvin, Carderelli, Cavanagh, Dann, Fitzgarrald, Gerstmayr, Joglekar, Kettle, Mckenna, Murphy, Najmudin, Parsons, Qian, Rajeev, Ridgers, Symes, Thomas, Sarri and Mangles112] used convolutional neural networks (CNNs) to predict the electron spectrum produced by a laser wakefield accelerator, taking measurements from secondary laser and plasma diagnostics as the inputs.

2.1.7 Physics-informed machine learning models

The ultimate application of machine learning for modeling physics systems would arguably be to create an ‘artificial intelligence physicist’, as coined by Wu and Tegmark[ Reference Wu and Tegmark113]. One prominent idea at the backbone of how we build physical models is Occam’s razor, which states that given multiple hypotheses, the simplest one that is consistent with the data is to be preferred. In addition to this guiding principle, it is furthermore assumed that a physical model can be described using mathematical equations. A program should therefore be able to automatically create such equations, given experimental data. While a competitive artificial intelligence (AI) physicist is still years away,Footnote 2 first steps have been undertaken in this direction. For instance, in 2009 Schmidt and Lipson[ Reference Schmidt and Lipson116] presented a genetic algorithm approach (Section 4.4) that independently searched and identified governing mathematical representations such as the Hamiltonian from real-life measurements of some mechanical systems. More recently, research has concentrated more on the concept of Occam’s razor and the underlying idea that the ‘simplest’ representation can be seen as the sparsest in some domain. A key contribution was SINDy (sparse identification of nonlinear dynamics), a framework for discovering sparse nonlinear dynamical systems from data[ Reference Brunton, Proctor and Kutz117]. As in CS (Section 3.4), the sparsity constraint was imposed using an ${\mathrm{\ell}}_1$ -norm regularization, which results in the identification of the fewest terms needed to describe the dynamics.

An important step towards combining physics and machine learning was undertaken in physics-informed neural networks (PINNs)[ Reference Karniadakis, Kevrekidis, Lu, Perdikaris, Wang and Yang118]. A PINN is essentially a neural network that uses physics equations, which are often described in form of ordinary differential equations (ODEs) or partial differential equations (PDEs), as a regularization to select a solution that is in agreement with physics. This is achieved by defining a custom loss function that includes a discrepancy term, the residuals of the underlying ODEs or PDEs, in addition to the usual data-based loss components. As such, solutions that obey the selected physics are enforced. In contrast to SINDy, there is no sparsity constraint imposed on the network weights, meaning that the network could still be quite complex. Early examples of PINNs were published by Raissi et al. [ Reference Raissi, Perdikaris and Karniadakis119 Reference Raissi, Yazdani and Karniadakis122] and Long et al. [ Reference Long, Lu, Ma and Dong123] in 2017–2020. Since then, the architecture has been applied to a wide range of problems in the natural sciences, with a quasi-exponential growth in publications[ Reference Cuomo, Di Cola, Giampaolo, Rozza, Raissi and Piccialli124]. Applications include, for instance, (low-temperature) plasma physics, where PINNs have been successfully used to solve the Boltzmann equation[ Reference Kawaguchi and Murakami125], and quantum physics, where PINNs were used to solve the Schrödinger equation of a quantum harmonic oscillator[ Reference Stiller, Bethke, Böhme, Pausch, Torge, Debus, Vorberger, Bussmann and Hoffmann126]. The work by Stiller et al. [ Reference Stiller, Bethke, Böhme, Pausch, Torge, Debus, Vorberger, Bussmann and Hoffmann126] uses a scalable neural solver that could possibly also be extended to solve, for example, the Vlasov–Maxwell system governing laser–plasma interaction.

2.2 Time series forecasting

A related problem meriting its own discussion is time series forecasting. While models in the previous section are based on interpolation or regression within given data points, forecasting explicitly deals with the issue of extrapolating parameter values to the future based on prior observations. Most notably this includes modeling of long-term trends (in a laser context often referred to as drifts) or periodic oscillations of parameters and short-term fluctuations referred to as jitter.

If available, models may also use covariates, auxiliary variables that are correlated to the observable, to improve the forecast. These covariates may even extend into the future (seasonal changes being the traditional example in economic forecasting).

In this section we are going to first discuss two common approaches to time series forecasting, autoregressive models and state-space models (SSMs), followed by a discussion of modern techniques based on neural networks. Note that the modeling approaches presented in the preceding section may also be used, to some extent, to extrapolate data. We are not aware of any recent examples on time series forecasting in the context of laser–plasma physics. However, we feel that the topic merits inclusion in this overview as implementations are likely in the near future, for example, for laser stabilization purposes.

2.2.1 Classical models

To model a time series one usually starts with a set of assumptions regarding its structure, that is, the interdependence between values at different times. A simple, approximate assumption is that the observed values in a discrete time series are linearly related. An important, wide-spread class of such models is the so-called autoregressive models, which assert that the next value ${y}_t$ in a time series is given as a linear function of the $p$ prior values ${y}_{t-1},{y}_{t-2},\dots, {y}_{t-p}$ . In addition, each value is assumed to be corrupted by additive white noise ${\varepsilon}_t\sim \mathcal{N}\left(0,{\sigma}^2\right)$ , representing, for example, measurement errors or inherent statistical fluctuations of the underlying process, which endows the models with stochasticity. In its most common form, the autoregressive model of order $p$ , denoted by $\mathrm{AR}(p)$ , is defined via the recurrence relation:

(9) $$\begin{align}{y}_t = \sum \limits_{i = 1}^p{\varphi}_i{y}_{t-i}+{\varepsilon}_t,\end{align}$$

where ${\varphi}_1,\dots, {\varphi}_p$ are the model’s parameters to be estimated. In this form, the problem of finding the model’s parameters is a linear regression problem that can be solved via the least-squares method:

(10) $$\begin{align}\left\{{\widehat{\varphi}}_1,\dots, {\widehat{\varphi}}_p\right\} = \mathop{\mathrm{argmin}}_{{\varphi_1,\dots, {\varphi}_p}}{\left({y}_t-\sum \limits_{i = 1}^p{\varphi}_i{y}_{t-i}\right)}^2.\end{align}$$

In another approach we might assume that the next value ${y}_t$ is instead given by a linear combination of (external) statistical fluctuations, sometimes called shocks, from the past $q$ points in time, again encoded in white noise terms ${\varepsilon}_t\sim \mathcal{N}\left(0,{\sigma}^2\right)$ . The corresponding model, called the moving average model and denoted by $\mathrm{MA}(q)$ , is defined as follows:

(11) $$\begin{align}{y}_t = \mu +\sum \limits_{i = 1}^q{\vartheta}_i{\varepsilon}_{t-i}+{\varepsilon}_t,\end{align}$$

where $\mu$ is the mean of the time series.

The parameters of the moving average process cannot be inferred by linear methods such as least squares, but rather have to be estimated by means of maximum likelihood methods.

Contrary to the $\mathrm{AR}(p)$ model, which is only stationaryFootnote 3 for certain parameters ${\varphi}_1,\dots, {\varphi}_p,$ the $\mathrm{MA}(q)$ model is always (strictly) stationary per definition. Put loosely, stationarity can be understood as the stochastic properties (mean, variance, …) of the time series being constant over time.

Another distinction between the autoregressive and the moving average model is how far into the future the effects of statistical fluctuations (shocks) are propagated. In the moving average model the white noise terms are only propagated $q$ steps into the future, while in the autoregressive model the effects are propagated infinitely far into the future. This is because the white noise terms are part of the prior values, which themselves are part of the future values in Equation (9).

If the time series in question cannot be explained by the $\mathrm{AR}(p)$ or the $\mathrm{MA}(q)$ model alone, both models can be combined into what is called an autoregressive-moving-average model, denoted by $\mathrm{ARMA}\left(p,q\right)$ :

(12) $$\begin{align}{y}_t = \mu +\sum \limits_{i = 1}^p{\varphi}_i{y}_{t-i}+\sum \limits_{j = 1}^q{\vartheta}_j{\varepsilon}_{t-j}+{\varepsilon}_t.\end{align}$$

Note that for the special cases of $p = 0$ and $q = 0$ , the $\mathrm{ARMA}\left(p,q\right)$ reduces to the $\mathrm{MA}(q)$ and $\mathrm{AR}(p)$ , respectively.

When fitting autoregressive models, special care should be taken that the time series to be modeled is stationary before fitting, otherwise spurious correlations are introduced. Spurious correlations are apparent correlations between two or more time series that are not causally related, thereby potentially leading to fallacious conclusions as warned of in the well-known adage ‘correlation does not imply causation’.

If the time series $y$ is non-stationary in general but stationary with respect to its mean, that is, the variations relative to the mean value are stationary, it might be possible to transform it into a stationary time series. For this we introduce a new series ${z}_t = {\nabla}^d{y}_t$ by differencing the original series $d$ times, where we defined the differencing operator $\nabla {y}_t\equiv {y}_t-{y}_{t-1}$ . Applying the differencing operator once ( $d = 1$ ), for example, removes a linear trend from ${y}_t$ , applying it twice removes a quadratic trend and so forth. If $z$ is stationary after differencing $y$   $d$ -times it can be readily modeled by Equation (12), which leads us to the autoregres sive-integrated-moving-average model $\mathrm{ARIMA}\left(p,d,q\right)$ :

(13) $$\begin{align}{z}_t = \sum \limits_{i = 1}^p{\varphi}_i{z}_{t-i}+\sum \limits_{j = 1}^q{\vartheta}_j{\varepsilon}_{t-j}+{\varepsilon}_t.\end{align}$$

Note that in contrast to the $\mathrm{ARMA}\left(p,q\right)$ model in Equation (12), the time series ${z}_t$ appearing here is a $d$ -times differenced version of the original series ${y}_t$ . Further extensions that will only be noted here for completeness are seasonal ARIMA models, called SARIMA models, that allow the modeling of time series that exhibit seasonality (periodicity), and exogeneous ARIMA models, called ARIMAX models, that allow the modeling of time series that are influenced by a separate, external time series. Both extensions can be combined to yield the so-called SARIMAX model, which is general enough to cover a large class of time series problems. However, we are still in any case limited to problems in which the time series is generated by a linear stochastic process. To cover more general nonlinear problems, we introduce SSMs.

2.2.2 State-space models

SSMs offer a very general framework to model time series data[ Reference Commandeur and Koopman127]. In this framework, presuming that the time series is based on some underlying system, it is assumed that there exists a certain true state of the system ${x}_t$ of which we observe a value ${y}_t$ subject to measurement noise. The true state ${x}_t$ , usually inaccessible and hidden to us, and the observed state ${y}_t$ are modeled by the state equation and the observation equation, respectively. A prominent example of SSMs in machine learning is hidden Markov models (HMMs)[ Reference Zucchini and MacDonald128], in which the hidden state ${x}_t$ is modeled as a Markov process. In general there are no restrictions on the functional form of the state and observation equations; however, the most common type of SSM is the linear Gaussian SSM, also referred to as the dynamic linear model, in which the state and observation equations are modeled as linear equations and the noise is assumed to be Gaussian. The prototypical example of such linear Gaussian SSMs is the Kalman filter [ Reference Kalman129], ubiquitous in control theory, robotics and many other fields[ Reference Auger, Hilairet, Guerrero, Monmasson, Orlowska-Kowalska and Katsura130], including for instance adaptive optics[ Reference Poyneer and Véran131] and particle accelerator control[ Reference Ushakov, Echevarria and Neumann132]. The term ‘filter’ originates from the fact that it filters out noise to estimate the underlying state of a system. The state equations can be written as follows:

(14) $$\begin{align}{x}_t &= {A}_t{x}_{t-1}+{B}_t{u}_t+{C}_t{\varepsilon}_t,\nonumber\\ {}{y}_t &= {D}_t{x}_t+{E}_t{\eta}_t,\end{align}$$

where the system noise ${\varepsilon}_t$ and observation noise ${\eta}_t$ are sampled from a normal distribution ${\varepsilon}_t,{\eta}_t\sim \mathcal{N}\left(0,1\right)$ . Note that the separation of these two noise terms differs from the so-far discussed models. This permits the modeling of systems that are driven by noise, while also accounting for observation noise of possibly differing amplitude. Another difference is the addition of a second process ${u}_t$ , commonly called the control input in signal processing, which is a deterministic external process driving the system state. This permits the modeling of systems with known external inputs (e.g., a controller-driven motor) or with known deterministic drivers (e.g., system temperature). Note that we can recover the autoregressive models discussed earlier as a special case of the Kalman filter. For example, the $\mathrm{AR}(1)$ process is obtained by setting ${A}_t = \mathrm{const}. = {\varphi}_1$ , ${B}_t = 0$ , ${C}_t = \mathrm{const}. = {\sigma}^2$ , ${D}_t = 1$ and ${E}_t = 0$ . For a more detailed discussion on Kalman filters we refer the interested reader to the literature[ Reference Welch and Bishop133]. Other (nonlinear) ways to perform inference in SSMs exist, for instance using sequential Monte Carlo estimation[ Reference Ristic, Arulampalam and Gordon134].

2.2.3 Forecasting networks

While predictions based on autoregression or SSMs can suffice for many applications, the nonlinear nature of neural networks can be harnessed to model time series with complex, nonlinear dependencies[ Reference Fawaz, Forestier, Weber, Idoumghar and Muller135, Reference Lim and Zohren136]. A particularly relevant architecture is the recurrent neural networks (RNNs). These are similar to the ‘traditional’ neural networks we discussed in Section 2.1.6, but they have a ‘memory’ that allows them to retain information from previous inputs. This is implemented in a somewhat similar way to the linear recurrence relations discussed in the previous sections, with a key difference being that the state ${x}_t$ of RNNs is updated according to an arbitrary nonlinear function. The current state ${x}_t$ is calculated from the previous states ${x}_{t-1},{x}_{t-2},\dots$ through the recurrence equation:

(15) $$\begin{align}{x}_t = f\left({w}_{\mathrm{rec}}{x}_{t-1}+{b}_{\mathrm{rec}}\right),\end{align}$$

where $f$ is a nonlinear activation function, ${w}_{\mathrm{rec}}$ is a matrix of recurrent weights and ${b}_{\mathrm{rec}}$ is a bias vector. This equation allows the update to each element of the current state vector, ${x}_t$ , to be dependent on the whole of the previous state vector, ${x}_{t-1}$ . The output ${y}_t$ of the network is calculated from the current state ${x}_t$ through the output equation:

(16) $$\begin{align}{y}_t = f\left({w}_{\mathrm{out}}{x}_t+{b}_{\mathrm{out}}\right),\end{align}$$

where ${w}_{\mathrm{out}}$ contains the output weights and ${b}_{\mathrm{out}}$ is another bias vector. Thus, the entire network state is updated from the previous state through a recurrent equation, and the current output is calculated from the current state through an output equation. The network state therefore contains all previous information about the time series. However, as the network state is updated by a simple multiplication of the weights ${w}_{\mathrm{rec}}$ with the previous state ${x}_{t-1}$ , the so-called vanishing gradient problem may occur[ Reference Hochreiter137, Reference Hochreiter138]. This problem is solved in a seminal work by Hochreiter and Schmidhuber[ Reference Hochreiter and Schmidhuber139], who introduced a special type of RNN, the long short-term memory (LSTM) network. In contrast to the simple update equation of RNNs, LSTMs use a special type of memory cell that can learn long-term dependencies. This includes a so-called forget gate  ${f}_t$ to update the previous state ${x}_{t-1}$ to the current state ${x}_t$ :

(17) $$\begin{align}{x}_t = {f}_t\odot {x}_{t-1}+{i}_t\odot {h}_t,\end{align}$$

where $\odot$ denotes the element-wise Hadamard product, ${\left[A\odot B\right]}_{ij} = {A}_{ij}{B}_{ij}$ . In addition, the LSTM uses two more gates, the input gate ${i}_t$ and output gate ${o}_t$ , where the latter is used to calculate the hidden state ${h}_t$ from the previous hidden state via ${h}_t = {o}_t\odot {h}_{t-1}$ . The three gates ${f}_t$ , ${i}_t$ and ${o}_t$ are calculated from the current input ${x}_t$ , the previous hidden state ${h}_{t-1}$ and the previous gate states ${f}_{t-1}$ , ${i}_{t-1}$ and ${o}_{t-1}$ using individually set weights and biases. The gates determine how much of the previous state ${x}_{t-1}$ and the current hidden state ${h}_t$ are used to calculate the current state ${x}_t$ . The output of the LSTM network is then calculated from the current state ${x}_t$ through the same output equation, Equation (16), as used in the RNN. Despite it being introduced in the early 1990s, the LSTM architecture remains one of the most popular network architectures for predictive tasks to date.

A more recently introduced type of neural network that can learn to interpret and generate sequences of data is the transformer. Transformers are similar to RNNs, but instead of processing the time series in a sequential manner, they use a so-called attention mechanism[ Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin140] to capture dependencies between all elements of the time series simultaneously and, thus, focus on specific parts of an input sequence. Assuming again a time series ${x}_t,{x}_{t-1},\dots$ , a transformer maps each point ${x}_t$ to a representation ${h}_t$ using a linear layer, for example, ${h}_t = {wx}_t+b$ . The transformer network then calculates a new representation for each point ${x}_t$ using the attention mechanism:

(18) $$\begin{align}{\tilde{h}}_t = \sum {\alpha}_{ti}{h}_i, \end{align}$$

where the attention weights ${\alpha}_{ti}$ are calculated using the point representations ${h}_i$ and the previous representation ${\tilde{h}}_t$ . The attention weights ${\alpha}_{ti}$ are used to calculate the new representation ${\tilde{h}}_t$ of each point ${x}_t$ from all previous representations ${\tilde{h}}_i$ , $i<t$ , of the previous point ${x}_i$ , $i<t$ . The new representations are then used to calculate the output ${y}_t$ of the network as in Equation (16). Thus, the attention mechanism of a transformer network can be interpreted as a nonlinear function that updates the representation of each point ${x}_t$ from all previous representations of the previous points ${x}_i$ , $i<t$ . It can be applied multiple times and enables transformers to learn complex patterns in data, outperforming RNNs on a variety of tasks, such as machine translation and language modeling. It has also been shown that transformers are more efficient than RNNs, meaning they can be trained on larger datasets in less time. One recent example of a transformer developed for time series prediction is the Temporal Fusion Transformer[ Reference Lim, Ark, Loeff and Pfister141]. Beyond RNNs, LSTMs and transformers, there exist many other network architectures that can be used for forecasting, for example, fully convolutional networks (FCNs), as discussed by Wang et al. [ Reference Wang, Yan and Oates142].

For longer-term forecasting, predictive networks are usually employed iteratively, meaning that a single-step forecast uses only real data, while a two-step forecast will use the historical data and the most recent prediction value, and so forth. In the context of laser–plasma physics and experiments, predictive neural networks could be used to model the time series data from diagnostic measurements in order to make predictions about the future performance, for example, for predictive steering of laser or particle beams.

2.3 Prediction and feedback

Both (surrogate) models and forecasts can be used to make predictions about the state of a system at a so-far unknown position, in parameter space or time, respectively. This type of operation is sometimes referred to as open-loop prediction. Closed-loop prediction and feedback, on the other hand, use predictions and compare them to the actual system state, continuously updating and improving the surrogate model of the system. This is particularly relevant for dynamic systems, where parameters change over time. A complete discussion on how to implement a closed-loop system using machine learning goes beyond the scope of this review, as it would also require an extensive discussion of control systems and so forth. The following discussion will thus be restricted to a brief outline of a few relevant concepts.

A feedback loop is most generally a system where a part of the output serves as input to the system itself, and we have already discussed some models with feedback in the context of forecasting (Section 2.2). Another well-known engineering implementation of closed-loop operation is the proportional–integral–derivative (PID) controller. A PID controller adjusts the process variable (PV) with a proportional, integral and derivative response to the difference between a measured PV and a desired set point. Here the proportional term serves to increase the response time of the system to changes in the set point. The integral term is used to eliminate the residual steady-state error that occurs if the PV is not equal to the set point. The derivative term improves the stability of the control, reducing the tendency of the PID output to overshoot the set point.

In the context of laser–plasma physics, the implementation of the feedback loop would most likely look slightly different. One possible implementation would be that a model is used to predict the output of the system at a given set of parameters, which is then compared to the actual output and used to update the model again. In order to keep improving the model, it is important that new data should be acquired in regions of parameter space that deviate from the previously acquired data. In other words, it is important to explore parameter space in addition to exploiting the knowledge from existing data points. This can be done through random sampling or through so-called BO, as we will discuss in Section 4.

Such a model that is continuously updated with new data is sometimes referred to as a dynamic model. There are two main approaches to updating such models.

  • Firstly, completely re-training the model with all available data, including the newly acquired data points. This results in an increasingly large training dataset and training time.

  • The second method is to update the model by adding new points to the training set and then re-training the existing model on these points in a process known as incremental learning. This method is thus much faster and uses less memory than a full re-training.

However, not all techniques we discussed to construct models are compatible with both training methods. For instance, GP regression can historically only be trained on full datasets, hindering its applicability in settings requiring (near) real-time updates. However, recent work on regression in a streaming setting might alleviate this problem[ Reference Bui, Nguyen and Turner143]. Learning dynamic models is further complicated by the fact that systems may change over the course of time during which training data are acquired, a problem referred to as concept drift[ Reference Žliobaitė, Pechenizkiy and Gama144].

Many laboratories are currently stepping up their efforts to integrate prediction and feedback systems into their lasers and experiments[ Reference Cassou145, Reference Weiße, Doyle, Gebhard, Balling, Schweiger, Haberstroh, D. Geulig, Lin, Irshad, Esslinger, Gerlach, Gilljohann, Vaidyanathan, Siebert, Münzer, Schilling, Schreiber, G. Thirolf, Karsch and Döpp146]. One of the first groups to extensively make use of these techniques was the team around A. Maier at DESY’s LUX facility[ Reference Maier, Delbos, Eichner, Hübner, Jalas, Jeppe, Jolly, Kirchen, Leroux, Messner, Schnepp, Trunk, Walker, Werle and Winkler26]. Among the most comprehensive proposals are plans recently presented by Ma et al. from NIF, proposing an extensive integration of feedback systems for high-energy-density (HED) experiments[ Reference Ma, Mariscal, Anirudh, Bremer, Djordjevic, Galvin, Grace, Herriot, Jacobs, Kailkhura, Hollinger, Kim, Liu, Ludwig, Neely, Rocca, Scott, Simpson, Spears, Spinka, Swanson, Thiagarajan, Van Essen, Wang, Wilks, Williams, Zhang, Herrmann and Haefner147]. The system, referred to as a ‘full integrated high-repetition rate HED laser–plasma experiment’, consists of multiple linked feedback loops. These are hierarchically organized, starting from a ‘laser loop’, over an ‘experiment loop’ to a ‘modeling & simulation loop’ at its highest level.

3 Inverse problems

In the previous section we looked at the forward problem of modeling the black-box function $f(x) = y$ . In many scientific and engineering applications, it is necessary to solve the inverse problem, namely to determine $x$ given $y$ and $f$ (or an approximation ${f}^{\ast}\approx f$ ). Inverse problems are extremely common in experimental physics, as they essentially describe the measurement process and subsequent retrieval of underlying properties in a physics experiment.

In many cases the problem takes the form of a discrete linear system:

(19) $$\begin{align}Ax = y,\end{align}$$

where $y$ is the known observation and $A$ describes the measurement process acting on the unknown quantity $x$ to be estimated. In most cases we can assume that the system behaves linearly with respect to the input parameters $x$ and, hence, $A$ can be written as a single sensing matrix. However, more generally, $A$ can be thought of as an operator, which maps the input vector $x$ to the observation vector $y$ .Footnote 4

A classical example of an inverse problem is CT, whose goal is to reconstruct an object from a limited set of projections at different angles. Other examples are wavefront sensing in optics, the ‘FROG’ algorithm for ultra-fast pulse measurements, the ‘unfolding’ of X-ray spectra or, in the context of particle accelerators, the estimation of particle distributions from different measurements of a beam.

3.1 Least-squares solution

A common approach to the problem described by Equation (19) is to use the least-squares approach. Here, the problem is reformulated as minimizing the quadratic, positive error between the observation and the estimate:

(20) $$\begin{align}{\widehat{x}}_{\mathrm{LS}} = \underset{x}{\mathrm{argmin}\kern0.1em }\left\{{\left\Vert Ax-y\right\Vert}^2\right\}.\end{align}$$

If $A$ is square and non-singular, the solution to this equation is obtained via the inverse, ${A}^{-1}y$ . For non-square matrices, the pseudo-inverse can be used, which can for instance be computed via singular value decomposition (SVD)[ Reference Barata and Hussein148]. Alternatively, a multitude of iterative optimization methods can be employed, for example, iterative shrinkage-thresholding algorithms[ Reference Bioucas-Dias and Figueiredo149, Reference Beck and Teboulle150] or gradient descent, to name a few. For more details on the solution to this optimization problem, the reader is referred to Ref. [Reference Nocedal and Wright151].

The approach given by Equation (20) is widely used in overdetermined problems, where the regression between data points with redundant information can help to reduce the influence of measurement noise. Prominent application areas are, for instance, wavefront measurements and adaptive optics[ Reference Tyson and Frazier152].

3.2 Statistical inference

As for predictive models (Section 2.1.3), one can also approach inverse problems via probabilistic methods, for example, if the underlying model $f(x)$ is stochastic in nature or measurements are corrupted by noise. One popular approach is to use MLE[ Reference Tarantola153]. As we have seen before, MLE consists of finding the value that maximizes the likelihood (see Equation (1)). To this end, one often uses Markov chain Monte Carlo (MCMC) methods[ Reference Geyer154] and/or gradient descent algorithms. Alternatively, expectation-maximization (EM) is a popular method to compute the maximum likelihood estimate[ Reference Dempster, Laird, Rubin and Stat155] and is, for instance, often used in statistical iterative tomography (SIR)[ Reference Zhang, Wang, Zeng, Tao and Ma156]. EM alternates between an estimate step, where one computes the expectation of the likelihood function, and a maximization step, where one obtains a new estimate that maximizes the posterior distribution. By sequentially repeating the two steps, the estimate converges to the maximum likelihood estimate.

Both the least-squares and MLE approaches suffer from the issue that the result is a point estimate. Thus, if the underlying problem is ill-posed or underdetermined, this estimate is often not unique or representative, or the least-squares solution is prone to artifacts resulting from small fluctuations in the observation. Consequently, it is often desirable to obtain an estimate of the entire solution space, that is, a probability distribution of the unknown parameter $x$ . The latter allows one to not only compute the estimate $\widehat{x}$ , but also to obtain a measure of the estimation uncertainty ${\sigma}_{\widehat{x}}.$

To this end, one can reformulate the inverse problem as a Bayesian inference problem and get both an expectation value and an uncertainty from the posterior probability $p\left(y|x\right)$ . As we have seen in Section 2.1.3, calculating the posterior requires knowledge of the likelihood $p\left(x|y\right)$ [ Reference Tarantola153]. However, in some settings the forward problem $f(x)$ can be very expensive to evaluate, which is for instance the case for accurate simulations of laser–plasma physics. In this case the likelihood function $p\left(x|y\right)$ becomes intractable because it requires the computation of $f(x)$ for many values of $x$ . Historically, solutions to this problem have been referred to as methods of likelihood-free inference, but more recently the term simulation-based inference is being used increasingly[ Reference Cranmer, Brehmer and Louppe157]. One of the most popular methods in this regard is approximate Bayesian computation (ABC)[ Reference Sisson, Fan and Beaumont158],Footnote 5 which addresses the issue by employing a reject–accept scheme to estimate the posterior distribution $p\left(y|x\right)$ without needing to compute the likelihood $p\left(x|y\right)$ for any value of $x$ . Instead, ABC first randomly selects samples ${x}^{\prime }$ from a pre-defined prior distribution, which is usually done using an MCMC[ Reference Gilks, Richardson and Spiegelhalter159] algorithm or, in more recent work, via BO[ Reference Gutmann, Corander and Mach160]. Defining good priors might require some expert knowledge about the system. Subsequently, one simulates the observation ${{y}^{\prime} = f\left({x}^{\prime}\right)}$ corresponding to each sample. Finally, the ABC algorithm accepts only those samples that match the observation within a set tolerance $\epsilon$ , thereby essentially approximating the likelihood function with a rejection probability $\rho \left(y,{y}^{\prime}\right)$ . This yields a reduced set of samples, whose distribution approximates the posterior distribution. The tolerance $\epsilon$ determines the accuracy of the estimate; the smaller $\epsilon$ , the more accurate the prediction, but due to the higher rate of rejection one also requires more samples of the forward process. This can make ABC prohibitively expensive, especially for high-dimensional problems. Approximate frequentist computation[ Reference Brehmer, Cranmer, Louppe and Pavez161] is a conceptually similar approach that approximates the likelihood function instead of the posterior. Information on this method and other simulation-based inference techniques, including their recent development in the context of deep learning, for example, for density estimation in high dimensions, can be found, for instance, in the recent overview paper by Cranmer et al. [ Reference Cranmer, Brehmer and Louppe157].

3.3 Regularization

One way to improve the estimate of the inverse in the case of ill-posed or underdetermined problems is to use regularization methods. Regularization works by further conditioning the solution space, thus replacing the ill-conditioned problem with an approximate, well-conditioned one. In variational regularization methods this is done by simultaneously solving a second minimization problem that incorporates a desirable property of the solution, for example, minimization of the total variation to remove noise.Footnote 6 Such regularization problems hence take the following form:

(21) $$\begin{align}{\widehat{x}}_{\mathrm{REG}} = \underset{x}{\mathrm{argmin}\kern0.1em }\left\{{\left\Vert Ax-y\right\Vert}^2+\lambda \mathrm{\mathcal{R}}(x)\right\},\end{align}$$

where $\lambda$ is a hyperparameter controlling the impact of the regularization and $\mathrm{\mathcal{R}}$ is the regularization criterion, for example, a matrix description of the first derivative. The effect of the regularization can be further adjusted by the norm that is used to calculate the residual term, for example, using the absolute difference, the squared difference or a mix of both, such as the Huber norm.Footnote 7 The more complex optimization problem in Equation (21) itself is typically solved using iterative optimization algorithms, as already mentioned in the previous section. As will be discussed in Section 3.6, there have been recent developments in data-driven approaches to learning the correct form of $\mathrm{\mathcal{R}}(x)$ , by representing it with a neural network.

A recent demonstration in the context of LPA is the work by Döpp et al. [ Reference Döpp, Hehn, Götzfried, Wenz, Gilljohann, Ding, Schindler, Pfeiffer and Karsch162, Reference Götzfried, Döpp, Gilljohann, Ding, Schindler, Wenz, Hehn, Pfeiffer and Karsch163], who presented results on tomographic reconstruction using betatron X-ray images of a human bone sample; see Figure 8. The 180 projection images were acquired within 3 minutes and an iterative reconstruction of the object’s attenuation coefficients was performed. For regularization, a variation of Equation (21) was used that used the Huber norm and included a weighting factor in the data term, whose objective was to adjust the weight of poorly illuminated detector pixels.

Figure 8 Tomography of a human bone sample using a laser-driven betatron X-ray source. Reconstructed from 180 projections using statistical iterative reconstruction. Based on the data presented by Döpp et al. [ Reference Döpp, Hehn, Götzfried, Wenz, Gilljohann, Ding, Schindler, Pfeiffer and Karsch162].

3.4 Compressed sensing

Compressed sensing (CS)[ Reference Candes and Wakin164 Reference Yuan, Brady and Katsaggelos166] is a relatively new research field that has attracted significant interest in recent years, since it efficiently deals with the problem of reconstructing a signal from a small number of measurements. The mathematical theory of CS has proven that it is possible to reliably reconstruct a complex signal from a small number of measurements, even below the Shannon–Nyquist sampling rate, as long as two conditions are satisfied. Firstly, the signal must be ‘sparse’ in some other representation (i.e., it must contain few non-zero elements). In this case we can replace the dense unknown variable $x$ with its sparse counterpart $\tilde{x}$ by using the transformation matrix $\Psi$ corresponding to a different representation, that is, $x = \Psi \tilde{x}$ . Here, $\Psi$ could for instance be the wavelet transformation. The second condition concerns the way the measurements are taken (hence compressed sensing): the sensing matrix $A$ must be incoherent with respect to the sparse basis $\Psi$ , which ensures that the sparse representation of the signal is fully sampled.

At its core, CS is closely related to the concepts of regularization discussed in the previous section. In order to reconstruct the signal from the measurements, the ideal regularization, $\mathrm{\mathcal{R}}\left(\tilde{x}\right)$ , is that which sums the number of non-zero components of $\tilde{x}$ and thus promotes sparsity when minimized[ Reference Donoho167]. However, it has been shown that using the vastly more computationally efficient ${\mathrm{\ell}}_1$ norm leads to the same results on many occasions[ Reference Candès, Wakin and Boyd168], and thus the CS reconstruction problem can be written as follows:

(22) $$\begin{align}{\widehat{x}}_{\mathrm{CS}} = \underset{\tilde{x}}{\mathrm{argmin}\kern0.1em }\left\{{\left\Vert A\Psi \tilde{x}-y\right\Vert}^2+\lambda {\left\Vert \tilde{x}\right\Vert}_1\right\}.\end{align}$$

It should be noted that CS is not the first method to achieve sub-Nyquist sampling in a certain domain; see, for example, band-limited sampling[ Reference Kohlenberg169]. The strength of the formalism is rather that it is very flexible, because it only requires the coefficients of a signal to be sparse, without exactly knowing beforehand which ones are non-zero.

It should be noted that in the most general case, the basis on which the signal is sparse, $\Psi$ , is unknown. Nonetheless, the incoherence with the sampling basis can be satisfied by sampling randomly. To reconstruct the signal from such measurements, one returns to solving Equation (21). This can still be considered as CS, due to the fact that the nature of the sensing process is designed to exploit the sparsity of the signal. Therefore, CS can be used alongside deep-learning-based approaches to solve Equation (21), which will be discussed in the subsequent sections.

CS has been used in a number of fields related to laser–plasma physics, for example ultra-fast polarimetric imaging [ Reference Liang, Wang, Zhu and Wang170] and ICF radiation symmetry analysis[ Reference Huang, Jiang, Li, Wang and Chen171]. There exist also several examples from the context of tomographic reconstruction[ Reference Kassubeck, Wenger, Jennings, Gomez, Harding, Schwarz and Magnor172, Reference Ma, Hua, Liu, He, Zhang, Chen, Yang, Ning, Yang, Zhang, Pai, Gu and Lu173]. Such enhanced reconstruction algorithms can reduce the number of projections beyond what is usually possible with regression techniques. For instance, in the work by Ma et al. [ Reference Ma, Hua, Liu, He, Zhang, Chen, Yang, Ning, Yang, Zhang, Pai, Gu and Lu173] a test object was illuminated using a Compton X-ray source and a compressive tomographic reconstruction algorithm[ Reference Yu and Wang174] was used to reconstruct the sample’s well-defined inner structure from only 31 projections.

3.5 End-to-end deep learning methods

In recent years, there has been growing interest in applying deep learning to inverse problems[ Reference Arridge, Maass, Öktem and Schönlieb175]. In general, these can be categorized into two types of approaches, namely those that aim to entirely solve the inverse problem end-to-end using a neural network and hybrid approaches that replace a subpart of the solution with a network (see the next section). Denoting the artificial neural network (ANN) as $\mathcal{A}$ , the estimate ${\widehat{x}}_{\mathcal{A}}$ from the end-to-end network could be simply written as follows:

(23) $$\begin{align}{\widehat{x}}_{\mathcal{A}} = \mathcal{A}y.\end{align}$$

ANNs can thus be interpreted as a nonlinear extension of linear algebra methods, such as SVD[ Reference Bermeitinger, Hrycej and Handschuh176]. As such, for well-posed problems, the ANN is acting similarly to the least-squares method and, if provided with non-biased training data, directly approximates the (pseudo-)inverse.

However, using neural networks for ill-posed problems is more difficult, as training tends to become unstable when the networks need to generate data, that is, the layer containing the desired output $x$ is larger than the input $y$ . Fortunately, several network architectures have been developed that perform very well at these tasks, such as generative adversarial networks (GANs)[ Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio177]. GANs are two-part neural networks, one of which is trying to generate data that resemble the input (the generator), while the other is trying to distinguish between the generated and real data (the discriminator). As the two networks compete against each other they improve their respective skills, and the generator will eventually be able to create high-quality data. The generator is then used to estimate the solution of the inverse problem. Training GANs can still be challenging, with common issues such as mode collapse[ Reference Thanh-Tung and Tran178]. Autoencoders (AEs) are a simpler architecture that can perform similar tasks with relevance to inverse problems[ Reference Peng, Jalali and Yuan179]. U-Nets[ Reference Ronneberger, Fischer and Brox180] are a popular architecture that draws inspiration from AEs and have proven to be very powerful for inverse problems, especially in imaging. They essentially combine features of AEs with FCNs, in particular ResNets (see also Section 6.1.2): similar to an AE, U-Nets consist of an encoder part, followed by a decoder part that usually mirrors the encoder. At the same time, the layers in the network are skip-connected, meaning that the output from the previous layer is concatenated with the output of the corresponding layer in the encoder part of the network (see Figure 6). This allows the network to retain information about the input data even when it is transformed to a lower dimensional space.

One sub-class of ANNs that has gained considerable recent interest in the context of inverse problems is the invertible neural network (INN)[ Reference Ardizzone, Kruse, Wirkert, Rahner, Pellegrini, Klessen, Maier-Hein, Rother and Köthe181]; see sketch in Figure 9. Mathematically, an INN is a bijective function between the input and output of the network, meaning that it can be exactly inverted. Because of this, an INN trained to approximate the forward function $A$ will implicitly also learn an estimate for its inverse. In order to be applied to inverse problems, both forward and inverse mapping should be efficiently computable. In ill-posed problems there is an inherent information loss present in the forward process, which is either counteracted by the introduction of additional latent output variables $z$ , which capture the information about $x$ that is missing in $y$ , or by adding a zero-padding of corresponding size. The architecture and training of INNs are inspired by recent advances in normalizing flows[ Reference Rezende and Mohamed182]. The name stems from the fact that the mapping learned by the network is composed of a series of invertible transformations, called coupling blocks[ Reference Dinh, Krueger and Bengio183], which ‘normalize’ the data, meaning that they move the data closer to a standard normal distribution $\mathcal{N}\left(0,1\right)$ . The choice of coupling block basically restricts the Jacobian of the transformation between the standard normal latent variable and the output. INNs can be trained bi-directionally, meaning that the loss function is optimized using both the loss of the forward pass and the inverse pass. In addition, an unsupervised loss such as maximum mean discrepancy[ Reference Gretton, Borgwardt, Rasch, Schölkopf, Smola and Mach184] or negative log-likelihood[ Reference Ardizzone, Lüth, Kruse, Rother and Köthe185] can be used to encourage the latent variable to be as close to a standard normal variable as possible.

Figure 9 Deep-learning for inverse problems. Sketch explaining the relation among predictive models, inverse models and fully invertible models.

Figure 10 Application of the end-to-end reconstruction of a wavefront using a convolutional U-Net architecture[ Reference Ronneberger, Fischer and Brox180]. The spot patterns from a Shack–Hartmann sensor are fed into the network, yielding a high-fidelity prediction. Adapted from Ref. [Reference Hu, Hu, Gong and Si188].

The loss function of end-to-end networks can also be modified such that a classical forward model is used in the loss function. Such PINNs (see Section 2.1.7) respect the physical constraints of the problem, which can lead to more accurate and physically plausible solutions for the inverse problem.

An early example from the context of ultra-fast laser diagnostics was published by Krumbügel et al. in 1996[ Reference Krumbügel, Ladera, DeLong, Fittinghoff, Sweetser and Trebino186], where the authors tested a neural network for retrieval of laser pulse profiles from FROG traces. At the time, numerical capabilities were much more limited than today and the FROG trace containing some $64\times 64$ data points was at the time considered as ‘far too much for a neural net’. Because of this, the data had to be compressed and only a polynomial dependence was trained, limiting reconstruction performance. The vast increase in computing power over the past decades has largely alleviated this issue and the problem was revisited by Zahavy et al. in 2018[ Reference Zahavy, Dikopoltsev, Moss, Haham, Cohen, Mannor and Segev187]. Their ‘DeepFROG’ network is based on a CNN architecture and, now using the entire 2D FROG trace as input, achieves a similar performance as iterative methods.

Another example for an end-to-end learning of a well-conditioned problem was published by Simpson et al. [ Reference Simpson, Mariscal, Williams, Scott, Grace and Ma111], who trained a fully connected three-layer network on a large set (47,845 samples) of synthetic spectrometer data to retrieve key experimental metrics, such as particle temperature.

U-Nets have been used for various inverse problems, including, for example, the reconstruction of wavefronts from Shack–Hartmann sensor images as presented by Hu et al. [ Reference Hu, Hu, Gong and Si188] (see Figure 10).

An example for the use of INNs in the context of LPA was recently published by Bethke et al. [ Reference Bethke, Pausch, Stiller, Debus, Bussmann and Hoffmann189]. In their work they used an INN called iLWFA to learn a forward surrogate model (from simulation parameters to the beam energy spectrum) and then used the bijective property of the INN to calculate the inverse, that is, deduce the simulation parameters from a spectrum.

3.6 Hybrid methods

In contrast to end-to-end approaches, there exist a variety of hybrid schemes that employ neural networks to solve part of the inverse problem. A collection of such methods focuses on splitting Equation (21) into two subproblems, separating the quadratic loss from the regularization term. The former is easily minimized as a least-squares problem (discussed in Section 3.1) and the optimum form of regularization can then be learned by a neural network. Crucially, this network can be smaller and more parameter-efficient than in end-to-end approaches, as it has a simpler task and less abstraction to perform. Furthermore, the direct relation that this network has the regularization $\mathrm{\mathcal{R}}(x)$ allows one to pick a network structure to exploit prior knowledge about the data.

There are multiple methods available to perform such a separation, for example half quadratic splitting (HQS) or the alternating direction method of multipliers (ADMM). HQS is the simplest method and involves substituting an auxiliary variable, $p$ , for $x$ in the regularization term and then separating Equation (21) into two subproblems, which are solved iteratively. The approximate equality of $p$ and $x$ is ensured by the introduction of a further quadratic loss term into each of the subproblems: $\beta {\left\Vert x-p\right\Vert}^2$ . If the same neural network (with the same set of weights) is used to represent the regularization subproblem in each iteration, the method is referred to as plug-and-play, but if each iteration uses a separate network, the method is deep unrolling[ Reference Monga, Li and Eldar190]. Such methods have achieved unprecedented accuracy whilst reducing computational cost.

Figure 11 Deep unrolling for hyperspectral imaging. The left-hand side displays an example of the coded shot, that is, a spatial-spectral interferogram hypercube randomly sampled onto a 2D sensor. The bottom left shows a magnification of a selected section. On the right-hand side is the corresponding reconstructed spectrally resolved hypercube. Adapted from Ref. [Reference Howard, Esslinger, Wang, Norreys and Döpp192].

It is worth noting that there exist other similar methods – for instance, neural networks can be used to learn an appropriate regression function $\mathrm{\mathcal{R}}(x)$ in Equation (21), as for instance done in network Tikhonov (NETT)[ Reference Li, Schwab, Antholzer and Haltmeier191].

Howard et al. [ Reference Howard, Esslinger, Wang, Norreys and Döpp192] recently presented an implementation of CS using deep unrolling for single-shot spatio-temporal characterization of laser pulses. Such a characterization is a typical example of an underdetermined problem, where one wishes to capture 3D information (across the pulse’s spatial and temporal domains) on a 2D sensor. Whilst previous methods mostly resorted to scanning, resulting in long characterization times and blindness to shot-to-shot fluctuations, this work presented a single-shot approach, which has numerable benefits. The authors’ implementation is based on a lateral shearing interferometer to encode the spatial wavefront in an interferogram for each spectral channel of the pulse, creating an interferogram cube. An optical system called CASSI[ Reference Gehm, John, Brady, Willett and Schulz193] was then used to randomly sample this 3D data onto a 2D sensor resulting in a ‘coded shot’, thereby fulfilling the conditions of CS. For the reconstruction of the interferogram cube, the HQS method was utilized, and the regularization term was represented by a network with 3D convolutional layers that can capture correlations between the spectral and spatial domains. An example for a successful reconstruction is shown in Figure 11.

Table 1 Summary of a few representative papers on machine-learning-aided optimization in the context of laser–plasma acceleration and high-power laser experiments.

4 Optimization

One of the most common problems in applied laser–plasma physics experiments, in particular LPA, is the optimization of the performance through manipulation of the machine controls. Here the goal is to minimize or maximize an objective function, a metric of the system performance according to some pre-defined criteria (see Section 4.1.1). A simple case of this would be to optimize the total charge of the accelerated electron beam, but in principle, any measurable beam quantity or combination of quantities can be used to create the objective function.

There are many different general techniques for tackling optimization problems, and their suitability depends on the type of problem. Single shots in an LPA experiment (or a single run of numerical simulation) can be considered the evaluation of an unknown function that one wishes to optimize. The form of this function is not typically known, due to the lack of a full theoretical description of the experiment, and therefore the Jacobian of this function is also unknown. The input to this function is typically highly dimensional, due to the large number of machine control parameters that affect the output, and these input parameters are coupled in a complex and often nonlinear manner. Evaluation of this unknown function is also relatively costly, meaning that optimization should be as efficient as possible to minimize the required beam time or computational resources. In addition, the unknown function has some stochasticity, due to the statistical noise in the measurement and also due to the natural variation of unmeasured input parameters, which nonetheless may contribute significantly to variations in the output. Finally, there are usually constraints placed upon the input parameters due to the operation range of physical devices and machine safety requirements. These constraints might also be non-trivial due to coupling between different input parameters, and may also depend on system outputs (e.g., to avoid beam loss in an accelerator).

Due to all these considerations, not all optimization algorithms are suitable and only a few different types have been explored in published work; see Table 1 for a selection of representative papers. The following sections will focus on these methods. The reader is referred to dedicated reviews, for example, the comprehensive work by Nocedal and Wright[ Reference Nocedal and Wright151] on numerical optimization algorithms, for further reading.

4.1 General concepts

4.1.1 Objective functions

Most optimization algorithms are based on maximizing or minimizing the value of the objective function, which has to be constructed in a way that it represents the actual, user-defined objective of the optimization problem. Typically, the objective function produces a single value, where higher (or lower) values represent a more optimized state.Footnote 8 The optimization problem is then a case of finding the parameters that maximize (or minimize) the objective function.

When the objective is to construct a model, the objective function encodes some measure of similarity between the model and what it is supposed to represent, for example, a measure of how well the model fits some training data. For example, deep learning algorithms minimize an objective function that encodes the difference between desired output values for a given input and actual results produced by the algorithm. A common, basic metric for this is the MSE cost function, in which the difference between predicted and actual values is squared. We already mentioned this metric at several points of this review, for example, in Section 3.1. The MSE objective function belongs to a class of distance metrics, the $\mathrm{\ell}$ -norms, which are defined by the following:

(24) $$\begin{align}{\mathrm{\ell}}_p\left(x,y\right) = {\left(\sum \parallel x-y{\parallel}^p\right)}^{1/p},\end{align}$$

where $x$ and $y$ are vectors, and we use the notation ${{\mathrm{\ell}}_2 = \mathrm{MSE}}$ . One can also use other $\mathrm{\ell}$ -norms, for example, ${\mathrm{\ell}}_1$ , which is more robust to outliers than ${\mathrm{\ell}}_2$ since it penalizes large deviations less severely. Note that the number of non-zero values is sometimes referred to as the ‘ ${\mathrm{\ell}}_0$ ’ norm (see Section 3.4), even though it does not follow from Equation (24).

Another popular similarity measure is the Kullback–Leibler (KL) divergence, sometimes called relative entropy, which is used to find the closest probability distribution for a given model. It is defined as follows:

(25) $$\begin{align}\mathrm{KL}\left(p|q\right) = \sum p(x)\log \frac{p(x)}{q(x)},\end{align}$$

where $p(x)$ is the probability of observing the value $x$ according to the model, and $q(x)$ is the actual probability of observing $x$ . The KL divergence is minimized when the model prediction $p(x)$ is as close to the actual distribution $q(x)$ as possible. It is a relative measure, that is, it is only defined for pairs of probability distributions. The KL divergence is closely related to the cross-entropy cost function:

(26) $$\begin{align}\mathrm{CE}\left(p,q\right) = -\sum p(x)\log q(x) = \mathrm{KL}\left(p|q\right)-H(p),\end{align}$$

where $p(x)$ and $q(x)$ are probability distributions, and $H(p) = -\sum p(x)\log p(x)$ is the Shannon entropy of the distribution $p(x)$ .

Other objective functions may rely on the maximization or minimization of certain parameters, for example, the beam energy or energy-bandwidth of particle beams produced by a laser–plasma accelerator. Both of these are examples of what is sometimes referred to as summary statistics, as they condense information from more complex distributions, in this case the electron energy distribution. While simple at the first glance, these objectives need to be properly defined and there are often different ways to do so[ Reference Irshad, Karsch and Döpp198]. In the example above, energy and bandwidth are examples for the central tendency and the statistical dispersion of the energy distribution, respectively. These can be measured using different metrics, such as the weighted arithmetic or truncated mean, the median, mode, percentiles, for the former; and full width at half maximum, median absolute deviation, standard deviation, maximum deviation, for the latter. Each of these seemingly similar measures emphasizes different features of the distribution they are calculated from, which can affect the outcome of optimization tasks. Sometimes one might also want to include higher-order momenta as objectives, such as the skewness, or use integrals, for example, the total beam charge.

4.1.2 Pareto optimization

In practice, optimization problems often constitute multiple, sometimes competing objectives ${g}_i$ . As the objective function should only yield a single scalar value, one has to condense these objectives in a process known as scalarization. Scalarization can, for instance, take the form of a weighted product $g = \prod {g}_i^{\alpha_i}$ or sum $g = \sum {\alpha}_i{g}_i$ of the individual objectives ${g}_i$ with the hyperparameters ${\alpha}_i$ describing the weight. Another common scalarization technique is $\epsilon$ -constraint scalarization, where one seeks to reformulate the problem of optimizing multiple objectives into a problem of single-objective optimization conditioned on constraints. In this method the goal is to optimize one of the ${g}_i$ given some bounds on the other objectives. All of these techniques introduce some explicit bias in the optimization, which may not necessarily represent the desired outcome. Because of this, the hyperparameters of the scalarization may have to be optimized themselves by running optimizations several times.

Figure 12 Pareto front. Illustration of how a multi-objective function $f(x) = y$ acts on a 2D input space $x = \left({x}_1,{x}_2\right)$ and transforms it to an objective space $y = \left({y}_1,{y}_2\right)$ on the right. The entirety of possible input positions is uniquely color-coded on the left and the resulting position in the objective space is shown in the same color on the right. The Pareto-optimal solutions form the Pareto front, indicated on the right, whereas the corresponding set of coordinates in the input space is called the Pareto set. Note that both the Pareto front and Pareto set may be continuously defined locally, but can also contain discontinuities when local maxima become involved. Adapted from Ref. [Reference Irshad, Karsch and Döpp199].

A more general approach is Pareto optimization, where the entire vector of individual objectives $g = \left({g}_1,\dots, {g}_N\right)$ is optimized. To do so, instead of optimizing individual objectives, it is based on the concept of dominance. A solution is said to dominate other solutions if it is both at least as good on all objectives and strictly better than the other solutions on at least one objective. Pareto optimization uses implicit scalarization by building a set of non-dominated solutions, called the Pareto front, and maximizing the diversity of solutions within this set. The latter can, for instance, be quantified by the hypervolume of the set or the spread of solutions along each individual objective. As it works on the solution for all objectives at once, Pareto optimization is commonly referred to as multi-objective optimization. An illustration of both the Pareto front and set is shown in Figure 12, where a multi-objective function $f$ ‘morphs’ the input space into the objective space. In this example $f$ is a modified version of the Branin–Currin function[ Reference Dixon and Szego200, Reference Currin, Mitchell, Morris, Ylvisaker and Amer201], exhibiting a single, global maximum in ${y}_2$ but multiple local maxima in  ${y}_1$ . The individual 2D outputs ${y}_1 = {f}_1\left({x}_1,{x}_2\right)$ , with ${f}_1$ being the Branin function, and ${y}_2 = {f}_2\left({x}_1,{x}_2\right)$ , with ${f}_2$ being the Currin function, are depicted with red and blue colormaps at the bottom.

4.2 Grid search and random search

Once an objective is defined, we can try to optimize it. One of the simplest methods to do so is a grid search, where the input parameter space is explored by taking measurements in regularly spaced intervals. This technique is particularly simple to implement, especially in experiments, and therefore remains very popular in the setting of experimental laser physics. However, the method severely suffers from the ‘curse of dimensionality’, meaning that the number of measurements increases exponentially with the number of dimensions considered. In practice, the parameter space therefore has to be low-dimensional (1D, 2D or 3D at most) and it is applied to the optimization of selected parameters that appear to influence the outcome the most. One issue with grid scans is that they can lead to aliasing, that is, high-frequency information can be missed due to the discrete grid with a fixed sampling frequency.

A popular variation, in particular in laser–plasma experiments, is the use of sequential 1D line searches. Here, one identifies the optimum in one dimension, then performs a scan of another parameter and moves to its optimum, and so forth. This method can converge much faster to an optimum, but is only applicable in settings with a single, global optimum.

Random search is a related method where the sampling of the input parameter space is random instead of regular. This method can be more efficient than grid search, especially if the system involves coupled parameters and has a lower effective dimensionality[ Reference Bergstra, Bengio and Mach202]. It is therefore often used in optimization problems with a large number of free parameters. However, purely random sampling also has drawbacks. For instance, it has a tendency to cluster, that is, to sample points very close to others, while leaving some areas unexplored. This behavior is undesirable for signals without high-frequency components and instead one would rather opt for a sampling that explores more of the parameter space. For this case, a variety of schemes exist that combine features of grid search and random search. Two popular examples are jittered sampling and Latin hypercube sampling[ Reference Kollig and Keller203]. For the former, samples are randomly placed within regularly spaced intervals, while the latter does so while maintaining an even distribution in the parameter space.

Grid and random search are often used for initial exploration of a parameter space to seed subsequent optimization with more advanced algorithms. An example for this is shown in Figure 15 (a), where grid search is combined with the downhill simplex method discussed in the next section.

4.3 Downhill simplex method and gradient-based algorithms

In the downhill simplex method, also known as the Nelder–Mead algorithm[ Reference Nelder and Mead204], an array of $\left(n+1\right)$ input parameter sets from an $n$ -dimensional space is evaluated to get the corresponding function values. The worst-performing evaluation point is modified at each iteration by translating it towards or through the center of the simplex. This continues until the simplex converges to a point at which the function is optimized. The method is simple and efficient, which is why it is popular in various optimization settings. The convergence speed crucially depends on the initial choice of the simplex, with a large distance between input parameters leading to a more global optimization, while small simplex settings result in local optimization.

In the limit of small simplex size, the Nelder–Mead algorithm is conceptually related to gradient-based methods for optimization. The latter are based on the concept of using the gradient of the objective function to find the direction of the steepest slope. The objective function is then minimized along this direction using a suitable algorithm, such as gradient descent. Momentum descent is a popular variation of gradient descent where the gradient of the function is multiplied by a value, in analogy to physics called momentum, before being subtracted from the current position. This can help the algorithm converge to the local minimum faster. These methods typically require more and smaller iterations than the downhill simplex method, but can be more accurate.

Figure 13 Genetic algorithm optimization. (a) Basic working principle of a genetic algorithm. (b) Sketch of a feedback-optimized LWFA via genetic algorithm. (c) Optimized electron beam spatial profiles using different figures of merit. Subfigures (b) and (c) adapted from Ref. [Reference He, Hou, Lebailly, Nees, Krushelnick and Thomas194].

In both cases, measurement noise can easily result in a wrong estimation of the gradient. In the setting of laser–plasma experiments, it is therefore important to reduce such noise, for example, by taking several measurements at the same position. While this may be possible when working with high-repetition-rate systems, as was demonstrated by Dann et al. [ Reference Dann, Baird, Bourgeois, Chekhlov, Eardley, Gregory, Gruse, Hah, Hazra, Hawkes, Hooker, Krushelnick, Mangles, Marshall, Murphy, Najmudin, Nees, Osterhoff, Parry, Pourmoussavi, Rahul, Rajeev, Rozario, Scott, Smith, Springate, Tang, Tata, Thomas, Thornton, Symes and Streeter195], gradient-based methods are in general less suitable to be used in high-power-laser settings. Other popular derivative-based algorithms are the conjugate gradient (CG) method, the quasi-Newton (QN) method and the limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) method, all of which are discussed by Nocedal and Whright[ Reference Nocedal and Wright151].

4.4 Genetic algorithms

One of the first families of algorithms to find application in field LPA was genetic algorithms. As a sub-class of evolutionary methods, these nature-inspired algorithms start with a pre-defined or random set, called the population, of measurements for different input settings. Each free parameter is called a gene and, similar to natural evolution, these genes can either crossover between most successful settings or randomly change (mutate). This process is guided by a so-called fitness function, which is designed to yield a single-valued figure of merit that is aligned with the optimization goal. Depending on the objective, the individual measurements are ranked from most to least fit. The fittest ‘individuals’ are then used to spawn a new generation of ‘children’, that is, a new set of measurements based on crossover and mutation of the parent genes. A popular variation in genetic algorithms is differential evolution[ Reference Storn and Price205], which employs a different type of crossover. Instead of crossing over two parents to create a child, differential evolution uses three parent measurements. The child is then created by adding a weighted difference between the parents to a random parent. This process is repeated until a new generation is created; see Figure 15(b) for an example.

One particular strength of genetic algorithms compared to many other optimization methods is their ability to perform multi-objective optimization, that is, when multiple, potentially conflicting, objectives are to be optimized. A popular example would be non-dominated sorting genetic algorithms (NSGAs)[ Reference Deb, Agrawal, Pratap and Meyarivan206]. Here the name-giving sorting technique ranks the individuals by their population dominance, that is, how many other individuals in the population the individual dominates. Other multi-objective approaches are, for instance, based on optimizing niches, similar-valued regions of the Pareto front[ Reference Horn, Nafpliotis and Goldberg207].

It should be noted that genetic algorithms intrinsically operate on population batches and not on a single individual. While this can be beneficial for parallel processing in simulations, it can make it more difficult to employ in an online optimization setup.

Genetic algorithms have been used since the early 2000s to control high-harmonic generation[ Reference Bartels, Backus, Zeek, Misoguti, Vdovin, Christov, Murnane and Kapteyn208, Reference Yoshitomi, Nees, Miyamoto, Sekikawa, Kanai, Mourou and Watanabe209], cluster dynamics[ Reference Moore, Mendham, Symes, Robinson, Springate, Mason, Smith, Tisch and Marangos210, Reference Zamith, Martchenko, Ni, Aseyev, Muller and Vrakking211] and ion acceleration[ Reference Nayuki, Fujii, Oishi, Takano, Wang, Andreev, Nemoto and Ueda212]. An influential example in the context of high-power lasers was published by He et al. in 2015[ Reference He, Hou, Lebailly, Nees, Krushelnick and Thomas194], and a sketch of their feedback-looped experimental setup is presented in Figure 13. In the paper, a genetic algorithm is used to optimize various parameters of a laser–plasma accelerator, namely the electron beam angular profile, energy distribution, transverse emittance and optical pulse compression. This was done by controlling the voltage on 37 actuators of a deformable mirror (DM), which was used to shape a laser pulse before it entered a gas jet target. The genetic algorithm was initialized with a population of 100 individuals and each subsequent generation was generated based on the 10 fittest individuals[ Reference He, Hou, Gao, Lebailly, Nees, Clarke, Krushelnick and Thomas213]. A similar experiment was performed using self-modulated LWFA[ Reference Lin, Ma, Schwartz, Woodbury, Nees, Mathis, Thomas, Krushelnick and Milchberg214] driven by a mid-infrared laser pulse. The electron beam charge, energy spectra and beam pointing have been optimized. The combination of genetic algorithm and DM has been applied to other high-power laser experiments as well[ Reference Hah, Jiang, He, Nees, Hou, Thomas and Krushelnick215 Reference Englesbe, Lin, Nees, Lucero, Krushelnick and Schmitt-Sody219].

Streeter et al. [ Reference Streeter, Dann, Scott, Baird, Murphy, Eardley, Smith, Rozario, Gruse, Mangles, Najmudin, Tata, Krishnamurthy, Rahul, Hazra, Pourmoussavi, Osterhoff, Hah, Bourgeois, Thornton, Gregory, Hooker, Chekhlov, Hawkes, Parry, Marshall, Tang, Springate, Rajeev, Thomas and Symes220] used a similar technique to optimize ultra-fast heating of atomic clusters at joule-level energies, but instead of controlling the spatial wavefront, they controlled the spectral phase up to its fourth order. The genetic algorithm was based on a population of 15 individuals and $4{-}8$ children generations were evaluated. This method can also be used for optimizing other laser parameters, such as focal spot size, focus position and chirp [ Reference Dann, Baird, Bourgeois, Chekhlov, Eardley, Gregory, Gruse, Hah, Hazra, Hawkes, Hooker, Krushelnick, Mangles, Marshall, Murphy, Najmudin, Nees, Osterhoff, Parry, Pourmoussavi, Rahul, Rajeev, Rozario, Scott, Smith, Springate, Tang, Tata, Thomas, Thornton, Symes and Streeter195].

Another example is the use of differential evolution for optimization of the laser pulse duration in a SPIDER and DAZZLER feedback loop, as presented by Peceli and Mazurek[ Reference Peceli and Mazurek221]. The genetic algorithm method has also been employed in ion acceleration in a laser–plasma accelerator, where Smith et al. [ Reference Smith, Orban, Morrison, George, Ngirmang, Chowdhury and Roquemore222] optimized the conversion efficiency from laser energy to ion energy by exploring thousands of target density profiles in 1D particle-in-cell (PIC) simulations.

4.5 Bayesian optimization

BO[ Reference Shahriari, Swersky, Wang, Adams and de Freitas223, Reference Frazier224] is a model-based global optimization method that uses probabilistic modeling, which was discussed in Section 2.1.3. The strength of BO lies in its efficiency with regard to the number of evaluations. This is particularly useful for problems with comparably high evaluation costs or long evaluation times. To achieve this, BO uses the probabilistic surrogate model to make predictions about the behavior of the system at new input parameter settings, providing both a value estimate and an uncertainty.

At the core of BO lies what is called the acquisition function, a pre-defined function that suggests the next points to probe on a probabilistic model. The latter is usuallyFootnote 9 a GP fitted from training points (see Section 2.1.4), thus providing a cheap-to-evaluate surrogate function. A simple and intuitive example of an acquisition function is the upper confidence bound:

(27) $$\begin{align}\mathrm{UCB}(x) = \mu (x)+\kappa \sigma (x),\end{align}$$

which weighs the mean prediction $\mu$ versus the variance prediction $\sigma$ , with a user-chosen hyperparameter $\kappa$ . For $\kappa = 0$ the optimization will act entirely exploitatively, that is, it will move to the position of the highest expected return, whereas a large $\kappa$ incentivizes the reduction of the uncertainty and exploration of the parameter space. Other common acquisition functions are the expected improvement[ Reference Jones, Schonlau and Welch225], knowledge gradient[ Reference Frazier, Powell and Dayanik226] and entropy search[ Reference Hennig, Schuler and Mach227]. As the surrogate model can be probed at near-negligible evaluation cost, this optimization can be performed using numerical optimization methods, such as the gradient-based methods discussed in Section 4.3. The position of the acquisition function’s optimum is then used as an input parameter setting to evaluate the actual system. This process is repeated until some convergence criteria are achieved, a pre-defined number of iterations is reached or the allocated resources have been otherwise exhausted.

BO provides a very flexible framework that can be further adapted to various different optimization settings. For instance, it has proven to be a valid alternative to evolutionary methods when it comes to solving multi-objective optimization problems. The importance of this method for LPA stems from the fact that many optimization goals, such as beam energy and beam charge, are conflicting in nature and require the definition of a trade-off. The goal of Pareto optimization is to find the Pareto front, which is a surface in the output objective space that comprises all the non-dominated solutions (see Section 4.1.2). A common metric that is used to measure the closeness of a set of points to the Pareto-optimal points is the hypervolume. The BO algorithm works by using the expected hypervolume improvement[ Reference Yang, Emmerich, Deutz and Bäck228] to increase the extent of the current non-dominated solutions, thus optimizing all possible combinations of individual objectives. Note that the Pareto front, like the global optimum of single-objective optimization, is usually not known a priori.

Figure 14 Bayesian optimization of a laser–plasma X-ray source. (a) The objective function (X-ray counts) as a function of iteration number (top) and the variation of the control parameters (bottom) during optimization. (b) X-ray images obtained for the initial (bottom) and optimal (top) settings. Adapted from Ref. [Reference Shalloo, Dann, Gruse, Underwood, Antoine, Arran, Backhouse, Baird, Balcazar, Bourgeois, Cardarelli, Hatfield, Kang, Krushelnick, Mangles, Murphy, Lu, Osterhoff, Põder, Rajeev, Ridgers, Rozario, Selwood, Shahani, Symes, Thomas, Thornton, Najmudin and Streeter196].

Figure 15 Illustration of different optimization strategies for a non-trivial 2D system, here based on a simulated laser wakefield accelerator with laser focus and plasma density as free parameters. The total beam charge, shown as contour lines in plots (a)–(c) serves as the optimization goal. The position of the optimum is marked by a red circle, located at a focus position of $-0.74\;\mathrm{mm}$ and a plasma density of $8\times {10}^{18}\ {\mathrm{cm}}^{-3}$ . In panel (a), a grid search strategy with subsequent local optimization using the downhill simplex (Nelder–Mead) algorithm is shown. Panel (b) illustrates differential evolution and (c) is based on Bayesian optimization using the common expected improvement acquisition function. The performance for all three examples is compared in panel (d). It shows the typical behavior that Bayesian optimization needs the least and the grid search requires the most iterations. The local search via the Nelder–Mead algorithm converges within some 20 iterations, but requires a good initial guess (here provided by the grid search). Individual evaluations are shown as shaded dots. Note how the Bayesian optimization starts exploring once it has found the maximum, whereas the evolutionary algorithm tends more towards exploitation around the so-far best value. This behavior is extreme for the local Nelder–Mead optimizer, which only aims to exploit and maximize to local optimum.

Another possible way to extend BO is to incorporate different information sources[ Reference Marco, Berkenkamp, Hennig, Schoellig, Krause, Schaal and Trimpe229]. This is often done by adding an additional information input dimension to the data model. This method is often referred to as multi-task (MT) or multi-fidelity (MF) BO. Both terms are used somewhat interchangeably in the literature, although MT often refers to optimization with entirely different systems (codes, etc.), whereas MF focuses on different fidelity (resolution, etc.) of the same information source. The core concept behind these methods is that the acquisition function not only encodes the objective, but also minimizes the measurement cost. These multi-information-source methods have the potential to speed up optimization significantly. They can also be combined with multi-objective optimization, as shown by Irshad et al. [ Reference Irshad, Karsch and Döpp199].

A first demonstration of BO in the context of LPA was presented by Lehe[ Reference Lehe230], who used the method to determine the injection threshold in a set of PIC simulations. The use of BO in experiments was pioneered by Shalloo et al. [ Reference Shalloo, Dann, Gruse, Underwood, Antoine, Arran, Backhouse, Baird, Balcazar, Bourgeois, Cardarelli, Hatfield, Kang, Krushelnick, Mangles, Murphy, Lu, Osterhoff, Põder, Rajeev, Ridgers, Rozario, Selwood, Shahani, Symes, Thomas, Thornton, Najmudin and Streeter196] (Figure 14), who demonstrated optimization of electron and X-ray beam properties from LWFA by automated control of laser and plasma control parameters. Another work by Jalas et al. [ Reference Jalas, Kirchen, Messner, Winkler, Hübner, Dirkwinkel, Schnepp, Lehe and Maier197] focused on improving the spectral charge density using the objective function $\sqrt{Q}\tilde{E}/{E}_{\mathrm{MAD}}$ , thus incorporating the beam charge $Q$ , the median energy $\tilde{E}$ and the energy spread defined here by the median absolute deviation ${E}_{\mathrm{MAD}}$ . In contrast to Shalloo et al., they used shot-to-shot measurements of the control parameters to train the model rather than relying on the accuracy of their controllers, reducing the level of output variation attributed purely to stochasticity. BO has also been applied to the optimization of laser-driven ion acceleration in numerical simulations[ Reference Dolier, King, Wilson, Gray and McKenna231] and experiments[ Reference Loughran, Streeter, Ahmed, Astbury, Balcazar, Borghesi, Bourgeois, Curry, Dann, DiIorio, Dover, Dzelzanis, Ettlinger, Gauthier, Giuffrida, Glenn, Glenzer, Green, Gray, Hicks, Hyland, Istokskaia, King, Margarone, McCusker, McKenna, Najmudin, Parisuaña, Parsons, Spindloe, Symes, Thomas, Treffert, Xu and Palmer232]. A first implementation of multi-objective optimization in numerical simulations of plasma acceleration was published by Irshad et al. [ Reference Irshad, Karsch and Döpp198], who showed that multi-objective optimization can lead to superior performance compared with manually trying different trade-off definitions or settings for the individual objectives, in this case, the beam charge, energy spread and distance to a target energy. An example of MT BO in laser–plasma simulations was recently published by Pousa et al. [ Reference Pousa, Hudson, Huebl, Jalas, Kirchen, Larson, Lehé, de la Ossa, Thévenet and Vay233], who combined the Wake-T and FBPIC codes, while Irshad et al. [ Reference Irshad, Karsch and Döpp198] used the FBPIC code at different resolutions for MF optimization.

4.6 Reinforcement learning

Reinforcement learning (RL)[ Reference Sutton and Barto234] differs fundamentally from the optimization methods discussed so far. RL is a method of learning the optimal behavior (the policy $\unicode{x3c0}$ ) to control a system. The learning occurs via repeated trial and error, called episodes, where each episode is started in an initial state of the system, and then the agent (i.e., the optimizer) interacts with the system according to its policy. The agent then receives a reward signal, which is a scalar value that indicates the success of the current episode and its goal is to maximize the expected reward, analogous to the objective function in other optimization methods. The agent is said to learn when it is able to improve its policy to achieve a higher expected reward.

Figure 16 Sketch of deep reinforcement learning. The agent, which consists of a policy and a learning algorithm that updates the policy, sends an action to the environment. In the case of model-based reinforcement learning, the action is sent to the model, which is then applied to the environment. Upon the action to the environment, an observation is made and sent back to the agent as a reward. The reward is used to update the policy via the learning algorithm in the agent, which leads to an action in the next iteration.

The policy itself has traditionally been represented using a Markov decision process (MDP), but in recent years deep reinforcement learning (DRL) has become widely used, in which the policy is represented using deep neural networks[ Reference Arulkumaran, Deisenroth, Brundage and Bharath235, Reference Li236]. However, while we commonly update weights and biases via back-propagation in supervised deep learning, the learning in DRL is done in an unsupervised way. Indeed, while the agent is trying to learn the optimal policy to maximize the reward signal, the reward signal itself is unknown to the agent. The agent only knows the reward signal at the end of the episode, so it is not possible to perform back-propagation. Instead, the policy network can, for instance, be updated using evolutionary strategies (see Section 4.4), where the agents achieving the highest reward are selected to create a new generation of agents. Another common approach is to use a so-called actor–critic strategy[ Reference Grondman, Busoniu, Lopes and Babuska237], where a second network is introduced, called the critic. At the end of each episode, the critic is trained to estimate the expected long-term reward from the current state, called the value. This expected reward signal is then used to train the actor network to adjust the policy. The policy gradient[ Reference Sutton, McAllester, Singh and Mansour238] is a widely used algorithm for training the policy network using the critic network to calculate the expected reward signal.

RL algorithms can be further divided into two main classes: model-free and model-based learning. Model-free methods learn as discussed above directly by trial and error, only implicitly learning about the environment. Model-based methods, on the other hand, explicitly build a model of the environment, which can be used for both planning and learning (somewhat similar to optimization using surrogate models discussed earlier). The arguably most popular method to build models in RL is again the use of neural networks, as they can learn complex, nonlinear relationships and are also capable of learning from streaming data, which is essential in RL. A crucial advantage of the model-based approach is that it can drastically speed up training, although performance is always limited by the quality of the model. In the case of real-life systems this is sometimes referred to as the ‘reality gap’.

One crucial advantage of RL is that once the training process is completed, the computational requirements of running an optimization are heavily reduced. A simplified representation of the RL workflow is sketched in Figure 16.

An example of an RL application is the work of Kain et al. [ Reference Kain, Hirlander, Goddard, Velotti, Porta, Bruchon and Valentino239] for trajectory optimization in the Advanced Wakefield (AWAKE) experiment in plasma wakefield acceleration, which found the optimum in just a few iterations based on 300 iterations of training. There are many other examples for the use of RL at accelerator facilities, for example, Refs. [Reference Gao, Chen, Robertazzi and Brown240Reference John, Herwig, Kafkes, Mitrevski, Pellico, Perdue, Quintero-Parra, Schupbach, Seiya, Tran, Schram, Duarte, Huang and Keller243].

5 Unsupervised learning

In this section we are going to discuss unsupervised learning techniques for exploratory data analysis. The term ‘unsupervised’ refers to the case where one does not have access to training labels, and therefore the aim is not to find a mapping between training data and labels, as is often the case for deep learning. Rather, the aim is to detect relationships between features of the data.

For high-power laser experiments, many parameters will be coupled in some way such that there are correlations between different measurable input quantities. For example, increasing the laser energy in the amplifier chain of a high-power laser can affect the laser spectrum or beam profile. To understand the effect a change to any one of these parameters will have, it is important to consider their correlation. However, an experimental setup can easily involve tens of parameters and interpreting correlations becomes increasingly difficult. In this context, it can be useful to distill the most important (combinations of) parameters in a process called dimensionality reduction. The same method also plays a crucial role in efficient data compression, which is becoming increasingly important due to the large amount of data produced in both experiments and simulation. These methods are also closely related to pattern recognition, which addresses the issue of automatically discovering patterns in data.

5.1 Clustering

Data clustering is a machine learning task of finding similar data points and dividing them into groups, even if the data points are not labeled. This can, for instance, be useful to separate a signal from the background in physics experiments.

A popular centroid-based clustering algorithm is the $K$ -means algorithm, which consists of two steps. First, the algorithm randomly assigns a cluster label to each point. Then, in the second step, it calculates the center point of each cluster and re-assigns the cluster label to each point based on the proximity to the cluster center. This process is repeated until the cluster assignment does not change anymore. The $K$ in the algorithm’s name represents the number of clusters, which can be guessed or – more quantitatively – be estimated using methods such as the ‘silhouette’ value or the ‘elbow method’[ Reference Rousseeuw244]. As the method is quite sensitive to the initial random choice of the cluster assignment, it is often run several times with different initial choices to find the optimal classification.

In contrast to centroid-based clustering, in which each cluster is defined by a center point, in distribution-based clustering, each cluster is defined by a (multivariate) probability distribution. In the simplest case this can be a Gaussian distribution with a certain mean and variance for each cluster. More advanced methods use a Gaussian mixture model (GMM), in which each cluster is represented as a combination of Gaussian distributions. A popular distribution-based clustering method is the EM algorithm[ Reference Dempster, Laird, Rubin and Stat155].

An example for the application of a GMM in data processing is shown in Figure 17. There a number of energy spectra from a laser wakefield accelerator are displayed. As the spectra exhibit multiple peaks, standard metrics such as the mean and standard deviation are not characteristic of either the peaks’ positions or their widths. To avoid this problem a mixture model is used that isolates the spectral peaks. To this end, a combination of Gaussian distributions is fitted to the data and then a spectral bin is assigned with a certain probability to each distribution.

Figure 17 Data treatment using a Gaussian mixture model (GMM). Top: 10 consecutive shots from a laser wakefield accelerator. Middle: the same shots using a GMM to isolate the spectral peak at around 250 MeV. Bottom: average spectra with and without GMM cleaning. Adapted from Ref. [Reference Irshad, Eberle, Foerster, Grafenstein, Haberstroh, Travac, Weisse, Karsch and Döpp245].

Figure 18 Correlogram – a visualization of the correlation matrix – of different variables versus yield at the NIF. Color indicates the value of the correlation coefficient. In this particular representation the correlation is also encoded in the shape and angle of the ellipses, helping intuitive understanding. The strongest correlation to the fusion yield is observed with the implosion velocity ${V}_{\mathrm{imp}}$ and the ion temperature ${T}_{\mathrm{ion}}$ . There is also a clear anti-correlation observable between the down-scattered ratio (DSR) and ${T}_{\mathrm{ion}}$ and, in accordance with the previously stated correlation of ${T}_{\mathrm{ion}}$ and yield, a weak anti-correlation of the DSR and yield. Note that all variables perfectly correlate with themselves by definition. Plot was generated based on data presented by Hsu et al. [ Reference Hsu, Cheng and Bradley96]. Further explanation (labels, etc.) can be found therein.

5.2 Correlation analysis

A simple method for exploring correlations is the correlation matrix, which is a type of matrix that is used to measure the relationship between two or more variables. We can calculate its coefficients, also known as Pearson correlation coefficients, on a set of $n$ measurements of each pair of parameters ${x}_i$ and ${y}_i$ as follows:

(28) $$\begin{align}{r}_{xy} = \frac{\sum_{i = 1}^n\left({x}_i-\overline{x}\right)\left({y}_i-\overline{y}\right)}{\sqrt{\sum_{i = 1}^n{\left({x}_i-\overline{x}\right)}^2}\sqrt{\sum_{i = 1}^n{\left({y}_i-\overline{y}\right)}^2}},\end{align}$$

where $\overline{x}$ and $\overline{y}$ are the mean values of variables $x$ and $y$ , respectively. The correlation coefficient $r$ is a number between –1 and 1. A value of 1 means that two variables are perfectly correlated, while a value of –1 means that two variables are perfectly anti-correlated. A value of 0 means that there is no correlation between two variables.

The correlation matrix and its visualization, sometimes called a correlogram, allow for a quick way to look for interesting and unexpected correlations. An example of this is shown in Figure 18. Note that by reducing correlations to a single linear term one can miss more subtle or complex relationships between variables. For such cases more general measures of correlation exist, such as the Spearman correlation coefficient that measures how well two variables can be described by any monotonic function.

5.3 Dimensionality reduction

Many datasets contain high-dimensional data but are governed by a few important underlying parameters. Signal decomposition and dimensionality reduction are processes that reduce the dimensionality of the data by separating a signal into its components or projecting it onto a lower-dimensional subspace so that the essential structure of the data is preserved. There are many ways to perform dimensionality reduction, two of the most common ones being principal component analysis (PCA) and AEs.

PCA is a very popular linear transformation technique that is used to convert a set of observations of possibly correlated variables into a (smaller) set of values of linearly uncorrelated variables, called the principal components. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. One method to perform PCA is to use SVD, which is used to decompose the matrix of data into a set of eigenvectors and eigenvalues. PCA shares a close relationship with correlation analysis, as the eigenvectors of the correlation matrix match those of the covariance matrix, which is utilized in defining PCA. In addition, the eigenvalues of the correlation matrix equate to the squared eigenvalues of the covariance matrix, provided that the data have been normalized. Kernel PCA[ Reference Schölkopf, Smola and Müller246] is an extension of PCA that uses a nonlinear transformation of the data to obtain the principal components. A relatively new variation, with some relation to the priorly discussed CS (Section 3.4), is robust principal component analysis (RPCA)[ Reference Candès, Li, Ma and Wright247]. RPCA is a modification of the original algorithm, which is better suitable to handle the presence of outliers in datasets. PCA should not be confused with the similarly named independent component analysis (ICA)[ Reference Hyvärinen and Oja248], which is a popular technique to decompose a multivariate signal into a sum of statistically independent non-Gaussian signals.

There are also many neural network approaches to dimensionality reduction, one of the most popular ones being AEs. The purpose of an AE is to learn an approximation to the identity function, that is, a function that reproduces the input. In a standard AE, a neutral network is created with an intermediate bottleneck layer with a reduced number of nodes, known as the latent space. During training, the neural network hyperparameters are optimized so that the output matches the input as closely as possible, typically by minimizing the MSE. Due to the bottleneck, the AE automatically discovers an efficient representation for the data in the latent space. The hidden layers up to the latent space are known as the decoder and the hidden layers from the latent space to the output layer are the encoder. The encoder can then be used separately to perform dimension reduction, equivalent to lossy data compression. With the corresponding decoder, an approximation of the original data can be extracted from its latent space.

There exist many different types of AE architecture, a particularly popular one being variational autoencoders (VAEs). VAEs replace deterministic encoder–decoder layers with a stochastic architecture (c.f. Figure 6) to allow the model to provide a probability distribution over the latent space. As a result, a VAE’s latent space is smooth and continuous in contrast to a standard AE’s latent space, which is discrete. This allows VAEs to also generate new data by sampling from the latent space.

AEs have also shown a strong potential as advanced compression techniques that can be highly adapted to many kinds of inputs. In this case, one trains an AE model to find an approximation of the identity function for some raw data. After training, the raw data are sent through the encoding layer; only the dimensionality reduces and highly compressed latent space representation in the bottleneck layer is saved. Decompression is achieved by sending the data through the decoding layer. This method is not only relevant to reduce disc space occupied by data; AEs are nowadays frequently used as an integral part of complex machine learning tasks, where the latent space is used as a lower-dimensional input for, for example, a diffusion network (as part of what is called a latent diffusion model[ Reference Rombach, Blattmann, Lorenz, Esser and Ommer249]) or a Bayesian optimizer[ Reference Griffiths and Hernández-Lobato250].

An example of the modeling using the latent space of an AE was recently published by Willmann et al. [ Reference Willmann, Stiller, Debus, Irman, Pausch, Chang, Bussmann and Hoffmann251]. Working on the problem of simulating shadowgrams from plasma probing, they used an AE to reduce the 3D input data and then trained a small four-layer perceptron network to learn how to approximate the shadowgram formation. An example of pure compression was recently presented by Stiller et al. [ Reference Stiller, Makdani, Pöschel, Pausch, Debus, Bussmann and Hoffmann252], who applied an AE to compress data from PIC simulations, showing promising first results.

6 Image analysis

In the previous section we discussed how to analyze datasets by looking for correlations or compressed representations. A closely related group of tasks occurs when dealing with image data, that is, image recognition or classification, object detection and segmentation. While the methods in the previous section dealt with features learnt from the data itself, the techniques discussed in this section aim to identify or locate specific features in our data (in particular images). As such, these are all considered supervised learning methods.

6.1 Classification

The classification problem in machine learning is the problem of correctly labeling a data point from a given set of data points with the correct label. The data points are labeled with a categorical class label, such as ‘cat’, ‘dog’, ‘electron’ or ‘ion’ (see Figure 19(a)).

Figure 19 Illustration of common computer vision tasks. (a) Classification is used to assign (multiple) labels to data. (b) Detection goes a step further and adds bounding boxes. (c) Segmentation provides pixel maps with exact boundaries of the object or feature.

In the following we are going to briefly discuss some of the most important machine learning techniques used for classification. It should be noted that classification is closely related to regression, with the main difference being essentially that the model’s output is a class value instead of a value prediction. As such, methods for working in regression can in general also be applied to classification tasks. One example is the decision tree method, which we already discussed in Section 2.1.5.

6.1.1 Support vector machines

Support vector machines (SVMs) are a popular set of machine learning methods used primarily in classification and recognition. For a simple binary classification task, a SVM draws a hyperplane that divides all data points of one category from all data points of the other category. While there could be many hyperplanes, an SVM looks for the optimal hyperplane that best separates the dataset by maximizing the margin between the hyperplane and the data points in both categories. The points that locate right on the margin are called ‘supporting vectors’. For a dataset with more than two categories, classification is performed by dividing the multi-classification problem into several binary classification problems and then finding a hyperplane for each. In practice, the data points in two categories can mix together so that they cannot be clearly divided by a linear hyperplane. For such nonlinear classification tasks, the kernel trick is used to compute the nonlinear features of the data points and map them to a high-dimensional space, so that a hyperplane can be found to separate the high-dimensional space. The hyperplane will then be mapped back to the original space.

The ideal application scenario for an SVM is for datasets with small samples but high dimensions. The choice of various kernel functions also adds to the flexibility of this method. However, an SVM would be very computationally expensive for large datasets. Besides, its accuracy can significantly decrease when analyzing datasets with large noise levels, as the hyperplanes cannot be defined precisely. Therefore, especially in the context of high-power laser experiments, one has to be cautious when applying SVMs if there are considerable stability issues.

6.1.2 Convolutional neural networks

CNNs are a type of neural network (cf. Section 2.1.6) that is particularly well-suited for image classification[ Reference Rawat and Wang253], but are also used in various other problems.

Such a network is composed of sequential convolutional layers, in each of which an $N\times N$ kernel (or ‘filter’ matrix) is convolved with the output of the previous layer. This operation is done by sliding the kernel over the input image, and each pixel in the output layer is calculated by the dot product of the kernel with a sub-section of the input image centered around the corresponding pixel. Resultingly, convolutional layers are capable of detecting local patterns in the $N\times N$ region of the kernel. The kernel is parameterized with weights, which are learned via back-propagation as in a regular neural network. Within a layer, there can also be multiple channels of the output, practically thought of as multiple kernels being passed over the image, allowing for different features to be detected. The early layers of a CNN detect simple structures such as edges, but by adding multiple layers with varying kernel sizes, the network can perform high-level abstraction in order to detect complicated patterns.

The convolutional layer makes the CNN very efficient for image classification. It allows the network to learn translation-invariant features – a feature learned at a certain position of an image can be recognized at any position on the same image.

In order to detect patterns that are non-local in the image, pooling is often applied. There exist many schemes of pooling, but the general concept is to take a set of pixels in the input and to apply some operation that turns them into 1 pixel. Examples include max-pooling (taking the maximum of the set of pixels) and average pooling. This operation decreases the dimensions of the image, and therefore allows a subsequent convolutional layer with the same $N\times N$ kernel to detect features that were much further apart in the original image. Typical CNNs will use multiple pooling layers to decrease the dimensions until a kernel can nearly span the whole image to detect any non-local patterns. The output is then flattened and several fully connected layers can be used to manipulate the data for the relevant (i.e., regression or classification) task.

While the use of deeper CNNs with an increasing number of layers tends to improve performance[ Reference Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke and Rabinovich254], architectures can suffer from unstable gradients in training via back-propagation[ Reference Balduzzi, Frean, Leary, Lewis, Ma and McWilliams255], showing that some deeper architectures are not as easy to optimize. One particularly influential solution to this problem was the introduction of the residual shortcut, where the input to a block is added to the output of the block. In back-propagation, this enables the ‘skipping’ of layers, effectively simplifying the network and leading to better convergence. This was first proposed and implemented in ResNet [ Reference He, Zhang, Ren and Sun256], which has since become a standard for deep CNN architectures[ Reference Szegedy, Ioffe, Vanhoucke and Alemi257], with any number of variations. It should be mentioned though that a number of competitive networks without residual shortcuts exist, for example, AlexNet [ Reference Krizhevsky, Sutskever and Hinton258].

6.2 Object detection

In the context of image data, object detection can be seen as an extension of classification, yielding both a label as in classification tasks and the position of that object, illustrated in Figure 19(b). The main challenge is that complex images can contain many objects with different features, while the number of objects can also differ from one image to another. Therefore, object detection techniques require a certain flexibility regarding their structure.

The Viola–Jones algorithm[ Reference Viola and Jones259] is one of the most popular object detection algorithms from the 2000s, pre-dating the recent network-based object detectors. Generally, it involves detecting objects by looking at the image as a set of small patches, and identifying so-called Haar-like features. The latter are patterns that occur frequently in images, and can be used to distinguish between different objects. The Viola–Jones algorithm detects objects by first analyzing an image at different scales. For each scale, it looks for features by scanning the image with a sliding window, and for each window, it computes a list of features used to identify the object. If the object is detected, the algorithm returns the bounding box of the object. The Viola–Jones framework does not allow for simultaneous classification and instead requires a subsequent classifier, such as an SVM, for that task. Compared to its network-based alternatives, the Viola–Jones algorithm is worse in terms of precision but better in terms of computational cost, allowing real-time detection at high frame rates[ Reference Nguyen-Meidine, Granger, Kiran and Blais-Morin260].

Object detection networks are more complicated than a regular CNN, because the length of the output layer of the neural network cannot be pre-defined due to the unknown number of objects on an image. A possible solution is to divide the image into many regions and to construct a CNN for each region, but that leads to significant computational cost. Two families of networks are developed to detect objects at reasonable computational cost. The region proposal network (RPN) takes an image as input and outputs a set of proposals for possible objects in the image. A CNN is used to find features in each possible region and to classify the feature into known category. The RPN is trained to maximize the overlap between the proposals and the ground truth objects. The state-of-the-art algorithm in this family is the Faster-RCNN[ Reference Ren, He, Girshick and Sun261].

An alternative to region-based CNNs is the YOLO[ Reference Redmon, Divvala, Girshick and Farhadi262] family. YOLO stands for ‘You Only Look Once’ as it only looks at the image once with a single neural network to make predictions. This is different from other object detection algorithms, which often employ many neural networks for the image. As a result, YOLO is typically much faster, allowing for real-time object detection. The disadvantage is that it is not as accurate as some of the other object detection algorithms. Nevertheless, the fast inference speed of YOLO is particularly appealing to high-power laser facilities with high-repetition-rate capability.

6.3 Segmentation

Semantic segmentation[ Reference Chen, Papandreou, Kokkinos, Murphy and Yuille263] is a related task in computer vision that seeks to create a pixel-by-pixel mapping of the input image to a class label, not only a bounding box as in object detection (see Figure 19(c)). By doing so, segmentation defines the exact boundary of the objects.

Many semantic segmentation architectures are based on FCNs[ Reference Long, Shelhamer and Darrell264] and have evolved considerably in recent years. Since FCNs suffered from the issue of semantic gaps, where the output has a much lower resolution than the input, skip connections were introduced to allow the gradient to back-propagate through the layers to improve the performance. Examples of such advanced network architectures are the U-Net[ Reference Ronneberger, Fischer and Brox180] and the DeepLab network[ Reference Chen, Papandreou, Kokkinos, Murphy and Yuille263], which are based on ResNet-101. Both of these architectures use residual skip connections to maintain gradient flow. The advantage of semantic segmentation compared to standard object detection is that the network can easily localize multiple objects of the same class in an image. The disadvantage is that one needs to train a separate network for each class.

Related is instance segmentation[ Reference Hafiz and Bhat265], which goes a step further and distinguishes each individual instance of an object, not just the class. Instance segmentation is a significant challenge, as it requires the network to be able to distinguish between two instances of the same object, for example, two cats. Instance segmentation architectures are typically based on mask R-CNN[ Reference He, Gkioxari, Dollár and Girshick266], which combines a CNN with a region-based convolutional neural network (R-CNN)[ Reference Girshick, Donahue, Darrell and Malik267] and an FCN[ Reference Long, Shelhamer and Darrell264]. Note that the mask R-CNN can be used for both semantic and instance segmentation.

Figure 20 Application of object detection to a few-cycle shadowgram of a plasma wave: the plasma wave, the shadowgram of a hydrodynamic shock and the diffraction pattern caused by dust are correctly identified by the object detector and located with bounding boxes. Adapted from Ref. [Reference Lin, Haberstroh, Karsch and Döpp273].

One of the prime examples for machine-learning-assisted image analysis is the automated detection and classification of laser damage. Researchers at the NIF have pioneered this approach with several works on neural networks for damage classification. For instance, Amorin et al. [ Reference Amorin, Kegelmeyer and Kegelmeyer268] trained CNNs based on the AlexNet and Inception architectures to distinguish between different kinds of laser damage. Another example for the use of both SVM- and CNN-based classification in high-power laser systems was recently presented by Pascu[ Reference Pascu269], who used both techniques for (supervised) anomaly detection in a laser beam profile at the ELI-NP facility. Chu et al. [ Reference Chu, Zhang, Tian, Zhang, Wang, Chen and Geng270] presented a first application of image segmentation to locate laser-induced defects on optics in real time using a U-Net. Ben Soltane et al. [ Reference Ben Soltane, Hallo, Lacombe, Lamaignère, Bonod and Néauport271] recently presented a deep learning pipeline to estimate the size of damages in glass windows at the Laser Mégajoule (LMJ) facility, using a similar U-Net architecture for segmentation. Li et al. [ Reference Li, Han, Ouyang, Zhang, Guo, Liu and Zhu272] combined damage detection via a deep neural network with postprocessing to position laser damages in 3D space. The axial distance between the damage site and the imaging system is obtained numerically by the principle of holographic focusing. More examples for applications of object detection in a high-power laser laboratory have been reported in the work of Lin et al. [ Reference Lin, Haberstroh, Karsch and Döpp273]. In addition to the aforementioned case of optical damages in a high-power laser beamline, the authors fine-tuned the YOLO network for object detection in the few-cycle shadowgraphy of plasma waves and electron beam detection in an electron spectrometer. An example of the detected features in a shadowgram is presented in Figure 20. The position and size of the detected objects are used to determine information about the physical quantities, such as the plasma wavelength and plasma density distribution.

7 Discussion and conclusions

In this paper, we have presented an overview of techniques and recent developments in machine learning for laser–plasma physics. As we have seen, early proof-of-concept papers appeared in the late 1990s and early 2000s, but the computing power available at the time was typically not sufficient to make the approaches competitive with established methods or to reach a suitable level of accuracy. In the mid-2010s, a resurgence of interest in the field began, with an ever-increasing number of publications. A significant fraction of the papers that have been reviewed here are experimental in nature, especially regarding optimization (see Table 2). On one hand, this can be attributed to the increasing digitization of the laboratory environment, with control systems, data acquisition and other developments providing access to large amounts of data. On the other hand, the complexity of modern experiments acts as a catalyst for the development of automated data analysis and optimization techniques. Deployment of machine learning techniques in a real-world environment can however be challenging, for example, due to noise, jitter and drifts. This is certainly one of the reasons why the most advanced machine learning techniques, such as multi-objective optimization or deep CS, tend to be first tested with simulations.

Table 2 Summary of papers used as application examples in this review, sorted by year for each section.

Among the methods being employed we also observed some general trends. For instance, while genetic algorithms have been very popular in the past for global optimization, there has been an increasing number of papers focusing on BO. This is likely due to the fact that both simulations and experiments in the context of laser–plasma physics are very costly, making the use of Bayesian approaches more appealing. In most experimental settings, local optimization algorithms, such as gradient descent or the Nelder–Mead algorithm, are less suitable because of the large number of iterations needed and their sensitivity to noise. RL, especially in its model-free incarnation, suffers from the same issue, which explains why only very few examples of its use exist in adjacent research fields. While rather popular among data scientists due to their simplicity and interpretability, decision tree methods have not seen wide application in laser–plasma physics. In part, this is likely due to the fact that these methods are often considered to have more limited capabilities in comparison to neural networks, making it more attractive to directly use deep neural networks. In the context of ill-posed inverse problems it is to be expected that end-to-end neural networks or hybrid approaches will gradually replace traditional methods, such as regularization via total variation. That said, the simplicity and bias-free nature of least-squares methods are likely to ensure their continued popularity, at least in the context of easier to solve well-posed problems.

Much of the success of machine learning techniques stems from the fact that they are able to leverage prior knowledge, be it in the form of physical laws (e.g., via PINNs) or in the form of training data (e.g., via deep learning). Regarding the latter, the importance of preparing input data cannot be overstated. A popular saying in supervised learning is ‘garbage in, garbage out’, meaning that the quality of a model’s output heavily depends on the quality of the training data. Important steps are, for instance, pre-processing[ Reference Garca, Ramrez-Gallego, Luengo, Bentez and Herrera276] (noise removal, normalization, etc.) and data augmentation[ Reference Shorten and Khoshgoftaar277] (rotation, shifts, etc.). The latter is of particular importance when dealing with experimental data, for which data acquisition is usually costly, making it challenging to acquire enough data to train a well-performing model. Furthermore, even when using regularization techniques, diversity of training data is very important to ensure good generalization capabilities and to avoid bias.

Two outstanding issues for the wide adoption of machine learning models in the laser–plasma community are interpretability and trustworthiness, both regarding the model itself and on the user side. While some machine learning models such as decision trees can be interpreted comparably easily, the inner workings of advanced models such as deep neural networks are often difficult to understand. This issue is amplified by the fact that integrated machine learning environments allow users to quickly build complex models without a thorough understanding of the underlying principles. We hope that this review will help to alleviate this issue, by providing a better understanding of the origin, capabilities and limitations of different machine learning techniques. Regarding trustworthiness, quantification of aleatoric uncertainty in training data and epistemic uncertainty of the model remains an important research area[ Reference Hüllermeier and Waegeman278]. For instance, well-tested models may break down when exposed to unexpected input data, for example, due to drifts in experimental conditions or changes in the experimental setup. Such issues can for instance be addressed by incorporating uncertainty quantification into models to highlight unreliable predictions.

As our discussion has shown, there is an ever-increasing interest in data-driven and machine learning techniques within the community and we hope that our paper provides useful guidance for those starting to work in this rapidly evolving field. To facilitate some hands-on experimentation, we conclude with a short guide on how to get started in implementing the techniques we have discussed in this paper. Most of these are readily implemented in several extensive libraries, Scikit-learn[ Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot, Duchesnay and Mach279], TensorFlow[ Reference Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, Devin, Ghemawat, Goodfellow, Harp, Irving, Isard, Jia, Jozefowicz, Kaiser, Kudlur, Levenberg, Mané, Monga, Moore, Murray, Olah, Schuster, Shlens, Steiner, Sutskever, Talwar, Tucker, Vanhoucke, Vasudevan, Viégas, Vinyals, Warden, Wattenberg, Wicke, Yu and Zheng280] and PyTorch[ Reference Paszke, Gross, Massa, Lerer, Bradbury, Chanan, Killeen, Lin, Gimelshein, Antiga, Desmaison, Kopf, Yang, DeVito, Raison, Tejani, Chilamkurthy, Steiner, Fang, Bai and Chintala281] being among the most popular ones. Each of these libraries has its own strengths and weaknesses. In particular, deep learning libraries such as TensorFlow and PyTorch are tailored for fast computations on graphics processing units (GPUs), whereas libraries such as Scikit-learn are designed for more general machine learning tasks. Higher level frameworks exist to facilitate the training of neural networks, for example, MLflow or PyTorch lightning.

The Darts library[ Reference Herzen, Lässig, Piazzetta, Neuer, Tafti, Raille, Pottelbergh, Pasieka, Skrodzki, Huguenin, Dumonal, Kościsz, Bader, Gusset, Benheddi, Williamson, Kosinski, Petrik and Grosch282] contains implementations of various time series forecasting models, and also acts as a wrapper for numerous other libraries related to forecasting. Many numerical optimization algorithms, such as derivative-based methods and differential evolution, can be easily explored using the optimization library of SciPy[ Reference Virtanen, Gommers, Oliphant, Haberland, Reddy, Cournapeau, Burovski, Peterson, Weckesser, Bright, van der Walt, Brett, Wilson, Millman, Mayorov, Nelson, Jones, Kern, Larson, Carey, Polat, Feng, Moore, VanderPlas, Laxalde, Perktold, Cimrman, Henriksen, Quintero, Harris, Archibald, Ribeiro, Pedregosa and van Mulbregt283]. While this includes, for instance, differential evolution, more sophisticated evolutionary algorithms such as multi-objective evolutionary methods require dedicated libraries such as pymoo[ Reference Blank and pymoo284] or PyGMO[ Reference Biscani and Izzo285]. BO can for instance be implemented within Scikit-learn or using the experimentation platform Ax. The highly optimized BoTorch library[ Reference Balandat, Karrer, Jiang, Daulton, Letham, Wilson and Bakshy286] can be used for more advanced applications, including for instance multi-objective MF optimization. Some libraries are specifically tailored to hyperparameter optimization, such as the popular optuna library[ Reference Akiba, Sano, Yanase, Ohta and Koyama287].

While all of the above examples are focused on Python as an underlying programming language, machine learning tasks can also be performed using many other programming platforms or languages, such as MATLAB or Julia. Jupyter notebooks provide a good starting point, and some online implementations, such as Google Colab, even give limited access to GPUs for training. The reader is encouraged to explore the various frameworks and libraries to find the one that best suits their needs.

Acknowledgements

The authors thank all participants of the LPA Online Workshop on Control Systems and Machine Learning for their contributions to the workshop, many of which are referenced throughout this manuscript. We particularly thank N. Hoffmann, S. Jalas, M. Kirchen, R. Shalloo and A. G. R. Thomas for helpful feedback and comments on this manuscript. The authors acknowledge the use of GPT-3[ Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever and Amodei288] (text-davinci-003) in the copy-editing process of this manuscript.

Footnotes

1 For instance, deep learning is a sub-field of machine learning where deep neural networks are used, and this term is nowadays often used interchangeably with neural networks. Similarly, data-driven science is sometimes used instead of machine learning, but also includes a variety of methods from computer science, applied mathematics and statistics, as well as data science, which is a field in itself. Even within itself, machine learning is a heavily segmented research field, whose community can famously be divided into five ‘tribes’, namely symbolists, connectionists, evolutionists, Bayesians and analogizers. Each of these groups has pioneered different tools, all of which have in recent years experienced a resurgence in popularity. The arguably most popular branch is connectionism, which focuses on the use of artificial neural networks. However, some challenges seen in laser–plasma physics require different approaches and, for instance, Bayesian optimization has recently drawn considerable research interest. Another line of division is often drawn between supervised and unsupervised methods. While supervised methods are usually trained from known datasets to build a model that can classify new data, unsupervised methods attempt to find some structure in the data without such pre-existing knowledge. Alternatively, one can distinguish between online and batch methods, where the former learn from data as it becomes available, and can therefore be used in an experimental setting, while the latter require access to the full dataset before learning can begin. Yet another important distinction can be made between parametric and non-parametric methods, where the former rely on a set of parameters that is fixed and known in advance, while the latter do not make this assumption, but rather learn the model parameters from the data.

2 Recently there has also been astonishing progress in the area of large language models such as generative pre-trained transformer (GPT) models[Reference Radford, Narasimhan, Salimans and Sutskever114]. Very similar to the forecasting networks discussed in Section 2.2.3, these use an attention mechanism to predict the next token (word, etc.) following input provided by the user. It was found that large models ( ${{\sim}\!{10}^{8}-10^{9}}$ parameters) become increasingly capable with sufficient training, for example, gaining the ability to do basic math and to write code, including to some claims first hints at artificial general intelligence[Reference Bubeck, Chandrasekaran, Eldan, Gehrke, Horvitz, Kamar, Lee, Lee, Li, Lundberg, Nori, Palangi, Ribeiro and Zhang115].

3 A time series $\left\{{y}_t\right\}$ is said to be strictly stationary, given the joint cumulative distribution $F\left({y}_0,\dots, {y}_t\right)$ , if $F\left({y}_0,\dots, {y}_t\right) = F\left({y}_{0+\tau },\dots, {y}_{t+\tau}\right)\kern1em \forall \tau \in \mathrm{\mathbb{N}}$ . Correspondingly, all moments of the joint distribution are invariant under time translation. If this invariance only holds for the first two moments (mean, variance) the time series is said to be (weakly) stationary.

4 The sensing matrix/operator is known by different names, for example, the instrument response, the system matrix, the response matrix, the transfer function or, most generally, the forward operator.

5 While both rely on Bayesian statistics and thus have some conceptual overlap, approximate Bayesian computation should not be confused with Bayesian optimization, discussed in Section 4.5.

6 Another class of regularization methods is based on smoothing in some sense. For instance, filtered back projection can be understood as a smoothing of the projection line integrals by means of a convolution with a filter function.

7 It should be noted that a very similar problem formulation can also be found in the context of training neural networks, where regularization terms are often added to the cost function.

8 Depending on the context of optimization problems, this function is referred to using many different names, for example, merit function (using a figure of merit), fitness function (in the context of evolutionary algorithms), cost or loss function (in deep learning) or reward function (reinforcement learning). Some of these are associated with either maximization (reward, fitness, …) and others with minimization (cost, loss). Here, we will use the general term objective function as used in optimization theory.

9 While most work on BO is done using GP regression, the method is in principle model agnostic. This means that other types of (probabilistic) surrogate models of the system can be employed, such as decision trees (see Section 2.1.5) or deep neural networks (see Section 2.1.6).

References

Danson, C., Hillier, D., Hopps, N., and Neely, D., High Power Laser Sci. Eng. 3, e3 (2015).CrossRefGoogle Scholar
Danson, C. N., Haefner, C., Bromage, J., Butcher, T., Chanteloup, J.-C. F., Chowdhury, E. A., Galvanauskas, A., Gizzi, L. A., Hein, J., Hillier, D. I., Hopps, N. W., Kato, Y., Khazanov, E. A., Kodama, R., Korn, G., Li, R., Li, Y., Limpert, J., Ma, J., Nam, C. H., Neely, D., Papadopoulos, D., Penman, R. R., Qian, L., Rocca, J. J., Shaykin, A. A., Siders, C. W., Spindloe, C., Szatmári, S., Trines, R. M. G. M., Zhu, J., Zhu, P., and Zuegel, J. D., High Power Laser Sci. Eng. 7, e54 (2019).CrossRefGoogle Scholar
Pukhov, A., Rep. Prog. Phys. 66, 47 (2003).CrossRefGoogle Scholar
Marklund, M. and Shukla, P. K., Rev. Mod. Phys. 78, 591 (2006).CrossRefGoogle Scholar
Cole, J. M., Behm, K. T., Gerstmayr, E., Blackburn, T. G., Wood, J. C., Baird, C. D., Duff, M. J., Harvey, C., Ilderton, A., Joglekar, A. S., Krushelnick, K., Kuschel, S., Marklund, M., McKenna, P., Murphy, C. D., Poder, K., Ridgers, C. P., Samarin, G. M., Sarri, G., Symes, D. R., Thomas, A. G. R., Warwick, J., Zepf, M., Najmudin, Z., and Mangles, S. P. D., Phys. Rev. X 8, 011020 (2018).Google Scholar
Poder, K., Tamburini, M., Sarri, G., Di Piazza, A., Kuschel, S., Baird, C. D., Behm, K., Bohlen, S., Cole, J. M., Corvan, D. J., Duff, M., Gerstmayr, E., Keitel, C. H., Krushelnick, K., Mangles, S. P. D., McKenna, P., Murphy, C. D., Najmudin, Z., Ridgers, C. P., Samarin, G. M., Symes, D. R., Thomas, A. G. R., Warwick, J., and Zepf, M., Phys. Rev. X 8, 031004 (2018).Google Scholar
Blackburn, T. G., Rev. Mod. Plasma Phys. 4, 5 (2020).CrossRefGoogle Scholar
Quéré, F. and Vincenti, H., High Power Laser Sci. Eng. 9, e6 (2021).CrossRefGoogle Scholar
Drake, R. P., Phys. Plasmas 16, 055501 (2009).CrossRefGoogle Scholar
Mahieu, B., Jourdain, N., Phuoc, K. T., Dorchies, F., Goddet, J.-P., Lifschitz, A., Renaudin, P., and Lecherbourg, L., Nat. Commun. 9, 3276 (2018).CrossRefGoogle Scholar
Kettle, B., Gerstmayr, E., Streeter, M. J. V., Albert, F., Baggott, R. A., Bourgeois, N., Cole, J. M., Dann, S., Falk, K., González, I. G., Hussein, A. E., Lemos, N., Lopes, N. C., Lundh, O., Ma, Y., Rose, S. J., Spindloe, C., Symes, D. R., Šmíd, M., Thomas, A. G. R., Watt, R., and Mangles, S. P. D., Phys. Rev. Lett. 123, 254801 (2019).CrossRefGoogle Scholar
Beier, N. F., Allison, H., Efthimion, P., Flippo, K. A., Gao, L., Hansen, S. B., Hill, K., Hollinger, R., Logantha, M., Musthafa, Y., Nedbailo, R., Senthilkumaran, V., Shepherd, R., Shlyaptsev, V. N., Song, H., Wang, S., Dollar, F., Rocca, J. J., and Hussein, A. E., Phys. Rev. Lett. 129, 135001 (2022).CrossRefGoogle Scholar
Esarey, E., Schroeder, C. B., and Leemans, W. P., Rev. Mod. Phys. 81, 1229 (2009.CrossRefGoogle Scholar
Malka, V., Phys. Plasmas 19, 055501 (2012).CrossRefGoogle Scholar
Hooker, S. M., Nat. Photonics 7, 775 (2013).CrossRefGoogle Scholar
Leemans, W. P., Nagler, B., Gonsalves, A. J., Tóth, C., Nakamura, K., Geddes, C. G. R., Esarey, E., Schroeder, C. B., and Hooker, S. M., Nat. Phys. 2, 696 (2006).CrossRefGoogle Scholar
Kneip, S., Nagel, S. R., Martins, S. F., Mangles, S. P. D., Bellei, C., Chekhlov, O., Clarke, R. J., Delerue, N., Divall, E. J., Doucas, G., Ertel, K., Fiuza, F., Fonseca, R., Foster, P., Hawkes, S. J., Hooker, C. J., Krushelnick, K., Mori, W. B., Palmer, C. A. J., Phuoc, K. T., Rajeev, P. P., Schreiber, J., Streeter, M. J. V., Urner, D., Vieira, J., Silva, L. O., and Najmudin, Z., Phys. Rev. Lett. 103, 035002 (2009).CrossRefGoogle Scholar
Clayton, C. E., Ralph, J. E., Albert, F, Fonseca, R. A., Glenzer, S. H., Joshi, C., Lu, W., Marsh, K. A., Martins, S. F., Mori, W. B., Pak, A., Tsung, F. S., Pollock, B. B., Ross, J. S., Silva, L. O., and Froula, D. H., Phys. Rev. Lett. 105, 105003 (2010).CrossRefGoogle Scholar
Wang, X., Zgadzaj, R., Fazel, N., Li, Z., Yi, S. A., Zhang, X., Henderson, W., Chang, Y.-Y., Korzekwa, R., Tsai, H.-E., Pai, C.-H., Quevedo, H., Dyer, G., Gaul, E., Martinez, M., Bernstein, A. C., Borger, T., Spinks, M., Donovan, M., Khudik, V., Shvets, G., Ditmire, T., and Downer, M. C., Nat. Commun. 4, 1988 (2013).CrossRefGoogle Scholar
Leemans, W. P., Gonsalves, A. J., Mao, H.-S., Nakamura, K., Benedetti, C., Schroeder, C. B., Tóth, C., Daniels, J., Mittelberger, D. E., Bulanov, S. S., Vay, J.-L., Geddes, C. G. R., and Esarey, E., Phys. Rev. Lett. 113, 245002 (2014).CrossRefGoogle Scholar
Gonsalves, A. J., Nakamura, K., Daniels, J., Benedetti, C., Pieronek, C., de Raadt, T. C. H., Steinke, S., Bin, J. H., Bulanov, S. S., van Tilborg, J., Geddes, C. G. R., Schroeder, C. B., Tóth, C., Esarey, E., Swanson, K., Fan-Chiang, L., Bagdasarov, G., Bobrova, N., Gasilov, V., Korn, G., Sasorov, P., and Leemans, W. P., Phys. Rev. Lett. 122, 084801 (2019).CrossRefGoogle Scholar
He, Z. H., Hou, B., Nees, J. A., Easter, J. H., Faure, J., Krushelnick, K., and Thomas, A. G. R., New J. Phys. 15, 053016 (2013).CrossRefGoogle Scholar
Faure, J., Gustas, D., Guénot, D., Vernier, A., Böhle, F., Ouillé, M., Haessler, S., Lopez-Martens, R., and Lifschitz, A., Plasma Phys. Controll. Fusion 61, 014012 (2018).CrossRefGoogle Scholar
Rovige, L., Huijts, J., Andriyash, I., Vernier, A., Tomkus, V., Girdauskas, V., Raciukaitis, G., Dudutis, J., Stankevic, V., Gecys, P., Ouille, M., Cheng, Z., Lopez-Martens, R., and Faure, J., Phys. Rev. Accel. Beams 23, 093401 (2020).CrossRefGoogle Scholar
Salehi, F., Le, M., Railing, L., Kolesik, M., and Milchberg, H. M., Phys. Rev. X 11, 021055 (2021).Google Scholar
Maier, A. R., Delbos, N. M., Eichner, T., Hübner, L., Jalas, S., Jeppe, L., Jolly, S. W., Kirchen, M., Leroux, V., Messner, P., Schnepp, M., Trunk, M., Walker, P. A., Werle, C., and Winkler, P., Phys. Rev. X 10, 031039 (2020).Google Scholar
Rechatin, C., Faure, J., Davoine, X., Lundh, O., Lim, J., Ben-Ismaïl, A., Burgy, F., Tafzi, A., Lifschitz, A., Lefebvre, E., and Malka, V., New J. Phys. 12, 045023 (2010).CrossRefGoogle Scholar
Götzfried, J., Döpp, A., Gilljohann, M. F., Foerster, F. M., Ding, H., Schindler, S., Schilling, G., Buck, A., Veisz, L., and Karsch, S., Phys. Rev. X 10, 041015 (2020).Google Scholar
Kirchen, M., Jalas, S., Messner, P., Winkler, P., Eichner, T., Hübner, L., Hülsenbusch, T., Jeppe, L., Parikh, T., Schnepp, M., and Maier, A. R., Phys. Rev. Lett. 126, 174801 (2021).CrossRefGoogle Scholar
Glinec, Y., Faure, J., Le Dain, L., Darbon, S., Hosokai, T., Santos, J. J., Lefebvre, E., Rousseau, J. P., Burgy, F., Mercier, B., and Malka, V., Phys. Rev. Lett. 94, 025003 (2005).CrossRefGoogle Scholar
Döpp, A., Guillaume, E., Thaury, C., Lifschitz, A., Sylla, F., Goddet, J.-P., Tafzi, A., Iaquanello, G., Lefrou, T., Rousseau, P., Conejero, E., Ruiz, C., Phuoc, K. T., and Malka, V., Nucl. Instrum. Methods Phys. Res. Sect. A 830, 515 (2016).CrossRefGoogle Scholar
Rousse, A., Phuoc, K. T., Shah, R., Pukhov, A., Lefebvre, E., Malka, V., Kiselev, S., Burgy, F., Rousseau, J.-P., Umstadter, D., and Hulin, D., Phys. Rev. Lett. 93, 135005 (2004).CrossRefGoogle Scholar
Kneip, S., McGuffey, C., Martins, J. L., Martins, S. F., Bellei, C., Chvykov, V., Dollar, F., Fonseca, R., Huntington, C., Kalintchenko, G., Maksimchuk, A., Mangles, S. P. D., Matsuoka, T., Nagel, S. R., Palmer, C. A. J., Schreiber, J., Phuoc, K. T., Thomas, A. G. R., Yanovsky, V., Silva, L. O., Krushelnick, K., and Najmudin, Z., Nat. Phys. 6, 980 (2010).CrossRefGoogle Scholar
Phuoc, K. T., Corde, S., Thaury, C., Malka, V., Tafzi, A., Goddet, J. P., Shah, R. C., Sebban, S., and Rousse, A., Nat. Photonics 6, 308 (2012).CrossRefGoogle Scholar
Powers, N. D., Ghebregziabher, I., Golovin, G., Liu, C., Chen, S., Banerjee, S., Zhang, J., and Umstadter, D. P., Nat. Photonics 8, 28 (2014).CrossRefGoogle Scholar
Khrennikov, K., Wenz, J., Buck, A., Xu, J., Heigoldt, M., Veisz, L., and Karsch, S., Phys. Rev. Lett. 114, 195003 (2015).CrossRefGoogle Scholar
Albert, F. and Thomas, A. G. R., Plasma Phys. Controll. Fusion 58, 103001 (2016).CrossRefGoogle Scholar
Kneip, S., McGuffey, C., Dollar, F., Bloom, M. S., Chvykov, V., Kalintchenko, G., Krushelnick, K., Maksimchuk, A., Mangles, S. P. D., Matsuoka, T., Najmudin, Z., Palmer, C. A. J., Schreiber, J., Schumaker, W., Thomas, A. G. R., and Yanovsky, V., Appl. Phys. Lett. 99, 093701 (2011).CrossRefGoogle Scholar
Fourmaux, S., Corde, S., Phuoc, K. T., Lassonde, P., Lebrun, G., Payeur, S., Martin, F., Sebban, S., Malka, V., Rousse, A., and Kieffer, J. C., Opt. Lett. 36, 2426 (2011).CrossRefGoogle Scholar
Wenz, J., Schleede, S., Khrennikov, K., Bech, M., Thibault, P., Heigoldt, M., Pfeiffer, F., and Karsch, S., Nat. Commun. 6, 7568 (2015).CrossRefGoogle Scholar
Cole, J. M., Wood, J. C., Lopes, N. C., Poder, K., Abel, R. L., Alatabi, S., Bryant, J. S. J., Jin, A., Kneip, S., Mecseki, K., Symes, D. R., Mangles, S. P. D., and Najmudin, Z., Sci. Rep. 5, 13244 (2015).CrossRefGoogle Scholar
Cole, J. M., Symes, D. R., Lopes, N. C., Wood, J. C., Poder, K., Alatabi, S., Botchway, S. W., Foster, P. S., Gratton, S., Johnson, S., Kamperidis, C., Kononenko, O., De Lazzari, M., Palmer, C. A. J., Rusby, D., Sanderson, J., Sandholzer, M., Sarri, G., Szoke-Kovacs, Z., Teboul, L., Thompson, J. M., Warwick, J. R., Westerberg, H., Hill, M. A., Norris, D. P., Mangles, S. P. D., and Najmudin, Z., Proc. Natl. Acad. Sci. U. S. A. 115, 6335 (2018).CrossRefGoogle Scholar
Guénot, D., Svendsen, K., Lehnert, B., Ulrich, H., Persson, A., Permogorov, A., Zigan, L., Wensing, M., Lundh, O., and Berrocal, E., Phys. Rev. Appl. 17, 064056 (2022).CrossRefGoogle Scholar
Wang, W., Feng, K., Ke, L., Yu, C., Xu, Y., Qi, R., Chen, Y., Qin, Z., Zhang, Z., Fang, M., Liu, J., Jiang, K., Wang, H., Wang, C., Yang, X., Wu, F., Leng, Y., Liu, J., Li, R., and Xu, Z., Nature 595, 516 (2021).CrossRefGoogle Scholar
Labat, M., Cabadağ, J. C., Ghaith, A., Irman, A., Berlioux, A., Berteaud, P., Blache, F., Bock, S., Bouvet, F., Briquez, F., Chang, Y.-Y., Corde, S., Debus, A., De Oliveira, C., Duval, J.-P., Dietrich, Y., El Ajjouri, M., Eisenmann, C., Gautier, J., Gebhardt, R., Grams, S., Helbig, U., Herbeaux, C., Hubert, N., Kitegi, C., Kononenko, O., Kuntzsch, M., LaBerge, M., , S., Leluan, B., Loulergue, A., Malka, V., Marteau, F., Guyen, M. H. N., Oumbarek-Espinos, D., Pausch, R., Pereira, D., Püschel, T., Ricaud, J.-P., Rommeluere, P., Roussel, E., Rousseau, P., Schöbel, S., Sebdaoui, M., Steiniger, K., Tavakoli, K., Thaury, C., Ufer, P., Valléau, M., Vandenberghe, M., Vétéran, J., Schramm, U., and Couprie, M.-E., Nat. Photonics 17, 150 (2023).CrossRefGoogle Scholar
Emma, C., van Tilborg, J., Albert, F., Labate, L., England, J., Gessner, S., Fiuza, F., Obst-Huebl, L., Zholents, A., Murokh, A., and Rosenzweig, J., arXiv:2203.09094 (2022).Google Scholar
Daido, H., Nishiuchi, M., and Pirozhkov, A. S., Rep. Prog. Phys. 75, 056401 (2012).CrossRefGoogle Scholar
Macchi, A., Borghesi, M., and Passoni, M., Rev. Mod. Phys. 85, 751 (2013).CrossRefGoogle Scholar
Robinson, A. P. L., Zepf, M., Kar, S., Evans, R. G., and Bellei, C., New J. Phys. 10, 013021 (2008).CrossRefGoogle Scholar
Henig, A., Steinke, S., Schnürer, M., Sokollik, T., Hörlein, R., Kiefer, D., Jung, D., Schreiber, J., Hegelich, B. M., Yan, X. Q., Meyer-ter-Vehn, J., Tajima, T., Nickles, P. V., Sandner, W., and Habs, D., Phys. Rev. Lett. 103, 245003 (2009).CrossRefGoogle Scholar
Fraga, R. A. C., Kalinin, A., Khnel, M., Hochhaus, D. C., Schottelius, A., Polz, J., Kaluza, M. C., Neumayer, P., and Grisenti, R. E., Rev. Sci. Instrum. 83, 025102 (2012).CrossRefGoogle Scholar
Gauthier, M., Kim, J. B., Curry, C. B., Aurand, B., Gamboa, E. J., Göde, S., Goyon, C., Hazi, A., Kerr, S., Pak, A., Propp, A., Ramakrishna, B., Ruby, J., Willi, O., Williams, G. J., Rödel, C., and Glenzer, S. H., Rev. Sci. Instrum. 87, 11D827 (2016).CrossRefGoogle Scholar
Kraft, S. D., Obst, L., Metzkes-Ng, J., Schlenvoigt, H. P., Zeil, K., Michaux, S., Chatain, D., Perin, J. P., Chen, S. N., Fuchs, J., Gauthier, M., Cowan, T. E., and Schramm, U., Plasma Phys. Controll. Fusion 60, 044010 (2018).CrossRefGoogle Scholar
Rehwald, M., Assenbaum, S., Bernert, C., Curry, C. B., Gauthier, M., Glenzer, S. H., Göde, S., Schoenwaelder, C., Schramm, U., Treffert, F., and Zeil, K., J. Phys. Conf. Ser. 2420, 012034 (2023).Google Scholar
Schumacher, D. W., Poole, P. L., Willis, C., Cochran, G. E., Daskalova, R., Purcell, J., and Heery, R., J. Instrum. 12, C04023 (2017).CrossRefGoogle Scholar
Moses, E. I., Nucl. Fusion 49, 104022 (2009).CrossRefGoogle Scholar
Betti, R. and Hurricane, O. A., Nat. Phys. 12, 435 (2016).CrossRefGoogle Scholar
Murakami, M. and Meyer-ter-Vehn, J., Nucl. Fusion 31, 1315 (1991).CrossRefGoogle Scholar
Craxton, R. S., Anderson, K. S., Boehly, T. R., Goncharov, V. N., Harding, D. R., Knauer, J. P., McCrory, R. L., McKenty, P. W., Meyerhofer, D. D., Myatt, J. F., Schmitt, A. J., Sethian, J. D., Short, R. W., Skupsky, S., Theobald, W., Kruer, W. L., Tanaka, K., Betti, R., Collins, T. J. B., Delettrez, J. A., Hu, S. X., Marozas, J. A., Maximov, A. V., Michel, D. T., Radha, P. B., Regan, S. P., Sangster, T. C., Seka, W., Solodov, A. A., Soures, J. M., Stoeckl, C., and Zuegel, J. D., Phys. Plasmas 22, 110501 (2015).CrossRefGoogle Scholar
Kodama, R., Norreys, P. A., Mima, K., Dangor, A. E., Evans, R. G., Fujita, H., Kitagawa, Y., Krushelnick, K., Miyakoshi, T., Miyanaga, N., Norimatsu, T., Rose, S. J., Shozaki, T., Shigemori, K., Sunahara, A., Tampo, M., Tanaka, K. A., Toyama, Y., Yamanaka, T., and Zepf, M., Nature 412, 798 (2001).CrossRefGoogle Scholar
Betti, R., Zhou, C. D., Anderson, K. S., Perkins, L. J., Theobald, W., and Solodov, A. A., Phys. Rev. Lett. 98, 155001 (2007).CrossRefGoogle Scholar
Zylstra, A. B., Hurricane, O. A., Callahan, D. A., Kritcher, A. L., Ralph, J. E., Robey, H. F., Ross, J. S., Young, C. V., Baker, K. L., Casey, D. T., Döppner, T., Divol, L., Hohenberger, M., Le Pape, S., Pak, A., Patel, P. K., Tommasini, R., Ali, S. J., Amendt, P. A., Atherton, L. J., Bachmann, B., Bailey, D., Benedetti, L. R., Berzak Hopkins, L., Betti, R., Bhandarkar, S. D., Biener, J., Bionta, R. M., Birge, N. W., Bond, E. J., Bradley, D. K., Braun, T., Briggs, T. M., Bruhn, M. W., Celliers, P. M., Chang, B., Chapman, T., Chen, H., Choate, C., Christopherson, A. R., Clark, D. S., Crippen, J. W., Dewald, E. L., Dittrich, T. R., Edwards, M. J., Farmer, W. A., Field, J. E., Fittinghoff, D., Frenje, J., Gaffney, J., Gatu Johnson, M., Glenzer, S. H., Grim, G. P., Haan, S., Hahn, K. D., Hall, G. N., Hammel, B. A., Harte, J., Hartouni, E., Heebner, J. E., Hernandez, V. J., Herrmann, H., Herrmann, M. C., Hinkel, D. E., Ho, D. D., Holder, J. P., Hsing, W. W., Huang, H., Humbird, K. D., Izumi, N., Jarrott, L. C., Jeet, J., Jones, O., Kerbel, G. D., Kerr, S. M., Khan, S. F., Kilkenny, J., Kim, Y., Kleinrath, H. G., Kleinrath, V. G., Kong, C., Koning, J. M., Kroll, J. J., Kruse, M. K. G., Kustowski, B., Landen, O. L., Langer, S., Larson, D., Lemos, N. C., Lindl, J. D., Ma, T., MacDonald, M. J., MacGowan, B. J., Mackinnon, A. J., MacLaren, S. A., MacPhee, A. G., Marinak, M. M., Mariscal, D. A., Marley, E. V., Masse, L., Meaney, K., Meezan, N. B., Michel, P. A., Millot, M., Milovich, J. L., Moody, J. D., Moore, A. S., Morton, J. W., Murphy, T., Newman, K., Di Nicola, J.-M. G., Nikroo, A., Nora, R., Patel, M. V., Pelz, L. J., Peterson, J. L., Ping, Y., Pollock, B. B., Ratledge, M., Rice, N. G., Rinderknecht, H., Rosen, M., Rubery, M. S., Salmonson, J. D., Sater, J., Schiaffino, S., Schlossberg, D. J., Schneider, M. B., Schroeder, C. R., Scott, H. A., Sepke, S. M., Sequoia, K., Sherlock, M. W., Shin, S., Smalyuk, V. A., Spears, B. K., Springer, P. T., Stadermann, M., Stoupin, S., Strozzi, D. J., Suter, L. J., Thomas, C. A., Town, R. P. J., Tubman, E. R., Trosseille, C., Volegov, P. L., Weber, C. R., Widmann, K., Wild, C., Wilde, C. H., Van Wonterghem, B. M., Woods, D. T., Woodworth, B. N., Yamaguchi, M., Yang, S. T., and Zimmerman, G. B., Nature 601, 542 (2022).CrossRefGoogle Scholar
Carleo, G., Cirac, I., Cranmer, K., Daudet, L., Schuld, M., Tishby, N., Vogt-Maranto, L., and Zdeborová, L., Rev. Mod. Phys. 91, 045002 (2019).CrossRefGoogle Scholar
Dunjko, V. and Briegel, H. J., Rep. Prog. Phys. 81, 074001 (2018).CrossRefGoogle Scholar
Schütt, K. T., Chmiela, S., von Lilienfeld, O. A., Tkatchenko, A., Tsuda, K., and Müller, K.-R., Machine Learning Meets Quantum Physics, Lecture Notes in Physics (Springer, 2020).CrossRefGoogle Scholar
Radovic, A., Williams, M., Rousseau, D., Kagan, M., Bonacorsi, D., Himmel, A., Aurisano, A., Terao, K., and Wongjirad, T., Nature 560, 41 (2018).CrossRefGoogle Scholar
Bedolla, E., Padierna, L. C., and Castaneda-Priego, R., J. Phys. Condens. Matter 33, 053001 (2020).CrossRefGoogle Scholar
Ede, J. M., Mach. Sci. Technol. 2, 011004 (2021).Google Scholar
Kutz, J. N., J. Fluid Mech. 814, 1 (2017).CrossRefGoogle Scholar
Leemans, W., “Report of Workshop on Laser Technology for k-BELLA and Beyond,” https://www2.lbl.gov/LBL-Programs/atap/Report_Workshop_k-BELLA_laser_tech_final.pdf (2017).Google Scholar
Prencipe, I., Fuchs, J., Pascarelli, S., Schumacher, D. W., Stephens, R. B., Alexander, N. B., Briggs, R., Büscher, M., Cernaianu, M. O., Choukourov, A., De Marco, M., Erbe, A., Fassbender, J., Fiquet, G., Fitzsimmons, P., Gheorghiu, C., Hund, J., Huang, L. G., Harmand, M., Hartley, N. J., Irman, A., Kluge, T., Konopkova, Z., Kraft, S., Kraus, D., Leca, V., Margarone, D., Metzkes, J., Nagai, K., Nazarov, W., Lutoslawski, P., Papp, D., Passoni, M., Pelka, A., Perin, J. P., Schulz, J., Smid, M., Spindloe, C., Steinke, S., Torchio, R., Vass, C., Wiste, T., Zaffino, R., Zeil, K., Tschentscher, T., Schramm, U., and Cowa, T. E., High Power Laser Sci. Eng. 5, e17 (2017).CrossRefGoogle Scholar
George, K. M., Morrison, J. T., Feister, S., Ngirmang, G. K., Smith, J. R., Klim, A. J., Snyder, J., Austin, D., Erbsen, W., Frische, K. D., Nees, J., Orban, C., Chowdhury, E. A., and Roquemore, W. M., High Power Laser Sci. Eng. 7, e50 (2019).CrossRefGoogle Scholar
Chagovets, T., Stanček, S., Giuffrida, L., Velyhan, A., Tryus, M., Grepl, F., Istokskaia, V., Kantarelou, V., Wiste, T., Martin, J. C. H., Schillaci, F., and Margarone, D., Appl. Sci. 11, 1680 (2021).CrossRefGoogle Scholar
Condamine, F. P., Jourdain, N., Hernandez, J.-C., Taylor, M., Bohlin, H., Fajstavr, A., Jeong, T. M., Kumar, D., Laštovička, T., Renner, O., and Weber, S., Rev. Sci. Instrum. 92, 063504 (2021).CrossRefGoogle Scholar
Nakamura, K., Mao, H.-S., Gonsalves, A. J., Vincenti, H., Mittelberger, D. E., Daniels, J., Magana, A., Toth, C., and Leemans, W. P., IEEE J. Quantum Electron. 53, 1200121 (2017).CrossRefGoogle Scholar
Sistrunk, E., Spinka, T., Bayramian, A., Betts, S., Bopp, R., Buck, S., Charron, K., Cupal, J., Deri, R., Drouin, M., Erlandson, A., Fulkerson, E. S., Horner, J., Horacek, J., Jarboe, J., Kasl, K., Kim, D., Koh, E., Koubikova, L., Lanning, R., Maranville, W., Marshall, C., Mason, D., Menapace, J., Miller, P., Mazurek, P., Naylon, A., Novak, J., Peceli, D., Rosso, P., Schaffers, K., Smith, D., Stanley, J., Steele, R., Telford, S., Thoma, J., VanBlarcom, D., Weiss, J., Wegner, P., Rus, B., and Haefne, C., in CLEO: Science and Innovations (Optica Publishing Group, 2017), paper STh1L.2.Google Scholar
Roso, L., EPJ Web Conf. 167, 01001 (2018).Google Scholar
Pilar, J., De Vido, M., Divoky, M., Mason, P., Hanus, M., Ertel, K., Navratil, P., Butcher, T., Slezak, O., Banerjee, S., Phillips, J., Smith, J., Lucianetti, A., Hernandez-Gomez, C., Edwards, C., Collier, J., and Mocek, T., Proc. SPIE 10511, 105110X (2018).Google Scholar
Jourdain, N., Chaulagain, U., Havlík, M., Kramer, D., Kumar, D., Majerová, I., Tikhonchuk, V. T., Korn, G., and Weber, S., Matter Radiat. Extremes 6, 015401 (2021).CrossRefGoogle Scholar
Spears, B. K., Brase, J., Bremer, P.-T., Chen, B., Field, J., Gaffney, J., Kruse, M., Langer, S., Lewis, K., Nora, R., Peterson, J. L., Thiagarajan, J., Van Essen, B., and Humbird, K., Phys. Plasmas 25, 080901 (2018).CrossRefGoogle Scholar
Anirudh, R., Archibald, R., Asif, M. S., Becker, M. M., Benkadda, S., Bremer, P.-T., Budé, R. H. S., Chang, C. S., Chen, L., Churchill, R. M., Citrin, J., Gaffney, J. A., Gainaru, A., Gekelman, W., Gibbs, T., Hamaguchi, S., Hill, C., Humbird, K., Jalas, S., Kawaguchi, S., Kim, G.-H., Kirchen, M., Klasky, S., Kline, J. L., Krushelnick, K., Kustowski, B., Lapenta, G., Li, W., Ma, T., Mason, N. J., Mesbah, A., Michoski, C., Munson, T., Murakami, I., Najm, H. N., Olofsson, K. E. J., Park, S., Peterson, J. L., Probst, M., Pugmire, D., Sammuli, B., Sawlani, K., Scheinker, A., Schissel, D. P., Shalloo, R. J., Shinagawa, J., Seong, J., Spears, B. K., Tennyson, J., Thiagarajan, J., Ticoş, C. M., Trieschmann, J., van Dijk, J., Van Essen, B., Ventzek, P., Wang, H., Wang, J. T. L., Wang, Z., Wende, K., Xu, X., Yamada, H., Yokoyama, T., and Zhang, X., arXiv:2205.15832 (2022).Google Scholar
Kambara, M., Kawaguchi, S., Lee, H. J., Ikuse, K., Hamaguchi, S., Ohmori, T., and Ishikawa, K., Jpn. J. Appl. Phys. 62, SA0803 (2022).CrossRefGoogle Scholar
Genty, G., Salmela, L., Dudley, J. M., Brunner, D., Kokhanovskiy, A., Kobtsev, S., and Turitsyn, S. K., Nat. Photonics 15, 91 (2021).CrossRefGoogle Scholar
Hatfield, P. W., Gaffney, J. A., Anderson, G. J., Ali, S., Antonelli, L., du Pree, S. B., Citrin, J., Fajardo, M., Knapp, P., Kettle, B., Kustowski, B., MacDonald, M. J., Mariscal, D., Martin, M. E., Nagayama, T., Palmer, C. A. J., Peterson, J. L., Rose, S., Ruby, J. J., Shneider, C., Streeter, M. J. V., Trickey, W., and Williams, B., Nature 593, 351 (2021).CrossRefGoogle Scholar
Rasheed, A., San, O., and Kvamsdal, T., IEEE Access 8, 21980 (2020).CrossRefGoogle Scholar
Anderson, D. and Burnham, K., Model Selection and Multi-Model Inference, 2nd ed. (Springer, 2002).Google Scholar
MacKay, D. J. C., NATO ASI Ser. F: Comput. Syst. Sci. 168, 133 (1998).Google Scholar
Rasmussen, C. E., in Advanced Lectures on Machine Learning (Springer, 2003), p. 63.Google Scholar
Genton, M. G., Mach, J.. Learn. Res. 2, 299 (2001).Google Scholar
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., Classification and Regression Trees (Routledge, New York, NY, 2017).CrossRefGoogle Scholar
Freund, Y. and Schapire, R. E., Jpn, J.. Soc. Artif. Intell. 14, 771 (1999).Google Scholar
Friedman, J. H., Ann. Stat. 29, 1189 (2001).CrossRefGoogle Scholar
Friedman, J. H., Comput. Stat. Data Analysis 38, 367 (2002).CrossRefGoogle Scholar
Humbird, K. D., Peterson, J. L., and McClarren, R. G., IEEE Trans. Neural Networks Learn. Syst. 30, 1286 (2018).CrossRefGoogle Scholar
Humbird, K. D., Peterson, J. L., Spears, B. K., and McClarren, R. G., IEEE Trans. Plasma Sci. 48, 61 (2019).CrossRefGoogle Scholar
Hsu, A., Cheng, B., and Bradley, P. A., Phys. Plasmas 27, 012703 (2020).CrossRefGoogle Scholar
Kluth, G., Humbird, K. D., Spears, B. K., Peterson, J. L., Scott, H. A., Patel, M. V., Koning, J., Marinak, M., Divol, L., and Young, C. V., Phys. Plasmas 27, 052707 (2020).CrossRefGoogle Scholar
Lin, J., Qian, Q., Murphy, J., Hsu, A., Hero, A., Ma, Y., Thomas, A. G. R., and Krushelnick, K., Phys. Plasmas 28, 083102 (2021).CrossRefGoogle Scholar
Wythoff, B. J., Chem. Intell. Lab. Syst. 18, 115 (1993).CrossRefGoogle Scholar
Kingma, D. P. and Ba, J., arXiv:1412.6980 (2017).Google Scholar
Kristiadi, A., Hein, M., and Hennig, P., arXiv:2002.10118 (2020).Google Scholar
Wager, S., Wang, S., and Liang, P. S., arXiv:1307.1493 (2013).Google Scholar
Gawlikowski, J., Tassi, C. R. N., Ali, M., Lee, J., Humt, M., Feng, J., Kruspe, A., Triebel, R., Jung, P., Roscher, R., Shahzad, M., Yang, W., Bamler, R., and Zhu, X. X., arXiv:2107.03342 (2022).Google Scholar
Montavon, G., Samek, W., and Müller, K.-R., Digital Sign. Process. 73, 1 (2018).CrossRefGoogle Scholar
Huh, M., Agrawal, P., and Efros, A. A., arXiv:1608.08614 (2016).Google Scholar
Gonoskov, A., Wallin, E., Polovinkin, A., and Meyerov, I., Sci. Rep. 9, 7043 (2019).CrossRefGoogle Scholar
Rodimkov, Y., Bhadoria, S., Volokitin, V., Efimenko, E., Polovinkin, A., Blackburn, T., Marklund, M., Gonoskov, A., and Meyerov, I., Sensors 21, 6982 (2021).CrossRefGoogle Scholar
Djordjević, B. Z., Kemp, A. J., Kim, J., Simpson, R. A., Wilks, S. C., Ma, T., and Mariscal, D. A., Phys. Plasmas 28, 043105 (2021).CrossRefGoogle Scholar
Watt, R. A., “Monte Carlo modelling of QED Interactions in laser-plasma experiments,” PhD. Thesis (Imperial College London, 2021).Google Scholar
McClarren, R.G., Tregillis, I. L., Urbatsch, T. J., and Dodd, E. S., Phys. Lett. A 396, 127243 (2021).CrossRefGoogle Scholar
Simpson, R. A., Mariscal, D., Williams, G. J., Scott, G. G., Grace, E., and Ma, T., Rev. Sci. Instrum. 92, 075101 (2021).CrossRefGoogle Scholar
Streeter, M. J. V., Colgan, C., Cobo, C. C., Arran, C., Los, E. E., Watt, R., Bourgeois, N., Calvin, L., Carderelli, J., Cavanagh, N., Dann, S. J. D., Fitzgarrald, R., Gerstmayr, E., Joglekar, A. S., Kettle, B., Mckenna, P., Murphy, C. D., Najmudin, Z., Parsons, P., Qian, Q., Rajeev, P. P., Ridgers, C. P., Symes, D. R., Thomas, A. G. R., Sarri, G., and Mangles, S. P. D., High Power Laser Sci. Eng. 11, e9 (2023).Google Scholar
Wu, T. and Tegmark, M., Phys. Rev. E 100, 033311 (2019).CrossRefGoogle Scholar
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., and Zhang, Y., arXiv:2303.12712 (2023).Google Scholar
Schmidt, M. and Lipson, H., Science 324, 81 (2009).CrossRefGoogle Scholar
Brunton, S. L., Proctor, J. L., and Kutz, J. N., Proc. Natl. Acad. Sci. U. S. A. 113, 3932 (2016).CrossRefGoogle Scholar
Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S., and Yang, L., Nat. Rev. Phys. 3, 422 (2021).CrossRefGoogle Scholar
Raissi, M., Perdikaris, P., and Karniadakis, G. E., arXiv:1711.10561 (2017).Google Scholar
Raissi, M., Perdikaris, P., and Karniadakis, G. E., arXiv:1711.10566 (2017).Google Scholar
Raissi, M., Perdikaris, P., and Karniadakis, G. E., J. Comput. Phys. 378, 686 (2019).CrossRefGoogle Scholar
Raissi, M., Yazdani, A., and Karniadakis, G. E., Science 367, 1026 (2020).CrossRefGoogle Scholar
Long, Z., Lu, Y., Ma, X., and Dong, B., arXiv:1710.09668 (2018).Google Scholar
Cuomo, S., Di Cola, V. S., Giampaolo, F., Rozza, G., Raissi, M., and Piccialli, F., J. Sci. Comput. 92, 88 (2022).CrossRefGoogle Scholar
Kawaguchi, S. and Murakami, T., Jpn. J. Appl. Phys. 61, 086002 (2022).CrossRefGoogle Scholar
Stiller, P., Bethke, F., Böhme, M., Pausch, R., Torge, S., Debus, A., Vorberger, J., Bussmann, M., and Hoffmann, N., arXiv:2009.03730 (2020).Google Scholar
Commandeur, J. J. F. and Koopman, S. J., An Introduction to State Space Time Series Analysis (Oxford University Press, 2007).Google Scholar
Zucchini, W. and MacDonald, I. L., Hidden Markov Models for Time Series: An Introduction Using R (CRC Press, 2009).CrossRefGoogle Scholar
Kalman, R. E., J. Basic Eng. 82, 35 (1960).CrossRefGoogle Scholar
Auger, F., Hilairet, M., Guerrero, J. M., Monmasson, E., Orlowska-Kowalska, T., and Katsura, S., IEEE Trans. Indus. Electron. 60, 5458 (2013).CrossRefGoogle Scholar
Poyneer, L. A. and Véran, J.-P., J. Opt. Soc. Am. A 27, A223 (2010).CrossRefGoogle Scholar
Ushakov, A., Echevarria, P., and Neumann, A., in 8th International Particle Accelerator Conference (2017), p. 3938.Google Scholar
Welch, G., and Bishop, G., “An introduction to the Kalman filter,” https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf (1995).Google Scholar
Ristic, B., Arulampalam, S., and Gordon, N., Beyond the Kalman Filter: Particle Filters for Tracking Applications (Artech House, 2003).Google Scholar
Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., and Muller, P.-A., Data Mining Knowledge Discovery 33, 917 (2019).CrossRefGoogle Scholar
Lim, B. and Zohren, S., Philos. Trans. R. Soc. A 379, 20200209 (2021).CrossRefGoogle Scholar
Hochreiter, S., “Untersuchungen zu dynamischen neuronalen netzen,” Diploma (Technische Universität München, 1991).Google Scholar
Hochreiter, S., Int. J. Uncertainty Fuzziness Knowl. Syst. 6, 107 (1998).CrossRefGoogle Scholar
Hochreiter, S. and Schmidhuber, J., Neural Comput. 9, 1735 (1997).CrossRefGoogle Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I., Adv. Neural Inform. Process. Syst. 30 (2017).Google Scholar
Lim, B., Ark, S. Ö., Loeff, N., and Pfister, T., Int. J. Forecasting 37, 1748 (2021).CrossRefGoogle Scholar
Wang, Z., Yan, W., and Oates, T., in 2017 International Joint Conference on Neural Networks (IJCNN) (IEEE, 2017), p. 1578.CrossRefGoogle Scholar
Bui, T. D., Nguyen, C., and Turner, R. E., arXiv:1705.07131 (2017).Google Scholar
Žliobaitė, I., Pechenizkiy, M., and Gama, J., in Big Data Analysis: New Algorithms for a New Society (Springer, 2016), p. 91.CrossRefGoogle Scholar
Cassou, K., in LPA Online Workshop on Control Systems and Machine Learning (2022).Google Scholar
Weiße, N., Doyle, L., Gebhard, J., Balling, F., Schweiger, F., Haberstroh, F., D. Geulig, L., Lin, J., Irshad, F., Esslinger, J., Gerlach, S., Gilljohann, M., Vaidyanathan, V., Siebert, D., Münzer, A., Schilling, G., Schreiber, J., G. Thirolf, P., Karsch, S., and Döpp, A., High Power Laser Sci. Eng. 11, e44 (2023).CrossRefGoogle Scholar
Ma, T., Mariscal, D., Anirudh, R., Bremer, T., Djordjevic, B. Z., Galvin, T., Grace, E., Herriot, S., Jacobs, S., Kailkhura, B., Hollinger, R., Kim, J., Liu, S., Ludwig, J., Neely, D., Rocca, J. J., Scott, G. G., Simpson, R. A., Spears, B. S., Spinka, T. S., Swanson, K., Thiagarajan, J. J., Van Essen, B., Wang, S., Wilks, S. C., Williams, G. J., Zhang, J., Herrmann, M. C., and Haefner, C., Plasma Phys. Controll. Fusion 63, 104003 (2021).CrossRefGoogle Scholar
Barata, J. C. A. and Hussein, M. S., Braz. J. Phys. 42, 146 (2012).CrossRefGoogle Scholar
Bioucas-Dias, J. M. and Figueiredo, M. A. T., IEEE Trans. Image Process. 16, 2992 (2007).CrossRefGoogle Scholar
Beck, A. and Teboulle, M., SIAM J. Imaging Sci. 2, 183 (2009).CrossRefGoogle Scholar
Nocedal, J. and Wright, S. J., Numerical Optimization, 2nd ed., Springer Series in Operations Research and Financial Engineering (Springer, 2007).Google Scholar
Tyson, R. K. and Frazier, B. W., Principles of Adaptive Optics (CRC Press, 2022).CrossRefGoogle Scholar
Tarantola, A., Inverse Problem Theory and Methods for Model Parameter Estimation (SIAM, 2005).CrossRefGoogle Scholar
Geyer, C. J., “Markov chain Monte Carlo maximum likelihood,” https://www.stat.umn.edu/geyer/f05/8931/c.pdf (1991).Google Scholar
Dempster, A. P., Laird, N. M., and Rubin, D. B., Stat, J. R.. Soc. Ser. B 39, 1 (1977).Google Scholar
Zhang, H., Wang, J., Zeng, D., Tao, X., and Ma, J., Med. Phys. 45, e886 (2018).CrossRefGoogle Scholar
Cranmer, K., Brehmer, J., and Louppe, G., Proc. Natl. Acad. Sci. U. S. A. 117, 30055 (2020).CrossRefGoogle Scholar
Sisson, S. A., Fan, Y., and Beaumont, M., Handbook of Approximate Bayesian Computation (CRC Press, 2018).CrossRefGoogle Scholar
Gilks, W. R., Richardson, S., and Spiegelhalter, D., Markov Chain Monte Carlo in Practice (CRC Press, 1995).CrossRefGoogle Scholar
Gutmann, M. U. and Corander, J., Mach, J.. Learn. Res. 17, 1 (2016).Google Scholar
Brehmer, J., Cranmer, K., Louppe, G., and Pavez, J., Phys. Rev. D 98, 052004 (2018).CrossRefGoogle Scholar
Döpp, A., Hehn, L., Götzfried, J., Wenz, J., Gilljohann, M., Ding, H., Schindler, S., Pfeiffer, F., and Karsch, S., Optica 5, 199 (2018).CrossRefGoogle Scholar
Götzfried, J., Döpp, A., Gilljohann, M., Ding, H., Schindler, S., Wenz, J., Hehn, L., Pfeiffer, F., and Karsch, S., Nucl. Instrum. Methods Phys. Res. Sect. A 909, 286 (2018).CrossRefGoogle Scholar
Candes, E. J. and Wakin, M. B., IEEE Sig. Process. Mag. 25, 21 (2008).CrossRefGoogle Scholar
Oliveri, G., Salucci, M., Anselmi, N., and Massa, A., IEEE Antennas Propag. Mag. 59, 34 (2017).CrossRefGoogle Scholar
Yuan, X., Brady, D. J., and Katsaggelos, A. K., IEEE Sig. Process. Mag. 38, 65 (2021).CrossRefGoogle Scholar
Donoho, D. L., IEEE Trans. Inform. Theory 52, 1289 (2006).CrossRefGoogle Scholar
Candès, E. J., Wakin, M. B., and Boyd, S. P., J. Fourier Anal. Appl. 14, 877 (2008).CrossRefGoogle Scholar
Kohlenberg, A., J. Appl. Phys. 24, 1432 (1953).CrossRefGoogle Scholar
Liang, J., Wang, P., Zhu, L., and Wang, L. V., in OSA Imaging and Applied Optics Congress (Optica Publishing Group, 2021), paper 3Tu4A.1.Google Scholar
Huang, Y., Jiang, S., Li, H., Wang, Q., and Chen, L., Comput. Phys. Commun. 185, 459 (2014).CrossRefGoogle Scholar
Kassubeck, M., Wenger, S., Jennings, C., Gomez, M. R., Harding, E., Schwarz, J., and Magnor, M., in IS&T International Symposium on Electronic Imaging (Society for Imaging Science and Technology, 2018), p. 1331.Google Scholar
Ma, Y., Hua, J., Liu, D., He, Y., Zhang, T., Chen, J., Yang, F., Ning, X., Yang, Z., Zhang, J., Pai, C.-H., Gu, Y., and Lu, W., Matter Radiat. Extremes 5, 064401 (2020).CrossRefGoogle Scholar
Yu, H. and Wang, G., Phys. Med. Biol. 54, 2791 (2009).CrossRefGoogle Scholar
Arridge, S., Maass, P., Öktem, O., and Schönlieb, C.-B., Acta Numer. 28, 1 (2019).CrossRefGoogle Scholar
Bermeitinger, B., Hrycej, T., and Handschuh, S., arXiv:1906.11755 (2019).Google Scholar
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y., arXiv:1406.2661 (2014).Google Scholar
Thanh-Tung, H. and Tran, T., arXiv:1807.04015 (2020).Google Scholar
Peng, P., Jalali, S., and Yuan, X., arXiv:1901.05045 (2019).Google Scholar
Ronneberger, O., Fischer, P., and Brox, T., arXiv:1505.04597 (2015).Google Scholar
Ardizzone, L., Kruse, J., Wirkert, S., Rahner, D., Pellegrini, E. W., Klessen, R. S., Maier-Hein, L., Rother, C., and Köthe, U., arXiv:1808.04730 (2018).Google Scholar
Rezende, D. and Mohamed, S., arXiv:1505.05770 (2016).Google Scholar
Dinh, L., Krueger, D., and Bengio, Y., arXiv:1410.8516 (2014).Google Scholar
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A., Mach, J.. Learn. Res. 13, 723 (2012).Google Scholar
Ardizzone, L., Lüth, C., Kruse, J., Rother, C., and Köthe, U., arXiv:1907.02392 (2019).Google Scholar
Krumbügel, M. A., Ladera, C. L., DeLong, K. W., Fittinghoff, D. N., Sweetser, J. N., and Trebino, R., Opt. Lett. 21, 143 (1996).CrossRefGoogle Scholar
Zahavy, T., Dikopoltsev, A., Moss, D., Haham, G. I., Cohen, O., Mannor, S., and Segev, M., Optica 5, 666 (2018).CrossRefGoogle Scholar
Hu, L., Hu, S., Gong, W., and Si, K., Opt. Lett. 45, 3741 (2020).CrossRefGoogle Scholar
Bethke, F., Pausch, R., Stiller, P., Debus, A., Bussmann, M., and Hoffmann, N., arXiv:2106.00432 (2021).Google Scholar
Monga, V., Li, Y., and Eldar, Y. C., arXiv:1912.10557 (2019).Google Scholar
Li, H., Schwab, J., Antholzer, S., and Haltmeier, M., Inverse Probl. 36, 065005 (2020).CrossRefGoogle Scholar
Howard, S., Esslinger, J., Wang, R. H.W., Norreys, P., and Döpp, A., High Power Laser Sci. Eng. 11, e32 (2023).CrossRefGoogle Scholar
Gehm, M. E., John, R., Brady, D. J., Willett, R. M., and Schulz, T. J., Opt. Express 15, 14013 (2007).CrossRefGoogle Scholar
He, Z.-H., Hou, B., Lebailly, V., Nees, J. A., Krushelnick, K., and Thomas, A. G. R., Nat. Commun. 6, 7156 (2015).Google Scholar
Dann, S. J. D., Baird, C. D., Bourgeois, N., Chekhlov, O., Eardley, S., Gregory, C. D., Gruse, J.-N., Hah, J., Hazra, D., Hawkes, S. J., Hooker, C. J., Krushelnick, K., Mangles, S. P. D., Marshall, V. A., Murphy, C. D., Najmudin, Z., Nees, J. A., Osterhoff, J., Parry, B., Pourmoussavi, P., Rahul, S. V., Rajeev, P. P., Rozario, S., Scott, J. D. E., Smith, R. A., Springate, E., Tang, Y., Tata, S., Thomas, A. G. R., Thornton, C., Symes, D. R., and Streeter, M. J. V.. Phys. Rev. Accel. Beams 22, 041303 (2019).CrossRefGoogle Scholar
Shalloo, R. J., Dann, S. J. D., Gruse, J.-N., Underwood, C. I. D., Antoine, A. F., Arran, C., Backhouse, M., Baird, C. D., Balcazar, M. D., Bourgeois, N., Cardarelli, J. A., Hatfield, P., Kang, J., Krushelnick, K., Mangles, S. P. D., Murphy, C. D., Lu, N., Osterhoff, J., Põder, K., Rajeev, P. P., Ridgers, C. P., Rozario, S., Selwood, M. P., Shahani, A. J., Symes, D. R., Thomas, A. G. R., Thornton, C., Najmudin, Z., and Streeter, M. J. V., Nat. Commun. 11, 6355 (2020).CrossRefGoogle Scholar
Jalas, S., Kirchen, M., Messner, P., Winkler, P., Hübner, L., Dirkwinkel, J., Schnepp, M., Lehe, R., and Maier, A. R., Phys. Rev. Lett. 126, 104801 (2021).CrossRefGoogle Scholar
Irshad, F., Karsch, S., and Döpp, A., Phys. Rev. Res. 5, 013063 (2023).CrossRefGoogle Scholar
Irshad, F., Karsch, S., and Döpp, A., arXiv:2112.13901 (2021).Google Scholar
Dixon, L. C. W. and Szego, G. P., Toward Global Optim. 2, 1 (1978).Google Scholar
Currin, C., Mitchell, T., Morris, M., and Ylvisaker, D., Amer, J.. Stat. Assoc. 86, 953 (1991).CrossRefGoogle Scholar
Bergstra, J. and Bengio, Y., Mach, J.. Learn. Res. 13, 281 (2012).Google Scholar
Kollig, T. and Keller, A., Comput. Graph. Forum 21, 557 (2002).CrossRefGoogle Scholar
Nelder, J. A. and Mead, R., Comput. J. 7, 308 (1965).CrossRefGoogle Scholar
Storn, R. and Price, K., J. Global Optim. 11, 341 (1997).CrossRefGoogle Scholar
Deb, K., Agrawal, S., Pratap, A., and Meyarivan, T., IEEE Trans. Evolution. Comput. 6, 182 (2001).CrossRefGoogle Scholar
Horn, J., Nafpliotis, N., and Goldberg, D. E., in Proceedings of the First IEEE Conference on Evolutionary Computation and IEEE World Congress on Computational Intelligence (IEEE, 1994), p. 82.Google Scholar
Bartels, R., Backus, S., Zeek, E., Misoguti, L., Vdovin, G., Christov, I. P., Murnane, M. M., and Kapteyn, H. C., Nature 406, 164 (2000).CrossRefGoogle Scholar
Yoshitomi, D., Nees, J., Miyamoto, N., Sekikawa, T., Kanai, T., Mourou, G., and Watanabe, S., Appl. Phys. B 78, 275 (2004).CrossRefGoogle Scholar
Moore, A. S., Mendham, K. J., Symes, D. R., Robinson, J. S., Springate, E., Mason, M. B., Smith, R. A., Tisch, J. W. G., and Marangos, J. P., Appl. Phys. B 80, 101 (2005).CrossRefGoogle Scholar
Zamith, S., Martchenko, T., Ni, Y., Aseyev, S. A., Muller, H. G., and Vrakking, M. J. J., Phys. Rev. A 70, 011201 (2004).CrossRefGoogle Scholar
Nayuki, T., Fujii, T., Oishi, Y., Takano, K., Wang, X., Andreev, A. A., Nemoto, K., and Ueda, K., Rev. Sci. Instrum. 76, 073305 (2005).CrossRefGoogle Scholar
He, Z.-H., Hou, B., Gao, G., Lebailly, V., Nees, J. A., Clarke, R., Krushelnick, K., and Thomas, A. G. R., Phys. Plasmas 22, 056704 (2015).CrossRefGoogle Scholar
Lin, J., Ma, Y., Schwartz, R., Woodbury, D., Nees, J. A., Mathis, M., Thomas, A. G. R., Krushelnick, K., and Milchberg, H., Opt. Express 27, 10912 (2019).CrossRefGoogle Scholar
Hah, J., Jiang, W., He, Z. H., Nees, J. A., Hou, B., Thomas, A. G. R., and Krushelnick, K., Opt. Express 25, 17271 (2017).CrossRefGoogle Scholar
Lin, J., Easter, J. H., Krushelnick, K., Mathis, M., Dong, J., Thomas, A. G. R., and Nees, J., Opt. Commun. 421, 79 (2018).CrossRefGoogle Scholar
Noaman-ul-Haq, M., Sokollik, T., Ahmed, H., Braenzel, J., Ehrentraut, L., Mirzaie, M., Yu, L.-L., Sheng, Z. M., Chen, L. M., Schnürer, M., and Zhang, J., Nucl. Instrum. Methods Phys. Res. Sect. A 883, 191 (2018).CrossRefGoogle Scholar
Finney, L. A., Lin, J., Skrodzki, P. J., Burger, M., Nees, J., Krushelnick, K., and Jovanovic, I., Opt. Commun. 490, 126902 (2021).CrossRefGoogle Scholar
Englesbe, A., Lin, J., Nees, J., Lucero, A., Krushelnick, K., and Schmitt-Sody, A., Appl. Opt. 60, G113 (2021).CrossRefGoogle Scholar
Streeter, M. J. V., Dann, S. J. D., Scott, J. D. E., Baird, C. D., Murphy, C. D., Eardley, S., Smith, R. A., Rozario, S., Gruse, J.-N., Mangles, S. P. D., Najmudin, Z., Tata, S., Krishnamurthy, M., Rahul, S. V., Hazra, D., Pourmoussavi, P., Osterhoff, J., Hah, J., Bourgeois, N., Thornton, C., Gregory, C. D., Hooker, C. J., Chekhlov, O., Hawkes, S. J., Parry, B., Marshall, V. A., Tang, Y., Springate, E., Rajeev, P. P., Thomas, A. G. R., and Symes, D. R., Appl. Phys. Lett. 112, 244101 (2018).CrossRefGoogle Scholar
Peceli, D. and Mazurek, P., in LPA Online Workshop on Control Systems and Machine Learning (2022).Google Scholar
Smith, J. R., Orban, C., Morrison, J. T., George, K. M., Ngirmang, G. K., Chowdhury, E. A., and Roquemore, W. M., New J. Phys. 22, 103067 (2020).CrossRefGoogle Scholar
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N., Proc. IEEE 104, 148 (2015).CrossRefGoogle Scholar
Frazier, P. I., arXiv:1807.02811 (2018).Google Scholar
Jones, D. R., Schonlau, M., and Welch, W. J., J. Global Optim. 13, 455 (1998).CrossRefGoogle Scholar
Frazier, P., Powell, W., and Dayanik, S., INFORMS J. Comput. 21, 599 (2009).CrossRefGoogle Scholar
Hennig, P. and Schuler, C. J., Mach, J.. Learn. Res. 13, 1809 (2012).Google Scholar
Yang, K., Emmerich, M., Deutz, A., and Bäck, T., Swarm Evolut. Comput. 44, 945 (2019).CrossRefGoogle Scholar
Marco, A., Berkenkamp, F., Hennig, P., Schoellig, A. P., Krause, A., Schaal, S., and Trimpe, S., in 2017 IEEE International Conference on Robotics and Automation (ICRA) (IEEE, 2017), p. 1557.CrossRefGoogle Scholar
Lehe, R., in GPU Technology Conference (2017).Google Scholar
Dolier, E. J., King, M., Wilson, R., Gray, R. J., and McKenna, P., New J. Phys. 24, 073025 (2022).CrossRefGoogle Scholar
Loughran, B., Streeter, M. J. V., Ahmed, H., Astbury, S., Balcazar, M., Borghesi, M., Bourgeois, N., Curry, C. B., Dann, S. J. D., DiIorio, S., Dover, N. P., Dzelzanis, T., Ettlinger, O. C., Gauthier, M., Giuffrida, L., Glenn, G. D., Glenzer, S. H., Green, J. S., Gray, R. J., Hicks, G. S., Hyland, C., Istokskaia, V., King, M., Margarone, D., McCusker, O., McKenna, P., Najmudin, Z., Parisuaña, C., Parsons, P., Spindloe, C., Symes, D. R., Thomas, A. G. R., Treffert, F., Xu, N., and Palmer, C. A. J., arXiv:2303.00823 (2023).Google Scholar
Pousa, Á. F., Hudson, S. T. P., Huebl, A., Jalas, S., Kirchen, M., Larson, J. M., Lehé, R., de la Ossa, A. M., Thévenet, M., and Vay, J.-L., in Proceedings of the 13th International Particle Accelerator Conference (JACoW Publishing, 2022), p. 1761.Google Scholar
Sutton, R. S. and Barto, A. G., Reinforcement Learning: An Introduction (MIT Press, 2018).Google Scholar
Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A., IEEE Signal Process. Mag. 34, 26 (2017).CrossRefGoogle Scholar
Grondman, I., Busoniu, L., Lopes, G. A. D., and Babuska, R., IEEE Trans. Syst. Man, Cybernet. C 42, 1291 (2012).CrossRefGoogle Scholar
Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y., in Proceedings of the 12th International Conference on Neural Information Processing Systems (ACM, 1999), p. 1057.Google Scholar
Kain, V., Hirlander, S., Goddard, B., Velotti, F. M., Porta, G. Z. D., Bruchon, N., and Valentino, G., Phys. Rev. Accel. Beams 23, 124801 (2020).CrossRefGoogle Scholar
Gao, Y., Chen, J., Robertazzi, T., and Brown, K. A.. Phys. Rev. Accel. Beams 22, 014601 (2019).CrossRefGoogle Scholar
Bruchon, N., Fenu, G., Gaio, G., Lonza, M., O’Shea, F. H., Pellegrino, F. A., and Salvato, E., Electronics 9, 781 (2020).CrossRefGoogle Scholar
O’Shea, F. H., Bruchon, N., and Gaio, G.. Phys. Rev. Accel. Beams 23, 122802 (2020).CrossRefGoogle Scholar
John, J. S., Herwig, C., Kafkes, D., Mitrevski, J., Pellico, W. A., Perdue, G.N., Quintero-Parra, A., Schupbach, B. A., Seiya, K., Tran, N., Schram, M., Duarte, J. M., Huang, Y., and Keller, R., Phys. Rev. Accel. Beams 24, 104601 (2021).CrossRefGoogle Scholar
Rousseeuw, P. J., J. Comput. Appl. Math. 20, 53 (1987).CrossRefGoogle Scholar
Irshad, F., Eberle, C., Foerster, F. M., Grafenstein, K., Haberstroh, F., Travac, E., Weisse, N., Karsch, S., and Döpp, A., arXiv:2303.15825 (2023).Google Scholar
Schölkopf, B., Smola, A., and Müller, K.-R., in International Conference on Artificial Neural Networks (Springer, 1997), p. 583.Google Scholar
Candès, E. J., Li, X., Ma, Y., and Wright, J., J. ACM 58, 11 (2011).CrossRefGoogle Scholar
Hyvärinen, A. and Oja, E., Neural Netw. 13, 411 (2000).CrossRefGoogle Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B., in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2022), p. 10684.Google Scholar
Griffiths, R.-R. and Hernández-Lobato, J. M., Chem. Sci. 11, 577 (2020).CrossRefGoogle Scholar
Willmann, A., Stiller, P., Debus, A., Irman, A., Pausch, R., Chang, Y.-Y., Bussmann, M., and Hoffmann, N., arXiv:2106.00317 (2021).Google Scholar
Stiller, P., Makdani, V., Pöschel, F., Pausch, R., Debus, A., Bussmann, M., and Hoffmann, N., arXiv:2211.04770 (2022).Google Scholar
Rawat, W. and Wang, Z., Neural Comput. 29, 2352 (2017).CrossRefGoogle Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A., arXiv:1409.4842 (2015).Google Scholar
Balduzzi, D., Frean, M., Leary, L., Lewis, J. P., Ma, K. W.-D., and McWilliams, B., arXiv:1702.08591 (2018).Google Scholar
He, K., Zhang, X., Ren, S., and Sun, J., in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2016), p. 770.Google Scholar
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A., Proc. AAAI Conf. Artif. Intell. 31, 4278 (2017).Google Scholar
Krizhevsky, A., Sutskever, I., and Hinton, G. E., Commun. ACM 60, 84 (2017).CrossRefGoogle Scholar
Viola, P. and Jones, M., in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (IEEE, 2001), paper II.Google Scholar
Nguyen-Meidine, L. T., Granger, E., Kiran, M., and Blais-Morin, L.-A., in Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA) (IEEE, 2017), p. 1.Google Scholar
Ren, S., He, K., Girshick, R., and Sun, J., arXiv:1506.01497 (2016).Google Scholar
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A., in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), p. 779.Google Scholar
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L., IEEE Trans. Pattern Anal. Mach. Intell. 40, 834 (2018).CrossRefGoogle Scholar
Long, J., Shelhamer, E., and Darrell, T., IEEE Trans. Pattern Anal. Mach. Intell. 39, 640 (2016).Google Scholar
Hafiz, A. M. and Bhat, G. M., Int. J. Multimedia Inform. Retr. 9, 171 (2020).CrossRefGoogle Scholar
He, K., Gkioxari, G., Dollár, P., and Girshick, R., IEEE Trans. Pattern Anal. Mach. Intell. 42, 386 (2018).CrossRefGoogle Scholar
Girshick, R., Donahue, J., Darrell, T., and Malik, J., IEEE Trans. Pattern Anal. Mach. Intell. 38, 142 (2015).CrossRefGoogle Scholar
Amorin, C., Kegelmeyer, L. M., and Kegelmeyer, W. P., Stat. Anal. Data Mining 12, 505 (2019).CrossRefGoogle Scholar
Pascu, T., in LPA Online Workshop on Control Systems and Machine Learning (2022).Google Scholar
Chu, X., Zhang, H., Tian, Z., Zhang, Q., Wang, F., Chen, J., and Geng, Y., High Power Laser Sci. Eng. 7, e66 (2019).CrossRefGoogle Scholar
Ben Soltane, I., Hallo, G., Lacombe, C., Lamaignère, L., Bonod, N., and Néauport, J., J. Opt. Soc. Am. A 39, 1881 (2022).CrossRefGoogle Scholar
Li, Z., Han, L., Ouyang, X., Zhang, P., Guo, Y., Liu, D., and Zhu, J., Opt. Express 28, 10165 (2020).CrossRefGoogle Scholar
Lin, J., Haberstroh, F., Karsch, S., and Döpp, A., High Power Laser Sci. Eng. 11, e7 (2023).CrossRefGoogle Scholar
Sidky, E. Y., Yu, L., Pan, X., Zou, Y., and Vannier, M., J. Appl. Phys. 97, 124701 (2005).CrossRefGoogle Scholar
Li, H., Liang, G., and Huang, Y., Comput. Phys. Commun. 259, 107644 (2021).CrossRefGoogle Scholar
Garca, S., Ramrez-Gallego, S., Luengo, J., Bentez, J. M., and Herrera, F., Big Data Anal. 1, 9 (2016).CrossRefGoogle Scholar
Shorten, C. and Khoshgoftaar, T. M., J. Big Data 6, 60 (2019).CrossRefGoogle Scholar
Hüllermeier, E. and Waegeman, W., Mach. Learn. 110, 457 (2021).CrossRefGoogle Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, É., Mach, J.. Learn. Res. 12, 2825 (2011).Google Scholar
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X., arXiv:1603.04467 (2016).Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S., Adv. Neural Inform. Process. Syst. 32, 8024 (2019).Google Scholar
Herzen, J., Lässig, F., Piazzetta, S. G., Neuer, T., Tafti, L., Raille, G., Pottelbergh, T. V., Pasieka, M., Skrodzki, A., Huguenin, N., Dumonal, M., Kościsz, J., Bader, D., Gusset, F., Benheddi, M., Williamson, C., Kosinski, M., Petrik, M., and Grosch, G., J. Mach. Learn. Res. 23, 1 (2022).Google Scholar
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and SciPy 1.0 Contributors, Nat. Methods 17, 261 (2020).CrossRefGoogle Scholar
Blank, J. and pymoo, K. D., IEEE Access 8, 89497 (2020).CrossRefGoogle Scholar
Biscani, F. and Izzo, D., J. Open Source Software 5, 2338 (2020).CrossRefGoogle Scholar
Balandat, M., Karrer, B., Jiang, D., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E., Adv. Neural Inform. Process. Syst. 33, 21524 (2020).Google Scholar
Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M., in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (ACM, 2019), p. 2623.CrossRefGoogle Scholar
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D., arXiv:2005.14165 (2020).Google Scholar
Figure 0

Figure 1 Overview of some of the machine learning applications discussed in this manuscript. (a) General configuration of laser–plasma interaction setups, applicable to both experiments and simulations. The system will have a number of input parameters of the laser and target. Some of these are known and actively controlled (e.g., laser energy, plasma density), some are monitored and others are unknown and essentially contribute as noise to the observations. Predictive models take the known input parameters $x$ and use some models to predict the output $y$. These models are discussed in Section 2.1 and some of them are sketched in (b). Inversely, in some cases one will want to derive the initial conditions from the output. These inverse problems are discussed in Section 3. In other cases one might be interested in a temporal evolution, discussed in Section 2.2. The output from observations or models can be used to optimize certain objectives, which can then be fed back to the control system to adjust the input parameters (see Section 4). Observations may also require further processing, for example, the image processing in (c) to detect patterns or objects. Note that sub-figure (a) is for illustrative purposes only and based on synthetic data.

Figure 1

Figure 2 Illustration of standard approaches to making predictive models in machine learning. The data were sampled from the function $y = x\left(1+\sin {x}^2\right)+\epsilon$ with random Gaussian noise, $\epsilon$, for which $\left\langle {\epsilon}^2\right\rangle = 1$. The data have been fitted by (a) nearest neighbor interpolation, (b) cubic spline interpolation, (c) linear regression of a third-order polynomial and (d) Gaussian process regression.

Figure 2

Figure 3 Gaussian process regression: illustration of different covariance functions, prior distributions and (fitted) posterior distributions. Left: correlation matrices between two values $x$ and ${x}^{\prime }$ using different covariance functions (white noise, radial basis function and periodic). Center: samples of the prior distribution defined by the prior mean $\mu (x) = 0$ and the indicated covariance functions. Note that the sampled functions are depicted with increasing transparency for visual clarity. Right: posterior distribution given observation points sampled from $y = {\cos}^2x+\epsilon$, where $\epsilon$ is random Gaussian noise with ${\sigma}_{\epsilon} = 0.1$. Note how the variance between observations increases when no noise term is included in the kernel (top row). Within the observation window the fitted kernels show little difference, but outside of it the RBF kernel decays to the mean $\mu = 0$ dependent on the length scale $\mathrm{\ell}$. This can be avoided if there exists prior knowledge about the data that can be encoded in the covariance function, in this case periodicity, as can be seen in the regression using a periodic kernel.

Figure 3

Figure 4 Sketch of a random forest, an architecture for regression or classification consisting of multiple decision trees, whose individual predictions are combined into an ensemble prediction, for example, via majority voting or averaging.

Figure 4

Figure 5 Example of gradient boosting with decision trees. Firstly, a decision tree ${g}_1$ is fitted to the data. In the next step, the residual difference between training data and the prediction of this tree is calculated and used to fit a second decision tree ${g}_2$. This process is repeated $n$ times, with each new tree ${g}_n$ learning to correct only the remaining difference to the training data. Data in this example are sampled from the same function as in Figure 2 and each tree has a maximum depth of two decision layers.

Figure 5

Figure 6 Simplified sketch of some popular neural network architectures. The simplest possible neural network is the perceptron, which consists of an input, which is fed into the neuron that processes the input based on the weights, an individual bias and its activation function. Multiple such layers can be stacked within so-called hidden layers, resulting in the popular multilayer perceptron (or fully connected network). Besides the direct connection between subsequent layers, there are also special connections common in many modern neural network architectures. Examples are the recurrent connection (which feeds the output of the current layer back into the input of the current layer), the convolutional connection (which replaces the direct connection between two layers by the convolutional operation) and the residual connection (which adds the input to the output of the current layer; note that the above illustration is simplified and the layers should be equal in size).

Figure 6

Figure 7 Real-world example of a multilayer perceptron for beam parameter prediction. (a) The network layout[29] consists of 15 input neurons, two hidden layers with 30 neurons and three output neurons (charge, mean energy and energy spread). The input is derived from parasitic laser diagnostics (laser pulse energy ${E}_{\mathrm{laser}}$, central wavelength ${\lambda}_0$ and spectral bandwidth $\Delta \lambda$, longitudinal focus position ${z}_{\mathrm{foc}}$ and Zernike coefficients of the wavefront). Neurons use a nonlinear ReLU activation and 20% of neurons drop out for regularization during training. The (normalized) predictions are compared to the training data to evaluate the accuracy of the model, in this case using the mean absolute ${\mathrm{\ell}}_1$ error as the loss function. In training, the gradient of the loss function is then propagated back through the network to adjust its weights and biases. (b) Measured and predicted median energy ($\overline{E}$) and (c) measured and predicted energy spread ($\Delta$E), both for a series of 50 consecutive shots. Sub-figures (b) and (c) are adapted from Ref. [29].

Figure 7

Figure 8 Tomography of a human bone sample using a laser-driven betatron X-ray source. Reconstructed from 180 projections using statistical iterative reconstruction. Based on the data presented by Döpp et al.[162].

Figure 8

Figure 9 Deep-learning for inverse problems. Sketch explaining the relation among predictive models, inverse models and fully invertible models.

Figure 9

Figure 10 Application of the end-to-end reconstruction of a wavefront using a convolutional U-Net architecture[180]. The spot patterns from a Shack–Hartmann sensor are fed into the network, yielding a high-fidelity prediction. Adapted from Ref. [188].

Figure 10

Figure 11 Deep unrolling for hyperspectral imaging. The left-hand side displays an example of the coded shot, that is, a spatial-spectral interferogram hypercube randomly sampled onto a 2D sensor. The bottom left shows a magnification of a selected section. On the right-hand side is the corresponding reconstructed spectrally resolved hypercube. Adapted from Ref. [192].

Figure 11

Table 1 Summary of a few representative papers on machine-learning-aided optimization in the context of laser–plasma acceleration and high-power laser experiments.

Figure 12

Figure 12 Pareto front. Illustration of how a multi-objective function $f(x) = y$ acts on a 2D input space $x = \left({x}_1,{x}_2\right)$ and transforms it to an objective space $y = \left({y}_1,{y}_2\right)$ on the right. The entirety of possible input positions is uniquely color-coded on the left and the resulting position in the objective space is shown in the same color on the right. The Pareto-optimal solutions form the Pareto front, indicated on the right, whereas the corresponding set of coordinates in the input space is called the Pareto set. Note that both the Pareto front and Pareto set may be continuously defined locally, but can also contain discontinuities when local maxima become involved. Adapted from Ref. [199].

Figure 13

Figure 13 Genetic algorithm optimization. (a) Basic working principle of a genetic algorithm. (b) Sketch of a feedback-optimized LWFA via genetic algorithm. (c) Optimized electron beam spatial profiles using different figures of merit. Subfigures (b) and (c) adapted from Ref. [194].

Figure 14

Figure 14 Bayesian optimization of a laser–plasma X-ray source. (a) The objective function (X-ray counts) as a function of iteration number (top) and the variation of the control parameters (bottom) during optimization. (b) X-ray images obtained for the initial (bottom) and optimal (top) settings. Adapted from Ref. [196].

Figure 15

Figure 15 Illustration of different optimization strategies for a non-trivial 2D system, here based on a simulated laser wakefield accelerator with laser focus and plasma density as free parameters. The total beam charge, shown as contour lines in plots (a)–(c) serves as the optimization goal. The position of the optimum is marked by a red circle, located at a focus position of $-0.74\;\mathrm{mm}$ and a plasma density of $8\times {10}^{18}\ {\mathrm{cm}}^{-3}$. In panel (a), a grid search strategy with subsequent local optimization using the downhill simplex (Nelder–Mead) algorithm is shown. Panel (b) illustrates differential evolution and (c) is based on Bayesian optimization using the common expected improvement acquisition function. The performance for all three examples is compared in panel (d). It shows the typical behavior that Bayesian optimization needs the least and the grid search requires the most iterations. The local search via the Nelder–Mead algorithm converges within some 20 iterations, but requires a good initial guess (here provided by the grid search). Individual evaluations are shown as shaded dots. Note how the Bayesian optimization starts exploring once it has found the maximum, whereas the evolutionary algorithm tends more towards exploitation around the so-far best value. This behavior is extreme for the local Nelder–Mead optimizer, which only aims to exploit and maximize to local optimum.

Figure 16

Figure 16 Sketch of deep reinforcement learning. The agent, which consists of a policy and a learning algorithm that updates the policy, sends an action to the environment. In the case of model-based reinforcement learning, the action is sent to the model, which is then applied to the environment. Upon the action to the environment, an observation is made and sent back to the agent as a reward. The reward is used to update the policy via the learning algorithm in the agent, which leads to an action in the next iteration.

Figure 17

Figure 17 Data treatment using a Gaussian mixture model (GMM). Top: 10 consecutive shots from a laser wakefield accelerator. Middle: the same shots using a GMM to isolate the spectral peak at around 250 MeV. Bottom: average spectra with and without GMM cleaning. Adapted from Ref. [245].

Figure 18

Figure 18 Correlogram – a visualization of the correlation matrix – of different variables versus yield at the NIF. Color indicates the value of the correlation coefficient. In this particular representation the correlation is also encoded in the shape and angle of the ellipses, helping intuitive understanding. The strongest correlation to the fusion yield is observed with the implosion velocity ${V}_{\mathrm{imp}}$ and the ion temperature ${T}_{\mathrm{ion}}$. There is also a clear anti-correlation observable between the down-scattered ratio (DSR) and ${T}_{\mathrm{ion}}$ and, in accordance with the previously stated correlation of ${T}_{\mathrm{ion}}$ and yield, a weak anti-correlation of the DSR and yield. Note that all variables perfectly correlate with themselves by definition. Plot was generated based on data presented by Hsu et al.[96]. Further explanation (labels, etc.) can be found therein.

Figure 19

Figure 19 Illustration of common computer vision tasks. (a) Classification is used to assign (multiple) labels to data. (b) Detection goes a step further and adds bounding boxes. (c) Segmentation provides pixel maps with exact boundaries of the object or feature.

Figure 20

Figure 20 Application of object detection to a few-cycle shadowgram of a plasma wave: the plasma wave, the shadowgram of a hydrodynamic shock and the diffraction pattern caused by dust are correctly identified by the object detector and located with bounding boxes. Adapted from Ref. [273].

Figure 21

Table 2 Summary of papers used as application examples in this review, sorted by year for each section.