1. Introduction
Interacting multi-particle systems are ubiquitous in a wide variety of scientific disciplines with application areas reaching from atomic scales over the human scale to the astronomical scale. For instance, large-scale multi-agent models are used to understand the coordinated movement of animal groups [Reference Couzin, Krause, Franks and Levin19, Reference Parrish and Edelstein-Keshet51] or crowds of people [Reference Albi, Bongini, Cristiani and Kalise1, Reference Cristiani, Piccoli and Tosin20]. Especially fascinating in this context is that such complex and often intelligent behaviour – phenomena known as self-organisation and swarm intelligence – emerge from seemingly simple rules of interaction [Reference Vicsek and Zafeiris56]. This intriguing capabilities have drawn researchers’ attention towards designing interacting particle systems for specific purposes in various disciplines. In applied mathematics in particular, agent-based optimisation algorithms look back on a long and successful history of empirically achieving state-of-the-art performance on challenging global unconstrained problems of the form
Here, ${\mathcal{E}}\, :\, \mathbb{R}^d\rightarrow \mathbb{R}$ denotes a possibly nonconvex and nonsmooth high-dimensional objective function, whose global minimiser $x^*$ is assumed to exist and be unique for the remainder of this work. Well-known representatives of this family are Evolutionary Programming [Reference Fogel24], Genetic Algorithms [Reference Holland38], Particle Swarm Optimisation [Reference Kennedy and Eberhart43] and Ant Colony Optimisation [Reference Dorigo and Blum23]. They belong to the broad class of so-called metaheuristics [Reference Back, Fogel and Michalewicz4, Reference Blum and Roli6], which are methods orchestrating an interaction between local improvement procedures and global strategies, deterministic and stochastic processes, to eventually design an efficient and robust procedure for searching the solution space of the objective function $\mathcal{E}$ .
Motivated by both the substantiated success of metaheuristics in applications and the lack of rigorous theoretical guarantees about their convergence and performance, the authors of [Reference Pinnau, Totzeck, Tse and Martin52] proposed consensus-based optimisation (CBO), which follows the spirit of metaheuristics but allows for a rigorous theoretical analysis [Reference Carrillo, Choi, Totzeck and Tse12, Reference Carrillo, Jin, Li and Zhu14, Reference Fornasier, Klock and Riedl28, Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29, Reference Ha, Jin and Kim33, Reference Ha, Jin and Kim34]. By taking inspiration from consensus formation in opinion dynamics [Reference Hegselmann and Krause36], CBO methods use $N$ particles $X^1,\dots,X^N$ to explore the energy landscape of the objective $\mathcal{E}$ and to eventually form a consensus about the location of the global minimiser $x^*$ . In its original form [Reference Pinnau, Totzeck, Tse and Martin52], the dynamics of each particle $X^i$ , which is governed by a stochastic differential equation (SDE), is subject to two competing forces. A deterministic drift term pulls the particles towards a so-called consensus point, which is an instantaneously computed weighted average of the positions of all particles and approximates the global minimiser $x^*$ the best possible given the currently available information. The resulting contractive behaviour is counteracted by the second term which is stochastic in nature and thereby features the exploration of the energy landscape of the objective function. Its magnitude and therefore its explorative power scales with the distance of the individual particle from the consensus point, which encourages particles far away to explore larger regions of the domain, while particles already close advance their position only locally.
In this work, motivated by the numerical evidence presented below as well as other recent papers such as [Reference Grassi and Pareschi32, Reference Schillings, Totzeck and Wacker54, Reference Totzeck and Wolfram55], we consider a more elaborate variant of this dynamics which exhibits the two following additional drift terms.
-
• The first is a drift towards the historical best position of the particular particle. To store such information, we follow the work [Reference Grassi and Pareschi32], where the authors introduce for each particle an additional state variable $Y^i$ , which can be regarded as the memory of the respective particle $X^i$ . In contrast to the original dynamics, an individual particle is therefore described by the tuple $(X^i,Y^i)$ . Moreover, the consensus point is no longer computed from the instantaneous positions $X^i$ , but the historical best positions $Y^i$ .
-
• The second term is a drift in the direction of the negative gradient of $\mathcal{E}$ evaluated at the current position of the respective particle $X^i$ .
Both terms are accompanied by associated noise terms. We now make the CBO dynamics with memory effects and gradient information rigorous by providing a formal description of the interacting particle system. A visualisation of the dynamics with all relevant quantities and forces is provided in Figure 1. Given a finite time horizon $T\gt 0$ , and user-specified parameters $\alpha,\beta,\theta,\kappa,\lambda _1,\sigma _1\gt 0$ and $\lambda _2,\lambda _3,\sigma _2,\sigma _3\geq 0$ , the dynamics is given by the system of SDEs
for $i=1,\dots,N$ and where $((B_{t}^{m,i})_{t\geq 0})_{i=1,\dots,N}$ are independent standard Brownian motions in $\mathbb{R}^d$ for $m\in \{1,2,3\}$ . The system is complemented with independent initial data $(X_0^i,Y_0^i)_{i=1,\dots,N}$ , typically such that $X_0^i=Y_0^i$ for all $i=1,\dots,N$ . A numerical implementation of the scheme usually originates from an Euler-Maruyama time discretisation of equation (1.1). The first term appearing in the SDE for the position $X_{t}^i$ , i.e., in the first line of equation (1.1a), is the drift towards the consensus point
Here, $\widehat \rho _{Y,t}^N$ denotes the random empirical measure of the particles’ historical best positions, i.e., $\widehat \rho _{Y,t}^N\,:\!=\,\frac{1}{N} \sum _{i=1}^N\delta _{Y_{t}^i}$ . Definition (1.2) is motivated by the fact that $y_{\alpha }({\widehat \rho _{Y,t}^N})\approx \operatorname *{\arg \min }_{i\in \{1,\dots,N\}}{\mathcal{E}}(Y_t^i)$ as $\alpha \rightarrow \infty$ under reasonable assumptions. The first term in the second line of equation (1.1a) is with the consensus drift associated diffusion term, which injects randomness into the dynamics and thereby features the explorative nature of the algorithm. The two commonly studied diffusion types are isotropic [Reference Carrillo, Choi, Totzeck and Tse12, Reference Fornasier, Klock and Riedl28, Reference Pinnau, Totzeck, Tse and Martin52] and anisotropic [Reference Carrillo, Jin, Li and Zhu14, Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29] diffusion with
where $\mathsf{Id}\in \mathbb{R}^{d\times d}$ is the identity matrix and $\operatorname{diag}:\;\mathbb{R}^{d} \rightarrow \mathbb{R}^{d\times d}$ the operator mapping a vector onto a diagonal matrix with the vector as its diagonal. Despite the potential of the dynamics getting trapped in affine subspaces, the coordinate-dependent scaling of anisotropic diffusion has proven to be beneficial for the performance of the method in high-dimensional applications by allowing for dimension-independent convergence rates [Reference Carrillo, Jin, Li and Zhu14, Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29]. For this reason, we restrict our attention to the case of anisotropic noise in what follows. Nevertheless, theoretically similar results as the ones presented in this work can be obtained also for the isotropic case. The second term in the first line of equation (1.1a) is the drift towards the historical best position of the respective particle. In contrast to the global nature of the consensus drift, which incorporates information from all $N$ particles, this term depends only on the past of the specific particle. To store such information about the history of each particle [Reference Grassi and Pareschi32], an additional state variable $Y^i$ is introduced for every particle, which evolves according to equation (1.1b), where
is chosen throughout this article, which is an approximation to the Heaviside function $H(x,y) = \mathbb{1}_{{\mathcal{E}}(x)\lt{\mathcal{E}}(y)}$ as $\theta \rightarrow 0$ and $\beta \rightarrow \infty$ . The variable $Y_{t}^i$ can therefore be regarded as the memory of the $i$ th particle, i.e., as the location of the in-time best-seen position of $X^i$ up to time $t$ . This can be understood most easily when discretising (1.1b) as
and noting that with parameter choices $\kappa =1/\Delta t$ , $\theta =0$ and $\beta \gg 1$ it holds $Y_{k+1}^i = X_{k+1}^i$ if ${\mathcal{E}}(X_{k+1}^i) \lt{\mathcal{E}}(Y_{k}^i)$ and $Y_{k+1}^i = Y_{k}^i$ else. The third term in the first line of equation (1.1a) is the drift in the direction of the negative gradient of $\mathcal{E}$ , which is a local and instantaneous contribution. The remaining two terms are noise terms, which are associated with the formerly described memory and gradient drifts.
A theoretical convergence analysis of CBO can be carried out either by directly investigating the microscopic system (1.1) or its numerical time discretisation, as promoted for instance in a simplified setting in the works [Reference Ha, Jin and Kim33, Reference Ha, Jin and Kim34], or alternatively, as done for example in [Reference Carrillo, Choi, Totzeck and Tse12, Reference Carrillo, Jin, Li and Zhu14, Reference Fornasier, Huang, Pareschi and Sünnen26, Reference Fornasier, Klock and Riedl28, Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29], by analysing the macroscopic behaviour of the particle density through a mean-field limit associated with (1.1). Formally, such mean-field limit is given by the self-consistent nonlinear and nonlocal SDE
which is complemented with initial datum $(\overline{X}_{0},\overline{Y}_{0})\sim \rho _0$ , and where $\rho _t = \rho (t) = \textrm{Law}\left (\overline{X}_{t},\overline{Y}_{t}\right )$ with marginal law $\rho _{Y,t}$ of $\overline{Y}_t$ given by $\rho _{Y,t} = \rho _{Y}(t, \,\cdot \,) = \int d\rho _t(\,\cdot \,,y)$ . The measure $\rho \in\mathcal{C}([0,T],\mathcal{P}(\mathbb{R}^d\times \mathbb{R}^d))$ in particular weakly satisfies the Fokker-Planck equation
see Definition 2.1. Working with the partial differential equation (PDE) (1.6) instead of the interacting particle system (1.1) typically permits to employ more powerful technical tools, which result in stronger and deterministic statements about the long-time behaviour of the average agent density $\rho$ . This analysis approach is rigorously justified by the mean-field approximation, i.e., the fact that the empirical particle measure $\widehat \rho _{t}^N\,:\!=\,\frac{1}{N}\sum _{i=1}^N\delta _{(X_{t}^i,Y_{t}^i)}$ converges in some sense to the mean-field law $\rho _t$ as the number of particles $N$ tends to infinity. For the original CBO dynamics, a qualitative result about convergence in distribution is provided in [Reference Huang and Qiu39], which is based on a tightness argument in the path space. More precisely, the authors of that work show that the sequence $\{\widehat \rho ^N\}_{N\geq 2}$ of $\mathcal{C}([0,T],\mathcal{P}(\mathbb{R}^d))$ -valued random variables is tight, which permits to employ Prokhorov’s theorem to obtain, up to a subsequence, some limiting measure, which turns out to be deterministic and at the same time satisfy the associated mean-field PDE. A more desirable quantitative approximation result, on the other hand, can be established by proving propagation of chaos, i.e., by establishing for instance
where $(\overline{X}_{t}^i,\overline{Y}_{t}^i)$ denote $N$ i.i.d. copies of the mean-field dynamics (1.5). For the original variant of unconstrained CBO, this was first done in [Reference Fornasier, Klock and Riedl28, Section 3.3]. To keep the focus of this work on the long-time behaviour of the CBO variant (1.6), a rigorous analysis of the mean-field approximation is left for future considerations.
Before summarising the contributions of the present paper, let us put our work into context by providing a comprehensive literature overview about the history, developments and achievements of CBO.
Versatility and flexibility of CBO: a literature overview
Since its introduction in the work [Reference Pinnau, Totzeck, Tse and Martin52], CBO has gained a significant amount of attention from various research groups. This has led to a vast variety of different developments, of both theoretical and applied nature, as well as what concerns the mathematical modelling and numerical analysis of the method. By interpreting CBO as a stochastic relaxation of gradient descent, the recent work [Reference Riedl, Klock, Geldhauser and Fornasier53] even establishes a connection between the worlds of derivative-free and gradient-based optimisation.
A first rigorous but local convergence proof of the mean-field limit of CBO to global minimisers is provided for the cases of isotropic and anisotropic diffusion in [Reference Carrillo, Choi, Totzeck and Tse12, Reference Carrillo, Jin, Li and Zhu14], respectively. By analysing the time-evolution of the variance of the law of the mean-field dynamics $\rho _t$ and proving its exponential decay towards zero, the authors first establish consensus formation at some stationary point before they ensure that this consensus is actually close to the global minimiser. A similarly flavoured approach is pursued in [Reference Ha, Jin and Kim33, Reference Ha, Jin and Kim34], however, directly for the fully in-time discrete microscopic system and in the simplified setting, where the same Brownian motion is used for all agents, which limits the exploration capabilities of the method. In contrast, in the recent works [Reference Fornasier, Klock and Riedl28, Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29], the authors, again for the isotropic and anisotropic CBO variant, respectively, investigate the time-evolution of the Wasserstein- $2$ distance between the law $\rho _t$ and a Dirac delta at the global minimiser. This is also the strategy which we pursue in this paper. By proving the exponential decay of $W_2(\rho _t,\delta _{x^*})$ to zero, consensus at the desired location follows immediately. Moreover, by providing a probabilistic quantitative result about the mean-field approximation, the authors give a first, and so far unique, holistic global convergence proof for the implementable, i.e., discretised numerical CBO algorithm in the unconstrained case. The results about the mean-field approximation of the latter papers were partially inspired by the series of works [Reference Fornasier, Huang, Pareschi and Sünnen25–Reference Fornasier, Huang, Pareschi and Sünnen27], in which the authors constrain the particle dynamics of CBO to compact hypersurfaces and prove local convergence of the numerical scheme to minimisers by adapting the technique of [Reference Carrillo, Choi, Totzeck and Tse12, Reference Carrillo, Jin, Li and Zhu14]. This ensures a beneficial compactness of the stochastic processes, which simplifies the convergence of the interacting particle dynamics to the mean-field dynamics. In the unconstrained case, such intrinsic compactness is replaced by the fact that the dynamics are bounded with high probability, which is sufficient to establish convergence in probability. Further related works about CBO for optimisations with constraints include the papers [Reference Ha, Kang and Kim35, Reference Kim, Kang, Kim, Ha and Yang44], where a problem on the Stiefel manifold is approached, and the works [Reference Borghi, Herty and Pareschi9, Reference Carrillo, Totzeck and Vaes15], where the constrained optimisation is recast into a penalised problem. The philosophy of using an interacting swarm of particles to approach various relevant problems in science and engineering has promoted several variations of the original CBO algorithm for minimisation. Amongst them are methods based on consensus dynamics to tackle multi-objective optimisation problems [Reference Borghi, Herty and Pareschi7, Reference Borghi, Herty and Pareschi8, Reference Klamroth, Stiglmayr and Totzeck45], saddle point problems [Reference Huang, Qiu and Riedl40], the search for several minimisers simultaneously [Reference Bungert, Wacker and Roith10] or the sampling from certain distributions [Reference Carrillo, Hoffmann, Stuart and Vaes13].
In the same vein and also in the spirit of this work, the original CBO method itself has undergone several modifications allowing for a more complex dynamics. This includes the use of particles with memory [Reference Grassi, Huang, Pareschi and Qiu31, Reference Totzeck and Wolfram55], the integration of momentum [Reference Chen, Jin and Lyu17], the usage of jump-diffusion processes [Reference Kalise, Sharma and Tretyakov42] and the exploitation of on-the-fly extracted higher-order differential information through inferred gradients based on point evaluations of the objective function [Reference Schillings, Totzeck and Wacker54]. It moreover turned out that the renowned particle swarm optimisation method (PSO) [Reference Kennedy and Eberhart43] can be formulated and regarded as a second-order generalisation of CBO [Reference Cipriani, Huang and Qiu18, Reference Grassi and Pareschi32]. This insight has enabled to adapt the for CBO-developed analysis techniques to rigorously prove the convergence of PSO [Reference Huang, Qiu and Riedl41].
In the collection of formerly referenced works and beyond, CBO has demonstrated to be a valuable method for a wide scope of applications reaching from the phase retrieval or robust subspace detection problem in signal processing [Reference Fornasier, Huang, Pareschi and Sünnen26, Reference Fornasier, Huang, Pareschi and Sünnen27], over the training of neural networks for image classification in machine learning [Reference Carrillo, Jin, Li and Zhu14, Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29] as well as in the setting of clustered federated learning [Reference Carrillo, Trillos, Li and Zhu16], to asset allocation in finance [Reference Bae, Ha, Kang, Lim, Min and Yoo5]. It has been furthermore employed to approximate low-frequency functions in the presence of high-frequency noise and to the task of solving PDEs with low-regularity solutions [Reference Chen, Jin and Lyu17].
Contributions
In view of the various developments and the wide scope of applications, a theoretical understanding of the long-time behaviour of the in practical applications employed CBO methods is of paramount interest. In this work, we analyse a variant of CBO which incorporates memory effects as well as gradient information from a theoretical and numerical perspective. As demonstrated concisely in Figure 2 and more comprehensively in Section 4, the herein investigated dynamics, which is more involved than standard CBO, proves to be beneficial in applications in machine learning and compressed sensing. Despite this additional complexity, by employing the analysis technique devised in [Reference Fornasier, Klock and Riedl28, Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29], we are able to provide rigorous mean-field-type convergence guarantees to the global minimiser, which describe the behaviour of the method in the large-particle limit and allow to draw conclusions about the typically observed performance in the practicable regime. Our results for CBO with memory effects and gradient information hold for a vast class of objective functions under minimal assumptions on the initialisation of the method. Moreover, the proof reveals how to leverage further, in other applications advantageous, forces in the dynamics while still being amenable to theory and allowing for provable global convergence.
1.1. Organisation
In Section 2, after providing details about the existence of solutions to the macroscopic SDE (1.5) and the associated PDE (1.6), we present and discuss our main theoretical contribution. It is about the convergence of CBO with memory effects and gradient information, as given in equation (1.1), to the global minimiser of the objective function in mean-field law, see [Reference Fornasier, Klock and Riedl28, Definition 1]. More precisely, we show that the mean-field dynamics (1.5) and (1.6) converge with exponential rate to the global minimiser. Section 3 contains the proof details of this result. In Section 4, we numerically demonstrate the benefits of the additional memory effects and gradient information of the previously analysed CBO variant. We in particular present applications of CBO in machine learning and compressed sensing, before we conclude the paper in Section 5.
For the sake of reproducible research, in the GitHub repository https://github.com/KonstantinRiedl/CBOGlobalConvergenceAnalysis we provide the Matlab code implementing the CBO algorithm with memory effects and gradient information analysed in this work.
1.2. Notation
Given a set $A\subset \mathbb{R}^d$ , we write $(A)^c$ to denote its complement, i.e., $(A)^c\,:\!=\,\{z\in \mathbb{R}^d:z\not \in A\}$ . For $\ell _{\infty}$ balls in $\mathbb{R}^d$ with centre $z$ and radius $r$ , we write $B^\infty _{r}(z)$ . The space of continuous functions $f:X\rightarrow Y$ is denoted by $\mathcal{C}(X,Y)$ , with $X\subset \mathbb{R}^n$ and a suitable topological space $Y$ . For $X\subset \mathbb{R}^n$ open and for $Y=\mathbb{R}^m$ , the function space $\mathcal{C}^k_{c}(X,Y)$ contains functions $f\in\mathcal{C}(X,Y)$ that are $k$ -times continuously differentiable and have compact support. $Y$ is omitted in the case of real-valued functions. The operator $\nabla$ denotes the standard gradient of a function on $\mathbb{R}^d$ .
In this paper, we mostly study laws of stochastic processes, $\rho \in\mathcal{C}([0,T],\mathcal{P}(\mathbb{R}^d))$ , and we refer to a snapshot of such law at time $t$ by writing $\rho _t\in\mathcal{P}(\mathbb{R}^d)$ . Here, $\mathcal{P}(\mathbb{R}^d)$ denotes the set of all Borel probability measures $\varrho$ over $\mathbb{R}^d$ . In $\mathcal{P}_p(\mathbb{R}^d)$ we moreover collect measures $\varrho \in\mathcal{P}(\mathbb{R}^d)$ with finite $p$ th moment. For any $1\leq p\lt \infty$ , $W_p$ denotes the Wasserstein- $p$ distance between two Borel probability measures $\varrho _1,\varrho _2\in\mathcal{P}_p(\mathbb{R}^d)$ , see, e.g., [Reference Ambrosio, Gigli and Savaré2]. $\mathbb{E}(\varrho )$ denotes the expectation of a probability measure $\varrho$ .
2. Global convergence in mean-field law
In the first part of this section, we provide an existence result about solutions of the nonlinear macroscopic SDE (1.5), respectively, the associated Fokker-Planck equation (1.6). Thereafter we specify the class of studied objective functions and present the main theoretical result about the convergence of the dynamics (1.5) and (1.6) to the global minimiser.
Throughout this work, we consider the – in typical applications beneficial – case of CBO with anisotropic diffusion, i.e., $D\!\left (\,\cdot \,\right ) = \operatorname{diag}\left (\,\cdot \,\right )$ in equations (1.1), (1.5) and (1.6), and also equation (2.1) below. However, up to minor modifications, analogous results can be obtained for isotropic diffusion.
2.1. Definition and existence of weak solutions
Let us begin by rigorously defining weak solutions of the Fokker-Planck equation (1.6).
Definition 2.1. Let $\rho _0 \in\mathcal{P}(\mathbb{R}^d\times \mathbb{R}^d)$ , $T \gt 0$ . We say $\rho \in\mathcal{C}([0,T],\mathcal{P}(\mathbb{R}^d\times \mathbb{R}^d))$ satisfies the Fokker-Planck equation ( 1.6 ) with initial condition $\rho _0$ in the weak sense in the time interval $[0,T]$ , if we have for all $\phi \in\mathcal{C}_c^2(\mathbb{R}^d\times \mathbb{R}^d)$ and all $t \in (0,T)$
and $\lim _{t\rightarrow 0}\rho _t = \rho _0$ (in the sense of weak convergence of measures).
For solutions of the mean-field dynamics (1.5) and (1.6), we have the following existence result.
Theorem 2.2. Let $T \gt 0$ , $\rho _0 \in\mathcal{P}_4(\mathbb{R}^d\times \mathbb{R}^d)$ . Let ${\mathcal{E}}\, :\, \mathbb{R}^d\rightarrow \mathbb{R}$ with $\underline{\mathcal{E}} \gt -\infty$ satisfy for some constants $C_1,C_2 \gt 0$ the conditions
and either $\sup _{x \in \mathbb{R}^d}{\mathcal{E}}(x) \lt \infty$ or
for some $C_3,C_4 \gt 0$ . Furthermore, in the case of an active gradient drift in the CBO dynamcis ( 1.5 ), i.e., if $\lambda _3\not =0$ , let ${\mathcal{E}}\in\mathcal{C}^1(\mathbb{R}^d)$ and obey additionally
for some $\widetilde{L}_{\nabla{\mathcal{E}}}\gt 0$ . Then, if $(\overline{X}_0, \overline{Y}_0)$ is distributed according to $\rho _0$ , there exists a nonlinear process $(\overline{X}, \overline{Y}) \in\mathcal{C}([0,T],\mathbb{R}^d\times \mathbb{R}^d)$ satisfying ( 1.5 ) with associated law $\rho = \textrm{Law}\big ((\overline{X}, \overline{Y})\big )$ having regularity $\rho \in\mathcal{C}([0,T],\mathcal{P}_4(\mathbb{R}^d\times \mathbb{R}^d))$ and being a weak solution to the Fokker-Planck equation ( 1.6 ) with $\rho (0)=\rho _0$ .
Assumption (2.2) requires that $\mathcal{E}$ is locally Lipschitz continuous with the Lipschitz constant being allowed to have linear growth. This entails in particular that the objective has at most quadratic growth at infinity as formulated explicitly in Assumption (2.3), which can be seen when choosing $x'=x^*$ and $C_2=2C_1\max \{1,\left \|{x^*}\right \|_2^2\}$ in (2.2). Assumption (2.4), on the other hand, assumes that $\mathcal{E}$ also has at least quadratic growth in the farfield, i.e., overall it grows quadratically far away from $x^*$ . Alternatively, $\mathcal{E}$ may be bounded from above. Since the objective $\mathcal{E}$ can be usually modified for the purpose of analysis outside a sufficiently large region, these growth conditions are not really restrictive. In case of an additional gradient drift term in the dynamics, i.e., $\lambda _3\not =0$ , the objective naturally needs to be continuously differentiable. Furthermore, Assumption (2.5) imposes $\mathcal{E}$ to be $\widetilde{L}_{\nabla{\mathcal{E}}}$ -smooth, i.e., having an $\widetilde{L}_{\nabla{\mathcal{E}}}$ -Lipschitz continuous gradient.
Remark 2.3. The regularity $\rho \in\mathcal{C}([0,T],\mathcal{P}_4(\mathbb{R}^d\times \mathbb{R}^d))$ obtained in Theorem 2.2 above is an immediate consequence of the regularity of the initial condition $\rho _0 \in\mathcal{P}_4(\mathbb{R}^d\times \mathbb{R}^d)$ . It allows to extend the test function space $\mathcal{C}^{\infty }_{c}(\mathbb{R}^d\times \mathbb{R}^d)$ in Definition 2.1 to the larger space
as can be seen from the proof of Theorem 2.2, which we sketch in what follows.
Proof sketch of Theorem 2.2. The proof is based on the Leray-Schauder fixed point theorem and follows the steps taken in [Reference Carrillo, Choi, Totzeck and Tse12, Theorems 3.1, 3.2].
Step 1: For a given function $u\in \mathcal{C}([0,T],\mathbb{R}^d)$ and an initial measure $\rho _0\in\mathcal{P}_4(\mathbb{R}^d)$ , according to standard SDE theory [Reference Arnold3, Chapters 6], we can uniquely solve the auxiliary SDE
with $(\widetilde{X}_0,\widetilde{Y}_0)\sim \rho _0$ . This is due to the fact that the coefficients of the drift and diffusion terms are locally Lipschitz continuous and have at most linear growth, which, in turn, is a consequence of the assumptions on $\mathcal{E}$ as well as the smoothness of $S^{\beta,\theta }$ as defined in (1.4). This induces $\widetilde{\rho} _t=\textrm{Law}\big ((\widetilde{X}_t,\widetilde{Y}_t)\big )$ . Moreover, the assumed regularity of the initial distribution $\rho _0 \in\mathcal{P}_4(\mathbb{R}^d\times \mathbb{R}^d)$ allows to obtain a fourth-order moment estimate of the form $\mathbb{E}\big [\|{\widetilde{X}_t}\|_2^4+\|{\widetilde{Y}_t}\|_2^4\big ] \leqslant \big (1+2\mathbb{E}\big [\|{\widetilde{X}_0}\|_2^4+\|{\widetilde{Y}_0}\|_2^4\big ]\big )e^{ct}$ , see, e.g. [Reference Arnold3, Chapter 7]. So, in particular, $\widetilde{\rho} \in\mathcal{C}([0,T],\mathcal{P}_4(\mathbb{R}^d\times \mathbb{R}^d))$ .
Step 2: For some test function $\phi \in\mathcal{C}^2_{*}(\mathbb{R}^d\times \mathbb{R}^d)$ as defined in (2.6), by Itô’s formula, we derive
After taking the expectation, applying Fubini’s theorem and observing that the stochastic integrals of the form $\mathbb{E}\int _0^t\nabla _x\phi (\widetilde{X}_t,\widetilde{Y}_t)\cdot D(\,\cdot \,)\,dB_t$ vanish as a consequence of [Reference Øksendal50, Theorem 3.2.1(iii)] due to the established regularity $\widetilde{\rho} \in\mathcal{C}([0,T],\mathcal{P}_4(\mathbb{R}^d\times \mathbb{R}^d))$ and $\phi \in\mathcal{C}^2_{*}(\mathbb{R}^d\times \mathbb{R}^d)$ , we obtain
according to the fundamental theorem of calculus. This shows that $\widetilde{\rho} \in\mathcal{C}([0,T],\mathcal{P}_4(\mathbb{R}^d\times \mathbb{R}^d))$ satisfies the Fokker-Planck equation
The remainder is identical to the cited reference and is summarised below for completeness.
Step 3: Setting $\mathcal{T} u\,:\!=\,y_{\alpha }({\widetilde{\rho} _Y})\in \mathcal{C}([0,T],\mathbb{R}^d)$ provides the self-mapping property of the map
which is compact as a consequence of a stability estimate for the consensus point [Reference Carrillo, Choi, Totzeck and Tse12, Lemma 3.2]. More precisely, as shown in the cited result, it holds $\|{y_{\alpha }({\widetilde{\rho} _{Y,t}})-y_{\alpha }({\widetilde{\rho} _{Y,s}})}\|_2 \lesssim W_2(\widetilde{\rho} _{Y,t},\widetilde{\rho} _{Y,s})$ for $\widetilde{\rho} _{Y,t},\widetilde{\rho} _{Y,s}\in\mathcal{P}_4(\mathbb{R}^d)$ . Together with the Hölder- $1/2$ continuity of the Wasserstein- $2$ distance $W_2(\widetilde{\rho} _{Y,t},\widetilde{\rho} _{Y,s})$ , this ensures the claimed compactness of $\mathcal{T}$ .
Step 4: Then, for $u=\vartheta\mathcal{T} u$ with $\vartheta \in [0,1]$ , there exists $\rho \in\mathcal{C}([0,T],\mathcal{P}_4(\mathbb{R}^d\times \mathbb{R}^d))$ satisfying (2.7) with marginal $\rho _Y$ such that $u=\vartheta y_{\alpha }({\rho _Y})$ . For such $u$ , a uniform bound can be obtained either thanks to the boundedness or the growth condition of $\mathcal{E}$ required in the statement. An application of the Leray-Schauder fixed point theorem concludes the proof by providing a solution to (1.5).
2.2. Main result
We now present the main theoretical result about global mean-field law convergence of CBO with memory effects and gradient information for objectives that satisfy the following conditions.
Definition 2.4 (Assumptions). Throughout, we are interested in functions ${\mathcal{E}} \in\mathcal{C}(\mathbb{R}^d)$ , for which
-
A1 there exists a unique $x^*\in \mathbb{R}^d$ such that ${\mathcal{E}}(x^*)=\inf _{x\in \mathbb{R}^d}{\mathcal{E}}(x)\,=\!:\,\underline{\mathcal{E}}$ , and
-
A2 there exist ${\mathcal{E}}_{\infty },R_0,\eta \gt 0$ , and $\nu \in (0,\infty )$ such that
(2.8) \begin{align} \left \|{x-x^*}\right \|_{\infty} &\leq \frac{1}{\eta }({\mathcal{E}}(x)-\underline{\mathcal{E}})^{\nu } \quad \text{for all } x \in B^\infty _{R_0}(x^*), \end{align}(2.9) \begin{align} {\mathcal{E}}_{\infty } &\lt{\mathcal{E}}(x)-\underline{\mathcal{E}}\quad \text{for all } x \in \big (B^\infty _{R_0}(x^*)\big )^c. \end{align}
Furthermore, for the case of an additional gradient drift component, i.e., if $\lambda _3\not =0$ , we additionally require that ${\mathcal{E}}\in\mathcal{C}^1(\mathbb{R}^d)$ and that
-
A3 there exist $C_{\nabla{\mathcal{E}}} \gt 0$ such that
(2.10) \begin{align} \left \|{\nabla{\mathcal{E}}(x)}\right \|_2 \leq C_{\nabla{\mathcal{E}}}\left \|{x-x^*}\right \|_2 \quad \text{for all } x \in \mathbb{R}^d. \end{align}
In the case, where no gradient drift is present, i.e., $\lambda _3=0$ in equations (1.1), (1.5) and (1.6), the objective function $\mathcal{E}$ is only required to be continuous and satisfy Assumptions A1 and A2. While the former merely imposes that the infimum is attained at $x^*$ , the latter can be regarded as a tractability condition of the energy landscape of $\mathcal{E}$ [Reference Fornasier, Huang, Pareschi and Sünnen26, Reference Fornasier, Klock and Riedl28]. More precisely, the inverse continuity condition (2.8) ensures that $\mathcal{E}$ is locally coercive in some neighbourhood of the global minimiser $x^*$ . Condition (2.9), on the other hand, guarantees that in the farfield $\mathcal{E}$ is bounded away from the minimal value by at least ${\mathcal{E}}_{\infty }$ . This in particular excludes objectives for which ${\mathcal{E}}(x)\approx \underline{\mathcal{E}}$ far away from $x^*$ . Note that A2 actually already implies the uniqueness of $x^*$ requested in A1. In case of an additional gradient drift term in the dynamics, i.e., $\lambda _3\not =0$ , the objective naturally needs to be continuously differentiable. Furthermore, in Assumption A3 we impose that the gradient $\nabla{\mathcal{E}}$ grows at most linearly. This is a significantly weaker assumption compared to typical smoothness assumptions about $\mathcal{E}$ in the optimisation literature (in particular in the analysis of stochastic gradient descent), where Lipschitz-continuity of the gradient of $\mathcal{E}$ is required [Reference Moulines and Bach49].
We are now ready to state the main theoretical result. Its proof is deferred to Section 3. For the reader’s convenience let us recall that
which motivates to investigate the behaviour of the Lyapunov functional $\mathcal{V}(\rho _t)$ as introduced in (2.11) below.
Theorem 2.5. Let ${\mathcal{E}}\in\mathcal{C}(\mathbb{R}^d)$ satisfy A1 and A2. Furthermore, in the case of an active gradient drift in the CBO dynamcis ( 1.5 ), i.e., if $\lambda _3\not =0$ , let ${\mathcal{E}}\in\mathcal{C}^1(\mathbb{R}^d)$ obey in addition A3. Moreover, let $\rho _0 \in\mathcal{P}_4(\mathbb{R}^d\times \mathbb{R}^d)$ be such that $(x^*,x^*)\in \operatorname{supp}(\rho _0)$ . Let us define the functional
and the rates
which we assume to be strictly positive through a sufficient choice of the parameters of the CBO dynamics. Furthermore, provided that $\mathcal{V}(\rho _0)\gt 0$ , fix any $\varepsilon \in (0,\mathcal{V}(\rho _0))$ , $\vartheta \in (0,1)$ and define the time horizon
Then, there exists $\alpha _0 \gt 0$ , depending (among problem-dependent quantities) also on $\varepsilon$ and $\vartheta$ , such that for all $\alpha \gt \alpha _0$ , if $\rho \in\mathcal{C}([0,T^*],\mathcal{P}_4(\mathbb{R}^d\times \mathbb{R}^d))$ is a weak solution to the Fokker-Planck equation ( 1.6 ) on the time interval $[0,T^*]$ with initial condition $\rho _0$ , we have
Furthermore, on the time interval $[0,T]$ , $\mathcal{V}(\rho _t)$ decays at least exponentially fast, i.e., for all $t\in [0,T]$ it holds
Theorem 2.5 proves the exponentially fast convergence of the law $\rho$ of the dynamics (1.5) to the global minimiser $x^*$ of $\mathcal{E}$ under a minimal assumption about the initial distribution $\rho _0$ . The result in particular allows to devise a strategy for the parameter choices of the method. Namely, fixing the parameters $\lambda _2, \lambda _3,\sigma _1,\sigma _2,\sigma _3$ and $\theta$ , choosing $\lambda _1$ and consecutively $\kappa$ such that
ensures that the convergence rate $\chi _1$ is strictly positive. Since $\chi _2\geq \chi _1$ , $\chi _2\gt 0$ as well. Given a desired accuracy $\varepsilon$ , by consulting the proof in Section 3.4, we can further derive an estimate on the lower bound of $\alpha$ , namely
for some suitably small $r\in (0,R_0)$ , which, like the hidden constant, may depend on $\varepsilon$ . The choice of the first set of parameters, in particular what concerns the drift towards the historical best and in the direction of the negative gradient, requires some manual hyperparameter tuning and depends on the problem at hand. We will see this also in Section 4, where we conduct numerical experiments in different application areas.
Eventually, with (2.13) one can determine the maximal time horizon $T^*$ , until which the Lyapunov functional $\mathcal{V}(\rho _t)$ is guaranteed to have reached the prescribed $\varepsilon$ . The exact time point $T$ , where $\mathcal{V}(\rho _T) = \varepsilon$ , is characterised more concretely in equation (2.14). Due to the presence of memory effects and gradient information, which might counteract the consensus drift of CBO, it seems challenging to specify $T$ more closely. However, in the case of standard CBO, $T$ turns out to be equal to $T^*$ up to a factor depending merely on $\vartheta$ , see, e.g., [Reference Fornasier, Klock and Riedl28].
In fact, this result can be retrieved as a special case of the subsequent Corollary 2.6, where we state an analogous convergence result for the CBO dynamics with gradient information but without memory effect. Its respective proof follows the lines of the one of the richer dynamics in Section 3, cf. also [Reference Fornasier, Klock and Riedl28, Theorem 12] and [Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29, Theorem 2], and it is left as an exercise to the reader interested in the technical details of the proof technique. More precisely, for the instantaneous CBO model with gradient drift,
where $\widehat{\widetilde{\rho }}_t\!^{N}\,:\!=\,\frac{1}{N}\sum _{i=1}^N\delta _{\widetilde{X}_{t}^i}$ and to which the associated mean-field Fokker-Planck equation reads
we have the following convergence result.
Corollary 2.6. Let ${\mathcal{E}}\in\mathcal{C}(\mathbb{R}^d)$ satisfy A1 and A2. Furthermore, in the case of an active gradient drift, i.e., if $\lambda _3\not =0$ , let ${\mathcal{E}}\in\mathcal{C}^1(\mathbb{R}^d)$ obey in addition A3. Moreover, let $\widetilde{\rho} _0 \in\mathcal{P}_4(\mathbb{R}^d)$ be such that $x^*\in \operatorname{supp}(\widetilde{\rho} _0)$ . Let us define the functional
and the rates
which we assume to be strictly positive through a sufficient choice of the parameters of the CBO dynamics. Furthermore, provided that $\widetilde{\mathcal{V}}(\widetilde{\rho} _0)\gt 0$ , fix any $\varepsilon \in (0,\widetilde{\mathcal{V}}(\widetilde{\rho} _0))$ , $\vartheta \in (0,1)$ and define the time horizon
Then, there exists $\widetilde{\alpha}_0 \gt 0$ , depending (among problem-dependent quantities) also on $\varepsilon$ and $\vartheta$ , such that for all $\alpha \gt \widetilde{\alpha}_0$ , if $\widetilde{\rho} \in\mathcal{C}([0,T^*],\mathcal{P}_4(\mathbb{R}^d))$ is a weak solution to the Fokker-Planck equation ( 2.17 ) on the time interval $[0,\widetilde T^*]$ with initial condition $\widetilde{\rho} _0$ , we have
Furthermore, on the time interval $[0,\widetilde{T}]$ , $\widetilde{\mathcal{V}}(\widetilde{\rho}_t)$ decays at least exponentially fast, i.e., for all $t\in [0,\widetilde{T}]$ it holds
3. Proof details for Section 2.2
In what follows, we provide the proof details for the global mean-field law convergence result of CBO with memory effects and gradient information, Theorem 2.5. The entire section can be read as a proof sketch with Corollaries 3.3 and 3.5, Propositions 3.6 and 3.8 containing the key individual statements. How to combine these results rigorously to complete the proof of Theorem 2.5 is then covered in detail in Section 3.4.
Remark 3.1. Without loss of generality, we assume $\underline{\mathcal{E}} = 0$ throughout this section.
3.1. Evolution of the mean-field limit
Recall that our overall goal is to establish the convergence of the dynamics (1.6) to a Dirac delta at the global minimiser $x^*$ with respect to the Wasserstein- $2$ distance, i.e.,
To this end, we analyse the decay behaviour of the functional $\mathcal{V}(\rho _t)$ as defined in (2.11), i.e., $\mathcal{V}(\rho _t) = \frac{1}{2} \iint \!\big (\!\left \|{x-x^*}\right \|_2^2 + \left \|{y-x}\right \|_2^2\!\big )\,d\rho _t(x,y)$ . More precisely, we will show its exponential decay with a rate controllable through the parameters of the CBO method.
Let us start below with deriving the evolution inequalities for the functionals
Lemma 3.2. Let ${\mathcal{E}}\, :\, \mathbb{R}^d \rightarrow \mathbb{R}$ , and fix $\alpha,\lambda _1,\sigma _1 \gt 0$ and $\lambda _2,\sigma _2,\lambda _3,\sigma _3,\beta,\kappa,\theta \geq 0$ . Moreover, let $T\gt 0$ and let $\rho \in\mathcal{C}([0,T],\mathcal{P}_4(\mathbb{R}^d\times \mathbb{R}^d))$ be a weak solution to the Fokker-Planck equation ( 1.6 ). Then, the functionals $\mathcal{X}(\rho _t)$ and $\mathcal{Y}(\rho _t)$ satisfy
where the inequality has to be understood component-wise.
Proof. We note that the functions $\phi _{\mathcal{X}}(x,y) = 1/2\left \|{x-x^*}\right \|_2^2$ and $\phi _{\mathcal{Y}}(x,y) = 1/2\left \|{y-x}\right \|_2^2$ are in $\mathcal{C}^2_{*}(\mathbb{R}^d\times \mathbb{R}^d)$ and recall that $\rho$ satisfies the weak solution identity (2.1) for such test functions. Hence, by applying (2.1) with $\phi _{\mathcal{X}}$ and $\phi _{\mathcal{Y}}$ as above, we obtain for the evolution of $\mathcal{X}(\rho _t)$
and for the evolution of $\mathcal{Y}(\rho _t)$
Here we used $\nabla _x\phi _{\mathcal{X}}(x,y) = x-x^*$ , $\nabla _y\phi _{\mathcal{X}}(x,y)=0$ , $\partial ^2_{x_kx_k} \phi _{\mathcal{X}}(x,y) = 1$ , $\nabla _x\phi _{\mathcal{Y}}(x,y) = x-y$ , $\nabla _y\phi _{\mathcal{Y}}(x,y)=y-x$ and $\partial ^2_{x_kx_k} \phi _{\mathcal{Y}}(x,y) = 1$ . Let us now collect auxiliary estimates in (3.3a)–(3.3g), which turn out to be useful in establishing upper bounds for (3.1) and (3.2). Using standard tools such as Cauchy-Schwarz and Young’s inequality we have
where in (3.3b)–(3.3d) we expanded the left-hand side of the scalar product and the norm by subtracting and adding $x^*$ . Furthermore, by means of A3 we obtain
Integrating the bounds (3.3a), (3.3b), (3.3d), (3.3e) and (3.3g) into equation (3.1) results in the upper bound
where we furthermore used that by Jensen’s inequality
For equation (3.2), we first note that, by definition, $S^{\beta,\theta }\geq \theta$ uniformly. This combined with the bounds (3.3c), (3.3d), (3.3f) and (3.3g) allows to derive
where we used (3.4) together with an analogous bound for $\iint \left \|{x-y}\right \|_2 d\rho _t(x,y)$ .
Recalling that $\mathcal{V}(\rho _t) =\mathcal{X}(\rho _t) +\mathcal{Y}(\rho _t)$ immediately allows to obtain an evolution inequality for $\mathcal{V}(\rho _t)$ of the following form.
Corollary 3.3. Under the assumptions of Lemma 3.2 , the functional $\mathcal{V}(\rho _t)$ satisfies
with $\chi _1$ as specified in ( 2.12a ).
Analogously to the upper bounds on the time evolutions of the functionals $\mathcal{X}(\rho _t)$ , $\mathcal{Y}(\rho _t)$ and $\mathcal{V}(\rho _t)$ , we can derive bounds from below as follows.
Lemma 3.4. Under the assumptions of Lemma 3.2 , the functionals $\mathcal{X}(\rho _t)$ and $\mathcal{Y}(\rho _t)$ satisfy
where the inequality has to be understood component-wise.
Proof. By following the lines of the proof of Lemma 3.2 and noticing that in analogy to the estimates (3.3), it hold
as well as
we obtain the statement by integrating the bounds into equations (3.1) and (3.2).
Corollary 3.5. Under the assumptions of Lemma 3.2 , the functional $\mathcal{V}(\rho _t)$ satisfies
with $\chi _2$ as specified in ( 2.12b ).
In order to be able to apply Grönwall’s inequality to (3.5) and (3.7) with the aim of obtaining estimates of the form $\mathcal{V}(\rho _t) \leq\mathcal{V}(\rho _0)e^{-(1-\vartheta )\chi _1 t}$ and $\mathcal{V}(\rho _t) \geq\mathcal{V}(\rho _0)e^{-(1-\vartheta/2)\chi _2 t}$ for some $\chi _1,\chi _2\gt 0$ and a suitable $\vartheta \in (0,1)$ , it remains to control the quantity $\left \|{y_{\alpha }({\rho _{Y,t}}) - x^*}\right \|_2$ through the choice of the parameter $\alpha$ . This is the content of the next section.
3.2. Quantitative Laplace principle
The well-known Laplace principle [Reference Dembo and Zeitouni21, Reference Miller48, Reference Pinnau, Totzeck, Tse and Martin52] asserts that for any absolutely continuous probability distribution $\varrho \in\mathcal{P}(\mathbb{R}^d)$ with $x^*\in \operatorname{supp}(\varrho )$ it holds
which allows to infer that the $\alpha$ -weighted measure $\omega _{\alpha }/\left \|{\omega _{\alpha }}\right \|_{L_1(\varrho )}\varrho$ is concentrated in a small region around the minimiser $x^*$ , provided that $\mathcal{E}$ attains its minimum at a single point, which is however guaranteed by the inverse continuity property A2.
The asymptotic nature of the result (3.8), however, does not permit to obtain the required quantitative estimates, which is the reason why the authors of [Reference Fornasier, Klock and Riedl28] proposed a quantitative nonasymptotic variant of the Laplace principle. In the following proposition, cf. [Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29, Proposition 1], we state this result for the setting of anisotropic noise considered throughout the paper.
Proposition 3.6 ([Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29, Proposition 1]). Let $\underline{\mathcal{E}} = 0$ , $\varrho \in\mathcal{P}(\mathbb{R}^d)$ and fix $\alpha \gt 0$ . For any $r \gt 0$ we define ${\mathcal{E}}_{r} \,:\!=\, \sup _{y \in B^\infty _{r}(x^*)}{\mathcal{E}}(y)$ . Then, under the inverse continuity property A2, for any $r \in (0,R_0]$ and $q \gt 0$ such that $q +{\mathcal{E}}_{r} \leq{\mathcal{E}}_{\infty }$ , we have
Proof. The proof is a mere reformulation of the one of [Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29, Proposition 1], which is presented in what follows for the sake of completeness.
For any $a \gt 0$ , Markov’s inequality gives $\|{\omega _{\alpha }}\|_{L_1(\varrho )} \geq a \varrho (\{y\, :\, \exp ({-}\alpha{\mathcal{E}}(y)) \geq a\})$ . By choosing $a = \exp ({-}\alpha{\mathcal{E}}_{r})$ and noting that
we get $\left \|{\omega _{\alpha }}\right \|_{L_1(\varrho )} \geq \exp ({-}\alpha{\mathcal{E}}_{r})\varrho (B^\infty _{r}(x^*))$ . Now let $\widetilde r \geq r \gt 0$ . With the definition of the consensus point $y_{\alpha }({\varrho }) = \int y \omega _{\alpha }(y)/\|{\omega _{\alpha }}\|_{L_1(\varrho )}\,d\varrho (y)$ and Jensen’s inequality, we can decompose
After noticing that the first term is bounded by $\widetilde r$ since $ \left \|{y-x^*}\right \|_{\infty} \leq \widetilde r$ for all $y \in B^\infty _{\widetilde r}(x^*)$ , we can continue the former with
where for the second term we used $\left \|{\omega _{\alpha }}\right \|_{L_1(\varrho )} \geq \exp ({-}\alpha{\mathcal{E}}_{r})\varrho (B^\infty _{r}(x^*))$ from above. Let us now choose $\widetilde r = (q+{\mathcal{E}}_r)^{\nu }/\eta$ , which satisfies $\widetilde r \geq r$ , since A2 with $\underline{\mathcal{E}} = 0$ and $r \leq R_0$ implies
Furthermore, due to the assumption $q+{\mathcal{E}}_r \leq{\mathcal{E}}_{\infty }$ in the statement we have $\widetilde r \leq{\mathcal{E}}_{\infty }^{\nu }/\eta$ , which together with the two cases of A2 with $\underline{\mathcal{E}} = 0$ allows to bound the infimum in (3.9) as follows
Inserting this and the definition of $\,\widetilde r$ into (3.9), we get the result as a consequence of the norm equivalence $\left \|{\,\cdot \,}\right \|_{\infty} \leq \left \|{\,\cdot \,}\right \|_2\leq \sqrt{d}\left \|{\,\cdot \,}\right \|_{\infty}$ .
To eventually apply Proposition 3.6 in the setting of Corollary 3.3, i.e., to upper bound the distance of the consensus point $y_{\alpha }({\rho _{Y,t}})$ to the global minimiser $x^*$ , it remains to ensure that $\rho _{Y,t}(B^\infty _{r}(x^*))$ is bounded away from $0$ for a finite time horizon. We ensure that this is indeed the case in what follows.
3.3. A lower bound for the probability mass $\rho _{Y,t}(B^\infty _{r}(x^*))$
In this section, for any small radius $r \gt 0$ , we provide a lower bound on the probability mass of $\rho _{Y,t}(B^\infty _{r}(x^*))$ by defining a mollifier $\phi _r\, :\, \mathbb{R}^d\times \mathbb{R}^d \rightarrow \mathbb{R}$ so that
and studying the evolution of the right-hand side.
Lemma 3.7. For $r \gt 0$ let $\Omega _r \,:\!=\, \{(x,y)\in \mathbb{R}^d\times \mathbb{R}^d: \max \{\left \|{x-x^*}\right \|_{\infty}, \left \|{x-y}\right \|_{\infty} \}\lt r/2\}$ and define the mollifier $\phi _r\, :\, \mathbb{R}^d\times \mathbb{R}^d \rightarrow \mathbb{R}$ by
We have that $\textit{Im} (\phi _r) = [0,1]$ , $\operatorname{supp}(\phi _r) = \Omega _r \subset B^\infty _{r/2}(x^*) \times B^\infty _{r}(x^*) \subset \mathbb{R}^d \times B^\infty _{r}(x^*)$ , $\phi _{r} \in\mathcal{C}_c^{\infty }(\mathbb{R}^d\times \mathbb{R}^d)$ and
Proof. It is straightforward to check the properties of $\phi _r$ as it is a tensor product of classical well-studied mollifiers.
To keep the notation as concise as possible in what follows, let us introduce the decomposition
where
and analogously for $\delta ^{2,*}_{x_kx_k} \phi _{r}$ and $\delta ^{2,Y}_{x_kx_k} \phi _{r}$ .
Proposition 3.8. Let $T \gt 0$ , $r \gt 0$ , and fix parameters $\alpha,\lambda _1,\sigma _1 \gt 0$ as well as parameters $\lambda _2,\sigma _2,\lambda _3,\sigma _3,\beta,\kappa,\theta \geq 0$ such that $\sigma _2\gt 0$ iff $\lambda _2\not =0$ and $\sigma _3\gt 0$ iff $\lambda _3\not =0$ . Moreover, assume the validity of Assumption A3 if $\lambda _3\not =0$ . Let $\rho \in\mathcal{C}([0,T],\mathcal{P}(\mathbb{R}^d\times \mathbb{R}^d))$ weakly solve the Fokker-Planck equation ( 1.6 ) in the sense of Definition 2.1 with initial condition $\rho _0 \in\mathcal{P}(\mathbb{R}^d\times \mathbb{R}^d)$ and for $t \in [0,T]$ . Then, for all $t\in [0,T]$ we have
with
where, for any $B\lt \infty$ with $\sup _{t \in [0,T]}\left \|{y_{\alpha }({\rho _{Y,t}})-x^*}\right \|_2 \leq B$ , $C_{\Upsilon} = C_{\Upsilon} (r,B,d,C_{\nabla{\mathcal{E}}})$ is as defined in ( 3.20 ). Moreover, $\omega _i=\mathbb{1}_{\lambda _i\gt 0}$ for $i\in \{1,2,3\}$ and $c \in (1/2,1)$ can be any constant that satisfies $(1-c)^2 \leq (2c-1)c$ .
Remark 3.9. In order to ensure a finite decay rate $p\lt \infty$ in Proposition 3.8 , it is crucial to have non-vanishing diffusions $\sigma _1\gt 0$ , $\sigma _2\gt 0$ if $\lambda _2\not =0$ and $\sigma _3\gt 0$ if $\lambda _3\not =0$ . As apparent from the formulation of the statement as well as the proof below, $\sigma _2$ or $\sigma _3$ may be $0$ if the corresponding drift parameter, $\lambda _2$ or $\lambda _3$ , respectively, vanishes.
Proof of Proposition 3.8 . By the definition of the marginal $\rho _Y$ and the properties of the mollifier $\phi _r$ defined in Lemma 3.7, we have
Our strategy is to derive a lower bound for the right-hand side of this inequality. Using the weak solution property of $\rho$ as in Definition 2.1 and the fact that $\phi _{r}\in\mathcal{C}^{\infty }_c(\mathbb{R}^d\times \mathbb{R}^d)$ , we obtain
where $T^s_{k}(x,y) \,:\!=\, -\kappa S^{\beta,\theta }(x,y) \left (y-x\right )_k \partial _{y_k}\phi _r(x,y)$ and
for $k\in \{1,\dots,d\}$ . Since the mollifier $\phi _r$ and its derivatives vanish outside of $\Omega _r$ , we restrict our attention to $\Omega _r$ and aim for showing for all $k\in \{1,\dots,d\}$ that
-
• $T^s_{k}(x,y) \geq 0$ ,
-
• $T^c_{1k}(x,y) + T^c_{2k}(x,y) \geq -p^c\phi _r(x,y)$ ,
-
• $T^\ell _{1k}(x,y) + T^\ell _{2k}(x,y) \geq -p^\ell \phi _r(x,y)$ ,
-
• $T^g_{1k}(x,y) + T^g_{2k}(x,y) \geq -p^g\phi _r(x,y)$
pointwise for all $(x,y)\in \Omega _r$ with suitable constants $0 \leq p^\ell, p^c, p^g \lt \infty$ .
Term $T^s_{k}$ : Using the expression for $\partial _{y_k} \phi _{r}$ from Lemma 3.7 and the fact that $S^{\beta,\theta }\geq \theta/2\geq 0$ , it is easy to see that
Terms $T^c_{1k}+T^c_{2k}$ , $T^\ell _{1k}+T^\ell _{2k}$ and $T^g_{1k}+T^g_{2k}$ : We first note that the third inequality from above holds with $p^\ell =0$ if $\lambda _2=\sigma _2=0$ and the fourth with $p^g=0$ if $\lambda _3=\sigma _3=0$ .
Therefore, in what follows we assume that $\lambda _2,\sigma _2,\lambda _3,\sigma _3\gt 0$ . In order to lower bound the three terms from above, we arrange the summands by using the abbreviations introduced in (3.10) as follows. For $T^c_{1k}+T^c_{2k}$ , we have
for $T^\ell _{1k}+T^\ell _{2k}$ we have
and for $T^g_{1k}+T^g_{2k}$ we have
We now treat each of the two-part sums in (3.15a), (3.15b), (3.16a), (3.16b), (3.17a) and (3.17b) separately by employing a technique similar to the one used in the proof of [Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29, Proposition 2], which was developed originally to prove [Reference Fornasier, Klock and Riedl28, Proposition 20].
Terms (3.15a), (3.16a) and (3.17a): Owed to their similar structure (in particular with respect to the denominator of the derivatives $\delta ^{*}_{x_k} \phi _{r}$ and $\delta ^{2,*}_{x_kx_k} \phi _{r}$ ), we can treat the three sums (3.15a), (3.16a) and (3.17a) simultaneously. Therefore, we consider the general formulation
which matches (3.15a) when $\Upsilon _k(x,y) = (x-y_{\alpha }({\rho _{Y,t}}))_k$ , $\lambda =\lambda _1$ and $\sigma =\sigma _1$ , (3.16a) when $\Upsilon _k(x,y) = (x-y)_k$ , $\lambda =\lambda _2$ and $\sigma =\sigma _2$ , and (3.17a) when $\Upsilon _k(x,y) = \partial _{x_k}{\mathcal{E}}(x)$ , $\lambda =\lambda _3$ and $\sigma =\sigma _3$ .
To achieve the desired lower bound over $\Omega _r$ , we introduce the subsets
and
where $\tilde{c} \,:\!=\, 2c-1\in (0,1)$ . For fixed $k$ , we now decompose $\Omega _r$ according to
In the following, we treat each of these three subsets separately.
Subset $(K_{1k}^{*})^c \cap \Omega _r$ : We have $\left |{\left (x-x^*\right )_k}\right | \leq \frac{\sqrt{c}}{2}r$ for each $(x,y) \in (K_{1k}^*)^c$ , which can be used to independently derive lower bounds for both summands in (3.18). For the first, we insert the expression for $\delta ^{*}_{x_k} \phi _{r}(x,y)$ to get
where, in the last inequality, we used that $(x,y)\in \Omega _r$ , the definition of $B$ and Assumption A3 to get the bound
For the second summand, we insert the expression for $\delta ^{2,*}_{x_kx_k} \phi _{r}(x,y)$ to obtain
where the last inequality uses $\Upsilon _k^2(x,y) \leq C_{\Upsilon} ^2$ .
Subset $K^*_{1k} \cap (K_{2k}^{*})^c \cap \Omega _r$ : As $(x,y) \in K^*_{1k}$ , we have $\left |{\left (x-x^*\right )_k}\right | \gt \frac{\sqrt{c}}{2}r$ . We observe that the sum in (3.18) is nonnegative for all $(x,y)$ in this subset whenever
The first term on the left-hand side in (3.22) can be bounded from above by exploiting that $v\in (K_{2k}^{*})^c$ and by using the relation $\tilde{c} = 2c-1$ . More precisely, we have
where the last inequality follows since $v\in K^*_{1k}$ . For the second term on the left-hand side in (3.22), we can use $(1-c)^2 \leq (2c-1)c$ as per assumption, to get
Hence, (3.22) holds and we have that (3.18) is uniformly nonnegative on this subset.
Subset $K^*_{1k} \cap K^*_{2k} \cap \Omega _r$ : As $(x,y) \in K_{1k}^*$ , we have $\left |{\left (x-x^*\right )_k}\right | \gt \frac{\sqrt{c}}{2}r$ . To start with we note that the first summand of (3.18) vanishes whenever $\sigma ^2\Upsilon _k^2(x,y) = 0$ , provided $\sigma \gt 0$ , so nothing needs to be done if $\Upsilon _k(x,y)=0$ . Otherwise, if $\sigma ^2\Upsilon _k^2(x,y) \gt 0$ , we exploit $(x,y)\in K_{2k}^*$ to get
Using this, the first summand of (3.18) can be bounded from below by
For the second summand, the nonnegativity of $\sigma ^2\Upsilon _k^2(x,y)$ implies the nonnegativity, whenever
This holds for $v\in K_{1k}^*$ , if $2(2c-1)c \geq (1-c)^2$ as implied by the assumption.
Term (3.16b): Recall that this term has the structure
We first note that the first summand of (3.24) is always nonnegative since
For the second summand of (3.24), a direct computation shows
which is nonnegative on the set
for any $c\geq 1/\sqrt{3}$ , as ensured by $(1-c)^2 \leq (2c-1)c$ . On the complement $(K^Y_k)^c$ , we have $\left |{\left (x-y\right )_k}\right | \leq \frac{\sqrt{c}}{2}r$ , which can be used to bound
Terms (3.15b) and (3.17b): The final two terms to be controlled have again a similar structure of the form
where we recycle the notation introduced after (3.18), i.e., $\Upsilon _k(x,y) = (x-y_{\alpha }({\rho _{Y,t}}))_k$ , $\lambda =\lambda _1$ and $\sigma =\sigma _1$ in the case of (3.15b) and $\Upsilon _k(x,y) = \partial _{x_k}{\mathcal{E}}(x)$ , $\lambda =\lambda _3$ and $\sigma =\sigma _3$ in the case of (3.17b).
The procedure for deriving lower bounds is similar to the one at the beginning with the exception that the denominator of the derivatives $\delta ^Y_{x_k} \phi _{r}$ and $\delta ^{2,Y}_{x_kx_k} \phi _{r}$ requires to introduce an adapted decomposition of $\Omega _r$ . To be more specific, we define the subsets
and
where $\tilde{c} \,:\!=\, 2c-1\in (0,1)$ . For fixed $k$ , we now decompose $\Omega _r$ according to
In the following, we treat again each of these three subsets separately.
Subset $(K_{1k}^Y)^c \cap \Omega _r$ : We have $\left |{\left (x-y\right )_k}\right | \leq \frac{\sqrt{c}}{2}r$ for each $(x,y) \in (K_{1k}^Y)^c$ , which can be used to independently derive lower bounds for both summands in (3.27). For the first summand, we insert the expression for $\delta ^Y_{x_k} \phi _{r}(x,y)$ to get
where we recall from above that $\Upsilon _k(x,y) \leq C_{\Upsilon}$ , which was used in the last inequality. For the second summand, we insert the expression for $\delta ^{2,Y}_{x_kx_k} \phi _{r}(x,y)$ to obtain
where the last inequality uses $\Upsilon _k^2(x,y) \leq C_{\Upsilon} ^2$ .
Subset $K^Y_{1k} \cap (K_{2k}^Y)^c \cap \Omega _r$ : As $(x,y) \in K^Y_{1k}$ , we have $\left |{\left (x-y\right )_k}\right | \gt \frac{\sqrt{c}}{2}r$ . We observe that the sum in (3.27) is nonnegative for all $(x,y)$ in this subset whenever
The first term on the left-hand side in (3.30) can be bounded from above exploiting that $v\in (K_{2k}^Y)^c$ and by using the relation $\tilde{c} = 2c-1$ . More precisely, we have
where the last inequality follows since $v\in K^Y_{1k}$ . For the second term on the left-hand side in (3.30), we can use $(1-c)^2 \leq (2c-1)c$ as per assumption, to get
Hence, (3.30) holds and we have that (3.27) is uniformly nonnegative on this subset.
Subset $K^Y_{1k} \cap K^Y_{2k} \cap \Omega _r$ : As $(x,y) \in K_{1k}^Y$ , we have $\left |{\left (x-y\right )_k}\right | \gt \frac{\sqrt{c}}{2}r$ . To start with we note that the first summand of (3.27) vanishes whenever $\sigma ^2\Upsilon _k^2(x,y) = 0$ , provided $\sigma \gt 0$ , so nothing needs to be done if $\Upsilon _k(x,y)=0$ . Otherwise, if $\sigma ^2\Upsilon _k^2(x,y) \gt 0$ , we exploit $(x,y)\in K_{2k}^Y$ to get
Using this, the first summand of (3.27) can be bounded from below by
For the second summand, the nonnegativity of $\sigma ^2\Upsilon _k^2(x,y)$ implies the nonnegativity, whenever
This holds for $v\in K_{1k}^Y$ , if $2(2c-1)c \geq (1-c)^2$ as implied by the assumption.
Concluding the proof: Combining the formerly established lower bounds (3.19), (3.21), (3.23), (3.25), (3.26), (3.28), (3.29) and (3.31), we obtain for the constants $p^c$ , $p^\ell$ and $p^g$ defined at the beginning of the proof
Together with (3.14) and by using the evolution of $\phi _r$ as in (3.13), we eventually obtain
where $q$ is defined implicitly and where $\omega _i=\mathbb{1}_{\lambda _i\gt 0}$ for $i\in \{1,2,3\}$ . Notice that $\omega _1=1$ since $\lambda _1\gt 0$ by assumption. An application of Grönwall’s inequality concludes the proof.
3.4. Proof of Theorem 2.5
We now have all necessary tools at hand to prove the global mean-field law convergence result for CBO with memory effects and gradient information by rigorously combining the formerly discussed statements.
Proof of Theorem 2.5. If $\mathcal{V}(\rho _0)=0$ , there is nothing to be shown since in this case $\rho _0=\delta _{(x^*,x^*)}$ . Thus, let $\mathcal{V}(\rho _0)\gt 0$ in what follows.
Let us first choose the parameter $\alpha$ such that
where we introduce the definitions
as well as
Moreover, $p$ is as given in (3.12) in Proposition 3.8 with $B=c\left (\vartheta,\chi _1,\lambda _1,\sigma _1\right )\sqrt{\mathcal{V}(\rho _0)}$ in $C_{\Upsilon}$ and with $r=r_{\varepsilon}$ . By construction, $q_{\varepsilon} \gt 0$ and $r_{\varepsilon} \leq R_0$ . Furthermore, recalling the notation ${\mathcal{E}}_{r}=\sup _{v \in B_{r}^\infty (x^*)}{\mathcal{E}}(v)$ from Proposition 3.6, we have $q_{\varepsilon} +{\mathcal{E}}_{r_{\varepsilon} } \leq 2q_{\varepsilon} \leq{\mathcal{E}}_{\infty }$ according to the definition of $r_{\varepsilon}$ . Since $q_{\varepsilon} \gt 0$ , the continuity of $\mathcal{E}$ ensures that there exists $s_{q_{\varepsilon} }\gt 0$ such that ${\mathcal{E}}(v)\leq q_{\varepsilon}$ for all $v\in B_{s_{q_{\varepsilon} }}^\infty (x^*)$ , yielding also $r_{\varepsilon} \gt 0$ .
Let us now define the time horizon $T_{\alpha} \geq 0$ by
with $C(t)\,:\!=\,c\left (\vartheta,\chi _1,\lambda _1,\sigma _1\right )\sqrt{\mathcal{V}(\rho _t)}$ . Notice for later use that $C(0)=B$ .
Our aim now is to show that $\mathcal{V}(\rho _{T_{\alpha} }) = \varepsilon$ with $T_{\alpha} \in \big [\frac{(1-\vartheta )\chi _1}{(1+\vartheta/2)\chi _2}T^*,T^*\big ]$ and that we have at least exponential decay of $\mathcal{V}(\rho _{t})$ until time $T_{\alpha}$ , i.e., until the accuracy $\varepsilon$ is reached.
First, however, we verify that $T_{\alpha} \gt 0$ , which is due to the continuity of $t\mapsto\mathcal{V}(\rho _{t})$ and $t\mapsto \left \|{y_{\alpha }({\rho _{Y,t}})-x^*}\right \|_2$ since $\mathcal{V}(\rho _{0}) \gt \varepsilon$ and $\left \|{y_{\alpha }({\rho _{Y,0}})-x^*}\right \|_2 \lt C(0)$ at time $0$ . While the former is a consequence of the assumption, the latter follows from Proposition 3.6 with $q_{\varepsilon}$ and $r_{\varepsilon}$ as defined in (3.35), which allows to show that
The first inequality in the last line holds by the choice of $\alpha$ in (3.33) and by noticing that $\Omega _{r_{\varepsilon}/2}\subset \mathbb{R}^d\times B_{r_{\varepsilon} }^\infty (x^*)$ and thus $\rho _0(\Omega _{r_{\varepsilon}/2})\leq \rho _{Y,0}\big (B_{r_{\varepsilon} }^\infty (x^*)\big )$ .
Next, we show that the functional $\mathcal{V}(\rho _t)$ is sandwiched between two exponentially decaying functions with rates $(1-\vartheta )\chi _1$ and $(1+\vartheta/2)\chi _2$ , respectively. More precisely, we prove that, up to time $T_{\alpha}$ , $\mathcal{V}(\rho _t)$ decays
-
(i) at least exponentially fast (with rate $(1-\vartheta )\chi _1)$ , and
-
(ii) at most exponentially fast (with rate $(1+\vartheta/2)\chi _2)$ .
To obtain (i), recall that Corollary 3.3 provides an upper bound on the time derivative of $\mathcal{V}(\rho _t)$ given by
with $\chi _1$ as in (2.12a) being strictly positive by assumption. By combining (3.37) and the definition of $T_{\alpha}$ in (3.36), we have by construction
Analogously, for (ii), by Corollary 3.5, we obtain a lower bound on the time derivative of $\mathcal{V}(\rho _t)$ given by
where the second inequality again exploits the definition of $T_{\alpha}$ . Grönwall’s inequality now implies for all $t \in [0,T_{\alpha} ]$ the upper and lower estimates
thereby proving (i) and (ii). The definition of $T_{\alpha}$ together with the one of $C(t)$ permits to control
To conclude, it remains to prove $\mathcal{V}(\rho _{T_{\alpha} }) = \varepsilon$ with $T_{\alpha} \in \big [\frac{(1-\vartheta )\chi _1}{(1+\vartheta/2)\chi _2}T^*,T^*\big ]$ . To this end, we consider the following three cases separately.
Case $T_{\alpha} \geq T^*$ : If $T_{\alpha} \geq T^*$ , the time-evolution bound of $\mathcal{V}(\rho _t)$ from (3.39a) combined with the definition of $T^*$ in (2.13) allows to immediately infer $\mathcal{V}(\rho _{T^*}) \leq \varepsilon$ . Therefore, with $\mathcal{V}(\rho _{t})$ being continuous, $\mathcal{V}(\rho _{T_{\alpha} }) = \varepsilon$ and $T_{\alpha} = T^*$ according to the definition of $T_{\alpha}$ in (3.36).
Case $T_{\alpha} \lt T^*$ and $\mathcal{V}(\rho _{T_{\alpha} }) \leq \varepsilon$ : By continuity of $\mathcal{V}(\rho _t)$ , it holds for $T_{\alpha}$ as defined in (3.36), $\mathcal{V}(\rho _{T_{\alpha} }) = \varepsilon$ . Thus, $\varepsilon =\mathcal{V}(\rho _{T_{\alpha} }) \geq\mathcal{V}(\rho _0) \exp \left ({-}(1+\vartheta/2)\chi _2 T_{\alpha} \right )$ as a consequence of the time-evolution bound (3.39b). The latter can be reordered as
Case $T_{\alpha} \lt T^*$ and $\mathcal{V}(\rho _{T_{\alpha} }) \gt \varepsilon$ : We will prove that this case can actually not occur by showing that $\left \|{y_{\alpha }({\rho _{Y,T_{\alpha} }})-x^*}\right \|_2 \lt C(T_{\alpha} )$ for the $\alpha$ chosen in (3.33). In fact, if both $\mathcal{V}(\rho _{T_{\alpha} })\gt \varepsilon$ and $\left \|{y_{\alpha }({\rho _{Y,T_{\alpha} }})-x^*}\right \|_2 \lt C(T_{\alpha} )$ held true simultaneously, this would contradict the definition of $T_{\alpha}$ in (3.36). To obtain this contradiction, we apply again Proposition 3.6 with $q_{\varepsilon}$ and $r_{\varepsilon}$ as before to get
Since, thanks to (3.40), we have $\max _{t \in [0,T_{\alpha} ]}\|{y_{\alpha }({\rho _{Y,t}})-x^*}\|_2 \leq B$ for $B=C(0)$ , which in particular does not depend on $\alpha$ , Proposition 3.8 guarantees the existence of $p\gt 0$ independent of $\alpha$ (but dependent on $B$ and $r_{\varepsilon}$ ) with
Here we use that $(x^*,x^*)\in \operatorname{supp}(\rho _0)$ to bound the initial mass $\rho _0$ and the fact that $\phi _{r}$ from Lemma 3.7 is bounded from below on $\Omega _{r/2}$ by $1/2^d$ . With this, we can continue the chain of inequalities in (3.41)3 to obtain
with the first inequality in the last line holding due to the choice of $\alpha$ in (3.33). This gives the desired contradiction, again thanks to the continuity of $t\mapsto\mathcal{V}(\rho _{t})$ and $t\mapsto \left \|{y_{\alpha }({\rho _{Y,t}})-x^*}\right \|_2$ .
4. Numerical experiments
In the first part of this section, we comment on how to efficiently implement a numerical scheme for the CBO dynamics (1.1) which allows to integrate memory mechanisms without additional computational complexity. Afterwards, we numerically demonstrate the benefit of memory effects and gradient information at the example of interesting real-world inspired applications.
4.1. Implementational aspects
Discretising the interacting particle system (1.1) in time by means of the Euler-Maruyama method [Reference Higham37] with prescribed time step size $\Delta t$ results in the implementable numerical scheme
where $((B_{k}^{m,i})_{k=0,\dots,K-1})_{i=1,\dots,N}$ are independent, identically distributed Gaussian random vectors in $\mathbb{R}^d$ with zero mean and covariance matrix $\Delta t \mathsf{Id}$ for $m\in \{1,2,3\}$ .
We notice that, compared to standard CBO, see, e.g., [Reference Fornasier, Klock and Riedl28, Equation (2)], the way the historical best position is updated in (4.1b) (recall the definition of $S^{\beta,\theta }$ from equation (1.4)) requires one additional evaluation of the objective function per particle in each time step, which raises the computational complexity of the numerical scheme substantially if computing $\mathcal{E}$ is costly and the dominating part. However, for the parameter choices $\kappa =1/\Delta t$ , $\theta =0$ and $\beta =\infty$ , in place of (4.1b), we obtain the update rule
which is how one expects a memory mechanism to be implemented. This way allows to recycle in time step $k$ the computations made in the previous step and thus leads to no additional computational cost as consequence of using memory effects. The memory consumption, on the other hand, is approximately twice as high as in standard CBO.
4.2. A benchmark problem in optimisation: the Rastrigin function
Let us validate in this section the numerical observation made in Figure 2a in the introduction about the benefit of memory effects. Namely, it has been observed in several prior works that a higher noise level can enhance the success of CBO. To rule out that the improved performance for $\lambda _2\gt 0$ in Figure 2a originates solely from the larger present noise as consequence of the additional noise term associated with the memory drift, we replicate in Figure 3 the experiments with the exception of setting $\sigma _2=0$ . The obtained results confirm that already the usage of memory effects together with a memory drift improves the performance. However, we also notice that an additional noise term further increases the success probability.
4.3. A machine learning example
As a first real-world inspired application, we now investigate the influence of memory mechanisms in a high-dimensional benchmark problem in machine learning, which is well-understood in the literature, namely the training of a shallow and a convolutional NN (CNN) classifier for the MNIST dataset of handwritten digits [Reference LeCun, Cortes and Burges47].
The experimental setting is the one of [Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29, Section 4] with tested architectures as described in Figure 4. While it is not our aim to challenge the state of the art at this task by employing very sophisticated architectures, we demonstrate that CBO is on par with stochastic gradient descent without requiring time-consuming hyperparameter tuning.
To train the learnable parameters $\theta$ of the NNs, we minimise the empirical risk ${\mathcal{E}}(\theta ) = \frac{1}{M} \sum _{j=1}^M \ell (f_{\theta} (x^j),y^j)$ , where $f_{\theta}$ denotes the forward pass of the NN and $(x^j,y^{\,j})_{j=1}^M$ the $M$ training samples consisting of image and label. As loss $\ell$ we choose the categorical crossentropy loss $\ell (\widehat{y},y)=-\sum _{k=0}^9 y_k \log \left (\widehat{y}_k\right )$ with $\widehat{y}=f_{\theta} (x)$ denoting the output of the NN for a sample $(x,y)$ .
Our implementation is the one of [Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29, Section 4], which includes concepts from [Reference Carrillo, Jin, Li and Zhu14] and [Reference Fornasier, Huang, Pareschi and Sünnen26, Section 2.2]. Firstly, mini-batching is employed when evaluating $\mathcal{E}$ and when computing the consensus point $y_{\alpha }$ , which means that $\mathcal{E}$ is evaluated on a random subset of size $n_{\mathcal{E}}=60$ of the training dataset and $y_{\alpha }$ is computed from a random subset of size $n_N=10$ of all $N=100$ particles. Secondly, a cooling strategy for $\alpha$ and the noise parameters is used. More precisely, $\alpha$ is doubled each epoch, while $\sigma _1$ and $\sigma _2$ follow the schedule $\sigma _{i,\text{epoch}} = \sigma _{i,0}/\log _2(\text{epoch}+2)$ for $i=1,2$ .
In Figure 5, we report the testing accuracies and the training risks evaluated at the consensus point based on a random sample of the training set of size 10,000 for both the shallow NN and the CNN when trained with one of three algorithms: standard CBO without memory effects as obtained when discretising [Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29, Equation (1)], CBO with memory effects but without memory drift as in equation (4.1) with $\lambda _2=\sigma _2=0$ , and CBO with memory effects and memory drift as in equation (4.1) with $\lambda _2=0.4$ and $\sigma _2=\lambda _2\sigma _1$ . The remaining parameters are $\lambda _1=1$ , $\sigma _{1,0}=\sqrt{0.4}$ , $\alpha _{\text{initial}}=50$ , $\beta =\infty$ , $\theta =0$ , $\kappa =1/\Delta t$ , and discrete time step size $\Delta t=0.1$ . We train for $100$ epochs and use $N=100$ particles, which are initialised according to $\mathcal{N}\big ((0,\dots,0)^T,\mathsf{Id}\big )$ . All results are averaged over $5$ training runs.
As concluded already in [Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29, Section 4], we obtain accuracies comparable to SGD, cf. [Reference LeCun, Bottou, Bengio and Haffner46, Figure 9]. Moreover, we see slightly improved results when exploiting memory effects. However, we also notice that memory mechanisms slow down the training process initially.
4.4. A compressed sensing example
In the final numerical section of this paper, we showcase an application where gradient information turns out to be indispensable for the success of CBO methods, namely an experiment in compressed sensing [Reference Foucart and Rauhut30], which has become a very active and profitable field of research since the seminal works [Reference Candès, Romberg and Tao11, Reference Donoho22] about two decades ago.
One of the most common problems encountered in engineering and technology is concerned with the inference of information about an unknown signal $x^*\in \mathbb{R}^d$ from (linear) measurements $b\in \mathbb{R}^m$ . While classical linear algebra suggests that the number of measurements $m$ must be at least as large as the dimensionality $d$ of the signal, in many applications measurements are costly, time-consuming or both, making it desirable to reduce their number to the absolute minimum. Very often one aims at $m\ll d$ , since real-world signals usually live in high-dimensional spaces. In general, this would be an impossible task. However, in typical practical scenarios additional information about the quantity of interest $x^*$ is available, which indeed allows to reconstruct signals from few measurements $b$ . An empirically observed assumption about real-world signals is compressibility, meaning that they can be well-approximated by sparse vectors, i.e., vectors whose components are for the most part zero. Exploiting sparsity enables us to solve the underdetermined system $Ax^* = b$ efficiently in both theory and practice. Compressed sensing is concerned with the task of designing a measurement process $A\in \mathbb{R}^{m\times d}$ together with a reconstruction algorithm capable of recovering the sparse solution $x^*$ from the set of solutions consistent with the measurements. This can be formulated as the nonconvex combinatorial optimisation
where $\left \|{x}\right \|_0$ is colloquially referred to as $\ell _0$ -‘norm’ and denotes the number of non-zero elements of $x$ . Solving $\ell _0$ -minimisation is however NP-hard in general, which lead researchers to consider the convex relaxation
$\ell _1$ -minimisation is easy to solve by means of established methods from convex optimisation and provably recovers the correct solution for a suitable measurement matrix $A$ . The remaining question is about the correct way of inferring information about the signal through measurements. Remarkably and responsible for the wide success of compressed sensing is that random matrices enjoy properties such as the null space or restricted isometry property, which guarantee successful recovery, for $m\gtrsim s\log (d/s)$ , where $s$ denotes the sparsity of the signal $x^*$ , i.e., $s=\left \|{x^*}\right \|_0$ . Up to the logarithmic factor in the ambient dimension $d$ , this is optimal, since in theory $m=2s$ measurements are necessary and sufficient to reconstruct every $s$ -sparse vector.
In the numerical experiments following, we resort to the regularised variant of the sparse recovery problem
for a suitable regularisation parameter $\mu \gt 0$ . For $p=1$ we obtain the regularisation of (4.3), whereas for $p\lt 1$ the optimisation (4.4) becomes nonconvex. Our results in Figures 2b and 6 show that CBO with gradient information is capable of solving the convex but also the nonconvex optimisation problem (4.4) with $p=1/2$ with already very few measurements. As parameters of the CBO algorithm, which is obtained as a Euler-Maruyama discretisation of equation (1.1), we choose in both cases the time horizon $T=20$ , time step size $\Delta t=0.01$ , $\alpha =100$ , $\beta =\infty$ , $\theta =0$ , $\kappa =1/\Delta t$ , $\lambda _1=1$ , $\lambda _2=0$ and $\sigma _1=\sigma _2=\sigma _3=0$ . We use either $N=10$ or $N=100$ particles, which is specified in the respective caption. After running the CBO algorithm, a post-processing step is performed, in which the support of the suspected sparse vector is identified by checking which entries are not smaller than $0.01$ before the final sparse solution is then obtained by solving the linear system constrained to this support.
The depicted success probabilities are averaged over $100$ runs of CBO. In Figure 2b, we solve the sparse recovery problem in the convex setting for an $8$ -sparse $200$ -dimensional signal with $p=1$ using CBO without and with gradient information with merely $10$ particles. We observe that gradient information is indispensable to be able to identify the correct sparse solution and standard CBO would fail in such task. In Figure 6, we conduct a slightly lower-dimensional experiment with a $2$ -sparse $50$ -dimensional signal. Here our focus is to enter the nonconvex recovery regime by comparing the convex $\ell _1$ -regularised with the nonconvex $\ell _{1/2}$ -regularised problem. We discover that in either case reconstruction is feasible from already very few measurements. Increasing the number of optimising particles has almost no effect for the convex optimisation problem, in the nonconvex setting recovery benefits from more particles. Furthermore, the nonconvex problem demands a more moderate choice of the strength of the gradient.
5. Conclusions
In this paper, we investigate a variant of consensus-based optimisation (CBO) which incorporates memory effects and gradient information. By developing further and generalising the proof technique devised in [Reference Fornasier, Klock and Riedl28, Reference Fornasier, Klock, Riedl, Jiménez Laredo, Hidalgo and Babaagba29], we establish the global convergence of the underlying dynamics to the global minimiser $x^*$ of the objective function $\mathcal{E}$ in mean-field law. To this end, we analyse the time-evolution of the Wasserstein distance between the law of the mean-field CBO dynamics and a Dirac delta at the minimiser and show its exponential decay in time. Our result holds under minimal assumptions about the initial measure $\rho _0$ and for a vast class of objective functions. The numerical benefit of such additional terms, specifically the employed memory effects and gradient information, is demonstrated at the example of a benchmark function in optimisation as well as at real-world applications such as compressed sensing and the training of neural networks for image classification.
By these means, we demonstrate the versatility, flexibility and customisability of the class of CBO methods, both with respect to potential application areas in practice and modifications to the underlying optimisation principles, while still being amenable to theoretical analysis.
Acknowledgements
I would like to profusely thank Massimo Fornasier for many fruitful and stimulating discussions while I was preparing this manuscript.
Financial support
This work has been funded by the German Federal Ministry of Education and Research and the Bavarian State Ministry for Science and the Arts. I, the author of this work, take full responsibility for its content. Furthermore, I acknowledge the financial support from the Technical University of Munich – Institute for Ethics in Artificial Intelligence. Moreover, I gratefully acknowledge the computer and data resources provided by the Leibniz Supercomputing Centre.
Competing interests
None.