1. Introduction
Generative adversarial networks (GANs) introduced in [Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio14] are generative models with two competing neural networks: a generator network G and a discriminator network D. The generator network G attempts to fool the discriminator network by converting random noise into sample data, while the discriminator network D tries to identify whether the input sample is fake or true.
After being introduced to the machine learning community, the popularity of GANs has grown exponentially with a wide range of applications, including high-resolution image generation [Reference Denton, Chintala, Szlam and Fergus9, Reference Radford, Metz and Chintala29], image inpainting [Reference Yeh, Chen, Yian Lim, Schwing, Hasegawa-Johnson and Do45], image super-resolution [Reference Ledig, Theis, Huszár, Caballero, Cunningham, Acosta, Aitken, Tejani, Totz and Wang23], visual manipulation [Reference Zhu, Krähenbühl, Shechtman and Efros48], text-to-image synthesis [Reference Reed, Akata, Yan, Logeswaran, Schiele and Lee30], video generation [Reference Vondrick, Pirsiavash and Torralba40], semantic segmentation [Reference Luc, Couprie, Chintala and Verbeek26], and abstract reasoning diagram generation [Reference Kulharia, Ghosh, Mukerjee, Namboodiri and Bansal21]; in recent years, GANs have attracted a substantial amount of attention in the financial industry for financial time series generation [Reference Takahashi, Chen and Tanaka-Ishii36, Reference Wiese, Bai, Wood, Morgan and Buehler42, Reference Wiese, Knobloch, Korn and Kretschmer43, Reference Zhang, Zhong, Dong, Wang and Wang46], asset pricing [Reference Chen, Pelger and Zhu5], market simulation [Reference Coletta, Prata, Conti, Mercanti, Bartolini, Moulin, Vyetrenko and Balch6, Reference Storchan, Vyetrenko and Balch35], and so on. Despite the empirical success of GANs, there are well-recognized issues in the training of GANs, such as the vanishing gradient when the discriminator significantly outperforms the generator [Reference Arjovsky and Bottou1], and mode collapse where the generator cannot recover a multi-model distribution but only a subset of the modes; this issue is believed to be linked with the gradient exploding [Reference Salimans, Goodfellow, Zaremba, Cheung, Radford and Chen32].
In response to these issues, there has been growing research interest in the theoretical understanding of GAN training. In [Reference Berard, Gidel, Almahairi, Vincent and Lacoste-Julien3] the authors proposed a novel visualization method for the GAN training process through the gradient vector field of loss functions. In a deterministic GAN training framework, [Reference Mescheder, Geiger and Nowozin28] demonstrated that regularization improved the convergence performance of GANs. [Reference Conforti, Kazeykina and Ren7] and [Reference Domingo-Enrich, Jelassi, Mensch, Rotskoff and Bruna11] analyzed a generic zero-sum minimax game including GANs, and connected the mixed Nash equilibrium of the game with the invariant measure of Langevin dynamics. In addition, various approaches have been proposed for amelioration of the aforementioned issues in GAN training, including different choices of network architectures, loss functions, and regularization. See, for instance, a comprehensive survey on these techniques in [Reference Wiatrak, Albrecht and Nystrom41] and the references therein.
1.1. Our work
This paper focuses on analyzing the training process of GANs via a stochastic differential equation (SDE) approach. It first establishes SDE approximations for the training of GANs under stochastic gradient algorithms (SGAs), with precise error bound analysis. It then describes the long-run behavior of GAN training via the invariant measures of its SDE approximations under proper conditions. This work builds a theoretical foundation for GAN training and provides analytical tools to study its evolution and stability. In particular:
-
The SDE approximations characterize precisely the distinction between GANs with alternating update and GANs with simultaneous update, in terms of the interaction between the generator and the discriminator.
-
The drift terms in the SDEs show the direction of the parameter evolution; the diffusion terms prescribe the ratio between the batch size and the learning rate in order to modulate the fluctuations of SGAs in GAN training.
-
Regularity conditions for the coefficients of the SDEs provide constraints on the growth of the loss function with respect to the model parameters, necessary for avoiding the explosive gradient encountered in the training of GANs; they also explain mathematically some well-known heuristics in GAN training, and confirm the importance of appropriate choices for network depth and of the introduction of gradient clipping and gradient penalty.
-
The dissipative property of the training dynamics under the SDE form ensures the existence of the invariant measures, hence the steady states of GAN training in the long run; it underpins the practical tactic of adding a regularization term to the GAN objective to improve the stability of the training.
-
Further analysis of the invariant measure for the coupled SDEs gives rise to a fluctuation–dissipation relation (FDR) for GANs. These FDRs reveal the trade-off of the loss landscape between the generator and the discriminator and can be used to schedule the learning rate.
1.2. Related work
Our analysis on the approximation and the long-run behavior of GAN training is inspired by [Reference Li and Tai24] and [Reference Liu and Theodorou25]. The former established the SDE approximation for the parameter evolution in SGAs applied to pure minimization problems (see also [Reference Hu, Li, Li and Liu18] on a similar topic); the latter surveyed the theoretical analysis of deep learning from two perspectives: propagation of chaos through neural networks and the training process of deep learning algorithms. Among other related works on the theoretical understanding of GANs, [Reference Genevay, Peyré and Cuturi13] reviewed the connection between GANs and the dual formulation of optimal transport problems; [Reference Luise, Pontil and Ciliberto27] studied the interplay between the latent distribution and generated distribution in GANs with optimal transport-based loss functions; [Reference Conforti, Kazeykina and Ren7] and [Reference Domingo-Enrich, Jelassi, Mensch, Rotskoff and Bruna11] focused on the equilibrium of the minimax game and its connection with Langevin dynamics; and [Reference Cao, Guo and Laurière4] studied the connection between GANs and mean-field games. Our focus is the GAN training process: we establish precise error bounds for the SDE approximations, study the long-run behavior of GAN training via the invariant measures of the SDE approximations, and analyze their implications for resolving various challenges in GANs.
1.3. Notation
Throughout this paper, the following notation will be adopted:
-
$\mathbb{R}^d$ denotes a d-dimensional Euclidean space, where d may vary from time to time.
-
The transpose of a vector $x\in\mathbb{R}^d$ is denoted by $x^\top$ and the transpose of a matrix $A\in\mathbb{R}^{d_1\times d_2}$ is denoted by $A^\top$ .
-
Let $\mathcal{X}$ be an arbitrary nonempty subset of $\mathbb{R}^d$ ; the set of k times continuously differentiable functions over some domain $\mathcal{X}$ is denoted by $\mathcal{C}^k(\mathcal{X})$ for any nonnegative integer k. In particular, when $k=0$ , $\mathcal{C}^0(\mathcal{X})=\mathcal{C}(\mathcal{X})$ denotes the set of continuous functions.
-
Let $J=(J_1,\dots,J_d)$ be a d-tuple multi-index of order $|J|=\sum_{i=1}^dJ_i$ , where $J_i\geq0$ for all $i=1,\dots,d$ ; then the operator $\nabla^J$ is $\nabla^J=\big(\partial_1^{J_1},\dots,\partial_d^{J_d}\big)$ .
-
For $p\geq1$ , $\|\cdot\|_p$ denotes the p-norm over $\mathbb{R}^d$ , i.e. $\|x\|_p=\big(\sum_{i=1}^d|x_i|^p\big)^{{1}/{p}}$ for any $x\in\mathbb{R}^d$ ; $L^p_\textrm{loc}(\mathbb{R}^d)$ denotes the set of functions $f\,:\,\mathbb{R}^d\to\mathbb{R}$ such that $\int_\mathcal{X} |f(x)|^p\,{\textrm{d}} x < \infty$ for any compact subset $\mathcal{X}\subset\mathbb{R}^d$ .
-
Let J be a d-tuple multi-index of order $|J|$ . For a function $f\in L^{1}_\textrm{loc}(\mathbb{R}^d)$ , its Jth-weak derivative $D^Jf\in L^{1}_\textrm{loc}(\mathbb{R}^d)$ is a function such that, for any smooth and compactly supported test function g, $\int_{\mathbb{R}^d}D^Jf(x)g(x)\,{\textrm{d}} x = ({-}1)^{|J|}\int_{\mathbb{R}^d}f(x)\nabla^Jg(x)\,{\textrm{d}} x$ . The Sobolev space $W^{k,p}_\textrm{loc}(\mathbb{R}^d)$ is a set of functions f on $\mathbb{R}^d$ such that, for any d-tuple multi-index J with $|J|\leq k$ , $D^Jf\in L^p_\textrm{loc}(\mathbb{R}^d)$ .
-
Fix an arbitrary $\alpha\in\mathbb N^+$ . $G^\alpha(\mathbb{R}^d)$ denotes a subspace of $\mathcal{C}^\alpha(\mathbb{R}^d;\mathbb{R})$ where, for any $g\in G^\alpha(\mathbb{R}^d)$ and any multi-index J with $|J|=\sum_{i=1}^dJ_i\leq \alpha$ , there exist $k_1,k_2\in\mathbb N$ such that $\nabla^Jg(x)\leq k_1\big(1+\|x\|_2^{2k_2}\big)$ for all $x\in\mathbb{R}^d$ . If g is a parametrized function $g_\beta$ , then $g_\beta\in G^\alpha(\mathbb{R}^d)$ indicates that the choices of constants $k_1$ and $k_2$ are uniform over all possible $\beta$ s.
-
Fix an arbitrary $\alpha\in\mathbb N^+$ . $G^\alpha_w(\mathbb{R}^d)$ denotes a subspace of $W^{\alpha,1}_\textrm{loc}(\mathbb{R}^d)$ where, for any $g\in G^\alpha_w(\mathbb{R}^d)$ and any multi-index J with $|J|=\sum_{i=1}^dJ_i\leq \alpha$ , there exist $k_1,k_2\in\mathbb N$ such that $D^Jg(x)\leq k_1\big(1+\|x\|_2^{2k_2}\big)$ for almost all $x\in\mathbb{R}^d$ . If g is a parametrized function $g_\beta$ , then $g_\beta\in G^\alpha_w(\mathbb{R}^d)$ indicates that the choices of constants $k_1$ and $k_2$ are uniform over all possible $\beta$ s.
2. GAN training
In this section we provide the mathematical setup for GAN training.
2.1. GAN training: Minimax versus maximin
GANs fall into the category of generative models to approximate an unknown probability distribution $\mathbb{P}_r$ . GANs are minimax games between two competing neural networks, the generator G and the discriminator D. The neural network for the generator G maps a latent random variable Z with a known distribution $\mathbb{P}_z$ into the sample space to mimic the true distribution $\mathbb{P}_r$ . Meanwhile, the other neural network for the discriminator D will assign a score between 0 and 1 to an input sample, either a generated sample or a true one. A higher score from the discriminator D indicates that the sample is more likely to be from the true distribution.
Formally, let $(\Omega, \mathcal{F}, \{\mathcal{F}_t\}_{t\geq0}, \mathbb{P})$ be a filtered probability space. Let a measurable space $\mathcal{X}\subset\mathbb{R}^{d_x}$ be the sample space of dimension $d_x\in\mathbb N$ . Let an $\mathcal{X}$ -valued random variable X denote the random sample, where $X\,:\,\Omega\to\mathcal{X}$ is a measurable function. The unknown probability distribution $\mathbb{P}_r$ is defined as $\mathbb{P}_r=\textrm{Law}(X)$ such that $\mathbb{P}_r(X\in A)=\mathbb{P}(\{\omega\in\Omega\,:\, X(\omega)\in A\})$ for any measurable set $A\subset X$ . Similarly, let a measurable space $\mathcal{Z}\subset \mathbb{R}^{d_z}$ be the latent space of dimension $d_z\in\mathbb N$ . Let a $\mathcal{Z}$ -valued random variable Z denote the latent variable where $Z\,:\,\Omega\to\mathcal{Z}$ . The prior distribution $\mathbb{P}_z$ is given by $\mathbb{P}_z=\textrm{Law}(Z)$ such that $\mathbb{P}_z(Z\in B)=\mathbb{P}(\{\omega\in\Omega\,:\, Z(\omega)\in B\})$ for any measurable $B\subset \mathcal{Z}$ . Moreover, X and Z are independent, i.e. $\mathbb{P}(\{\omega\,:\, X(\omega)\in A,\,Z(\omega)\in B\})=\mathbb{P}_r(X\in A)\mathbb{P}_z(Z\in B)$ for any measurable sets $A\subset \mathcal{X}$ and $B\subset\mathcal{Z}$ .
In the vanilla GAN framework proposed by [Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio14], the loss function with respect to G and D is given by $L(G,D)=\mathbb{E}_{X\sim\mathbb{P}_r}\log D(X)+\mathbb{E}_{Z\sim \mathbb{P}_z}[\log(1- D(G(Z)))]$ , and the objective is given by a minimax problem, $\min_G\max_D L(G,D)$ . Under a given G, the concavity of L(G, D) with respect to D follows from the concavity of the functions $\log x$ and $\log(1-x)$ ; under a given D, the convexity of L(G, D) with respect to G follows from the linearity of expectation and the pushforward measure $G\#\mathbb{P}_z=\textrm{Law}(G(Z))$ . Therefore, the training loss in vanilla GANs is indeed convex in G and concave in D. In the practical training stage, both G and D become parametrized neural networks $G_\theta$ and $D_\omega$ , and therefore the working loss function is indeed with respect to the parameter $(\theta,\omega)$ ,
According to the training scheme proposed by [Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio14], in each iteration, $\omega$ is updated first followed by the update of $\theta$ . This precisely corresponds to the minimax formulation of the objective, $\min_\theta\max_\omega \hat{L}(\theta,\omega)$ . However, in the practice training stage of GANs, there might be an interchange of training orders between the generator and the discriminator. We should be careful as the interchange implicitly modifies the objective into a maximin problem, $\max_\omega\min_\theta \hat{L}(\theta,\omega)$ , and hence raises the question of whether these two objectives are equivalent. This question is closely related to the notion of Nash equilibrium in a two-player zero-sum game. According to the original GAN framework, the solution should provide an upper value to the corresponding two-player zero-sum game between the generator and the discriminator, i.e. an upper bound for the game value. As pointed out by Sion’s theorem (see [Reference Sion34, Reference Von Neumann39]), a sufficient condition to guarantee equivalence between the two training orders is that the loss function $\hat{L}$ is convex in $\theta$ and concave in $\omega$ . Though we have seen that the loss function L with respect to G and D satisfies this condition, it is not necessarily true for $\hat{L}(\theta,\omega)$ . In fact, [Reference Zhu, Jiao and Tse47] points out that these conditions are usually not satisfied with respect to generator and discriminator parameters in common GAN models, and this lack of convexity and/or concavity does create challenges in the training of GANs. Such challenges motivate us to take a closer look at the evolution of parameters in the training of GANs using mathematical tools. In the following analysis, we will strictly follow the minimax formulation and its corresponding training order.
2.2. SGA for GAN training
Typically, GANs are trained through a stochastic gradient algorithm (SGA). An SGA is applied to a class of optimization problems whose loss function $\Phi(\gamma)$ with respect to the model parameter vector $\gamma$ can be written as $\Phi(\gamma)=\mathbb{E}_{{\mathcal{I}}}[\Phi_{{\mathcal{I}}}(\gamma)]$ , where a random variable ${\mathcal{I}}$ takes values in the index set $\mathbb I$ of the data points and, for any $i\in\mathbb I$ , $\Phi_i(\gamma)$ denotes the loss evaluated at the data point with index i.
Suppose the objective is to minimize $\Phi(\gamma)$ over $\gamma$ . Applying gradient descent with learning rate $\eta>0$ , at an iteration k, $k=0,1,2,\dots$ , the parameter vector is updated by $\gamma_{k+1}=\gamma_k-\eta\nabla\Phi(\gamma_k)$ . By the linearity of differentiability and expectation, this update can be written as $\gamma_{k+1}=\gamma_k-\eta\mathbb{E}_{\mathcal{I}}[\nabla\Phi_{\mathcal{I}}(\gamma_k)]$ . Under suitable conditions, $\mathbb{E}_{\mathcal{I}}[\nabla\Phi_{\mathcal{I}}(\gamma_k)]$ can be estimated by sample mean
where $\mathcal{B}=\{I_1,\dots,I_B\}$ is a collection of indices with $I_k\overset{\textrm{i.i.d.}}{\sim}{\mathcal{I}}$ , called a minibatch, and $B\ll|\mathbb I|$ .
Under an SGA, the uncertainty in sampling $\mathcal{B}$ propagates through the training process, making it a stochastic process rather than a deterministic one. This stochasticity motivates us to study a continuous-time approximation for GAN training in the form of SDEs, as will be seen in (SML-SDE) and (ALT-SDE). (See also the connection between stochastic gradient descent and Markov chains in [Reference Dieuleveut, Durmus and Bach10]).
Consider GAN training performed on a data set $\mathcal{D}=\{(z_i,x_j)\}_{1\leq i\leq N,\,1\leq j\leq M}$ , where $\{z_i\}_{i=1}^N$ are sampled from $\mathbb{P}_z$ and $\{x_j\}_{j=1}^M$ are real image data following the unknown distribution $\mathbb{P}_r$ . Let $G_{\theta}\,:\,\mathcal{Z}\to\mathcal{X}$ denote the generator parametrized by the neural network with parameter $\theta\in\mathbb{R}^{d_\theta}$ of dimension $d_\theta\in\mathbb N$ , and let $D_{\omega}\,:\,\mathcal{X}\to\mathbb{R}^+$ denote the discriminator parametrized by the other neural network with parameter $\omega\in\mathbb{R}^{d_\omega}$ of dimension $d_\omega\in\mathbb N$ , where $\mathbb{R}^+$ denotes the set of nonnegative real numbers. Then the objective of the GAN is to solve the minimax problem
for some cost function $\Phi$ , with $\Phi$ of the form
For instance, $\Phi$ in the vanilla GAN model [Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio14] is given by
while $\Phi$ in a Wasserstein GAN [Reference Arjovsky, Chintala and Bottou2] takes the form
Here, the full gradients of $\Phi$ with respect to $\theta$ and $\omega$ are estimated over a minibatch $\mathcal{B}$ of batch size B. One way of sampling $\mathcal{B}$ is to choose B samples out of a total of $N\cdot M$ samples without putting back; another is to take B independent and identically distributed (i.i.d.) samples. The analyses for both cases are similar; here we adopt the second sampling scheme.
More precisely, let $\mathcal{B}=\{(z_{I_k},x_{J_k})\}_{k=1}^B$ be i.i.d. samples from $\mathcal{D}$ . Let $g_{\theta}$ and $g_{\omega}$ be the full gradients of $\Phi$ with respect to $\theta$ and $\omega$ such that
Here, $g_{\theta}^{i,j}$ and $g_{\omega}^{i,j}$ denote $\nabla_\theta J(D_\omega(x_j),D_\omega(G_\theta(z_i)))$ and $\nabla_\omega J(D_\omega(x_j),D_\omega(G_\theta(z_i)))$ , respectively, with differential operators defined as $\nabla_\theta\,:\!=\,\begin{pmatrix}\partial_{\theta_1}&\cdots& \partial_{\theta_{d_\theta}}\end{pmatrix}^\top$ and $\nabla_\omega\,:\!=\,\begin{pmatrix}\partial_{\omega_1}& \cdots&\partial_{\omega_{d_\omega}}\end{pmatrix}^\top$ . Then, the estimated gradients for $g_{\theta}$ and $g_{\omega}$ corresponding to the minibatch $\mathcal{B}$ are
Moreover, let $\eta^\theta_t>0$ and $\eta^\omega_t>0$ be the learning rates at iteration $t=0,1,2,\dots$ for $\theta$ and $\omega$ respectively; then, solving the minimax problem (1) with SGA under alternating parameter update implies descent of $\theta$ along $g_{\theta}$ and ascent of $\omega$ along $g_{\omega}$ at each iteration, i.e.,
Furthermore, within each iteration, the minibatch gradients for $\theta$ and $\omega$ are calculated on different batches. In order to emphasize this difference, we use $\bar{\mathcal{B}}$ to represent the minibatch for $\theta$ and $\mathcal{B}$ for that of $\omega$ , with $\bar{\mathcal{B}}\overset{\textrm{i.i.d.}}{\sim}\mathcal{B}$ . That is,
Some practical training of GANs uses simultaneous parameter update between the discriminator and the generator, corresponding to the similar yet subtly different form
For ease of exposition, we will assume a constant learning rate $\eta^\theta_t=\eta^\omega_t=\eta$ throughout the paper, with $\eta$ viewed as the time interval between two consecutive parameter updates.
3. Approximation and error bound analysis of GAN training
The randomness in sampling $\mathcal{B}$ (and $\bar{\mathcal{B}}$ ) brings stochasticity to the GAN training process prescribed by (ALT) and (SML). In this section, we establish their continuous-time approximations and error bounds, where the approximations are in the form of coupled SDEs.
3.1. Approximation
To get an intuition of how the exact expression of SDEs emerges, let us start by some basic properties embedded in the training process. First, let $I\,:\,\Omega\to\{1,\dots,N\}$ and $J\,:\,\Omega\to\{1,\dots,M\}$ denote random indices independently and uniformly distributed respectively; then, according to the definitions of $g_{\theta}$ and $g_{\omega}$ in (2), we have $\mathbb{E}\big[g_{\theta}^{I,J}(\theta,\omega)\big]=g_{\theta}(\theta,\omega)$ and $\mathbb{E}\big[g_{\omega}^{I,J}(\theta,\omega)\big]=g_{\omega}(\theta,\omega)$ . Denote the correspondence covariance matrices as
since the $(I_k,J_k)$ in $\mathcal{B}$ are i.i.d. copies of (I, J); then,
As the batch size B gets sufficiently large, the classical central limit theorem leads to the following approximation of (ALT):
with independent Gaussian random variables $Z^1_t\sim N\big(0,1\cdot I_{d_\omega}\big)$ and $Z^2_t\sim N\big(0,1\cdot I_{d_\theta}\big)$ , $t=0,1,2,\dots$ Here, the scalar 1 specifies the time increment $1=\Delta t=(t+1)-t$ .
Write $t+1=t+\Delta t$ . On one hand, assuming the continuity of the process $\{\omega_t\}_t$ with respect to time t and sending $\Delta t$ to 0, one intuitive approximation can be easily derived in the following form:
with $\beta={2B}/{\eta}$ and $\{W_t\}_{t\geq0}$ being a standard $(d_\theta+d_\omega)$ -dimensional Brownian motion supported by the filtered probability space $(\Omega, \mathcal{F}, \{\mathcal{F}_t\}_{t\geq0},\mathbb{P})$ . Let $\{\mathcal{F}^W_t\}_{t\geq0}$ denote the natural filtration generated by $\{W_t\}_{t\geq0}$ . As a continuous-time approximation for GAN training, SDEs in this rather intuitive form are adopted without justification in some earlier works such as [Reference Conforti, Kazeykina and Ren7] and [Reference Domingo-Enrich, Jelassi, Mensch, Rotskoff and Bruna11]. Later we will show that (3) is in fact an approximation for GAN training under the simulations update scheme (SML).
On the other hand, the game nature in GANs is demonstrated through the interactions between the generator and the discriminator during the training process; more specifically, the appearance of $\omega_{t+1}$ at the update of $\theta$ as in (ALT). However, the widely adopted coupled processes (3) do not capture such interactions. One possible approximation for the GAN training process of (ALT) would be
Equations (3) and (4) can be written in the more compact forms
where the drift $b(\theta,\omega)=b_0(\theta,\omega)+\eta b_1(\theta,\omega)$ , with
and the volatility $\sigma(\theta,\omega)$ is given by
The drift terms in the SDEs, i.e. $b_0$ in (SML-SDE) and b in (ALT-SDE), show the direction of the parameters’ evolution; the diffusion terms $\sigma$ represent the fluctuations of the learning curves for these parameters. Moreover, the form of the SDEs prescribes $\beta$ , the ratio between the batch size and the learning rate, in order to modulate the fluctuations of SGAs in GAN training. Even though both (SML-SDE) and (ALT-SDE) are adapted to $\{\mathcal{F}^W_t\}_{t\geq0}$ , the term
in (ALT-SDE) highlights the interaction between the generator and the discriminator in the GAN training process; see Remark 1.
3.2. Error bound for the SDE approximation
We will show that these coupled SDEs are indeed the continuous-time approximations of GAN training processes, with the following error bound analysis. Here the approximations are under the notion of weak approximation as in [Reference Li and Tai24]. More precisely, Theorems 1 and 2 provide conditions under which the evolution of parameters in GANs are within a reasonable distance from the SDE approximation.
Theorem 1. Fix an arbitrary time horizon $\mathcal{T}>0$ , and take the learning rate $\eta\in(0,1\wedge \mathcal{T})$ and the number of iterations $\bar{N}=\lfloor{\mathcal{T}}/{\eta}\rfloor$ . Suppose that
-
1(a) $g_{\omega}^{i,j}$ is twice continuously differentiable, and $g_{\theta}^{i,j}$ and $g_{\omega}^{i,j}$ are Lipschitz, for any $i=1,\dots,N$ and $j=1,\dots,M$ ;
-
1(b) $\Phi$ is of $\mathcal{C}^3\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ and $\Phi\in G^{4}_{w}\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ ;
-
1(c) $(\nabla_\theta g_{\theta})g_{\theta}$ , $(\nabla_\omega g_{\theta})g_{\omega}$ , $(\nabla_\theta g_{\omega})g_{\theta}$ , and $(\nabla_\omega g_{\omega})g_{\omega}$ are all Lipschitz.
Then, $(\Theta_{t\eta},\mathcal{W}_{t\eta})$ as in (ALT-SDE) is a weak approximation of $(\theta_t,\omega_t)$ as in (ALT) of order 2, i.e. given any initialization $\theta_0=\theta$ and $\omega_0=\omega$ , for any test function $f\in G^3\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ , we have the estimate
for some constant $C\geq0$ ; this constant C is independent of the learning rate $\eta$ but is dependent on the time horizon $\mathcal{T}$ .
Theorem 2. Fix an arbitrary time horizon $\mathcal{T}>0$ , and take the learning rate $\eta\in(0,1\wedge \mathcal{T})$ and the number of iterations $\bar{N}=\lfloor{\mathcal{T}}/{\eta}\rfloor$ . Suppose
-
2(a) $\Phi(\theta,\omega)$ is continuously differentiable and $\Phi\in G^{3,1}_{W}\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ ;
-
2(b) $g_{\theta}^{i,j}$ and $g_{\omega}^{i,j}$ are Lipschitz for any $i=1,\dots, N$ and $j=1,\dots,M$ .
Then, $(\Theta_{t\eta},\mathcal{W}_{t\eta})$ as in (SML-SDE) is a weak approximation of $(\theta_t,\omega_t)$ as in (SML) of order 1, i.e. given any initialization $\theta_0=\theta$ and $\omega_0=\omega$ , for any test function $f\in G^2\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ , we have the estimate
for some constant $C\geq0$ ; this constant C is independent of the learning rate $\eta$ but is dependent on the time horizon $\mathcal{T}$ .
Theorems 1 and 2 provide SDE approximations for GAN training in practice when we have finite training samples and training iterations, i.e. N, M, and $\bar{N}=\lfloor{\mathcal{T}}/{\eta}\rfloor$ being finite and fixed; they also provide error bounds for such approximations, in particular:
where $C_1(\mathcal{T})$ is a coefficient depending on the time horizon $\mathcal{T}$ , and $\rho_1(\eta)$ is an appropriate error term such that either $\rho_1(\eta)=\eta^2$ or $\rho_1(\eta)=\eta$ . These SDE approximations will enable us to analyze the long-run behavior of GAN training in Section 4 through studying the invariant measures of SDEs and then control the difference between the training outcome and some equilibrium of the minimax game of GANs.
Remark 1. Modifying the intuitive SDE approximation (SML-SDE) into
and applying similar techniques to the proof of Theorem 1, we can get an $O\big(\eta^2\big)$ approximation for (SML). However, comparing (10) and (ALT-SDE), the term
still stands out, which is due to the interactions between the generator and discriminator during training. It implies that the ‘game effect’ between the generator and the discriminator has an impact on the evolution trajectories of the model parameters.
3.3. Proof of Theorem 1
In this section we provide a detailed proof of Theorem 1; the proof of Theorem 2 is a simple analogy and thus omitted. We adapt the approach of [Reference Li and Tai24] to our analysis of GAN training.
3.3.1. Preliminary analysis
One-step difference Recall that under the alternating update scheme and constant learning rate $\eta$ , the GAN training is given by (ALT).
Let $(\theta,\omega)$ denote the initial value for $(\theta_0,\omega_0)$ , and
be the one-step difference. Let $\Delta^{i,j}$ denote the tuple consisting of the ith and jth components of the one-step difference of $\theta$ and $\omega$ , with $i=1,\dots,d_{\theta}$ and $j=1,\dots,d_{\omega}$ .
Lemma 1. Assume that $g_{\theta}^{i,j}$ is twice continuously differentiable for any $i=1,\dots, N$ and $j=1,\dots, M$ .
The first moment is given by
The second moment is given by
where $\Sigma_\theta(\theta,\omega)_{i,k}$ and $\Sigma_\omega(\theta,\omega)_{j,l}$ denote the elements at positions (i,k) and (j,l) of the matrices $\Sigma_\theta(\theta,\omega)$ and $\Sigma_\omega(\theta,\omega)$ , respectively.
The third moments are all of order $O\big(\eta^3\big)$ .
Proof. By a second-order Taylor expansion, we have
Then,
and higher-order polynomials are of order $O\big(\eta^3\big)$ . Notice that $\bar{\mathcal{B}}\perp\mathcal{B}$ and recall the definition of $\Sigma_\theta$ and $\Sigma_\omega$ . The conclusion follows.
Now, for (ALT-SDE) with the same initialization as (11), define the corresponding one-step difference:
Let $\tilde\Delta_k$ be the kth component of $\tilde\Delta$ , $k=1,\dots, d_\theta+d_\omega$ , and $\tilde\Delta^{i,j}$ be the tuple consisting of the ith and jth components of the one-step difference of $\Theta$ and $\mathcal{W}$ , with $i=1,\dots,d_{\theta}$ and $j=1,\dots,d_{\omega}$ .
Lemma 2. Suppose $b_0$ , $b_1$ , and $\sigma$ given by (5), (6), and (7) are from $\mathcal{C}^3\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ such that, for any multi-index J of order $|J|\leq 3$ , there exist $k_1, k_2\in\mathbb N$ satisfying
and they are all Lipschitz.
The first moment is given by
The second moment is given by
The third moments are all of order $O\big(\eta^3\big)$ .
Proof. Let $\psi\,:\,\mathbb{R}^{d_\theta+d_\omega}\to\mathbb{R}$ be any smooth test function. Under the dynamic (ALT-SDE), define the following operators:
Applying Itô’s formula to $\psi\big(\Theta_t,\mathcal{W}_t\big)$ , $\mathcal{L}_i\psi\big(\Theta_t,\mathcal{W}_t\big)$ for $i=1,2,3$ , and $\mathcal{L}_1^2\psi\big(\Theta_t,\mathcal{W}_t\big)$ , we have
where $M_\eta$ denotes the remaining martingale term with mean zero. Given the regularity conditions of $b_0$ , $b_1$ , and $\sigma$ , [Reference Krylov20, Theorem 9, Section 2.5] implies that (12) is of order $O\big(\eta^3\big)$ . Therefore,
Take $\psi(\Theta_\eta,\mathcal{W}_\eta)$ as $\tilde\Delta_i$ , $\tilde\Delta_i\tilde\Delta_j$ , and $\tilde\Delta_i\tilde\Delta_j\tilde\Delta_k$ for arbitrary indices $i,j,k=1,\dots,d_\theta+d_\omega$ , and the conclusion follows.
Estimate of moments. Next, we bound the moments of GAN parameters under (ALT).
Lemma 3. Fix an arbitrary time horizon $\mathcal{T}>0$ and take the learning rate $\eta\in(0,1\wedge \mathcal{T})$ and the number of iterations $\bar{N}=\lfloor{\mathcal{T}}/{\eta}\rfloor$ . Suppose that $g_{\theta}^{i,j}$ and $g_{\omega}^{i,j}$ are all Lipschitz, i.e. there exists $L>0$ such that
Then, for any $m\in\mathbb N$ ,
is uniformly bounded, independent of $\eta$ .
Proof. Throughout the proof, the positive constants C and C’ may vary from line to line. The Lipschitz assumption suggests that
For any $k=1,\dots,m$ ,
and
For any $t=0,\dots, \bar{N}-1$ ,
Write
Then $a^m_{t+1}\leq (1+C\eta)a^m_t+C^{\prime}\eta$ , which leads to
The conclusion follows.
Mollification Notice that in Theorem 1 (and Theorem 2) the condition about the differentiability of the loss function $\Phi$ is in the weak sense. For ease of analysis, we adopt the following mollification, given in [Reference Evans12].
Definition 1. (Mollifier.) Define the function $\nu\,:\,\mathbb{R}^{d_\theta+d_\omega}\to\mathbb{R}$ ,
such that $\int_{\mathbb{R}^{d_\theta+d_\omega}}\nu(u)\,{\textrm{d}} u = 1$ . For any $\varepsilon>0$ , define
Note that the mollifier $\nu\in\mathcal{C}^{\infty}\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ and, for any $\varepsilon>0$ , $\textrm{supp}(\nu^{\varepsilon})=B_\varepsilon(0)$ , where $B_\varepsilon(0)$ denotes the $\varepsilon$ ball around the origin in the Euclidean space $\mathbb{R}^{d_\theta+d_\omega}$ .
Definition 2. (Mollification.) Let $f\in\mathcal{L}_\textrm{loc}^1\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ be any locally integrable function. For any $\varepsilon>0$ , define $f^\varepsilon=\nu^\varepsilon*f$ such that
By a simple change of variables and integration by parts, we can derive that, for any multi-index J, $\nabla^J f^\varepsilon=\nu^\varepsilon*[D^Jf]$ . Here we quote some well-known results about this mollification from [Reference Evans12, Theorem 7, Appendix C.4].
Lemma 4.
-
(i) $f^\varepsilon\in\mathcal{C}^{\infty}\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ .
-
(ii) $f^\varepsilon\longrightarrow f$ almost everywhere as $\varepsilon\longrightarrow0$ .
-
(iii) If $f\in\mathcal{C}\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ , then $f^\varepsilon\longrightarrow f$ uniformly on compact subsets of $\mathbb{R}^{d_\theta+d_\omega}$ .
-
(iv) If $f\in\mathcal{L}_\textrm{loc}^p\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ for some $1\leq p<\infty$ , then $f^\varepsilon\longrightarrow f$ in $\mathcal{L}_\textrm{loc}^p\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ .
To give a convergence rate for the pointwise convergence in Lemma 4, we have the following lemma.
Lemma 5. Assume $f\in W^{1,1}_\textrm{loc}\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ and there exist $k_1,k_2$ such that $|Df(u)|\leq k_1\Big(1+\|u\|_2^{2k_2}\Big)$ ; then, for any $u\in\mathbb{R}^{d_\theta+d_\omega}$ , there exists $\rho\,:\,\mathbb{R}^+\to\mathbb{R}$ such that $\lim_{\varepsilon\to0}\rho(\varepsilon)=0$ and $|f^\varepsilon(u)-f(u)|\leq \rho(\varepsilon)$ .
Proof.
Since there exist $k_1,k_2$ such that $|Df(u)|\leq k_1\Big(1+\|u\|_2^{2k_2}\Big)$ ,
Let $\rho(\varepsilon)=\varepsilon\Big[k_1\Big(1+\|u\|_2^{2k_2}\Big)\Big]+({k_1}/({2k_2+1}))\varepsilon^{2k_2+1}$ . Then $\rho(\varepsilon)\longrightarrow0$ as $\varepsilon\longrightarrow0$ .
It is also straightforward to see that the mollification preserves Lipschitz conditions.
Consider the following SDE under component-wise mollification of coefficients:
Lemma 6. Assume $b_0$ , $b_1$ , and $\sigma$ are all Lipschitz. Then
where
Proof. With Lemma 5, the conclusion follows from [Reference Krylov20, Theorem 9, Section 2.5].
3.3.2. Remaining proof
Given the conditions of Theorem 1 and the fact that mollification preserves Lipschitz conditions, $b_0^\varepsilon$ , $b_1^\varepsilon$ , and $\sigma^\varepsilon$ inherit regularity conditions from Theorem 1. Therefore, the conclusion from Lemma holds. Lemma 2 holds. Lemmas 1, 2, 3, and 5 verify the condition in [Reference Li and Tai24, Theorem 3]. Therefore, for any test function $f\in\mathcal{C}^3\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ such that, for any multi-index J with $|J|\leq 3$ , there exist $k_1, k_2\in\mathbb N$ satisfying
we have the weak approximation given by (8), where $(\theta_t,\omega_t)$ and $(\Theta_{t\eta},\mathcal{W}_{t\eta})$ are given by (ALT) and (SDE-MLF), respectively, and $\rho$ is given as in Lemma 5.
Finally, taking $\varepsilon$ to 0, Lemma 6 and the explicit form of $\rho$ lead to the conclusion.
4. The long-run behavior of GAN training via SDE invariant measures
In this section we study the long-run behavior of GAN training and discuss some of the implications of the technical assumptions as well as the steady state.
4.1. Long-run behavior of GAN training
In addition to the evolution of parameters in GANs, the long-run behavior of GAN training can be estimated from the SDEs (ALT-SDE) and (SML-SDE). This limiting behavior is characterized by their invariant measures. Recall the following definition of invariant measures in [Reference Da Prato8].
Definition 3. A probability measure $\mu^*\in\mathcal{P}\big(\mathbb{R}^{d_\theta+d_\omega}\big)$ is called an invariant measure for a stochastic process $\big\{\big(\Theta_t\ \, \mathcal{W}_t\big)^\top\big\}_{t\geq0}$ if, for any measurable bounded function f and $t\geq0$ ,
Remark 2. Intuitively, an invariant measure $\mu^*$ in the context of GAN training describes the joint probability distribution of the generator and discriminator parameters $(\Theta^*, \mathcal{W}^*)$ in equilibrium. For instance, if the training process converges to the unique minimax point $(\theta^*,\omega^*)$ for $\min_{\theta}\max_{\omega}\Phi(\theta,\omega)$ , the invariant measure is the Dirac mass at $(\theta^*,\omega^*)$ .
Moreover, the invariant measure $\mu^*$ and the marginal distribution of $\Theta^*$ characterize the generated distribution $\textrm{Law}(G_{\Theta^*}(Z))$ , necessary for producing synthesized data and for evaluating the performance of the GAN model through metrics such as inception score and Fréchet inception distance. (See [Reference Heusel, Ramsauer, Unterthiner, Nessler and Hochreiter16, Reference Salimans, Goodfellow, Zaremba, Cheung, Radford and Chen32] for more details on these metrics.)
Finally, as emphasized in Section 2 GANs are minimax games. From a game perspective, the probability distribution of $\Theta^*$ conditioning on the discriminator parameter $\mathcal{W}^*$ , denoted by the $\textrm{Law}(\Theta^*\mid\mathcal{W}^*)$ , corresponds to the mixed strategies adopted by the generator; likewise, the probability distribution of $\mathcal{W}^*$ conditioning on the generator parameter $\Theta^*$ , denoted by $\textrm{Law}(\mathcal{W}^*\mid\Theta^*)$ , characterizes the mixed strategies adopted by the discriminator.
Recall that the SDE approximation (ALT-SDE) for the GAN training process is given by
where the drift coefficient is given by $b(\theta,\omega)=b_0(\theta,\omega)+\eta b_1(\theta,\omega)$ with
and the diffusion coefficient is given by
Note that (ALT-SDE) depends on the first- and second-order derivatives of the training loss with respect to the generator and the discriminator parameters.
Theorem 3. Assume the following conditions hold:
-
3(a) both b and $\sigma$ are bounded and smooth, and have bounded derivatives of any order;
-
3(b) there exist some positive real numbers r and $M_0$ such that, for any $\big(\theta\ \, \omega\big)^\top\in\mathbb{R}^{d_\theta+d_\omega}$ ,
\begin{equation*} \big(\theta\ \, \omega\big)b(\theta,\omega) \leq -r\Bigg\|\begin{pmatrix}\theta\\[4pt]\omega\end{pmatrix}\Bigg\|_2 \qquad \text{if }\Bigg\|\begin{pmatrix}\theta\\[4pt]\omega\end{pmatrix}\Bigg\|_2\geq M_0; \end{equation*} -
3(c) $\mathcal{A}$ is uniformly elliptic, i.e. there exists $l>0$ such that
\begin{equation*} \text{for any } \begin{pmatrix}\theta\\[4pt]\omega\end{pmatrix}, \begin{pmatrix}\theta^{\prime}\\[4pt]\omega^{\prime}\end{pmatrix}\in\mathbb{R}^{d_\theta+d_\omega}, \quad \big(\theta^{\prime}\ \, \omega^{\prime}\big)^\top\sigma(\theta,\omega)\sigma(\theta,\omega)^\top \begin{pmatrix}\theta^{\prime}\\[4pt]\omega^{\prime}\end{pmatrix} \geq l\Bigg\|\begin{pmatrix}\theta^{\prime}\\[4pt]\omega^{\prime}\end{pmatrix}\Bigg\|_2^2. \end{equation*}
Then (ALT-SDE) admits a unique invariant measure $\mu^*$ with an exponential convergence rate of the joint distribution of $\big(\Theta_t,\mathcal{W}_t\big)$ towards $\mu^*$ as $t\to\infty$ .
Similar results hold for the invariant measure of (SML-SDE) with b replaced by $b_0$ .
Proof. In order to prove Theorem 3, we construct an appropriate Lyapunov function to characterize the long-term behavior for the SDE (ALT-SDE); the associated Lyapunov condition leads to the existence of an invariant measure for the dynamics of the parameters. We highlight this technique since it can be used in the analysis of broader classes of dynamical systems, for both stochastic and deterministic cases; see, for instance, [Reference Laborde and Oberman22]. Consider the function $V\,:\,[0,\infty)\times\mathbb{R}^{d_\theta+d_\omega}\to\mathbb{R}$ given by $V(t,u) = \exp\{\delta t+\varepsilon\|u\|_2\}$ for all $u\in\mathbb{R}^{d_\theta+d_\omega}$ , where the parameters $\delta, \varepsilon>0$ will be determined later. Note that V is a smooth function, and
for any fixed $\delta,\varepsilon>0$ . Under (ALT-SDE), applying Itô’s formula to V gives
Define the Lyapunov operator
Given the boundedness of $\sigma$ , i.e. there exists $K>0$ such that $\|\sigma\|_\textrm{F}\leq K$ , and the dissipative property given by condition 3(b), i.e. there exist $r,M_0>0$ such that, for any $u\in\mathbb{R}^{d_\theta+d_\omega}$ with $\|u\|_2>M_0$ , $u^\top b(u)\leq -r\|u\|_2$ , we have
Now take
then, for any $\|u\|_2>M$ , $\mathcal{L} V(t,u)\leq -\delta V(t,u)$ . Therefore,
Following [Reference Khasminskii19, Theorem 3.7], (13) and (14) ensure the existence of an invariant measure $\mu^*$ for (ALT-SDE). By the uniform elliptic condition 3(c), uniqueness follows from [Reference Hong and Wang17, Theorem 2.3]. Following from the main result in [Reference Veretennikov38], the mixing coefficient
decays exponentially as $s\to\infty$ . For a Borel measurable set $C\subset\mathbb{R}^{d_\theta+d_\omega}$ ,
for all $s>0$ . Since any bounded and measurable function f can be approximated by simple functions, the usual argument from indicator functions to simple functions implies that
for all $s>0$ , for some constant $C>0$ . Let $\nu$ be an arbitrary initial distribution of $(\Theta_0,\mathcal{W}_0)$ . Then we have
for all $s>0$ . The conclusion follows.
Theorem 3, together with Theorems 1 and 2, help to control the distance between the training output after $\bar{N}=\lfloor{\mathcal{T}}/{\eta}\rfloor$ iterations and the mixed-strategy equilibrium $(\Theta^*,\mathcal{W}^*)\sim\mu^*$ in a sense that
for any bounded measurable function $f\in G^3\big(\mathbb{R}^{d_{\theta}+d_{\omega}}\big)$ , where $C_1$ and $\rho_1$ are as in (9), $C_2$ is some positive constant, and $\beta$ is as in (15).
4.2. Discussion of the assumptions
4.2.1. Implications of the technical assumptions on GAN training
The assumptions 1(a)–(c), 2(a) and (b), and 3(a) for the regularity conditions of the drift, the volatility, and the derivatives of the loss function $\Phi$ , are more than mathematical convenience. They are essential constraints on the growth of the loss function with respect to the model parameters, necessary for avoiding the explosive gradient encountered in the training of GANs.
Moreover, these conditions put restrictions on the gradients of the objective functions with respect to the parameters. By the chain rule, it requires both careful choices of network structures and particular forms of the loss function $\Phi$ .
In terms of proper neural network architectures, let us take an example of a network with one hidden layer. Let $f\,:\,\mathcal{X}\subset\mathbb{R}^{d_x}\to\mathbb{R}$ be such that
Here, h is the width of the hidden layer, $W^\textrm{h}\in\mathbb{R}^{h\times d_x}$ and $w^\textrm{o}\in\mathbb{R}^h$ are the weight matrix and vector for the hidden and output layers respectively, and $\sigma_\textrm{h}\,:\,\mathbb{R}\to\mathbb{R}$ and $\sigma_\textrm{o}\,:\,\mathbb{R}\to\mathbb{R}$ are the activation functions for the hidden and output layers. Then, taking partial derivatives with respect to the weights yields
from which we can see that the regularity conditions, especially the growth of the loss function with respect to the model parameters (i.e. assumptions 1(a)–(c) and 2(a)–(b)) rely on the regularity and the boundedness of the activation functions and the width and depth of the network, as well as the magnitudes of the parameters and data. Therefore, assumptions 1(a)–(c), 2(a) and (b), and 3(a) explain mathematically some well-known practices in GAN training such as introducing various forms of gradient penalties; see, for instance, [Reference Gulrajani, Ahmed, Arjovsky, Dumoulin and Courville15, Reference Thanh-Tung, Tran and Venkatesh37]. See also [Reference Schaefer, Zheng and Anandkumar33] for a combination of competition and gradient penalty to stabilize GAN training. It is worth noticing that apart from affecting the stability of GAN training, the regularity of the network can also affect the sample complexity of GANs and this phenomenon has been studied in [Reference Luise, Pontil and Ciliberto27] for a class of GANs with optimal transport-based loss functions.
In terms of choices of loss functions, the objective function of the vanilla GANs in [Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio14] is given by
Taking partial derivatives with respect to $\theta$ and $\omega$ , we see that
where $\textbf J^G_\theta$ denotes the Jacobian matrix of $G_\theta$ with respect to $\theta$ , and $\nabla_x$ denotes the gradient operator over a (parametrized) function with respect to its input variable. [Reference Arjovsky and Bottou1] analyzed the difficulties of stabilizing GAN training under the above loss function due to the lack of proper regularity conditions, and proposed a possible remedy by an alternative Wasserstein distance which enjoys better smoothness conditions.
4.2.2. Verifiability of assumptions in Theorems 1, 2, and 3
The assumptions from these theorems can be summarized into the three categories specified below. For some of the assumptions, there are available choices of GAN structures for a wide range of applications where these assumptions can be verified easily; some are consistent with certain choices of regularization applied in the training procedures of GANs; others are more subtle.
On the smoothness and the boundedness of drift and volatility Take the example of Wasserstein GANs (WGANs) for image processing. Given that sample data in image processing problems are supported on a compact domain, assumptions 1(a)–(c), 2(a) and (b), and 3(a) are easily satisfied with proper choices of prior distribution and activation function: first, the prior distribution $\mathbb P_z$ such as the uniform distribution is naturally compactly supported; next, take $D_\omega=\tanh{(\omega\cdot x)}$ , $G_\theta(z)=\tanh{(\theta\cdot z)}$ , and the objective function
Then, assumptions 1(a)–(c), 2(a) and (b), and 3(a) are guaranteed by the boundedness of the data $\{(z_i,z_j)\}_{1\leq i\leq N,1\leq j\leq M}$ and the very structure of the activation function:
More precisely, the first- and second-order derivatives of $\psi$ are
any higher-order derivatives can be written as functions of $\psi(\cdot)$ and $\psi^{\prime}(\cdot)$ and are therefore bounded.
On the dissipative property The dissipative property specified by 3(b) essentially prevents the evolution of the parameters from being driven to infinity. The weight clipping technique in WGANs, for instance, is consistent with this assumption.
On the elliptic condition Compared with the above two categories of assumptions, the uniform ellipticity condition 3(c) is intrinsically rooted in the stochasticity brought by the sampling procedures of stochastic gradient algorithms in general, i.e. the microscopic fluctuation from the noise of SGAs, instead of the macroscopic loss landscape of GANs. Recall from Section 2 that a cost function of the form
naturally induces a random variable $g(\theta,\omega;X,Z)=(\nabla_\theta J(\theta,\omega;X,Z),\nabla_\omega J(\theta,\omega;X,Z))$ with mean $(g_\theta(\theta,\omega),g_\omega(\theta,\omega))$ , where (X, Z) follows the empirical distribution given by the dataset $\mathcal{D}$ . The elliptic condition is essentially equivalent to the random variable $g(\theta,\omega;X,Z)$ being non-degenerate, and the smallest eigenvalue of its covariance matrix, $\underline{\sigma}(\textrm{Cov}(g(\theta,\omega;X,Z)))$ , being bounded away from 0. Note that this condition cannot be guaranteed by adding parameter regularizations as in the case of the dissipative property, since parameter regularizations only change the drift term b. For suitable choices of loss function $\Phi$ such that $\underline{\sigma}(\textrm{Cov}(g(\theta,\omega;X,Z)))$ is indeed bounded away from 0, control of the training outcome (16) holds; otherwise, we could consider a perturbed SDE approximation,
for sufficiently small $\lambda>0$ . Under condition 3(a), Itô isometry and Gronwall’s inequality give the error bound
for some positive coefficient $\alpha=\alpha(t)$ depending on time t. We can still control the distance between the training outcome and the perturbed equilibrium by
for any bounded measurable function $f\in G^3(\mathbb{R}^{d_{\theta}+d_{\omega}})$ , where $C_1$ , $C_2$ , $\rho_1$ , and $\beta$ are as in (16), $C_3$ is some positive coefficient depending on $\mathcal{T}$ , and $k\in\mathbb N$ is some positive integer.
4.3. Dynamics of training loss and FDR
We can further analyze the dynamics of the training loss based on the SDE approximations and derive a fluctuation–dissipation relation (FDR) for the GAN training.
To see this, let $\mu=\{\mu_t\}_{t\geq0}$ be the flow of probability measures for $\big\{\big(\Theta_t\ \, \mathcal{W}_t\big)^\top\big\}_{t\geq0}$ given by (ALT-SDE). Then, applying Itô’s formula to the smooth function $\Phi$ (see [Reference Rogers and Williams31, Section 4.18]) gives the following dynamics of training loss:
where
is the infinitesimal generator for (ALT-SDE) on any given test function $f\,:\,\mathbb{R}^{d_\theta+d_\omega}\rightarrow\mathbb{R}$ .
The existence of the unique invariant measure $\mu^*$ for (ALT-SDE) implies the convergence of $\big\{\big(\Theta_t\ \, \mathcal{W}_t\big)^\top\big\}_{t\geq0}$ in (ALT-SDE) to some $\big(\Theta^*\ \,\mathcal{W}^*\big)^\top\sim\mu^*$ as $t\to\infty$ . By Definition 3 of the invariant measure and (17), we have $\mathbb{E}_{\mu^*}[\mathcal{A}\Phi(\Theta^*,\mathcal{W}^*)]=0$ . Applying the operator (18) over the loss function $\phi$ yields
Based on the evolution of the loss function (17), convergence to the invariant measure $\mu^*$ leads to the following FRD for GAN training.
Theorem 4. Assume the existence of an invariant measure $\mu^*$ for (ALT-SDE); then
The corresponding FDR for the simultaneous update case of (SML-SDE) is
Remark 3. This FDR relation in GANs connects the microscopic fluctuation from the noise of SGAs with the macroscopic dissipation phenomena related to the loss function. In particular, the quantity $\textrm{Tr}(\Sigma_\theta\nabla^2_\theta\Phi+\Sigma_\omega\nabla^2_\omega\Phi)$ links the covariance matrices $\Sigma_\theta$ and $\Sigma_\omega$ from SGAs with the loss landscape of $\Phi$ , and reveals the trade-off of the loss landscape between the generator and the discriminator.
Note that this FDR relation for GAN training is analogous to that for the stochastic gradient descent algorithm on a pure minimization problem in [Reference Liu and Theodorou25, Reference Yaida44].
Further analysis of the invariant measure can lead to a different type of FDR that will be practically useful for learning rate scheduling. Indeed, applying Itô’s formula to the squared norm of the parameters $\big\|\big(\Theta_t\ \,\mathcal{W}_t\big)^\top\big\|_2^2$ shows the following dynamics:
Theorem 5. Assume the existence of an invariant measure $\mu^*$ for (SML-SDE); then
Given the infinitesimal generator for (ALT-SDE), Theorems 4 and 5 follow from direct computations.
Remark 4. (Scheduling of learning rate.) Notice that the quantities in (FDR2), including the parameters $(\theta,\omega)$ and first-order derivatives of the loss function $g_{\theta}$ , $g_{\omega}$ , $g_{\theta}^{i,j}$ , and $g_{\omega}^{i,j}$ , are computationally inexpensive. Therefore, (FDR2) enables customized scheduling of learning rate, instead of predetermined scheduling ones such as Adam or RMSprop optimizer.
For instance, recall that $g_{\theta}^{\mathcal{B}}$ and $g_{\omega}^{\mathcal{B}}$ are respectively unbiased estimators for $g_{\theta}$ and $g_{\omega}$ , and
are respectively unbiased estimators of $\Sigma_\theta(\theta,\omega)$ and $\Sigma_\omega(\theta,\omega)$ . Now, in order to improve GAN training with simultaneous update, we can introduce two tunable parameters $\varepsilon>0$ and $\delta>0$ to have the following scheduling:
Funding information
There are no funding bodies to thank relating to the creation of this article.
Competing interests
There were no competing interests to declare which arose during the preparation or publication process of this article.