1. Introduction
Applications in modern statistics often require generating Monte Carlo samples from a distribution defined on $\Theta \subseteq \mathbb{R}^d$ using a version of the Metropolis–Hastings algorithm [Reference Brooks, Gelman, Jones and Meng4, Reference Hastings16, Reference Metropolis, Rosenbluth, Rosenbluth, Teller and Teller30]. Popular versions of Metropolis–Hastings include, among many others, random walk Metropolis–Hastings (RWM), the Metropolis-adjusted Langevin algorithm (MALA), the Metropolis–Hastings independence (MHI) sampler, and Hamiltonian Monte Carlo.
Convergence analyses of general state space Metropolis–Hastings algorithms have traditionally focused on studying their convergence rates in total variation distances [Reference Meyn and Tweedie31, Reference Rosenthal44, Reference Tierney48]. These convergence rates have received significant attention, at least in part, because they provide a key sufficient condition for the existence of central limit theorems [Reference Jones22] and the validity of methods for assessing the reliability of the simulation effort [Reference Robertson, Flegal, Vats and Jones43, Reference Vats, Flegal and Jones49]. However, convergence analyses of Metropolis–Hastings Markov chains typically result in qualitative convergence rates [Reference Hairer, Stuart and Vollmer15, Reference Jarner and Hansen18, Reference Johnson and Geyer21, Reference Mengersen and Tweedie29, Reference Roberts and Tweedie42]. Exact convergence rates in total variation in the general state space setting have been nonexistent until recently and these are for MHI samplers [Reference Smith and Tierney46, Reference Wang52]. Moreover, exact convergence rates in total variation for MHI samplers have been studied only in trivial examples. This leaves practitioners with little guidance on the true convergence behavior and reliability of MHI samplers in practical applications, especially in high dimensions.
At the same time, there has been significant recent interest in the convergence properties of Monte Carlo Markov chains in high-dimensional settings [Reference Durmus and Moulines8, Reference Ekvall and Jones11, Reference Hairer, Stuart and Vollmer15, Reference Johndrow, Smith, Pillai and Dunson20, Reference Qin and Hobert37, Reference Rajaratnam and Sparks41, Reference Yang, Wainwright and Jordan53] and traditional approaches can have limitations in this regime [Reference Qin and Hobert38]. This has led to an interest in considering the convergence of Monte Carlo Markov chains using Wasserstein distances [Reference Gibbs13, Reference Hairer, Stuart and Vollmer15, Reference Jin and Tan19, Reference Madras and Sezer28, Reference Qin and Hobert39, Reference Qin and Hobert40] which may scale to large problem sizes where other approaches have had difficulties [Reference Durmus and Moulines7, Reference Hairer, Stuart and Vollmer15, Reference Qin and Hobert40]. Convergence analyses in Wasserstein distances also result in benefits similar to those obtained using total variation such as central limit theorems and concentration inequalities for time averages of the Markov chain [Reference Hairer, Stuart and Vollmer15, Reference Jin and Tan19, Reference Joulin and Ollivier23, Reference Komorowski and Walczuk26].
We study exact convergence rates of the MHI sampler in $L_1$ -Wasserstein distances, which we refer to as just Wasserstein distances. There has been previous successful convergence analysis of Metropolis–Hastings algorithms using specific Wasserstein distances [Reference Durmus and Moulines7, Reference Hairer, Stuart and Vollmer15]. We develop exact convergence rates which are universal across many different metrics used in Wasserstein distances for the MHI sampler, simultaneously. Under mild assumptions, we show that the exact convergence rate in total variation [Reference Wang52] is also exact for Wasserstein distances weaker than total variation for every initialization. We provide a new upper and lower bound on the worst-case Wasserstein distance when initialized from points. Under mild assumptions, similar to the ones used for the result in total variation [Reference Wang52], we show that the convergence rate at any point initialization is the same as the worst-case convergence rate. When the algorithm is started at a specific point, we give exact convergence expressions across more general Wasserstein distances, possibly stronger than total variation.
Our theoretical results on the exact convergence rate extend the results in total variation [Reference Wang52] to Wasserstein distances. However, only a trivial example was studied in total variation [Reference Wang52]. We provide a practically relevant application of our theoretical results by developing exact convergence expressions using normal–inverse-gamma proposals in the Bayesian quantile regression setting. Previously, qualitative convergence results for a Gibbs sampler were developed [Reference Khare and Hobert25].
Compared to methods used to approximate integrals such as importance sampling, MHI samplers can generate samples from the target distribution, which is often of interest for practitioners. MHI samplers can also be computationally efficient at each iteration, in contrast to more sophisticated Markov chain Monte Carlo algorithms, but can require many iterations to accept a proposed sample. Connections between the MHI sampler and rejection sampling are also well known [Reference Liu27, Reference Tierney48]. Exact convergence rates for MHI samplers may also provide insight into the convergence rates of more popular Metropolis–Hastings algorithms such as the MALA and RWM algorithms [Reference Brown and Jones5].
Motivated by the general theoretical work, we consider using a centered Gaussian proposal and derive exact convergence expressions in Wasserstein distances for a large class of target distributions. The centered Gaussian proposal matches the maximal point of the proposal density with that of the target density. By centering an independent proposal, we directly imbue the Markov chain with a strong attraction to a set where the target distribution has high probability. This centered Gaussian proposal is similar to using a Laplace approximation [Reference Pierre, Robert and Smith35, Reference Shephard and Pitt45, Reference Tierney48], but differs in its covariance matrix. We study this MHI in several Bayesian generalized linear models and derive exact convergence expressions in general Wasserstein distances.
Our techniques are based on a well-known condition [Reference Mengersen and Tweedie29, Reference Tierney48, Reference Wang52], but the novelty in our analysis is a carefully constructed proposal to develop exact convergence rates across Wasserstein distances. We then consider scaling properties of the exact convergence rate to large dimensions and sample sizes in high-dimensional Bayesian binary response regression (e.g. logistic and probit regression) with Gaussian priors. Data augmentation algorithms have been developed for these models [Reference Albert and Chib1, Reference Polson, Scott and Windle36], but the required matrix inversions at each iteration can be computationally intensive. We derive an explicit asymptotic upper bound on the convergence rate of our centered MHI sampler for general Wasserstein distances, independent of the dimension d and sample size n, when they increase in such a way that the ratio $d/n \to \gamma \in (0, +\infty)$ . In this case, we show informative convergence rates for practitioners for the MHI sampler which can scale to large problem sizes when the convergence analysis is exact.
To the best of our knowledge, this work is the first to successfully address the convergence complexity of Metropolis–Hastings in general Wasserstein distances when both the sample size and the dimension increase. Previously, under the conditions of a central limit theorem, the convergence complexity in total variation of RWM on a compact set was studied [Reference Belloni and Chernozhukov2]. In contrast, our convergence complexity results do not rely on the underlying space being compact. The dimension dependence of the mixing time has been studied in specific Wasserstein distances and total variation for Metropolis-Hastings algorithms such as MALA and RWM for certain log-concave target distributions [Reference Dwivedi, Chen, Wainwright and Yu9, Reference Eberle10]. We take into account the sample size, and upper bound the convergence rate to provide further theoretical guarantees for time averages of the Markov chain [Reference Jones22, Reference Joulin and Ollivier23].
Some related quantitative convergence complexity bounds look at the spectral gap, implying a convergence rate in total variation from a specific distribution initialization. For example, the convergence rate of a random walk algorithm for logistic regression in one dimension has been studied in terms of the sample size [Reference Johndrow, Smith, Pillai and Dunson20]. Other related results concern the convergence properties of some high-dimensional Gibbs samplers [Reference Papaspiliopoulos, Roberts and Zanella33, Reference Papaspiliopoulos, Stumpf-Fétizon and Zanella34] or the convergence properties of Gibbs samplers when the dimension or the sample size increase individually [Reference Ekvall and Jones11, Reference Qin and Hobert40, Reference Rajaratnam and Sparks41].
The remainder is organized as follows. In Section 2, we define the Metropolis–Hastings independence sampler and the Wasserstein distance. In Section 3, we develop exact convergence rates in the Wasserstein distance for the MHI sampler and apply this theory to Bayesian quantile regression. In Section 5, we study a centered Gaussian proposal to obtain exact convergence expressions and apply this to many popular Bayesian generalized linear models used in statistics. We also develop high-dimensional convergence complexity results for Bayesian binary response regression in the large-dimension and large-sample-size regime. Section 6 contains some final remarks. Some technical details and proofs are deferred to the appendices.
2. MHI samplers and Wasserstein distances
As they will be considered here, MHI samplers simulate a Markov chain with invariant distribution $\Pi$ supported on a nonempty set $\Theta \subseteq \mathbb{R}^d$ using a proposal distribution Q which, to avoid trivialities, is assumed throughout to be different than $\Pi$ . We also assume throughout that $\Pi$ has Lebesgue density $\pi$ with support $\Theta$ , and Q has Lebesgue density q with support $\Theta$ . Define
We consider MHI samplers initialized at a point $\theta_0 \in \Theta$ . MHI proceeds as follows: for $t \in \{1, 2, \ldots \}$ , given $\theta_{t-1}$ , draw $\theta^{\prime}_t \sim Q({\cdot})$ and $U_t \sim \text{Unif}(0, 1)$ independently so that
If $\delta_\theta$ denotes the Dirac measure at the point $\theta$ , the MHI Markov kernel P is defined for $\theta \in \mathbb{R}^d$ and $B \subseteq \mathbb{R}^d$ by
For $\theta \in \mathbb{R}^d$ , define the Markov kernel at iteration time $t \ge 2$ recursively by
Let $\mathcal{C}(P^t(\theta, \cdot), \Pi)$ be the set of all joint probability measures with marginals $P^t(\theta, \cdot)$ , and $\Pi$ and $\rho$ be a lower semicontinuous metric. The $L_1$ -Wasserstein distance [Reference Kantorovich and Rubinstein24, Reference Villani50, Reference Villani51], which we will call simply the Wasserstein distance, is
Notice that when the metric $\rho$ is $\rho(\theta, \omega) = I_{\theta \not= \omega}$ , then the Wasserstein distance is the total variation distance. More generally, for a lower semicontinuous function $V \ge 1$ , $\rho(\theta, \omega) = [V(\theta) + V(\omega)] I_{\theta \not= \omega}$ defines a weighted total variation distance. Another example is $\rho(\theta, \omega) = \min\{ \left\lVert \theta - \omega \right\rVert, 1 \}$ , which is always less than the Hamming metric used in total variation. More general $L_p$ -Wasserstein distances with $p \ge 2$ are not studied in this work.
Wasserstein distances control the bias of integrals of $P^t(\theta, \cdot)$ over all $\rho$ -Lipschitz functions, whereas total variation controls the bias for bounded, measurable functions. Often in applications, various types of Lipschitz functions are of interest. Consider, for example, functions such as the identity function and Winsorized functions such as $g(\omega) = ({-}R) \vee (R \wedge \omega)$ where $R \in (0, +\infty)$ , which reduce sensitivity to extreme values. An alternative motivation for using Wasserstein distances is that convergence analysis in certain Wasserstein distances may improve the scaling to high dimensions [Reference Hairer, Stuart and Vollmer15, Reference Qin and Hobert39, Reference Qin and Hobert40]. Serious problems can arise with existing convergence analysis in total variation [Reference Tierney48] in high dimensions even for seemingly trivial examples such as the one below.
Example 1. Consider a standard $d-$ dimensional Gaussian target distribution $\Pi \equiv N(0, I_d)$ and Gaussian proposal $Q \equiv N(0, \sigma^2 I_d)$ with $\sigma^2 \in (1, \infty)$ . With $\rho_{\text{TV}}(\theta^{\prime}, \omega^{\prime}) = I_{\theta^{\prime} \not= \omega^{\prime}}$ , current results on convergence analysis [Reference Tierney48] show that, for every $\theta \in \mathbb{R}^d$ , the ratio of the proposal and target densities $\inf_{\theta} \{ q(\theta) / \pi(\theta) \} \ge \sigma^{-d}$ and $\mathcal{W}_{\rho_{\text{TV}}}( P^t(\theta, \cdot), \Pi ) \le (1 - \sigma^{-d})^t$ . In particular, $\sigma^{-d} \approx 0$ in high dimensions.
At a specific initialization, we may look for a smaller convergence rate which may scale to high dimensions in a Wasserstein distance weaker than total variation. We will not only develop an exact convergence analysis which controls the bias for Lipschitz functions, but, roughly speaking, we also discover that the convergence rate in Example 1 is exact for Wasserstein distances and the same for every initialization. Nevertheless, we show that convergence rates can scale to large problem sizes using a novel proposal and exact convergence analysis.
3. Exact convergence rates for MHI samplers
When the ratio of the proposal and target densities is bounded below by a positive number, i.e. $\varepsilon^* = \inf_{\theta \in \Theta} \{ q(\theta)/\pi(\theta) \} > 0$ , the MHI sampler is uniformly ergodic in total variation with convergence rate upper bounded by $1 - \varepsilon^*$ [Reference Tierney48, Corollary 4]. Unlike in accept–reject sampling, $\varepsilon^*$ does not need to be known explicitly or computed in order to implement MHI. However, this requirement was shown to be necessary for uniform ergodicity in total variation [Reference Mengersen and Tweedie29, Theorem 2.1]. More recently, it was shown that the convergence rate cannot be improved [Reference Wang52, Theorem 1]. We show this to be the case even in weaker Wasserstein distances where the lower bound does not follow trivially from that of the total variation lower bound [Reference Wang52, Theorem 1].
Theorem 1. Suppose $\rho(\cdot, \cdot) \le 1$ . Then
If, in addition, q is lower semicontinuous on $\Theta$ , $\pi$ is upper semicontinuous on $\Theta$ , and $\Theta$ can be expressed as a countable union of compact sets, then
Proof. The proof is provided in Appendix A.
The semicontinuity assumption is not required when working with the total variation distance [Reference Wang52, Theorem 1], but it is a mild assumption that holds in many practical applications. The upper bound constant can improve upon upper bounds in total variation [Reference Tierney48] if, for example, $\rho$ is continuous and $\Theta$ is compact. If $\varepsilon^* = 0$ , Theorem 1 also gives the lower bound
which shows that MHI cannot converge uniformly from any starting point for many Wasserstein distances. Thus, under mild assumptions, Theorem 1 gives a complete characterization of the worst-case convergence of the MHI sampler in many Wasserstein distances.
Exact convergence expressions are available when the Markov chain is initialized at $\theta^* = \text{argmin}\{ q(\theta) / \pi(\theta) \,:\, \theta \in \Theta\}$ using techniques from [Reference Wang52].
Proposition 1. Suppose there exists a solution $\theta^* = \mathrm{argmin}\{ q(\theta) / \pi(\theta) \,:\, \theta \in \Theta\}$ . Then
Proof. Define $\varepsilon_{\theta^*} = q(\theta^*)/\pi(\theta^*) = \varepsilon^*$ . Under our assumptions, $P^t(\theta^*, \cdot)$ can be represented as a convex combination of the target distribution and the Dirac measure at the point $\theta^*$ [Reference Wang52, Remark 1, Theorem 2], that is,
Let $\psi \,:\, \Theta \to \mathbb{R}$ be a function such that $\int_\Theta |\psi| \,\mathrm{d}\Pi < \infty$ . We have the identity
Since the only coupling between $\Pi^*$ and the Dirac measure $\delta_{\theta^*}$ is the independent coupling [Reference Giraudo14], the Wasserstein distance takes the simple form $\mathcal{W}_{\rho}( \delta_{\theta^*}, \Pi) = \int \rho(\theta, \theta^*) \,\mathrm{d}\Pi(\theta)$ .
Since q is not exactly $\pi$ , $\varepsilon_{\theta^*} \in (0, 1)$ . Let $M_b(\mathbb{R}^d)$ be the set of bounded measurable functions on $\mathbb{R}^d$ and, for real-valued functions $\varphi$ , let $\left\lVert\varphi\right\rVert_{\text{Lip}(\rho)} = \sup_{x, y, x \not= y} \{ |\varphi(x) - \varphi(y)| / \rho(x, y) \}$ denote the Lipschitz norm with respect to the distance $\rho$ . Applying the Kantorovich–Rubinstein theorem [Reference Villani50, Theorem 1.14],
3.1. Application: Bayesian quantile regression
Fix $r \in (0, 1)$ and suppose, for $i=1,\ldots,n$ , that $\varepsilon_i$ are independent and identically distributed (i.i.d.) with density $p_r(\varepsilon) = r(1 - r) (\exp\!((1 - r) \varepsilon ) I_{\varepsilon < 0} + \exp\!({-}r\varepsilon) I_{\varepsilon \ge 0})$ . Let $v_0, s_0 \in (0, \infty)$ and $C \in \mathbb{R}^{d \times d}$ be symmetric positive-definite. We parameterize the inverse gamma distribution so that if $\sigma \sim \text{IG}(v, s)$ for some $v, c \in (0, \infty)$ , then it has a density proportional to $\sigma^{-v - 1} \exp\!({-}s/\sigma)$ . Assume the Bayesian quantile regression model for $i \in 1, \ldots, n$ where $X_i \in \mathbb{R}^d$ is fixed and
Let $\Pi(\cdot \mid X, Y)$ denote the posterior and $\pi(\cdot \mid X, Y)$ denote the density for this Bayesian model with normalizing constant $Z_{\Pi(\cdot \mid X, Y)}$ .
Upper bounds on the convergence rate were previously investigated for Gibbs samplers [Reference Khare and Hobert25] in this setting. We will study the MHI sampler with a normal–inverse-gamma proposal constructed as follows. Define the convex function by $\ell_r(u) = u (r - I_{u < 0})$ and $s_{n, r}\,:\, \mathbb{R}^d \to \mathbb{R}$ by $s_{n, r}(\beta) = \sum_{i = 1}^n \ell_r(Y_i - \beta^\top X_i) + \beta^\top C^{-1} \beta / 2$ . Since $s_{n, r}$ is strongly convex, let $\beta^* \in \mathbb{R}^d$ be the unique minimum of the function $s_{n, r}$ . Now the MHI proposal is given by
Let $\Gamma \,:\, (0, \infty) \to \mathbb{R}$ be the usual Gamma function and define
The following gives an exact convergence rate of this algorithm which completely characterizes its convergence from a specific initialization.
Theorem 2. For any $\sigma_0 \in (0, \infty)$ ,
Proof. We may define the function $f \,:\, \mathbb{R}^d \times (0, \infty) \to \mathbb{R}$ by
and write the posterior density as $\pi(\beta, \sigma \mid X, Y) = Z_{\Pi(\cdot \mid X, Y)}^{-1} \exp\!({-}f(\beta, \sigma))$ . Since the function $\beta \mapsto s_{n, r}(\beta) - \beta^\top C^{-1} \beta / 2$ is a convex function on $\mathbb{R}^d$ , by Lemma 4, for every $\beta \in \mathbb{R}^d$ , $s_{n, r}(\beta) \ge s_{n, r}(\beta^*) + \frac{1}{2}(\beta - \beta^*)^\top C^{-1}(\beta - \beta^*)$ . For any $(\beta, \sigma) \in \mathbb{R}^d \times (0, \infty)$ , we then have the lower bound
This implies that
Let q denote the proposal’s normal–inverse-gamma density. For any $\sigma_0 \in (0, \infty)$ and for every $(\beta, \sigma) \in \mathbb{R}^d \times (0, \infty)$ , we have shown that
An application of Proposition 1 completes the proof.
Note that $\varepsilon_{\beta^*}$ is difficult to compute since it depends on the normalizing constant, but we give an example later where upper bounding the convergence rate is possible in Bayesian logistic and probit regression.
4. The convergence rate at arbitrary initializations
The previous section studies the worst-case convergence rate and the convergence rate at an individual point for the MHI sampler. We can study the convergence rate at every point, as was done for total variation [Reference Wang52]. The technique needed to prove this relies on the exact representation of the MHI sampler [Reference Smith and Tierney46, Theorem 1], but new techniques are needed to show this in the Wasserstein distance. Similar to the convergence rate in total variation [Reference Wang52], we define the Wasserstein convergence rate for a point $\theta \in \Theta$ as $r_{\rho}(\theta) = \lim_{t \to \infty} \mathcal{W}_{\rho}(P^t(\theta, \cdot), \Pi)^{1/t}$ .
When the distance metric is $\min\{ \left\lVert\cdot\right\rVert, 1 \}$ where $\left\lVert\cdot\right\rVert$ can be any norm, Theorem 3 shows we can obtain the convergence rate at every point under mild conditions. We require $\pi$ and q to be locally $\left\lVert\cdot\right\rVert$ -Lipschitz continuous and bounded, which is stronger than only locally Lipschitz as in total variation [Reference Wang52]. However, this additional condition is satisfied in many practical applications in statistics. Theorem 3 also lower bounds the convergence rate for Wasserstein $L_p$ -distances, and the rate of convergence for these distances cannot be improved.
Theorem 3. If $\pi$ , q are locally $\left\lVert\cdot\right\rVert$ -Lipschitz continuous and bounded on $\mathbb{R}^d$ and there exists a solution $\theta^* = \mathrm{argmin}\{q(\theta) / \pi(\theta) \,:\, \theta \in \Theta\}$ , then, for any $\theta \in \Theta$ , the Wasserstein convergence rate is the same with $r_{\min\{\left\lVert\cdot\right\rVert, 1\}}(\theta) = 1 - q(\theta^*) / \pi(\theta^*)$ .
Proof. If the initialization is at $\theta^*$ , then the result follows from Proposition 1. Fix a point $\theta_0 \in \Theta$ such that $\theta_0 \not= \theta^*$ . Using Lemma 1, we have an upper bound on the convergence rate by
It remains to lower bound the limit inferior.
Denote the standard p-norms for vectors $x \in \mathbb{R}^d$ by $\left\lVert x\right\rVert_p = \big(\sum_{i} |x_i|^p\big)^{1/p}$ . For $h \in (0, 1]$ , define the function $\varphi_h(\theta) = (2h)^{-d}\exp\!({-}h^{-1} \left\lVert\theta - \theta^*\right\rVert_1)$ . This is the probability density function for a Laplace distribution and $\varphi_h$ is $2^{-d} h^{-d - 1}$ $\left\lVert\cdot\right\rVert_1$ -Lipschitz, and so $2^{d} h^{d + 1} \varphi_h(\theta) = h \exp\!({-}h^{-1} \left\lVert\theta - \theta^*\right\rVert_1)$ is nonnegative, $\left\lVert\cdot\right\rVert_1$ -Lipschitz with constant 1 and bounded by 1. In particular, it is readily shown that $2^{d} h^{d + 1} \varphi_h(\theta)$ is $\min\{1, \left\lVert\cdot\right\rVert_1\}$ -Lipschitz. Using Kantorovich–Rubinstein duality [Reference Villani50, Theorem 1.14], we have the lower bound
We will develop some approximation properties of $\varphi_h$ . Since $\theta_0 \not= \theta^*$ , it is readily shown that $\lim_{h \downarrow 0} \varphi_h(\theta_0) = 0$ . Using a change of variables and since we have assumed $\sup_{\theta} \pi(\theta) < \infty$ ,
If $Y_1, \ldots, Y_d$ are i.i.d. Laplace, then we have the tail bound
Since norms are equivalent on $\mathbb{R}^d$ , $\pi$ and q are locally Lipschitz with respect to any norm. Choosing $h = h_0/t^2$ for some $h_0 \in (0, 1)$ , since $\pi$ is locally $\left\lVert\cdot\right\rVert_2$ -Lipschitz, we can find a universal constant $L \in (0, \infty)$ such that, for any $\left\lVert\theta^{\prime}\right\rVert_2 \le t$ , $|\pi(\theta^* + h \theta^{\prime}) - \pi(\theta^*)| \le {h_0 L}/{t}$ . Applying these upper bounds to (2) and (3), for large enough t we have
A similar argument with the assumptions on q yields, for large enough t,
It remains to lower bound (1). We will use the exact representation of the independence sampler [Reference Smith and Tierney46, Theorem 1, Lemma 3]. Define the importance sampling weight by $w(\theta) = \pi(\theta) / q(\theta)$ and its maximum $w^* = w(\theta^*) = \pi(\theta^*) / q(\theta^*)$ . For $w \in (0, \infty)$ , define
and, using the exact representation of the independence sampler [Reference Smith and Tierney46, Theorem 1, Lemma 3], for measurable sets $B \subseteq \mathbb{R}^d$ ,
The proof of existing results [Reference Wang52, Theorem 4] shows that
We now use the exact representation of the independence sampler to lower bound (1). We have $\lambda^t(w(\theta_0)) \le (1 - 1/w^*)^t$ , so we then have the upper bound
Using this upper bound with the approximations (4) and (5) yields
Therefore, we can choose a small enough $h_0 \in (0, 1)$ independently of t such that we have the upper bound
Applying these bounds to (1) and using that we have chosen $h = h_0/t^2$ , we have the lower bound
Taking the limit,
Since all norms are equivalent on $\mathbb{R}^d$ , this can be extended to any norm $\left\lVert\cdot\right\rVert$ .
5. MHI samplers with centered Gaussian proposals
We look to apply the exact convergence expression from Proposition 1 to practical applications since the convergence rate is the same at every initialization under mild assumptions. Recently, centered drift functions have been used to improve convergence analyses of some Monte Carlo Markov chains [Reference Ekvall and Jones11, Reference Qin and Hobert37, Reference Qin and Hobert40]. Our focus is instead on centering the proposal distribution, that is, matching the optimal points of the proposal and target densities similar to Laplace approximations.
We shall see in the next section that by centering a Gaussian proposal, we may satisfy the assumptions of Proposition 1 for a general class of target distributions with $\theta^*$ being the optimum of the target’s density. While we focus on Gaussian proposals, the technique of centering proposals is in fact more general.
We will assume the target distribution $\Pi$ is a probability distribution supported on $\mathbb{R}^d$ . With $f \,:\, \mathbb{R}^d \to \mathbb{R}$ and normalizing constant $Z_{\Pi}$ , define the density $\pi$ by $\pi(\theta) = Z_{\Pi}^{-1} \exp\!({-}f(\theta))$ . Let $\theta^*$ be the unique maximum of $\pi$ , $\alpha \in (0, +\infty)$ , and $C \in \mathbb{R}^{d \times d}$ be a symmetric, positive-definite matrix. Let the proposal distribution Q with density q correspond to a d-dimensional Gaussian distribution, $\text{N}_{d}(\theta^*, \alpha^{-1} C )$ . In this case, the ratio of the proposal density and target density is
Proposition 2. If $\theta^*$ exists and, for any $\theta \in \mathbb{R}^d$ , $f(\theta) \ge f(\theta^*) + \alpha(\theta - \theta^*)^\top C^{-1}(\theta - \theta^*)/2$ , then
Proof. Since the proposal density has been centered at the point $\theta^*$ , it then satisfies $q(\theta^*) = (2 \pi)^{-d/2} \alpha^{d/2} \det\!(C)^{-1/2}$ . For every $\theta \in \mathbb{R}^d$ , we have the lower bound
Since both densities are positive and the proposal is independent of the previous iteration, we have shown that the conditions for Proposition 1 are satisfied and an application of Proposition 1 with the proposal and target distribution Q and $\Pi$ as we have defined them in this section completes the proof.
The assumption in Proposition 2 is sometimes referred to as a quadratic growth condition at $\theta^*$ . This condition does not require convexity of f, but is satisfied and the point $\theta^*$ is guaranteed to exist if the function f satisfies a strong convexity property. A function $h \,:\, \mathbb{R}^d \to \mathbb{R}$ is strongly convex with parameter $\mu$ if there is a $\mu \in (0, +\infty)$ such that $h({\cdot}) - \mu \left\lVert\cdot\right\rVert^2 / 2$ is convex [Reference Hiriart-Urruty and Lemaéchal17, Reference Nesterov32]. The norm in this definition is often taken to be the Euclidean norm, but we will use the norm induced by the matrix $C^{-1}$ . We consider using a Gaussian proposal centered at a point $\theta_0$ which is not necessarily the optimum of the target density. Let $g_{f(\theta_0)} \in \mathbb{R}^d$ be a subgradient of f at $\theta_0$ [Reference Nesterov32]. For a point $\theta_0 \in \mathbb{R}^d$ , we consider the proposal corresponding to a d-dimensional Gaussian distribution, $\text{N}_{d}(\theta_0 - \alpha^{-1} C g_{f(\theta_0)}, \alpha^{-1} C)$ . When f is differentiable, this construction of the proposal uses the gradient of f in a similar way to MALA. The ratio of the proposal and target density evaluated at $\theta_0$ is $\varepsilon_{\theta_0} = (2 \pi)^{-d/2} \det\!(\alpha^{-1} C)^{-1/2} Z_{\Pi} \exp\!(f(\theta_0) - {g_{f(\theta_0)}}^\top C g_{f(\theta_0)} / (2\alpha) )$ . Choosing $\theta_0 \equiv \theta^*$ maximizes the convergence rate and yields the centered Gaussian proposal, but we also have an exact convergence expression in other cases.
Proposition 3. If the function $\theta \mapsto f(\theta) - \alpha \theta^\top C^{-1} \theta / 2$ is convex for all points on $\mathbb{R}^d$ , then $\mathcal{W}_{\rho}(P^t(\theta_0, \cdot), \Pi) = (1 - \varepsilon_{\theta_0})^t \int \rho(\theta, \theta_0) \,\mathrm{d}\Pi(\theta)$ .
Proof. Since the function $f(\theta) - \alpha \theta^\top C^{-1} \theta/2$ is convex for all points on $\mathbb{R}^d$ , for each $\theta \in \mathbb{R}^d$ ,
This implies that, for every $\theta \in \mathbb{R}^d$ , the ratio of the proposal density q corresponding to the distribution $\text{N}_{d}(\theta_0 - \alpha^{-1} C g_{f(\theta_0)}, \alpha^{-1} C)$ and target density $\pi$ satisfies
An application of Proposition 1 completes the proof.
5.1. Application: Bayesian generalized linear models
We consider Bayesian Poisson and negative-binomial regression for count response regression and Bayesian logistic and probit regression for binary response regression. Suppose there are n discrete-valued responses $Y_i$ with features $X_i \in \mathbb{R}^d$ , and a parameter $\beta \in \mathbb{R}^d$ . For Poisson regression, assume the $Y_i$ are conditionally independent with $Y_i \mid X_i, \beta \sim \text{Poisson}(\exp\!(\beta^\top X_i))$ . Similarly, for negative-binomial regression, if $\xi \in (0, +\infty)$ , assume $Y_i \mid X_i, \beta \sim \text{Negative-Binomial}(\xi,(1 + \exp\!({-}\beta^\top X_i))^{-1})$ . For binary response regression, if $S \,:\, \mathbb{R} \to (0, 1)$ , assume $Y_{i} \mid X_{i}, \beta \sim \text{Bernoulli} (S(\beta^\top X_i))$ . For logistic regression, we will consider $S(x) = (1 + \exp\!(x))^{-1}$ , and for probit regression, we will consider S(x) to be the cumulative distribution function of a standard Gaussian random variable.
Suppose $\beta \sim \text{N}_{d}(0, \alpha^{-1} C)$ , where $\alpha \in (0, +\infty)$ and $C \in \mathbb{R}^{d \times d}$ is a symmetric, positive-definite matrix. Both $\alpha$ and C are assumed to be known. Define the vector $Y = (Y_1, \ldots, Y_n)^\top$ and the matrix $X = (X_1, \ldots, X_n)^\top$ . Let $\Pi(\cdot \mid X, Y)$ denote the posterior with density $\pi(\cdot \mid X, Y)$ . If $\ell_n$ denotes the negative log-likelihood for each model, the posterior density is characterized by
Observe that the function $\ell_n$ is convex in all four models we consider. Let $\beta^*$ denote the unique maximum of $\pi(\cdot \mid X, Y)$ . For the MHI algorithm, we use an $\text{N}_{d}(\beta^*, \alpha^{-1} C)$ proposal distribution, and Proposition 3 immediately yields the following for each posterior.
Corollary 1. $\mathcal{W}_{\rho}(P^t(\beta^*, \cdot), \Pi(\cdot \mid X, Y)) = (1 - \varepsilon_{\beta^*})^t \int \rho(\beta, \beta^*) \,\mathrm{d}\Pi(\beta \mid X, Y)$ , where $\varepsilon_{\beta^*} = \exp\!(\ell_n(\beta^*) + ({\alpha}/{2}){\beta^*}^\top C^{-1} \beta^*) Z_{\Pi(\cdot \mid X, Y)} ((2 \pi)^{d/2} \det\!(\alpha^{-1} C)^{1/2})^{-1}$ .
5.2. Convergence complexity analysis in binary response regression
For the MHI algorithm, we continue to use a centered proposal $\text{N}_{d}(\beta^*, \alpha^{-1} C)$ and first consider a more general posterior density of the form $\pi(\beta \mid X, Y) \propto \exp\!({-}\ell_n(\beta) - \alpha \beta^\top C^{-1} \beta/ 2)$ depending on data X, Y of size n. We also assume the limit of the trace of the covariance matrix used in our prior to be finite, i.e. $\mathrm{tr}(C) \to s_0 \in (0, +\infty)$ as $d \to +\infty$ . Note that the trace of the covariance being finite is a necessary condition for Gaussian distributions to exist in an infinite-dimensional Hilbert space [Reference Bogachev3].
Theorem 4. Suppose that the following conditions hold for a sequence $d_n, n \to \infty$ :
-
(i) The negative log-likelihood $\ell_n$ is a twice continuously differentiable convex function.
-
(ii) There is a universal constant $H_0 \in (0, +\infty)$ such that the largest eigenvalue of the Hessian of the negative log-likelihood $H_{\ell_n}$ satisfies, for every $\beta \in \mathbb{R}^d$ , $\limsup_{d_n, n \to \infty} \lambda_\mathrm{max}(H_{\ell_n}(\beta)) \le H_0$ .
Then $\limsup_{d_n, n \to \infty}\mathcal{W}_{\rho}(P^t(\beta^*, \cdot), \Pi(\cdot \mid X, Y)) \le M_0 (1 - \exp\!({-}H_0 s_0 / (2 \alpha)))^t$ , where $M_0 = \limsup_{d_n, n \to \infty} \int \rho(\beta, \beta^*) \,\mathrm{d}\Pi(\beta \mid X, Y)$ .
Proof. Define the function $f \,:\, \mathbb{R}^d \to \mathbb{R}$ by $f(\beta) = \ell_n(\beta) + \alpha \beta^\top C^{-1} \beta / 2$ , where $\ell_n$ is the negative log-likelihood loss function, and define $Z_Q = (2 \pi)^{d/2} \det\!(\alpha^{-1} C)^{1/2}$ . We first lower bound the quantity $\exp\!(f(\beta^*)) Z_{\Pi( \cdot \mid X, Y)} / Z_{Q}$ . For $\varepsilon \in (0, 1)$ and sufficiently large $d_n, n$ , we have, for any $\beta \in \mathbb{R}^d$ and any $v \in \mathbb{R}^d$ , $v^\top H_{\ell_n}(\beta)v \le (1 + \varepsilon)H_0\left\lVert v\right\rVert_2^2$ . This implies that, for any $\beta \in \mathbb{R}^d$ and any $v \in \mathbb{R}^d$ , the Hessian of f, denoted by $H_{f}$ , satisfies $v^\top H_{f}(\beta) v \le v^\top ((1 + \varepsilon) H_0 I_d + \alpha C^{-1}) v$ . Since the function $\ell_n$ is twice continuously differentiable, f is also twice continuously differentiable. Since both the gradient $\nabla f$ and Hessian $H_{f}$ are continuous and $\nabla f(\beta^*) = 0$ , we use a Taylor expansion to obtain the upper bound
We then have a lower bound on the normalizing constant of the target posterior,
This in turn yields a lower bound on the ratio
The matrix C is symmetric and positive-definite, so its eigenvalues exist and are positive. Let $(\lambda_i(C))_{i = 1}^d$ be the eigenvalues of C. It is readily verified that the eigenvalues of the matrix $(1 + \varepsilon) H_0) I_d + \alpha C^{-1}$ exist and are $((1 + \varepsilon)H_0 + {\alpha}/{\lambda_i(C)})_{i = 1}^d$ . Then
We have the basic inequality $\log\!(x + 1) \le x$ for any $x \in [0, +\infty)$ . Since the eigenvalues of C are positive and $H_0$ is nonnegative, we have the upper bound
This yields a lower bound on (6). Define the doubly-indexed sequence $(a_{d, n})$ by
We have then shown that
By our assumption, $\mathrm{tr}(C) \to s_0$ as $d \to \infty$ . That is to say, $\lim_{d \to +\infty} \sum_{i = 1}^d \lambda_i(C) = s_0$ . This implies, by using continuity,
By Corollary 1, we have the upper bound on the Wasserstein distance for each $d_n$ and each n:
Suppose that $\limsup_{d_n, n \to \infty}\int_{\mathbb{R}^d}\rho(\beta,\beta^*)\,\mathrm{d}\Pi(\beta\mid X,Y)<\infty$ . Using properties of the limit superior,
This holds for every $\varepsilon \in (0, 1)$ , so taking the limit completes the proof in this case. The other case, when $\limsup_{d_n, n \to \infty}\int_{\mathbb{R}^d}\rho(\beta, \beta^*)\,\mathrm{d}\Pi(\beta\mid X, Y) = +\infty$ , is trivial.
Our goal now is to obtain an upper bound on the rate of convergence established in Corollary 1 in high dimensions for binary response regression. In this context, it is natural to treat the $(Y_i, X_i)_{i = 1}^n$ as stochastic; each time the sample size increases, the additional observation is randomly generated. Specifically, we assume that $(Y_i, X_i)_{i = 1}^n$ are independent with $Y_{i} \mid X_{i}, \beta \sim \text{Bernoulli}(S(\beta^\top X_i))$ and $X_i \sim N_{d}(0, \sigma^2 n^{-1} I_d)$ with $\sigma^2 \in (0, +\infty)$ . This scaling ensures the variance of the columns of the random matrix of features X is of fixed order; it is often used to ensure nondegeneracy in large-system limits of d, n [Reference Geman12]. Similar scaling assumptions on the data are used for high-dimensional maximum-likelihood theory in logistic regression [Reference Sur and Candès47].
Corollary 2. Suppose the following conditions hold:
-
(i) The negative log-likelihood $\ell_n$ is a twice continuously differentiable convex function.
-
(ii) There is a universal constant $r_0 \in (0, +\infty)$ such that the largest eigenvalue of the Hessian of the negative log-likelihood $H_{\ell_n}$ satisfies, for every $\beta \in \mathbb{R}^d$ , $\lambda_\mathrm{max}(H_{\ell_n}(\beta)) \le r_0 \lambda_\mathrm{max}(X^\top X)$ .
Let $a_0 = r_0 (1 + \gamma^{1/2})^2 \sigma^2 s_0 / (2 \alpha)$ . If $d, n \to +\infty$ in such a way that $d/n \to \gamma \in (0, +\infty)$ , then, almost surely, $\limsup_{d/n \to \gamma}\mathcal{W}_{\rho}(P^t(\beta^*, \cdot), \Pi(\cdot \mid X, Y)) \le M_0 (1 - \exp\!({-}a_0))^t$ , where $M_0 = \limsup_{d/n \to \gamma} \int \rho(\beta, \beta^*) \,\mathrm{d}\Pi( \beta \mid X, Y)$ .
Proof. Under our assumption, we may write the matrix $X = n^{-1/2} G$ , where G is a matrix with i.i.d. Gaussian entries with mean 0 and variance $\sigma^2$ . Denote the largest eigenvalue of the matrix $X^\top X$ by $\lambda_\mathrm{max}(X^\top X)$ . Therefore, as $d, n \to \infty$ in such a way that $d/n \to \gamma \in (0, +\infty)$ ,
almost surely [Reference Geman12, Theorem 1]. We have, for any $\beta \in \mathbb{R}^d$ and any $v \in \mathbb{R}^d$ , $v^\top H_{\ell_n}(\beta) v \le r_0 \lambda_\mathrm{max}(X^\top X) \left\lVert v\right\rVert_2^2$ . The proof follows from Theorem 4.
Corollary 2 applies to both Bayesian logistic and probit regression. For logistic regression, $\ell_n$ is a twice continuously differentiable convex function and we may choose $r_0 = 4^{-1}$ . Similarly, for probit regression, $\ell_n$ is also a twice continuously differentiable convex function and we may choose $r_0 = 1$ [Reference Demidenko6].
In Figure 1 we plot $(1 - \exp\!({-}a_0))^t$ , the limiting decrease in the Wasserstein distance according to our upper bound, with varying values of the limiting ratio $\gamma$ and the other remaining values in $a_0$ fixed. We observe that as this ratio increases, the number of iterations needed to approximately converge may still increase rather rapidly.
6. Final remarks
We have studied the exact convergence behavior of the MHI sampler across general Wasserstein distances. We showed upper and lower bounds on the worst-case convergence rate for Wasserstein distances weaker than the total variation distance. We showed that the exact convergence rate at every initialization for Wasserstein distances weaker than the total variation distance is the same and matches that of the total variation convergence rate [Reference Wang52]. When starting at a certain point, we gave exact convergence expressions. By centering an independent proposal, we directly imbue the Markov chain with a strong attraction to a set where the target distribution has high probability. We showed this technique can provide uniform control over acceptance probability yielding exact convergence rates in Bayesian quantile regression. The centered MHI sampler turns out to have many applications for posteriors that arise in Bayesian generalized linear models where more sophisticated proposals are often used. With additional assumptions on the data and prior, we also showed that this exact convergence rate may be upper bounded when sampling high-dimensional posteriors in Bayesian binary response regression.
Appendix A. Proof of Theorem 1
The proof will proceed by establishing the upper and lower bounds separately in Lemmas 1 and 2, respectively. This is done largely because the conditions for the upper bound are weaker than those for the lower bound.
The following definitions will be used in the proofs of Lemmas 1 and 2. First, for $\theta \in \Theta$ , real-valued measurable functions f, and a Markov kernel K, we use the notation $K^{t} f(\theta) = \int f \,\mathrm{d} K^t(\theta, \cdot) = \int f(\theta^{\prime}) K^t(\theta, \mathrm{d}\theta^{\prime})$ and $K^{0} f(\theta) = f(\theta)$ . Second, recall that, for functions $\varphi \,:\, \mathbb{R}^d \to \mathbb{R}$ , $\left\lVert\varphi\right\rVert_{\text{Lip}(\rho)} = \sup_{x, y, x \not= y} \{ |\varphi(x) - \varphi(y)| / \rho(x, y)\}$ .
Lemma 1. Let $\varepsilon^* = \inf_{\theta \in \Theta} \{ q(\theta)/\pi(\theta) \}$ . Then
Proof. Let $\theta \in \Theta$ and let $\varphi$ satisfy $\left\lVert\varphi\right\rVert_{\text{Lip}(\rho)} \le 1$ . The existence of $\varepsilon^*$ implies the minorization condition $P(\theta, \cdot) \ge \varepsilon^* \Pi({\cdot})$ [Reference Tierney48, Corollary 4] which, in turn, ensures the residual kernel $R(\theta, \cdot) = [P(\theta, \cdot) - \varepsilon^* \Pi({\cdot})] / (1 - \varepsilon^*)$ is a Markov kernel with invariant distribution $\Pi$ . It then follows that
Since $\varphi$ is Lipschitz with respect to $\rho$ , we then have
Taking the supremum with respect to $\varphi$ and using the Kantorovich–Rubinstein theorem [Reference Villani50, Theorem 1.14],
We now turn our attention to establishing the lower bound.
Lemma 2. Let $\varepsilon^* = \inf_{\theta \in \Theta}\{q(\theta)/\pi(\theta)\}$ . Suppose q is lower semicontinuous and $\pi$ is upper semicontinuous on $\Theta$ . Suppose $\Theta$ can be expressed as a countable union of compact sets. If $\rho(\cdot, \cdot) \le 1$ , then $\sup_{\theta \in \Theta}\mathcal{W}_\rho(P^t(\theta, \cdot), \Pi) \ge (1 - \varepsilon^*)^t \inf_{\theta \in \Theta} \int \rho(\cdot, \theta) \,\mathrm{d}\Pi$ .
Proof. Since $\Theta$ can be expressed as a countable union of compact sets, there is a sequence of compact sets $B_n \subseteq B_{n + 1} \subseteq \Theta$ increasing to $\Theta = \cup_{n = 1}^{\infty} B_n$ . We can assume $\Pi(B_n) > 0$ , otherwise we can take n large enough so this holds. Since $\pi, q > 0$ and $\pi$ is upper semicontinuous on $\Theta$ , then $q/\pi$ is lower semicontinuous on $\Theta$ . By Lemma 3, $\inf_{\theta \in B_n} \{ q(\theta)/\pi(\theta) \}$ is monotonically nonincreasing to $\varepsilon^* = \inf_{\theta \in \Theta} \{ q(\theta)/\pi(\theta) \}$ . Since we have assumed lower semicontinuity, the $\inf_{\theta \in K} \{ q(\theta)/\pi(\theta) \}$ is attained over any compact set $K \subseteq \Theta$ . Then define the sequence
We can then define the sequence $\varepsilon_{\theta^*_n} = \inf_{\theta \in B_n} \{ q(\theta)/\pi(\theta) \} = q(\theta^*_n)/\pi(\theta^*_n)$ , and this is monotonically nonincreasing to $\varepsilon^*$ .
Define $P_n$ to be the Metropolis–Hastings independence kernel with independent proposal Q with density q, and target distribution $\Pi(\cdot \mid B_n)$ with density $\pi(\cdot \mid B_n) = \pi({\cdot}) I_{B_n}({\cdot}) / \Pi(B_n)$ . By construction, $\Pi(B_n) > 0$ and this is well-defined. The key part of the proof is that if we start at any $\theta_n \in B_n$ , this kernel $P_n$ and the kernel P only disagree outside of $B_n$ . For $\theta_n \in B_n$ , we have $\pi(\theta_n) > 0$ , $I_{B_n}(\theta_n) = 1$ , and, since $\Theta \equiv \text{supp}(q)$ by assumption, $q(\theta_n) > 0$ . Also, if $y \in B_n^\mathrm{c} \cap \Theta$ , then
Let $M_1(\mathbb{R}^d)$ be the set of measurable functions $\varphi \,:\, \mathbb{R}^d \to \mathbb{R}$ with $\sup_{x \in \mathbb{R}^d} |\varphi(x)| \le 1$ . Therefore, for any $\theta_n \in B_n$ and any function $\varphi \in M_1(\mathbb{R}^d)$ ,
Let $\varepsilon \in (0, 1 - \varepsilon^*)$ . Since Q and $\Pi$ are probability measures, we may then choose $n_{\varepsilon}$ sufficiently large that, for all $n \ge n_{\varepsilon}$ , $2\max\{\Pi(B_n^\mathrm{c}), Q(B_n^\mathrm{c})\} \le \varepsilon/2$ . We then have
Similarly,
With $\theta^*_n$ as in (7), let $\psi_n({\cdot}) = -\rho(\cdot, \theta_n^*)$ . Then, for any $x, y \in \mathbb{R}^d$ ,
and $\psi_n \in M_1(\mathbb{R}^d)$ . Since $\Pi$ is invariant for the kernel P,
Now, for any integer s with $1 \le s \le t$ , the function $P^{s} \psi_n \in M_1(\mathbb{R}^d)$ since P is a Markov kernel. Since $\theta^*_n \in B_n$ and $\pi(\theta^*_n) > 0$ , using (8), (9), and (11),
By the construction of $\theta^*_n$ in (7), we have
For measurable $A \subset \mathbb{R}^d$ [Reference Wang52, Remark 1, Theorem 2], we then have the identity
Since $P^{t-1} \psi_n$ is a bounded measurable function, this identity leads to the following one:
Using (12) in the first inequality, (13) in the second inequality, (9) in the third inequality, and the invariance of $\Pi$ for the Markov kernel P in the last inequality,
Applying this inequality recursively and using the definition of $\psi_n$ ,
Since $\Pi(B_n) \to 1$ and $\varepsilon_{\theta^*_n} \to \varepsilon^*$ , we may take n large enough that $|\varepsilon_{\theta^*_n} \Pi(B_n) - \varepsilon^*| \le \varepsilon$ . For all large enough n and since $\varepsilon < 1 - \varepsilon^*$ , we lower bound (14) to get
Combining (10) and (15), we lower bound the Wasserstein distance with
Since this holds for all small $\varepsilon$ , the proof is complete by taking the limit as $\varepsilon \downarrow 0$ .
Appendix B. Technical lemmas
Lemma 3. Let $\varepsilon^* = \inf_{\theta \in \Theta} \{ q(\theta)/\pi(\theta) \}$ . Suppose there is a sequence of compact sets $B_n \subseteq B_{n + 1} \subseteq \Theta$ increasing to $\Theta = \cup_{n = 1}^{\infty} B_n$ . Define $\varepsilon_n = \inf_{\theta \in B_n} q(\theta)/\pi(\theta)$ . Then $\varepsilon_n$ is monotonically nonincreasing to its limit $\varepsilon^*$ .
Proof. By the definition of infimum, $\varepsilon_n \ge \varepsilon_{n + 1}$ and $\varepsilon_n \ge \varepsilon^*$ . Hence, the sequence $\varepsilon_n$ converges. Let $\delta \in (0, \infty)$ . By the definition of the infimum, we can choose $\theta_\delta \in \Theta$ with $\pi(\theta_\delta) > 0$ such that $q(\theta_\delta)/\pi(\theta_\delta) - \delta \le \varepsilon^*$ . We can choose $B_{n_\delta}$ such that $\theta_\delta \in B_{n_\delta}$ . Then $\varepsilon_{n_\delta} - \delta \le q(\theta_\delta)/\pi(\theta_\delta) - \delta \le \varepsilon^*$ . It follows that, for any $n \ge n_\delta$ , $|\varepsilon_{n} - \varepsilon^*| = \varepsilon_{n} - \varepsilon^* \le \varepsilon_{n_\delta} - \varepsilon^* \le \delta$ . Therefore, $\lim_n \varepsilon_{n} = \varepsilon^*$ .
Lemma 4. Let $C \in \mathbb{R}^{d \times d}$ be a positive-definite, symmetric matrix, and let $\alpha \in (0, \infty)$ . Let $f \,:\, \mathbb{R}^d \to \mathbb{R}$ and suppose $\theta \mapsto f(\theta) - \alpha \theta^\top C^{-1} \theta / 2$ is convex for all points on $\mathbb{R}^d$ . Then there exists $\theta^* = \mathrm{argmin}_{\theta \in \mathbb{R}^d} f(\theta)$ and
Proof. Since the function $f(\theta) - \alpha \theta^\top C^{-1} \theta/2$ is convex for all points on $\mathbb{R}^d$ , it follows that, for any $\lambda \in [0, 1]$ and any $(\theta, \theta^{\prime}) \in \mathbb{R}^d \times \mathbb{R}^d$ ,
Since $C^{-1}$ is positive-definite, $\alpha \lambda (1 - \lambda)(\theta^{\prime} - \theta)^\top C^{-1} (\theta^{\prime} - \theta)/2$ is nonnegative and this implies that f is a convex function. It can also be shown that $\lim_{\left\lVert\theta\right\rVert \to +\infty} f(\theta) = +\infty$ and, since f is lower semicontinuous, f attains its minimum $\theta^* \in \mathbb{R}^d$ . The right directional derivative $f^{\prime}(\theta^*; \theta) = \lim_{t \downarrow 0} t^{-1} [f(\theta^* + t \theta) - f(\theta^*)]$ exists for all points $\theta \in \mathbb{R}^d$ [Reference Nesterov32, Theorem 3.1.12]. For $\lambda \in (0, 1)$ ,
Taking the limit with $\lambda \downarrow 0$ , we have
Since $\theta^*$ is the minimum of f, then the right directional derivative satisfies $f^{\prime}(\theta^*; \theta - \theta^*) \ge 0$ for all $\theta \in \mathbb{R}^d$ . Therefore, for all $\theta \in \mathbb{R}^d$ ,
Acknowledgements
We would like to thank the associate editor and referees for their helpful comments that improved the presentation of this article. We would also like to thank Qian Qin and Dootika Vats for their helpful comments, which improved an earlier draft of this article.
Funding information
Jones was partially supported by NSF grant DMS-2152746.
Competing interests
There were no competing interests to declare which arose during the preparation or publication process of this article.