Moderate deviations inequalities for Gaussian process regression

Jialin Li; Ilya O. Ryzhov

doi:10.1017/jpr.2023.30

Moderate deviations inequalities for Gaussian process regression

Part of: Design of experiments Stochastic processes

Published online by Cambridge University Press: 05 June 2023

Jialin Li and

Ilya O. Ryzhov

Show author details

Jialin Li*: Affiliation:
University of Toronto
Ilya O. Ryzhov*: Affiliation:
University of Maryland
*: *Postal address: Rotman School of Management, University of Toronto, Ontario, Canada. Email address: [email protected]
**Postal address: Robert H. Smith School of Business, University of Maryland, College Park, MD 20742, USA. Email address: [email protected]

Article contents

Abstract
Introduction
Gaussian process regression and approximation theory
Large deviations for a fixed pair of points
Extensions and special cases
Applications: pairwise comparisons and estimation error
General large deviations inequality
Conclusion
Funding information
Competing interests
References

Rights & Permissions

Abstract

Gaussian process regression is widely used to model an unknown function on a continuous domain by interpolating a discrete set of observed design points. We develop a theoretical framework for proving new moderate deviations inequalities on different types of error probabilities that arise in GP regression. Two specific examples of broad interest are the probability of falsely ordering pairs of points (incorrectly estimating one point as being better than another) and the tail probability of the estimation error at an arbitrary point. Our inequalities connect these probabilities to the mesh norm, which measures how well the design points fill the space.

Keywords

Gaussian process regression moderate deviations interpolation theory

MSC classification

Primary: 60G15: Gaussian processes

Secondary: 62K99: None of the above, but in this section

Type: Original Article
Information: Journal of Applied Probability , Volume 61 , Issue 1 , March 2024 , pp. 172 - 197

DOI: https://doi.org/10.1017/jpr.2023.30 [Opens in a new window]
Copyright: © The Author(s), 2023. Published by Cambridge University Press on behalf of Applied Probability Trust

1. Introduction

Given a compact domain $D\subseteq\mathbb{R}^d$ , let $\{\mathcal{E}(x)\}_{x\in D}$ be a centered Gaussian process on a probability space $(\Omega,\mathcal{F},P)$ . Define

(1.1)

\begin{equation}f(x) = m(x) + \mathcal{E}(x), \quad x\in D,\end{equation}

where $m\,:\, D\rightarrow\mathbb{R}$ is a pre-specified ‘mean function’. Suppose that we are given the values $f(x_1),\ldots,f(x_n)$ of f at the design points $x_1,\ldots,x_n\in D$ . Then we can construct an estimator $\hat{f}_n$ of f using Gaussian process regression [Reference Rasmussen and Williams25]. This is a Bayesian method: for each x, $\hat{f}_n(x)$ is the conditional mean of the random variable f(x) given $f(x_1),\ldots,f(x_n)$ . The covariance function of the Gaussian process is used to infer the value of f at unobserved x from information collected about the design points.

Gaussian process regression is widely used to interpolate and predict the values of black-box functions in simulation calibration [Reference Scott, Powell and Simão26] and optimization [Reference Ankenman, Nelson and Staum2, Reference Jones, Schonlau and Welch15], biomedical applications [Reference Lee, Mortazavi, Hoffman, Lu, Li, Paak, Garst, Razaghy, Espinal, Park, Lu and Sarrafzadeh18], risk assessment of civil infrastructure [Reference Sheibani and Ou27], tuning of machine learning models [Reference Snoek, Larochelle and Adams28], and many other problems from diverse branches of science. In all such applications, f models the output of a complex system (physical or virtual), with x being the input. There is no closed form for f, but it is possible to observe f(x) at individual x values, e.g. by running expensive lab, field, or computer experiments with those particular inputs. The goal is to obtain accurate estimates at unobserved values using as few experiments as possible. Often, the function f represents a performance metric, such as the predictive power of a machine learning model with a given set of parameters, and the goal then becomes to optimize f(x) over $x\in D$ .

The analysis of this paper is motivated by concerns that arise in design of experiments, though we do not explicitly model any design problem. Our main contribution is a theoretical framework for studying the moderate deviations behavior of random vectors of the form $\bigl(\,\hat{f}_n(x),\hat{f}_n(x^*),f(x),f(x^*)\bigr)$ for two fixed but arbitrary points $x,x^*\in D$ . This framework can be used to prove new convergence rates for different types of ‘error probabilities’ related to GP regression. We demonstrate the usefulness of the theory with two specific applications, though others may be possible. The first application deals with probabilities of the form

(1.2)

\begin{equation}\pi_n(x,x^*) = P\bigl(\,\hat{f}_n(x) \leq \hat{f}_n(x^*) - \delta\,|\,f(x) \geq f(x^*)\bigr),\end{equation}

where $\delta \gt 0$ is a small threshold. In words, it is given to us that $x^*$ has a smaller function value than x, but interpolation error may cause us to falsely reverse this ordering (the threshold $\delta$ makes (1.2) well-defined). When f is an objective function, this is the probability of reporting x as being ‘better’ than $x^*$ when in reality the opposite is the case. For this type of error probability, we leverage our theory to prove a new moderate deviations inequality

(1.3)

\begin{equation}\pi_n(x,x^*) \leq C_1\exp\bigl({-}\delta^2 C_2 h^{-{{s}/{2}} }_n\bigr),\end{equation}

where $C_1,C_2,s\gt 0$ are constants depending on the specification of the Gaussian process, and

\begin{equation*}h_n = \max_{y\in D} \min_{m=1,\ldots,n} \|y-x_m\|_2\end{equation*}

is the mesh norm measuring the density of the design points. In a special case where the design points are uniformly distributed on D, it has been shown [Reference Janson13] that $h_n$ is of order $({{\log n}/{n}})^{{{1}/{d}}}$ , which justifies the interpretation of (1.3) as a moderate deviations rate.

The second application deals with the estimation error $|\,\hat{f}_n(x)-f(x)|$ at arbitrary $x\in D$ . For this error, we prove the uniform bound

(1.4)

\begin{equation}\sup_{x\in D} P\bigl(|\,\hat{f}_n(x) - f(x)|\geq \delta\bigr) \leq C^{\prime}_1\exp\bigl({-}\delta^2 C^{\prime}_2 h^{-s}_n\bigr).\end{equation}

Both types of error probabilities are of broad interest in simulation, statistics, and uncertainty quantification. In particular, the pairwise comparison in (1.2) is motivated by the approach developed by Glynn and Juneja [Reference Glynn and Juneja12] for the ranking and selection problem, where one collects samples from a finite number of populations in an effort to select the one with the highest mean. The probability of correct selection can be related to the probability of false ordering between pairs of populations. The quantity $\pi_n(x,x^*)$ is the analog of this concept in the GP regression setting, with the additional complication that we are using a Bayesian model of f, so the event in (1.2) can only be viewed as an error conditionally given $f(x) \geq f(x^*)$ .

Interestingly, the mesh norm is in use as a criterion for design of experiments, in the literature on so-called space-filling designs [Reference Joseph, Gul and Ba16, Reference Pronzato and Müller24]. As early as Johnson et al. in 1990 [Reference Johnson, Moore and Ylvisaker14], statisticians proposed spreading out the design points in D in a way that essentially minimizes the mesh norm. From (1.3), we can see that this has the effect of speeding up the rate at which (1.2) converges to zero, uniformly over all $x,x^*$ . Essentially, if we have no specific $x^*$ to serve as the reference solution, we can view space-filling designs as a way to minimize the probability of false ordering across all possible $x^*$ .

The available theory for Gaussian process regression has extensively studied (pointwise) consistency; see e.g. [Reference Bect, Bachoc and Ginsbourger4] or [Reference Ghosal and Roy11]. The rate at which $\hat{f}_n$ converges to f has been studied, e.g. in [Reference Teckentrup30] and [Reference Wang, Tuo and Wu34], but these papers do not consider tail probabilities, and so their rates have completely different orders to (1.3)–(1.4), though their analysis also makes some connections to the mesh norm. A different, less directly related stream of literature focuses on online optimization problems where the goal is to maximize the sum of the function values of the design points; a representative example of this type of work is that of Srinivas et al. [Reference Srinivas, Krause, Kakade and Seeger29]. In general, many of the existing rate results are derived for specific classes of kernels, such as squared exponential [Reference Pati, Bhattacharya and Cheng23] and Matérn [Reference Teckentrup30], or specific choices of the design points [Reference Bull6]. Probabilists have investigated tail probabilities [Reference Adler1, Reference Ghosal and Roy11] and excursion probabilities [Reference Chan and Lai7, Reference Cheng and Xiao8] of Gaussian processes, and some of these results also have the form of moderate deviations laws, but they pertain to generic GPs rather than the GP regression mechanism. Analogously, moderate deviations laws are available for sums of random variables [Reference Beknazaryan, Sang and Xiao5] and random PDE models [Reference Li, Liu, Lu and Zhou20], but these results do not pertain to GP regression either.

To our knowledge, this paper presents the first moderate deviations results for Gaussian process regression estimators. It is well known [Reference Dembo and Zeitouni10] that sample averages of independent and identically distributed (i.i.d.) Gaussian observations satisfy large deviations laws. Similar laws hold for ordinary least-squares estimators under Gaussian residuals [Reference Zhou and Ryzhov37], extrema of Gaussian vectors [Reference van der Hofstad and Honnappa32], and various finite-dimensional statistical estimators [Reference Arcones3]. Gaussian process regression can be viewed as an infinite-dimensional generalization of linear regression, but the analysis is made much more complicated because, essentially, the dimensionality of the objects used to construct the estimator grows over time, and their asymptotic behavior heavily depends on the covariance kernel. One could perhaps recover large deviations laws for certain specific choices of the kernel and design, but it is far from clear whether this is possible in general. In the process of proving our results, we also establish a modified version of the Gärtner–Ellis theorem [Reference Dembo and Zeitouni10], which may be of stand-alone interest.

Section 2 describes the GP regression framework, states the assumptions used throughout the paper, and gives important technical preliminaries. Section 3 gives the bulk of our analysis, which relies on a general large deviations law for random vectors. This latter result also requires some new technical developments, but since they are unrelated to GP regression, they are deferred to Section 6 for readability. Section 4 handles several extensions not covered by the main theorem. Section 5 uses our analysis to derive (1.3) and (1.4), and presents several more explicit examples. Section 7 concludes the paper.

2. Gaussian process regression and approximation theory

Before we begin the analysis that leads to (1.3)–(1.4), it is necessary to understand the definition, construction, and properties of the GP regression estimator $\hat{f}_n$ . The computation of the estimator (a form of Bayesian updating) is described in Section 2.1. Our analysis also makes use of an alternative interpretation, originating from approximation theory, of the GP regression model. We use this theory to study the asymptotic behavior of the posterior covariance function, which is crucial to the tail probabilities of $\hat{f}_n$ . The relevant technical preliminaries are given in Section 2.2.

2.1. Definitions and assumptions

Recalling the model in (1.1), we assume that the mean function m is Lipschitz-continuous, and the Gaussian process $\mathcal{E}$ is specified by

\begin{align*}\mathbb{E}(\mathcal{E}(x)) &= 0,\\ \operatorname{Cov}\!(\mathcal{E}(x),\mathcal{E}(x^{\prime})) &= k(x,x^{\prime})\end{align*}

for all $x,x^{\prime}\in D$ . In one application, we will assume that $k\,:\, D\times D\rightarrow\mathbb{R}$ is a fixed, symmetric kernel function mapping $D\times D$ into $\mathbb{R}_+$ . The kernel is required to be positive definite, meaning that for any n, any set of n distinct design points $\{x_m\}^n_{m=1}\subseteq D$ , and any vector $v = (v_1,\ldots,v_n)$ in $\mathbb{R}^n$ , we have

\begin{equation*}\sum_{m,m^{\prime}} v_mv_{m^{\prime}}k (x_m,x_{m^{\prime}}) \gt 0.\end{equation*}

Without this assumption, the Gaussian process would be degenerate.

In addition, we assume that there exists a function $\phi$ on $\mathbb{R}_+$ such that $k(x,x^{\prime}) = \phi(\|x-x^{\prime}\|)$ for all x, x ^′. Such a $\phi$ is called a radial basis function. We assume that $\phi$ is twice differentiable at zero with $\phi^{\prime\prime}(0)\lt 0$ . Many commonly used covariance kernels satisfy this requirement, including Gaussian, multiquadric, inverse quadratic, inverse multiquadric, and others.

Let $X_n = \{x_m\}^n_{m=1}$ denote the set of design points. We treat the design points as a deterministic sequence, as is standard in the literature on design of experiments, and assume that $X_n$ becomes dense in D as $n\rightarrow\infty$ , a common condition in the theoretical literature [Reference Vazquez and Bect33]. For convenience, we introduce the notation

\begin{align*}f(x_n) &= (\,f(x_1),\ldots,f(x_n))^\top,\\m(x_n) &= (m(x_1),\ldots,m(x_n))^\top,\\K(X_n,x) &= (k(x,x_1),\ldots,k(x,x_n))^\top,\end{align*}

as well as $K(x,X_n) = K(X_n,x)^\top$ . We also let $K(X_n,X_n)$ denote the matrix whose (m, m ^′)th entry is $k(x_m,x_{m^{\prime}})$ .

Given the design points $X_n$ and observations $f(x_n)$ , the posterior distribution of f(x), at any arbitrary $x\in D$ , is Gaussian with mean

\begin{equation*}\hat{f}_n(x) = K(x,X_n)K(X_n,X_n)^{-1}f(x_n)\end{equation*}

and variance

(2.1)

\begin{equation}P_{X_n}(x) = k(x,x) - K(x,X_n)K(X_n,X_n)^{-1}K(X_n,x).\end{equation}

This specific structure of the mean and variance is what is referred to by the name of Gaussian process regression. The variance $P_{X_n}(x)$ , viewed as a function of x, is also sometimes called the ‘power function’ in the literature on interpolation. In this paper we use the posterior mean $\hat{f}_n$ to interpolate the observed function values $f(X_n)$ over the design space D and make predictions at unobserved points.

We let $\mathcal{H}$ denote the reproducing kernel Hilbert space (RKHS) whose reproducing kernel is k. The construction and uniqueness of $\mathcal{H}$ are discussed by Wendland [Reference Wendland35]. For our purposes, it is sufficient to review the following properties. Letting $\langle \cdot,\cdot\rangle_{\mathcal{H}}$ be the inner product of $\mathcal{H}$ , we know that

(1) $k(\cdot,x)\in\mathcal{H}$ for all $x\in D$ ,
(2) $g(x) = \langle g,k(\cdot,x)\rangle_{\mathcal{H}}$ for all $g\in\mathcal{H}$ and $x\in D$ ,
(3) $k(x,x^{\prime}) = \langle k(\cdot,x),k(\cdot,x^{\prime})\rangle_{\mathcal{H}}$ for all $x,x^{\prime}\in D$ .

Additionally, from the usual properties of the inner product, we have the Cauchy–Schwarz inequality $|\langle g_1,g_2\rangle_{\mathcal{H}}| \leq \|g_1\|_{\mathcal{H}}\|g_2\|_{\mathcal{H}}$ , where $\|\cdot\|_{\mathcal{H}}$ is the norm induced by the inner product.

2.2. Approximation theory

With the assumptions made in Section 2.1, Gaussian process regression can be seen as a special case of radial basis function (RBF) interpolation, enabling us to make use of some results from interpolation theory. We should note, however, that this theory treats interpolation models as purely deterministic, and thus has assumptions and interpretations very different to GP regression. Below, we present key facts from the theory that will be important for our analysis, and discuss their applicability to our setting when necessary.

Like GP regression, RBF interpolation requires a kernel k with the properties described in Section 2.1, as well as a matrix $X_n$ describing n design points. Recall that under these assumptions we have $k(x,x^{\prime}) = \phi(\|x-x^{\prime}\|)$ . Let $\mathcal{L}_{k,X_n}$ denote the operator mapping some fixed function $g\,:\, D\rightarrow\mathbb{R}^d$ to its interpolant according to

(2.2)

\begin{equation}\mathcal{L}_{k,X_n}g(x) = \sum^n_{m=1} \alpha_m k(x,x_m),\end{equation}

where the coefficients $\alpha_m$ solve the linear system

(2.3)

\begin{equation}\sum^n_{m=1} \alpha_m k(x_m,x_{m^{\prime}}) = g(x_{m^{\prime}}),\quad m^{\prime}=1,\ldots,n.\end{equation}

In fact Wu and Schaback [Reference Wu and Schaback36] presented a more general form where (2.2)–(2.3) include additional polynomial functions, but this will not be necessary for our purposes. It can be shown that $\mathcal{L}_{k,X_n}g(x) = K(x,X_n)K(X_n,X_n)^{-1}g(x_n)$ , similar to the calculations used in GP regression.

Let $\tilde{g}$ be the Fourier transform of g, and suppose that the generalized Fourier transform (as defined in Section 8.2 of [Reference Wendland35]) of the function $x\mapsto \phi(\|x\|)$ exists and coincides with a continuous function $\tilde{\phi}$ on $\mathbb{R}^d\setminus \{0\}$ satisfying

(2.4)

\begin{equation}0 \lt \tilde{\phi}(x) \leq c_{\tilde{\phi}}\|x\|^{-d-s_{\infty}} \quad \mbox{ as } \|x\|\rightarrow\infty\end{equation}

for suitable constants $c_{\tilde{\phi}},s_{\infty} \gt 0$ . In particular, the constant $s_{\infty}$ will play a significant role in our analysis, and it is worth pointing out that this quantity is explicitly computable for a variety of commonly used kernels. For example, for Gaussian kernels $s_{\infty}$ can take on any arbitrarily large value (this special case is treated separately in Section 5.3), while for the Matérn kernel, Teckentrup [Reference Teckentrup30] showed that $s_{\infty}=2\sigma$ where $\sigma$ is the kernel smoothness parameter. Other examples are given in Section 8.3 of [Reference Wendland35].

We now define

\begin{equation*}c^2_{g,\phi} = \dfrac{1}{(2\pi)^d}\int_{\mathbb{R}^d} |\tilde{g}(x)|^2\tilde{\phi}(x)^{-1} \,{\mathrm{d}} x.\end{equation*}

The results below require $c^2_{g,\phi} \lt \infty$ , which essentially means that g resides in the RKHS whose reproducing kernel is k. Before stating these results, we should make it clear that we will not require $c^2_{f,\phi}\lt\infty$ , i.e. we will not apply the above definitions with f as the choice of g. Lukić and Beder [Reference Lukić and Beder21] showed that a sample from a GP prior is almost surely not in the RKHS induced by the kernel assumed in the prior. Therefore it is not possible for the function f to satisfy $c^2_{f,\phi}\lt\infty$ . This is a major difference between GP regression and interpolation theory, where f is modeled as a deterministic function and so the condition $c^2_{f,\phi}\lt\infty$ is seen as fairly innocuous (for example, it is assumed in [Reference Wu and Schaback36] and many other papers in pure interpolation theory, e.g. [Reference Li and Ryzhov19]). In the present work, however, we cannot make this assumption, and will instead apply this framework to other choices of g related to the kernel, for example the function $k(\cdot,x)$ for fixed x.

For any compact $E\subseteq D$ , let $h_n(E) = \max_{y\in E}\min_{m=1,\ldots,n}\|y-x_m\|_2$ be the mesh norm of E. We slightly abuse notation by using $h_n$ to denote $h_n(D)$ when the entire domain is considered. Let $B_{x,\rho}$ denote the closed ball of radius $\rho$ centered at $x\in\mathbb{R}^d$ . We can now state the results that will be referenced and applied throughout this paper.

Lemma 2.1. ([Reference Wu and Schaback36].) Fix $\rho \gt 0$ and assume that the kernel k satisfies (2.4) with some $s_{\infty}$ . Then there exist positive constants $\bar{h}$ and $c_P$ such that for any $X_n$ and any point $x\in\mathbb{R}^d$ with $h_n(B_{x,\rho}\cap D) \lt \bar{h}$ , the power function $P_{X_n}$ defined in (2.1) satisfies

\begin{equation*}P_{X_n}(x) \leq c_P (h_n(B_{x,\rho}\cap D))^{s_\infty}.\end{equation*}

Lemma 2.2. ([Reference Wu and Schaback36].) Fix g satisfying $c_{g,\phi}\lt\infty$ and assume that the kernel k satisfies (2.4) with some $s_{\infty}$ . Then, for any $X_n$ and any $x\in\mathbb{R}^d$ , we have

\begin{equation*}|g(x)-\mathcal{L}_{k,X_n}g(x)|^2 \leq c^2_{g,\phi}P_{X_n}(x).\end{equation*}

We note that in Lemma 2.1 the constant $c_P$ depends only on d and $s_{\infty}$ but not on the fixed value $\rho$ . The same is true of the ratio ${{\bar{h}}/{\rho}}$ , indicating that $\bar{h}$ is proportional to $\rho$ .

3. Large deviations for a fixed pair of points

We now fix $x,x^*\in D$ and focus on the sequence of random vectors

\[Z_n = \bigl(\,\hat{f}_n(x),\hat{f}_n(x^*),f(x),f(x^*)\bigr)^\top.\]

Letting $\mu_n$ be the probability law of $Z_n$ , we will derive the inequality of the form

(3.1)

\begin{equation}\limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log \mu_n(E) \leq -\inf_{u\in E} I(u),\end{equation}

which holds for any closed measurable set E and any sequence $\{a_n\}$ satisfying $\lim_{n\rightarrow\infty} a_n = \infty$ . Our end goal is to characterize the function I and specify $a_n$ and E in a manner that causes (3.1) to yield results such as (1.3).

Inequalities of the form (3.1) can be obtained by invoking the Gärtner–Ellis theorem from large deviations theory [Reference Dembo and Zeitouni10]. In general, the derivation of the function I consists of two main steps. The first step is to derive the scaled limit

(3.2)

\begin{equation}\Psi(\gamma) = \limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\Psi_n(a_n\gamma),\end{equation}

where $\Psi_n(\gamma) = \log\mathbb{E}_{\mu_n}\bigl({\mathrm{e}}^{\langle \gamma,Z_n\rangle}\bigr)$ denotes the cumulant-generating function of $Z_n$ , and $\langle\cdot,\cdot\rangle$ is the usual $L^2$ inner product on $\mathbb{R}^p$ . The scaling $\Psi$ is allowed to take values on the extended number line (i.e. may be $+\infty$ for some $\gamma$ ). The second step obtains I via the Fenchel–Legendre transform

(3.3)

\begin{equation}I(u) = \sup_{\gamma\in\mathbb{R}^p} \langle\gamma,u\rangle - \Psi(\gamma)\end{equation}

of $\Psi$ . Inequality (3.1) then follows under certain technical conditions on $\Psi$ .

Our analysis in this section follows this outline. Section 3.1 studies the cumulant-generating functions of $\{Z_n\}$ and characterizes $\Psi$ . Section 3.2 then studies the Fenchel–Legendre transform of $\Psi$ , and Section 3.3 analyzes the convergence rates of various terms in the Fenchel–Legendre transform, which helps us to select $\{a_n\}$ in a way that makes (3.1) yield an informative inequality (stated in Section 3.4). In Section 5 we will instantiate this inequality with certain choices of E that yield (1.3)–(1.4).

One complication that arises in this procedure is that the technical conditions of the classical Gärtner–Ellis theorem do not hold in our setting, so additional analysis is required to obtain (3.1). This analysis is carried out in the setting of general random vectors, and thus is somewhat tangential to the setting of Gaussian process regression. For this reason we defer it to Section 6 at the end of the paper; the core result of that section is Theorem 6.1, which recovers the desired inequality. Here we take (3.1) as given, referring readers to Theorem 6.1 for the proof, and focus on applying this inequality to the specific sequence $\{Z_n\}$ .

3.1. Analysis of cumulant-generating functions

We write $Z_n$ as

\begin{equation*}\begin{bmatrix}\hat{f}_n(x)\\[4pt] \hat{f}_n(x^*)\\[4pt] f(x)\\[4pt] f(x^*)\end{bmatrix} =\begin{bmatrix}K(x,X_n)K(X_n,X_n)^{-1}f(x_n)\\[4pt] K(x^*,X_n)K(X_n,X_n)^{-1}f(x_n)\\[4pt] f(x)\\[4pt] f(x^*)\end{bmatrix} =\begin{bmatrix}K(x,X_n)K(X_n,X_n)^{-1} & \quad 0 & \quad 0\\[4pt] K(x^*,X_n)K(X_n,X_n)^{-1} & \quad 0 & \quad 0\\[4pt] 0 & \quad 1 & \quad 0\\[4pt] 0 & \quad 0 & \quad 1\end{bmatrix}f(\bar{X}_n),\end{equation*}

where $\bar{X}_n = X_n\cup\{x,x^*\}$ . The distribution of $Z_n$ is Gaussian with mean vector $A_nm(\bar{X}_n)$ and covariance matrix $V_n = A_n\Sigma_nA^\top_n$ , where

\begin{equation*}A_n =\begin{bmatrix}K(x,X_n)K(X_n,X_n)^{-1} & \quad 0 & \quad 0\\[4pt] K(x^*,X_n)K(X_n,X_n)^{-1} & \quad 0 & \quad 0\\[4pt] 0 & \quad 1 & \quad 0\\[4pt] 0 & \quad 0 & \quad 1\end{bmatrix}, \quad \Sigma_n =\begin{bmatrix}K(X_n,X_n) & \quad K(X_n,x) & \quad K(X_n,x^*)\\[4pt] K(x,X_n) & \quad k(x,x) & \quad k(x,x^*)\\[4pt] k(x^*,X_n) & \quad k(x^*,x) & \quad k(x^*,x^*)\end{bmatrix}\!.\end{equation*}

For convenience, we introduce the notation

\begin{align*}Q_{X_n}(x) &= K(x,X_n)K(X_n,X_n)^{-1}K(X_n,x),\\Q_{X_n}(x,x^*) &= K(x,X_n)K(X_n,X_n)^{-1}K(X_n,x^*).\end{align*}

Then the power function $P_{X_n}$ in (2.1) can be written as $P_{X_n}(x) = k(x,x) - Q_{X_n}(x)$ . We also use the analogous notation $P_{X_n}(x,x^*) = k(x,x^*) - Q_{X_n}(x,x^*)$ . With some trivial computation, we obtain

\begin{equation*}V_n =\begin{bmatrix}Q_{X_n}(x) & \quad Q_{X_n}(x,x^*) & \quad Q_{X_n}(x) & \quad Q_{X_n}(x,x^*)\\[4pt] Q_{X_n}(x,x^*) & \quad Q_{X_n}(x^*) & \quad Q_{X_n}(x,x^*) & \quad Q_{X_n}(x^*)\\[4pt] Q_{X_n}(x) & \quad Q_{X_n}(x,x^*) & \quad k(x,x) & \quad k(x,x^*)\\[4pt] Q_{X_n}(x,x^*) & \quad Q_{X_n}(x^*) & \quad k(x,x^*) & \quad k(x^*,x^*)\end{bmatrix}\!.\end{equation*}

Since $Z_n$ follows a multivariate normal distribution, it follows straightforwardly that

\begin{equation*}\Psi_n(\gamma) = \gamma^\top A_nm(\bar{X}_n) + \dfrac{1}{2}\gamma^\top V_n\gamma\end{equation*}

for any $\gamma\in\mathbb{R}^4$ . Then, by (3.2),

(3.4)

\begin{equation}\Psi(\gamma) = \gamma^\top\Bigl(\lim_{n\rightarrow\infty} A_nm(\bar{X}_n)\Bigr) + \dfrac{1}{2}\limsup_{n\rightarrow\infty}\gamma^\top (a_nV_n)\gamma,\end{equation}

provided that the limit on the right-hand side of (3.4) exists.

To study these limits, it is helpful to observe that $Q_{X_n}(x,x^*)$ can be viewed as the RBF interpolant of the function $k(x,\cdot)$ evaluated at the point $x^*$ , or, equivalently, the RBF interpolant of $k(x^*,\cdot)$ evaluated at the point x. This allows us to leverage the results from approximation theory that were stated in Section 2.2.

First, we consider the limit of

\begin{equation*}A_nm(\bar{X}_n) =\begin{bmatrix}K(x,X_n)K(X_n,X_n)^{-1}m(X_n) & \quad K(x^*,X_n)K(X_n,X_n)^{-1}m(X_n) & \quad m(x) & \quad m(x^*)\end{bmatrix}\!.\end{equation*}

We may observe that $\mathcal{L}_{k,X_n}m(x)$ , with $\mathcal{L}_{k,X_n}$ being the operator that maps a function to its interpolant given k and $X_n$ as in (2.2), is a (differentiable) linear combination of the values $k(x,x_m)$ . Hence the difference $y\mapsto m(y) - \mathcal{L}_{k,X_n}m(y)$ is a Lipschitz function whose zeros become dense (as $n\rightarrow\infty$ ) around x and $x^*$ . That is, there exist $\rho,\rho^*\gt 0$ such that $h_n(B_{x,\rho}\cap D)\rightarrow 0$ and $h_n(B_{x^*,\rho^*}\cap D)\rightarrow 0$ as $n\rightarrow \infty$ . Consequently, $m(x)-\mathcal{L}_{k,X_n}m(x)\rightarrow 0$ , whence $\lim_{n\rightarrow\infty} A_nm(\bar{X}_n) = m_0$ , with $m_0 = (m(x),m(x^*),m(x),m(x^*))^\top$ . Thus (3.4) becomes

(3.5)

\begin{equation}\Psi(\gamma) = \gamma^\top m_0 + \dfrac{1}{2}\limsup_{n\rightarrow\infty}\gamma^\top(a_nV_n)\gamma.\end{equation}

The precise behavior of the limit superior will depend on $a_n$ and the asymptotics of the matrix $V_n$ .

3.2. Analysis of Fenchel–Legendre transform

We begin by examining the limit of $V_n$ . It is easy to see that $P_{X_n}(y)\geq 0$ for all $y\in D$ . Furthermore, by Lemma 2.1 we can see that $P_{X_n}(y)\rightarrow 0$ if $X_n$ is dense in D as $n\rightarrow\infty$ . This implies that $Q_{X_n}(x) \rightarrow k(x,x)$ and similarly $Q_{X_n}(x^*)\rightarrow k(x^*,x^*)$ , with $k(x,x)=k(x^*,x^*)$ by the properties of the radial basis function. Although we do not know the sign of $P_{X_n}(x,x^*)$ , we can note that

\begin{equation*}P_{X_n}(x,x^*) = k(x,x^*) - (\mathcal{L}_{k,X_n}k(\cdot,x^*))(x) = k(x^*,x) - (\mathcal{L}_{k,X_n}k(\cdot,x))(x^*).\end{equation*}

By Lemma 2.2, we have $P_{X_n}(x,x^*)^2 \leq c^2_{k(\cdot,x^*),\phi}P_{X_n}(x)$ . The finiteness of $c_{k(\cdot,x^*),\phi}$ can be verified. Therefore, if $X_n$ is dense in D as $n\rightarrow\infty$ , we have $P_{X_n}(x,x^*)\rightarrow 0$ , whence $Q_{X_n}(x,x^*)\rightarrow k(x,x^*)$ . Thus we have shown that $V_n\rightarrow V$ entrywise, where

\begin{equation*}V =\begin{bmatrix}k(x,x) & \quad k(x,x^*) & \quad k(x,x) & \quad k(x,x^*)\\[4pt] k(x,x^*) & \quad k(x,x) & \quad k(x,x^*) & \quad k(x,x)\\[4pt] k(x,x) & \quad k(x,x^*) & \quad k(x,x) & \quad k(x,x^*)\\[4pt] k(x,x^*) & \quad k(x,x) & \quad k(x,x^*) & \quad k(x,x)\end{bmatrix}\!.\end{equation*}

It is easy to verify that V has eigenvalues $\lambda_1 = 2(k(x,x)+k(x,x^*))$ , $\lambda_2 = 2(k(x,x)-k(x,x^*))$ with respective eigenvectors

\begin{equation*}U_1 = \dfrac{1}{2}(1,1,1,1)^\top, \quad U_2 = \dfrac{1}{2}(1,-1,1,-1)^\top,\end{equation*}

and $\lambda_3=\lambda_4=0$ with respective eigenvectors

(3.6)

\begin{equation}U_3 = \dfrac{1}{\sqrt{2}}(1,0,-1,0)^\top,\quad U_4=\dfrac{1}{\sqrt{2}}(0,1,0,-1)^\top.\end{equation}

Similarly, let $\lambda_{i,n}$ and $U_{i,n}$ (for $1\leq i\leq 4$ ) denote the eigenvalues and corresponding eigenvectors of $V_n$ . Since $V_n\rightarrow V$ , we also have $\lambda_{i,n}\rightarrow\lambda_i$ . Accordingly, we also have $U_{1,n}\rightarrow U_1$ and $U_{2,n}\rightarrow U_2$ . However, the zero eigenvalue of V has multiplicity 2, so $U_{3,n},U_{4,n}$ will converge to limits $U^{\prime}_3,U^{\prime}_4$ that belong to the span of $U_3,U_4$ , but these limits need not be $U_3,U_4$ themselves. We know, however, that

(3.7)

\begin{equation}\bigl(U^{\prime}_3,U^{\prime}_4\bigr) = (U_3,U_4) T ,\end{equation}

where $T\in\mathbb{R}^{2\times 2}$ is an orthonormal matrix.

Looking back to (3.3) and (3.5), we can write the Fenchel–Legendre transform as

\begin{align*}I(u) &= \sup_{\gamma\in\mathbb{R}^4} (u-m_0)^\top\gamma - \dfrac{1}{2}\limsup_{n\rightarrow\infty}\gamma^\top(a_nV_n)\gamma\\&= \sup_{\gamma\in\mathbb{R}^4} (u-m_0)^\top U\gamma - \dfrac{1}{2}\limsup_{n\rightarrow\infty}\gamma^\top U^\top(a_nV_n)U\gamma\\&= \sup_{\gamma\in\mathbb{R}^4} (u-m_0)^\top U\gamma - \dfrac{1}{2}\limsup_{n\rightarrow\infty}a_n\gamma^\top U^\top U_n\Lambda_nU^\top_n U\gamma\\&= \sup_{\gamma\in\mathbb{R}^4} (u-m_0)^\top U\gamma - \dfrac{1}{2}\limsup_{n\rightarrow\infty} \sum_j\Biggl(\sum_i \gamma_iU^\top_i U_{j,n}\Biggr)^2 a_n\lambda_{j,n}.\end{align*}

Observe that $\lim_{n\rightarrow\infty} U^\top_i U_{j,n} = 1_{\{i=j\}}$ , whence

\begin{equation*}\limsup_{n\rightarrow\infty} \Biggl(\sum_i \gamma_iU^\top_i U_{j,n}\Biggr)^2 a_n\lambda_{j,n} = \gamma^2_j\lambda_j\limsup_{n\rightarrow\infty}a_n = \infty\end{equation*}

as long as $\gamma_j\neq 0$ for $j\in\{1,2\}$ . Therefore the supremum in (3.3) can only be achieved at $\gamma$ for which $\gamma_1=\gamma_2=0$ , whence

(3.8)

\begin{align}I(u) &= \sup_{\gamma_3,\gamma_4}(u-m_0)^\top U_3\gamma_3 + (u-m_0)^\top U_4\gamma_4 - \dfrac{1}{2}\limsup_{n\rightarrow\infty} \sum_j\Biggl(\sum_i \gamma_iU^\top_i U_{j,n}\Biggr)^2 a_n\lambda_{j,n}\notag\\&\geq \sup_{\gamma_3,\gamma_4}\Biggl[(u-m_0)^\top U_3\gamma_3 + (u-m_0)^\top U_4\gamma_4 - \dfrac{1}{2}\sum^2_{j=1}\lambda_j\limsup_{n\rightarrow\infty}\Biggl(\sum^4_{i=3}\gamma_iU^\top_i U_{j,n}\Biggr)^2 a_n \notag\\&\quad -\dfrac{1}{2}\sum^4_{j=3}\limsup_{n\rightarrow\infty}\Biggl(\sum^4_{i=3}\gamma_iU^\top_i U_{j,n}\Biggr)^2 \limsup_{n\rightarrow\infty} a_n\lambda_{j,n}\Biggr]\notag\\&= \sup_{\gamma_3,\gamma_4}\Biggl[(u-m_0)^\top U_3\gamma_3 + (u-m_0)^\top U_4\gamma_4 - \dfrac{1}{2}\sum^2_{j=1}\lambda_j\limsup_{n\rightarrow\infty}\Biggl(\sum^4_{i=3}\gamma_iU^\top_i U_{j,n}\Biggr)^2 a_n \notag\\&\quad -\dfrac{1}{2}\sum^4_{j=3}(T_{1,j-2}\gamma_3 + T_{2,j-2}\gamma_4)^2\limsup_{n\rightarrow\infty} a_n\lambda_{j,n}\Biggr].\end{align}

The supremum value in (3.3) is thus governed by the rate at which $a_n$ increases. If this rate is fast, we will have to take $\gamma_3=\gamma_4=0$ , leading to $I=0$ . To avoid this situation, $a_n$ should be assigned the highest order that makes one of the limits superior in (3.8) finite. Some matrix perturbation analysis is required to understand the rate that $a_n$ can take.

3.3. Perturbation analysis for rate function

Define the notation

\begin{equation*}\tilde{V} = V - V_n =\begin{bmatrix}P_{X_n}(x) & \quad P_{X_n}(x,x^*) & \quad P_{X_n}(x) & \quad P_{X_n}(x,x^*)\\[4pt] P_{X_n}(x,x^*) & \quad P_{X_n}(x^*) & \quad P_{X_n}(x,x^*) & \quad P_{X_n}(x^*)\\[4pt] P_{X_n}(x) & \quad P_{X_n}(x,x^*) & \quad 0 & \quad 0\\[4pt] P_{X_n}(x,x^*) & \quad P_{X_n}(x^*) & \quad 0 & \quad 0\end{bmatrix}\!.\end{equation*}

Let us also write $U_{j,n} = \sum_i \nu_{ijn}U_i$ . Then

\begin{align*}\lambda_{j,n}U_{j,n} &= (V-\tilde{V})U_{j,n}\\&= \sum_i \nu_{ijn}VU_i - \tilde{V}U_{j,n}\\&= \sum_i \nu_{ijn}\lambda_i U_i - \tilde{V}U_{j,n},\end{align*}

where the last line follows because $\lambda_i$ is an eigenvalue (and $U_i$ is an eigenvector) of V. Left-multiplying by the unit vector $U_i$ , we obtain

(3.9)

\begin{equation}\nu_{ijn}\lambda_{j,n} = \nu_{ijn}\lambda_i - U^\top_i \tilde{V}U_{j,n}.\end{equation}

Recalling that $\lambda_3=\lambda_4=0$ and $\lambda_{j,n}\gt 0$ , we find that

(3.10)

\begin{equation}\nu_{ijn} = -\dfrac{U^\top_i\tilde{V}U_{j,n}}{\lambda_{j,n}}, \quad i\in\{3,4\},\,j\in\{1,2\}.\end{equation}

This allows us to bound the limits superior in (3.8) as shown in Lemmas 3.1 and 3.2 below.

Lemma 3.1. For fixed $\gamma_3,\gamma_4\in\mathbb{R}$ , we have

\begin{equation*}\Biggl(\sum^4_{i=3} \gamma_i U^\top_iU_{j,n}\Biggr)^2 = {\mathrm{O}}\bigl(P^2_{X_n}(x)+P^2_{X_n}(x,x^*)+P^2_{X_n}(x^*)\bigr), \quad j\in\{1,2\}.\end{equation*}

Proof. Using (3.10), we write

\begin{equation*}\Biggl(\sum^4_{i=3} \gamma_i U^\top_iU_{j,n}\Biggr)^2 = (\nu_{3jn}\gamma_3+\nu_{4jn}\gamma_4)^2 = \dfrac{\bigl(U^\top_{j,n}\tilde{V}(\gamma_3U_3 + \gamma_4U_4)\bigr)^2}{\lambda^2_{j,n}}.\end{equation*}

Plugging in the closed-form expressions for $U_3,U_4$ from (3.6) yields

\begin{equation*}\tilde{V}U_3=\dfrac{1}{\sqrt{2}}(0,0,P_{X_n}(x),P_{X_n}(x,x^*))^\top, \quad \tilde{V}U_4=\dfrac{1}{\sqrt{2}}(0,0,P_{X_n}(x,x^*),P_{X_n}(x^*))^\top.\end{equation*}

Since $\gamma_3,\gamma_4$ are fixed and $U_3,U_4$ are unit vectors, the Cauchy–Schwarz inequality yields

\begin{equation*}\Biggl(\sum^4_{i=3} \gamma_i U^\top_iU_{j,n}\Biggr)^2 \leq \dfrac{\max\bigl\{\gamma^2_3,\gamma^2_4\bigr\}}{2\lambda^2_{j,n}}\bigl(P^2_{X_n}(x)+P^2_{X_n}(x,x^*) + P^2_{X_n}(x^*)\bigr),\end{equation*}

as desired.

Lemma 3.2. Suppose that the matrix T defined in (3.7) has no zero-valued entries. Then

(3.11)

\begin{align} \lambda_{j,n}&=\biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)\biggl(P_{X_n}(x)+\dfrac{T_{2,j-2}}{T_{1,j-2}}P_{X_n}(x,x^*)\biggr)\notag \\ &=\biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)\biggl(P_{X_n}(x^*)+\dfrac{T_{1,j-2}}{T_{2,j-2}}P_{X_n}(x,x^*)\biggr)\end{align}

for $j\in\{3,4\}$ .

Proof. Recall (3.9) and note that $\nu_{ijn}\rightarrow T_{i-2,j-2}$ for $i,j\in\{3,4\}$ . Since T is assumed to have no zero-valued entries, we do not need to worry about zero values of $\nu_{ijn}$ . Then (3.10) can be rewritten as

(3.12)

\begin{equation}\lambda_{j,n} = -\dfrac{U^\top_i \tilde{V} U_{j,n}}{\nu_{ijn}}, \quad i,j\in\{3,4\}.\end{equation}

The first equality in (3.11) can be obtained by setting $i=3$ , whence (3.12) yields

\begin{equation*}\lambda_{j,n} = -\dfrac{1}{\nu_{3jn}\sqrt{2}}(0,0,P_{X_n}(x),P_{X_n}(x,x^*))\cdot U_{j,n}.\end{equation*}

By expressing (0, 0, 1, 0) and (0, 0, 0,1) in terms of $U_i$ , we obtain

\begin{align*}\lambda_{j,n}&= -\dfrac{1}{\nu_{3jn}\sqrt{2}}\biggl(P_{X_n}(x)U\cdot\biggl(\dfrac{1}{2},\dfrac{1}{2},-\dfrac{1}{\sqrt{2}},0\biggr)^\top + P_{X_n}(x,x^*)U\cdot\biggl(\dfrac{1}{2},-\dfrac{1}{2},0,-\dfrac{1}{\sqrt{2}}\biggr)^\top\biggr)^\top U_{j,n}\\&= -\dfrac{1}{\nu_{3jn}\sqrt{2}}\biggl(P_{X_n}(x)\cdot\biggl(\dfrac{1}{2},\dfrac{1}{2},-\dfrac{1}{\sqrt{2}},0\biggr)+P_{X_n}(x,x^*)\cdot\biggl(\dfrac{1}{2},-\dfrac{1}{2},0,-\dfrac{1}{\sqrt{2}}\biggr)\biggr)v_{\cdot jn}\\&= \biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)\biggl(P_{X_n}(x)+\dfrac{T_{2,j-2}}{T_{1,j-2}}P_{X_n}(x,x^*)\biggr),\end{align*}

where the last line follows from the fact that $\nu_{ijn}\rightarrow 0$ for $i\in\{1,2\}$ and $j\in\{3,4\}$ , while $\nu_{ijn}\rightarrow T_{i-2,j-2}$ for $i,j\in\{3,4\}$ . The second equality in (3.11) can be obtained by repeating the above arguments with $i=4$ .

The analysis in Lemma 3.2 is easily extended to handle situations where T has zero-valued entries. If this occurs, we must have $||T_{11}|-|T_{21}||=1$ because T is orthonormal. In the first case we can repeat the proof of Lemma 3.2 with $i=4$ , $j=3$ and $i=3$ , $j=4$ , and obtain

(3.13)

\begin{equation}\lambda_{3,n}=\biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)P_{X_n}(x^*), \quad \lambda_{4,n}=\biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)P_{X_n}(x).\end{equation}

In the second case we repeat the same proof with $i=3$ , $j=3$ and $i=4$ , $j=4$ , and obtain

\begin{equation*}\lambda_{3,n}=\biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)P_{X_n}(x), \quad \lambda_{4,n}=\biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)P_{X_n}(x^*).\end{equation*}

In general, the bounds in Lemmas 3.1 and 3.2 depend on $P_{X_n}(x,x^*)$ , which is a difficult object to study. Lemma 3.3 establishes a bound that relates this quantity to a simpler function of the design points. Then Lemma 3.4 derives a similar lower bound on $P_{X_n}(x)$ . Note that these results provide lower bounds; we will later multiply them by negative quantities to convert them to upper bounds, which will enable additional analysis of the terms in (3.8).

Lemma 3.3. Let $\lambda_{\min}({\cdot})$ denote the smallest eigenvalue of a square matrix. The following bound holds:

\begin{equation*}2P_{X_n}(x,x^*) + P_{X_n}(x) + P_{X_n}(x^*) \geq 2\lambda_{\min}(K(\bar{X}_n,\bar{X}_n)).\end{equation*}

Proof. For notational convenience, define a vector $\kappa(x) = K(x,X_n)K(X_n,X_n)^{-1}$ . Note that $\kappa(x)$ takes values in $\mathbb{R}^n$ , and observe the identities

\begin{align*}\sum^n_{m=1}(\kappa_m(x) + \kappa_m(x^*))k(x_m,x) &= Q_{X_n}(x) + Q_{X_n}(x,x^*),\\\sum^n_{m=1}(\kappa_m(x) + \kappa_m(x^*))k(x_m,x^*) &= Q_{X_n}(x^*) + Q_{X_n}(x,x^*),\\\sum_{m,m^{\prime}}(\kappa_m(x) + \kappa_m(x^*))(\kappa_{m^{\prime}}(x)+\kappa_{m^{\prime}}(x^*))k(x_m,x_{m^{\prime}}) &= Q_{X_n}(x) + 2Q_{X_n}(x,x^*) + Q_{X_n}(x^*).\end{align*}

We extend $\kappa$ to $\mathbb{R}^{n+2}$ by taking $\kappa_{n+1},\kappa_{n+2}\equiv -\frac{1}{2}$ . Plugging in the above identities, we derive

\begin{align*}& \sum^{n+2}_{m,m^{\prime}=1} (\kappa_m(x) + \kappa_m(x^*))(\kappa_{m^{\prime}}(x)+\kappa_{m^{\prime}}(x^*))k(x_m,x_{m^{\prime}})\\&\quad= (\kappa_{n+1}(x)+\kappa_{n+1}(x^*))^2 k(x,x) + (\kappa_{n+2}(x)+\kappa_{n+2}(x^*))^2 k(x^*,x^*)\\&\quad\quad + 2(\kappa_{n+1}(x)+\kappa_{n+1}(x^*))(\kappa_{n+2}(x)+\kappa_{n+2}(x^*))k(x,x^*)\\&\quad\quad + 2(\kappa_{n+1}(x)+\kappa_{n+1}(x^*))(Q_X(x)+Q_X(x,x^*))\\&\quad\quad + 2(\kappa_{n+2}(x)+\kappa_{n+2}(x^*))(Q_X(x^*)+Q_X(x,x^*))\\&\quad\quad + Q_{X_n}(x)+2Q_{X_n}(x,x^*) + Q_{X_n}(x^*)\\&\quad= 2P_{X_n}(x,x^*) + P_{X_n}(x) + P_{X_n}(x^*).\end{align*}

Thus we arrive at

\begin{align*}2P_{X_n}(x,x^*) + P_{X_n}(x) + P_{X_n}(x^*) &= (\kappa(x) + \kappa(x^*))^\top K(\bar{X}_n,\bar{X}_n)(\kappa(x)+\kappa(x^*))\\&\geq \|\kappa(x) + \kappa(x^*)\|^2_2\cdot\lambda_{\min}(K(\bar{X}_n,\bar{X}_n))\\&= \Biggl(\sum^n_{m=1} (\kappa_m(x) + \kappa_m(x^*))^2+2\Biggr)\lambda_{\min}(K(\bar{X}_n,\bar{X}_n))\\&\geq 2\lambda_{\min}(K(\bar{X}_n,\bar{X}_n)),\end{align*}

which completes the proof.

Lemma 3.4. Let $X^{\prime}_n = X_n\cup\{x\}$ . Then

\begin{equation*}P_{X_n}(x) \geq \lambda_{\min}(K(X^{\prime}_n,X^{\prime}_n)).\end{equation*}

Proof. Define $\kappa(x)$ as in the proof of Lemma 3.3 and extend it to $\mathbb{R}^{n+1}$ by taking $\kappa_{n+1}\equiv -1$ . Then, by repeating the arguments in the proof of Lemma 3.3, we obtain

\begin{align*}P_{X_n}(x) &= \sum^{n+1}_{m,m^{\prime}=1} \kappa_m(x)\kappa_{m^{\prime}}(x)k(x_m,x_{m^{\prime}})\\&\geq \lambda_{\min}(K(X^{\prime}_n,X^{\prime}_n))\sum^{n+1}_{m=1}\kappa^2_m(x)\\&\geq \lambda_{\min}(K(X^{\prime}_n,X^{\prime}_n)),\end{align*}

as desired.

Now, we can study the rate at which $\lambda_{\min}(K(X^{\prime}_n,X^{\prime}_n))$ or $\lambda_{\min}(K(\bar{X}_n,\bar{X}_n))$ converges to zero. For this, we cite the following result (Theorem 12.3 of [Reference Wendland35]).

Lemma 3.5. ([Reference Wendland35].) Define $q_{X_n} = \min_{x_m\neq x_{m^{\prime}}} \|x_m-x_{m^{\prime}}\|_2$ and $\phi_0(M) = \inf_{\|y\|_2\leq M} \tilde{\phi}(y)$ , where $\tilde{\phi}$ is the generalized Fourier transform of the radial basis function $\phi$ . Then

\begin{equation*}\lambda_{\min}(K(X_n,X_n)) \geq C_d \phi_0\biggl(\dfrac{M_d}{q_{X_n}}\biggr)q^{-d}_{X_n},\end{equation*}

where the constants $C_d,M_d$ depend only on d.

Combining Lemmas 3.3–3.5, we have

\begin{equation*}P_{X_n}(x) \geq C_d\phi_0\biggl(\dfrac{M_d}{q_{X^{\prime}_n}}\biggr)q^{-d}_{X^{\prime}_n}.\end{equation*}

Consequently, the inequality in Lemma 3.3 becomes

(3.14)

\begin{equation}2P_{X_n}(x,x^*) + P_{X_n}(x) + P_{X_n}(x^*) \geq 2C_d\phi_0\biggl(\dfrac{M_d}{q_{\bar{X}_n}}\biggr)q^{-d}_{\bar{X}_n}.\end{equation}

The lower bound in (3.14) can be connected back to the upper bound obtained in Lemma 3.2 in the following manner. Note that if T has no zero-valued entries as assumed in Lemma 3.2, by orthogonality we either have ${{T_{11}}/{T_{21}}}\gt 0$ and ${{T_{12}}/{T_{22}}}\lt 0$ or vice versa (note also that $T_{11}=-T_{22}$ ). Without loss of generality, we only treat the first case here.

Supposing that ${{T_{12}}/{T_{22}}}\lt 0$ , we apply (3.14) to (3.11) with $j=4$ and argue that

(3.15)

\begin{align}\!\!\lambda_{4,n} \leq \biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)\biggl[P_{X_n}(x)-\dfrac{T_{22}}{2T_{12}}\biggl(P_{X_n}(x)+P_{X_n}(x^*) - 2C_d\phi_0\biggl(\dfrac{M_d}{q_{\bar{X}_n}}\biggr)q^{-d}_{\bar{X}_n}\biggr)\biggr]\end{align}

(3.16)

\begin{align} \ \ = \biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)\biggl[\biggl(1-\dfrac{T_{22}}{2T_{12}}\biggr)P_{X_n}(x) - \dfrac{T_{22}}{2T_{12}}P_{X_n}(x^*) + {\mathrm{O}}\Bigl(q^{s_{\infty}}_{\bar{X}_n}\Bigr)\biggr]\qquad\quad \end{align}

(3.17)

\begin{align} \ \ = {\mathrm{O}}(h_n(B_{x,\rho}\cap D)^{s_{\infty}}).\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\end{align}

In this derivation, (3.15) uses the fact that ${{T_{12}}/{T_{22}}}\lt 0$ to convert the lower bound in (3.14) into an upper bound, while (3.16) applies (2.4) to bound $\phi_0$ . Noting that the multipliers $1-{{T_{22}}/{(2T_{12})}}$ and $- {{T_{22}}/{(2T_{12})}}$ are both strictly positive, we then obtain (3.17) by applying Lemma 2.1 together with the fact that

\begin{equation*}h_n(B_{x,\rho}\cap D) \geq q_{X_n} \geq q_{\bar{X}_n}.\end{equation*}

Next we return to (3.11) with $j=3$ and obtain

\begin{equation*}\lambda_{3,n}\leq \biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)\biggl(P_{X_n}(x)+\dfrac{T_{21}}{T_{11}}\sqrt{\phi(0)}\sqrt{P_{X_n}(x)}\biggr)\end{equation*}

by using the Cauchy–Schwarz inequality for the RKHS inner product to produce the simple bound

\begin{equation*}|P_{X_n}(x,x^*)| \leq \sqrt{\phi(0)}\sqrt{P_{X_n}(x)}.\end{equation*}

Applying Lemma 2.1, we conclude that $\lambda_{3,n} = {\mathrm{O}}(h_n(B_{x,\rho}\cap D)^{ {{s_{\infty}}/{2}} })$ . Finally, applying Lemma 2.1 to the bound in Lemma 3.1 straightforwardly yields

\begin{equation*}\Biggl(\sum^4_{i=3} \gamma_i U^\top_i U_{j,n}\Biggr)^2 = {\mathrm{O}}(h_n(B_{x,\rho}\cap D)^{s_{\infty}}).\end{equation*}

When $X_n$ is dense in D as $n\rightarrow\infty$ , we have $h_n(B_{x,\rho}\cap D)\leq h_n(D)$ with $h_n(D)\rightarrow 0$ . Thus, among the limits superior in (3.8), one is ${\mathrm{O}}\bigl(h^{ {{s_{\infty}}/{2}} }_n\bigr)$ and the others are ${\mathrm{O}}\bigl(h^{s_{\infty}}_n\bigr)$ . This will also happen in the symmetric situation where ${{T_{12}}/{T_{22}}}\gt 0$ , but with the order switched for $\lambda_{3,n}$ and $\lambda_{4,n}$ .

3.4. Main moderate deviations inequality

The conclusions of Section 3.3 suggest that $a_n$ should have the exact order $h_n^{- {{s_{\infty}}/{2}} }$ , which we denote by $a_n\sim h_n^{- {{s_{\infty}}/{2}} }$ . Then we obtain

\begin{equation*}\Biggl(\sum^4_{i=3}\gamma_i U^\top_i U_{j,n}\Biggr)^2 a_n \rightarrow 0\end{equation*}

in (3.8), and bound $I(u)\geq I^l(u)$ , where $I^l$ is defined as

\begin{equation*}I^l(u) = \sup_{\gamma_3,\gamma_4\in\mathbb{R}} (u-m_0)^\top U_3\gamma_3 + (u-m_0)^\top U_4\gamma_4 - \dfrac{1}{2}c_3(T_{11}\gamma_3 + T_{21}\gamma_4)^2\end{equation*}

when ${{T_{12}}/{T_{22}}}\lt 0$ , and

\begin{equation*}I^l(u) = \sup_{\gamma_3,\gamma_4\in\mathbb{R}} (u-m_0)^\top U_3\gamma_3 + (u-m_0)^\top U_4\gamma_4 - \dfrac{1}{2}c_4(T_{12}\gamma_3 + T_{22}\gamma_4)^2\end{equation*}

when ${{T_{12}}/{T_{22}}}\gt 0$ , for some suitable constants $c_3,c_4$ . Furthermore, recalling (3.6) and the definition of $m_0$ , we find that $m^\top_0 U_3 = m^\top_0 U_4 = 0$ . Now, applying (3.1), we can finally state our main result.

Theorem 3.1. Let T be as in (3.7), take $a_n \sim h_n^{- {{s_{\infty}}/{2}} }$ , and let $c_l$ be a constant satisfying $\limsup_{n\rightarrow\infty} a_n\lambda_{j,n}\leq c_l$ for $j\in\{3,4\}$ . If $||T_{11}|-|T_{21}||\notin \{0,1\}$ , we have

(3.18)

\begin{equation}\limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log\mu_n(E) \leq -\inf_{u\in E} I^l(u)\end{equation}

for any closed $E\subseteq \mathbb{R}^4$ , with

(3.19)

\begin{equation}I^l(u) = \sup_{\gamma_3,\gamma_4\in\mathbb{R}} u^\top U_3\gamma_3 + u^\top U_4\gamma_4 - \dfrac{1}{2}c_l(|T_{11}|\gamma_3 + |T_{21}|\gamma_4)^2.\end{equation}

Throughout this analysis we have assumed that $X_n$ is dense in D when $n\rightarrow\infty$ , but it is possible to recover Theorem 3.1, for fixed $x,x^*$ , as long as the design is dense only in neighborhoods of those two points, e.g. in $B_{x,\rho}\cup B_{x^*,\rho}$ for some $\rho\gt 0$ . In that case $a_n$ will take the order of $\min\{h_n(B_{x,\rho})^{- {{s_{\infty}}/{2}} },h_n(B_{x^*,\rho})^{- {{s_{\infty}}/{2}} }\}$ .

The right-hand side of (3.18) is some strictly negative, problem-specific constant, and it is the order of $a_n$ that governs the convergence rate of $\mu_n(E)$ . The moderate deviations inequality allows the rate to depend on the kernel through the quantity $s_{\infty}$ , but otherwise the complex interdependence of the various elements of $K(X_n,X_n)$ has been streamlined using Lemmas 2.1 and 3.5. For this reason, the bound in (3.18) may not be the tightest possible.

4. Extensions and special cases

In this section we treat several special cases not covered by Theorem 3.1. First, Section 4.1 handles the situation where $||T_{11}|-|T_{21}||\in \{0,1\}$ . Section 4.2 discusses how the bound of Theorem 3.1 is improved if we only consider one reference point x instead of two points $x,x^*$ .

4.1. Special cases of Theorem 3.1

Again, let T be as in (3.7), and suppose that $||T_{11}|-|T_{21}||=0$ . For simplicity we consider the case $T_{11}=T_{21}$ , as the other cases are very similar.

Because T is orthonormal, we must have $T_{11}=T_{21}={{\sqrt{2}}/{2}}$ . Then, returning to the derivation of (3.8) and recalling that $m^\top_0 U_3 = m^\top_0 U_4 = 0$ , we have

(4.1)

\begin{align}I(u) &\geq \sup_{\gamma_3,\gamma_4} u^\top U_3\gamma_3 + u^\top U_4\gamma_4 - \dfrac{1}{2}\limsup_{n\rightarrow\infty}\Biggl[\sum^2_{j=1}\lambda_j\Biggl(\sum^4_{i=3}\gamma_iU^\top_i U_{j,n}\Biggr)^2 a_n \notag\\&\quad -\dfrac{1}{2}\bigl((\gamma_3+\gamma_4)^2\lambda_{3,n} + (\gamma_3-\gamma_4)^2\lambda_{4,n}\bigr)a_n\Biggr]\end{align}

Using an argument similar to that in Lemma 3.1, we explicitly derive

(4.2)

\begin{align}\sum^2_{j=1}\lambda_j\Biggl(\sum^4_{i=3}\gamma_iU^\top_i U_{j,n}\Biggr)^2 &= \dfrac{1+{\mathrm{o}}(1)}{8}\biggl[ \dfrac{1}{\lambda_1}(\gamma_3P_{X_n}(x) + (\gamma_3+\gamma_4)P_{X_n}(x,x^*) + \gamma_4P_{X_n}(x^*))^2\notag\\&\quad +\dfrac{1}{\lambda_2}(\gamma_3 P_{X_n}(x) + (\gamma_4-\gamma_3)P_{X_n}(x,x^*) - \gamma_4P_{X_n}(x^*))^2\biggr].\end{align}

By direct application of Lemma 3.2, we have

\begin{align*}\lambda_{3,n} &= \biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)(P_{X_n}(x)+P_{X_n}(x,x^*)) = \biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)(P_{X_n}(x^*)+P_{X_n}(x,x^*)),\\\lambda_{4,n} &= \biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)(P_{X_n}(x)-P_{X_n}(x,x^*)) = \biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)(P_{X_n}(x^*)-P_{X_n}(x,x^*)).\end{align*}

Let us now take $a_n\sim h^{-s_{\infty}}_n$ , which guarantees

\[\limsup_{n\rightarrow\infty} a_nP_{X_n}(x) \leq c_l\quad\text{and}\quad\limsup_{n\rightarrow\infty} a_nP_{X_n}(x^*) \leq c_l\]

for some finite $c_l \gt 0$ . This allows us to place a bound on (4.1) that drops all negligible terms in (4.2) that contain $P_{X_n}(x)$ and $P_{X_n}(x^*)$ . That is,

(4.3)

\begin{align}&I(u) \notag \\ &\ \geq \sup_{\gamma_3,\gamma_4} u^\top U_3\gamma_3 + u^\top U_4\gamma_4 - \dfrac{1}{16}\limsup_{n\rightarrow\infty}\biggl[\dfrac{1}{\lambda_1}(\gamma_3+\gamma_4)^2P^2_{X_n}(x,x^*)+\dfrac{1}{\lambda_2}(\gamma_4-\gamma_3)^2P^2_{X_n}(x,x^*) \notag\\&\ \quad +2 (\gamma_3+\gamma_4 )^2P_{X_n}(x)+2 (\gamma_3-\gamma_4 )^2P_{X_n}(x^*)-4\gamma_3\gamma_4 P_{X_n}(x,x^*)\biggr]a_n\end{align}

The last term in (4.3) requires us to take $\gamma_3\gamma_4\leq 0$ to avoid I(u) becoming $-\infty$ . Then (4.3) becomes

(4.4)

\begin{align}&I(u) \notag \\ &\ \geq \sup_{\gamma_3\gamma_4\leq 0} u^\top U_3\gamma_3 + u^\top U_4\gamma_4 - \dfrac{1}{16}\limsup_{n\rightarrow\infty}\biggl[\dfrac{1}{\lambda_1}(\gamma_3+\gamma_4)^2P^2_{X_n}(x,x^*)+\dfrac{1}{\lambda_2}(\gamma_4-\gamma_3)^2P^2_{X_n}(x,x^*)\notag\\ &\ \quad +2(\gamma_3+\gamma_4)^2P_{X_n}(x)+2(\gamma_3-\gamma_4)^2P_{X_n}(x^*)-4\gamma_3\gamma_4 (P_{X_n}(x)+P_{X_n}(x^*))\biggr]a_n\\ &\ \geq \sup_{\gamma_3\gamma_4\leq 0} u^\top U_3\gamma_3 + u^\top U_4\gamma_4 - \dfrac{c_l}{16}\biggl[\dfrac{1}{\lambda_1}(\gamma_3+\gamma_4)^2c^2_{k(\cdot,x^*),\phi} + \dfrac{1}{\lambda_2}(\gamma_3-\gamma_4)^2c^2_{k(\cdot,x^*),\phi}\notag\end{align}

(4.5)

\begin{align} &\ \quad + 2 ((\gamma_3+\gamma_4)^2+(\gamma_3-\gamma_4)^2-4\gamma_3\gamma_4)\biggr]\\ &\!\!\!\!\!\!\!\! = \sup_{\gamma_3\gamma_4\leq 0} u^\top U_3\gamma_3 + u^\top U_4\gamma_4 - \dfrac{c_l}{16}\biggl[\dfrac{1}{\lambda_1}(\gamma_3+\gamma_4)^2c^2_{k(\cdot,x^*),\phi} + (\gamma_3-\gamma_4)^2\biggl(\dfrac{c^2_{k(\cdot,x^*),\phi}}{\lambda_2}+4\biggr)\biggr].\notag\end{align}

where (4.4) applies Lemma 3.3 to bound $P_{X_n}(x,x^*)$ , and (4.5) uses the bound $P^2_{X_n}(x,x^*) \leq c^2_{k(\cdot,x^*),\phi}P_{X_n}(x)$ from Lemma 2.2. After some algebra, we obtain a version of the moderate deviations inequality (3.18) with

(4.6)

\begin{align}I^l(u) &= \sup_{\gamma_3\gamma_4\leq 0} \dfrac{1}{2}u^\top(U_3 + U_4)(\gamma_3+\gamma_4)+\dfrac{1}{2}u^\top(U_3-U_4)(\gamma_3-\gamma_4)\notag\\&\quad -\dfrac{c_l}{16}\biggl[\dfrac{1}{\lambda_1}(\gamma_3+\gamma_4)^2c^2_{k(\cdot,x^*),\phi} + (\gamma_3-\gamma_4)^2\biggl(\dfrac{c^2_{k(\cdot,x^*),\phi}}{\lambda_2}+4\biggr)\biggr].\end{align}

The second special case $||T_{11}|-|T_{21}||=1$ is handled in almost the same way. For simplicity we take $T_{11}=T_{22}=0$ , as the other possible cases are very similar. Then (4.2) remains unchanged, but we use (3.13) instead of Lemma 3.2. We can thus omit the final cross-term $-4\gamma_3\gamma_4P_{X_n}(x,x^*)$ in (4.3), so we no longer need $\gamma_3\gamma_4\leq 0$ . The other arguments are unchanged and yield another version of (3.18) with

(4.7)

\begin{align}I^l(u) &= \sup_{\gamma_3,\gamma_4} \dfrac{1}{2}u^\top(U_3 + U_4)(\gamma_3+\gamma_4)+\dfrac{1}{2}u^\top(U_3-U_4)(\gamma_3-\gamma_4)\notag\\&\quad -\dfrac{c_l}{16}\biggl[(\gamma_3+\gamma_4)^2\biggl(\dfrac{c^2_{k(\cdot,x^*),\phi}}{\lambda_1}+2\biggr) + (\gamma_3-\gamma_4)^2\biggl(\dfrac{c^2_{k(\cdot,x^*),\phi}}{\lambda_2}+2\biggr)\biggr].\end{align}

4.2. Moderate deviations for a single point

In the following, we treat the special case where $Z_n = \bigl(\,\hat{f}_n(x),f(x)\bigr)^\top$ , for a single, fixed $x\in D$ . Although one can obtain moderate deviations inequalities for such $Z_n$ directly from Theorem 3.1, it is possible to tighten the bounds by modifying the analysis. Below we highlight those parts of the argument that require changes.

As before, we let $\mu_n$ denote the probability law of $Z_n$ and assume that the set $X_n$ of design points becomes dense in D as $n\rightarrow\infty$ . We write $Z_n = A_n f(X^{\prime}_n)$ , where $X^{\prime}_n = X_n\cup\{x\}$ and

\begin{equation*}A_n = \begin{bmatrix}K(x,X_n)K(X_n,X_n)^{-1} & \quad 0\\[4pt] 0 & \quad 1\end{bmatrix}\!.\end{equation*}

The covariance matrix of $Z_n$ is given by

\begin{equation*}V_n = \begin{bmatrix}Q_{X_n}(x) & \quad Q_{X_n}(x)\\[4pt] Q_{X_n}(x) & \quad k(x,x)\end{bmatrix}\!,\end{equation*}

where $Q_{X_n}$ is as defined in Section 3.1. Equation (3.5) remains unchanged with $m_0 = (m(x),m(x))^\top$ . Repeating the arguments in Section 3.2, we have $V_n\rightarrow V$ entrywise, where V is a $2\times 2$ matrix whose elements are all equal to k(x, x). This matrix has two eigenvalues $\lambda_1 \gt 0$ and $\lambda_2 = 0$ , with corresponding eigenvectors

\begin{equation*}U_1 = \dfrac{1}{\sqrt{2}}(1,1)^\top, \quad U_2 = \dfrac{1}{\sqrt{2}}(1,-1)^\top.\end{equation*}

Similarly, let $\lambda_{i,n}$ and $U_{i,n}$ denote the eigenvalues and eigenvectors of $V_n$ .

Again following Section 3.2, the supremum in (3.3) can only be achieved at $\gamma\in\mathbb{R}^2$ satisfying $\gamma_1 = 0$ . Repeating the derivation of (3.8), we obtain

\begin{equation*}I(u) \geq \sup_{\gamma_2}\! (u-m_0)^\top U_2\gamma_2 - \dfrac{1}{2}\lambda_1\bigl(\gamma_2U^\top_2U_{1,n}\bigr)^2a_n -\dfrac{1}{2}\limsup_{n\rightarrow\infty}\bigl(\gamma_2 U^\top_2 U_{2,n}\bigr)^2\limsup_{n\rightarrow\infty} a_n\lambda_{2,n} ,\end{equation*}

where $\limsup_{n\rightarrow\infty}\bigl(\gamma_2U^\top_2U_{2,n}\bigr)^2 = \gamma^2_2$ and

\begin{equation*}\bigl(\gamma_2U^\top_2U_{1,n}\bigr)^2 = \dfrac{1}{\lambda^2_{1,n}}\bigl(\gamma_2U^\top_{1,n}(V-V_n)U_2\bigr)^2 \leq \dfrac{\gamma^2_2}{\lambda^2_{1,n}}P^2_{X_n}(x) = {\mathrm{O}}\bigl(P^2_{X_n}(x)\bigr),\end{equation*}

analogously to Lemma 3.1. Repeating the proof of Lemma 3.2, we obtain

\begin{equation*}\lambda_{2,n} = \biggl(\dfrac{1}{2}+{\mathrm{o}}(1)\biggr)P_{X_n}(x).\end{equation*}

The rest of Section 3.3 remains the same, but it is sufficient to use Lemma 3.4, as there is no longer any cross-term. Thus we avoid the factor of 2 in the lower bound derived in Lemma 3.3, which allows us to take $a_n\sim h_n^{-s_\infty}$ . It then follows that $\bigl(\gamma_2U^\top_2 U_{1,n}\bigr)^2 a_n\rightarrow 0$ , while $\limsup_{n\rightarrow\infty}a_n\lambda_{2,n}\leq c_l$ for some $c_l\lt\infty$ . We thus obtain $I(u) \geq I^l(u)$ , where

\begin{align*}I^l(u) &= \sup_{\gamma_2}\! (u-m_0)^\top U_2\gamma_2 - \dfrac{1}{2}c_l\gamma^2_2\\&= \sup_{\gamma_2} u^\top U_2\gamma_2 - \dfrac{1}{2}c_l\gamma^2_2\\&= \dfrac{(u_1-u_2)^2}{4c_l}.\end{align*}

We now apply (3.1) to obtain the desired result. Note that the support set $\{\gamma \in \mathbb{R}^2\,:\, \Psi(\gamma)\lt\infty\}$ is a subspace, so (3.1) still holds by the results of Section 6.

Theorem 4.1. Take $a_n \sim h_n^{-s_{\infty}}$ and let $c_l$ be a constant satisfying $\limsup_{n\rightarrow\infty} a_n\lambda_{2,n}\leq c_l$ . Then

\begin{equation*}\limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log\mu_n(E) \leq -\inf_{u\in E} \dfrac{(u_1-u_2)^2}{4c_l}\end{equation*}

for any closed $E\subseteq \mathbb{R}^2$ .

By considering one reference point x instead of two points $x,x^*$ , we obtain a better bound because it is no longer necessary to bound a cross-term, as in Lemma 3.3. However, the bound will not get worse if we consider $K\gt 2$ points, i.e. with

\[Z_n = \bigl\{\hat{f}_n\big(x^*_k\big),f\big(x^*_k\big)\bigr\}^K_{k=1} \quad\text{for some $K \lt \infty$.}\]

This is because the same bound in Lemma 3.3 can be applied to every possible cross-term.

5. Applications: pairwise comparisons and estimation error

In Sections 5.1–5.2 we use Theorems 3.1 and 4.1 to prove (1.3) and (1.4), respectively. The proofs are very similar, but use different definitions of the error set E in (3.18). Section 5.3 presents several other results of interest where the moderate deviations bound can be made more explicit.

5.1. Moderate deviations for false ordering

We return to (1.2) and write

(5.1)

\begin{equation}\pi_n(x,x^*) = \dfrac{P\bigl(\,\hat{f}_n(x)\leq \hat{f}_n(x^*)-\delta,\,f(x)\geq f(x^*)\bigr)}{P(\,f(x)\geq f(x^*))}.\end{equation}

For fixed $x,x^*$ , the denominator is a strictly positive constant, so we can focus on the numerator, which fits into the framework of Section 3 with

(5.2)

\begin{equation}E=\{u\in\mathbb{R}^4\,:\, u_1 \leq u_2-\delta,\,u_3\geq u_4\}.\end{equation}

We will apply Theorem 3.1 and derive a more explicit form for (3.19). First, note that the supremum in (3.19) can only be finite when

(5.3)

\begin{equation}\dfrac{u^\top U_3}{|T_{11}|} = \dfrac{u^\top U_4}{|T_{21}|}.\end{equation}

Letting $\eta$ be the value in (5.3), we then have $I^l(u)={{\eta^2}/({2c_l})}$ . Then we minimize $I^l(u)$ subject to (5.2)–(5.3). From the optimality conditions, it can be seen that the inequalities in (5.2) must be binding at optimality, which leads to

\begin{equation*}\inf_{u\in E} I^l(u) = \dfrac{\delta^2}{4c_l}\dfrac{1}{(|T_{11}|-|T_{21}|)^2}.\end{equation*}

Applying Theorem 3.1, we complete the proof of (1.3). The formal statement of the result is as follows.

Theorem 5.1. Let T be as in (3.7), take $a_n \sim h_n^{- {{s_{\infty}}/{2}} }$ and let $c_l$ be a constant satisfying $\limsup_{n\rightarrow\infty} a_n\lambda_{j,n}\leq c_l$ for $j\in\{3,4\}$ . If $||T_{11}|-|T_{21}||\notin \{0,1\}$ , we have

\begin{equation*}\pi_n(x,x^*) \leq C_1 \exp\biggl({-}\dfrac{\delta^2 C_2}{4c_l(|T_{11}|-|T_{21}|)^2}h_n^{- {{s_{\infty}}/{2}} }\biggr)\end{equation*}

where $C_1,C_2$ are positive constants.

In fact, we can show that the moderate deviations bound holds uniformly for all $x,x^*\in D$ . To do so, we must make sure that the denominator of (5.1) is well-behaved. It is easily seen that

\begin{equation*}P(\,f(x)\geq f(x^*)) = \Phi\biggl(\dfrac{m(x)-m(x^*)}{\sqrt{2(k(x,x)-k(x,x^*))}}\biggr),\end{equation*}

where $\Phi$ is the standard Gaussian cumulative distribution function. We let $c_L$ be the Lipschitz constant of m, and derive

\begin{align*}\lim_{\|x-x^*\|\rightarrow 0}\dfrac{(m(x)-m(x^*))^2}{2(k(x,x)-k(x,x^*))} &\leq \lim_{\|x-x^*\|\rightarrow 0}\dfrac{c^2_L\|x-x^*\|^2_2}{2(\phi(0)-\phi(\|x-x^*\|))}\\&= \dfrac{c^2_L}{2} \lim_{y\searrow 0}\dfrac{y^2}{\phi(0)-\phi(y)}\\&= -c^2_L \lim_{y\searrow 0}\dfrac{y}{\phi^{\prime}(y)}\\&= -\dfrac{c^2_L}{\phi^{\prime\prime}(0)}\\& \lt \infty\end{align*}

using the assumption made in Section 2.1 that $\phi$ is twice differentiable at zero with $\phi^{\prime\prime}(0)\lt 0$ . Because D is compact, there exists some $c_D \gt 0$ satisfying

\begin{equation*}\inf_{x,x^*\in D} P(\,f(x) \geq\, f(x^*))\geq c_D.\end{equation*}

Furthermore, the constant $C_2$ in Theorem 5.1 does not depend on $x,x^*$ . The constant $C_1$ may depend on $x,x^*$ , but we can take $C^{\prime}_1$ to be its largest value over the compact set D. We then conclude the following.

Corollary 5.1. Suppose that we are in the situation of Theorem 5.1. Then

\begin{equation*}\sup_{x,x^*\in D}\pi_n(x,x^*) \leq \dfrac{C^{\prime}_1}{c_D} \exp\biggl({-}\dfrac{\delta^2 C_2}{4c_l(|T_{11}|-|T_{21}|)^2}h_n^{- {{s_{\infty}}/{2}} }\biggr)\end{equation*}

where $C^{\prime}_1,C_2,c_D$ are positive constants.

Finally, we consider two special cases covered in Section 4. First, in the case where $T_{11}=T_{21}$ or $T_{12}=T_{22}$ , we apply (4.6). It can be shown that the supremum is attained when $\gamma_3+\gamma_4=0$ , which satisfies the condition $\gamma_3\gamma_4\leq 0$ and achieves $u^\top(U_3+U_4)=0$ and $|u^\top U_3-u^\top U_4| = {{\delta}/{(2\sqrt{2})}}$ . Consequently, we have

(5.4)

\begin{equation}\pi_n(x,x^*) \leq C_1\exp\bigl({-}\delta^2 C^{\prime}_2 h_n^{-s_{\infty}}\bigr),\end{equation}

where $C^{\prime}_2$ depends on $\lambda_2,c_l,c_{k(\cdot,x^*),\phi}$ . In the second special case where $||T_{11}|-|T_{21}|| = 1$ , we apply (4.7). Again, the supremum achieves $u^\top(U_3+U_4)=0$ and $|u^\top U_3-u^\top U_4| = {{\delta}/{(2\sqrt{2})}}$ and (5.4) follows. In summary, both special cases admit an exponential bound in $h_n^{-s_{\infty}}$ rather than $h_n^{- {{s_{\infty}}/{2}} }$ .

5.2. Moderate deviations for pointwise estimation error

The convergence rate of the tail probability $P\bigl(|\,\hat{f}_n(x)-f(x)|\geq \delta\bigr)$ for fixed x can be obtained from Theorem 4.1 using the error event $E^{\prime} = \{|u_1-u_2|\geq\delta\}$ . It is easy to see that

\begin{equation*}\inf_{u\in E^{\prime}} \dfrac{(u_1-u_2)^2}{4c_l} = \dfrac{\delta^2}{4c_l},\end{equation*}

whence we have

\begin{equation*}P\bigl(|\,\hat{f}_n(x) - f(x)|\geq \delta\bigr) \leq C_1\exp\biggl({-}\dfrac{\delta^2C_2}{4c_l} h^{-s_{\infty}}_n\biggr)\end{equation*}

where $C_1,C_2\gt 0$ are constants. As in Corollary 5.1, we can take the supremum over all $x\in D$ to obtain the desired result, which is formally stated as follows.

Theorem 5.2. There exist constants $C_1,C_2 \gt 0$ such that

\begin{equation*}\sup_{x\in D} P\bigl(|\,\hat{f}_n(x) - f(x)|\geq \delta\bigr) \leq C_1\exp\biggl({-}\dfrac{\delta^2C_2}{4c_l} h^{-s_{\infty}}_n\biggr).\end{equation*}

A natural question is whether this result can be extended to bound $P\bigl(\sup_x |\,\hat{f}_n(x)-f(x)|\geq \delta\bigr)$ , the tail probability of the estimation error over the domain, perhaps by combining Theorem 5.2 with the continuity of f and an epsilon-net partition of D. Such an argument could work if f and $\hat{f}_n$ were uniformly equicontinuous over all sample paths, a property known as ‘uniform modulus of continuity’ [Reference Marcus and Shepp22]. It is possible to obtain such results for generic GPs in certain settings [Reference Ciesielski9], but they are currently not available for the specific mechanism of Gaussian process regression; recent work, such as [Reference Lederer, Umlauft and Hirche17] and [Reference Toth and Oberhauser31], has only bounded the modulus of continuity on a restricted set of sample paths, not almost surely. This extension is an interesting topic for future work.

5.3. Other results of interest

In the following, we give several examples in which our main results can be made more explicit. To avoid excessive repetition, we focus on the uniform bound in Corollary 5.1 in our presentation, but analogs of the other results in Sections 5.1–5.2 can be straightforwardly obtained as well. For simplicity, let us take $D=[0,1]^d$ .

Gaussian kernel. Suppose that k is the Gaussian kernel with parameter $\alpha$ , that is, $k(x,x^*) = \exp\!({-}\alpha\|x-x^*\|^2_2)$ . For this particular kernel, it is known that $s_{\infty}$ can take arbitrarily large values. However, Theorem 11.22 of [Reference Wendland35] proves the bound

\begin{equation*}P_{X_n}(x) \leq \exp\biggl(c_\alpha \dfrac{\log h_n}{h_n}\biggr),\end{equation*}

where $c_\alpha$ depends only on $\alpha$ , d, and D. In addition, Corollary 12.4 in [Reference Wendland35] provides a modified version of Lemma 3.5 for this setting, namely,

\begin{equation*}\lambda_{\min}(K(X_n,X_n)) \geq c^{\prime}_\alpha\exp\biggl({-}40.71 \dfrac{d^2}{\alpha q^2_{X_n}}\biggr)q^{-d}_{X_n}.\end{equation*}

Thus, using the above results instead of Lemmas 2.1 and 3.5, we can repeat our analysis with

\[a_n \sim \exp\biggl({-}c_\alpha \dfrac{\log h_n}{h_n}\biggr)\]

and obtain, for example,

\begin{equation*}\sup_{x,x^*\in D}\pi_n(x,x^*) \leq c_1\exp\biggl({-}\delta^2 c_2\exp\biggl({-}\dfrac{c_\alpha}{2}\dfrac{\log h_n}{h_n}\biggr)\biggr)\end{equation*}

under the same assumptions as Corollary 5.1.

Uniform design. Consider a uniform grid, discretized evenly in each dimension, with n being the total number of points in the discretization. One can find that $h_n={\mathrm{O}}(n^{-{{1}/{d}}})$ , leading to the explicit rate

\begin{equation*}\sup_{x,x^*\in D}\pi_n(x,x^*) \leq \dfrac{C^{\prime}_1}{c_D} \exp\biggl({-}\dfrac{\delta^2 C_2}{4c_l(|T_{11}|-|T_{21}|)^2}n^{{{s_{\infty}}/{(2d)}}}\biggr)\end{equation*}

under the assumptions of Corollary 5.1.

Uniform random design. Suppose that the design points are sampled from a uniform distribution on $[0,1]^d$ . By adapting results in [Reference Janson13], one can show that $h_n={\mathrm{O}}(({{\log n}/{n}})^{{{1}/{d}}})$ , leading to the explicit rate

\begin{equation*}\sup_{x,x^*\in D}\pi_n(x,x^*) \leq \dfrac{C^{\prime}_1}{c_D} \exp\biggl({-}\dfrac{\delta^2 C_2}{4c_l(|T_{11}|-|T_{21}|)^2}\biggl(\dfrac{n}{\log n}\biggr)^{{{s_{\infty}}/{(2d)}}}\biggr)\end{equation*}

under the assumptions of Corollary 5.1. One can also extend this result to a setting with independent but non-uniform sampling. Suppose that the nth design point is sampled independently from some fixed density $g_n$ with support $[0,1]^d$ . Then one can show that

\begin{equation*}h_n = {\mathrm{O}}\biggl(\biggl(\dfrac{\log(c_g n)}{c_g n}\biggr)^{{{1}/{d}}}\biggr),\end{equation*}

and the rate follows.

We remark that the above discussion implicitly assumes that $||T_{11}|-|T_{21}||\notin\{0,1\}$ . However, the exceptions can be handled using the same arguments that were presented in Section 5.1.

6. General large deviations inequality

Let $\{Z_n\}$ be a sequence of random vectors taking values in $\mathbb{R}^p$ , and let $\mu_n$ denote the probability law of $Z_n$ . Let $\Psi_n$ be the cumulant-generating function of $Z_n$ , and let $\{a_n\}$ be a sequence satisfying $a_n\rightarrow\infty$ as $n\rightarrow\infty$ . Define $\Psi(\gamma)$ as in (3.2). The functions $\Psi_n$ and $\Psi$ are convex. Let $D_\Psi = \{\gamma\in\mathbb{R}^p\,:\, \Psi(\gamma) \lt \infty\}$ be the convex support set of $\Psi$ and note that $0\in D_\Psi$ .

Let I be the Fenchel–Legendre transform of $\Psi$ as in (3.3). The classical Gärtner–Ellis theorem [Reference Dembo and Zeitouni10] establishes the inequality (3.1) for any closed measurable set E, under the condition that the origin belongs to the interior of $D_\Psi$ . This condition will fail to hold in our setting, because we will consider situations in which $D_\Psi$ is a subspace of $\mathbb{R}^p$ . Thus it is necessary to prove (3.1) under weaker conditions.

In the following, let $\mathcal{P}$ be the orthogonal projection operator onto the subspace $D_\Psi$ , and define $\mathcal{P}E = \{\mathcal{P}u\,:\, u\in E\}$ to be the projection of any $E\subseteq \mathbb{R}^p$ . Let $\mu^{\mathcal{P}}_n$ be the probability law of the random variable $\mathcal{P}Z_n$ .

Our goal is to prove (3.1), for any closed measurable set E, under the assumption that $D_\Psi \neq\{0\}$ is a subspace of $\mathbb{R}^p$ . This is accomplished in three steps with progressively fewer assumptions on E. In the first two steps (Lemma 6.1), the large deviations inequality is proved for $\mathcal{P}E$ , with the first step making the additional assumption that this projected set is compact. The final step (Theorem 6.1) then proves the inequality for E.

Lemma 6.1. Suppose that $D_\Psi \neq\{0\}$ is a subspace of $\mathbb{R}^p$ , and $E\subseteq \mathbb{R}^p$ has the property that $\mathcal{P}E$ is compact and measurable. Then

\begin{equation*}\limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log \mu^{\mathcal{P}}_n(\mathcal{P}E) \leq -\inf_{u\in \mathcal{P}E} I(u).\end{equation*}

Proof. Let $I^\tau(u) = \min\{I(u)-\tau,{{1}/{\tau}}\}$ for $\tau \gt 0$ . By definition of this function, for any $u\in\mathcal{P}E$ we can pick $\gamma^u\in D_\Psi$ for which $\langle\gamma^u,u\rangle - \Psi(\gamma^u)\geq I^{\tau}(\gamma^u)$ . We can also pick $\rho^u$ such that $\rho^u\|\gamma^u\|\geq \tau$ and let $B_{u,\rho^u}$ be the closed ball of radius $\rho^u$ centered at u.

By Chebyshev’s inequality,

\begin{equation*}\mu^{\mathcal{P}}_n(G) =\mathbb{E}(1_{\{\mathcal{P}Z_n\in G\}}) \leq \mathbb{E}\Bigl[\!\exp\Bigl(\langle\gamma,\mathcal{P}Z_n\rangle - \inf_{u\in G}\langle\gamma,u\rangle\Bigr)\Bigr]\end{equation*}

for any n, $\gamma\in\mathbb{R}^p$ and measurable $G\subseteq D_\Psi$ . In particular,

\begin{equation*}\mu^{\mathcal{P}}_n(\mathcal{P} B_{u,\rho^u}) \leq \mathbb{E}[\!\exp(a_n\langle\gamma^u,\mathcal{P}Z_n\rangle)]\exp\Bigl({-}\inf_{u^{\prime}\in \mathcal{P} B_{u,\rho^u}}\langle a_n\gamma^u,u^{\prime}\rangle\Bigr).\end{equation*}

For any $u\in\mathcal{P}E$ ,

\begin{equation*}-\inf_{u^{\prime}\in B_{u,\rho^u}}\langle a_n\gamma^u,u^{\prime}\rangle \leq a_n\rho^u\|\gamma^u\| - a_n\langle\gamma^u,u\rangle \leq a_n\tau - a_n\langle\gamma^u,u\rangle,\end{equation*}

whence

(6.1)

\begin{align}\dfrac{1}{a_n}\log\mu^{\mathcal{P}}_n(\mathcal{P}B_{u,\rho^u}) &\leq \dfrac{1}{a_n}\log\mathbb{E}[\!\exp(a_n\langle\gamma^u,\mathcal{P}Z_n)] + \tau - \langle\gamma^u,u\rangle\notag\\&\leq \dfrac{1}{a_n}\log\mathbb{E}[\!\exp(\langle a_n\mathcal{P}\gamma^u,Z_n)] + \tau - \langle\gamma^u,u\rangle\\&= \dfrac{1}{a_n}\Psi_n(a_n\mathcal{P}\gamma^u) + \tau - \langle\gamma^u,u\rangle,\notag\end{align}

where (6.1) follows from the fact that $\mathcal{P}$ is self-adjoint.

Since $\mathcal{P}E$ is compact, we can select a finite covering from the open covering $\bigcup_{u\in\mathcal{P}E} B_{u,\rho^u}$ of $\mathcal{P}E$ . Let N be the number of balls in this covering, and denote their centers by $u_i$ , $i=1,\ldots,N$ . For simplicity, let $\gamma_i,\rho_i$ denote the corresponding $\gamma^u,\rho^u$ values. Then

\begin{equation*} \dfrac{1}{a_n}\log\mu^{\mathcal{P}}_n(\mathcal{P}E) \leq \dfrac{1}{a_n}\log N + \tau - \min_{1\leq i\leq N} \biggl\{\langle\gamma_i,u_i\rangle - \dfrac{1}{a_n}\Psi_n(a_n\mathcal{P}\gamma_i)\biggr\},\end{equation*}

and we can take the limsup of both sides to obtain

\begin{align*}\limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log\mu^{\mathcal{P}}_n(\mathcal{P}E) &\leq \tau - \min_{1\leq i\leq n}\biggl\{\langle\gamma_i,u_i\rangle - \limsup_{n\rightarrow\infty}\dfrac{1}{a_n}\Psi_n(a_n\mathcal{P}\gamma_i)\biggr\}\\&= \tau - \min_{1\leq i \leq n}\{\langle\gamma_i,u_i\rangle - \Psi(\mathcal{P}\gamma_i)\}.\end{align*}

Recalling the properties of $\gamma_i$ , we arrive at

\begin{equation*}\limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log\mu^{\mathcal{P}}_n(\mathcal{P}E) \leq \tau - \min_{1\leq i \leq n}I^{\tau}(\gamma_i) \leq \tau - \inf_{u\in \mathcal{P}E} I^{\tau}(u).\end{equation*}

This holds for any $\tau \gt 0$ , so we take $\tau\searrow 0$ to prove the desired result.

Lemma 6.2. Suppose that $D_\Psi \neq\{0\}$ is a subspace of $\mathbb{R}^p$ , and $E\subseteq \mathbb{R}^p$ is closed and measurable. Then

\begin{equation*}\limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log \mu^{\mathcal{P}}_n(\mathcal{P}E) \leq -\inf_{u\in \mathcal{P}E} I(u).\end{equation*}

Proof. Let $u_1,\ldots,u_{\ell}$ be a basis for the subspace $D_\Psi$ , with $\ell \lt p$ being its dimensionality. Let $\mu^j_n$ denote the probability law of $\langle u_j,Z_n\rangle$ .

Let $\gamma=a_n u_j$ and take some $\zeta \gt 0$ . By Chebyshev’s inequality,

\begin{align*}\mu^j_n([\zeta,\infty)) &\leq \mathbb{E}\Bigl[\!\exp\Bigl(\langle a_n u_j,\mathcal{P}Z_n\rangle - \inf_{u\,:\, \langle u_j,u\rangle \geq \zeta}\langle a_n u_j,u\rangle\Bigr)\Bigr]\\&\leq \mathbb{E}[\!\exp(a_n\langle u_j,\mathcal{P}Z_n\rangle)]\exp\!({-}a_n\zeta),\end{align*}

whence

\begin{equation*}\limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log\mu^j_n([\zeta,\infty)) \leq \Psi(u_j) - \zeta \lt \infty.\end{equation*}

Consequently,

\begin{equation*}\lim_{\zeta\rightarrow\infty} \limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log\mu^j_n([\zeta,\infty)) = -\infty\end{equation*}

for all $j=1,\ldots,\ell$ . Using symmetric arguments, one can also obtain

\begin{equation*}\lim_{\zeta\rightarrow\infty} \limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log\mu^j_n(({-}\infty,-\zeta]) = -\infty.\end{equation*}

Now define the compact set $G_{\zeta} = \{u\in D_\psi\,:\, \langle u_j,u\rangle\in[-\zeta,\zeta] \ \textrm{for all} \, j = 1,\ldots,\ell\}$ We then derive

(6.2)

\begin{align}& \lim_{\zeta\rightarrow\infty} \limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log\mu^{\mathcal{P}}_n(D_\psi \setminus G_{\zeta})\notag\\&\quad\leq \lim_{\zeta\rightarrow\infty} \limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log \sum^{\ell}_{j=1} \mu^j_n(({-}\infty,-\zeta]) + \mu^j_n([\zeta,\infty))\notag\\& \quad\leq \lim_{\zeta\rightarrow\infty} \limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log\Bigl(2\ell \max_j\bigl\{\mu^j_n(({-}\infty,-\zeta]),\mu^j_n([\zeta,\infty))\bigr\}\Bigr)\notag\\ &\quad= -\infty,\end{align}

where the first inequality uses a union bound together with the monotonicity of probability measures.

Observing that $\mathcal{P}E\cap G_{\zeta}$ is compact, we can apply Lemma 6.1 to obtain

\begin{equation*}\limsup_{n\rightarrow\infty}\dfrac{1}{a_n}\log \mu^{\mathcal{P}}_n(\mathcal{P}E\cap G_{\zeta}) \leq -\inf_{u\in\mathcal{P}E\cap G_{\zeta}} I(u) \leq -\inf_{u\in\mathcal{P}E} I(u).\end{equation*}

On the other hand, $\mathcal{P}E\cap G^c_{\zeta} \subseteq D_{\psi}\setminus G_{\zeta}$ , so

\begin{equation*}\limsup_{n\rightarrow\infty}\dfrac{1}{a_n}\log \mu^{\mathcal{P}}_n\bigl(\mathcal{P}E\cap G^c_{\zeta}\bigr) \leq \limsup_{n\rightarrow\infty}\dfrac{1}{a_n}\log \mu^{\mathcal{P}}_n(D_\psi\setminus G_{\zeta}).\end{equation*}

Combining both inequalities, we find that

\begin{align*}\limsup_{n\rightarrow\infty}\dfrac{1}{a_n}\log \mu^{\mathcal{P}}_n(\mathcal{P}E) \leq 2\max\biggl\{-\inf_{u\in\mathcal{P}E} I(u), \limsup_{n\rightarrow\infty}\dfrac{1}{a_n}\log \mu^{\mathcal{P}}_n(D_\psi\setminus G_{\zeta})\biggr\}.\end{align*}

Taking $\zeta\rightarrow\infty$ and applying (6.2) yields the desired result.

Theorem 6.1. Suppose that $D_\Psi \neq\{0\}$ is a subspace of $\mathbb{R}^p$ , and $E\subseteq \mathbb{R}^p$ is closed and measurable. Then

\begin{equation*}\limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log \mu_n(E) \leq -\inf_{u\in E} I(u).\end{equation*}

Proof. We rewrite (3.3) as

\begin{equation*}I(u) = \sup_{\gamma\in D_{\Psi}} \langle \gamma,u\rangle - \Psi(\gamma),\end{equation*}

because $\Psi(\gamma)$ takes finite values only for $\gamma\in D_{\psi}$ . Observe, however, that

\begin{align*}\sup_{\gamma\in D_{\Psi}} \langle \gamma,u\rangle - \Psi(\gamma) &= \sup_{\gamma\in \mathbb{R}^p} \langle \mathcal{P}\gamma,u\rangle - \Psi(\gamma)\\&= \sup_{\gamma\in \mathbb{R}^p} \langle \gamma,\mathcal{P}u\rangle - \Psi(\gamma)\\&= I(\mathcal{P}u)\end{align*}

because $\mathcal{P}$ is self-adjoint. Therefore, by Lemma 6.2,

\begin{equation*}\limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log\mu_n(E) \leq \limsup_{n\rightarrow\infty} \dfrac{1}{a_n}\log\mu^{\mathcal{P}}_n(\mathcal{P}E) \leq -\inf_{u\in\mathcal{P}E} I(u) \leq -\inf_{u\in\mathcal{E}} I(u),\end{equation*}

which completes the proof.

We remark that the large deviations inequality can be recovered under the weaker condition $0 \notin \textrm{rel int}(D_{\psi})$ , without requiring $D_{\psi}$ to be a subspace of $\mathbb{R}^p$ . However, this is beyond the needs of the present work so we do not give the proof here.

7. Conclusion

We have presented a theoretical framework that leverages the connections between Gaussian process regression and approximation theory to derive new moderate deviations inequalities for different types of error probabilities. The utility of these results is demonstrated through two applications of broad interest: probabilities of pairwise errors between fixed errors of points, and uniform tail probabilities for the pointwise estimation error. Furthermore, our results illustrate the effect of the kernel on the convergence rate.

It is difficult to say whether it is possible to improve on these bounds; perhaps this also depends on the class of kernels that is chosen. The main limitation of this work is that for purposes of tractability, we bound difficult posterior covariances by the much more tractable mesh norm. The mesh norm only measures the extent to which the design points are evenly spread out, and thus has limited ability to distinguish between different strategies for choosing the design points. We leave this problem for future work, noting that the results presented here are the first of their kind.

Funding information

The second author’s research was partially funded by the National Science Foundation (grant CMMI-2112828).

Competing interests

There were no competing interests to declare which arose during the preparation or publication process of this article.

References

Adler, R. J. (2000). On excursion sets, tube formulas and maxima of random fields. Ann. Appl. Prob. 10, 1–74.CrossRef Google Scholar

Ankenman, B., Nelson, B. L. and Staum, J. (2010). Stochastic kriging for simulation metamodeling. Operat. Res. 58, 371–382.CrossRef Google Scholar

Arcones, M. A. (2006). Large deviations for M-estimators. Ann. Inst. Statist. Math. 58, 21–52.CrossRef Google Scholar

Bect, J., Bachoc, F. and Ginsbourger, D. (2019). A supermartingale approach to Gaussian process based sequential design of experiments. Bernoulli 25, 2883–2919.CrossRef Google Scholar

Beknazaryan, A., Sang, H. and Xiao, Y. (2019). Cramér type moderate deviations for random fields. J. Appl. Prob. 56, 223–245.CrossRef Google Scholar

Bull, A. D. (2011). Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res. 12, 2879–2904.Google Scholar

Chan, H. P. and Lai, T. L. (2006). Maxima of asymptotically Gaussian random fields and moderate deviation approximations to boundary crossing probabilities of sums of random variables with multidimensional indices. Ann. Prob. 34, 80–121.CrossRef Google Scholar

Cheng, D. and Xiao, Y. (2016). Excursion probability of Gaussian random fields on sphere. Bernoulli 22, 1113–1130.CrossRef Google Scholar

Ciesielski, Z. (1961). Hölder conditions for realizations of Gaussian processes. Trans. Amer. Math. Soc. 99, 403–413.Google Scholar

Dembo, A. and Zeitouni, O. (2009). Large Deviations Techniques and Applications, 2nd edn. Springer, Berlin and Heidelberg.Google Scholar

Ghosal, S. and Roy, A. (2006). Posterior consistency of Gaussian process prior for nonparametric binary regression. Ann. Statist. 34, 2413–2429.CrossRef Google Scholar

Glynn, P. W. and Juneja, S. (2004). A large deviations perspective on ordinal optimization. In Proceedings of the 2004 Winter Simulation Conference, ed. R. Ingalls et al., pp. 577–585.CrossRef Google Scholar

Janson, S. (1987). Maximal spacings in several dimensions. Ann. Prob. 15, 274–280.CrossRef Google Scholar

Johnson, M. E., Moore, L. M. and Ylvisaker, D. (1990). Minimax and maximin distance designs. J. Statist. Planning Infer. 26, 131–148.CrossRef Google Scholar

Jones, D. R., Schonlau, M. and Welch, W. J. (1998). Efficient global optimization of expensive black-box functions. J. Global Optimization 13, 455–492.CrossRef Google Scholar

Joseph, V. R., Gul, E. and Ba, S. (2015). Maximum projection designs for computer experiments. Biometrika 102, 371–380.CrossRef Google Scholar

Lederer, A., Umlauft, J. and Hirche, S. (2019). Uniform error bounds for Gaussian process regression with application to safe control. In Advances in Neural Information Processing Systems 32, ed. H. Wallach et al. Curran Associates, Red Hook, NY.Google Scholar

Lee, S. I., Mortazavi, B., Hoffman, H. A., Lu, D. S., Li, C., Paak, B. H., Garst, J. H., Razaghy, M., Espinal, M., Park, E., Lu, D. C. and Sarrafzadeh, M. (2014). A prediction model for functional outcomes in spinal cord disorder patients using Gaussian process regression. IEEE J. Biomed. Health Inform. 20, 91–99.CrossRef Google Scholar PubMed

Li, J. and Ryzhov, I. O. (2023). Convergence rates of epsilon-greedy global optimization under radial basis function interpolation. Stoch. Systems 13, 59–92.CrossRef Google Scholar

Li, X., Liu, J., Lu, J. and Zhou, X. (2018). Moderate deviation for random elliptic PDE with small noise. Ann. Appl. Prob. 28, 2781–2813.CrossRef Google Scholar

Lukić, M. and Beder, J. (2001). Stochastic processes with sample paths in reproducing kernel Hilbert spaces. Trans. Amer. Math. Soc. 353, 3945–3969.CrossRef Google Scholar

Marcus, M. B. and Shepp, L. A. (1972). Sample behavior of Gaussian processes. In Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability, vol. 2, ed. L. M. Le Cam et al., pp. 423–421. University of California Press, Berkeley and Los Angeles.Google Scholar

Pati, D., Bhattacharya, A. and Cheng, G. (2015). Optimal Bayesian estimation in random covariate design with a rescaled Gaussian process prior. J. Mach. Learning Res. 16, 2837–2851.Google Scholar

Pronzato, L. and Müller, W. G. (2012). Design of computer experiments: space filling and beyond. Statist. Comput. 22, 681–701.CrossRef Google Scholar

Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA.Google Scholar

Scott, W. R., Powell, W. B. and Simão, H. P. (2010). Calibrating simulation models using the knowledge gradient with continuous parameters. In Proceedings of the 2010 Winter Simulation Conference, ed. B. Johansson et al., pp. 1099–1109. IEEE, Piscataway, NJ.Google Scholar

Sheibani, M. and Ou, G. (2021). The development of Gaussian process regression for effective regional post-earthquake building damage inference. Comput.-Aided Civ. Infrastruct. Eng. 36, 264–288.CrossRef Google Scholar

Snoek, J., Larochelle, H. and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, ed. F. Pereira et al., pp. 2951–2959. Curran Associates, Red Hook, NY.Google Scholar

Srinivas, N., Krause, A., Kakade, S. M. and Seeger, M. W. (2012). Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Trans. Inform. Theory 58, 3250–3265.CrossRef Google Scholar

Teckentrup, A. L. (2020). Convergence of Gaussian process regression with estimated hyper-parameters and applications in Bayesian inverse problems. SIAM/ASA J. Uncertain. Quantif. 8, 1310–1337.CrossRef Google Scholar

Toth, C. and Oberhauser, H. (2020). Bayesian learning from sequential data using Gaussian processes with signature covariances. In Proceedings of the 37th International Conference on Machine Learning, ed. H. Daumé and A. Singh, pp. 9548–9560. PMLR, Cambridge, MA.Google Scholar

van der Hofstad, R. and Honnappa, H. (2019). Large deviations of bivariate Gaussian extrema. Queueing Systems 93, 333–349.CrossRef Google Scholar

Vazquez, E. and Bect, J. (2010). Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. J. Statist. Planning Infer. 140, 3088–3095.CrossRef Google Scholar

Wang, W., Tuo, R. and Wu, C. F. J. (2020). On prediction properties of kriging: uniform error bounds and robustness. J. Amer. Statist. Assoc. 115, 920–930.CrossRef Google Scholar

Wendland, H. (2004). Scattered Data Approximation. Cambridge University Press.CrossRef Google Scholar

Wu, Z.-M. and Schaback, R. (1993). Local error estimates for radial basis function interpolation of scattered data. IMA J. Numer. Anal. 13, 13–27.CrossRef Google Scholar

Zhou, J. and Ryzhov, I. O. (2022). Technical note: A new rate-optimal sampling allocation for linear belief models. Operat. Res., to appear.Google Scholar

Article contents

Moderate deviations inequalities for Gaussian process regression

Abstract

Keywords

MSC classification

1. Introduction

2. Gaussian process regression and approximation theory

2.1. Definitions and assumptions

2.2. Approximation theory

3. Large deviations for a fixed pair of points

3.1. Analysis of cumulant-generating functions

3.2. Analysis of Fenchel–Legendre transform

3.3. Perturbation analysis for rate function

3.4. Main moderate deviations inequality

4. Extensions and special cases

4.1. Special cases of Theorem 3.1

4.2. Moderate deviations for a single point

5. Applications: pairwise comparisons and estimation error

5.1. Moderate deviations for false ordering

5.2. Moderate deviations for pointwise estimation error

5.3. Other results of interest

6. General large deviations inequality

7. Conclusion

Funding information

Competing interests

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests