1. Introduction
Given a compact domain $D\subseteq\mathbb{R}^d$ , let $\{\mathcal{E}(x)\}_{x\in D}$ be a centered Gaussian process on a probability space $(\Omega,\mathcal{F},P)$ . Define
where $m\,:\, D\rightarrow\mathbb{R}$ is a pre-specified ‘mean function’. Suppose that we are given the values $f(x_1),\ldots,f(x_n)$ of f at the design points $x_1,\ldots,x_n\in D$ . Then we can construct an estimator $\hat{f}_n$ of f using Gaussian process regression [Reference Rasmussen and Williams25]. This is a Bayesian method: for each x, $\hat{f}_n(x)$ is the conditional mean of the random variable f(x) given $f(x_1),\ldots,f(x_n)$ . The covariance function of the Gaussian process is used to infer the value of f at unobserved x from information collected about the design points.
Gaussian process regression is widely used to interpolate and predict the values of black-box functions in simulation calibration [Reference Scott, Powell and Simão26] and optimization [Reference Ankenman, Nelson and Staum2, Reference Jones, Schonlau and Welch15], biomedical applications [Reference Lee, Mortazavi, Hoffman, Lu, Li, Paak, Garst, Razaghy, Espinal, Park, Lu and Sarrafzadeh18], risk assessment of civil infrastructure [Reference Sheibani and Ou27], tuning of machine learning models [Reference Snoek, Larochelle and Adams28], and many other problems from diverse branches of science. In all such applications, f models the output of a complex system (physical or virtual), with x being the input. There is no closed form for f, but it is possible to observe f(x) at individual x values, e.g. by running expensive lab, field, or computer experiments with those particular inputs. The goal is to obtain accurate estimates at unobserved values using as few experiments as possible. Often, the function f represents a performance metric, such as the predictive power of a machine learning model with a given set of parameters, and the goal then becomes to optimize f(x) over $x\in D$ .
The analysis of this paper is motivated by concerns that arise in design of experiments, though we do not explicitly model any design problem. Our main contribution is a theoretical framework for studying the moderate deviations behavior of random vectors of the form $\bigl(\,\hat{f}_n(x),\hat{f}_n(x^*),f(x),f(x^*)\bigr)$ for two fixed but arbitrary points $x,x^*\in D$ . This framework can be used to prove new convergence rates for different types of ‘error probabilities’ related to GP regression. We demonstrate the usefulness of the theory with two specific applications, though others may be possible. The first application deals with probabilities of the form
where $\delta \gt 0$ is a small threshold. In words, it is given to us that $x^*$ has a smaller function value than x, but interpolation error may cause us to falsely reverse this ordering (the threshold $\delta$ makes (1.2) well-defined). When f is an objective function, this is the probability of reporting x as being ‘better’ than $x^*$ when in reality the opposite is the case. For this type of error probability, we leverage our theory to prove a new moderate deviations inequality
where $C_1,C_2,s\gt 0$ are constants depending on the specification of the Gaussian process, and
is the mesh norm measuring the density of the design points. In a special case where the design points are uniformly distributed on D, it has been shown [Reference Janson13] that $h_n$ is of order $({{\log n}/{n}})^{{{1}/{d}}}$ , which justifies the interpretation of (1.3) as a moderate deviations rate.
The second application deals with the estimation error $|\,\hat{f}_n(x)-f(x)|$ at arbitrary $x\in D$ . For this error, we prove the uniform bound
Both types of error probabilities are of broad interest in simulation, statistics, and uncertainty quantification. In particular, the pairwise comparison in (1.2) is motivated by the approach developed by Glynn and Juneja [Reference Glynn and Juneja12] for the ranking and selection problem, where one collects samples from a finite number of populations in an effort to select the one with the highest mean. The probability of correct selection can be related to the probability of false ordering between pairs of populations. The quantity $\pi_n(x,x^*)$ is the analog of this concept in the GP regression setting, with the additional complication that we are using a Bayesian model of f, so the event in (1.2) can only be viewed as an error conditionally given $f(x) \geq f(x^*)$ .
Interestingly, the mesh norm is in use as a criterion for design of experiments, in the literature on so-called space-filling designs [Reference Joseph, Gul and Ba16, Reference Pronzato and Müller24]. As early as Johnson et al. in 1990 [Reference Johnson, Moore and Ylvisaker14], statisticians proposed spreading out the design points in D in a way that essentially minimizes the mesh norm. From (1.3), we can see that this has the effect of speeding up the rate at which (1.2) converges to zero, uniformly over all $x,x^*$ . Essentially, if we have no specific $x^*$ to serve as the reference solution, we can view space-filling designs as a way to minimize the probability of false ordering across all possible $x^*$ .
The available theory for Gaussian process regression has extensively studied (pointwise) consistency; see e.g. [Reference Bect, Bachoc and Ginsbourger4] or [Reference Ghosal and Roy11]. The rate at which $\hat{f}_n$ converges to f has been studied, e.g. in [Reference Teckentrup30] and [Reference Wang, Tuo and Wu34], but these papers do not consider tail probabilities, and so their rates have completely different orders to (1.3)–(1.4), though their analysis also makes some connections to the mesh norm. A different, less directly related stream of literature focuses on online optimization problems where the goal is to maximize the sum of the function values of the design points; a representative example of this type of work is that of Srinivas et al. [Reference Srinivas, Krause, Kakade and Seeger29]. In general, many of the existing rate results are derived for specific classes of kernels, such as squared exponential [Reference Pati, Bhattacharya and Cheng23] and Matérn [Reference Teckentrup30], or specific choices of the design points [Reference Bull6]. Probabilists have investigated tail probabilities [Reference Adler1, Reference Ghosal and Roy11] and excursion probabilities [Reference Chan and Lai7, Reference Cheng and Xiao8] of Gaussian processes, and some of these results also have the form of moderate deviations laws, but they pertain to generic GPs rather than the GP regression mechanism. Analogously, moderate deviations laws are available for sums of random variables [Reference Beknazaryan, Sang and Xiao5] and random PDE models [Reference Li, Liu, Lu and Zhou20], but these results do not pertain to GP regression either.
To our knowledge, this paper presents the first moderate deviations results for Gaussian process regression estimators. It is well known [Reference Dembo and Zeitouni10] that sample averages of independent and identically distributed (i.i.d.) Gaussian observations satisfy large deviations laws. Similar laws hold for ordinary least-squares estimators under Gaussian residuals [Reference Zhou and Ryzhov37], extrema of Gaussian vectors [Reference van der Hofstad and Honnappa32], and various finite-dimensional statistical estimators [Reference Arcones3]. Gaussian process regression can be viewed as an infinite-dimensional generalization of linear regression, but the analysis is made much more complicated because, essentially, the dimensionality of the objects used to construct the estimator grows over time, and their asymptotic behavior heavily depends on the covariance kernel. One could perhaps recover large deviations laws for certain specific choices of the kernel and design, but it is far from clear whether this is possible in general. In the process of proving our results, we also establish a modified version of the Gärtner–Ellis theorem [Reference Dembo and Zeitouni10], which may be of stand-alone interest.
Section 2 describes the GP regression framework, states the assumptions used throughout the paper, and gives important technical preliminaries. Section 3 gives the bulk of our analysis, which relies on a general large deviations law for random vectors. This latter result also requires some new technical developments, but since they are unrelated to GP regression, they are deferred to Section 6 for readability. Section 4 handles several extensions not covered by the main theorem. Section 5 uses our analysis to derive (1.3) and (1.4), and presents several more explicit examples. Section 7 concludes the paper.
2. Gaussian process regression and approximation theory
Before we begin the analysis that leads to (1.3)–(1.4), it is necessary to understand the definition, construction, and properties of the GP regression estimator $\hat{f}_n$ . The computation of the estimator (a form of Bayesian updating) is described in Section 2.1. Our analysis also makes use of an alternative interpretation, originating from approximation theory, of the GP regression model. We use this theory to study the asymptotic behavior of the posterior covariance function, which is crucial to the tail probabilities of $\hat{f}_n$ . The relevant technical preliminaries are given in Section 2.2.
2.1. Definitions and assumptions
Recalling the model in (1.1), we assume that the mean function m is Lipschitz-continuous, and the Gaussian process $\mathcal{E}$ is specified by
for all $x,x^{\prime}\in D$ . In one application, we will assume that $k\,:\, D\times D\rightarrow\mathbb{R}$ is a fixed, symmetric kernel function mapping $D\times D$ into $\mathbb{R}_+$ . The kernel is required to be positive definite, meaning that for any n, any set of n distinct design points $\{x_m\}^n_{m=1}\subseteq D$ , and any vector $v = (v_1,\ldots,v_n)$ in $\mathbb{R}^n$ , we have
Without this assumption, the Gaussian process would be degenerate.
In addition, we assume that there exists a function $\phi$ on $\mathbb{R}_+$ such that $k(x,x^{\prime}) = \phi(\|x-x^{\prime}\|)$ for all x, x ′. Such a $\phi$ is called a radial basis function. We assume that $\phi$ is twice differentiable at zero with $\phi^{\prime\prime}(0)\lt 0$ . Many commonly used covariance kernels satisfy this requirement, including Gaussian, multiquadric, inverse quadratic, inverse multiquadric, and others.
Let $X_n = \{x_m\}^n_{m=1}$ denote the set of design points. We treat the design points as a deterministic sequence, as is standard in the literature on design of experiments, and assume that $X_n$ becomes dense in D as $n\rightarrow\infty$ , a common condition in the theoretical literature [Reference Vazquez and Bect33]. For convenience, we introduce the notation
as well as $K(x,X_n) = K(X_n,x)^\top$ . We also let $K(X_n,X_n)$ denote the matrix whose (m, m ′)th entry is $k(x_m,x_{m^{\prime}})$ .
Given the design points $X_n$ and observations $f(x_n)$ , the posterior distribution of f(x), at any arbitrary $x\in D$ , is Gaussian with mean
and variance
This specific structure of the mean and variance is what is referred to by the name of Gaussian process regression. The variance $P_{X_n}(x)$ , viewed as a function of x, is also sometimes called the ‘power function’ in the literature on interpolation. In this paper we use the posterior mean $\hat{f}_n$ to interpolate the observed function values $f(X_n)$ over the design space D and make predictions at unobserved points.
We let $\mathcal{H}$ denote the reproducing kernel Hilbert space (RKHS) whose reproducing kernel is k. The construction and uniqueness of $\mathcal{H}$ are discussed by Wendland [Reference Wendland35]. For our purposes, it is sufficient to review the following properties. Letting $\langle \cdot,\cdot\rangle_{\mathcal{H}}$ be the inner product of $\mathcal{H}$ , we know that
-
(1) $k(\cdot,x)\in\mathcal{H}$ for all $x\in D$ ,
-
(2) $g(x) = \langle g,k(\cdot,x)\rangle_{\mathcal{H}}$ for all $g\in\mathcal{H}$ and $x\in D$ ,
-
(3) $k(x,x^{\prime}) = \langle k(\cdot,x),k(\cdot,x^{\prime})\rangle_{\mathcal{H}}$ for all $x,x^{\prime}\in D$ .
Additionally, from the usual properties of the inner product, we have the Cauchy–Schwarz inequality $|\langle g_1,g_2\rangle_{\mathcal{H}}| \leq \|g_1\|_{\mathcal{H}}\|g_2\|_{\mathcal{H}}$ , where $\|\cdot\|_{\mathcal{H}}$ is the norm induced by the inner product.
2.2. Approximation theory
With the assumptions made in Section 2.1, Gaussian process regression can be seen as a special case of radial basis function (RBF) interpolation, enabling us to make use of some results from interpolation theory. We should note, however, that this theory treats interpolation models as purely deterministic, and thus has assumptions and interpretations very different to GP regression. Below, we present key facts from the theory that will be important for our analysis, and discuss their applicability to our setting when necessary.
Like GP regression, RBF interpolation requires a kernel k with the properties described in Section 2.1, as well as a matrix $X_n$ describing n design points. Recall that under these assumptions we have $k(x,x^{\prime}) = \phi(\|x-x^{\prime}\|)$ . Let $\mathcal{L}_{k,X_n}$ denote the operator mapping some fixed function $g\,:\, D\rightarrow\mathbb{R}^d$ to its interpolant according to
where the coefficients $\alpha_m$ solve the linear system
In fact Wu and Schaback [Reference Wu and Schaback36] presented a more general form where (2.2)–(2.3) include additional polynomial functions, but this will not be necessary for our purposes. It can be shown that $\mathcal{L}_{k,X_n}g(x) = K(x,X_n)K(X_n,X_n)^{-1}g(x_n)$ , similar to the calculations used in GP regression.
Let $\tilde{g}$ be the Fourier transform of g, and suppose that the generalized Fourier transform (as defined in Section 8.2 of [Reference Wendland35]) of the function $x\mapsto \phi(\|x\|)$ exists and coincides with a continuous function $\tilde{\phi}$ on $\mathbb{R}^d\setminus \{0\}$ satisfying
for suitable constants $c_{\tilde{\phi}},s_{\infty} \gt 0$ . In particular, the constant $s_{\infty}$ will play a significant role in our analysis, and it is worth pointing out that this quantity is explicitly computable for a variety of commonly used kernels. For example, for Gaussian kernels $s_{\infty}$ can take on any arbitrarily large value (this special case is treated separately in Section 5.3), while for the Matérn kernel, Teckentrup [Reference Teckentrup30] showed that $s_{\infty}=2\sigma$ where $\sigma$ is the kernel smoothness parameter. Other examples are given in Section 8.3 of [Reference Wendland35].
We now define
The results below require $c^2_{g,\phi} \lt \infty$ , which essentially means that g resides in the RKHS whose reproducing kernel is k. Before stating these results, we should make it clear that we will not require $c^2_{f,\phi}\lt\infty$ , i.e. we will not apply the above definitions with f as the choice of g. Lukić and Beder [Reference Lukić and Beder21] showed that a sample from a GP prior is almost surely not in the RKHS induced by the kernel assumed in the prior. Therefore it is not possible for the function f to satisfy $c^2_{f,\phi}\lt\infty$ . This is a major difference between GP regression and interpolation theory, where f is modeled as a deterministic function and so the condition $c^2_{f,\phi}\lt\infty$ is seen as fairly innocuous (for example, it is assumed in [Reference Wu and Schaback36] and many other papers in pure interpolation theory, e.g. [Reference Li and Ryzhov19]). In the present work, however, we cannot make this assumption, and will instead apply this framework to other choices of g related to the kernel, for example the function $k(\cdot,x)$ for fixed x.
For any compact $E\subseteq D$ , let $h_n(E) = \max_{y\in E}\min_{m=1,\ldots,n}\|y-x_m\|_2$ be the mesh norm of E. We slightly abuse notation by using $h_n$ to denote $h_n(D)$ when the entire domain is considered. Let $B_{x,\rho}$ denote the closed ball of radius $\rho$ centered at $x\in\mathbb{R}^d$ . We can now state the results that will be referenced and applied throughout this paper.
Lemma 2.1. ([Reference Wu and Schaback36].) Fix $\rho \gt 0$ and assume that the kernel k satisfies (2.4) with some $s_{\infty}$ . Then there exist positive constants $\bar{h}$ and $c_P$ such that for any $X_n$ and any point $x\in\mathbb{R}^d$ with $h_n(B_{x,\rho}\cap D) \lt \bar{h}$ , the power function $P_{X_n}$ defined in (2.1) satisfies
Lemma 2.2. ([Reference Wu and Schaback36].) Fix g satisfying $c_{g,\phi}\lt\infty$ and assume that the kernel k satisfies (2.4) with some $s_{\infty}$ . Then, for any $X_n$ and any $x\in\mathbb{R}^d$ , we have
We note that in Lemma 2.1 the constant $c_P$ depends only on d and $s_{\infty}$ but not on the fixed value $\rho$ . The same is true of the ratio ${{\bar{h}}/{\rho}}$ , indicating that $\bar{h}$ is proportional to $\rho$ .
3. Large deviations for a fixed pair of points
We now fix $x,x^*\in D$ and focus on the sequence of random vectors
Letting $\mu_n$ be the probability law of $Z_n$ , we will derive the inequality of the form
which holds for any closed measurable set E and any sequence $\{a_n\}$ satisfying $\lim_{n\rightarrow\infty} a_n = \infty$ . Our end goal is to characterize the function I and specify $a_n$ and E in a manner that causes (3.1) to yield results such as (1.3).
Inequalities of the form (3.1) can be obtained by invoking the Gärtner–Ellis theorem from large deviations theory [Reference Dembo and Zeitouni10]. In general, the derivation of the function I consists of two main steps. The first step is to derive the scaled limit
where $\Psi_n(\gamma) = \log\mathbb{E}_{\mu_n}\bigl({\mathrm{e}}^{\langle \gamma,Z_n\rangle}\bigr)$ denotes the cumulant-generating function of $Z_n$ , and $\langle\cdot,\cdot\rangle$ is the usual $L^2$ inner product on $\mathbb{R}^p$ . The scaling $\Psi$ is allowed to take values on the extended number line (i.e. may be $+\infty$ for some $\gamma$ ). The second step obtains I via the Fenchel–Legendre transform
of $\Psi$ . Inequality (3.1) then follows under certain technical conditions on $\Psi$ .
Our analysis in this section follows this outline. Section 3.1 studies the cumulant-generating functions of $\{Z_n\}$ and characterizes $\Psi$ . Section 3.2 then studies the Fenchel–Legendre transform of $\Psi$ , and Section 3.3 analyzes the convergence rates of various terms in the Fenchel–Legendre transform, which helps us to select $\{a_n\}$ in a way that makes (3.1) yield an informative inequality (stated in Section 3.4). In Section 5 we will instantiate this inequality with certain choices of E that yield (1.3)–(1.4).
One complication that arises in this procedure is that the technical conditions of the classical Gärtner–Ellis theorem do not hold in our setting, so additional analysis is required to obtain (3.1). This analysis is carried out in the setting of general random vectors, and thus is somewhat tangential to the setting of Gaussian process regression. For this reason we defer it to Section 6 at the end of the paper; the core result of that section is Theorem 6.1, which recovers the desired inequality. Here we take (3.1) as given, referring readers to Theorem 6.1 for the proof, and focus on applying this inequality to the specific sequence $\{Z_n\}$ .
3.1. Analysis of cumulant-generating functions
We write $Z_n$ as
where $\bar{X}_n = X_n\cup\{x,x^*\}$ . The distribution of $Z_n$ is Gaussian with mean vector $A_nm(\bar{X}_n)$ and covariance matrix $V_n = A_n\Sigma_nA^\top_n$ , where
For convenience, we introduce the notation
Then the power function $P_{X_n}$ in (2.1) can be written as $P_{X_n}(x) = k(x,x) - Q_{X_n}(x)$ . We also use the analogous notation $P_{X_n}(x,x^*) = k(x,x^*) - Q_{X_n}(x,x^*)$ . With some trivial computation, we obtain
Since $Z_n$ follows a multivariate normal distribution, it follows straightforwardly that
for any $\gamma\in\mathbb{R}^4$ . Then, by (3.2),
provided that the limit on the right-hand side of (3.4) exists.
To study these limits, it is helpful to observe that $Q_{X_n}(x,x^*)$ can be viewed as the RBF interpolant of the function $k(x,\cdot)$ evaluated at the point $x^*$ , or, equivalently, the RBF interpolant of $k(x^*,\cdot)$ evaluated at the point x. This allows us to leverage the results from approximation theory that were stated in Section 2.2.
First, we consider the limit of
We may observe that $\mathcal{L}_{k,X_n}m(x)$ , with $\mathcal{L}_{k,X_n}$ being the operator that maps a function to its interpolant given k and $X_n$ as in (2.2), is a (differentiable) linear combination of the values $k(x,x_m)$ . Hence the difference $y\mapsto m(y) - \mathcal{L}_{k,X_n}m(y)$ is a Lipschitz function whose zeros become dense (as $n\rightarrow\infty$ ) around x and $x^*$ . That is, there exist $\rho,\rho^*\gt 0$ such that $h_n(B_{x,\rho}\cap D)\rightarrow 0$ and $h_n(B_{x^*,\rho^*}\cap D)\rightarrow 0$ as $n\rightarrow \infty$ . Consequently, $m(x)-\mathcal{L}_{k,X_n}m(x)\rightarrow 0$ , whence $\lim_{n\rightarrow\infty} A_nm(\bar{X}_n) = m_0$ , with $m_0 = (m(x),m(x^*),m(x),m(x^*))^\top$ . Thus (3.4) becomes
The precise behavior of the limit superior will depend on $a_n$ and the asymptotics of the matrix $V_n$ .
3.2. Analysis of Fenchel–Legendre transform
We begin by examining the limit of $V_n$ . It is easy to see that $P_{X_n}(y)\geq 0$ for all $y\in D$ . Furthermore, by Lemma 2.1 we can see that $P_{X_n}(y)\rightarrow 0$ if $X_n$ is dense in D as $n\rightarrow\infty$ . This implies that $Q_{X_n}(x) \rightarrow k(x,x)$ and similarly $Q_{X_n}(x^*)\rightarrow k(x^*,x^*)$ , with $k(x,x)=k(x^*,x^*)$ by the properties of the radial basis function. Although we do not know the sign of $P_{X_n}(x,x^*)$ , we can note that
By Lemma 2.2, we have $P_{X_n}(x,x^*)^2 \leq c^2_{k(\cdot,x^*),\phi}P_{X_n}(x)$ . The finiteness of $c_{k(\cdot,x^*),\phi}$ can be verified. Therefore, if $X_n$ is dense in D as $n\rightarrow\infty$ , we have $P_{X_n}(x,x^*)\rightarrow 0$ , whence $Q_{X_n}(x,x^*)\rightarrow k(x,x^*)$ . Thus we have shown that $V_n\rightarrow V$ entrywise, where
It is easy to verify that V has eigenvalues $\lambda_1 = 2(k(x,x)+k(x,x^*))$ , $\lambda_2 = 2(k(x,x)-k(x,x^*))$ with respective eigenvectors
and $\lambda_3=\lambda_4=0$ with respective eigenvectors
Similarly, let $\lambda_{i,n}$ and $U_{i,n}$ (for $1\leq i\leq 4$ ) denote the eigenvalues and corresponding eigenvectors of $V_n$ . Since $V_n\rightarrow V$ , we also have $\lambda_{i,n}\rightarrow\lambda_i$ . Accordingly, we also have $U_{1,n}\rightarrow U_1$ and $U_{2,n}\rightarrow U_2$ . However, the zero eigenvalue of V has multiplicity 2, so $U_{3,n},U_{4,n}$ will converge to limits $U^{\prime}_3,U^{\prime}_4$ that belong to the span of $U_3,U_4$ , but these limits need not be $U_3,U_4$ themselves. We know, however, that
where $T\in\mathbb{R}^{2\times 2}$ is an orthonormal matrix.
Looking back to (3.3) and (3.5), we can write the Fenchel–Legendre transform as
Observe that $\lim_{n\rightarrow\infty} U^\top_i U_{j,n} = 1_{\{i=j\}}$ , whence
as long as $\gamma_j\neq 0$ for $j\in\{1,2\}$ . Therefore the supremum in (3.3) can only be achieved at $\gamma$ for which $\gamma_1=\gamma_2=0$ , whence
The supremum value in (3.3) is thus governed by the rate at which $a_n$ increases. If this rate is fast, we will have to take $\gamma_3=\gamma_4=0$ , leading to $I=0$ . To avoid this situation, $a_n$ should be assigned the highest order that makes one of the limits superior in (3.8) finite. Some matrix perturbation analysis is required to understand the rate that $a_n$ can take.
3.3. Perturbation analysis for rate function
Define the notation
Let us also write $U_{j,n} = \sum_i \nu_{ijn}U_i$ . Then
where the last line follows because $\lambda_i$ is an eigenvalue (and $U_i$ is an eigenvector) of V. Left-multiplying by the unit vector $U_i$ , we obtain
Recalling that $\lambda_3=\lambda_4=0$ and $\lambda_{j,n}\gt 0$ , we find that
This allows us to bound the limits superior in (3.8) as shown in Lemmas 3.1 and 3.2 below.
Lemma 3.1. For fixed $\gamma_3,\gamma_4\in\mathbb{R}$ , we have
Proof. Using (3.10), we write
Plugging in the closed-form expressions for $U_3,U_4$ from (3.6) yields
Since $\gamma_3,\gamma_4$ are fixed and $U_3,U_4$ are unit vectors, the Cauchy–Schwarz inequality yields
as desired.
Lemma 3.2. Suppose that the matrix T defined in (3.7) has no zero-valued entries. Then
for $j\in\{3,4\}$ .
Proof. Recall (3.9) and note that $\nu_{ijn}\rightarrow T_{i-2,j-2}$ for $i,j\in\{3,4\}$ . Since T is assumed to have no zero-valued entries, we do not need to worry about zero values of $\nu_{ijn}$ . Then (3.10) can be rewritten as
The first equality in (3.11) can be obtained by setting $i=3$ , whence (3.12) yields
By expressing (0, 0, 1, 0) and (0, 0, 0,1) in terms of $U_i$ , we obtain
where the last line follows from the fact that $\nu_{ijn}\rightarrow 0$ for $i\in\{1,2\}$ and $j\in\{3,4\}$ , while $\nu_{ijn}\rightarrow T_{i-2,j-2}$ for $i,j\in\{3,4\}$ . The second equality in (3.11) can be obtained by repeating the above arguments with $i=4$ .
The analysis in Lemma 3.2 is easily extended to handle situations where T has zero-valued entries. If this occurs, we must have $||T_{11}|-|T_{21}||=1$ because T is orthonormal. In the first case we can repeat the proof of Lemma 3.2 with $i=4$ , $j=3$ and $i=3$ , $j=4$ , and obtain
In the second case we repeat the same proof with $i=3$ , $j=3$ and $i=4$ , $j=4$ , and obtain
In general, the bounds in Lemmas 3.1 and 3.2 depend on $P_{X_n}(x,x^*)$ , which is a difficult object to study. Lemma 3.3 establishes a bound that relates this quantity to a simpler function of the design points. Then Lemma 3.4 derives a similar lower bound on $P_{X_n}(x)$ . Note that these results provide lower bounds; we will later multiply them by negative quantities to convert them to upper bounds, which will enable additional analysis of the terms in (3.8).
Lemma 3.3. Let $\lambda_{\min}({\cdot})$ denote the smallest eigenvalue of a square matrix. The following bound holds:
Proof. For notational convenience, define a vector $\kappa(x) = K(x,X_n)K(X_n,X_n)^{-1}$ . Note that $\kappa(x)$ takes values in $\mathbb{R}^n$ , and observe the identities
We extend $\kappa$ to $\mathbb{R}^{n+2}$ by taking $\kappa_{n+1},\kappa_{n+2}\equiv -\frac{1}{2}$ . Plugging in the above identities, we derive
Thus we arrive at
which completes the proof.
Lemma 3.4. Let $X^{\prime}_n = X_n\cup\{x\}$ . Then
Proof. Define $\kappa(x)$ as in the proof of Lemma 3.3 and extend it to $\mathbb{R}^{n+1}$ by taking $\kappa_{n+1}\equiv -1$ . Then, by repeating the arguments in the proof of Lemma 3.3, we obtain
as desired.
Now, we can study the rate at which $\lambda_{\min}(K(X^{\prime}_n,X^{\prime}_n))$ or $\lambda_{\min}(K(\bar{X}_n,\bar{X}_n))$ converges to zero. For this, we cite the following result (Theorem 12.3 of [Reference Wendland35]).
Lemma 3.5. ([Reference Wendland35].) Define $q_{X_n} = \min_{x_m\neq x_{m^{\prime}}} \|x_m-x_{m^{\prime}}\|_2$ and $\phi_0(M) = \inf_{\|y\|_2\leq M} \tilde{\phi}(y)$ , where $\tilde{\phi}$ is the generalized Fourier transform of the radial basis function $\phi$ . Then
where the constants $C_d,M_d$ depend only on d.
Combining Lemmas 3.3–3.5, we have
Consequently, the inequality in Lemma 3.3 becomes
The lower bound in (3.14) can be connected back to the upper bound obtained in Lemma 3.2 in the following manner. Note that if T has no zero-valued entries as assumed in Lemma 3.2, by orthogonality we either have ${{T_{11}}/{T_{21}}}\gt 0$ and ${{T_{12}}/{T_{22}}}\lt 0$ or vice versa (note also that $T_{11}=-T_{22}$ ). Without loss of generality, we only treat the first case here.
Supposing that ${{T_{12}}/{T_{22}}}\lt 0$ , we apply (3.14) to (3.11) with $j=4$ and argue that
In this derivation, (3.15) uses the fact that ${{T_{12}}/{T_{22}}}\lt 0$ to convert the lower bound in (3.14) into an upper bound, while (3.16) applies (2.4) to bound $\phi_0$ . Noting that the multipliers $1-{{T_{22}}/{(2T_{12})}}$ and $- {{T_{22}}/{(2T_{12})}}$ are both strictly positive, we then obtain (3.17) by applying Lemma 2.1 together with the fact that
Next we return to (3.11) with $j=3$ and obtain
by using the Cauchy–Schwarz inequality for the RKHS inner product to produce the simple bound
Applying Lemma 2.1, we conclude that $\lambda_{3,n} = {\mathrm{O}}(h_n(B_{x,\rho}\cap D)^{ {{s_{\infty}}/{2}} })$ . Finally, applying Lemma 2.1 to the bound in Lemma 3.1 straightforwardly yields
When $X_n$ is dense in D as $n\rightarrow\infty$ , we have $h_n(B_{x,\rho}\cap D)\leq h_n(D)$ with $h_n(D)\rightarrow 0$ . Thus, among the limits superior in (3.8), one is ${\mathrm{O}}\bigl(h^{ {{s_{\infty}}/{2}} }_n\bigr)$ and the others are ${\mathrm{O}}\bigl(h^{s_{\infty}}_n\bigr)$ . This will also happen in the symmetric situation where ${{T_{12}}/{T_{22}}}\gt 0$ , but with the order switched for $\lambda_{3,n}$ and $\lambda_{4,n}$ .
3.4. Main moderate deviations inequality
The conclusions of Section 3.3 suggest that $a_n$ should have the exact order $h_n^{- {{s_{\infty}}/{2}} }$ , which we denote by $a_n\sim h_n^{- {{s_{\infty}}/{2}} }$ . Then we obtain
in (3.8), and bound $I(u)\geq I^l(u)$ , where $I^l$ is defined as
when ${{T_{12}}/{T_{22}}}\lt 0$ , and
when ${{T_{12}}/{T_{22}}}\gt 0$ , for some suitable constants $c_3,c_4$ . Furthermore, recalling (3.6) and the definition of $m_0$ , we find that $m^\top_0 U_3 = m^\top_0 U_4 = 0$ . Now, applying (3.1), we can finally state our main result.
Theorem 3.1. Let T be as in (3.7), take $a_n \sim h_n^{- {{s_{\infty}}/{2}} }$ , and let $c_l$ be a constant satisfying $\limsup_{n\rightarrow\infty} a_n\lambda_{j,n}\leq c_l$ for $j\in\{3,4\}$ . If $||T_{11}|-|T_{21}||\notin \{0,1\}$ , we have
for any closed $E\subseteq \mathbb{R}^4$ , with
Throughout this analysis we have assumed that $X_n$ is dense in D when $n\rightarrow\infty$ , but it is possible to recover Theorem 3.1, for fixed $x,x^*$ , as long as the design is dense only in neighborhoods of those two points, e.g. in $B_{x,\rho}\cup B_{x^*,\rho}$ for some $\rho\gt 0$ . In that case $a_n$ will take the order of $\min\{h_n(B_{x,\rho})^{- {{s_{\infty}}/{2}} },h_n(B_{x^*,\rho})^{- {{s_{\infty}}/{2}} }\}$ .
The right-hand side of (3.18) is some strictly negative, problem-specific constant, and it is the order of $a_n$ that governs the convergence rate of $\mu_n(E)$ . The moderate deviations inequality allows the rate to depend on the kernel through the quantity $s_{\infty}$ , but otherwise the complex interdependence of the various elements of $K(X_n,X_n)$ has been streamlined using Lemmas 2.1 and 3.5. For this reason, the bound in (3.18) may not be the tightest possible.
4. Extensions and special cases
In this section we treat several special cases not covered by Theorem 3.1. First, Section 4.1 handles the situation where $||T_{11}|-|T_{21}||\in \{0,1\}$ . Section 4.2 discusses how the bound of Theorem 3.1 is improved if we only consider one reference point x instead of two points $x,x^*$ .
4.1. Special cases of Theorem 3.1
Again, let T be as in (3.7), and suppose that $||T_{11}|-|T_{21}||=0$ . For simplicity we consider the case $T_{11}=T_{21}$ , as the other cases are very similar.
Because T is orthonormal, we must have $T_{11}=T_{21}={{\sqrt{2}}/{2}}$ . Then, returning to the derivation of (3.8) and recalling that $m^\top_0 U_3 = m^\top_0 U_4 = 0$ , we have
Using an argument similar to that in Lemma 3.1, we explicitly derive
By direct application of Lemma 3.2, we have
Let us now take $a_n\sim h^{-s_{\infty}}_n$ , which guarantees
for some finite $c_l \gt 0$ . This allows us to place a bound on (4.1) that drops all negligible terms in (4.2) that contain $P_{X_n}(x)$ and $P_{X_n}(x^*)$ . That is,
The last term in (4.3) requires us to take $\gamma_3\gamma_4\leq 0$ to avoid I(u) becoming $-\infty$ . Then (4.3) becomes
where (4.4) applies Lemma 3.3 to bound $P_{X_n}(x,x^*)$ , and (4.5) uses the bound $P^2_{X_n}(x,x^*) \leq c^2_{k(\cdot,x^*),\phi}P_{X_n}(x)$ from Lemma 2.2. After some algebra, we obtain a version of the moderate deviations inequality (3.18) with
The second special case $||T_{11}|-|T_{21}||=1$ is handled in almost the same way. For simplicity we take $T_{11}=T_{22}=0$ , as the other possible cases are very similar. Then (4.2) remains unchanged, but we use (3.13) instead of Lemma 3.2. We can thus omit the final cross-term $-4\gamma_3\gamma_4P_{X_n}(x,x^*)$ in (4.3), so we no longer need $\gamma_3\gamma_4\leq 0$ . The other arguments are unchanged and yield another version of (3.18) with
4.2. Moderate deviations for a single point
In the following, we treat the special case where $Z_n = \bigl(\,\hat{f}_n(x),f(x)\bigr)^\top$ , for a single, fixed $x\in D$ . Although one can obtain moderate deviations inequalities for such $Z_n$ directly from Theorem 3.1, it is possible to tighten the bounds by modifying the analysis. Below we highlight those parts of the argument that require changes.
As before, we let $\mu_n$ denote the probability law of $Z_n$ and assume that the set $X_n$ of design points becomes dense in D as $n\rightarrow\infty$ . We write $Z_n = A_n f(X^{\prime}_n)$ , where $X^{\prime}_n = X_n\cup\{x\}$ and
The covariance matrix of $Z_n$ is given by
where $Q_{X_n}$ is as defined in Section 3.1. Equation (3.5) remains unchanged with $m_0 = (m(x),m(x))^\top$ . Repeating the arguments in Section 3.2, we have $V_n\rightarrow V$ entrywise, where V is a $2\times 2$ matrix whose elements are all equal to k(x, x). This matrix has two eigenvalues $\lambda_1 \gt 0$ and $\lambda_2 = 0$ , with corresponding eigenvectors
Similarly, let $\lambda_{i,n}$ and $U_{i,n}$ denote the eigenvalues and eigenvectors of $V_n$ .
Again following Section 3.2, the supremum in (3.3) can only be achieved at $\gamma\in\mathbb{R}^2$ satisfying $\gamma_1 = 0$ . Repeating the derivation of (3.8), we obtain
where $\limsup_{n\rightarrow\infty}\bigl(\gamma_2U^\top_2U_{2,n}\bigr)^2 = \gamma^2_2$ and
analogously to Lemma 3.1. Repeating the proof of Lemma 3.2, we obtain
The rest of Section 3.3 remains the same, but it is sufficient to use Lemma 3.4, as there is no longer any cross-term. Thus we avoid the factor of 2 in the lower bound derived in Lemma 3.3, which allows us to take $a_n\sim h_n^{-s_\infty}$ . It then follows that $\bigl(\gamma_2U^\top_2 U_{1,n}\bigr)^2 a_n\rightarrow 0$ , while $\limsup_{n\rightarrow\infty}a_n\lambda_{2,n}\leq c_l$ for some $c_l\lt\infty$ . We thus obtain $I(u) \geq I^l(u)$ , where
We now apply (3.1) to obtain the desired result. Note that the support set $\{\gamma \in \mathbb{R}^2\,:\, \Psi(\gamma)\lt\infty\}$ is a subspace, so (3.1) still holds by the results of Section 6.
Theorem 4.1. Take $a_n \sim h_n^{-s_{\infty}}$ and let $c_l$ be a constant satisfying $\limsup_{n\rightarrow\infty} a_n\lambda_{2,n}\leq c_l$ . Then
for any closed $E\subseteq \mathbb{R}^2$ .
By considering one reference point x instead of two points $x,x^*$ , we obtain a better bound because it is no longer necessary to bound a cross-term, as in Lemma 3.3. However, the bound will not get worse if we consider $K\gt 2$ points, i.e. with
This is because the same bound in Lemma 3.3 can be applied to every possible cross-term.
5. Applications: pairwise comparisons and estimation error
In Sections 5.1–5.2 we use Theorems 3.1 and 4.1 to prove (1.3) and (1.4), respectively. The proofs are very similar, but use different definitions of the error set E in (3.18). Section 5.3 presents several other results of interest where the moderate deviations bound can be made more explicit.
5.1. Moderate deviations for false ordering
We return to (1.2) and write
For fixed $x,x^*$ , the denominator is a strictly positive constant, so we can focus on the numerator, which fits into the framework of Section 3 with
We will apply Theorem 3.1 and derive a more explicit form for (3.19). First, note that the supremum in (3.19) can only be finite when
Letting $\eta$ be the value in (5.3), we then have $I^l(u)={{\eta^2}/({2c_l})}$ . Then we minimize $I^l(u)$ subject to (5.2)–(5.3). From the optimality conditions, it can be seen that the inequalities in (5.2) must be binding at optimality, which leads to
Applying Theorem 3.1, we complete the proof of (1.3). The formal statement of the result is as follows.
Theorem 5.1. Let T be as in (3.7), take $a_n \sim h_n^{- {{s_{\infty}}/{2}} }$ and let $c_l$ be a constant satisfying $\limsup_{n\rightarrow\infty} a_n\lambda_{j,n}\leq c_l$ for $j\in\{3,4\}$ . If $||T_{11}|-|T_{21}||\notin \{0,1\}$ , we have
where $C_1,C_2$ are positive constants.
In fact, we can show that the moderate deviations bound holds uniformly for all $x,x^*\in D$ . To do so, we must make sure that the denominator of (5.1) is well-behaved. It is easily seen that
where $\Phi$ is the standard Gaussian cumulative distribution function. We let $c_L$ be the Lipschitz constant of m, and derive
using the assumption made in Section 2.1 that $\phi$ is twice differentiable at zero with $\phi^{\prime\prime}(0)\lt 0$ . Because D is compact, there exists some $c_D \gt 0$ satisfying
Furthermore, the constant $C_2$ in Theorem 5.1 does not depend on $x,x^*$ . The constant $C_1$ may depend on $x,x^*$ , but we can take $C^{\prime}_1$ to be its largest value over the compact set D. We then conclude the following.
Corollary 5.1. Suppose that we are in the situation of Theorem 5.1. Then
where $C^{\prime}_1,C_2,c_D$ are positive constants.
Finally, we consider two special cases covered in Section 4. First, in the case where $T_{11}=T_{21}$ or $T_{12}=T_{22}$ , we apply (4.6). It can be shown that the supremum is attained when $\gamma_3+\gamma_4=0$ , which satisfies the condition $\gamma_3\gamma_4\leq 0$ and achieves $u^\top(U_3+U_4)=0$ and $|u^\top U_3-u^\top U_4| = {{\delta}/{(2\sqrt{2})}}$ . Consequently, we have
where $C^{\prime}_2$ depends on $\lambda_2,c_l,c_{k(\cdot,x^*),\phi}$ . In the second special case where $||T_{11}|-|T_{21}|| = 1$ , we apply (4.7). Again, the supremum achieves $u^\top(U_3+U_4)=0$ and $|u^\top U_3-u^\top U_4| = {{\delta}/{(2\sqrt{2})}}$ and (5.4) follows. In summary, both special cases admit an exponential bound in $h_n^{-s_{\infty}}$ rather than $h_n^{- {{s_{\infty}}/{2}} }$ .
5.2. Moderate deviations for pointwise estimation error
The convergence rate of the tail probability $P\bigl(|\,\hat{f}_n(x)-f(x)|\geq \delta\bigr)$ for fixed x can be obtained from Theorem 4.1 using the error event $E^{\prime} = \{|u_1-u_2|\geq\delta\}$ . It is easy to see that
whence we have
where $C_1,C_2\gt 0$ are constants. As in Corollary 5.1, we can take the supremum over all $x\in D$ to obtain the desired result, which is formally stated as follows.
Theorem 5.2. There exist constants $C_1,C_2 \gt 0$ such that
A natural question is whether this result can be extended to bound $P\bigl(\sup_x |\,\hat{f}_n(x)-f(x)|\geq \delta\bigr)$ , the tail probability of the estimation error over the domain, perhaps by combining Theorem 5.2 with the continuity of f and an epsilon-net partition of D. Such an argument could work if f and $\hat{f}_n$ were uniformly equicontinuous over all sample paths, a property known as ‘uniform modulus of continuity’ [Reference Marcus and Shepp22]. It is possible to obtain such results for generic GPs in certain settings [Reference Ciesielski9], but they are currently not available for the specific mechanism of Gaussian process regression; recent work, such as [Reference Lederer, Umlauft and Hirche17] and [Reference Toth and Oberhauser31], has only bounded the modulus of continuity on a restricted set of sample paths, not almost surely. This extension is an interesting topic for future work.
5.3. Other results of interest
In the following, we give several examples in which our main results can be made more explicit. To avoid excessive repetition, we focus on the uniform bound in Corollary 5.1 in our presentation, but analogs of the other results in Sections 5.1–5.2 can be straightforwardly obtained as well. For simplicity, let us take $D=[0,1]^d$ .
Gaussian kernel. Suppose that k is the Gaussian kernel with parameter $\alpha$ , that is, $k(x,x^*) = \exp\!({-}\alpha\|x-x^*\|^2_2)$ . For this particular kernel, it is known that $s_{\infty}$ can take arbitrarily large values. However, Theorem 11.22 of [Reference Wendland35] proves the bound
where $c_\alpha$ depends only on $\alpha$ , d, and D. In addition, Corollary 12.4 in [Reference Wendland35] provides a modified version of Lemma 3.5 for this setting, namely,
Thus, using the above results instead of Lemmas 2.1 and 3.5, we can repeat our analysis with
and obtain, for example,
under the same assumptions as Corollary 5.1.
Uniform design. Consider a uniform grid, discretized evenly in each dimension, with n being the total number of points in the discretization. One can find that $h_n={\mathrm{O}}(n^{-{{1}/{d}}})$ , leading to the explicit rate
under the assumptions of Corollary 5.1.
Uniform random design. Suppose that the design points are sampled from a uniform distribution on $[0,1]^d$ . By adapting results in [Reference Janson13], one can show that $h_n={\mathrm{O}}(({{\log n}/{n}})^{{{1}/{d}}})$ , leading to the explicit rate
under the assumptions of Corollary 5.1. One can also extend this result to a setting with independent but non-uniform sampling. Suppose that the nth design point is sampled independently from some fixed density $g_n$ with support $[0,1]^d$ . Then one can show that
and the rate follows.
We remark that the above discussion implicitly assumes that $||T_{11}|-|T_{21}||\notin\{0,1\}$ . However, the exceptions can be handled using the same arguments that were presented in Section 5.1.
6. General large deviations inequality
Let $\{Z_n\}$ be a sequence of random vectors taking values in $\mathbb{R}^p$ , and let $\mu_n$ denote the probability law of $Z_n$ . Let $\Psi_n$ be the cumulant-generating function of $Z_n$ , and let $\{a_n\}$ be a sequence satisfying $a_n\rightarrow\infty$ as $n\rightarrow\infty$ . Define $\Psi(\gamma)$ as in (3.2). The functions $\Psi_n$ and $\Psi$ are convex. Let $D_\Psi = \{\gamma\in\mathbb{R}^p\,:\, \Psi(\gamma) \lt \infty\}$ be the convex support set of $\Psi$ and note that $0\in D_\Psi$ .
Let I be the Fenchel–Legendre transform of $\Psi$ as in (3.3). The classical Gärtner–Ellis theorem [Reference Dembo and Zeitouni10] establishes the inequality (3.1) for any closed measurable set E, under the condition that the origin belongs to the interior of $D_\Psi$ . This condition will fail to hold in our setting, because we will consider situations in which $D_\Psi$ is a subspace of $\mathbb{R}^p$ . Thus it is necessary to prove (3.1) under weaker conditions.
In the following, let $\mathcal{P}$ be the orthogonal projection operator onto the subspace $D_\Psi$ , and define $\mathcal{P}E = \{\mathcal{P}u\,:\, u\in E\}$ to be the projection of any $E\subseteq \mathbb{R}^p$ . Let $\mu^{\mathcal{P}}_n$ be the probability law of the random variable $\mathcal{P}Z_n$ .
Our goal is to prove (3.1), for any closed measurable set E, under the assumption that $D_\Psi \neq\{0\}$ is a subspace of $\mathbb{R}^p$ . This is accomplished in three steps with progressively fewer assumptions on E. In the first two steps (Lemma 6.1), the large deviations inequality is proved for $\mathcal{P}E$ , with the first step making the additional assumption that this projected set is compact. The final step (Theorem 6.1) then proves the inequality for E.
Lemma 6.1. Suppose that $D_\Psi \neq\{0\}$ is a subspace of $\mathbb{R}^p$ , and $E\subseteq \mathbb{R}^p$ has the property that $\mathcal{P}E$ is compact and measurable. Then
Proof. Let $I^\tau(u) = \min\{I(u)-\tau,{{1}/{\tau}}\}$ for $\tau \gt 0$ . By definition of this function, for any $u\in\mathcal{P}E$ we can pick $\gamma^u\in D_\Psi$ for which $\langle\gamma^u,u\rangle - \Psi(\gamma^u)\geq I^{\tau}(\gamma^u)$ . We can also pick $\rho^u$ such that $\rho^u\|\gamma^u\|\geq \tau$ and let $B_{u,\rho^u}$ be the closed ball of radius $\rho^u$ centered at u.
By Chebyshev’s inequality,
for any n, $\gamma\in\mathbb{R}^p$ and measurable $G\subseteq D_\Psi$ . In particular,
For any $u\in\mathcal{P}E$ ,
whence
where (6.1) follows from the fact that $\mathcal{P}$ is self-adjoint.
Since $\mathcal{P}E$ is compact, we can select a finite covering from the open covering $\bigcup_{u\in\mathcal{P}E} B_{u,\rho^u}$ of $\mathcal{P}E$ . Let N be the number of balls in this covering, and denote their centers by $u_i$ , $i=1,\ldots,N$ . For simplicity, let $\gamma_i,\rho_i$ denote the corresponding $\gamma^u,\rho^u$ values. Then
and we can take the limsup of both sides to obtain
Recalling the properties of $\gamma_i$ , we arrive at
This holds for any $\tau \gt 0$ , so we take $\tau\searrow 0$ to prove the desired result.
Lemma 6.2. Suppose that $D_\Psi \neq\{0\}$ is a subspace of $\mathbb{R}^p$ , and $E\subseteq \mathbb{R}^p$ is closed and measurable. Then
Proof. Let $u_1,\ldots,u_{\ell}$ be a basis for the subspace $D_\Psi$ , with $\ell \lt p$ being its dimensionality. Let $\mu^j_n$ denote the probability law of $\langle u_j,Z_n\rangle$ .
Let $\gamma=a_n u_j$ and take some $\zeta \gt 0$ . By Chebyshev’s inequality,
whence
Consequently,
for all $j=1,\ldots,\ell$ . Using symmetric arguments, one can also obtain
Now define the compact set $G_{\zeta} = \{u\in D_\psi\,:\, \langle u_j,u\rangle\in[-\zeta,\zeta] \ \textrm{for all} \, j = 1,\ldots,\ell\}$ We then derive
where the first inequality uses a union bound together with the monotonicity of probability measures.
Observing that $\mathcal{P}E\cap G_{\zeta}$ is compact, we can apply Lemma 6.1 to obtain
On the other hand, $\mathcal{P}E\cap G^c_{\zeta} \subseteq D_{\psi}\setminus G_{\zeta}$ , so
Combining both inequalities, we find that
Taking $\zeta\rightarrow\infty$ and applying (6.2) yields the desired result.
Theorem 6.1. Suppose that $D_\Psi \neq\{0\}$ is a subspace of $\mathbb{R}^p$ , and $E\subseteq \mathbb{R}^p$ is closed and measurable. Then
Proof. We rewrite (3.3) as
because $\Psi(\gamma)$ takes finite values only for $\gamma\in D_{\psi}$ . Observe, however, that
because $\mathcal{P}$ is self-adjoint. Therefore, by Lemma 6.2,
which completes the proof.
We remark that the large deviations inequality can be recovered under the weaker condition $0 \notin \textrm{rel int}(D_{\psi})$ , without requiring $D_{\psi}$ to be a subspace of $\mathbb{R}^p$ . However, this is beyond the needs of the present work so we do not give the proof here.
7. Conclusion
We have presented a theoretical framework that leverages the connections between Gaussian process regression and approximation theory to derive new moderate deviations inequalities for different types of error probabilities. The utility of these results is demonstrated through two applications of broad interest: probabilities of pairwise errors between fixed errors of points, and uniform tail probabilities for the pointwise estimation error. Furthermore, our results illustrate the effect of the kernel on the convergence rate.
It is difficult to say whether it is possible to improve on these bounds; perhaps this also depends on the class of kernels that is chosen. The main limitation of this work is that for purposes of tractability, we bound difficult posterior covariances by the much more tractable mesh norm. The mesh norm only measures the extent to which the design points are evenly spread out, and thus has limited ability to distinguish between different strategies for choosing the design points. We leave this problem for future work, noting that the results presented here are the first of their kind.
Funding information
The second author’s research was partially funded by the National Science Foundation (grant CMMI-2112828).
Competing interests
There were no competing interests to declare which arose during the preparation or publication process of this article.