1. Introduction
This work revisits the notorious issue of model uncertainty, which is pervasive in virtually all applied fields using data due to scarcity or low quality of data. This model uncertainty issue is especially critical in insurance and finance for reasons such as the changing environment that introduces multiple layers of uncertainty, increasingly complex modern insurance and financial products, and information asymmetry between borrowers and lenders or between insurers and insureds.
Let us consider the estimation of $E\left[ h(X)\right] $ , where X is a general non-negative risk variable and $h(\!\cdot\!)$ is some non-negative deterministic loss function. As the probability distribution of X, denoted by $F_{X}$ , is in general unknown, we need to use a reference distribution inferred from available data and sometimes also based on expert opinions. In view of the possible scarcity or low quality of the data as well as the possible bias of the expert, the objective is subject to an ambiguity issue, which may lead to significant deviations from the true value of $E\left[ h(X)\right] $ and potentially derail the entire risk analysis.
In the literature, the two terms uncertainty and ambiguity are often used interchangeably. Following Pflug and Wozabal (Reference Pflug and Wozabal2007), uncertainty refers to a situation that the model is known but the realizations are unknown while ambiguity refers to a situation that the probability distribution is unknown. This way to distinguish the two terms seems to be consistent with a majority of recent works in this field. Thus, in this paper, we use the term ambiguity to indicate that the underlying probability distribution is unknown.
To tackle the ambiguity issue, the worst-case treatment is often adopted. This treatment constructs an ambiguity set B of possible probability distributions and considers the worst case in this ambiguity set; that is,
At the core of the worst-case treatment is the construction of such an ambiguity set B, which involves a trade-off issue. On the one hand, this set should be broad enough to cover the true distribution with a prescribed confidence level, but on the other hand, its construction should also be prudent enough to avoid over-conservativeness. While efforts continue to be devoted to methodological innovations, the over-conservativeness issue becomes prominent: When an ambiguity set is constructed to meet a prescribed confidence level, the results provided by the worst-case treatment are often too conservative and the realized confidence level may be unnecessarily too high. This issue represents a great obstacle for the results to be practically useful, but it so far has received very limited discussions.
To alleviate this over-conservativeness issue, we propose a scenario analysis in accordance with the reality that the realization of a generic risk is usually the aggregation of multiple scenarios subject to varying extents of ambiguity. For example, when a company extends to new lines of business, these new lines are likely subject to more ambiguity than the company’s existing lines. Similarly, an insurance company faces losses from its unfamiliar territories that are subject to more ambiguity—hence may charge an additional ambiguity premium to those losses—than from its familiar territories. In studies of mortality and morbidity, the population is often described as the composition of different age groups among which we need to take special care of the oldest-old group. Note that scenario analysis as one of classical approaches to risk management is emphasized in some modern regulatory frameworks including Basel III and Swiss Solvency Test.
Suppose that there are two scenarios between which one, called ordinary scenario, has abundant data, and the other, called ambiguous scenario, may suffer from scarcity or low quality of data. Here, we consider only two scenarios for simplicity. It is relatively straightforward to extend the study to more than two scenarios although the formulations will become more complicated. Denoting by Y and Z the risks in the respective scenarios, the generic risk X, which is understood as the result of the aggregation of Y and Z, has the stochastic representation
where I is the indicator of the ambiguous scenario, assumed to be independent of the vector (Y, Z). Thus, X is a mixture of Y and Z with weights decided by the indicator I.
It is important to note that this is a process of aggregating the risks Y and Z from the respective scenarios into the generic risk X rather than a process of thinning the generic risk X into different scenarios. During this aggregation, the independence assumption between the scenario indicator I and the vector (Y, Z) comes quite naturally. For example, still consider an insurance company extending its business to a new territory. It would be a common practice in loss models to express its generic loss distribution as a mixture of the loss distributions corresponding to its existing and new territories, respectively, which is equivalent to the above-mentioned independence assumption.
The ordinary scenario, due to the abundant data, may be subject to negligible ambiguity and hence it is safe to trust the empirical distribution when making inferences about the corresponding risk Y. However, the ambiguous scenario may suffer from scarcity or low quality of the data, and for this reason, the corresponding risk Z is subject to significant ambiguity. In such a situation, apparently it becomes unwise to still apply the worst-case treatment to the generic risk in the usual way. Instead, we propose to simply treat the risk Y in the ordinary scenario as subject to no ambiguity and fully in accordance with the reference distribution and to apply the worst-case treatment only to the ambiguous scenario where ambiguity is significant.
Note that the ambiguity in our consideration is twofold, adhering to both the ambiguous scenario indicator I and the corresponding risk Z. Thus, we need to first construct an ambiguity set of possible probability distributions of the pair (I, Z), denoted by $\left( F_{I},F_{Z}\right) $ , and then consider the worst case in this ambiguity set. With a slight abuse of notation, we still denote by B this ambiguity set. The problem adopted to the current context is reformulated as
Following a recent research trend in distributionally robust optimization, we construct the ambiguity set B in terms of the Wasserstein distance and centered at the empirical distribution of the pair (I, Z).
In order to link the problem (1.2) to an established strong duality, we disentangle the total ambiguity described by the set B in (1.2) along the ambiguous scenario indicator I and the corresponding risk Z. Based on this disentanglement, we convert the two-fold optimization problem into two one-fold problems.
For a power loss function $h(x)=x^{p}$ with $p\geq 1$ , we apply the established strong duality to solve the worst-case estimation problem of $E\left[ Z^{p}\right] $ and eventually obtain the worst-case moment $E\left[X^{p}\right] $ in (1.2); see Theorem 3.1. We use a power loss function here for simplicity, but extensions to other loss functions are possible; see Remark 3.1. Based on Theorem 3.1, we conduct numerical studies to illustrate that the consideration of partial ambiguity is indeed a promising solution to alleviate the over-conservativeness issue.
To summarize, we study the worst-case moments of a risk whose distribution is subject to ambiguity. To alleviate the over-conservativeness issue, we trace the realization of the risk from two scenarios among which one is an ordinary scenario subject to negligible ambiguity, so that we only need to apply the worst-case treatment to the other ambiguous scenario. We construct a Wasserstein ball to describe the two-fold ambiguity in both the ambiguous scenario indicator and the risk in the corresponding ambiguous scenario, and then we derive the worst-case moment estimates both analytically and numerically. Our numerical studies illustrate that the consideration of partial ambiguity, which is our main contribution of this paper, indeed greatly alleviates the over-conservativeness issue.
We would like to point out that many insurance and financial products, such as traditional reinsurance, insurance-linked securities, catastrophe bonds, industry loss warranties, contingent convertible bonds, and credit default swaps, are designed mainly to hedge risks in extreme scenarios where issues such as scarcity and low quality of data are often critical. Our consideration of partial ambiguity becomes particularly relevant to such products.
The rest of the paper is organized as follows: Section 1 ends with a brief literature review; Section 2 formulates our worst-case estimation problem under partial ambiguity; Section 3 presents our main results; Section 4 conducts numerical studies to illustrate the benefit of our consideration of partial ambiguity; Section 5 makes some concluding remarks; finally, the Appendix collects all proofs.
1.1. A brief literature review
The worst-case treatment as formulated in (1.1) often appears in the field of decision-making under ambiguity. Typically, the decision-maker solves a minimax problem to find decisions that minimize the worst-case risk. See, for example, Scarf (Reference Scarf, Arrow, Karlin and Scarf1958), Popescu (Reference Popescu2007), and Delage and Ye (Reference Delage and Ye2010), among others. The worst-case treatment is popular for applications in insurance, finance, and risk management. See, for example, Artzner et al. (Reference Artzner, Delbaen, Eber and Heath1999), Hürlimann (Reference Hürlimann2002), Embrechts et al. (Reference Embrechts, Höing and Puccetti2005), Kaas et al. (Reference Kaas, Laeven and Nelsen2009), Wang et al. (Reference Wang, Peng and Yang2013), Bernard et al. (Reference Bernard, Rüschendorf, Vanduffel and Wang2017, Reference Bernard, Kazzi and Vanduffel2020), Li et al. (Reference Li, Shao, Wang and Yang2018), and Cornilly et al. (Reference Cornilly, Rüschendorf and Vanduffel2018), among many others. Our work naturally follows these strands of research.
Many of the works cited above employ a characteristics-based approach to constructing an ambiguity set; that is, the set over which the worst-case treatment takes place is constructed based on certain distributional characteristics such as marginal distributions, moments, and a dependence structure. A disadvantage of characteristics-based approaches is that they often yield overly conservative estimates because the constructed ambiguity set contains all kinds of distributions including pathological ones as long as they possess the required characteristics. For this reason, a majority of recent works have shifted towards divergence-based approaches following which an ambiguity set is constructed to be a ball centered at a reference distribution with a radius specified in terms of a certain statistical divergence.
Ben-Tal et al. (Reference Ben-Tal, Den Hertog, De Waegenaere, Melenberg and Rennen2013) advocate to use $\phi $ -divergences to construct ambiguity balls for which the authors come up with a natural interpretation in terms of statistical confidence sets. See also Glasserman and Xu (Reference Glasserman and Xu2014), Bayraksan and Love (Reference Bayraksan and Love2015), Gupta (Reference Gupta2019), Lam (Reference Lam2019), and Rahimian et al. (Reference Rahimian, Bayraksan and Homem-de-Mello2019). An advantage, among others, of using $\phi $ -divergences is that many of them enable to conduct goodness-of-fit tests. However, some popular $\phi $ -divergences such as the Kullback–Leibler divergence and the chi-square divergence have an obvious problem that all distributions in the ball have to be absolutely continuous with respect to the reference distribution, which implies that they have to be discrete when, as usual, an empirical distribution is selected as the reference distribution. See Wozabal (Reference Wozabal2012) and Gao and Kleywegt (2022) for related discussions.
To avoid this absolute continuity restriction on the ambiguity ball, one may turn to the Wasserstein distance, which has attracted a great deal of attention from scholars who follow divergence-based approaches. Among many works along this direction, we name several that are closely related to our current study: Wozabal (Reference Wozabal2012, Reference Wozabal2014), Mohajerin Esfahani and Kuhn (Reference Mohajerin Esfahani and Kuhn2018), Blanchet and Murthy (Reference Blanchet and Murthy2019), Pesenti and Jaimungal (Reference Pesenti and Jaimungal2020), and Gao and Kleywegt (2022). Using the Wasserstein distance has clear advantages over other $\phi $ -divergences. With the absolute continuity restriction lifted, the constructed ambiguity set allows for both discrete and continuous distributions. Moreover, due to its non-parametric nature, the resulting worst-case estimation is immune to distribution types and can largely avoid the model misspecification issue.
Note that (1.1) is a semi-infinite optimization problem, and so is (1.2), which cannot be solved computationally in a straightforward way. It is necessary to further convert such an optimization problem into a finite dimensional problem. In this regard, under the Wasserstein distance, Gao and Kleywegt (2022) establish a strong duality result, which plays a key role in solving our worst-case estimation problem (1.2). Independently, Blanchet and Murthy (Reference Blanchet and Murthy2019) obtain a similar result, which from various aspects is more general than that of Gao and Kleywegt (2022). Related results under simpler settings can be found in Mohajerin Esfahani and Kuhn (Reference Mohajerin Esfahani and Kuhn2018) and Zhao and Guan (Reference Zhao and Guan2018).
In order to alleviate the over-conservativeness issue, we propose a scenario analysis following which we focus on ambiguous scenarios only, leading to a partial ambiguity issue. In behavioral finance, Payzan-LeNestour and Woodford (Reference Payzan-LeNestour and Woodford2022) raise an outlier blindness issue that “people are hampered in their perception of outcomes that they expect to seldom encounter, and view the magnitude of such outcomes as less extreme than they really are.” This also motivates our consideration of partial ambiguity.
We notice that the exact phrase of partial ambiguity has appeared in the study of Ellsberg’s two-urn paradox; see, for example, Chew et al. (Reference Chew, Miao and Zhong2017) and Berger (Reference Berger2021). In their context, partial ambiguity refers to the situation that some prior knowledge limits the possible compositions in an urn of 100 red and black balls, while in our context it refers to the situation that abundant information about the risk in ordinary scenarios allows us to pin down ambiguity to the other scenarios. In the two contexts, partial ambiguity has essentially the same meaning though presented quite differently.
Recently, Lam and Mottet (Reference Lam and Mottet2017) evaluate performance measures over a tail region where there are very limited data or even no data. They adopt a new approach based on the geometric premise of tail convexity, a feature shared by virtually all practically useful models, and then consider all possible tail densities. Although their work and ours share a similar flavor of pinning down ambiguity due to data scarcity, essential difference exists in that they conduct optimization over all possible tail densities fulfilling several postulated distributional constraints (thus, they follow a characteristics-based approach), while we resort to a divergence-based approach.
Often being a critical issue in insurance practice, model uncertainty has received increasing attention in the insurance literature. The following works, among others, have investigated or touched this issue in various insurance contexts: Cairns (Reference Cairns2000), Chen and Su (Reference Chen and Su2009), Peters et al. (Reference Peters, Shevchenko and Wüthrich2009), Zhao and Zhu (Reference Zhao and Zhu2011), Landsman and Tsanakas (Reference Landsman and Tsanakas2012), Robert and Therond (Reference Robert and Therond2014), Liu and Wang (Reference Liu and Wang2017), Fujii et al. (Reference Fujii, Iwaki and Osaki2017), and Jiang et al. (Reference Jiang, Escobar-Anel and Ren2020). More applications to practical problems in insurance have been proposed during recent years but are still sporadic; see Wozabal (Reference Wozabal2014), Pflug et al. (Reference Pflug, Timonina-Farkas and Hochrainer-Stigler2017), Lam and Mottet (Reference Lam and Mottet2017), Asimit et al. (Reference Asimit, Bignozzi, Cheung, Hu and Kim2017, Reference Asimit, Hu and Xie2019), Ghossoub (Reference Ghossoub2019a, Reference Ghossoub2019b), Blanchet et al. (Reference Blanchet, Lam, Tang and Yuan2019), Birghila and Pflug (Reference Birghila and Pflug2019), and Birghila et al. (Reference Birghila, Boonen and Ghossoub2020), among others. We hope that our work can draw attention from actuaries and financial analysts to incorporate the consideration of partial ambiguity when addressing the ambiguity issue in practical problems.
2. Formulation of the problem
2.1. Notational conventions
For a general probability space $(\Omega ,\mathcal{F},P)$ , denote by $L^{p}(\Omega ,\mathcal{F},P)$ the Lebesgue space with exponent $p\geq 1$ , namely, the space of random variables with finite pth moments. For a measurable set $A\subset \mathbb{R}^{d}$ , denote by $\mathcal{P(}A)$ the set of probability distributions supported on A. Denote by $\left\Vert \cdot\right\Vert _{p}$ the usual p norm in $\mathbb{R}^{d}$ and by $\left\Vert\cdot \right\Vert _{L^{p}}$ the $L^{p}$ norm in $L^{p}(\Omega ,\mathcal{F},P) $ . Thus, for a vector $\textbf{x}\in \mathbb{R}^{d}$ we have $\left\Vert\textbf{x}\right\Vert _{p}=\left( \sum_{i=1}^{d}\left\vert x_{i}\right\vert^{p}\right) ^{\frac{1}{p}}$ while for a random variable $X\in L^{p}(\Omega ,\mathcal{F},P)$ we have $\left\Vert X\right\Vert _{L^{p}}=\left( E\left\vert X\right\vert ^{p}\right) ^{\frac{1}{p}}$ . For $x,y\in \mathbb{R}$ , we write $x\vee y=\max \{x,y\}$ , $x\wedge y=\min \{x,y\}$ , $x_{+}=x\vee 0$ , and denote by $\delta _{x}$ a Dirac measure that assigns mass 1 at point x. For a random variable $\xi $ , we denote its distribution by $F_{\xi }$ and its quantile function at level $q\in \lbrack 0,1]$ by
where $\inf \emptyset $ is the right endpoint of $F_{\xi }$ and $\sup\emptyset $ is the left endpoint of $F_{\xi }$ . Throughout the paper, we will tacitly follow these conventions without extra explanations.
2.2. A general formulation
Consider the estimation of $E\left[ h(X)\right] $ , where X is a non-negative risk variable defined on the probability space $(\Omega ,\mathcal{F},P)$ and $h(\!\cdot \!)\,{:}\,\mathbb{R}_{+}\rightarrow \mathbb{R}_{+}$ is a deterministic measurable loss function such that the expectation involved is finite.
Suppose that, as described before, X is the result of the aggregation of the risks in an ordinary scenario and an ambiguous scenario, labelled as 0 and 1, respectively. Introduce a Bernoulli variable I, called scenario indicator, with distribution
where $q=P(I=1)\in \lbrack 0,1]$ denotes the probability that scenario 1 is picked. The risk in scenario 0, denoted by Y, is subject to negligible ambiguity and for this reason we will trust its reference distribution. However, the risk in scenario 1, denoted by Z, is subject to significant ambiguity so that the worst-case treatment needs to be applied when estimating the risk. In this way, the generic risk X takes the stochastic representation
Assuming independence between I and $\left( Y,Z\right) $ , the expectation $E\left[ h(X)\right] $ is decomposed into
Remark 2.1 It is important to keep in mind the following:
-
(a) To derive a worst-case estimate for $E\left[ h(X)\right] $ starting from (2.2), it suffices to take into account the ambiguity adheres to $F_{I}$ and $F_{Z}$ since we fully trust the reference distribution of Y. This leads to a two-fold ambiguity issue, which will be at the core of our analysis.
-
(b) The scenario indicator I and the risk Z in the ambiguous scenario are in different scales and play distinct roles in the worst-case estimation, leading to an asymmetry issue.
To address the asymmetry issue between I and Z in Remark 2.1(b), we scale down the risk Z by a constant $s>0$ ; that is, we instead consider the scaled risk $\tilde{Z}=\frac{Z}{s}$ . The decomposition (2.2) now becomes
For the moment, the scale parameter s is simply treated as exogenous when tackling the twofold ambiguity issue. Later in Subsection 4.2, we will establish a mechanism to endogenize the scale parameter s so that in the worst-case estimation of $E\left[ h(X)\right] $ the two folds play comparable roles in a certain sense.
Further, we introduce a vector $\textbf{V}=\left( I,\tilde{Z}\right) $ whose distribution is
Following the divergence-based approach, we consider the true distribution $F_{\textbf{V}}$ to be within a ball centered at its empirical version
where $n=n_{0}+N$ is the total number of observations counting both $\{y_{1},\ldots ,y_{n_{0}}\}$ of Y from the ordinary scenario 0 and $\{z_{1},\ldots ,z_{N}\}$ of Z from the ambiguous scenario 1, $\tilde{z}_{i}=\frac{z_{i}}{s}$ for $i=1,\ldots ,N$ denote the scaled realizations of the risk Z, and $q_{n}=\frac{N}{n}$ is the empirical estimate for q. This ball for $F_{\textbf{V}}$ will be described by $B_{r}\left( \hat{F}_{I}\times \hat{F}_{\tilde{Z}}\right) $ , where $r>0$ represents the radius.
To conclude, we need to solve the worst-case estimation problem
2.3. Under the Wasserstein distance
To construct the ambiguity ball in (1.1), we use the following Wasserstein distance of order $p\geq 1$ . For two multivariate distributions $F_{1}$ and $F_{2}$ , their Wasserstein distance is defined to be
where $\textbf{V}_{1}$ and $\textbf{V}_{2}$ are two vectors distributed by $F_{1}$ and $F_{2}$ , respectively, $d(\cdot ,\cdot )$ denotes a certain distance, $\Pi $ denotes a joint distribution of $(\textbf{V}_{1},\textbf{V}_{2})$ with marginal distributions $\Pi _{\textbf{V}_{1}}=F_{1}$ and $\Pi _{\textbf{V}_{2}}=F_{2}$ .
The definition of the Wasserstein distance (2.4) originates from optimal transportation problems. According to Villani (Reference Villani2009), each joint distribution $\Pi $ is interpreted as a plan of transporting goods between producers and consumers, whose spatial distributions are described by $F_{1}$ and $F_{2}$ , respectively, the quantity $d(\textbf{v}_{1},\textbf{v}_{2})^{p} $ is interpreted as the cost for transporting one unit of goods from $\textbf{v}_{1}$ to $\textbf{v}_{2}$ , and consequently the quantity $W(F_{\textbf{V}_{1}},F_{\textbf{V}_{2}})^{p}$ is interpreted as the optimal total transportation cost.
In most existing studies, the distance $d(\cdot ,\cdot )$ is specified as the p norm; that is, for two vectors $\textbf{v}_{1}$ and $\textbf{v}_{2}$ of the same dimension,
We will simply follow this specification. Nevertheless, we would like to point out that depending on the situation it may become crucial to employ other distances to address some practical considerations. Then for the one-dimensional case, it is well known that the Wasserstein distance between two distributions $F_{1},F_{2}\in \mathcal{P}(\mathbb{R})$ takes an explicit expression in terms of their quantile functions:
see, for example, Panaretos and Zemel (Reference Panaretos and Zemel2019).
Corresponding to our setting in Subsection 2.2, $\textbf{V}_{i}=\left( I_{i},\tilde{Z}_{i}\right) $ and $F_{\textbf{V}_{i}}=F_{I_{i}}\times F_{\tilde{Z}_{i}}$ for each $i=1,2$ , $\Pi $ denotes a joint distribution of $(\textbf{V}_{1},\textbf{V}_{2})=\left( I_{1},\tilde{Z}_{1},I_{2},\tilde{Z}_{2}\right) $ , and $d(\cdot ,\cdot )$ is specified as the p norm. To simplify the notation, we introduce the following sets:
Then $S=S_{I}\times S_{\tilde{Z}}$ and
so that
Intuitively, we can take $\inf_{\Pi \in S}$ onto the two terms $E_{\Pi }\left[ \left\vert I_{1}-I_{2}\right\vert ^{p}\right] $ and $E_{\Pi }\left[\left\vert \tilde{Z}_{1}-\tilde{Z}_{2}\right\vert ^{p}\right] $ separately, giving
It turns out that this intuition is correct, as proved by Lemma A.1 in a general context. Moreover, by (2.5), we can easily verify the following two identities:
-
With $q_{i}=P(I_{i}=1)$ for $i=1,2$ ,
\begin{equation*}W(F_{I_{1}},F_{I_{2}})^{p}=\int_{0}^{1}\left\vert F_{I_{1}}^{-1}(u)-F_{I_{2}}^{-1}(u)\right\vert ^{p}du=\left\vert q_{1}-q_{2}\right\vert ;\end{equation*} -
With $s>0$ and $\tilde{Z}_{i}=\frac{Z_{i}}{s}$ for $i=1,2$ ,
\begin{equation*}W\left( F_{\tilde{Z}_{1}},F_{\tilde{Z}_{2}}\right) ^{p}=\frac{1}{s^{p}}W(F_{Z_{1}},F_{Z_{2}})^{p}.\end{equation*}
It follows that
Formally, we construct the ambiguity ball in our worst-case estimation problem (2.3) as
Thus, our worst-case estimation problem (2.3) can be reformulated as
Remark 2.2 The next task is to optimally allocate the given amount of ambiguity between the probability q and the distribution $F_{Z}$ . This involves a trade-off issue given that the total amount of ambiguity as reflected by the radius r is fixed. For example, increasing q from its empirical estimate $q_{n}$ influences the value of $(1-q)E\left[ h(Y)\right]+qE\left[ h(Z)\right] $ in an intricate way: Directly, it decreases the first term and increases the second term, while indirectly, the deviation of q from $q_{n}$ consumes part of the ambiguity and thus reduces the range for $F_{Z}$ in optimizing $E\left[ h(Z)\right] $ . Also keep in mind two extreme cases in which the ambiguity is maximally allocated to the probability q and the distribution $F_{Z}$ , respectively. In each case, the given amount of ambiguity is likely used up by one argument, causing the other argument to stick to its empirical estimate. Moreover, we note that, to address the asymmetry issue in Remark 2.1(b), we have used a scale parameter s, taken as exogenous for the moment, in constructing the worst-case estimation problem in (2.7).
Further, we factorize the worst-case estimation problem (2.7) into two layers as
where we have used the following notation:
This way, we have successfully disentangled the ambiguity along the two folds q and $F_{Z}$ and hence converted the worst-case estimation problem to a two-stage optimization problem. We need to first solve the inner optimization, which is a standard worst-case estimation problem. Then it will become straightforward to solve the outer optimization in (2.8), eventually completing the worst-case problem (2.3).
Remark 2.3 Observe the range (2.9) for q, which comes from the ingenuous belief of $q\in \lbrack 0,1]$ . In practice, however, we may have a prior belief in q reflected by a restricted range $q\in \lbrack\underline{q},\bar{q}]\subseteq \lbrack 0,1]$ , which can be, for example, a confidence interval based on a rich dataset or an expert opinion. In this case, we may utilize this information to reduce the range for q in the hope to further help alleviate the over-conservativeness issue. Formally, we have the worst-case estimation problem
over the modified ball
Then by going along the same lines as (2.8), we can still convert the worst-case estimation problem to a two-stage optimization problem in which the outer optimization is over
Thus, it becomes straightforward to address this modification to potentially refine the study, but we will omit it to keep the paper short.
3. The main result
Many existing studies in the literature consider a concave loss function $h(\!\cdot \!)$ . However, the case with a convex loss function is arguably more relevant to some applications. A potential challenge with a convex loss function is that it grows too fast, causing the worst-case estimate to easily explode. For simplicity, we specify $h(\!\cdot \!)$ as a power loss function
so that our work amounts to providing the worst-case pth moment of a generic risk X.
For ease of reference, we recollect here some essential steps in the worst-case estimation of $E\left[ X^{p}\right] $ . Following the traditional divergence-based approaches, the worst-case estimation of $E\left[ X^{p}\right] $ is constructed as
which is conducted over the Wasserstein ball $B_{r}\left( \hat{F}_{X}\right)$ , $r>0$ , centered at the empirical distribution $\hat{F}_{X}$ based on the whole dataset of the generic risk X. When constructing the Wasserstein ball, following the mainstream in related research we choose the order p of the Wasserstein distance (2.6) to be the same as the order p of the power function (3.1). Then, the corresponding worst-case estimation problem (3.2) is bounded according to Lemma A.2.
In our scenario analysis, the generic risk X, as described by (2.1), results from the aggregation of the risks in an ordinary scenario 0 and an ambiguous scenario 1, and we aim to estimate
where $q=P(I=1)$ , $\tilde{Z}=\frac{Z}{s}$ , and $s>0$ is a scale parameter. Following (2.8), our worst-case estimation problem becomes
with $\epsilon =s\left( r^{p}-\left\vert q-q_{n}\right\vert \right) ^{\frac{1}{p}}$ given in (2.10).
As the first step to solve (3.3), we work on the inner optimization in (3.4),
and the result is shown in Proposition 3.1 below, the proof of which is postponed to the Appendix. To avoid triviality, we only consider a proper sample $\{z_{1},\ldots ,z_{N}\}$ of Z in the sense that not all sample points are 0. Nevertheless, in case all sample points $\{z_{1},\ldots ,z_{N}\}$ of Z are 0, the aimed results are still valid: In this case, $\hat{F}_{Z}$ is degenerate at 0, $\sup_{F_{Z}\in B_{\epsilon }\left( \hat{F}_{Z}\right) }E\left[ Z^{p}\right] =\epsilon ^{p}$ is attained at a distribution $F_{Z}^{\ast }$ degenerate at $\epsilon $ , and $W\left( F_{Z}^{\ast },\hat{F}_{Z}\right) =\epsilon $ .
Proposition 3.1 Consider the worst-case estimation problem (3.5) in which $p\geq 1$ , $\epsilon \gt0$ , and the Wasserstein distance is specified as (2.6) with the same order p. A proper sample $\{z_{1},\ldots ,z_{N}\}$ of Z is given. Then
where $Z_{N}$ denotes a random variable uniformly distributed over the sample $\{z_{1},\ldots ,z_{N}\}$ and thus $\left\Vert Z_{N}\right\Vert_{L^{p}}=\left( \frac{1}{N}\sum_{i=1}^{N}z_{i}^{p}\right) ^{\frac{1}{p}}$ .
Clearly, the supremum (3.6) is attained at
where $z_{i}^{\ast }=\left( 1+\frac{\epsilon }{\left\Vert Z_{N}\right\Vert_{L^{p}}}\right) z_{i}$ for $i=1,\ldots ,N$ . Moreover, it is easy to check that $W\left( F_{Z}^{\ast },\hat{F}_{Z}\right) =\epsilon $ , which means that the worst-case distribution $F_{Z}^{\ast }$ consumes all the ambiguity $\epsilon $ and lies on the boundary of $B_{\epsilon }\left( \hat{F}_{Z}\right) $ . To see this, recall the alternative expression (2.5) of the Wasserstein distance for the univariate case. We have
Remark 3.1 Although we focus on a non-negative risk variable X and a power loss function, our method is applicable to a real-valued risk variable X and a general loss function $h:\mathbb{R}\rightarrow \mathbb{R}$ as long as $h(\!\cdot \!)$ is upper semi-continuous and satisfies $h\left(x\right) =O(\left\vert x\right\vert ^{p})$ for some $p\geq 1$ as $\left\vert x\right\vert \rightarrow \infty $ . Indeed, we can still arrive at the two-stage optimization (2.8) by disentangling the ambiguity along the two folds. Furthermore, following the proof of Proposition 3.1, we can always convert the inner optimization in (2.8) to a one-dimensional convex optimization problem to make it numerically tractable but an explicit expression may be available only for certain special loss functions. One such example is the loss function
for which $E\left[ h(X)\right] $ corresponds to the pth moment of the insurance payoff in a stop-loss contract. For this example, we can achieve an explicit expression as
However, it does not seem to be easy to establish a unified result for general convex loss functions.
Under the help of Proposition 3.1, by completing the outer optimization in (3.4), we will eventually solve the worst-case estimation problem (3.3). To this end, introduce
where the distribution of Y is fully in accordance with the empirical distribution from the sample $\{y_{1},\ldots ,y_{n_{0}}\}$ in scenario 0. There are two cases:
The case $C_{0}\leq C_{1}$ means that, in the $L^{p}$ norm, the risk realization in the ambiguous scenario 1 tends to be larger than that in the ordinary scenario 0, which may be the case if, for example, the ambiguous scenario 1 is more catastrophic. For this case, we are able to obtain an analytical solution to the worst-case estimation problem (3.3).
Theorem 3.1 Consider the worst-case estimation problem (3.3). In addition to the conditions in Proposition 3.1, assume that $C_{0}\leq C_{1}$ . Then the worst-case pth moment of X is
In (3.7):
-
$r_{\ast }=\bar{r}\vee 0$ , with $\bar{r}$ the unique solution r to
(3.8) \begin{equation}-C_{0}^{p}+(rs+C_{1})^{p-1}\left( rs(1-q_{n}r^{-p})+C_{1}\right) =0;\end{equation} -
$q_{\ast }=\bar{q}_{n}\wedge q_{n}^{+}$ , with $q_{n}^{+}$ defined in (2.9) and with $\bar{q}_{n}$ the unique solution q to
(3.9) \begin{equation}-C_{0}^{p}+\left( s(r^{p}+q_{n}-q)^{^{\frac{1}{p}}}+C_{1}\right)^{p}-qs\left( s(r^{p}+q_{n}-q)^{^{\frac{1}{p}}}+C_{1}\right)^{p-1}(r^{p}+q_{n}-q)^{^{\frac{1}{p}}}=0; \end{equation} -
$\epsilon _{\ast }=\epsilon (q_{\ast })=s(r^{p}+q_{n}-q_{\ast })^{^{\frac{1}{p}}}$ as defined in (2.10).
The proof of Theorem 3.1 is postponed to the Appendix, from which it is easy to see that the supremum of (3.7) is always attainable. Let us observe Theorem 3.1 for the cases $p>1$ and $p=1$ separately. We will see that it echoes Remark 2.2 to a certain extent.
For $p>1$ , it is easy to see that $r_{\ast }>0$ as Equation (3.8) yields a unique solution $\bar{r}\in \left( 0,q_{n}^{\frac{1}{p}}\right) $ . Actually, the left-hand side of (3.8), as a continuous and increasing function in r, diverges to $-\infty $ as $r\downarrow 0$ , while it takes a positive value at $r=q_{n}^{\frac{1}{p}}$ since $C_{0}\leq C_{1}$ . The first piece in (3.7) shows that, when r is relatively small such that $0<r\leq r_{\ast }=\bar{r}$ , the optimization procedure requires that the ambiguity be fully allocated to $F_{Z}$ to raise $E\left[Z^{p}\right] $ , and subsequently the range for q boils down to the singleton $\{q_{n}\}$ . However, when r is relatively large such that $r>r_{\ast }=\bar{r}$ , the total amount of ambiguity is allocated to both q and $F_{Z}$ according to the second piece in (3.7). Precisely, part of the ambiguity is allocated to q to shift it from $q_{n}$ to $q_{\ast}\in (q_{n},q_{n}+r^{p})$ , and the remaining ambiguity, as quantified by $\epsilon _{\ast }\in \left( 0,rs\right) $ after scaling by s, is allocated to $F_{Z}$ to raise $E\left[ Z^{p}\right] $ .
For $p=1$ , it is possible that Equation (3.8) yields a negative solution depending on the value of $q_{n}$ . For this case, $r_{\ast}=0$ , and thus only the second piece in (3.7) is relevant, which indicates that the ambiguity is not fully allocated to $F_{Z}$ . However, it is possible that $q_{\ast }=q_{n}+r$ and hence $\epsilon _{\ast }=0$ , which means that the optimization procedure fully allocates the ambiguity to q to shift it from $q_{n}$ to $q_{\ast }=q_{n}+r$ and subsequently squeezes the range for $F_{Z}$ to the singleton $\left\{ \hat{F}_{Z}\right\} $ .
Remark 3.2 The other case $C_{0}>C_{1}$ means that, in the $L^{p}$ norm, the risk realization in the ambiguous scenario 1 is smaller than that in the ordinary scenario 0, which may be the case when, for example, an insurance company extends with prudence to a new line of business in which losses incurred are subject to more ambiguity but not necessarily larger than in its ordinary business. Unfortunately, the proof of Theorem 3.1 does not cover this case: The auxiliary function $f_{1}\left( q\right) $ in (A12) does not exhibit a clear convexity feature on $[q_{n}^{-},q_{n}]$ , and hence it becomes troublesome to achieve an analytical solution. Nevertheless, under the help of Proposition 3.1, the worst-case estimation problem (3.3) is numerically tractable and we will instead seek numerical solutions in Section 4.
4. Numerical studies
This section is devoted to numerical studies to illustrate the benefit of our new approach via (3.3) giving consideration to partial ambiguity compared with the traditional approach via (3.2).
4.1. Synthetic data
To assess the performance of the traditional and new approaches, we will generate synthetic data from a known distribution, and then use this dataset in both approaches but pretend that we do not know the true distribution. Such an idea of synthetic data has often been used to facilitate similar numerical studies; see, for example, Lam and Mottet (Reference Lam and Mottet2017).
Precisely, given the distributions $F_{I}$ , $F_{Y}$ , and $F_{Z}$ , the synthetic data of X can be generated in the following steps: First, we generate a uniform random sample of U and define $I=1_{(U\geq 1-q)}$ ; Second, if $U<1-q$ , then $I=0$ and we generate a random sample of Y, while if $U\geq 1-q$ , then $I=1$ and we generate a random sample of Z; Third, putting these into (2.1), we obtain a random sample of X.
In our numerical experiments, we will use the following distributions to model Y and Z:
-
We call $\xi $ a folded normal variable with parameters $(m,v^{2})\in(\!-\infty ,\infty )\times (0,\infty )$ if $\xi =\left\vert \eta \right\vert $ , where $\eta $ is normally distributed with mean m and variance $v^{2}$ , namely, $\eta \sim \mathcal{N}(m,v^{2})$ ;
-
We call $\xi $ a beta variable with parameters $(a,b,c)\in (0,\infty)^{3}$ if $\xi $ has the density
\begin{equation*}g(t)=\frac{1}{B(a,b)c}\left( \frac{t}{c}\right) ^{a-1}\left( 1-\frac{t}{c}\right) ^{b-1},\qquad 0\leq t\leq c,\end{equation*}where $B(\cdot ,\cdot )$ is the beta function; -
We call $\xi $ a Pareto variable with parameters $(\alpha ,\gamma )\in(0,\infty )^{2}$ if $\xi $ has the density
\begin{equation*}g(t)=\frac{\alpha \gamma ^{\alpha }}{t^{\alpha +1}},\qquad t\geq \gamma .\end{equation*}
4.2. On the scale parameter
Recall the asymmetry issue mentioned in Remark 2.1(b) between the scenario indicator I and the risk Z in the ambiguous scenario. To address this issue, in Subsection 2.2, we have proposed to introduce a constant $s>0$ to scale down the risk Z, eventually leading to the worst-case estimation problem (2.7) with the constraint
namely, we let q and $F_{Z}$ compete for the given amount of ambiguity in the worst-case estimation, while the amount of ambiguity allocated to $F_{Z}$ is further scaled by s. Now we establish a mechanism to endogenize the scale parameter s. To clarify, we do not claim that our mechanism is universally appropriate, but rather we think that, depending on the situation, there should be other potentially better mechanisms to endogenize s.
Recall the two extreme cases mentioned in Remark 2.2 when allocating the ambiguity along the two folds. The first extreme case is to maximally allocate the ambiguity to the probability q, which must be realized at either endpoint of the interval (2.9). There is a subtlety that q may not consume all the given ambiguity, for which case there is still ambiguity left for $F_{Z}$ according to (2.10). Putting together, we obtain an estimate for $E\left[ X^{p}\right] $ as
where $\tilde{q}$ is either $q_{n}^{+}$ or $q_{n}^{-}$ , whichever yields a larger value of (4.2), and where $\tilde{\epsilon}$ is defined by
according to (2.10). The second extreme case is to maximally allocate the ambiguity described by (4.1) to the distribution $F_{Z}$ to raise the pth moment of Z. As $F_{Z}$ can always consume all the given ambiguity, which raises the pth moment of Z to $(rs+C_{1})^{p} $ by Proposition 3.1, there is no ambiguity left for q. This gives another estimate for $E\left[ X^{p}\right] $ as
Our mechanism for determining the scale parameter s is based on the reasoning that, in competing for the given ambiguity to optimize $E\left[X^{p}\right] $ , the scenario indicator I and the risk Z in the ambiguous scenario should have equal power. Quantitatively, we interpret this as that the two extreme cases yield the same estimate for $E\left[ X^{p}\right] $ . Thus, by equating (4.2) and (4.4), we arrive at the equation
which we use to endogenize s.
Once the dataset is available and the radius r is given, the values of $C_{0}$ , $C_{1}$ , and $q_{n}$ are known, and we can decide $\tilde{q}$ and $\tilde{\epsilon}$ by comparing the values of (4.2) at $\tilde{q}=q_{n}^{+}$ and $\tilde{q}=q_{n}^{-}$ . Then it becomes straightforward to check the existence and uniqueness of the solution s to (4.5). We remark that, for most practical situations where the radius $r $ is modest, q moving from $q_{n}$ to $\tilde{q}$ is able to consume all the ambiguity, leaving no ambiguity to $F_{Z}$ . More precisely, if $[q_{n}-r^{p},q_{n}+r^{p}]\subset \lbrack 0,1]$ , then $[q_{n}^{-},q_{n}^{+}]=[q_{n}-r^{p},q_{n}+r^{p}]$ by (2.9), and subsequently, for either $\tilde{q}=q_{n}^{+}$ or $\tilde{q}=q_{n}^{-}$ , we have $\tilde{\epsilon}=0$ by (4.3). In such a situation, Equation (4.5) is simplified to
which gives a closed-form solution
4.3. Bootstrapping
To produce the worst-case estimate for $E\left[ X^{p}\right] $ with a desired coverage probability, a key step is to determine an appropriate radius r for the Wasserstein ball. There have already been a few theoretical studies under various distributional assumptions such as light tails, but the corresponding selection of the radius either is too conservative or involves unknown constants; see, for example, Pflug and Wozabal (Reference Pflug and Wozabal2007), Fournier and Guillin (Reference Fournier and Guillin2015), and Zhao and Guan (Reference Zhao and Guan2018), among others. Most of those theoretical studies by far are not immediately applicable in practice, and thus scholars usually resort to numerical approaches such as bootstrapping to calibrate the Wasserstein radius r; see, for example, Mohajerin Esfahani and Kuhn (Reference Mohajerin Esfahani and Kuhn2018) and Kannan et al. (Reference Kannan, Bayraksan and Luedtke2020) for related discussions.
In our numerical studies, we will prudently calibrate the radius r using bootstrapping. For a given dataset of size n, we first construct resamplings from it. Each resampling yields a training dataset and a validation dataset. Suppose that we already have k resamplings. Given a radius r for the worst-case estimation, we process these k resamplings to examine if this level of r is high enough to guarantee the desired coverage probability for $E\left[ X^{p}\right] $ . Then we identify the smallest radius r such that the coverage probability is no less than the desired coverage probability.
Formally, the procedure consists of the following steps:
-
For each resampling, with the reference distribution generated from the training dataset and the radius r, we produce a worst-case estimate for $E\left[ X^{p}\right] $ .
-
With the reference distribution generated from the corresponding validation dataset, we have a direct estimate for $E\left[ X^{p}\right] $ .
-
Repeat these steps for k resamplings and count the frequency that the worst-case estimate from the training dataset is no less than the direct estimate from the validation dataset.
-
We regard this frequency as the coverage probability of the corresponding approach with the radius r. We gradually raise r if the obtained coverage probability is lower than the desired, or gradually reduce r otherwise.
-
We redo the steps above for T times.
Following the procedure, the radius r is prudently calibrated based on the available data.
Now we show how to construct the k resamplings in the traditional and new approaches, respectively. In the traditional approach as formulated by (3.2), the generic risk X is subject to ambiguity. To construct a resampling, we simply sample with replacement the total n data points to construct the training dataset, and the validation dataset comprises the rest, roughly
of the original data points that are absent from the training dataset; see Subsection 7.2 of Mohajerin Esfahani and Kuhn (Reference Mohajerin Esfahani and Kuhn2018). Moreover, when processing this resampling, the reference distributions generated from the training dataset and the validation dataset are selected to be the empirical distributions.
In our new approach as formulated by (3.3), the moment $E\left[ X^{p}\right] $ is now decided by $(F_{I},F_{Y},F_{Z})$ among which $F_{I}$ and $F_{Z}$ are subject to ambiguity. To reflect this two-fold ambiguity, we construct the training dataset and the validation dataset in each resampling as follows:
-
We keep the $n_{0}$ data points of Y in scenario 0 unchanged and denote by $\hat{F}_{Y}$ the empirical distribution, which we treat as the true distribution of Y.
-
Regarding the ambiguity of $F_{Z}$ , we sample with replacement the N data points in scenario 1 to construct the training dataset, which gives the empirical distribution $\hat{F}_{Z}^{\textrm{tr}}$ . The rest of the data points absent from the training dataset form the validation dataset, which gives the empirical distribution $\hat{F}_{Z}^{\textrm{va}}$ .
-
Regarding the ambiguity of $F_{I}$ , we sample with replacement the $n=n_{0}+N$ mixed data to construct the training dataset and estimate $\hat{q}^{\textrm{tr}}$ to be the frequency of scenario 1, yielding the Bernoulli distribution $\hat{F}_{I}^{\textrm{tr}}$ with $\hat{F}_{I}^{\textrm{tr}}\left( \left\{ 1\right\} \right) =\hat{q}^{\textrm{tr}}$ . The rest of the data points absent from the training dataset form the validation dataset from which we obtain in the same way the frequency $\hat{q}^{\textrm{va}}$ and the Bernoulli distribution $\hat{F}_{I}^{\textrm{va}}$ .
-
When processing this resampling, the reference distributions generated from the training dataset and the validation dataset are $\left( \hat{F}_{I}^{\textrm{tr}},\hat{F}_{Y},\hat{F}_{Z}^{\textrm{tr}}\right) $ and $\left( \hat{F}_{I}^{\textrm{va}},\hat{F}_{Y},\hat{F}_{Z}^{\textrm{va}}\right) $ , respectively.
4.4. Comparison with the central limit theorem approach
In this subsection, following a reviewer’s request we use the classical central limit theorem (CLT) to provide an upper confidence bound as another conservative estimate for the pth moment. In the next subsection, we will conduct numerical experiments to examine the advantage of our new approach over the CLT approach.
Recall the decomposition (2.2), which, with h(x) specified to the power function $x^{p}$ , becomes
where W denotes the product $IZ^{p}$ and the last step holds due to the independence between I and Z. With $\pi =E\left[ X^{p}\right] $ , $q=E[I]$ , and $w=E[W]$ , we further rewrite the decomposition as
To use (4.6) to estimate $\pi $ , note that $E\left[ Y^{p}\right]$ is known since we fully trust the empirical distribution of Y from scenario 0, but q and w, due to the unknown distributions of I and $W $ , need to be estimated. In view of the simple linear relationship in (4.6), it is customary to estimate $\pi $ by the same expression with q and w replaced by their sample versions
respectively, yielding
In the expression for $w_{n}$ , whenever $I_{i}=0$ there is no observation of Z but the product $I_{i}z_{i}^{p}$ is understood as 0.
By the central limit theorem, we have
where $\overset{d}{\rightarrow }$ denotes convergence in distribution and $\Sigma =\left( \sigma _{ij}\right) _{2\times 2}$ is the covariance matrix of $\left( I,W\right) $ . It follows that
Therefore, for a desired coverage probability $\beta \in (0,1)$ , namely, the probability that the estimate is no less than the true value, the CLT estimate for $\pi $ is
where $\Phi \left( \cdot \right) $ is the standard normal distribution function and $\hat{\sigma}_{11}$ , $\hat{\sigma}_{12}$ , and $\hat{\sigma}_{22} $ are the sample versions of $\sigma _{11}$ , $\sigma _{12}$ , and $\sigma _{22}$ .
We would like to point out that the CLT approach and our new approach focus on different topics and applications. The former provides approximations based on the classical limit theory, while the latter is designed to address the ambiguity issue and the resulting worst-case estimates can be viewed as distributionally robust bounds.
4.5. Numerical results
4.5.1. First, consider the case $C_{0}\leq C_{1}$ in Section 3
We specify a Bernoulli distribution for I with $P(I=1)=0.1$ , a folded normal distribution for Y with parameters $(m,v^{2})=(2,2^{2})$ , and a Pareto distribution for Z with parameters $(\alpha ,\gamma )=(5,20)$ . Experiments are conducted for the first four moments $p=1,2,3,4$ . It is easy to check that these moments of Y are much smaller than those of Z, and thus the condition $C_{0}\leq C_{1}$ can be easily fulfilled in numerical experiments.
We first generate a dataset of size n but pretend that we do not know the true distributions and set the desired coverage probability to be $\beta=0.95$ . Based on this dataset, we compute the upper bound for $E\left[ X^{p}\right] $ using the traditional and the new approaches adopting the worst-case treatment. In each approach, we calibrate the radius r using bootstrapping with $k=100$ resamplings and then produce the upper bounds by solving the corresponding worst-case estimation problems. Note that for the new approach, we use the mechanism described in Subsection 4.2 to determine the scale parameter s. Moreover, we also apply the CLT approach proposed in Subsection 4.4 to produce the $0.95$ confidence upper bound for $E\left[ X^{p}\right] $ . Thus, we employ three approaches, which we call the “traditional”, the “new”, and the “CLT” approaches to distinguish them.
Repeating the above procedure $T=100$ times, the realized coverage probability in each approach is estimated to be the frequency of the upper bound no less than the true moment $E\left[ X^{p}\right] $ . We specify the size of the dataset to be $n=200$ and $n=2000$ , and the corresponding results are shown in Tables 1 and 2, respectively. We observe the following. The traditional approach always achieves a full coverage probability, which signals the aforementioned over-conservativeness issue, while the CLT approach always achieves coverage probabilities that are significantly lower than the desired level $0.95$ , which signals significant underestimation. In contrast, our new approach achieves generally satisfactory coverage probabilities. Actually, only for the fourth moment, the coverage probability from the new approach is slightly below the desired, which indicates that a larger dataset is required for estimating higher moments.
Furthermore, we compare the traditional and new approaches that generally guarantee the desired coverage probability. Specifically, we measure the estimation error of each approach by the mean squared error (MSE) of the upper bounds in the $T=100$ experiments compared with the true moment. To demonstrate the benefit of our new approach, we calculate the reduction ratio in the MSE:
and the corresponding results are also shown in Tables 1 and 2. We can observe that the new approach significantly reduces the estimation error compared with the traditional approach.
Note that the implementation of the new approach requires an adequately large dataset. On the one hand, the new approach is based on the assumption of abundant data from the ordinary scenario; on the other hand, it also requires a reasonable amount of data from the ambiguous scenario in order to effectively construct the training dataset and the validation dataset during bootstrapping.
4.5.2. Next, consider the other case $C_{0}>C_{1}$ in Section 3
We specify a Bernoulli distribution for I with $P(I=1)=0.1$ , a beta distribution for Y with parameters $(a,b,c)=(5,1,2)$ , and a folded normal distribution for Z with parameters $(m,v^{2})=(0,1^{2})$ . Experiments are conducted still for the first four moments $p=1,2,3,4$ . It is easy to check that these moments of Y are now larger than those of Z, and thus the condition $C_{0}>C_{1}$ can be fulfilled in numerical experiments.
We conduct the same experiments as before with $k=100$ , $T=100$ , and the desired coverage probability $\beta =0.95$ . The numerical results for the datasets of size $n=200$ and 2000 are shown in Tables 3 and 4, respectively, from which we have similar observations to those from Tables 1 and 2. In particular, we see that the desired coverage probability $0.95$ is achieved only in the traditional and new approaches, between which the new approach, while generally retaining the desired coverage probability, greatly reduces the estimation error for all four moments.
5. Concluding remarks
We revisit the worst-case estimation of the moments of a risk X whose distribution is subject to ambiguity. To alleviate the over-conservativeness issue, we consider the risk X as resulting from the aggregation of the risk Y in an ordinary scenario subject to no ambiguity and the risk Z in an ambiguous scenario subject to significant ambiguity. The ambiguity exists in both the scenario indicator and the risk in the ambiguous scenario. We construct a Wasserstein ball to describe this two-fold ambiguity and then we derive worst-case estimates for the moments of X both analytically and numerically.
Several extensions are worthy of pursuit in the future. First, we may consider multiple risk scenarios each of which is subject to a varying extent of ambiguity. With the extents of ambiguity specified, we need to disentangle the total ambiguity along the scenario indicator and the risks from the respective scenarios. Our current work already lends useful hints to this extension. Second, loss functions often involve control variables, which represent, for example, a contract design, an investment strategy, or a risk management rule. Then, we face an additional layer of optimization with respect to the control variables, which, as pointed out by Kuhn et al. (Reference Kuhn, Mohajerin Esfahani, Nguyen and Shafieezadeh-Abadeh2019), may amplify the impact of ambiguity. Third, it is also desirable to consider a general situation involving multiple risk factors rather than multiple risk scenarios. Then we need to deal with the worst-case estimation of the expectation $E\left[ h(Y,Z_{1},\ldots ,Z_{d})\right] $ in which h is a general multivariate loss function and Y, $Z_{1}$ , …, $Z_{d}$ are risks subject to different extents of ambiguity. Note that in this situation ambiguity also exists in the dependence structure of $(Y,Z_{1},\ldots ,Z_{d})$ .
Acknowledgments
The authors would like to thank the three anonymous referees for their constructive comments which have greatly helped us improve the work. The authors feel indebted to Editor Ruodu Wang, who kindly pointed out an asymmetry issue between the scenario indicator I and the risk Z in the decomposition (2.2) and made a nice suggestion on how to address this issue in conducting the worst-case estimation. Qihe Tang acknowledges the financial support by the Australian Government through the Australian Research Council’s Discovery Projects funding scheme (DP200101859 and DP220100090), and Yunshen Yang acknowledges the financial support from the Scientia PhD Scholarship and the PhD Teaching Fellowship of UNSW Sydney.
A. Appendix
Lemma A.1 For two random pairs $\textbf{V}_{1}=(\xi _{1},\eta _{1})$ and $\textbf{V}_{2}=(\xi _{2},\eta _{2})$ , each with independent components so that $F_{\textbf{V}_{i}}=F_{\xi _{i}}\times F_{\eta _{i}}$ for $i=1,2$ , we have
where $W(\cdot ,\cdot )$ is the Wasserstein distance of order $p\geq 1$ defined as in (2.4) with $d(\cdot ,\cdot )$ specified as the p norm.
Proof. As in Subsection 2.3, we introduce the following spaces:
Since $F_{\textbf{V}_{i}}=F_{\xi _{i}}\times F_{\eta _{i}}$ for $i=1,2$ , we have $S=S_{\xi }\times S_{\eta }$ . Clearly,
On the one hand, due to the super-additivity of the infimum operation, we have
On the other hand, for any $\mu \in S_{\xi }$ and $\nu \in S_{\eta }$ , we simply take the product measure $\Pi _{(\mu ,\nu )}=\mu \times \nu $ , which defines a joint distribution of the quadruple $\left( \xi _{1},\eta_{1},\xi _{2},\eta _{2}\right) $ . Under $\Pi _{(\mu ,\nu )}$ , for each $i=1,2$ , the pair $\textbf{V}_{i}=(\xi _{i},\eta _{i})$ follows the distribution $F_{\textbf{V}_{i}}=F_{\xi _{i}}\times F_{\eta _{i}}$ . This verifies $\Pi_{(\mu ,\nu )}\in S$ . Therefore, by (A2),
Taking infimum of the two terms on the left-hand side over $\mu \in S_{\xi }$ and $\nu \in S_{\eta }$ , respectively, we obtain
This proves (A1).
Proof of Proposition 3.1. Introduce the space
Observe that the optimization problem (3.5) is conducted over the Wasserstein ball
with the Wasserstein distance $W\left( F_{Z},\hat{F}_{Z}\right) =\inf_{\Pi\in S}\left( E_{\Pi }\left[ \left\vert Z-\hat{Z}\right\vert ^{p}\right]\right) ^{\frac{1}{p}}$ . Note that the infimum in defining the Wasserstein distance is always attainable; see, for example, Theorem 4.1 of Villani (Reference Villani2009). Therefore, for each $F_{Z}\in B_{\epsilon }\left( \hat{F}_{Z}\right) $ , we can find a joint distribution $\Pi \in S$ such that $E_{\Pi }\left[\left\vert Z-\hat{Z}\right\vert ^{p}\right] \leq \epsilon ^{p}$ , while conversely, for each $\Pi \in S$ such that $E_{\Pi }\left[ \left\vert Z-\hat{Z}\right\vert ^{p}\right] \leq \epsilon ^{p}$ , its marginal distribution $\Pi _{Z}=F_{Z}$ belongs to the ambiguity ball $B_{\epsilon }\left( \hat{F}_{Z}\right) $ . This allows us to rewrite the optimization problem (3.5) in terms of $\Pi $ , that is,
Now we continue on the optimization problem (A3). Given a joint distribution $\Pi \in S$ , we denote by $F_{Z}^{i}$ the distribution of Z conditioned on $\hat{Z}=z_{i}$ , that is,
In terms of this conditional distribution, by the law of total probability, we have
and
Note that the construction of the conditional distributions (A4) actually establishes a mapping between a joint distribution $\Pi \in S$ and a set of conditional distributions $\{F_{Z}^{1},\ldots ,F_{Z}^{N}\}$ . Moreover, by (A5), $E_{\Pi }\left[ \left\vert Z-\hat{Z}\right\vert ^{p}\right] \leq \epsilon ^{p}$ if and only if
Thus, we can convert the optimization problem (A3) to
Introducing the Lagrangian multiplier $\lambda $ to this new version of the optimization problem (A7), we have
where we can switch the order of $\sup $ and $\inf $ based on an established strong duality in worst-case estimation problems; see, for example, Theorem 1 and Proposition 2 of Gao and Kleywegt (2022). After rearrangement, it follows that
where the last step follows from the observation that each $F_{Z}^{i}$ can be reduced to a Dirac measure $\delta _{x}$ at any $x\in \mathbb{R}_{+}$ . By now, we have converted the semi-infinite optimization problem (3.5) to a one-dimensional convex minimization problem (A8).
To solve (A8), we first consider the case $p>1$ . We separate $\lambda \geq 0$ into two regions: $0\leq \lambda \leq 1$ and $\lambda \gt1$ . When $0\leq \lambda \leq 1$ , since there exists at least one $z_{i}>0$ for which $\lim_{z\rightarrow \infty }\left( \lambda \left\vert z-z_{i}\right\vert ^{p}-z^{p}\right) =-\infty $ , we have
Therefore, the minimizer of the dual problem (A8) for $p>1$ must be in the region $\lambda \gt1$ . By splitting the range for z into $(z>z_{i})$ and $(0\leq z\leq z_{i})$ , we obtain
Observe the function inside the first inner infimum $\inf_{z>z_{i}}$ in (A9). By taking derivative $\frac{d}{dz}$ and letting it be zero, we obtain a unique stationary point
which belongs to the region $z>z_{i}$ because $\lambda \gt1$ . It is easy to see that this infimum $\inf_{z>z_{i}}$ is attained at $z_{\ast }$ , that is,
Moreover, the second inner infimum $\inf_{0\leq z\leq z_{i}}$ in (A9) is attained at $z_{i}$ since the inside function is decreasing in $z\leq z_{i}$ . Thus, between the two infima in the bracketed part of (A9) the first is smaller, and we have
We continue to solve the optimization problem (A10). Looking at the inside function of $\lambda $ in (A10), it is easy to see that the $\min_{\lambda \gt1}$ is attained at the unique stationary point
and we have
This proves (3.6) for $p>1$ .
The case $p=1$ for (A8) can be dealt with in the same way, as follows:
This proves (3.6) for $p=1$ and hence complete the proof of Proposition 3.1.
Lemma A.2 Consider the optimization problem (3.2) in which $\hat{F}_{X}$ is the empirical distribution of a dataset $\{x_{1},\ldots,x_{n}\}$ . The supremum (3.2) is finite if and only if the order of the Wasserstein distance embedded in this worst-case estimation problem is no less than the order p of the power function.
Proof. Denoted by $p^{\prime }$ the order of the Wasserstein distance embedded in (3.2). Following the beginning part of the proof of Proposition 3.1, we can convert the optimization problem (3.2) to
The right-hand side is finite if and only if there exists some $\lambda \geq0$ such that
if and only if there exists some $\lambda \geq 0$ such that
if and only if $p^{\prime }\geq p$ .
Proof of Theorem 3.1. In view of Proposition 3.1, to solve the worst-case estimation problem (3.3), it remains to complete the outer optimization problem in (3.4). Simply plugging into (3.4) the expression (3.6) with $\epsilon=s\left( r^{p}-\left\vert q-q_{n}\right\vert \right) ^{\frac{1}{p}}$ given in (2.10), we obtain
where $f_{1}(q)$ and $f_{2}(q)$ are two auxiliary functions defined by
Given $C_{0}\leq C_{1}$ , it is easy to deal with the first supremum of $f_{1}(q)$ . Observe that the bracketed part $\left( s\left(r^{p}-(q_{n}-q)\right) ^{\frac{1}{p}}+C_{1}\right) $ in $f_{1}(q)$ is monotonically increasing in q and larger than $C_{0}$ over $q\in \lbrack q_{n}^{-},q_{n}]$ . Thus, the supremum of $f_{1}(q)$ is attained at the largest possible value $q=q_{n}$ . This reduces the optimization problem (A11) to the second supremum.
To determine the supremum of $f_{2}(q)$ over $q\in \lbrack q_{n},q_{n}^{+}]$ , we observe the following derivatives:
Note that, in the expression for $\frac{d^{2}}{dq^{2}}$ above, the last bracketed term after $C_{1}$ satisfies
Thus, $\frac{d^{2}}{dq^{2}}f_{2}(q)\leq 0$ and $f_{2}(q)$ is concave over $q\in \lbrack q_{n},q_{n}^{+}]$ . We now look at the derivative
whose sign depends on the value of r. As it is continuous and strictly increasing in r, we define $\bar{r}$ as the unique solution to $\left. \frac{d}{dq}f_{2}(q)\right\vert _{q=q_{n}}=0$ , namely, Equation (3.8), and separately conclude the two pieces in (3.7) as follows:
-
When $0<r<r_{\ast }=\bar{r}\vee 0$ , we have $\left. \frac{d}{dq}f_{2}(q)\right\vert _{q=q_{n}}<0$ , which, combined with the concavity of $f_{2}(q)$ over the region $q\in \lbrack q_{n},q_{n}^{+}]$ , implies that $f_{2}(q)$ is decreasing over this region. Thus, the supremum is attained at $q=q_{n}$ , giving the first expression in (3.7).
-
When $r\geq r_{\ast }$ , we have $\left. \frac{d}{dq}f_{2}(q)\right\vert _{q=q_{n}}\geq 0$ . This implies that the supremum is attained at either the unique solution $\bar{q}_{n}$ to $\frac{d}{dq}f_{2}(q)=0$ , namely, Equation (3.9), or $q_{n}^{+}$ , whichever is smaller, giving the second expression in (3.7).
This completes the proof of Theorem 3.1.