1. Introduction
Markov decision processes enable the modeling of non-deterministic interactions between an agent and its environment within a tractable stochastic framework. At each time t an agent observes the current state and takes an action which leads to an immediate reward. The goal of the agent then is to optimize its expected cumulative reward. Mathematically, Markov decision problems are solved based on a dynamic programming principle, whose framework is the foundation of many reinforcement learning algorithms such as, e.g., the Q-learning algorithm. See [Reference Bäuerle and Rieder5, Reference Feinberg and Shwartz10, Reference Puterman25, Reference Puterman26] for the theory of Markov decision processes, and [Reference Angiuli, Fouque and Lauriere1, Reference Cao, Chen, Hull and Poulos6, Reference Charpentier, Elie and Remlinger7, Reference Kaelbling, Littman and Moore11, Reference Kallus and Uehara12, Reference Levin, Pieraccini and Eckert15, Reference Natarajan and Kolobov20, Reference Sutton and Barto29, Reference White33] for their applications, especially in the field of reinforcement learning.
In the classical setup for Markov decision problems, the transition kernel describing the transition probabilities of the underlying Markov decision processes is given. Economically, this means that the agent possesses the knowledge of the true distribution of the underlying process, an assumption which typically cannot be justified in practice. To address this issue, academics have recently introduced a robust version of the Markov decision problem accounting for a possible misspecification of the assumed underlying probability kernel that describes the dynamics of the state process. Typically, we assume that the agent possesses a good guess of the true, but to the agent unknown, probability kernel, but due to her uncertainty decides to consider the worst case among all laws which lie within a ball of certain radius around the estimated probability kernel with respect to some distance, e.g. the Wasserstein distance or the Kullback–Leibler distance. See [Reference Bäuerle and Glauner3, Reference Bäuerle and Glauner4, Reference Chen, Yu and Haskell8, Reference El Ghaoui and Nilim9, Reference Li, Sutter and Kuhn16–Reference Mannor, Mebel and Xu18, Reference Neufeld and Sester21, Reference Neufeld, Sester and Šikić23, Reference Panaganti and Kalathil24, Reference Si, Zhang, Zhou and Blanchet27, Reference Si, Zhang, Zhou and Blanchet28, Reference Uğurlu30, Reference Wang and Zou32, Reference Wiesemann, Kuhn and Rustem34–Reference Yang, Zhang and Zhang37, Reference Zhou, Zhou, Bai, Qiu, Blanchet and Glynn39] for robust Markov decision problems and corresponding reinforcement-learning-based algorithms to solve them.
In this note, the goal is to analyze the difference between the value function of the corresponding Markov decision problem with respect to the true (but to the agent unknown) probability kernel and that of the robust Markov decision problem defined with respect to some Wasserstein ball around the (by the agent) estimated transition kernel. Note that the estimated transition kernel does not necessarily need to coincide with the true probability kernel, however we assume that the agent’s guess is good enough that the true probability kernel lies within the Wasserstein ball around the estimated probability kernel.
Similar, while not identical, research questions have been studied in [Reference Bartl and Wiesel2, Reference Kern13, Reference Kiszka and Wozabal14, Reference Müller19, Reference Zähle38], mainly focusing on establishing stability results for value functions with respect to the choice of the underlying transition probability. In [Reference Müller19, Theorem 4.2], the author presents a state-dependent bound on the difference between iterations of value functions (obtained via the so-called value iteration algorithm) of two Markov decision processes, implying that these iterations depend continuously on the transition kernels. As a refinement of [Reference Müller19, Theorem 4.2] and also of the related result obtained in [Reference Kern13, Theorem 2.2.8], the result from [Reference Zähle38, Theorem 6.2] shows that in a finite time horizon setting the difference between the value functions of two Markov decision processes with different transition probabilities can be bounded by an expression depending on a certain tailored distance between the transition probabilities. In [Reference Kiszka and Wozabal14], the author proposes a semi-metric for Markov processes which allows us to determine bounds for certain types of linear stochastic optimization problems, cf. [Reference Kiszka and Wozabal14, Theorem 3]. The authors of [Reference Bartl and Wiesel2] study the sensitvity of multi-period stochastic optimization problems over a finite time horizon with respect to the underlying probability distribution in the so-called adapted Wasserstein distance. They show in [Reference Bartl and Wiesel2, Theorem 2.4] that the value function of their robust optimization problem, with the corresponding ambiguity set being a Wasserstein ball around a reference measure, can be approximated by the corresponding value function of the non-robust optimization problem defined with respect to the reference measure plus an explicit correction term. Vaguely speaking (as the optimization problem in [Reference Bartl and Wiesel2] is technically speaking not comparable to our setting), this is similar to our analysis in the special case where our reference measure coincides with the true measure.
Under some mild assumptions, we obtain in Theorem 3.1 an explicit upper bound for the difference between the value function of the robust and the non-robust Markov decision problem which only depends on the radius $\varepsilon$ of the Wasserstein ball, the discount factor $\alpha$ , and the Lipschitz constants of the reward function and the true transition kernel. In particular, we obtain that the difference of the two value functions only grows at most linearly in the radius $\varepsilon$ and does not depend on the dimensions of the underlying state and action space.
The remainder of this note is as follows. In Section 2 we introduce the underlying setting used to derive our main result, which is reported in Section 3. The proof of the main result and auxiliary results necessary for the proof are reported in Section 4.
2. Setting
We first present the underlying setting to define both robust and non-robust Markov decision processes that we then use to compare their respective value functions.
2.1. Setting
As the state space we consider a closed subset $\mathcal{X} \subseteq\mathbb{R}^d$ for some $d\in \mathbb{N}$ , equipped with its Borel $\sigma$ -field $\mathcal{F}_{\mathcal{X}}$ , which we use to define the infinite Cartesian product $\Omega \,:\!=\, {{\mathcal{X}}^{{\mathbb{N}_0}}} = {\mathcal{X}} \times {\mathcal{X}} \times \cdots$ and the $\sigma$ -field $\mathcal{F}\,:\!=\,{\mathcal{F}_{\mathcal{X}}}\otimes \mathcal{F}_{\mathcal{X}} \otimes \cdots$ . For any $q\in \mathbb{N}$ , we denote by $\mathcal{M}_1^q(\mathcal{X})$ the set of probability measures on $\mathcal{X}$ with finite q-moments and write $\mathcal{M}_1(\mathcal{X})\,:\!=\,\mathcal{M}_1^1(\mathcal{X})$ for brevity. We define on $\Omega$ the infinite-horizon stochastic process $(X_{t})_{t\in\mathbb{N}_0}$ via the canonical process $X_t((\omega_0,\omega_1,\dots,\omega_t,\dots))\,:\!=\,\omega_t$ for $(\omega_0,\omega_1,\dots,\omega_t,\dots) \in \Omega$ , $t \in \mathbb{N}_0$ .
To define the set of controls (also called actions) we fix a compact set $A \subseteq \mathbb{R}^m$ for some $m \in \mathbb{N}$ , and set
Next, we define the q-Wasserstein distance $d_{W_q}(\cdot,\cdot)$ for some $q\in \mathbb{N}$ . For any $\mathbb{P}_1,\mathbb{P}_2 \in \mathcal{M}_1^q(\mathcal{X})$ let $d_{W_q}(\mathbb{P}_1,\mathbb{P}_2)$ be defined as
where $\|\cdot\|$ denotes the Euclidean norm on $\mathbb{R}^d$ , and where $\Pi(\mathbb{P}_1,\mathbb{P}_2)$ denotes the set of joint distributions of $\mathbb{P}_1$ and $\mathbb{P}_2$ . Moreover, we denote by $\tau_q$ the Wasserstein q-topology induced by the convergence with respect to $d_{W_q}$ .
To define an ambiguity set of probability kernels, we first fix throughout this paper some $q\in \mathbb{N}$ and $\varepsilon>0$ . Then, we define, as an ambiguity set of probability kernels,
with respect to some center $\mathcal{X}\times A\ni(x,a)\mapsto\widehat{\mathbb{P}}(x,a)\in\big(\mathcal{M}^q_1(\mathcal{X}),\tau_q\big)$ , meaning that $\mathcal{B}^{(q)}_\varepsilon(\widehat{\mathbb{P}}(x,a))$ denotes the q-Wasserstein ball (also called the Wasserstein ball of order q) with radius $\varepsilon$ and center $\widehat{\mathbb{P}}(x,a)$ .
Under these assumptions we define, for every $x \in \mathcal{X}$ and ${\mathbf{a}} \in \mathcal{A}$ , the set of admissible measures on $(\Omega, \mathcal{F})$ by
where the notation $\mathbb{P}=\delta_x \otimes\mathbb{P}_0\otimes \mathbb{P}_1 \otimes\cdots \in \mathfrak{P}_{x,{\mathbf{a}}}$ abbreviates
2.1. Problem formulation and standing assumptions
Let $r\colon\mathcal{X} \times A \times \mathcal{X} \rightarrow \mathbb{R}$ be some reward function. We assume from now on that it fulfils the following assumptions.
Assumption 2.1. (Assumptions on the reward function) The reward function $r\colon\mathcal{X}\times A\times\mathcal{X}\rightarrow\mathbb{R}$ satisfies the following:
-
(i) The map $\mathcal{X}\times A\times\mathcal{X}\ni(x_0,a,x_1)\mapsto r(x_0,a,x_1)\in\mathbb{R}$ is Lipschitz continuous with constant $L_r>0$ .
-
(ii) If $\mathcal{X}$ is unbounded and $q\in\mathbb{N}$ defined in (2.1) satisfies $q=1$ , then we additionally assume that $\sup_{x_0,x_1\in \mathcal{X}, a\in A} |r(x_0,a,x_1)|<\infty$ .
Note that Assumption 2.1(i) implies that the reward r is bounded whenever $\mathcal{X}$ is bounded.
Next, we impose the following standing assumption on our reference probability kernel modeled by the center of the q-Wasserstein ball.
Assumption 2.2. (Assumption on the center of the ambiguity set.) Let $q\in \mathbb{N}$ be defined in (2.1). Then the center $\mathcal{X}\times A\ni(x,a)\mapsto\widehat{\mathbb{P}}(x,a)\in(\mathcal{M}^q_1(\mathcal{X}),\tau_q)$ satisfies the following:
-
(i) The map $\mathcal{X}\times A\ni(x,a)\mapsto\widehat{\mathbb{P}}(x,a)\in(\mathcal{M}^q_1(\mathcal{X}),\tau_q)$ is continuous.
-
(i′) If the reward function r is unbounded, then we assume instead of (i) the stronger assumption that $\widehat{\mathbb{P}}$ is Lipschitz continuous, i.e. that there exists $L_{\widehat{\mathbb{P}}}>0$ such that
\begin{equation*} d_{W_q}(\widehat{\mathbb{P}}(x,a),\widehat{\mathbb{P}}(x^{\prime},a^{\prime})) \leq L_{\widehat{\mathbb{P}}}\big(\Vert x-x^{\prime} \Vert + \Vert a-a^{\prime} \Vert\big) \quad \text{for all } x,x^{\prime} \in \mathcal{X},\,a,a^{\prime}\in A.\end{equation*}
Finally, we assume the following on the discount factor $\alpha\in (0,1)$ .
Assumption 2.3. (Assumption on the discount factor) Let $q\in \mathbb{N}$ , $\varepsilon>0$ be defined as in (2.1), and $\mathcal{X}\times A\ni(x,a)\mapsto\widehat{\mathbb{P}}(x,a)\in(\mathcal{M}^q_1(\mathcal{X}),\tau_q)$ be as defined in Assumption 2.2. Then the discount factor $\alpha$ satisfies $0<\alpha<{1}/{C_P}$ , where $1\leq C_P<\infty$ is defined by
Our goal is to compare the value of the robust Markov decision problem with the value of the non-robust Markov decision problem. To define the robust value function, for every initial value $x\in \mathcal{X}$ , we maximize the expected value of $\sum_{t=0}^\infty \alpha^tr(X_{t},a_t,X_{t+1})$ under the worst-case measure from $\mathfrak{P}_{x,{\mathbf{a}}}$ over all possible actions ${\mathbf{a}} \in \mathcal{A}$ . More precisely, we introduce the robust value function by
To define the non-robust value function under the true, but to the agent unknown, probability kernel $\mathbb{P}^{\operatorname{true}}$ contained in the ambiguity set $\mathcal{P}$ , we impose the following assumptions on $\mathbb{P}^{\operatorname{true}}$ .
Assumption 2.4. (Assumptions on the true probability kernel) Let $q\in \mathbb{N}$ be as defined in (2.1). Then the true (but unknown) probability kernel $\mathcal{X}\times A\ni(x,a)\mapsto\mathbb{P}^{\mathrm{true}}(x,a)\in (\mathcal{M}^q_1(\mathcal{X}),\tau_q)$ satisfies the following:
-
(i) $\mathbb{P}^{\operatorname{true}}(x,a) \in \mathcal{P}(x,a)$ for all $(x,a)\in \mathcal{X} \times A$ .
-
(ii) $\mathbb{P}^{\operatorname{true}}$ is $L_P$ -Lipschitz with constant $0\leq L_P < {1}/{\alpha}$ , where $0<\alpha<1$ is defined in Assumption 2.3, i.e. we have
\begin{equation*} d_{W_q}(\mathbb{P}^{\operatorname{true}}(x,a),\mathbb{P}^{\operatorname{true}}(x^{\prime},a^{\prime})) \leq L_P(\|x-x^{\prime}\|+\|a-a^{\prime}\|) \end{equation*}for all $ x,x^{\prime} \in \mathcal{X}$ , $a,a^{\prime}\in A$ .
Then, we introduce the non-robust value function under the true (but to the agent unknown) transition kernel by
where we write, for any $x \in \mathcal{X}$ and ${\mathbf{a}} \in \mathcal{A}$ ,
Note that Assumptions 2.1–2.4 ensure that the dynamic programming principle holds for both the robust and non-robust Markov decision problem; see [Reference Neufeld, Sester and Šikić23, Theorem 2.7].
3. Main result
As a main result we establish a bound on the difference between the value function of the Markov decision process with fixed reference measure defined in (2.3), and the value function of the robust Markov decision process defined in (2.2).
Theorem 3.1. Let all Assumptions 2.1–2.4 hold.
-
(i) Then, for any $x_0\in \mathcal{X}$ ,
(3.1) \begin{equation} 0 \leq V^{\operatorname{true}}(x_0)-V(x_0) \leq 2 L_r \varepsilon(1+\alpha)\sum_{i=0}^\infty \alpha^i \sum_{j=0}^i(L_P)^j < \infty. \end{equation} -
(ii) Moreover, in the special case that $\mathbb{P}^{\operatorname{true}} = \widehat{\mathbb{P}}$ , for any $x_0\in \mathcal{X}$ ,
(3.2) \begin{equation} 0 \leq V^{\operatorname{true}}(x_0)-V(x_0) \leq L_r \varepsilon(1+\alpha)\sum_{i=0}^\infty \alpha^i \sum_{j=0}^i(L_P)^j < \infty. \end{equation}
We highlight that the upper bound from (3.1) depends only on $\varepsilon$ , $\alpha$ , and the Lipschitz-constants $L_r$ and $L_P$ . In particular, the upper bound depends linearly on the radius $\varepsilon$ of the Wasserstein ball and is independent of the current state $x_0$ and the dimensions d and m of the state and action space, respectively.
Remark 3.1. The assertion from Theorem 3.1 also carries over to the case of autocorrelated time series where one assumes that the past $h \in \mathbb{N} \cap [2,\infty)$ values of a time series $(Y_{t})_{t \in \{-h,\ldots,-1,0,1,\ldots\}}$ taking values in some closed subset $\mathcal{Y}$ of $\mathbb{R}^{D}$ for some $D\in \mathbb{N}$ may have an influence on the next value. This can be modeled by defining the state process $X_t\,:\!=\,(Y_{t-h+1},\dots,Y_t) \in \mathcal{Y}^h\,=\!:\,\mathcal{X}$ , $t\in\mathbb{N}_0$ . In this setting, the subsequent state $X_{t+1}=(Y_{t-h+2},\dots,Y_{t+1})$ shares $h-1$ components with the preceding state $X_t=(Y_{t-h+1},\dots,Y_t)$ and uncertainty is only inherent in the last component $Y_{t+1}$ . Thus, we consider a reference kernel of the form $\mathcal{X}\times A\ni(x,a)\mapsto\mathbb{P}^{\mathrm{true}}(x,a) = \delta_{\pi(x)}\otimes\widetilde{\mathbb{P}}^{\mathrm{true}}(x,a)\in\mathcal{M}_1(\mathcal{X})$ , where $\widetilde{\mathbb{P}}^{\mathrm{true}}(x,a)\in\mathcal{M}_1(\mathcal{Y})$ and $\mathcal{X} \ni (x_1,\dots,x_h) \mapsto \pi(x)\,:\!=\, (x_2,\dots,x_h)$ denotes the projection on the last $h-1$ components. In this setting, for $q\in \mathbb{N}$ and $\varepsilon>0$ , the ambiguity set is given by
The described setting is discussed in more detail in [Reference Neufeld, Sester and Šikić23, Section 3.3] or [Reference Neufeld and Sester21, Section 2.2]. Typical applications can be found in finance and include portfolio optimization; cf. [Reference Neufeld, Sester and Šikić23, Section 4].
Example 3.1. (Coin toss.) To illustrate the applicability of Theorem 3.1, we study an example similar to the one provided in [Reference Neufeld and Sester21, Example 4.1]. To this end, we consider an agent who at each time tosses 10 coins and observes the number of heads. Thus, we model the environment by a state space $\mathcal{X}\,:\!=\,\{0,\dots,10\}$ . Prior to the toss, the agent can bet whether in the next toss of 10 coins the sum of heads will be smaller ( $a=-1$ ) or larger $(a=1)$ than the previous toss. She gains $\$!$ if the bet is correct, and in turn has to pay $\$1$ if it is not (without being rewarded/punished if the sum of heads remains the same). Moreover, the agent can also decide not to bet for the toss (by choosing $a=0$ ). We model this via the reward function
where the possible actions are given by $ A\,:\!=\,\{-1,0,1\}$ . The reference measure in this setting assumes a fair coin, and therefore (independent of the state action pair) is a binomial distribution with $n=10$ , $p=0.5$ , i.e.
In the described setting it is easy to see that r is Lipschitz continuous with Lipschitz constant
Moreover, we have $L_P=0$ . In Figure 1 we plot the corresponding upper bound from (3.2) against the difference $V^{\mathrm{true}}(x_0)-V(x_0)$ for different initial values $x_0$ and different levels of $\varepsilon$ used for the computation of V with $\alpha =0.45$ . The value functions are computed using the robust Q-learning algorithm proposed in [Reference Neufeld and Sester21]. The code used can be found at https://github.com/juliansester/MDP_Bound.
4. Proof of the main result
In Section 4.1 we provide several auxiliary lemmas which are necessary to establish the proof of Theorem 3.1, which is reported in Section 4.2.
4.1. Auxiliary results
Lemma 4.1. Let $r\colon\mathcal{X}\times A\times\mathcal{X}\rightarrow\mathbb{R}$ satisfy Assumption 2.1. Let $\mathcal{X}\times A\ni(x,a)\mapsto\mathbb{P}^{\mathrm{true}}(x,a)(\mathrm{d} x_1) \in (\mathcal{M}_1^q(\mathcal{X}),\tau_q)$ satisfy Assumption 2.4. For any $v \in C_b(\mathcal{X}, \mathbb{R})$ , where here and in the following we denote by $C_b(\mathcal{X}, \mathbb{R})$ the set of continuous and bounded functions from $\mathcal{X}$ to $\mathbb{R}$ , define
Then, for any $v \in C_b(\mathcal{X}, \mathbb{R})$ being $L_r$ -Lipschitz, $n \in \mathbb{N}$ , $x_0,x^{\prime}_0\in \mathcal{X}$ , we have
Proof. For any $x_0,x^{\prime}_0\in \mathcal{X}$ and $a \in A$ , let ${\Pi^{\mathrm{true}}}(\mathrm{d}{x_1},\mathrm{d}{x^{\prime}_1})\in\mathcal{M}_1(\mathcal{X}\times\mathcal{X})$ denote an optimal coupling between $\mathbb{P}^{\mathrm{true}}(x_0,a)$ and $\mathbb{P}^{\mathrm{true}}(x^{\prime}_0,a)$ with respect to $d_{W_1}$ , i.e.
where the inequality follows from Hölder’s inequality (see, e.g., [Reference Villani31, Remark 6.6]). We prove the claim by induction. We start with the base case $n=1$ , and compute by using the Lipschitz continuity of the functions r and v, and of $\mathbb{P}^{\mathrm{true}}$ that
We continue with the induction step. Hence, let $n\in\mathbb{N}\cap[2,\infty)$ be arbitrary and assume that (4.2) holds for $n-1$ . Then, we compute
Applying the induction hypothesis to this therefore yields
Lemma 4.2. Let Assumptions 2.1 and 2.4 hold. Moreover, let $\mathcal{X} \times A \ni (x,a) \mapsto \mathbb{P}^{\mathrm{wc}}(x,a) \in \mathcal{P}(x,a)$ denote another probability kernel contained in $\mathcal{P}(x,a)$ for each $x,a \in \mathcal{X} \times A$ . Furthermore, for any $v \in C_b(\mathcal{X},\mathbb{R})$ , define
-
(i) Then, for any $v \in C_b(\mathcal{X}, \mathbb{R})$ that is $L_r$ -Lipschitz, $n \in \mathbb{N}$ , and $x_0\in \mathcal{X}$ ,
(4.4) \begin{equation} \big|(\mathcal{T}^{\mathrm{wc}})^nv(x_0)-(\mathcal{T}^{\mathrm{true}})^nv(x_0)\big| \leq 2L_r\varepsilon(1+\alpha)\sum_{i=0}^{n-1}\alpha^i\sum_{j=0}^i(L_P)^j, \end{equation}where $\mathcal{T}^{\mathrm{true}}$ is defined in (4.1). -
(ii) Moreover, in the special case that $\mathbb{P}^{\mathrm{true}} = \widehat{\mathbb{P}}$ , we obtain, for any $x_0\in \mathcal{X}$ ,
(4.5) \begin{equation} \big|(\mathcal{T}^{\mathrm{wc}})^nv(x_0)-(\mathcal{T}^{\mathrm{true}})^nv(x_0)\big| \leq L_r\varepsilon(1+\alpha)\sum_{i=0}^{n-1}\alpha^i\sum_{j=0}^i(L_P)^j. \end{equation}
Proof. (i) For any $x_0\in \mathcal{X}$ and $a \in A$ , let $\Pi\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big)\in\mathcal{M}_1(\mathcal{X}\times\mathcal{X})$ denote an optimal coupling between $\mathbb{P}^{\mathrm{wc}}(x_0,a)$ and $\mathbb{P}^{\mathrm{true}}(x_0,a)$ with respect to $d_{W_1}$ . Then, since both $\mathbb{P}^{\mathrm{wc}}(x_0,a),\mathbb{P}^{\mathrm{true}}(x_0,a) \in \mathcal{B}^{(q)}_\varepsilon(\widehat{\mathbb{P}}(x_0,a))$ we have
where the first inequality follows from Hölder’s inequality (see, e.g., [Reference Villani31, Remark 6.6]). We prove the claim by induction. To this end, we start with the base case $n=1$ , and compute by using (4.6) and the Lipschitz continuity of r, v, and $\mathbb{P}^{\mathrm{true}}$ that
We continue with the induction step. Therefore, let $n\in\mathbb{N}\cap[2,\infty)$ be arbitrary and assume that (4.4) holds for $n-1$ . Then, we compute
Applying Lemma 4.1 to (4.7) and the induction hypothesis to (4.8) together with (4.6) therefore yields
(ii) In the case $\mathbb{P}^{\mathrm{true}} = \widehat{\mathbb{P}}$ we have, for any $x_0\in\mathcal{X}$ and $a \in A$ that
since the ambiguity set $\mathcal{P}(x_0,a)$ is centered around $\mathbb{P}^{\mathrm{true}}(x_0,a)=\widehat{\mathbb{P}}(x_0,a)$ . Hence, replacing (4.6) by (4.9) and then following the proof of (i) shows the assertion.
Lemma 4.3. Let $0<\alpha<1$ and $L_P\geq 0$ satisfy $\alpha \cdot L_P <1$ . Then
Proof. Note that
with $a_i = (i+1) \cdot \alpha^i \max\{1,L_P\}^i $ . Moreover,
as $i \rightarrow \infty$ . Hence, d’Alembert’s criterion implies that $\sum_{i=0}^{\cdot}a_i$ converges absolutely. Thus, by (4.10), we have $\sum_{i=0}^\infty \alpha^i \sum_{j=0}^i (L_P)^j <\infty$ .
Lemma 4.4. Let Assumptions 2.1–2.3 hold. Then $\mathcal{P}(x,a)\,:\!=\,\mathcal{B}^{(q)}_\varepsilon(\widehat{\mathbb{P}}(x,a))$ as defined in (2.1) satisfies [Reference Neufeld, Sester and Šikić23, Standing Assumption 2.2] and the reward function $r\colon\mathcal{X} \times A \times \mathcal{X} \rightarrow \mathbb{R}$ together with the discount factor $0<\alpha<1$ satisfy [Reference Neufeld, Sester and Šikić23, Standing Assumption 2.4]. As a consequence, [Reference Neufeld, Sester and Šikić23, Theorem 2.7] then directly implies that the dynamic programming principle holds for the robust Markov decision problem defined in (2.2).
Proof. First, if $r\colon\mathcal{X} \times A \times \mathcal{X} \rightarrow \mathbb{R}$ is bounded, then Assumptions 2.1–2.3 allow us to use [Reference Neufeld, Sester and Šikić23, Proposition 3.1], which immediately ensures that the result holds true with respect to $p=0$ and $C_P=1$ in the notation of [Reference Neufeld, Sester and Šikić23, Standing Assumptions 2.2 2.4].
Now, assume for the rest of this proof that $r\colon\mathcal{X} \times A \times \mathcal{X} \rightarrow \mathbb{R}$ is unbounded. Then, by Assumption 2.1(ii) we have that $q\in[2,\infty)\cap\mathbb{N}$ . In this case, let $p=1$ in the notation of [Reference Neufeld, Sester and Šikić23, Standing Assumptions 2.2 and 2.4]. Then our Assumptions 2.1 and 2.3 immediately ensure that [Reference Neufeld, Sester and Šikić23, Standing Assumption 2.4] holds. Moreover, by our Assumption 2.2, we directly obtain from [Reference Neufeld and Sester22, Proposition 4.1] that [Reference Neufeld, Sester and Šikić23, Standing Assumption 2.2(i)] holds. Therefore, it remains to verify [Reference Neufeld, Sester and Šikić23, Standing Assumptions 2.2(ii)]. To that end, let
Indeed, note that $C_P<\infty$ , as Assumption 2.2 ensures that the map
is continuous. This implies that the map
is upper semicontinuous, which in turns ensures that $C_P$ is finite as A is compact. Now, let $(x,a)\in\mathcal{X} \times A$ and $\mathbb{P}\in\mathcal{P}(x,a)=\mathcal{B}^{(q)}_\varepsilon(\widehat{\mathbb{P}}(x,a))$ be arbitrarily chosen. Then by following the calculations in [Reference Neufeld and Sester22, Proof of Proposition 4.1, (6.34)] (with $p=1$ in the notation of [Reference Neufeld and Sester22]), using the Lipschitz continuity of $\widehat{\mathbb{P}}$ we obtain, for any arbitrary $x^{\prime}\in \mathcal{X}$ , that
Since $x^{\prime}\in \mathcal{X}$ was arbitrarily chosen, we see from (4.11) that
which shows that [Reference Neufeld, Sester and Šikić23, Standing Assumption 2.2(ii)] indeed holds.
4.2. Proof of Theorem 3.1
Proof. (i) First note that as, by assumption, $\mathbb{P}^{\mathrm{true}}(x,a)\in\mathcal{P}(x,a)$ for all $(x,a)\in\mathcal{X}\times A$ , we have $0 \leq V^{\mathrm{true}}(x_0)-V(x_0)$ for all $x_0\in\mathcal{X}$ . To compute the upper bound, we fix any $v\in C_b(\mathcal{X},\mathbb{R})$ which is $L_r$ -Lipschitz and we define the operator $\mathcal{T}^{\mathrm{true}}$ by (4.1). Then, by Lemma 4.4 and [Reference Neufeld, Sester and Šikić23, Theorem 2.7(ii)], we have
for all $x_0 \in \mathcal{X}$ and for $\mathcal{T}$ as defined in [Reference Neufeld, Sester and Šikić23, (8)]. Moreover, by [Reference Neufeld, Sester and Šikić23, Theorem 2.7(iii)], there exists a worst case transition kernel $\mathcal{X} \times A \ni (x,a)\mapsto \mathbb{P}^{\mathrm{wc}}(x,a) $ with $\mathbb{P}^{\mathrm{wc}}(x,a) \in \mathcal{P}(x,a)$ for all $(x,a) \in \mathcal{X} \times A$ such that, by denoting, for any ${\mathbf{a}} =(a_t)_{t\in \mathbb{N}_0}\in \mathcal{A}$ ,
we have
where $\mathcal{T}^{\mathrm{wc}}$ is as defined in (4.3). Therefore, by (4.12), (4.13), Lemma 4.2, and Lemma 4.3, we have, for all $x_0 \in \mathcal{X}$ ,
(ii) In the case $\mathbb{P}^{\mathrm{true}} = \widehat{\mathbb{P}}$ , due to Lemma 4.2(ii), we may use (4.5) and replace the inequality (4.14) by
Acknowledgements
We thank an anonymous referee of [Reference Neufeld and Sester21] who raised a question that led to this note.
Funding information
The first author gratefully acknowledges financial support by the MOE AcRF Tier 1 Grant RG74/21 and by the Nanyang Assistant Professorship Grant (NAP Grant) Machine Learning based Algorithms in Finance and Insurance. The second author gratefully acknowledges financial support by the NUS Start-Up Grant Tackling model uncertainty in Finance with machine learning.
Competing interests
There were no competing interests to declare which arose during the preparation or publication process of this article.
Supplementary material
The supplementary material for this article (Python code for Example 3.1 can be found at https://github.com/juliansester/MDP_Bound.