Hostname: page-component-cd9895bd7-7cvxr Total loading time: 0 Render date: 2024-12-24T17:56:05.744Z Has data issue: false hasContentIssue false

Skillful statistical prediction of subseasonal temperature by training on dynamical model data

Published online by Cambridge University Press:  23 February 2023

Laurie Trenary*
Affiliation:
Department of Atmospheric, Oceanic, and Earth Science and Center for Ocean-Land-Atmosphere Studies, George Mason University, Fairfax, Virginia, USA
Timothy DelSole
Affiliation:
Department of Atmospheric, Oceanic, and Earth Science and Center for Ocean-Land-Atmosphere Studies, George Mason University, Fairfax, Virginia, USA
*
*Corresponding author. E-mail: [email protected]

Abstract

This paper derives statistical models for predicting wintertime subseasonal temperature over the western US. The statistical models are trained on two separate datasets, namely observations and dynamical model simulations, and are based on least absolute shrinkage and selection operator (lasso). Surprisingly, statistical models trained on dynamical model simulations can predict observations better than observation-trained models. One reason for this is that simulations involve orders of magnitude more data than observational datasets.

Type
Application Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press

Impact Statement

In this study, we show that a statistical prediction model for observed western US wintertime temperature trained on long dynamical model simulations outperforms a statistical model trained on observations alone. This encouraging result suggests that statistical subseasonal prediction models can be further improved by training on both dynamical model simulations and observations.

1. Introduction

Medium-range weather (up to 10 days) and long-range climate forecasts (months-to-seasons) have been used operationally for decades. While the performance of forecast systems targeting these timescales has steadily improved, until recently, relatively little effort has been dedicated to the advancing prediction capabilities at the intermediary subseasonal (2-week-to-1 month) timescales (e.g., National Academy of Science, 2016). Nevertheless, there is evidence that forecasts are skillful on subseasonal timescales (Newman et al., Reference Newman, Sardeshmukh, Winkler and Whitaker2003; Pegion and Sardeshmukh, Reference Pegion and Sardeshmukh2011). In particular, state-of-the-art numerical forecast models demonstrate skill in subseasonal prediction, including regional precipitation and temperature, extreme events (heat waves, cold waves, likelihood of hurricane formation), tornado and hail activity (DelSole et al., Reference DelSole, Trenary, Tippett and Pegion2017; Vitart and Robertson, Reference Vitart and Robertson2018).

Tremendous societal needs have driven improvements in subseasonal forecast capabilities. Warnings of weather hazards such as drought or cold temperature extremes 2-to-4 weeks in advance have the potential to save lives and mitigate changing demands on energy supplies, water resources, the agriculture sector, and fisheries (White et al., Reference White, Carlsen, Robertson, Klein, Lazo, Kumar, Vitart, Coughlan de Perez, Ray, Murray, Bharwani, MacLeod, James, Fleming, Morse, Eggen, Graham, Kjellström, Becker, Pegion, Holbrook, McEvoy, Depledge, Perkins-Kirkpatrick, Brown, Street, Jones, Remenyi, Hodgson-Johnston, Buontempo, Lamb, Meinke, Arheimer and Zebiak2017). Given the far-reaching societal benefits, numerical modeling projects within the United States (SubX) and internationally (S2S) have been established with the goal of improving subseasonal forecast skill (Vitart et al., Reference Vitart, Ardilouze, Bonet, Brookshaw, Chen, Codorean, Déqué, Ferranti, Fucile, Fuentes, Hendon, Hodgson, Kang, Kumar, Lin, Liu, Liu, Malguzzi, Mallas, Manoussakis, Mastrangelo, MacLachlan, McLean, Minami, Mladek, Nakazawa, Najm, Nie, Rixen, Robertson, Ruti, Sun, Takaya, Tolstykh, Venuti, Waliser, Woolnough, Wu, Won, Xiao, Zaripov and Zhang2017; Pegion et al., Reference Pegion, Kirtman, Becker, Collins, LaJoie, Burgman, Bell, DelSole, Min, Zhu, Li, Sinsky, Guan, Gottschalck, Metzger, Barton, Achuthavarier, Marshak, Koster, Lin, Gagnon, Bell, Tippett, Robertson, Sun, Benjamin, Green, Bleck and Kim2019). Parallel to the establishment of numerical modeling initiatives, in 2016 the U.S. Bureau of Reclamation and the National Oceanic and Atmospheric Administration (NOAA) established a Subseasonal Climate Forecast Rodeo, a one-year forecast competition where participants were tasked with developing statistical models for real-time prediction of western United States (US) temperature and precipitation. The inaugural winners, Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019), developed a forecast system of nonlinear statistical models trained on a diverse set of observational predictors (i.e., soil moisture, geo-potential heights). Their statistical forecast system was found to be more accurate than operational US Climate Forecasting System (CFSv2). The success of the Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019) forecast system demonstrated unequivocally the utility of statistical methods for subseasonal prediction.

It is plausible that statistical models could be improved still further if they were trained on longer data sets. Unfortunately, observational data sets for subseasonal prediction are limited to 50 years or less. Moreover, the effective sample size is smaller than this because predictability mechanisms differ across seasons, suggesting that models need to be trained for each season separately, and daily temperature is serially correlated. One approach to obtaining longer training sets is to use dynamical models to generate them. Of course, dynamical models are imperfect and may be less skillful than some statistical models. Nevertheless, dynamical models are based on the laws of physics and simulate many of the complex physical processes that impact subseasonal predictability, so it makes sense to try to use both observational and dynamical models to constrain the statistical fit. In this paper, we train statistical models on dynamical model simulations and then use the resulting statistical models to predict observational data. To mitigate the impact of model errors, particularly those in the subgrid parameterizations that often differ between dynamical models, we pursue a multi-model approach in which the outputs of several dynamical models are pooled together for training data. This leads to sample sizes orders of magnitude longer than observational data sets. Note that in this approach, observations are not used to estimate empirical coefficients, so any predictive skill arising from the resulting statistical models clearly comes from the dynamical models and demonstrates that the dynamical models simulate statistical relations relevant to subseasonal predictability. Presumably, the best statistical models are those that combine both dynamical model simulations and observational data. Such approaches are part of an active field of research in transfer learning (Zhuang et al., Reference Zhuang, Qi, Duan, Xi, Zhu, Zhu, Xiong and He2021). In this paper, we use dynamical models to estimate the regression coefficients of the statistical models and then use observations to select the tuning parameter in the statistical model. The resulting prediction model is then compared to those trained on observations.

The design of our forecast problem is similar to that of the Forecast Rodeo. Specifically, we predict the week 3–4 local temperature at a set of grid points over the western US. Each forecast model is estimated from the least absolute shrinkage and selection operator (lasso) method, which is a standard method in machine learning (Hastie et al., Reference Hastie, Tibshirani and Friedman2017). Our approach is similar to that of DelSole and Banerjee (Reference DelSole and Banerjee2017) and Buchmann and DelSole (Reference Buchmann and DelSole2021), except that here we predict many grid points, instead of just one local region (as in the former paper) or just one large-scale pattern (as in the latter paper). We find that statistical models trained on dynamical model data perform better than those trained on observations, suggesting that sample size is indeed a limiting factor in the statistical prediction of subseasonal temperature.

We clarify that our goal is not to derive a statistical forecast system that outperforms that of Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019). Rather, our goal is to show the potential of improving skill of statistical models by incorporating information from dynamical model simulations. Several studies suggest that the strongest source of subseasonal predictability comes from sea surface temperatures (SSTs) in the tropical Pacific (Vitart, Reference Vitart2013; McKinnon et al., Reference McKinnon, Rhines, Tingley and Huybers2016; DelSole et al., Reference DelSole, Trenary, Tippett and Pegion2017). Accordingly, we use a set of predictors that capture variations in SSTs. It is quite likely that the performance could be improved further by increasing our predictor set to include other variables like precipitation, sea ice concentration, and indices of the Madden–Julian Oscillation, as in Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019), but the dynamical models do not simulate these variables well. Moreover, SSTs are the dominant source of predictability (Vitart, Reference Vitart2013). For these reasons, only SSTs are included as predictors.

This paper is presented as follows. In Section 2, we describe the data sets and methods used to derive a set of statistical models for predicting observed wintertime subseasonal temperature over the western US. The statistical models are trained either on observations or preindustrial control runs from Climate Model Inter-comparison project phase 6 (CMIP6) archive. In Section 3, we present the results. We begin with an overall performance evaluation of the different statistical models and select for comparison the best-performing candidate models among the observation and CMIP6-trained group of models. We go on to demonstrate the advantages of using a large dataset such as CMIP6 data as opposed to the relatively short observational record to fit the predictive models. We then compare the spatial and temporal skills of these statistical models in predicting submonthly observed temperatures. The paper concludes with a summary of our major results.

2. Data and Methods

2.1. Observations

The target data for our study is observed 2-week mean temperature, obtained by averaging daily minimum and maximum 2-meter temperatures from the NOAA-Climate Prediction Center Global Gridded Temperature dataset (https://psl.noaa.gov/data/gridded/data.cpc.globaltemp.html). The data are available from 1979 to the present, but we focus our analysis on the years 1982–2019 to avoid the large number of missing data at the start of the record. Daily SST data are obtained from the NOAA Optimum Interpolation SST dataset for the period 1982–2019 (Reynolds et al., Reference Reynolds, Smith, Liu, Chelton, Casey and Schlax2007). All data are regridded to a 1 × 1 degree resolution.

2.2. CMIP6

We also use climate model simulations for training. The particular model simulations we use are from the CMIP6 archive. To avoid the confounding effects of external forcings (i.e., greenhouse gases and anthropogenic aerosols) on predictability, we limit our selection to preindustrial control simulations. A collection of 13 models with a collective total of 6889 years of daily data are selected for analysis (see Table 1). These models were selected because they provide daily surface temperature data. Consistent with observations, the target data are 2-week mean temperatures estimated by averaging the minimum and maximum of daily 2-m temperature. Model SST data are also used. All data are regridded to a 1 × 1 degree resolution. A description of the CMIP6 experiments can be found in Eyring et al. (Reference Eyring, Bony, Meehl, Senior, Stevens, Stouffer and Taylor2016).

Table 1. List of CMIP6 models with preindustrial control runs was analyzed in this study.

Note. Models were selected if daily data was available.

2.3. Data processing

The statistical forecast models are fit at the 499 grid points shown in Figure 1, using 2-week averages for target and predictor variables and predictions target the winter months December to February. The forecast year for a given winter corresponds to December. To illustrate, a forecast for winter 2000, targets 2-week averages from December 2000 to February 2001. Averages for surface temperature are estimated inclusively for every start day in the winter months of December–February. For instance, the 2-week average for December 1st is given by the average over the period December 1–14, and the corresponding predictors are averaged for the dates November 4–17. Climatology for daily-to-weekly data is often estimated using local regression and polynomial or harmonic regression (DelSole et al., Reference DelSole, Trenary, Tippett and Pegion2017; Pegion et al., Reference Pegion, Kirtman, Becker, Collins, LaJoie, Burgman, Bell, DelSole, Min, Zhu, Li, Sinsky, Guan, Gottschalck, Metzger, Barton, Achuthavarier, Marshak, Koster, Lin, Gagnon, Bell, Tippett, Robertson, Sun, Benjamin, Green, Bleck and Kim2019). A study by Pegion et al. (Reference Pegion, Kirtman, Becker, Collins, LaJoie, Burgman, Bell, DelSole, Min, Zhu, Li, Sinsky, Guan, Gottschalck, Metzger, Barton, Achuthavarier, Marshak, Koster, Lin, Gagnon, Bell, Tippett, Robertson, Sun, Benjamin, Green, Bleck and Kim2019) demonstrated that these different methods produce nearly identical climatologies. In this work, we estimate the climatology as a fifth-order polynomial fit across all 2-week means between December and February (e.g., DelSole et al., Reference DelSole, Trenary, Tippett and Pegion2017). This order of polynomial is selected to ensure that the climatological signals are accurately estimated at the different geographical locations. We tested whether the statistical models performance are sensitive to polynomial order and found no impact on skill (not shown). Anomalies are estimated with respect to climatology. Observed anomalies are detrended to remove the impacts of anthropogenic forcings (e.g., Johnson et al., Reference Johnson, Collins, Feldstein, L’Heureux and Riddle2014).

Figure 1. Map of the forecast target region. Each red dot denotes a forecast location on a 1 × 1 degree grid.

Climatically relevant variations of SST are large-scale and characterized by distinct patterns (e.g., Deser et al., Reference Deser, Alexander, Xie and Phillips2010). We select predictors from the Pacific and Atlantic Oceans, where fluctuations in SSTs are known to impact global climate through teleconnections (Horel and Wallace, Reference Horel and Wallace1981). The respective domains of the predictors are shown in Figure 2 as the black and blue regions. No single pattern of SST variability drives all predictable variations in climate. For this reason, we use multiple patterns to represent large-scale variations of SSTs. Typically, these large-scale patterns of variability are estimated using empirical orthogonal function (EOF) analysis. A drawback to EOF analysis is that the patterns recovered are data-dependent. This means that the leading patterns of variability that represent one data set may not be the same pattern recovered for a different data set. Ultimately, when comparisons between data sets are desirable, it is preferable to use a common basis set to describe all data. For this reason, we will isolate the large-scale variations by projecting the daily SST data onto the eigenvectors of the Laplacian operator (DelSole and Tippett, Reference DelSole and Tippett2015). The Laplacian eigenvectors form a basis that depends only on the domain geometry and are orthogonal in space, with the first pattern associated with the spatial mean, the second a dipole, the third a tripole, and so forth. Projecting the data onto each Laplacian eigenvector yields a time series for each eigenvector. An example of the leading Laplacians is shown in Figure 3. In this study, we will represent large-scale SST variations in the Pacific and Atlantic using 50 Laplacians for each basin. The inclusion of more Laplacians time series did not impact the performance of the lasso models.

Figure 2. Map depicting predictor locations. Black and blue denote the domains of the Pacific and Atlantic basins, respectively. Large-scale variations in both basins are represented by the leading 50 Laplacian eigenvectors.

Figure 3. The second (a) and third (b) eigenvectors of the Laplacian operator for the Pacific basin.

2.4. Building the statistical forecast systems

Our objective is to determine if training on dynamical models can produce skillful prediction models. There is no unique configuration for statistical models, so we consider a range of reasonable choices. Specifically, we construct five distinct statistical forecast systems to predict observed western US subseasonal winter temperatures. Each forecast system is comprised of a set of statistical models, fit at each grid point in the target region (see Figure 1), yielding a total of 499 statistical models per forecast system. The candidate forecast systems differ in terms of how the grid-point statistical models are fit and the data sets used to train the models. A description of each statistical forecast systems is provided below and summarized in Table 2.

Table 2. Summary of statistical forecast models.

2.4.1. Benchmark

ENSO is the greatest contributor to seasonal predictability over the US, and hence it is anticipated to be a strong contributor to subseasonal predictability (National Academy of Sciences, 2016). Accordingly, we construct a benchmark forecast where 2-week average surface winter temperature anomalies are predicted at each grid point from the 2-week lagged Nino3.4 index, a commonly used measure of ENSO variability. The Nino3.4 index is the spatial average of SST anomalies between 5°S-5°N over longitudes 120–170°W.

Let $ n $ and $ s $ denote the temporal and spatial indices, where $ n=1,\dots, N $ and $ s=1,\dots, S $ . Let $ {Y}_{\mathrm{ns}} $ denote the target variable and $ {X}_{\mathrm{np}} $ denote the predictors for $ p=1,\dots, P $ . Let the Nino3.4 index be denoted by $ {z}_n $ . The prediction based on Nino3.4 index is a linear regression model, where the regression coefficients $ {\beta}_s=\left({\beta}_{0,s},{\beta}_{1,s}\right) $ are obtained by minimizing the cost function

(1) $$ {\hat{\beta}}_s^{\mathrm{Nino}3.4}=\arg \underset{\beta_s}{\min}\left\{\sum \limits_{n=1}^N{\left({Y}_{\mathrm{ns}}-{\beta}_{0,s}-{\beta}_{1,s}{z}_n\right)}^2\right\}. $$

Note that the regression coefficients are chosen to minimize the cost function at each spatial location separately. These regression models are fit using a leave-one-out approach, such that the regression coefficients $ {\beta}_{0,s} $ and $ {\beta}_{1,s} $ for a given winter are estimated from all other winters from the observational dataset.

2.4.2. Lasso

The performance of the benchmark models will be compared to predictions made by a suite of statistical models based on lasso. The different lasso models have the same set of target and predictor variables. That is, the target variables are 2-week mean surface temperatures anomalies and the predictors are large-scale SST anomalies in the Pacific and Atlantic Oceans, which are represented by 50 Laplacian time series for each basin, giving a total of 100 SST predictors. The predictors lag the target variables by 2 weeks. We estimate the lasso coefficients by either minimizing the cost function locally or across all grid points. This distinction and the difference between lasso and OLS are discussed in more detail below.

Lasso is similar to OLS, except that the mean square error (MSE) is minimized subject to a constraint on the norm of the regression coefficients (Tibshirani, Reference Tibshirani1996). More precisely, the lasso coefficients $ {\beta}_{p,s} $ are obtained by minimizing the cost function

(2) $$ {\hat{\beta}}_s^{\mathrm{single}\hbox{-} \mathrm{task}}=\arg \underset{\beta_s}{\min}\left\{\frac{1}{2N}\sum \limits_{n=1}^N{\left({Y}_{\mathrm{ns}}-{\beta}_{0,s}-\sum \limits_{p=1}^P{X}_{n,p}{\beta}_{p,s}\right)}^2\right\}+\lambda \sum \limits_{p=1}^P\left|{\beta}_{p,s}\right|, $$

where $ {X}_{n,p} $ denotes elements in the matrix of predictors, $ \mid \cdot \mid $ denotes the absolute value, and $ \lambda $ is a tuning parameter that determines the strength of the penalty term. As $ \lambda $ is increased, the lasso coefficients are reduced toward 0. Conversely, as $ \lambda $ goes to 0, the penalty term has less weight and the cost function approaches the traditional OLS form.

There is no closed-form solution to equation (2) and the minimization problem must be solved iteratively. We use the glmnet package to find lasso solutions as $ \lambda $ is varied (Friedman et al., Reference Friedman, Hastie and Tibshirani2010). Examples of how $ \lambda $ is selected for different lasso models are provided in Section 2.4.3.

The above formulation predicts each target variable separately. We call this formulation “single-task” lasso and we derive two lasso models from the above minimization problem: one trained on observations and one trained on CMIP6 data. These lasso models are called OBS-single-task and CMIP6-single-task, respectively.

Weekly temperature is spatially correlated, so making use of information between neighboring grid points during the training stage may yield a better prediction model. One approach to doing this is “multi-task lasso,” which was used by Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019). The cost function for multi-task lasso is

(3) $$ {\hat{\beta}}^{\mathrm{multi}\hbox{-} \mathrm{task}}=\arg \underset{\beta }{\min}\left\{\frac{1}{2N}\sum \limits_{s=1}^S\sum \limits_{n=1}^N{\left({Y}_{\mathrm{ns}}-{\beta}_{0,s}-\sum \limits_{p=1}^P{X}_{n,p}{\beta}_{p,s}\right)}^2\right\}+\lambda \sum \limits_{p=1}^P\sqrt{\sum \limits_{s=1}^S{\beta}_{p,s}^2}. $$

In multi-task lasso, squared errors are summed over all targets and the penalty term now applies to the whole group of predictors and a given predictor is either included in the statical model for all targets, or excluded for all targets.

We selected lasso regression to build our statistical forecast models because the method tends to reduce some of the regression coefficients to 0, thus performing predictor selection and aiding in the interpretability of the statistical models.

2.4.3. Selecting the lasso tuning parameter

For lasso models trained on observations, the first 18 years (1982–1999) of observational data are used to fit the regression model and to select $ \lambda $ using a 10-fold cross-validation. When CMIP6 data are used for training, the $ \lambda $ is selected to minimize the normalized mean-square-error (NMSE) with respect to the same 18 years of observations during the period 1982–1999. A summary of how the statistical models are trained and tuning parameter $ \lambda $ selected is presented in Table 2.

To illustrate the $ \lambda $ selection process, Figure 4a–c shows curves of NMSE versus $ \lambda $ at three different locations for predictions made by the CMIP6-single-task model. The regression coefficients are estimated from CMIP6 simulations, yielding $ {\beta}_s\left(\lambda \right) $ , then, based on these coefficients, $ {\mathrm{NMSE}}_s\left(\lambda \right) $ is evaluated at each location $ s $ using observations for predictors and target variable. The $ \lambda $ that minimizes NMSE is denoted by a red asterisk. Two extreme cases are shown in Figure 4a,c, where the $ \lambda $ that minimizes NMSE is small (near zero) and large, respectively. For the case where $ \lambda \approx 0 $ (Figure 4a), the cost function approaches the traditional OLS form and all the predictors are included. Alternatively, when $ \lambda $ is large (Figure 4c), the regression coefficients are set to zero. The NMSE- $ \lambda $ curve shown in Figure 4b represents an intermediate scenario, where only some regression coefficients are set to 0.

Figure 4. NMSE of a lasso model versus $ \lambda $ for forecasts locations (a) Yakima, Washington, (b) Austin, Texas, and (c) Colorado Springs, Colorado. The NMSE curves are estimated from lasso predictions made with the CMIP6-single-task model and evaluated with respect to observations for winters (DJF) during 1982–1999. A red asterisk denotes the $ \lambda $ that minimizes the NMSE.

2.5. Skill metrics

Statistical model performance will be evaluated in terms of similarity between forecast and verification data and prediction accuracy. The most widely used metrics to evaluate these measures of model performance are temporal correlation, spatial correlation, and MSE (Jolliffe and Stephenson, Reference Jolliffe and Stephenson2012; Coelho et al., Reference Coelho, Brown, Wilson, Mittermaier, Casati, Robertson and Vitart2019).

For temporal similarity measures, we will use the standard Pearson correlation coefficient. We refer to this metric as the temporal correlation coefficient since it is estimated across time at each grid point as:

(4) $$ \rho (s)=\frac{\sum_{t=1}^T{\left({F}^{\prime}\left(s,t\right)\cdot {V}^{\prime}\Big(s,t\Big)\right)}^2}{\sqrt{\left({\sum}_{t=1}^T{F}^{\prime }{\left(s,t\right)}^2\right)\cdot \left({\sum}_{t=1}^T{V}^{\prime }{\left(s,t\right)}^2\right)}}, $$

where $ {F}^{\prime}\left(s,t\right) $ and $ {V}^{\prime}\left(s,t\right) $ are the $ t $ matched pairs of centered forecast and verification data at location $ s $ , where there are $ S $ total grid-point forecasts.

Spatial similarity in the predicted and observed spatial patterns will be measured using the anomaly correlation or cosine similarity (Jolliffe and Stephenson, Reference Jolliffe and Stephenson2012). To avoid confusion, we will refer to this metric as the spatial correlation. Note, this is the only metric used in evaluating the machine learning forecast models of Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019). The expression used to compute the spatial correlation is similar in form to $ \rho (s) $ , with the exception that the spatial mean has not been removed and the summation is over space rather than time. Formally, the spatial correlation is

(5) $$ \gamma (t)=\frac{\sum_{s=1}^S{\left({F}^{\prime}\left(s,t\right)\cdot {V}^{\prime}\Big(s,t\Big)\right)}^2}{\sqrt{\left({\sum}_{s=1}^S{F}^{\prime }{\left(s,t\right)}^2\right)\cdot \left({\sum}_{s=1}^S{V}^{\prime }{\left(s,t\right)}^2\right)}}. $$

The spatial correlation measures the similarity of spatial patterns of 2-week temperature anomalies. Importantly, the spatial correlation can be computed for each 2-week period; that is, it is a time series. Forecast accuracy is often measured by MSE and is formally expressed as:

(6) $$ \mathrm{MSE}(s)=\frac{\sum_{t=1}^T{\left({F}_t^{\prime }(s)-{V}_t^{\prime }(s)\right)}^2}{T}. $$

A standard approach is to compare MSE to some reference forecast, typically the climatological mean corresponding to $ {\overline{F}}^{\prime }(s)=0 $ , which yields the $ \mathrm{NMSE}(s) $ :

(7) $$ \mathrm{NMSE}(s)=\frac{\sum_{t=1}^T{\left({F}^{\prime}\left(s,t\right)-{V}^{\prime}\left(s,t\right)\right)}^2}{\sum_{t=1}^T{\left({V}^{\prime}\left(s,t\right)\right)}^2}. $$

A forecast with NMSE >1 has no skill, since its MSE is greater than that of the reference forecast. The NMSE can be decomposed into its constituent parts when expressed as the mean-square-error skill score (MSESS). Following Murphy (Reference Murphy1988), MSESS can be expanded as follows:

(8) $$ \mathrm{MSESS}(s)=1-\mathrm{NMSE}(s)=\rho {(s)}^2-{\left(\rho (s)-\frac{\sigma_{F^{\prime }}(s)}{\sigma_{V^{\prime }}(s)}\right)}^2-{\left(\frac{{\overline{F}}^{\prime }(s)-{\overline{V}}^{\prime }(s)}{\sigma_{V^{\prime }}(s)}\right)}^2, $$

where $ \rho (s) $ is the temporal correlation (see equation (4)), and $ {\sigma}_{F^{\prime }}(s) $ and $ {\sigma}_{V^{\prime }}(s) $ are the standard deviations for the forecast and verification anomalies respectively. The first term in the MSESS is the square of the temporal correlation ( $ \rho {(s)}^2 $ ) and gives the maximum value of MSESS. The next two terms reduce the skill score and represent the amplitude and the mean biases, respectively. For this study, the forecast and verification data are centered and consequently the mean bias term is 0. We perform a decomposition of MSESS to evaluate the trade-off between correlation and amplitude bias impacts on statistical model performance.

We use spatial averages of NMSE and temporal correlation to identify the best-performing statistical models. Following Wilks (Reference Wilks2011), the spatial average of NMSE is estimated as

(9) $$ \left[\mathrm{NMSE}\right]=\frac{\sum_{s=1}^S{\sum}_{t=1}^N{\left({F}^{\prime}\left(s,t\right)-{V}^{\prime}\left(s,t\right)\right)}^2}{\sum_{s=1}^S{\sum}_{t=1}^N{V}^{\prime }{\left(s,t\right)}^2}. $$

The spatial average of temporal correlation is estimated as

(10) $$ \left[\rho \right]=\frac{\sum_{s=1}^S\rho (s)}{S}. $$

2.6. Statistical significance of metrics

The data being analyzed are serially correlated, so standard methods of estimating uncertainty and statistical significance for performance metrics are not applicable. Accordingly, resampling methods are used to evaluate the uncertainty and statistical significance of the performance metrics.

2.6.1. Uncertainty of spatial averaged metrics

For forecasts trained on CMIP6 data, the regression coefficients are not reestimated when comparing forecast performance across the candidate statistical models, since they are very robust due to the large training set. However, the regression coefficients depend on $ \lambda $ . To quantify uncertainty associated with $ \lambda $ , we randomly select 18 distinct years from the observational record, use the corresponding matched-pairs of target and predictors to determine $ \lambda $ (see Section 2.4.3), and then use the remaining 19 year record of observation data to evaluate the performance of the CMIP-single and -multi-task models. To preserve the serial correlation in the data, the paired target and predictors are sampled such that all sequential data for a given winter are sampled. This process is repeated 1,000 times to build up distributions of the spatially averaged NMSE and temporal correlation. The uncertainty is given as the 5th and 95th percentiles of these distributions.

For forecasts trained on observations, both $ \lambda $ and the regression coefficients have uncertainties that need to be quantified. To do this, a bootstrap sample of 18 distinct years from the observational record is used to train the statistical models and select $ \lambda $ through 10-fold cross-validation. The remaining 19 years of paired target variables and predictors are then used to evaluate the newly trained models. As before, the serial correlation in the data is preserved by selecting all sequential data for a given winter. This process is repeated 1,000 times for the OBS-single-task model to build distributions of the performance metrics. We use only 100 permutations for the OBS-multi-task because of the excessive computational requirements needed for the retraining and evaluation. The uncertainty is given as the 5th and 95th percentiles of these distributions.

We estimate uncertainties for the benchmark Nino3.4 statistical model, by randomly selecting 37 years of paired forecasts and verification data and estimating the performance metrics for the bootstrapped samples. Again, we preserve serial correlation by selecting sequential data for each winter. This process is repeated 1,000 times to build up distributions of the two metrics. The uncertainty for each metric is given as the 5th and 95th percentiles of the respective distributions.

The above uncertainty analyses aim to quantify the sampling effects of the observational data on estimates of statistical model performance. Given that the CMIP6 data are sufficiently long, we can also quantify the impacts of training set size on statistical model performance. To do so, we retrain the CMIP6-single-task model using a training data set that varies in length from 50 to 3000 years and evaluate model performance with respect to spatially averaged NMSE with the verification period fixed to 2000–2018. For instance, if the length of the training set is specified to be 3000 years, a training set of that length is found by sampling CMIP6 data with replacement, where each year is a set of consecutive 2-week forecasts for the target months of December–February. The data are sampled with replacement because there is not enough data to sample without replacement. To illustrate, if we wish to find 3000-year samples in the dataset, sampling without replacement would yield only two samples, which is insufficient for empirical uncertainty estimation. The process of selecting a training set, model retraining, and evaluation is repeated 60 times for each specified training set size. The uncertainty is given as the 5th and 95th percentiles of the respective distributions.

2.6.2. Statistical significance of similarity metrics

To quantify uncertainty in the temporal correlation we use a permutation method (DelSole et al., Reference DelSole, Trenary, Tippett and Pegion2017). Under the null hypothesis of no predictability, the forecasts and observations are independent. Thus, a permutation sample can be derived by separately permuting the year labels for forecast and observed data. The permutation sample preserves the mean and variance of the forecast and observations, but temporally misaligns the forecast-observation pairing. For this particular problem, the data are permuted for winter forecasts targeting the months of December–February. We permute the winter forecasts and verification data 5,000 times to create 5,000 realizations of correlation maps from the null hypothesis of no skill (or more precisely, the null hypothesis of exchangeability). The temporal correlation between the local forecast and verification is computed for each grid point separately and statistical significance is assessed locally by comparing the correlation value to the local 95th percentile from the permutation sample. In addition, the field significance is assessed based on the counts of positive correlations. Negative correlations are not considered skillful and therefore not included in the count. This process is repeated 10,000. The resulting cumulative distribution function is then used to determine the local p-value of the number of positive correlations.

To assess significance of the spatial correlation scores, we count the winters within which the number of positive scores exceeds the negative scores. This count is the total number of skillful forecast winters. Under the null hypothesis of no skill, the forecast has equal chances of producing positive or negative scores and the count of skillful winters should follow a binomial distribution with $ p=.5 $ . Although forecasts within a winter are serially correlated, forecast between winters are assumed independent and therefore the binomial distribution can be applied to the count of skillful winters.

3. Results

3.1. Statistical model selection

First, we evaluate the overall performance of the statistical models in terms of the spatial averages of NMSE and correlation coefficient to help isolate the best-performing statistical models. The confidence intervals of the two metrics are represented by the vertical bars in Figure 5a,b, where the bars capture the fluctuations of each metrics when different segments of the observations are used in statistical model evaluation. Confidence intervals are computed as described in Section 2.6.1.

Figure 5. Performance of statistical prediction models based on spatial averages of the (a) NMSE and (b) temporal correlation. The horizontal black line denotes a NMSE of 1 in (a) and the zero correlation in (b). Five statistical models are compared: the benchmark Nino3.4 regression model, two observation-trained, and two CMIP6-trained lasso models. The vertical black bars denote the uncertainty of the performance metrics when observation data for the period 1982–2018 are bootstrapped. The method for evaluating this uncertainty varies across statistical models and the details of how uncertainty is estimated can be found in Section 2.6.

Referring to Figure 5a, we see that the confidence intervals for the spatial averaged NMSE are near or above 1 for all candidate models. This illustrates the challenge of subseasonal forecasting: spatial average measures of local skill tend to be indistinguishable from values expected from no-skill models. Overall, the statistical models trained on CMIP6 data have lower NMSE than statistical models trained on observations, including the benchmark Nino3.4 model. In particular, the spatially averaged temporal correlation shown in Figure 5b, is low and generally negative for the statistical models trained using observation data. Only the CMIP6-single-task model consistently produces forecasts that are characterized by a positive spatially averaged temporal correlation. This result shows that training statistical models on dynamical model simulations can yield better forecasts than models trained on observations only. The OBS-multi-task model shows no improvement over the OBS-single-task. To better understand why statistical model performance improves when trained on CMIP6 data, we will compare the skill from the CMIP6-single-task (the best-performing model) with the similarly formulated OBS-single-task model. The statistical models are compared in terms of MSESS and its decomposition (see equation (8)) in Section 3.3.

3.2. Training set size and statistical model performance

A plausible explanation for why statistical models trained on CMIP6 simulations predict observations better than observation-trained models is that the size of the training data is much larger in the former than in the latter. To test this hypothesis, we re-train the CMIP6-single-task model using a training data set that varies in length from 50 to 3000 years and evaluate model performance with respect to spatially averaged NMSE for the verification period 2000–2018. The range of uncertainty is reported as the 5th–95th percentiles of the 60 estimates of spatially averaged NMSE for each sample size and shown as the vertical bars in Figure 6. The threshold for a skillful forecast is denoted by the horizontal black line which shows a NMSE of 1. When the training set size is 50 years, which is more than double the number of years available for training with observations, the CMIP6-single-task model has no skill. As the sample size is increased up to 3000 years, the spatially averaged NMSE systematically drops, yielding reliably skillful forecasts (NMSE < 1) when the training set size is 2000 years or greater. DelSole and Banerjee (Reference DelSole and Banerjee2017) arrive at a similar conclusion concerning the impacts of training set size on statistical model performance (see their Figure 12). Note that the NMSEs of the CMIP6-single-task model differ between Figures 5a and 6. Although this difference is not critical to our study, the reason for the difference is that the observations are bootstrapped in Figure 5a whereas they are not in Figure 6. To elaborate on this, Figure 5a shows the variation in skill when different segments of the observational record are used for validation and verification, but using fixed CMIP6 training set, while Figure 6 shows the variation in skill when different segments of the CMIP6 training set are used, but using fixed observational verification set. In essence, Figure 5a quantifies uncertainty from verification randomness under fixed training set, whereas Figure 6 quantifies uncertainty from training randomness under fixed verification set.

Figure 6. Spatial averaged NMSE versus training set size for the CMIP6-single-task model. The vertical bars represent the uncertainty in spatially averaged NMSE for the CMIP6-single-task model with respect to data included in the training set. This uncertainty is estimated by bootstrapping CMIP6 data to create a training dataset of a specified length, re-training the lasso model, followed by verification. This process is repeated 60 times and the bars give the 5th–95th percentiles of these 60 estimates. The number of years included in the training dataset is varied from 50, 100, 300, 500, 750, 1000, 2000, and 3000 years. The verification data are observed winter (DJF) temperatures for the period 2000–2018. Distinct from the analysis shown in Figure 5, the observation data used in validation and verification are not bootstrapped. The vertical axis is scaled to highlight the uncertainties with respect to the no-skill line of a NMSE of 1.

3.3. Statistical model comparison

The map of MSESS estimates from the CMIP6-single-task model, shown in Figure 7a, is characterized by regions of positive values along the west-coast, from northern California up through Washington and extending eastward into Idaho. Forecasts in the continental interior are characterized by negative MSESS. In contrast, we see MSESS values for the OBS-single-task model, shown in Figure 7d, are negative over much of forecast region.

Figure 7. Average performance of CMIP6-single-task and OBS-single-task model evaluated in terms of MSESS (left column), temporal correlation (middle column), and amplitude bias (right column). Each metric is evaluated with respect to observed and forecast winter (DJF) temperature anomalies for the period 2000–2018, where the year corresponds to a December start date. Maps of MSESS for (a) CMIP6-single-task and (d) OBS-single-task models. A forecast is skillful if MSESS > 0. Maps of the temporal correlation for (b) CMIP6-single-task and (e) OBS-single-task models. The percentage of forecasts that positively correlate with verification data is listed in parentheses. Statistical significance of the correlation maps is estimated with respect to a field significance test, with the corresponding p-value listed in the title. The + sign denotes grid points where the correlation is locally significant. Maps of amplitude bias for (c) CMIP6-single-task and (f) OBS-single-task models. A negative bias indicates a forecast under-predicted the amplitude of the temperature anomalies.

To diagnose the sources of these errors, we decompose the MSESS according to equation (8) into contributions from the squared temporal correlation coefficient and amplitude bias. To aid in interpretation, all subsequent analyses are based on the square root of these two terms. Recall that since the forecasts and verification data are centered, the mean bias term is 0. Maps of the temporal correlation for the CMIP6-single-task and OBS-single-task models are shown in Figure 7b,e, respectively. For the CMIP6-single-task predictions, positive correlations are found over much of the western US, with the largest correlations along the west coast, and weaker and sometimes negative correlation in the continental interior. These regions of positive correlation determine the positive MSESS values along the west coast shown in Figure 7a. Positive temporal correlation coefficients recovered from OBS-single-task model are similarly concentrated in the Pacific Northwest. However, the OBS-single-task model has large areas of negative correlations in the southern portions of the forecast region. The positive correlations estimated for the OBS-single-task model correspond to regions of near-zero MSESS values shown in Figure 7d. For both set of statistical models, the contributions of temporal correlation to forecast skill are reduced by the amplitude bias, shown in Figure 7c,f. This analysis indicates that both statistical forecast systems underestimate the amplitude of the predicted temperature anomalies. However, this amplitude bias is markedly larger for predictions made by the OBS-single-task model.

Since the upper bound of the MSESS is determined by the temporal correlation, we quantify the statistical significance of this metric as discussed in Section 2.5. The percentage of grid-point forecasts that predict the correct sign of temperature anomalies and the corresponding p-values are listed in the title of Figure 7b,e. For CMIP6-single-task, 78% of the grid points are characterized by positive correlation, which has a p-value of .021 (i.e., it is field significant). In contrast, the OBS-single-task model has positive correlations only over 59% of the grid points, which has p-value of .26 (i.e., not field significant).

The maps of temporal correlation indicate that forecasts produced by statistical models trained on CMIP6 data are generally skillful over large portions of the western US. This metric says nothing about how well the model performs for any given 2-week average. Figure 8a,b shows the distribution of the spatial correlation as a function of time for the CMIP6-single-task and OBS-single-task models, respectively. For a given winter, the vertical bar represents the 25th–75th percentile of the spatial correlation and the black asterisk gives the median value. The number of forecasts for a given winter varies between 90 and 91, depending upon leap year. The number listed next to each median gives the percentage of forecasts that predict the correctly signed temperature anomalies. The CMIP6-single-task model produces forecasts where 14 of the 19 forecast winters predict the correctly signed temperature anomalies. Variations in the predictive capability are also evident for OBS-single-task, where 13 of the 19 winter forecasts produce correctly signed anomalies. These results suggest that while CMIP6-single-task model performs well overall in terms of individual grid points, the individual statistical forecast models do not consistently predict the large-scale pattern of temperature anomalies. That said, the overall skill of the CMIP6-single task and OBS-single-task model in predicting the spatial pattern of temperature anomalies is indistinguishable from a no-skill forecast for both statistical models. The study by Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019), where forecasts are evaluated only with respect to spatial correlation, also found statistical prediction are characterized by a wide range in spatial correlation scores. It is worth pointing out that the performance of the statistical forecast system derived by Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019), cannot be directly compared with the statistical model evaluated here, because the statistical models are trained on different datasets, target a different range of dates, and use a slightly different metric.

Figure 8. Spatial correlation between observed winter (DJF) temperature anomalies for the period 2000–2018 and predictions made with (a) CMIP6-single-task and (b) OBS-single-task models. The observed winter (DJF) temperature anomalies are for the period 2000–2018, where the year corresponds to December. The vertical lines represent the 25th–75th percentiles of spatial correlation coefficient between forecast and observations. The median is denoted by the black asterisk. The percentage of forecast within a given winter that have a positive spatial correlation score with observations is listed next to the median.

In terms of aggregate metrics, the CMIP6-single-task model provides the most skillful forecasts. Here we examine a pair of high- and low-skill forecasts made by this statistical forecast system. The high-skill forecast, shown in Figure 9b, tends to capture the overall spatial structure of the observed temperature anomalies shown in Figure 9a. Notably, the amplitude of the predicted anomalies is reduced compared to observations. Consistent with analysis presented in Figure 7, the forecast is skillful in terms of predicting the correct sign of temperature anomalies, but underestimates their amplitude. The low-skill forecast, shown in Figure 9d, similarly predicts temperature anomalies that vary with latitude. However, the sign of the anomalies is completely opposite of the observed temperature anomalies shown in Figure 9c. Notably, the individual CMIP6-single-task models are capturing coherent patterns, which may suggest a common forcing mechanism is driving predictable variations over the target region.

Figure 9. Observed 2-week average temperature anomalies (a,c) and CMIP6-single-task forecasts (b,d) for a high-skill forecast targeting the 2-week average over January 8–21, 2010 (a,b) and low-skill forecast targeting the 2-week average over January 11–24, 2014 (c,d).

3.4. Predictor selection

Lastly, we examine differences in the frequency of predictor selection between the CMIP6-single-task and OBS-single-task-models. Lasso assigns zero values to selected regression coefficients, clearly indicating that the associated predictor is less important than the other predictors with nonzero coefficients. Thus, we can use nonzero coefficients as a kind of predictor selection. Since lasso is fitted at each grid point, we can collect statistics of predictor selection across grid points. The frequency with which each predictor is selected using CMIP6 and observational training data is shown in Figure 10. Regardless of the forecast model, no predictor is selected consistently across all grid points. That said, a notable distinction between Figure 10a and b, is the larger percentage of predictor selection for lasso models fit using CMIP6 data. Generally, Laplacians time series from the Pacific are selected by a large percentage of the individual CMIP6-single-task models. In contrast, the OBS-single-task model shows less agreement in predictor selection across the location. The robust selection of key predictors for the CMIP6-trained models can likely be attributed to improved estimates of regression coefficients given the vast amount of data used in statistical model training.

Figure 10. Percentage of predictors selected across all 499 individual grid points for the (a) CMIP6-single-task and (b) OBS-single-task models. The horizontal black line denotes the 60% selection level.

4. Conclusion

This paper derives statistical models for predicting wintertime subseasonal temperature over the western US. Our goal was to show that statistical models trained on dynamical model data can be skillful, thereby demonstrating that dynamical models provide information relevant to subseasonal prediction. As a reference benchmark, we use simple linear regression to predict 2-week mean temperature at each grid point based on the Nino3.4 index. This benchmark is compared to models with tens more predictors derived from lasso. The lasso coefficients are estimated in two different ways, namely under a single-task or multi-task formulation. In all cases, the forecasts are validated on observational data that was excluded from the statistical model construction.

With respect to spatial averages of NMSE and temporal correlation, the statistical models trained on CMIP6 data are more skillful than statistical models trained on observation data. Performance of the most skillful statistical model, CMIP6-single-task, is characterized by spatially averaged NMSE <1 and a regionally average temporal correlation that is positive. This is not the case for the OBS-single-task model, a similarly configured set of statistical models trained on observation data, where spatially average NMSE is greater than 1 and the averaged temporal correlation is negative. A direct comparison of the MSESS between the CMIP6-single-task and OBS-single-task models, shows a greater portion of the forecast region with positive MSESS values for the CMIP6-single-task model. The MSESS for OBS-single-task is characterized by mostly negative values. The positive MSESS identified for the CMIP6-single-task model can be attributed to the statistical model’s ability to correctly predict the sign of the temperature anomalies (i.e., positive and field significant temporal correlation coefficients) for large portions of the western US. In contrast, the OBS-single-task model skill in predicting the correct sign of the temperature anomalies is largely limited to the Pacific Northwest. The greatest source of error between the two statistical models is the amplitude bias, where forecasts from the OBS-single-task model underpredict the magnitude of the temperature anomalies over much of the target region, which in turn, accounts for negative MSESS values. We demonstrate that size of the training set impacts skill of the statistical models and conclude from this that the increase in sample size from using simulated data more than compensates for the limitations due to imperfections in dynamical models. These results are encouraging and suggest that skill of statistical subseasonal prediction models can be further improved by using both dynamical model simulations and observations in statistical model training.

In general, the single-task models performed better than multi-task models, when evaluated in terms of spatial averages of NMSE and temporal correlation. Even still, the skill of the single-task models is nearly indistinguishable from no skill, with regards to these two performance metrics. This low skill on a local basis does not necessarily mean there is no significant skill for large-scale patterns. Notably, the individual CMIP6-single-task models predict coherent large-scale temperature anomalies over the target region. This suggests that statistical model skill might be further improved by isolating predictable large-scale patterns. Initially, lasso was chosen because of its potential for interpretability, but for our study, this turned out not to be the case. In Trenary and DelSole (Reference Trenary and DelSole2022), we develop a statistical technique that is able to identify predictable large-scale patterns despite the limited local predictability.

Acknowledgments

We acknowledge the World Climate Research Programme’s Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups (listed in Table 1) for producing and making available their model output.

Author Contributions

L.T. is responsible for writing the original draft and all formal analysis and investigation. T.D. is responsible for conceptualization of the project and project supervision. Both authors contributed to revising and editing of the manuscript. All authors approved the final submitted draft.

Competing Interests

The authors declare no competing interests exist.

Data Availability Statement

For CMIP, the U.S. Department of Energy’s Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals and data can be found here https://esgf-node.llnl.gov/projects/cmip6/. The observational data are provided by the National Oceanic and Atmospheric Administration Climate Prediction Center and be downloaded directly from https://psl.noaa.gov/data/gridded/index.html.

Ethics Statement

The research meets all ethical guidelines, including adherence to the legal requirements of the United States.

Funding Statement

This research was supported primarily by the National Science Foundation (AGS 1822221). The views expressed herein are those of the authors and do not necessarily reflect the views of this agency.

References

Buchmann, P and DelSole, T (2021) Week 3-4 prediction of wintertime conus temperature using machine learning techniques. Frontiers 3. https://doi.org/10.3389/fclim.2021.697423Google Scholar
Coelho, C.A.S., Brown, B, Wilson, L, Mittermaier, M, Casati, B (2019) Chapter 16 - Forecast Verification for S2S Timescales, Editor(s): Robertson, Andrew W., Vitart, Frédéric, Sub-Seasonal to Seasonal Prediction, Elsevier, 337361, ISBN 9780128117149, https://doi.org/10.1016/B978-0-12-811714-9.00016-4.CrossRefGoogle Scholar
DelSole, T and Banerjee, A (2017) Statistical seasonal prediction based on regularized regression. Journal of Climate 30(4), 13451361.CrossRefGoogle Scholar
DelSole, T and Tippett, MK (2015) Laplacian eigenfunctions for climate analysis. Journal of Climate 28(18), 74207436.CrossRefGoogle Scholar
DelSole, T, Trenary, L, Tippett, MK and Pegion, K (2017) Predictability of week-3–4 average temperature and precipitation over the contiguous United States. Journal of Climate 30(10), 34993512.CrossRefGoogle Scholar
Deser, C, Alexander, MA, Xie, S-P and Phillips, AS (2010) Sea surface temperature variability: Patterns and mechanisms. Annual Review of Marine Science 2(1), 115143. https://doi.org/10.1146/annurev-marine-120408-151453CrossRefGoogle ScholarPubMed
Eyring, V, Bony, S, Meehl, GA, Senior, CA, Stevens, B, Stouffer, RJ and Taylor, KE (2016) Overview of the coupled model intercomparison project phase 6 (CMIP6) experimental design and organization. Geoscientific Model Development 9(5), 19371958.CrossRefGoogle Scholar
Friedman, J, Hastie, T and Tibshirani, R (2010) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1), 122.CrossRefGoogle ScholarPubMed
Hastie, T, Tibshirani, R and Friedman, J (2017) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edn. New York: Springer.Google Scholar
Horel, JD and Wallace, JM (1981) Planetary-scale atmospheric phenomena associated with the southern oscillation. Monthly Weather Review 109(4):813829. Available at https://journals.ametsoc.org/view/journals/mwre/109/4/1520-0493_1981_109_0813_psapaw_2_0_co_2.xml. Access date2.0.CO;2>CrossRefGoogle Scholar
Hwang, J, Orenstein, P, Cohen, J, Pfeiffer, K and Mackey, L (2019) Improving subseasonal forecasting in the Western U.S. with machine learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, pp. 23252335.CrossRefGoogle Scholar
Johnson, NC, Collins, DC, Feldstein, SB, L’Heureux, ML and Riddle, EE (2014) Skillful wintertime North American temperature forecasts out to 4 weeks based on the state of ENSO and the MJO. Weather and Forecasting 29(1), 2338. Available at https://journals.ametsoc.org/view/journals/wefo/29/1/waf-d-13-00102_1.xml.CrossRefGoogle Scholar
Jolliffe, IT and Stephenson, DB (2012) Forecast Verification: A Practitioner’s Guide in Atmospheric Science, 2nd Edn. Oxford: Wiley-Blackwell.Google Scholar
McKinnon, KA, Rhines, A, Tingley, MP and Huybers, P (2016) Long-lead predictions of eastern United States hot days from Pacific Sea surface temperatures. Nature Geoscience 9(5), 389394.CrossRefGoogle Scholar
Murphy, AH (1988) Skill scores based on the mean square error and their relationships to the correlation coefficient. Monthly Weather Review 16, 24172424. https://doi.org/10.1175/1520-04932.0.CO;2>CrossRefGoogle Scholar
National Academy of Sciences (2016) Strategies for Subseasonl to Seasonal Forecasts. Washington, DC: National Academy Press.Google Scholar
Newman, M, Sardeshmukh, PD, Winkler, CR and Whitaker, JS (2003) A study of subseasonal predictability. Monthly Weather Review 131(8), 17151732.CrossRefGoogle Scholar
Pegion, K, Kirtman, BP, Becker, E, Collins, DC, LaJoie, E, Burgman, R, Bell, R, DelSole, T, Min, D, Zhu, Y, Li, W, Sinsky, E, Guan, H, Gottschalck, J, Metzger, EJ, Barton, NP, Achuthavarier, D, Marshak, J, Koster, RD, Lin, H, Gagnon, N, Bell, M, Tippett, MK, Robertson, AW, Sun, S, Benjamin, SG, Green, BW, Bleck, R and Kim, H (2019) The subseasonal experiment (SubX): A multimodel subseasonal prediction experiment. Bulletin of the American Meteorological Society 100(10), 20432060.CrossRefGoogle Scholar
Pegion, K and Sardeshmukh, PD (2011) Prospects for improving subseasonal predictions. Monthly Weather Review 139(11), 36483666.CrossRefGoogle Scholar
Reynolds, R, Smith, TM, Liu, C, Chelton, DB, Casey, KS and Schlax, MG (2007) Daily high-resolution-blended analyses for sea surface temperature. Journal of Climate 20, 54735496.CrossRefGoogle Scholar
Tibshirani, R (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1), 267288.CrossRefGoogle Scholar
Trenary, L and DelSole, T (2022) Advancing interpretability of machine learning prediction models. Environmental Data Science 1, E14. https://doi.org/10.1017/eds.2022.13CrossRefGoogle Scholar
Vitart, F. (2014), Evolution of ECMWF sub-seasonal forecast skill scores. Q.J.R. Meteorol. Soc., 140: 18891899. https://doi.org/10.1002/qj.2256CrossRefGoogle Scholar
Vitart, F, Ardilouze, C, Bonet, A, Brookshaw, A, Chen, M, Codorean, C, Déqué, M, Ferranti, L, Fucile, E, Fuentes, M, Hendon, H, Hodgson, J, Kang, H-S, Kumar, A, Lin, H, Liu, G, Liu, X, Malguzzi, P, Mallas, I, Manoussakis, M, Mastrangelo, D, MacLachlan, C, McLean, P, Minami, A, Mladek, R, Nakazawa, T, Najm, S, Nie, Y, Rixen, M, Robertson, AW, Ruti, P, Sun, C, Takaya, Y, Tolstykh, M, Venuti, F, Waliser, D, Woolnough, S, Wu, T, Won, D-J, Xiao, H, Zaripov, R and Zhang, L (2017) The subseasonal to seasonal (s2s) prediction project database. Bulletin of the American Meteorological Society 98(1), 163173.CrossRefGoogle Scholar
Vitart, F and Robertson, AW (2018) The sub-seasonal to seasonal prediction project (S2S) and the prediction of extreme events. NPJ Climate and Atmospheric Science 1(1), 3. https://doi.org/10.1038/s41612-018-0013-0CrossRefGoogle Scholar
White, CJ, Carlsen, H, Robertson, AW, Klein, RJ, Lazo, JK, Kumar, A, Vitart, F, Coughlan de Perez, E, Ray, AJ, Murray, V, Bharwani, S, MacLeod, D, James, R, Fleming, L, Morse, AP, Eggen, B, Graham, R, Kjellström, E, Becker, E, Pegion, KV, Holbrook, NJ, McEvoy, D, Depledge, M, Perkins-Kirkpatrick, S, Brown, TJ, Street, R, Jones, L, Remenyi, TA, Hodgson-Johnston, I, Buontempo, C, Lamb, R, Meinke, H, Arheimer, B and Zebiak, SE (2017) Potential applications of subseasonal-to-seasonal (s2s) predictions. Meteorological Applications 24(3), 315325.CrossRefGoogle Scholar
Wilks, D (2011) Chapter 8 - Ensemble Forecasting, Section 8.6.3. Statistical Methods in the Atmospheric Sciences, 3rdEdn. San Diego: Elsevier, pp. 359365.Google Scholar
Zhuang, F, Qi, Z, Duan, K, Xi, D, Zhu, Y, Zhu, H, Xiong, H and He, Q (2021) A comprehensive survey on transfer learning. Proceedings of the IEEE 109(1), 4376.CrossRefGoogle Scholar
Figure 0

Table 1. List of CMIP6 models with preindustrial control runs was analyzed in this study.

Figure 1

Figure 1. Map of the forecast target region. Each red dot denotes a forecast location on a 1 × 1 degree grid.

Figure 2

Figure 2. Map depicting predictor locations. Black and blue denote the domains of the Pacific and Atlantic basins, respectively. Large-scale variations in both basins are represented by the leading 50 Laplacian eigenvectors.

Figure 3

Figure 3. The second (a) and third (b) eigenvectors of the Laplacian operator for the Pacific basin.

Figure 4

Table 2. Summary of statistical forecast models.

Figure 5

Figure 4. NMSE of a lasso model versus $ \lambda $ for forecasts locations (a) Yakima, Washington, (b) Austin, Texas, and (c) Colorado Springs, Colorado. The NMSE curves are estimated from lasso predictions made with the CMIP6-single-task model and evaluated with respect to observations for winters (DJF) during 1982–1999. A red asterisk denotes the $ \lambda $ that minimizes the NMSE.

Figure 6

Figure 5. Performance of statistical prediction models based on spatial averages of the (a) NMSE and (b) temporal correlation. The horizontal black line denotes a NMSE of 1 in (a) and the zero correlation in (b). Five statistical models are compared: the benchmark Nino3.4 regression model, two observation-trained, and two CMIP6-trained lasso models. The vertical black bars denote the uncertainty of the performance metrics when observation data for the period 1982–2018 are bootstrapped. The method for evaluating this uncertainty varies across statistical models and the details of how uncertainty is estimated can be found in Section 2.6.

Figure 7

Figure 6. Spatial averaged NMSE versus training set size for the CMIP6-single-task model. The vertical bars represent the uncertainty in spatially averaged NMSE for the CMIP6-single-task model with respect to data included in the training set. This uncertainty is estimated by bootstrapping CMIP6 data to create a training dataset of a specified length, re-training the lasso model, followed by verification. This process is repeated 60 times and the bars give the 5th–95th percentiles of these 60 estimates. The number of years included in the training dataset is varied from 50, 100, 300, 500, 750, 1000, 2000, and 3000 years. The verification data are observed winter (DJF) temperatures for the period 2000–2018. Distinct from the analysis shown in Figure 5, the observation data used in validation and verification are not bootstrapped. The vertical axis is scaled to highlight the uncertainties with respect to the no-skill line of a NMSE of 1.

Figure 8

Figure 7. Average performance of CMIP6-single-task and OBS-single-task model evaluated in terms of MSESS (left column), temporal correlation (middle column), and amplitude bias (right column). Each metric is evaluated with respect to observed and forecast winter (DJF) temperature anomalies for the period 2000–2018, where the year corresponds to a December start date. Maps of MSESS for (a) CMIP6-single-task and (d) OBS-single-task models. A forecast is skillful if MSESS > 0. Maps of the temporal correlation for (b) CMIP6-single-task and (e) OBS-single-task models. The percentage of forecasts that positively correlate with verification data is listed in parentheses. Statistical significance of the correlation maps is estimated with respect to a field significance test, with the corresponding p-value listed in the title. The + sign denotes grid points where the correlation is locally significant. Maps of amplitude bias for (c) CMIP6-single-task and (f) OBS-single-task models. A negative bias indicates a forecast under-predicted the amplitude of the temperature anomalies.

Figure 9

Figure 8. Spatial correlation between observed winter (DJF) temperature anomalies for the period 2000–2018 and predictions made with (a) CMIP6-single-task and (b) OBS-single-task models. The observed winter (DJF) temperature anomalies are for the period 2000–2018, where the year corresponds to December. The vertical lines represent the 25th–75th percentiles of spatial correlation coefficient between forecast and observations. The median is denoted by the black asterisk. The percentage of forecast within a given winter that have a positive spatial correlation score with observations is listed next to the median.

Figure 10

Figure 9. Observed 2-week average temperature anomalies (a,c) and CMIP6-single-task forecasts (b,d) for a high-skill forecast targeting the 2-week average over January 8–21, 2010 (a,b) and low-skill forecast targeting the 2-week average over January 11–24, 2014 (c,d).

Figure 11

Figure 10. Percentage of predictors selected across all 499 individual grid points for the (a) CMIP6-single-task and (b) OBS-single-task models. The horizontal black line denotes the 60% selection level.