1. Introduction
Forecasting economic developments during crisis time is problematic since the realisations of the variables are far away from their average values, while econometric models are typically better at explaining and predicting values close to the average, particularly so in the case of linear models. The situation is even worse for the Covid-19 induced recession, when typically well-performing econometric models such as Bayesian vector autoregressions (VARs) with stochastic volatility have troubles in tracking the unprecedented fall in real activity and labour market indicators—see for example, for the United States Carriero et al. (Reference Carriero, Clark and Marcellino2020) and Plagborg-Møller et al. (Reference Plagborg-Møller, Reichlin, Ricco and Hasenzagl2020), or An and Loungani (Reference An and Loungani2020) for an analysis of the past performance of the Consensus Forecasts.
As a partial solution, Foroni et al. (Reference Foroni, Marcellino and Stevanovic2020) employ simple mixed-frequency models to nowcast and forecast U.S. and the rest of G7 GDP quarterly growth rates, using common monthly indicators, such as industrial production, surveys and the slope of the yield curve. They then adjust the forecasts by a specific form of intercept correction or estimate by the similarity approach, see Clements and Hendry (Reference Clements and Hendry1999) and Dendramis et al. (Reference Dendramis, Kapetanios and Marcellino2020), showing that the former can reduce the extent of the forecast error during the Covid-19 period. Schorfheide and Song (Reference Schorfheide and Song2020) do not include Covid periods in the estimation of a mixed-frequency VAR model because those observations substantially alter the forecasts. An alternative approach is the specification of sophisticated nonlinear/time-varying models. While this is not without perils when used on short economic time series, it can yield some gains, see for example, Ferrara et al. (Reference Ferrara, Marcellino and Mogliani2015) in the context of forecasting during the financial crisis using Markov-Switching, threshold and other types of random parameter models.
The goal of this paper is to go one step further in terms of model sophistication, by considering a variety of machine learning (ML) methods and assessing whether and to what extent they can improve the forecasts, both in general and specifically during the Covid-19 crisis, focussing on the UK economy that at the same time was also experiencing substantial Brexit-related uncertainty. A related paper, but with a focus on the largest euro area countries, is Huber et al. (Reference Huber, Koop, Onorante, Pfarrhofer and Schreiner2020) who introduce Bayesian Additive Regression Tree-VARs (BART-VARs) for Covid. They develop a nonlinear mixed-frequency VAR framework by incorporating regression trees, and exploiting their ability to model outliers and to disentangle the signal from noise. Indeed, the regression trees (and even more the forests) are able to quickly adapt to extreme observations and to disentangle the switch in the underlying regime. Another relevant related paper is Goulet Coulombe et al. (Reference Goulet Coulombe, Leroux, Stevanovic and Surprenant2019), which however does not include an analysis of the Covid-19 period and focuses on the United States. A third related paper, again with a focus on the United States, is Clark et al. (Reference Clark, Huber, Koop, Marcellino and Pfarrhofer2021), who consider alternative specifications of BART-VARs, possibly with also a non-parametric specification for the time-varying volatility, and compare their point, density and tail forecast performance with that of large Bayesian VARs with stochastic volatility, finding often gains, though of limited size.
In line with Goulet Coulombe et al. (Reference Goulet Coulombe, Leroux, Stevanovic and Surprenant2019), we consider five nonlinear non-parametric ML methods. Three of them have the capacity to extrapolate and two do not. Specifically, being based on trees, boosted trees (BT) and random forests (RF) cannot predict out-of-sample a value ( $ {\hat{y}}_i $ ) greater than the maximal in-sample value (same goes for the minimum). This is a simple implication of how forecasts are constructed, basically by taking means over sub-samples chosen in a data-driven way. Clearly, this is an important limitation when it comes to forecasting variables which significantly got out of their typical range during the Pandemic (like hours worked).Footnote 1 No such constraints bind on macroeconomic random forest (MRF), kernel ridge regression (KRR) and neural networks (NN). By using a linear part within the leafs, MRF can extrapolate the same way a linear model does, while retaining the usual benefits of tree-based methods (limited or inexistent overfitting, necessitate little to no tuning, can cope with large data). Goulet Coulombe (Reference Goulet Coulombe2020a) notes that this particular feature gives MRF an edge over RF when it comes to forecasting the (once) extreme escalation of the unemployment rate during the Great Recession.
As mentioned, we focus on the UK and, as another contribution of the paper, we construct a monthly large-scale macroeconomic database (MD), labelled UK-MD, comparable to those for the United States by McCracken and Ng (Reference McCracken and Ng2016, Reference McCracken and Ng2020) and for Canada by Fortin-Gagnon et al. (Reference Fortin-Gagnon, Leroux, Stevanovic and Surprenant2018).Footnote 2 Specifically, the dataset contains 112 monthly macroeconomic and financial indicators divided into nine categories: labour, production, retail and services, consumer and retail price indices, producer price indices, international trade, money, credit and interest rate, stock market and finally sentiment and leading indicators. The starting date varies across indicators, from 1960 to 2000, and to simplify econometric analyses, we also balance the resulting panel using an expectation–maximization (EM) algorithm to impute missing values, as in Stock and Watson (Reference Stock and Watson2002b) and McCracken and Ng (Reference McCracken and Ng2016).
In terms of empirical results, overall ML methods can provide substantial gains when short-term forecasting several indicators of the UK economy, though a careful temporal and variable by variable analysis is needed. Over the full sample, RF works particularly well for labour market variables, in particular when augmented with a moving average rotation of $ X $ [ $ X $ being the predictors, hence moving average rotation of X (MARX)]; KRR for real activity and consumer price inflation; LASSO or LASSO + MARX for the retail price index and its version focusing on housing; and RF for credit variables. The gains can be sizable, even 40–50 per cent with respect to the benchmark, and ML methods were particularly useful during the Covid-19 period. Focussing on the Covid sample, it is clear that nonlinear methods with the ability to extrapolate become extremely competitive. And this goes both ways. For instance, certain MRFs, unlike linear methods or simpler nonlinear ML techniques, procure important improvements by predicting unprecedented values (for hours worked), and avoiding immaterial cataclysms (employment and housing prices).
The rest of the paper is structured as follows. Section 2 introduces the ML forecasting framework. Section 3 discusses the forecasting models. Section 4 presents the UK-MD dataset and studies its main features. Section 5 discusses the set-up of the forecasting exercise. Section 6 presents and discusses the results. Section 7 summarises the key findings and concludes. Additional details and results are presented in Appendices.
2. ML forecasting framework
ML algorithms offer ways to approximate unknown and potentially complicated functional forms with the objective of minimising the expected loss of a forecast over $ h $ periods. The focus of the current paper is to construct a feature matrix susceptible to improve the macroeconomic forecasting performance of off-the-shelf ML algorithms. Let $ {H}_t=\left[{H}_{1t},\dots, {H}_{Kt}\right] $ for $ t=1,\dots, T $ be the vector of variables found in a large MD, such as the FRED-MD database of McCracken and Ng (Reference McCracken and Ng2016) or the UK-MD dataset described in the next section, and let $ {y}_{t+h} $ be our target variable. We follow Stock and Watson (Reference Stock and Watson2002a, Reference Stock and Watson2002b) and target average growth rates or average differences over $ h $ periods ahead
To illustrate this point, define $ {Z}_t\equiv {f}_Z\left({H}_t\right) $ as the $ {N}_Z $ -dimensional feature vector, formed by combining several transformations of the variables in $ {H}_t $ .Footnote 3 The function $ {f}_Z $ represents the data pre-processing and/or featuring engineering whose effects on forecasting performance we seek to investigate. The training problem for the case of no data pre-processing ( $ {f}_Z=I\left(\right) $ ) is
The function $ g $ , chosen as a point in the functional space G, maps transformed inputs into the transformed targets. $ \mathrm{pen}\left(\right) $ is the regularisation function whose strength depends on some vector/scalar hyperparameter(s) $ \tau $ .
3. Forecasting models
In this section, we present the main predictive models (for a more complete discussion, see, among other, Hastie et al., Reference Hastie, Tibshirani and Friedman2009), and some additional, less standard, forecasting models we will consider (more details can be found in Goulet Coulombe et al., Reference Goulet Coulombe, Leroux, Stevanovic and Surprenant2019). Table 1 lists all the models implemented in the forecasting exercise, together with their respective input matrices $ {Z}_t $ .
3.1. Main models
Linear models. We consider the autoregressive model (AR), as well as the autoregressive diffusion index (ARDI) model of Stock and Watson (Reference Stock and Watson2002a, Reference Stock and Watson2002b). Let $ {Z}_t=[\hskip0.2em {y}_t,{y}_{t-1}\dots, {y}_{t-{P}_y},{F}_t,{F}_{t-1}\dots, {F}_{t-{P}_f},] $ be our feature matrix, then the ARDI model is given by
where $ {F}_t $ are $ k $ factors extracted by principal components from the $ {N}_X $ -dimensional set of predictors $ {X}_t $ and parameters are estimated by ordinary least squares (OLS). The AR model is obtained by keeping in $ {Z}_t $ only the lagged values of $ {y}_t $ . The hyperparameters of both models are specified using the Bayesian information criterion (BIC).
Ridge, lasso and elastic net. The elastic net model simultaneously predicts the target variable $ {y}_{t+h} $ and selects the most relevant predictors from a set of $ {N}_Z $ features contained in $ {Z}_t $ whose weights $ \beta := {\left({\beta}_i\right)}_{i=1}^{N_Z} $ solve the following penalised regression problem
and where $ \left(\alpha, \lambda \right) $ are hyperparameters. Here, $ {Z}_t $ contains lagged values of $ {y}_t $ , factors and $ {X}_t $ . The Lasso estimator is obtained when $ \alpha =1 $ , while the ridge estimator imposes $ \alpha =0 $ and both use unit weights throughout. We select $ \lambda $ and $ \alpha $ with grid search where $ \alpha \in \left\{.01,.02,.03,\dots, 1\right\} $ and $ \lambda \in \left[0,{\lambda}_{\mathrm{max}}\right] $ where $ {\lambda}_{\mathrm{max}} $ is the penalty term beyond which coefficients are guaranteed to be all zero assuming $ \alpha \ne 0 $ . Since those algorithms performs shrinkage (and selection), we do not cross-validate $ {P}_y $ , $ {P}_f $ and $ k $ . We impose $ {P}_y=6 $ , $ {P}_f=6 $ and $ k=8 $ and let the algorithms select the most relevant features for forecasting task at hand.
Random forests. This algorithm provides a means of approximating nonlinear functions by combining regression trees. Each regression tree partitions the feature space defined by $ {Z}_t $ into distinct regions and, in its simplest form, uses the region-specific mean of the target variable $ {y}_{t+h} $ as the forecast, that is for $ M $ leaf nodes
where $ {R}_1,\dots, {R}_M $ is a partition of the feature space. The input $ {Z}_t $ is the same as in the case of elastic net models. To circumvent some of the limitations of regression trees, Breiman (Reference Breiman2001) introduced RF. RF consist in growing many trees on subsamples (or nonparametric bootstrap samples) of observations. A random subset of features is eligible for the splitting variable, further decorrelating them. The final forecast is obtained by averaging over the forecasts of all trees. In this paper, we use 500 trees which is normally enough to stabilise the predictions. The minimum number of observation in each terminal nodes is set to 3 while the number of features considered at each split is $ \frac{\#{Z}_t}{3} $ . In addition, we impose $ {P}_y=6 $ , $ {P}_f=6 $ and $ k=8 $ .
Boosted trees. This algorithm provides an alternative means of approximating nonlinear functions by additively combining regression trees in a sequential fashion. Let $ \eta \in 0,1\Big] $ be the learning rate and $ {\hat{y}}_{t+h}^{(n)} $ and $ {e}_{t+h}^{(n)}:= {y}_{t-h}-\eta {\hat{y}}_{t+h}^{(n)} $ be the step $ n $ predicted value and pseudo-residuals, respectively. Then, for square loss, the step $ n+1 $ prediction is obtained as
where $ \left({c}_{n+1},{\rho}_{n+1}\right):= \arg \underset{\rho, c}{\min }{\sum}_{t=1}^T{\left({e}_{t+h}^{(n)}-{\rho}_{n+1}f\left({Z}_t,{c}_{n+1}\right)\right)}^2 $ and $ {c}_{n+1}:= {\left({c}_{n+1,m}\right)}_{m=1}^M $ are the parameters of a regression tree. In other words, it recursively fits trees on pseudo-residuals. We consider a vanilla BT where the maximum depth of each tree is set to 10 and all features are considered at each split. We select the number of steps and $ \eta \in 0,1\Big] $ with Bayesian optimisation. $ {Z}_t $ contains lagged values of $ {y}_t $ , factors and $ {X}_t $ , and we impose $ {P}_y=6 $ , $ {P}_f=6 $ and $ k=8 $ .
Kernel ridge regressions. A way to introduce high-order nonlinearities among predictors’ set $ {Z}_t $ , but without specifying a plethora of basis functions, is to opt for the Kernel trick. As in Goulet Coulombe et al. (Reference Goulet Coulombe, Leroux, Stevanovic and Surprenant2019), the nonlinear ARDI predictive equation (3) is written in a general nonlinear form $ g\left({Z}_t\right) $ and can be approximated with basis functions $ \phi \left(\right) $ such that
The so-called Kernel trick is the fact that there exist a reproducing kernel $ K\left(\right) $ such that
This means we do not need to specify the numerous basis functions, a well-chosen kernel implicitly replicates them. Here, we use the standard radial basis function (RBF) kernel
where $ \sigma $ is a tuning parameter to be chosen by cross-validation. In terms of implementation, after factors are extracted via principal component analysis from equation (4), the forecast of the KRR diffusion index model is obtained from
Here, we impose the same set of inputs, $ {Z}_t $ , as in the ARDI model and we fix $ {P}_y=6 $ , $ {P}_f=6 $ and $ k=8 $ .
Neural networks. We consider standard feed-forward networks and the architecture closely follows that of Gu et al. (Reference Gu, Kelly and Xiu2020). Cross-validating the whole network architecture is a difficult task especially with a small number of observations as is the case in macroeconomic applications. Hence, we use two hidden layers, the first with 32 neurons and the second with 16 neurons. The number of epochs is fixed at 100. The activation function is ReLu and that of the output layer is linear. The batch size is 32 and the optimiser is Adam (Keras default values). The learning rate and the Lasso parameter are chosen by fivefold cross-validation among the following grids respectively, $ \in \left\{\mathrm{0.001,0.01}\right\} $ and $ \in \left\{\mathrm{0.001,0.0001}\right\} $ . We apply the early stopping, that is we wait for 20 epochs to pass without any improvement of the cross-validation mean squared error (MSE) to stop the training. The final prediction is the average of an ensemble of five different estimations. $ {Z}_t $ contains lagged values of $ {y}_t $ , factors and $ {X}_t $ , and we impose $ {P}_y=6 $ , $ {P}_f=6 $ and $ k=8 $ .
3.2. Additional forecasting models
Macroeconomic random forests. Goulet Coulombe (Reference Goulet Coulombe2020a) proposes a new form of RF better suited for macroeconomic data. The new problem is to extract generalised time-varying parameters (GTVPs)
where $ {S}_t $ are the state variables governing time variation and $ \mathrm{\mathcal{F}} $ a forest. $ {S}_t $ is (preferably) a high-dimensional macroeconomic data set. In this paper, it is the same $ {Z}_t $ as in plain RF and boosting. $ \tilde{X} $ determines the linear model that we want to be time-varying. Usually $ \tilde{X}\subset S $ is rather small (and focussed) compared to $ S $ . For instance, an autoregressive RF (ARRF) uses lags of $ {y}_t $ for $ {\tilde{X}}_t $ . A factor-augmented ARRF (FA-ARRF) adds factors to ARRF’s linear part.
The problem is to find the optimal variable $ {S}_j $ (so, finding the best $ j $ out of the random subset of predictors indexes $ {\mathcal{J}}^{-} $ ) to split the sample with, and at which value $ c $ of that variable should we split. The outputs should be $ {j}^{\ast } $ and $ {c}^{\ast } $ to be used to split $ l $ (the parent node) into two children nodes, $ {l}_1 $ and $ {l}_2 $ . Hence, the greedy algorithm developed in Goulet Coulombe (Reference Goulet Coulombe2020a) runs
recursively to construct trees.
As it was the case for RF, the bulk of regularisation comes from taking the average over a diversified ensemble of trees (generated by both Bagging and a random $ {\mathcal{J}}^{-}\subset \mathcal{J} $ . Nonetheless, $ {\beta}_t $ ’s (and the attached prediction) can also benefit from extra (yet mild) regularisation. Time-smoothness is made operational by taking the ‘rolling-window view’ of time-varying parameters. That is, the tree solve many weighted least squares problems which includes close-by observations. To keep computational demand low, the kernel $ w\left(t;\zeta \right) $ is a symmetric five-step Olympic podium. Informally, the kernel puts a weight of 1 on observation $ t $ , a weight of $ \zeta <1 $ for observations $ t-1 $ and $ t+1 $ and a weight of $ {\zeta}^2 $ for observations $ t-2 $ and $ t+2 $ . Note that a small ridge penalty is added to make sure every matrix inverts nicely (even in very small leaves), so a single tree has in fact two sources of regularisation.
The standard RF is a restricted version of MRF where $ {\tilde{X}}_t=\iota $ , $ \lambda =0 $ , $ \zeta =0 $ and the block size for Bagging is 1. In words, the only regressor is a constant, there is no within-leaf shrinkage, and Bagging does not care for serial dependence. It is understood that MRF will have an edge over RF whenever linear signals included in $ {\tilde{X}}_t $ are strong and the number of training observations (or signal-to-noise ratio) is low. The reason for this is simple: MRF nudge the learning algorithm in the right direction rather than hoping for RF to learn everything non-parametrically. Moreover, by providing generalised time-varying parameters (and credible regions for those), MRF lends itself more easily to interpretation.
Moving average rotation of $ X $ . MARX transformation was proposed in Goulet Coulombe et al. (Reference Goulet Coulombe, Leroux, Stevanovic and Surprenant2020) as a feature engineering technique which generates an implicit shrinkage more appropriate for time series data. In linear setup when coefficients are shrunk (and maybe selected) to 0, using MARX transform the usual $ {\beta}_{k,p}\to 0 $ prior into shrinking each $ {\beta}_{k,p} $ to $ {\beta}_{k,p-1} $ for the $ p $ lag of predictor $ k $ . For more sophisticated techniques where shrinkage is only implicit (like RF and boosting), MARX ‘proposes’ the variable-selecting algorithm with pre-assembled group of lags which helps in avoiding that the underlying trees waste splits on a bunch of scattered lags Goulet Coulombe, Reference Goulet Coulombe2020a). Goulet Coulombe et al. (Reference Goulet Coulombe, Leroux, Stevanovic and Surprenant2020) report that the transformation is particularly helpful for U.S. monthly real economic activity targets. Adding MARX to the input set $ {Z}_t $ is considered in all models except ARDI and KRR.
4. UK-MD: a large UK monthly MD
Large datasets are now very popular in empirical macroeconomic research since Stock and Watson (Reference Stock and Watson2002a, Reference Stock and Watson2002b) have initiated the breakthrough by providing the econometric theory and showing the benefits in terms of macroeconomic forecasting. McCracken and Ng (Reference McCracken and Ng2016, Reference McCracken and Ng2020) proposed a standardised version of a large monthly and quarterly U.S. datasets that are regularly updated and publicly available at the Federal Reserve Economic Data (FRED) website. Fortin-Gagnon et al. (Reference Fortin-Gagnon, Leroux, Stevanovic and Surprenant2018) have developed the Canadian version of FRED. In this paper, we construct a similar large-scale UK macroeconomic database in monthly frequency that can be used in the same way as the U.S. and the Canadian datasets. The dataset is described in Section 4.1 and analyzed in Section 4.2.
4.1. UK-MD
The dataset contains 112 macroeconomic and financial indicators divided into nine categories: labour, production, retail and services, consumer and retail price indices, producer price indices, international trade, money, credit and interest rate, stock market and finally sentiment and leading indicators. The selection of variables is inspired by McCracken and Ng (Reference McCracken and Ng2016), Fortin-Gagnon et al. (Reference Fortin-Gagnon, Leroux, Stevanovic and Surprenant2018) and Joseph et al. (Reference Joseph, Kalamara, Potjagailo and Kapetanios2021). The complete list of series is available in table A7. Most of the indicators are available at the Office of National Statistics, while others are taken from the Bank of England, FRED and Yahoo finance. The starting date varies across indicators, from 1960 to 2000. For the forecasting application in this paper, data start in 1998 M01.
Most of the series included in the database must be transformed to induce stationarity. We roughly follow McCracken and Ng (Reference McCracken and Ng2016) and Fortin-Gagnon et al. (Reference Fortin-Gagnon, Leroux, Stevanovic and Surprenant2018). For instance, most I(1) series are transformed in the first difference of logarithms; a first difference of levels is applied to unemployment rate and interest rates; and the first difference of logarithms is used for all price indices. Transformation codes are reported in the Appendix.
Our last concern is to balance the resulting panel since some series have missing observations. We opted to apply an EM algorithm by assuming a factor model to fill in the blanks as in Stock and Watson (Reference Stock and Watson2002b) and McCracken and Ng (Reference McCracken and Ng2016). We initialise the algorithm by replacing missing observations with their unconditional mean, starting in 1998 M1, and then proceed to estimate a factor model by principal component. The fitted values of this model are used to replace missing observations.
Finally, for this application, we also add 19 U.S. macroeconomic and financial aggregates as considered in Banbura et al. (Reference Banbura, Giannone and Reichlin2008). These series include income, production, labour market, housing, consumption and monetary indicators, as well as interest rates and prices. The complete list is available in the Appendix D.
4.2. Exploring the factor structure of UK-MD
Large MDs are mainly used for forecasting and impulse response analysis through lenses of factor modelling (Bernanke et al., Reference Bernanke, Boivin and Eliasz2005; Kotchoni et al., Reference Kotchoni, Leroux and Stevanovic2019). Indeed, the factors provide a widely used dimension reduction method, but they also serve as an empirical representation of general equilibrium models (Boivin and Giannoni, Reference Boivin and Giannoni2006). Hence, it is important to explore the factor structure of our UK-MD dataset.
Estimating the number of factors is an empirical challenge and several statistical decision procedures have been proposed, see Mao Takongmo and Stevanovic (Reference Mao Takongmo and Stevanovic2015) for review. Here, we select the number of static factors using the Bai and Ng (Reference Bai and Ng2002) $ {PC}_{p2} $ criterion, and we follow Hallin and Liska (Reference Hallin and Liska2007) to test for the number of dynamic factors. $ {PC}_{p2} $ criterion finds eight significant factors, while the number of dynamic components is estimated at four. In addition, we performed the Alessi et al. (Reference Alessi, Barigozzi and Capasso2010) improvement of the $ {PC}_{p2} $ criterion that in turn suggests nine factors.
After the static factors are estimated by principal components as in Stock and Watson (Reference Stock and Watson2002a), we report in table 2 their marginal contribution to the variance of variables constituting UK-MD. For instance, $ {mR}_i^2(k) $ measures the incremental explanatory power of the factor $ k $ for the variable $ i $ , which is simply the difference between the $ {R}^2 $ after regressing the variable $ i $ on the first $ k $ and $ k-1 $ factors. The overall marginal contribution of the factor $ k $ is the sample average over all variables. Table 2 shows the average $ {mR}^2(k) $ for each of nine estimated factors, lists 10 series that load most importantly on each factor and indicates the group to which the series belongs. For example, factor 1 explains 20.7 per cent of the variation in UK-MD and is clearly a real activity factor as the 10 most related variables are indicators of production and services. In particular, it explains 88.7 and 83.6 per cent of variation in the index of services and the index of production in manufacturing, respectively. The second factor explains 8.4 per cent of variation overall, and represents mainly the group of interest rates. For instance, its marginal contribution to the 12-month LIBOR is 0.532. Factor 3’s average explanatory power is 5.4 per cent and it is linked to prices indices, with the highest $ {mR}_i^2(k)=0.513 $ for the CPI inflation. Factors 4 and 5 are related to stock market and employment variables, respectively. The sixth factor explain 3.4 per cent of total variation and can be interpreted as the international trade factor. Factor 7 is related to unemployment and working hours indicators, with an explanatory power of 24.5 per cent for the over 12-month unemployment duration. Exchange rates are well explained by the seventh factor. Finally, the ninth component stands out as an energy factor as it explains a sizeable fraction of variation in production indices of oil extraction, mining and energy sectors.
Note: This table shows the 10 series that load most importantly on the first nine factors. For example, the first factor explains 20.7 per cent of the variation in all 112 series, and it explains 88.7 per cent of variation in IOS indicator. The third column of each panel indicates the group to which the variable belongs. Group 1: labour market. Group 2: production. Group 3: retail and services. Group 4: consumer and retail price indices. Group 5: international trade. Group 6: money, credit and interest rates. Group 7: stock market. Group 8: sentiment and leading indicators. Group 9: producer price indices.
Figure 1 plots the importance of the common component with nine factors. The total $ {R}^2 $ is 0.554. The explanatory power of the common component varies across series. It explains more than 80 per cent for 20 series, mostly services, production and average week hours series. The nine factors are also very important for 42 variables as they have an $ {R}^2 $ between 0.5 and 0.8. There is only one series that have the idiosyncratic component explaining over 90 per cent of the variation, IOP_PETRO, and three variables for which the common component $ {R}^2 $ is less than 20 per cent. Therefore, we can conclude that the factor structure in UK-MD seems reasonable and is comparable to those in FRED-MD and CAN-MD datasets. Interestingly, the interpretation of the first three UK-MD factors is identical to the interpretation of the first three FRED-MD components.
In figure 2, we show the number of static factors selected recursively from 2009 by the Bai and Ng (Reference Bai and Ng2002) $ {PC}_{p2} $ criterion (upper panel) and the corresponding $ {R}^2 $ (bottom panel). The number of significant factors increases over time. It goes from 2 between 2009 and 2015, followed by a second plateau at 4 until 2020, and it jumps to 7, 9 and 8 since the Covid-19 pandemic. The additional factors emerging during the pandemic period are likely capturing the specificities of this period.
5. Empirical Setup
5.1. Variables of interest
We focus on predicting 12 representative macroeconomic indicators of the UK economy: employment (EMP), unemployment rate (UNEMP RATE), total actual weekly hours worked (HOURS), industrial production (IP PROD), index of production: manufacture of machinery and equipment (IP MACH), total retail trade (RETAIL), consumer price index (CPI), retail price index (RPI), RPI housing (RPI HOUSING), consumer credit excluding student loans (CREDIT), total sterling approvals for house purchases (HOUSE APP) and producer price index of manufacturing sector (PPI MANU).
We consider the direct predictive modelling in which the target is projected on the information set, and the forecast is made directly using the most recent observables. All the variables above are assumed $ I(1) $ , so we forecast the average growth rate (Stock and Watson, Reference Stock and Watson2002b)
except for UNRATE where we target the average change as in equation (6) but without logs.
5.2. Pseudo-out-of-sample experiment design
The pseudo-out-of-sample period starts on 2008 M01. The end period depends on target variables. Labour market series, EMP, UNEMP RATE and HOURS, end on 2020 M09, while RETAIL is available up to 2020 M10. The rest of variables end on 2020 M11. The forecasting horizons considered are 1, 2 and 3 months. All models are estimated recursively with an expanding window in order to include more data so as to potentially reduce the variance of more flexible models.
The standard Diebold and Mariano (Reference Diebold and Mariano2002) (DM) test procedure is used to compare the predictive accuracy of each model against the reference autoregressive model. MSE is the most natural loss function given that all models are trained to minimise the squared loss in-sample. Hyperparameter selection is performed using the BIC for AR and ARDI and K-fold cross-validation is used for the remaining models. This approach is theoretically justified in time series models under conditions spelled out by Bergmeir et al. (Reference Bergmeir, Hyndman and Koo2018). Moreover, Goulet Coulombe et al. (Reference Goulet Coulombe, Leroux, Stevanovic and Surprenant2019) compared it with a scheme which respects the time structure of the data in the context of macroeconomic forecasting and found K-fold to be performing as well as or better than this alternative scheme. All models are estimated (and their hyperparameters re-optimised) every month.
6. Results
In this section, we present the results of the forecasting experiment, focussing first on the Covid-19 era and then on average performance over the longer evaluation sample.
6.1. Pandemic recession case study
Figure 3 looks at four selected cases and compares the behaviour of the best models among certain categories: best linear model for the Covid era, defined as the period 2020 M1–2020 M9/M11 depending on the variable, best nonlinear model for the Covid era, and best model overall for the 2008–2019 period. The exact identities of selected models in figure 3 are reported in table 3.
Though the Covid era is short and so the results should be interpreted with care, the outcome is quite interesting. Linear models have a hard time characterising the path of EMP during the Pandemic recession. Ridge+MARX, which was marginally better than the nonlinear FA-ARRF(2,2) during the pre-Covid era, is predicting an employment cataclysm that did not materialise. This is a general property of linear models for this target since the best linear forecast (other than the AR) for EMP in 2020 is the 0 forecast, that is, the RW without drift in levels. FA-ARRF(2,4) (and FA-ARRF(2,2) close behind) is the best model for EMP at a horizon of 1 month. At longer horizons, RF-MARX is the best model, with a decisive advantage over both AR and RF that do not use the transformations of Goulet Coulombe et al. (Reference Goulet Coulombe, Leroux, Stevanovic and Surprenant2020). This winning streak extends to unemployment at all horizons—another variable that responded in a rather mild fashion to the Covid shock due to Government intervention. Given RF usual robustness (Goulet Coulombe, Reference Goulet Coulombe2020b), those gains are almost all statistically significant.
In figure 3b, we see that the improvement at $ h=1 $ comes from responding more swiftly (and more vigorously) to the first Covid shock than what AR would allow for. An explanation for this well-calibrated response can be found in figure 4 which plots the underlying GTVPs for FA-ARRF(2,2). The persistence seems to be highly state-dependent—being much higher during certain episodes (including recessions). This feature is replicated out-of-sample during the Pandemic recession, which procured FA-ARRF(2,2) an edge over the competitive plain AR. Additionally, the model incorporates an intercept that alternates between two regimes, with the negative one being attributed to recessions (but not exclusively according to pre-2008 data). The drop in intercept is also predicted out-of-sample for the Covid period. Unsurprisingly, those switches match those of persistence. Finally, it is noted that the sensitivity to the first factor (which usually characterises real activity) is initially milder during recessions for EMP. This is a salient feature for 2020 as the EMP response to the Covid shock is much milder than that of other labour/production indicators (like HOURS).
Turning to HOURS—which experienced an unprecedented rise and fall during the onset of the Pandemic Recession—it is striking to see that only MRF can beat the AR benchmark at $ h=1 $ . Indeed, the four MRFs report MSE ratios between 0.69 and 0.78 whereas that of the other nonlinear models range between 1.05 and 1.5. Things are even worse for linear models.
Figure B1 reports various variable importance (VI) measures for FA-ARRF(2,2) (the reader is referred to Goulet Coulombe, Reference Goulet Coulombe2020a for numerous implementation details). Universally, the VIs suggest the predominance of other labour indicators like measures of vacancies. Given how those are closely related to HOURS itself, and that all successful MRFs include an AR component, this points in the direction that HOURS may well follow a nonlinear AR process which MRF is particularly well equipped to extract. As a result, the response of MRF to the Covid shock is (as it was the case for EMP), more timely than that of AR. Given how fast things were evolving back in the spring of 2020, that timing provides MRF with an improvement of around 30 per cent over the benchmark.
As conjectured earlier, MRF’s capacity to extrapolate (which RF and BT both lack) proves vital for variables which exhibited (previously unseen) swings of extraordinary proportions. While NN-ARDI also has the capacity to extrapolate (and is marginally better than FA-ARRF(2,2) in the pre-Covid era), its lack of an explicit linear part is likely to blame for its spectacular incapacity to propel the Covid shock in figure 3b. A similar dismal predicament is observed for RIDGE-MARX which is the best linear model for the Covid sample.
Different troubles afflict data-rich linear models for RPI HOUSING with MSE ratios exploding well over 10. As a result, the best linear model is without question the simple autoregression. An obvious explanation for the generalised failure of linear models (and also most data-rich ones) can be found in figure 3b. The ‘orange’ forecasts basically predict a path largely inspired by the experience of the Great Recession, that is, a joint collapse of real activity and housing prices. Since this is the sole recession in the training set, it is fair to say that most ML methods naively (yet inevitably) associate real activity slowdown with a significant drop in RPI housing. However, by information available to the economist, but not to the sample-constrained ML algorithm, this association is more of a 2008–2009 exception than a ‘rule’.
The only models able to beat the benchmark are the MRFs equipped with small autoregressions as linear parts (ARRF(2) and ARRF(6)). So, how did they avoid the dismal fates of other ML methods, and captured nicely the soft drop (and bounce back) of RPI HOUSING in 2020? First, they do not rely explicitly on linkage with other groups of variables (like FA-ARRFs would through the use of factors) but rather focus on nonlinear autoregressive dynamics. This strategy is expected to pay off whenever a shock can truly be thought of as ‘exogenous’ and we simply need a model to propagate it—this description corresponds to the onset of the Pandemic Recession but certainly not its predecessor. Second, the model needs to separate pre-2008 dynamics from what followed. Figure 5 report interesting transformations of ARRF(6)’s GTVPs. While persistence is rather stable at 0.75, the long-run mean is subject to a lot of variation. Some is cyclical (like the mild drops in 2008 and 2020), but the most noticeable feature is a permanent regime change after 2008. Variable importance measures in figure B2 validate this observation: much of the forest generating the time-variation uses either ‘trend’ (i.e. exogenous change) or a catalog of indicators related to the policy rate (UK Bank Rate, U.S. Federal Funding Rate, and many MARX transformations of those) whose are known to have entered uncharted territory in the aftermath of the 2008–2009 recession. Figure B3 confirms visually that the variation in the intercept of ARRF(6) gives an edge over both AR and the best linear model (RIDGE-MARX), especially starting from 2011. As a result, ARRF(6) is also the best model for all horizons in the quieter period of 2011–2019 (see table A8) with improvements over the AR benchmark of 70, 54 and 54 per cent at horizons 1 to 3 respectively.
The last quadrant of figure 3a shows that for PPI MANU, a model that does marginally worse most of the time can generate substantial gain during the Covid period. Such is the case for RF-MARX which performance is similar to that of the best linear model for most samples (and the best overall pre-Covid). Figure 3b makes clear that this edge during the Pandemic happens because (i) RF-MARX goes almost as deep as linear models during the spring and yet (ii) does not call for a large decrease in September and October (unlike linear models, and akin to AR’s prediction). Since RF-MARX does better than plain RF by 36 per cent and boosting-MARX better than plain boosting by 12 per cent, it is natural curiosity to investigate the VI measures of those models to uncover what particular MARX transformations RF is so fond of. In figure 6, we see that both plain boosting and RF rely strongly on the most recent values of oil prices, PPI oil and PPI MANU itself—which comes to no surprise. Interestingly, the other lags of oil prices are generally absent from the top 20. The MARX versions consider a slightly less focussed set of predictors composed of various moving averages of oil prices. In both the RF and boosting case, the most important feature is the last 6 months average of oil prices change. Thus, RF-MARX versions avoid calling for another decrease of PPI MANU by relying less on monthly oil indicators by themselves, which are subject to large swings, but rather on temporal averages that have the ability of smoothing out the noise inevitably present in the oil price trajectory. Moreover, by the very design of the manufacturing production chain, increases/decreases over several months are more likely to be transmitted into prices than notoriously volatile 1-month-to-the-next variations.
6.2. Quiet(er) times
It has been repeatedly reported that the benefits of a large panel of predictors may solely be present during periods of economic turmoil (Kotchoni et al., Reference Kotchoni, Leroux and Stevanovic2019; Siliverstovs and Wochner, Reference Siliverstovs and Wochner2019). For this reason and others (Lerch et al., Reference Lerch, Thorarinsdottir, Ravazzolo and Gneiting2017), it is of interest to study the marginal benefits associated with data-rich models outside of the tumultuous entry/exit of the Great Recession and the Pandemic Recession. Moreover, starting the pseudo-out-of-sample from 2011 gives data-rich models at least one recession to be trained on, and 13 years of data overall rather than 10 (as it were the case in table A1).
Ridge and Ridge-MARX do well for EMP and HOURS with gains roughly distributed between 10 and 20 per cent depending on the horizon. The MARX version usually has the upper hand by a small margin. The evidence for other activity indicators is more mixed. For HOURS, only nonlinear models manage to beat the AR benchmark albeit in a non-statistically significant fashion. The best model for IP PROD at all horizons is ARRF(2) which improves upon the AR by small margins. For IP MACH, some small gains can be obtained at a horizon of 3 months (with FA-ARRF(2,2), most notably) but none of those are statistically significant.
Aligned with traditional wisdom for the United States (Stock and Watson, Reference Stock and Watson2008), it is hard to beat the simple benchmark when it comes to CPI inflation. Nevertheless, ARRF(6) is the best model for all horizons (ex-aequo at $ h=1 $ ) with gains of 9–10 per cent—but none of those are significant. Larger improvements are obtained for RPI, where various data-rich models (linear and nonlinear) provide gains of around 20 per cent. The most notable are those of FA-ARRFs at a horizon of 3 months (but also any other horizon) which are nearly 30 per cent, far ahead from most of the competing models—including all those that also rely directly on factors. Finally, as a last notable observation from table A8, ARRF(6) dominates at all horizons for both RPI HOUSING and CREDIT, highlighting the benefits of a more focussed modelling of persistence (while allowing for its time variation) in otherwise high-dimensional/data-rich/nonlinear ML methods.
7. Conclusion
In this paper, we assess the forecasting performance of a variety of standard and ML forecasting methods for key UK economic variables, with a special focus on the Covid-19 period and using a specifically collected large dataset of monthly indicators, labelled UK-MD (also augmented with some international indicators).
As standard benchmarks, we consider AR, random walk and factor augmented AR models. As ML methods, we evaluate penalised regressions (RIDGE, LASSO, ELASTIC NET), BT and RF, KRR, and NN, plus MRF, which uses a linear part within the leafs, and MARX, a feature engineering technique which generates an implicit shrinkage more appropriate for time series data.
Overall ML methods can provide substantial gains when short-term forecasting several indicators of the UK economy, though a careful temporal and variable by variable analysis is needed. Over the full sample, RF works particularly well for labour market variables, in particular when augmented with MARX; KRR for real activity and consumer price inflation; LASSO or LASSO + MARX for the retail price index and its version focusing on housing; and RF for credit variables. The gains can be sizable, even 40–50 per cent with respect to the benchmark, and ML methods were particularly useful during the Covid-19 period. During the Covid era, nonlinear methods with the ability to extrapolate have a nice edge. Certain MRFs, unlike linear methods or simpler nonlinear ML techniques, procure important improvements by predicting large ‘bounce back’ that did occur and avoid predicting mayhem that did not materialise.
Acknowledgements
We thank the Editor Ana Galvao, an anonymous referee, and Hugo Couture who provided excellent research assistance. The third author acknowledges financial support from the Chaire en macroéconomie et prévisions ESG UQAM.
Appendix A: Detailed Forecasting Results
Notes: The numbers represent the relative MSEs with respect to AR, BIC model. ***, **, * stand for 1, 5 and 10 per cent significance of Diebold-Mariano test. Bold indicates lowest value in each column.
Notes: See table A1.
Notes: See table A1.
Notes: See table A1.
Notes: See table A1.
Notes: See table A1.
Notes: See table A1.
Notes: See table A1.
Notes: See table A1.
Notes: See table A1.
Appendix B: Additional Graphs
Appendix C: UK Large MD
When available, the series have been retrieved adjusted for seasonality beforehand. However, the price indices (CPI, RPI and PPI) were not and after conducting the Kruskal and Wallis (Reference Kruskal and Wallis1952) test for seasonal behaviour, these have been seasonally adjusted using the X-13-ARIMA-SEATS software developed by the United States Census Bureau. The transformation codes are: 1—no transformation; 2—first difference; 4—logarithm; 5—first difference of logarithm.
Appendix D: U.S. Data
The additional transformation codes are: 6—second difference of logs; 7— $ \delta \left({x}_t/{x}_{t-1}-1\right) $ .