‘All the business of war, indeed all the business of life, is endeavour to find out what you don’t know by what you do; that’s what I called “guess what was the other side of the hill”’.
Duke of Wellington
1. Introduction
Forecasts play a central role in decision-making under uncertainty. Good forecasts are those that lead to good decisions, in the sense that the expected payoff to the decision-maker using the forecast is greater than it would be otherwise.Footnote 1 In the case of inflation forecasts, which we consider below, the Bank of England makes forecasts to help it set monetary policy to keep inflation within a target range.Footnote 2 The payoff is the variation of inflation around the target. However, it is not clear how one would quantify the contribution of the forecast to the payoff in terms of a specific central bank loss function.
Since forecasts are designed to inform decisions, they are inherently linked to policy making. However, there is an issue as to whether one should use the same model for both forecasting and setting the policy instruments. Different questions require different types of models to answer them. A policy model might be quite large, while a forecasting model might be quite small. There is also an issue of how transparent the model should be. It may be difficult to interpret why a machine learning statistical model makes the predictions it does, and this can be a major disadvantage when a policy requires communication of a persuasive narrative.
In recent years, forecasting has been influenced by the increasing availability of high-dimensional data, improvements in computational power and advances in econometrics and machine learning techniques. In some areas, such as meteorology, this has resulted in improved forecasts, increasing the number of hours ahead for which accurate predictions can be made. The improved forecasts lead to better decision-making as people change their behaviour in response to the predictions, and the effect of such responses on mortality from heat and cold is examined in Shrader et al. (Reference Shrader, Bakkensen and Lemoine2023). Despite advances in data, computation and technique, the improvement in accuracy of weather forecasts has not been matched by economic forecasts. This is a cause for concern, since as emphasised in the classic analysis of Whittle (Reference Whittle1983), prediction and control are inherently linked and decisions over such elements of economic management such as monetary policy are dependent on a view of the future.
Macroeconomic forecasting is challenging because lags in responses to policies or shocks are long and variable and the economic system is responsive, events prompting changes in the structure of the economy. Forecasting tends to be relatively successful during normal times; however, in times of crises and change, in the face of large shocks or structural changes, when accurate predictions are most needed, forecasters tend to fail. For instance, inflation was 5.4% in December 2021. This was the last figure they had when, in February 2022, the Bank forecast that inflation would peak at 7.1% in 2022q2, and fall to 5.5% in 2023q1, and be back within target at 2.6% in 2024q1. The 2023q1 actual was 10.2%, almost 5 percentage points higher than the forecast. This burst in inflation was a global phenomenon and other central banks made similar errors.
Economic forecasters may use purely statistical models or more structural economic models that include the policy variables and important economic linkages. We will call these more structural models ‘policy models’, given the way structural has a number of interpretations. The statistical models will typically be conditional on information available at the time of the forecast, which may be inaccurate: knowing where one is at the time of forecast, nowcasting, is an important element. Policy models may also be conditional on the assumed future values. The Bank of England makes forecasts conditional on market expectations of future interest rates, assumptions about future energy prices and government announcements about future fiscal policy as well as other measures. Wrong assumptions about those future values may cause problems, as it did for the Bank in August 2022, when it anticipated a rise in energy prices but not the government response to them, so it overestimated inflation. There is also a policy issue as to whether fiscal and monetary policy should be determined independently by different institutions or jointly.
If both statistical and policy models are used, there is an issue as to how to integrate them. Forecast averaging has been widely shown to improve forecast performance, but forming averages on many variables may lack coherence and consistency. In September 2018, the Bank of England Independent Evaluation Office, IEO reported back to the Court of the Bank on the implementation and impact of the 2015 IEO review of forecasting performance. The IEO reported that some ‘non-structural’ models had been introduced as a source of challenge, and outputs were routinely shown to the Monetary Policy Committee (MPC) as a way of cross checking the main forecast. While some but not all members found them helpful, there was no desire to develop more models of this sort, and internally, they had not been integrated into the forecast process as a source of challenge.
Forecasts by central banks fulfil multiple purposes, including as a means of communication to influence expectations in the wider economy. This also makes it difficult to choose a loss function to evaluate forecasts. For instance, the Bank of England forecasts for inflation at a 2-year horizon are always close to the target of 2 per cent. Even were the Bank to think it unlikely that it could get back to target within 2 years, it might feel that its credibility might be damaged were it to admit that. An institutional issue is who ‘owns’ the forecast. The Bank of England forecast is the responsibility of the nine-member MPC; other central banks have different systems. For instance, the US Federal Reserve has a staff forecast not necessarily endorsed by the decision-makers.
There is an issue about the optimal amount of information to use: both with respect to breadth, how many variables, and length, how long a run of data. With respect to breadth, in principle, one should use information on as many variables as possible and not just for the country being forecast, since in a networked world, foreign variables contain information. This is the information that is used in the Global vector autoregression (GVAR), whose use is surveyed in Chudik and Pesaran (Reference Chudik and Pesaran2016). While the use of many variables might imply a large model, in practice quite parsimonious small models tend to be difficult to beat in forecasting competitions.
With respect to length, in a May 2023 hearing, the Chair of the House of Commons Treasury Select Committee asked the chief economist of the Bank of England, Huw Pill: ‘Are you saying that, despite the Bank of England having been in existence for over 300 years, you look at only the last 30 years when you think about what the risks are to inflation?’. Pill emphasised the importance of the policy regime, which had been different in the past 30 years of inflation targeting than in earlier high inflation periods. The 30 years up to 2019 had also been different in terms of the absence of large real shocks, like Covid-19 and the effects of the Russian invasion of Ukraine.
Whether it is statistical or policy, the model will typically be supplemented by a judgemental input, justified by the argument that the forecaster has a larger information set than the model. In evidence to the Treasury Select Committee in September 2023, Sir Jon Cunliffe said: ‘We start with the model. All models are caricatures of real life. There is a suite of models; that is the starting point. However, the MPC itself puts judgements that change the model, and we have made some quite big judgements in the past about inflation persistence and the like. Finally, when we have the best collective view of the committee, which is our judgement on top of the model, the model keeps us honest. It ensures that there is a general equilibrium and we cannot just move things around.’
In short, macroeconomic forecasting faces important challenges. It depends on how forecasts are announced and used in the decision-making process. To deal with a constantly changing economic environment, forecasts must continually adapt to new data sets, statistical techniques and theory-based economic insights, knowing that there are still key variables that might have been left out, either due to difficulties in measurement, oversight or ignorance. Forecasters must answer a range of difficult questions. What sample periods and potential variables to consider? How to decide which variables to use for forecasting, and whether to use the same sample periods for variable selection and for forecasting? Should one use ensemble forecasting from forecasts obtained either from different models or from the same model estimated over different sample sizes or with different degrees of down-weighting? One must only be humbled by the sheer extent of the uncertainty that these choices entail. It is within this wider context that this article tries to formalise some elements of the problem of forecasting with high-dimensional data and illustrates the various issues involved with an application to forecasting UK inflation.
The rest of this article is organised as follows. Section 2 sets out the high-dimensional forecasting framework we will be considering. Section 3 considers ‘known knowns’, selecting relevant variables from a known active set. Section 4 considers ‘known unknowns’ where there are known to be unobserved latent variables. Section 5 presents the empirical application on forecasting UK inflation. Section 6 contains some concluding comments.
2. The high-dimensional forecasting problem
Suppose the aim is to forecast a scalar target variable, denoted by $ {y}_{T+h}, $ at time $ T $ , for the future dates, $ T+h $ , $ h=1,2,\dots, H $ . Given the historical observations, the optimal forecast of the target variable, $ {y}_{T+h}, $ depends on how the forecasts are used, namely the underlying decision problem. In practice, specifying loss functions associated with decision problems is difficult; hence, the tendency is to fall back on mean squared error loss. Under this loss function, the optimal forecasts are given by conditional expectations, $ E\left({y}_{T+h}\left|{\mathcal{I}}_T\right.\right) $ , where $ {\mathcal{I}}_T $ is the set of available information, and expectations are formed with respect to the joint probability distribution of the target variable and the set of potential predictors under consideration. But when the number of potential predictors, say $ K $ , is large, even this result is too general to be of much use in practice.
The high-dimensional nature of the forecasting problem also presents a challenge of its own when we come to multi-step ahead forecasting when forecasts of the target variable are required for different horizons, $ h=1,2,\dots, H $ . Many decision problems require having forecasts many periods ahead, months, years and even decades ahead. Monetary policy is often conducted over the business cycle, at least 2–3 years ahead of the policy formulation. Climate change policy requires forecasts over many decades ahead. In interpreting Pharaoh’s dreams, Joseph considered a two-period decision problem whereby 7 years of plenty are predicted to be followed by 7 years of drought. Multi-horizon forecasting is relatively straightforward when the number of potential predictors is small and a complete system of equations, such as a VAR, can be used to generate forecasts for different horizons from the same forecasting model in an iterative manner. Such an iterated approach is not feasible, and might not even be desirable, when the number of potential predictors is too large, since future forecasts of predictors are also needed to generate forecasts of $ {y}_{T+h} $ for $ h\ge 2 $ . This is why in high-dimensional set-ups multi-period ahead forecasts are typically formed using different models for different horizons. This is known as the direct approach and avoids the need for forward iteration by directly regressing the target variable $ {y}_{t+h} $ on the predictors at time $ t $ , thus possibly ending up with different models and/or estimates for each $ h $ .Footnote 3
To be more specific, ignoring intercepts and factors which we introduce below, suppose $ {y}_t $ is the first element of the high-dimensional vector $ {\boldsymbol{w}}_t $ , assumed to follow the first-order VAR model,
Higher order VARs can be written as first-order VARs using the companion form. The error vector, $ {\mathbf{u}}_t $ , satisfies the orthogonality condition $ E\left({\mathbf{u}}_t\left|{\mathcal{I}}_{t-1}\right.\right)=\mathbf{0} $ , where $ {\mathcal{I}}_{t-1}=\left({\boldsymbol{w}}_{t-1},{\boldsymbol{w}}_{t-2},\dots \right) $ . Then
where except for $ h=1 $ , the overlapping observations cause the error in (2) to have the moving average structure of order $ h-1 $ :
Under the VAR specification $ E\left({\mathbf{u}}_{h,T+h}\left|{\mathcal{I}}_T\right.\right)=0 $ , for $ h=1,2,\dots $ and the optimal (in the mean squared error sense) h-step ahead forecast of $ {\boldsymbol{w}}_{T+h} $ is $ E\left({\boldsymbol{w}}_{T+h}\left|{\mathcal{I}}_T\right.\right)={\boldsymbol{\Phi}}^h{\boldsymbol{w}}_T $ . But given that in most forecasting applications the dimension of $ {\boldsymbol{w}}_t $ is large, it is not feasible to estimate $ \boldsymbol{\Phi} $ directly without imposing strong sparsity restrictions. Instead, we take the target variable, $ {y}_{T+h} $ , to be the first element of $ {\boldsymbol{w}}_{T+h} $ and consider the direct regression
where $ {\boldsymbol{\phi}}_h^{\mathrm{\prime}} $ is the first row of $ {\boldsymbol{\Phi}}^h, $ and $ {u}_{h,t+h} $ is the first element of $ {\mathbf{u}}_{h,t+h} $ . We still face a high-dimensional problem since there are a large number of potential covariates in $ {\boldsymbol{w}}_t $ . We consider the implementation of the direct approach under two scenarios concerning the potential predictors. First, when it is known that the target variable $ {y}_{t+h} $ is a sparse linear function of a large set of observed variables $ {\boldsymbol{x}}_t $ (a subset of $ {\boldsymbol{w}}_t $ ) known as the ‘active set’. The model is sparse in the sense that $ {y}_{t+h} $ depends on a small number of covariates, that are known to be a subset of the much larger active set. The machine learning literature focuses on this case, which we refer to as the case of ‘known knowns’. Second, when $ {y}_{t+h} $ could also depend on a few latent (unobserved) factors, $ {\mathbf{f}}_t $ , not directly included in the active set, which we call the case of ‘known unknowns’.
Specifically, we suppose that for each $ h $ , $ {y}_{t+h} $ can be approximated by the following linear model, where the predictors are also elements of $ {\boldsymbol{w}}_t $ in the high-dimensional VAR (1),
for $ t=1,2,\dots, T-h $ , where $ {c}_h $ is the intercept, $ {\mathbf{z}}_t $ is a vector of small number, $ p $ , of preselected covariates included across all horizons $ h $ . Obvious examples include lagged values of the target variable $ \left({y}_t,{y}_{t-1},\dots \right) $ . Other variables can also be included in $ {\mathbf{z}}_t $ on the basis of a priori theory or strong beliefs. The third component of $ {y}_{t+h} $ specifies the subset of variables in the active set $ {\boldsymbol{x}}_{Kt}={\left({x}_{1t},{x}_{2t},.\dots, {x}_{Kt}\right)}^{\prime } $ . $ I\left(j\in DGP\right) $ is an indicator variable that takes the value of unity if $ {x}_{jt} $ is included in the data generating process (DGP) for $ {y}_{t+h} $ and zero otherwise. It is only if $ I\left(j\in DGP\right)=1 $ that $ {\beta}_{jh} $ will be identified. We discuss ways to determine the selection indicator $ I\left(j\in DGP\right) $ below. The number of variables included in the DGP is given by $ k={\sum}_{j=1}^KI\left(j\in DGP\right) $ , which is supposed to be small and fixed as $ T $ (and possibly $ K $ ) becomes large. This assumption imposes sparsity on the relationship between the target and the variables in the active set. In addition, we allow for a small number of latent factors, $ {\mathbf{f}}_t $ , that represent other variables influencing $ {y}_{t+h} $ that are not observed directly, but known to be present—the known unknowns.
Giannone et al. (Reference Giannone, Lenza and Primiceri2021) contrast sparse methods, that select a few variables from the active set as predictors, such as Lasso and One Covariate at a time Multiple Testing (OCMT) discussed below, and dense methods, that select all the variables in the active set but attach small weights to many of them, such as principal components (PCs), ridge regression and other shrinkage techniques. Rather than having to choose between sparse and dense predictors, we consider approaches that combine the two. We apply sparse selection methods to the variables in the active set, and use dense shrinkage methods to approximate $ {\mathbf{f}}_t $ from a wider set of variables with $ {\boldsymbol{x}}_t $ included as a subset. We first consider the selection problem, known knowns, where we know the active set of potential covariates, and we then consider known unknowns where there are unobserved factors. The elastic net regression of Zou and Hastie (Reference Zou and Hastie2005) discussed below also combines sparse and dense techniques.
Throughout, we shall assume that the errors, $ {u}_{h,t+h} $ , in (3) satisfy the orthogonality condition $ E\left({u}_{h,t+h}\hskip0.5em \left|\hskip0.5em ,{\mathbf{z}}_t,{\boldsymbol{x}}_t,{\mathbf{f}}_t\right.\right)=0, $ for $ h\ge 1 $ . In the context of the high-dimensional VAR model discussed above, this orthogonality condition holds so long as the underlying errors, $ {\mathbf{u}}_t $ , are serially uncorrelated. This is so despite the fact that due to the use of overlapping observations $ {u}_{h,t+h} $ will be serially correlated when $ h>1 $ . This is an important consideration when high-dimensional techniques are applied to select predictors for multi-step ahead forecasting; an issue to which we will return.
3. Known knowns
In the case of known knowns, forecasts are obtained assuming that $ {y}_{t+h} $ is a linear function of $ {\boldsymbol{x}}_t $
subject to some penalty condition on $ \left\{{\beta}_{hj}\right\} $ . Some of the covariates, $ {x}_{jt} $ , could be transformations of other covariates, such as interaction terms. It is assumed that the model is correctly specified, in the sense that, apart from $ {\boldsymbol{z}}_t $ , the variables that drive $ {y}_{t+h} $ are all included in the active set, $ {\mathbf{x}}_{Kt} $ .
Penalised regressions estimate $ \beta $ by solving the following optimisation problem:
where $ {\beta}_h={\left({\beta}_{h1},{\beta}_{h2},\dots, {\beta}_{hK}\right)}^{\prime } $ , for given values of the ‘tuning’ parameters $ \lambda $ and $ \alpha $ . When $ \alpha =1 $ , we have ridge regression. When $ \alpha =0 $ and $ {\lambda}_{hT}\ne 0 $ , we have the Lasso regression, which is better suited for variable selection. When $ {\lambda}_{hT}\ne 0 $ and $ \alpha \ne 0 $ , we have the Zou and Hastie (Reference Zou and Hastie2005) elastic net regression, which also mixes sparse and dense approaches.
Many standard forecasting techniques result from a particular choice of the penalty function. Shrinkage estimators such as ridge or some Bayesian forecasts can be derived using the $ {\mathrm{\ell}}_2 $ norm $ {\sum}_{i=1}^K $ $ {\beta}_{hi}^2<{C}_h<\infty $ . Lasso (least absolute shrinkage and selection operator) follows when the $ {\mathrm{\ell}}_1 $ norm is used $ {\sum}_{i=1}^K\left|{\beta}_{hi}\right|<{C}_h<\infty $ . The difference is shown in Figure 1 below in Tibshirani (Reference Tibshirani1996) where the $ {\mathrm{\ell}}_1 $ norm yields corner solutions with many of the coefficients, $ {\beta}_{hj} $ , estimated to be zero. In contrast, the use of $ {\mathrm{\ell}}_2 $ norm yields non-zero estimates for all the coefficients with many very close to zero.
There are also a large number of variants of Lasso, including adaptive Lasso, group Lasso, double Lasso, fused Lasso and prior Lasso. We will focus on Lasso itself, which we use in our empirical application and which we will compare to OCMT as an alternative procedure which is based on inferential rather than penalised procedures.
3.1. Lasso
In this article, we focus on Lasso, but acknowledge that are many variations on Lasso such as adaptive Lasso, group Lasso, fused Lasso and prior Lasso. Lasso estimates $ {\beta}_h $ by solving the following optimisation problem:
where $ {\beta}_h={\left({\beta}_{h1},{\beta}_{h2},\dots, {\beta}_{hK}\right)}^{\prime } $ , $ {\boldsymbol{x}}_{Kt}={\left({x}_{1t},{x}_{2t},.\dots, {x}_{Kt}\right)}^{\prime } $ for a given choice of the ‘tuning’ parameter, $ {\lambda}_{hT} $ . The variable selection consistency of Lasso has been investigated by Meinshausen and Bühlmann (Reference Meinshausen and Bühlmann2006), Zhao and Yu (Reference Zhao and Yu2006) and more recently by Lahiri (Reference Lahiri2021). The key condition is the so-called ‘Irrepresentable Condition (IRC)’ that places restrictions on the magnitudes of the correlations between the signals ( $ {\mathbf{X}}_{1h} $ , standardised) and the rest of the covariates ( $ {\mathbf{X}}_{2h} $ , standardised), taken as given (deterministic). The IRC is:
where $ {\beta}_h^0={\left({\beta}_{1h}^0,{\beta}_{2h}^0,\dots, {\beta}_{k_hh}^0\right)}^{\prime } $ denotes the vector of true signal coefficients.Footnote 4 The IRC condition is met for pure noise variables, but need not hold for proxy variables, noise variables that are correlated with the true signals.
To appreciate the significance of the IRC, suppose the DGP contains $ {x}_{1t} $ and $ {x}_{2t} $ and the rest of the covariates in the active set are $ {x}_{3t},{x}_{4t},\dots, {x}_{Kt}. $ Denote the sample correlation coefficient between $ {x}_{1t} $ and $ {x}_{2t} $ by $ \hat{\rho} $ $ \left({\hat{\rho}}^2<1\right) $ and the sample correlation coefficient of $ {x}_{1t} $ and $ {x}_{2t} $ with the rest of the covariates in the active set by $ {\hat{\rho}}_{1s},{\hat{\rho}}_{2s} $ , for $ s=3,4,\dots, K. $ Then, dropping the subscript $ h, $ the IRC for the $ {s}^{th} $ covariate is given by
which yields
for $ s=3,4,\dots, K $ . In this example, there are two cases to consider: A: $ \mathit{\operatorname{sign}}\left({\beta}_{02}\right)=\mathit{\operatorname{sign}}\left({\beta}_{01}\right); $ and B: $ \mathit{\operatorname{sign}}\left({\beta}_{02}\right)=-\mathit{\operatorname{sign}}\left({\beta}_{01}\right) $ . For case A $ {\sup}_s\hskip0.5em \left|{\hat{\rho}}_{1s}+{\hat{\rho}}_{2s}\right|<1+\hat{\rho} $ , and for case B $ {\sup}_s\hskip0.5em \left|{\hat{\rho}}_{1s}-{\hat{\rho}}_{2s}\right|<1-\hat{\rho} $ . Since the signs of the coefficients are unknown, for all possible values of $ \hat{\rho} $ , $ {\hat{\rho}}_{1s} $ and $ {\hat{\rho}}_{2s} $ , we can ensure the IRC condition is met if $ \left|\hat{\rho}\right|+{\sup}_s\left|{\hat{\rho}}_{1s}\right|+{\sup}_s\left|{\hat{\rho}}_{2s}\right|<1 $ . This example shows the importance of the correlations between the true covariates in the DGP as well as between the true covariates and the other members of the active set that do not belong to the DGP. The IRC is quite a stringent condition and it is not just when one has proxies in the active set that are highly correlated with the true covariates that Lasso will tend to choose too many variables. In practice, one cannot check the IRC condition since one does not know which variables are the true signals.
In addition to the IRC, it is also required that
The penalty condition, which follows from MinC, says that the penalty has to rise with $ T $ , but not too fast and not too slowly. The expansion rate of $ {\lambda}_{hT} $ depends on the magnitude and the sign of $ {\beta}_{jh}^0 $ , and the correlations of signals with the proxy variables. Lahiri (Reference Lahiri2021) shows that the penalty condition can be relaxed to $ {\lim}_{T\to \infty }{T}^{-1}{\lambda}_{hT}<\lim {\operatorname{inf}}_{T\to \infty }{d}_{hT} $ , where
The above conditions do not restrict the choice of $ {\lambda}_{hT} $ very much, hence the recourse to cross-validation (CV) to determine it. In practice, $ {\lambda}_{hT} $ is calibrated using M-fold CV techniques. The observations, $ t=1,2,\dots, T $ , are partitioned into $ M $ disjoint subsets (folds), of size approximately $ m=T/M $ . Then, $ M-1 $ subsets are used for training and one for evaluation. This is repeated with each fold being used in turn for evaluation. $ M $ is typically set to $ 5 $ or $ 10 $ . CV methods are often justified in machine learning literature under strong assumptions, such as independence and parameter stability across the sub-samples used in CV. These assumptions are rarely met in the case of economic time series data, an issue that is discussed further in the context of the empirical example in Section 5.
3.2. One Covariate at a time Multiple Testing
The need for CV is avoided in the procedure proposed by Chudik et al. (Reference Chudik, Kapetanios and Pesaran2018), (CKP). This is the OCMT procedure, where covariates are selected one at a time, using the t-statistic for testing the significance of the variables in the active set, individually.Footnote 5 Ideas from the multiple testing literature are used to control the false discovery rate, and ensure that the selected covariates encompass the true covariates (signals) with probability tending to unity, under certain regularity conditions. Like Lasso, OCMT has no difficulty in dealing with (pure) noise variables, and is very effective at eliminating them. Also, like Lasso, it requires some $ \mathit{\min} $ condition such as $ \left|{\beta}_{jh}^0\right|>\hskip-0.5em >\sqrt{\frac{k\log (K)}{T}} $ , for $ j=1,2,\dots, k $ .Footnote 6 But because it considers a single variable at a time, OCMT does not require the IRC condition to hold and is not affected by the correlation between the members of the DGP as Lasso is. Instead, it requires the number of proxies signals, say $ {k}_T^{\ast } $ , to rise no faster than $ \sqrt{T} $ . Chudik et al. (Reference Chudik, Pesaran and Sharifvaghefi2023), discussed below, is primarily concerned with parameter instability; however, Section 4 of that paper has a detailed comparison of the assumptions required for Lasso and OCMT under parameter stability.
OCMT’s condition on $ {k}_T^{\ast } $ has been recently relaxed by Sharifvaghefi (Reference Sharifvaghefi2023) who allows $ {k}_T^{\ast}\to \infty $ , with $ T $ . He considers the following DGP:
where as before $ {\boldsymbol{z}}_t $ is a known vector of preselected variables, and it is assumed that the $ k $ signals are contained in the known active set $ {\mathcal{S}}_{K,t}=\left\{{x}_{jt},j=1,2,\dots, K\right\} $ . Note that for now the DGP in (7) does not include the additional latent factors, $ {\mathbf{f}}_t $ , introduced in (3). Without loss of generality, consider the extreme case where there are no noise variables and all proxy or pseudo signal variables $ \Big({x}_{jt}, $ for $ j=k+1,k+2,\dots, K\Big) $ are correlated with the signals, $ {\mathbf{x}}_{1t}={\left({x}_{1t,}{x}_{2t},\dots, {x}_{kt}\right)}^{\prime } $ . In this case, $ {k}_T^{\ast } $ rises with $ K $ and OCMT is no longer applicable. However, in this case because of the correlation with the proxies, the signals, $ {\mathbf{x}}_{1t} $ , become latent factors for the proxy variables and we have
for $ j=k+1,k+2,.\dots, K $ . Although the identity of these common factors is unknown, because we do not know the true signals, they can be approximated by the PCs of the variables in the active set.
Specifically, following Sharifvaghefi (Reference Sharifvaghefi2023), denote the latent factors that result in non-zero correlations between the noise variables in the active set and the signals by $ {\mathrm{\varkappa}}_t $ and consider the factor model
where $ {\boldsymbol{\kappa}}_j $ , for $ j=1,2,\dots, K $ are the factor loadings and the errors, $ {v}_{jt} $ , are weakly cross-correlated and distributed independently of the factors and their loadings. Under (8), the DGP, (7), can be written equivalently as
where $ {\mathbf{b}}_h={\sum}_{j=1}^KI\left(j\in DGP\right){\beta}_{jh}{\kappa}_j $ . When $ {\mathrm{\varkappa}}_t $ and $ {v}_{jt} $ are known, the problem reduces to selecting $ {v}_{jt} $ from $ {\mathcal{S}}_{K,t}^v=\left\{{v}_{jt},j=1,2,\dots, K\right\}, $ conditional on $ {\mathbf{z}}_t $ and $ {\mathrm{\varkappa}}_t $ . Sharifvaghefi shows that the OCMT selection can be carried out using the PC estimators of $ {\mathrm{\varkappa}}_t $ and $ {\mathbf{v}}_{jt} $ —denoted by $ {\hat{\mathrm{\varkappa}}}_t $ and $ {\hat{\mathbf{v}}}_{jt} $ , if both $ K $ and $ T $ are large. He labels this procedure as generalised OCMT (GOCMT). Note that at the moment, it is assumed that $ {\mathrm{\varkappa}}_t $ does not directly affect $ {y}_{t+h} $ , it only enters through the $ {x}_{jt} $ . It represents the signals, the common factors correlated with the proxies, and provides a way of filtering the correlations in the first step.
3.3. Generalised OCMT
The GOCMT procedure simply augments the OCMT regressions with the PCs, $ {\hat{\mathrm{\varkappa}}}_t $ , and considers the statistical significance of $ {\hat{\mathbf{v}}}_{jt} $ for each $ j $ , one at the time. Lasso-factor models have also been considered by Hansen and Liao (Reference Hansen and Liao2019) and Fan et al. (Reference Fan, Ke and Wang2020). In practice, since $ {\mathbf{x}}_j=\hat{\boldsymbol{\Xi}}{\hat{\boldsymbol{\psi}}}_j+{\hat{\mathbf{v}}}_j $ , where $ \hat{\boldsymbol{\Xi}}={\left({\hat{\mathrm{\varkappa}}}_1,{\hat{\mathrm{\varkappa}}}_2,\dots, {\hat{\mathrm{\varkappa}}}_T\right)}^{\prime } $ , then $ {\mathbf{M}}_{\hat{\boldsymbol{\Xi}}}{\mathbf{x}}_j={\mathbf{M}}_{\hat{\boldsymbol{\Xi}}}{\hat{\mathbf{v}}}_j $ , where $ {\mathbf{M}}_{\hat{\boldsymbol{\Xi}}}={\mathbf{I}}_T-\hat{\boldsymbol{\Xi}}{\left(\hat{{\boldsymbol{\Xi}}^{\prime }}\hat{\boldsymbol{\Xi}}\right)}^{-1}\hat{{\boldsymbol{\Xi}}^{\prime }} $ , and GOCMT reduces to OCMT when $ {\mathbf{z}}_t $ is augmented with $ {\hat{\mathrm{\varkappa}}}_t $ , where the statistical significance of $ {x}_{jt} $ as a predictor of $ {y}_{t+h} $ is evaluated for each $ j $ , one at a time. Like OCMT, GOCMT allows for the multiple testing nature of the procedure ( $ K $ separate tests—with $ K $ large) by increasing the level of significance with $ K $ . The number of PCs, $ \mathit{\dim}\left({\hat{\mathrm{\varkappa}}}_t\right) $ , can be determined using one of the criteria suggested in the factor literature.
In the first stage, $ K $ separate OLS regressions are computed, where the variables in the active set are entered one at a time:
Denote the t-ratio of $ {\phi}_{jh} $ by $ {t}_{{\hat{\phi}}_{j,(1)}} $ . Then, variable $ j $ is selected if
where $ {c}_p\left(K,\delta \right) $ is a critical value function given by
$ p $ is the nominal size (usually set to $ 5\% $ ), $ {\Phi}^{-1}\left(\cdot \right) $ is the inverse of a standard normal distribution function and $ \delta $ is a fixed constant set in the interval $ \left[\mathrm{1,1.5}\right] $ . In the second step a multivariate regression of $ {y}_{t+h} $ on $ {\boldsymbol{z}}_t $ and all the selected regressors is considered for inference and forecasting. Serial correlation will arise with OCMT when selection is based on one variable at the time, and the omitted variables are mixing (serially correlated). CKP discuss this in section C of the online theory supplement to their paper and suggest using a more conservative (higher) critical value—namely using $ \delta =1.5 $ rather than $ \delta =1.0 $ .
When the covariates are not highly correlated, OCMT applies irrespective of whether $ K $ is small or large relative to $ T $ , so long as $ T=\ominus \left({K}^c\right) $ , for some finite $ c>0 $ . But to allow for highly correlated covariates, GOCMT requires $ K $ to be sufficiently large to enable the identification of the latent factor, $ {\mathrm{\varkappa}}_t $ . In cases where $ K $ is not that large, it might be a good idea to augment the active set for the target variable, $ {y}_{t+h} $ , $ {\mathcal{S}}_{K,t}=\left\{{x}_{jt},j=1,2,\dots, K\right\} $ , with covariates for other variables determined simultaneously with $ {y}_{t+h} $ , ending up with $ \overline{\mathcal{K}}>K $ covariates for identification of $ {\mathrm{\varkappa}}_t $ . GOCMT does not impose any restriction on the correlations between the variables other than that they cannot be perfectly collinear.
3.4. High-dimensional variable selection in presence of parameter instability
OCMT has also been recently generalised by Chudik et al. (Reference Chudik, Pesaran and Sharifvaghefi2023) to deal with parameter instability. Under parameter instability, OCMT correctly selects the covariates with non-zero average (over time) effects, using the full sample. However, the adverse effects of changing parameters on the forecast may mean that while the full sample is the best to use for selection, it need not be the best to use for estimating the forecasting model. Instead, it may be better to use shorter windows or weight the observations in the light of the evidence on break points and break sizes.
Determining the appropriate window or weighting for the observations before estimation is a difficult problem and no fully satisfactory procedure seems to be available. It is common in finance to use rolling windows of 60 or 120 months, but one problem with shorter windows is that if you have periods of instability interspersed with periods of stability, like the Great Moderation, estimates using a short window from the stable period may understate the degree of uncertainty. This happened during the financial crisis when the short windows used for estimation did not reflect past turbulence. Similarly, the Bank of England estimating their models using the low inflation regime of the past 30 years discounted the evidence from the high inflation regime of the 1970s and 1980s.
While identifying the date of a break might not be difficult, identifying the size of the break may be problematic if the break point is quite recent. If there is a short time since the break, there is little data on which to estimate the post-break coefficient with any degree of precision. If there is a long time since the break, then using post break data is sensible. Pesaran et al. (Reference Pesaran, Pick and Pranovich2013) examine optimal forecasts in the presence of continuous and discrete structural breaks. These present quite different sorts of challenges. With continuous breaks, the parameters change often by small amounts. With discrete breaks, the parameters change rarely but by large amounts. They propose weighting observations to obtain optimal forecasts in the MSFE sense and derive optimal weights for one-step ahead forecasts for the two types of break. Under continuous breaks, their approach largely recovers exponential smoothing weights. Under discrete breaks between two regimes, the optimal weights follow a step function that allocates constant weights within regimes but different weights in different regimes. In practice, the time and size of the break are uncertain, and they investigate robust optimal weights. Averaging forecasts with different weighting schemes, for instance, with exponential smoothing parameters between 0.96 and 0.99, may also be a way to produce more robust forecasts.
4. Known unknowns
So far, we have considered techniques (penalised regressions and OCMT) that assume $ {y}_{t+h} $ depends on $ {\mathbf{z}}_t $ and a subset of a set of covariates—the active set—which is assumed known. In contrast, shrinkage type techniques such as PCs, (implicitly) assume that $ {y}_{t+h} $ depends on $ {\mathbf{z}}_t $ and the $ m\times 1 $ vector of unknown factors $ {\mathbf{f}}_t $
This is a simple example of techniques that in our terminology can be viewed as belonging to a class of forecasting models based on known unknowns. The uncertainty about $ {\mathbf{f}}_t $ is resolved assuming it can be identified from a known active set, such as $ {\mathcal{S}}_{K,t}=\left\{{x}_{jt},j=1,2,\dots, K\right\} $ . Individual covariates in $ {\mathcal{S}}_{K,t} $ are not considered for selection (although a few could be preselected and included in $ {\boldsymbol{z}}_t $ ). To forecast $ {y}_{t+h} $ one still requires to forecast the PCs and to allow for the uncertainty regarding $ m=\dim \left({\mathbf{f}}_t\right) $ .
Factor augmented VARs (FAVAR), initially proposed by Bernanke et al. (Reference Bernanke, Boivin and Eliasz2005, BBE), augment the standard VAR models with a set of unobserved common factors. In the context of our set up, FAVAR can be viewed as a generalised version of (3), where $ {\mathbf{y}}_{t+h} $ is a vector and $ {\boldsymbol{z}}_t=\left\{{\boldsymbol{y}}_t,{\boldsymbol{y}}_{t-1},\dots, {\boldsymbol{y}}_{t-p}\right\} $ . BBE argue that small VARs gave implausible impulse response functions, such as the ‘price puzzle’, which were interpreted as reflecting omitted variables. One response was to add variables and use larger VARs, but this route rapidly runs out of degrees of freedom, since Central Bankers monitor hundreds of variables. The FAVAR was presented as a solution to this problem. Big Bayesian VARs are an alternative solution.
The assumptions that underlie both penalised regression and PC shrinkage are rather strong. The former assumes that $ {\mathbf{f}}_t $ can affect $ {y}_{t+h} $ only indirectly through $ {x}_{jt} $ , $ j=1,2,\dots, K $ , and the latter does not allow for individual variable selection. Suppose that $ {\mathbf{f}}_t $ also enters (7) then the model can be written as (3) above, repeated here for convenience:
The forecasting problem now involves both selection and shrinkage. The $ {\mathbf{f}}_t $ can be identified by, for instance, the PCs of the augmented active set $ {x}_{jt}, $ $ j=1,2\dots, K,K+1,\dots, \overline{\mathcal{K}} $ which can be wider than the active set of covariates used to predict $ {y}_t $ . There are various other ways that the unobserved $ {\mathbf{f}}_t $ could be estimated, but we use PCs as an example, since they are widely used.
The latent factors are unlikely to be only specific to the target variable under consideration. Observed global factors, such as oil and raw material prices or inflation and output growth of major countries such as United States can be included in the active set. The main issue is how to deal with global factors, such as technology, political change and so on that are unobserved and tend to affect many countries in the world economy. Call this vector of global factors $ {\mathbf{g}}_t $ . A natural extension is to introduce forecast equations for other countries (entities) who have close trading relationships with United Kingdom and use penalised panel regressions, where the panel dimension allows identification of the known unknowns.
More specifically, suppose there are $ N $ other units (countries) that are affected by observed country-specific covariates, $ {\mathbf{z}}_{it} $ , $ i=1,2,\dots, N $ , and $ {x}_{i, jt} $ for $ j=1,2,\dots, {K}_i $ , plus domestic latent factors $ {\mathbf{f}}_{it} $ , and global latent factors, $ {\mathbf{g}}_t $ . The forecasting equations are now generalised as
where $ {k}_i={\sum}_{j=1}^{K_i}I\left(j\in {DGP}_i\right) $ is finite as $ {K}_i\to \infty $ , for $ i=\mathrm{0,1,}\ \mathrm{2},\dots, N $ . For the country-specific covariates, we postulate that there is an augmented active set
where $ {\mathbf{f}}_{it} $ are the latent factors. The global factors are then identified as the common components of the country-specific factors, namely
for $ i=1,2,..,N $ , with $ N $ large.
Variable selection for the target variable (say UK inflation) can now proceed by applying GOCMT, with the UK model augmented with UK-specific PCs, $ {\hat{\mathbf{f}}}_{it} $ as well as the PC estimator of the global factor, $ {\mathbf{g}}_t $ , that drives the country specific factors. This can be extracted from $ {\hat{\mathbf{f}}}_{it} $ as PCs of the country-specific PCs. In addition to common factor dependence, countries are also linked through trade and other more local features (culture, language). Such ‘network’ effects can be captured by using ‘starred’ variables, to use the GVAR terminology. A simple example would be (for $ i=0,1,.\dots, N $ )
where $ i=0 $ represents United Kingdom, and $ {y}_{it}^{\ast }={\sum}_{j=1}^N{w}_{ij}{y}_{jt}, $ $ {w}_{ij} $ (trade weights) measures the relative importance of country $ j $ in determination of country $ {i}^{th} $ target variable. Similarly, $ {\boldsymbol{z}}_{it}^{\ast }={\sum}_{j=1}^N{w}_{ij}^{\ast }{\mathbf{z}}_{jt} $ can also be added to the model if deemed necessary.
The network effects can be included either as an element of $ {\boldsymbol{z}}_{it} $ or could be made subject to variable selection. The problem becomes much more complicated if we try to relate $ {y}_{i,t+h} $ simultaneously to $ {y}_{i,t+h}^{\ast } $ . Further, for forecasting, following Chudik et al. (Reference Chudik, Grossman and Pesaran2016), one might also need to augment the UK regressions with time series, forecasting models for the common factors.
Equation (13) allows for a number of different approaches to dimension reduction. As has been pointed out by Wainwright (Reference Wainwright2019): ‘Much of high-dimensional statistics involves constructing models of high-dimensional phenomena that involve some implicit form of low-dimensional structure, and then studying the statistical and computational gains afforded by exploiting this structure’. Shrinkage methods, like PCs, assume a low dimensional factor structure. The two selection procedures that we have considered, Lasso and OCMT, exploit different aspects of the low-dimensional sparsity structure assumed for the underlying data generating process. Lasso restricts the magnitude of the correlations within and between the signals and the noise variables. OCMT limits the rate at which the number of proxy variables rises with the sample size. GOCMT relaxes this restriction by filtering out the effects of latent factors that bind the proxies to the true signals before implementing the OCMT procedure.
5. Forecasting UK inflation
5.1. Introduction
We apply the procedures proposed above to the problem of forecasting quarterly UK inflation at horizons $ h=1,2 $ and $ 4 $ . The target variable is the headline rate, average annual UK inflation, which is also forecast by the Bank of England. It is labelled DPUK4, defined as $ {\pi}_{t+h}=100\times \log \left({p}_{t+h}/{p}_{t+h-4}\right) $ , where $ {p}_t $ is the UK consumer price index taken from the IMF International Financial Statistics. Forecasting annual rates of inflation at quarterly frequencies is subject to the overlapping observations problem when $ h>1 $ , and it is important that the preselected variables in $ {\boldsymbol{z}}_t $ , or the variables include in the active set $ {\mathcal{S}}_{K,t}=\left\{{x}_{jt},j=1,2,\dots, K\right\} $ are all predetermined (known) at time $ t $ . Furthermore, as discussed earlier, the variables selected for forecasting inflation at different horizons need not be the same, and the selected variables are also likely to change over time.
Since we have emphasised the importance of international network effects, we need to use a quarterly data set that includes a large number of countries to estimate global factors and to allow the construction of the $ {y}_{it}^{\ast } $ variables that appear in (13). The GVAR data set provides such a source. The publicly available data set compiled by Mohaddes and Raissi (Reference Mohaddes and Raissi2024) covers 1979q1–2023q3. We are very grateful to them for extending the data. While the latest GVAR data set goes up to 2023q3, we only had access to data till 2023q1 when we started the forecasting exercise, the results of which are reported in this article, but the data that we used matches that GVAR 2023 vintage, which was released in January 2024.Footnote 7
The database includes quarterly macroeconomic data for 6 variables (log real GDP, y; the rate of inflation, dp; short-term interest rate, r; long-term interest rate, lr; the log deflated exchange rate, ep and log real equity prices, eq), for 33 economies as well as data on commodity prices (oil prices, poil, agricultural raw material, pmat and metals prices, pmetal). These 33 countries cover more than 90% of world GDP. The GVAR data were supplemented with other specific UK data on money, wages, employment and vacancies, in the construction of the active set discussed below.
In the light of the argument in Chudik et al. (Reference Chudik, Pesaran and Sharifvaghefi2023), we use the full sample beginning in 1979q1 for variable selection. There are arguments for down-weighting earlier data for estimation when there have been structural changes, as discussed by Pesaran et al. (Reference Pesaran, Pick and Pranovich2013). However, the full sample was used both for variable selection and estimation of the forecasting model in order to allow evidence from the earlier higher inflation regime to inform both aspects.
Two sets of variables are considered for inclusion in $ {\boldsymbol{z}}_t $ . The first set, which we label AR2, includes lags of the target variable $ {\pi}_t,{\pi}_{t-1} $ (or equivalently $ {\pi}_t $ and $ \Delta {\pi}_t\Big) $ . Given the importance we attach to global variables and network effects, the second set, which we label ARX2, also includes $ {\pi}_t^{\ast } $ , $ {\pi}_{t-1}^{\ast } $ (or equivalently $ {\pi}_t^{\ast } $ and $ \Delta {\pi}_t^{\ast } $ ) where $ {\pi}_t^{\ast } $ is a measure of UK-specific foreign inflation constructed using UK trade weights with the other countries.Footnote 8
If there is a global factor in inflation, the inflation rates of different countries will be highly correlated and tend to move together. Figure 2 demonstrates that this is in fact the case. It plots the inflation rates for 19 countries over the period 1979–2022. It is clear that they do move together, reflecting a strong common factor. The dispersion is somewhat greater in the high inflation period of the 1980s. At times, individual countries break away from the herd with idiosyncratic bursts of inflation, like New Zealand in the mid-1980s. However, it is striking that inflation in every country increases from 2020.
To demonstrate the importance of the global factor for the United Kingdom, Figure 3 plots $ {\pi}_t $ , and $ {\pi}_t^{\ast } $ , UK inflation and UK-specific foreign inflation. The two series move together, and from the mid-1990s they are very close. This indicates that not only is one unlikely to be able to explain UK inflation just by UK variables but that there are good reasons to include this UK-specific measure of foreign inflation in one of our specifications for $ {\boldsymbol{z}}_t $ . The GVAR estimates also indicate the higher sensitivity of the UK to foreign variables than the US or euro area. This is not surprising, they are larger, less open, economies.
5.2. Active set
We now turn to the choice of the members of the active set, $ {\mathcal{S}}_{K,t}=\left\{{x}_{jt},j=1,2,\dots, K\right\} $ , some of which may be included in $ {\boldsymbol{z}}_t $ . While our focus is on forecasting not on building a coherent economic model, our choice of the covariates in the active set is motivated by the large Phillips curve literature, which suggests important roles for demand, supply and expectations variables. The demand and supply variables in the active set are both domestic and foreign from both product and labour markets. Expectations are captured by financial variables. Interaction terms were not included, but the non-linearities that have been investigated in the literature may be picked up by the latent foreign variables. These are represented by UK-specific measures of foreign inflation and output that could be viewed as estimates of $ {\mathbf{g}}_t $ that are tailored to the UK in relation to her trading partners.
Accordingly, we consider 26 covariates $ \left({x}_{jt}\right) $ listed in Table 1, and their changes $ \Delta {x}_{jt}={x}_{jt}-{x}_{j,t-1} $ , giving an active set with $ K=52 $ variables to select from. Whereas in a regression including current and lagged values of a regressor (say $ {x}_t $ and $ {x}_{t-1}\Big) $ is equivalent to including current and change (namely $ {x}_t $ and $ \Delta {x}_t $ ), in selection the two specifications can result in different outcomes. Including $ \Delta {x}_t $ is better since, as compared to $ {x}_{t-1} $ , it is less correlated with the level of the other variables in the active set. The 26 included covariates are measured as four-quarter rates of change, changes or averages to match the definition of the target variable. The rates of change are per cent per annum.
UK goods market demand indicators: rate of change of output, two measures of the output gap: log output minus either a $ P=8 $ or $ P=12 $ quarter moving average of log output, $ Gap\left({y}_t,P\right)={y}_t-{P}^{-1}{\sum}_{p=1}^P{y}_{t-p} $ ;
UK labour market demand indicators: rate of change of UK employment, vacancies and average weekly earnings and the change in unemployment;
UK financial indicators: annual averages of UK short and long interest rates, the rate of change of money, UK M4 and of UK real equity prices;
Global cost pressures on the UK: rate of change of the price of oil, metals, materials, UK import prices and deflated dollar exchange rate;
Foreign demand and supply variables: UK-specific global measures, foreign inflation, rate of change of foreign output and two measures of the foreign output gap: log foreign output minus either an 8- or 12-quarter moving average of log foreign output. In addition, large country variables were added: annual average of US short and long interest rates, rates of change of US output and prices, and of Chinese output.
5.3. Variable selection
5.3.1. Variable selection procedures
We consider Lasso, Lasso conditional on $ {\boldsymbol{z}}_t $ and GOCMT conditional on $ {\boldsymbol{z}}_t $ . With Lasso, the variables are standardised in-sample before implementing variable selection. The Lasso penalty parameter, $ {\lambda}_T, $ is estimated using 10 fold CV, across subsets of the observations. As noted above, the assumptions needed for standard CV procedures, for instance those used in the program cv.glmnet, are not appropriate for time series. Time series show features such as persistence and changing variance that are incompatible with those assumptions. In the standard procedure the CV subsets (folds) are typically chosen randomly. This is appropriate if the observations are independent draws from a common distribution, but this is not the case with time series. Since order matters in time series, we retain the time order of the data within each subset. See Bergmeir et al. (Reference Bergmeir, Hyndman and Koo2018) who provide Monte Carlo evidence on various procedures suggested for the case of serially correlated data. We use all the data, and do not leave gaps between subsets. In addition, the standard procedure chooses the $ {\hat{\lambda}}_{hT} $ that minimises the pooled MSE over the 10 subsets. But when variances differ substantially over subsets pooling is not appropriate, instead we follow Chudik et al. (Reference Chudik, Kapetanios and Pesaran2018, CKP), and use the average of the $ {\hat{\lambda}}_{hT} $ chosen in each subset. Full details are provided in the Appendix to CKP (2018).
As well as standard Lasso, for consistency with OCMT, we also generated Lasso forecasts conditional on $ {\boldsymbol{z}}_t $ by including a preselected set of variables $ {\boldsymbol{z}}_t $ in the optimisation problem (5). This generalised Lasso procedure solves the following optimisation problem:
where the penalty is applied only to the variables in the active set, $ {\boldsymbol{x}}_t, $ and not to the preselected variables, $ {\boldsymbol{z}}_t $ . The above optimisation problem can be solved in two stages. In the first stage, the common effects of $ {\boldsymbol{z}}_t $ are filtered out by regressing $ {y}_{t+h} $ and $ {\boldsymbol{x}}_t $ on the preselected variables $ {\boldsymbol{z}}_t $ and saving the residuals $ {e}_{y.z} $ and $ {e}_{xj.z} $ , $ j=1,2,..,K $ . In the second stage, Lasso is applied to these residuals. A proof that this two-step procedure solves the constrained minimisation problem in (14) is provided by Sharifvaghefi and reproduced in the Appendix.
In the OCMT critical value function, $ {c}_p\left(K,\delta \right)={\Phi}^{-1}\left(1-\frac{p}{2{K}^{\delta }}\right) $ , we set $ p=0.05 $ and $ \delta =1 $ . With $ K=52 $ , this means that we only retain variables with t-ratios (in absolute value) exceeding $ {c}_{0.05}\left(52,1\right)=3.3 $ . To allow for possible serial correlation, we also experimented with setting $ \delta =1.5 $ , which yields, $ {c}_{0.05}\left(\mathrm{52,1.5}\right)=3.82 $ . The results were reasonably robust and we focus on the baseline choice of $ \delta =1 $ , also recommended by CKP.
We implement Lasso and OCMT conditional on two preselected sets of variables, either an AR2 written as level and change $ {\mathbf{z}}_t={\left({\pi}_t,\Delta {\pi}_t\right)}^{\prime } $ or given the role of foreign inflation, shown above, the AR2 augmented by the level and change of the UK-specific measure of foreign inflation, denoted ARX, $ {\mathbf{z}}_t= $ $ \Big({\pi}_t $ , $ \Delta {\pi}_t $ , $ {\pi}_t^{\ast } $ , $ \Delta {\pi}_t^{\ast}\Big){}^{\prime } $ . As noted above, for selection including current and change is better than including current and lag. For comparative purposes, we also generated forecasts with the preselected variables only, namely the AR2 forecasts generated from the regressions
and the ARX forecasts generated from
Variable selection is carried out recursively, for each forecast horizon $ h $ separately, using an expanding windows approach. All data samples start in $ 1979q2 $ and end in the quarter that forecasts are made. To forecast the average inflation over the four quarters to $ 2020q1 $ using a forecast horizon of $ h=4 $ , the sample used for selection and estimation ends in $ 2019q1 $ . The end of the sample is then moved to $ 2019q2 $ to forecast the average inflation over the four quarters to $ 2020q2 $ , and so on. Similarly, to forecast the average inflation over the four quarters to $ 2020q1 $ using $ h=2 $ , the sample ends in $ 2019q3 $ , and using $ h=1 $ , the sample ends in $ 2019q4 $ . These sequences continue one quarter at a time until the models are selected and estimated to forecast inflation over the four quarters to $ 2023q1 $ . Thus, for $ h=4 $ , there are $ 17 $ samples used for variable selection, while for $ h=2 $ and $ h=1 $ there are $ 15 $ and $ 14 $ such variable selection samples. This process of recursive model selection and estimation means that the variables selected can change from quarter to quarter, and for each forecast horizon, $ h $ .
Section S-3 of the online supplement list the variables selected by each of the procedures, for each quarter and each forecast horizon. The main features are summarised here.
5.3.2. Number of variables selected
Table 2 gives the minimum, maximum, and average number of variables selected for the three forecast horizons and five variable selection procedures. Except for AR2-OCMT, at $ h=4 $ , OCMT chooses fewer variables than Lasso. Lasso conditional on the preselected variables selects a larger number of variables in total than the standard Lasso without conditioning. Conditioning on preselected variables is much more important for OCMT as compared to Lasso. This finding is in line with the theoretical results obtained by Sharifvaghefi (Reference Sharifvaghefi2023) who establishes the importance of conditioning on the latent factors when applied to an active set with highly correlated covariates. The number of variables Lasso selects falls with the forecast horizon,Footnote 9 while the number of variables selected by OCMT rises with the horizon. These results show that Lasso and OCMT could select very different models for forecasting.
Note: The reported results are based on 14, 15 and 17 variable selection samples for 1-, 2- and 4-quarter ahead models, respectively. The AR2 and ARX components include two and four preselected variables, respectively.
As expected, the number of variables selected by Lasso correlates with the estimates of the penalty parameter $ {\hat{\lambda}}_{hT} $ , computed by CV. These values are summarised in Table 3. For all three Lasso applications, the mean of the estimated penalty parameter increases with the forecast horizon, though more slowly for the specifications that include preselected variables. For Lasso (without preselection), the number of variables selected falls with the forecast horizon because of the increasing penalty parameter. This is not as clear-cut for the specifications including preselected variables.
Note: The reported estimates are based on Lasso penalty estimates (obtained from 10-fold cross-validation) for 14, 15 and 17 variable selection samples for 1-, 2- and 4-quarter ahead models, respectively.
5.3.3. OCMT: selected variables by horizon
OCMT selects only a few variables in addition to the preselected UK and UK-specific foreign inflation ( $ {\pi}_t,\Delta {\pi}_t $ , $ {\pi}_t^{\ast } $ and $ \Delta {\pi}_t^{\ast } $ ).Footnote 10 For $ h=1 $ , the variables selected are given in sub-section S-3.1.1 of the online supplement. In addition to the two preselected variables ( $ {\pi}_t,\Delta {\pi}_t $ ), AR2-OCMT selects the rate of change of wages (DWUK4) for samples ending in $ 2021q2 $ , $ 2021q4 $ and $ 2022q1 $ , and no other variables. ARX-OCMT does not select any additional variables for any of the $ 14 $ variable selection samples!
For $ h=2 $ , the variables selected are given in Section S-3.1.2 of the online supplement. AR2-OCMT selects the rates of change of money (DMUK4) and exchange rate (DEPUK4) from samples ending in $ 2019q3 $ to $ 2020q3 $ , then the rate of change of wages is added till the sample ending in $ 2022q2 $ , from then the rates of change of money (DMUK4) and wages (DWUK4) are selected. ARX-OCMT selects just the rate of change of money (DMUK4) as an additional variable in every sample for $ h=2 $ .
For $ h=4 $ , the variables selected are given in Section S-3.1.3 of the online supplement. AR2-OCMT chooses the same $ 3 $ extra variables—the rate of change of money (DMUK4) and exchange rate (DEPUK4) as well as the UK-specific measure of foreign inflation (DPSUK4, $ {\pi}_t^{\ast } $ )—for every sample ending from $ 2019q1 $ to $ 2021q1 $ . Then, in the sample ending in $ 2021q2 $ , AR2-OCMT chooses $ 12 $ extra variables. The number of variables selected then falls to $ 7 $ in $ 2021q3 $ , $ 6 $ in $ 2021q4 $ , $ 5 $ in $ 2022q1 $ and $ 4 $ in $ 2022q2-2022q4. $ These four are the rate of change of money (DMUK4), of material prices (DPMAT4), of wages (DWUK4), and $ {\pi}_t^{\ast } $ (DPSUK4). The number of variables selected falls to $ 3 $ in the sample ending in $ 2023q1 $ when the foreign inflation measure is no longer selected. ARX-OCMT chooses the rate of change of employment (DEMUK4) and of money for samples ending in $ 2019q1-2021q4 $ , then adds material prices in $ 2022q1 $ , and selects just the rate of change of money for the last four samples.
5.3.4. Lasso: selected variables by horizon
Lasso selections for each sample and horizon are given in the online supplement, Section S-3.2. Lasso tends to select more variables than OCMT so we give less detail. Table 4 lists the variables chosen by standard Lasso at each horizon and the number of times they were chosen out of the maximum number of possible samples: $ 14 $ , for $ h=1 $ , $ 15 $ for $ h=2 $ and $ 17 $ for $ h=4 $ . UK inflation, $ {\pi}_t $ (DPUK4) is always chosen in every sample at every horizon as is the UK measure of foreign inflation, $ {\pi}_t^{\ast } $ (DPSUK4). The change in UK inflation, $ \Delta {\pi}_t $ , (DDPUK4) is chosen in every sample in the case of models for $ h=1 $ , and $ h=2 $ , but never for $ h=4 $ . The change in foreign inflation $ \Delta {\pi}_t^{\ast } $ (DDPSUK4) is chosen in every sample at $ h=1 $ , in $ 3 $ samples at $ h=2 $ but never at $ h=4 $ . Thus, Lasso provides considerable support for the choice of preselected variables in $ {\boldsymbol{z}}_t $ that include foreign inflation as well as the two lagged inflation variables.
Note: The number of variable selection samples are 14, 15 and 17 for $ h=1 $ , $ 2 $ and $ 4 $ quarter ahead models, respectively.
Apart from these variables, the rate of change of wages and of money figure strongly when using Lasso. The rate of change of wages (DWUK4) is chosen in all the samples for $ h=1 $ and $ h=2 $ and $ 14 $ of the $ 17 $ samples for $ h=4. $ The rate of change of money (DMUK) is chosen in $ 10 $ of the $ 14 $ samples for $ h=1, $ and in every sample for $ h=2 $ and $ h=4. $ Money and wages are also chosen by OCMT but the rate of change of the exchange rate selected by OCMT is never chosen by Lasso.
When $ h=1 $ , Lasso also always selects two other variables, namely the change in long interest rates (DLRUK4), and import price inflation (DPMUK).
5.4. Forecasts
The point forecasts of inflation for $ h=1,2 $ and $ h=4 $ for the various selection procedures are summarised in Section S-4 of the online supplement. For each forecast horizon, we have $ 13 $ forecasts and their realisations for the quarters $ 2020q1 $ to $ 2023q1 $ inclusive. These are summarised in Table 5. We use the root mean square forecast error (RMSFE) as our forecast evaluation criterion. Since $ 13 $ forecast errors represent a very short evaluation sample with considerable serial correlation, testing for the significance of the loss differences using the Diebold and Mariano (Reference Diebold and Mariano1995) test would not be reliable and is not pursued here.
Note: The RMSFE figures are taken from online supplement Tables S-4.1–S-4.3. The least value for RMSFE for each forecast horizon is shown in bold.
5.4.1. One quarter ahead forecasts
Figure 4 gives the plots of actual inflation and forecasts one quarter ahead. Section S-4.1 of the online supplement gives the point forecasts. For $ h=1 $ , ARX has the lowest RMSFE, the $ {\pi}_t^{\ast } $ and $ \Delta {\pi}_t^{\ast } $ improve forecast performance relative to the AR2. AR2-OCMT adds wage growth in three periods. Lasso suffers from choosing too many variables relative to OCMT. The forecasts are very similar, except Lasso predicted a large drop in $ 2020q3 $ with a subsequent rebound. This results from selecting an output gap measure, when UK output dropped sharply in $ 2020q2 $ . This sharp drop and rebound was also a feature of Lasso forecasts at other horizons. The Bank of England overestimated inflation in $ 2022q4 $ , correctly anticipating higher energy prices but not anticipating the government energy price guarantees.
5.4.2. Two quarter ahead forecasts
Figure 5 gives the plots of actual inflation and forecasts two quarters ahead. Section S-4.2 of the online supplement gives the values. For $ h=2 $ , ARX again has the lowest RMSFE. ARX-OCMT selects money growth in every period. Lasso selects between five and nine variables.
5.4.3. Four quarter ahead forecasts
Figure 6 gives the plots of actual inflation and forecasts four quarters ahead. Section S-4.3 of the online supplement gives the values. The case of $ h=4 $ is the only one where the ARX does not have the lowest RMSFE. The lowest RMSFE is obtained by AR2-OCMT. It does well by having a very high inflation forecast in $ 2022Q2 $ . This corresponds to the selection of 12 extra variables in the sample ending in $ 2021q2 $ . It then rejoins the pack in $ 2022q3 $ .
5.4.4. Summary
Table 5 brings together the RMSFE for each of the selection methods at different horizons. Both the variable selection and forecasting exercises highlight the importance of taking account of persistence and foreign inflation for UK inflation forecasting. Lasso selects $ {\pi}_t $ and $ {\pi}_t^{\ast } $ in all three forecast horizon models. ARX, which includes UK and foreign inflation as preselected variables, tends to perform best in forecasting, but in the present application, the OCMT component does not seem to add much once the preselected variables are included. However, Lasso performs rather poorly when it is conditioned on the preselected variables.
5.5. Contemporaneous drivers
Our forecasts of $ {\pi}_{t+h} $ are based on variables observed at time t, and do not depend on any conditioning. But, as noted above with respect to the Bank of England, it is common to condition on contemporaneous values of variables that are considered as proximate causes of the variable to be forecast. Even if such causal variables can be identified; however, it does not mean that they help with forecasting—often such causal variables are themselves difficult to be forecast. The Bank of England overestimated inflation in 2022q4, correctly anticipating higher energy prices but not anticipating the government energy price guarantees. This illustrates the dangers of conditioning on variables that cannot be forecasted. Understanding does not necessarily translate into better forecasts. For example, knowing the causes of earthquakes does not necessarily help in predicting them in a timely manner.
This point can be illustrated by including contemporaneous changes in oil prices, in the UK inflation equation (over the period pre Covid-19 and the full sample) for the case $ h=1 $ . For both samples, $ \Delta {poil}_{t+1} $ are highly statistically significant, but their lagged values $ \Delta {poil}_t $ are not (Table 6).
6. Conclusion
High-dimensional data are not a panacea; the data must have some predictive content that might come from spatial or temporal sequential patterns. Forecasting is particularly challenging either if there are unknown unknowns (factors that are not even thought about) or if there are known factors that are falsely believed to be important. When there are new global factors, like Covid-19, or when a relevant variable has shown little variation over the sample period, forecasting their effect is going to be problematic.
Many forecasting problems require a hierarchical structure where latent factors at local and global levels are explicitly taken into account. This is particularly relevant for macro forecasting in an increasingly interconnected world. It is important that we allow for global factors in national forecasting exercises—and GVAR was an attempt in this direction.
A number of key methodological issues were illustrated with a simple approach to forecasting UK inflation, which has become a topic of public discussion. This example showed both the power of parsimonious models and the importance of global factors. There remain many challenges. How to allow for regime change and parameter instability in the case of high-dimensional data analysis? How to choose data samples? Our recent research suggests that it is best to use long time series samples for variable selections, but consider carefully what sample to use for forecasting. Given a set of selected variables, parameter estimation can be based on different window sizes or down-weighting. Should we use ensemble or forecast averaging? Forecast averaging will only work if the covariates used to forecast the target variable are driven by strong common factors; otherwise, one will be averaging over noise.
There are some more general lessons. Econometric and statistical models must not become a straightjacket. Forecasters should be open minded about factors not included in their model, and acknowledge that forecasts are likely to be wrong if unexpected shocks hit.
Acknowledgements
This article was developed from Pesaran’s Deane-Stone Lecture at the National Institute of Economic and Social Research, London. The authors greatly benefitted from comments when earlier versions of this article were presented in 2023 at NIESR on 21 June, the International Association of Applied Econometrics Annual Conference in Oslo, June 27-30, 2023, Bayes Business School, City University, 22 November 2023, and at the Economics Department-Wide Seminar, Emory University, 16 February, 2024. The authors are also grateful for comments from Alex Chudik, Anthony Garratt, George Kapetanios, Essie Maasoumi, Alessio Sancetta, Mahrad Sharifvaghefi, Allan Timmermann and Stephen Wright. The authors particularly thank Hayun Song, their research assistant, for his work in coding, empirical implementation, and the tabulation of results, along with his invaluable research assistance.
Appendix
The Lasso procedure with a set of preselected variablesFootnote 11
Let $ \mathbf{y}={\left({y}_1,{y}_2,\cdots, {y}_T\right)}^{\prime } $ be the vector of observations for the target variable. Suppose we have a vector of pre-selected covariates denoted by $ {\mathbf{z}}_t={\left({z}_{1t},{z}_{2t},\cdots, {z}_{mt}\right)}^{\prime } $ . Additionally, there is a vector of covariates denoted by $ {\mathbf{x}}_t={\left({x}_{1t},{x}_{2t},\cdots, {x}_{nt}\right)}^{\prime } $ , from which we aim to select the relevant ones for the target variable using the Lasso procedure. We can further stack the observations for $ {\mathbf{z}}_t $ and $ {\mathbf{x}}_t $ in matrices $ \mathbf{Z}={\left({\mathbf{z}}_1,{\mathbf{z}}_2,\cdots, {\mathbf{z}}_T\right)}^{\prime } $ and $ \mathbf{X}={\left({\mathbf{x}}_1,{\mathbf{x}}_2,\cdots, {\mathbf{x}}_T\right)}^{\prime } $ , respectively. For a given value of the tuning parameter, $ \lambda $ , the Lasso problem can be written as:
Partition $ \mathbf{X}=\left({\mathbf{X}}_1,{\mathbf{X}}_2\right) $ where $ {\mathbf{X}}_1 $ is the matrix of covariates with the corresponding vector of estimated coefficients, $ {\hat{\boldsymbol{\beta}}}_1\left(\lambda \right) $ , different from zero and $ {\mathbf{X}}_2 $ is the matrix of covariates with the corresponding vector of estimated coefficients, $ {\hat{\boldsymbol{\beta}}}_2\left(\lambda \right) $ , equal to zero. So, $ \mathbf{X}\hat{\boldsymbol{\beta}}\left(\lambda \right)={\mathbf{X}}_1{\hat{\boldsymbol{\beta}}}_1\left(\lambda \right) $ . By the first order conditions we have:
and
Where $ \mathbf{1} $ represents a vector of ones. We can further conclude from Equation (A.16) that:
By substituting $ \hat{\boldsymbol{\delta}}\left(\lambda \right) $ from (A.18) into (A.15), we have
We can further write this as:
Therefore,
where $ {\tilde{\mathbf{X}}}_1={\mathbf{M}}_Z{\mathbf{X}}_1 $ , $ \tilde{\mathbf{y}}={\mathbf{M}}_Z\mathbf{y} $ and $ {\mathbf{M}}_Z=\mathbf{I}-\mathbf{Z}{\left({\mathbf{Z}}^{\prime}\mathbf{Z}\right)}^{-1}{\mathbf{Z}}^{\prime } $ .
Similarly, by substituting $ \hat{\delta}\left(\lambda \right) $ from (A.18) into (A.17), we have
Note that (A.19) and (A.20) are the first order conditions of the following Lasso problem:
Therefore, we can first obtain the estimator of the vector coefficients for $ \mathbf{X} $ , $ \hat{\boldsymbol{\beta}}\left(\lambda \right) $ , by solving the Lasso problem given by (A.21) and then estimate the vector of coefficients for $ \mathbf{Z} $ , $ \hat{\boldsymbol{\delta}}\left(\lambda \right) $ , by using Equation (A.18).