1 Introduction
Most forecasting activities involve the ability to reason under uncertainty and require some level of probabilistic reasoning. This is true if one forecasts single quantities (e.g., the value of the market at the end of the current year, or the number of COVID cases that will be recorded next month in a certain country), but it is especially critical when generating probabilistic forecasts. Limited cognitive ability, lack of complete and/or fully reliable information and suboptimal processing of the information available can lead to the pervasive miscalibration in estimating the target probabilities (e.g., Reference Kahneman, Slovic and TverskyKahneman et al. 1982; Reference Gilovich, Griffin and KahnemanGilovich et al. 2002). There is substantial empirical evidence that judges are often miscalibrated due to systematic biases (Lichtensten et al. 1982; Reference Zhang and MaloneyZhang & Maloney, 2011) and random errors (Reference Erev, Wallsten and BudescuErev, Wallsten & Budescu, 1994). A large number of empirical studies have found that both single point probabilities and probability interval estimates tend to be mostly overconfident (e.g., Reference Alpert, Raiffa, Kahneman, Slovic and TverskyAlpert & Raiffa, 1982; Reference Budescu and DuBudescu & Du, 2007; Reference Juslin, Wennerholm and OlssonJuslin, Wennerholm & Olsson, 1999; Reference McKenzie, Liersch and YanivMcKenzie, Liersch & Yaniv, 2008; Reference Fischhoff, Slovic and LichtensteinFischhoff, Slovi & Lichtenstein, 1977; Reference Klayman, Soll, González-Vallejo and BarlasKlayman et al., 1999; Reference Park and BudescuPark & Budescu, 2015). Sometimes miscalibration carries over to specific expertise domains (Reference Christensen-Szalanski and BushyheadChristensen-Szalanski & Bushyhead, 1981; Reference Du and BudescuDu & Budescu, 2018).
More specifically judges tend to overestimate the probability of, and overweight, rare events and underestimate and underweight highly probable events (Reference Camerer and HoCamerer & Ho, 1994; Reference Fischhoff, Slovic and LichtensteinFischhoff, Slovic & Lichtenstein, 1977; Reference Moore and HealyMoore & Healy, 2008; Reference Wu and GonzalezWu & Gonzalez, 1996) and to avoid extreme probabilities, close to 0 or 1 (Reference Juslin, Winman and OlssonJuslin, Winman & Olsson, 2000). Reference Ariely, Tung Au, Bender, Budescu, Dietz, Gu and ZaubermanAriely et al. (2000) and Reference Turner, Steyvers, Merkle, Budescu and WallstenTurner et al. (2014) have shown that this tendency of avoiding extreme probability prediction can carry over to aggregated probability estimates.
Reference Baron, Mellers, Tetlock, Stone and UngarBaron et al. (2014) attributed the lack of extremity in probability forecasting to two distorting factors. The first, which they labeled an end-of-scale effect, is that the distribution of the estimates near the true value (1 or 0) is not symmetric. Typically, the distribution is regressive towards 0.5, leading to over- (under-) estimation of low (high) probabilities (see analysis in Reference Erev, Wallsten and BudescuErev, Wallsten & Budescu, 1994). This causes forecastersFootnote 1 to provide less extreme estimates when the true probability is close to the two endpoints (0 and 1). The second factor driving this bias is the forecasters’ tendency to mix individual confidence with confidence in the best forecast. Reference Baron, Mellers, Tetlock, Stone and UngarBaron et al. (2014) proposed that the extent of reduction in the forecasting extremity is associated with the amount of information that the judge feels is missing.
One natural solution to the problem of miscalibration is to “debias” judges and train them to be better calibrated, but this has turned out to be difficult (Reference Alpert, Raiffa, Kahneman, Slovic and TverskyAlpert & Raiffa, 1982; Reference Koriat, Lichtenstein and FischhoffKoriat, Lichtenstein & Fischhoff, 1980; Reference Schall, Doll and MohnenSchall, Doll & Mohnen, 2017) and, often, impractical but there are some success stories (e.g., Reference Mellers, Ungar, Baron, Ramos, Gurcay, Fincher, Scott, Moore, Atanasov, Swift, Murray, Stone and TetlockMellers et al., 2014). An alternative solution, which the subject of this paper, is to recalibrate the judgements (e.g., Reference Shlomi and WallstenShlomi & Wallsten, 2010), i.e., to transform the empirical probability estimates to improve their accuracy. This is a drastically different approach, because the application of these non-linear transformations does not involve the judge(s): They are applied by the users of the estimates or by intermediaries (e.g., decision analysts) before using the forecasts to make actual decisions. For example, if a Decision Maker (DM) makes periodical decisions regarding his/her investment portfolio and believes that his/her financial advisors are systematically biased, he/she may recalibrate the estimates hoping to reduce, if not fully eliminate, this bias before making his/her decisions.
Various transformation methods have been developed and proved to enhance the forecasting accuracy (Reference Ariely, Tung Au, Bender, Budescu, Dietz, Gu and ZaubermanAriely et al., 2000; Reference Baron, Mellers, Tetlock, Stone and UngarBaron et al., 2014; Satopää et al., 2014; Reference Turner, Steyvers, Merkle, Budescu and WallstenTurner et al., 2014; Reference Mandel, Karvetski and DhamiMandel, Karvetski & Dhami, 2018). Reference Turner, Steyvers, Merkle, Budescu and WallstenTurner et al. (2014) discussed the Linear Log Odds (LLO) recalibration function, which has been widely used to compensate the distortion of individual probability forecasts (Reference Gonzalez and WuGonzalez & Wu, 1999; Reference Tversky and FoxTversky & Fox, 1995). The LLO transformation recalibrates the original probability p, by means of a linear transformation of the original log odds, to obtain the recalibrated value, p:
This formula is derived from the linear log-odds model:
where γ represents the slope and τ represents the intercept and δ = exp( τ ) in Equation 1. Turner et al. (2014) interpreted γ as discriminability parameter which is manifested as curvature of the LLO function. More specifically, when γ increases (decreases), the curve becomes steeper (flatter) in the middle of the range. The other parameter δ was interpreted as the overall response tendency parameter, representing the vertical distance of the curve from zero.
The LLO function can be simplified by restrictions of its parameters to generate special cases of the general family of transformations. When δ = 1 and γ = 1, p = p, the function represents no transformation, and when δ = 1, LLO function becomes the well-known Karmarkar equation (Reference KarmarkarKarmarkar, 1978):
This function has some attractive properties: (1) it generates probabilities (and does not require any additional normalizations) for binary events, for any value of γ ; (2) p = p for three “natural” anchor points p = 0, 0.5 and 1. The full LLO function and its simplified version (Equation 3) have been applied in a large body of studies and shown to enhance the accuracy of individual forecasts as well as aggregated forecasts (e.g., Reference Budescu and DuAtanasov et al., 2017; Reference Budescu, Wallsten and AuBudescu et al., 1997; Reference Baron, Mellers, Tetlock, Stone and UngarBaron et al., 2014; Reference Erev, Wallsten and BudescuErev et al., 1994; Reference Han and BudescuHan & Budescu, 2019; Reference Mellers, Ungar, Baron, Ramos, Gurcay, Fincher, Scott, Moore, Atanasov, Swift, Murray, Stone and TetlockMellers et al., 2014; Reference Satopää and UngarSatopää & Ungar, 2015; Reference Shlomi and WallstenShlomi & Wallsten, 2010; Reference Turner, Steyvers, Merkle, Budescu and WallstenTurner et al., 2014).
Reference Mellers, Ungar, Baron, Ramos, Gurcay, Fincher, Scott, Moore, Atanasov, Swift, Murray, Stone and TetlockMellers et al. (2014) applied Karmarkar’s transformation to data generated by more than 2,000 forecasters in a geopolitical forecasting tournament (Aggregative Contingent Estimation ACE; https://www.iarpa.gov/research-programs/ace). They showed that recalibration improved the quality of aggregated probability judgments with optimal γ greater than 1 (implying extremization of the original estimates). They also found some cases of de-extremization, with parameters less than 1. Reference Baron, Mellers, Tetlock, Stone and UngarBaron et al. (2014) also applied the same transformation function to the dataset of Reference Mellers, Ungar, Baron, Ramos, Gurcay, Fincher, Scott, Moore, Atanasov, Swift, Murray, Stone and TetlockMellers et al (2014) and demonstrated that extremization can eliminate the two distorting effects (which cause less extremity in aggregated probability forecasts) with different estimated parameters. They also found out that less extremization (smaller γ ) is needed for experts than for non-expert groups and median aggregation requires less extremization than mean aggregation.
Reference Turner, Steyvers, Merkle, Budescu and WallstenTurner et al. (2014) applied the full LLO function to a different group of forecasters who participated in the ACE forecasting tournament. They compared a set of models which varied in terms of whether (1) the transformation was applied before or after the aggregation, (2) the aggregation was applied to original probability forecasts or log odds of forecasts, and (3) hierarchical modeling of individual difference was utilized. They found that a model that first transforms the raw probability estimates and then aggregates them using log odds improves the forecasting quality the most, compared to the simple aggregation and that the hierarchical modeling of individual difference slightly enhances the forecasting quality. A few studies utilized different recalibration methods. Reference Ranjan and GneitingRanjan and Gneiting (2010) applied beta transformation and Reference Satopää, Baron, Foster, Mellers, Tetlock and UngarSatopää et al. (2014) used a logit model to solve the lack of sharpness of probability judgments and improve the accuracy of probability forecasts. In the current paper we focus on the LLO function and seek to extend its use.
We should clarify that there is no single and “best” recalibration approach. The various applications estimate parameters that seek to optimize one aspect of the forecasts, typically their accuracy. Naturally, if various people seek to optimize different features of the forecasts, they may choose different approaches that can lead to different transformations. Previous studies focused on the recalibration of single (point) probability forecasts associated with simple binary events (e.g., What is the probability that it will rain tomorrow in city Z? What is the probability that candidate A will win next month election in country Y?). This is, of course, a widely used elicitation format in forecasting. Yet recent studies have focused on elicitation methods that seek to estimate complete subjective probability distributions of continuous random variables in a relatively efficient way (Reference Abbas, Budescu, Yu and HaggertyAbbas et al., 2008; Reference Haran, Morewedge and MooreHaran, Moore & Morewedge, 2010; Reference Wallsten, Shlomi, Nataf and TomlinsonWallsten, Shlomi, Nataf & Tomlinson, 2016). Reference Abbas, Budescu, Yu and HaggertyAbbas et al. (2008) discussed the Fixed Probability (FP) and the Fixed Value (FV) methods, both of which elicit points along the cumulative distribution of a target variable, X. Reference Haran, Morewedge and MooreHaran, Moore and Morewedge (2010) formalized and validated the Subjective Probability Interval Estimates (SPIES) in which judges are asked to allocate probabilities to several predefined bins that represent a C-fold (mutually exclusive and exhaustive) partition of the full range of the target variable. Several large-scale forecasting projects including the Survey of Professional Forecasters (SPF) of European Central Bank (ECB; Reference GarciaGarcia, 2003) and the Federal Reserve Bank of Philadelphia (Reference CroushoreCroushore, 1993) utilize this “bin” method to collect expert forecasters’ judgments regarding macroeconomic indicators such as inflation and GDP growth rate.
2 The current paper
In this paper, we describe an extension of Karmarkar’s transformation function that can be applied simultaneously to any number of points, on the cumulative distribution, F(X)Footnote 2, of a random variable, X. These (C−1) points on F(X) can be obtained by any of methods described earlier but, to fix ideas, it is probably best to think of the SPIES (bins) method where judges assign probabilities to each of the C discrete bins. We illustrate this recalibration approach by re-analyzing data from the quarterly Survey of Professional Forecasters (SPF) conducted by the European Central Bank (ECB). We seek to determine under what circumstances can the proposed recalibration method improve the accuracy of the forecasts and what degree of improvement can be expected, and we illustrated how this approach can be used in practice.
2.1 The transformation function
In this context recalibration means moving the distribution away from an uninformed uniform distribution that assigns an equal probability, 1/C, to each of the C bins. In the binary case the one-parameter Karmarkar function applies a linear transformation to the log-odds of the event. Here we apply the same approach to the ratio of the odds inferred from the probability assigned to any given bin, (P(Bin)/1−P(Bin)), to the odds under equal probability (i.e., 1/(C−1)), so for every bin the recalibrated probability P ∗ is obtained as:
This implies the transformation function:
In the binary case, C=2, this formula recovers Karmarkar’s transformation (Equation 3). If the parameter γ > 1, all probabilities > 1/C increase (i.e., move closer to 1) and all probabilities < 1/C decrease (move closer to 0), so the recalibration extremizes the distribution. If the parameter γ < 1, the pattern is reversed, so the transformation de-extremizes the distribution, and if γ = 1 the distribution is not transformed. Three “anchor” probabilities – 0, 1/C and 1 – are invariant under the transformation for all values of γ . This suggests that for any given C, all transformation curves cross at p i = 1/C. Figure 1 illustrates the effects of the transformation. Figures 1A and 1B apply various parameters to the cases C = 3 and C = 5, respectively, Figures 1C and 1D display the effect of two transformation (γ = 0.5 and 2) for various values of C. Naturally, one can use more complex recalibration functions with additional parameters, but we opted to focus on this simple, intuitive and easy to interpret function.
2.2 The calibration function
Whether or not probability forecasts are transformed, ordinal forecasts (such as one implied by the binning of a continuous variables) are assessed by the ordinal Brier score. This scoring function depends on, and is sensitive to, the specific bin that includes the eventual resolutions of the event. More specifically, if a forecaster assigns the same probability to several bins, it will be scored differentially, as a function of its proximity to the bin of the correct answer. Following Reference Jose and WinklerJose et al. (2009) the score is defined by considering all (C−1) binary partitions that preserve the ordering of the categories [F 1,(1 − F 1)]; [F 2,(1 − F 2)]; ⋯ [F C−1,(1 − F C−1)]Footnote 3, (binary) Brier score for each of these partitions and, then, averaging them. If the eventual outcome is in the R’th (1 ≤ R ≤ C) bin, it is possible to show that:
The last formula can be re-expressed in another form that highlights how the BS depends not only on the distribution of forecasts, F i, but also on the bin of the correct response, R, and the way its location “splits” the distribution over the C bins:
2.3 The recalibration procedure
When recalibrating real forecasts, there are two options for the choice of recalibration parameter γ . One can apply a pre-determined value based on previous experience, experts’ advice, etc. Alternatively, one can estimate the optimal parameter γ that maximizes the accuracy of the transformed forecasts where accuracy is measured by the criterion of choice (in our case, the Brier score). We focus on the latter approach and estimate optimal values of γ . The bins (categories) in the ECB data are ordinal, so it is more convenient to recalibrate cumulative probabilities (If we recalibrate specific bin probabilities, we need to add one more step of normalization to make the sum of C recalibrated bin probabilities equal to 1).
With these considerations in mind, we implemented the following recalibration procedure. First, cumulative probabilities F i for each valid case were computed based on the probabilities assigned to the various bins (F i=∑1iP(bin)i). Second, the extremization function (Equation 5) was applied to the cumulative probabilities F i and the recalibration parameter, γ , was estimated by minimizing the corresponding ordinal Brier Score (BS). Finally, the optimal parameter was applied to the relevant probabilistic forecasts and the forecasting performance was evaluated by calculating the ordinal BS of the recalibrated forecasts.
Consider one forecaster in the ECB data set: In 2001Q1, participant (ID # 1) assigned probabilities to the 9 possible bins for the inflation of the current year: {0, 0, 0, 0.15, 0.50, 0.30, 0.05, 0, 0}. The 9 corresponding cumulative probabilities are {0, 0, 0, 0.15, 0.65, 0.95, 1, 1, 1}. Transformation function in Equation 5 was applied to these cumulative probabilities, and transformed probabilities of all 9 cumulative probabilities were expressed as a function including a single parameter γ , i.e., F 4 = 0.15, transformed cumulative probability . The ordinal BS was expressed as a function of γ by plugging in transformed cumulative probabilities F 1∗ to F 8∗ to Equation 8 (the ground truth for this item was 2.35 which is in the 6th bin, hence Equation 8 is appropriate). The optimal parameter γ was estimated by minimizing the ordinal BS function (in this case, γ = 0.62).Footnote 4 The recalibrated cumulative distribution over the C = 9 bins after the optimal transformation became {0, 0, 0, 0.13, 0.40, 0.74, 1, 1, 1}. The BS of the optimally recalibrated forecasts is BS= 0.062, compared to BS = 0.112 for the original forecasts. The cumulative distributions before and after transformation of this de-extremization example (γ < 1) are plotted in the upper 2 panels of Figure 2. Another example of extremization (γ > 1), based on a different forecaster (ID # 2) for the same event on the same round, is provided in the lower two panels. In this case, the raw cumulative probabilities are {0, 0, 0, 0, 0, 0.7, 1, 1, 1}, and after transformation based on optimal γ = 3.91, the transformed cumulative probabilities are {0, 0, 0, 0, 0, 0.99, 1, 1, 1}.
3 Data
The Survey of Professional Forecasters (SPF) is a quarterly survey conducted by the European Central Bank (ECB) since 1991. The experts forecast some macroeconomic indicators for the European Union (EU). The experts are affiliated with financial or non-financial institutions in the EU. At each release of the survey, participants are asked to report their forecasts about future HICP (Harmonised Index of Consumer Prices) inflation, the real GDP growth rate and the unemployment rate of in the Euro zone.
The ECB survey elicits expectations for different forecasting horizons (forecasts of the current year, next calendar year, year after next year and five/sixFootnote 5 years ahead of current calendar year) for the three variables. The survey elicits both point estimates of these quantities as well as full probability distributions of target quantities using the bin method (Reference Haran, Morewedge and MooreHaran, Moore & Morewedge, 2010) (see sample questionnaire in Appendix A). We re-analyze the probability distributions for the period starting on the first quarter of 2001 (2001 Q1) to the last quarter of 2017 (2017 Q4), i.e., 72 successive quarters.Footnote 6 A total of 99 experts forecasted at least once during this period, but not every expert forecasted all quantities every quarter. Our data set consists of 14,117 forecasts, which translates into an average of about 196 forecasts per target quantity per quarter and 49.02 forecasts per target quantity and time horizonFootnote 7 in a quarter. Some cases were removed for the following four reasons (see details in Table 1):
-
1. Missing cases (no values were assigned to any bins).
-
2. The probability estimates of the C bins did not sum up to 1.
-
3. True values (ground truths) for the target quantities were not available at the time of the analysis, e.g., probability estimates of year 5 in 2017 Q3 (forecasts of 2022).
-
4. The parameter estimation procedure failed.
The number of bins and their corresponding upper and lower bounds of all three indicators are determined by the ECB and change over time, as shown in Table 2.
4 Results
Basic descriptive statistics of re-calibration parameter and the corresponding BS of three macroeconomic indicators are summarized in Table 3. The cumulative distributions of estimated parameters of three indicators across all the cases are presented in Figure 3. There is a considerable number of values close to 0, and a large number of very high values. It appears that GDP requires the least amount of recalibration, and also a considerable amount of cases are optimized with de-extremization (optimal γ < 1 ). The cumulative distribution of the re-calibrated Brier scores is displayed in Figure 4. These scores indicate that GDP is the hardest indicator to predict and inflation is the most predictable. The presence of many high values of parameters distorts the distribution, so we also present (in Figure 5) the distribution of parameters based only on those cases where the estimated parameters do not exceed 10. This eliminates between 10 and 15% of the cases for the various indicators (Table 3).
4.1 Re-calibration parameters and forecasting horizon
Descriptive statistics of re-calibration parameters of different forecasting horizons (FH) of all three economic indicators are summarized in Tables 4–6. We present here only the cases where γ ≤ 10. Analyses of the full data set yield similar results and are relegated to Appendix C.
Notes: Balance = (cases with γ >1 − cases with γ < 1)/(cases with γ > 1 + cases with γ < 1).
Skew = (Mean − Median)/SD.
Notes: Balance = (cases with γ >1 − cases with γ < 1)/(cases with γ > 1 + cases with γ < 1).
Skew = (Mean − Median)/SD
Notes:
Skew = (Mean − Median)/SD
Several regularities stand out in these displays: (1) There is a general (but not strictFootnote 8) monotonic pattern of the mean parameter values – longer-term forecasting is optimized with higher recalibration parameters compared to shorter forecasting horizons in inflation and GDP. This pattern does not hold for unemployment rate where years 5 and 6 do not yield higher estimated parameters than years 1, 2 and 3; (2) In almost all the cases, the variability of the optimal parameters (as measured by their SDs and IQRs) increases as function of the forecasting horizon (Here, again, the variability in the unemployment rates for years 5 and 6 are unusual); (3) The distributions of the estimated parameters are skewed to the right: In most cases the median γ is below 1, indicating that the majority to the forecasts are being de-extremized to some degree but, on the other hand, a minority of cases induce vary large extremization driving the mean γ .
To confirm that forecasts of longer terms require higher recalibration, we combined the four forecasting horizons into two classes, with Years 1 and 2 representing “short term” and Years 3 and 5/6 representing “long term”, and compared parameters of the two classes for all three economic indicators. Table 7 shows that long term forecasts always require larger parameters, indicating that estimates of distant events are likely to be more conservative than those of closer events, therefore require greater extremization to optimize the accuracy. This observation is confirmed by the significant t-tests between the two time horizons for inflation (t(4,899) = –10.2, p < .05) and GDP (t(6,011) = –6.45, p < .05), but is not supported by the unemployment forecasts, t(6,072) = 5.57, p > .05.
4.2 Brier score improvement and forecasting horizons
In this section we document the benefits of recalibration, in terms of Brier scores. Let Relative Brier Score Difference (RBSD) measure the improvement in accuracy that can be attributed to re-calibration. More specifically, let
Higher (lower) RBSD indicates more (less) improvement in the forecasting quality. Overall, recalibration significantly improved the accuracy of the forecasts (Mean RBSD = .569, .452 and .574 for the three indicators). Figures 6–8 summarize the RBSD for different forecasting horizons, showing that distant forecasting horizons yield lower RBSD and benefit less from recalibration, compared to the closer forecasting horizons. In fact, recalibration is most effective and beneficial for year 1 (Mean RBSD = .621, .539 and .623 for the three indicators).
4.3 Practical applications involving out of sample re-calibration
In the previous sections we estimated optimal re-calibration parameters (γ ) for each forecast and used these case-specific estimates to re-compute BS and illustrate the effectiveness of the approach. These are, essentially, proofs of concept results but this analysis is analogous to in-sample prediction and, as such, subject to overfitting the data. This is neither a practical approach for predicting future events, nor the optimal method for testing the efficacy of the approach in real-life applications.
In practical settings, one would estimate the optimal parameters based on past performance. This is only possible after the ground truth is revealed which, in the cases studied here, takes a long time. More precisely, the minimal waiting period is the target time horizon. For example, if at time t we wish to predict the value of an indicator at time (t+k) we need to rely on the optimal aggregate of the forecasts provided at time (t−k), which resolve, and allow estimation of γ , only at time t.
Another consideration that affects the best use of historical information is how to best utilize case-specific estimates of the past quarters (the group of judges forecasting in every quarter may also change over time, so it is impossible to generate individual-specific parameters). To explore practical strategies for recalibrating forecasts, we compared the performance of five different types of re-calibration parameters that “borrow” information from other forecasts and forecasters.
-
1. Domain-specific: The median of all the case-specific parameters (γ s) (collapsing all the time horizons and quarters) of any given economic indicator.
-
2. Quarter-specific: The median of all the case-specific parameters (γ s) (collapsing all the time horizons) of the same quarter for each economic indicator.
-
3. Forecast horizon specific: The median of all the case-specific parameters (γ s) (collapsing all the quarters) of the same forecasting horizon for each economic indicator.
-
4. Quarter & Forecast horizon specific: The median of all the case-specific parameters (γ s) of the same forecasting horizon and the same quarter for each economic indicator.
-
5. Aggregate: Estimate the optimal parameter (γ ) for the mean probability distributionFootnote 9 for any given quarter and FH for every indicator.
We calculated the parameters based on these five approaches, used them to re-calibrate the forecasts, and we compared the BS obtained from the different selections to the performance of two baselines: No recalibration (γ = 1) and the optimal recalibration based on the case-specific γ . The results are summarized in Table 8. For all three indicators, the aggregate γ performs best (closest to the case specific upper bound) and the quarter & forecasting horizon specific γ performs the second best. Both approaches systematically outperform the original (untransformed) forecasts across all three domains.
Note: The best two methods are highlighted
Given these results, we estimated first the two top-performing γ parameters (the optimal aggregate γ and the quarter & forecasting horizon specific γ ) in every quarter and for every relevant time horizon for the three indicators and used these parameters to re-calibrate the relevant forecasts (i.e., same time horizon for each indicator) for the next period. For example, if forecasts made at 2002Q1 target one calendar year ahead, we estimated the best γ based on forecasts made a year earlier (2001Q1) as soon as the target events resolved (at 2002Q1) and used them to predict the next round of forecasts for the same time horizon (2003Q1).
Table 9 presents the mean Brier scores across all relevant quarters for every time horizon and indicator. The first and the second panels of the table show the original Brier scores (untransformed) and the case-specific Brier scores as the lower and the upper benchmarks. The third panel shows the Brier scores based on the recalibrated forecasts based on the optimal aggregate γ of the previous period and the fourth panel shows the scores based on the quarter & forecasting horizon specific γ of the previous period. Both sets of γ parameters estimated from the previous periods outperform the untransformed BS, but only for short-term forecasts for the current year. For the longer horizons the performance of the optimal parameters based on the previous periods does not outperform the baseline BS. This pattern is consistent for all three indicators.
Note: Cases where recalibration improved the Brier Scores are highlighted.
Table 10 focuses on the current year forecasts for the various indicators and displays the number of individual forecasts, where applying one of the two approaches improved or, conversely, caused a deterioration in the Brier score (we excluded cases where γ = 1, and the Brier score in unaltered.). In a significant majority of the cases, recalibration using the two top-performing estimates of the γ parameter from the previous period was successful. Optimal aggregated γ of previous period yields better (lower) BSs compared to the untransformed baseline in at least 77% cases for the three indicators. Quarter & horizon specific γ ’s of previous period improved the BSs in at least 69% for the three indicators.
Note: .
The two sets of estimates of quarter and domain specific γ s are highly consistent, as shown in Figure 9. After excluding a few extreme estimates, and concentrating only on cases where γ ≤ 10, the two sets correlate highly (r = 0.87). The aggregated γ performed better than quarter & horizon specific γ and the out-sample recalibration worked best for the GDP forecasts, for both methods.
5 Beyond Brier Scores
Our approach was driven by the desire to improve the accuracy of the probabilistic forecasts, as measured by their Brier Scores. This choice is motivated and justified by the fact that accuracy is, typically, the top desideratum of good forecasts, and that the Brier Scores are considered by many the “gold standard”. For example, they are often used in forecasting competitions (e.g., Reference Himmelstein, Atanasov and BudescuHimmelstein, Atanasov & Budescu, 2021; Reference Mellers, Ungar, Baron, Ramos, Gurcay, Fincher, Scott, Moore, Atanasov, Swift, Murray, Stone and TetlockMellers et al., 2014). However, as some of the reviewers of this manuscript have pointed out, this is not the sole criterion one could consider and, in fact, several appealing alternatives are well documented (e.g., Reference Steyvers, Wallsten, Merkle and TurnerSteyvers, Wallsten, Merkle & Turner, 2014).
In this section we illustrate the effect of the recalibration on an alternative quality measure. Many people prefer evaluating the quality of forecasts by comparing a single best value, extracted from the distribution, to the ground truth. This approach is seen as simpler and easier to interpret, because its scale is more intuitive than Brier. In this spirit, we calculated the median of each distribution in our sample (Raw and Transformed form) and calculated its Relative Absolute Distance (RAD) to the ground truth:
Figures 10–12 display the joint distributions of the Raw and Transformed RADs for the three indicators. Most of the points lie below the respective diagonals indicating that the recalibrated distributions provide more accurate predictions. Thus, on average, and in most individual cases the medians inferred from the recalibrated distributions are closer to the eventual outcomes. The proportion of cases where the recalibration improved the point prediction is 76.02% (Mean improvement = 0.65, SD = 1.49) for Inflation, 78.16% (Mean improvement = 0.48, SD = 1.01) for GDP and 72.23% (Mean improvement = 0.04, SD = 0.07) for Unemployment.
We should clarify that each quality criterion can, in principle, be used to derive an optimal transformation (e.g., one could seek to derive distributions such that their RAD, or other metrics, be minimized). We focused on the Brier score but this example illustrates that this transformation can also benefit other relevant measures of quality.
6 Concluding remarks
There are several compelling examples in the forecasting literature (e.g., Reference Baron, Mellers, Tetlock, Stone and UngarBaron et al., 2014; Reference Turner, Steyvers, Merkle, Budescu and WallstenTurner at al., 2014) illustrating the benefits of recalibration of individual forecasts, as well as aggregates of multiple forecasts, of the target events. These examples involve binary events and, as such, amount to recalibrating – extremizing or de-extremizing – a single probability. In this paper we proposed, to our knowledge, the first extension of this approach that allows one to recalibrate a cumulative probability function based on C of its quantiles in a consistent and coherent way that is captured by its single parameter, γ . The recalibration function is defined relative to the uniform distribution and its impact is defined in relation to the invariant “anchor”, Prob = 1/C, in the sense that probabilities below or above this anchor are transformed in different directions. The recalibration function generalizes Karmarkar’s transformation that was used often in the special case C=2.
We discussed some of the properties of the proposed function and illustrated its use by re-analyzing a large body of forecasts for three economic indicators made by almost 100 experts and spanning 72 quarters. This analysis confirmed that recalibration can be highly beneficial (see Figures 6–8) and we found that its effects are not uniform, in the sense that not all indicators benefit equally. It also clearly showed that, on average, longer term forecasts require more aggressive recalibration. Finally, we have illustrated obvious practical applications of our approach by showing how one can use recalibration parameters estimated in previous periods to significantly increase the accuracy of future short-term forecasts.
We make no claims of optimality or uniqueness regarding our approach. The method we used was develop as a straightforward generalization of the simplest function used in binary cases, using a single parameter. We expect that more complex function could improve accuracy further, and we hope that future work in this area will explore alternative, possibly more flexible and powerful, recalibration functions. One issue we did not study is how the function operates when applied to distributions that are elicited at various levels of precision (i.e., number of bins). In our dataset, the experts were typically given more than 10 bins (see details in Table 2), and we observed that many tail bins were often assigned probabilities of 0. One way to improve the recalibration process may be to develop algorithms that are sensitive to the total number of bins and/or the way the judges use them.
An interesting question that was raised by one of the reviewers of the paper is whether one should consider forecast recalibration as a one-shot adjustment, or as an additional component to be implemented periodically as part of the forecasting process? We believe that the answer is somewhere in between these two extremes. In a perfectly stable and stationary world, once a transformation function is identified it could be applied routinely to all new forecasts in the same domain. However, recalibration is not perfect (see our results), the estimation is susceptible to random errors and capitalization of chance and, at least in principle, it could be improved as more data become available. And, of course, the world is not stationary and the circumstances that drive the behavior of the target variables of interest, may change over time making older recalibration parameters suboptimal or obsolete.