1 Introduction
Expert judgments are frequently made under conditions of uncertainty. Consequently, those judgments are often conveyed as probability estimates to end-users whose decisions and outcomes, in turn, may be affected by such information. For instance, in medicine, the estimated probability of adverse side effects may influence patients’ willingness to undergo certain treatments (Reference Berry, Michas, Gillie and ForsterBerry et al., 1997; Reference Ziegler, Mosier, Buenaver and OkuyemiZiegler et al., 2001). In criminal proceedings, expert communication of uncertainty underlying forensic evidence may shape the conclusions of judges or juries (Reference Ligertwood and EdmondLigertwood & Edmond, 2012; Reference McQuiston-Surrett and SaksMcQuiston-Surret & Saks, 2008). In national security policymaking, intelligence assessments are usually qualified by probabilities that can shape consequential decisions, including whether to go to war (Reference KentKent, 1964; Reference MarchioMarchio, 2014; Reference Debs and MonteiroDebs & Monteiro, 2014). Indeed, the communication of uncertainty is central to all domains of public policymaking (Reference Funtowicz and RavetzFuntowicz & Ravetz, 1990).
Many organizations and professional groupings that produce expert judgments prefer to express uncertainties with verbal probabilities such as “likely” or “unlikely” rather than with precise numeric probabilities such as “70% chance” or imprecise numeric ranges such as “60% to 80% chance” (e.g., Reference Dhami and MandelDhami & Mandel, 2020; Reference Lipkus and PetersHo et al., 2015). For instance, in a recent study of National Weather Service tweets, 99.9% of probabilistic forecasts were made using verbal probability expressions (Reference Lenhardt, Cross, Krocak, Ripberger, Ernst, Silva and Jenkins-SmithLenhardt et al., 2020). Accountants also tend to prefer using verbal probabilities, despite the quantitative basis of the profession (Kolesnika et al., 2019). This tendency is partly attributable to the belief that end-users will not be able to effectively process numeric probabilities (e.g., Reference Lewis, King, Perkins-Kirkpatrick and WehnerLewis et al., 2019) and it is partly attributable to the greater ease of producing assessments that are qualitatively rather than quantitatively qualified (Reference Wallsten, Budescu, Zwick and KempWallsten et al., 1993). As Reference Beyth-MaromBeyth-Marom (1982) also suggested, the preference for using verbal probabilities may also be motivated by a desire to have one’s probabilistic judgments remain less verifiable in terms of accuracy. Several studies have shown a “communication mode preference paradox” in which, on average, senders prefer verbal probabilities but receivers prefer numeric probabilities (Reference Brun and TeigenBrun & Teigen, 1988; Reference Erev and CohenErev & Cohen, 1990; Reference Wallsten, Budescu, Zwick and KempWallsten et al., 1993). In spite of senders’ preferential tendency, extensive research has shown intrapersonal imprecision and interpersonal inconsistency in how people translate verbal probabilities into numeric equivalents (e.g., Reference Beyth-MaromBeyth-Marom, 1982; Reference Budescu and WallstenBudescu & Wallsten, 1985; Reference Dhami and WallstenDhami & Wallsten, 2005; Reference Harris, Corner, Xu and DuHarris et al., 2013; Reference Lichtenstein and NewmanLichtenstein & Newman, 1967).
The detrimental consequences of using verbal probabilities to convey uncertainties have been noted in past literature (e.g., Reference Dhami and MandelDhami & Mandel, 2020; Reference Hart, Maxim, Siegrist, von Goetz and da CruzEuropean Food Safety Authority et al., 2018; Reference FriedmanFriedman, 2019; Reference Mandel and IrwinMandel & Irwin, 2020; Reference MorganMorgan, 1998). As noted already, verbal probabilities are fuzzy in their interpretation and can vary greatly in meaning across individuals. Compared to numeric probabilities, verbal probabilities are judged to be less clear in their communication of degrees of probability (Reference Collins and MandelCollins & Mandel, 2019), and verbal probabilities are prone to communicating implicit recommendations for action through their directionality (Reference Teigen and BrunTeigen & Brun, 1995, 1999) — recommendations which may have policy-biasing effects in contexts such as national security intelligence, which has long focused on sustaining policy neutrality (Reference KentKent, 1951). However, instead of using numeric probabilities in their communications of risk and uncertainty, most organizations that disseminate probabilistic assessments have adopted numerically bounded linguistic probability (NBLP) schemes that prescribe an ordinal scale of verbal probabilities, each associated with numeric probability ranges (Reference Lipkus and PetersHo et al., 2015; Reference Mandel, Wallsten and BudescuMandel, Wallsten et al. 2021).Footnote 1 For example, Table 1 shows the five-point NBLP scheme currently used in NATO intelligence doctrine (2016; Reference Dhami and MandelDhami & Mandel, 2020, and Reference Mandel and IrwinMandel & Irwin, 2020, discuss other NBLP schemes used in intelligence communities for communicating probabilities). According to this methodology, an analyst who judges an event to have a probability ≥ 60% and ≤ 90% should describe it as likely. Conversely, an analyst who describes an event as likely should agree that the probability falls within the associated range.
However, studies show that even when participants are given the relevant NBLP scheme, they continue to show poor agreement with it (measured as the percentage of overlap between the numeric ranges in the standard and participants’ ranges or by the proportion of participants whose best numeric equivalence estimates fall within the stipulated ranges). In a study on verbal probabilities used to communicate projections by the International Panel on Climate Change (IPCC), Reference Budescu, Broomell and PorBudescu et al. (2009) asked participants to characterize the intended numeric meaning of each term (i.e., very unlikely, unlikely, likely, and very likely) by estimating its lower and upper bounds and a best estimate. The terms were embedded in sentences extracted from IPCC reports. Participants either received no guidance regarding the numeric equivalents of the verbal terms (control condition), unrestricted access to the IPCC translation table that contained numerical equivalents (translation condition), or numeric equivalents embedded in the sentences alongside each verbal probability (combined condition; e.g., “very likely [90% chance or greater]”). The combined format yielded better agreement than the translation or control formats. Median responses for the expressions were also less regressive and interpreted ranges were significantly narrower in the combined condition than in the other conditions. Subsequent replications including one with samples taken from 24 countries and in 17 languages also found better performance of the combined format (Reference Budescu, Por and BroomellBudescu et al., 2012, 2014), and the better performance of the combined format was also generalized to a different standard used by the US intelligence community in Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al. (2019) and in a re-analysis of the same dataset using a different agreement measure (Reference Mandel and IrwinMandel & Irwin, 2021).
1.1 The present research
Our research expands on previous studies examining agreement between receivers’ interpretations of verbal probability terms and the stipulated meaning of such terms in probability lexicons in multiple respects. First, a question yet to be investigated is whether agreement is affected by the presence or absence of numeric probability ranges in NBLP schemes — that is, does it help to numerically bound the verbal terms used in such schemes? Whereas previous studies (e.g., Reference Budescu, Broomell and PorBudescu et al., 2009, 2014; Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al., 2019) have examined the effect of introducing numeric ranges alongside verbal probability terms in specific assessments (i.e., the combined format), none of these studies examine schemes that themselves lack numeric bounds on the prescribed set of terms. This issue is important, however, because some probability schemes do not include probability ranges but merely consist of an ordered set of probability terms. Such approaches are common in risk assessment where an ordinal scale of probability terms is crossed with an ordinal scale of consequence severity to yield a risk matrix (Reference FriedmanFriedman, 2019; Reference MandelMandel, 2007). To address this issue, we manipulated whether participants presented with the NATO lexicon shown earlier received the full version (as in Table 1) or a partial version that omitted the numeric ranges. Although agreement has been shown to be low even where numeric ranges are included in lexicons, we propose Hypothesis 1: agreement will be lower when numeric ranges are omitted from the probability scheme than when they are included.
A second aim of the present research was to build on studies by Reference Budescu, Broomell and PorBudescu et al. (2009, 2012, 2014) and Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al. (2019) by including a purely numeric probability format condition in which, following exposure to the full NATO lexicon, the intelligence assessment conveyed probabilities with numeric probability ranges only. We know of only one study that directly compared the effect of a combined probability format to a numeric format. In that study, Reference Knapp, Gardner and WoolfKnapp et al. (2016) compared the effect of presenting information about the risk of a cancer medical treatment using either verbal expressions of relative frequency (e.g., “common”) paired with upper-bounded numeric quantifiers (e.g., “up to 1 in 10”) or using only the numeric quantifiers. Participants tended to overestimate risks in both conditions, but the degree of overestimation was far greater in the combined condition. Participants’ judgments were also more variable in the combined condition than in the numeric condition. Reference Knapp, Gardner and WoolfKnapp et al.’s (2016) findings call into question the benefit of pairing numeric expressions of probability with verbal probabilities. Unlike Knapp et al., we compared the agreement yielded by combined and numeric formats. Given that the combined format creates an opportunity for conflict between two sources of probability information, we propose Hypothesis 2: agreement using the numeric format will be as good as that observed using the combined format, and these formats (numeric and combined) will each show better agreement than the verbal format. If agreement levels were found to be as good or better when using the numeric range format, it would call into question why various organizations and professional groups remain committed to expressing probabilities primarily with verbal probabilities.
A third aim of our research was to compare agreement using three distinct measures. As in past studies (Reference Budescu, Broomell and PorBudescu et al., 2009; Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al., 2019), we used the proportion of participants who provided “best-estimate” numeric equivalents that fell within the numeric ranges stipulated in the NATO standard. This measure captures an “all or none” interpretation of agreement. An alternative measure we tested that also uses best estimates measures agreement as the absolute distance between a participant’s best estimate and the midpoint of the stipulated numeric range for a given verbal probability term. Under a variety of distributional assumptions (e.g., normal, rectangular or other symmetric distributions), the expected value of a numeric range is its midpoint. In interval analysis (Reference Moore, Kearfott and CloudMoore et al., 2009), for instance, a range is equivalent to its midpoint with a margin of error equal to half the range. We therefore included this measure, which had not been used in earlier agreement studies by Reference Budescu, Broomell and PorBudescu et al. (2009, 2012, 2014) and Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al. (2019). Finally, we computed an agreement measure that uses participants’ upper and lower bounds in calculating the percentage overlap with a stipulated range, as shown in Equation 1:
where L e and U e refer to the participant’s lower-bound and upper-bound estimates, respectively, and U s and L s refer to the relevant lower and upper bounds stipulated in the NATO scheme. Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al. (2019) also used a measure of percentage overlap, as shown in Equation 2:
However, unlike our measure, their measure penalizes in-range precision. For instance, if a participant provided lower and upper bounds of 45% and 55% for the term even chance, using Equation 2, the participant would be said to have 50% overlap of the stipulated 40%-60% range for this term (see Table 1). In contrast, our measure would score this participant as showing 100% overlap because 100% of their range was within the bounds of the stipulated range. In other words, the measure used in the present research does not punish within-range precision. Reference Mandel and IrwinMandel and Irwin (2021) re-analyzed data from Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al. (2019) using
the new percentage overlap measure and found that, as expected, agreement was higher across format conditions. However, the effect of format was not influenced by the choice of measure. In the present experiment, we hypothesized that the effect of probability format on agreement specified in Hypothesis 2 will be upheld across the three measures (Hypothesis 3).
A fourth aim of the present research was to examine whether interpretations of probability assessments are affected by the semantic context of the events they describe. Several context effects on the interpretation of verbal probabilities have been reported (Reference Brun and TeigenBrun & Teigen, 1988; Reference Mellers, Baker, Chen, Mandel and TetlockMellers et al., 2017; Reference Wallsten, Fillenbaum and CoxWallsten et al., 1986; Reference Weber and HiltonWeber & Hilton, 1990), as well as on the selection of verbal probabilities from the sender’s perspective (e.g., Reference Patt and SchragPatt & Schrag, 2003). Some studies have found that interpretations of probability terms depend on the valence of the events qualified by such terms (e.g., Reference Mullet and RivetMullet & Rivet, 1991). In one study (Reference MandelMandel, 2015a), participants discriminated better among the meanings of verbal probability terms ranging from extremes of will not (i.e., a very low probability) to will (a very high probability) when the terms referred to an event success rather than an event failure (and in both cases the desirability of the event was opaque, making it difficult to judge whether success or failure was “a good thing”). More recently, consistent with these findings, Reference Dhami and MandelDhami and Mandel (2021) found that participants considering a forensic assessment case discriminated better between the terms probable and improbable when the context was positive (i.e., the defendant was judged as being fit to plead) rather than negative (i.e., the defendant was judged as being not fit to plead).
In the present research, we attempted to generalize this valence-discrimination relation to a task in which valence was manipulated through gain/loss framing of outcomes rather than a valence-based reflection of outcomes (Reference FagleyFagley, 1993). Specifically, probabilistic assessments focused on the potential outcome of either saving half the lives of 1,000 threatened people (i.e., the positive frame) or losing half the lives of 1,000 threatened people (i.e., the negative frame). Consistent with the findings of Reference MandelMandel (2015a) and Reference Dhami and MandelDhami and Mandel (2021), we tested Hypothesis 4: discrimination between the terms unlikely and likely (operationalized as the mean difference in numeric probability equivalents assigned to these terms) will be significantly better in the positive frame than in the negative frame. If so, this result should be expressed as a three-way interaction between probability format, probability level and frame in which differential discriminability (characterized by a probability level × frame interaction effect) is observed in the verbal condition but not in the combined or numeric conditions, where the presence of numeric information is expected to cancel any valence-discrimination relation that might be induced via gain-loss framing.
A final aim of this research was to examine how individual difference measures of cognitive ability and cognitive style predict compliance with the NATO lexicon. In doing so, we expand on previous research showing that numeracy correlates positively with compliance (Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al., 2019). Numeracy refers to an individual’s ability to perform basic mathematical operations that would be expected of a data-literate person (e.g., converting a percentage probability into a decimal and knowing that 0.01 is larger than 0.001). Higher levels of numeracy have been shown to facilitate probability assessment and improve the interpretation of numerical data (Reference Lipkus and PetersLipkus & Peters, 2009). Meanwhile, individuals with low numeracy are shown to rely on non-numerical cues and to be more vulnerable to presentation effects (Reference Reyna, Nelson, Han and DieckmannReyna et al., 2009). In the present research, in addition to numeracy, we measured differences in verbal reasoning skill and actively open-minded thinking (AOT). Verbal reasoning skill assesses abstract analogical reasoning using language (Reference Bilker, Wierzbicki, Brensinger, Gur and GurBilker et al., 2014), while AOT assesses people’s openness to new information and perspectives contrary to their beliefs (Reference Baron, Scott, Fincher and MetzBaron et al., 2015). AOT is positively associated with accuracy in probabilistic judgment tasks (Reference Haran, Ritov and MellersHaran et al., 2013; Reference Mellers, Stone, Atanasov, Rohrbaugh, Metz, Ungar, Bishop, Horowitz, Merkle and TetlockMellers et al., 2015), and negatively associated with certain cognitive biases (Reference BaronBaron, 2008; Reference Toplak, West and StanovichToplak et al., 2017; Reference West, Toplak and StanovichWest et al., 2008). To the best of our knowledge, verbal reasoning skill and AOT have not been explored in relation to agreement with NBLP schemes for communicating probability. Consistent with Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al. (2019), we hypothesized that numeracy, verbal reasoning ability and AOT would be positively correlated with our agreement measures (Hypothesis 5).
2 Method
2.1 Sampling strategy and participants
Our primary analyses involved factorial analysis of variance (ANOVA) with twelve between-subjects conditions (i.e., Probability format [3] × Probability level [2] × Frame [2]). Using G*Power (Reference Brun and TeigenFaul et al., 2007), we computed an a priori power analysis for ANOVA with main and interaction effects with η2p = .025, Type I and II error rates set to 5%, df = 2 in the numerator, which returned a sample size of 606. To accommodate ANOVA with an additional nested factor (Table format) we required a sample of 509, half of which overlapped with the sample required for the aforementioned three-way design. Therefore, we estimated a minimum required sample of 866. We oversampled by approximately 40% to offset the chance that we might need to exclude a significant proportion of incoherent responders as we have encountered the need to do in other judgment research (e.g., Reference Mandel, Collins, Risko and FugelsangMandel, Collins, et al., 2020). A sample of 1,236 participants (52% male) between the ages of 18 and 60 (M = 43.79, SD = 11.77) was recruited using the online crowdsourcing service Qualtrics Panels (https://www.qualtrics.com/). Qualtrics Panels incentivizes participants using a variety of methods that typically correspond to 40%–60% of the per-participant cost charged to researchers. In the present research that corresponds to $6-$9 US, for completion of the full survey set (see procedure and materials). All participants were sampled from Canada or the U.S., and were required to have English as their first language. Participants were prohibited from completing the experiment using a smartphone and were also screened out if they failed a one-item instructional manipulation check designed to test their attention to instructions (Reference Oppenheimer, Meyvis and DavidenkoOppenheimer et al., 2009).
2.2 Design
Participants were randomly assigned to 12 conditions in a 3 (Probability format: verbal, combined, numeric) × 2 (Probability level: low, high) × 2 (Frame: positive, negative) between-subjects factorial design. A fourth factor we refer to as table format was manipulated between subjects and nested in the verbal condition. Specifically, participants assigned to the verbal condition were further randomly assigned to either a full-table or partial-table condition. In the full-table condition, participants were shown the full NATO translation table (see Table 1), whereas in the partial-table condition, the numeric equivalents shown in the first column of Table 1 were omitted. In the combined and numeric conditions, participants were presented with the full table. Probability format refers to whether the intelligence assessment reported in the experimental task stated only the verbal probability term (e.g., likely), the verbal term with the numeric range in parentheses (e.g., likely [60%–90%]), which we call the combined condition, or only the numeric range (e.g., 60%–90%). Probability level refers to whether the intelligence assessment used a low probability (e.g., unlikely [10% –40%]) or a high probability (e.g., likely [60%–90%]). Frame refers to whether the outcome was described positively (i.e., half of a group of civilians surviving) or negatively (i.e., half of the group dying).
2.3 Procedure and materials
The experiment was conducted as part of a small set of brief, counterbalanced experiments administered online through Qualtrics. Participants were not informed of the aims of the research until the end of the experiment and they could not alter responses entered on previous screens. At the beginning of the experiment, participants were informed that they would receive information from a hypothetical intelligence report and answer a set of questions. They were introduced to the NATO translation table (partial or full, depending on their condition) and informed that the analyst had used one of the probability terms when making a forecast. Participants were then presented with a hypothetical humanitarian crisis and an intelligence forecast regarding the survival of 1,000 displaced civilians.
After participants reviewed the scenario, the hypothetical intelligence assessment was presented as follows (probability level and frame manipulations shown in brackets):
Given the current situation on the ground, a senior intelligence analyst specializing in that region assesses
[in the verbal condition] ‘It is [likely/unlikely] that half of these civilians will [survive/die].’
[in the combined condition] ‘It is [likely (namely, there is a 60% - 90% chance)/unlikely (namely, there is a 10% - 40% chance)] that half of these civilians will [survive/die].’
[in the numeric condition] ‘There is a [60% - 90%/10% - 40%] chance that half of these civilians will [survive/die].’
The scenario and intelligence assessment remained visible while participants responded to subsequent questions, whereas the NATO translation table was visible only at the beginning of the experiment. However, before proceeding to the first set of questions, participants had the opportunity to review the NATO translation table (with or without numeric equivalents, depending on their condition) by clicking a clearly labeled button. After proceeding, they were presented with the first set of questions, along with the text of the scenario and intelligence assessment. In the following order, participants were asked to provide their best, lowest, and highest estimates of the probability that the intelligence analyst had in mind by responding on sliders ranging from 0 to 100 with a default starting position of 0.Footnote 2 The three questions were phrased as follows:
(1) What is your BEST estimate of the probability conveyed by the analyst?
(2) and (3) What is the [LOWEST, HIGHEST] probability the analyst conceivably has in mind?
Participants subsequently completed an additional set of questions, which are the focus of a separate investigation that also includes data from other experiments.Footnote 3 After completing the core experimental tasks, participants were given a one-item instructional manipulation check (Reference Oppenheimer, Meyvis and DavidenkoOppenheimer et al., 2009). Qualtrics Panels excluded participants who did not answer this task correctly. Participants who correctly answered the instructional manipulation check subsequently completed a 10-item numeracy scale drawing eight questions from Reference Lipkus, Samsa and RimerLipkus et al.’s (2001) numeracy scale and two questions from the Berlin Numeracy Test (Reference Beyth-MaromCokely et al., 2012); an 8-item verbal skills test comprised verbal analogy questions from the 29-item Penn Verbal Reasoning Test (PVRT; Reference Bilker, Wierzbicki, Brensinger, Gur and GurBilker et al., 2014); and the eight-item actively-open-minded thinking scale from Reference Baron, Scott, Fincher and MetzBaron et al. (2015). Finally, participants answered basic demographic questions (i.e., age, sex, and professional experience) to further characterize the sample.
2.4 Agreement measures
We computed three agreement measures. First, in line with earlier studies (Reference Budescu, Broomell and PorBudescu et al., 2009, 2012, 2014; Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al., 2019), we categorized whether best estimates fell within the relevant ranges stipulated by the NATO lexicon and analyzed the proportion of agreeing best estimates (PABE). Our second measure relied on the mean absolute difference (MAD) of the best estimate and the midpoint of the relevant NATO range. This measure reflects the distance between a participant’s best estimate and what is arguably the best prototypical point within the relevant stipulated range. However, to enable multivariate analyses with the other agreement measures, we multiplied MAD by −1 so that for all three agreement measures, higher values reflected better agreement. We refer to the negated MAD measure as MADneg. Our third measure was the mean percentage overlap (MPO) between the participant’s range and the stipulated range as shown in Equation 2. In cases where spread was equal to 0 (n = 78), PO equaled 100% if the value of the bounds fell within the stipulated range; otherwise, PO equaled 0%.
3 Results
3.1 Preliminary analyses
Thirty-four (2.8%) participants provided lower-bound estimates that exceeded their upper-bound estimates. These cases were removed (revised N = 1,202). Approximately 19% of remaining participants provided best estimates that fell outside the credible interval defined by their lower- and upper-bound estimates. These violations were independent of probability format, probability level, or frame based on chi-square tests (all p > .28). Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al. (2019) rearranged such estimates into their logical order. However, we neither altered nor removed them.
3.2 Primary measures of equivalence
3.2.1 Spread
In the verbal condition, the effect of table format on spread (i.e., the upper bound minus the lower bound) was not statistically significant (t[599] = 0.40, p = .69, Cohen’s d = 0.03). Therefore, we collapsed over this nested factor in analyses of other effects, and we used the full sample. As an exploratory analysis, we conducted a three-way (Probability format × Probability level × Frame) ANOVA on spread. None of the main or interaction effects were significant (all p >.07).Footnote 4 The grand mean of spread was 31.91 [30.71, 33.10].
3.2.2 Best estimates
In the verbal condition, the effect of table format on best estimates was not statistically significant (t[599] = 0.75, p = .45, Cohen’s d = 0.06). Therefore, this nested factor was collapsed over analyses of other effects. We conducted a three-way (Probability format × Probability level × Frame) between-subjects factorial ANOVA on best estimates. As expected, the main effect of probability level was significant (F[1, 1190] = 470.95, p < .001, η2p = .284). The mean estimate of the low probability was 41.48 [39.93, 43.02] and the mean estimate of the high probability was 65.52 [63.99, 67.05].Footnote 5 However, probability level significantly interacted with probability format (F[2, 1190] = 34.33, p < .001, η2p = .055). Figure 1 plots the interaction effect, which shows that the discrimination between the low and high probabilities is significantly lower in the verbal condition than in the combined or numeric conditions, the latter two of which are virtually indistinguishable. No other effect in the model was statistically significant (all p > .5). Therefore, in the present experiment, we rejected Hypothesis 4 and find no evidence that positive/negative framing of outcomes affects the discrimination between low and high probability terms.
As Figure 1 shows, the median probabilities for the terms unlikely and likely are virtually indistinguishable and they fall on or very close to 50%, in contrast to the medians observed in the combined and numeric conditions. This raises the possibility that (despite the initial starting position of 0 on the slider scale) a significant proportion of participants may have responded with 50% as their best estimate to reflect a “don’t know” response, thus producing a fifty-fifty blip (Reference Bruine de Bruin, Fischbeck, Stiber and FischhoffBruine de Bruin et al., 2002; Reference Fischhoff and Bruine de BruinFischhoff & Bruine de Bruin, 1999). More specifically, the results in Figure 1 suggest that the proportion of fifty-fifty responders is significantly greater in the verbal condition than in the combined or numeric conditions. We tested this hypothesis using strict and loose classification methods. For the strict method, we dummy coded participants whose best estimates equaled 50% as 1 and otherwise as 0. For the loose method, we coded responses between 49% and 51% inclusive as 1 and otherwise as 0. The loose method reflects the fact that the slider was quite sensitive to movement and someone intending to respond with 50% might easily end up a point higher or lower on the scale. As Table 2 shows, the percentage of fifty-fifty responders was significantly greater in the verbal condition than in the combined or numeric conditions. This was the case for both the strict and loose methods. As well, using the strict method, the percentage of fifty-fifty responders was marginally greater in the combined condition than in the numeric condition.
Note. Verb., Comb., and Num. stand for verbal, combined and numeric conditions, respectively. Pairwise comparisons are based on Mann-Whitney U test. Significance values are two-tailed.
The preceding analyses naturally raise the question of whether the discrimination between low and high probabilities is still affected by probability format if fifty-fifty responders are excluded, as the exclusion of this subset must attenuate the interaction effect plotted in Figure 1. Accordingly, we recomputed the three-way ANOVA on best estimates after excluding those who met the definition of fifty-fifty responders by the loose criterion in order to retest the probability level × probability format interaction effect. As expected, the two-way interaction effect was attenuated and only approached conventional significance levels (F[2, 734] = 2.37, p = .094, η2p = .006). Figure 2 plots this interaction effect. Compared to Figure 1, estimates in the verbal condition are much less regressive. The median probabilities assigned to the terms unlikely and likely now fall in the stipulated ranges, although the mean of the term unlikely still falls outside the stipulated range.
To provide a direct test of the hypothesis that participants’ best estimates were more regressive in the verbal condition than in the combined or numeric conditions, even after excluding fifty-fifty responders, we computed two extremity scores, E P and E L, as follows:
where B e is the participant’s best estimate and PL stands for the design factor, probability level. The subscripts P and L on E stand for punitive and lenient, respectively. The higher the value of E, the more extreme (or less regressive) the participant’s best estimate is provided it is correctly located relative to 50% — namely, provided best estimates for low probabilities are not more than 50% and best estimates for high probabilities are not less than 50%. Violations of these constraints yield negative “anti-extremity” values for E P and values of 0 for E L. Both measures differ, therefore, from one that merely scores the absolute difference between 50 and B e. The absolute difference would, of course, fail to differentiate a participant who indicates that unlikely means 20% from one who indicates that it means 80%, treating normative and perverse forms of extremity at a constant magnitude equally.
After excluding fifty-fifty responders using the loose criterion, a one-way (Probability format) ANOVA computed on punitive extremity, E P, was statistically significant (F[2, 743] = 3.05, p = .048, η2p = .008). Compared to participants’ best estimates in the verbal condition (M = 12.05 [9.18, 14.92]), those in the numeric condition (M = 16.60 [14.30, 18.91]) were significantly more extreme (p = .041 by Tukey’s HSD test), whereas those in the combined condition (M = 15.53 [13.10, 17.96]) did not significantly differ from those in either the verbal condition (p = .17) or the numeric condition (p = .81). We conducted a second test that was identical to the preceding test, except that we swapped the punitive measure for the lenient measure of extremity, E L. The main effect, once again, was statistically significant (F[2, 743] = 3.32, p = .037, η2p = .009). Compared to participants’ best estimates in the verbal condition (M = 17.05 [15.50, 18.90]), those in the numeric condition (M = 20.15 [18.66, 21.63]) were significantly more extreme (p = .029 by Tukey’s HSD test), whereas those in the combined condition (M = 18.67 [17.11, 20.24]) did not significantly differ from those in either the verbal condition (p = .39) or the numeric condition (p = .37). Therefore, even after removing fifty-fifty responders, participants’ best estimates were more regressive when they were presented with verbal probabilities than when they were presented with numeric ranges.
Our final analyses in this section aim to shed light on the causal bases for the observed fifty-fifty blip, which was most strongly manifested in the verbal condition. Reference Bruine de Bruin, Fischhoff, Millstein and Halpern-FelsherBruine de Bruin et al. (2000) found that the fifty-fifty blip was stronger for less numerate and younger participants (i.e., youth vs. adults), and for events that were singular rather than distributional and, therefore, more likely to be associated with epistemic rather than aleatory uncertainty. We tested support for a similar pattern of results in the present research. However, numeracy did not significantly differ between participants who gave fifty-fifty responses and those who did not (t[900] = 1.16, p = .25). In our study, age did differ, but fifty-fifty responders, on average, were older (M = 46.97, SD = 10.48) than the remainder of the sample (M = 43.16, SD = 11.79; t[244.40] = 4.04, p < .001).Footnote 6 Moreover, the significance of these effects was virtually unchanged if the sample is restricted to participants in the verbal condition.
Another possibility suggested by past research (e.g., Reference Bruine de Bruin, Fischhoff, Millstein and Halpern-FelsherBruine de Bruin et al., 2000) is that fifty-fifty responders are less epistemically certain than their counterparts about their estimate. If so, we might expect the spread of the lower and upper bounds of their credible intervals to be greater for fifty-fifty responders than their counterparts, as wider spreads represent greater uncertainty. To the contrary, spread was significantly greater in the subsample that did not respond fifty-fifty (M = 32.88, SD = 16.97) than among fifty-fifty responders (M = 26.53, SD = 21.81; t[196.09] = 3.43, p = .001). The variances of these subsamples were significantly different too, according to Levene’s test (F = 30.54, p < .001). The smaller variance among fifty-fifty responders suggests an alternative hypothesis: perhaps there is a corresponding zero-spread blip for credible intervals among fifty-fifty responders, consistent with the hypothesis that, for some individuals, verbal probabilities simply are not interpreted in quantitative terms. In fact, if we examine the distribution of spreads in the verbal condition, we find that there is a zero-spread blip for fifty-fifty responders, whereas there is no such blip for the counterpart subsample. Whereas only 3.7% of the latter subsample gave bounds that yielded a spread of 0, fully 25% of fifty-fifty responders did so. In fact, the zero-spread blip represented the mode in that subsample.Footnote 7 The difference in these percentages is highly significant (by Mann-Whitney U test, z = -10.53, p < .001) and large: fifty-fifty responders in the verbal condition are 6.8 times more likely to indicate a zero spread than their “non-50%” counterparts in the verbal condition.
Finally, if we compare the percentage of participants who gave fifty-fifty responses and had zero-spread (using the loose criterion in both cases), we find 11.0% in the verbal condition, whereas the percentage is 0.3% in each of the other two conditions — namely, participants were 36.7 times more likely to exhibit this pattern of response in the verbal condition than in the combined or numeric conditions. These findings suggest that just over 10% of people asked to interpret the numeric meaning of verbal probability show signs of what we call representational mapping incapacity. For these individuals, it may be difficult to conceive of a mapping from verbal to numeric probabilities. If so, this difficulty does not appear to be related to numeracy, as the minority exhibiting this pattern in the verbal condition did not significantly differ from the majority who did not exhibit the “50±0” pattern (t[299] = 0.86, p = .39).
3.3 Agreement
3.3.1 Individual differences in cognitive performance and style
We first examined whether the three agreement measures (i.e., MPO, MADneg, and PABE) were correlated with numeracy, PVRT, and AOT. Consistent with earlier findings (Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al., 2019), as Table 3 shows, greater agreement (across all three measures) was positively related to higher numeracy, verbal-reasoning skill, and actively open-minded thinking. Therefore, we found consistent support across multiple tests of Hypothesis 5. None of these individual difference measures significantly differed across any of our experimental manipulations.
* p < .05
** p < .01
3.3.2 Table format
Recall that we predicted that agreement would be better in the full table condition than in the partial table condition (Hypothesis 1). To examine the effect of table format on agreement we restricted our analyses to the subset of cases in the verbal condition where format was varied (n = 601) and conducted a one-way (Table format) multivariate ANOVA (MANOVA) on the three agreement measures. The multivariate effect of table format was not statistically significant (F[3, 597] = 1.82, p = .143, η2p = .009). However, the univariate results were mixed. Both agreement measures that relied on best estimates (i.e., PABE and MADneg) were not statistically significant (both p > .115), whereas the measure that relied on lower and upper bounds (i.e., MPO) was significant (F[1, 599] = 5.24, p = .020, η2p = .009). Using the MPO measure, agreement was, in fact, better in the full table condition (M = 0.40 [0.36, 0.44]) than in the partial table condition (M = 0.33 [0.29, 0.37]). These findings therefore provide partial support for Hypothesis 1. Evidently, providing numeric ranges for stipulated terms helps foster agreement, but only on measures that rely on range input for calculating agreement.
3.3.3 Probability format, probability level, and frame
Turning to cases presented with the full translation table (n = 902), we examined the three agreement measures in a three-way (Probability format × Probability level × Frame) factorial MANOVA. There was a significant multivariate main effect of probability format (F[6, 1778] = 27.29, p < .001, η2p = .084). All three univariate F tests were significant at p < .001. Table 4 shows that for each of the three agreement measures, agreement in the verbal condition was significantly poorer than in the combined and numeric conditions, and the latter two conditions did not significantly differ. Moreover, the effect of probability format did not significantly interact with probability level or frame (smallest p = .312). These findings strongly support Hypotheses 2 and 3.
Note. Values in the second column that do not share the same superscript within measure significantly differ at p < .001 by Tukey’s HSD test and those sharing a subscript do not significantly differ at α = .05.
The multivariate main effect of probability level was also statistically significant (F[3, 888] = 23.48, p < .001, η2p = .073). However, the results of the univariate F tests were at odds. Using MPO, agreement was better for the lower probability (M = 0.67 [0.64, 0.71]) than for the higher probability (M = 0.57 [0.53, 0.60]; F[1, 890] = 16.89, p < .001, η2p = .019). For MADneg, the effect was in the opposite direction, with worse agreement for the lower probability (M = −18.34 [−19.78, −16.91]) than for the higher probability (M = −14.79 [−16.22, −13.36], F[1, 890] = 11.83, p = .001, η2p = .013). Finally, for PABE, the effect was not significant (F[1, 890] = 0.12, p = .731, η2p = .000). No other effect in the MANOVA model was statistically significant at α = .05.
Finally, we recomputed the MANOVA on agreement measures with fifty-fifty responders excluded based on the loose criterion. The new model yielded the same significant effects: for probability format (multivariate F[6, 1466] = 8.18, p < .001, η2p = .032), for probability level (multivariate F[3, 732] = 17.37, p < .001, η2p = .066). Therefore, the findings are robust regardless of whether fifty-fifty responders are included or excluded from the analysis.
3.3.4 Decision to review the NBLP scheme
Recall that participants were given the option of reviewing NATO’s NBLP scheme prior to giving their probability estimates. We examined whether the effect of probability format reported earlier interacted with participants’ decision to either review the table or not. Among participants in the full table condition, 492 (54.5%) reviewed the table before providing their probability equivalent (dummy coded as 1 and otherwise as 0). The effect of probability format on this percentage only approached statistical significance (χ 2[2, N = 902] = 4.86, p = .088); percentages who chose to review equal 52.8%, 51.0%, and 59.5% in the verbal, combined and numeric conditions, respectively.
We conducted a two-way (Probability Format × Review) MANOVA on the agreement measures. In particular, we sought to examine whether there was a significant interaction effect. Perhaps the poor agreement in the verbal condition compared to the combined and numeric conditions was due to participants’ failure to attend to the NBLP scheme. If so, we should observe a stronger simple effect of review in the verbal condition than in the other two conditions. First, we observed a multivariate main effect of review (F[3, 894] = 8.08, p < .001, η2p = .026).Footnote 8 However, only the univariate F test on PABE was statistically significant (F[1, 896] = 7.96, p = .005, η2p = .009). In this case, participants who reviewed the scheme prior to providing numeric equivalents showed better agreement (M = 0.64 [0.60, 0.68]) than those who did not review the scheme (M = 0.55 [0.50, 0.59]). More importantly, the multivariate interaction effect was statistically significant (F[6, 1790] = 2.60, p = .016, η2p = .009). Moreover, all univariate F tests for the interaction effect were significant at α = .01.
To simplify the presentation of the interaction across agreement measures, we standardized the three agreement measures and averaged them to form an agreement scale, which had good reliability, Cronbach’s α = .86. Figure 3 plots the interaction effect. As anticipated, reviewing the scheme improved agreement in the verbal condition. The simple effect on the composite measure was statistically significant (F[1, 299] = 11.20, p = .001, η2p = .036). In contrast, in the combined condition, the decision to review the scheme had no significant effect (F[1, 288] = 0.37, p = .545, η2p = .001). Finally, in the numeric condition, there was a marginally significant effect in the opposite direction to that observed in the verbal condition (F[1, 309] = 2.14, p = .081, η p2 = .010). That is, participants who did not choose to review the scheme showed better agreement than those who chose to review it. It is also evident from Figure 3 that the simple effect of presentation format was significant. In particular, it is clear from the non-overlapping 95% confidence intervals that, even among those participants that reviewed the scheme immediately prior to judging the numeric equivalents, agreement in the verbal condition is surpassed by that in the combined and numeric conditions.
3.3.5 Extremity, presentation format, and agreement
The preceding results showed that the extremity of best estimates remained affected by probability format after excluding fifty-fifty responders. As well, after fifty-fifty responders were excluded, presentation format continued to significantly affect agreement. Taken together, these findings suggest that the effect of probability format on agreement is mediated at least partly by extremity. In this final analysis, we tested this hypothesis directly. Figure 4 shows the standardized regression weights for links in the model in which extremity (using E P) mediates the effect of probability format on agreement (using the composite measure). The attenuation of the probability format effect on agreement after controlling for the mediator was statistically significant (Sobel test z = 5.19, p < .001). Even after controlling for extremity, the predictive effect of probability format was still significant. These results suggest that extremity partially mediates the effect of probability format on agreement (Reference Baron and KennyBaron & Kenny, 1986).
4 Discussion
This research tested several hypotheses about contemporary organizational approaches to communicating probabilities to end-users, which rely on NBLP schemes. We investigated agreement with one such scheme used by NATO in the context of military intelligence production and dissemination and using verbal probability terms (i.e., likely and unlikely) that are widely employed in other NBLP schemes (for other examples, see Reference Dhami and MandelDhami & Mandel, 2020; Reference Lipkus and PetersHo et al., 2015; Reference MorganMorgan, 1998). One issue we sought to address was whether the numerically-bounded component of NBLP schemes conferred an advantage in terms of fostering agreement. That is, do schemes that provide numeric ranges as semantic anchors fare better than comparable schemes that provide only an ordered set of verbal probabilities? Across the three measures of agreement we computed, only one (based on the proportion of overlap) showed a benefit to using ranges in the scheme. It is noteworthy that the one measure of agreement that was improved by providing ranges was itself range-dependent, having been computed using lower and upper bound values. In contrast, the other measures relied on the participant’s best estimate of the numeric meaning of the relevant term. Therefore, these findings call into question the effectiveness of attempting to stipulate the meaning of verbal probabilities by assigning numeric ranges to them, as organizations have been prone to do and as some researchers have recommended as a comprise in light of the recalcitrant attitude organizations exhibit towards the use of numeric probabilities (Reference Beyth-MaromBeyth-Marom, 1982). Instead, the findings lend support to recent proposals recommending that organizations and professional bodies communicate probabilities to end-users with numeric probability ranges that can be expressed with more or less precision, as required (e.g., Reference Dhami and MandelDhami & Mandel, 2020; European Food Safety Authority et al., 2018; Reference FriedmanFriedman, 2019; Reference Mandel and IrwinMandel & Irwin, 2020).
The present research also extended previous work on users’ agreement with NBLP schemes by examining a purely numeric condition in which only ranges (corresponding to those depicted in NATO scheme) were used in assessments. Consistent with earlier studies (Reference Budescu, Broomell and PorBudescu et al., 2009, 2012, 2014; Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al., 2019), we found that the combined format produced significantly better agreement than the verbal format. A novel result, however, was that agreement using the purely numeric format was just as good as the combined format on all agreement measures; that is, regardless of whether agreement was calculated on the basis of participants’ best estimates or their lower and upper bounds. Simply put, there was a substantial cost imposed on agreement if numeric range information was not included in the assessment, yet there was no observable cost to agreement if verbal probability information was omitted. This pattern was evident even if we examined only the subsample of participants who took care to review the NATO standard immediately before making their judgments. Taken together, these findings show that not only do numeric probabilities improve upon the communicative function of verbal probabilities when they are embedded directly into probabilistic statements, as some have noted (e.g., Reference Budescu, Por, Broomell and SmithsonBudescu et al., 2014; Reference Lipkus and PetersHo et al., 2015; Reference Patt and DessaiPatt & Dessai, 2005), but critically, numeric probabilities can replace the use of verbal probabilities insofar as fostering communicative agreement about degrees of probability is the main goal of communication.
The results further show that the cost imposed on agreement by using verbal probabilities is associated with a lack of discrimination between low and high probability terms. Despite having just read the NATO scheme moments before making their judgments, participants, on average, provided best estimates that were too high for the term unlikely and too low for the term likely. In fact, their estimates were so regressive that their median probabilities were virtually indistinguishable and centered on 50%. As we observed, the regression toward the midpoint of the probability scale was, in part, due to the fact that there were significantly more fifty-fifty responders in the verbal condition than in the combined or numeric conditions. However, even after removing the subsample of fifty-fifty responders, best estimates were still significantly more regressive in the verbal condition than in the numeric condition, and agreement was still lower in the verbal condition than in the combined or numeric conditions. In fact, the extremity of participants’ best estimates partially mediated the probability format effect on agreement.
The preceding findings suggest that the use of verbal probabilities to communicate probability levels can undermine information value to end users in two distinct ways. First, by increasing the proportion of fifty-fifty responses, verbal probabilities increase ambiguity about the meaning of the terms. This is because 50% can represent a first-order probability judgment or it could represent the sender’s utter epistemic uncertainty about what probability to assign (Reference Bruine de Bruin, Fischbeck, Stiber and FischhoffBruine de Bruin et al., 2002; Reference Fischhoff and Bruine de BruinFischhoff & Bruin de Bruin, 1999). In the present research, participants in the verbal condition were 6.4 times (using the loose criterion) to 9.0 times (using the strict criterion) more likely to give a fifty-fifty response than participants in the numeric condition. This effect of probability format on fifty-fifty responses represents a large increase in ambiguity production. Second, by making probability judgments appear less extreme to receivers, verbal formats for communicating uncertainty are likely to water down the information value to end-users. Since the value of probabilistic assessments is judged to be a function of informativeness and accuracy (Reference Yaniv and FosterYaniv & Foster, 1995), the regressiveness of verbal probabilities is likely to discount the value of such assessments to end-users, perhaps partly explaining the communication mode preference paradox noted earlier (e.g., Reference Erev and CohenErev & Cohen, 1990).
While requiring further research, the present findings shed light on the causal bases of these effects. Contrary to Reference Bruine de Bruin, Fischhoff, Millstein and Halpern-FelsherBruine de Bruin et al. (2000), we did not find that fifty-fifty responders were less numerate or younger than those who did not exhibit that response. In fact, we found that older participants were more likely to exhibit the fifty-fifty blip. The comparison of age effects across these studies, however, must be interpreted cautiously since Reference Bruine de Bruin, Fischhoff, Millstein and Halpern-FelsherBruine de Bruin et al. (2000) compared youth and adults, whereas we examined age within an adult sample. Reference Bruine de Bruin, Fischhoff, Millstein and Halpern-FelsherBruine de Bruin et al. (2000) also found that the fifty-fifty blip was associated with greater epistemic uncertainty, which we hypothesized might manifest in the present experiment as greater spread. We found, however, the opposite result: spread was greater for participants who did not give a fifty-fifty response.
This last result, however, suggested an alternative hypothesis that garnered support. That is, we reasoned that a nontrivial proportion of fifty-fifty responders in the verbal condition might simply fail to interpret verbal probabilities in quantitative terms and also produce a zero spread. In support of this hypothesis, in the verbal condition, fifty-fifty responders were about 7 times more likely to indicate a zero spread than their “non-50%” counterparts. The comparison across conditions was even more striking, with participants yielding the “50±0” pattern about 37 times more frequently in the verbal condition than in combined or numeric conditions. As noted earlier, the “50±0” pattern we observed in just over 10% of participants in the verbal condition suggests that these individuals have a representational mapping incapacity. For these individuals, it may be difficult to conceive of a mapping from verbal to numeric probabilities — a difficulty unrelated to numeracy. Such results are consistent with Reference Mandel, Dhami, Tran and IrwinMandel et al. (2021) which found that, whereas numeracy was related to the accuracy and coherence of arithmetic computations of averages and products among participants asked to compute these results with numeric probabilities, numeracy was not correlated with these performance measures among participants who received verbal probabilities as inputs to computation. Reference Mandel, Dhami, Tran and IrwinMandel et al. (2021) suggested that the findings reveal a differential schematicity effect in which the schema for arithmetic computing is less available when individuals are given verbal probabilities rather than numeric probabilities to work with. These authors also found that the mapping of verbal probabilities to numeric equivalents was unreliable even though such mappings were elicited in a brief timespan and the task context did not vary between mappings. Taken together, such findings indicate that some individuals cannot map verbal probabilities to numeric probabilities, even if allowances for imprecision are permitted (as in the present research).
In terms of the what we have until now called the “regressiveness” of best estimates, which was found to be greater in the verbal condition than in the numeric condition even after removing fifty-fifty responders from the sample, it is perhaps more accurate to describe this as “response contraction bias”, although similar response tendencies have been called regression effects (Reference Stevens and GreenbaumStevens & Greenbaum, 1966). As Reference PoultonPoulton (1994) explains, true regression effects are caused by variability, whereas response contraction is due to the effect that the central value of a scale, serving as a psychological default, has on estimation through an anchoring and adjustment process. Much earlier, Reference HollingworthHollingworth (1910) referred to this as the central tendency of judgment. This strikes us as applicable in the present context since the midpoint of the probability scale, in fact, has certain default properties. It is the expected value of random probability draws and it corresponds to the point of maximum uncertainty when the possibility space is binary. This default is often a valid starting point when orienting to a new stimulus that may be present or absent, or to a new hypothesis that may be true or false. As a corrective for response contraction bias, Reference PoultonPoulton (1994) recommends using the extreme values of the scale as anchors. In the present experiment, we used 0 as the default. If Reference PoultonPoulton (1994) is correct, we might have anticipated even greater response contraction in the verbal condition if the default had been set on 50%; a test that could be performed in future research.
The present research did not show the “valence effect” shown in a few other studies (Reference Dhami and MandelDhami & Mandel, 2021; Reference MandelMandel, 2015a; Reference Mullet and RivetMullet & Rivet, 1991). Unlike the earlier studies, which manipulated the events such that one was in some way a reflection of the other (e.g., “success vs. failure” or “fit vs. not fit”), in the present research the same event was framed either in terms of lives to be saved or lives to be lost. Whereas manipulations of reflection refer to different events, manipulations of frame refer to the same events that are “merely” described differently. It is possible that this difference accounts for the failure to get the result. However, it is also possible that the valence effect is not particularly robust. Since each of the earlier studies used a distinct task structure, it is premature to judge whether the comparative difference between this research and the earlier studies may be attributable to the framing context. In fact, it is possible that the earlier findings are not themselves reflective of a unitary valence effect. For instance, in Reference MandelMandel (2015a) success versus failure was used to manipulate valence, whereas affirmative versus negational statements were used in Reference Dhami and MandelDhami and Mandel (2021). Boundary conditions for valence effects on the interpretation of verbal probabilities could be explored in future research. Our findings do, however, add to at least one other study showing no interaction of probability formats and frames (Reference Liu, Juanchich and SirotaLiu et al., 2020). Clearly, this is an area that is ripe for future research.
4.1 Policy Implications
Taken together, the findings of this research call into question current practices that use NBLP schemes, such as those used in climate science communication (e.g., Reference Lewis, King, Perkins-Kirkpatrick and WehnerLewis et al., 2019; Reference Mastrandrea, Mach, Plattner and MatschossMastrandrea et al., 2011), national security intelligence (Office of the Director of National Intelligence, 2015; NATO, 2016), and other organizations (e.g., Reference MorganMorgan, 1998). Our findings add to those of other studies (e.g., Reference Budescu, Broomell and PorBudescu et al., 2009, 2012, 2014; Reference Lipkus and PetersHo et al., 2015; Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al., 2019) that suggest that NBLP schemes are unlikely to achieve their goal of ensuring a high degree of agreement between senders and receivers of uncertain estimates. The present findings show that such schemes do not even ensure that probability terms with different directionality such as unlikely and likely deflect in opposite directions from fifty-fifty.
Earlier studies (Reference Budescu, Por, Broomell and SmithsonBudescu et al., 2014; Reference Lipkus and PetersHo et al., 2015; Reference Patt and DessaiPatt & Dessai, 2005, Reference Wintle, Fraser, Wills, Nicholson and FidlerWintle et al., 2019) have identified limitations of these schemes. For instance, noting that most schemes, with some noteworthy exceptions (e.g., Reference BarnesBarnes, 2016), are formulated by BOGSAT (i.e., “bunch of guys/gals sitting around the table”; e.g., Reference Marvin, Scala and HowardMarvin, 2020), and illustrating how more general calls for the application of scientific research to methodological problems in domains such as intelligence analysis could be conducted (e.g., Reference Chang, Berdini, Mandel and TetlockChang et al., 2018; Dhami et al., 2015). Reference Ho, Budescu, Dhami and MandelHo et al., (2015) aimed to show that the setting of numeric ranges on probability terms could be determined by empirical data and model fitting with a resulting increase in agreement compared to existing NBLP schemes. The recommendations given in earlier work, as noted earlier, have also focused on trying to repair deficiencies by embedding the probability ranges not only into the lexicons used by organizations but into each statement that uses probability terms from the relevant lexicon, as captured in the combined format. Given that (a) we found agreement to be as good using numeric ranges alone as using the combined format, (b) the tendency toward regressiveness in the combined condition fell between the verbal and numeric conditions, and (c) Reference Knapp, Gardner and WoolfKnapp et al. (2016) found that risk assessments were more realistic following information in a numeric format than in a combined format, we question the utility of imposing NBLP schemes on senders and receivers. As well, the use of numeric ranges in specific assessments to clarify the meaning of vague probability terms runs the risk of being misinterpreted as credible intervals on the probability of events referenced in the substantive assessments, yet this is not what the ranges are intended to signify (Reference Mandel and IrwinMandel & Irwin, 2020).
Instead, our findings support recommendations for organizations to use numeric probabilities either as point values (with or without margins of error) or as numeric ranges without the use of linguistic probabilities in their communications (Reference Dhami and MandelDhami & Mandel, 2020; Reference FriedmanFriedman, 2019; Reference Mandel, Wallsten and BudescuMandel, Wallsten et al., 2021). If numeric ranges were unshackled from vague verbal probabilities, they could, in fact, be used as credible intervals on the probability of focal events referenced in substantive assessments and there would be no risk of confusing intervals meant to define terms with intervals that are issue-specific. This would provide decision-makers with useful information both about the probability of events and the uncertainty of the assessment (i.e., depending on the spread).
Of course, numeric quantifiers (i.e., the use of numeric values, precise or imprecise, in linguistic contexts) can still be ambiguous. It is not always obvious whether numeric quantifiers refer to exact values (“precisely p”), lower bounds (“at least p”), upper bounds (“at most p”), or fuzzy numbers (“roughly p”) (e.g., Reference Geurts and NouwenGeurts & Nouwen, 2007; Reference MandelMandel, 2014). Numeric ranges may also be interpreted in a variable manner, such that different end-users may draw quite different conclusions about the underlying probability distributions and such conclusions tend to be biased by end-users’ worldviews in a belief-congruent manner (Reference Dieckmann, Gregory, Peters and HartmanDieckmann et al., 2017), although the use of best estimates along with upper and lower bounds can serve to reduce such variability (Reference Dieckmann, Peters and GregoryDieckmann et al., 2015). Finally, there is little doubt that a principal reason senders prefer verbal probabilities to their numeric alternatives is that they are easier and “more natural” to produce (Reference Wallsten, Budescu, Zwick and KempWallsten et al., 1993). In contexts where ease is a concern that outweighs transparency, we cannot recommend against the use of verbal probabilities. Moreover, if ease is a significant concern, then the use of NBLP schemes may be preferable to a no-scheme alternative since it could, at least, help steer senders away from the vaguest expressions to which they may be inclined. For instance, a study of probability expressions used in oral radiology found that the expressions used most frequently tended to have the widest range of meanings (Reference Stheeman, Mileman, van’t Hof and van der SteltStheeman et al., 1993). Yet if especially vague terms, such as realistic possibility, which was recently used in the UK intelligence community’s NBLP scheme (Reference Dhami and MandelDhami & Mandel, 2020), are selected, such schemes could institutionalize rather than avoid the worst possible choices of terminology.
We further note that justifications for the use of NBLP schemes sometimes turn on the view that receivers do not have the requisite numeracy skills to correctly process numeric probability information (e.g., Reference Lewis, King, Perkins-Kirkpatrick and WehnerLewis et al., 2019). Our findings, however, suggest that low numeracy skill, along with lower verbal reasoning ability and an actively open-minded thinking disposition, also portend difficulty with the proper application of NBLP schemes. In the present research, each measure of agreement was directly related to numeracy and these other measures, calling into question how well they serve the information interests of individuals with lower numeracy. This conclusion suggests that efforts might be better focused on numeracy education. Such education could focus on how to update probabilistic beliefs more coherently (Reference MandelMandel, 2015b) and use comparison classes (Reference Chang, Chen, Mellers and TetlockChang et al., 2016) as well as on overcoming popular misconceptions about quantifying uncertainty, such as the view that assigning numbers to probabilities implies they are scientific estimates (Reference Mandel and IrwinMandel & Irwin, 2020).
To sum up, in terms of vagueness, numeric probabilities pale in comparison to verbal probabilities. The idea that such vagueness can be brushed away by providing NBLP schemes that stipulate the semantic meaning of probability phrases has been attractive, if not outright seductive, to many organizations tasked with delivering uncertain estimates to diverse audiences. Unfortunately, over multiple studies including the present research, the same notion has garnered virtually no empirical support. NBLP schemes might seem to be a good solution, but a growing body of research on the topic suggests that such schemes are not, in fact, as they seem.