1 Introduction
The comparative evaluation of theories is an issue of fundamental importance in all sciences. In general, many disciplines proceed by submitting a particular theory or derived hypothesis to empirical tests and evaluating it through the logic of verification and falsification. Although such tests can be constructed to differentiate between models (experimentum crucis) given that opposing predictions can be derived (Reference PlattPlatt, 1964), it is more common that their comparison proceeds more indirectly. Specifically, underlying assumptions or predictions derived from each particular model are tested independently. Over time, instances of confirmation and disconfirmation are accumulated for each model. According to the classical falsificationist logic (Reference PopperPopper, 1959), a model that repeatedly fails relevant tests is eventually discarded. Thereby, the question of which is the better theory or model is answered indirectly: In the long run, it is the model which makes testable and falsifiable predictions and endures critical tests of these. There are numerous implementations of this approach in JDM research and well-stated arguments have been formulated in favor of testing critical properties or central assumptions of single models (for recent examples see Birnbaum, 2008; Reference Hilbig, Erdfelder and PohlFiedler, 2010). Indeed, a typical variant is to conduct series of investigations which successively shed light on the determinants and/or bounding conditions of certain effects or theories.
However, discontent with testing properties of single models in isolation has been voiced. The line of argument can be summarized as follows (Reference Gigerenzer and BrightonGigerenzer & Brighton, 2009; Reference Marewski and OlssonMarewski & Olsson, 2009): It is problematic to test a specific hypothesis derived from a single model against the indefinite number of unspecified alternatives. Rather, it is argued that we need to compare alternative models directly. In line with such arguments, a popular approach is to specify several competing models and directly compare these in terms of their ability to account for empirical data (Reference Shiffrin, Lee, Kim and WagenmakersShiffrin, Lee, Kim, & Wagenmakers, 2008). One particular variant specific to JDM research is the strategy classification approach which attempts to identify the decision strategy an individual most likely used (Bröder, 2000, 2002; Reference Rieskamp and HoffrageRieskamp & Hoffrage, 2008; Reference Rieskamp and OttoRieskamp & Otto, 2006). Following the idea that people adaptively select from a set of strategies (Reference Kahneman and TverskyGigerenzer & Selten, 2001; Reference Hilbig, Erdfelder and PohlPayne, Bettman, & Johnson, 1988, 1993), models are compared on the level of individual subjectsFootnote 1 and the superior model is retained as a description of how the decision maker proceeded.
In the current paper, we focus on comparative model testing in general and the more JDM-specific procedure of strategy classification in particular. Following the notion that a good test of a theory is one that implements a sufficiently high hurdle to be overcome by this theory (e.g., Meehl, 1967), we identify two major shortcomings in existing approaches to comparative model evaluation: (1) failure to distinguish between random and systematic error and (2) neglect of global model fit. As we will argue and demonstrate, these seriously question the conclusions that may be drawn.
2 Systematic versus random error
One approach to model evaluation is to assess which of several models makes most correct predictions in terms of observable choices. The adherence rate denotes the proportion of observed choices that are in line with the predictions of a model, given that the latter makes a prediction.Footnote 2 For example, the recognition heuristic (Reference Jekel, Nicklisch and GlöcknerGoldstein & Gigerenzer, 2002) predicts that people choose recognized over unrecognized options when judging which scores higher on some criterion (e.g., which of two cities has more inhabitants). The rate of adherence to this heuristic is simply the proportion of cases in which a participant chose the recognized option, while the error rate is defined as the proportion of choices that conflict with the heuristic’s predictions (i.e., 100% minus the adherence rate). When we compare competing models, we regard the one yielding the highest adherence rate as the data generating model (e.g., Reference Marewski, Gaissmaier, Schooler, Goldstein and GigerenzerMarewski, Gaissmaier, Schooler, Goldstein, & Gigerenzer, 2010). At the same time, a model need not yield perfect adherence (100%), because choices will be marred by some execution errors resulting from demands of the task, fatigue, slips of the finger etc.
The question of which maximal error rate a model should be allowed to produce is subject to idiosyncrasies of researchers, however. Since an adherence rate of 50% would be observed for purely random patterns in binary choices, this is the lowest useful criterion (Reference Gigerenzer and SeltenGlöckner, 2009; Reference Rieskamp and HoffrageRieskamp, 2008). However, for choice patterns approaching simple random responding, it would be dubitable to conclude systematic execution of any strategy at all. Some have therefore suggested applying stricter criteria (Reference Erdfelder, Auer, Hilbig, Aßfalg, Moshagen and NadarevicBröder & Schiffer, 2003; Reference Gigerenzer and SeltenGlöckner, 2009). Nonetheless, a general reservation against applying a single error-threshold to all models or strategies is that their application may be not equally difficult, and that the amount of execution errors may also depend on the particular task.Footnote 3
Irrespective of the error threshold applied, adherence rates make a strong and very problematic assumption regarding the type of error which occurs. It is implicitly taken for granted that the error is entirely random and that only its average size—across all items or trials—matters. In the above example, it is merely considered how many of, say, 100 paired-comparison choices the recognition heuristic predicts correctly. However, an at least equally relevant question is which of these 100 choices are explained by the model. The adherence rate ignores the latter aspect. As an upshot, model refutation becomes extremely difficult: Since almost any (completely implausible) model can easily produce above-chance-level adherence rates (Reference Marewski and OlssonHilbig, 2010b), how should we expect to falsify a model? By contrast, assessment of whether a model or strategy adequately describes observed choices is a question of the degree of systematic error. The crucial question is not merely how much overall error a model implies, but whether the error really is random and, consequently, of equal magnitude across all items. The main flaw inherent in adherence rates is the neglect of different item types and their respective error-rates. Only by considering these separately can we identify systematic error.
Returning to the above example, the recognition heuristic often yields adherence rates greater than 80% (Reference Pachur, Bröder and MarewskiPachur et al., 2008; Reference PohlPohl, 2006) and thus a relatively small average error. However, as argued above, the crucial question is whether the probability of choosing as predicted by the recognition heuristic is roughly 80% across all items (allowing for attributing the remaining 20% to random error). In fact, however, this is not the case (Reference Hilbig and PohlHilbig & Pohl, 2008; Reference PohlPohl, 2006). Participants often adhere to the recognition heuristic whenever it implies a factually correct choice (e.g., when the recognized of two cities really has more inhabitants than the unrecognized one), but deviate from the heuristic’s prediction whenever the choice it implies is factually false (Reference Hilbig, Pohl and BröderHilbig, Pohl, & Bröder, 2009). Thus, their adherence varies systematically as a function of item type, which is why it is entirely inappropriate to attribute non-adherence to random error only (Reference Hilbig, Erdfelder and PohlHilbig, Erdfelder, & Pohl, 2010; Reference Marewski and OlssonHilbig, 2010b).
For exactly these reasons, researchers have specifically tested whether models yield equal adherence rates across different types of items (e.g., Bröder & Eichler, 2006; Reference HilbigHilbig, 2008b; Reference Newell and FernandezNewell & Fernandez, 2006; Reference Richter and SpäthRichter & Späth, 2006). These studies represent clear instances of model refutation by testing the critical property of unsystematic errors across experimentally manipulated types of items. The overall adherence rate, by contrast, is uninformative, unlikely to provide an instance of model refutation, and often biased (Reference Marewski, Gaissmaier, Schooler, Goldstein and GigerenzerHilbig, 2010a). Therefore, models cannot be evaluated (let alone compared to each other) by merely considering whether choices deviate from model predictions. Although this point has been recognized before (e.g., Bröder & Schiffer, 2003), current JDM articles fail to treat it appropriately (e.g., Brandstätter, Gigerenzer, & Hertwig, 2008; Reference Marewski, Gaissmaier, Schooler, Goldstein and GigerenzerMarewski et al., 2010).
3 Neglect of global model fit
Addressing this severe shortcoming of adherence rates, several researchers developed methods assessing which model or strategy best accounts for the data while allowing for a certain degree of random errors only (Reference Erdfelder, Auer, Hilbig, Aßfalg, Moshagen and NadarevicBröder & Schiffer, 2003; Reference Gigerenzer and SeltenGlöckner, 2009; Reference Hilbig, Erdfelder and PohlJekel, Nicklisch, & Glöckner, 2010; Reference Rieskamp and HoffrageRieskamp, 2008). For each model considered, the empirical distance of the predicted pattern from the observed pattern is measured by means of a distance function (usually a log-likelihood value or a transformation thereof such as the Bayesian Information Criterion, BIC). Because the error is constrained to be equal across all item types, systematic errors lead to model misfit and are thus penalized. The best-fitting model is then deemed to reflect the actual decision making process or strategy.
Despite the indubitable superiority of such procedures over the mere comparison of adherence rates, this particular procedure also bears a caveat: Reliance on relative model fit as criterion requires that the data generating model is among the competitors. However, the observed data may not have been generated by any of the models considered, and the model yielding the smallest discrepancy may be entirely invalid (Reference Gelman and RubinGelman & Rubin, 1995; Reference Roberts and PashlerRoberts & Pashler, 2000; Reference ZucciniZuccini, 2000). Like the average adherence rate, relative model fit is unlikely to allow for model refutation. One model will always fit the data best. Without the ability to falsify candidate models, researchers may uphold a model that is merely least false but still far from adequate.
Although this issue has been openly acknowledged (Reference Erdfelder, Auer, Hilbig, Aßfalg, Moshagen and NadarevicBröder & Schiffer, 2003; Reference Gigerenzer and SeltenGlöckner, 2009), no conclusive efforts have tackled the problem. Fortunately, however, it is easy to assess whether a particular model might have generated the data: Prior to preferring a certain model over others by drawing on relative fit, we need to establish that each is able to account for the observed data, through testing global goodness-of-fit. Thus, instead of taking for granted the vital requirement that a model by itself should adequately describe the data, we need to test this assumption. As we demonstrate below, failure to consider absolute model fit can easily lead to flawed conclusions.
To illustrate this point, consider the judgment situation depicted in Table 1 (Reference Erdfelder, Auer, Hilbig, Aßfalg, Moshagen and NadarevicBröder & Schiffer, 2003): Decision makers infer which of two options (A or B) is superior in terms of some criterion given four probabilistic binary cues (for applications of such task structures see Bröder & Schiffer, 2006; Reference BirnbaumGlöckner & Betsch, 2008; Reference Rieskamp and HoffrageRieskamp & Hoffrage, 2008). For example, the task might be to judge which of two cities, A or B (options), has more inhabitants (criterion) based on whether or not a city has an international airport, is state capitol, has a university, and has a major-league football team (probabilistic binary cues with different predictive validity, cf. Reference HilbigGigerenzer & Goldstein, 1996). There are three item types (choices between A/B, C/D and E/F in Table 1) which differ in the cue patterns, constructed so as to differentiate between three candidate decision strategies: A weighted additive strategy (WADD; choose the option with the higher sum of cue values weighted by their validities), an equal weight strategy (EQW; choose the option with the higher sum of positive cue values), and a lexicographic take-the-best strategy (TTB; consider cues in order of their validity; choose according to first discriminating cue). Table 1 shows the choice predictions of each strategy.
Using this set-up, we ran a series of simulations, mostly mirroring the procedures of Bröder and Schiffer (2003). We first let each of the three strategies (WADD, EQW, and TTB) generate 1,000 data sets (simulated decision makers) with 30 choices per item type and a constant random error rate of 10%. Additionally, 1,000 data sets were generated by a pure guessing strategy to rule out that some strategies would fit random data. However, our main argument raised above is that conclusions based on the mere assessment of relative fit will be flawed if the data generating strategy is not within the set of those considered. Thus, we additionally simulated 1,000 data sets generated by a three-cue strategy (3C; compare choice options on each cue; choose the first option to reach three positive cue values). Across the three item types, it predicts the choice pattern “A, guess, E” which is distinct from those of the other strategies (see Table 1). Note that any other strategy could have been used for this demonstration, as long as it predicts a choice pattern distinct from the strategies under consideration.Footnote 4
Parameter estimation for each strategy and data set proceeded by minimizing the log-likelihood ratio statistic G 2 by means of the EM algorithm (Reference Hu and BatchelderHu & Batchelder, 1994) as implemented in the multiTree software tool (Reference MoshagenMoshagen, 2010). Following the recommendation of Glöckner (2009), a strategy was no longer considered if it required an average error of 30% or larger. Then, the strategy yielding the smallest BIC was chosen for classification.
The results displayed in Table 2 mirror those of Bröder and Schiffer (2003): For data generated by either of the strategies within the set, classifications were almost perfect. The reliability of such classifications can be assessed through the Bayes Factor, which expresses the posterior odds in favor of one model compared to another, given the data (Reference WagenmakersWagenmakers, 2007). The Bayes Factors between the best and the second best fitting strategy were > 3 (implying positive evidence, Raftery, 1995) for more than 95% of all classifications, suggesting that these were highly reliable. At the same time, random data generation led to practically all data sets remaining unclassified, as is desirable.
Note. Positive cue values are indicated by +, negative cue values by −. A:B represents guessing between options.
However, once data were generated by a strategy outside of the set considered, there were substantial misclassifications. When the 3C strategy was the true underlying model, the optimal outcome would have been that no single data set is classified. However, about 85% of data sets were actually classified—all as WADD and TTB in about equal proportions (final row of Table 2). Once again, Bayes Factors were > 3 comparing the best and second best fitting model in over 99% of data sets. Thus, most data sets were clearly and reliably classified, even though no single one was generated by any of the strategies under consideration. Given that researchers can rarely claim to know whether the data generating strategy is in fact within their set, this finding seriously questions any conclusion drawn from such a model comparison procedure based on relative fit.
As a remedy, we call for initially considering absolute model fit to determine whether a model is consistent with the data. Because the classification method of Bröder and Schiffer (2003) is a member of the family of multinomial processing tree models (Reference Batchelder and RieferBatchelder & Riefer, 1999; Reference Erdfelder, Auer, Hilbig, Aßfalg, Moshagen and NadarevicErdfelder et al., 2009), absolute model fit for each strategy can be determined by evaluating the asymptotically chi-square distributed log-likelihood ratio statistic G 2 (Reference Hu and BatchelderHu & Batchelder, 1994).Footnote 5 In the above example with the 3C strategy generating the data, using a conventional type-I error of .05Footnote 6 yielded exclusion of 95.5% of data sets as unclassifiable. Thus, rather than wrongly considering about 85% of decision makers as WADD or TTB-users, the vast majority was treated appropriately and left unclassified. At the same time, introducing an absolute-fit-threshold only marginally affected classifications whenever data generation followed one of the strategies within the set: Non-classification rates were 7.6%, 5.7%, and 10.3% for data generated by WADD, EQW, and TTB, respectively. As this exercise demonstrates, using absolute model fit to refute candidate models prevents false classifications if the data generating model is not in the set of those considered. At the same time, if the true model is within the set, classifications are only slightly more conservative.
4 Model fit and validity
If a model or strategy is found to fit the data in absolute terms and also outperforms other (fitting) models, can we conclude that the model under investigation is correct? Unfortunately, not. The seminal work of Wason (1968) provides an instructive example of this fallacy: When asked to identify a rule underlying a sequence of numbers such as “2, 4, 8”, people find it difficult to identify the generality of the underlying rule and tend to test overly specific rules such as “the previous number multiplied by two”. Although such specific rules may perfectly describe the given sequence, the actual data generating rule may be much more general (e.g., “a triple of numbers”). Thus, considering a fitting model to be the data-generating one is an instance of the classical logical fallacy of affirming the consequent (Reference RafteryTrafimow, 2009): The rule “if the model is correct, then the model will fit the data” cannot imply the reverse “the model fits the data, therefore it is correct”. If a model fits the data, it is only a candidate that may have generated the data (although the validity of this assertion can be made more or less plausible by drawing on additional tests).
More generally speaking, even a perfectly fitting model need not be valid, because at least one of its core assumptions may be entirely wrong (Reference Roberts and PashlerRoberts & Pashler, 2000), and there may also be “infinitely many theoretically distinct models that fit the data equally well” (Reference Voss, Rothermund and VossVoss, Rothermund, & Voss, 2004, p. 1217). In essence, “the danger is […] to use a good fit as a surrogate for a theory” (Reference Hu and BatchelderGigerenzer, 1998, p. 200), as model fit is only ever necessary but never sufficient for model validity. In turn, regardless of whether the underlying assumptions of a model are theoretically and empirically justifiable, misfit provides an instance of falsification with regard to the model in question. However, the diagnosticity of model misfit is also limited, because it may stem from very different sources that do not necessarily invalidate the core assumptions of a certain model. Even repeated occurrences of misfit may still be due to inappropriate auxiliary assumptions that are of little relevance to the core ideas of a model (Reference PohlLakatos, 1970). Nevertheless, model (mis)fit must be acknowledged and failure to fit the data calls for model refinement in the very least.
Since the conclusions that can be drawn from model comparisons—even if they include testing for systematic error appropriately and are based on assessment of absolute model fit as we have called for herein—are necessarily limited, model evaluations will always need to be complemented by other test of critical model properties or tests of competing hypothesis derived from different models (Reference Bröder and SchifferGlöckner & Herbold, 2011; Reference PopperHilbig & Pohl, 2009; Reference Roberts and PashlerRoberts & Pashler, 2000).Footnote 7
5 Conclusion
In the present paper, we identified two major shortcomings of existing approaches to comparative model evaluation, namely (1) failure to distinguish between random and systematic error and (2) neglect of global model fit. Both of these lessen the chances to falsify models and therefore increase the dangers of drawing inadequate conclusions. The first point refers to studies comparing models by means of the average adherence across items, that is, the proportion of choices in line with a model’s predictions (e.g., Brandstätter, Gigerenzer, & Hertwig, 2008; Reference Marewski, Gaissmaier, Schooler, Goldstein and GigerenzerMarewski et al., 2010) or similar measures such as majority choices (Reference Brandstätter, Gigerenzer and HertwigBrandstätter, Gigerenzer, & Hertwig, 2006; Reference Glöckner and HerboldKahneman & Tversky, 1979). The second applies to superior approaches which panelize systematic error and compare models based on their relative ability to account for the data (Reference Erdfelder, Auer, Hilbig, Aßfalg, Moshagen and NadarevicBröder & Schiffer, 2003; Reference Gigerenzer and SeltenGlöckner, 2009; Reference Rieskamp and HoffrageRieskamp, 2008). These can warrant valid conclusions only if the data-generating model is in the set of those compared. However, this is typically unknown and whenever it is not the case, failure to test whether models are able to adequately describe observed data in terms of absolute goodness-of-fit can lead to false conclusions.
In summary, we propose to retain the logic of falsification—as is well-implemented when testing critical properties of single models (Reference BirnbaumBirnbaum, 2008; Reference Hilbig, Erdfelder and PohlFiedler, 2010)—when comparing models or strategies in terms of fit. Misfit of a model represents an instance of falsification and should exclude this model from consideration in a model comparison. This, in turn, will secure conclusions drawn from model comparisons against the daunting possibility that “in the land of the blind, the one-eyed [model] is made king”. Moreover, we advocate testing critical properties or central assumptions of models directly, instead of pursuing blind competitions. The higher we set the hurdles for our models, the more confidence we can have in those which stand the test of time.