Hostname: page-component-cd9895bd7-dk4vv Total loading time: 0 Render date: 2024-12-26T20:34:24.982Z Has data issue: false hasContentIssue false

The non-effects of repeated exposure to the Cognitive Reflection Test

Published online by Cambridge University Press:  01 January 2023

Andrew Meyer*
Affiliation:
Booth School of Business, University of Chicago
Elizabeth Zhou
Affiliation:
Massachusets Institute of Technology
Shane Frederick
Affiliation:
Yale School of Management, Yale University
*
Email: [email protected].
Rights & Permissions [Opens in a new window]

Abstract

We estimate the effects of repeated exposure to the Cognitive Reflection Test (CRT) by examining 14,053 MTurk subjects who took the test up to 25 times. In contrast with inferences drawn from self-reported prior exposure to the CRT, we find that prior exposure usually fails to improve scores. On average, respondents get only 0.024 additional items correct per exposure, and this small increase is driven entirely by the minority of subjects who continue to spend time reflecting on the items. Moreover, later scores retain the predictive validity of earlier scores, even when they differ, because initial success and later improvement appear to measure the same thing.

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
The authors license this article under the terms of the Creative Commons Attribution 3.0 License.
Copyright
Copyright © The Authors [2018] This is an Open Access article, distributed under the terms of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

The Cognitive Reflection Test (below) is intended to measure the disposition or ability to engage in reflective thought (Reference HaighFrederick, 2005), as it requires, among other things, that respondents override intuitively appealing but incorrect answers. The test has become popular because it is easy to administer, maps onto the central distinction underlying many dual process theories (Reference Kahneman and FrederickKahneman & Frederick, 2002; Reference Lubinski and HumphreysEvans & Stanovich, 2013), and predicts things that people care about, such as patience (Reference HaighFrederick, 2005; Shenhav, Rand & Greene, 2017), risk tolerance (Reference HaighFrederick, 2005; Reference Campitelli and LabollitaCampitelli & Labollita, 2010), willingness to admit ignorance (Reference Fernbach, Sloman, Louis and ShubeFernbach et al., 2012), ability to differentiate real news from fake news (Reference Bialek and PennycookPennycook & Rand, 2017), and religiosity (Reference Shenhav, Rand and DataPennycook et al., 2012; Reference Shenhav, Rand and GreeneShenhav, Rand & Greene, 2012).

A bat and a ball cost $110 in total. The bat costs $100 more than the ball. How much does the ball cost? _____ dollars

If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? _____ mins

In a lake there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the lake, how long would it take for the patch to cover half the lake? _____ days

Since the test has become popular, frequent subjects in psychological studies (e.g., MTurkers, some undergraduates, etc.) may encounter it multiple times. Although respondents usually receive no feedback, solutions are readily available online (there are currently over 300 YouTube videos explaining how to solve the bat & ball problem). This paper investigates the effect of repeated exposure on scores and on the predictive validity of those scores, by tracking the performance of 14,053 MTurkers who took the test from 1 to 25 times between November, 2013 and April, 2015. Table 1 partitions the data into four series of surveys and provides an overview.

Table 1: Data overview

Four results are notable: (1) self-reports of prior exposure markedly exaggerate the effect of prior exposure on score. (2) The average effect of prior exposure is small. (3) This small average effect is driven almost entirely by the subset of subjects who continue to spend time on the test. (4) The test’s predictive validity is robust to prior exposure, in part because subsequent scores are an excellent proxy for initial scores, and in part because initial performance and later improvement both diagnose the tendency to reflect.

The observation that more active MTurkers perform better on the CRT (Reference Chandler, Mueller and PaolacciChandler et al., 2013) has sparked worries that prior exposure may invalidate the test. In response, researchers have asked subjects whether they’ve seen the test before (Reference RavenHaigh, 2016; Reference Stieger and ReipsStieger & Reips, 2016), which items they’ve seen before (Reference RavenHaigh, 2016), or how many of the three items they’ve seen before (Reference Thomson and OppenheimerThomson & Oppenheimer, 2016, and us, throughout our Fall 2014 series). In all cases, respondents who report having seen the test before do better – often by a lot, as shown in the middle column of Table 2.

Table 2: CRT scores by self-reported exposure # of scores

The relation between reported exposure and performance is usually interpreted as an effect of exposure. However, that causal inference requires at least two assumptions: first, that mathematical ability is uncorrelated with the degree of exposure, and second, that mathematical ability is uncorrelated with the ability to recall exposure. The rightmost column of Table 2 shows that the second assumption is badly violated. As one might have predicted from the general tendency for mental abilities to correlate positively (Reference JensenJensen, 1998; Reference Lubinski and HumphreysLubinski & Humphreys, 1990; Reference UnsworthUnsworth, 2010), the ability to recall exposure to these problems is strongly correlated with the ability to solve them. Thus, self-reported prior exposure would diagnose superior performance (identifying those who are good at these problems) even if actual exposure had no effect.Footnote 1 (For more, see Appendix A.)

Table 3 shows that the first assumption may be violated as well. It sorts subjects by the number of times they appear, and reveals that more frequent subjects have higher CRT scores, even on their first trial, suggesting that mathematically inclined subjects expose themselves to such tasks more frequently and, correspondingly, are more likely to have had prior exposure to the CRT. (For more, see Appendix B.)

The best way to asess the effect of exposure, per se, is to track performance of the same subjects over time. Although we don’t know the exposure histories of people entering our study, we can track subjects who appear multiple times during the Fall of 2014. These longitudinal effects are revealed in Table 3 as changes in the numbers moving down any column. They show a small effect of exposure (scores rise slightly) and a large effect on response latencies (subjects are spending much less time on the test). Scores improve by an average of only 0.024 items per exposure – a tiny fraction of the 0.829 item improvement implied by self-reports.Footnote 2

Table 3: Mean CRT scores and geometric mean seconds to respond across repeated testing

Many have expressed concerns that the CRT will be destroyed by its popularity (Reference Chandler, Mueller and PaolacciChandler et al., 2013; Reference Brañas-Garza, Kujal and LenkeiBaron et al., 2015; Reference RavenHaigh, 2016; Reference Stieger and ReipsStieger & Reips, 2016; Reference Thomson and OppenheimerThomson & Oppenheimer, 2016). The most common worry is that respondents will learn all the answers, eliminating any variance, and, hence, any covariance with other constructs of interest. But this concern is overhyped. Though a rise in scores reduces variance in elite populations, for which ceiling effects are already a problem (e.g., Princeton undergraduates), it increases variance in less elite populations, for which floor effects are the current problem. MTurkers are likely the most heavily exposed population (Reference Rand, Peysakhovich, Kraft-Todd, Newman, Wurzbacher, Nowak and GreeneRand et al., 2014), yet plenty of variance remains.

The concern that the CRT items “will lose some of their predictive power through repeated use” (Baron et al., 2015, page 268) reflects not only the worry about ceiling effects, but also the worry that the ability to learn the correct answers may measure something different from the ability to solve the problems in the first place. Among subjects who take the CRT multiple times, one can model current score (S n) as initial score (S 1) plus the improvement afforded by further opportunities to reflect (R 2:n), plus an error term (εn), to capture changes in score that are uncorrelated with reflection:

From this perspective, the predictive validity of current score will remain intact if it closely resembles the initial score (R 2:n and εn are both small), or if S 1 and R 2:n measure the same thing and εn is small. Both of these conditions appear to be met. First, scores are highly stable: subjects miss 90% of problems they missed on the preceding trial, and solve 95% of the problems they solved on the preceding trial (see Appendix C for further analysis). Second, score increases appear to indicate reflection, as they are more likely among people who solved other items (see appendix D), and are limited to those who continue to spend time on the test upon re-exposure (see Figure 1 and Appendix E).Footnote 3 Moreover, this subset is not just discovering and memorizing the correct responses; they appear to be learning how to solve these types of problems, as their improvements transfer to a modified CRT with different correct answers (contradicting Chandler et al., 2013, see Appendix F).

Figure 1: Time spent on CRT and score improvement. Analysis is of returning subjects within the Fall 2014 series. Data are sorted by cumulative time spent after first exposure and separated into 30 segments of 253 observations. The position of each dot corresponds to the average cumulative time spent and score increase for that segment. Error bars are 95% confidence intervals.

In any case, secular trends in the predictive validity of some instrument are easy to test for: one can simply check whether the correlation of interest changes or not. We can perform a few such tests with our data. First, in our Fall 2014 studies, we obtained self-reported SAT scores from 1,407 MTurkers who took the CRT at least twice.Footnote 4 Their final CRT scores predict SAT about as well as their initial scores, and the changes in score add significant incremental validity (See Table 4 and Appendix G).Footnote 5 Second, 327 subjects from the Fall of 2013 returned in the Spring of 2014 where they encountered the Linda problem (Reference Tversky and KahnemanTversky & Kahneman, 1983), six items from Raven’s Advanced Progressive Matrices (Reference RavenRaven, 1941), and the CRT (again). Once again, performance on these other tests was predicted as well by final CRT scores as by initial scores (see Appendix H). Additionally, using self-reports as a proxy for prior CRT exposure, Reference Campitelli and LabollitaBialek & Pennycook (2017) find no evidence that the test’s predictive validity decreases across a large battery of covariates.

Table 4: Mean SAT scores sorted by initial and final CRT scores # of scores

Those who fret about the test’s continued validity assume, reasonably, that someone who scores a 0/3 the first time but a 3/3 the second time, was originally correctly classified (as unreflective) and now misclassified (as reflective) and erroneously lumped with those who got 3/3 the first time.Footnote 6 At first blush, this concern seems warranted: parroting answers one learns is not the same as generating those answers oneself. But suppose such a person had misgivings about their answers, the curiosity to act upon this doubt by Googling these items, the patience to sit through YouTube tutorials explaining their solutions, and the ability to remember these solutions when they encounter those items again. Those faculties sound conceptually close to what the test is intended to assess, and possibly even a purer measure than the sum of traits that enable correct solutions the first time, which include facility with algebra and with puzzles. Thus, we can find merit in the opposite interpretation: that this person was initially misclassified as unreflective, and is now being correctly classified as reflective.

Although we’ve focused on the CRT, this underlying logic applies to the shelf-life of any test. If current performance is a faithful proxy for initial performance or if change in performance measures the same thing as initial performance, the test’s predictive validity won’t be harmed by repeated exposure. Indeed, Appendix I shows that average performance on the Raven’s and Linda items are about as stable as CRT scores. Just as a wine may become better, worse, or different as it acquires and loses various chemical aspects, the quality of a test may change depending upon the amounts of various traits a correct response betokens and the exact relations between levels of those traits and other constructs of interest (e.g., risk preferences, trolley preferences, authoritarianism, belief in God, and so on).Footnote 7

The foregoing discussion should give pause to those who assume that the psychometric value of the CRT (or any test) necessarily declines with time. This could occur, but there is no compelling reason to think it is typical. Moreover, two primary concerns associated with the continued use of any test – response variance and predictive validity – can be straightforwardly assessed by simply looking at the data.Footnote 8 With respect to the CRT, that assessment will likely prove reassuring: in the most heavily exposed population, scores exhibit ample variance, are surprisingly stable, and retain their predictive validity, even when they change.

Appendix A: self-reported prior exposure probably reflects actual prior exposure plus latent ability and past performance

In the main text, we suggest that self-reported prior exposure to the CRT should not be interpreted as actual prior exposure or even as a noisy proxy for actual prior exposure. Here, we model it as a joint function of prior exposure, mental ability, and prior success on the test. First, we quantify the relation between likelihood of recalling prior exposure and amount of prior exposure. Then we attempt to differentiate two other determinants: mental ability and past performance on the test.

Table A1 shows how self-reported exposure varies according to how often subjects had encountered the CRT in the Fall 2014 series (reading down the columns) and how often they would (reading right along the rows). If self-reports accurately reflected the actual number of items respondents had seen before, it would increase to three by the second row and remain at three in all following rows. Though we do observe a large increase between the first and second rows, it does not go immediately to 3.0, but instead continues to rise gradually with further exposure. The increase moving right across the rows is most likely a composite effect of unobserved prior exposure and ability which facilitates memory.

Table A1: # of CRT items reported seen before and # of subjects responding across repeated testing.

Table A2 shows that people are more likely to recall their prior exposure to the test if they had done well on it (r(6,761) = 0.25, p < 0.001). This could either be interpreted as an effect of their prior success on their ability to recall the problems or as an effect of mental ability on both their prior success and their ability to recall the problems.

Table A2: relation between previous CRT score and self-reported prior exposure.

The first two columns of Table A3 show that the relation between previous performance and self-reported prior exposure is completely robust to controls for number of observed prior exposures and total number of exposures (as a proxy for unobserved prior exposure). The coefficient on previous CRT score barely changes with the addition of those controls.

Table A3: OLS estimates of the effect of prior exposure and previous performance on self-reported number of items seen before standard error.

The third column of Table A3 adds controls for subjects’ previously reported number of items seen before to show that previous performance not only predicts cross-sectional differences in self-reported exposure, but also predicts changes in self-reported exposure within the same respondent.

Table A4 gives a more nuanced view of the average effect estimated in column 3 of Table 3. It shows the average self-reported number of items seen before, separately for each previous CRT score and previously reported number of items seen before.

Table A4: relation between previous CRT performance and self-reported prior exposure, separately for each level of previous self-reported prior exposure.

A relation between mnemonic ability and general intelligence struggles to explain the fact that changes in previous performance continue to predict changes in self-reported prior exposure within the same subject over-time (even after controlling for number of prior exposures). This suggests some direct effect of prior performance on problem recall. But regardless of whether the relation is actually driven by past performance or merely by general intelligence, self-reported prior exposure will proxy for the ability to solve these problems above and beyond any effect of exposure, per se.

Appendix B: the relation between initial performance and frequency of later appearance

Table B1: Mean CRT score among probable CRT “virgins” and mean CRT score of everybody else.

More frequent subjects in our study perform better on the CRT, even on their first exposure. To help differentiate selection effects from effects of unobserved prior exposure, we attempt to identify subjects who probably hadn’t seen the CRT prior to our study by restricting analyses to those who (1) did not appear in any prior series of our data, (2) reported having seen zero items on their first exposure, and (3) reported having seen three items on every subsequent exposure (which provides evidence that their first report was accurate).

The positive relation between frequency of exposure and initial performance remains (p = 0.07) and is of similar magnitude to full sample estimates, suggesting that willingness to repeatedly engage in this task indicates greater aptitude for it, even if prior MTurk activity had not brought them in contact with the CRT. The more active subjects in our study were markedly less likely to be encountering the CRT for the first time in this study, suggesting a significant role of unobserved – and heavy – prior exposure.Footnote 9

In the demographics section of the survey, subjects reported their SAT scores and educational attainment. Those who appear more frequently in our survey were more likely to report a valid SAT score (r(6,908) = 0.04, p = 0.002), and more likely to report having completed college (r(6,759) = 0.04, p = 0.001). However, there was no significant relation between frequency of appearance and the SAT score (r(2,920) = -0.00, p = 0.80).

Table B2 shows that the effects of repeated exposure on performance are similar across items (moving left to right within a row). The relation between frequency of appearance (moving down within a column) is strongest for bat and ball, followed by widgets, and weakest for lily pads (all three pairwise comparisons, p < .01).

Table B2: individual item solution rates across repeated testing.

Appendix C: response stability

Table C1: % probability of transitioning from wrong to right and from right to wrong.

Table 3 shows that average CRT scores don’t increase much over time, but that could either indicate stability of response, or offsetting response variance (people who get it right forgetting and people who got it wrong improving, with similar magnitudes). Table C1 differentiates these possibilities by showing the probability of switching from wrong to right, and from right to wrong, at every possible transition. These probabilities are uniformly low which helps explain why the CRT maintains its predictive validity.

Table C2 differentiates the common or “intuitive” errors of 10, 100, and 24, from other “idiosyncratic” errors.

Table C2: Percentage giving each type of answer on the next trial, conditional on type of answer given on the current trial.

Although those who make intuitive errors (10, 100, 24) sometimes transition to idiosyncratic errors (e.g., 105, 20, 36), and those who make idiosyncratic errors sometimes transition to the correct answers (e.g., 5, 5, 36), idiosyncratic errors do not appear to function as a gateway to the truth. Of the 265 triplets with an idiosyncratic error in the middle position and a correct answer at the end, just 8% showed the pattern {intuitive→idiosyncratic→correct}, compared with 62% who merely “rediscovered” the truth {correct→idiosyncratic→correct}. Table C3 reproduces the analysis presented above at the item level.

Table C3: Percentage giving each type of answer on the next trial, conditioned on type of answer given on the current trial.

Appendix D: people who initially solve more other items are more likely to improve

The main text asserts that more reflective individuals are more likely to improve CRT performance with further exposure. For each CRT problem, table D1 selects participants who initially got that problem wrong, separates them by their initial performance on other CRT problems and shows their rate of improvement with further exposure. In all cases, those who initially get more other items correct are more likely to improve.

Table D1: % solving each CRT problem after missing it on 1st try (among those appearing three or more times in Fall 2014 series)

For each CRT problem, table D2 selects subjects who initially got that problem right, separates them by their initial performance on other CRT problems and shows their rate of decrement with further exposure. In all cases, those who do better on other problems initially are less likely to get worse.

Table D2: % continuing to solve each CRT problem after solving it on 1st try (among those appearing three or more times in Fall 2014 series)

Table D3 makes linear assumptions on the rate of improvement and the change in rate of improvement to estimate the overall relation between rate of improvement and initial performance on other items for each of the three items. For all three items, people who initially get a given problem wrong are more likely to get it right later if they initially got other problems right.

Table D3: Probit estimates of the relation between initial performance on other items and rate of performance increase among those who initially got the target problem wrong standard error.

Table D4 performs the same analysis among those who initially got each item right. It shows mixed results. For two out of the three items, better initial performance on other problems predicts a better chance of continuing to get the target problem correct. For the third problem, this relation reverses, but does not attain statistical significance.

Table D4: Probit estimates of the relation between initial performance on other items and rate of performance decrease among those who initially got the target problem right standard error.

Appendix E: people who continue to spend time are more likely to improve

The main text reports a strong relation between score improvement and the log of time spent on subsequent exposure (r(7,487) = 0.21). It also mentions that this does not reflect an underlying positive relation between time spent on the CRT and performance. In fact, that relation is negative, both overall (r(14,272) = −0.14), and excluding first observations (r(7,433) = −0.12). Further, the relation between score improvement and time spent on subsequent exposures is robust to controls for initial time spent (partial r(7,450) = 0.19).

We can distinguish two models of improvement in CRT score with repeat exposure: 1) between exposures, respondents encounter the answers in their daily lives, and 2) during each exposure, respondents think about the problems a little more. One crude test to distinguish between these two models asks whether score improvements are best explained by total weeks elapsed between exposures or by total minutes elapsed during exposures.

Table E presents the results of this test: specifically the expected score improvement (current score minus initial score) with each doubling of each independent variable. The constant in column 1 shows that one minute of additional reflection is associated with a score increase of about 0.15 items, and that each doubling of that time adds an additional 0.10 items correct, so that we would expect a respondent’s score to exceed his initial score by 0.25 items after 2 minutes of time spent on re-exposure, by 0.35 after 4 minutes etc…. Column 2 presents the relation with weeks spent between exposures. It shows that we should expect scores to increase by 0.13 items correct when re-exposed one week after initial exposure, but only by another 0.03 for each doubling of that time, so that two weeks since initial exposure predicts a 0.16 item score increase and 4 weeks predicts a score increase of 0.19 items. Finally, column 3 models score improvement by number of previous exposures, as we do in our primary analysis. It shows that we should expect scores to increase by 0.09 items on first re-exposure, but only by 0.01 additional items for each additional doubling of exposures, such that 2 additional exposures predicts scores to increase by 0.10 items, whereas 4 additional exposures predicts a score increase of just 0.11 items.

Table E: OLS estimates of change in CRT with doublings of three variables standard errors. Dependent variable = current CRT score minus initial CRT score.

One simple way to compare these models is by the percentage of variation in score change that they can explain. R2 of the “minutes spent” model is more than ten times higher than R2 of the “weeks passed” model. And R2 of the weeks passed model is itself almost ten times higher than R2 of column 3’s “pure exposure” model. Another way to compare these models is to hold each constant and ask whether orthogonal variation in the other explains significant variation in the criterion. Columns 4 through 6 show that the coefficient relating score change to time spent remains stable when controlling for weeks passed, but that the coefficient on weeks passed falls and even flips sign when controlling for time spent.

Appendix F: transfer of learning to modified CRT

If score improvements betoken continued reflection, subjects who improve on the test might not only learn the answers to these items, but also acquire the concepts required to solve them. We test that prediction by examining how exposure to the standard CRT during the Fall of 2014 affects performance on a modified CRT (Table F, left most column) that 4,670 subjects encountered during the Winter of 2015. Initial scores on the modified CRT were higher among the 1,610 subjects who had previously been exposed to the standard CRT than among the 3,060 who hadn’t (1.61 vs. 1.35, p < 0.001). Further, among the 1,028 subjects who were exposed to the standard CRT multiple times, improvement over the course of exposures predicts modified score over-and above initial score (partial r = 0.44, p < 0.001), and modified score is better predicted by final standard score than by initial standard score (r(1,028) = 0.80 vs. r = 0.76, p < 0.001). This confirms that the modest effects of repeat exposure go beyond a rote memorization of answers, and, in conjunction with the response time evidence, suggests that cognitive reflection may be captured as well by final score as by initial score. Table F presents item level results.

Table F: Effects of exposure to standard CRT on initial modified CRT score in Winter of 2015.

Appendix G: SAT scores

Self-reported SAT score is the sum of self-reported quantitative and verbal sub-scores. The distribution is presented in Figure G. Verbal and quantitative sub-scores correlate strongly with each other (r = 0.51), and each is significantly related to CRT. Quantitative scores correlate somewhat more strongly (r = 0.37) than verbal scores (r = 0.21), but verbal scores are a significant predictor of CRT even after controlling for quantitative score.

Figure G: Histogram of self-reported SAT scores.

The main text reports that SAT scores are just as well explained by Fall 2014 initial CRT scores as by Fall 2014 final CRT scores (r(1,405) = 0.38 vs. 0.36), and that final CRT adds incremental predictive validity over and above initial CRT score (partial r(1,404) = 0.08, p = 0.002).

Only 45% of those who appeared more than once in our study reported the same SAT score every time. While a few of the other 55% may have taken the SAT again in the interim and are reporting their latest score, for most, the variation reflects imperfect memory or insincere responding. In any case, our aforementioned finding that the relation between CRT and SAT is equally strong whether respondents are seeing the CRT for the first time or the nth time is essentially unchanged whether we just average the reported SAT scores (as we do above) or exclude the 55% who did not report the same score every time we asked them (r(653) = 0.39 vs. 0.36). However, although the partial correlation between final CRT and SAT after controlling for initial CRT does not change very much, it falls to insignificance in this smaller sample (partial r(652) = 0.05, p = 0.203). If we restrict our exclusions to respondents who report very different scores (a standard deviation greater than 100), we again find no significant decrease in the relation between SAT and CRT (r(1084) = 0.37 vs. 0.36), and we confirm the full-sample finding that final CRT score adds significant incremental validity over and above initial CRT score (partial r(1,083) = 0.09, p = 0.002).

Table G1 takes a different approach. It estimates the correlation between CRT score and an individual’s average reported SAT score, separately for each number of previous exposures within the study. A glance left-to-right within each row shows that there is no obvious decline in the CRT’s predictive validity.

Table G1: Correlations between CRT and average reported SAT across repeated testing.

Table G2 formalizes this ocular analysis: it estimates the average change in the relation between mean reported SAT and CRT with each repeated exposure. Column 1 presents the univariate regression, which estimates an average SAT score of 1137 among 0s on the CRT and a 55 point increase for every additional CRT item solved. Column 2 adds non-parametric controls for number of times a subject appears in our data and the interaction between that control and CRT score. These controls are the equivalent of breaking the table into separate rows by total number of appearances in our data. They distinguish time-invariant covariates of frequent participation from effects of previous exposure. Column 3 adds number of previous exposures and the interaction between CRT score and the number of previous exposures. The interaction coefficient (0.3) estimates the average change in the relation between CRT and SAT with each additional exposure. It is small relative to the average relation (55), and statistically indistinguishable from 0. Further, comparing R2 between model 2 and model 3 shows that allowing the relation between CRT and SAT to vary with previous exposure did not improve model fit.

Table G2: OLS estimates of the effect of previous exposure on the relation between CRT score and SAT score (dependent variable) standard error.

Even if the CRT continues to measure the same underlying trait, such that uniform prior exposure has no effect on its predictive validity, heterogeneous prior exposure could still be corrosive, as test scores alone would not differentiate between attaining a certain score on the first try and attaining that same score with the slight benefit of prior exposure. However, this effect is trivial. When we demean CRT scores by level of prior exposure, their ability to predict SAT scores barely increases (r=0.34 vs. 0.33).

Appendix H: Raven’s and Linda

Table H1: correlations with CRT score at four different points among the subset of subjects who appeared in both the Spring and Fall 2014 studies t-statistic comparing correlation to spring 2014 pre-feedback correlation.

Our studies in Spring 2014 included two other cognitive tests: a six-item battery of Raven’s Advanced Progressive Matrices (Reference RavenRaven, 1941), and Tversky and Kahneman’s “Linda” problem (Reference Tversky and KahnemanTversky & Kahneman, 1983). Raven’s advanced progressive matrices are a pattern matching task that is meant to assess fluid intelligence. The Linda problem presents subjects with a description of a woman who seems like a feminist, and asks the respondent whether she is more likely to be a feminist bank teller, or just a bank teller (whether or not she’s a feminist). Many respondents commit the “conjunction fallacy” by choosing feminist bank teller over bank teller, and implying that the joint occurrence of two possibilities is more likely than one of the possibilities itself.

The main text reports that final CRT score predicts Raven’s and Linda as well as initial CRT score (Raven’s: r(317) = 0.45 vs. 0.43; Linda: r(238) = 0.13 vs. 0.15). After controlling for initial score, the change in CRT is itself a significant predictor of Raven’s score (partial r = 0.20, p < 0.01), but not of correct responses to the Linda problem (partial r = −0.01, p = 0.90).

We rely exclusively on the (relatively small) overlap between the Fall 2013 and Spring 2014 samples because respondents in the Spring of 2014 (when Linda and Raven’s were administered) received feedback immediately after completing the CRT (i.e., that the answers were not 10, not 100, and not 24), creating a confound between the effect of that feedback, and the effect of any further exposure to the CRT. Table H1 ignores this confound, and examines the larger overlap between the Spring of 2014 and Fall of 2014 samples. It reports the relation with CRT score at four points in time: before feedback in the Spring, after feedback in the Spring, on first exposure in the Fall, and on final exposure in the Fall. It shows some evidence that repeated exposure with feedback reduces the CRT’s ability to predict Linda, but no evidence that it reduces its ability to predict Raven’s score.

Table H2 isolates the unique predictive contribution of each of the four CRT exposures. CRT scores appear to explain unique variation in Raven’s score on every elicitation, but only the pre-feedback CRT score explains significant unique variation in Linda solution. See Reference Meyer and FrederickMeyer and Frederick (2018) for further discussion of the effect of invalidating the intuitive errors on the CRT’s predictive validity.

Table H2: OLS estimates of the partial contribution of each CRT exposure after feedback standard error.

Appendix I: generalizability

The small effects of repeated exposure are not unique to the CRT, nor to the MTurk environment. The Spring 2014 series collected 6,843 responses from 5,191 unique MTurkers and found that the probability of avoiding the conjunction fallacy in the Linda problem increased by just 2.1% per exposure while Ravens scores increased by just 0.035 items per exposure (out of a possible score of 6). In the common terms of standard deviations per exposure, these two tests show similar repeat exposure effects to the CRT: 0.046 for Linda, 0.021 for Raven’s and 0.020 for the CRT. Table I1 shows these longitudinal effects on average performance.

Table I1: % of Raven’s matrices correct and % avoiding conjunction fallacy in Linda problem.

Table I2 replicates our primary analysis of change in average CRT score across repeated testing, but across administrations of the modified CRT during the Winter 2015 series (see Table F of appendix F for modified CRT materials). It replicates the small effects of repeat exposure that we find on the standard CRT in the Fall 2014 series.

Table I2: mean scores on modified CRT and geometric mean seconds to respond.

Further, the small effects of repeated exposure appear to generalize beyond MTurk. Although we didn’t track individual subjects over time, we observed scores from 23 successive administrations of the CRT to a total of 1454 students on the University of Michigan campus, and see no evidence that aggregate scores improved there (see Figure I). Similarly, Brañas-Garza, Kujal, and Lenkei (2015) examine 118 administrations of the CRT, and, when they exclude MTurk studies, they find no statistically significant increase in solution rates from 2005 to 2014.

Figure I: University of Michigan CRT scores across repeated testing.

Although we have no reason to believe that the CRT is unique among cognitive tests or that MTurk is unique among experimental settings, important differences become apparent outside of experimental settings. In a meta-analysis of repeated exposure to tests used in organizational and educational settings, Hausknecht (2007) finds effects ten times larger (0.21 standard deviations per exposure).Footnote 10 The “discrepancy” between the tiny effects we observe and the modest effects observed elsewhere may be due partly to mean-reversion, as test takers in these other contexts are particularly likely to retake tests when they underperform expectations. Two more obvious reasons include higher performance incentives and explicit feedback after every exposure.

Although respondents don’t typically get feedback about their CRT performance, 1298 people who participated in the Fall of 2014 surveys had previously participated in the Spring of 2014 surveys, which included a version of the CRT with partial feedback. Specifically, after those subjects responded to the CRT, they were told the most common errors on each problem (i.e., that the answers were not 10, not 100, and not 24), and received an opportunity to revise their responses. This feedback increased scores from 1.30 to 1.66 for those we never saw again and from 1.67 to 1.93 among those who returned for our fall study, where they averaged 2.07 items correct on first appearance. Thus, previous exposure with feedback (combined with the demand for an intervening response and a long delay) caused a 0.40 item increase, much larger than the 0.024 item average without feedback. Table I3 reproduces Table 3’s analysis of previous exposure effects in the Fall of 2014 series, separately for those who had [and had not] previously participated in the Spring of 2014 study that provided feedback.

Table I3: Mean CRT scores among those previously told that the intuitive answers are wrong among everybody else.

The 5612 respondents who hadn’t appeared in the Spring 2014 series improved their CRT scores by about 0.037 items correct per exposure during the Fall 2014 series (standard error = 0.003), while the 1298 respondents who had appeared in the Spring of 2014 (which told them what the answers were not) only improved their CRT scores by about 0.007 items correct per exposure during the Fall 2014 series (standard error = 0.003). The data shown in Table 3 is the composite of these two groups.

Footnotes

For comments, we thank Maya Bar-Hillel, Jonathan Baron, Eric Bradlow, Chris Chabris, Zoe Chance, Lee Follis, Alex Fulmer, Reid Hastie, Ryan Hauser, Dan Kahan, Daniel Kahneman, Jin Kim, Amanda Levis, Steve Malliaris, Hillary Parent, Kariyushi Rao, Taly Reich and Daniel Read.

Results reported here are supported by those reported in this issue by Stagnaro, Pennycook & Rand (2018).

1 In addition to the selection forces we describe, reverse causation is also possible. For example, problems that are easy to solve may feel more familiar, and participants experiencing persistent difficulties may explain them away by invoking problem novelty.

2 We estimate the repeat exposure effect by regressing CRT score against number of previous exposures with a non-parametric control for total number of appearances in the data. We estimate the self-report “effect” by regressing CRT score against percentage of items reported seen before. Both regressions are ordinary least squares.

The modest improvement across successive trials within our study likely exaggerates the effect of repeated exposure to the CRT, because some of these subjects probably encountered it in other studies between their nth and n+1st exposures in our study.

3 We emphasize that time spent on subsequent exposure predicts improvement in CRT score. The underlying relation between time spent on the CRT and CRT score is actually negative in these data.

4 Of the 14,500 responses in this survey, 7,339 included SAT scores for both subject tests. In order to identify and omit spurious reports, respondents were not informed that scores range from 200 to 800, and we deleted 1,135 score reports that fell outside of that range. If individuals reported legitimate, but different SAT scores on different occasions, we averaged them. After this kind of cleaning, self-reported SAT scores typically correlate very highly with actual SAT scores (Reference Kuncel, Crede and ThomasKuncel, Crede & Thomas, 2005).

5 Using modified CRT items that subjects had not seen before, Baron and co-authors (2014) report that both CRT score and CRT response time can be used to diagnose reflective tendencies. We worry that response times may be less robust to prior exposure than scores, because repeated exposure has negligible effect on scores, but has massive effects on response times. Even upon first exposure to the CRT in our data, response times already appear to add no incremental validity, beyond the scores themselves, for predicting performance on SAT, Raven’s, or the Linda problem.

6 Though useful as a thought experiment, this event is extremely rare (in our Fall 2014 series). Of the 2022 instances in which someone scored a 0 out of 3 and returned to take it again, only 48 got a perfect score the next time.

7 Repetition of a test is just one of many factors that could affect performance. One could encourage people to take their time, warn them that the test is more difficult than it appears, tell them what the answers are not (see, e.g., Reference Meyer and FrederickMeyer & Frederick, 2018), and so on. Any of these other variables could also increase or reduce the test’s predictive validity, depending on the population sampled and the other construct(s) of interest.

8 However, one can still ask whether the CRT measures what it was originally intended to (the “organic” or innate disposition to stop and think). This cannot be answered solely by appealing to data of the usual sort, since the construct(s) measured by a psychological instrument could shift over time without affecting test scores. As a thought experiment, suppose that people used to rate the quality of their relationships by considering how close they felt with their parents, but now do so by considering the quantity of recent sexual experiences. This shift in the meaning of responses could occur with no changes in the responses themselves nor their covariation with other traits of interest (e.g., amount of drinking or frequency of suicidal thoughts).

9 You can also use this table to compare the effect of repeat exposure on CRT “virgins” to the effect of repeat exposure on others. CRT virgins improve by 0.066 items correct per exposure. Others improve by 0.023 items per exposure.

10 Note that this is still much smaller than the 0.72 standard deviations per exposure that self-reports of CRT familiarity imply.

References

Baron, J., Scott, S., Fincher, K., & Metz, S. E. (2015). Why does the cognitive reflection test (sometimes) predict utilitarian moral judgment (and other things)? Journal of Applied Research in Memory and Cognition, 4(3), 265284.CrossRefGoogle Scholar
Bialek, M., & Pennycook, G. (2017). The Cognitive Reflection Test is robust to multiple exposures. Behavior Research Methods, 17.Google Scholar
Brañas-Garza, P., Kujal, P., & Lenkei, B. (2015). Cognitive Reflection Test: whom, how, when. (No. 68049). University Library of Munich, Germany.Google Scholar
Campitelli, G., & Labollita, M. (2010). Correlations of cognitive reflection with judgments and choices. Judgment and Decision Making, 5(3), 182191.CrossRefGoogle Scholar
Chandler, J., Mueller, P., & Paolacci, G. (2014). Nonnaïveté Among Amazon Mechanical Turk Workers: Consequences and Solutions for Behavioral Researchers. Behavior Research Methods, 46(1), 112130. http://dx.doi.org/10.3758/s13428-013-0365-7CrossRefGoogle ScholarPubMed
Evans, J. S. B., & Stanovich, K. E. (2013). Dual-process theories of higher cognition: Advancing the debate. Perspectives on psychological science, 8(3), 223241.CrossRefGoogle ScholarPubMed
Fernbach, P. M., Sloman, S. A., Louis, R. S., & Shube, J. N. (2012). Explanation fiends and foes: How mechanistic detail determines understanding and preference. Journal of Consumer Research, 39(5), 11151131.CrossRefGoogle Scholar
Frederick, S. (2005). Cognitive Reflection and Decision Making. Journal of Economic Perspectives, 19(4), 2542.CrossRefGoogle Scholar
Haigh, M. (2016). Has the Standard Cognitive Reflection Test Become a Victim of Its Own Success? Advances in Cognitive Psychology, 12(3), 145149. http://dx.doi.org/10.5709/acp-0193-5CrossRefGoogle ScholarPubMed
Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Moriarty Gerrard, M. O. (2007). Retesting in selection: a meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology, 92(2), 373385.CrossRefGoogle ScholarPubMed
Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger.Google Scholar
Kahneman, D., & Frederick, S. (2002). Representativeness revisited: Attribute substitution in intuitive judgment. Heuristics and biases: The psychology of intuitive judgment, 4981.CrossRefGoogle Scholar
Kuncel, N. R., Crede, M., & Thomas, L. L. (2005). The validity of self-reported grade point averages, class ranks, and test scores: A meta-analysis and review of the literature. Review of Educational Research, 75, 6382.CrossRefGoogle Scholar
Lubinski, D., & Humphreys, L. G. (1990). A broadly based analysis of mathematical giftedness. Intelligence, 14(3), 327355.CrossRefGoogle Scholar
Meyer, A., Frederick, S., Burnham, T. C., Guevara Pinto, J. D., Boyer, T. W., Ball, L. J., ... & Schuldt, J. P. (2015). Disfluent fonts don’t help people solve math problems. Journal of Experimental Psychology: General, 144(2), e16–e30. http://dx.doi.org/10.1037/xge0000049CrossRefGoogle Scholar
Meyer, A., & Frederick, S. (2018). The bat and ball problem. Unpublished manuscript.Google Scholar
Pennycook, G., Cheyne, J. A., Seli, P., Koehler, D. J., & Fugelsang, J. A. (2012). Analytic cognitive style predicts religious and paranormal belief. Cognition, 123(3), 335346.CrossRefGoogle ScholarPubMed
Pennycook, G., & Rand, D. G. (2018). Who falls for fake news? The roles of bullshit receptivity, overclaiming, familiarity, and analytic thinking. Available at SSRN: https://ssrn.com/abstract=3023545Google Scholar
Rand, D. G., Peysakhovich, A., Kraft-Todd, G. T., Newman, G. E., Wurzbacher, O., Nowak, M. A., & Greene, J. D. (2014). Social heuristics shape intuitive cooperation. Nature Communications, 5(3677), 112.CrossRefGoogle Scholar
Raven, J. C. (1941). Standardization of progressive matrices, 1938. Psychology and Psychotherapy: Theory, Research and Practice, 19(1), 137150.Google Scholar
Shenhav, A., Rand, D. G., & Greene, J. D. (2012). Divine intuition: cognitive style influences belief in God. Journal of Experimental Psychology: General, 141(3), 423428.CrossRefGoogle Scholar
Shenhav, A., Rand, D. G., & Data, J. D. G. (2017). The relationship between intertemporal choice and following the path of least resistance across choices, preferences, and beliefs. Judgment and Decision making, 12(1), 118.CrossRefGoogle Scholar
Stagnaro, M. N., Pennycook, G., & Rand, D. G. (2018). Performance on the Cognitive Reflection Test is stable across time. Judgment and Decision making, 13(3), 260267.CrossRefGoogle Scholar
Stieger, S., & Reips, U. D. (2016). A limitation of the Cognitive Reflection Test: familiarity. PeerJ, 4, e2395.CrossRefGoogle ScholarPubMed
Thomson, K. S., & Oppenheimer, D. M. (2016). Investigating an alternate form of the cognitive reflection test. Judgment and Decision Making, 11(1), 99113.CrossRefGoogle Scholar
Tversky, A., & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment. Psychological review, 90(4), 293315.CrossRefGoogle Scholar
Unsworth, N. (2010). On the division of working memory and long-term memory and their relation to intelligence: A latent variable approach. Acta psychologica, 134(1), 1628.CrossRefGoogle ScholarPubMed
Figure 0

Table 1: Data overview

Figure 1

Table 2: CRT scores by self-reported exposure # of scores

Figure 2

Table 3: Mean CRT scores and geometric mean seconds to respond across repeated testing

Figure 3

Figure 1: Time spent on CRT and score improvement. Analysis is of returning subjects within the Fall 2014 series. Data are sorted by cumulative time spent after first exposure and separated into 30 segments of 253 observations. The position of each dot corresponds to the average cumulative time spent and score increase for that segment. Error bars are 95% confidence intervals.

Figure 4

Table 4: Mean SAT scores sorted by initial and final CRT scores # of scores

Figure 5

Table A1: # of CRT items reported seen before and # of subjects responding across repeated testing.

Figure 6

Table A2: relation between previous CRT score and self-reported prior exposure.

Figure 7

Table A3: OLS estimates of the effect of prior exposure and previous performance on self-reported number of items seen before standard error.

Figure 8

Table A4: relation between previous CRT performance and self-reported prior exposure, separately for each level of previous self-reported prior exposure.

Figure 9

Table B1: Mean CRT score among probable CRT “virgins” and mean CRT score of everybody else.

Figure 10

Table B2: individual item solution rates across repeated testing.

Figure 11

Table C1: % probability of transitioning from wrong to right and from right to wrong.

Figure 12

Table C2: Percentage giving each type of answer on the next trial, conditional on type of answer given on the current trial.

Figure 13

Table C3: Percentage giving each type of answer on the next trial, conditioned on type of answer given on the current trial.

Figure 14

Table D1: % solving each CRT problem after missing it on 1st try (among those appearing three or more times in Fall 2014 series)

Figure 15

Table D2: % continuing to solve each CRT problem after solving it on 1st try (among those appearing three or more times in Fall 2014 series)

Figure 16

Table D3: Probit estimates of the relation between initial performance on other items and rate of performance increase among those who initially got the target problem wrong standard error.

Figure 17

Table D4: Probit estimates of the relation between initial performance on other items and rate of performance decrease among those who initially got the target problem right standard error.

Figure 18

Table E: OLS estimates of change in CRT with doublings of three variables standard errors. Dependent variable = current CRT score minus initial CRT score.

Figure 19

Table F: Effects of exposure to standard CRT on initial modified CRT score in Winter of 2015.

Figure 20

Figure G: Histogram of self-reported SAT scores.

Figure 21

Table G1: Correlations between CRT and average reported SAT across repeated testing.

Figure 22

Table G2: OLS estimates of the effect of previous exposure on the relation between CRT score and SAT score (dependent variable) standard error.

Figure 23

Table H1: correlations with CRT score at four different points among the subset of subjects who appeared in both the Spring and Fall 2014 studies t-statistic comparing correlation to spring 2014 pre-feedback correlation.

Figure 24

Table H2: OLS estimates of the partial contribution of each CRT exposure after feedback standard error.

Figure 25

Table I1: % of Raven’s matrices correct and % avoiding conjunction fallacy in Linda problem.

Figure 26

Table I2: mean scores on modified CRT and geometric mean seconds to respond.

Figure 27

Figure I: University of Michigan CRT scores across repeated testing.

Figure 28

Table I3: Mean CRT scores among those previously told that the intuitive answers are wrong among everybody else.

Supplementary material: File

Meyer et al. supplementary material

Meyer et al. supplementary material 1
Download Meyer et al. supplementary material(File)
File 4.1 MB
Supplementary material: File

Meyer et al. supplementary material

Meyer et al. supplementary material 2
Download Meyer et al. supplementary material(File)
File 24 KB