Debiasing System 1: Training favours logical over stereotypical intuiting

Esther Boissin; Serge Caparos; Aikaterini Voudouri; Wim De Neys

doi:10.1017/S1930297500008895

Debiasing System 1: Training favours logical over stereotypical intuiting

Published online by Cambridge University Press: 01 January 2023

and

Esther Boissin*: Affiliation:
Université Paris Cité, LaPsyDÉ, CNRS, F-75005, Paris, France
Serge Caparos: Affiliation:
Université Paris 8, DysCo lab, Saint-Denis, France. Institut Universitaire de France, Paris, France
Aikaterini Voudouri: Affiliation:
Université Paris Cité, LaPsyDÉ, CNRS, F-75005, Paris, France
Wim De Neys: Affiliation:
Université Paris Cité, LaPsyDÉ, CNRS, F-75005, Paris, France
*: *Email: [email protected]

Article contents

Abstract
Introduction
Study 1A: Base-rate training
Study 2A: Conjunction training
Study 1B: Base-rate training re-test
Study 2B: Conjunction training re-test
General Discussion
Footnotes
References

Rights & Permissions

Abstract

Whereas people’s reasoning is often biased by intuitive stereotypical associations, recent debiasing studies suggest that performance can be boosted by short training interventions that stress the underlying problem logic. The nature of this training effect remains unclear. Does training help participants correct erroneous stereotypical intuitions through deliberation? Or does it help them develop correct intuitions? We addressed this issue in four studies with base-rate neglect and conjunction fallacy problems. We used a two-response paradigm in which participants first gave an initial intuitive response, under time pressure and cognitive load, and then gave a final response after deliberation. Studies 1A and 2A showed that training boosted performance and did so as early as the intuitive stage. After training, most participants solved the problems correctly from the outset and no longer needed to correct an initial incorrect answer through deliberation. Studies 1B and 2B indicated that this sound intuiting persisted over at least two months. The findings confirm that a short training can debias reasoning at an intuitive “System 1” stage and get reasoners to favour logical over stereotypical intuitions.

Keywords

reasoning heuristics and biases debiasing intuition

Type: Research Article
Information: Judgment and Decision Making , Volume 17 , Issue 4 , July 2022 , pp. 646 - 690

DOI: https://doi.org/10.1017/S1930297500008895 [Opens in a new window]
Creative Commons: The authors license this article under the terms of the Creative Commons Attribution 4.0 License.
Copyright: Copyright © The Authors [2022] This is an Open Access article, distributed under the terms of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

Although as human beings we have exceptional capacities to reason, we do not always reason correctly. Imagine, for example, you are told there is an event with 1000 people. Of the 1000 attendants, 995 are I.T. technicians and 5 professional boxers. You know that one person (“Person X”) was drawn randomly from all attendees. Next, you are informed that this person is described to be “strong”. What do you think is most likely now: Is Person X an I.T. technician or a professional boxer? Intuitively, many people will tend to say that Person X is a professional boxer based on stored stereotypical associations cued by the descriptive information (“Professional boxers are strong”). If your only piece of information were the description of the person, that might be a fair guess. In general, there might be more professional boxers than I.T. technicians who are strong. However, there are also strong I.T. technicians, and you were explicitly told that there were far more I.T. technicians than professional boxers in the sample where Person X was drawn from. If you take this extreme base-rate information into account, this should push the scale to the “I.T. technicians” side. Yet untrained people typically neglect the base-rate principle and opt for the intuitive response that it is cued by their stereotypical prior beliefs (e.g., Reference KahnemanKahneman, 2011).

Decades of reasoning and decision-making research have shown that similar intuitive thinking is biasing people’s reasoning in a wide range of situations and tasks (Reference EvansEvans, 2008; Reference Evans and OverEvans & Over, 1996; Reference Kahneman, Frederick, Gilovich, Griffin and KahnemanKahneman & Frederick, 2002; Reference Kahneman and TverskyKahneman & Tversky, 1973). In general, this literature indicates that human reasoners have a strong tendency to base their inferences on fast intuitive impressions rather than on more demanding, deliberative reasoning. In and by itself, this intuitive or so-called “heuristic” thinking can be useful because it is fast and effortless and can often provide valid problem solutions. However, our intuitions can sometimes cue responses that conflict with more logical or probabilistic principles. As the base-rate example illustrates, relying on mere intuitive thinking will bias our reasoning in that case (Reference EvansEvans, 2010, Reference Evans2003; Reference Morewedge, Yoon, Scopelliti, Symborski, Korris and KassamKahneman, 2011; Reference Stanovich and WestStanovich & West, 2000).

Cognitive scientists have long been trying to remediate people’s biased thinking and get them to reason correctly (e.g., Reference Lilienfeld, Ammirati and LandfieldLilienfeld et al., 2009; Reference Milkman, Chugh and BazermanMilkman et al., 2009, Reference NisbettNisbett, 1993). A number of recent studies have been especially successful in this respect (e.g., Reference Boissin, Caparos, Raoelison and De NeysBoissin et al., 2021; Reference Claidière, Trouche and MercierClaidière et al., 2017; Reference Hoover and HealyHoover & Healy, 2017; Reference Morewedge, Yoon, Scopelliti, Symborski, Korris and KassamMorewedge et al., 2015; Reference Purcell, Wastell and SwellerPurcell et al., 2020; Reference Trouche, Sander and MercierTrouche et al., 2014). These “debiasing” studies have shown that a short single-shot explanation about the intuitive bias and correct solution strategy often helps reasoners produce a correct response. Once the problem has been properly explained to reasoners, they manage to solve structurally similar problems afterward.

Such training effects are obviously promising, of course. However, the nature of the training effect is currently not clear. A key question is whether the training primarily affects people’s intuitive or deliberate thinking (or in popular dual-process terms, their fast “System 1” or slow “System 2”, e.g., Reference KahnemanKahneman, 2011). The common assumption is that after training, participants will be more likely to deliberate properly (i.e., to engage their “System 2”) and correct the intuitively generated heuristic response (e.g., Reference EvansEvans, 2019; Reference Lilienfeld, Ammirati and LandfieldLilienfeld et al., 2009; Reference Milkman, Chugh and BazermanMilkman et al., 2009). This assumption fits with the general dual process idea that the deliberate “System 2” primarily serves to correct the intuitive “System 1” (Reference Morewedge, Yoon, Scopelliti, Symborski, Korris and KassamKahneman, 2011).

However, in theory, it is also possible that once reasoners grasp the solution, they will no longer generate an incorrect intuitive response. Instead, their initial intuitive response would often be sufficient to generate a correct response, without the need for a corrective “System 2” deliberation process. At a general level, this “training sound intuiting” idea can be likened to expertise development (e.g., Reference HogarthHogarth, 2010; Reference Larrick and FeilerLarrick & Feiler, 2015; Reference Kahneman and KleinKahneman & Klein, 2009) in which the goal is also to turn processes that initially require effortful deliberate “System 2” processing into intuitive “System 1” processes (e.g., Reference Larrick and FeilerLarrick & Feiler, 2015).Footnote ¹

If a debiasing training actually helps people intuit correctly, this would have important implications (Reference Boissin, Caparos, Raoelison and De NeysBoissin et al., 2021). Although it can be laudable to help people deliberate more, in many daily life situations they will simply not have the time (or resources/motivation) to deliberate. Hence, if debiasing interventions help people only to deliberately correct erroneous intuitions, their impact may be suboptimal. Ultimately, we do not only want people to learn to correct erroneous intuitions but to avoid biased intuitions altogether (Reference EvansEvans, 2019; Reference Milkman, Chugh and BazermanMilkman et al., 2009; Reference Reyna, Weldon and McCormickReyna et al., 2015; Reference StanovichStanovich, 2018). The potential benefits of training sound intuition are rife in this respect.

Recent evidence provided some support for the “trained intuitor” point of view. Reference Boissin, Caparos, Raoelison and De NeysBoissin et al. (2021) trained their participants on versions of the notorious bat-and-ball problem (i.e., “A bat and a ball together cost $1.10. The bat costs $1 more than the ball. How much does the ball cost?”; typical incorrect response: 10 cents, correct response: 5 cents) with a short debias training in which the correct solution logic was explained with a number of practice problems. Critically, they tested participants’ reasoning accuracy before and after the training with a two-response paradigm (Reference Thompson, Prowse Turner and PennycookThompson et al., 2011). In this paradigm, participants are asked to give two consecutive responses to a reasoning problem. First, they have to respond as fast as possible with the first intuitive hunch that comes to mind. Next, they can take all the time they want to reflect on the problem and give a final response. To make maximally sure that the initial response is generated intuitively, the response needs to be given under time pressure and/or cognitive load (Reference Bago and De NeysBago & De Neys, 2017; Reference Newman, Gibb and ThompsonNewman et al., 2017). This paradigm allows to measure the training impact on people’s intuitive (initial response accuracy) and deliberate reasoning performance (final response accuracy). In line with previous training studies (Reference Claidière, Trouche and MercierClaidière et al., 2017; Reference Hoover and HealyHoover & Healy, 2017; Reference Purcell, Wastell and SwellerPurcell et al., 2020; Reference Trouche, Sander and MercierTrouche et al., 2014), most people solved the bat-and-ball problem correctly after training, when they were allowed to deliberate. However, the key finding was that, for most previously biased reasoners, after training their initial, intuitive responses were already correct. The Reference Boissin, Caparos, Raoelison and De NeysBoissin et al. (2021) findings consequently lend credence to the claim that training can help reasoners switch from biased to sound intuiting.

However, the study was but the first of its kind and focused on one single reasoning problem. Given the importance of the potential applied and theoretical implications, further validation is needed. This is especially crucial since the specific focus on the bat-and-ball problem might have distorted the training findings. That is, although the bat-and-ball problem is a popular study object and people show massive bias when solving it, in one critical sense it is also somewhat a-typical. In the bat-and-ball problem, the cued erroneous intuitive response is itself based on logico-mathematical knowledge (i.e., people arrive at $0.10 cents because they simply subtract $1 from the total $1.10 instead of applying the following correct equation: “$1 + 2x = $1.10”; e.g., Reference De Neys, Rossi and HoudéDe Neys et al., 2013; Reference Morewedge, Yoon, Scopelliti, Symborski, Korris and KassamKahneman, 2011; Reference Morewedge and KahnemanMorewedge & Kahneman, 2010).

However, most problems from the bias literature cue a conflicting intuitive response that is based on semantic and/or stereotypical background beliefs (e.g., “CEOs are male”) about which people hold (stronger) personal beliefs. This difference might have implications for the success of debiasing interventions. When the bat-and-ball training intervention informs the subject that the answer cannot be 10 cents because otherwise the bat (at a dollar more) would cost $1.10 — which makes for a total cost of $1.20 — participants will presumably not object to the mathematical fact that: $1.10 + $0.10 > $1.10 (i.e., the stated total price). That is, it might be far harder for people to discard a cued intuitive response for which they hold personal beliefs than when such beliefs are not at play (e.g., Reference Kaplan, Gimbel and HarrisKaplan et al., 2016; Reference GoelGoel, 2022). Indeed, when people’s personal belief system is challenged, they might be more likely to engage in motivated reasoning or rationalization to protect their beliefs (e.g., Reference Ditto, Liu, Clark, Wojcik, Chen, Grady, Celniker and ZingerDitto et al., 2019; Reference KahanKahan, 2016, Reference Kahan, Jamieson, Kahan and Scheufele2017; Reference KundaKunda, 1990; Reference Mercier and SperberMercier & Sperber, 2011; Reference Pennycook and RandPennycook & Rand, 2019). In sum, the striking debiasing results of Reference Boissin, Caparos, Raoelison and De NeysBoissin et al. (2021) might be limited to a specific problem in which the reasoners’ belief system is not challenged. In the present study we examined the robustness and generality of the intuitive training effect by testing whether it could be replicated with different types of bias problems that evoke an intuitive response based on personal, stereotypical beliefs.

In Study 1A, we focused on the popular base-rate neglect problems (e.g., Reference Kahneman and TverskyKahneman & Tversky, 1973) akin to the opening example in which a stereotypical description can conflict with base-rate information. In Study 2A, we looked at equally (in)famous conjunction fallacy problems (Reference Tversky and KahnemanTversky & Kahneman, 1983) in which a stereotypical description can trick people into violating the elementary conjunction rule (i.e., judging a conjunction of two events as more likely than one of its constituent events because it fits a cued stereotypical association). For each study, we contrasted participants’ reasoning performance with a two-response paradigm before and after a short training session and compared their performance to that of participants who received no training (the control group). In Study 1B (base-rate problems) and 2B (conjunction problems) participants were re-tested two months after the initial training to explore whether the training effect was robust and sustained over time.

2 Study 1A: Base-rate training

2.1 Method

2.1.1 Pre-registration

The study design and research question were preregistered on the Open Science Framework (https://osf.io/674gk/). No specific analyses were preregistered.

2.1.2 Participants

Participants were recruited online, using the Prolific Academic website (http://www.prolific.ac/). Participants had to be native English speakers from Canada, Australia, New Zealand, the United States of America, or the United Kingdom to take part. The same sample size as Reference Boissin, Caparos, Raoelison and De NeysBoissin et al. (2021) was selected; In total, 101 individuals participated (62 females, M = 31.0 years, SD = 10.7), 50 participants were randomly assigned to the training group and 51 to the control group. In total, 38 participants had secondary school as their highest level of education, and 63 reported a university degree. We compensated participants for their time at the rate of £5 per hour.

2.1.3 Materials

The study consisted of three blocks presented in the following order: a pre-intervention, an intervention, and a post-intervention block. In total, each participant had to solve 12 problems during the pre-intervention block, namely, four conflict, four no-conflict, two neutral and two transfer problems (see below), and again the same number of problems during the post-intervention block. All the problems are presented in the Supplementary Material Section A.

Base rate problems:

Base-rate problems were taken from Reference Bago and De NeysBago and De Neys (2017). Participants always received a description of the composition of a sample (e.g., “This study contained I.T. engineers and professional boxers”), base rate information (e.g., “There were 995 engineers and 5 professional boxers”) and a description that was designed to cue a stereotypical association (e.g. “This person is strong”). Participants’ task was to indicate to which group the person most likely belonged. The task instructions stressed that the person was drawn randomly from the specified sample.

The problem presentation format was based on Reference Pennycook, Trippas, Handley and ThompsonPennycook et al.’s (2014) rapid-response paradigm. The base rates and descriptive information were presented serially and the amount of text that was presented on screen was minimized. First, participants received the names of the two groups in the sample (e.g., “This study contains businessmen and firemen”). Next, under the first sentence (which remained on the screen) we presented the descriptive information (e.g., Person ‘K’ is brave). The descriptive information specified a neutral name (‘Person K’) and a single word personality trait (e.g., “brave”) that was designed to trigger the stereotypical association. Finally, participants received the base rate probabilities. As in Pennycook et al., base rates varied between 995/5, 996/4, and 997/3. The following illustrates the full problem format:

This study contains businessmen and firemen.
Person ‘K’ is brave.
There are 996 businessmen and 4 firemen.
Is Person ‘K’ more likely to be:
- • A businessman
- • A fireman

Reference Pennycook, Trippas, Handley and ThompsonPennycook et al. (2014) pre-tested the material to make sure that words that were selected to cue a stereotypical association consistently did so but avoided extremely diagnostic cues. As Reference Bago and De NeysBago and De Neys (2017) clarified, the importance of such a non-extreme and moderate association is not trivial. Note that we label the response that is in line with the base rates as the correct response. Critics of the base rate task (e.g., Reference Gigerenzer, Hell and BlankGigerenzer et al., 1988; see also Reference Barbey and SlomanBarbey & Sloman, 2007) have long pointed out that if reasoners adopt a Bayesian approach and combine the base rate probabilities with the stereotypical description, this can lead to interpretative complications when the description is extremely diagnostic. For example, imagine that we have an item with males and females as the two groups and give the description that Person ‘A’ is ‘pregnant’. Now, in this case, one would always need to conclude that Person ‘A’ is a woman, regardless of the base rates. The more moderate descriptions (such as ‘kind’ or ‘funny’) help to avoid this potential problem. In addition, the extreme base rates (i.e., 997/3, 996/4, 995/5) that were used in the current study further help to guarantee that even a very approximate Bayesian reasoner would need to pick the response cued by the base-rates (see Reference De NeysDe Neys, 2014).

Note that Reference Pennycook, Trippas, Handley and ThompsonPennycook et al. (2014) created the rapid-response base-rate format with a single word personality trait to reduce reading time (variability) and optimize latency measurement. They showed that the single-word format did not affect accuracy results: People were as biased with their single-word associations as with lengthier descriptions.

In each block, we presented four critical “conflict” items, and four control “no-conflict” items. In the conflict items, the base rate probabilities and the stereotypical information cued conflicting responses (see example above). In the no-conflict items, they both cued the same response (i.e., the description triggered a stereotypical trait of a member of the largest group). The following is an example of a no-conflict problem:

This study contains businessmen and firemen.
Person ‘K’ is brave.
There are 996 firemen and 4 businessmen.
Is Person ‘K’ more likely to be:
- • A fireman
- • A businessman

These control problems should be easy to solve. If participants are paying minimal attention to the task and refrain from random guessing, they should show high accuracy (Reference Bago and De NeysBago & De Neys, 2019).

Two sets of 16 unique items (8 pre-intervention and 8 post-intervention block items) were used for counterbalancing purposes. For each block, the conflict problems in one set were the no-conflict problems in the other, and vice-versa (i.e., the base-rates were reversed). Participants were randomly assigned to one of the two sets. Consequently, none of the pre- and post-intervention problem contents was repeated within-subjects (i.e., participants saw a total of 16 different items with a unique stereotypical association).

Justification:

After the last problem of the post-intervention block, which was always a conflict problem, participants were asked to type in a justification for their final response (see Supplementary Material Section B for further details). As in Reference Boissin, Caparos, Raoelison and De NeysBoissin et al. (2021), results indicated that most correct responses were correctly justified (training group: 35 correct justifications out of 40 correct responses; control group: 23 correct justifications out of 32 correct responses, see Supplementary Material Section B). Note that the justification was untimed and retrospective. It was collected for exploratory purposes and does not allow us to draw any conclusion regarding the intuitive or deliberate nature of participants’ processing.

Neutral problems:

We also presented two neutral base-rate problems taken from Reference Pennycook, Trippas, Handley and ThompsonPennycook et al. (2014). These problems were designed such that they did not cue any stereotypical association (i.e., the descriptive information was not diagnostic). Here is an example of a neutral base-rate problem:

This study contains boys and girls.
Person ’T’ is young.
There are 4 boys and 996 girls.
Is Person ’T’ more likely to be:
- • a boy
- • a girl

The neutral base-rate items are traditionally used to track people’s knowledge of the underlying logical principles or “mindware” (Reference StanovichStanovich, 2011). When people are allowed to deliberate, reasoners have little trouble solving them (e.g., Reference De Neys and GlumicicDe Neys & Glumicic, 2008; Reference Frey, Johnson and De NeysFrey & De Neys, 2017). The neutral problems allowed us to explore whether a potential learning effect on conflict base-rate problems, in which the reasoner needs to discard a conflicting stereotypical association, leads to a more general performance boost on other untrained base-rate problems.

Transfer problems:

In addition to the base-rate problems, we presented other types of reasoning problems to test whether the “base-rate” training effect could transfer to other untrained problems with a different logical structure. In total, we used two problems taken from the Cognitive Reflection Test 2 (CRT2) based on the “race” problem from Reference Thomson and OppenheimerThomson and Oppenheimer (2016), and two conjunction-fallacy problems taken from Reference Frey, Johnson and De NeysFrey et al. (2018). We presented one CRT-like and one conjunction-fallacy problem at the end of the pre-intervention block, and again one CRT-like and one conjunction-fallacy problem at the end of the post-intervention block.

Like the base-rate problems, the CRT-like problems are designed to cue a strong biasing heuristic response and consequently show low accuracy rates (Reference GoelFrederick, 2005):

Imagine you’re in a car race. If you pass the car in fifth place,
what place are you in?
- • Fourth
- • Fifth
- • Sixth

Here, the heuristic incorrect response is “fourth place” and the correct response is “fifth place”. The third (“sixth”) response option was used as a filler.

For each of the two conjunction problems, participants were given a short personality description of an individual and were asked to indicate which of two statements was most probable. One statement always consisted of a conjunction of two characteristics (one characteristic that was likely given the description (i.e., a stereotypical association), and one that was unlikely). The other statement contained only the unlikely characteristic. The following illustrates the structure of the conjunction problem:

Jake is 20.
He grew up in a poor family in a neglected neighbourhood.
He is quite violent and already served a short sentence in prison.
Which statement is most likely?
- • Jake plays the violin
- • Jake plays the violin and is jobless

Given that the conjunction of two events cannot be more likely than each of the constituent events (formally: p(A&B) ≤ p(A)) the correct response was the non-conjunctive statement.

Intervention block:

During the intervention block, the participants tried to solve three additional conflict base-rate problems without any cognitive or time constraint. In the training group, participants were given an explanation of the correct solution after having responded to each problem. Participants in the control group received no such explanation. The following example illustrates the explanation:

The correct answer to the previous problem is that person ’K’ is most likely a “businessman”. Many people think it is “fireman”, but this answer is wrong.

Most people base their answer solely on the description (“Person K is brave”). If this were all information you got, this answer would be correct, as it is likely that there are more brave firemen in the world than brave businessmen.

However, in the problem you also got information about the specific number of businessmen and firemen in the group that person K got drawn from. You were informed that person K was drawn randomly from a group with 996 businessmen and only 4 firemen. Since there are so much more businessmen in the group than firemen (200 times more!), it becomes more likely that person K is a businessman. After all, although firemen might in general be braver than businessmen, there are also some businessmen who are brave. If you combine this with the vastly larger number of businessmen in the group, it will be more plausible that you’re dealing with a brave businessman.

The explanations were based on the same general principles that were adopted by Reference Boissin, Caparos, Raoelison and De NeysBoissin et al. (2021): The explanations were as brief and simple as possible to prevent fatigue or disengagement from the task. Each explanation explicitly stated both the correct response and the typical incorrect response. To avoid promoting feelings of judgment (Reference Trouche, Sander and MercierTrouche et al., 2014), we gave no personal performance feedback (e.g., “Your answer was wrong”). And to avoid inducing mathematical anxiety, the explanation never mentioned a formal algebraic equation (Reference Hoover and HealyHoover & Healy, 2017). Participants moved on to the following screen by clicking on the “Next” button.

Two-response format:

For both the pre- and post-intervention blocks, participants responded to each problem using a two-response procedure, where they first provided a ‘fast’ answer, directly followed by a second ‘slow’ answer (Reference Thompson, Prowse Turner and PennycookThompson et al., 2011). This method allowed us to capture both an initial ‘intuitive’ response, and then a final ‘deliberate’ one. To minimize the possibility that deliberation was involved in producing the initial ‘fast’ response, participants had to provide their initial answer within a strict time limit while performing a concurrent cognitive load task (see Reference Bago and De NeysBago & De Neys, 2017, 2019; Reference Raoelison and De NeysRaoelison & De Neys, 2019). The load task was based on the dot memorization task (Reference Miyake, Friedman, Rettinger, Shah and HegartyMiyake et al., 2001), given that it had been successfully used to burden executive resources during reasoning tasks (e.g., Reference De NeysDe Neys, 2006; Reference Franssens and De NeysFranssens & De Neys, 2009; Reference Verschueren, Schaeken and d’YdewalleVerschueren et al., 2004). Participants had to memorize a complex visual pattern (i.e., 4 crosses in a 3x3 grid) that was presented briefly before each reasoning problem. After their initial (intuitive) response to the problem, participants were shown four different patterns (i.e., with different matrices of crosses) and had to identify the one that they had memorized (see De Neys, 2006, for more details).

For all base-rate problems, a time limit of 3 seconds was chosen for the initial response, based on previous pre-testing that indicated it amounted to the time needed to read the preambles, move the mouse, and click on a response option (Reference Bago and De NeysBago & De Neys, 2017, 2019; Reference Raoelison, Thompson and De NeysRaoelison et al., 2020). For the lengthier transfer problems, the time limit was set to 6 seconds. The time limit and cognitive load were applied only for the initial response, and not for the final one (see below).

2.1.4 Procedure

The experiment was run online using the Qualtrics platform. Participants were instructed that the experiment would take 13–15 minutes and that it demanded their full attention. A general description of the task was presented in which participants were instructed that they would read reasoning problems, for which they would have to provide two consecutive responses. They were told that we were interested in their very first, initial answer that comes to mind and that – after providing their initial response – they could reflect on the problem and take as much time as they needed to provide a final answer (see Reference Bago and De NeysBago & De Neys, 2017, for literal instructions). In order to familiarize themselves with the two-response procedure, they first solved two unrelated practice reasoning problems with a response deadline only. Next, they familiarised themselves with the cognitive load procedure by solving two memorization trials and, finally, they solved the same two reasoning problems as before with the full two-response procedure (i.e., deadline + load on initial response).

Figure 1 shows a typical base-rate trial, which started with the presentation of a fixation cross for 2000ms, followed by the description of the sample (e.g., “This study contains businessmen and firemen”) for 2000ms, and subsequently, by the visual matrix for the cognitive-load task for 2000ms. Afterwards, the descriptive adjective (e.g., “Person ‘K’ is brave”) was presented for 2000ms followed by the full problem which featured the base-rate information (e.g. “There are 996 businessmen and 4 firemen”) and the answer options. At this point participants had 3000ms to choose a response. After 2000ms the background of the screen turned yellow to warn participants that they only had a short amount of time left to answer. If they had not provided an answer before the time limit, they were given a reminder that it is important to provide an answer within the time limit on subsequent trials. Participants were then asked to enter how confident they were with their response (from 0%, absolutely not confident, to 100%, absolutely confident). Then, they were presented with four visual matrices and had to choose the one that they had previously memorized. They received feedback as to whether their memory response was correct. If the answer was not correct, they were reminded that it was important to perform well on the memory task on subsequent trials. Finally, the same reasoning problem was presented again, and participants were asked to provide a final deliberate answer (with no time limit) and, once again, to indicate their confidence level.

Figure 1: Time course of a complete two-response base-rate item.

Note that, given the different nature of the transfer CRT-like and conjunction problems, we adopted a slightly different timing and presentation format than for the initial response of the base-rate problems. The problems appeared in two parts. The first part of the conjunction fallacies remained on screen for 4000ms, and the first part of the CRT-like problems remained on screen for 2000ms. Then, the visual matrix appeared for 2000ms and next the full problem was shown and remained on screen for 6000ms, during which participants had to select an answer. After 4000ms the background turned yellow to warn participants for the deadline. For the transfer CRT-like and conjunction fallacy problems confidence ratings were not requested after each response, unlike the base-rate problems.

At the end of the study, participants in the control group were also presented with the explanations about how to solve the base-rate problems, and all participants were asked to complete their demographic information.

2.1.5 Trial exclusion

We discarded trials in which participants failed to provide their initial answer before the deadline (3.5% of all trials) or failed to pick the correct matrix in the load task (12.9% of the remaining trials), and we analysed the remaining 84.1% of all trials. On average, each participant contributed 20.5 (SEM = 0.6) trials out of 24.

2.2 Results

2.2.1 Base-rate response accuracy

For each participant, we calculated the average proportion of correct initial and final responses for the conflict and no-conflict problems, in each of the two blocks (pre- and post-intervention). We analysed the data using mixed-design ANOVAs with Block (pre- vs post-intervention) as a within-subjects factor and Group (training vs control) as a between-subjects factor.

First, we focus on accuracies for the final responses. Figure 2 shows that accuracy was low before the intervention in both the control and the training group (respectively, M = 59.6%, SEM = 5.6, and M = 53.5%, SEM = 6.1), which is in line with findings showing that many reasoners opt for the incorrect stereotypical response even when they can reflect (i.e., the necessary time and resources; Reference Bago and De NeysBago & De Neys, 2017; Reference Raoelison, Thompson and De NeysRaoelison et al., 2020). The overall performance of both groups improved following the intervention; however, the performance increase was larger in the training group (accuracy increase of M = 32.8 points, SEM = 5.5) than in the control group (accuracy increase of M = 10.6 points, SEM = 3.7). The ANOVA showed that the Block × Group interaction was significant (F(1,99) = 11.39, p = .001, η ²_g= .02Footnote ²).

Figure 2: Average initial and final accuracy on conflict problems in Study 1A (base-rate problems) and 2A (conjunction problems). Error bars represent standard error of the mean (SEM).

To explore whether the training improved people’s intuitive reasoning performance, we repeated the analyses on accuracies for the initial responses. The results were fully consistent (Figure 2). Once again, most reasoners failed to solve the conflict problems before the intervention, both in the control and the training groups (M = 39.6%, SEM = 5.4, and M = 27.0%, SEM = 5.2, respectively), but improved after the intervention. The improvement was larger in the training group (performance increase of M = 48.3 points, SEM = 5.5) than in the control group (performance increase of M = 6.2 points, SEM = 4.4); the Block x Group interaction was again significant (F(1,99) = 36.17, p < .001, η ²_g = .07).

In sum, the training intervention helped participants to produce more correct responses. Critically, this improvement was shown not only for final “deliberate” responses, for which participants had time and resources to reflect on their response, but also for initial “intuitive” responses, where deliberation was minimized.Footnote ³

Finally, we analysed the performance for the no-conflict control problems. We observed that performance was consistently at ceiling, with grand means of 96.2% (SEM = 0.8) for initial accuracy, and 97.4% (SEM = 0.7) for final accuracy (See Supplementary Material Section C). In line with previous studies (Reference Bago and De NeysBago & De Neys, 2020; Reference Scherer, Yates, Baker and ValentinePennycook et al., 2015; Reference Raoelison and De NeysRaoelison & De Neys, 2019), participants’ high accuracy rates on the no-conflict problems indicated that they were paying attention to the task and refrained from random guessing.

2.2.2 Direction of change

To gain some deeper insight into how people changed (or did not change) their response after deliberation, we performed a direction of change analysis (Reference Bago and De NeysBago & De Neys, 2017, 2019). More specifically, on each trial, people could give a correct (‘1’) or incorrect (‘0’) response at each of the two response stages (i.e., initial and final). Hence, this can result in four different types of response patterns on any single trial (“00” pattern, incorrect response at both stages; “11” pattern, correct response at both stages; “01” pattern, initial incorrect and final correct response; “10” pattern, initial correct and final incorrect response).

Figure 3 plots the direction of change distribution, for the conflict problems, in both the pre- and post-intervention blocks. As the figure shows, in the training group the intervention led to a sharp decrease in “00” patterns (32.5% drop) which was specifically accompanied by an increase in “11” patterns (48.0% rise). These trends were far less pronounced in the control group.

Figure 3: Proportion of each direction of change (i.e., 00 response patterns, 01 response patterns, 10 response patterns and 11 response patterns) for the conflict problems as a function of block and group in Study 1A (base rate problems) and 2A (conjunction problems).

Critically, in the training group, the decrease in “00” patterns was driven by an increase in “11” patterns rather than an increase in “01” patterns. In fact, the latter pattern slightly decreased (15.8%) following the intervention. These results support the idea that training helped participants intuit the correct solution strategy rather than correct an initial “erroneous” response through deliberation. More specifically, it indicates that, after the training intervention, reasoners were able to apply the correct solution strategy at an intuitive level.

2.2.3 Individual level directions of change

To explore further how participants solved the problems, we performed an individual level accuracy analysis (Reference Raoelison and De NeysRaoelison & De Neys, 2019). For each participant, on each conflict trial, we coded the direction of change from start to end of the experiment. This allowed us to observe, at a higher level of detail, how the intervention influenced participants’ response patterns.

First, we describe the categories of participants observed in the training group. By and large, Figure 4 suggests that we can classify the participants into three main categories. First, 12% of the participants did not benefit from the training intervention since they gave incorrect (biased) responses (i.e., “00” patterns) throughout the study. These participants were classified as “biased” respondents in Figure 4. Second, some participants gave correct initial and/or final responses (i.e., “01” or “11” patterns) from start to finish and did not require any training intervention to respond correctly to the base-rate problems. They represent 30% of the participants and were labelled as “correct” respondents. Third, some participants improved their performance after the intervention and were labelled as “improved” respondents. These were participants who showed a post-intervention increase in “01” patterns (at the expense of “00” patterns), or an increase in “11” patterns (at the expense of “00” or “01” patterns). Overall, the proportion of improved respondents in the training group represented the majority of participants (58%).

Figure 4: Individual level direction of change (each row represents one participant) of Study 1A (base-rate problems) and Study 2A (conjunction problems). Due to the discarding of missed deadline and load trials (see Trial Exclusion), not all participants contributed 8 analysable trials.

Next, we made a further subdivision based on the dominant response category within the pre- and post-intervention blocks. Participants who produced a majority of “00” patterns were labelled as “biased”, those who produced a majority of “01” patterns were labelled as “deliberators”, and those who produced a majority of “11” patterns were labelled as “intuitors”. These subdivisions allowed us to look more closely into the individual level directions of change from pre- to post-intervention. Figure 4 shows that, among correct respondents, the majority of the participants belonged to the “intuitor” sub-category (86.7%), both in the pre- and post-intervention blocks, and a minority of the participants belonged to the “deliberator” sub-category (13.3%). Critically, among improved respondents, more than half of the participants who were “biased” before the intervention became “intuitors” (66.7%) after the intervention and a smaller proportion went from being “biased” to being “deliberators” (33.3%). Finally, we note that, among improved respondents, 48.3% of the participants went from being “deliberator” before the intervention to “intuitor” after the intervention. Hence, although before the training they could already respond correctly through deliberation, after the training they were able to intuit the correct response (i.e., with no deliberation involved).

With respect to the control group, Figure 4 shows that 19.6% of the participants were biased respondents, and 43.1% were correct respondents. Note that, in the control group, some respondents (21.7%) showed an inconsistent response pattern and could not be classified based on our criteria. They were put in an “other” group. 15.6% of reasoners in the control group showed a natural improvement, in the absence of training, and started giving correct responses after the control “no-explanation” intervention block. These participants were labelled as “natural improved”. After the no-explanation intervention, 55.5% of them became “intuitors” while the remaining 45.5% were “deliberators”. However, the key point is that this natural-improved group (15.6% of reasoners) was considerably smaller than the improved group in the training condition (58.0% of reasoners). Again, this finding supports the idea that the training intervention led to an improvement in reasoning with the base-rate problems.

2.2.4 Conflict detection

Previous studies have shown that, despite giving an incorrect response, reasoners sometimes detect their error or the presence of a response conflict (e.g., Reference De Neys and MarkovitsDe Neys, 2013; Reference Frey, Johnson and De NeysFrey et al., 2017). This detection is often reflected in increased response doubt (i.e., lowered response confidence). In the present study, we explored whether the training intervention affected biased reasoners’ ability to detect conflict in base-rate problems. That is, although the training might not have succeeded in getting biased people to reason accurately, it might have helped them to better detect that their answer was incorrect. We used the conflict-detection index introduced in the study of Reference De Neys, Cromheeke and OsmanDe Neys et al. (2011), which contrasts confidenceFootnote ⁴ ratings for no-conflict trials that yielded a correct response to confidence ratings for conflict trials that yielded an incorrect response. We compared the conflict-detection index before and after the intervention, in both the training and control groups. A higher difference value implies a larger confidence decrease when solving conflict items, which is believed to reflect a more pronounced conflict experience (Reference Bago and De NeysBago & De Neys, 2019; Reference Scherer, Yates, Baker and ValentinePennycook et al., 2015).

Table 1 indicates that, while the conflict experience (i.e., response doubt for incorrect conflict vs. baseline correct no-conflict trial responses) seemed to increase post-intervention in the training group, the opposite pattern was observed in the control group. For completeness, we analysed the data using ANOVAs on initial and final detection indices with Block (pre- vs post-intervention) as a within-subjects factor and Group (training vs control) as a between-subjects factor. For both final and initial responses, the ANOVAs revealed a trend for a Group by Block interaction (Final response: F(1,31) = 3.58, p = .07, η ²g = .03, Initial response: F(1,51) = 2.52, p = .12, η ²g = .02). In sum, although some participants failed to provide the correct response after the training, they may nevertheless have benefited from it, in that they were slightly better able to detect that their heuristic answer was not correct after the training. Clearly, given the weak nature of the trends, this conclusion remains speculative.

Table 1: Conflict detection results in Study 1A (Base-rate problems) and Study 2A (Conjunction problems). Percentage of mean difference in confidence ratings (Standard Error of the Mean) between incorrect conflict and correct no-conflict problems.

2.2.5 Predictive conflict detection

We also asked whether individual differences in ability to detect conflict (before the intervention) was predictive of the success of the training intervention. That is, we asked whether reasoners who became correct respondents after the training intervention (i.e., improved respondents in our individual level classification) showed better conflict detection (i.e., stronger response doubt when giving incorrect answers on the conflict problems) before the training compared to reasoners who did not improve after training (i.e., biased respondents). We again used the difference in confidence ratings for incorrect conflict problem responses and correct no-conflict control problem responses as our index of conflict detection. Hence, the higher the conflict detection index, the more a participant doubted their incorrect answer (i.e., the higher the error detection).

For final responses, we observed a better conflict detection for the improved (M = 17.0%, SEM = 7.9) compared to the biased respondents (M = –0.2%, SEM = 2.2; t(22) = 2.06, p = .05, d = .70). The same trend was observed for initial responses although it did not reach significance (M improved = 7.4%, SEM = 4.4; M biased = –0.3%, SEM = 0.3; t(29) = 1.75, p = .09, d = .50). Note that, for both initial and final responses, reasoners from the biased group did not show a nominal detection effect (i.e., the conflict detection index was negative), showing that these participants did not doubt their incorrect conflict responses.

2.2.6 Neutral problem accuracy

We tested whether the training could lead to a performance increase with untrained neutral ,problems, in which the description did not cue a stereotypical response. Figure 5 indicates that, except for a general pre- to post-intervention increase in accuracy, there was no clear sign of a training effect on neutral problems. Specifically, for both response stages (i.e., initial, and final), there was no significant Block * Group interaction (Final response: F(1,90) = 0.32, p = .57, η ²_g = .001; Initial response: F(1,88) = 0.79, p = .38, η ²_g = .004). In sum, participants tended to improve somewhat through passive repetitive exposure, but this improvement was not boosted by the training intervention. Hence, although our conflict problems results indicate that participants learned to favour the base-rate response over a conflicting stereotypical association, they did not learn to favour base-rates more generally (either intuitively or deliberately) per se.

Figure 5: Average initial and final accuracy on neutral and transfer problems in Study 1A and 2A Error bars represent standard error of the mean (SEM).

2.2.7 Transfer problem accuracy

Finally, we asked whether the training intervention led to performance increase on untrained reasoning problems with a different logical structure than the base-rate problems (i.e., CRT-like and conjunction fallacy problem).

Figure 5 shows the average performance. The ANOVAs revealed that performance remained stable after the intervention in both groups, for final responses (no significant Block * Group interaction: F(1,96) = 0.42, p = .52, η ²_g = .001) and for initial responses as well (no significant Block * Group interaction: F(1,91) = 2.00, p = .16, η ²_g = .01). This pattern was similar for each problem in isolation (see Supplementary Material Section D). Hence, the results suggest that the training effect is highly specific to conflict base-rate problems and does not lead to an increase in (intuitive or deliberate) performance on other untrained reasoning tasks.

3 Study 2A: Conjunction training

Study 1A showed that our base-rate training intervention helped reasoners to intuit the correct response to conflict base-rate (but not other) problems. After training, participants favoured the response cued by the base-rates over a conflicting cued stereotypical response even when deliberation was minimized. In Study 2A, we tested the robustness of this intuitive application of a trained principle over a conflicting stereotypical association by examining whether it applied to the conjunction fallacy problem (Reference Tversky and KahnemanTversky & Kahneman, 1983). Here, a cued stereotypical association typically tricks people to violate the logical conjunction rule. Participants typically read a short personality sketch (e.g., “Perry, 36, has previously studied literature and likes poetry”). They are then asked to judge the probability of statements such as ‘(A) Perry is a carpenter’, and ‘(B) Perry is a carpenter and a novel writer’. The conjunction rule, one of the most fundamental laws of probability, holds that the probability of a conjunction of two events cannot exceed that of either of its constituents (i.e., p(A&B)≤p(A), p(B)). Thus, there should always be more individuals that are simply carpenter than individuals that are carpenters and in addition also novel writers. However, without training, people massively violate the conjunction rule and intuitively conclude that statement B is more probable than statement A based on the intuitive match with the stereotypical description (Reference Andersson, Eriksson, Stillesjö, Juslin, Nyberg and WirebringAndersson et al., 2020; Reference Tversky and KahnemanTversky & Kahneman, 1983). We tested whether an intervention in which the conjunction logic was clarified, helped people to (intuitively) disregard the tempting stereotypical association and avoid the conjunction fallacy.

3.1 Method

Study 2A was roughly similar to Study 1A except that participants were not asked to provide a justification at the end of the experiment, and that they did not respond to neutral problems. Also, unlike in Study 1A, transfer problems consisted of CRT-like and base-rate problems. Only the specifics inherent to Study 2A will be presented.

3.1.1 Pre-registration

The study design and research question were preregistered on the Open Science Framework (https://osf.io/674gk/). No specific analyses were preregistered.

3.1.2 Participants

Participants were recruited online, using the Prolific Academic website (http://www.prolific.ac/). Participants had to be native English speakers from Canada, Australia, New Zealand, the United States of America, or the United Kingdom to take part. As in Study 1A, 100 individuals participated (72 females, M = 35.7 years, SD = 11.8), 46 participants were randomly assigned to the training group and 54 to the control group. In total, 50 participants had secondary school as their highest level of education, and 50 reported a university degree. We compensated participants for their time at the rate of £5 per hour.

Note that in addition to the above 100 participants, there were also a total of 95 participants who started the experiment but could not complete it due to a coding error in the post-intervention block. These partial data were not analysed for the main study, but they are included in our publicly available data file.

3.1.3 Materials

The study consisted of three blocks presented in the following order: a pre-intervention, an intervention, and a post-intervention block. In total, each participant had to solve 10 problems during the pre-intervention block, namely, four conflict, four no-conflict, and two transfer problems (see below), and again the same number of problems during the post-intervention block. All the problems are presented in the Supplementary Material Section A.

Conjunction problems:

We used the conjunction task format introduced by Reference Andersson, Eriksson, Stillesjö, Juslin, Nyberg and WirebringAndersson et al. (2020). All conjunction problems presented a short personality description of a character. This description consisted of the character’s name (e.g., “Emery”), his age (e.g., “30”), his previous studies (e.g., “robotics”) and his hobby/interests (e.g., “AI”). Next, the participants were given four response options and were asked to indicate which one was most probable. In the critical conflict items, one option presented a characteristic that featured an unlikely stereotypical association given the description (e.g., a cashier) and one option presented a conjunction of this unlikely and a likely characteristic (e.g. “a cashier and a computer hacker”). Two other filler options presented a characteristic that was very unlikely (e.g. “an international pop singer”) and a conjunction of two unlikely characteristics (e.g., “a cashier and a cheerleader”). The following illustrates the full problem format:

Emery, 30, has previously studied robotics and likes AI.
Is it most probable that the described person is:
- • A cashier
- • An international pop singer
- • A cashier and a cheerleader
- • A cashier and a computer hacker

As with the base-rate problems in Study 1A, in addition to the four conflict problems we also presented four no-conflict control problems in each block. In the no-conflict problems, we replaced the singular unlikely response option with the option that featured the likely stereotypical association. Here is an example:

Emery, 30, has previously studied robotics and likes AI.
Is it most probable that the described person is:
- • A computer hacker
- • An international pop singer
- • A cashier and a cheerleader
- • A cashier and a computer hacker

Reasoners will tend to select the statement that best fits with the stereotypical description (i.e., the most representative statement, see Reference Tversky and KahnemanTversky & Kahneman, 1983). Clearly, the fit will be higher for the likely than the unlikely characteristic with the conjunctive statement falling in between. Hence, on the no-conflict problems, stereotypical associations will no longer favour the conjunctive over the singular statement and participants are expected to show high accuracies (e.g., see Reference De Neys, Cromheeke and OsmanDe Neys et al., 2011).

Two sets of 16 unique items (8 pre-intervention and 8 post-intervention block items) were used for counterbalancing purposes. The conflict problems in one set were the no-conflict problems in the other, and vice-versa. Participants were randomly assigned to one of the two sets. Consequently, as with the Study 1A base-rate problems, none of the pre- and post-intervention conjunction problem contents was repeated within-subjects (i.e., participants saw a total of 16 different items with a unique stereotypical association).

The four response options were presented in random order. Note that Reference Andersson, Eriksson, Stillesjö, Juslin, Nyberg and WirebringAndersson et al. (2020), adopted the four options design to minimize the use of simple visual response strategies (e.g., “always choose the shortest answer”). As in the Andersson et al. study, selection of the filler options was overall very rare in our study (i.e., less than 12% of options). However, strictly speaking, participants who select the singular very unlikely option do not violate the critical conjunction rule. Given that we are interested in learning effects, selection of the very unlikely option can be considered a correct response. First, we ran all analyses while including the “very unlikely” option as correct and, second, while not including it. None of our conclusions were affected either way. To avoid a lengthy technical discussion, we report the analyses in which selection of the singular unlikely and likely response are both considered correct (i.e., correct answer = answer on which the conjunction fallacy is avoided). Figure S3 in Supplementary Material Section E gives a detailed overview of the selection frequency of each individual response option.

Pilot rating study:

We created a pool of 60 potential items that contained translated and culturally adapted items from Reference Andersson, Eriksson, Stillesjö, Juslin, Nyberg and WirebringAndersson et al. (2020) and newly generated items that respected the same structure. To validate the stereotypical problem content, we ran a pilot rating study with 90 participants (60 female, mean age = 34.2 years, SD = 12.5). Participants were asked to rate how well each option matched the described person on a scale from 0 (not at all similar) to 10 (very similar). To select the most appropriate material, after an initial exploration, we picked items for which, in the conflict version, the combination of the unlikely and likely constituent was rated at a minimum of 3.5 and was rated higher than the unlikely constituent, while in their no-conflict counterpart, the likely constituent was rated at a minimum of 5 and higher than the combination of the unlikely and likely constituent. In addition, the relative option ranking needed to be maximally respected (e.g., very unlikely < unlikely < likely and unlikely combination < likely). We selected 32 items for which these differences were greatest. Among the ultimately selected items, the average ratings for the different response options were: Very unlikely option (M = 1.4, SD = 1.8); unlikely option (M = 2.0, SD = 2.2); unlikely and unlikely option (M =1.7, SD = 1.9); unlikely and likely option (M = 5.1, SD = 2.5); and likely option (M = 6.7, SD = 2.6). Half of the items were used for the current Study 2A, the other half was used for Study 2B. The full item set can be found in the Supplementary Material Section A.