The gender gap in political knowledge is considered “one of the most robust findings in the field of political behavior” (Dow Reference Dow2009, 117) and is thought to be linked to women’s lower political participation and representation (Ondercin and Jones-White Reference Ondercin and Jones-White2011). The underlying reasons for this knowledge gap, however, remain contentious. Until recently, most studies focused on cultural and macro-level factors (Burns, Schlozman and Verba Reference Burns, Schlozman and Verba2001; Carpini and Keeter Reference Carpini, Keeter, Tolleson-Rinehart and Josephson2005). In contrast, Ihme and Tausendpfund (Reference Ihme and Tausendpfund2018) offered a psychological explanation. Specifically, they explored whether the activation of negative stereotypes about women’s lower political knowledge can harm women’s performance.
According to the stereotype threat literature, exposure to negative stereotypes about one’s in-group increases anxiety, negative thinking, and psychological discomfort, all of which overload the working memory and ultimately hamper cognitive performance (McGlone and Pfiester Reference McGlone and Pfiester2007; Pennington, Heim, Levy, and Larkin Reference Pennington, Heim, Levy and Larkin2016). These psychological processes, in turn, reinforce the existing stereotypes (Schmader, Johns, and Forbes Reference Schmader, Johns and Forbes2008). Conversely, non-stigmatized individuals exhibit an enhanced task performance when exposed to negative stereotypes related to their outgroup (i.e., a “stereotype lift”; Walton and Cohen, Reference Walton and Cohen2003). Consistent with both stereotype threat and stereotype lift, Ihme and Tausendpfund (Reference Ihme and Tausendpfund2018) found that female participants performed worse than males in a political knowledge test when gender stereotypes were activated (N = 377). They also observed no knowledge gap in the absence of activated gender stereotypes. Specifically, women performed worse (and men performed better) when gender stereotypes were activated compared to a control condition. These findings persisted even when controlling for political interest, ruling out that the results are a function of women’s lack of interest on the topic. Further, the effect of stereotype threat on the gender gap in political knowledge was more pronounced for female students of Politics, presumably because the test represented higher stakes for them as supposedly experts on the topic. The authors concluded that “the often-found gender gap in political knowledge might – to some extent – be the result of stereotyping” (Ihme and Tausendpfund Reference Ihme and Tausendpfund2018, 12). These findings represent an important practical contribution, as they suggest that the political knowledge gender gap is not necessarily stable, and thus could be potentially mitigated by a range of interventions (for a review, see Lewis and Sekaquaptewa Reference Lewis and Sekaquaptewa2016).
The effects of stereotype threat on gender differences in performance have not been consistent in the literature. Pruysers and Blais (Reference Pruysers and Blais2014) found no effect of stereotype threat on the political knowledge gap. McGlone, Aronson and Kobrynowicz (Reference McGlone, Aronson and Kobrynowicz2006) found that implicit and explicit cues of gender stereotype threats impaired women’s performance on a political knowledge test, but did not improve males’ performance. Adding to the contention, careful examinations of stereotype threat effects on other domains, such as women’s and girls’ mathematics performance, reveal at most weak evidence in its favor (Flore and Wicherts Reference Flore and Wicherts2015; Flore, Mulder and Wicherts Reference Flore, Mulder and Wicherts2018; Pennington, Litchfield, McLatchie, and Heim Reference Pennington, Litchfield, McLatchie and Heim2018). These inconsistent patterns call into question whether the effect of stereotype threat on the political knowledge gap is replicable and, if so, to what extent. To date, no direct replication of this effect has been conducted.
As part of a large-scale replication initiative led by the Center for Open Science and SCORE program (Systematizing Confidence in Open Research and Evidence; https://www.cos.io/score), aiming to investigate the credibility of scientific claims in social and behavioral sciences (Alipourfard et al. Reference Alipourfard, Arendt, Benjamin, Benkler, Bishop, Burstein and Wu2021), we have conducted a preregistered (peer-reviewed), well-powered, two-step direct replication of Ihme and Tausendpfund (Reference Ihme and Tausendpfund2018).
Methods
As determined by SCORE, the focal claim we attempted to replicate was that “the activation of gender stereotypes affects performance on a political knowledge test” (Ihme and Tausendpfund Reference Ihme and Tausendpfund2018, 1). As in the original study, we employed a 2 (gender: male vs. female) × 2 (field of study/work: non-politics vs. politics) × 3 (stereotype activation: stereotype activated by gender question vs. stereotype activated by gender difference statement vs. stereotype not activated) between-subjects design. Note that the original study included the variable field of study in all reported analyses. Thus, even though this variable was not necessary to the replication of the effect of gender stereotype activation on political knowledge, we included it in our direct replication so our study design and analyses were as similar and comparable as possible to the original study. According to SCORE guidelines, the replication would be deemed successful if the statistical results showed a significant interaction (α = 0.05) between stereotype activation and gender. All study materials, containing ethical approval, power calculation, and preregistration, are publicly available at OSF (https://osf.io/8feku/?view_only=99a41a96c8cd43c4ab349e44d79919cd).
Sample
The required sample size for replicating the focal claim was determined with power analyses carried out using the “pwr” package (Champely Reference Champely2020) in R (R Core Team 2020). Power calculations were performed in accordance with the guidelines of the Social Sciences Replication Project (http://www.socialsciencesreplicationproject.com/). As per SCORE guidelines, data collection should proceed in two stages, with a second round of data being collected only if the first round resulted in an unsuccessful replication. Two power calculations were then performed to derive the sample sizes required for each stage of data collection. For the first round of data collection, 90% of power should be achieved. Assuming that the true effect size of the interaction term between gender and stereotype activation was 75% of that reported in the original study, the power analysis yielded a sample of 667 participants. The pooled sample (including both the first and second stages of data collection) should achieve 90% power. Assuming that the true effect size of the interaction between gender and stereotype activation was 50% of that reported in the original study, the second power analysis suggested an additional 830 responses would be needed. Participants were recruited using a professional survey firm (https://www.cint.com) using attention checks as recommended (Aronow, Kalla, Orr, and Ternovski Reference Aronow, Kalla, Orr and Ternovski2020). Only American citizens older than 18 years studying or working at the time of the survey were invited to take part.
Procedure
To ensure a fair and reliable replication attempt, the study design and analysis plan were peer-reviewed by independent researchers selected by SCORE and preregistered on OSF (https://osf.io/nxrg7). The study was approved by an independent IRB ethics committee, BRANY (https://www.brany.com), and the U.S. Army’s Human Research Protection Office (HRPO)#20-032-764 (Award Number HR00112020015, HRPO Log Number A-21036.50).
According to existing definition efforts (Parsons et al. Reference Parsons, Azevedo, Elsherif, Guay, Shahim, Govaart, Norris, O’Mahony, Parker, Todorovic, Pennington, Garcia-Pelegrin, Lazić, Robert-son, Middleton, Valentini, McCuaig, Baker, Collins and Aczel2021), our study can be considered a direct replication of Study 2 by Ihme and Tausendpfund (Reference Ihme and Tausendpfund2018) as it uses the same methodology and experimental design employed by the authors of the original study, with few modifications as follows. First, our sample was composed not only of students, as in the original study, but also of working adults. This modification was necessary to achieve the required sample size, which was considerably higher than the original study, and to check whether the original findings (in German students) generalize to the adult population of the United States. As a consequence, the political knowledge scale used in our study had to be adapted from a German political scenario to the contemporary political context of the United States (see Table S1). Second, as our sample was composed of both students and working adults, the measurement of participants’ field of study had to be expanded to encompass fields of study or work. Data were collected online and hosted at Qualtrics. Both stages of data collection had exactly the same procedures and measures. Before participants answered the political knowledge test, we measured political interest and manipulated stereotype activation in the same way as Ihme and Tausendpfund (Reference Ihme and Tausendpfund2018). We provide additional sample, procedural, and question wording details in the Supplementary Materials.
Data analysis
Following the analyses reported in the original study and the analysis script made available by the original authors, we tested the replication claim that activation of gender stereotypes influences performance in a political knowledge test with a 2 (gender) × 2 (field of work/study) × 3 (Gender Stereotype Activation) ANCOVA. The dependent variable was participants’ total score on the political knowledge test. As in the original study, a single score of political interest was calculated per participant (i.e., average of responses in the short scale of political interest) and included as a covariate. In addition, we use Bayesian analyses to adjudicate about whether results indicate absence of evidence or evidence of absence. All analyses were conducted in R. To increase comparability between the direct replication and original results, we adjusted the sum of squares in R to type III, which is the default in the SPSS software used by the original authors to perform their analyses.
Results
Stage 1
Results of the ANCOVA yielded a non-significant interaction between stereotype activation and gender, F(2, 658) = 0.691, p = 0.501, partial η2 = 0.002, 95% CI = [.00, .01], N = 671. Thus, according to the SCORE criteria, the replication was considered unsuccessful at the first stage (see Tables S2–S4 for detailed results). As preregistered, to provide further evidence regarding the (non)replicability of gender stereotype threat on gender differences in political knowledge, we then proceeded to a second stage of data collection.
Stage 2
The pooled analytical sample (first and second stages together) was composed of 1,502 participants (Mage = 45.87 years, SDage = 17.35, 48.74% female). The distribution of participants across conditions resembled the distribution of the original study (see Table S5). Consistent with the original study and a large body of research, ANCOVA results revealed a main effect of gender on political knowledge, such that men generally scored higher than women on the political knowledge test, F(1, 1489) = 28.61, p < 0.001, partial η² = 0.02, 95% CI = [.01, .04]; M female = 7.36, SD = 3.62; M male = 9.81, SD = 3.87. Also in line with the original study, we found no main effect of stereotype activation on political knowledge, F(2, 1489) = 0.27, p = 0.77, partial η² = 0.00, 95% CI = [.00, .00], and a significant effect of political interest, such that the more participants were interested in politics, the higher their score on the political knowledge test F(1, 1489) = 194.78, p < 0.001, partial η² = 0.12; 95% CI = [.09, .15]. Our focal test, however, diverged from the results reported in the original study, as the interaction between gender and stereotype activation was not significant F(2, 1489) = 1.22, p = 0.3, partial η² = 0.00, 95% CI = [.00, .01]. Thus, according to the criteria outlined by SCORE, the replication of the effect of stereotype threat on the gender gap on political knowledge was unsuccessful even after the second stage of data collection.
We further explore the results by conducting Bonferroni-corrected pairwise comparisons with the emmeans function in R (Lenth Reference Lenth2022). As illustrated in Figure 1, males’ scores were significantly higher than females’ in the stereotype not activated condition t(1489) = −7.42, p < 0.001, the stereotype activated by gender question t(1489) = −4.36, p < 0.001, and in the gender difference statement condition t(1489) = −6.02, p < 0.001. In addition, we did not find evidence of either stereotype threat or stereotype lift, as women’s performance did not decrease nor men’s performance increased in the stereotype-activated conditions compared to the stereotype not activated condition (see supplementary materials section 3.3 for detailed analyses). The interaction between field of study/work and stereotype activation as well as the three-way interaction between field of study/work, stereotype activation, and gender were not significant (p = 0.32 and p = 0.81, respectively). Additional analyses and a comparison between the replication results and the results of the original study can be found in the supplementary materials (Tables S6–S7).
Exploratory analyses
In order to evaluate our replication attempt, we computed the evidence-updated replication Bayes factors for both stages of data collection (Ly, Etz, Marsman, and Wagenmakers Reference Ly, Etz, Marsman and Wagenmakers2019; Verhagen and Wagenmakers Reference Verhagen and Wagenmakers2014). Using the “posterior distribution obtained from the original study as a prior distribution for the test of the data from the replication study” (Ly, Etz, Marsman, and Wagenmakers Reference Ly, Etz, Marsman and Wagenmakers2019, 2504), we computed an overall Bayes Factor of BF10(d orig, d rep ) = 0.009 for the interaction term of gender and stereotype activation on political knowledge at Stage 1. Dividing the overall Bayes factor by the Bayes factor from the original data (BF10(d orig ) = 0.142) yielded a replication Bayes factor of BF10(d orig | d rep ) = 0.064. For Stage 2, an overall Bayes Factor of BF10(d orig, d rep ) = 0.001 for the interaction effect of gender and stereotype activation on political knowledge was computed. Again, dividing this by the original study’s Bayes Factor resulted in a replication Bayes factor of BF10(d orig | d rep ) = 0.007. This means that the replication data are predicted 1/0.064 = 15.8 (Stage 1) or 1/0.007 = 143 (Stage 2) times better by the null hypothesis than by the alternative hypothesis in the original dataset. Hence, the replication cannot be deemed successful (Zwaan, Etz, Lucas, and Donnellan Reference Zwaan, Etz, Lucas and Donnellan2018).
In addition, we evaluated – as per original authors’ advice – whether the political knowledge scale is a “sufficiently difficult test.” Using Ihme and Tausendpfund (Reference Ihme and Tausendpfund2018) original data, we compared scales’ difficulty. Comparing the political knowledge test distribution of the original and the replication data revealed no significant differences for Stage 1 (z = −1.53, p = 0.06) or Stage 2 (z = −1.22, p = 0.11, see Figure 2). To test this more in depth, we used Item Response Theory two-parameter model (2PL). As indicated in Figure 3, both scales display equivalent levels of reliability across the latent construct θ (panel a), show equivalent test difficulty and total score across θ levels (panel b), and – albeit some inter-item differences – have overall corresponding item difficulties (panel c). These findings suggest comparable scale properties for both the original and replication, allowing us to rule out measurement-related (difficulty) issues underlying the non-replication. A variety of robustness checks and additional exploratory analyses are reported in the Supplementary Materials (Tables S8–S20).
Discussion
Ihme and Tausendpfund (Reference Ihme and Tausendpfund2018) have proposed that the activation of negative gender stereotypes accounts for the variance of the political knowledge gender gap. In our independent and well-powered direct replication, we find no evidence that activation of gender stereotypes affects participants’ performance in a political knowledge test. Indeed, we find evidence of absence of this effect.
We note that some elements of our study design diverged from the original study and could have contributed to the observed non-replication. Our study was conducted with American students and working adults, whereas the original study included German students. As the United States has achieved relatively lower gender parity than Germany in political empowerment (World Economic Forum 2021), one could argue that negative stereotypes about women might be more salient for Americans than Germans, undermining women’s cognitive performance even in the absence of stereotype activation (e.g., in the control condition). Although we cannot rule out that some populations might be more vulnerable to gender stereotyping than others, we have reduced cultural biases as much as possible by devising a political knowledge test that was – at the same time – similar to the one used in the original study regarding the level of difficulty, as our data suggest, and relevant to the American political context. A comparison of the effect of stereotype threat on gender differences in political knowledge across countries with varying levels of gender equality would be beneficial for a better understanding of potential cultural differences in stereotype threat. Second, as a direct consequence of including working adults in our sample, it was necessary to adapt the measure of field of study to encompass the field of work. We argue, however, that this should not have contributed to the unsuccessful replication. If our measure of field of study/work would inadvertently make participants aware of their affiliation with a Politics or Non-Politics group, the effects of gender stereotype activation on performance would presumably become more salient. Instead, our results show that the field of study/work did not influence the results (Tables S16–S17). An argument can be made, however, that the extensive list of topics in our study reduced participants’ self-identity with Politics. Nevertheless, adding participants’ attributed importance of Politics to their study/work as a covariate in the analyses did not change results (Tables S18–S19). We have also conducted further tests restricting our sample to young and educated adults to achieve a sample more similar in composition to the respondents in the original study, but we could still not replicate the effect of stereotype activation on the gender gap in political knowledge (Table S20).
We note that our failure to replicate the effect of stereotype threat on gender differences in political knowledge is consistent with recent research efforts challenging the effect of stereotype threat on academic performance more broadly. Stoet and Geary (Reference Stoet and Geary2012) showed that only 30% of efforts aiming to replicate the gender gap in mathematical performance do succeed. In addition, a meta-analysis investigating the effect of gender stereotype threats on the performance of schoolgirls in stereotyped subjects (e.g., science, math) indicated several signs of publication bias within this literature (Flore and Wicherts Reference Flore and Wicherts2015). Given these results, it is plausible that the effect of gender stereotype activation might be small in magnitude and/or might be decreasing over time (Lewis and Michalak Reference Lewis and Michalak2019).
Furthermore, we find robust evidence of a gender gap in political knowledge even after controlling for political interest. Our results validate previous accounts that the gender gap on political knowledge may be an artifact of how knowledge is conceptualized and measured and of different gender attitudes toward standard tests. In line with previous research stating that the political knowledge gap might be artificially inflated by a disproportionate amount of men who are willing to guess rather than chose the “don’t know” option – even if that might lead to an incorrect answer (Mondak and Anderson Reference Mondak and Anderson2004) – we find that female participants attempted to answer less questions and used the “don’t know” response option in the political knowledge test more frequently than their male counterparts whereas men guessed their answers more frequently than women, resulting in a larger amount of incorrect answers (Tables S8–S14). This suggests factors other than knowledge might contribute to the gender gap in political knowledge (Mondak Reference Mondak1999). For example, gender differences in risk taking and competitiveness (Lizotte and Sidman Reference Lizotte and Sidman2009) as well as in self-confidence (Wolak Reference Wolak2020) and self-efficacy (Preece Reference Preece2016) may lead women to second-guess themselves and be less prone to attempt answering the questions of which they are unsure. Meanwhile, higher competitiveness and confidence in males might lead them to guess and “gain the advantage from a scoring system that does not penalize wrong answers and rewards right ones” (Kenski and Jamieson Reference Kenski, Jamieson and Jamieson2000, 84). Measurement non-invariance, too, appears to detrimentally affect the interpretation and validity of political knowledge scales across several sociodemographics. For example, Lizotte and Sidman (Reference Lizotte and Sidman2009) and Mondak and Anderson (Reference Mondak and Anderson2004) have shown political knowledge instruments violate the equivalence assumption for gender, while Abrajano (Reference Abrajano2015) and Pietryka and MacIntosh (Reference Pietryka and MacIntosh2013) found non-invariance across age, income, race, and education. In our own replication attempt, we also found evidence of measurement non-invariance using item response theory and showed that the magnitude of the gender systematic bias appears to be contingent on respondents’ knowledge levels such that lack of equivalence by gender is stronger at average scores and weaker at the extremes of the political knowledge continuum (see Table S21 and Figure S1).
As Politics has been essentially a male-dominated field since its creation, it should not come as a surprise that current measures of political knowledge tend to favor what men typically know. Previous studies have shown that the mere inclusion of gendered items on scales of political knowledge lessens the gender gap (Barabas, Jerit, Pollock, and Rainey Reference Barabas, Jerit, Pollock and Rainey2014; Dolan Reference Dolan2011). The investigation and validation of measures of political knowledge that capitalize on the fact that men and women might not only know different things but also may react in different ways to standard tests is paramount for a more accurate understanding of the gender gap in political knowledge and its bias.
Finally, we note that measurement issues are not unique to political knowledge and in fact are pervasive in Political Science with consequences for how we measure populism (Van Hauwaert, Schimpf, and Azevedo Reference Van Hauwaert, Schimpf and Azevedo2018, Reference Van Hauwaert, Schimpf and Azevedo2020; Wuttke, Schimpf, and Schoen Reference Wuttke, Schimpf and Schoen2020), operational ideology (Azevedo and Bolesta Reference Azevedo and Bolesta2022; Azevedo, Jost, Rothmund, and Sterling Reference Azevedo, Jost, Rothmund and Sterling2019; Kalmoe Reference Kalmoe2020), and political psychological constructs such as authoritarianism, racial resentment, personality traits, and moral traditionalism (Azevedo and Jost Reference Azevedo and Jost2021; Bromme, Rothmund, and Azevedo Reference Bromme, Rothmund and Azevedo2022; Pérez and Hetherington Reference Pérez and Hetherington2014; Pietryka and MacIntosh Reference Pietryka and MacIntosh2022). If the basic measurement properties of widely used constructs are flawed, it is likely that insights from research will be biased. Valid, invariant, and theoretically derived instruments are urgently needed for the reliable accumulation of knowledge in Political Science.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/XPS.2022.35
Data availability
This work was carried out as part of the Center for Open Science’s Systematizing Confidence in Open Research and Evidence (SCORE) program, which is funded by the Defense Advanced Research Projects Agency. The data, code, and any additional materials required to replicate all analyses in this article are available at the Journal of Experimental Political Science Dataverse within the Harvard Dataverse Network, at https://doi.org/10.7910/DVN/ETUUOD. All study materials and preregistration information for the current study have been made publicly available via OSF and can be accessed at https://osf.io/8feku/?view_only=99a41a96c8cd43c4ab349e44d79919cd.
Acknowledgements
We would like to thank the staff and researchers at the Center for Open Science for their guidance and assistance, and especially Zach Loomas and Beatrix Arendt for their patience and kindness. We also would like to show our appreciation for Kimberly Quinn’s editorship of the preregistration stage. Lastly, we would like to thank Charlotte R. Pennington for providing helpful feedback on an earlier manuscript.
Author contributions
Conceptualization: F.A, L.M, and D.S.B; Data Curation: F.A, L.M, and D.S.B; Formal analyses: F.A, L.M, and D.S.B; Investigation: F.A, L.M, and D.S.B, Methodology: F.A, L.M, and D.S.B; Project administration: F.A; Software: F.A, L.M, and D.S.B; Visualization: F.A, L.M, and D.S.B; Writing (original draft): F.A and L.M; Writing (review and editing): F.A, L.M, and D.S.B.
Conflicts of interest
The authors report no conflicts of interest.
Ethics statement
The study reported here was approved by an independent IRB ethics committee, BRANY (https://www.brany.com), and the U.S. Army’s Human Research Protection Office (HRPO)#20–032–764 (Award Number HR00112020015, HRPO Log Number A-21036.50) and adheres to APSA’s Principles and Guidance for Human Subjects Research. More information can be found in the Supplementary materials (section 1: “Procedures and Measures”).