Researchers of L2Footnote 1 anxiety as well as other individual difference constructs (e.g., motivation, willingness to communicate) in applied linguistics are often concerned with comparing questionnaire (or scale)Footnote 2 scores across key participant characteristics (e.g., target language, language-learning context, age, gender, country) and interventions (e.g., pre- and posttest) as well as over time (e.g., as in latent growth modeling). Unfortunately, such score comparisons are not meaningful unless evidence of construct comparability—the situation in which scores from different groups “measure the same construct of interest on the same metric”—is convincingly demonstrated beforehand (Wu et al., Reference Wu, Li and Zumbo2007, p. 1). Measurement invariance (MI, also referred to as measurement equivalence; see Somaraju et al., Reference Somaraju, Nye and Olenick2022) endeavors to tackle the issue of construct comparability by embracing an empirical approach to examining group differences. This paper aims to provide a nontechnical introduction to MI for L2 anxiety researchers and applied linguists working with questionnaire data more broadly. I begin by presenting a brief history of MI and by describing this concept using nonspecialized language. I then seek to demonstrate why this procedure is key to enhancing our understanding of L2 anxiety as well as other learner-internal characteristics, particularly with regard to various language-learning situations.
Although the history of MI goes back to the 1960s (e.g., Meredith, Reference Meredith1964; see Putnick & Bornstein, Reference Putnick and Bornstein2016, for more), the techniques for MI testing have been arguably clouded with overly specialized statistical terminology, thereby making this procedure less popular with applied researchers (as noted by Wu et al., Reference Wu, Li and Zumbo2007). Indeed, it is not surprising that one of the most widely used L2 anxiety questionnaires—a 33-item Foreign Language Classroom Anxiety Scale (FLCAS; Horwitz et al., Reference Horwitz, Horwitz and Cope1986)—was not tested for MI at the time of development. Until recently, only a short version of the FLCAS had been comprehensively examined for MI (see Botes et al., Reference Botes, van der Westhuizen, Dewaele, MacIntyre and Greiff2022). Other popular L2 anxiety questionnaires, including the Foreign Language Reading Anxiety Scale (Saito et al., Reference Saito, Horwitz and Garza1999), the Second Language Writing Apprehension Test (Cheng et al., Reference Cheng, Horwitz and Schallert1999), and the Foreign Language Listening Anxiety Scale (Elkhafaifi, Reference Elkhafaifi2005), were not assessed for MI during their initial validation either. In fact, a recent systematic review of L2 anxiety research published in twenty-two leading L2 journals in 2000–2020 revealed that only five out of 321 L2 anxiety scales in the sample were tested for MI (Sudina, Reference Sudina2023). What is more, no MI tests were performed on the newly developed scales, which diminishes validity arguments behind these psychometric instruments. Despite the scarcity of MI testing in L2 anxiety research, applications of this technique are becoming increasingly common in neighboring disciplines focusing on nonlanguage-related anxieties, including dating, social, and pain anxiety (see Adamczyk et al., Reference Adamczyk, Morelli, Segrin, Jiao, Park and Villodas2022; Rogers et al., Reference Rogers, Gallagher, Garey, Ditre, Williams and Zvolensky2020; Torregrosa Díez et al., Reference Torregrosa Díez, Gómez Núñez, Sanmartín López, García Fernández, La Greca, Zhou, Redondo Pacheco and Inglés Saura2022). Additionally, there has been a surge in MI testing in the realm of other L2 individual differences such as self-guides, enjoyment, and engagement (see Derakhshan et al., Reference Derakhshan, Doliński, Zhaleh, Enayat and Fathi2022; Liu et al., Reference Liu, Wang and Baiin press).
So, what is MI and why is it important? The concept of MI refers to the situation in which a latent variable representing a theoretical constructFootnote 3 and consisting of one or more observed variables, such as questionnaire items, is similarly understood by respondents in different groups or by respondents in the same group over time (Putnick & Bornstein, Reference Putnick and Bornstein2016). As such, MI is a prerequisite for assessing mean scores on a latent variable across groups (e.g., L2 anxiety of students learning English in a foreign language context and students learning English in a second language context) and across time (e.g., L2 anxiety of students when they started learning English in elementary school and the same students following several years of language instruction). In a similar vein, MI should be established in experimental designs if the latent variable in question was somehow manipulated (e.g., to ensure that L2 students’ anxiety decreased due to the intervention itself rather than as an artifact of respondents’ interpretation of the questionnaire items following the intervention). Critically, rigorously validated questionnaires should be normed across age, gender, as well as a number of other participant characteristics to allow for group comparisons (Lee, Reference Lee2018). MI can be attained by identifying and excluding scale items that carry different meanings for different groups. A hypothetical example would be to discover that nail-biting was a symptom of L2 anxiety in children but not in adults. Keeping the item inquiring about nail-biting would bias mean score comparisons across the two groups. If adults scored lower on L2 anxiety because they bite their nails less, this would be misleading because nail-biting is unrelated to anxiety in adults anyway (although it could be related to stress, for example). The importance of MI thus lies in its potential to equip researchers with the necessary tools to detect whether and to what extent scale items should be interpreted similarly or differently, depending on group membership.
Main Stages of Measurement Invariance
I have thus far highlighted the conceptual importance of testing for MI; to allow for meaningful cross- or within-group comparisons, researchers need to ensure that questionnaire items are invariant, or similarly construed, across the groups. In reality, however, this requires a great deal of skill and decision-making on the part of the researcher. The purpose of this section is to provide a brief overview of the main stages of MI testing via multigroup confirmatory factor analysis (CFA), which is arguably the most common method for establishing MI in a structural equation modeling framework (Wu et al., Reference Wu, Li and Zumbo2007). Nonetheless, it is not my intention to provide a tutorial (for a comprehensive review of MI, see Putnick & Bornstein, Reference Putnick and Bornstein2016 and Somaraju et al., Reference Somaraju, Nye and Olenick2022; for a tutorial on longitudinal measurement invariance, see Nagle, Reference Nagle2023). For guidance on how to test for MI using item response theory by investigating differential item functioning, or DIF, in particular, see Andrich and Marais (Reference Andrich and Marais2019) and Zumbo et al. (Reference Zumbo, Liu, Wu, Shear, Olvera Astivia and Ark2015); for a gentle introduction to DIF, see Zumbo (Reference Zumbo2007).
To illustrate the MI procedure via the multigroup CFA, I will use a hypothetical L2 anxiety scale consisting of five positively keyed items that represent different facets of the construct (listening, reading, writing, speaking, and pronunciation anxiety; see Fig. 1) and are measured on a Likert scale. The composite mean score indicates the level of language-specific anxiety. Let's assume that the goal is to compare L2 English students’ anxiety in the United States and Japan (i.e., in a second versus foreign language learning context).
In the first stage of MI testing, researchers examine configural equivalence, or invariance of the internal structure of the scale across groups. Configural invariance is tenable if the factor structure of the scale is identical in both samples (i.e., the same five items load on the same factor of L2 anxiety). If, however, in one of the groups the structure of the scale is different (e.g., L2 anxiety consists of two different factors instead of one, with three items loading on Factor 1 and two other items loading on Factor 2), this indicates configural noninvariance. To remedy the issue, researchers can either “redefine the construct (e.g., omit some items and retest the model)” (Putnick & Bornstein, Reference Putnick and Bornstein2016, p. 75) or accept the fact that the latent variable of interest is nonequivalent across groups and refrain from comparing mean group scores. Having established configural invariance, researchers proceed to explore metric equivalence, or invariance of factor loadings (also known as weights) across groups. This second stage involves examining the extent to which each item on the scale contributes to the latent variable. Statistically speaking, researchers compare the fit of the metric model with constrained factor loadings with the fit of the configural model with unconstrained factor loadings and item intercepts. Metric invariance is supported if item loadings on the factor are similar across groups, that is, the fit of the metric model should not be considerably worse. If, however, one or more items has a nonequivalent loading across the groups,Footnote 4 metric invariance is not supported. For instance, the item representing anxiety in speaking may be more strongly associated with the overall L2 anxiety in the Japanese group (foreign language context) but not in the US group (second language context). To address the issue of metric noninvariance, researchers need to determine which item(s) has a nonequivalent loading, exclude this item(s), and rerun the tests of both configural and metric invariance; alternatively, researchers can admit measurement noninvariance and refrain from further group comparisons (Putnick & Bornstein, Reference Putnick and Bornstein2016). Provided there is evidence of metric invariance, in the third stage, researchers evaluate scalar equivalence, or invariance of item intercepts (also referred to as means). To that end, constraints are imposed on item intercepts in both groups, and the fit of the scalar model with constrained intercepts is compared with the fit of the metric model with constrained factor loadings and unconstrained intercepts. Scalar invariance is upheld if the fit of the model has not significantly deteriorated after imposing these additional parameter constraints. If, however, the scalar model fit is considerably worse, scalar invariance should not be assumed. It would indicate that one or more item intercepts has different parameters across the groups. For example, compared to L2 English students in Japan, students in the United States may obtain a higher score on the item representing anxiety in speaking, but this amplified speaking anxiety in the U.S. group would not contribute to their overall L2 anxiety level (although it could contribute to their overall stress level, for example). If faced with scalar noninvariance, researchers can either accept it and avoid group comparisons altogether or identify the source of nonequivalence by locating problematic item intercept(s), remove this item(s), and retest all invariance models (Putnick & Bornstein, Reference Putnick and Bornstein2016).
If scalar invariance is achieved, researchers can impose constraints on item residuals (error terms) in both groups and test for residual equivalence, or strict invariance. However, many methodologists would argue that it is justifiable to examine mean differences across groups as long as scalar equivalence, or strong invariance, is supported because “residuals are not part of the latent factor, so invariance of the item residuals is inconsequential to interpretation of latent mean differences” (Putnick & Bornstein, Reference Putnick and Bornstein2016, p. 76; see Wu et al., Reference Wu, Li and Zumbo2007).
Implications of Measurement Invariance for L2 Anxiety Research and Beyond
Critically, evidence of MI allows researchers to safely make inferences about group differences in the latent variable of interest. To take our hypothetical L2 anxiety example, if MI is demonstrated, and the result of a t-test suggests that students have higher language anxiety in the United States than in Japan, researchers can rest assured that this finding reflects a true difference in the latent variable and is not a by-product of measurement bias. Moreover, passing MI tests allows researchers to make more advanced comparisons by examining relationships between two or more latent variables across groups (e.g., compare the relationship between L2 anxiety and achievement across contexts), that is, to investigate structural invariance. Finally, in addition to being an important step in questionnaire development and validation, MI testing can be used as a means to advance theory and “evaluate theoretical predictions” (Somaraju et al., Reference Somaraju, Nye and Olenick2022, p. 756). To illustrate, testing for MI and reporting standardized effect sizes for group differences allows for establishing construct-specific norms by comparing results across primary studies. If there is evidence of cultural dependence of L2 anxiety across target languages and language learning contexts, these findings can help language educators determine the best anxiety-reducing strategies in their language classrooms. To take it even further, the importance of MI extends beyond simple group comparisons. It enables researchers to ensure the fairness of questionnaires by helping detect and revise problematic items in order to avoid measurement bias (Jung & Yoon, Reference Jung and Yoon2016). It should be noted, however, that a traditional approach to MI may not account for cultural differences, particularly if questionnaire items have been developed in a WEIRD (i.e., Western, educated, industrialized, rich, democratic) context (Boehnke, Reference Boehnke2022). Instead of developing scale items in English, which is typically used as the lingua franca, translating them into other languages via back-translation, and removing noninvariant culture-specific items, MI testing can be done in a more culturally sensitive manner. According to Boehnke (Reference Boehnke2022), in a culturally inclusive approach, scale developers agree on the construct of interest beforehand and develop items representing the construct in each language and context independently. Then, exploratory factor analysis is performed on each sample separately, and items with high loadings on the first factor are retained and included in MI testing, which is performed on the combined dataset. In other words, following this new approach, items measuring L2 anxiety in English learners in the United States and Japan do not have to be similarly worded “as long as functional equivalence is achieved through item intercorrelations” (p. 1164). This reduces “the bias that is brought in by relying exclusively on Western-origin items” (p. 1163).
Conclusion
By advocating to implement more MI testing in L2 anxiety research, I have sought in this paper to raise awareness of the potential of this technique to inform L2 research and practice and to present the main stages of MI via multigroup CFA within a structural equation modeling framework. Given that establishing MI is often challenging and time-consuming but nevertheless crucial for making meaningful conclusions about study findings, it is advisable to test for MI at the very least during the process of scale development and validation so that other researchers can safely use the questionnaire of interest with a similar population or in a similar context. I anticipate that the various applications of MI testing will continue to increase in L2 anxiety research as well as elsewhere in applied linguistics.