Validity evidence for an EIT as an assessment for Spanish heritage speakers and L2 learners

Sara Saez-Fajardo; Melissa A. Bowles

doi:10.1017/S0272263125000130

Validity evidence for an EIT as an assessment for Spanish heritage speakers and L2 learners

Published online by Cambridge University Press: 10 April 2025

Sara Saez-Fajardo

and

Melissa A. Bowles

Show author details

Sara Saez-Fajardo*: Affiliation:
Department of Spanish and Portuguese, University of Illinois Urbana-Champaign, Urbana, IL, USA
Melissa A. Bowles: Affiliation:
Department of Spanish and Portuguese, University of Illinois Urbana-Champaign, Urbana, IL, USA
*: Corresponding author: Sara Saez-Fajardo; Email: [email protected]

Article contents

Abstract
Introduction
Literature review
Validating measures
Research questions
Methodology
Instruments
Procedures
Data analysis
Results
Scores on the validating measures: CAF and Versant proficiency test
Discussion and conclusions
Competing interest
Footnotes
References

Rights & Permissions

Abstract

As the field of heritage language acquisition expands, there is a need for proficiency to compare speakers across groups and studies. Elicited imitation tasks (EITs) are efficient cost-effective tasks with a long tradition in proficiency assessment of second language (L2) learners, first language children, and adults. However, little research has investigated their use with heritage speakers (HSs), despite their oral nature, which makes them appropriate for speakers with variable literacy skills. This study is a partial replication of Solon, Park, Dehghan-Chaleshtori, Carver & Long (2022), who administered an EIT originally developed for advanced L2 learners on a group of HSs. In this study, we administered the same EIT with minor modifications to 70 HSs and 132 L2 learners of Spanish with different levels of proficiency and ran a Rasch analysis to evaluate the functioning of the task with the two groups. To obtain concurrent validity evidence, scores on the EIT were compared with participants’ performance in an oral narration; evaluated for complexity, accuracy, and fluency (CAF); and compared with a standardized oral proficiency test, the Versant Spanish Test. Results of Rasch analyses showed that the EIT was effective at distinguishing different levels of ability for both groups, and analyses showed moderate to strong correlations between CAF measures and the EIT and very strong correlations between the EIT and the Versant Spanish Test. These results provide evidence that the EIT is an efficient and adequate proficiency test for HSs and L2 learners of Spanish; its use in research settings is recommended.

Keywords

accuracy complexity elicited imitation task fluency features heritage language acquisition language assessment second language acquisition

Type: Research Article
Information: Studies in Second Language Acquisition , First View , pp. 1 - 27

DOI: https://doi.org/10.1017/S0272263125000130 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press

Introduction

As the field of heritage language (HL) acquisition grows, there is an increasing need for reliable, efficient, and practical measures of HL proficiency to meet investigation standards and facilitate comparability across studies and groups. In this paper, we adopt Valdés’ “narrow” (as differentiated by Polinsky & Kagan, Reference Polinsky and Kagan2007) definition of a heritage speaker (HS) as “a language student raised in a home where a non-English language is spoken, who speaks, or at least understands the language, and who is to some degree bilingual in that language and in English” (Valdés, Reference Valdés, Kreeft, Ranard and McGinnis2001). Such a narrow definition is appropriate in the context of language proficiency testing, since many learners who fall under a “broad” definition (those “who have been raised with a strong cultural connection to a particular language through family interaction,” Van Deusen-Scholl, Reference Van Deusen-Scholl2003, p. 222) may not have any discernible proficiency in the HL.

Although HSs are a heterogeneous group by nature, some generalizations are possible. HSs have mainly aural exposure to their HL, which occurs mostly at home. Therefore, HSs tend to be more proficient at oral, spontaneous tasks that do not require metalinguistic knowledge, due to their naturalistic acquisition (Bowles, Reference Bowles2011; Montrul & Foote, Reference Montrul and Foote2014; Montrul, Foote & Perpiñan, Reference Montrul, Foote and Perpiñan2008). Owing to differences in their acquisition context, HSs and second language (L2) learners have differing learning needs. L2 learners’ main exposure is in a classroom setting, so they are accustomed to and tend to perform better on written tasks and those that require explicit and/or metalinguistic knowledge, which are common in academic environments.

Literacy varies widely across HSs, depending on the exposure to written language they have had. For this reason, assessments that rely on literacy to evaluate HSs’ and L2 learners’ proficiency are not ideal (Carreira & Potowski, Reference Carreira and Potowski2011; Sanz & Torres, Reference Sanz, Torres, Malovrh and Benati2018), and it is important to consider the two populations’ characteristics to design assessments that are valid and reliable for both. Given that the field of L2 acquisition (SLA) predates that of HL acquisition (HLA), resources to evaluate L2 learners’ proficiencies are more abundant and developed than those aimed at HSs, and many studies have assessed HSs with tests that have only been validated with L2 learners (e.g., the DELE in Colantoni, Cuza & Mazzaro, Reference Colantoni, Cuza, Mazzaro, Armstrong, Henriksen and Vanrell2016; Cuza & Frank, Reference Cuza and Frank2011).

Due to the increased enrollment of HSs in language courses, there have been growing discussions of HL assessment for the purposes of course placement (e.g., Bowles, Reference Bowles2022; Fairclough, Reference Fairclough, Beaudrie and Fairclough2012; Parisi & Teschner, Reference Parisi and Teschner1983; Potowski, Parada & Morgan-Short, Reference Potowski, Parada and Morgan-Short2012), moving away from highly subjective measures like one-on-one interviews, which were commonplace in the past. Simultaneously, increased research on HLA has led to the search for proficiency tests that are suited for HSs and can assign them valid, reliable scores for use as independent variables in research (Solon et al., Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022). In so doing, researchers such as Bowles (Reference Bowles, Potowski and Muñoz-Basols2018), Carreira and Potowski (Reference Carreira and Potowski2011), and Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) have questioned the validity of methods used to assess HL proficiency in past studies.

Elicited imitation tasks (EITs) have been found to be valid measures of proficiency for L2 learners (e.g., Kostromitina & Plonsky, Reference Kostromitina and Plonsky2022; Yan, Maeda, Lv & Ginther, Reference Yan, Maeda, Lv and Ginther2016), and there is a small but growing corpus of studies that have started using them with HSs as measures of proficiency (Isbell & Son, Reference Isbell and Son2021; Lopez-Beltran Forcada, Reference Lopez-Beltran Forcada2021; Solon et al., Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022; Wu & Ortega, Reference Wu and Ortega2013). EITs have been used with diverse groups, including first language (L1) adults, L1 children, and L2 learners of varying literacy levels, as they are oral and not necessarily related to any curriculum. Nevertheless, there has been limited research on the use of EITs as proficiency assessments for HSs (e.g., Solon et al., Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022; Son, Reference Son2018). The present study addresses the need for additional validity evidence needed to support the use of EITs with both groups of learners. The inclusion of samples of both groups of learners is important given that there are still many studies and contexts where both groups of learners are assessed together and the efficacy of the EIT in these contexts should be verified. For the sake of brevity, we will focus the literature review on advances in the field of EITs as HL proficiency assessment. Readers are referred to the meta-analyses by Kostromitina and Plonsky (Reference Kostromitina and Plonsky2022) and Yan et al. (Reference Yan, Maeda, Lv and Ginther2016) for information about the long history and efficiency of EITs as measures of L2 proficiency.

Providing reliability and validity evidence for assessments whenever they are applied in a new context or with a new group or purpose is crucial. Reliability is the internal consistency of a measure, and validity refers to “the degree to which evidence and theory support the interpretation of test scores for proposed uses of tests” (American Educational Research Association, American Psychological Association & National Council on Measurement in Education, 2014). Similarly, the validation process refers to the accumulation of evidence to support the interpretation of those scores.

To obtain that evidence, an internal analysis of the measure’s functioning is important, as is obtaining criterion-related concurrent validity evidence, which involves examining the level of agreement between the scores on the new test and those provided by independent, established measures (Hughes, Reference Hughes2003). Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) provided concurrent validity evidence by comparing the scores of the EIT with scores on a written measure of proficiency. Given that success on a written proficiency measure is directly linked to literacy development, we chose oral proficiency measures (i.e., an oral narration and the Versant test), which we consider would better reflect the proficiency of both groups of learners, to provide concurrent validity evidence for the EIT. Additionally, this study reports a novel norming technique for EIT stimuli using Amazon Mechanical Turk (MTurk), a marketplace that allows access to a cost-efficient participant pool that is increasingly used in behavioral research (Ortega-Santos, Reference Ortega-Santos2019). In sum, this study inspects the reliability and discrimination of a Spanish EIT modified for advanced learners (Solon, Park, Henderson & Dehghan-Chaleshtori, Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019) and gathers concurrent validity evidence for an EIT as used with both HSs and L2 learners.

Literature review

Heritage language assessment

One main concern that has been raised regarding HL assessment is that most studies in HLA either used tasks that were originally developed for L2 learners or used placement tests in research settings, raising reliability and validity concerns (Ilieva & Clark-Gareca, Reference Ilieva, Clark-Gareca, Beaudrie and Fairclough2016; Son, Reference Son, Winke and Brunfault2020). Beaudrie (Reference Beaudrie and Pascual y Cabo2016) highlights four characteristics that should be kept in mind for HL assessment design:Footnote ¹ (a) using performance-based measures of real-world tasks with authentic purposes; (b) accounting for the variability of HL linguistic varieties by attempting to use structures that are “dialect-neutral” (Potowski et al., Reference Potowski, Parada and Morgan-Short2012) and formulating and reporting decisions of what linguistic variations can be accepted; (c) accounting for the lack of an established HL proficiency framework and limiting the assumptions regarding the expected development of the learners; and (d) using a multiplicity of measures to measure HSs’ diverse skills, and avoiding using assessments that only test one set of skills, as they may highlight learners’ inability to produce specific forms (to which they may have not been exposed).

Adequate proficiency tests for HSs are a prerequisite to appropriately interpreting the results of studies that use language proficiency as an independent variable, and, considering that many studies compare HSs and L2 learners, it is important that the assessments used have been validated with both groups. Previous explorations of assessments for HLA research settings have included a variety of different task types, including self-assessments (Keating et al., Reference Keating, VanPatten and Jegerski2011); C-tests (Drackert & Timukova, Reference Drackert and Timukova2019); complexity, accuracy, and fluency (CAF) features of written production (Camus & Adrada-Rafael, Reference Camus and Adrada-Rafael2015); and standardized tests of proficiency, like ACTFL’s Oral Proficiency Interview (OPI) (Ilieva, Reference Ilieva2012). There has been particular interest in EITs as a promising measure that could be reliable, easy to administer, and would adequately assess L2 learners and HSs regardless of their literacy level or language variety (Solon et al., Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022; Wu & Ortega, Reference Wu and Ortega2013).

Elicited imitation tasks as measures of proficiency for L2 learners and HSs

In an EIT, participants hear multiple stimulus sentences of increasing length and complexity one at a time and are asked to repeat them aloud as accurately as possible. EITs have been shown to provide a measure of implicit linguistic knowledge when there is a short delay between the presentation of the sentence and its repetition and a time limit for the repetition. The rationale is that the stimulus sentences exceed participants’ working memory span and must be quickly and accurately comprehended to be repeated.

EITs are versatile and have been used to assess the implicit knowledge of specific aspects of grammar, lexicon, and phonology (e.g., Deygers, Reference Deygers2020; Torres, Estremera & Mohamed, Reference Torres, Estremera and Mohamed2019) and to test the effectiveness of instructional interventions (Fernandez-Cuenca & Bowles, Reference Fernandez Cuenca, Bowles and Bowles2022) and the listening comprehension of L1 and L2 speakers (Akbary, Benzaia, Jarvis & Park Reference Akbary, Benzaia, Jarvis and Park2023); they have also been highlighted by Yan et al. (Reference Yan, Maeda, Lv and Ginther2016) and Kostromitina and Plonsky’s (Reference Kostromitina and Plonsky2022) meta-analyses for their efficiency as measures of global language proficiency (e.g., Gaillard, Reference Gaillard2014; Lopez-Beltran Forcada, Reference Lopez-Beltran Forcada2021; Wu & Ortega, Reference Wu and Ortega2013). EITs have been used in many different applications in L2 assessment, including as part of standardized commercial proficiency tests, such as the Versant proficiency test, Duolingo English Test, and the Test of English as a Foreign Language Essentials, likely due to their ease of administration, efficiency, affordability, and practicality.

One of the most popular EITs in SLA research is that of Ortega, Iwashita, Norris & Rabie (Reference Ortega, Iwashita, Norris and Rabie2002), and parallel versions have been developed and validated as measures of oral proficiency in Spanish, French, English, Chinese, Korean, German, and Japanese (e.g., Chaudron, Nguyen & Prior, Reference Chaudron, Nguyen and Prior2005; Gaillard & Tremblay, Reference Gaillard and Tremblay2016; Wu & Ortega, Reference Wu and Ortega2013). It includes 30 items that range from 7 to 18 syllables and are scored on a 5-point scale.

EITs have also proven to be versatile for their ability to assess L1 adults (e.g., Chaudron et al., Reference Chaudron, Nguyen and Prior2005; Ellis, Reference Ellis2005), and even young L1 children (Keller-Cohen, Reference Keller-Cohen1981) with limited or developing literacy skills. This feature in particular makes EITs attractive as an assessment measure for HSs who may lack literacy in the HL.

Indeed, EITs have recently begun to be used to measure different constructs in HSs’ knowledge and to compare HSs’ and L2 learners’ proficiency. A growing corpus of studies has used EITs to measure knowledge of specific language structures (e.g., Bowles, Reference Bowles2011; Heo, Reference Heo2016), and a few have utilized one of the versions of Ortega et al.’s (Reference Ortega, Iwashita, Norris and Rabie2002) EIT to measure HSs’ and L2 learners’ proficiency as an independent variable in their research (Lopez-Beltran Forcada, Reference Lopez-Beltran Forcada2021; Solon et al., Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022; Son, Reference Son2018; Wu & Ortega, Reference Wu and Ortega2013; Zarate Sandez, Reference Zarate-Sandez2015; Zhou, Reference Zhou2012).

In her study with L2 learners, Bowden (Reference Bowden2016) found that Ortega et al.’s EIT was not well suited to test the full range of proficiency, particularly at advanced levels. Previous studies that tested both L2 learners and HSs with an EIT (e.g., Solon et al., Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022; Son, Reference Son2018; Wu & Ortega, Reference Wu and Ortega2013) showed that HSs performed significantly better than L2 learners within and across curricular levels. Wu and Ortega attributed these findings to HSs having an advantage over L2 learners along the full oral language development continuum. To make the task suitable for advanced-level learners, Solon et al. (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019) increased the difficulty of Ortega et al.’s (Reference Ortega, Iwashita, Norris and Rabie2002) EIT by adding six more items to the original 30-item task and expanding the longest item from 18 to 27 syllables. These more difficult items were expected to increase the discrimination of the task for advanced L2 learners and potentially for HSs as well.

Very recently, a few studies used the EIT to test HSs’ Spanish proficiency. Lopez-Beltran Forcada (Reference Lopez-Beltran Forcada2021) used Solon et al.’s (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019) EIT as a measure of proficiency with L2 and HSs, and Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) tested 63 HSs of Spanish with the 30-item EIT and compared their results with the L2 sample that was included in their 2019 study. Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022)’s descriptive results showed that the 30-item EIT was effective in eliciting responses at a wide range of proficiency levels, though most HSs scored at the high end of the scale. Moreover, a Rasch analysis confirmed that the test was too easy for approximately half of the participants. Nevertheless, the correlations of item difficulty between HSs and L2 learners were strong and all items (except for item 1) fell within the 95% CI, showing that the task was performing well at differentiating participants’ levels of proficiency.

Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) also ran a Rasch analysis on the complete 36-item sample to inspect whether the extended task could provide a more fine-grained analysis of the HSs’ competence. This analysis showed that the item difficulty of the extended EIT better matched the range of person ability than the 30-item version and that the task was able to discriminate approximately seven different levels of person performance and six of item difficulty. Moreover, the reliability of the extended task was very high (.98 for item and person reliability). These results are informative about the efficiency of the use of this task with HSs.

Since the present study used the 36-item version of the EIT with a different sample of HSs with differing profiles from Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022)’s sample, it would not be surprising to find new patterns of difficulty for the HSs and L2 learners. Therefore, in this study, we partially replicated Solon et al.’s (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) methodology and sought to obtain additional validity evidence for the 36-item EIT. We aimed to gather concurrent validity evidence through external measures that met recommendations for HL proficiency test design. These are discussed in the following section.

Validating measures

CAF features

Language proficiency has been considered a construct whose foundational aspects are reflected in the features of CAF (Housen & Kuiken, 2009). Pallotti (Reference Pallotti, Winke and Brunfaut2020) defines CAF as follows: complexity is “the number of elements and their interrelationship in a text or linguistic system,” accuracy is “the conformity of linguistic performance to target-language norms,” and fluency is “the extent to which linguistic production is (and/or perceived as) fast and smooth,” and it often involves measuring speed, breakdown (i.e., number and length of pauses), and repair (i.e., number of self-corrections, false-starts, etc.) (pp. 202–203). It is also common to find references to lexical complexity as an independent construct (e.g., Wu & Ortega, Reference Wu and Ortega2013), and, from a structural perspective, it relates to the variety of lexemes within a text, generally measured with a type/token ratio.

The motivation behind using CAF measures as a concurrent validity measure is that the use of an oral narration from which to draw CAF measures is aligned with the principles of communicative language teaching and the naturalistic context of exposure common to HSs (it recreates the communicative function of narration, which can be found in familiar authentic communicative contexts and it is a task that centers the attention of the speaker on meaning rather than on form). Both tasks address the development of the internal language system, but the abilities measured in the oral narration task may be different from those measured in the EIT, due to the different cognitive demands of each of the tasks.

Additionally, the design of an oral narration task aligns with Beaudrie (Reference Beaudrie and Pascual y Cabo2016)’s assertion that it is important to use assessments that represent the types of communicative activities that HSs are familiar with (oral, contextualized, spontaneous) to be able to access their implicit linguistic knowledge (i.e., their proficiency). Admittedly, the EIT does not meet all those characteristics, as the sentences are presented in a decontextualized manner, and it does not elicit spontaneous production. Therefore, positive correlations between the scores in the EIT and CAF measures can provide evidence of the extent to which the task measures the construct of proficiency that is relevant to these groups of learners, as well as criterion-related validity, due to the recognition of CAF measures as an effective assessment of proficiency.

While the use of CAF features is frequent in SLA, many different measures can be chosen, making it difficult to compare across studies (Norris & Ortega, Reference Norris and Ortega2009). Wu and Ortega (Reference Wu and Ortega2013) used components of oral CAF as indicators of global oral language proficiency, to which they compared their EIT data to find concurrent validity evidence. Wu and Ortega extracted CAF measures from an oral narrative task that consisted of describing 12 sequential pictures that presented a story and included 12 motion event segments. As a measure of fluency, they chose the total number of clauses. Second, motion clauses were quantified as an indicator of communicative effectiveness (i.e., accuracy). Third, the number of motion verb types was taken to be an indicator of lexical diversity (i.e., complexity and vocabulary capacity). Results showed that participants who scored higher in the EIT also showed a better command in all CAF measures, thereby providing evidence that participants’ performance in both tasks relied on the same underlying oral language abilities. The present study developed a similar narration task to obtain CAF measures and compared them with scores in the EIT. The specific measures chosen and the rationale behind their election are explained in the Methodology section.

Versant Spanish Test

For this study, it was important to have a standardized oral proficiency measure that was highly reliable and also cost-effective and efficient to administer to provide additional concurrent validity evidence to the EIT without making the process overly fatiguing for participants. The Versant Spanish Test, an oral production and aural comprehension test that takes 13 to 17 min to complete and relies on automated scoring, was chosen to meet these criteria (Pearson Education, 2011).

Audio prompts represent native speakers from different countries and Spanish varieties, making the test appropriate for the linguistically diverse sample of participants in this study. The items included in the test are designed to assess examinees’ comprehension of and intelligibility in spoken, everyday Spanish. The computer-delivered test consists of 60 items in seven different sections (i.e., Reading, Repeats, Opposites, Short Answer Questions, Sentence Builds, Story Retelling, and Open Questions). Items are designed to assess test-takers’ ability to understand spoken Spanish on everyday topics and to respond intelligibly at a nativelike conversational pace (Bernstein, Van Moere & Cheng, Reference Bernstein, Van Moere and Cheng2010, p. 358). The Versant test provides an overall score and four subscores: Sentence Mastery, Vocabulary, Fluency, and Pronunciation. Sentence Mastery and Vocabulary measure the response’s linguistic content, and Fluency and Pronunciation relate to the articulation and rhythm of responses (Bernstein et al., Reference Bernstein, Van Moere and Cheng2010).

Scores on the Versant Spanish Test have been shown to be highly correlated with scores on other oral proficiency measures, including ACTFL’s OPI (r = .86, p < .001), widely considered a gold standard in oral proficiency assessment, and the Spoken Proficiency Test (r = .92, p < .001) (Pearson Education, 2011). It has also been used in a few prior SLA and HLA studies (Blake, Wilson, Cetto & Pardo-Ballester Reference Blake, Wilson, Cetto and Pardo-Ballester2008; Escalante, Reference Escalante2018; Fairclough, Reference Fairclough, Beaudrie and Fairclough2012; Moneypenny & Aldrich, Reference Moneypenny and Aldrich2016; Pozzi & Reznicek-Parrado, Reference Pozzi and Reznicek-Parrado2021; Quan, Reference Quan2018). While Pozzi and Reznicek-Parrado point out that the Versant test was not specifically designed to measure HSs’ proficiency, Versant scores in Blake et al. (Reference Blake, Wilson, Cetto and Pardo-Ballester2008) distinguished HSs as a different proficiency group from L2 learners and were able to measure the progress of both groups over time. Taken together, this evidence suggests that the Versant test is an appropriate standardized test to use to gather validation evidence in this study.

Research questions

The present study aimed to explore whether the EIT was an appropriate and valid measure of proficiency for HSs and L2 learners of Spanish. Obtaining evidence of a test validity construct is a key step in every instance an assessment is used with a new group and/or in a new context. A common framework to inspect the validity of a test is looking at the evidence for content-related (i.e., the extent to which the content of a test “constitutes a representative sample of the language skills, structures, etc. with which it is meant to be concerned”), construct-related (i.e., whether the inferences that the constructs, which the test means to assess actually exist, can be measured and are indeed being measured in the test), and concurrent (i.e., the extent to which scores on the test correspond with an independent, highly dependable measure of the same construct) validity.

The overarching question is whether an EIT, given its successful use with both L1 and L2 populations and its oral nature, is an appropriate measure of proficiency for HSs and L2 learners. Ortega et al.’s (Reference Ortega, Iwashita, Norris and Rabie2002) 30-item EIT appears to be too simple for many HSs, as shown by Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022), and the 36-item version seemed to better match the proficiency of the HSs and L2 learners in their sample. Therefore, gathering additional validity evidence and information about its discrimination and reliability can strengthen the argument for the adequacy of the 36-item EIT as a measure of HS and L2 proficiency.

For the sake of space, this article will be centered around the discussion of its concurrent validity evidence (i.e., whether scores on the test correspond to scores on other dependable measures) and its discrimination and reliability (i.e., whether the proficiency test actually differentiates across proficiency levels and does so reliably). Henceforth, the present study poses the following research questions:

1. To what extent are the EIT items and their scoring reliable with HSs and L2 learners?
2. To what extent can the EIT discriminate across the proficiencies of HSs and L2 learners?
3. To what extent can HSs and L2 learners’ performances on an oral narration and the Versant proficiency test provide concurrent validity evidence to their EIT scores?

To answer these questions, the functioning and reliability of the items with a population of L2 and HSs at a range of proficiency levels will first be explored. Then, concurrent validity evidence will be gathered by comparing HS and L2 scores on the EIT to those on the Versant Spanish Test and on CAF measures derived from an oral narration.

Methodology

Participants

L2 and HSs of Spanish were recruited to participate in the study from two different public universities in the US, one in the Midwest (N = 198) and one in the Southwest (N = 5). Data for this study were collected in spring 2021, spring 2022, and fall 2022. Participants were sought from both universities to enable a sample of HSs at the full range of proficiencies. In contrast to the Midwestern university, the Southwestern one enrolls many HSs at the lower end of the proficiency spectrum. Recruiting a larger sample of participants from the Southwest would have been ideal, but this was not possible due to the challenges of remote recruitment. Learners were recruited through Spanish undergraduate classes and through personal connections. Students in Spanish courses at the Midwestern university had the option to receive extra credit or to participate in a drawing for a gift card as compensation, whereas those at the Southwestern university received extra credit and a gift card. All others not enrolled in courses received a gift card as compensation.

A summary of participants’ characteristics based on Birdsong, Gertken and Amengual’s (Reference Birdsong, Gertken and Amengual2012) Bilingual Language Profile can be found in Table 1.

Table 1. Descriptive statistics of complete sample of participants

Participants were classified into HS and L2 groups on the basis of their responses to the question, “How many years have you spent in a family where Spanish is spoken?” where a response similar to the participants’ age was taken as an indication of their HS profile. Participants’ self-reported age of acquisition of Spanish was also used to confirm that they had been correctly identified as either HS or L2.

Participants were also asked to respond to Likert-scale questions regarding their communicative competence in English and Spanish (though only the Spanish results are reported here), which generated a self-reported proficiency (SRP) score from 0 to 24 (Birdsong et al., Reference Birdsong, Gertken and Amengual2012). The SRP score in Spanish was used as an independent continuous variable to facilitate comparison across HS and L2 groups.

Participants were recruited from all course levels at the Midwestern university and the distribution of the sample is representative of the student population enrolled in Spanish, with a greater proportion coming from 100-level courses than from higher levels (see Table 2). The Midwestern university has mostly mixed classrooms, enrolling both HSs and L2 learners together, apart from one 200-level composition course tailored for HSs, from which 8 participants were recruited. Students from the Southwestern university were all recruited from a Spanish as an HL program. The goal of recruiting learners from a wide range of courses was to find HSs and L2 learners across the proficiency spectrum.

Table 2. Descriptive statistics and distribution of participants’ course enrollment and SRP

Note: All participants recruited outside a language program were part of the Midwestern university campus community.

A note about proficiency as it relates to course enrollment is in order: 100-level courses encompass the first four semesters of language study and therefore include a fairly broad range of L2 proficiencies, from true beginners who start at novice low to those completing the fourth-semester language requirement who are typically at the intermediate mid level. Most HSs at the Midwestern university are second generation and therefore tend to have intermediate-level or higher listening and speaking skills in Spanish. HSs who enroll in 100-level courses do so for a variety of reasons, including fulfilling a language requirement or gaining literacy skills in Spanish. This contrasts with the profile of the HSs at the Southwestern university, where students are third or fourth generation and tend to have lower proficiency in Spanish. SRP scores among HSs in the Southwest were 10–15 compared to 7–20 for those from the Midwestern university.

Instruments

The present study reports results on the linguistic background questionnaire, the elicited imitation and oral narration tasks, and the Versant Spanish Test. These measures were chosen to provide evidence of the EIT’s concurrent validity.

Elicited imitation task

The EIT was Solon et al.’s (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019) EIT, which was adapted from Bowden (Reference Bowden2016), which was itself adapted from Ortega et al. (Reference Ortega, Iwashita, Norris and Rabie2002). Solon et al.’s modifications consisted of some vocabulary adaptations to limit dialect-specific terminology and the addition of six sentences to the original task, which served to increase the length of the longest sentences from 18 to 27 syllables.

To ensure the appropriacy of the language in the EIT for HSs of different varieties of Spanish, a norming procedure was conducted with 49 Spanish native speakers from nine different countries (Argentina, Chile, Colombia, Ecuador, Mexico, Peru, Spain, USA, and Venezuela), recruited via Amazon MTurk. This is a novel feature of our study, as crowdsourcing has just started to be used in research on Spanish (Ortega-Santos, Reference Ortega-Santos2019) and has not, to our knowledge, been used in any HS studies.

Through Qualtrics, MTurk “workers” were asked a few demographic questions (i.e., Were they native speakers of Spanish? What Spanish variety did they speak? Where were they born and where did they live currently?). They were then prompted to mark any words on the EIT items that they did not recognize or that were confusing to them. Although many of the HS participants come from Mexican and Puerto Rican backgrounds, we chose to recruit informants for the norming more broadly to ensure that the language in the EIT stimuli was accessible cross-dialectally.

Based on the norming informants’ responses, item 14 (A ustedes les fascinan las fiestas grandiosas, “Grand parties fascinate you”) and item 25 (Después de llegar a casa del trabajo tomé la cena, “After arriving home from work, I had dinner”) were modified. The words grandiosas and tomé were marked by one and four native speakers, respectively. They were subsequently changed to ruidosas (“loud”) in item 14 and hice (“made”) in item 25. Although the two modified items were not re-normed, their frequencies were checked in the Corpus del Español (Davies, Reference Davies2016). Fiestas ruidosas appeared 18 times in the corpus, while fiestas grandiosas appeared just once, hacer la cena appeared 151 times, and tomar la cena appeared 31 times. The increased frequency of these new strings was expected to make the stimuli more accessible across dialects and the proficiency spectrum.

The EIT items were recorded in a sound-attenuating booth by a female native speaker of Mexican Spanish who is also a graduate student and instructor of Spanish. She listened to Solon et al. (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019)’s recordings before the session and at different points during her recordings to emulate the pace. The recordings were edited with Audacity to remove any remaining background noise and to add timed pauses and tones to prompt participants’ repetitions. The timing of the pauses and tones was adjusted following Solon et al. (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019)’s method, available in the IRIS database.

The administration of the newly adapted EIT took place through the Gorilla Experiment Builder platform (henceforth, Gorilla) (Anwyl-Irvine, Massonié, Flitton, Kirkham & Evershed Reference Anwyl-Irvine, Massonié, Flitton, Kirkham and Evershed2020) and took approximately 15 min. Gorilla was chosen because it enables remote recording and submission of audio responses, and it provided benefits of web-based EITs that Kim, Liu, Isbell & Chen (Reference Kim, Liu, Isbell and Chen2024) highlighted, including access to larger and more diverse participant samples; reduction of research, equipment, and employment costs; and standardized and optimized testing procedures. Kim et al. also point out that web-based research is not a panacea and that there is a higher risk of distractions, noisy data, and higher dropout rates. Yet Kim et al. found that web-based EITs were comparable to lab-based EITs, providing support for the use of platforms like Gorilla.

Participants’ recorded repetitions were captured by Gorilla and later scored using Ortega et al.’s (Reference Ortega, Iwashita, Norris and Rabie2002) rubric, which has been used in most of the parallel versions or adaptations. The modified EIT and the rubric used in this study are available as supplementary materials.

Oral narration task

The oral narration task was designed to elicit CAF measures. The task required participants to narrate a series of 12 vignettes that presented the story of two friends who wanted to have lunch together but kept coming up against constant obstacles. The task was administered through Gorilla, and participants had 2 min at first to look attentively at the entirety of the comic strip. Then they saw each vignette in a slideshow, which they controlled and advanced at their own pace. Each slide generated independent recordings of the participant’s response, which then were unified into one file and transcribed and coded with the CLAN software.

In this study, CAF features that were compatible with CLAN were chosen. First, complexity was operationalized using two different measures: (a) proportion of subordination (i.e., the number of utterances containing subordination over total number of utterances), obtained with a manual count, and (b) mean length of utterance (MLU) quantified with morphemes, obtained with CLAN’s eval test. The use of subordinate clauses is common as a measure of complexity and provides a fine-grained analysis of advanced learners’ speech, though it is not appropriate for novice learners who may not yet be capable of subordination (Norris & Ortega, Reference Norris and Ortega2009). Hence, MLU was also selected because it can be used at all levels of proficiency. Second, accuracy was measured as the percentage of error-free production at word level, which was also obtained with the eval test. During the transcription of the recordings, any word-level inaccuracies were marked as such (see example 1).Footnote ² This annotation allowed the software to generate a percentage of error-free words.

Third, a measure of fluency was obtained by counting syllables per minute, with the FluCalc test. Finally, as a measure of lexical diversity, a modified type-token ratio, the D measure (deBoer, Reference deBoer2014), was calculated with CLAN’s vocd test, which, as explained by deBoer (Reference deBoer2014), “attempts to measure the diversity of vocabulary in writing by taking random samples of words and comparing the observed diversity to ideal curves […] vocd is fundamentally a graphical method to address lexical diversity” (p. 140).

Procedures

All data collection took place remotely. During recruitment, the researchers distributed a link that provided direct access to the experiment. Participants were able to access the study at the time and location that was most convenient for them, without any external monitoring. These conditions facilitated data collection while respecting COVID-19 safety protocols. Most tasks described in this study were administered through the Gorilla platform, except for the Versant Spanish Test, which is available only through the proprietary Pearson platform.

The first task all participants completed was the Bilingual Language Profile, after which they were randomly assigned one of three task orders: ABC-BCA-CAB (A = EIT, B = oral narration, and C = DELE test).Footnote ³ Counterbalancing was necessary to verify that there were no modality or task type influences due to order effects.

Finally, all participants completed the Versant Spanish Test last. The reason it was done last is that there is a fee for each access code, so we wanted to ensure that participants had completed the rest of the study before providing a paid access code. Participants accessed the test through Versant’s web-based platform and took between 13 and 17 min to complete it.

Data analysis

Due to the unmonitored nature of this experiment, some participants did not complete the tasks adequately and some data had to be discarded. An initial pool of 242 participants completed at least some tasks in Gorilla, but the data of 39 were discarded for not following instructions, leaving the results of 203 participants in the final sample. Not all these participants completed all tasks, so the data from 189 participants were used to analyze the concurrent validity evidence of the EIT by comparing it with their performance on the oral narration, and the data from 100 participants were used to compare EIT scores with those on the Versant proficiency test.

All tasks except the EIT and the oral narration were scored automatically. Responses to the EIT and oral narration were manually transcribed and then coded by the first author (who is a native speaker of Castilian Spanish) and a team of three undergraduate research assistants (who were HSs of Mexican Spanish enrolled in advanced content-based courses in Spanish). This team also rated the same 10% of the EIT, to calculate interrater reliability of item scoring. The exact agreement on item scoring across raters was 47.59%, with a Light’s kappa for four raters of .62, which is considered substantial agreement beyond chance (Fleiss, Levin & Paik, Reference Fleiss, Levin and Paik2003). Moreover, rater socialization and discussion of differences resulted in a 100% agreement rate on the manual transcription of the item recordings. The EIT was transcribed and coded in Microsoft Excel, and the oral narration was transcribed, coded, and analyzed using the CLAN software. Once all data were processed, they were analyzed using RStudio and Winsteps.

Descriptive statistical analyses were performed on the EIT and the oral narration task as well as on SRP. Next, to obtain information about the internal reliability of the EIT as well as to observe the item functioning and their distribution of learners by item level of difficulty, Rasch analyses were run with Winsteps. Rasch analyses prove to be more informative than true score theory reliability or correlation calculations; therefore, a similar methodology to Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) is followed in the present study.

To determine the concurrent validity of the EIT, the results of the task were compared to results in the oral narration and in the Versant test through comparisons of their descriptive statistics and through correlation analyses. These data were not normally distributed, so Spearman correlations were chosen to analyze the data in RStudio.

Results

Scores on the EIT: Descriptive statistics and reliability

The distribution of L2 scores presented a right/positive skew due to few high scores in the sample, while the contrary happened with HS scores, which showed a pronounced left/negative skew, due to a large number of scores at the high end of the score range. The results of Shapiro-Wilk normality tests confirmed that scores were not normally distributed, either for the HSs (p < .0001) or the L2 learners (p = .0001).

The distribution of EIT scores can be observed in Figures 1 and 2, which present the mean scores (and standard deviation [SD]) that each participant group obtained for the different items on the EIT. These figures show that each task item prompted different levels of performance and present a general trend (with some exceptions) where the more advanced the item on the list, the lower its average ratings. These figures also show that, overall, HSs were considerably more accurate than L2 learners on the EIT.

Figure 1. Distribution of EIT mean scores and SDs across items for L2 learners.

Figure 2. Distribution of EIT mean scores and SDs across items for HSs.

Figure 3 displays the distribution of the total EIT scores for both groups plotted with their SRP scores. The maximum possible EIT score for the 36-item version of the task was 144. At first glance, one can observe that participants presented an increasing average EIT score as SRP increased, therefore suggesting that the EIT elicits different levels of performance depending on the examinee’s proficiency. Nevertheless, this trend was more visible with HSs than L2 learners. It is also clear that learners with similar SRP scores often showed a wide range of EIT scores. This trend is apparent across both HS and L2 groups but is particularly visible in the highest SRP scores, where some L2 learners still scored as low as 29/144 on the EIT, while the lowest-scoring HSs attained 56/144 on the EIT. Therefore, it seems that L2 learners overrated their self-assessments of proficiency more than HSs (a trend also seen in Bowles, Adams and Toth, Reference Bowles, Adams and Toth2014), which signals that interpretation of SRP results should be done with care.

Figure 3. Distribution of EIT and SRP scores.

The EIT data met the assumptions of local independence and unidimensionality. Therefore, it was possible to apply Rasch analysis and obtain fine-grained information about the reliability of the EIT items. Person separation obtained a score of 5.51, which is above the minimum desired score (i.e., 2), and person reliability was .97, also high and appropriate (i.e., > .80). On the other hand, item separation had a score of 10.40, which was high (i.e., > 3) and item reliability had a score of .99, which is very high (i.e., > .90) (Linacre, Reference Linacren.d.). All these indices provide evidence of reliability. Additionally, item separation and person separation provide information regarding the ability of the task to separate across items and persons of different levels. Therefore, this EIT was able to discriminate across approximately 6 different levels of performance in persons and 10 levels of difficulty for items.

Before using Rasch to analyze the capacity of the EIT to measure participants’ competence across different levels, model fit was checked first, which is done by observing the outfit and infit mean squares (MNSQs) and standardized Z values (ZSTD) of the different items. Table 3 presents the model fit indices for the different items in the EIT.

Table 3. Mean scores and fit statistics for the EIT items

Note: Bolded numbers represent misfitting values. Mean scores indicate the average score on the different items by each group of speakers. A mean score that is too close to 4 may indicate that the item is too easy, and a score too close to 0 may indicate that the item is too difficult.

As Table 3 shows, there are four items that are clearly misfitting: items 1–4. Their infit MNSQ scores are larger than modeled (MNSQ = .6–1.4), which is a sign of unexpected item behavior. Moreover, those items’ ZSTD scores are high (–2.0 < Z < +2.0), signaling the significance of the MNSQ scores (Bond & Fox, 2020). Notably, these are the shortest and among the easiest items in the task, as indicated by their mean scores for HS and L2 groups; therefore, they are not considered problematic for the correct functioning of a proficiency test that is intended to function across the proficiency spectrum. The fit statistics in this case are the result of a large number of participants scoring high on those items, making the model flag them as too easy (e.g., 70% of participants’ scores on item 4, the most misfitting item, received the highest possible score).

All other items show adequate infit and outfit indices. It is important to point out that values larger than 1.0 represent unmodeled noise. Therefore, item 15, which has a MNSQ value of 1.24, has 24% excess noise, but this is considered a reasonable value in the context of results from a rating scale. On the other hand, values below 1.0 indicate overfit to the model (Wright & Linacre, Reference Wright and Linacre1994), which could suggest they are redundant or predictable, though acceptable.

To look further into the functioning of the different items, we analyzed the Wright map generated with Winsteps (Figure 4), which presents a distribution of item difficulty and person ability, with participants labeled according to their learner group (HS/L2).

Figure 4. Wright map of the EIT scores.

A Wright map represents a developmental pathway for the different participants in the study and the items, showing how items are distributed in relation to the ability of the participants. From top to bottom, the participants are organized at the left of the map in order of their performance in the task (participants higher up in the map were more successful), and items at the right of the map are organized based on their difficulty (the higher the item, the more difficult it was for this sample of participants). The map presents a good distribution of item difficulty and candidates’ performance because a majority of participants’ performance levels are distributed across the same logits as the EIT items and their difficulty. However, it is relevant that the person’s ability covers 10 logits and EIT difficulty 6 logits. Hence, the EIT seems not to be challenging enough for 25 participants (12.3% of the complete dataset), 21 of them HSs (29% of the HS data). This conclusion is backed by the count of individual participants achieving very high scores on the task (though not one participant achieved the maximum score): Of the 70 HSs, 38 participants obtained a score over 100, 15 participants scored over 130, and 5 participants scored between 140 and 143. In contrast, of the 131 L2 learners, 20 participants scored over 100, 4 scored over 130, and only one scored 140.

The map also shows that it is the addition of at least four of the six additional items by Solon et al. (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019), which improves the difficulty of the EIT. The inclusion of these items enables the task to assess learners at one more logit of proficiency, therefore improving the task’s discrimination with advanced learners.

Scores on the validating measures: CAF and Versant proficiency test

With the reliability and item analyses complete, the next steps consisted of observing the correlations of the scores on the EIT and other proficiency measures. First, regarding the use of CAF measures to assess the linguistic competence of L2 and HS learners, the following graphs present the descriptive statistics of participants’ performance in the oral narration task in contrast with their SRP scores. Three of the five measures (MLU morphemes, fluency, and lexical diversity) are continuous variables, and the other two (proportion of subordination per utterance and accuracy) are proportions.

There is a general, expected trend of improved performance with CAF features as SRP increases and the absence of a clear trend of one group (HS/L2) consistently better than the other on all measures. Interestingly, L2 learners present better performances as their SRP score is higher in the five different CAF features (see Figures 5–9), while HSs do not present such clear regular improvement in the two measures of complexity (see Figures 5 and 6). HSs with lower SRP scores seem to have higher levels of subordination and MLU in morphemes than HSs with higher SRP. Nevertheless, the other features (see Figures 7–9) present a regular improvement, as seen for the L2 group. These results could be related to the small sample size for HSs with lower SRP scores. However, this irregular pattern matches the weaker correlations found between HSs’ complexity measures and their scores in the EIT (see Table 4), which were positive and significant but considerably weaker than for the other CAF features.

Figure 5. Distribution of proportion of subordination scores (complexity).

Figure 6. Distribution of MLU morphemes scores (complexity).

Figure 7. Distribution of fluency scores.

Table 4. Spearman correlations between scores on the EIT and CAF measures

* p < .005, ** p < .0001, *** p < .00001.

On the other hand, in terms of accuracy (see Figure 8), L2 learners scored higher than HSs with lowest SRP, which was then surpassed by HSs with highest scores. HSs also consistently showed higher lexical diversity (see Figure 9) than L2 learners, except for at the highest SRP level, where L2 learners slightly surpassed their score. Finally, HSs also showed overall better fluency than L2 learners did (see Figure 7).

Figure 8. Distribution of accuracy scores.

Figure 9. Distribution of lexical diversity scores.

There were positive, significant Spearman correlations between all measures of CAF features and participants’ scores on the EIT, as displayed in Table 4.

The Versant Spanish Test was also administered to obtain concurrent validity information for the EIT. Figure 10 displays the distribution of Versant scores and SRP scores. Table 5 presents the correlations between scores on the two tasks, which were very high for the overall sample, particularly for HSs. These results provide evidence that the two assessments measure similar constructs. Figures including graphs of the distribution of scores for all the measures (EIT with CAF measures and the Versant test) have been included as supplementary materials.

Figure 10. Distribution of Versant scores.

Table 5. Spearman correlations between scores on the EIT and the Versant Spanish Test

*** p < .00001.

Discussion and conclusions

To respond to the first research question, which asked to what extent the EIT items and their scoring were reliable, we ran a Rasch analysis as well as interrater reliability analyses. First, high person and item reliability indices, combined with the moderately high level of interrater reliability, described in the Methods section, are evidence of the EIT’s reliability, as these reliability estimates represent the likelihood that these scores would be generated in future instances of the task, with a similar sample of test-takers and items.

These results follow the trend of high reliability indices in EIT research. Kostromitina and Plonsky (Reference Kostromitina and Plonsky2022) report results of a reliability generalization meta-analysis, which averages the reliability estimates found in their sample (Cronbach α and Kuder-Richardson Formula 20 coefficients) and which showed high internal reliability (.92).

Additionally, they averaged the interrater reliability coefficients and obtained strong reliability scores: .91 kappa, and .88 for percent agreement. These coefficients are certainly higher than those obtained in the present study (.62 kappa for 4 raters, 47.59% agreement). However, the authors of the meta-analysis recommend interpreting these specific mean coefficients with caution, as less than half of the unique sample studies reported interrater reliability data.

We can think of different explanations for the low interrater reliability score (though it is considered “substantial agreement”): lack of agreement on the differences between score bands 1 and 2, the background of the raters, and their previous training. Moreover, while rater socialization meetings were in place, their goal was not to reach agreement in the scoring but to verify that all raters understood the rubric and were applying it consistently. Solon, Park, Pandža & Garza (Reference Solon, Park, Pandža and Garza2023) examined the role of rating modality (aural versus written) and rater characteristics (specifically language background and linguistics training) in the rating of L2 scores on their expanded version of the EIT. They report that none of these factors greatly influenced EIT scores, which contributes to the validity argument of EITs as a measure of L2 proficiency. However, they found that there were significant differences (with a small effect size) in scores with rating modality as a variable: Raters who rated performance based only on the oral recordings tended to give higher scores than those who rated the transcriptions of the recordings. In the present study, the raters were instructed to transcribe the recordings as they rated them, so there was a record of the utterance they had heard. Therefore, they were rating the recordings, not a pre-made transcription.

Similarly, Solon et al. (Reference Solon, Park, Pandža and Garza2023) found that native speakers of the target language tended to rate responses higher than non-native speakers. All raters in the present study were native Spanish speakers (one monolingually raised and the other three bilingually raised advanced proficiency HSs of Spanish). It would be interesting to investigate the role that monolingual versus bilingual L1 acquisition has on rating behavior, and further research should examine this and other factors that might affect rater behavior in EITs.

To respond to the second research question, which asked to what extent the EIT discriminated among HSs and L2 learners of different proficiencies, we looked at model fit and the Wright map generated by the Rasch analysis and observed participants’ performance on the task through descriptive statistics, considering their group (HS/L2) and SRP score as independent variables. Descriptive statistics showed that there was a wide range of SRP scores, but average EIT scores increased linearly as SRP increased. This pattern was true of both HS and L2 groups, though HSs scored higher on average than L2 learners across the spectrum of SRP scores.

Results of the Rasch analysis confirmed that the task efficiently differentiated across 6 levels of proficiency and 10 levels of item difficulty. Overall, a majority of EIT items nicely fit the Rasch model, except for four misfitting items, which were also the shortest and some of the easiest, thus not considered problematic.

The item distribution on the Wright map also showed good item functioning, given that the items were organized across the same logits as most participants. Particularly, it showed that the six items added by Solon et al. (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019) were the most challenging ones and improved task discrimination (i.e., the number of levels across which the task is able to differentiate) with advanced HSs and L2 learners. However, there were 25 participants for whom the task was too easy, most of them HS learners. Solon et al.’s (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) Wright map presented a very similar distribution, and the 36-item EIT showed that the EIT mapped better with the HSs’ ability than the 30-item version. Their ability covered 9.49 logits and the item difficulty covered 7.06 logits. The present study showed similar proportions, though seemingly less discriminating across the entire sample: Learners’ ability covered 10 logits and item difficulty 6 logits. The difference in distribution could be due to the different profiles of the learners included in this study, who had an unbalanced proficiency distribution, with a large percentage of them higher on the proficiency spectrum. Nonetheless, the EIT was still effective with a large portion of the sample.

The distribution of participants’ ability on the map is related to the number of participants who scored almost at ceiling in the EIT. These results are associated with the high levels of competence that many members of the HS sample presented in all the experimental tasks. Therefore, while the EIT seems to be efficient at distributing most HSs and L2 learners across different levels of competence, it may be less effective with HSs of very high proficiency. Moreover, due to the scarce number of HSs at the lowest end of the score range, more evidence of the task’s efficiency is needed at the lowest proficiency levels.

To respond to the third research question, which asked to what extent scores on CAF measures from an oral narration and scores from the Versant Spanish Test provided concurrent validity evidence for the EIT, correlation analyses were run. Overall, the five CAF features showed a linear development throughout SRP scores for both groups of learners, except for two specific small irregularities by HSs on complexity features at the higher levels. Therefore, CAF feature development showed the expected pattern, which was also similar to the improvement of EIT scores. There were, therefore, strong, positive Spearman’s correlations between the different CAF features and EIT scores, with the exception of the performance of HSs with complexity features, which were moderate: .44 in proportion of subordination and .35 in MLU by morphemes. It is not odd to find weaker correlations with complexity than with other features, given that there tends to be more variability in how complex oral spontaneous responses are, while accuracy, fluency, and lexical diversity are less variable. Previous research has shown that a range of variables affects performance in terms of CAF features (Foster & Tavaloki, Reference Foster and Tavaloki2009), with Pallotti (Reference Pallotti2019) showing that even when proficiency is controlled for, participants show individual variability across tasks with different difficulty, particularly in terms of syntax, “as some participants tend to prefer broad and complex structures while others typically produce rather short and simple constructions” (p. 67).

Finally, results on the Versant Spanish Test showed very strong positive, significant correlations (ρ >.74) with scores on the EIT. These results, in addition to those obtained through the CAF analysis, provide evidence that the EIT measures oral proficiency similarly to them, hence providing concurrent validity evidence in response to the third research question.

Despite that the overall sample of participants was fairly large and represented a range of proficiencies, the sample was not equally balanced across all proficiency levels, and some participants did not complete all tasks properly, resulting in some lost data and different numbers of participants completing each task. However, results show promising trends, as they provide evidence that the EIT is a valid measure of proficiency for HSs and L2 learners of Spanish, being able to elicit performance at different levels of proficiency for both groups. Given that the EIT takes just 15 min to administer, it is a speedy and efficient measure with high reliability and a long history of use in language assessment. The oral modality of the task makes it appropriate for HSs and L2 learners at different levels of literacy development. Moreover, the task is easy to rate due to the controlled productions it elicits (making it better than other oral tasks like narrations or picture descriptions), and it has strong interrater reliability (Kostromitina & Plonsky, Reference Kostromitina and Plonsky2022; Yan et al., Reference Yan, Maeda, Lv and Ginther2016). The EIT constitutes a useful tool for HLA as well as for other applications where an efficient and reliable proficiency measure is needed. If this measure were to be used in placement, further research would be needed in local contexts to determine its appropriacy. Readers are referred to Yan, Lei and Shih (Reference Yan, Lei and Shih2020) for discussion of how an EIT can be developed to place students into a curriculum as well as how scores compare to a more general, non-curricular EIT.

Future research on this topic could also investigate the specific linguistic features of the EIT that influenced participants’ performance on items that did not appropriately fit the Rasch model or that showed differential item functioning (DIF), a measure that indicates an advantage for learners based on their background traits, similarly to Isbell and Son (Reference Isbell and Son2021), who found substantial DIF estimates larger than .5 for HSs in items 1, 19, 25, and 27, and for L2 learners in item 19 in the Korean EIT. Nonetheless, they did not find this value or the direction of DIF to be so large or consistent that “overall measurement of oral proficiency would be compromised.”

Previous research (Yan et al., Reference Yan, Maeda, Lv and Ginther2016) has signaled stimulus length as the main factor in EIT complexity, and, considering that the 36-item EIT was still too easy for a number of participants, an area of further research could be the extension of the task in number and length of items. Moreover, it would be valuable to know which other features are influential when item length is kept constant. For example, in the Korean EIT, Isbell and Son (Reference Isbell and Son2021) found an effect of vocabulary sophistication and number of inflectional morphemes, which accounted for 59% of the variance observed in item difficulty. Moreover, they observed that some of the most difficult items included more embedded clauses, though this did not appear to be a systematic source of item difficulty.

Future studies could also investigate the different trends that HSs and L2 learners present when self-assessing their proficiencies, given that the present study showed a pattern of L2 learners overestimating their proficiency (which was not as much the case in HS learners), and a similar trend had already been observed in previous research (Bowles et al., Reference Bowles, Adams and Toth2014). Finally, the oral narration data could be used to describe profiles of HSs at different oral proficiency levels, as Gatti and O’Neill (Reference Gatti and O’Neill2018) did in writing. These profiles could then become a useful framework of comparison for the analysis of HL proficiency by CAF features.

In conclusion, this study contributes to the body of research on HL and L2 proficiency assessment by providing additional validity evidence of the Spanish EIT as an effective measure of proficiency for both HSs and L2 learners. Additionally, it shows how HS and L2 scores on the extended Spanish EIT strongly correlated to two oral proficiency measures: CAF features and Versant proficiency test, a comparison that (to the best of our knowledge) had not been reported before and that provides concurrent validity evidence to the task.

Supplementary material

The supplementary material for this article can be found at http://doi.org/10.1017/S0272263125000130.

Competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

¹ Beaudrie presents these principles for assessment design with a broad scope and not specifically for assessments used in a research setting. As an anonymous reviewer pointed out, quick assessments designed for use in this context may prioritize the instrument’s practicality and discrimination, while deprioritizing other characteristics like authenticity. Nonetheless, there has been a call for designs that are appropriate for HSs, as described in the Introduction.

² When deciding what was an inaccuracy, we were very cautious not to mark any features that would be characteristic of a Spanish dialect, including Spanish in the USA.

³ While the DELE test used in Solon et al. (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019) was included in the battery of tests to be part of a forthcoming comparative study between the DELE and other measures, it was not meant to provide evidence for the EIT’s validity, as some researchers (e.g., Carreira & Potowski, Reference Carreira and Potowski2011; Sanz & Torres, Reference Sanz, Torres, Malovrh and Benati2018) had criticized its use with HS learners in the past. As such, the data obtained from the DELE are not included in this article. Alternatively, this study chose oral measures of proficiency to provide concurrent validity evidence, which are more aligned with the proficiency assessed by the EIT, as none of these tasks depends on the literacy of the test-takers.

References

Akbary, M., Benzaia, L. A., Jarvis, S., & Park, H. I. (2023). Evaluating the utility of elicited imitation as a measure of listening comprehension in the context of forensic linguistics. Research Methods in Applied Linguistics, 2, 100067. https://doi.org/10.1016/j.rmal.2023.100067CrossRef Google Scholar

Anwyl-Irvine, A. L., Massonié, J., Flitton, A., Kirkham, N. Z., Evershed, J. K. (2020). Gorilla in our midst: An online behavioural experiment builder. Behavior Research Methods, 52 388–407. https://doi.org/10.3758/s13428-019-01237-xCrossRef Google Scholar PubMed

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. AERA.Google Scholar

Beaudrie, S. (2016). Advances in Spanish as a heritage language assessment: Research and instructional considerations. In Pascual y Cabo, D. (Ed.), Advances in Spanish as a heritage language (pp. 143–158). https://doi.org/10.1075/sibil.49.08beaCrossRef Google Scholar

Bernstein, J., Van Moere, A., & Cheng, J. (2010). Validating automated speaking tests. Language Testing, 27, 355–377. https://doi.org/10.1177/0265532210364404CrossRef Google Scholar

Birdsong, D., Gertken, L.M. & Amengual, M. (2012). Bilingual Language Profile: An Easy-to-Use Instrument to Assess Bilingualism. COERLL, University of Texas at Austin. https://sites.la.utexas.edu/bilingual/Google Scholar

Blake, R., Wilson, N. L., Cetto, M., Pardo-Ballester, C. (2008). Measuring oral proficiency in distance, face-to-face, and blended classrooms. Language Learning & Technology, 12, 114–127. https://doi.org/10125/44158 Google Scholar

Bowden, H. W. (2016). Assessing second-language oral proficiency for research: The Spanish elicited imitation task. Studies in Second Language Acquisition, 38, 647–675. https://doi.org/10.1017/S0272263115000443Google Scholar

Bowles, M.A. (2011). Measuring implicit and explicit linguistic knowledge: What can heritage language learners contribute? Studies in Second Language Acquisition, 33(3), 247–271. https://doi.org/10.1017/S0272263110000756CrossRef Google Scholar

Bowles, M. A. (2018). Outcomes of classroom Spanish heritage language instruction. In Potowski, K. & Muñoz-Basols, J. (Eds.), The Routledge handbook of Spanish as a heritage language (pp. 331–344). Routledge. https://doi.org/10.4324/9781315735139CrossRef Google Scholar

Bowles, M. A. (2022). Using instructor judgment, learner corpora, and DIF to develop a placement test for Spanish L2 and heritage learners. Language Testing, 39, 355–376. https://doi.org/10.1177/02655322221076033CrossRef Google Scholar

Bowles, M. A., Adams, R. J., & Toth, P. D. (2014). A comparison of L2-L2 and L2-heritage learner interactions in Spanish language classrooms. The Modern Language Journal, 98, 497–517. https://doi.org/10.1111/modl.12086Google Scholar

Camus, P., & Adrada-Rafael, S. (2015). Spanish heritage language learners vs. L2 learners: What CAF reveals about written proficiency. EuroAmerican Journal of Applied Linguistics and Languages, 2, 31–49. https://doi.org/10.21283/2376905X.3.44Google Scholar

Carreira, M., & Potowski, K. (2011). Commentary: Pedagogical implications of experimental SNS research. Heritage Language Journal, 8, 134–151. https://doi.org/10.46538/HSj.8.1.6CrossRef Google Scholar

Chaudron, C., Nguyen, H., & Prior, M. (2005). Manual for the Vietnamese elicited imitation test. National Foreign Language Resource Center. http://hdl.handle.net/10125/10601 Google Scholar

Colantoni, L., Cuza, A., & Mazzaro, N. (2016). Task-related effects in the prosody of Spanish heritage speakers and long-term immigrants. In Armstrong, M. E., Henriksen, N., & Vanrell, M. M. (Eds.), Intonational Grammar in Ibero-Romance: Approaches across linguistic subfields (pp. 3–23). John Benjamins. http://digital.casalini.it/9789027267450 CrossRef Google Scholar

Cuza, A., & Frank, J. (2011). Transfer effects at the syntax-semantics interface: The case of double-que questions in heritage Spanish. Heritage Language Journal, 8, 66–89. https://doi.org/10.46538/HSj.8.1.4CrossRef Google Scholar

Davies, M. (2016) Corpus del Español: Web/dialects. Available online at http://www.corpusdelespanol.org/web-dial/Google Scholar

deBoer, F. (2014). Evaluating the comparability of two measures of lexical diversity. System, 47¸ 139–145. https://doi.org/10.1016/j.system.2014.10.008CrossRef Google Scholar

Deygers, B. (2020). Elicited imitation: A test for all learners? Examining the EI performance of learners with diverging educational backgrounds. Studies in Second Language Acquisition, 42, 933–957. https://doi.org/10.1017/S027226312000008XGoogle Scholar

Drackert, A., & Timukova, A. (2019). What does the analysis of C-test gaps tell us about the construct of a C-test? A comparison of foreign and heritage language learners’ performance. Language Testing, 37(1), 107–132. https://doi.org/10.1177/0265532219861042CrossRef Google Scholar

Ellis, R. (2005). Measuring implicit and explicit knowledge of a second language: A psychometric study. Studies in Second Language Acquisition, 27, 141–172. https://doi.org/10.1017/S0272263105050096Google Scholar

Escalante, C. (2018). ¡Ya pué[h]! Perception of coda -/s/ weakening among L2 and heritage speakers in coastal Ecuador. EuroAmerican Journal of Applied Linguistics and Languages, 5, 1–26. http://doi.org/10.21283/2376905X.8.128CrossRef Google Scholar

Fernandez Cuenca, S., & Bowles, M. (2022). What type of knowledge do implicit and explicit heritage language instruction result in? In Bowles, M. A. (Ed.), Outcomes of university Spanish heritage language instruction in the United States (pp. 103–126). Georgetown University Press. https://doi.org/10.2307/j.ctv296mt6k.10Google Scholar

Fairclough, M. (2012). Language assessment: Key theoretical considerations in academic placement of Spanish heritage language learners. In Beaudrie, S. M. & Fairclough, M. (Eds.), Spanish as a heritage language in the United States: The state of the field (pp. 256–277). Georgetown University Press. https://www.jstor.org/stable/j.ctt2tt42d.18 Google Scholar

Fleiss, J., Levin, B., & Paik, M. C. (2003). Statistical methods for rates and proportions (3rd ed.), John Wiley & Sons. https://doi.org/10.1002/0471445428CrossRef Google Scholar

Foster, P., & Tavaloki, P. (2009). Native speakers and task performance: Comparing effects of complexity, fluency, and lexical diversity. Language Learning, 59, 866–896. https://doi.org/10.1111/j.1467-9922.2009.00528.xGoogle Scholar

Gaillard, S. (2014). The elicited imitation task as a method for French proficiency assessment in institutional and research settings [Doctoral dissertation]. University of Illinois at Urbana-Champaign. http://hdl.handle.net/2142/50562 Google Scholar

Gaillard, S., & Tremblay, A. (2016). Linguistic proficiency assessment in second language acquisition research: The elicited imitation task. Language Learning, 66, 419–447. https://doi.org/10.1111/lang.12157CrossRef Google Scholar

Gatti, A., & O’Neill, T. (2018). Writing proficiency profiles of heritage learners of Chinese, Korean, and Spanish. Foreign Language Annals, 51, 719–737. https://doi.org/10.1111/flan.12364Google Scholar

Heo, Y. (2016). Heritage and L2 learners’ acquisition of Korean in terms of implicit and explicit knowledge [Doctoral dissertation]. Michigan State University. https://www.proquest.com/dissertations-theses/heritage-l2-learners-acquisition-korean-terms/docview/1815794810/se-2?accountid=14553 Google Scholar

Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge University Press. https://doi.org/10.1017/CBO9780511732980Google Scholar

Ilieva, G. N. (2012). Hindi heritage language learners’ performance during OPIs: Characteristics and pedagogical implications. Heritage Language Journal, 9, 18–36. https://doi.org/10.46538/HSj.9.2.2Google Scholar

Ilieva, G. N., & Clark-Gareca, B. (2016). Heritage language learner assessment: Toward proficiency standards. In Beaudrie, S. M. & Fairclough, M. (Eds.), Innovative strategies for heritage language teaching: A practical guide for the classroom (pp. 214–236). Georgetown University Press.Google Scholar

Isbell, D. R., & Son, Y-A, (2021). Measurement properties of a standardized elicited imitation test: An integrative data analysis. Studies in Second Language Acquisition, 44, 859–885. https://doi.org/10.1017/S0272263121000383Google Scholar

Keating, G. D., VanPatten, B., & Jegerski, J. (2011). Who was walking on the beach? Anaphora resolution in Spanish heritage speakers and adult second language learners. Studies in Second Language Acquisition, 33, 193–221. https://doi.org/10.1017/S0272263110000732CrossRef Google Scholar

Keller-Cohen, D. (1981) Elicited imitation in lexical development: evidence from a study of temporal reference. Journal of Psycholinguistic Research, 10, 273–288. https://doi.org/10.1007/BF01067508CrossRef Google Scholar PubMed

Kim, K. M., Liu, X., Isbell, D. R., & Chen, X. (2024). A comparison of lab- and Web-based elicited imitation: Insights from explicit-implicit L2 grammar knowledge and L2 proficiency. Studies in Second Language Acquisition, 46, 946–967. https://doi.org/10.1017/S0272263124000214CrossRef Google Scholar

Kostromitina, M., & Plonsky, L. (2022). Elicited imitation tasks as a measure of L2 proficiency. A meta-analysis. Studies in Second Language Acquisition, 1–26. https://doi.org/10.1017/S0272263121000395Google Scholar

Linacre, J. M. (n.d.) Reliability and separation of measures. https://www.winsteps.com/winman/reliability.htm Google Scholar

Lopez-Beltran Forcada, P. (2021). Heritage speakers’ online processing of the Spanish subjunctive: A comprehensive usage-based study [Doctoral dissertation]. The Pennsylvania State University. https://etda.libraries.psu.edu/catalog/21547pul57.Google Scholar

Moneypenny, D. B., & Aldrich, R. S. (2016). Online and face-to-face language learning: A comparative analysis of oral proficiency in introductory Spanish. The Journal of Educators Online, 13, 105–133Google Scholar

Montrul, S., & Foote, R. (2014). Age of acquisition interactions in bilingual lexical access: A study of the weaker language of L2 learners and heritage language speakers. International Journal of Bilingualism, 18, 274–303. https://doi.org/10.1177/1367006912443431Google Scholar

Montrul, S., Foote, R., & Perpiñan, S. (2008). Gender agreement in adult second language learners and Spanish heritage speakers: The effects of age and context of acquisition. Language Learning, 58, 503–553. https://doi.org/10.1111/j.1467-9922.2008.00449.xGoogle Scholar

Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics, 30, 555–578. https://doi.org/10.1093/applin/amp044Google Scholar

Ortega, L., Iwashita, N., Norris, J., & Rabie, S. (2002). An investigation of elicited imitation tasks in crosslinguistic SLA research [Paper presentation]. Second Language Research Forum, Toronto.Google Scholar

Ortega-Santos, I. (2019). Crowdsourcing for Hispanic linguistics and beyond: Amazon’s Mechanical Turk (MTurk) as a source of Spanish data. Borealis, 8, 187–215. https://doi.org/10.7557/1.8.1.4670CrossRef Google Scholar

Pallotti, G. (2019). An approach to assessing the linguistic difficulty of tasks. Journal of the European Second Language Association, 3, 58–70. https://doi.org/10.22599/jesla.61Google Scholar

Pallotti, G., (2020). Measuring complexity, accuracy, and fluency (CAF). In Winke, P. & Brunfaut, T. (Eds.), The Routledge handbook of second language acquisition and language testing (pp. 201–210). Routledge. https://doi.org/10.4324/9781351034784CrossRef Google Scholar

Parisi, G., & Teschner, R. V. (1983). PASS: Parisi Assessment System for Spanish. Vargas Printing Co.Google Scholar

Pearson Education (2011). Versant^TM Spanish Test. Test description and validation summary. https://www.versanttest.com/technology/VersantSpanishTestValidation.pdf Google Scholar

Polinsky, M., & Kagan, O. (2007). Heritage languages: In the ‘wild’ and in the classroom. Language and Linguistics Compass, 1, 365–395. https://doi.org/10.1111/j.1749-818X.2007.00022.xGoogle Scholar

Potowski, K., Parada, M. A., & Morgan-Short, K. (2012). Developing an online placement exam for Spanish heritage speakers and L2 students. Heritage Language Journal, 9, 51–76. https://doi.org/10.46538/HSj.9.1.4CrossRef Google Scholar

Pozzi, R., & Reznicek-Parrado, L. (2021). Problematizing heritage language identities. Heritage speakers of Mexican descent studying abroad in Argentina. Study Abroad Research in Second Language Acquisition and International Education, 6, 189–213. https://doi.org/10.1075/sar.20004.pozCrossRef Google Scholar

Quan, T. (2018). Language learning while negotiating race and ethnicity abroad. Frontiers: The Interdisciplinary Journal of Study Abroad, 30, 32–46. https://doi.org/10.36366/frontiers.v30i2.410CrossRef Google Scholar

Sanz, C., & Torres, J. (2018). The prior language experience of heritage bilinguals. In Malovrh, P. A. & Benati, A. G. (Eds.), The handbook of advanced proficiency in second language acquisition (pp. 179–198). Wiley Blackwell.CrossRef Google Scholar

Solon, M., Park, H. I., Henderson, C., & Dehghan-Chaleshtori, M. (2019). Revisiting the Spanish elicited imitation task. A tool for assessing advanced language learners? Studies in Second Language Acquisition, 41, 1027–1053. https://doi.org/10.1017/S0272263119000342Google Scholar

Solon, M, Park, H. I., Dehghan-Chaleshtori, M., Carver, C., & Long, A. Y. (2022). Exploring an elicited imitation task as a measure of heritage language proficiency. Studies in Second Language Acquisition, 44, 1–29. https://doi.org/10.1017/S0272263121000905Google Scholar

Solon, M., Park, H. I., Pandža, N. B., & Garza, R. (2023). The influence of raters and scoring procedures on EIT outcomes: An exploratory study. Research Methods in Applied Linguistics, 2, 1–14. https://doi.org/10.1016/j.rmal.2023.100056Google Scholar

Son, Y-A (2018). Measuring heritage language learners’ proficiency for research purposes: An argument-based validity study of the Korean C-test [Doctoral dissertation]. Georgetown University. http://hdl.handle.net/10822/1053074 Google Scholar

Son, Y-A (2020). Testing heritage learners. In Winke, P. & Brunfault, T. (Eds.), The Routledge handbook of second language acquisition and language testing (pp. 422–431). Routledge. https://doi.org/10.4324/9781351034784CrossRef Google Scholar

Torres, J., Estremera, R., & Mohamed, S. (2019). The contribution of psychosocial and biographical variables to heritage language learners’ linguistic knowledge of Spanish. Studies in Second Language Acquisition, 41, 695–719. https://doi.org/10.1017/S0272263119000184Google Scholar

Valdés, G. (2001). Heritage language students: Profiles and possibilities. In Kreeft, P., Ranard, D. A., & McGinnis, S. (Eds.), Heritage languages in America. Preserving a national resource (pp. 37–81). Center for Applied Linguistics.Google Scholar

Van Deusen-Scholl, N. (2003). Toward a definition of a heritage language: Sociopolitical and pedagogical considerations. Journal of Language, Identity, and Education, 2, 211–230. https://doi.org/10.1207/S15327701JLIE0203_4Google Scholar

Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370.Google Scholar

Wu, S.-L., & Ortega, L. (2013). Measuring global oral proficiency in SLA research: A new elicited imitation test of L2 Chinese. Foreign Language Annals, 46, 680–704. https://doi.org/10.1111/flan.12063CrossRef Google Scholar

Yan, X., Lei., Y., & Shih, C. (2020). A corpus-driven, curriculum-based Chinese elicited imitation test in US universities. Foreign Language Annals, 53, 704–732. https://doi.org/10.1111/flan.12492Google Scholar

Yan, X., Maeda, Y., Lv, J., & Ginther, A. (2016). Elicited imitation as a measure of second language proficiency: A narrative review and meta-analysis. Language Testing, 33, 497–528. https://doi.org/10.1177/0265532215594643Google Scholar

Zarate-Sandez, G. A. (2015) Perception and production of intonation among English-Spanish bilingual speakers at different proficiency levels [Doctoral dissertation]. Georgetown University. https://www.proquest.com/dissertations-theses/perception-production-intonation-among-english/docview/1696070016/se-2 Google Scholar

Zhou, Y. (2012). Willingness to communicate in learning Mandarin as a foreign and heritage language [Doctoral dissertation]. University of Hawaii at Mãnoa. http://hdl.handle.net/10125/101558 Google Scholar