Introduction
As the field of heritage language (HL) acquisition grows, there is an increasing need for reliable, efficient, and practical measures of HL proficiency to meet investigation standards and facilitate comparability across studies and groups. In this paper, we adopt Valdés’ “narrow” (as differentiated by Polinsky & Kagan, Reference Polinsky and Kagan2007) definition of a heritage speaker (HS) as “a language student raised in a home where a non-English language is spoken, who speaks, or at least understands the language, and who is to some degree bilingual in that language and in English” (Valdés, Reference Valdés, Kreeft, Ranard and McGinnis2001). Such a narrow definition is appropriate in the context of language proficiency testing, since many learners who fall under a “broad” definition (those “who have been raised with a strong cultural connection to a particular language through family interaction,” Van Deusen-Scholl, Reference Van Deusen-Scholl2003, p. 222) may not have any discernible proficiency in the HL.
Although HSs are a heterogeneous group by nature, some generalizations are possible. HSs have mainly aural exposure to their HL, which occurs mostly at home. Therefore, HSs tend to be more proficient at oral, spontaneous tasks that do not require metalinguistic knowledge, due to their naturalistic acquisition (Bowles, Reference Bowles2011; Montrul & Foote, Reference Montrul and Foote2014; Montrul, Foote & Perpiñan, Reference Montrul, Foote and Perpiñan2008). Owing to differences in their acquisition context, HSs and second language (L2) learners have differing learning needs. L2 learners’ main exposure is in a classroom setting, so they are accustomed to and tend to perform better on written tasks and those that require explicit and/or metalinguistic knowledge, which are common in academic environments.
Literacy varies widely across HSs, depending on the exposure to written language they have had. For this reason, assessments that rely on literacy to evaluate HSs’ and L2 learners’ proficiency are not ideal (Carreira & Potowski, Reference Carreira and Potowski2011; Sanz & Torres, Reference Sanz, Torres, Malovrh and Benati2018), and it is important to consider the two populations’ characteristics to design assessments that are valid and reliable for both. Given that the field of L2 acquisition (SLA) predates that of HL acquisition (HLA), resources to evaluate L2 learners’ proficiencies are more abundant and developed than those aimed at HSs, and many studies have assessed HSs with tests that have only been validated with L2 learners (e.g., the DELE in Colantoni, Cuza & Mazzaro, Reference Colantoni, Cuza, Mazzaro, Armstrong, Henriksen and Vanrell2016; Cuza & Frank, Reference Cuza and Frank2011).
Due to the increased enrollment of HSs in language courses, there have been growing discussions of HL assessment for the purposes of course placement (e.g., Bowles, Reference Bowles2022; Fairclough, Reference Fairclough, Beaudrie and Fairclough2012; Parisi & Teschner, Reference Parisi and Teschner1983; Potowski, Parada & Morgan-Short, Reference Potowski, Parada and Morgan-Short2012), moving away from highly subjective measures like one-on-one interviews, which were commonplace in the past. Simultaneously, increased research on HLA has led to the search for proficiency tests that are suited for HSs and can assign them valid, reliable scores for use as independent variables in research (Solon et al., Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022). In so doing, researchers such as Bowles (Reference Bowles, Potowski and Muñoz-Basols2018), Carreira and Potowski (Reference Carreira and Potowski2011), and Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) have questioned the validity of methods used to assess HL proficiency in past studies.
Elicited imitation tasks (EITs) have been found to be valid measures of proficiency for L2 learners (e.g., Kostromitina & Plonsky, Reference Kostromitina and Plonsky2022; Yan, Maeda, Lv & Ginther, Reference Yan, Maeda, Lv and Ginther2016), and there is a small but growing corpus of studies that have started using them with HSs as measures of proficiency (Isbell & Son, Reference Isbell and Son2021; Lopez-Beltran Forcada, Reference Lopez-Beltran Forcada2021; Solon et al., Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022; Wu & Ortega, Reference Wu and Ortega2013). EITs have been used with diverse groups, including first language (L1) adults, L1 children, and L2 learners of varying literacy levels, as they are oral and not necessarily related to any curriculum. Nevertheless, there has been limited research on the use of EITs as proficiency assessments for HSs (e.g., Solon et al., Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022; Son, Reference Son2018). The present study addresses the need for additional validity evidence needed to support the use of EITs with both groups of learners. The inclusion of samples of both groups of learners is important given that there are still many studies and contexts where both groups of learners are assessed together and the efficacy of the EIT in these contexts should be verified. For the sake of brevity, we will focus the literature review on advances in the field of EITs as HL proficiency assessment. Readers are referred to the meta-analyses by Kostromitina and Plonsky (Reference Kostromitina and Plonsky2022) and Yan et al. (Reference Yan, Maeda, Lv and Ginther2016) for information about the long history and efficiency of EITs as measures of L2 proficiency.
Providing reliability and validity evidence for assessments whenever they are applied in a new context or with a new group or purpose is crucial. Reliability is the internal consistency of a measure, and validity refers to “the degree to which evidence and theory support the interpretation of test scores for proposed uses of tests” (American Educational Research Association, American Psychological Association & National Council on Measurement in Education, 2014). Similarly, the validation process refers to the accumulation of evidence to support the interpretation of those scores.
To obtain that evidence, an internal analysis of the measure’s functioning is important, as is obtaining criterion-related concurrent validity evidence, which involves examining the level of agreement between the scores on the new test and those provided by independent, established measures (Hughes, Reference Hughes2003). Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) provided concurrent validity evidence by comparing the scores of the EIT with scores on a written measure of proficiency. Given that success on a written proficiency measure is directly linked to literacy development, we chose oral proficiency measures (i.e., an oral narration and the Versant test), which we consider would better reflect the proficiency of both groups of learners, to provide concurrent validity evidence for the EIT. Additionally, this study reports a novel norming technique for EIT stimuli using Amazon Mechanical Turk (MTurk), a marketplace that allows access to a cost-efficient participant pool that is increasingly used in behavioral research (Ortega-Santos, Reference Ortega-Santos2019). In sum, this study inspects the reliability and discrimination of a Spanish EIT modified for advanced learners (Solon, Park, Henderson & Dehghan-Chaleshtori, Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019) and gathers concurrent validity evidence for an EIT as used with both HSs and L2 learners.
Literature review
Heritage language assessment
One main concern that has been raised regarding HL assessment is that most studies in HLA either used tasks that were originally developed for L2 learners or used placement tests in research settings, raising reliability and validity concerns (Ilieva & Clark-Gareca, Reference Ilieva, Clark-Gareca, Beaudrie and Fairclough2016; Son, Reference Son, Winke and Brunfault2020). Beaudrie (Reference Beaudrie and Pascual y Cabo2016) highlights four characteristics that should be kept in mind for HL assessment design:Footnote 1 (a) using performance-based measures of real-world tasks with authentic purposes; (b) accounting for the variability of HL linguistic varieties by attempting to use structures that are “dialect-neutral” (Potowski et al., Reference Potowski, Parada and Morgan-Short2012) and formulating and reporting decisions of what linguistic variations can be accepted; (c) accounting for the lack of an established HL proficiency framework and limiting the assumptions regarding the expected development of the learners; and (d) using a multiplicity of measures to measure HSs’ diverse skills, and avoiding using assessments that only test one set of skills, as they may highlight learners’ inability to produce specific forms (to which they may have not been exposed).
Adequate proficiency tests for HSs are a prerequisite to appropriately interpreting the results of studies that use language proficiency as an independent variable, and, considering that many studies compare HSs and L2 learners, it is important that the assessments used have been validated with both groups. Previous explorations of assessments for HLA research settings have included a variety of different task types, including self-assessments (Keating et al., Reference Keating, VanPatten and Jegerski2011); C-tests (Drackert & Timukova, Reference Drackert and Timukova2019); complexity, accuracy, and fluency (CAF) features of written production (Camus & Adrada-Rafael, Reference Camus and Adrada-Rafael2015); and standardized tests of proficiency, like ACTFL’s Oral Proficiency Interview (OPI) (Ilieva, Reference Ilieva2012). There has been particular interest in EITs as a promising measure that could be reliable, easy to administer, and would adequately assess L2 learners and HSs regardless of their literacy level or language variety (Solon et al., Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022; Wu & Ortega, Reference Wu and Ortega2013).
Elicited imitation tasks as measures of proficiency for L2 learners and HSs
In an EIT, participants hear multiple stimulus sentences of increasing length and complexity one at a time and are asked to repeat them aloud as accurately as possible. EITs have been shown to provide a measure of implicit linguistic knowledge when there is a short delay between the presentation of the sentence and its repetition and a time limit for the repetition. The rationale is that the stimulus sentences exceed participants’ working memory span and must be quickly and accurately comprehended to be repeated.
EITs are versatile and have been used to assess the implicit knowledge of specific aspects of grammar, lexicon, and phonology (e.g., Deygers, Reference Deygers2020; Torres, Estremera & Mohamed, Reference Torres, Estremera and Mohamed2019) and to test the effectiveness of instructional interventions (Fernandez-Cuenca & Bowles, Reference Fernandez Cuenca, Bowles and Bowles2022) and the listening comprehension of L1 and L2 speakers (Akbary, Benzaia, Jarvis & Park Reference Akbary, Benzaia, Jarvis and Park2023); they have also been highlighted by Yan et al. (Reference Yan, Maeda, Lv and Ginther2016) and Kostromitina and Plonsky’s (Reference Kostromitina and Plonsky2022) meta-analyses for their efficiency as measures of global language proficiency (e.g., Gaillard, Reference Gaillard2014; Lopez-Beltran Forcada, Reference Lopez-Beltran Forcada2021; Wu & Ortega, Reference Wu and Ortega2013). EITs have been used in many different applications in L2 assessment, including as part of standardized commercial proficiency tests, such as the Versant proficiency test, Duolingo English Test, and the Test of English as a Foreign Language Essentials, likely due to their ease of administration, efficiency, affordability, and practicality.
One of the most popular EITs in SLA research is that of Ortega, Iwashita, Norris & Rabie (Reference Ortega, Iwashita, Norris and Rabie2002), and parallel versions have been developed and validated as measures of oral proficiency in Spanish, French, English, Chinese, Korean, German, and Japanese (e.g., Chaudron, Nguyen & Prior, Reference Chaudron, Nguyen and Prior2005; Gaillard & Tremblay, Reference Gaillard and Tremblay2016; Wu & Ortega, Reference Wu and Ortega2013). It includes 30 items that range from 7 to 18 syllables and are scored on a 5-point scale.
EITs have also proven to be versatile for their ability to assess L1 adults (e.g., Chaudron et al., Reference Chaudron, Nguyen and Prior2005; Ellis, Reference Ellis2005), and even young L1 children (Keller-Cohen, Reference Keller-Cohen1981) with limited or developing literacy skills. This feature in particular makes EITs attractive as an assessment measure for HSs who may lack literacy in the HL.
Indeed, EITs have recently begun to be used to measure different constructs in HSs’ knowledge and to compare HSs’ and L2 learners’ proficiency. A growing corpus of studies has used EITs to measure knowledge of specific language structures (e.g., Bowles, Reference Bowles2011; Heo, Reference Heo2016), and a few have utilized one of the versions of Ortega et al.’s (Reference Ortega, Iwashita, Norris and Rabie2002) EIT to measure HSs’ and L2 learners’ proficiency as an independent variable in their research (Lopez-Beltran Forcada, Reference Lopez-Beltran Forcada2021; Solon et al., Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022; Son, Reference Son2018; Wu & Ortega, Reference Wu and Ortega2013; Zarate Sandez, Reference Zarate-Sandez2015; Zhou, Reference Zhou2012).
In her study with L2 learners, Bowden (Reference Bowden2016) found that Ortega et al.’s EIT was not well suited to test the full range of proficiency, particularly at advanced levels. Previous studies that tested both L2 learners and HSs with an EIT (e.g., Solon et al., Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022; Son, Reference Son2018; Wu & Ortega, Reference Wu and Ortega2013) showed that HSs performed significantly better than L2 learners within and across curricular levels. Wu and Ortega attributed these findings to HSs having an advantage over L2 learners along the full oral language development continuum. To make the task suitable for advanced-level learners, Solon et al. (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019) increased the difficulty of Ortega et al.’s (Reference Ortega, Iwashita, Norris and Rabie2002) EIT by adding six more items to the original 30-item task and expanding the longest item from 18 to 27 syllables. These more difficult items were expected to increase the discrimination of the task for advanced L2 learners and potentially for HSs as well.
Very recently, a few studies used the EIT to test HSs’ Spanish proficiency. Lopez-Beltran Forcada (Reference Lopez-Beltran Forcada2021) used Solon et al.’s (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019) EIT as a measure of proficiency with L2 and HSs, and Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) tested 63 HSs of Spanish with the 30-item EIT and compared their results with the L2 sample that was included in their 2019 study. Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022)’s descriptive results showed that the 30-item EIT was effective in eliciting responses at a wide range of proficiency levels, though most HSs scored at the high end of the scale. Moreover, a Rasch analysis confirmed that the test was too easy for approximately half of the participants. Nevertheless, the correlations of item difficulty between HSs and L2 learners were strong and all items (except for item 1) fell within the 95% CI, showing that the task was performing well at differentiating participants’ levels of proficiency.
Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) also ran a Rasch analysis on the complete 36-item sample to inspect whether the extended task could provide a more fine-grained analysis of the HSs’ competence. This analysis showed that the item difficulty of the extended EIT better matched the range of person ability than the 30-item version and that the task was able to discriminate approximately seven different levels of person performance and six of item difficulty. Moreover, the reliability of the extended task was very high (.98 for item and person reliability). These results are informative about the efficiency of the use of this task with HSs.
Since the present study used the 36-item version of the EIT with a different sample of HSs with differing profiles from Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022)’s sample, it would not be surprising to find new patterns of difficulty for the HSs and L2 learners. Therefore, in this study, we partially replicated Solon et al.’s (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) methodology and sought to obtain additional validity evidence for the 36-item EIT. We aimed to gather concurrent validity evidence through external measures that met recommendations for HL proficiency test design. These are discussed in the following section.
Validating measures
CAF features
Language proficiency has been considered a construct whose foundational aspects are reflected in the features of CAF (Housen & Kuiken, 2009). Pallotti (Reference Pallotti, Winke and Brunfaut2020) defines CAF as follows: complexity is “the number of elements and their interrelationship in a text or linguistic system,” accuracy is “the conformity of linguistic performance to target-language norms,” and fluency is “the extent to which linguistic production is (and/or perceived as) fast and smooth,” and it often involves measuring speed, breakdown (i.e., number and length of pauses), and repair (i.e., number of self-corrections, false-starts, etc.) (pp. 202–203). It is also common to find references to lexical complexity as an independent construct (e.g., Wu & Ortega, Reference Wu and Ortega2013), and, from a structural perspective, it relates to the variety of lexemes within a text, generally measured with a type/token ratio.
The motivation behind using CAF measures as a concurrent validity measure is that the use of an oral narration from which to draw CAF measures is aligned with the principles of communicative language teaching and the naturalistic context of exposure common to HSs (it recreates the communicative function of narration, which can be found in familiar authentic communicative contexts and it is a task that centers the attention of the speaker on meaning rather than on form). Both tasks address the development of the internal language system, but the abilities measured in the oral narration task may be different from those measured in the EIT, due to the different cognitive demands of each of the tasks.
Additionally, the design of an oral narration task aligns with Beaudrie (Reference Beaudrie and Pascual y Cabo2016)’s assertion that it is important to use assessments that represent the types of communicative activities that HSs are familiar with (oral, contextualized, spontaneous) to be able to access their implicit linguistic knowledge (i.e., their proficiency). Admittedly, the EIT does not meet all those characteristics, as the sentences are presented in a decontextualized manner, and it does not elicit spontaneous production. Therefore, positive correlations between the scores in the EIT and CAF measures can provide evidence of the extent to which the task measures the construct of proficiency that is relevant to these groups of learners, as well as criterion-related validity, due to the recognition of CAF measures as an effective assessment of proficiency.
While the use of CAF features is frequent in SLA, many different measures can be chosen, making it difficult to compare across studies (Norris & Ortega, Reference Norris and Ortega2009). Wu and Ortega (Reference Wu and Ortega2013) used components of oral CAF as indicators of global oral language proficiency, to which they compared their EIT data to find concurrent validity evidence. Wu and Ortega extracted CAF measures from an oral narrative task that consisted of describing 12 sequential pictures that presented a story and included 12 motion event segments. As a measure of fluency, they chose the total number of clauses. Second, motion clauses were quantified as an indicator of communicative effectiveness (i.e., accuracy). Third, the number of motion verb types was taken to be an indicator of lexical diversity (i.e., complexity and vocabulary capacity). Results showed that participants who scored higher in the EIT also showed a better command in all CAF measures, thereby providing evidence that participants’ performance in both tasks relied on the same underlying oral language abilities. The present study developed a similar narration task to obtain CAF measures and compared them with scores in the EIT. The specific measures chosen and the rationale behind their election are explained in the Methodology section.
Versant Spanish Test
For this study, it was important to have a standardized oral proficiency measure that was highly reliable and also cost-effective and efficient to administer to provide additional concurrent validity evidence to the EIT without making the process overly fatiguing for participants. The Versant Spanish Test, an oral production and aural comprehension test that takes 13 to 17 min to complete and relies on automated scoring, was chosen to meet these criteria (Pearson Education, 2011).
Audio prompts represent native speakers from different countries and Spanish varieties, making the test appropriate for the linguistically diverse sample of participants in this study. The items included in the test are designed to assess examinees’ comprehension of and intelligibility in spoken, everyday Spanish. The computer-delivered test consists of 60 items in seven different sections (i.e., Reading, Repeats, Opposites, Short Answer Questions, Sentence Builds, Story Retelling, and Open Questions). Items are designed to assess test-takers’ ability to understand spoken Spanish on everyday topics and to respond intelligibly at a nativelike conversational pace (Bernstein, Van Moere & Cheng, Reference Bernstein, Van Moere and Cheng2010, p. 358). The Versant test provides an overall score and four subscores: Sentence Mastery, Vocabulary, Fluency, and Pronunciation. Sentence Mastery and Vocabulary measure the response’s linguistic content, and Fluency and Pronunciation relate to the articulation and rhythm of responses (Bernstein et al., Reference Bernstein, Van Moere and Cheng2010).
Scores on the Versant Spanish Test have been shown to be highly correlated with scores on other oral proficiency measures, including ACTFL’s OPI (r = .86, p < .001), widely considered a gold standard in oral proficiency assessment, and the Spoken Proficiency Test (r = .92, p < .001) (Pearson Education, 2011). It has also been used in a few prior SLA and HLA studies (Blake, Wilson, Cetto & Pardo-Ballester Reference Blake, Wilson, Cetto and Pardo-Ballester2008; Escalante, Reference Escalante2018; Fairclough, Reference Fairclough, Beaudrie and Fairclough2012; Moneypenny & Aldrich, Reference Moneypenny and Aldrich2016; Pozzi & Reznicek-Parrado, Reference Pozzi and Reznicek-Parrado2021; Quan, Reference Quan2018). While Pozzi and Reznicek-Parrado point out that the Versant test was not specifically designed to measure HSs’ proficiency, Versant scores in Blake et al. (Reference Blake, Wilson, Cetto and Pardo-Ballester2008) distinguished HSs as a different proficiency group from L2 learners and were able to measure the progress of both groups over time. Taken together, this evidence suggests that the Versant test is an appropriate standardized test to use to gather validation evidence in this study.
Research questions
The present study aimed to explore whether the EIT was an appropriate and valid measure of proficiency for HSs and L2 learners of Spanish. Obtaining evidence of a test validity construct is a key step in every instance an assessment is used with a new group and/or in a new context. A common framework to inspect the validity of a test is looking at the evidence for content-related (i.e., the extent to which the content of a test “constitutes a representative sample of the language skills, structures, etc. with which it is meant to be concerned”), construct-related (i.e., whether the inferences that the constructs, which the test means to assess actually exist, can be measured and are indeed being measured in the test), and concurrent (i.e., the extent to which scores on the test correspond with an independent, highly dependable measure of the same construct) validity.
The overarching question is whether an EIT, given its successful use with both L1 and L2 populations and its oral nature, is an appropriate measure of proficiency for HSs and L2 learners. Ortega et al.’s (Reference Ortega, Iwashita, Norris and Rabie2002) 30-item EIT appears to be too simple for many HSs, as shown by Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022), and the 36-item version seemed to better match the proficiency of the HSs and L2 learners in their sample. Therefore, gathering additional validity evidence and information about its discrimination and reliability can strengthen the argument for the adequacy of the 36-item EIT as a measure of HS and L2 proficiency.
For the sake of space, this article will be centered around the discussion of its concurrent validity evidence (i.e., whether scores on the test correspond to scores on other dependable measures) and its discrimination and reliability (i.e., whether the proficiency test actually differentiates across proficiency levels and does so reliably). Henceforth, the present study poses the following research questions:
-
1. To what extent are the EIT items and their scoring reliable with HSs and L2 learners?
-
2. To what extent can the EIT discriminate across the proficiencies of HSs and L2 learners?
-
3. To what extent can HSs and L2 learners’ performances on an oral narration and the Versant proficiency test provide concurrent validity evidence to their EIT scores?
To answer these questions, the functioning and reliability of the items with a population of L2 and HSs at a range of proficiency levels will first be explored. Then, concurrent validity evidence will be gathered by comparing HS and L2 scores on the EIT to those on the Versant Spanish Test and on CAF measures derived from an oral narration.
Methodology
Participants
L2 and HSs of Spanish were recruited to participate in the study from two different public universities in the US, one in the Midwest (N = 198) and one in the Southwest (N = 5). Data for this study were collected in spring 2021, spring 2022, and fall 2022. Participants were sought from both universities to enable a sample of HSs at the full range of proficiencies. In contrast to the Midwestern university, the Southwestern one enrolls many HSs at the lower end of the proficiency spectrum. Recruiting a larger sample of participants from the Southwest would have been ideal, but this was not possible due to the challenges of remote recruitment. Learners were recruited through Spanish undergraduate classes and through personal connections. Students in Spanish courses at the Midwestern university had the option to receive extra credit or to participate in a drawing for a gift card as compensation, whereas those at the Southwestern university received extra credit and a gift card. All others not enrolled in courses received a gift card as compensation.
A summary of participants’ characteristics based on Birdsong, Gertken and Amengual’s (Reference Birdsong, Gertken and Amengual2012) Bilingual Language Profile can be found in Table 1.
Table 1. Descriptive statistics of complete sample of participants

Participants were classified into HS and L2 groups on the basis of their responses to the question, “How many years have you spent in a family where Spanish is spoken?” where a response similar to the participants’ age was taken as an indication of their HS profile. Participants’ self-reported age of acquisition of Spanish was also used to confirm that they had been correctly identified as either HS or L2.
Participants were also asked to respond to Likert-scale questions regarding their communicative competence in English and Spanish (though only the Spanish results are reported here), which generated a self-reported proficiency (SRP) score from 0 to 24 (Birdsong et al., Reference Birdsong, Gertken and Amengual2012). The SRP score in Spanish was used as an independent continuous variable to facilitate comparison across HS and L2 groups.
Participants were recruited from all course levels at the Midwestern university and the distribution of the sample is representative of the student population enrolled in Spanish, with a greater proportion coming from 100-level courses than from higher levels (see Table 2). The Midwestern university has mostly mixed classrooms, enrolling both HSs and L2 learners together, apart from one 200-level composition course tailored for HSs, from which 8 participants were recruited. Students from the Southwestern university were all recruited from a Spanish as an HL program. The goal of recruiting learners from a wide range of courses was to find HSs and L2 learners across the proficiency spectrum.
Table 2. Descriptive statistics and distribution of participants’ course enrollment and SRP

Note: All participants recruited outside a language program were part of the Midwestern university campus community.
A note about proficiency as it relates to course enrollment is in order: 100-level courses encompass the first four semesters of language study and therefore include a fairly broad range of L2 proficiencies, from true beginners who start at novice low to those completing the fourth-semester language requirement who are typically at the intermediate mid level. Most HSs at the Midwestern university are second generation and therefore tend to have intermediate-level or higher listening and speaking skills in Spanish. HSs who enroll in 100-level courses do so for a variety of reasons, including fulfilling a language requirement or gaining literacy skills in Spanish. This contrasts with the profile of the HSs at the Southwestern university, where students are third or fourth generation and tend to have lower proficiency in Spanish. SRP scores among HSs in the Southwest were 10–15 compared to 7–20 for those from the Midwestern university.
Instruments
The present study reports results on the linguistic background questionnaire, the elicited imitation and oral narration tasks, and the Versant Spanish Test. These measures were chosen to provide evidence of the EIT’s concurrent validity.
Elicited imitation task
The EIT was Solon et al.’s (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019) EIT, which was adapted from Bowden (Reference Bowden2016), which was itself adapted from Ortega et al. (Reference Ortega, Iwashita, Norris and Rabie2002). Solon et al.’s modifications consisted of some vocabulary adaptations to limit dialect-specific terminology and the addition of six sentences to the original task, which served to increase the length of the longest sentences from 18 to 27 syllables.
To ensure the appropriacy of the language in the EIT for HSs of different varieties of Spanish, a norming procedure was conducted with 49 Spanish native speakers from nine different countries (Argentina, Chile, Colombia, Ecuador, Mexico, Peru, Spain, USA, and Venezuela), recruited via Amazon MTurk. This is a novel feature of our study, as crowdsourcing has just started to be used in research on Spanish (Ortega-Santos, Reference Ortega-Santos2019) and has not, to our knowledge, been used in any HS studies.
Through Qualtrics, MTurk “workers” were asked a few demographic questions (i.e., Were they native speakers of Spanish? What Spanish variety did they speak? Where were they born and where did they live currently?). They were then prompted to mark any words on the EIT items that they did not recognize or that were confusing to them. Although many of the HS participants come from Mexican and Puerto Rican backgrounds, we chose to recruit informants for the norming more broadly to ensure that the language in the EIT stimuli was accessible cross-dialectally.
Based on the norming informants’ responses, item 14 (A ustedes les fascinan las fiestas grandiosas, “Grand parties fascinate you”) and item 25 (Después de llegar a casa del trabajo tomé la cena, “After arriving home from work, I had dinner”) were modified. The words grandiosas and tomé were marked by one and four native speakers, respectively. They were subsequently changed to ruidosas (“loud”) in item 14 and hice (“made”) in item 25. Although the two modified items were not re-normed, their frequencies were checked in the Corpus del Español (Davies, Reference Davies2016). Fiestas ruidosas appeared 18 times in the corpus, while fiestas grandiosas appeared just once, hacer la cena appeared 151 times, and tomar la cena appeared 31 times. The increased frequency of these new strings was expected to make the stimuli more accessible across dialects and the proficiency spectrum.
The EIT items were recorded in a sound-attenuating booth by a female native speaker of Mexican Spanish who is also a graduate student and instructor of Spanish. She listened to Solon et al. (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019)’s recordings before the session and at different points during her recordings to emulate the pace. The recordings were edited with Audacity to remove any remaining background noise and to add timed pauses and tones to prompt participants’ repetitions. The timing of the pauses and tones was adjusted following Solon et al. (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019)’s method, available in the IRIS database.
The administration of the newly adapted EIT took place through the Gorilla Experiment Builder platform (henceforth, Gorilla) (Anwyl-Irvine, Massonié, Flitton, Kirkham & Evershed Reference Anwyl-Irvine, Massonié, Flitton, Kirkham and Evershed2020) and took approximately 15 min. Gorilla was chosen because it enables remote recording and submission of audio responses, and it provided benefits of web-based EITs that Kim, Liu, Isbell & Chen (Reference Kim, Liu, Isbell and Chen2024) highlighted, including access to larger and more diverse participant samples; reduction of research, equipment, and employment costs; and standardized and optimized testing procedures. Kim et al. also point out that web-based research is not a panacea and that there is a higher risk of distractions, noisy data, and higher dropout rates. Yet Kim et al. found that web-based EITs were comparable to lab-based EITs, providing support for the use of platforms like Gorilla.
Participants’ recorded repetitions were captured by Gorilla and later scored using Ortega et al.’s (Reference Ortega, Iwashita, Norris and Rabie2002) rubric, which has been used in most of the parallel versions or adaptations. The modified EIT and the rubric used in this study are available as supplementary materials.
Oral narration task
The oral narration task was designed to elicit CAF measures. The task required participants to narrate a series of 12 vignettes that presented the story of two friends who wanted to have lunch together but kept coming up against constant obstacles. The task was administered through Gorilla, and participants had 2 min at first to look attentively at the entirety of the comic strip. Then they saw each vignette in a slideshow, which they controlled and advanced at their own pace. Each slide generated independent recordings of the participant’s response, which then were unified into one file and transcribed and coded with the CLAN software.
In this study, CAF features that were compatible with CLAN were chosen. First, complexity was operationalized using two different measures: (a) proportion of subordination (i.e., the number of utterances containing subordination over total number of utterances), obtained with a manual count, and (b) mean length of utterance (MLU) quantified with morphemes, obtained with CLAN’s eval test. The use of subordinate clauses is common as a measure of complexity and provides a fine-grained analysis of advanced learners’ speech, though it is not appropriate for novice learners who may not yet be capable of subordination (Norris & Ortega, Reference Norris and Ortega2009). Hence, MLU was also selected because it can be used at all levels of proficiency. Second, accuracy was measured as the percentage of error-free production at word level, which was also obtained with the eval test. During the transcription of the recordings, any word-level inaccuracies were marked as such (see example 1).Footnote 2 This annotation allowed the software to generate a percentage of error-free words.

Third, a measure of fluency was obtained by counting syllables per minute, with the FluCalc test. Finally, as a measure of lexical diversity, a modified type-token ratio, the D measure (deBoer, Reference deBoer2014), was calculated with CLAN’s vocd test, which, as explained by deBoer (Reference deBoer2014), “attempts to measure the diversity of vocabulary in writing by taking random samples of words and comparing the observed diversity to ideal curves […] vocd is fundamentally a graphical method to address lexical diversity” (p. 140).
Procedures
All data collection took place remotely. During recruitment, the researchers distributed a link that provided direct access to the experiment. Participants were able to access the study at the time and location that was most convenient for them, without any external monitoring. These conditions facilitated data collection while respecting COVID-19 safety protocols. Most tasks described in this study were administered through the Gorilla platform, except for the Versant Spanish Test, which is available only through the proprietary Pearson platform.
The first task all participants completed was the Bilingual Language Profile, after which they were randomly assigned one of three task orders: ABC-BCA-CAB (A = EIT, B = oral narration, and C = DELE test).Footnote 3 Counterbalancing was necessary to verify that there were no modality or task type influences due to order effects.
Finally, all participants completed the Versant Spanish Test last. The reason it was done last is that there is a fee for each access code, so we wanted to ensure that participants had completed the rest of the study before providing a paid access code. Participants accessed the test through Versant’s web-based platform and took between 13 and 17 min to complete it.
Data analysis
Due to the unmonitored nature of this experiment, some participants did not complete the tasks adequately and some data had to be discarded. An initial pool of 242 participants completed at least some tasks in Gorilla, but the data of 39 were discarded for not following instructions, leaving the results of 203 participants in the final sample. Not all these participants completed all tasks, so the data from 189 participants were used to analyze the concurrent validity evidence of the EIT by comparing it with their performance on the oral narration, and the data from 100 participants were used to compare EIT scores with those on the Versant proficiency test.
All tasks except the EIT and the oral narration were scored automatically. Responses to the EIT and oral narration were manually transcribed and then coded by the first author (who is a native speaker of Castilian Spanish) and a team of three undergraduate research assistants (who were HSs of Mexican Spanish enrolled in advanced content-based courses in Spanish). This team also rated the same 10% of the EIT, to calculate interrater reliability of item scoring. The exact agreement on item scoring across raters was 47.59%, with a Light’s kappa for four raters of .62, which is considered substantial agreement beyond chance (Fleiss, Levin & Paik, Reference Fleiss, Levin and Paik2003). Moreover, rater socialization and discussion of differences resulted in a 100% agreement rate on the manual transcription of the item recordings. The EIT was transcribed and coded in Microsoft Excel, and the oral narration was transcribed, coded, and analyzed using the CLAN software. Once all data were processed, they were analyzed using RStudio and Winsteps.
Descriptive statistical analyses were performed on the EIT and the oral narration task as well as on SRP. Next, to obtain information about the internal reliability of the EIT as well as to observe the item functioning and their distribution of learners by item level of difficulty, Rasch analyses were run with Winsteps. Rasch analyses prove to be more informative than true score theory reliability or correlation calculations; therefore, a similar methodology to Solon et al. (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) is followed in the present study.
To determine the concurrent validity of the EIT, the results of the task were compared to results in the oral narration and in the Versant test through comparisons of their descriptive statistics and through correlation analyses. These data were not normally distributed, so Spearman correlations were chosen to analyze the data in RStudio.
Results
Scores on the EIT: Descriptive statistics and reliability
The distribution of L2 scores presented a right/positive skew due to few high scores in the sample, while the contrary happened with HS scores, which showed a pronounced left/negative skew, due to a large number of scores at the high end of the score range. The results of Shapiro-Wilk normality tests confirmed that scores were not normally distributed, either for the HSs (p < .0001) or the L2 learners (p = .0001).
The distribution of EIT scores can be observed in Figures 1 and 2, which present the mean scores (and standard deviation [SD]) that each participant group obtained for the different items on the EIT. These figures show that each task item prompted different levels of performance and present a general trend (with some exceptions) where the more advanced the item on the list, the lower its average ratings. These figures also show that, overall, HSs were considerably more accurate than L2 learners on the EIT.

Figure 1. Distribution of EIT mean scores and SDs across items for L2 learners.

Figure 2. Distribution of EIT mean scores and SDs across items for HSs.
Figure 3 displays the distribution of the total EIT scores for both groups plotted with their SRP scores. The maximum possible EIT score for the 36-item version of the task was 144. At first glance, one can observe that participants presented an increasing average EIT score as SRP increased, therefore suggesting that the EIT elicits different levels of performance depending on the examinee’s proficiency. Nevertheless, this trend was more visible with HSs than L2 learners. It is also clear that learners with similar SRP scores often showed a wide range of EIT scores. This trend is apparent across both HS and L2 groups but is particularly visible in the highest SRP scores, where some L2 learners still scored as low as 29/144 on the EIT, while the lowest-scoring HSs attained 56/144 on the EIT. Therefore, it seems that L2 learners overrated their self-assessments of proficiency more than HSs (a trend also seen in Bowles, Adams and Toth, Reference Bowles, Adams and Toth2014), which signals that interpretation of SRP results should be done with care.

Figure 3. Distribution of EIT and SRP scores.
The EIT data met the assumptions of local independence and unidimensionality. Therefore, it was possible to apply Rasch analysis and obtain fine-grained information about the reliability of the EIT items. Person separation obtained a score of 5.51, which is above the minimum desired score (i.e., 2), and person reliability was .97, also high and appropriate (i.e., > .80). On the other hand, item separation had a score of 10.40, which was high (i.e., > 3) and item reliability had a score of .99, which is very high (i.e., > .90) (Linacre, Reference Linacren.d.). All these indices provide evidence of reliability. Additionally, item separation and person separation provide information regarding the ability of the task to separate across items and persons of different levels. Therefore, this EIT was able to discriminate across approximately 6 different levels of performance in persons and 10 levels of difficulty for items.
Before using Rasch to analyze the capacity of the EIT to measure participants’ competence across different levels, model fit was checked first, which is done by observing the outfit and infit mean squares (MNSQs) and standardized Z values (ZSTD) of the different items. Table 3 presents the model fit indices for the different items in the EIT.
Table 3. Mean scores and fit statistics for the EIT items

Note: Bolded numbers represent misfitting values. Mean scores indicate the average score on the different items by each group of speakers. A mean score that is too close to 4 may indicate that the item is too easy, and a score too close to 0 may indicate that the item is too difficult.
As Table 3 shows, there are four items that are clearly misfitting: items 1–4. Their infit MNSQ scores are larger than modeled (MNSQ = .6–1.4), which is a sign of unexpected item behavior. Moreover, those items’ ZSTD scores are high (–2.0 < Z < +2.0), signaling the significance of the MNSQ scores (Bond & Fox, 2020). Notably, these are the shortest and among the easiest items in the task, as indicated by their mean scores for HS and L2 groups; therefore, they are not considered problematic for the correct functioning of a proficiency test that is intended to function across the proficiency spectrum. The fit statistics in this case are the result of a large number of participants scoring high on those items, making the model flag them as too easy (e.g., 70% of participants’ scores on item 4, the most misfitting item, received the highest possible score).
All other items show adequate infit and outfit indices. It is important to point out that values larger than 1.0 represent unmodeled noise. Therefore, item 15, which has a MNSQ value of 1.24, has 24% excess noise, but this is considered a reasonable value in the context of results from a rating scale. On the other hand, values below 1.0 indicate overfit to the model (Wright & Linacre, Reference Wright and Linacre1994), which could suggest they are redundant or predictable, though acceptable.
To look further into the functioning of the different items, we analyzed the Wright map generated with Winsteps (Figure 4), which presents a distribution of item difficulty and person ability, with participants labeled according to their learner group (HS/L2).

Figure 4. Wright map of the EIT scores.
A Wright map represents a developmental pathway for the different participants in the study and the items, showing how items are distributed in relation to the ability of the participants. From top to bottom, the participants are organized at the left of the map in order of their performance in the task (participants higher up in the map were more successful), and items at the right of the map are organized based on their difficulty (the higher the item, the more difficult it was for this sample of participants). The map presents a good distribution of item difficulty and candidates’ performance because a majority of participants’ performance levels are distributed across the same logits as the EIT items and their difficulty. However, it is relevant that the person’s ability covers 10 logits and EIT difficulty 6 logits. Hence, the EIT seems not to be challenging enough for 25 participants (12.3% of the complete dataset), 21 of them HSs (29% of the HS data). This conclusion is backed by the count of individual participants achieving very high scores on the task (though not one participant achieved the maximum score): Of the 70 HSs, 38 participants obtained a score over 100, 15 participants scored over 130, and 5 participants scored between 140 and 143. In contrast, of the 131 L2 learners, 20 participants scored over 100, 4 scored over 130, and only one scored 140.
The map also shows that it is the addition of at least four of the six additional items by Solon et al. (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019), which improves the difficulty of the EIT. The inclusion of these items enables the task to assess learners at one more logit of proficiency, therefore improving the task’s discrimination with advanced learners.
Scores on the validating measures: CAF and Versant proficiency test
With the reliability and item analyses complete, the next steps consisted of observing the correlations of the scores on the EIT and other proficiency measures. First, regarding the use of CAF measures to assess the linguistic competence of L2 and HS learners, the following graphs present the descriptive statistics of participants’ performance in the oral narration task in contrast with their SRP scores. Three of the five measures (MLU morphemes, fluency, and lexical diversity) are continuous variables, and the other two (proportion of subordination per utterance and accuracy) are proportions.
There is a general, expected trend of improved performance with CAF features as SRP increases and the absence of a clear trend of one group (HS/L2) consistently better than the other on all measures. Interestingly, L2 learners present better performances as their SRP score is higher in the five different CAF features (see Figures 5–9), while HSs do not present such clear regular improvement in the two measures of complexity (see Figures 5 and 6). HSs with lower SRP scores seem to have higher levels of subordination and MLU in morphemes than HSs with higher SRP. Nevertheless, the other features (see Figures 7–9) present a regular improvement, as seen for the L2 group. These results could be related to the small sample size for HSs with lower SRP scores. However, this irregular pattern matches the weaker correlations found between HSs’ complexity measures and their scores in the EIT (see Table 4), which were positive and significant but considerably weaker than for the other CAF features.

Figure 5. Distribution of proportion of subordination scores (complexity).

Figure 6. Distribution of MLU morphemes scores (complexity).

Figure 7. Distribution of fluency scores.
Table 4. Spearman correlations between scores on the EIT and CAF measures

* p < .005, ** p < .0001, *** p < .00001.
On the other hand, in terms of accuracy (see Figure 8), L2 learners scored higher than HSs with lowest SRP, which was then surpassed by HSs with highest scores. HSs also consistently showed higher lexical diversity (see Figure 9) than L2 learners, except for at the highest SRP level, where L2 learners slightly surpassed their score. Finally, HSs also showed overall better fluency than L2 learners did (see Figure 7).

Figure 8. Distribution of accuracy scores.

Figure 9. Distribution of lexical diversity scores.
There were positive, significant Spearman correlations between all measures of CAF features and participants’ scores on the EIT, as displayed in Table 4.
The Versant Spanish Test was also administered to obtain concurrent validity information for the EIT. Figure 10 displays the distribution of Versant scores and SRP scores. Table 5 presents the correlations between scores on the two tasks, which were very high for the overall sample, particularly for HSs. These results provide evidence that the two assessments measure similar constructs. Figures including graphs of the distribution of scores for all the measures (EIT with CAF measures and the Versant test) have been included as supplementary materials.

Figure 10. Distribution of Versant scores.
Table 5. Spearman correlations between scores on the EIT and the Versant Spanish Test

*** p < .00001.
Discussion and conclusions
To respond to the first research question, which asked to what extent the EIT items and their scoring were reliable, we ran a Rasch analysis as well as interrater reliability analyses. First, high person and item reliability indices, combined with the moderately high level of interrater reliability, described in the Methods section, are evidence of the EIT’s reliability, as these reliability estimates represent the likelihood that these scores would be generated in future instances of the task, with a similar sample of test-takers and items.
These results follow the trend of high reliability indices in EIT research. Kostromitina and Plonsky (Reference Kostromitina and Plonsky2022) report results of a reliability generalization meta-analysis, which averages the reliability estimates found in their sample (Cronbach α and Kuder-Richardson Formula 20 coefficients) and which showed high internal reliability (.92).
Additionally, they averaged the interrater reliability coefficients and obtained strong reliability scores: .91 kappa, and .88 for percent agreement. These coefficients are certainly higher than those obtained in the present study (.62 kappa for 4 raters, 47.59% agreement). However, the authors of the meta-analysis recommend interpreting these specific mean coefficients with caution, as less than half of the unique sample studies reported interrater reliability data.
We can think of different explanations for the low interrater reliability score (though it is considered “substantial agreement”): lack of agreement on the differences between score bands 1 and 2, the background of the raters, and their previous training. Moreover, while rater socialization meetings were in place, their goal was not to reach agreement in the scoring but to verify that all raters understood the rubric and were applying it consistently. Solon, Park, Pandža & Garza (Reference Solon, Park, Pandža and Garza2023) examined the role of rating modality (aural versus written) and rater characteristics (specifically language background and linguistics training) in the rating of L2 scores on their expanded version of the EIT. They report that none of these factors greatly influenced EIT scores, which contributes to the validity argument of EITs as a measure of L2 proficiency. However, they found that there were significant differences (with a small effect size) in scores with rating modality as a variable: Raters who rated performance based only on the oral recordings tended to give higher scores than those who rated the transcriptions of the recordings. In the present study, the raters were instructed to transcribe the recordings as they rated them, so there was a record of the utterance they had heard. Therefore, they were rating the recordings, not a pre-made transcription.
Similarly, Solon et al. (Reference Solon, Park, Pandža and Garza2023) found that native speakers of the target language tended to rate responses higher than non-native speakers. All raters in the present study were native Spanish speakers (one monolingually raised and the other three bilingually raised advanced proficiency HSs of Spanish). It would be interesting to investigate the role that monolingual versus bilingual L1 acquisition has on rating behavior, and further research should examine this and other factors that might affect rater behavior in EITs.
To respond to the second research question, which asked to what extent the EIT discriminated among HSs and L2 learners of different proficiencies, we looked at model fit and the Wright map generated by the Rasch analysis and observed participants’ performance on the task through descriptive statistics, considering their group (HS/L2) and SRP score as independent variables. Descriptive statistics showed that there was a wide range of SRP scores, but average EIT scores increased linearly as SRP increased. This pattern was true of both HS and L2 groups, though HSs scored higher on average than L2 learners across the spectrum of SRP scores.
Results of the Rasch analysis confirmed that the task efficiently differentiated across 6 levels of proficiency and 10 levels of item difficulty. Overall, a majority of EIT items nicely fit the Rasch model, except for four misfitting items, which were also the shortest and some of the easiest, thus not considered problematic.
The item distribution on the Wright map also showed good item functioning, given that the items were organized across the same logits as most participants. Particularly, it showed that the six items added by Solon et al. (Reference Solon, Park, Henderson and Dehghan-Chaleshtori2019) were the most challenging ones and improved task discrimination (i.e., the number of levels across which the task is able to differentiate) with advanced HSs and L2 learners. However, there were 25 participants for whom the task was too easy, most of them HS learners. Solon et al.’s (Reference Solon, Park, Dehghan-Chaleshtori, Carver and Long2022) Wright map presented a very similar distribution, and the 36-item EIT showed that the EIT mapped better with the HSs’ ability than the 30-item version. Their ability covered 9.49 logits and the item difficulty covered 7.06 logits. The present study showed similar proportions, though seemingly less discriminating across the entire sample: Learners’ ability covered 10 logits and item difficulty 6 logits. The difference in distribution could be due to the different profiles of the learners included in this study, who had an unbalanced proficiency distribution, with a large percentage of them higher on the proficiency spectrum. Nonetheless, the EIT was still effective with a large portion of the sample.
The distribution of participants’ ability on the map is related to the number of participants who scored almost at ceiling in the EIT. These results are associated with the high levels of competence that many members of the HS sample presented in all the experimental tasks. Therefore, while the EIT seems to be efficient at distributing most HSs and L2 learners across different levels of competence, it may be less effective with HSs of very high proficiency. Moreover, due to the scarce number of HSs at the lowest end of the score range, more evidence of the task’s efficiency is needed at the lowest proficiency levels.
To respond to the third research question, which asked to what extent scores on CAF measures from an oral narration and scores from the Versant Spanish Test provided concurrent validity evidence for the EIT, correlation analyses were run. Overall, the five CAF features showed a linear development throughout SRP scores for both groups of learners, except for two specific small irregularities by HSs on complexity features at the higher levels. Therefore, CAF feature development showed the expected pattern, which was also similar to the improvement of EIT scores. There were, therefore, strong, positive Spearman’s correlations between the different CAF features and EIT scores, with the exception of the performance of HSs with complexity features, which were moderate: .44 in proportion of subordination and .35 in MLU by morphemes. It is not odd to find weaker correlations with complexity than with other features, given that there tends to be more variability in how complex oral spontaneous responses are, while accuracy, fluency, and lexical diversity are less variable. Previous research has shown that a range of variables affects performance in terms of CAF features (Foster & Tavaloki, Reference Foster and Tavaloki2009), with Pallotti (Reference Pallotti2019) showing that even when proficiency is controlled for, participants show individual variability across tasks with different difficulty, particularly in terms of syntax, “as some participants tend to prefer broad and complex structures while others typically produce rather short and simple constructions” (p. 67).
Finally, results on the Versant Spanish Test showed very strong positive, significant correlations (ρ >.74) with scores on the EIT. These results, in addition to those obtained through the CAF analysis, provide evidence that the EIT measures oral proficiency similarly to them, hence providing concurrent validity evidence in response to the third research question.
Despite that the overall sample of participants was fairly large and represented a range of proficiencies, the sample was not equally balanced across all proficiency levels, and some participants did not complete all tasks properly, resulting in some lost data and different numbers of participants completing each task. However, results show promising trends, as they provide evidence that the EIT is a valid measure of proficiency for HSs and L2 learners of Spanish, being able to elicit performance at different levels of proficiency for both groups. Given that the EIT takes just 15 min to administer, it is a speedy and efficient measure with high reliability and a long history of use in language assessment. The oral modality of the task makes it appropriate for HSs and L2 learners at different levels of literacy development. Moreover, the task is easy to rate due to the controlled productions it elicits (making it better than other oral tasks like narrations or picture descriptions), and it has strong interrater reliability (Kostromitina & Plonsky, Reference Kostromitina and Plonsky2022; Yan et al., Reference Yan, Maeda, Lv and Ginther2016). The EIT constitutes a useful tool for HLA as well as for other applications where an efficient and reliable proficiency measure is needed. If this measure were to be used in placement, further research would be needed in local contexts to determine its appropriacy. Readers are referred to Yan, Lei and Shih (Reference Yan, Lei and Shih2020) for discussion of how an EIT can be developed to place students into a curriculum as well as how scores compare to a more general, non-curricular EIT.
Future research on this topic could also investigate the specific linguistic features of the EIT that influenced participants’ performance on items that did not appropriately fit the Rasch model or that showed differential item functioning (DIF), a measure that indicates an advantage for learners based on their background traits, similarly to Isbell and Son (Reference Isbell and Son2021), who found substantial DIF estimates larger than .5 for HSs in items 1, 19, 25, and 27, and for L2 learners in item 19 in the Korean EIT. Nonetheless, they did not find this value or the direction of DIF to be so large or consistent that “overall measurement of oral proficiency would be compromised.”
Previous research (Yan et al., Reference Yan, Maeda, Lv and Ginther2016) has signaled stimulus length as the main factor in EIT complexity, and, considering that the 36-item EIT was still too easy for a number of participants, an area of further research could be the extension of the task in number and length of items. Moreover, it would be valuable to know which other features are influential when item length is kept constant. For example, in the Korean EIT, Isbell and Son (Reference Isbell and Son2021) found an effect of vocabulary sophistication and number of inflectional morphemes, which accounted for 59% of the variance observed in item difficulty. Moreover, they observed that some of the most difficult items included more embedded clauses, though this did not appear to be a systematic source of item difficulty.
Future studies could also investigate the different trends that HSs and L2 learners present when self-assessing their proficiencies, given that the present study showed a pattern of L2 learners overestimating their proficiency (which was not as much the case in HS learners), and a similar trend had already been observed in previous research (Bowles et al., Reference Bowles, Adams and Toth2014). Finally, the oral narration data could be used to describe profiles of HSs at different oral proficiency levels, as Gatti and O’Neill (Reference Gatti and O’Neill2018) did in writing. These profiles could then become a useful framework of comparison for the analysis of HL proficiency by CAF features.
In conclusion, this study contributes to the body of research on HL and L2 proficiency assessment by providing additional validity evidence of the Spanish EIT as an effective measure of proficiency for both HSs and L2 learners. Additionally, it shows how HS and L2 scores on the extended Spanish EIT strongly correlated to two oral proficiency measures: CAF features and Versant proficiency test, a comparison that (to the best of our knowledge) had not been reported before and that provides concurrent validity evidence to the task.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S0272263125000130.
Competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.