Introduction
Within the research on second-language (L2) learning and instruction, scholars have long debated how development is impacted by explicit information (EI) about a target form (e.g., textbook style grammar rules; see Kang, Sok, & Han, Reference Kang, Sok and Han2019). In a landmark study, Fernández (Reference Fernández2008) contributed to this debate, presenting two experiments that investigated the role of EI in processing instruction (PI; VanPatten, Reference VanPatten2015) treatments that aimed to teach learners to process two targeted grammatical forms: object pronouns in Spanish, and the Spanish subjunctive. Rather than looking at learner performance in a traditional pretest/post-test design, Fernández uniquely examined whether EI affected how learners processed sentences as they engaged with practice stimuli during training.
Fernández’s (Reference Fernández2008) significant impact on the field is evident in both its methodological and theoretical contributions. First, it was the first study to investigate the effects of PI by tracking learner behavior as participants completed the training, an approach that remains critical to studying the effects of EI in PI. Additionally, Fernández’s two experiments suggested that EI may impact learner development in different ways depending on particular processing principles that apply to specific target forms by default. Fernández’s impact can also be seen in that numerous studies have replicated its first experiment, which has both validated the original findings and brought nuance to them, for example, by improving scoring methods by adding post-tests and delayed tests that chart learner development over time, and by investigating the impact of aptitude-treatment interactions (e.g., VanPatten, Collopy, Price, Borst, & Qualin, Reference VanPatten, Collopy, Price, Borst and Qualin2013).
Despite robust replication of Fernández’s (Reference Fernández2008) first experiment, no research has replicated or expanded the second experiment on the Spanish subjunctive, even as its results retain theoretical and pedagogical relevance. The present study, therefore, sought to partially replicate this experiment while also addressing one of its methodological limitations, namely, that the experimental groups were not given equal exposure before the training tasks. Furthermore, we extend the scope of the original study to address advancements made in PI research. First, we include an exposure-only control group (e.g., Morgan-Short & Bowden, Reference Morgan-Short and Bowden2006; Prieto Botana, Reference Prieto Botana2013). Second, we administer pretests, post-tests, and delayed tests to examine the effects of EI that extend beyond training. Finally, we explore how working memory (WM; e.g., Santamaria & Sunderman, Reference Santamaria, Sunderman, (Edward) Wen, Mota and McNeill2015) influences the effects of EI. Thus, the present study not only constituted a partial replicationFootnote 1 of Fernández (Reference Fernández2008) that seeks to validate the findings of its second experiment; it also constitutes an expansion of the original study that contributes to current research on the interactions between training, explicit information, and individual differences (e.g., Patra et al., Reference Patra, Suwondo, Mohammed, Alghazali, Mohameed, Hula and Behbahani2022; Santamaria & Sunderman, Reference Santamaria, Sunderman, (Edward) Wen, Mota and McNeill2015; VanPatten et al., Reference VanPatten, Collopy, Price, Borst and Qualin2013; see also DeKeyser, Reference DeKeyser2012; Henry, Reference Henry, Wong and Barcroft2024).
Background and motivation
Explicit information and processing instruction
Within the L2 acquisition literature, numerous theories suggest that EI could increase awareness of and attention to grammatical forms in the input, which could increase intake and benefit acquisition (e.g., Schmidt, Reference Schmidt1990; VanPatten, Reference VanPatten, VanPatten, Keating and Wulff2020). Whereas some have argued for a strong link between conscious knowledge and acquisition (Schmidt, Reference Schmidt1990), others have argued that conscious knowledge is not necessary, although it could be beneficial (Krashen, Reference Krashen1982; VanPatten, Reference VanPatten2016). Thus, research on instructed second-language acquisition (SLA) has sought to determine what role EI plays in acquisition, especially within the PI literature (VanPatten & Cadierno, Reference VanPatten and Cadierno1993).
PI, as a pedagogical framework, is derived from VanPatten’s Input Processing model (VanPatten, Reference VanPatten, VanPatten, Keating and Wulff2020), which consists of a set of principles that describe how learners make form-meaning connections and predict which forms are (un)likely to be processed during comprehension. PI then uses so-called structured-input (SI) activities to direct learners to process a targeted form more readily. As an example, consider the Spanish subjunctive, which is associated with the lexical preference principle (LPP). The LPP suggests that learners rely on lexical items rather than grammatical forms when both encode the same meaning (VanPatten, Reference VanPatten, VanPatten, Keating and Wulff2020; see e.g., Cameron, Reference Cameron2013). In a sentence such as No creo que baile todos los días (“I do not believe that she/he/you dance(s) every day”), doubt is expressed both by the construction No creo que (“I do not believe”) and by the subjunctive ending -e on the end of the verb baile (“dance3rd-SUB”). When processing sentences of this type, learners tend to rely on the lexical construction rather than the verb ending to understand the intended meaning of doubt (Cameron, Reference Cameron2013). In an SI activity designed to counteract this tendency (Farley, Reference Farley and VanPatten2004), participants see a series of half-sentences marked for either the indicative or subjunctive (e.g., baila todos los días or baile todos los días); they then choose between two possible preambles (e.g., Creo que or No creo que) and receive one-word feedback indicating the accuracy of their answer. Importantly, this task can only be completed correctly if learners process the verbal morphology indicating mood and link it to the expression of mood in the preamble.
Research on PI has focused primarily on forms related to one of two processing principles: (a) the LPP or (b) the first-noun principle, which states that learners tend to rely on word order rather than case-marked elements of the sentence and thus often misinterpret object-first sentences (VanPatten, Reference VanPatten, VanPatten, Keating and Wulff2020; see, e.g., Jackson, Reference Jackson2007). These studies have generally found that PI helps learners develop form-meaning connections useful both for comprehension and production (see Lee, Reference Lee2015). However, PI typically includes both SI activities and EI about the target form. Furthermore, the type of EI that is included is distinct from traditional explanations in that it focuses narrowly on one target form (rather than a paradigm) and includes information about processing strategies. Thus, many studies on PI have investigated the role of EI, in particular, whether SI activities alone spur development, if EI is necessary, and if it is beneficial in any way. VanPatten and Oikkenon (Reference VanPatten and Oikkenon1996), for example, investigated the effects of EI on Spanish clitic object pronouns. One group of learners received the standard PI training consisting of SI and EI, another received SI only, and a third received EI only. They found that the SI-only and PI groups performed similarly on both comprehension and production post-tests while outperforming the EI-only group. The finding that EI is not a necessary component of PI training has been replicated in a number of studies and has been upheld in a meta-analysis that examined the issue (Shintani, Reference Shintani2015). Some research, however, also pointed to the advantages of EI: Farley (Reference Farley and VanPatten2004), for example, investigated the effects of EI on the subjunctive in Spanish. Although learners in an SI-only group improved on both comprehension and production post-test measures, Farley found higher gains among a PI group. For a more comprehensive review of this literature, see Henry (Reference Henry, Wong and Barcroft2024).
Fernández
One of the limitations of early studies on PI concerns the use of pretest/post-test designs, which can hide differences between groups, for example, if one training condition provides learners an initial advantage, which is leveled out during the training and thus disappears by the time post-testing occurs. Therefore, to understand whether learners benefit from the EI presented in PI, it is necessary to measure both the outcomes of training and how learners perform during training itself. Fernández (Reference Fernández2008) addressed this concern in two experiments investigating how EI affected the rate at which participants learned the target form. Experiment 1 focused on Spanish clitic object pronouns, which are affected by the first-noun principle. Experiment 2, the focus of the current partial replication study, investigated the Spanish subjunctive of doubt, which, as described above, is affected by lexical preference.
In both experiments, participants received either SI or PI (i.e., SI+EI). As a measure of learning rate, Fernández applied a metric called trials-to-criterion (TTC) to the analysis of learner responses in the sequence of SI items (see also Fernández, Reference Fernández, Leeser, Keating and Wong2021; Henry, Reference Henry2023; Marijuan, Reference Marijuan, Wong and Barcroft2024). For this study, TTC was defined as the number of items participants had seen before answering three targets and one distractor correctly in a row. This, as Fernández (Reference Fernández, Leeser, Keating and Wong2021) explained, “was considered the minimally convincing evidence of learners having achieved appropriate strategies for processing both target items and distractors” (p. 251). In both experiments, Fernández (Reference Fernández2008) took four measures: (a) TTC, (b) the percentage of items answered correctly after learners met criterion (“accuracy after criterion”; AAC), (c) the proportion of participants who met criterion, and (d) reaction times (RTs) on learner responses. (Figure 1 shows an example of TTC and AAC scoring.) In Experiment 1, Fernández found no differences between the PI and SI groups in any of these measures. However, in Experiment 2, she found that the PI group had lower TTC scores, higher AAC scores, more participants who met criterion, and faster RTs. Thus, the two experiments in Fernández’s study provided different results, showing advantages for EI with advantages for the form affected by the lexical preference principle but not for the form affected by the first-noun principle. She speculated that these differences might stem from how SI interacts with EI, arguing that EI is more beneficial when the task requires the processing of a single form. She further speculated that the effects of EI may be related to the processing problem (e.g., a lexical preference strategy).
Replication of Fernández (Reference Fernández2008) and Related Issues in EI Research
Since the publication of Fernández (Reference Fernández2008), several studies have replicated Fernández’s first experiment on object pronouns. These studies fit into a broader context of research in the field of L2 acquisition (and related fields), which has increasingly recognized a central role in replication (Marsden, Morgan-Short, Thompson, & Abugaber, Reference Marsden, Morgan-Short, Thompson and Abugaber2018; Porte & McManus, Reference Porte and McManus2019). Current research distinguishes between several different types of replications. Direct replications make no changes to the original study and help validate previous research findings. Partial (and conceptual) replications, on the other hand, change one (or more) key variables in the original study and thus test the extent to which findings hold under different conditions (Marsden et al., Reference Marsden, Morgan-Short, Thompson and Abugaber2018). Replication, therefore, builds confidence in research findings and is especially important when a study has had an outsized impact on the field (Makel, Plucker, & Hegarty, Reference Makel, Plucker and Hegarty2012)Footnote 2 or when methodological limitations may have impacted results.
The rather robust replication of Fernández’s (Reference Fernández2008) first experiment presents a good example of how replication can help refine methods and bring nuance to findings. For example, Henry et al. (Reference Henry, Culman and VanPatten2009) investigated the effects of EI, targeting subject-verb-object/object-verb-subject word order and accusative case in German, which are impacted by the first-noun principle. Although the EI and SI training followed the design of Fernández’s first experiment, Henry et al. found different results: EI did significantly increase the learning rate. This study suggested that the effect of EI was not dependent on the processing problem per se, as Fernández had speculated, but rather on the interaction between EI and particular target forms. Subsequent research on forms related to the first-noun principle has yielded mixed evidence, with some studies replicating Fernández’s findings and others replicating the differences observed by Henry et al. Importantly, these studies have brought new methodological innovations. For example, Henry et al. introduced a new scoring method for TTC, which (as will be discussed later) has become standard in the field. VanPatten et al. (Reference VanPatten, Collopy, Price, Borst and Qualin2013) not only investigated learner performance during training (i.e., TTC) but also included a post-test component; subsequent studies have followed suit, additionally including delayed tests (Henry et al., Reference Henry, Jackson and DiMidio2017). Furthermore, VanPatten et al. (Reference VanPatten, Collopy, Price, Borst and Qualin2013) explored whether an individual difference measure, aptitude, impacted the effects of EI. Work that investigates aptitude–treatment interactions of this sort is increasingly important not only to the understanding of EI within PI (Santamaria & Sunderman, Reference Santamaria, Sunderman, (Edward) Wen, Mota and McNeill2015; see Henry, Reference Henry, Wong and Barcroft2024), but also in the wider field of SLA (DeKeyser, Reference DeKeyser2012, Reference DeKeyser2021).
Despite these advancements—and continual interest in the Spanish subjunctive within PI research (Benati & Lee, Reference Benati and Lee2010; Diaz, Reference Diaz2017; Farley, Reference Farley2000, Reference Farley2001, Reference Farley and VanPatten2004)—no studies to date have replicated Fernández’s second experiment. Thus, further research is essential to (a) validate the finding that EI provides advantages during PI for the Spanish subjunctive, (b) bring this research up to date with current methodological standards, and (c) investigate the degree to which individual differences contribute to the use of EI in PI for the Spanish subjunctive. Such replication is not only of theoretical importance, given that little research has been conducted with complex forms or forms that rely on the lexical preference principle; it is also of continued pedagogical importance, given that current recommendations for practice rely in whole or in part on this result (see e.g., Henry, Reference Henry, Wong and Barcroft2024).
In the remainder of this section, we focus specifically on the methodological aspects of Fernández (Reference Fernández2008) that motivate the present study, namely: (a) the need to balance exposure between groups, (b) the use of post-testing to investigate learner development and long-term effects, (c) the need for true control groups, and (d) exploration of aptitude–treatment interactions.
First, several meta-analyses have found that more explicit training conditions provide advantages over less explicit training conditions (Goo, Granena, Yilmaz, & Novella, Reference Goo, Granena, Yilmaz, Novella and Rebuschat2015; Norris & Ortega, Reference Norris and Ortega2000; Spada & Tomita, Reference Spada and Tomita2010). However, critiques of this work (e.g., Doughty, Reference Doughty, Doughty and Long2003; Sanz & Morgan-Short, Reference Sanz, Morgan-Short and Sanz2005) argue that designs tend to favor explicit conditions. One common issue is that more explicit training paradigms often include greater amounts of exposure (e.g., DeKeyser, Reference DeKeyser1995; Robinson, Reference Robinson1996). In Fernández’s (Reference Fernández2008) study, the PI group received EI with several examples of the target form (i.e., exposure), whereas the SI group received no exposure to the target form before completing the training task. Thus, the advantage that Fernández observed for the PI group in Experiment 2 may partially stem from this pre-practice exposure. Future research should, therefore, balance exposure to the target form in terms of the number of exemplars and the length of pre-practice exposure, as done in the present study.
Second, as discussed previously, Fernández (Reference Fernández2008) broke the traditional mode of PI research by looking at training itself. However, recent research on PI has typically combined pretest/post-test designs with Fernández’s approach (e.g., Glimois, Reference Glimois2019; Henry, Reference Henry2022, Reference Henry2023; Henry et al., Reference Henry, Jackson and DiMidio2017; VanPatten et al., Reference VanPatten, Collopy, Price, Borst and Qualin2013), testing both whether there are advantages during training and whether such differences are maintained on immediate and delayed post-tests. Conceptual replications of Fernández’s first experiment have indeed found that initial advantages for EI dissipate over time (e.g., Henry et al., Reference Henry, Jackson and DiMidio2017; VanPatten et al., Reference VanPatten, Collopy, Price, Borst and Qualin2013). However, no work replicating the second experiment has yet combined these methodological approaches, investigating both learning trajectory and learning outcomes together.
A related issue is that investigations of PI/SI (and other paradigms) often lack true control groups, and thus it is difficult to know how they compare with meaningful exposure alone. Some research suggests that meaningful exposure to the form alone is indeed enough to bring about learning (Morgan-Short & Bowden, Reference Morgan-Short and Bowden2006). However, task essentialness—in which a grammatical form must be attended to in order to complete a task (Loschkey & Bley-Vroman, Reference Loschkey, Bley-Vroman, Crookes and Gass1993)—is an important feature of effective training, at least for some L2 forms (Prieto Botana, Reference Prieto Botana2013). SI practice can be characterized as task-essential (Sanz & Morgan-Short, Reference Sanz and Morgan-Short2004; but note that SI must also be structured to address processing tendencies, see Wong, Reference Wong, Wong and Barcroft2024). Thus, continued research should include control groups who receive meaningful exposure to the target form to distinguish between the effects of exposure to input through task-essential SI practice versus exposure to meaningful input alone.
Finally, relatively few studies have explored how learner variables moderate the role of EI, although this research is becoming more common (DeKeyser, Reference DeKeyser2021). For example, several studies in the PI literature have explored the effects of age (Cox, Reference Cox2019; Lenet et al., Reference Lenet, Sanz, Lado, Howard, Howard, Sanz and Leow2011), aptitude (VanPatten et al., Reference VanPatten, Collopy, Price, Borst and Qualin2013), and WM, the focus of the present investigation (Santamaria & Sunderman, Reference Santamaria, Sunderman, (Edward) Wen, Mota and McNeill2015; Sanz et al., Reference Sanz, Lin, Lado, Stafford and Bowden2016). WM is “responsible for the control, regulation, and active maintenance of information in the face of distracting information” (Linck et al., Reference Linck, Osthus, Koeth and Bunting2014, p. 862) and thus is central to theories of L2 development that emphasize a role for attentional processing. Although research has linked WM to L2 development in numerous areas (see Linck et al., Reference Linck, Osthus, Koeth and Bunting2014), there are still relatively few studies that address the interaction between WM and EI in training. Among these, there are conflicting results. For example, Sanz et al. (Reference Sanz, Lin, Lado, Stafford and Bowden2016) found that WM had no effect on learning outcomes when EI was presented before learners received SI but that it was positively associated with learning effects when presented as feedback. Similarly, Dracos and Henry (Reference Dracos and Henry2021) found general effects for WM in both offline and online processing, but the effect of WM did not differ depending on the explicitness of feedback during SI-like training. On the contrary, others have reported that EI does affect the outcomes of training, with higher WM leading to greater gains under explicit but not more implicit training conditions (Indrarathne & Kormos, Reference Indrarathne and Kormos2018; Tagarelli, Borges-Mota, & Rebuschat, Reference Tagarelli, Borges-Mota and Rebuschat2011). Thus, more research is necessary to tease apart the role of WM under different training conditions, informing questions about aptitude–treatment interactions more generally.
The Present Study
The goal of the present study was to partially replicate and extend Fernández (Reference Fernández2008: Experiment 2) by examining the role of EI in PI and exploring how it affects the development of a complex form.Footnote 3 Accordingly, the study focuses on a comparison of two trainings targeting the Spanish subjunctive: PI, which includes EI, and SI, which does not. The study builds on previous research, first, by controlling a methodological bias present in previous work, namely, differences in pretraining exposure. As such, the PI and SI training are balanced for both the amount of time and the number of examples shown before practice. Second, the present study included a control group (C+) that received the same amount of exposure as the PI and SI groups. Thus, this study examines whether PI/SI training provide effects beyond simple, meaningful exposure to the form. Third, pretests, post-tests, and delayed tests were included to examine the retention of the target form. Finally, we examined the potential moderating effects of WM, which may play a different role in successful L2 development under different conditions.
Thus, the present study is, first and foremost, a partial replication of Fernández’s (Reference Fernández2008) Experiment 2, which has not yet been replicated, whereas Experiment 1 has been replicated numerous times. For our partial replication, we adopt the three research questions from Fernández, with the one motivated difference being the addition of pre-practice exposure for the SI group.
RQ1: Do learners in the PI group correctly process Spanish subjunctive forms in expressions of doubt sooner than learners in the SI group when presented with a series of SI items?
RQ2: Do learners in the PI group correctly process Spanish subjunctive forms in expressions of doubt faster (as measured by the time they take to submit their answers) than learners in the SI group?
RQ 3: Do learners in the PI group process Spanish subjunctive forms in expressions of doubt more accurately than learners in the SI group after having reached criterion?
The present study also extended Fernández’s study by examining short- and longer-term effects of training, the role of task essentialness, and the effects of individual differences, all of which have become important topics in PI research (DeKeyser & Prieto Botana, Reference DeKeyser and Prieto Botana2015; Henry, Reference Henry, Wong and Barcroft2024; Prieto Botana, Reference Prieto Botana2013). Thus, this study addresses the following research questions:
RQ 4: Do learners in the PI group show an increased ability to interpret the target form on immediate and delayed post-tests relative to the SI and/or C+ groups?
RQ 5: Do individual differences in WM moderate the effects of the PI, SI, and C+ training conditions?
Methods
All materials, experimental data, and analysis scripts are available on the Open Science Framework (OSF): https://osf.io/xrd4g/?view_only=282501b59dac41dfba5d646610d76624.
Participants
Given the primary goal of replicating Fernández (Reference Fernández2008: Experiment 2), we conducted a power analysis using G*Power (version 3.1.9.4: Faul, Erdfelder, Lang, & Buchner, Reference Faul, Erdfelder, Lang and Buchner2007) to determine an appropriate sample size. According to the analysis, a sample size of 34 participants per group would be needed to detect the same effect size found by Fernández (d = 0.7) with 80% power using independent t tests (α = .05). To obtain this sample, we recruited intermediate-low L2 Spanish learners from second-semester, university-level Spanish courses.Footnote 4 Ninety-nine participants were recruited and pseudo-randomly assigned to the PI group (n = 34), the SI group (n = 33), or the control plus exposure group (C+; n = 32). Time and financial constraints prevented further participant recruitment.
Participants were recruited before exposure to the target form in their Spanish course, and we confirmed that they lacked previous knowledge of the form through a prescreening and debriefing questionnaire (see Supplementary Materials, Appendices A and B). Participants were excluded from analyses if they had ≥3 years of Spanish instruction before university (PI: n = 1; SI: n = 1; C+: n = 2), or if they demonstrated accuracy >60% (following Fernández, Reference Fernández2008) on target items on the pre-test (PI: n = 3; SI: n = 2; C+: n = 1).
The final sample consisted of 29 participants in the PI groupFootnote 5 (15 women), 30 participants in the SI group (11 women), and 30 participants in the C+ group (24 women). A series of one-way between-participants analysis of variance (ANOVA) indicated that the groups did not differ in their current age, F(2,86) = 0.19, p = .83, age of first exposure to Spanish, F(2,81) = 1.35, p = .27, number of languages spoken, F(2,86) = 1.94, p = .15, composite WM score, F(2,76) = 0.20, p = .82, or Spanish vocabulary test scores, F(2,86) = 0.18, p = .84 (Table 1). However, there was a difference in pretest scores for subjunctive items, F(2,86) = 5.29, p = .007, with the C+ group scoring significantly lower than the PI (p = .004) and SI groups (p = .03) as determined by follow-up pairwise comparisons with Tukey-adjusted p values.
Note. WM composite maximum score = 28. SUB = subjunctive; WM = working memory.
Target Structure
Following Fernández (Reference Fernández2008) and other PI research (Farley, Reference Farley2001, Reference Farley and VanPatten2004), we examined the third-person singular Spanish subjunctive in expressions of doubt (e.g., No creo que-[DOUBT] baile-[3psgSUB]todos los días [“I do not believe that she/he/you dance{s} everyday”]). This form is considered complex in that it is (a) learned late in the L2 (Contreras Aedo & Cabrera, Reference Contreras Aedo and Cabrera2013), (b) lacks saliency and communicative value (VanPatten, Reference VanPatten and VanPatten2004), and (c) has been identified by teachers as difficult (Collentine, Reference Collentine1995). The indicative/subjunctive alternation is also not present in expressions of doubt in the participants’ L1.
Materials and Procedures
Training conditions
The current study employed two experimental training conditions, PI and SI, and a control training condition, C+. Each training condition included exposure to the Spanish subjunctive followed by practice: the PI group received EI + SI practice, the SI group received exposure to the target form followed by SI practice; and the C+ group received exposure to the target form followed by practice that was not task essential.
The PI group first received EI about the Spanish subjunctive (see Supplementary Materials, Appendix C). The EI provided in the present study was the same as the EI in Fernández (Reference Fernández2008), which was originally developed by Farley (Reference Farley2000). It described how the form is conjugated and explained that expressions of doubt trigger it and are often overlooked by learners due to their semantic redundancy. Within the EI, participants were exposed to 13 subjunctive verbs either in isolation or in an example sentence. Participants were asked to read the EI, and the exposure lasted approximately 3 min. After receiving EI, the PI group completed structured input training with the target form. The structured input task was the same as the task in Fernández (Reference Fernández2008; see Supplementary Materials, Appendix E) and consisted of 30 half-sentences in which the verb was marked as either indicative (9) or subjunctive (21). As shown in (1), participants heard these half-sentences and matched them to one of two possible sentence beginnings displayed on the computer screen.
Participants then received simple, one-word feedback indicating the accuracy of their answer (i.e., whether the mood of the verb matched the sentence beginning that they selected). The only modification from Fernández (Reference Fernández2008) was that the sentences were re-recorded so that a single L1 Spanish speaker was heard across all tasks.
The SI group first received pre-practice exposure to the same 13 subjunctive verbs that the PI group received in the EI. This was necessary to ensure that the SI group received the same amount of exposure to the target form before training, thus accounting for a limitation in Fernández’s (Reference Fernández2008) study. The length of exposure to these forms (i.e., ~3 min) was the same as the EI. However, no information about the forms was provided. In contrast, as seen in (2), the 13 sentences were embedded in meaningful question-and-response dialogs.
This pre-practice exposure was followed by the exact same structured input training that the PI group received.
The C+ group was included in this study to address another methodological limitation in Fernández (Reference Fernández2008), that is, the lack of a true control group. The C+ group received the same pre-practice exposure as the SI group. After pre-practice exposure, the C+ group did not receive structured input like the other groups. Instead, they completed practice that was not task-essential but contained the same 30 sentences that appeared in the structured input. During training, participants saw two images and heard the target sentence in Spanish. They then chose the picture that was most related to the sentence. The images were selected to reflect the general topic of the sentence (or not) and were not expected to direct participants’ attention to the subjunctive/indicative distinction (Figure 2). Participants then received one-word feedback on their answers.
Interpretation Assessment
To measure L2 knowledge of the Spanish subjunctive, the current study adopted the interpretation test used in Farley (Reference Farley2000) as pre-, post-, and delayed post-tests (see Supplementary Materials, Appendix F). Thirty-six target items were distributed among three versions of the interpretation test such that each version consisted of 24 items (12 target; 12 distractors). The target items included nine expressions of doubt using the subjunctive (3) and three expressions of certainty using the indicative (4). The 12 distractor items were unrelated to the target form.
The three test versions were assigned to participants such that each participant received a different version for their pre-, post-, or delayed post-tests. Like in the SI practice, participants heard the second half of the sentence and then chose between two possible sentence beginnings. No feedback was provided during any of these tests.
Working Memory
WM capacity was assessed with complex span tasks that required participants to both store and process information. To maximize validity while also managing time constraints, shortened operation, reading, and symmetry span tasks (Oswald, McAbee, Redick, & Hambrick, Reference Oswald, McAbee, Redick and Hambrick2015) were used.
In the operation span task, participants were instructed to remember a series of letters that were presented in sequences ranging from 3 to 7. Between the presentation of each letter, participants completed a processing task. In this case, they solved a simple math problem with a true or false response (e.g., Slide 1: [2×2]-1=? | Slide 2: 3 – Yes or No?). After participants had seen all the letters and math problems in a given sequence, they reported which letters they saw and the order in which they saw them.
The reading span task followed the same structure as the operation span task, but instead of solving math problems, the learners made semantic plausibility judgments about English sentences (e.g., Not plausible: “The man likes to run in the spaghetti”; Plausible: “The man likes to run in the park”).
The symmetry span task followed the same structure as the previous two tasks. However, the storage items and processing tasks were different. The storage items in this task were the sequence and location of red squares that appeared in a 4×4 grid, and the processing task required participants to determine whether images were symmetrical across their horizontal axis or not. As in the other tasks, at the end of each sequence, participants reported the location of each square that they had seen (i.e., the storage items) in the order in which they saw them.
Procedure
This study was conducted over two sessions separated by 2 weeks (M = 15.72 days, SD = 4.03). During the first session, participants completed a consent form, the LEAP-Q background questionnaire (Marian, Blumenfeld, & Kaushanskaya, Reference Marian, Blumenfeld and Kaushanskaya2007), and a vocabulary quiz, which controlled for variable knowledge of the target verbs and was based on a review sheet provided to participants before the experiment (see Supplementary Materials, Appendix G and H). After this quiz, participants completed the pre-test, followed by the training (PI, SI, or C+), and the immediate post-test.
In the second session, participants signed a second consent form before completing the delayed post-test and the WM span tests.Footnote 6, Footnote 7 All the experimental tasks were administered through E-Prime Professional (version 2.0.10.356). After completing these tasks, the researcher orally administered a debriefing questionnaire (see Appendix B) and recorded the participants’ answers. All participants were paid for their participation.
Analysis and results
RQs 1-3: Partial Replication of Fernández (Reference Fernández2008, Experiment 2)
Analysis
RQs 1–3 asked whether EI affects how learners begin to interpret the Spanish subjunctive. Analyses focused on participant responses given by the PI and SI groups during training. Four scoring metrics were computed: (a) the proportion of participants to reach criterion, (b) trials to criterion, (c) accuracy after criterion, and (d) average RT. Each of these is described in detail subsequently.
First, criterion was defined in the same way that it was in Fernández (Reference Fernández2008): answering four questions in a row at any point in the experiment. The proportion of participants in the PI and SI groups who reached this criterion was calculated for each group and compared using a Fisher’s exact test.
Second, trials to criterion were first scored and analyzed using Fernández’s (Reference Fernández2008) original scoring method. For this metric, participants were awarded a score corresponding to the number of items they had seen before they answered four correctly in a row (Figure 1). Following Fernández, we removed participants who did not reach criterion from the data set. However, because Fernández’s scoring method eliminates participants who do not reach criterion, it only captures differences among those for whom training is effective. Therefore, we conducted a planned analysis in which we rescored the data using the scoring method used by Henry et al. (Reference Henry, Culman and VanPatten2009) and several subsequent studies (e.g., VanPatten et al., Reference VanPatten, Collopy, Price, Borst and Qualin2013; see Henry, Reference Henry2023). In this method, participants who did not meet criterion were not eliminated from the data set but were given a TTC score of 30 (corresponding to the number of items in the training). This method thus captures differences between all participants, regardless of whether they met criterion. We refer to these scoring methods as TTC-F and TTC-H, respectively.
Third, accuracy after criterion represented the percentage of answers that a participant answered correctly after reaching criterion. This measure was scored exactly as in Fernández (Reference Fernández2008; Figure 1). Finally, RT was simply the time (in milliseconds) taken between the onset of the stimulus and the participant’s answer.
TTC-F, TTC-H, accuracy after criterion, and RT were all analyzed using independent sample t tests. To better compare results across samples, Hedge’s g effect sizes were calculated for both the results reported in Fernández (Reference Fernández2008) and the present study. These effect sizes were interpreted according to Plonsky and Oswald’s (Reference Plonsky and Oswald2014) field-specific recommendations for between-group comparisons, with .40 indicating a small effect, .70 indicating a medium effect, and 1.00 indicating a large effect.
Results
Proportion of learners meeting criterion
The proportion of learners from the PI group who reached criterion was 89.7% (26 of 29), whereas it was 63.3% (19 of 30) in the SI group. Fischer’s exact test confirmed that this proportion was significantly higher in the PI than in the SI group (p = .038). This reproduces the findings from Fernández (Reference Fernández2008), which also indicated a significantly greater proportion of participants reaching criterion in the PI group (76%) compared with the SI group (50%).
Trials to criterion
Descriptive statistics show that the PI group had lower TTC scores than the SI group (Table 2). An independent t test demonstrated that this difference was not statistically different for the TTC-F measure, and the effect was very weak (t[43] = –0.103, p = .918, g = 0.03). However, a separate t test did reveal statistical differences for TTC-H with a small effect size (t[57] = –2.044, p = .046, g = 0.53). The results of the TTC-F measure thus contradict Fernández (Reference Fernández2008), who reported that the PI group required significantly fewer trials to reach criterion than the SI group (p < .05, g = 0.66) with a small to medium effect size. However, the TTC-H measure replicates the effect.
Accuracy after criterion
Participants in the PI group performed more accurately after reaching criterion than participants in the SI group (Table 2). The independent t test confirmed that this difference was significant and had a large effect size (t[43] = 4.447, p < .001, g = 1.32). These results reproduce the findings from Fernández (Reference Fernández2008, Experiment 2), who also found an advantage for PI over SI with a medium effect size (p < .05, g = 0.76).
Reaction times
Participants in the PI group had descriptively faster RTs during training than participants in the SI group (Table 2). An independent t test, however, showed that there was no significant difference in RTs between the two groups, and the effect was small (t[43] = –1.446, p = .16, g = 0.43). Conversely, Fernández (Reference Fernández2008) reported that the PI group had significantly quicker RTs than the SI group with a medium effect size (p < .05, g = 0.80).
RQ 4: L2 development
Analysis
RQ4 asks how EI and task-essentialness affect L2 development at immediate and delayed testing. Analyses focused on the interpretation assessments. These tests were first scored for accuracy, awarding 1 point for each correct response to target items (max score: 12 points; 9 subjunctive, 3 indicative) and is reported as percent correct.Footnote 8 Statistical analyses were conducted via a linear mixed model that assessed learners’ accuracy on subjunctive (SUBJ) items. The model was built using the lme4 package (Version 1.1-19; Bates, Mächler, Bolker, & Walker, Reference Bates, Mächler, Bolker and Walker2015) in R and included the primary fixed effects of training (Instruction.D), time, and the interaction between them. The maximal model included subject random slopes and intercepts; however, this model resulted in a singular fit. Thus, the final model included random slopes for the subject (ID) as specified below:
$ \mathrm{Model}.1<\hbox{-} \mathrm{lmer}\left(\mathrm{Acc}\sim \mathrm{Time}.\mathrm{D}\ast \mathrm{Instruction}.\mathrm{D}+\left[1|\mathrm{ID}\right],\mathrm{data}=\mathrm{p}2\mathrm{data},\mathrm{REML}=\mathrm{TRUE}\right) $
This linear mixed-effects model approach allowed us to include participants who had not returned for the delayed post-test (PI: n = 5; SI: n = 2; C+: n = 1; West, Welch, Gałecki, & Gillespie, Reference West, Welch, Gałecki and Gillespie2015). For the purposes of interpretation, results were computed in R for main effects (ANOVA, type III format with the Kenward-Roger approximation for degrees of freedom) using the lmerTest package (Version 3.0-1 Kuznetsova, Brockhoff, & Christensen, Reference Kuznetsova, Brockhoff and Christensen2017), and partial eta2 was estimated from the F value. Tukey follow-up tests were computed using the emmeans package (Version 1.3.2; Lenth, Reference Lenth2022).
Results
Descriptive statistics are displayed by training condition and time in Table 3 and are presented with distributional data in Figure 3, which shows performance that appears to be around chance level except for the PI group on the post-test. The ANOVA (type III) run on the linear mixed-effects model revealed statistically significant main effects for time, F(2, 167.583) = 21.54, p < .001, ηp2 = .20, training, F(2, 85.729) = 4.59, p = .013, ηp2 = .10, and the time×training interaction, F(4, 167.513) = 3.04, p = .019, ηp2 = .07.
C+ = control. PI = processing instruction. SI = structured input.
To interpret the statistically significant interaction, we conducted follow‐up Tukey tests presented in Table 4. These tests showed some differences between the groups at pretest, with the C+ group scoring lower than the PI and SI groups. On the immediate post-test, the PI group outperformed both the SI and C+ groups, which were similar. On the delayed post-test, there were no differences between the groups. Results also showed that, while all three groups evinced at least marginally significant improvement from the pretest to the post-test, only the C+ group maintained their gains and also had higher scores from the pretest to the delayed post-test, although they still did not perform above the chance level.
C+ = control. PI = processing instruction. SI = structured input. *p < .05. **p < .01, ***p < .001. †p < .10.
RQ 5: L2 development and working memory
Analysis
RQ5 asked whether individual differences in WM differentially account for L2 development. This question was answered by investigating the relationship between WM scores and the interpretation assessments. WM scores were computed as follows: first, for each WM task, participants’ accuracy and errors were recorded for both storage and processing items. The operation and reading span tasks had a total of 30 items, while the symmetry span tasks had a total of 24 items. A partial-scoring method was used, wherein participants received one point for each storage item recalled in the correct order (Conway et al., Reference Conway, Kane, Bunting, Hambrick, Wilhelm and Engle2005). To ensure an acceptable level of accuracy, and thus attention, on the processing components of the task (Conway et al., Reference Conway, Kane, Bunting, Hambrick, Wilhelm and Engle2005), a criterion for inclusion was set at ≥60% on the processing items; all participants met this requirement. Scores from the three tasks were then averaged to create WM composite scores, which were converted to z-scores for analysis. Note that, in addition to the eight participants who did not return for the second session, two participants were not able to complete the WM span tasks (PI: n = 1; C+: n = 1) and thus were not included in the analyses. Four participants completed two of the three span tasks, so their averages were calculated based on their two scores.
Before running statistical analyses, we first calculated the reliability index α following the developers of the tasks (Oswald et al., Reference Oswald, McAbee, Redick and Hambrick2015) using the R script provided by the Engle laboratory (Tsukahara, Reference Tsukahara2022). Reliability fell within a similar range for three WM tasks (operation span: 0.62; reading span task: 0.53; symmetry span task: 0.65). These values were in line with Oswald et al. (operation span: 0.71; reading span task: 0.54; symmetry span task: 0.59), which reported an overall acceptable composite α of 0.76.
Statistical analyses assessed learner accuracy with respect to the potential three-way interaction of training, time of testing, and WM. Analyses were performed using separate linear mixed-effects models built with the lme4 package (Version 1.1-19; Bates et al., Reference Bates, Mächler, Bolker and Walker2015) in R. The full model included the fixed effects of time, training (Instruction.D), WM z-scores, and the interactions between them. Subject (ID) was included as a random effect:
$ \mathrm{Model}.2<\hbox{-} \mathrm{lmer}\left(\mathrm{Acc}\sim \mathrm{Time}.{\mathrm{D}}^{\ast}\mathrm{Instruction}.{\mathrm{D}}^{\ast}\mathrm{WM}.\mathrm{z}+\left[1|\mathrm{ID}\right],\mathrm{data}=\mathrm{p}2\mathrm{data},\mathrm{REML}=\mathrm{TRUE}\right) $
As done for RQ4, results from the linear mixed-effects model are reported using ANOVA (type III) format with the Kenward-Roger approximation. However, to best capture and visualize any three-way interactions, follow-ups are reported through simple slopes using the effects package (Version 4.1-0; Fox, Reference Fox2003; Fox & Weisberg, Reference Fox and Weisberg2018, Reference Fox and Weisberg2019) in R. Below, we present and interpret only those effects and interactions that include WM, as these inform the research question. The full statistical output for these tests is found in Appendix I in the online supplemental materials.
Results
The ANOVA (type III) on the linear mixed-effects model revealed significant main effects for time, F(2,146) = 18.29, p < .001, ηp2 = .20, and training F(2,73) = 5.39, p = .007, ηp2 = .13, which were qualified by significant interactions for time × training, F(4,146) = 3.08, p = .018, ηp2 = .08, and for time × training × WM, F(4,146) = 3.12, p = .017, ηp2 = .08.
We followed up this three-way interaction by examining simple slopes, which allowed us to test how participants’ WM capacity influences accuracy across testing times for each training condition. To do this, we examined the interaction between time and training at three different values of WM (i.e., mean WM and high/low WM, defined as ±1 SD from the mean; Figure 4). Descriptively, all three training conditions pattern similarly at low WM and show maximal divergence at high WM. Thus, we expect that performance at high WM is driving the three-way interaction.
This was confirmed by the linear mixed-effects model, which was rotated to capture differences at each level of WM with the PI group acting as the baseline (Table 5 represents each rotated level in columns). At each rotated level of the model, we were interested in whether a unique contribution of WM is present. To interpret the three-way interaction, we thus examined the two-way time × training interaction at each level of WM. This analysis revealed that the two-way time × training interaction (Table 5) was significant at mean and high WM, such that (a) at mean and high WM, the C+ group’s slope from pre- to delayed testing was greater than the PI’s slope, and (b) at high WM, the SI group’s slope from pre- to post-testing was less than the PI group’s slope. Moreover, when we re-leveled the model so that the C+ group acted as the baseline (Table 6), the two-way time × training interaction, again, was significant at mean and high WM, such that (a) at mean and high WM the PI group’s slope from pre- to delayed-testing was less than the C+ group’s slope (which parallels the findings between these two groups in Table 5), and (b) at high WM, the SI group’s slope from pre- to post-testing was less than the C+ group’s slope. These findings suggest that higher WM capacity benefitted both the performance gains on the immediate post-test in the PI and C+ groups (compared with SI) and the performance gains on the delayed post-test in the C+ group (compared with PI).
Note: The intercept reflects Time.Pre:Training.PI. GLM = generalized linear model; SI = structured input; WM = working memory. ***p < .001, **p < .01, *p < .05.
Note: The intercept reflects Time.Pre:Training.C+. C+ control; = GLM = generalized linear model; PI = processing instruction; SI = structured input; WM = working memory. ***p < .001, **p < .01, *p < .05.
Discussion
Summary
The present study first sought to partially replicate Fernandez (Reference Fernández2008: Experiment 2), and thus focused on learners’ processing of the Spanish subjunctive during training with (PI) or without EI (SI). The study also examined learner development over time, comparing PI and SI with a control group that received meaningful exposure to the target form (C+). In the following sections, we discuss the results with reference to our three research questions.
RQs 1–3: Partial replication of Fernández (Reference Fernández2008, Experiment 2)
The pattern of the results from the present study was descriptively consistent with those of Fernández in that PI, compared with SI alone, led to a greater proportion of learners reaching criterion, higher accuracy scores after reaching criterion, fewer trials to reach criterion, and faster RTs. However, these patterns were not all statistically reproduced. The present study did statistically reproduce the advantage of PI compared with SI in terms of the proportion of participants who reached criterion and accuracy after criterion. However, the results for TTC were mixed, reproducing the advantage for the PI group using Henry’s (Reference Henry, Culman and VanPatten2009) scoring method (TTC-H), but not Fernández’s original method (TTC-F). Similarly, the present study found no statistically significant differences in the groups’ reaction times, whereas Fernández did. Despite these differences, the totality of evidence seems to confirm Fernández’s conclusion that EI played a significant role in training for this form. Results therefore suggest that when PI is used in classroom instruction targeting the Spanish subjunctive, learners may benefit if both EI and SI components are included in the instructional activities.
The differences between our results and Fernández’s, namely the lack of an advantage for the PI group in TTC-F and reaction times, may be attributed to the presence of pre-practice exposure. This exposure could have eased the processing burden for SI participants at the beginning of the experiment. Consider the TTC-F scores for the SI group. In the present study, these scores were lower compared with Fernández’s study (9.26 vs 12.10).Footnote 9 Given that the present participants had seen the form before, they may have been more ready to attend to the new grammatical form and begin processing it for meaning. Similarly, pre-practice exposure to the form may have allowed participants to react more quickly, at least at the outset of training, even as EI conferred additional advantages.
The differences between our two TTC scoring methods also warrant discussion. As noted previously, the TTC-F method eliminates participants who do not reach criterion, whereas the TTC-H method includes all participants, even those for whom the training was not effective. Thus, we believe that the TTC-H method is more relevant to the larger picture, although results for TTC-F do show that some learners are able to begin processing the target form quickly, even without EI. We suggest that further research using TTC could benefit from using both approaches, perhaps in conjunction with one of the alternatives suggested by Henry (Reference Henry2023).
RQ 4: L2 development
With regard to L2 performance over time (RQ4), different patterns of development emerged across conditions. Two results are particularly noteworthy. First, all three groups improved from pretest to post-test (despite only marginally significant differences for the SI group). This suggests that SI helps participants begin to associate the subjunctive with doubt phrases, although gains are generally quite limited and are not retained. Note that even though the C+ group did seem to maintain gains over time, it is difficult to attribute this to stable learning, given their low scores on the pretest relative to the other groups. Second, the PI group improved over and above the other two groups on the immediate post-test. Although this advantage disappeared on the delayed post-test, this does suggest that EI may lend participants an immediate advantage over more implicit training conditions, even when it does not lead to retention of the form (see also Goo et al., Reference Goo, Granena, Yilmaz, Novella and Rebuschat2015; Kang et al., Reference Kang, Sok and Han2019).
These results must be interpreted in the context of the generally low accuracy scores at each test time. It seems likely that these low scores stem from the rather limited training set that consisted of only 30 trials, which is far fewer than other studies on the Spanish subjunctive. For example, Farley’s (Reference Farley2001) PI training consisted of eight activities with approximately 99 subjunctive tokens over 90 min of practice. Indeed, Farley reported much greater improvement than in the present study (pretest: 39.8%, post-test: 85.3%, delayed post-test: 83.1%).
RQ 5: L2 development and working memory
Regarding individual differences, a significant three-way interaction emerged between time, training, and WM. Notably, the three conditions differed at higher levels of WM with both the PI and C+ conditions showing greater gains than the SI condition at the immediate post-test. That is, participants in both the PI and C+ conditions benefited from a higher WM capacity and showed greater gains than SI when tested just after training, but participants in the SI condition did not. These results may suggest that the three types of training engage memory processes differently. For example, the PI group may have relied on their ability to access EI in short-term WM throughout the training, as suggested elsewhere in the literature (Henry et al., Reference Henry, Culman and VanPatten2009). If learners rely on such a strategy but never fully process the relationship between the subjunctive form and the meaning of doubt, they may gain short-term benefits without long-term learning, as observed here. Indeed, for the delayed post-test, the results revealed that average and high WM learners showed greater gains in the C+ group as compared to PI. Thus, WM did not seem to play a beneficial role in the PI group in the long run (after 2 weeks). Somewhat similarly, these results point to differences between PI/SI and mere exposure: At the delayed post-test, there was no evidence for group differences on the interpretation task, yet WM only seemed to contribute to gains in the C+ group. Thus, there may be fundamental differences in the underlying processes between conditions despite relative similarities in learning outcomes.
Although WM played a role in training in this study, these results should be considered in the context of other studies on EI and WM, which have been inconsistent. These results are in line with studies, such as Tagarelli et al. (Reference Tagarelli, Borges-Mota and Rebuschat2011) and Indrarathne and Kormos (Reference Indrarathne and Kormos2018) who found a positive relationship between WM and L2 development when learners were given or searched for rules. However, the results also contrast with studies that found no effect when learners were given rules (Dracos & Henry, Reference Dracos and Henry2021; Sanz et al., Reference Sanz, Lin, Lado, Stafford and Bowden2016). Given the differences in the training paradigms represented in these studies, it is difficult to ascertain why WM had an effect in this study but not in some others. One possibility is that prior research has focused on different target forms, and that WM has differential effects for more complex versus less complex forms. Nevertheless, further research in this area is needed to draw solid conclusions and make pedagogic recommendations. In particular, we would recommend a systematic approach to studying WM that manipulates particular variables (e.g., complexity of form, type of EI, task essentialness of practice, length of treatment) and also controls for as many variables as possible between studies. To that end, multisite replication research (e.g., Morgan-Short et al., Reference Morgan-short, Marsden, Heil, Issa, Leow, Mikhaylova, Mikołajczak, Moreno, Slabakova and Szudarski2018) or multi-experiment studies that test a variety of target forms (e.g., VanPatten et al., Reference VanPatten, Collopy, Price, Borst and Qualin2013) may prove useful.
Conclusions and Limitations
The present study was a partial replication of Fernández (Reference Fernández2008: Experiment 2), which sought to address methodological issues from the original study, namely the lack of balance between the PI and SI groups. Additionally, it expanded on Fernández by including, first, post-test assessments and a true control group to investigate learner performance over time, and second, measures of WM. Results showed that, overall, the PI group was quicker to begin processing the form correctly, supporting Fernández’s (Reference Fernández2008) original conclusions. Although immediate post-tests show advantages for PI over the other training groups, these were not sustained. Analyses further indicated that higher WM facilitated gains in learner accuracy, but its effects only played a role in the PI and C+ training conditions.
Of course, this study had its limitations, most notably the amount of exposure to the target form, as discussed in the preceding section. We also believe that future research could improve on the tests in the present study (which followed Farley, Reference Farley2000) if they included more than nine subjunctive and three indicative items. With more exposure to the target form and additional interpretation task items, greater levels of learning might be evidenced in future research, allowing for greater reliability. Additionally, we noted differences between the C+ group and the PI/SI groups at pretest. Despite these limitations, the present study provided a valuable contribution to the research landscape, first and foremost improving on and replicating the results of Fernández’s (Reference Fernández2008) Experiment 2, which has not been replicated before, even as Experiment 1 has received continued attention. Beyond replication, however, this study also provided critical information about learner performance over time, showing that initial advantages for PI do not always translate into sustainable gains. Finally, this study contributed to a small but growing number of studies that specifically address the interaction between EI and WM.
Acknowledgements
Nick Henry and Briana Villegas are designated as co-first authors. Briana Villegas is now independently affiliated. The authors acknowledge Claudia Fernández for sharing her materials. They also acknowledge audiences at the Second Language Research Forum (2017, 2019) and members of the Cognition of Second Language Acquisition laboratory for their comments on previous iterations of this project. The authors also acknowledge Minnie Pham, Yasiel Lacalle, Ana Hernandez, Anahi Gante, and Hannah Chadda for their help with data collection, with a special thanks to Ana Hernández who led data collection efforts while the first author was on maternity leave. All errors are our own.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S0272263124000524.
Data availability statement
The experiment in this article earned Open Data and Open Materials badges for transparent practices. The materials and data are available at https://osf.io/xrd4g/.