Highlights
-
• High individual differences in voice processing and L2 phoneme learning
-
• CFAs support voice processing and L2 phoneme learning being distinct abilities
-
• SEMs of accuracy and reaction time data show that voice ability predicts L2 phoneme learning
1. Introduction
Anyone who has taken a second-language (L2) course will have noticed that we display considerable individual differences in language learning. Some people struggle with the most basic abilities, while others seem to absorb linguistic knowledge effortlessly. One of the most challenging aspects of learning an L2 is the acquisition of its speech sounds (i.e., phonemes), an ability subject to great individual differences, with only a minority of learners achieving high proficiency (Schmitz et al., Reference Schmitz, Díaz, Fernández Rubio and Sebastian-Galles2018; Sebastian-Galles & Baus, Reference Sebastian-Galles, Baus and Cutler2005; Sebastian-Galles & Díaz, Reference Sebastian-Galles and Díaz2012). Studies have found that individual differences in L2 phoneme command persist despite accounting for comparable experiences and opportunities to learn the L2 (Archila-Suerte et al., Reference Archila-Suerte, Bunta and Hernandez2016; Díaz et al., Reference Díaz, Mitterer, Broersma and Sebastian-Galles2012; Sebastian-Galles & Baus, Reference Sebastian-Galles, Baus and Cutler2005; Sebastian-Galles & Díaz, Reference Sebastian-Galles and Díaz2012). Yet, the learner-related factors that impact L2 phoneme command are poorly understood. A recent study (Díaz et al., Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022) showed that individual differences in L2 phoneme proficiency were related to the ability to recognize trained (i.e., learned) voices. Here, we tested whether voice processing abilities (operationalized as the ability to recognize and discriminate voices) can predict attained L2 phoneme learning in a sample of early bilingual adults using a battery of behavioral tests and structural equation models (SEMs).
Speech is a highly variable and complex signal. It contains both linguistic information, which reflects the message the speaker intends to transmit, and voice information, which provides cues about various characteristics of the speaker. Listeners use linguistic information to understand what is being said, while voice information is exploited for successful social interactions (Nygaard & Tzeng, Reference Nygaard, Tzeng, Pardo, Nygaard, Remez and Pisoni2021). The complexity and variability of the speech signal are largely due to these two types of information not being discreetly encoded; there is no one-to-one mapping between the percepts of phonemes and their acoustic correlates across speakers (Peterson & Barney, Reference Peterson and Barney1952). The anatomy of the vocal tract, which is responsible for speech production, is unique to each speaker. Consequently, the acoustic characteristics of each speaker’s voice are also unique. The main acoustic features that characterize voices are the average fundamental frequency, which is perceived as voice pitch, and the frequency values of formants (i.e., resonances of the vocal tract), that cause the percept of vocal timbre (Baumann & Belin, Reference Baumann and Belin2010; Ghazanfar & Rendall, Reference Ghazanfar and Rendall2008; Latinus & Belin, Reference Latinus and Belin2011). While the first and second formants (F1 and F2) are claimed to be the primary cues to determine vowel identity (Fox et al., Reference Fox, Flege and Munro1995; Yang & Fox, Reference Yang and Fox2014), higher formants have been proposed to carry most of the vocal timbre information, as they exhibit minimal within-speaker variation across vocalizations (Kitamura & Akagi, Reference Kitamura and Akagi1995). However, as stated, the spectral values of all formants are determined by the anatomy of the speaker’s vocal tract. Therefore, theories of speech perception must address how the perceptual system resolves the lack of invariance between speech-sound (i.e., phoneme) percepts and their acoustic correlates across speakers.
Many of the solutions proposed by speech perception theories to this lack of invariance problem require the accurate identification of the speaker-specific spectro-temporal changes embedded in the speech signal. Speaker normalization theories argue that speech perception is accomplished by initially identifying the acoustic idiosyncrasies introduced into the speech stream by the speaker and discarding them from further processing. Thus, only the phoneme cues that enable the recognition of the corresponding phoneme representations are retained (Choi et al., Reference Choi, Hu and Perrachione2018; Johnson & Sjerps, Reference Johnson, Sjerps, Pardo, Nygaard, Remez and Pisoni2021; Nusbaum & Magnuson, Reference Nusbaum, Magnuson, Johnson and Mullennix1997; Zhang & Chen, Reference Zhang and Chen2016). However, the specific acoustic cues onto which normalization is applied vary across theoretical proposals, such as the ratio between the F1 and F2 of vowels or the absolute fundamental frequency, and remain a matter of debate (for a review, see Persson & Jaeger, Reference Persson and Jaeger2023). Conversely, distributional (Kleinschmidt & Jaeger, Reference Kleinschmidt and Jaeger2015; McMurray & Jongman, Reference McMurray and Jongman2011) and exemplar-based (Goldinger, Reference Goldinger1998; Klatt, Reference Klatt1979; Sumner et al., Reference Sumner, Kim, King and McGowan2014) models of speech perception do not consider voice information as noise to be discarded, but rather as fundamental for speech perception. These models propose that the speech perceptual system resolves the lack of invariance between phoneme percepts and their acoustic correlates by representing voice-dependent variations of speech. While distributional models claim that listeners retain statistical distributions of the range of variability of phoneme cues across speakers, exemplar-based models propose that listeners store memory traces of actual speech segments that contain both linguistic and voice details. Thus, according to these two theoretical proposals, speaker variations of phoneme productions are accounted by either inferring the most probable outcome or by a similarity matching process, respectively. Both distributional and exemplar-based models of speech perception share the underlying assumption that exposure to speaker variability provides the speech perceptual system the capability to accurately perceive speech. Numerous studies have repeatedly shown that voice and linguistic information interact during speech processing, as contemplated by all of the enumerated theoretical proposals. The perception of synthesized ambiguous vowels is strongly dependent on the spectro-temporal characteristics of a speaker’s voice in a preceding sentence (Darwin et al., Reference Darwin, Denis McKeown and Kirby1989; Krumbiegel et al., Reference Krumbiegel, Ufer and Blank2022; Ladefoged & Broadbent, Reference Ladefoged and Broadbent1957; Miller et al., Reference Miller, Aibel and Green1984; Nearey, Reference Nearey1989; Newman & Sawusch, Reference Newman and Sawusch2009; Reinisch & Sjerps, Reference Reinisch and Sjerps2013; Sjerps et al., Reference Sjerps, McQueen and Mitterer2013, Reference Sjerps, Fox, Johnson and Chang2019) regardless of language familiarity (Sjerps & Smiljanić, Reference Sjerps and Smiljanić2013). Familiarity with a speaker is beneficial for speech comprehension in acoustically challenging scenarios, such as noisy environments or multi-talker situations (Drozdova et al., Reference Drozdova, van Hout and Scharenborg2019; Johnsrude et al., Reference Johnsrude, Mackey, Hakyemez, Alexander, Trang and Carlyon2013; Magnuson et al., Reference Magnuson, Nusbaum, Akahane-Yamada and Saltzman2021; Nygaard et al., Reference Nygaard, Sommers and Pisoni1994; Nygaard & Pisoni, Reference Nygaard and Pisoni1998; Souza et al., Reference Souza, Gehani, Wright and McCloy2013; Yonan & Sommers, Reference Yonan and Sommers2000).
A growing body of evidence suggests that voice processing ability, the capacity of a listener to identify the speaker-specific acoustic variations introduced into the speech stream, is not only relevant for speech and speaker recognition (Johnson & Sjerps, Reference Johnson, Sjerps, Pardo, Nygaard, Remez and Pisoni2021; Nygaard & Tzeng, Reference Nygaard, Tzeng, Pardo, Nygaard, Remez and Pisoni2021) but might also influence phoneme learning. The acquisition of non-native phonetic contrasts is enhanced if learnt from multiple speakers as compared to a single speaker (Bradlow et al., Reference Bradlow, Pisoni, Akahane-Yamada and Tohkura1997; Bradlow & Pisoni, Reference Bradlow and Pisoni1999; Deng et al., Reference Deng, Chandrasekaran, Wang and Wong2018; Iverson et al., Reference Iverson, Hazan and Bannister2005; Lively et al., Reference Lively, Logan and Pisoni1993, Reference Lively, Pisoni, Yamada, Tohkura and Yamada1994; Logan et al., Reference Logan, Lively and Pisoni1991; Wong, Reference Wong2014; Ylinen et al., Reference Ylinen, Uther, Latvala, Vepsäläinen, Iverson, Akahane-Yamada and Näätänen2010; Zhang et al., Reference Zhang, Cheng and Zhang2021, but see Brekelmans et al., Reference Brekelmans, Lavan, Saito, Clayards and Wonnacott2022). This benefit in L2 phoneme learning is assumed to arise from the exposure to greater acoustic–phonetic variability that multiple speakers entail. This variability would allow L2 learners to identify the acoustic properties that convey linguistic information across speakers and facilitate accurate speech perception when new speakers are encountered (Deng et al., Reference Deng, Chandrasekaran, Wang and Wong2018; Iverson et al., Reference Iverson, Hazan and Bannister2005; Ylinen et al., Reference Ylinen, Uther, Latvala, Vepsäläinen, Iverson, Akahane-Yamada and Näätänen2010). The relevance of voice processing ability for language learning processes was also reported by Houston and Jusczyk (Reference Houston and Jusczyk2000), who found that familiarity with characteristics of a speaker contributes to speech segmentation during early language learning. Infants were familiarized with isolated words spoken by one speaker and then presented with passages enunciated by a different speaker that occasionally contained the familiarized words. Seven-and-a-half-month-old infants recognized the trained words when familiarized and tested with speakers of the same sex but were unable to generalize across sexes. Houston and Jusczyk (Reference Houston and Jusczyk2000) proposed that the ability to accurately disentangle voice information from linguistic information develops in parallel with language acquisition.
Additional evidence advocating for the importance of voice processing ability for language learning is provided by research in dyslexia, a developmental disorder characterized by difficulties in reading and spelling despite normal intelligence, neurological integrity, and educational opportunities. Current conceptualizations attribute dyslexia to an underlying phonological deficit that impedes the optimal association between phonemes and their respective characters (Ramus, Reference Ramus2003). Behavioral studies have established an association between dyslexia and difficulties in voice recognition (Perea et al., Reference Perea, Jiménez, Suárez-Coalla, Fernández, Viña and Cuetos2014; Perrachione et al., Reference Perrachione, Del Tufo and Gabrieli2011). Perea et al. (Reference Perea, Jiménez, Suárez-Coalla, Fernández, Viña and Cuetos2014) found that children and adults with dyslexia exhibited an impairment to recognize speakers in both the language for which they had previous phoneme representations, i.e., their native language (L1), and an unfamiliar language, leading them to suggest that poor voice recognition skill is a trait of dyslexia (Perea et al., Reference Perea, Jiménez, Suárez-Coalla, Fernández, Viña and Cuetos2014, but see Perrachione et al., Reference Perrachione, Del Tufo and Gabrieli2011). This interpretation is in line with electrophysiological work that showed that children with dyslexia exhibit a reduced encoding of features related to pitch as compared to typically developing children (Chandrasekaran et al., Reference Chandrasekaran, Hornickel, Skoe, Nicol and Kraus2009) and suggests that deficient voice processing ability might underlie the phonological deficit that characterizes dyslexia.
Further evidence that suggests that phoneme learning and voice processing are related abilities is provided by the advantage in voice recognition bilinguals exhibit compared to monolinguals when discriminating speakers in an unfamiliar language (Fecher & Johnson, Reference Fecher and Johnson2019, Reference Fecher and Johnson2022; Levi, Reference Levi2019). Fecher and Johnson (Reference Fecher and Johnson2019, Reference Fecher and Johnson2022) proposed that a richer phonetic upbringing had given rise to bilingual infants possessing higher sensitivity to phonetic cues, thus facilitating speaker recognition despite the absence of reliable phoneme representations. While a richer phonetic upbringing may underlie bilinguals having an advantage in voice recognition over monolinguals, bilingualism cannot account for the positive correlation between individual differences in voice recognition and L2 phoneme learning a recent study observed, since the sample was entirely composed of early bilingual adults with similar opportunities to learn the L2 (Díaz et al., Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022). This study took advantage of the considerable variance displayed by Spanish (L1)–Catalan (L2) early bilinguals in their capacity to discriminate the Catalan-specific vowel contrast /e/ - /ε/, since native Spanish speakers perceive both phonemes as the Spanish vowel /e/ (Bosch et al., Reference Bosch, Costa and Sebastian-Galles2000; Pallier et al., Reference Pallier, Bosch and Sebastian-Galles1997, Reference Pallier, Colomé and Sebastian-Galles2001; Sebastian-Galles et al., Reference Sebastian-Galles, Rodríguez-Fornells, de Diego-Balaguer and Díaz2006; Sebastian-Galles & Soto-Faraco, Reference Sebastian-Galles and Soto-Faraco1999). This phenomenon, where two L2 speech sounds are perceived as a single phoneme from the native language, is known as perceptual assimilation and constitutes one of the most challenging scenarios L2 speakers face (Best & Tyler, Reference Best, Tyler, Bohn and Munro2007; Flege, Reference Flege and Strange1995). The bilinguals studied by Díaz et al. (Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022) were selected from a previous study (Schmitz et al., Reference Schmitz, Díaz, Fernández Rubio and Sebastian-Galles2018) according to whether they had exhibited either native-like or below-native performance in three behavioral tasks that evaluated their ability to perceive the L2-specific vowel contrast /e/ - /ε/. The bilinguals were administered a voice recognition task (adapted from Perea et al, Reference Perea, Jiménez, Suárez-Coalla, Fernández, Viña and Cuetos2014; Perrachione et al., Reference Perrachione, Del Tufo and Gabrieli2011), which required them to learn associations of voices speaking in the participants’ first language and cartoon avatars while their behavioral and electroencephalographic responses were registered.
In addition to the voice recognition task, Díaz et al. (Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022) administered a non-word association task (NWAT) which required participants to learn associations between auditory non-words enunciated by a single speaker and cartoon avatars. The task served to obtain a behavioral measure of the participants’ general capacity to learn audiovisual associations, an ability that might have influenced participants’ performance in the voice recognition task. The behavioral data showed that voice recognition ability positively correlated with attained L2 phoneme discrimination, while none of these two measures correlated with NWAT. Analysis of the electroencephalographic data revealed a positive correlation between the brain activity during voice recognition and the behavioral L2 phoneme discrimination ability at two time windows: 300–340 and 880–1140 ms. These findings were in line with previous studies, which had reported voice recognition eliciting positive brain electrophysiological responses 300 ms after stimuli onset (Humble et al., Reference Humble, Schweinberger, Dobel and Zäske2019; Schweinberger, Reference Schweinberger2001; Zäske et al., Reference Zäske, Volberg, Kovacs and Schweinberger2014, Reference Zäske, Limbach, Schneider, Skuk, Dobel, Guntinas-Lichius and Schweinberger2018). The positive relation between voice recognition (at the behavioral and electroencephalographic levels) and L2 phoneme discrimination ability evidenced a common individual variance for L2 phoneme and voice recognition processes. The new-found relation between these two seemingly independent processes opened up the possibility of voice processing abilities impacting the final attainment of L2 phonemes. Díaz et al. (Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022) suggested that the correlation between voice recognition ability and L2 phoneme learning might stem from L2 learners with proficient voice processing skills being better equipped to disentangle voice and linguistic information during learning, resulting in finer-tuned L2 phoneme representations and thus greater accuracy when detecting L2 phonemes. However, this proposal was limited by the correlational nature of the evidence.
In the present study, we examined if the ability to accurately identify the acoustic idiosyncrasies introduced into the speech stream by a speaker (i.e., voice processing ability) predicts L2 phoneme learning using structural equation modeling (SEM, for a list of all acronyms used in this article, see Appendix 1). We employed a battery of behavioral tests to assess voice processing ability and attained L2 phoneme learning in a sample of 57 early Spanish (L1)–Catalan (L2) bilingual adults with similar characteristics as the participants in Díaz et al. (Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022). Voice processing ability was operationalized as the ability to recognize and discriminate speakers. We assessed participants’ voice recognition skills using three different tasks. The first of these three was a voice recognition task in the native language (L1) of the participants, Spanish, which was identical to the task employed in Díaz et al. (Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022). This L1 voice recognition task consisted in training participants to recognize five voices and subsequently testing voice recognition accuracy. Recognizing voices in one’s L1 is facilitated by the prior phonological and semantic knowledge of the spoken language (Yu et al., Reference Yu, Zhou, Zhang, Li, Li and Wang2023) and results in greater accuracy as compared to recognizing voices in an unknown language (Lx) (Perea et al., Reference Perea, Jiménez, Suárez-Coalla, Fernández, Viña and Cuetos2014; Perrachione et al., Reference Perrachione, Del Tufo and Gabrieli2011). To obtain a richer characterization of the voice processing ability of participants than in Díaz et al. (Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022), we also administered an Lx voice recognition task similar to the one employed by Perea et al. (Reference Perea, Jiménez, Suárez-Coalla, Fernández, Viña and Cuetos2014) in which the voices spoke Chinese. By using both L1 and Lx voice recognition tasks, we aimed to capture participants’ ability to identify voice cues that are intertwined with linguistic information in two different situations; when prior linguistic knowledge facilitated the identification of voice cues (L1 voice recognition task) and when prior linguistic knowledge did not facilitate voice recognition (Lx voice recognition task). Lastly, to deepen our understanding of voice processing abilities, we assessed participants’ ability to identify speaker-specific cues embedded in the speech signal in the absence of linguistic-dependent acoustic variations. For this purpose, we designed a novel voice discrimination task (VDT) which required participants to evaluate whether two emotional interjections (Belin et al., Reference Belin, Fillion-Bilodeau and Gosselin2008) had been produced by the same or different unfamiliar speakers. We employed affect bursts as stimuli due to emotional tone being a within-person source of non-linguistic variation which drastically modulates the spectro-temporal characteristics of the speech signal (Lavan et al., Reference Lavan, Burston and Garrido2019a). These three voice tasks therefore evaluated participants’ voice processing abilities in three situations that varied in their engagement of speech processes: linguistic information present and familiar (i.e., L1 voice recognition task), linguistic information present but unfamiliar (i.e., Lx voice recognition task), and linguistic information not present (voice discrimination task). The participants’ L2 phoneme learning ability was quantified using two tasks that evaluated L2 phoneme knowledge at the sub-lexical and lexical levels, respectively: a categorization task (CT) of synthetic vowels (Pallier et al., Reference Pallier, Bosch and Sebastian-Galles1997) and an auditory lexical decision task (Schmitz et al., Reference Schmitz, Díaz, Fernández Rubio and Sebastian-Galles2018; Sebastian-Galles et al., Reference Sebastian-Galles, Echeverría and Bosch2005; Sebastian-Galles & Baus, Reference Sebastian-Galles, Baus and Cutler2005). All tasks measured accuracy and reaction time (RT). While both accuracy scores and RT capture effective cognitive processing, they are qualitatively different measures. Accuracy scores capture how similar the decision alternatives are to each other and how effectively the correct option can be identified. RT measures the speed with which a participant identifies the correct option. Perceptual decision-making models have highlighted the need to study both measures when investigating individual differences since, surprisingly, they tend to exhibit low correlation on an individual level (Ratcliff et al., Reference Ratcliff, Thapar and McKoon2010; Ratcliff et al., Reference Ratcliff, Smith and McKoon2015a, Reference Ratcliff, Thompson and McKoon2015b). Drawing firm conclusions in behavioral studies therefore necessitates interpreting both measures (Ratcliff et al., Reference Ratcliff, Smith and McKoon2015a).
We conducted confirmatory factor analysis (CFA) to investigate whether both the accuracy and RT data, modeled separately, were represented more adequately by two related latent variables (i.e., voice processing ability and L2 phoneme learning), as hypothesized, or rather by a single latent variable (i.e., general speech ability). After confirming that the model with two latent variables provided an overall better fit of the data, we proceeded to investigate our main hypothesis that voice processing ability predicted L2 phoneme learning with SEM. A positive result would provide insight into the high variability early bilingual adults display in their command of L2 phonemes and suggest that voice processing influences L2 learning.
2. Methods
2.1. Sample size estimation
The minimum sample size required for this study was estimated using an a priori power analysis (Hancock & Mueller, Reference Hancock and Mueller2013). Using a tool designed for SEM studies (Soper, Reference Soper2023), we calculated the minimum sample size as a function of the number of observed and latent variables (5 and 2, respectively), anticipated effect size (β = .61, based on Díaz et al., Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022), desired probability (p = .05) and statistical power (π = .80). This analysis determined that a minimum of 12 participants was necessary to detect an effect. However, to ensure the convergence of the CFAs and SEMs, we aimed to collect the data of a minimum of 50 participants, following the recommendation of Bentler and Chou (Reference Bentler and Chou1987) of having a minimum of 10 participants per indicator.
2.2. Participants
The sample of this study was composed of 57 Spanish–Catalan bilingual adults (40 female; mean age 21 years; age range 18–26) born and raised in the metropolitan area of Barcelona in Catalunya, an autonomous community of Spain where Spanish and Catalan are co-official languages. The L1 of all participants was Spanish; they had been raised in monolingual Spanish families and had not been systematically exposed to Catalan until the age of 4 years, when mandatory bilingual schooling begins. All participants were highly fluent speakers of Catalan; from kindergarten on, they had received mandatory bilingual education. At the time of testing, all participants were pursuing or had obtained a university degree in Catalonia, indicating that they had completed mandatory bilingual schooling, a requirement to access higher education.
Participants were selected using an online survey in Google Forms that collected information concerning their personal history (place of birth, place/s of residence, etc.) and language profile (L1, L2, age of acquisition of each spoken language, current use of each spoken language, etc.) of the respondent and their extended family. This was done to ensure that the participants had no substantial experience with any language other than Spanish during their initial years of life (0–4 years of age) and that systematic exposure to Catalan only began upon commencing mandatory bilingual schooling. Participants answered free-response questions inquiring about the language(s) employed to communicate with each family in their early childhood environment. All participants reported exclusively communicating in Spanish with both of their parents and other regular caretakers. None of the participants had extended family members or caretakers from the eastern region of Andalusia nor the Region of Murcia, two autonomous communities in the south of Spain. This was avoided because the Spanish dialects in these regions employ the phoneme /ε/ and the standard Spanish /e/ (Sanders, Reference Sanders1994; Soriano, Reference Soriano2012). Participants exposed to one of these Spanish dialects during their early infancy would have had an advantage in distinguishing the two phonemes we exploited to evaluate L2 phoneme learning.
None of the participants possessed substantial musical training, as defined by a previous study (Kaganovich et al., Reference Kaganovich, Kim, Herring, Schumaker, MacPherson and Weber-Fox2013). Substantial musical training consisted in meeting a minimum of two of the three following criteria: (1) the onset of musical training having occurred before the age of 12 years; (2) having partaken in musical training for a minimum of 5 years; and (3) being part of a musical group or ensemble, either currently or in the past. None of the participants had received a clinical diagnosis of a hearing problem, learning disability, or neurological impairment. Of the 1123 respondents that completed the online questionnaire, only 68 were eligible for inclusion in the final sample, of which 57 accepted to participate in this study. Participants provided their written informed consent and were monetarily compensated for their time (10 €). The Medical Faculty and Health Sciences Ethics Committee of the Universitat Internacional de Catalunya approved the procedures (Protocol no.º: PSI-2020-05).
2.3. Materials
A battery composed of six behavioral tasks was employed to evaluate the participants’ voice processing ability, L2 phoneme learning, and general audiovisual learning capacities. Voice processing ability was assessed with the L1 voice recognition task (L1 VRT), the Lx voice recognition task (Lx VRT), and the VDT. The indicators of L2 phoneme learning were a CT and a lexical decision task (LDT). General audiovisual learning was evaluated with the NWAT. All tasks registered both accuracy and RT data and were programmed and executed in MATLAB (Version R2021a, MathWorks, Inc., Natick, MA USA) using the Psychophysics Toolbox extensions (3.0.18; Brainard, Reference Brainard1997; Pelli, Reference Pelli1997). Here, we present a summarized description of the tasks. A detailed description can be found in the Supplementary Materials.
2.3.1. Voice recognition tasks (VRTs)
The L1 VRT and the Lx VRT, adapted from Perea et al. (Reference Perea, Jiménez, Suárez-Coalla, Fernández, Viña and Cuetos2014), followed an identical procedure. These two tasks solely differed in the stimuli they employed. In the L1 VRT, the auditory stimuli consisted of 10 Spanish sentences recorded by 5 Spanish native speakers, while 10 Chinese sentences recorded by 5 Chinese native speakers were employed in the Lx VRT. Ten female avatars were created, of which five were employed in each VRT. The VRTs trained participants to associate voices with avatars and then tested the learning that the participants had attained. Participants were taught the associations between voices and avatars in two phases, each composed of 25 trials: the training and the short test. The trials of the training followed an ABX structure; two voice–avatar pairings were sequentially presented. One of the two voices was then repeated while the five avatars were displayed. Participants had to indicate as fast as possible by means of a button press which of the five avatars the repeated voice corresponded to. Feedback was provided concerning the participants’ response accuracy, and the correct avatar was displayed on the screen. The trials of the short test consisted in the presentation of an auditory stimulus accompanied by the five avatars. Participants indicated as fast as they could which of the five avatars was associated with the presented voice. As in the training, feedback was provided after each delivered response. The test phase, composed of 50 trials, followed the same structure as the short test but no feedback was provided. Participants were trained and tested on different sentences.
2.3.2. Voice discrimination task (VDT)
The Montreal Affective Voices set (Belin et al., Reference Belin, Fillion-Bilodeau and Gosselin2008) was employed as the stimuli of the VDT. This set is composed of 10 different speakers enunciating nine affective interjections using the vowel /ɑ/. The VDT followed an AX discrimination design: Two auditory stimuli were sequentially presented and participants indicated via button press as fast as they could if the same or different speakers had enunciated the two vocalizations. In half of the trials, both stimuli had been enunciated by the same speaker, while in the other half, they had been enunciated by different speakers. Fifty-two trials composed the VDT.
2.3.3. Categorization task (CT)
The CT followed the design presented by Pallier and collaborators (1997). The stimuli consisted of a continuum of seven synthesized vowel stimuli between the Catalan vowels /e/ and /ε/. In 63 trials (nine trials per stimuli), participants had to respond as fast as they could via button press if the vowel they heard was perceived as the first vowel in the Catalan word Pere (/perə/, the name Peter) or as the first vowel in pera (/pεrə/, which means pear).
2.3.4. Lexical decision task (LDT)
The LDT employed in this study was from Sebastian-Galles et al. (Reference Sebastian-Galles, Echeverría and Bosch2005). The stimuli consisted of 344 auditory stimuli (experimental and control) enunciated by a native Catalan speaker. The experimental stimuli included 132 words containing one of the two phonemes from the targeted Catalan contrast (i.e., /e/ or /ε/) and 132 non-words which were designed by substituting the /e/ and /ε/ present in the real words with the other member of the phoneme pair. Eighty control stimuli, 40 Catalan words and 40 non-words were also employed. Control non-words were derived from a set of Catalan words different from the control and experimental words. These control non-words were created by changing a vowel phoneme in this separate set of Catalan words with a phoneme employed in both Spanish and Catalan. In each of the 212 trials, participants were presented with an auditory stimulus and had to respond via button press if the stimulus was part of the Catalan lexicon. The experimental stimuli were distributed between two lists to ensure that participants only heard one member of the same word pair. Both lists included all control stimuli, and their use was counterbalanced across participants.
2.3.5. Non-word association task (NWAT)
The NWAT was initially introduced in Díaz et al. (Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022). Six non-words enunciated by a single native Spanish speaker constituted the auditory stimuli for this task while six avatars constituted the visual stimuli. The NWAT sought to train and test participants’ ability to learn audiovisual associations. It was composed of two phases: a training and a test. Each of the 12 trials of the training phase consisted in the simultaneous presentation of a non-word–avatar pairing. The test trials, a total of 48, consisted of the presentation of a non-word, while the six avatars were displayed. Participants indicated via button press which avatar was associated with the presented non-word as fast as possible.
2.4. Procedure
The six tasks were administered in a single one-hour-long experimental session. The tasks were presented to all participants in the following order: Lx VRT, LDT, VDT, L1 VRT, NWAT, and, lastly, the CT. The order of task presentation was arbitrary; however, the order of the tasks was maintained constant throughout for participants to avoid task-order effects playing a role in individual task performance. Instructions for each task were displayed via text. Any doubts the participants had were resolved by the experimenters before commencing each task. Instructions were delivered in Catalan for the LDT and the CT and in Spanish for the other four tasks. Participants were instructed to provide their responses with their dominant hand and to keep their response fingers over the response buttons. For all participants, the six tasks were presented on an HP EliteBook 840 G7 Notebook PC with Audio-Technica ATH-PRO7x headphones, ensuring a consistent and comfortable audio level. Participants were tested individually in sound-attenuated rooms at the Psychology and Psychiatry University Clinic and Digital Media Studios of the Universitat Internacional de Catalunya and at the laboratories of the Center for Brain and Cognition of the Pompeu Fabra University.
2.5. Data analysis
We investigated whether voice processing ability predicted L2 phoneme learning using SEM, a statistical methodology that systematically analyzes the relationship among several variables. Following Brown (Reference Brown2015), CFAs were conducted prior to the SEMs. CFA assesses the relationships between observed measures and latent variables. CFA allows for the validation of the hypothesized latent constructs being manifested through the employed indicators. Similar to a previous study (Díaz et al., Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022), we tested whether general audiovisual learning abilities influenced the participants’ performance in the VRTs by computing Pearson’s correlations between the accuracy scores and RT of the NWAT and the VRTs. Mplus Version 8.8. Demo (Mplus. Statistical Analysis with Latent Variables, 2017) was used to estimate the CFAs and SEMs. All other analyses were conducted with R 4.2.2 (R Core Team, 2019) and RStudio 2022.12.0 (RStudio Team, 2020).
Each task’s accuracy and RT scores were computed from trials where participants delivered their responses within a specific time window. These time windows were designed to exclude responses provided before perceptual processing while including responses delivered up to three-and-a-half seconds after mean stimuli duration, similar to one of our previous studies (Sebastian-Galles et al., Reference Sebastian-Galles, Echeverría and Bosch2005). The time windows for each task were as follows: L1 VRTs: 250–7500 ms; Lx VRT: 250–8500 ms; VDT: 250–5000 ms; CT: 250–4000 ms; LD: 250–4000 ms; and NWAT: 250–4000 ms. Following these criteria, the following percentage of data was discarded for each task: L1 VRT: 0.70%; Lx VRT: 3.12%; VDT: 0.67%; CT: 3.07%; LD: 2.53%; and NWAT: 5.52%. Due to technical malfunctions, the LDT data of two participants were not registered. Under the assumption of data missing at random, multiple imputations by chained equations were performed with the R package mice. Subsequently, multivariate normality was assessed using the Mahalanobis distance (D2M) and computed with the R stat function. D2M was calculated for each participant’s responses to the five experimental tasks, and its statistical significance was tested with χ 2 at a significant level of .001 (Kline, Reference Kline2015).
Accuracy scores were computed for each participant and each task. For the VRTs, we computed the proportion of accurate responses delivered, following studies which have previously employed voice recognition tasks (Díaz et al., Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022; Perea et al., Reference Perea, Jiménez, Suárez-Coalla, Fernández, Viña and Cuetos2014; Perrachione et al., Reference Perrachione, Del Tufo and Gabrieli2011). For the VDT (see Table A1 in Appendix 2 for descriptive statistics of the proportion of correct responses), since it aimed to evaluate the ability of participants to discriminate between pairs of stimuli, we computed the d’, an index of discriminability (see Table A2 in Appendix 3 for mean proportion of hits and false alarms) derived from signal detection theory (McNicol, Reference McNicol2005; Snodgrass & Corwin, Reference Snodgrass and Corwin1988; Stanislaw & Todorov, Reference Stanislaw and Todorov1999). Accuracy scores for the CT were computed as in previous studies (Schmitz et al., Reference Schmitz, Díaz, Fernández Rubio and Sebastian-Galles2018; Sebastian-Galles et al., Reference Sebastian-Galles, Echeverría and Bosch2005; Sebastian-Galles & Baus, Reference Sebastian-Galles, Baus and Cutler2005). We sought to obtain a measure that reflected if participants could perceive the difference between the /e/ stimuli (steps 1 and 2) and the /ε/ stimuli (steps 6 and 7). For this, the average /e/ responses to steps 6 and 7 were subtracted from the average /e/ responses of steps 1 and 2. Thus, high positive scores reflect a good separation of /e/ and /ε/, scores close to zero reflect that participants did not respond differently to steps 1 and 2 than to steps 6 and 7, and negative scores indicate that participants’ responses showed a reverse pattern. Negative CT scores were assumed to originate from responses systematically delivered in reverse, which necessitates the capacity to perceive the difference between phoneme categories. Thus, the CT scores were transformed into absolute values. For the LDT, the mean accuracy for the experimental words was computed (see Table A1 in Appendix 2). Previous studies that had used the same LDT with the same population had computed the A’ score, a non-parametric unbiased index of sensitivity (McNicol, Reference McNicol2005; Snodgrass & Corwin, Reference Snodgrass and Corwin1988; Stanislaw & Todorov, Reference Stanislaw and Todorov1999), due to the participant’s strong bias to consider most experimental non-words as real words (Schmitz et al., Reference Schmitz, Díaz, Fernández Rubio and Sebastian-Galles2018; Sebastian-Galles et al., Reference Sebastian-Galles, Echeverría and Bosch2005; Sebastian-Galles & Baus, Reference Sebastian-Galles, Baus and Cutler2005). After confirming that our participants showed a high rate of false alarms for the experimental stimuli of the LDT (see Table A2 in Appendix 3), consistent with previous studies (Schmitz et al., Reference Schmitz, Díaz, Fernández Rubio and Sebastian-Galles2018; Sebastian-Galles et al., Reference Sebastian-Galles, Echeverría and Bosch2005; Sebastian-Galles & Baus, Reference Sebastian-Galles, Baus and Cutler2005), we computed A’ scores for the LDT. A’ ranges between a score of 0.5 (random response) and 1.0 (perfect discrimination). To ensure high L2 lexical knowledge, we excluded participants with an A’ < 0.8 in the control trials of the LDT. Lastly, for the NWAT, we employed the proportion of accurate responses as the accuracy score, following the study in which this task was introduced (Díaz et al., Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022). RT scores for all tasks and participants resulted from the mean average of the RT corresponding to trials in which the correct response was delivered.
Separate CFAs and SEMs were constructed for the accuracy scores and the RT data. The model parameters of the SEMs and CFAs were estimated using the robust maximum likelihood estimator, which does not rely on the assumption of a normal distribution (Kline, Reference Kline2015). While theoretically the tasks we employed are indicators of two different, yet related, constructs (i.e., voice processing ability and L2 phoneme learning), we also tested the possibility of a single-latent variable model providing an adequate fit of the data to rule out a possible explicative model that might be supported statistically. We employed the Akaike information criterion (AIC) to compare between the models with two latent variables (i.e., voice processing ability and L2 phoneme learning) and the models with a single latent variable (i.e., general speech ability) (Akaike, Reference Akaike, Parzen, Tanabe and Kitagawa1998). The chi-square test of model fit (χ 2) was considered significant at p < .05. A significant result of this statistic would indicate model misfit, reflecting a deviation between the population covariance structure and the model-implied covariance structure (Kline, Reference Kline2015). Goodness of fit of the models was also assessed via two indices which are robust in models with relatively small degrees of freedom, as in the present study (Shi et al., Reference Shi, DiStefano, Maydeu-Olivares and Lee2022): the comparative fit index (CFI) and the standardized root mean residual (SRMR). CFI compares the fit of the specified model to a baseline null model in which the latent variables are unrelated by constraining the covariance between the latent variables to zero. SRMR represents the average squared deviation between the observed and reproduced covariances. Following the recommendation of Hu and Bentler (Reference Hu and Bentler1999), the following values in the indices were interpreted as indicating a good fit: CFI ≥ .90 and SRMR ≤ .08. For completeness, we report the root mean square error of approximation (RMSEA), a measure of model misfit due to model misspecification commonly employed in models with large degrees of freedom, though not recommended for models with small degrees of freedom as those presented here (Kenny et al., Reference Kenny, Kaniskan and McCoach2015).
3. Results
3.1. Tasks’ results
All indicators exhibited considerable variability, suggesting that the tasks we employed successfully captured individual differences (Figures 1 and 2). The skewness and kurtosis values were within the thresholds suggested by Hancock and Mueller (Reference Hancock and Mueller2013) (i.e., absolute values of 2 and 7, respectively) for conducting CFAs and SEMs (see Table 1). Covariance matrices were generated (see Tables 2 and 3) as part of the standard procedure of conducting CFAs and SEMs (Kline, Reference Kline2015). All participants attained high accuracy scores in the control trials of the LDT (M = 0.95; SD = 0.04; range = 0.83–0.99), and therefore, no participant was excluded from the analysis due to having low L2 lexical knowledge. Multivariate normality was assessed using D2M to rule out the possibility of disturbances caused by potential multivariate outliers. No multivariate outliers were identified for the accuracy scores, and all participants were included in the CFAs and SEMs. For the RT data, a single case was identified as a multivariate outlier following the D2M criteria and was excluded from the RT models.
Indicators of voice processing ability are Spanish voice recognition, Chinese voice recognition, and voice discrimination. Indicators of L2 phoneme learning are categorization and lexical decision. ms = milliseconds.
We did not expect performance in the NWAT (accuracy mean = 0.7; accuracy SD = 0.26; RT mean = 1589.15 ms; RT SD = 252.89 ms) to correlate significantly with performance in the VRTs, since these tasks were designed to capture individual differences of different abilities (i.e., general audiovisual learning and voice recognition abilities, respectively). No correlation between these tasks was observed in a previous study (Díaz et al., Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022). We ascertained that individual differences in general audiovisual learning abilities were not related to performance in the VRTs by computing Pearson’s correlation coefficients between the scores of the VRTs and those of the NWAT. Performance in the NWAT did not correlate with L1 VRT measures (accuracy: r = .16; p = .221; RT: r = .19; p = .151) nor Lx VRT (accuracy: r = .14; p = .298; RT: r = .13; p = .336), suggesting that individual differences in general audiovisual learning abilities were not related to performance in the VRTs. As a result, the NWAT data was not included in subsequent analyses. Given the dominance of female participants in our sample, we ascertained that gender did not influence participants’ performance in the indicators of voice processing ability and L2 phoneme learning using a series of Welch’s t-tests for unequal sample sizes. No comparison between genders approached statistical significance (all ps > 0.1; see Table A3 in Appendix 4).
3.2. Confirmatory factor analyses (CFAs)
CFAs were computed to evaluate whether the accuracy scores and RT data captured the latent constructs as intended. We tested whether voice processing ability and L2 phoneme learning could be modeled as distinct but related constructs. Additionally, we modeled the data into a single-latent-variable structure to test the possibility of this competing model. The CFAs with two related latent variables (see Figure 3) showed that the accuracy scores in the L1 VRT and Lx VRT were valid indicators of voice processing ability (both p < .001). While VDT accuracy did not significantly represent voice processing ability (p = .191), RT in this same task did (p < .001). Furthermore, RT in L1 VRT and Lx VRT represented voice processing ability (p < .001). Concerning L2 phoneme learning, both the accuracy scores and RT in the CT and the LDT represented this latent construct (all p < .001). Voice processing ability and L2 phoneme learning were correlated in both the accuracy and the RT model. The chi-square test of model fit (χ 2) was not significant for either CFA, indicating that the models with two related latent variables provided an adequate fit of the data. The CFI and SRMR indicated that both the accuracy CFA and the RT CFA met the established criteria for goodness of fit (Hu and Bentler, Reference Hu and Bentler1999) (see Table 4).
The value of the χ 2 test indicated that the single-latent CFA modeled with the accuracy scores (see Figure 4) fitted the data adequately (p > .05). However, the single-latent CFA modeled with the RT data exhibited significant model misfit (χ 2(5) = 13.368; p < .05), suggesting that the model could not adequately represent the data (see Table 4). All accuracy scores of all tasks significantly represented general speech ability (p < .005), with the sole exception of the VDT (p = .139). Fit indicators for this single-latent CFA exhibited adequate fit results following the criterion suggested by Hu and Bentler (Reference Hu and Bentler1999). Comparison of the Akaike Information Criterion (AIC) for models based on accuracy scores suggested that the CFA with a single latent variable provided a more adequate representation of the accuracy scores than the CFA with two latent variables (see Table 4). However, the single-latent CFA model did not adequately fit the RT data while the CFAs with two latent variables showed adequate fit for both the accuracy scores and the RT data. Hence, modeling voice processing ability and L2 phoneme learning as distinct but related constructs provided an overall better characterization of the complete dataset.
3.3. Structural equation models (SEMs)
We investigated whether voice processing ability predicted L2 phoneme learning with SEMs. The similarity between the procedures of the VRTs motivated us to release the covariate parameter between them when estimating the models. The results of the SEM analyses were in line with CFA findings. For both the accuracy and RT models, all measures of voice processing ability loaded onto said factor, with the sole exception of the VDT accuracy, with a loading that was close to significance (p = .066). Both accuracy and RT measures of L2 phoneme learning are loaded onto an L2 phoneme learning factor. Voice processing ability predicted L2 phoneme learning in both the model that included the accuracy scores (p < .005) and the model that included the RT data (p < .001) (see Figure 5). The goodness-of-fit indicators employed to evaluate the models met the criteria proposed by Hu and Bentler (Reference Hu and Bentler1999), indicating that the data were well represented by the models (see Table 5).
4. Discussion
We investigated whether individual differences in voice processing ability provided a statistically significant prediction regarding L2 phoneme learning proficiency. To test this hypothesis, we exploited the variance Spanish (L1)–Catalan (L2) early bilinguals display in their capacity to discriminate the Catalan-specific vowel contrast /e/ - /ε/. We employed a battery of behavioral tests to assess voice processing ability and L2 phoneme learning in a sample of 57 early bilingual adults. Performance in all indicators exhibited considerable variability, suggesting that the tasks we employed successfully captured individual differences. We employed CFA to evaluate whether the accuracy scores and RT data captured two distinct latent constructs, as hypothesized, or a single latent variable. The model with two related latent variables showed a good fit of both the accuracy and RT data while the model with a single latent variable only fitted the accuracy data. Subsequent SEMs incorporating two latent variables for both accuracy scores and RT data confirmed that voice processing ability is a reliable predictor of L2 phoneme learning in early bilingual adults. Drawing on various theories of speech perception, in the following paragraphs, we discuss the nature of the relationship between voice processing and L2 phoneme learning. We also consider how voice processing abilities may relate to language learning in different stages of life, such as learning an L2 as an adult and acquiring a native language. Furthermore, we offer some considerations for future studies that seek to further investigate the influence of voice processing abilities on language learning.
Our findings suggest that the ability of a listener to identify the idiosyncratic acoustic variations introduced into the speech stream by the speaker’s voice, an ability that theoretical proposals of native speech perception consider indispensable (Johnson & Sjerps, Reference Johnson, Sjerps, Pardo, Nygaard, Remez and Pisoni2021; Nygaard & Tzeng, Reference Nygaard, Tzeng, Pardo, Nygaard, Remez and Pisoni2021), relates to L2 phoneme learning ability. It should be noted that theoretical models of non-native speech perception do not address how non-native listeners cope with the lack of invariance of speech sounds across speakers (Best, Reference Best, Good man and Nusbaum1994; Best & Tyler, Reference Best, Tyler, Bohn and Munro2007; Escudero, Reference Escudero, Boersma and Hamann2009; Flege, Reference Flege and Strange1995). Models of non-native speech perception assume that to create representations of non-native phonemes, the speech perception system needs to identify the invariant phonemic cues that differentiate these from native phonemes. These models implicitly assume that the mechanisms that enable L1 perception are the same that support the identification of the phonemic cues that distinguish native from non-native phonemes. We therefore build on native speech perception models to examine the potential mechanisms that drive the present relation between voice and L2 phoneme abilities.
Being models of native speech perception, speaker normalization theories do not address non-native phoneme learning. However, we provide a tentative explanation of how this theoretical proposal might accommodate the association between voice processes and L2 phoneme learning abilities. Speaker normalization theories propose that to account for the high variability of the speech signal, the perceptual system initially identifies and discards the voice information embedded in the speech signal. This computation entails that the remaining acoustic information cannot be attributed to speaker idiosyncrasies but rather corresponds to linguistic information (Choi et al., Reference Choi, Hu and Perrachione2018; Johnson & Sjerps, Reference Johnson, Sjerps, Pardo, Nygaard, Remez and Pisoni2021; Nusbaum & Magnuson, Reference Nusbaum, Magnuson, Johnson and Mullennix1997; Zhang & Chen, Reference Zhang and Chen2016). Viewed through the theoretical frame of speaker normalization theories, individual differences in voice processing abilities might be relevant during L2 phoneme learning as they would determine the listener’s accuracy in identifying the spectro-temporal correlates of voices in the speech signal, such as variations of the fundamental frequency or the frequency of the formants (Baumann & Belin, Reference Baumann and Belin2010; Ghazanfar & Rendall, Reference Ghazanfar and Rendall2008; Latinus & Belin, Reference Latinus and Belin2011). Inadequate identification of speaker-specific acoustic variation could lead to two potential scenarios: the speech system would either flag phoneme-relevant cues as voice-dependent and discard them from speech analyses or rather consider voice cues as phoneme-relevant features and include them in further speech processing. In both cases, inaccurate identification of voice features would hamper the discovery of the invariant cues of non-native phonemes and their subsequent learning. A caveat to this interpretation lies in the nature of the computations assessed in our voice processing ability tasks. The voice tasks focus on explicit recognition and discrimination and might involve high-level processes, such as accessing identity representations or making similarity judgments. These high-level processes may differ from those that underpin speaker normalization, which is typically conceptualized as an automatic process that mostly relies on low-level acoustic contrasts (Sjerps et al., Reference Sjerps, McQueen and Mitterer2013; Sjerps & Smiljanić, Reference Sjerps and Smiljanić2013).
An alternative theoretical proposal which also accommodates interactions between voice and linguistic information is provided by distributional and exemplar-based models of speech perception. These models suggest that, rather than a normalization process occurring, the speech perceptual system tracks and retains speaker-specific acoustic variations introduced into the speech signal (Goldinger, Reference Goldinger1998; Klatt, Reference Klatt1979; Kleinschmidt & Jaeger, Reference Kleinschmidt and Jaeger2015; McMurray & Jongman, Reference McMurray and Jongman2011; Sumner et al., Reference Sumner, Kim, King and McGowan2014). The flexibility that these models attribute to speech perception, conceptualizing it as a dynamic ability capable of incorporating novel information to adapt to new scenarios (e.g., learning dialectal variations), can arguably accommodate the learning of non-native phoneme contrasts. Based on phonetic training paradigms that show greater generalization when learning occurs in multispeaker as compared to single speaker conditions, it has been proposed that the speech system dynamically learns and extrapolates the features that characterize phonemes across speakers (Weatherholtz & Jaeger, Reference Weatherholtz, Jaeger, Weatherholtz and Jaeger2016). Building on distributional and exemplar-based models, the association between voice processing ability and L2 phoneme learning might originate from the listener’s ability to properly identify the speaker-specific variations introduced in the speech signal, directly impacting the ability of the listener to discover the acoustic correlates of phonemic regularities. The tasks employed in the present study to measure voice processing ability are designed to capture both low-level and high-level acoustic processes, similar to the processes conceptualized by distributional and exemplar-based models. It should be noted that these models propose that learning regular variations of voices is an implicit process, while the behavioral tasks employed in this study evaluated explicit learning and discrimination. However, recent research has shown that voice recognition accuracy is similar regardless of whether attention is directed to the voice or to the linguistic content of speech, suggesting that both implicit and explicit processes support the learning of the relevant cues that characterize voices (Lee & Perrachione, Reference Lee and Perrachione2022). While the assumptions of these models fit well with the reported findings, the validity of these theories of speech perception remains a subject of ongoing debate. Therefore, we are limited with regard to drawing a causal interpretation from the observed predictive value of voice processing ability for L2 phoneme learning. Investigating the neural underpinnings that support the interaction between voice processing ability and L2 phoneme learning may further our understanding of the relation between these two processes, especially upon considering that speaker recognition and speech perception engage partially distinct brain regions (Bonte et al., Reference Bonte, Hausfeld, Scharke, Valente and Formisano2014; Formisano et al., Reference Formisano, De Martino, Bonte and Goebel2008; Schall et al., Reference Schall, Kiebel, Maess and von Kriegstein2015). Previous studies have proposed two neurofunctional mechanisms that might support interactions between voice and speech processes: (i) interhemispheric functional connectivity between right lateralized voice-sensitive regions and left lateralized speech-sensitive regions (Deng et al., Reference Deng, Chandrasekaran, Wang and Wong2018; Kreitewolf et al., Reference Kreitewolf, Friederici and von Kriegstein2014; von Kriegstein et al., Reference von Kriegstein, Smith, Patterson, Kiebel and Griffiths2010) and (ii) the functional overlap exhibited by regions along the temporal cortices and right temporoparietal junction, which exhibit sensitivity to both voice and phonetic information (Chandrasekaran et al., Reference Chandrasekaran, Chan and Wong2011; Formisano et al., Reference Formisano, De Martino, Bonte and Goebel2008; Holmes & Johnsrude, Reference Holmes and Johnsrude2021; Luthra et al., Reference Luthra, Magnuson and Myers2023; Myers & Theodore, Reference Myers and Theodore2017; von Kriegstein et al., Reference von Kriegstein, Smith, Patterson, Kiebel and Griffiths2010). If these mechanisms are also engaged during L2 phoneme learning, they would provide a neural basis for the interaction between voice processing ability and L2 phoneme learning that would align with the proposals of the models of speech perception that we have discussed here.
Despite the present findings fitting well with theoretical proposals, it remains unknown whether the predictive value of voice processing abilities for L2 phoneme learning can be extrapolated to learning during other stages of life. The participants in this study were early bilingual adults who learnt the L2 upon commencing mandatory bilingual schooling at the age of 4 years. While children predominantly utilize implicit domain-specific mechanisms in language learning, adult L2 learners can no longer rely on these implicit mechanisms. Instead, they must reflect on the grammatical structure of the novel language and exploit general cognitive strategies (DeKeyser, Reference DeKeyser2000). Furthermore, recent studies support the long-standing proposal of the existence of a sensitive period for language learning (Hartshorne et al., Reference Hartshorne, Tenenbaum and Pinker2018; Werker & Hensch, Reference Werker and Hensch2015). Sensitive periods are developmental stages during which the central nervous system exhibits greater experience-induced plasticity, enabling the acquisition of sensory and cognitive abilities. Once a sensitive period has ended, poorer learning is possible in that domain. Crucially, the bilinguals tested in the present study learnt the L2 after the sensitive period for phoneme learning had concluded, which has been proposed to end during the second year of life (for a review, see Werker & Hensch, Reference Werker and Hensch2015). Indeed, several studies show that systematic exposure to an L2 at the age at which our sample of participants began learning does not consistently result in native-like proficiency in L2 phoneme contrast discrimination, as would be expected if the L2 had been acquired during the sensitive period (Caramazza et al., Reference Caramazza, Yeni-Komshian, Zurif and Carbone1973; Díaz et al., Reference Díaz, Mitterer, Broersma and Sebastian-Galles2012; Schmitz et al., Reference Schmitz, Díaz, Fernández Rubio and Sebastian-Galles2018; Sebastian-Galles & Díaz, Reference Sebastian-Galles and Díaz2012). Therefore, the observed association between voice processing and L2 phoneme learning may generalize to the learning of non-native phoneme contrasts occurring after the sensitive period for phoneme acquisition concludes. Supporting this claim, previous research has shown that voice processes are relevant for language learning during adulthood. For instance, numerous studies have demonstrated significant gains in the perception of L2 phoneme contrasts when learners are exposed to these contrasts from multiple speakers, as compared to learning from a single speaker (Bradlow et al., Reference Bradlow, Pisoni, Akahane-Yamada and Tohkura1997; Bradlow & Pisoni, Reference Bradlow and Pisoni1999; Deng et al., Reference Deng, Chandrasekaran, Wang and Wong2018; Iverson et al., Reference Iverson, Hazan and Bannister2005; Lively et al., Reference Lively, Logan and Pisoni1993, Reference Lively, Pisoni, Yamada, Tohkura and Yamada1994; Logan et al., Reference Logan, Lively and Pisoni1991; Wong, Reference Wong2014; Ylinen et al., Reference Ylinen, Uther, Latvala, Vepsäläinen, Iverson, Akahane-Yamada and Näätänen2010; for a review, see Zhang et al., Reference Zhang, Cheng and Zhang2021). This benefit in L2 phoneme learning in multispeaker contexts is believed to reflect the enhanced identification of the invariant cues that characterize phonemes when the learner has access to a more diverse speech input (Deng et al., Reference Deng, Chandrasekaran, Wang and Wong2018; Iverson et al., Reference Iverson, Hazan and Bannister2005; Ylinen et al., Reference Ylinen, Uther, Latvala, Vepsäläinen, Iverson, Akahane-Yamada and Näätänen2010). However, it remains to be investigated whether adult learners display variability in their ability to extract the features that characterize phonemes across speakers and whether this variability is related to individual differences in voice processing ability.
The assessment of voice abilities may be relevant to predict not only phoneme learning in the L2 but also the acquisition of the L1. Previous studies (Perea et al., Reference Perea, Jiménez, Suárez-Coalla, Fernández, Viña and Cuetos2014; Perrachione et al., Reference Perrachione, Del Tufo and Gabrieli2011) revealed an association between difficulties in voice recognition and dyslexia, a difficulty in learning to read whose origins are claimed to be rooted in a phonological deficit (Ramus, Reference Ramus2003). Impaired voice recognition abilities have been proposed as a marker of developmental dyslexia and a valuable measure to predict the disability (Perea et al., Reference Perea, Jiménez, Suárez-Coalla, Fernández, Viña and Cuetos2014). Moreover, an electrophysiological study reported a reduced encoding of features related to pitch in children with dyslexia compared to typically developing children (Chandrasekaran et al., Reference Chandrasekaran, Hornickel, Skoe, Nicol and Kraus2009). Chandrasekaran et al. (Reference Chandrasekaran, Hornickel, Skoe, Nicol and Kraus2009) suggested that individuals with dyslexia may experience challenges adapting speech processes to accommodate the characteristics of different voices. Considering voice processing as a general mechanism that enables the learning of speech sound invariants would provide an explanatory mechanism for the co-occurrence in dyslexia of voice and phoneme deficits. However, extrapolating an effect that influences L2 phoneme learning to the acquisition of the L1 would require further testing. The neural processes that enable language learning during the first years of life are different than those that enable learning after that sensitive period has concluded (Hartshorne et al., Reference Hartshorne, Tenenbaum and Pinker2018; Werker & Hensch, Reference Werker and Hensch2015). Furthermore, theoretical models of non-native speech perception conceptualize the acquisition of the L2 as qualitatively different from the learning of an L1, since L2 learners must identify the cues that differentiate non-native from the native phonemes (Best, Reference Best, Good man and Nusbaum1994; Best & Tyler, Reference Best, Tyler, Bohn and Munro2007; Escudero, Reference Escudero, Boersma and Hamann2009; Flege, Reference Flege and Strange1995). Thus, investigating whether voice processing ability influences L1 phoneme learning would also shed light on the similarities and differences between learning an L1 and an L2.
The discussed implications of the present findings for language learning call for further research to better comprehend the nature of the relationship between voice processing abilities and L2 phoneme learning. Future studies that investigate how voice processing ability influences language learning should note that our battery of behavioral tests captured large individual differences in L2 phoneme proficiency in both sub-lexical and lexical contexts, as reported in previous studies that investigated similar populations with the same L2 phoneme tasks (Díaz et al., Reference Díaz, Mitterer, Broersma and Sebastian-Galles2012; Schmitz et al., Reference Schmitz, Díaz, Fernández Rubio and Sebastian-Galles2018; Sebastian-Galles et al., Reference Sebastian-Galles, Echeverría and Bosch2005; Sebastian-Galles & Díaz, Reference Sebastian-Galles and Díaz2012). We were also successful in replicating the high inter-individual variability in the ability to recognize and discriminate speakers that previous studies observed in healthy populations (Aglieri et al., Reference Aglieri, Watson, Pernet, Latinus, Garrido and Belin2017; Lavan et al., Reference Lavan, Burston and Garrido2019a; Mühl et al., Reference Mühl, Sheil, Jarutytė and Bestelmeyer2018). While previous studies evaluated voice processing with speech samples containing phonetic information from the participants’ native language, we employed a diverse set of experimental procedures to evaluate voice abilities in the participants’ native language, in an unfamiliar language, and from sub-lexical affect bursts. We observed variability in all indicators of voice processing ability, regardless of the participants’ familiarity with the language employed during the voice tasks, whether the task trained participants to recognize the speaker or the linguistic content (sub-lexical or lexical) of the task. This suggests that, while they likely influence task performance, neither language familiarity, voice familiarity, nor linguistic content are critical factors when evaluating voice processing ability in healthy populations. However, we acknowledge that the accuracy data of the VDT did not relate to voice ability in the CFA. While all voice processing ability indicators captured individual differences, the VDT differed considerably from the other two voice tasks: It did not involve processing of linguistic information or training and employed affective interjections that primarily modulate the fundamental frequency of the speech signal (Bachorowski et al., Reference Bachorowski, Smoski and Owren2001; Bachorowski & Owren, Reference Bachorowski and Owren2001; Lavan et al., Reference Lavan, Scott and McGettigan2016, Reference Lavan, Burton, Scott and McGettigan2019b), unlike phoneme changes, which primarily encoded as changes in the F1 and F2 (Fox et al., Reference Fox, Flege and Munro1995; Yang & Fox, Reference Yang and Fox2014). These three differences could explain why the VDT task did not relate to voice ability in the CFA for the accuracy data. If future research supports the idea that linguistic content is not a crucial factor to capture individual differences in voice processing ability, it could lead to the development of a voice-processing evaluative tool applicable to any population, regardless of their linguistic background.
The combined use of CFAs and SEM revealed that the proficiency early L2 learners achieve in mastering L2 phoneme contrasts, an ability known to vary considerably among individuals (Archila-Suerte et al., Reference Archila-Suerte, Bunta and Hernandez2016; Díaz et al., Reference Díaz, Mitterer, Broersma and Sebastian-Galles2012; Schmitz et al., Reference Schmitz, Díaz, Fernández Rubio and Sebastian-Galles2018; Sebastian-Galles & Baus, Reference Sebastian-Galles, Baus and Cutler2005; Sebastian-Galles & Díaz, Reference Sebastian-Galles and Díaz2012), can be predicted based on an individual’s ability to recognize and discriminate voices. Our models showed this effect despite the tasks employed as indicators of voice processing ability involving learning and memory components not present in the indicators of L2 phoneme learning. In other words, as noted by a reviewer, had the tasks employed as indicators of each construct been more similar in their domain-general cognitive requirements, the predictive capacity of voice processing ability on L2 phoneme learning would likely have been greater than that reported here. Furthermore, voice and phoneme processing differ in the relative importance of various acoustic features of the speech signal. Research suggests that voice processing is primarily dependent on changes at high spectral modulations (i.e., >1.1 cycles per octave at center frequencies of up to 0.8 kHz), while phoneme category is mostly determined by changes in lower spectral modulations (i.e., broad spectral modulations for center frequencies above 0.6 kHz) and fast temporal changes (i.e., >7.8 Hz) (Rutten et al., Reference Rutten, Santoro, Hervais-Adelman, Formisano and Golestani2019). Therefore, the predictive capacity of voice processing ability over L2 phoneme learning is not due to both processes relying on the same acoustic features. However, this study does not establish a definitive causal relation between voice processing ability and L2 phoneme learning. While theoretical accounts of speech perception could support a causal relation, it remains feasible that the association between voice processing ability and L2 phoneme learning stems from a common origin: The listener’s sensitivity to detect phoneme changes in any given language. This interpretation was also presented in the study that inspired the current investigation (Díaz et al., Reference Díaz, Cordero, Hoogendoorn and Sebastian-Galles2022) and is based on two sets of findings: speaker recognition accuracy being influenced by the phoneme knowledge of the listener (Fecher & Johnson, Reference Fecher and Johnson2019, Reference Fecher and Johnson2022; Perrachione et al., Reference Perrachione, Del Tufo and Gabrieli2011) and the relation between the mastery of L2 phoneme contrasts with the ability to discriminate both native and unfamiliar phoneme contrasts (Díaz et al., Reference Díaz, Baus, Escera, Costa and Sebastian-Galles2008, Reference Díaz, Mitterer, Broersma, Escera and Sebastian-Galles2016). However, the alternative interpretation of voice processing ability and L2 phoneme learning emerging from a common underlying process lacked conclusive support from the single-latent CFAs. The analysis yielded good fit for the accuracy data but failed to adequately fit of the RT data. Advocating for the validity of the single-latent model would entail disregarding the RT data, a measure of effective cognitive processing equally valid, and complementary, to accuracy data (Ratcliff et al., Reference Ratcliff, Smith and McKoon2015a). Another potential limitation of the current study is the relatively small sample size. Some recommendations suggest employing sample sizes of up to several thousand individuals when conducting SEM (Kline, Reference Kline2015; Schumacker & Lomax, Reference Schumacker and Lomax2010). A sample size of such proportions was unfeasible due to the strict inclusion criteria participants had to meet. Nonetheless, a priori power analysis confirmed that our analyses were sufficiently powered, and, indeed, both the CFA and SEM exhibited good fit when including two latent variables. A second potential limitation related to the sample of this study is the greater number of women participants compared to men. However, no significant performance differences between males and females in the indicators of either latent variable were observed. This finding suggests that the higher proportion of female participants did not influence our primary findings.
In conclusion, our findings contribute to understanding the processes involved in speech perception and language learning: Individual differences in voice processing ability among early bilingual adults can predict the proficiency they achieve in L2 phoneme learning. By recognizing voice processing as a predictive factor in language learning, we deepen our understanding of the variability in L2 proficiency observed among early bilingual adults. This perspective opens new avenues for research, ranging from the acquisition of the native language to educational applications.
Supplementary material
To view supplementary material for this article, please visit http://doi.org/10.1017/S136672892400110X.
Data availability statement
The data collected and the analysis code are accessible at https://osf.io/symg2/?view_only=9f331bf8fc2146099874e81de4e908ae.
Acknowledgements
This work was supported by grants from the Ministry of Science and Innovation of the Spanish Government, State Research Agency and European Regional Development Fund (PID2019-106924GA-I00, PID2022-137368NB-I00 and PID2021‐123416NB-I00 funded by MICIN/AEI/10.13039/501100011033/FEDER UE) awarded to BD and NSG. MP was awarded a grant from the Valencian Government (CIAICO/2021/172). NSG was awarded the ICREA Academia Prize by the Catalan Government. GC was supported by a doctoral fellowship from the Universitat Internacional de Catalunya. Two grants financed by the Catalan Generalitat AGAUR (2021 SGR 00911 and 2021 SGR 00625) also supported this work.
Competing interests
None declared.
Appendices
Appendix 1
All abbreviations employed in the present study in order of appearance.
- L2
-
Second language
- F1
-
First formant
- F2
-
Second formant
- L1
-
Native language
- SEM
-
Structural equation model
- Lx
-
Unfamiliar language
- RT
-
Reaction time
- CFA
-
Confirmatory factor analysis
- VRT
-
Voice recognition task
- VDT
-
Voice discrimination task
- CT
-
Categorization task
- LDT
-
Lexical decision task
- NWAT
-
Non-word association task
- AIC
-
Akaike’s information criterion
- χ 2
-
Chi-square test of model fit
- CFI
-
Comparative fit index
- SRMR
-
Standardized root mean residual
- RMSEA
-
Root mean square error of approximation
- D 2M
-
Mahalanobis distance
Appendix 2
n = 57 except for the lexical decision task in which n = 55.
Appendix 3
For the LDT, hits and false alarms have been calculated separately for experimental and control trials. Standard deviations are presented in brackets.
Appendix 4
ms = milliseconds.