1. Introduction
In the last few decades, there has been growing interest in the predictive role of individual differences in ‘phonetic language abilities’. Christiner and Reiterer (Reference Christiner and Reiterer2018) defined ‘phonetic language abilities’ as ‘the capacity to imitate, mimic and pronounce spoken speech based on holistic judgments of human native speaker raters, judging imitated prosody as well as phonetic (segmental) aspects’ (p. 2). In particular, the differences between the cognitive function of working memory and music perception as individual aptitudes have received special attention. On the one hand, music and language are both related to human acoustic and sensory-motor systems and these common networks and processes have led to the hypothesis that music may influence language production (Patel, Reference Patel2011). On the other hand, there is evidence that working memory capacity affects language processing and production (Christiner & Reiterer, Reference Christiner and Reiterer2018; Christiner et al., Reference Christiner, Rüdegger and Reiterer2018, Reference Christiner, Renner, Groß, Seither-Preisler, Benner and Schneider2022). In what follows, we review the literature on the role of musicality and working memory in the perception and production of both familiar languages – including an L2 in the process of acquisition – and unfamiliar languages.
1.1. The role of musicality in the perception and production of familiar and unfamiliar languages
Although there is still debate on whether the mechanisms underlying speech and music perception are overlapping or rather dissociable, cognitive science has revealed a compelling and complex relationship between music and language. On the one hand, neuroscientific evidence revealed that the representations of music and nonmusic sounds are distinct in the auditory cortex (Boebinger et al., Reference Boebinger, Norman-Haignere, McDermott and Kanwisher2021; Leaver & Rauschecker, Reference Leaver and Rauschecker2010; Norman-Haignere et al., Reference Norman-Haignere, Kanwisher and McDermott2015, Reference Norman-Haignere, Feather, Boebinger, Brunner, Ritaccio and McDermott2022). On the other hand, the processing of music and speech stimuli involves some overlapping brain areas (Patel, Reference Patel2014; Peretz et al., Reference Peretz, Vuvan, Lagrois and Armony2015), because the response to music and speech activates a large overlapping portion of the auditory cortex (Rogalsky et al., Reference Rogalsky, Rong, Saberi and Hickok2011). Moreover, violating syntactic rules of speech and harmonic rules of music led to similar neural responses (Besson & Schön, Reference Besson and Schön2001). Taken together, although the neuronal populations that respond to music and speech differ, they seem to occur in overlapping brain areas (Peretz et al., Reference Peretz, Vuvan, Lagrois and Armony2015). Therefore, it is reasonable to hypothesize that when neural networks are trained through extensive musical practice, they can help process acoustic information related to not only music but also speech with high precision (Patel, Reference Patel2011, Reference Patel2014).
An individual’s ability to perceive and produce the phonetic features of nonnative speech correlates with both their musical expertise and musical aptitude. Musical expertise is usually defined as the number of years of formal musical training (Zhang et al., Reference Zhang, Susino, McPherson and Schubert2020), and musical aptitude refers to the ability to intuitively learn, understand and appreciate music, a kind of inherent potential for learning music before formal music training (Law & Zentner, Reference Law and Zentner2012). To capture this individual ability, researchers usually measure the participants’ ability to discriminate between differences in various components of music like rhythm and pitch. This is done by playing paired musical statements to the participants and then asking them to indicate whether the statements they heard were the same or different.
First, individuals with higher musical expertise tend to excel in processing and perceiving pitch in speech, for instance, in the discrimination and identification of L2 lexical tones (Cooper & Wang, Reference Cooper and Wang2012; Delogu et al., Reference Delogu, Lampis and Belardinelli2010; Gottfried et al., Reference Gottfried, Staby and Ziemer2004; Lee & Hung, Reference Lee and Hung2008; Marie et al., Reference Marie, Delogu, Lampis, Belardinelli and Besson2011) or pitch deviations in L2 intonation (Marques et al., Reference Marques, Moreno, Castro and Besson2007; Martínez-Montes et al., Reference Martínez-Montes, Hernández-Pérez, Chobert, Morgado-Rodríguez, Suárez-Murias and Valdés-Sosa2013). In addition, experienced musicians show higher sensitivity to other aspects of speech processing such as rhythmic grouping (Boll-Avetisyan et al., Reference Boll-Avetisyan, Bhatara, Unger, Nazzi and Höhle2016), speech stream segmentation (François et al., Reference François, Jaillet, Takerkar and and Schön2014), speech timing (Sadakata & Sekiyama, Reference Sadakata and Sekiyama2011), speech sound perception (Marie et al., Reference Marie, Delogu, Lampis, Belardinelli and Besson2011; Sadakata & Sekiyama, Reference Sadakata and Sekiyama2011) and even the perception of subsegmental features like voice onset time (Ott et al., Reference Ott, Langer, Oechslin, Meyer and Jäncke2011). More importantly, musicians outperform nonmusicians in their ability to imitate unfamiliar languages as measured in terms of intelligibility (Delogu & Zheng, Reference Delogu and Zheng2020), suprasegmental accuracy (Pei et al., Reference Pei, Wu, Xiang and Qian2016) and overall imitation accuracy (Murljacic, Reference Murljacic2020; Pastuszek‐Lipinska, Reference Pastuszek‐Lipinska2008).
Second, regarding musical aptitude, musical perception skills play a predictive role in speech perception and production abilities. Slevc and Miyake (Reference Slevc and Miyake2006) showed that musical aptitude predicts receptive and productive L2 phonology, and their findings were confirmed and extended in subsequent studies. For instance, among nontonal language speakers, those who showed higher musical aptitude outperformed those who were less musically talented in identifying L2 lexical tones (Cooper & Wang, Reference Cooper and Wang2012). Learners with better musical aptitude also showed enhanced intelligibility in L2 speech imitation (Delogu & Zheng, Reference Delogu and Zheng2020). Likewise, children with better musical aptitude were more likely to perceive changes in duration in both speech and music (Milovanov et al., Reference Milovanov, Huotilainen, Esquef, Alku, Välimäki and Tervaniemi2009). Interestingly, the results of a study involving accent-faking tasks in which participants were asked to speak in their L1 while imitating an L2 accent suggested that participants with greater musical aptitude were also able to do this more easily, suggesting a correlation with overall phonological awareness (Coumel et al., Reference Coumel, Christiner and Reiterer2019).
Unsurprisingly, musical production skills are associated with productive language abilities. For instance, adults and children with higher singing aptitudes performed better in imitating a series of unintelligible and unfamiliar speech sounds (Christiner & Reiterer, Reference Christiner and Reiterer2013, Reference Christiner and Reiterer2018) and better overall L2 pronunciation (Milovanov et al., Reference Milovanov, Huotilainen, Välimäki, Esquef and Tervaniemi2008). A higher musical production aptitude may also have a positive effect on not only the production but also the perception of L2 speech. For instance, Li and Dekeyser (Reference Li and Dekeyser2017) and Slevc and Miyake (Reference Slevc and Miyake2006) measured participants’ ability to produce tones by asking them to orally repeat the musical stimuli played to them using the syllable sequence ‘lalala’, their output being recorded and later rated by professional singers. Both studies found positive correlations between musical production aptitude and L2 speech production and perception skills.
1.2. Correlations between the specific components of musical aptitude and language perception and production
Musical aptitude is a multi-dimensional construct that consists of many components. Several studies have investigated how its separate components correlate to specific aspects of speech perception and production. Thus far, it is the rhythmic and pitch perception and production components that have received the most attention.
Rhythmic perception skills correlated with a more accurate perception of speech rhythm (Boll-Avetisyan et al., Reference Boll-Avetisyan, Bhatara and Höhle2017), the production of L2 long and short vowels (Li et al., Reference Li, Baills and Prieto2020), and the ability to imitate unfamiliar languages accurately (Christiner & Reiterer, Reference Christiner and Reiterer2013). By the same token, individuals with better rhythmic production skills (i.e., they are better able to reproduce musical rhythmic sequences) could produce word stress more accurately and thus exhibited greater fluency in an L2 (Zheng et al., Reference Zheng, Saito and Tierney2022), and also reproduced unfamiliar prosody more accurately, specifically in terms of stress-accent placement (Cason et al., Reference Cason, Marmursztejn, D’Imperio and Schön2020).
Musical pitch perception skills were associated not only with the perception of L2 lexical tones (Li & Dekeyser, Reference Li and Dekeyser2017) and the production of lexical tones in unfamiliar languages (Christiner et al., Reference Christiner, Renner, Groß, Seither-Preisler, Benner and Schneider2022) but also with L2 pronunciation in general (Posedel et al., Reference Posedel, Emery, Souza and Fountain2012). Importantly, pitch perception abilities may predict successful learning of L2 words with lexical tones (Bowles et al., Reference Bowles, Chang and Karuzis2016) and the production of L2 intonation (Yuan et al., Reference Yuan, González-Fuente, Baills and Prieto2019). However, when the learning target is not related to pitch, pitch perception skills do not correlate significantly with the ability to learn other phonetic features of an L2, such as vowel length (Li et al., Reference Li, Baills and Prieto2020).
There is contradictory evidence regarding the role of other components of musical aptitude in phonetic language abilities. For example, melodic perception skills correlated with the perception of L2 lexical tones (Delogu et al., Reference Delogu, Lampis and Belardinelli2010) and the production of L2 intonation (Jekiel & Malarski, Reference Jekiel and Malarski2023; Yuan et al., Reference Yuan, González-Fuente, Baills and Prieto2019). However, a recent study found no significant correlation between melodic production skills and L2 speech production (Zheng et al., Reference Zheng, Saito and Tierney2022). Similarly, accent and melody perception skills, but not tempo or tuning skills, significantly predicted the imitation performance of English regional variants by native English speakers (Murljacic, Reference Murljacic2020). A recent study, however, has shown that while musical rhythm and pitch perception abilities alone could not predict accent-faking accuracy, singing abilities could (Coumel et al., Reference Coumel, Groß, Sommer-Lolei and Christiner2023).
At the segmental level, different components of musical aptitude may correlate with the ability to produce challenging L2 sounds, although studies have yielded inconsistent results. For instance, the musical timing perception skills of Japanese students predicted their ability to imitate English /r–l/ contrasts accurately, whereas pitch, loudness and rhythmic perception skills did not (Dolman & Spring, Reference Dolman and Spring2014). While having good rhythmic perception skills positively correlated with the ability to produce challenging L2 vowels, this was not the case with melodic and pitch perception skills (Jekiel & Malarski, Reference Jekiel and Malarski2021).
Nevertheless, only a handful of studies have looked for correlations between the different components of musical aptitude and language phonetic abilities as reflected through individuals’ abilities to imitate unknown languages, and these studies have yielded mixed results. The first study compared the predictive role of rhythm and pitch perception abilities in the production of familiar (English) and unfamiliar (Hindi) languages by German speakers and found that rhythmic – but not pitch – perception abilities significantly predicted the ability to imitate Hindi (Christiner & Reiterer, Reference Christiner and Reiterer2013). Later, Christiner et al. (Reference Christiner, Rüdegger and Reiterer2018) showed that the predictive role of specific music components on the imitation abilities of unfamiliar languages may be dependent on the typology of the target language. Specifically, they found that pitch perception ability predicted the imitation abilities in a tone language (Chinese) while rhythm perception ability predicted the imitation abilities in a stress language (Tagalog). Finally, pitch perception abilities could predict the imitation accuracy of Chinese tones by German speakers who had no prior knowledge of Chinese (Christiner et al., Reference Christiner, Renner, Groß, Seither-Preisler, Benner and Schneider2022).
To the best of our knowledge, no previous studies have assessed whether the native language of the participants can also modulate the predictive role of the musical aptitude components, as most of the studies included participants from a homogeneous L1 background. Christiner et al. (Reference Christiner, Rüdegger and Reiterer2018) tested participants with different native languages including Bosnian, Serbian, Turkish and Macedonian. However, they did not test whether the participants’ L1 influenced the predictive value of the different musical components. In other words, it remains an open question whether the results can be applied to speakers of different language typologies. Since tone languages manipulate pitch more on the lexical level than on the intonational level, tone language speakers might differ from intonation language speakers in their sensitivity to certain musical components. In fact, Chinese speakers have been shown to have finely tuned pitch perception skills similar to those of musicians (Bidelman et al., Reference Bidelman, Hutka and Moreno2013).
Taken as a whole, this body of research suggests that it would be of interest to explore how speakers of tonal languages differ from intonation language speakers in terms of how their musical aptitude skills might be transferable to their processing and production abilities of L2s or unfamiliar languages.
1.3. The role of working memory in the perception and production of familiar and unfamiliar languages
Working memory refers to the temporary storage and simultaneous manipulation of information during cognitive processes, providing interfaces between perception, long-term memory and action. It is critical for higher cognitive functions such as planning, problem-solving and reasoning, as well as for processing and decoding speech and music (Schulze & Koelsch, Reference Schulze and Koelsch2012). In the context of L2 learning, working memory positively correlates with overall language proficiency (Kormos & Sáfár, Reference Kormos and Sáfár2008), vocabulary learning (Cheung, Reference Cheung1996) and grammar accuracy (Abdallah, Reference Abdallah2010; O’Brien et al., Reference O’Brien, Segalowitz, Collentine and Freed2006).
Regarding L2 speech learning, empirical research has not yet yielded consistent results on the predictive role of working memory. On the positive end, working memory related to the development of speech fluency (O’Brien et al., Reference O’Brien, Segalowitz, Freed and Collentine2007), narrative skills (O’Brien et al., Reference O’Brien, Segalowitz, Collentine and Freed2006) and overall speech proficiency as measured by complexity, accuracy and fluency (Fortkamp, Reference Fortkamp2000; Trude & Tokowicz, Reference Trude and Tokowicz2011). Working memory may also affect outcomes of L2 pronunciation training such as accuracy in the imitation of an English dialect (Baker, Reference Baker, Bowles, Foote, Perpiñán and Bhatt2008) and the perceptual learning of individual vowels (Aliaga-Garcia et al., Reference Aliaga-Garcia, Mora, Cerviño-Povedano, Dziubalska-Kołaczyk, Wrembel and Kul2010). By contrast, some studies did not show significant correlations between working memory and aspects of L2 speech production, such as fluency (Mizera, Reference Mizera2006), overall pronunciation accuracy (Posedel et al., Reference Posedel, Emery, Souza and Fountain2012, p. 201), intelligibility and accentedness (Slevc & Miyake, Reference Slevc and Miyake2006) and the production of specific L2 features like duration (Li et al., Reference Li, Baills and Prieto2020).
Likewise, mixed results were obtained on the role of working memory in predicting an individual’s phonetic language abilities as manifested in their skill at imitating unfamiliar languages. Focusing first on the positive findings, working memory capacities have been shown to predict the imitation abilities of unfamiliar languages in both children (Christiner & Reiterer, Reference Christiner and Reiterer2018; Christiner et al., Reference Christiner, Rüdegger and Reiterer2018) and adults (Christiner & Reiterer, Reference Christiner and Reiterer2013, Reference Christiner and Reiterer2018). By contrast, some recent studies have shown that while musical aptitude and singing abilities were significant predictors of phonological awareness as measured by an L2 accent-faking task (Coumel et al., Reference Coumel, Christiner and Reiterer2019, Reference Coumel, Groß, Sommer-Lolei and Christiner2023), working memory was not (Coumel et al., Reference Coumel, Christiner and Reiterer2019). Also, working memory was not a significant predictor of the imitation of unfamiliar languages (Li et al., Reference Li, Zhang, Fu, Baills, Prieto, Frota, Cruz and Vigário2022). These results suggest that working memory capacity is a potential predictor of individual differences in phonological awareness, although it might be less predictive than musical aptitude. Given the inconclusive findings in previous research, more evidence is needed to assess the predictive value of working memory. Therefore, it seems to be relevant to involve working memory in the investigation of phonetic language ability.
1.4. Goals of the present study
Considering the previous literature, further evidence is needed to determine which components of musical aptitude are better predictors of speech imitation abilities, and how they compare with working memory in this regard. Of those components, although the literature reviewed in Section 1.1 identified rhythm and pitch as the most relevant components of musical aptitude in predicting phonetic language abilities, some results pointed to the relevance of melody and accent as well. Therefore, the present study will focus on those four components, namely, accent, melody, pitch and rhythm. In this study, then, we aim to investigate the predictive role of specific perceptive components of musical aptitude and working memory capacity on the speech imitation skills of two groups of participants with typologically different native languages, namely, Catalan (an intonation language) and Chinese (a tone language).
The present study poses the following two research questions:
-
• RQ1: Do musical perception skills predict phonetic language abilities better than working memory?
-
• RQ2: Which components of musical perception skills predict phonetic language abilities? Does the predictive effect of these components hold across speakers of typologically different languages?
For RQ1, we hypothesized that musical perception skills would be more predictive than working memory. Regarding RQ2, however, it is largely exploratory based on the typological differences between Chinese and Catalan. Chinese speakers showed pitch discrimination abilities similar to those of musicians (Bidelman et al., Reference Bidelman, Hutka and Moreno2013), and Catalan speakers were sensitive to changes in specific parts of the pitch contour such as pitch accents and boundary tones (Prieto et al., Reference Prieto, Borràs-Comes, Cabré, Crespo-Sendra, Mascaró, Roseano, Frota and Prieto2015). Therefore, it would be reasonable to hypothesize that if speakers demonstrate musician-like expertise in one specific domain due to the prosodic properties of their L1, this musical skill will be less relevant to the imitation skills of unfamiliar languages compared to other components.
2. Methods
2.1. Participants
We recruited 144 Chinese-speaking middle-school students (80 females, mean age 13.93 years) from China and 61 Catalan-speaking undergraduate students (54 females, mean age 19.70 years) from Spain. All the participants reported having normal hearing and no speech impairment and had no prior exposure to the languages that were used in the speech imitation task. No participant had received musical training in voice, or a musical instrument was trained for more than half a year. Thus, all the participants were considered to have essentially no musical expertise. The participants and their legal guardians, in the case of a minor, gave prior written consent allowing speech data collected from them to be used for academic purposes.
2.2. Materials
The experiment consisted of three tasks: a battery of tests assessing musical perception skills consisting of subtests for accent, melody, pitch and rhythm; a forward digit span task to measure working memory; and a speech imitation task with sentences in six languages that were unfamiliar to the participants to assess speech imitation skills.
2.2.1. Musical perception skills tests
To measure musical perception skills, we opted for the Profile of Music Perception Skills (PROMS; Law & Zentner, Reference Law and Zentner2012), which is free online and provides an objective assessment of musical perception skills in various components such as pitch, rhythm, melody, accent, timbre, tempo and harmony. PROMS can be tailored to specific research needs in terms of both skill components and duration of the task (i.e., there are micro, mini, short and full versions), with even the short version producing reliable test scores and good internal consistency (Zentner & Strauss, Reference Zentner and Strauss2017).
For the present study, we chose the short versions of the PROMS subtests measuring accent, melody, pitch and rhythm. The accent subtest measured the participants’ ability to detect emphasis in rhythmic patterns with isometric notes varying in intensity. The melody subtest included monophonic rhythms. The pitch subtest used pure tones and varied pitch differences. The rhythm subtest had two-bar notes with constant intensity but varying duration. In all the subtests, participants were asked to detect differences between paired auditory stimuli, where the differences ranged from obvious to subtle.
2.2.2. Forward digit span task
Digit span is a measure of working memory, which belongs to the cognitive system that allows for the temporary storage of information (Baddeley, Reference Baddeley2003). In order to keep the experiment a reasonable length, we selected a forward digit task, meaning that participants were only asked to repeat a sequence of digits in the order in which they had appeared and were not expected to try to repeat them in reverse order (a cognitively more challenging task). Adapting Woods et al.’s (Reference Woods, Kishiyama, Yund, Herron, Edwards and Poliva2011) method, we used WinSCP software to develop an online test. The test was based on a script written by Eisenberg et al. (Reference Eisenberg, Enkavi, Bissett, Sochat and Poldrack2017) and modified by Navarro Pérez and Rohrer (Reference Navarro Pérez and Rohrer2020).
2.2.3. Speech imitation task
A total of six languages belonging to different language typologies were selected for the speech imitation task, with two sentences taken from each language. For L1 Chinese participants, the six target languages were Catalan, Hebrew, Japanese, Tagalog, Turkish and Vietnamese, whereas for L1 Catalan participants, we replaced Catalan with Chinese. The syllable count of the sentences varied from six to twelve. Table 1 lists all the sentences with English translations. It is important to point out that the goal of the speech imitation test was to obtain an overall score of speech imitation abilities based on widely diverse phonetic targets; it was not designed to assess the participants’ ability to imitate a specific language.
Seven native speakers (one for each language) were audio-recorded in a soundproof room as they read each of the two sentences four times in a row. Afterward, the clearest tokens of the four recordings were selected as the target stimuli. The audio recordings were edited with Audacity and uploaded onto the Alchemer online survey platform (www.alchemer.com), where they constituted the auditory stimuli that participants would first hear and then repeat.
2.3. Procedure
After signing the written consent form, each participant carried out the full sequence of tasks, namely, musical skills subtests, forward digit span task, speech imitation task, online on a laptop, working individually and in a silent room. The full procedure lasted around 30 min per participant.
2.3.1. PROMS-S test battery
First, the participants did the PROMS-S subtests for accent, melody, pitch and rhythm, with each subtest containing eight to ten trials of varying degrees of difficulty. In each trial, participants first listened twice to the same stimulus (the ‘referent’). After a short interval, they listened to a comparison stimulus (the ‘comparison’). Participants were required to indicate whether the comparison stimulus differed from the referent stimulus or not and choose one answer from five options: definitely different, probably different, I don’t know, probably the same and definitely the same. The PROMS-S test battery lasted approximately 20 min.
2.3.2. Forward digit span task
Participants were then shown a link on the laptop screen to access the STM test. The task consisted of 14 trials. For each trial, participants were first presented with a sequence of digits appearing consecutively in the center of the screen and were then asked to replicate the sequence they had seen using the laptop keyboard, pressing the ‘Enter’ key when finished to proceed to the next trial. The number of digits in each sequence differed, with the first trial showing a sequence of only three digits. If the participants were able to replicate the three-digit sequence successfully, the program showed them a four-digit sequence, then a five-digit sequence, and so on. If the participant failed to correctly replicate a sequence of two trials in a row, the program reduced the length of the sequence by one digit. The task ended with the fourteenth trial regardless of how many digits had been presented in the last trial. The system automatically calculated and recorded participants’ scores. The full task lasted approximately 5 min.
2.3.3. Speech imitation task
Finally, still working with the laptop, the participants proceeded to the online testing platform Alchemer to complete the speech imitation task. This involved listening to each model sentence twice and then imitating each sentence once. The 12 stimulus sentences (2 tokens × 6 languages) were presented to the participants randomly and no translations were provided. The speech imitation test lasted approximately 5 min. Participants’ oral output was recorded through a professional-quality audio recorder placed in front of them and activated by the experimenters at the outset of the speech imitation task. In total, 2,460 recordings were obtained of sentences being imitated [(144 Chinese participants + 61 Catalan participants) × 6 languages × 2 sentences].
2.4. Data coding
From the PROMS-S test battery results, a composite musical perception score was calculated by aggregating the scores of the four subsets (accent, melody, pitch and rhythm), which were automatically generated by the PROMS platform according to the following criteria. Whenever the participant correctly identified a comparison stimulus as being ‘definitely’ the same as or different from the referent stimulus, they were awarded two points; if they correctly identified the comparison stimulus as ‘probably’ the same or different, they were awarded one point. A wrong answer or ‘I don’t know’ received 0 points. The score for each subtest was the sum of the scores for all items.
As noted above, scores on the forward digit span task were generated automatically by the program in WinSCP following the guidelines by Woods et al. (Reference Woods, Kishiyama, Yund, Herron, Edwards and Poliva2011).
A score for participants’ speech imitation ability was obtained as follows. First, the recordings of the participant imitating the two prompt sentences in each language were assessed perceptually by three native speakers of that language (7 languages × 3 raters). Each rater judged how closely the participant approached native-like pronunciation on a 9-point Likert scale, with ‘1’ indicating completely non-native or unintelligible pronunciation and ‘9’ fully native-like pronunciation. Before performing the evaluations, all raters underwent a brief training session to try to ensure some consistency in the criteria they applied when rating. They were first given instructions about how to rate, it being emphasized that they were to rate recordings based on their overall impression of the speaker’s pronunciation rather than by focusing on elements such as specific phonemes. Raters were also instructed that a rating of 1 (the minimum) should be assigned to recordings where participants had produced only a small number of syllables because this constituted insufficient information to form a valid opinion. Raters then practiced by evaluating six sample recordings that were not part of the current experiment. The resulting ratings were compared, and whenever a sharp discrepancy among ratings was detected, this was discussed among the raters until a consensus was reached on the most appropriate rating. The same training procedure was carried out for each of the seven groups of language raters.
The raters then proceeded to rate the recordings of participants, working independently and in isolation, their ratings being recorded directly on the Alchemer online platform. This task required on average 90 min. After the ratings were completed, inter-rater reliability (intraclass correlation coefficients, ICCs) between the three raters of each language was checked using the icc() function from the irr package, version 0.84.1 (Gamer et al., Reference Gamer, Lemon, Fellows and Singh2019) in the R program, version 4.2.2 (R Core Team, 2014). The ICC was obtained from a series of mean ratings (k = 3), consistency, and two-way mixed-effects models. Most of the results (Table 2) showed an acceptable (ICC > 0.7) to excellent (ICC > 0.9) estimated mean ICC across the three raters for each of the six languages imitated by the two groups of participants (see Koo & Li, Reference Koo and Li2016 for the interpretation of ICC). Only the Vietnamese items imitated by Chinese speakers showed an estimated ICC below the 0.7 threshold due to the exclusion of the items with an imitation score of 1. If calculated without data exclusion (see Section 2.5), the mean ICC for Vietnamese imitated by Chinese speakers was 0.75 [0.70, 0.80]. We thus concluded that the shortfall of data here would not affect the overall validity of our analysis. Finally, we averaged the ratings of the three raters for each item to create a mean speech imitation score (henceforth ‘imitation score’) for the follow-up analysis.
2.5. Statistical analyses
Four linear mixed models (LMM) were built to analyze the predictive role of musical perception abilities and working memory using the lmer() function from the lme4 package, version 1.1.31 (Bates et al., Reference Bates, Mächler, Bolker and Walker2015) in R. Models 1 and 2 addressed RQ1 for Catalan speakers and Chinese speakers, respectively. Similarly, models 3 and 4 addressed RQ2 separately for each participant group. In all four models, the dependent variable was the speech imitation score. In models 1 and 2, the independent variables were the composite musical perception score and working memory score; whereas in models 3 and 4, the independent variables were the subtest scores for accent, melody, pitch and rhythm separately, and working memory score. Scores for all variables were transformed into z-scores. Specifically for the speech imitation data, before z-score transformation, all items that had obtained a mean rating of 1 (e.g., the three raters gave the score 1, meaning that the recording offered a too small speech sample to assess) were excluded from further analysis. In this way, 2 out of 732 speech recordings (0.3%) by Catalan participants and 121 out of 1,725 speech recordings (7%) by Chinese participants were excluded.
To select the best-fitting models, we built four full models including all the possible random slopes for the two random intercepts: participant and item. Here, item refers to the 12 sentences regardless of the language. We chose not to treat specific language as a fixed or random effect for several reasons. First, we were interested in participants’ overall ability to imitate unfamiliar languages and not whether they could imitate one language better than another. Second, for each of the six target languages, participants were asked to imitate only two short sentences. As we were not interested in the by-language variance, we decided to treat each sentence as a single item when building the statistical models.
We then ranked all the possible models from the full model to the null model using the buildmer() function from the buildmer package, version 2.8 (Voeten, Reference Voeten2021). The best-fitting models were the best-ranking models without singular fit issues. As a result, model 1 (Catalan speakers) involved a random intercept of item with a random slope of working memory and a random intercept of participant with a random slope of musical perception score. Model 2 (Chinese speakers) involved a random intercept of item with a random slope of working memory and a random intercept of participant. Model 3 (Catalan speakers) involved a random intercept of participant with random slopes for working memory score and rhythm score, and a random intercept of item with a random slope for working memory score. Model 4 (Chinese speakers) involved a random intercept of item with random slopes for working memory score and pitch score, and a random intercept of participant.
3. Results
Table 3 summarizes the descriptive data for all the variables on their original scale from the Catalan and Chinese participants.
a Musical perception score is the sum of accent, melody, pitch, and rhythm scores.
3.1. RQ1: Do musical perception skills predict phonetic language abilities better than working memory capacity?
The results of models 1 and 2 (Table 4) revealed a significant main effect of musical perception score (both p < 0.05), which means that participants’ musical perception abilities significantly predicted their speech imitation abilities, for both Catalan and Chinese speakers. As for the role of working memory, there was no significant main effect in either model. This suggests that working memory is not a significant predictor of speech imitation abilities for either Catalan or Chinese speakers.
Note: Estimates (β) represent the change in speech imitation score resulting from a change in each fixed factor. Significant results are bolded.
3.2. RQ2: Which components of musical perception skills predict phonetic language abilities, and does the predictive effect of these components hold across speakers of typologically different languages?
As for the predictive role of the specific components of musical perception skills, model 3 (Table 5) and model 4 (Table 6) revealed different results. Model 3 showed that melody was the only significant predictor of Catalan speakers’ imitation ability, whereas model 4 revealed that accent was the only significant predictor of Chinese speakers’ imitation ability (both p < 0.05).
Note: Estimates (β) represent the change in speech imitation score resulting from a change in each fixed factor. Significant results are bolded.
Note: Estimates (β) represent the change in speech imitation score resulting from a change in each fixed factor. Significant results are bolded.
4. Discussion and conclusions
The present study examined (RQ1) the role of two cognitive individual factors, namely, musical perception skills and working memory capacity, in predicting phonetic language abilities; and (RQ2) whether the predictive effect of specific components of musical perception skills is subject to the speakers’ native languages. The typologically different languages included Catalan (an intonation language) and Chinese (a tone language).
Regarding RQ1, we found that general musical perception skills – but not working memory capacity – predicted the imitation abilities of unfamiliar languages in the two groups of speakers. This is in line with the results of previous research showing that musical perception skills correlated with phonetic language abilities. In this regard, our findings add further cross-linguistic evidence that the phonetic language abilities of speakers of both intonation languages, like Catalan, and tone languages, like Chinese, are moderately predicted by their general musical aptitude, supporting the hypothesis that there is cognitive overlap between music and language (Chobert & Besson, Reference Chobert and Besson2013; Milovanov & Tervaniemi, Reference Milovanov and Tervaniemi2011; Peretz et al., Reference Peretz, Vuvan, Lagrois and Armony2015).
We did not find working memory to significantly predict speech imitation abilities for either participant group. This is in line with previous research showing the limited utility of working memory capacity for predicting phonetic language abilities (Coumel et al., Reference Coumel, Christiner and Reiterer2019; Li et al., Reference Li, Zhang, Fu, Baills, Prieto, Frota, Cruz and Vigário2022). Our null results regarding working memory capacity do not match the results of several comparable studies that found working memory to be a significant predictor of the ability to imitate unfamiliar languages (Christiner & Reiterer, Reference Christiner and Reiterer2018; Christiner et al., Reference Christiner, Rüdegger and Reiterer2018, Reference Christiner, Renner, Groß, Seither-Preisler, Benner and Schneider2022). In our view, there are two possible explanations for this inconsistency. First, while participants in some of the previous work that highlighted the importance of working memory were young children (e.g., 5-year-olds in Christiner & Reiterer, Reference Christiner and Reiterer2018; 9-to-10-year-olds in Christiner et al., Reference Christiner, Rüdegger and Reiterer2018), our participants were adolescents and young adults. The role played by the working memory variable might conceivably be more evident in younger children than in older individuals. Second, our target sentences in the speech imitation task were not long and did not vary a great deal in length, with a mean syllable count of 8.5. The mean syllable count was close to the working memory scores of both groups of participants (Catalan speakers: 7.01 and Chinese speakers: 7.99). This means that working memory may not play a significant role when the target sentence length in the speech imitation task is similar to the participants’ working memory span, as the demands of the imitation task do not exceed the participants’ working memory capacity. Future research may want to control for the phonological length factor and adapt the length of target stimuli to exceed the working memory capacities of participants.
With respect to RQ2, our results contributed cross-linguistic data on which specific components of music perception abilities were predictors of phonetic language aptitude. Interestingly, the significant predictors of the two groups of participants were not the same. Specifically, the only predictive musical component of phonetic language aptitude for Chinese speakers was musical accent perception, while that for Catalan speakers was melody perception skills. In our view, this contrast can be explained by the differing prosodic nature of these two languages, Chinese being a tonal language and Catalan being an intonation language. On the one hand, since Chinese speakers already showed excellent pitch perception skills, which can be equated to those of musicians (Bidelman et al., Reference Bidelman, Hutka and Moreno2013), we expected that other music perception skills might be more discriminatory in this population. It thus makes sense that the accent component was more discriminatory in this population since the strong–weak prominence contrast assessed by the PROMS accent subtest in Chinese is less phonologically relevant for Chinese speakers (Duanmu, Reference Duanmu2007) than for stress language speakers like Catalan (Wheeler, Reference Wheeler2005), where stress is an important feature of the phonology. Therefore, that Chinese participants, who were better at detecting strong–weak contrasts in music (i.e., the accent component), would be more sensitive to strong–weak contrasts in the imitation of unfamiliar speech as well and thus reproduced prominence differences better in speech. On the other hand, for Catalan participants, we would expect that musical components like accent differences, which are phonologically relevant in this intonation language would be less discriminatory in predicting speech imitation abilities. This was borne out by our results, where Catalan participants, who discriminated better across melodies of different musical phrases, as shown by the melody component in PROMS, were better at imitating unfamiliar speech. This ability may not have been as crucial for Chinese speakers, who are already trained to detect subtle melodic and pitch changes in their language. Though Catalan is an intonation-based language, it is not sentence-level pitch changes that are discriminatory but rather smaller-scale pitch accentual contrasts (Prieto et al., Reference Prieto, Borràs-Comes, Cabré, Crespo-Sendra, Mascaró, Roseano, Frota and Prieto2015). Catalan speakers are thus not experts in detecting fine-grained intonational differences at the sentence level; rather, their phonological expertise lies in detecting changes in pitch, duration and intensity in very specific parts of the contour (i.e., pitch accents and boundary tones). These results imply a skill transfer from the specific prosodic patterns of the L1 to the ability to detect those contrasts in musical phrases. Prosodic phonological abilities that are not specifically trained in the L1 are the most predictive of speech imitation abilities.
Following up on these findings, our results add new evidence to previous studies on the specific role of language background and musical aptitude skills in the prediction of phonetic language abilities. In our study, melody and accent appear as the significant predictors. Yet since very few previous studies have included the perception abilities specifically related to accent and melody as components in their musical skills tests, we cannot make direct comparisons with other research. The small number of studies that have considered these components have focused on speech perception (Delogu et al., Reference Delogu, Lampis and Belardinelli2010), L2 intonation training (Yuan et al., Reference Yuan, González-Fuente, Baills and Prieto2019), and the production of challenging L2 sounds (Dolman & Spring, Reference Dolman and Spring2014). The only study involved cross-linguistic design was Christiner et al. (Reference Christiner, Rüdegger and Reiterer2018), which showed the predictive role of specific music components is dependent on the typology of the target languages being imitated, but the cross-linguistic design did not vary the speaker’s L1 backgrounds. Our study thus provided new cross-linguistic evidence that suggests that speakers of different L1 backgrounds may be positioned differently with respect to the role of the various musical aptitude components in phonetic language abilities.
The present study suffers from several limitations. First, our measures of musical aptitude were based on perceptive abilities only. In the future, it would be of interest to use measures of productive abilities to look for links between music and language by comparing, for example, singing skills with speech across language typologies. Second, phonetic language abilities in our study were assessed in terms of participants’ ability to imitate languages with which they were unfamiliar. It would be worthwhile to replicate the current study contrasting unfamiliar and familiar languages, or an L2 that the participants are learning. Doing so might yield results that would be of considerable utility to the field of second language acquisition. Finally, it is worth noting that due to human resource limits, we recruited more Chinese speakers (N = 144) than Catalan speakers (N = 61) and the two groups of participants differed in age as well (Chinese = 13.93 and Catalan = 19.7). Although both groups are young individuals, the differences in age and number of participants may have a potential influence on the results. Especially, adolescence is a crucial age for cognitive development (Müllensiefen et al., Reference Müllensiefen, Elvers and Frieler2022). Future studies may want to replicate the current study with more comparable groups of participants in sample size, age and gender.
To conclude, the results of the present study constitute new cross-linguistic evidence that music and speech share common processes in the brain. More specifically, our findings show that the ability of specific components of musical perceptive aptitude to predict an individual’s ability to imitate unfamiliar languages may be modulated by the prosodic specificities of the individual’s native language, a finding that is potential of considerable relevance to L2 pronunciation teaching and learning practices.
Data availability statement
The datasets and R scripts for doing the analyses are available at OSF via the following link: https://osf.io/he2am/.
Acknowledgments
We acknowledge that part of the data from the Catalan speakers was collected by Mr. Xianqiang Fu at Universitat Pompeu Fabra. We sincerely thank the students at the Department of Translation and Language Sciences, Universitat Pompeu Fabra, and the students at Zhangqiu Experimental School (Jinan, China) who voluntarily participated in this study.
Competing interest
The authors declare no competing interests exist.
Funding
This study is funded by ‘Multimodal Communication: The integration of prosody and gesture in human communication and in language learning’ (PID2021-123823NB-I00) awarded by the Ministerio de Ciencia e Innovación and ‘Multimodal language learning: Prosodic and Gestural Integration in Pragmatic and Phonological Development’ (PGC2018-097007-B-l00), awarded by the Ministerio de Ciencia, Innovación y Universidades, Agencia Estatal de Investigación, and Fondo Europeo de Desarrollo Regional. P.L. is supported by the Research Council of Norway through its Centres of Excellence funding scheme (223265). F.B. acknowledges a Margarita Salas grant funded by the European Union-NextGenerationEU, Ministry of Universities and Recovery, Transformation and Resilience Plan, through a call from Pompeu Fabra University.