Introduction
A growing body of research has shown that bilinguals tend to have advantages in learning additional languages compared to monolinguals (Abu-Rabia & Sanitsky, Reference Abu-Rabia and Sanitsky2010; Hirosh & Degani, Reference Hirosh and Degani2018), with regard to both language-general proficiency (Swain et al., Reference Swain, Lapkin, Rowen and Hart1990) and language-specific skills (Klein, Reference Klein1995). Nevertheless, in the domain of non-native phonological/phonetic acquisition, studies on the influence of bilingualism have rendered mixed results (Antoniou et al., Reference Antoniou, Liang, Ettlinger and Wong2015; Elvin et al., Reference Elvin, Tuninetti and Escudero2018; Escudero et al.. Reference Escudero, Mulak, Fu and Singh2016), which therefore requires further exploration. The current study presents a unique contribution to this line of research by examining the influence of bilingualism on non-native phonetic learning through the lens of bidialectalism (i.e., speaking a dialect besides a standard language) in the context of non-native speech production. This is because compared with the majority of research on bilingualism in foreign language learning, relatively little attention has been paid to the influence of bidialectalism on non-native phonetic learning, especially in terms of how bidialectalism interacts with cross-language similarities/difficulties of the phonetic targets in non-native speech production. In the present study, we focused on the comparison between bidialectal Shanghai-Mandarin Chinese speakers and monodialectal Mandarin Chinese speakers when producing American English vowels that were judged to be either easy or difficult for Chinese learners of English, depending on the cross-linguistic relationships between English and Chinese vowels.
Bilingualism effects
Bilingualism refers to one’s ability to understand, speak, and frequently use two languages (Luk & Bialystok, Reference Luk and Bialystok2013). So far, there is no consensus as to the influence of bilingualism on non-native phonetic learning, probably because compared with lexical and grammatical aspects of language learning, phonetic learning in a foreign language is more complicated due to the complexities and difficulties in learning non-native speech contrasts (Colantoni et al., Reference Colantoni, Steele and Escudero2015). Some studies have shown that bilinguals have advantages over monolinguals in learning non-native speech sounds. For example, Cohen et al. (Reference Cohen, Tucker and Lambert1967) found that bilinguals generally were more accurate than monolinguals in producing non-native phoneme sequences. Similarly, Enomoto (Reference Enomoto1994) found that bilinguals outperformed monolinguals in perceptually differentiating between Japanese phonemic contrasts. Recent studies such as Singh et al. (Reference Singh, Poh and Fu2016) also found a bilingual advantage in integrating lexical tones into novel word learning.
Nevertheless, some studies have failed to identify a consistent advantage for bilinguals in distinguishing non-native speech contrasts. For example, Werker (Reference Werker1986) found no significant difference between bilinguals (L1 English with different L2 backgrounds) and English monolinguals in their ability to differentiate between Hindi retroflex and dental contrasts, as well as velar and uvular contrasts. Similarly, Patihis et al. (Reference Patihis, Oh and Mogilner2015) found that Spanish–English bilingual individuals were no better than English monolinguals and worse than Armenian–English bilinguals in discrimination of L3 Korean stop consonants. Escudero et al. (Reference Escudero, Broersma and Simon2013) also found that bilingualism (L1 Spanish–L2 English) did not help the learning of L3 Dutch vowels. In addition, Kopečková (Reference Kopečková2016) found that the bilingual advantage (L1 German–L2 English) in L3 learning of Spanish rhotic sounds is not broad-based; rather, it is subject to the difficulty and learnability of the non-native phonetic features. This finding echoes the results in Antoniou et al. (Reference Antoniou, Liang, Ettlinger and Wong2015) where the bilingual advantage did not apply to learning all non-native speech contrasts: the advantage was more obvious when the target foreign sound contrasts were easy (e.g., retroflex); for difficult contrasts (e.g., lenition), the bilingual advantage was not sufficient, because other factors such as phonetic similarity between languages also played a significant role. Similarly, Escudero et al. (Reference Escudero, Mulak, Fu and Singh2016) found an overall advantage for Singaporean English–Mandarin bilinguals when learning CVC words that formed non-minimal pairs but no specific advantages for vowel minimal pairs compared with Australian English monolinguals. These mixed findings collectively indicate that the impact of bilingualism on phonetic learning in a foreign language may be influenced by the acoustic properties of the non-native speech sounds in relation to the learner’s native language. Additionally, it suggests that certain speech sounds may pose universal challenges in learning, irrespective of one’s linguistic background (i.e., bilingual or not) (Antoniou et al., Reference Antoniou, Liang, Ettlinger and Wong2015). Hence, these studies suggest a need to further investigate how the relations between L1 and L2 acoustic properties modulate the effect of bilingualism on learning non-native sounds.
Bidialectalism effects
Bidialectals are those who can fluently speak a standard language and a regional dialect. The existing literature mainly focuses on the relations between bidialectalism and executive functions, which so far have presented mixed results. Some studies have reported a potential advantage for bidialectals. For example, Antoniou et al. (Reference Antoniou, Grohmann, Kambanaros and Katsos2016) found that bidialectals (Cypriot and Standard Modern Greek) were similar to bilinguals and outperformed monolinguals in working memory and inhibitory control tasks. Some studies also suggest that bidialectalism may specifically impact certain aspects of executive functions. For example, Blom et al. (Reference Blom, Boerma, Bosma, Cornips and Everaert2017) found that Limburgish-Dutch bidialectal children were significantly different from monolingual Dutch children in a selective attention task but not in a flanker task. Similarly, Oschwald et al. (Reference Oschwald, Schättin, von Bastian and Souza2018) only found a positive relation between bidialectalism and working memory but failed to find such an association in other measures of executive functions. Furthermore, the frequency of language usage may also play a role, for example, as found in Poarch et al. (Reference Poarch, Vanhove and Berthele2019), bidialectal language usage patterns can influence the relations between bidialectalism and executive functions, that is, those who used the nonstandard dialect more frequently had better executive control skills than monolinguals. On the other hand, some studies have failed to discover cognitive advantages for bidialectals. For instance, in Ross and Melinger (Reference Ross and Melinger2017), no significant differences were found between bidialectal and monolingual children in inhibitory control and shifting tasks. In studies where bidialectal participants were older adults, results have shown that bidialectals were similar to monolinguals in executive control tasks (Kirk et al., Reference Kirk, Fiala, Scott-Brown and Kempe2014; Scaltritti et al., Reference Scaltritti, Peressotti and Miozzo2017).
In the domain of speech acquisition, the impact of bidialectalism on non-native phonetic learning still calls for more research efforts. The existing studies are mainly focused on speech perception, with the results suggesting that dialectal differences can significantly affect one’s accuracy in perception of non-native vowels. For example, Escudero and Williams (Reference Escudero and Williams2012) compared Peruvian Spanish (PS) and Iberian Spanish (IS) learners regarding non-native Dutch vowel discrimination. They found that IS learners were better than PS learners at differentiating between the Dutch vowel contrasts. The results suggest that acoustic characteristics of vowels of one’s native language or dialect have a direct impact on L2 vowel perception. Similarly, Escudero et al. (Reference Escudero, Simon and Mitterer2012) found that non-native speech perception was significantly influenced by regional/dialectal differences in the listener’s L1. Specifically, they compared native speakers of North Holland Dutch with those of Flemish Dutch in terms of their perception of English vowel contrast (/ɛ/ vs. /æ/). The results showed that the dialectal differences in vowel production by two groups of speakers led to different vowel categorization responses. Some studies have also demonstrated the impact of a possible activation switch between different modes of languages/dialects on learning an additional language. For instance, Williams and Escudero (Reference Williams and Escudero2014a) compared Northern and Southern British listeners in their perceptual categorization of non-native Dutch vowels. Interestingly, they found that the Northern listeners’ categorization of Dutch vowels was influenced by their knowledge about the acoustic patterns of the standard Southern British vowels, possibly due to the activation of the Southern British English mode of speech perception during the laboratory testing session.
The present study
The above literature review on phonetic learning and bilingualism/bidialectalism is mainly centered around the research question of whether knowing a second language/dialect would benefit phonetic learning of a third language, that is, does knowing one more language/dialect lead to an advantage in learning speech sounds of a new language? The mixed results of previous studies as reviewed above suggest that the answer should take into account the cross-linguistic influences between native and non-native sounds. Specifically, the acoustic characteristics and learning difficulty of the non-native speech sounds in relation to the native sound system could play a significant role in determining how bilingualism/bidialectalism influences non-native speech learning (Antoniou et al., Reference Antoniou, Liang, Ettlinger and Wong2015; Elvin et al., Reference Elvin, Tuninetti and Escudero2018; Escudero et al. Reference Escudero, Broersma and Simon2013, Reference Escudero, Mulak, Fu and Singh2016; Kopečková, Reference Kopečková2016). For adult learners, the acquisition of sounds in a new language is usually influenced by the learner’s experience with speech sounds in previously acquired languages.
Indeed, well-established models of L2 perception/production such as the Second Language Linguistic Perception (L2LP) model (Escudero, Reference Escudero2005, Reference Escudero, Boersma and Hamann2009; Escudero & Yazawa, Reference Escudero, Yazawa and Amengualin press; van Leussen & Escudero, Reference van Leussen and Escudero2015) state that the acquisition of non-native speech sounds is related to the influence of L1. In the context of the present study, the L2LP model is suitable because it applies to both monolingual and bilingual/bidialectal learners (Escudero et al., Reference Escudero, Mulak, Fu and Singh2016; Escudero & Yazawa, Reference Escudero, Yazawa and Amengualin press). Furthermore, L2LP strives to comprehensively model the whole developmental trajectory in non-native speech learning, spanning from novice to advanced learners (for more details see Escudero Reference Escudero2005, Reference Escudero, Boersma and Hamann2009; Escudero & Yazawa, Reference Escudero, Yazawa and Amengualin press; van Leussen & Escudero, Reference van Leussen and Escudero2015), and is thus suitable for the present study where participants were learners with prior exposure to the target non-native speech sounds. Particularly, the L2LP model can provide explanations as to why, despite years of dedicated efforts, the ultimate mastery of L2 production and perception may not be fully attained, due to the activation of L1 for sequential bilinguals (i.e., L2 learners) whose onset of L2 learning is after early childhood (Escudero & Yazawa, Reference Escudero, Yazawa and Amengualin press).
Since the L2LP model has the word “perception” in it, one may wonder whether it is appropriate to use this model to explain non-native speech production. Admittedly, the L2LP model was initially developed for speech perception, but it has been extended to explain speech production and lexical development (e.g., Elvin et al., Reference Elvin, Williams and Escudero2016; Reference Elvin, Williams and Escudero2020; Elvin & Escudero, 2019; Escudero et al., Reference Escudero, Smit and Mulak2022; Escudero & Yazawa, Reference Escudero, Yazawa and Amengualin press; van Leussen & Escudero, Reference van Leussen and Escudero2015; Yazawa et al., Reference Yazawa, Whang, Kondo and Escudero2020; Yazawa et al., Reference Yazawa, Konishi, Whang, Escudero and Kondo2023). Crucially, other models of L2 speech do not make explicit predictions about the possible shifts between different language or dialect modes for bilingual or bidialectal speakers because they assume a single phonetic space for an L2 learner’s two languages (see Colantoni et al., Reference Colantoni, Steele and Escudero2015 for a thorough comparison between L2 speech models). In contrast, the L2LP model explicitly predicts that bilinguals and bidialectals have two separate systems readily accessible, including separate perception and production grammars (Escudero, Reference Escudero2005, Reference Escudero, Boersma and Hamann2009; Escudero & Yazawa, Reference Escudero, Yazawa and Amengualin press; van Leussen & Escudero, Reference van Leussen and Escudero2015; Yazawa et al., Reference Yazawa, Whang, Kondo and Escudero2020, Reference Yazawa, Konishi, Whang, Escudero and Kondo2023). Therefore, when bilinguals and bidialectals learn an additional language (L3, L4, etc.), the speech sounds of the additional language could be mapped to either their first or second language or dialect (Escudero et al., Reference Escudero, Broersma and Simon2013; Williams & Escudero, Reference Williams and Escudero2014a). This further suggests that bilinguals and bidialectals may switch between different modes when learning an additional language because their separate systems could be activated selectively (Escudero, Reference Escudero2005, Reference Escudero, Boersma and Hamann2009; Escudero & Yazawa, Reference Escudero, Yazawa and Amengualin press; Williams & Escudero, Reference Williams and Escudero2014a). Whether this holds true for bidialectals’ non-native speech production remains to be explored.
In addition, one may wonder why bidialectalism is worth examination. This is because firstly, compared with bilingualism, the effect of bidialectalism is largely under-recognized and undervalued, leaving much room for future research (Antoniou et al., Reference Antoniou, Grohmann, Kambanaros and Katsos2016; Oschwald et al., Reference Oschwald, Schättin, von Bastian and Souza2018; Poarch et al., Reference Poarch, Vanhove and Berthele2019). Secondly, bidialectals could be different from bilinguals because of the “ubiquitous usage of both dialects in their environment compared to bilinguals who may display a more compartmentalized language usage pattern” (Poarch et al., Reference Poarch, Vanhove and Berthele2019: 613). This may reveal how frequency of language use affects the learning of a subsequent language in bilinguals versus bidialectals, the understanding of which is currently unclear (cf. Antoniou et al., Reference Antoniou, Grohmann, Kambanaros and Katsos2016; Oschwald et al., Reference Oschwald, Schättin, von Bastian and Souza2018). Research of this kind could also make people appreciate the effect of bidialectalism, which could further contribute to people’s understanding of their own identity as well as how dialects could relate to learning a foreign language (Antoniou et al., Reference Antoniou, Grohmann, Kambanaros and Katsos2016).
The above review suggests that it is still not clear how bilingualism and bidialectalism could influence individuals acquiring sound categories in a new language. Particularly, the impact of bidialectalism on non-native speech production remains largely unexplored. It is important to address this issue, as a substantial portion (approximately 50% to 70%) of the global population possesses proficiency in multiple languages or dialects (Grosjean, Reference Grosjean2021). This percentage further increases in regions where multiple dialects are prevalent. However, there is a prevailing issue in English as a Second Language research that inaccurately portrays English learners as monolingual speakers, thus failing to represent the reality of English learners worldwide (Leivada et al., Reference Leivada, Rodríguez-Ordóñez, Couto and Perpiñán2023). Therefore, the present study is among the few that confronts this bias through a comparative study between Chinese speakers in Beijing where the majority are monodialectal in Mandarin Chinese and Chinese speakers in Shanghai where individuals use two dialects (Mandarin and Shanghai Chinese) in their daily life. Specifically, this study examines the production of American English vowels by monodialectal Mandarin Chinese speakers compared with bidialectal Shanghai-Mandarin Chinese speakers. The choice of American English aligns with the current English teaching environment in China, where American English is the dominant target L2 variety.
Mandarin Chinese is the official standard language of China, while Shanghai Chinese is mainly spoken in the city of Shanghai. Shanghai Chinese belongs to the Wu family of Chinese dialects. As noted in Chao (Reference Chao1967), Chinese dialects are “primarily different in phonology, secondarily in lexicon and least in grammatical structure” (pp. 92–93). In terms of phonology, one of the most prominent distinctions between Shanghai and Mandarin Chinese is that Shanghai Chinese has a larger vowel inventory containing 15 monophthongs (6 monophthongs also found in Mandarin: /i, y, a, u, ɤ, ə/ and 9 only found in Shanghai Chinese: /ɛ, ø, o, ɔ, ɪ, ʏ, ɐ, ʊ, ɑ/) and 8 diphthongs (3 diphthongs also found in Mandarin Chinese /ia, ie, ua/ and 5 only found in Shanghai Chinese: /iɔ, iɤ, ue, uø, yø/) (Chen, Reference Chen2008; Chen & Gussenhoven, Reference Chen and Gussenhoven2015; Yu et al., Reference Yu, Li and Wang2004), while Mandarin Chinese has a smaller vowel inventory containing 6 monophthongs (/i, y, a, u, ɤ, ə/) and 11 diphthongs (/ai, au, ou, uo, ei, ye, ie, ia, ua, uə, iu/) (Lee & Zee, Reference Lee and Zee2003). In addition, only Shanghai Chinese contains short vowels such as [ɪ] and [ʊ], which sound similar to (but are not exactly the same as) the English [ɪ] and [ʊ] vowels, as detailed in the next paragraph. These “short” vowels only occur in closed syllables that end with a glottal stop coda in Shanghai Chinese (Chen, Reference Chen2008), while Mandarin does not have “short” vowels because the oral stop coda has been lost historically, resulting in vowel length variation being more relevant for Shanghai Chinese than for Mandarin. Therefore, the contrast between the vowel inventory of Mandarin Chinese and Shanghai Chinese makes an ideal test case for our study.
As reviewed, how bilingualism and bidialectalism influence non-native speech learning could be related to the learning difficulty of the non-native speech sounds. Therefore, in the present study, the target American English vowels were classified into two categories of difficulty for Chinese speakers: easy and difficult. Based on previous research (Chen et al., Reference Chen, Robb, Gilbert and Lerman2001; Jia et al., Reference Jia, Strange, Wu and Collado2006), the easy American English vowels chosen for the present study were [i] and [u] because: a) they are found in Chinese (Shanghai and Mandarin) and English, and b) Chinese speakers produce these two English vowels with high accuracy (Jia et al., Reference Jia, Strange, Wu and Collado2006). The difficult American English vowels chosen for the present study were [ɪ] and [ʊ] because: (a) they are unfamiliar to Mandarin speakers (Chen et al., Reference Chen, Robb, Gilbert and Lerman2001), and (b) Mandarin speakers produce these vowels differently from native American English speakers (Chen et al., Reference Chen, Robb, Gilbert and Lerman2001; Jia et al., Reference Jia, Strange, Wu and Collado2006). For Shanghai Chinese speakers, the two vowels could also be difficult because according to the L2LP model, bidialectals could map the incoming non-native speech targets to either their first or second language/dialect (Escudero et al., Reference Escudero, Broersma and Simon2013; Williams & Escudero, Reference Williams and Escudero2014a). This means that bidialectals such as Shanghai-Mandarin Chinese could map the American English [ɪ] and [ʊ] to either the [ɪ] and [ʊ] in Shanghai Chinese or the [i] and [u] in Mandarin Chinese, which as a result could interfere with the effective establishment of the target non-native sounds.
In sum, the present study is concerned with how bidialectalism interacts with cross-language difficulties of the phonetic targets in non-native speech acquisition. We aimed at answering the following research questions: (1) Do Shanghai-Mandarin Chinese bidialectal speakers differ from monodialectal Mandarin Chinese speakers in their production of easy and difficult American English vowels? If so, in which acoustic dimension do the two groups differ, vowel formants or duration? (2) How does the vowel system of the participants’ Chinese dialects influence their production of the non-native English vowels? The two groups of Chinese participants were asked to produce the target American English vowels [i], [ɪ], [u], [ʊ], and their native Mandarin Chinese [i] and [u] vowels; additionally, the bidialectal Shanghai-Mandarin speakers were also asked to produce their native Shanghai Chinese [ɪ] and [ʊ] vowels.
Based on previous research where the bilingual advantage was more evident in learning easy non-native speech sounds (Antoniou et al., Reference Antoniou, Liang, Ettlinger and Wong2015), we hypothesize that bidialectal Shanghai-Mandarin Chinese speakers could outperform monodialectal Mandarin Chinese speakers in accurately producing the easy English vowels [i] and [u], which will be reflected in smaller formant and durational differences from American English speakers’ production. The bidialectal advantage of Shanghai-Mandarin Chinese speakers in non-native English vowel production may become less apparent for the difficult English vowels [ɪ] and [ʊ] in certain acoustic aspects due to the influence of their native languages. Specifically, given that Shanghai Chinese contains short vowels whereas Mandarin does not, we speculate that the bidialectal Shanghai Chinese production of the difficult English vowels [ɪ] and [ʊ] may approach American speakers’ production more closely in duration than Mandarin Chinese speakers would do. Nevertheless, both groups (Shanghai and Mandarin Chinese) could be similarly deviant from American speakers in terms of the formants of the two difficult English vowels due to the influence of their native Chinese. Following the L2LP model’s proposal, the bidialectal speakers may switch between the two languages/dialects when learning an additional language, resulting in their mapping of the non-native English vowels to either Shanghai or Mandarin Chinese depending on the specific language mode they are in. This suggests that their production of the difficult English vowels could be closer to Mandarin vowels in formants and duration if they are in their Mandarin Chinese mode, or closer to Shanghai vowels if they are in their Shanghai Chinese mode.
Methods
Participants
Forty adult native Chinese speakers (20 females and 20 males, aged between 19 and 26 years) without hearing or speech impairments participated in the present study. Twenty of them (10 females and 10 males) were monodialectal speakers of Mandarin Chinese; the other 20 of them (10 females and 10 males) were bidialectal speakers of Shanghai and Mandarin Chinese, that is, they were proficient in both Shanghai dialect and Mandarin Chinese and used the two language varieties on a daily basis. Specifically, the participants in the monodialectal group grew up in Beijing where only Mandarin Chinese is used in daily life. They came to Shanghai for higher education but could not understand or speak the Shanghai dialect at the time of the experiment nor could they speak any other Chinese dialects. The participants in the bidialectal group grew up in Shanghai, with daily exposure to and frequent usage of both Shanghai and Mandarin Chinese. Participants completed a language background survey where they rated their language proficiency (i.e., daily usage of and lifetime exposure to the target language) on a scale of 1–5 (1= not familiar; 2 = familiar; 3 = fair; 4 = proficient; 5= very proficient). The monodialectal group’s average Mandarin proficiency was 4.9, while for the bidialectal group, the average proficiency in Mandarin and Shanghai Chinese was 4.85 and 4.9, respectively. Proficiency in Mandarin and Shanghai Chinese was comparable for the bidialectal group (i.e., the differences were nonsignificant [F(1,19) = 0.192, p = 0.67]), and Mandarin proficiency was comparable between the monodialectal and bidialectal groups [F(1,19) = 0.192, p = 0.67].
All Chinese participants had studied English as a foreign language in China for an average of 14 years, with no history of living in an English-speaking country for more than one month. They all reported speaking American English only. In the same language background survey (as mentioned in the previous paragraph), they were also asked to indicate how often they used English and Chinese (Mandarin for the monodialectal group; Mandarin and Shanghai Chinese for the bidialectal group) in their daily communication on a scale of 1–5 (1 = not at all; 2 = only occasionally; 3 = sometimes; 4 = frequently; 5 = very frequently). For English, the average score was 2.03, while for Chinese the average score was 4.73, and the difference was significant [F (1, 39) =466.08, p <0.001, η p 2 = 0.92]. Therefore, Chinese was mainly used for their daily life and English was only used occasionally. Participants also indicated that when they spoke English, it was with their Chinese peers and teachers rather than with English native speakers. Six adult native speakers of General American English (three females and three males, mean age = 35) were recruited in the U.S. to produce the American English stimuli. They did not understand or speak any form of Chinese (Mandarin or Shanghai or other Chinese dialects) at all. The acoustic characteristics of the target English vowels produced by the six American speakers for the present study (detailed in Fig. 1) were consistent with previous studies on American English vowels (Figure 3 of Hillenbrand et al., Reference Hillenbrand, Getty, Clark and Wheeler1995).
Stimuli
The American English stimuli included two target English words (deed, goose) containing the easy vowels ([i], [u]) and two English words (did, good) containing the difficult vowels ([ɪ], [ʊ]). The Chinese stimuli included two Chinese words (/di/ <brother 弟>, /gu/ <old 故>) containing two vowels ([i], [u]) in both Mandarin and Shanghai Chinese and two Chinese words (/tɪʔ/ <drop 滴>, /kʊʔ/ <country 国>) containing Shanghai Chinese vowels ([ɪ], [ʊ]). Filler items that were not analyzed were bird, bait, brown, dice, joy, gold, door, and fate for English and ren <people人>, hua <flower 花>, xing <star 星>, lan <blue 蓝>, niao <bird 鸟>, xian <fresh 鲜>, jiu <wine 酒>, and nuan <warm 暖> for Chinese.
Procedure
The Chinese participants were asked to produce the English and Chinese speech stimuli (presented randomly on a screen) three times each; the American participants were asked to produce the English stimuli only, three times each. Their speech was recorded individually in a sound-attenuated booth using a Sudotack ST-800 High-Quality Cardioid Microphone connected to a MacBook (64 bit) computer. For the final acoustic analyses, there were (a) [4 (English stimuli) + 2 (Mandarin Chinese stimuli)] * 3 (repetitions) * 20 (Mandarin Chinese participants)] = 360 tokens for the monodialectal Mandarin Chinese group, (b) [4 (English stimuli) + 2 (Mandarin Chinese stimuli) + 2 (Shanghai Chinese stimuli)] * 3 (repetitions) * 20 (Shanghai Chinese participants)] = 480 tokens for the bidialectal Shanghai-Mandarin Chinese group, and (c) 4 (English stimuli) * 3 (repetitions) * 6 (American participants) = 72 tokens for the American English participants. Participants took a self-paced approach to produce the target stimuli, and they pressed the space key on the keyboard to proceed to the next trial. A 500-ms fixation cross was displayed on the screen between each trial. The stimuli were presented randomly, and so the participants were not likely to know that their vowel production was the target of the study, which was confirmed by a post-experiment debrief where participants expressed they were not aware of the purpose of the experiment. The random presentation of vowel stimuli has been used in numerous studies on L2 speech learning (e.g., Baker & Trofimovich, Reference Baker and Trofimovich2006; Bundgaard-Nielsen et al., Reference Bundgaard-Nielsen, Best and Tyler2011; Munro & Derwing, Reference Munro and Derwing2008 among many others). But this well-established approach might give rise to production errors due to the unpredictability of the presentation of the stimuli. Therefore, participants were allowed to self-correct speech errors they made during production. Tokens that contained speech errors (approximately 1% of all the tokens) due to the possible priming effect of the random presentation of the vowel stimuli were subsequently excluded from the acoustic data analyses. Participants were allowed to take breaks at their discretion. The experiment lasted approximately 15 min.
Acoustic data analyses
We extracted the vowels of the stimuli for acoustic analyses. The vowel boundaries were determined manually by three phoneticians using Praat (Boersma & Weenink, Reference Boersma and Weenink2020), based on the start and end points of the periodic waveform of the vowels. The formant values of the vowels were taken as an average from the beginning to the end of the vowel boundaries. Another expert phonetician was invited to check all the vowel boundaries to ensure the labeling was correct. The corresponding duration of the vowels was measured in milliseconds (ms). In order to assess the extent to which the Chinese speakers’ production of the English vowels was different from that of native speakers of English, we examined the Euclidean distance between Chinese and American English speakers’ production of the English vowels. The use of Euclidean distance is a well-established method to quantify vowel distances across different language conditions in many previous studies (e.g., Chang, Reference Chang2023; Mora & Nadeu, Reference Mora and Nadeu2012; Recasens & Espinosa, Reference Recasens and Espinosa2006 among many others). Formant values (F1 and F2) were converted from Hertz to the Bark scale to normalize the intrinsic variation of different speakers’ vocal tract lengths (Clopper, Reference Clopper2009). Statistical analysis of the acoustic data was performed in R (Version 3.4.4; R Core Team, 2018) using the lme4 package (Bates et al., Reference Bates, Maechler, Bolker and Walker2015).
Results
Table 1A shows the means of the differences in vowel formants, as measured by the Euclidean distance between Chinese speakers’ (Shanghai and Mandarin) and native American speakers’ production of the easy and difficult English vowels. To address the first research question, the Euclidean distance data were submitted to a linear mixed-effects model (with “group” (Shanghai vs. Mandarin), “vowel type” (Easy vs. Difficult), and their interaction as the fixed effects, and “participants” and “items” as random effects). The results (Table 2A) showed significant effects of group and vowel type, as well as a significant interaction between group and vowel type. Post hoc one-way ANOVA showed that in the condition of easy vowels, Shanghai speakers had significantly smaller Euclidean distance than Mandarin speakers [F (1, 38) = 9.43, p = 0.004]. In contrast, the difference in Euclidean distance between the two groups was not significant for difficult vowels [F (1, 38) = 0.003, p = 0.95].
Table 1B shows the means of the duration difference between Chinese (Shanghai and Mandarin) and American speakers’ production for easy versus difficult English vowels. The duration difference data were submitted to a linear mixed-effects model (with “group,” “vowel type” and their interaction as the fixed effects, and “participants” and “items” as random effects). The results of the mixed-effects model (Table 2B) showed significant effects of group and vowel type, as well as a significant interaction between group and vowel type. Post hoc one-way ANOVA showed that Shanghai speakers had significantly smaller duration difference from American speakers than Mandarin speakers for both the easy [F (1, 38) = 12.42, p = 0.001] and difficult [F (1, 38) = 19.73, p < 0.001] vowels.
Plots of the easy ([i], [u]) and difficult English vowels ([ɪ], [ʊ]) produced by Chinese and American speakers are presented in Fig. 1. The figures also include Shanghai and Mandarin speakers’ production of the Mandarin Chinese vowels [i] and [u] (which are also found in Shanghai Chinese). In addition, Figure (1c) shows Shanghai speakers’ production of the Shanghai Chinese vowels [ɪ] and [ʊ]. Fig. 2 further compares participants’ productions of their Chinese vowels with their English vowels. It shows the scatterplots of Shanghai Chinese and Mandarin Chinese speakers’ production of the American English vowels ([i], [u], [ɪ], [ʊ]), Mandarin Chinese vowels ([i], [u]), and Shanghai Chinese vowels ([ɪ], [ʊ]).
To address research question 2, we first examined the acoustics of the Chinese vowels produced by Shanghai and Mandarin Chinese speakers in each dialect, detailed in Table 3. The results suggest that for Mandarin Chinese [i], Shanghai Chinese speakers had a lower F1 and F2 than Mandarin Chinese speakers. For Mandarin Chinese [u], Shanghai Chinese speakers had a higher F1 and lower F2 than Mandarin Chinese speakers. In terms of the duration of the two Mandarin Chinese vowels, Shanghai Chinese speakers produced the two vowels shorter than did Mandarin Chinese speakers. In addition, Shanghai Chinese [ɪ] and [ʊ] had higher F1, lower F2, and shorter duration than Mandarin Chinese [i] and [u], respectively.
Hence, the above results demonstrate the acoustic differences between Shanghai and Mandarin Chinese vowels, which could influence Shanghai Chinese speakers’ production of Mandarin and English vowels due to the influence of Shanghai dialect. To gain a deeper understanding of how native Chinese dialect vowel systems affect the production of non-native English vowels among Chinese speakers, we further investigated whether Chinese speakers’ production of English vowels is more similar to the production of Chinese vowels in their respective Chinese dialect or to that of American English speakers. We calculated the Euclidean distance and duration difference (ED1 and Dur1) between the Mandarin Chinese and English vowels produced by Chinese speakers, as well as the Euclidean distance and duration difference (ED2 and Dur2) between Chinese and American speakers’ production of the English vowels. The data for the easy and difficult vowels conditions were analyzed for Mandarin and Shanghai Chinese groups respectively, and the p value was Bonferroni-corrected. The means are presented in Table 4.
For the easy English vowels, the results [Table 5(I)] showed that ED1 and Dur1 were overall smaller than ED2 and Dur2, respectively, for both Chinese groups (Shanghai and Mandarin). Specifically, for Mandarin speakers, ED1 was significantly smaller than ED2, and similarly, Dur1 was significantly smaller than Dur 2. For Shanghai speakers, the differences were not significant in either Euclidean distance or durational differences. For the difficult English vowels, the results [Table 5(II)] showed that ED1 was significantly smaller than ED2 for both Shanghai and Mandarin Chinese speakers. Similarly, Dur1 was significantly smaller than Dur2 for both Shanghai and Mandarin Chinese speakers. Additionally, for Shanghai Chinese speakers, we also computed the Euclidean distance (ED3) and durational differences (Dur3) between Shanghai Chinese speakers’ production of the two Shanghai Chinese vowels (which sound similar to the difficult English vowels ) and their production of the two difficult English vowels. ED3 and Dur3 were compared with ED1 and Dur1, respectively, to examine whether Shanghai participants’ production of the difficult English vowels was influenced more by Mandarin or Shanghai Chinese. The results [Table 5(II)] showed that ED1 was significantly smaller than ED3, and similarly, Dur1 was significantly smaller than Dur3, indicating that Shanghai participants’ production of the difficult English vowels was closer to the corresponding vowels in Mandarin rather than Shanghai Chinese.
Discussion
The present study examined how bidialectalism influences non-native speech production. Particularly, we compared monodialectal Mandarin Chinese with bidialectal Shanghai-Mandarin Chinese speakers in terms of their production of non-native American English vowels classified into two categories of difficulty for Chinese learners of English: easy ([i], [u]) and difficult ([ɪ], [ʊ]). We found that for easy English vowels, Shanghai Chinese was better than Mandarin Chinese speakers in approaching native English speakers with regard to vowel formants and duration. For difficult English vowels, Shanghai Chinese speakers were better in vowel duration but not in vowel formants compared with Mandarin Chinese speakers.
The results suggest that overall, there is a bidialectal advantage for Shanghai Chinese speakers in producing the easy English vowels, but that advantage becomes less apparent for the difficult English vowels, particularly in terms of formant frequencies. The results are in line with the proposal that the bilingual advantage is not broad-based; rather, it is modulated by the degree of difficulty and learnability of the target sounds (Antoniou et al., Reference Antoniou, Liang, Ettlinger and Wong2015; Elvin et al., Reference Elvin, Tuninetti and Escudero2018; Escudero et al., Reference Escudero, Mulak, Fu and Singh2016; Kopečková, Reference Kopečková2016). When the target non-native sounds are “easy,” bilingualism could play a positive role in enhancing learning, whereas for learning “difficult” non-native target sounds, bilingualism may not be sufficient to yield high accuracy. The present study extends this proposal to the effect of bidialectalism on non-native speech production.
One may argue that the classification of sounds in a specific dialect/language is arbitrary. However, it is important to recognize that this arbitrariness could lead to differences in the acoustic mappings of sounds between one’s native language (L1) and the target second language (L2). These differences contribute to varying levels of difficulty and learnability when acquiring non-native speech sounds. The existence of well-known theories, such as the Second Language Linguistic Perception (L2LP, Escudero, Reference Escudero2005, Reference Escudero, Boersma and Hamann2009; Escudero & Yazawa, Reference Escudero, Yazawa and Amengualin press), further supports the notion that the classification of sounds based on dialect/language plays a crucial role in understanding the learnability and difficulty of speech sounds. This model recognizes and explains the challenges faced by learners in perceiving and producing non-native sounds due to the acoustic and phonetic differences between their native language and the target language. Therefore, despite the arbitrary nature of sound classification, it is crucial to consider the impact of acoustic mappings and differences between L1 and the target L2 on the learnability and difficulty of non-native speech sounds. These considerations are essential for establishing theoretical frameworks such as L2LP that can explain and interpret the findings in the context of language acquisition and perception.
Since Shanghai Chinese has the short vowels [ɪ] and [ʊ] that sound similar to the difficult English vowels [ɪ] and [ʊ], plus the fact that these two short vowels in Shanghai Chinese are rather different from Mandarin Chinese [i] and [u], respectively, as detailed in Table 3, one may wonder why Shanghai Chinese speakers did not perform better than Mandarin Chinese speakers in terms of formant frequency accuracy of the two difficult English vowels. A possible explanation is that the bidialectals are fully proficient in two varieties of the same language. According to the L2LP model, they could use either language variety when producing vowels in an additional language. Thus, they may have resorted to their knowledge of Mandarin Chinese when trying to produce the difficult English vowels, as evidenced from the smaller Euclidean distance from Mandarin Chinese vowels (ED1) than from Shanghai Chinese vowels (ED3). This finding echoes Williams and Escudero’s (Reference Williams and Escudero2014a) results, where Northern British listeners’ categorization of Dutch vowels was influenced by their knowledge about acoustic patterns of the Standard Southern British English (SSBE) vowels. One of the reasons could be that SSBE is prevalent in British media and education, which means Northern British listeners are regularly exposed to SSBE, even though they may not produce English vowels in a way similar to Southern British speakers (Stuart-Smith, Reference Stuart-Smith, Llamas, Mullany and Stockwell2007). Such regular exposure may render the Northern listeners’ expectation to hear SSBE frequently in daily life, especially in a formal setting such as a university laboratory, resulting in the activation of their SSBE mode of speech perception (Williams & Escudero, Reference Williams and Escudero2014a). This further suggests that speech perception is highly dynamic, which is often subject to the modulation of one’s expectations and linguistic experiences depending on different contexts (Drager, Reference Drager2010).
The L2LP model (Escudero, Reference Escudero2005, Reference Escudero, Boersma and Hamann2009; Escudero & Yazawa, Reference Escudero, Yazawa and Amengualin press; van Leussen & Escudero, Reference van Leussen and Escudero2015; Yazawa et al., Reference Yazawa, Whang, Kondo and Escudero2020), which applies to both monolingual and bilingual/bidialectal learners, not only posits that monolinguals tend to perceive non-native sounds according to their native phonological categories but also that bilinguals may switch between different language modes when learning, listening to, and speaking in an additional language. More particularly, listeners’ knowledge of how to process different dialects or languages is stored in separate perception grammars, each of which could be activated according to the specific language mode the bilinguals are in (Escudero, Reference Escudero2005, Reference Escudero, Boersma and Hamann2009; Escudero & Yazawa, Reference Escudero, Yazawa and Amengualin press; Yazawa et al., Reference Yazawa, Whang, Kondo and Escudero2020). Such activation, as a result, could serve to map the incoming non-native speech sounds to either their native or non-native language/dialect (Williams & Escudero, Reference Williams and Escudero2014a). As mentioned above, the Shanghai-Mandarin Chinese speakers were fully functional bidialectals, that is, they were proficient in both Shanghai and standard Mandarin Chinese and used these two language varieties on a daily basis. Therefore, both Shanghai and Mandarin Chinese are readily accessible for them as a reference to map onto the incoming non-native English vowels. Moreover, given the predominant status of Mandarin Chinese in media and education all over China, plus the fact that the participants for the present study are students in a Chinese university where the medium of language instruction is Mandarin Chinese, it is likely that such frequent exposure to the standard official language may result in the Shanghai-Mandarin Chinese participants' activation of their Mandarin Chinese mode when trying to produce the difficult English vowels. This is similar to Williams and Escudero (Reference Williams and Escudero2014a) where Northern British listeners relied on their knowledge of the SSBE in perceiving non-native Dutch vowels due to the ubiquity of the standard language in media and education. The present findings can thus be seen as an extension of the L2LP theory to the domain of bidialectal non-native speech production, that is, bidialectal speakers can also switch between different modes to map the incoming non-native speech sounds to either their native or non-native language/dialect in speech production.
In terms of vowel duration, Shanghai Chinese speakers’ production of the two difficult English vowels was closer to Mandarin vowels rather than Shanghai Chinese vowels, as evidenced from the result that Dur1 was significantly smaller than Dur3, which again suggest that Shanghai speakers could be in their Mandarin Chinese mode when producing those difficult English vowels. This is an interesting result because Shanghai speakers also performed better than Mandarin speakers in producing the difficult English vowels in terms of duration, as Shanghai speakers’ Dur2 was shorter than Mandarin speakers’ Dur2. Together, these results suggest that even though Shanghai speakers seemed to have been in their Mandarin mode, their production of the difficult English vowels was better than Mandarin speakers in terms of duration. This could be due to Shanghai Chinese speakers’ Mandarin vowels being shorter than those of Mandarin Chinese speakers (SH Mandarin Chinese [i]: 205.93 ms; MN Mandarin Chinese [i]: 207.96 ms; SH Mandarin Chinese [u]: 189.39 ms; MN Mandarin Chinese [u]: 191.2 ms; see Table 3), which could be due to the existence of short vowels in Shanghai Chinese with shorter durations than Mandarin Chinese vowels (SH Shanghai Chinese [ɪ]: 197.39 ms; MN Mandarin Chinese [i]: 207.96 ms; SH Shanghai Chinese [ʊ]: 187.72 ms; MN Mandarin Chinese [u]: 191.2 ms; see Table 3). This could provide Shanghai speakers with an advantage in producing the short English vowels [ɪ] and [ʊ] even when they are in their Mandarin Chinese mode, which would explain their higher durational accuracy of the difficult English vowels.
The results are reminiscent of the findings reported in Iverson and Evans (Reference Iverson and Evans2007) that L2 learners of English tended to have asymmetrical patterns of cue weighting in representing English vowels, that is, those who were accurate in representing one acoustic cue such as duration were not necessarily accurate at other cues such as formant frequencies (Iverson & Evans, Reference Iverson and Evans2007). The present study is also consistent with the findings that non-native speech learners may rely on duration as an alternative strategy when they struggle with the spectral characteristics of the target non-native vowels (Bohn, Reference Bohn and Strange1995; Bohn & Flege, Reference Bohn and Flege1990; Escudero & Boersma, Reference Escudero and Boersma2004; Escudero et al., Reference Escudero, Benders and Lipski2009), as Shanghai speakers had an advantage (compared with Mandarin speakers) in achieving the durational accuracy of the difficult English vowels, despite their difficulty with achieving accurate production of the vowel formants.
Future research could include other English sounds that are present in Mandarin but not in Shanghai Chinese, for example, the word-final /n/-/ŋ/ distinction. Moreover, more English varieties such as British or Australian English could be included as the target non-native languages to see if the same effects reported in our paper are found in varieties of English with different pronunciations of the target vowels from those of American English (see for instance Escudero & Chladkova, Reference Escudero and Chladkova2010 for the acoustics properties of American versus Southern British English vowels; and Elvin et al., Reference Elvin, Williams and Escudero2016 for Australian English vowels). Accordingly, an examination of a different cohort of Chinese dialects may lead to more diverse findings, especially regarding the acoustic contrasts with the target English sounds, which could also enhance our understanding of how bidialectalism influences non-native speech production. In addition, the present study used the method of eliciting words in isolation, but for future research, employing methods with greater ecological validity such as words read in the context of a sentence or a story (e.g., Yazawa et al., Reference Yazawa, Konishi, Whang, Escudero and Kondo2023) to capture natural speech patterns would be beneficial. Additionally, eliciting target vowels from multiple words with different syllabic contexts would enhance the generalizability of the results and promote a more comprehensive understanding of non-native speech production. This has been done in the analysis of native English speech (e.g., Elvin et al., Reference Elvin, Williams and Escudero2016; Williams & Escudero, Reference Williams and Escudero2014b) but not so much for non-native English speech (but see Yazawa et al., Reference Yazawa, Konishi, Whang, Escudero and Kondo2023 where vowels were produced in different consonantal contexts of a story).
Conclusion
The present study presents a unique contribution on how bidialectalism influences non-native speech production. We compared monodialectal Mandarin Chinese with bidialectal Shanghai-Mandarin Chinese speakers in terms of their ability to produce American English vowels, which were classified into easy and difficult categories for Chinese learners of English. The results showed that the bidialectal group had an overall advantage in producing the easy American English vowels [i] and [u] in terms of vowel formants and duration. For the difficult English vowels [ɪ] and [ʊ], both groups experienced the same challenges with vowel formants, but the bidialectals had higher accuracy in vowel duration. The present study thus extends previous bidialectalism research and the L2LP model to the realm of non-native speech production, demonstrating that the bidialectal advantage in non-native speech learning is modulated by cross-linguistic difficulty constraints. Therefore, the present study also contributes to our general understanding and theoretical modeling of how bidialectalism influences second-language acquisition.
Acknowledgments
We would like to thank Mr. Hongxiang Qin for his help with data collection. This work was supported by the Program of the Shanghai Planning Office of Philosophy and Social Science (No. 2022EYY006) awarded to Dr. Xiaoluan Liu. Professor Escudero’s work was supported by an Australian Research Council Future Fellowship (FT160100514).
Replication package
Data and materials for this article can be found at https://osf.io/a5y49/.
Competing interests
The authors declare none.