Hostname: page-component-6bf8c574d5-2jptb Total loading time: 0 Render date: 2025-02-22T22:05:30.024Z Has data issue: false hasContentIssue false

The role of multiword sequences in fluent speech

The case of listener-based judgment in L2 argumentative speech

Published online by Cambridge University Press:  12 February 2025

Kotaro Takizawa*
Affiliation:
Graduate School of Education, Waseda University, Tokyo, Japan
Shungo Suzuki
Affiliation:
Green Computing Systems Research Organization, Waseda University, Tokyo, Japan
*
Corresponding author: Kotaro Takizawa; Email: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

This study explored how second language (L2) speakers’ use of multiword sequences in speech predicted perceived fluency ratings while controlling for their utterance fluency. A total of 102 Japanese speakers of English delivered an argumentative speech, which was analyzed for bigram and trigram measures (frequency, proportion, and mutual information) and utterance fluency measures capturing three subdimensions: speed, breakdown, and repair fluency (Tavakoli & Skehan, 2005). Perceived fluency was assessed by 10 experienced L2 raters. Mixed-effects regression analyses revealed that after establishing the parsimonious model solely by UF predictors (marginal R2 = .61), a frequency-based n-gram predictor––bigram proportion––slightly but significantly explained the remaining variance of fluency rating scores (0.8%). The results indicated that multiword sequences in speech had a small but systematic impact on perceived fluency, even controlling for the effects of utterance fluency. This finding contributes to the discussion concerning the role of multiword sequences in fluent speech production.

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press

Introduction

Oral fluency is a robust indicator of oral proficiency for second language (L2) learners (Iwashita et al., Reference Iwashita, Brown, McNamara and O’Hagan2008; Tavakoli, Kendon, Mazhurnaya & Ziomek, Reference Tavakoli, Kendon, Mazhurnaya and Ziomek2023; Tavakoli, Nakatsuhara & Hunter, Reference Tavakoli, Nakatsuhara and Hunter2020). According to Segalowitz’s (Reference Segalowitz2010) conceptualization, oral fluency is better understood from three different perspectives: utterance, cognitive, and perceived fluency. Utterance fluency (UF) refers to observable temporal features, including pauses and repairs, while cognitive fluency (CF) refers to the efficiency of underlying speech production processes, including both language-general and language-specific cognition (Segalowitz, Reference Segalowitz2010, Reference Segalowitz2016). Perceived fluency (PF) concerns listener perceptions of speakers’ UF by inferencing speakers’ CF (Segalowitz, Reference Segalowitz2010). PF, in particular, is likely influenced by various factors, including both temporal and nontemporal speech characteristics. Prior studies indicate that, aside from temporal factors, nontemporal factors explained additional variations in PF. For instance, factors pertained to pronunciation quality (e.g., segmentals and suprasegmentals) had a major impact on PF judgments, followed by factors related to lexicogrammar (e.g., S. Suzuki & Kormos, Reference Suzuki and Kormos2020). The focus of this study is to shed light on the role of multiword sequences (MWS)––recurrent strings of words––in PF ratings.

Recently, a growing body of research has investigated the potential role of MWS knowledge and usage in L2 fluency due to the reduced processing cost associated with speech processing and production (Kormos, Reference Kormos2006; Wray, Reference Wray2002). Previous studies looking at the relationship between MWS and oral fluency can be better understood through Segalowitz’s (Reference Segalowitz2010) conceptualization of UF, CF, and PF. The relationship between MWS usage in speech and UF (e.g., Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020) and MWS knowledge as a CF component (e.g., Kahng, Reference Kahng2020) has been most thoroughly examined to date. These studies primarily focused on the role of MWS in the speaker’s cognitive processes. However, less is known about how the use of MWS in speech affects listener perceptions of the speaker’s fluency (Foster, Reference Foster2020). The significance of this gap lies in the fact that, in addition to the psycholinguistic importance of MWS, MWS can be considered for the hearer’s processing fluency or comprehensibility perceptions in real-world language use (Wray, Reference Wray2017).

Prior literature on the role of MWS usage in PF ratings (Boers, Eyckmans, Kappel, Stengers & Demecheleer, Reference Boers, Eyckmans, Kappel, Stengers and Demecheleer2006; McGuire & Larson-Hall, Reference McGuire and Larson-Hall2017; Stengers, Boers, Housen & Eyckmans, Reference Stengers, Boers, Housen and Eyckmans2011) revealed a direct connection between L2 learners’ use of MWS and rater judgments of fluency. However, there are two methodological limitations in these studies. First, the studies relied on MWS measures based on raw frequency and overlooked the robustness of corpus-based measures of MWS, which take into account all possible word combinations in learner texts in reference to large corpora. Among the corpus-based measures of MWS, phrasal sophistication indices (e.g., n-gram frequency) are reliable predictors of speaking proficiency, including UF (e.g., Eguchi & Kyle, Reference Eguchi and Kyle2020; Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020). Second, none of the studies controlled for the effects of the speaker’s UF on PF ratings. A meta-analysis of the relationship between UF and PF (S. Suzuki, Kormos & Uchihara, Reference Suzuki, Kormos and Uchihara2021) revealed that UF explained a large portion of PF variance (e.g., speech rate explained 57.8% of PF). Moreover, UF and MWS in speech are empirically associated (e.g., Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020). Thus, it is plausible to assume that the speaker’s UF can be a confounding factor in the relationship between MWS in speech and PF ratings. Motivated by these methodological challenges, the current study explores how L2 learners’ phrasal sophistication indices in speech predict PF ratings while controlling for the effects of the speaker’s UF.

Background

Utterance fluency and perceived fluency

A growing number of studies have explored the interface between UF and PF, as listener perception of a speaker’s fluency is considered directly related to the temporal speech features present in the speaker’s utterance (Bosker, Quené, Sanders & de Jong, Reference Bosker, Quené, Sanders and de Jong2014; Kormos & Dénes, Reference Kormos and Dénes2004; Rossiter, Reference Rossiter2009; Saito, Ilkan, Magne, Tran & Suzuki, Reference Saito, Ilkan, Magne, Tran and Suzuki2018; S. Suzuki & Kormos, Reference Suzuki and Kormos2020). A recent meta-analysis by S. Suzuki et al. (Reference Suzuki, Kormos and Uchihara2021) aggregated 263 weighted effect sizes of the correlations between UF and PF variables from 22 studies. The UF measures examined included articulation rate (i.e., the number of syllables uttered per second, excluding silent pauses) (speed fluency); silent and filled pause (e.g., uhm, ehh) frequency and silent pause duration within and between clauses (breakdown fluency); and disfluency rate, which includes repetition, self-correction, and false start frequency (repair fluency). Composite measures included speech rate (i.e., the number of syllables uttered per second, including silent pauses) and mean length of run (MLR), the mean length of syllables or words between silent pauses. The moderator effects of 11 methodological factors were also analyzed, focusing on speaker and listener backgrounds, rating procedures, the granularity of UF features, and speaking task types. The results showed that composite measures—speech rate and MLR—were most strongly associated with PF (r = .76 and = .72, respectively) followed by articulation rate (r = .62). In terms of silent pause measures, pause frequency exhibited a larger effect size (r = –.59) than pause duration (r = –.46). Disfluency rate, as a measure of repair fluency, showed a statistically significant but weak association with PF (r = –.20). Although most moderator variables did not reach statistical significance, several variables warrant close examination. For UF variables, silent pause location (within versus between clauses) appeared influential in the relationship between silent pause frequency and PF (r = –.72 for within-clause versus r = –.48 for between-clause). Among methodological variables, rating variables indicated that experienced L1 raters who underwent extensive training tended to yield stronger correlations between UF and PF. Regarding rating procedures, using validated rubrics and a 6-point scale seemed to help researchers likely observe the relationship between UF and PF. This meta-analysis demonstrated that the subdimensions of UF play different roles in raters’ perceptions of a speaker’s UF. In particular, speed and breakdown fluency (mid-clause pause frequency) were particularly valuable for PF ratings.

As evident from the results of the meta-analysis, the explained variance of PF by UF measures is far from perfect; even speech rate explained 57.8% of the variance in fluency perception. The remaining variability can, in part, be attributed to nontemporal factors, including lexicogrammatical, phonological, and discoursal factors in speech. Previous studies have indicated that vocabulary-related factors (e.g., lexical diversity, richness, and sophistication) are less prominent but still nontrivial to fluency perceptions (Bosker et al., Reference Bosker, Quené, Sanders and de Jong2014; Kormos & Dénes, Reference Kormos and Dénes2004; Magne et al., Reference Magne, Suzuki, Suzukida, Ilkan, Tran and Saito2019; Préfontaine & Kormos, Reference Préfontaine and Kormos2016; Rossiter, Reference Rossiter2009; S. Suzuki & Kormos, Reference Suzuki and Kormos2020). Despite the potential influence of vocabulary use on fluency perception, few studies have examined the UF-PF link in relation to MWS use in speaking performance (see Rossiter, Reference Rossiter2009, for an exception).

Multiword sequences and processing efficiency

According to the usage-based model of language acquisition, language consists of constructions at all levels of granularity, spanning from small units like phonemes to larger units of discourse (Ellis, Reference Ellis2006, Reference Ellis2012). MWS are considered building blocks of language and can take different forms depending on their definition and identification. Research shows that over 20% to 50% of spoken and written discourse consists of MWS (Erman & Warren, Reference Erman and Warren2000; Foster, Reference Foster, Bygate, Skehan and Swain2001). This ubiquitous nature of MWS in language underscores the necessity for L2 learners to acquire target-like MWS to achieve L2 proficiency. Specifically, the ability to employ target-like MWS in language use is termed phraseological competence and is quintessential in L2 learning (Siyanova-Chanturia & Pellicer-Sánchez, Reference Siyanova-Chanturia, Pellicer-Sánchez, Siyanova-Chanturia and Pellicer-Sánchez2019). MWS, by definition, is routinized strings of recurrent words that frequently appear together or are statistically strongly associated with each other (Paquot & Granger, Reference Paquot and Granger2012). For clarity, we use MWS as a cover term to refer to phenomena known as formulaic language (see Wray, Reference Wray2002, for other similar terms). There are many subtypes of MWS, including idioms (e.g., kick the bucket), collocations (e.g., solar energy), and lexical bundles (e.g., one of the most), which can be identified using either frequency-based or phraseological approaches (Boers & Webb, Reference Boers and Webb2018). The current study focuses on n-gram, which is one subtype of MWS and includes contiguous sequences of words longer than one word (e.g., bigrams and trigrams). Unlike lexical bundles, which commonly target longer sequences based on predetermined frequency and distributional criteria (Biber et al., Reference Biber, Conrad and Cortes2004), all word combinations in learner texts are identified as n-grams. The n-gram indices of the sequences are then calculated with reference to a large corpus (Kyle & Crossley, Reference Kyle and Crossley2015). The n-gram indices often examined in the extant literature are calculated based on frequency-based measures (e.g., frequency, range, and proportion) and association strength measures (e.g., t-score and mutual information [MI]). The use of a large reference corpus allows for the interpretation of those n-gram measures as indicators of the target-likeness of the language use (Kyle & Crossley, Reference Kyle and Crossley2015). Taking advantage of n-grams measuring all possible word combinations in learner texts, previous studies have demonstrated that n-grams effectively capture L2 learners’ phraseological competence and its development (Eguchi & Kyle, Reference Eguchi and Kyle2020; Garner & Crossley, Reference Garner and Crossley2018; Hougham, Clenton & Uchihara, Reference Hougham, Clenton and Uchihara2024; Kim, Crossley & Kyle, Reference Kim, Crossley and Kyle2018; Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020; Zhang, Zhao & Li, Reference Zhang, Zhao and Li2023). Of the n-gram indices, the n-gram proportion index tends to exhibit higher sensitivity to speaking development and proficiency (e.g., Eguchi & Kyle, Reference Eguchi and Kyle2020). In contrast, research evidence shows that the n-gram MI index is not a reliable predictor of general speaking proficiency (Eguchi & Kyle, Reference Eguchi and Kyle2020; Kim et al., Reference Kim, Crossley and Kyle2018; Kyle & Crossley, Reference Kyle and Crossley2015). MI represents low-frequency word combinations and advanced phrasal sophistication (e.g., Ellis, Simpson-Vlach & Maynard, Reference Ellis, Simpson-Vlach and Maynard2008). It can be argued that a monologue speech task with a written text prompt would allow lower-level learners to use words in the prompt directly, which inflates the MI scores of their performance (Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020).

One of the key characteristics of MWS, relevant to efficient processing and spontaneity of speech production, is the assumed holistic representation, allowing phrases to be stored, processed, and retrieved as a whole, rather than in a word-by-word fashion (Wray, Reference Wray2002). A substantial body of evidence supports the processing superiority of MWS compared to non-MWS in both language comprehension and production (Yi & Zhong, Reference Yi and Zhong2024). When people comprehend language, whether in written or spoken modality, MWS––widely shared within a speech community––offer advantages for processing fluency and comprehensibility (Millar, Reference Millar2011; Saito, Reference Saito2020; Wray, Reference Wray2017; Yeldham, Reference Yeldham2020). Likewise, when people speak or talk, a wide repertoire of MWS could reduce the cognitive load during the linguistic encoding processes involved in speech production (Kormos, Reference Kormos2006; Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020). In line with the usage-based model of language acquisition, a frequency effect also emerges for MWS (i.e., phrasal frequency effect). Previous literature indicates that higher-frequency MWS tend to be processed faster than lower-frequency MWS, leading to fluent speech production (Takizawa, Reference Takizawa2024).

The role of multiword sequences in fluent speech

The connection between MWS and fluency has garnered exclusive attention since the publication of Pawley and Syder’s (Reference Pawley, Syder, Richards and Schmidt1983) study. They critiqued the harsh divide between lexis and grammar and argued that idiomatic language use is characteristic of spoken language and ultimately contributes to fluent speech. Since then, a growing number of researchers have tackled the link between MWS knowledge and usage and oral fluency (Hougham et al., Reference Hougham, Clenton and Uchihara2024; Kahng, Reference Kahng2020; Y. Suzuki, Eguchi & de Jong, Reference Suzuki, Eguchi and de Jong2022; Takizawa, Reference Takizawa2024; Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020; Uchihara, Eguchi, Clenton & Saito, Reference Uchihara, Eguchi, Clenton and Saito2021; Wood, Reference Wood2006, Reference Wood2009; Yan, Reference Yan2020). The role of MWS in fluency can be conceptualized within Segalowitz’s (Reference Segalowitz2010) theoretical framework of oral fluency. While CF is out of the scope of this study, it views MWS knowledge as part of the underlying linguistic knowledge contributing to UF, which is measured through knowledge tests or phrasal acceptability judgement tasks (de Jong, Steinel, Florijn, Schoonen & Hulstijn, Reference de Jong, Steinel, Florijn, Schoonen and Hulstijn2013; Kahng, Reference Kahng2020; Koizumi & In’nami, Reference Koizumi and In’nami2013; Takizawa, Reference Takizawa2024; Uchihara et al., Reference Uchihara, Eguchi, Clenton and Saito2021). In contrast, research concerning the link between MWS and UF has extensively examined how the use of MWS in learners’ speech facilitates UF (Hougham et al., Reference Hougham, Clenton and Uchihara2024; Y. Suzuki et al., Reference Suzuki, Eguchi and de Jong2022; Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020; Wood, Reference Wood2006, Reference Wood2009; Yan, Reference Yan2020). The central concern of this approach includes how MWS usage in speech is associated with the temporal features of utterances. The theoretical assumption underlying the role of MWS knowledge and usage in fluent speech production lies in the holistic representation of MWS, which would ease the processing demands of linguistic encoding during speech production (Kormos, Reference Kormos2006; Skehan, Foster & Shum, Reference Skehan, Foster and Shum2016; Wray, Reference Wray2002). Speech production models (Kormos, Reference Kormos2006; Levelt, Reference Levelt1989) posit three stages of incremental speech processing––conceptualization, formulation, and articulation––while these stages are monitored by a comprehension system (self-monitoring). In conceptualization, macroplanning and microplanning shape the preverbal message, drawing on the speaker’s real-world knowledge. The preverbal message is then passed down to lexical, grammatical, and phonological encoding processes in the formulator, followed by articulation, where a stream of sounds is produced based on a store of gestural score syllabary and vocal organs (Kormos, Reference Kormos2006; Levelt, Reference Levelt1989). It has been well documented that L2 learners tend to struggle with linguistic encoding due to inadequate linguistic knowledge or processing speed, often manifested as silent pauses within clauses (de Jong, Reference de Jong2016; Huensch, Reference Huensch2023; Kahng, Reference Kahng2014; Saito et al., Reference Saito, Ilkan, Magne, Tran and Suzuki2018). A growing number of studies have found that speech production could be largely lexically driven, suggesting that lexical and phraseological knowledge plays an important role in fluent speech production (de Jong et al., Reference de Jong, Steinel, Florijn, Schoonen and Hulstijn2013; Kahng, Reference Kahng2020; Koizumi & In’nami, Reference Koizumi and In’nami2013; S. Suzuki & Kormos, Reference Suzuki and Kormos2023; Takizawa, Reference Takizawa2024; Uchihara et al., Reference Uchihara, Eguchi, Clenton and Saito2021).

In contrast to the aforementioned role of MWS as the speaker’s cognitive processing advantage in speech production, less is known about the role of MWS usage in speech that affects raters’ perceptions of a speaker’s fluency. Foster (Reference Foster2020) suggests that “the perception of speaker fluency is to a greater or lesser extent a perception of idiomaticity” (p. 449). This claim could be supported by three theoretical perspectives. First, according to the usage-based model of language acquisition, language construction patterns including MWS emerge by rational human cognition that utilizes the probabilities and contingencies of linguistic information (Ellis, Reference Ellis2006). Fluent language users are thus posited to possess unconscious mechanisms that operate as optimal word processors, which are “adaptively probability-tuned to predict the linguistic constructions that are most likely to be relevant in the ongoing discourse context” (Ellis, Reference Ellis2006, p. 8). It could be postulated that competent listeners are likely to perceive the speaker’s speech as fluent when it contains expected construction patterns. Second, MWS have a processing advantage compared to idiosyncratic, non–target-like word combinations (Yi & Zhong, Reference Yi and Zhong2024), due to the concept known as the Holistic hypothesis (Wray, Reference Wray2002). Prior literature on MWS processing has demonstrated that MWS tend to be reacted to significantly faster than matched novel phrases (e.g., Conklin & Schmitt, Reference Conklin and Schmitt2008). This indicates that MWS are likely entrenched in memory holistically and retrieved as such (Wray, Reference Wray2002). This processing easiness of MWS may help readers or hearers free up attentional resources for comprehension and quickly grasp the meaning of the speech, impacting their perception of the speaker’s fluency. Third, according to Wray’s (Reference Wray2017) communicative impact model, “speakers will be motivated to select formulaic material to support both for their own production and the hearer’s comprehension” (p. 574). Wray’s model posits that MWS have compensatory mechanisms for smooth communication, including minimizing the risk of the hearer’s cognitive processing overload or misunderstandings on the part of the hearer. In other words, using idiosyncratic word combinations may lead to processing inefficiencies for the hearer, giving them impression that the speaker is not fluent.

Albeit few in number, previous studies have addressed raters’ perceptions of a speaker’s fluency in relation to MWS usage in speech (Boers, Eyckmans, Kappel, Stengers & Demecheleer, Reference Boers, Eyckmans, Kappel, Stengers and Demecheleer2006; McGuire & Larson-Hall, Reference McGuire and Larson-Hall2017; Stengers et al., Reference Stengers, Boers, Housen and Eyckmans2011). Boers et al. (Reference Boers, Eyckmans, Kappel, Stengers and Demecheleer2006) examined whether teaching MWS would enhance PF judged by an experienced rater in conversational and monologue tasks. MWS used in learners’ speech were based on raw frequency counts identified by the intuitions of two judges and correlated with the PF rating scores. The results showed that MWS counts were moderately correlated with PF rating scores (rho = |.438–.446|). Following this line of study, Stengers et al. (Reference Stengers, Boers, Housen and Eyckmans2011) further showed that the raw number of MWS use in learner speech was significantly associated with raters’ perceptions of fluency (r = .550 for L2 English and .361 for L2 Spanish). Building on these studies, McGuire and Larson-Hall (Reference McGuire and Larson-Hall2017) examined whether teaching MWS led to improvements in both PF and UF. PF was judged by 16 trained native speakers of English. The learner use of MWS relied on intuitive judgments by the authors. The results showed that while UF measures, speech rate and MLR, and the raw frequency of MWS used in speech significantly improved from pretest to posttest, PF did not exhibit a major improvement (Cohen’s d = 0.26). Although the correlation between MWS and PF was not examined, it appears that the significant improvement in MWS did not align with improvements in PF.

In sum, despite the theoretical importance of MWS usage in PF, the aforementioned studies have resulted in mixed findings, possibly due to the following methodological shortcomings. First, previous studies did not fully account for MWS usage in speech. A growing number of studies have investigated the potential of frequency-based approaches to MWS detection using phrasal sophistication indices (Eguchi & Kyle, Reference Eguchi and Kyle2020; Garner & Crossley, Reference Garner and Crossley2018; Kim et al., Reference Kim, Crossley and Kyle2018; Hougham et al., Reference Hougham, Clenton and Uchihara2024; Saito, Reference Saito2020; Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020). This method of detecting MWS in learner speech takes advantage of analyzing all possible word combinations as n-grams. Therefore, this study adopted the n-gram approach to MWS detection. Second, none of the previous studies considered the effects of the speaker’s UF on PF ratings. Given that raters tend to simultaneously attend to both nontemporal and temporal features of utterance to a certain degree even with instructions or trainings, it is highly likely that UF can be an extraneous variable (i.e., variables that can affect dependent variables but are not included in statistical analyses) in the relationship between MWS usage and PF ratings. Additionally, the relationship between speakers’ use of MWS and UF has been substantiated in previous studies (Hougham et al., Reference Hougham, Clenton and Uchihara2024; McGuire & Larson-Hall, Reference McGuire and Larson-Hall2017; Y. Suzuki et al., Reference Suzuki, Eguchi and de Jong2022; Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020; Wood, Reference Wood2006, Reference Wood2009; Yan, Reference Yan2020). Therefore, the methodological decision to control for the effects of the speaker’s UF on the relationship between MWS and PF ratings can be justified.

The current study

This study is part of a larger project exploring the relationship between phraseological competence and oral fluency (Takizawa, Reference Takizawa2024). This study addressed how learners’ use of MWS in speech is associated with raters’ perceptions of speakers’ fluency (i.e., PF) in line with Foster’s (Reference Foster2020) research agenda. To better understand the intricate relationship between MWS and oral fluency, this study aims to extend this line of research by adopting corpus-based measures of MWS (i.e., phrasal sophistication indices) and carefully controlling for the potential confounding effects of UF on the predictive power of MWS for PF ratings. This study was guided by the following research question:

To what extent do n-gram indices (frequency, proportion, and association) predict PF ratings by controlling for the effects of UF measures (speed, breakdown, and repair fluency) on PF ratings?

Method

Participants

A total of 110 Japanese university students volunteered to participate in the experiment as speakers ([M]age = 20.49, [SD]age = 1.29, range = 18–23). As part of a larger project investigating the relationship between phraseological competence and oral fluency (Takizawa, Reference Takizawa2024), the participants took several vocabulary tests and spontaneous speaking tasks. Due to restrictions related to the COVID-19 pandemic, all data collection was conducted online using a video conferencing tool (Zoom) and an online experiment builder Gorilla (Anwyl-Irvine et al. Reference Anwyl-Irvine, Massonnié, Flitton, Kirkham and Evershed2020). Their self-reported standardized test scores indicated that they were mostly at B1 and B2 levels on the Common European Framework of Reference for Languages (CEFR): A2 (n = 2), B1 (n = 21), B2 (n = 46), and C1 (n = 12). They had varying degrees of overseas experience (M month = 22.98, SD = 35.65, range = 1–210). Two participants were excluded due to issues with speech recording.

As a rater for PF judgments, a total of 10 experienced Japanese raters were recruited, most of whom (n = 9) were graduate students in applied linguistics and L2 acquisition (M age = 30.00, SD age = 7.86). The remaining rater (n = 1) was a high-school teacher who earned a Ph.D. in applied linguistics and L2 acquisition. Following Isaacs and Thomson Reference Isaacs and Thomson2013, experienced raters were defined as having linguistic and educational backgrounds. Their average teaching experience was 7.45 years (SD = 9.25), and they had taken formal linguistic classes for an average of 4.85 years (SD = 1.81). Some raters had formal linguistic trainings regarding phonetics, while others pursued their graduate degrees in the area of L2 pronunciation or speech fluency. Their proficiency levels were deemed sufficiently high, which was considered a necessary part of rater characteristics. Most raters (n = 8) met the requirement of a minimum Test of English as a Foreign Language Internet-based Test overall score of 80 at the time of their graduate school admission. Their self-reported test scores indicated that they were at C1 level on the CEFR. The remaining raters (n = 2) reported Test of English for International Communication Reading and Listening scores of 825 and 915, respectively, which correspond to B2 to C1 level on the CEFR. None of the raters reported any hearing problems. Appendix A summarizes the rater profiles.

Speech samples

Speech samples were elicited from 108 Japanese university students through an argumentative speech task. The topic focused on the pros and cons of using smartphones in our everyday lives. Participants prepared speech for 2 min and were free to speak as much as they liked without time limitations. The mean duration of the speeches was 103.77 sec (SD = 55.13, range = 24.88–346.3), and the mean length of the speeches was 109.78 words (SD = 55.36, range = 26–306). To prepare for the rating session, all speech samples were normalized to 65 decibels using Praat (Boersma & Weenink, Reference Boersma and Weenink2020). Normalization is a necessary process to equalize loudness, ensuring that raters can comfortably listen to the speech sounds. Six speech samples were excluded because normalization caused clipping, resulting in distorted and low-quality audio. To make the speech samples standardized, the speech samples were cut to 60 sec from the beginning (see Rossiter, Reference Rossiter2009). It should be noted that 26 of the 102 samples were shorter than 60 sec.

To distribute the samples into 3-day rating sessions (see Rating sessions section below), each day needed to have a similar range of fluency levels so that raters could use the full range of scale (1–6) in every session. In doing so, speech rate––a reliable index reflecting relatively global UF level (S. Suzuki et al., Reference Suzuki, Kormos and Uchihara2021)––was computed based on de Jong, Pacilly & Heeren’s (Reference de Jong, Pacilly and Heeren2020) Praat script. In this calculation, the total number of syllables, including disfluencies (repetition, correction, and false start), was divided by the total duration of speech, including silent pauses. The speech rate was then examined through cluster analysis using the ggdendro and cluster packages in R (R Core Team, 2023). Inspection of the dendrogram, silhouette figures, and silhouette values indicated that the speakers were best divided into 2 clusters (cluster 1: n = 64 and cluster 2: n = 38; see supplementary material S1 for details). Speakers were then extracted from each cluster so that each session contained 34 speech samples (34 speech × 3 blocks).

Rating sessions

The procedure for the rating session comprised three components: (a) instructions for ratings and definitions of fluency, (b) rater training, and (c) the main rating sessions. The entire sessions spanned 3 consecutive days. The first day included instruction, training, and main ratings. The definition of fluency was based on speed, breakdown, and repair fluency (Tavakoli & Skehan, Reference Tavakoli, Skehan and Ellis2005) (see Appendix D). Raters were instructed to assign an overall score considering the three dimensions of fluency. They were asked to judge fluency using a 6-point scale ranging from 1 = extremely disfluent to 6 = extremely fluent (see Appendix C). The use of a 6-point scale was justified based on the cluster analysis and the consistency with the range of speakers’ fluency levels (see S. Suzuki et al., Reference Suzuki, Kormos and Uchihara2021). Raters were encouraged to use the full range of the scale, aiming to ensure a comprehensive discrimination of PF levels. To facilitate this, they were provided with a tip suggesting that the scale could initially be divided into two halves: scores 1–3 representing less fluent speech samples and scores 4–6 representing fluent speech samples. Note that the initial division of samples into two major levels is justified by the aforementioned cluster analysis (see supplementary material S1). The samples could then be further subdivided to assign each score accordingly.

Raters underwent a 30-min training session aimed at ensuring the construct validity and reliability of PF judgments. The training primarily focused on standard setting for the current samples and understandings of fluency features in terms of speed, breakdown, and repair fluency. Four speech samples were prepared, elicited from another argumentative speech by the same participants as in the current study. Two of these samples represented the highest speech rates, while the other two represented the lowest speech rates, allowing raters to understand the upper and lower limits of fluency levels of the current dataset. Care was taken to visually demonstrate how the three dimensions of fluency features emerge in actual speech, with transcriptions presented in PowerPoint slides highlighting annotations for speed, breakdown, and repair fluency. The author discussed the rating scores assigned by the raters, closely inspecting the transcriptions when needed. Following this, another four speech samples––two with the highest speech rate and two with the lowest––were presented for practice rating using Praat. In the main rating session, raters assessed 17 speech samples, with a 5-min break followed by another 17 speech samples, totaling 34 samples per day. At the end of the first day of rating, they were asked, “To what extent do you understand the construct of fluency?” and “To what extent was the training session effective?” Additionally, at the end of each session, they were asked, “Did you consistently apply the definition of fluency (speed, breakdown, and repair) provided at the beginning? What other aspects did you pay attention to, if any?” In addition to verbal reports, they provided numerical scores out of 100 (see Appendix B). The internal consistency of PF ratings for all 102 speeches was found to be high (Cronbach’s α = .975). The two-way, consistency, single-measure, and average-measure intraclass correlation coefficients (2, 1) and (2,k) were computed (Nagle & Rehman, Reference Nagle and Rehman2021), indicating good to excellent inter-rater reliability (.735 and .965, respectively) based on Cicchetti’s (Reference Cicchetti1994) proposed cutoffs.

Analysis

The n-gram analysis

MWS usage in the current study was operationalized as the extent to which the use of bigrams and trigrams was target-like compared to a representative large-scale reference corpus. This n-gram analysis has increasingly been employed in L2 acquisition as a reliable method for detecting recurrent sequences (Eguchi & Kyle, Reference Eguchi and Kyle2020; Kim et al., Reference Kim, Crossley and Kyle2018; Kyle & Crossley, Reference Kyle and Crossley2015; Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020). The sequences are measured with a range of phrasal sophistication indices. In this study, the Tool for the Automatic Analysis of Lexical Sophistication (TAALES) 2.8.1 (Kyle & Crossley, Reference Kyle and Crossley2015) was used to extract lemmatized bigrams and trigrams from the pruned transcriptions of speech texts, excluding disfluencies and filled pauses.Footnote 1 Following Tavakoli and Uchihara’s (Reference Tavakoli and Uchihara2020) study, frequency, proportion, and association indices were computed with reference to the spoken subsection of the Corpus of Contemporary American English (COCA: Davies, Reference Davies2008), which consists of 79 million words from a wide range of spontaneous conversations in TV and radio over the past 25 years in the United States. Given that few attempts have been made so far regarding the relationship between n-gram usage in speech and PF, the current study opted for n-gram indices which have been frequently employed and empirically examined in previous studies (frequency, proportion, and association) to keep comparability across studies.

The n-gram frequency index measures the frequency of n-grams appearing in a learner’s text with reference to the reference corpus. In TAALES, each n-gram is assigned a frequency score, which is aggregated to create a composite score per text. However, this mean frequency score may not necessarily account for text length, as the number of score-assigned n-grams may not align closely with text length. Thus, a text-length–controlled frequency score was recalculated manually.Footnote 2 To avoid the Zipfian distribution common to frequency-based measures, bigram and trigram frequencies were logarithmized (Kyle & Crossley, Reference Kyle and Crossley2015). A higher frequency score denotes that a speaker produces many high-frequency, target-like n-grams. While the n-gram frequency is a continuous score, the n-gram proportion index is binary scoring (the presence or absence of the n-grams). In this study, bigrams and trigrams in the 30,000 most frequent n-grams in the spoken subsection of COCA were computed (Eguchi & Kyle, Reference Eguchi and Kyle2020; Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020). A higher score of the n-gram proportion indicates that a speaker produces many frequent target-like n-grams in their speech, accounting for text length. Finally, the n-gram association index measures the strength of association between words, commonly measured as MI (e.g., Durrant & Schmitt, Reference Durrant and Schmitt2009; Granger & Bestgen, Reference Granger and Bestgen2014; Saito, Reference Saito2020). It is suggested that MI is a measure of the strength, tightness, coherence, and appropriateness of word combinations (Gablasova, Brezina & McEnery, Reference Gablasova, Brezina and McEnery2017, p. 163). It is calculated as the logarithmic scale of the ratio between observed frequency with its expected frequency in a reference corpus (i.e., COCA). Following Zhang et al. (Reference Zhang, Zhao and Li2023), trigram MI was computed for two types: unigram to bigram and bigram to unigram. Similar to the n-gram frequency measure, the MI index available in TAALES may not account for text length, as the total MI score per text is aggregated to create a composite score. Therefore, a text-length–controlled MI score was recalculated manually.

Utterance fluency analysis

In line with previous studies which comprehensively examined the multidimensionality of UF (e.g., S. Suzuki & Kormos, Reference Suzuki and Kormos2023), this study also measured UF in terms of the three subdimensions of UF: speed fluency, breakdown fluency, and repair fluency, based on Tavakoli and Skehan’s (Reference Tavakoli, Skehan and Ellis2005) triadic framework. Articulation rate was computed as a measure of speed fluency by calculating the total number of syllables in the pruned texts divided by total phonation time. Silent pause ratio and duration were computed separately for pause locations (mid-clause or end-clause). The definition of a clause followed Foster et al. (Reference Foster, Tonkyn and Wigglesworth2000): either a simple independent finite clause, or a dependent finite or nonfinite clause. Mid-clause pauses were defined as silent pauses occurring within either independent or subordinate clauses, while end-clause pauses were defined as silent pauses occurring between independent clauses or subordinate clauses. Mid-clause and end-clause pause ratios were calculated by dividing the total number of mid- and end-clause pauses by the total number of syllables. Mid-clause and end-clause pause durations were calculated as the mean duration of mid- and end-clause pauses. Filled pauses, including lexical fillers (e.g., you know), were measured as a filled pause ratio, calculated by dividing the total number of filled pauses by the total number of syllables. Repair fluency was measured as the total number of disfluency words caused by repetition, correction, and false starts, divided by the total number of syllables, which we refer to as disfluency ratio. A previous meta-analysis suggested that composite measures (speech rate and MLR) have a stronger predictive power for PF (S. Suzuki et al., Reference Suzuki, Kormos and Uchihara2021). This study opted not to include composite measures because the constructs underlying these measures encompass both speed and breakdown fluency (e.g., Bosker, Pinget, Quené, Sanders & de Jong, Reference Bosker, Pinget, Quené, Sanders and de Jong2013). To control for the effects of UF dimensions (speed, breakdown, and repair fluency) on the prediction of n-gram for PF, maintaining the three subdimensions of UF is justifiable. Speech was annotated by the first author for pauses and disfluencies using TextGrid in Praat. The second author coded 10% of the data, achieving high agreement rate (Cronbach’s α = .995 for mid-clause pauses, .976 for end-clause pauses, .998 for filled pauses, and .987 for disfluency words). The annotated TextGrid and speech samples were later submitted to R script (S. Suzuki & Révész, Reference Suzuki, Révész and Yuichi2023) to compute each UF measure automatically.

Statistical analysis

As PF and most UF measures were positively skewed and deviated from normal distributions, nonparametric Spearman’s rank-order correlation analyses were run to examine the bivariate relationships between PF and UF, n-grams and UF (supplementary material S3), and n-grams and PF (see supplementary material S4 for partial correlation analyses between n-grams and PF ratings while controlling for each UF measure, to examine the unique relationship between n-grams and PF ratings). These statistical analyses were administered using JASP (JASP Team, Reference Team2023). Effect sizes were interpreted following Plonsky and Oswald’s (Reference Plonsky and Oswald2014) recommendations: r = .25 (small), r = .40 (medium), and r = .60 (large). To model the relationship between n-grams and PF while controlling for the effects of UF, mixed-effects regression analyses were run using the lme4 package (Bates, Mächler, Bolker & Walker, Reference Bates, Mächler, Bolker and Walker2015) in R (R Core Team, 2023). Model-building procedures followed Murakami (Reference Murakami2016), where variables improving model fit, assessed by a decreasing value in the Akaike information criterion (AIC), were entered in a forward manner and later confirmed by the likelihood ratio test to determine statistically significant improvement in model fit between adjacent models. In building mixed-effects models, we first established the parsimonious model using only UF predictors to explain PF ratings. In doing so, the remaining variance in PF ratings was obtained. Next, n-gram indices (bi/trigram frequency, proportion, and MI) were entered to the parsimonious model as predictors for PF ratings in a forward manner to examine which n-gram indices best predicted the remaining variance in PF ratings (see supplementary material S8 for R scripts). Statistical assumptions of the final model (linearity, normality of residuals, homogeneity, and multicollinearity) were checked by visually inspecting plotted values powered by performance package in R (Lüdecke, Ben-Shachar, Patil, Waggoner & Makowski, Reference Lüdecke, Ben-Shachar, Patil, Waggoner and Makowski2021) (see supplementary material S5).

Results

Bigram and trigram frequency (log), proportion, and MI were submitted to correlation analyses to examine the bivariate relationships with PF ratings. As summarized in Table 1, the results showed that all n-gram indices, except for bigram MI, were significantly associated with PF rating scores, exhibiting small to medium effect sizes (rho = |.24–.33|). Trigram MI (uni→bi) appeared more strongly correlated with PF rating scores (rho = .32) than the other trigram MI (bi→uni) (rho = .27). Therefore, only trigram MI (uni→bi) was selected for subsequent regression analysis to avoid multicollinearity issues if both were entered (rho = .94). supplementary material S2, S3, and S4 include descriptive statistics for PF, UF, and n-gram indices as well as correlations between UF and PF ratings, UF and n-gram indices, and partial correlations between n-gram indices and PF ratings while controlling for the effects of each UF measure.

Table 1. Spearman’s correlations between n-gram and PF

Note:

1 Sum of the indices’ scores was divided by text length. MI = mutual information. **p < .01; ***p < .001.

To model the predictive relationship between n-gram indices and PF ratings while controlling for the effects of the speaker’s UF as a confounding factor, mixed-effects regression models were constructed. We adopted a two-step approach for model building. First, we established the parsimonious model predicting PF ratings solely from UF measures to obtain the remaining variances of PF rating scores. Next, n-gram indices (bi/trigram frequency, proportion, and MI) were entered as predictors for the remaining variance of PF rating scores. The selection of n-gram predictors was exploratory, as there was no prior evidence in prior studies regarding the predictive strength of n-gram indices on PF rating scores. Table 2 shows the model comparisons, illustrating how AIC values and marginal R2 (i.e., explained variances by fixed effects) changed with the addition of fixed-effect predictors. Model building for establishing the parsimonious UF predictors concluded with model 7. However, the likelihood ratio test indicated no significant difference of model fit between models 5 and 6 and between models 6 and 7. Therefore, model 5 was adopted as the parsimonious model. Mid-clause pause ratio emerged as the best predictor of PF ratings (marginal R2 = .449), followed by mid-clause pause duration (marginal R2 change = .122), end-clause pause ratio (marginal R2 change = .027), and end-clause pause duration (marginal R2 change = .012). Based on the parsimonious UF model, n-gram predictors were subsequently entered. Model 8 included bigram proportion, which explained additional variances in PF rating scores (marginal R2 change = .008). Model 9 indicated that bigram MI was entered as a predictor due to a decreasing AIC value. However, the likelihood ratio test showed that, although it approached the statistical significance, model fit between models 8 and 9 was not statistically different (p = .079). Therefore, model 8 was selected as the final model (see Table 3 for summary statistics).

Table 2. Summary of mixed-effects model comparison

Note:

1 By-rater and by-speaker random intercept model.

2 R2 marginal was computed with r.squaredGLMM function in R.

Table 3. Final mixed-effects model to predict PF

Note: R2 marginal = .62; R2 conditional = .77. *p < .05; **p < .01; ***p < .001.

Discussion

The current study explored whether the use of MWS (i.e., bi/trigrams) in speech is associated with raters’ perceptions of speakers’ fluency while controlling for the speakers’ UF. Correlation analyses between n-gram indices and PF ratings indicated that bigram and trigram indices were significantly correlated with PF: bi/trigram frequency (rho = .32 and .24), trigram MI (uni→bi) (rho = .32), and bi/trigram proportion (rho = .33 and .26). To model the relationship between n-gram measures and PF ratings while controlling for speakers’ UF measures, mixed-effects regression models were built by separating PF rating variances, which were accounted for by UF predictors from those explained by n-gram predictors. The results indicated that bigram proportion was the only significant predictor of the remaining PF rating score variances (b = .14, marginal R2 = 0.8%).

The current findings generally confirmed the existing body of evidence that L2 learners’ phraseological competence contributes to oral fluency (Boers et al., Reference Boers, Eyckmans, Kappel, Stengers and Demecheleer2006; Hougham et al., Reference Hougham, Clenton and Uchihara2024; Kahng, Reference Kahng2020; McGuire & Larson-Hall, Reference McGuire and Larson-Hall2017; Pawley & Syder, Reference Pawley, Syder, Richards and Schmidt1983; Stengers et al., Reference Stengers, Boers, Housen and Eyckmans2011; Y. Suzuki et al., Reference Suzuki, Eguchi and de Jong2022; Takizawa, Reference Takizawa2024; Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020; Wood, Reference Wood2006, Reference Wood2009; Yan, Reference Yan2020). Within Segalowitz’s (Reference Segalowitz2010) multidimensional framework of oral fluency (UF, CF, and PF), this study particularly highlighted PF and its relevance to L2 speakers’ MWS usage (Boers et al., Reference Boers, Eyckmans, Kappel, Stengers and Demecheleer2006; McGuire & Larson-Hall, Reference McGuire and Larson-Hall2017; Stengers et al., Reference Stengers, Boers, Housen and Eyckmans2011). The results indicated that bigram proportion significantly predicted PF while controlling for the relevant UF features, that is, breakdown fluency measures in the study. Note that speed and repair fluency did not significantly improve the explained variance of PF scores in the parsimonious mixed-effects model. This suggests that the target-like use of frequent bigrams (e.g., agree with as a target-like frequent bigram versus agree the as a non–target-like infrequent bigram) could be important for L2 learners to be perceived as fluent beyond temporal fluency features. That said, UF still had a substantial effect on PF (61%), while the remaining variance explained by bigram proportion was merely 0.8%. It is likely that the variance in PF ratings explained by UF included those explained by a large portion of n-gram predictors, as ample evidence indicates that MWS used in speech contribute to UF (e.g., Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020). Theoretically, there could thus be enough reason to view these variables sequentially (MWS used in speech → UF → PF).

The current findings of the contribution of high-frequency MWS to PF can be interpreted from the perspective of the usage-based model of language acquisition, which posits that unconscious language representation is probability-tuned (Ellis, Reference Ellis2006). Given that language learning is susceptible to frequency effects (Ellis, Reference Ellis2012), higher-frequency constructions are encountered more often and processed more quickly. Thus, it seems that raters could be sensitive to high-frequency word combinations, and any deviation from target-like use of high-frequency MWS may negatively impact rater’s PF. This could be further supported by empirical findings in MWS processing research. Yi and Zhong (Reference Yi and Zhong2024) demonstrated that the processing speed of MWS was moderated by phrasal frequency: higher-frequency MWS likely benefitted from processing advantages (faster reaction times) for L1 and L2 speakers alike. In contrast, Yi and Zhong also demonstrated that the relationship between low-frequency advanced MWS (measured as MI) and their processing speed was nonlinear, meaning that both lower (MI < 3.34) and higher (MI > 9.56) end of MI scores exhibited weaker processing advantages compared to medium MI scores. The speech in the current study contained MWS with very weak association strength (bigram MImean = 1.59; trigram MImean = 2.45; cf. the scores before considering text length). It could be possible that MWS frequency, rather than association strength, mattered more to the current listeners. That said, it should be noted that the raters’ insensitivity to MI could be attributed to their limited MWS knowledge compared to L1 speaker raters. Relatedly, the relatively small amount of contribution of bigram proportion (0.8%) might have been larger for L1 speaker raters.

This study also demonstrated that corpus-based measures of MWS (i.e., n-grams) pertained to raters’ perceptions of fluency. Previous studies revealed that n-gram proportion is a robust predictor of general speaking proficiency (Eguchi & Kyle, Reference Eguchi and Kyle2020; Garner & Crossley, Reference Garner and Crossley2018; Kim et al., Reference Kim, Crossley and Kyle2018; Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020; Zhang et al., Reference Zhang, Zhao and Li2023). The current findings contributed to this body of research by showing that the n-gram proportion measure is a better indicator of phraseological competence in the context of listener-based judgments of oral proficiency (i.e., PF in this study). Previous studies looking at MWS in speech and UF (temporal features of utterance) indicated, however, that the role of n-gram proportion remains to be questioned (Hougham et al., Reference Hougham, Clenton and Uchihara2024; Tavakoli & Uchihara, Reference Tavakoli and Uchihara2020). The discrepancy regarding the role of n-gram proportion in predicting oral fluency between the current study and previous studies (Hougham et al., Reference Hougham, Clenton and Uchihara2024; Tavakoli and Uchihara, Reference Tavakoli and Uchihara2020) could stem from the different constructs of fluency (PF versus UF). It has been suggested that PF is influenced by a range of temporal and nontemporal factors (Kormos & Dénes, Reference Kormos and Dénes2004; S. Suzuki & Kormos, Reference Suzuki and Kormos2020). In fact, the raters in this study were largely consistent in their ratings based on their self-assessments of rating consistency out of 100 (day 1: 79.44, day 2: 88.50, and day 3: 90.50; see Appendix B for details). Nevertheless, interviews revealed that they paid attention to a range of temporal and nontemporal factors during their fluency ratings, albeit with instructions and a rater training provided (see supplementary material S6 for coded categories of factors).

Conclusions

This study demonstrated that the use of target-like frequent MWS (i.e., bigram proportion) in argumentative speech contributed to raters’ perceptions of speakers’ fluency. This held true even when controlling for speakers’ speed, breakdown, and repair fluency. The results can suggest that L2 learners aiming to be perceived as fluent speakers in language assessments could benefit from mastering high-frequency MWS. For example, in our dataset, agree with was identified as a frequent bigram, while agree the was not, as it is very infrequent or non–target-like. Thus, L2 learners are encouraged to learn and use frequent MWS accurately in their speech. It should be noted, however, that this does not mean that the temporal features of utterance are less important. Indeed, these features had a substantial impact on raters’ judgments of speakers’ fluency (marginal R2 = .61). This implies that, to achieve a high level of PF, L2 learners might need to primarily enhance their UF but also master target-like, high-frequency MWS.

Future research should address the following limitations of this study. First, MWS was operationalized as the use of bi/trigrams (frequency, proportion, and MI) for the sake of the comparability of findings across studies. However, this may lead to both false positives and false negatives in detecting L2 learners’ MWS use, as Foster (Reference Foster2020) emphasizes. Additionally, a recent study by Hougham et al. (Reference Hougham, Clenton and Uchihara2024) addressed the role of MWS length on UF. Building on this line of research, future studies need to include longer MWS and examine the effects of MWS length on PF to further test the effect of MWS usage on PF found in the current study. Second, speech was elicited only through an argumentative speech task. Given that task types and task structures affect speaking performance, including lexis and fluency (e.g., Foster & Tavakoli, Reference Foster and Tavakoli2009), future research should examine the generalizability of the current findings by manipulating task demands. Third, as one of the reviewers pointed out, during the rater training, speech samples for standard setting only included high-level versus low-level fluency but lacked medium-level fluency, which could have influenced the raters’ understanding of fluency and rating consistency. Finally, we only included advanced L2 users as raters. Future research could benefit from comparing L1 and L2 raters to examine the relationships between UF, PF, and MWS use.

Supplementary material

The supplementary material for this article can be found at http://doi.org/10.1017/S0272263125000051.

Competing interest

The authors declare none.

Appendix A. Summary of rater profiles

Appendix B. Questionnaire data about fluency rating

Appendix C. A screen during the rating

Appendix D. The fluency definition and the examples given in rater training

Footnotes

1 Unpruned text sample (“uhh there’s uh there are two reasons for these”) and its pruned version (“There are two reasons for these”).

2 The number of score-assigned bigrams were almost perfectly correlated with text length (rho = .979, p < .001). This makes sense because shorter n-grams are more likely to be detected in reference corpus. However, the number of score-assigned trigrams were still very strongly but slightly more weakly correlated with text length than that of bigrams (rho = .888, p < .001). This may mean that there would be slight variations in detectable trigrams and text length. supplementary material S7 includes the top and bottom five participants in PF ratings along with all n-gram indices scores as well as text length.

1 One rater was excluded in this category because they preferred to remain anonymous.

References

Anwyl-Irvine, A. L., Massonnié, J., Flitton, A., Kirkham, N., & Evershed, J. K. (2020). Gorilla in our midst: An online behavioral experiment builder. Behavior Research Methods, 52, 388407. https://doi.org/10.3758/s13428-019-01237-xCrossRefGoogle ScholarPubMed
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67, 148. https://doi.org/10.18637/jss.v067.i01CrossRefGoogle Scholar
Biber, D., Conrad, S., & Cortes, V. (2004). If you look at …: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25, 371405. https://doi.org/10.1093/applin/25.3.371CrossRefGoogle Scholar
Boers, F., Eyckmans, J., Kappel, J., Stengers, H., & Demecheleer, M. (2006). Formulaic sequences and perceived oral proficiency: Putting a Lexical Approach to the test. Language Teaching Research, 10, 245261. https://doi.org/10.1191/1362168806lr195oaCrossRefGoogle Scholar
Boers, F., & Webb, S. (2018). Teaching and learning collocation in adult second and foreign language learning. Language Teaching, 51, 7789. https://doi.org/10.1017/S0261444817000301CrossRefGoogle Scholar
Boersma, P., & Weenink, D. (2020). PRAAT: Doing phonetics by computer (Version 6.1.44) [Computer program]. www.praat.orgGoogle Scholar
Bosker, H. R., Pinget, A. F., Quené, H., Sanders, T., & de Jong, N. H. (2013). What makes speech sound fluent? The contributions of pauses, speed and repairs. Language Testing, 30, 159175. https://doi.org/10.1177/0265532212455394CrossRefGoogle Scholar
Bosker, H. R., Quené, H., Sanders, T., & de Jong, N. H. (2014). The perception of fluency in native and nonnative speech. Language Learning, 64, 579614. https://doi.org/10.1111/lang.12067CrossRefGoogle Scholar
Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6, 284290. https://doi.org/10.1037/1040-3590.6.4.284CrossRefGoogle Scholar
Conklin, K., & Schmitt, N. (2008). Formulaic sequences: Are they processed more quickly than nonformulaic language by native and nonnative speakers? Applied Linguistics, 29, 7289. https://doi.org/10.1093/applin/amm022CrossRefGoogle Scholar
Davies, M. (2008). Corpus of Contemporary American English. https://www.english-corpora.org/coca/Google Scholar
de Jong, N. H. (2016). Predicting pauses in L1 and L2 speech: The effects of utterance boundaries and word frequency. International Review of Applied Linguistics in Language Teaching, 54, 113132. https://doi.org/10.1515/iral-2016-9993CrossRefGoogle Scholar
de Jong, N. H., Pacilly, J., & Heeren, W. (2020). Praat scripts to measure speed fluency and breakdown fluency in speech automatically. Assessment in Education: Principles, Policy, & Practice, 28, 456476. https://doi.org/10.1080/0969594X.2021.1951162Google Scholar
de Jong, N. H., Steinel, M. P., Florijn, A., Schoonen, R., & Hulstijn, J. H. (2013). Linguistic skills and speaking fluency in a second language. Applied Psycholinguistics, 34, 893916. https://doi.org/10.1017/S0142716412000069CrossRefGoogle Scholar
Durrant, P., & Schmitt, N. (2009). To what extent do native and non-native writers make use of collocations? International Review of Applied Linguistics in Language Teaching, 47, 157177. https://doi.org/10.1515/iral.2009.007CrossRefGoogle Scholar
Eguchi, M., & Kyle, K. (2020). Continuing to explore the multidimensional nature of lexical sophistication: The case of oral proficiency. The Modern Language Journal, 104, 381400. https://doi.org/10.1111/modl.12637CrossRefGoogle Scholar
Ellis, N. (2006). Language acquisition as rational contingency learning. Applied Linguistics, 27, 124. https://doi.org/10.1093/applin/ami038CrossRefGoogle Scholar
Ellis, N. C. (2012). Formulaic language and second language acquisition: Zipf and the phrasal teddy bear. Annual Review of Applied Linguistics, 32, 1744. https://doi.org/10.1017/S0267190512000025CrossRefGoogle Scholar
Ellis, N., Simpson-Vlach, R., & Maynard, C. (2008). Formulaic language in native speakers: Psycholinguistics, corpus linguistics, and TESOL. TESOL Quarterly, 42, 6178. https://doi.org/10.1515/CLLT.2009.003CrossRefGoogle Scholar
Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text, 20, 2962. https://doi.org/10.1515/text.1.2000.20.1.29Google Scholar
Foster, P. (2001). Rules and routines: A consideration of their role in the task-based language production of native and non-native speakers. In Bygate, M., Skehan, P. & Swain, M. (Eds.), Researching pedagogic tasks: Second language learning, teaching and testing. (pp. 7593). Longman.Google Scholar
Foster, P. (2020). Oral fluency in a second language: A research agenda for the next ten years. Language Teaching, 53, 446461. https://doi.org/10.1017/CBO9781107415324.004CrossRefGoogle Scholar
Foster, P., & Tavakoli, P. (2009). Native speakers and task performance: Comparing effects on complexity, fluency, and lexical diversity. Language Learning, 59, 866896. https://doi.org/10.1111/j.1467-9922.2009.00528.xCrossRefGoogle Scholar
Foster, P., Tonkyn, A., & Wigglesworth, G. (2000). Measuring spoken language: A unit for all reasons. Applied Linguistics, 21(3), 354375. https://doi.org/10.1093/applin/21.3.354CrossRefGoogle Scholar
Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations in corpus-based language learning research: Identifying, comparing, and interpreting the evidence. Language Learning, 67, 155179. https://doi.org/10.1111/lang.12225CrossRefGoogle Scholar
Garner, J., & Crossley, S. A. (2018). A latent curve model approach to studying L2 n-gram development. Modern Language Journal, 102, 494511. https://doi.org/10.1111/modl.12494CrossRefGoogle Scholar
Granger, S., & Bestgen, Y. (2014). The use of collocations by intermediate vs. advanced non-native writers: A bigram-based study. International Review of Applied Linguistics in Language Teaching, 52, 229252. https://doi.org/10.1515/iral-2014-0011CrossRefGoogle Scholar
Hougham, D., Clenton, J., & Uchihara, T. (2024). Disentangling the contributions of shorter vs. longer lexical bundles to L2 oral fluency. System, 121, 103243. https://doi.org/10.1016/j.system.2024.103243CrossRefGoogle Scholar
Huensch, A. (2023). Effects of speaking task and proficiency on the midclause pausing characteristics of L1 and L2 speech from the same speakers. Studies in Second Language Acquisition, 45, 10311055. https://doi.org/10.1017/S0272263123000323CrossRefGoogle Scholar
Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10, 135159. https://doi.org/10.1080/15434303.2013.769545CrossRefGoogle Scholar
Iwashita, N., Brown, A., McNamara, T., & O’Hagan, S. (2008). Assessed levels of second language speaking proficiency: How distinct? Applied Linguistics, 29, 2449. https://doi.org/10.1093/applin/amm017CrossRefGoogle Scholar
Team, JASP. (2023). JASP (Version 0.17.3) [computer software].Google Scholar
Kahng, J. (2014). Exploring utterance and cognitive fluency of L1 and L2 English speakers: Temporal measures and stimulated recall. Language Learning, 64, 809854. https://doi.org/10.1111/lang.12084CrossRefGoogle Scholar
Kahng, J. (2020). Explaining second language utterance fluency: Contribution of cognitive fluency and first language utterance fluency. Applied Psycholinguistics, 41, 457480. https://doi.org/10.1017/S0142716420000065CrossRefGoogle Scholar
Kim, M., Crossley, S. A., & Kyle, K. (2018). Lexical sophistication as a multidimensional phenomenon: Relations to second language lexical proficiency, development, and writing quality. The Modern Language Journal, 102, 120141. https://doi.org/10.1111/modl.12447CrossRefGoogle Scholar
Koizumi, R., & In’nami, Y. (2013). Vocabulary knowledge and speaking proficiency among second language learners from novice to intermediate levels. Journal of Language Teaching and Research, 4, 900913. https://doi.org/10.4304/jltr.4.5.900-913CrossRefGoogle Scholar
Kormos, J. (2006). Speech production and second language acquisition. Lawrence Erlbaum.Google Scholar
Kormos, J., & Dénes, M. (2004). Exploring measures and perceptions of fluency in the speech of second language learners. System, 32, 145164. https://doi.org/10.1016/j.system.2004.01.001CrossRefGoogle Scholar
Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly, 49, 757786. https://doi.org/10.1002/tesq.194CrossRefGoogle Scholar
Levelt, W. J. M. (1989). Speaking: From intention to articulation. The MIT Press.CrossRefGoogle Scholar
Lüdecke, D., Ben-Shachar, M., Patil, I., Waggoner, P., & Makowski, D. (2021). performance: An R package for assessment, comparison and testing of statistical models. Journal of Open Source Software, 6, 3139. https://doi.org/10.21105/joss.03139CrossRefGoogle Scholar
Magne, V., Suzuki, S., Suzukida, Y., Ilkan, M., Tran, M., & Saito, K. (2019). Exploring the dynamic nature of second language listeners’ perceived fluency: A mixed-methods approach. TESOL Quarterly, 53, 11391150. https://doi.org/10.1002/tesq.528CrossRefGoogle Scholar
McGuire, M., & Larson-Hall, J. (2017). Teaching formulaic sequences in the classroom: Effects on spoken fluency. TESL Canada Journal, 341–25. https://doi.org/10.18806/tesl.v34i3.1271CrossRefGoogle Scholar
Millar, N. (2011). The processing of malformed formulaic language. Applied Linguistics, 32, 129148. https://doi.org/10.1093/applin/amq035CrossRefGoogle Scholar
Murakami, A. (2016). Modeling systematicity and individuality in nonlinear second language development: The case of English grammatical morphemes. Language Learning, 66, 834871. https://doi.org/10.1111/lang.12166CrossRefGoogle Scholar
Nagle, C. L., & Rehman, I. (2021). Doing L2 speech research online: Why and how to collect online ratings data. Studies in Second Language Acquisition, 43, 916939. https://doi.org/10.1017/S0272263121000292CrossRefGoogle Scholar
Paquot, M., & Granger, S. (2012). Formulaic Language in Learner Corpora. Annual Review of Applied Linguistics, 32, 130149. https://doi.org/10.1017/S0267190512000098CrossRefGoogle Scholar
Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In Richards, J. C. & Schmidt, R. W. (Eds.), Language and Communication (pp. 191226). Longman. https://doi.org/10.4324/9781315836027-12Google Scholar
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language Learning, 64, 878912. https://doi.org/10.1111/lang.12079CrossRefGoogle Scholar
Préfontaine, Y., & Kormos, J. (2016). A qualitative analysis of perceptions of fluency in second language French. International Review of Applied Linguistics in Language Teaching, 54, 151169. https://doi.org/10.1515/iral-2016-9995CrossRefGoogle Scholar
R Core Team. (2023). R: A language and environment for statistical computing (Version 4.3.1). R Foundation for Statistical Computing. https://www.r-project.org/Google Scholar
Rossiter, M. J. (2009). Perceptions of L2 fluency by native and non-native speakers of English. The Canadian Modern Language Review, 65, 395412. https://doi.org/10.3138/cmlr.65.3.395CrossRefGoogle Scholar
Saito, K. (2020). Multi‐ or single‐word units? The role of collocation use in comprehensible and contextually appropriate second language speech. Language Learning, 70, 548588. https://doi.org/10.1111/lang.12387CrossRefGoogle Scholar
Saito, K., Ilkan, M., Magne, V., Tran, M. N., & Suzuki, S. (2018). Acoustic characteristics and learner profiles of low-, mid-and high-level second language fluency. Applied Psycholinguistics, 39, 593617. https://doi.org/10.1017/S0142716417000571CrossRefGoogle Scholar
Segalowitz, N. (2010). Cognitive bases of second language fluency. Routledge.CrossRefGoogle Scholar
Segalowitz, N. (2016). Second language fluency and its underlying cognitive and social determinants. IRAL -International Review of Applied Linguistics in Language Teaching, 54(2), 7995. https://doi.org/10.1515/iral-2016-9991CrossRefGoogle Scholar
Siyanova-Chanturia, A., & Pellicer-Sánchez, A. (2019). Formulaic language: Setting the scene. In Siyanova-Chanturia, A. & Pellicer-Sánchez, A. (Eds.), Understanding formulaic language: A second language acquisition perspective (pp. 116). Routledge.Google Scholar
Skehan, P., Foster, P., & Shum, S. (2016). Ladders and snakes in second language fluency. International Review of Applied Linguistics in Language Teaching, 54, 97111. https://doi.org/10.1515/iral-2016-9992CrossRefGoogle Scholar
Stengers, H., Boers, F., Housen, A., & Eyckmans, J. (2011). Formulaic sequences and L2 oral proficiency: Does the type of target language influence the association? International Review of Applied Linguistics in Language Teaching, 49, 321343. https://doi.org/10.1515/iral.2011.017CrossRefGoogle Scholar
Suzuki, S., & Kormos, J. (2020). Linguistic dimensions of comprehensibility and perceived fluency: An investigation of complexity, accuracy, and fluency in second language argumentative speech. Studies in Second Language Acquisition, 42, 143167. https://doi.org/10.1017/S0272263119000421CrossRefGoogle Scholar
Suzuki, S., & Kormos, J. (2023). The multidimensionality of second language oral fluency: Interfacing cognitive fluency and utterance fluency. Studies in Second Language Acquisition, 45, 3864. https://doi.org/10.1017/S0272263121000899CrossRefGoogle Scholar
Suzuki, S., Kormos, J., & Uchihara, T. (2021). The relationship between utterance and perceived fluency: A meta-analysis of correlational studies. The Modern Language Journal, 105, 435463. https://doi.org/10.1111/modl.12706CrossRefGoogle Scholar
Suzuki, S., & Révész, A. (2023). Measuring speaking and writing fluency: A methodological synthesis focusing on automaticity. In Yuichi, S. (Ed.), Practice and automatization in second language research: Perspectives from skill acquisition theory and cognitive psychology (pp. 235264). Routledge.CrossRefGoogle Scholar
Suzuki, Y., Eguchi, M., & de Jong, N. (2022). Does the reuse of constructions promote fluency development in task repetition? A usage-based perspective. TESOL Quarterly, 56, 12901319. https://doi.org/10.1002/tesq.3103CrossRefGoogle Scholar
Takizawa, K. (2024). What contributes to fluent L2 speech? Examining cognitive and utterance fluency link with underlying L2 collocational processing speed and accuracy. Applied Psycholinguistics, 45, 516541. https://doi.org/10.1017/S014271642400016XCrossRefGoogle Scholar
Tavakoli, P., Kendon, G., Mazhurnaya, S., & Ziomek, A. (2023). Assessment of fluency in the Test of English for Educational Purposes. Language Testing, 40, 607629. https://doi.org/10.1177/02655322231151384CrossRefGoogle Scholar
Tavakoli, P., Nakatsuhara, F., & Hunter, A. (2020). Aspects of fluency across assessed levels of speaking proficiency. Modern Language Journal, 104, 169191. https://doi.org/10.1111/modl.12620CrossRefGoogle Scholar
Tavakoli, P., & Skehan, P. (2005). Strategic planning, task structure and performance testing. In Ellis, R. (Ed.), Planning and task performance in a second language (pp. 239277). John Benjamins.CrossRefGoogle Scholar
Tavakoli, P., & Uchihara, T. (2020). To what extent are multiword sequences associated with oral fluency? Language Learning, 70, 506547. https://doi.org/10.1111/lang.12384CrossRefGoogle Scholar
Uchihara, T., Eguchi, M., Clenton, J., & Saito, K. (2021). To what extent is collocation knowledge associated with oral proficiency? A corpus-based approach to word association. Language and Speech, 65, 311336. https://doi.org/10.1177/00238309211013865CrossRefGoogle ScholarPubMed
Wood, D. (2006). Uses and functions of formulaic sequences in second language speech: An exploration of the foundations of fluency. The Canadian Modern Language Review, 63, 1333.Google Scholar
Wood, D. (2009). Effects of focused instruction of formulaic sequences on fluent expression in second language narratives: A case study. Canadian Journal of Applied Linguistics, 12, 3958.Google Scholar
Wray, A. (2002). Formulaic language and the lexicon. Cambridge University Press.CrossRefGoogle Scholar
Wray, A. (2017). Formulaic sequences as a regulatory mechanism for cognitive perturbations during the achievement of social goals. Topics in Cognitive Science, 9, 569587. https://doi.org/10.1111/tops.12257CrossRefGoogle ScholarPubMed
Yan, X. (2020). Unpacking the relationship between formulaic sequences and speech fluency on elicited imitation tasks: Proficiency level, sentence length, and fluency dimensions. TESOL Quarterly, 54, 460487. https://doi.org/10.1002/tesq.556CrossRefGoogle Scholar
Yeldham, M. (2020). Does the presence of formulaic language help or hinder second language listeners’ lower-level processing? Language Teaching Research, 24, 338363. https://doi.org/10.1177/1362168818787828CrossRefGoogle Scholar
Yi, W., & Zhong, Y. (2024). The processing advantage of multiword sequences: A meta-analysis. Studies in Second Language Acquisition, 46, 427452. https://doi.org/10.1017/S0272263123000542CrossRefGoogle Scholar
Zhang, X., Zhao, B., & Li, W. (2023). N-gram use in EFL learners’ retelling and monologic tasks. International Review of Applied Linguistics in Language Teaching, 61, 939965. https://doi.org/10.1515/iral-2021-0080CrossRefGoogle Scholar
Figure 0

Table 1. Spearman’s correlations between n-gram and PF

Figure 1

Table 2. Summary of mixed-effects model comparison

Figure 2

Table 3. Final mixed-effects model to predict PF

Figure 3

1

Supplementary material: File

Takizawa and Suzuki supplementary material

Takizawa and Suzuki supplementary material
Download Takizawa and Suzuki supplementary material(File)
File 959.8 KB