Hostname: page-component-6bf8c574d5-rwnhh Total loading time: 0 Render date: 2025-02-20T09:51:27.391Z Has data issue: false hasContentIssue false

Affect as a component of second language speech perception

Published online by Cambridge University Press:  17 February 2025

John Dylan Burton*
Affiliation:
Applied Linguistics and ESL, Georgia State University, Atlanta, GA, USA
Paula Winke
Affiliation:
Second Language Studies, Michigan State University, East Lansing, MI, USA
*
Corresponding author: John Dylan Burton; Email: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Growing evidence suggests that ratings of second language (L2) speech may be influenced by perceptions of speakers’ affective states, yet the size and direction of these effects remain underexplored. To investigate these effects, 83 raters evaluated 30 speech samples using 7-point scales of four language features and ten affective states. The speech samples were 2-min videorecordings from a high-stakes speaking test. An exploratory factor analysis reduced the affect scores to three factors: assuredness, involvement, and positivity. Regression models indicated that affect variables predicted spoken language feature ratings, explaining 18–27% of the variance in scores. Assuredness and involvement corresponded with all language features, while positivity only predicted comprehensibility scores. These findings suggest that listeners’ perceptions of speakers’ affective states intertwine with their spoken language ratings to form a visual component of second-language communication. The study has implications for models of L2 speech, language pedagogy, and assessment practice.

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices
Open data
Copyright
© The Author(s), 2025. Published by Cambridge University Press

Introduction

Second language (L2) speakers are often evaluated by the world around them—implicitly or explicitly—on their capacity to communicate in social situations. Comprehensibility, speech fluency, or the accuracy of grammar or vocabulary may influence how proficient someone is perceived in their L2. These perceptions can then alter an individual’s prospects of succeeding at various low and high-stakes real-world tasks that require language. For example, if someone’s language ability is perceived as “inadequate,” they may face stigma or discrimination by not receiving raises, being passed over in interviews, or even being fired from their place of employment (Gluszek & Dovidio, Reference Gluszek and Dovidio2010; Kang & Yaw, Reference Kang and Yaw2024). Language ability is important in L2 research settings as well, as learners are often grouped by proficiency profiles in research on a myriad of acquisitional processes. It is often assumed that linguistic features of speech drive perceptions of language ability, and there is a vast body of research supporting this claim. However, it is possible—indeed, even intuitive—that non-linguistic factors such as emotion and body language influence these perceptions as well, yet research in this area is currently scant.

In many settings, impressions about individuals’ cognitive and psychological states and traits are influenced by nonverbal behaviour and its corresponding affective interpretation, such as how confident, anxious, happy, or engaged a person appears. A confident and engaged speaker may be perceived as able to handle a communicative situation more adeptly, for example, than a speaker who is relatively reticent and anxious, despite similarities in actual language produced (Jenkins & Parra, Reference Jenkins and Parra2003). Hymes (Reference Hymes and Duranti1972, p. 283) theorized that affective dispositions (along with cognitive and volitive factors) are elements of “ability for use” that moderate how individuals deploy linguistic knowledge in communication. These factors could then partially determine an individual’s assessed level of communicative competence in a particular scenario or sociolinguistic context. In nearly all situations, listeners detect affect displays, even if they are unaware of doing so (Bargh & Chartrand, Reference Bargh and Chartrand1999). This combined visual information has even been found to lead to score differences in language tests, where audiovisual speech samples are perceived as exhibiting higher comprehensibility or overall ability than the same individuals in audio-only samples (Carey & Szocs, Reference Carey and Szocs2024; Nakatsuhara et al., Reference Nakatsuhara, Inoue and Taylor2021).

If perceptions of affect form a visual component of listeners’ mental model of L2 speech, these relationships should be documented, as decisions based on these perceived abilities may impact social outcomes and interpretations of research findings. This study sets out to measure speakers’ perceptions of both affect and spoken language ability to determine the extent to which these non-linguistic and linguistic elements are related. Understanding these relationships may have important practical and theoretical implications for how people teach, assess, and model second language speech.

Background

Affect generally refers to the subjective experience of internal feelings, emotions, moods, dispositions, and temperaments, often visible through behaviour (Frijda, Reference Frijda and Davidson1994). Individuals show affect through displays—that is, intentional or unintentional behaviours that convey an individual’s orientations or reactions to stimuli and other people. Although communicated verbally through word choice or prosody, affect is primarily conveyed through a speaker’s facial expressions (Kappas et al., Reference Kappas, Krumhuber, Küster, Hall and Knapp2013). A speaker may display a feeling, orientation, or stance at one moment as a state, or they may be disposed to react to particular experiences in a similar way over time as a trait element of their personality. Affect was historically thought to originate in individual cognition, whereby appraisals of environmental stimuli activate emotional responses (Arnold, 1960). Others have argued that due to the lack of one-to-one physiological correlates, affect may arise within socially distributed processes amongst individuals in interpersonal interactions (Parkinson, Reference Parkinson1996). Furthermore, interpretations of affect may differ depending on the cultural background of speakers and listeners (Uchida et al., Reference Uchida, Townsend, Rose Markus and Bergsieker2009) and may even be felt differently across cultures (Mesquita, Reference Mesquita2022). These varying issues highlight methodological challenges in the measurement of affect; the social context should be carefully documented, and cultural variables should be controlled if possible.

Affect displays present distinct challenges for L2 speakers in cross-cultural settings because, to communicate effectively, they must “puzzle out unfamiliar behaviors, to identify what triggers which ‘emotions’ and when, to learn how particular ‘emotions’ may be managed and to discover what cues to pay attention to and how to interpret verbal and non-verbal ‘emotion displays’” (Pavlenko, Reference Pavlenko2014, p. 247). Furthermore, the presence of internal affective responses may have a facilitating or limiting effect on the language they produce, determining “not only whether they even attempt to use language in a given situation, but also how flexible they are in adapting their language use to variations in the setting” (Bachman & Palmer, Reference Bachman and Palmer1996, p. 65). Affect may then drive differential performance outcomes in learners based on their ever-changing reactions to social stimuli.

Indeed, trait-like affective factors have received much attention in the second language acquisition (SLA) literature because they have been found to relate to various pedagogical outcomes. Higher measures of anxiety (e.g., foreign language anxiety, test anxiety), for example, have been found to correspond with lower levels of language proficiency and course achievement (Botes et al., Reference Botes, Dewaele and Greiff2020; MacIntyre et al., Reference MacIntyre, Noels and Clément1997; Teimouri et al., Reference Teimouri, Goetze and Plonsky2019). In contrast, confidence (especially L2 self-confidence) has a positive relationship with language knowledge measures and performance outcomes (Ahammer et al., Reference Ahammer, Lackner and Voigt2019; Clément, 1986; Noels & Clément, Reference Noels and Clément1996; Stankov et al., Reference Stankov, Lee, Luo and Hogan2012), perhaps given its close relationship with other individual differences such as motivation and willingness to communicate (MacIntyre et al., Reference MacIntyre, Clément, Dörnyei and Noels1998). Positive emotions (e.g., enjoyment, happiness) may also drive or correspond with L2 acquisitional processes by “broadening a person’s perspective and opening the individual to absorb the language” (MacIntyre & Gregersen, Reference MacIntyre and Gregersen2012, p. 193), with some evidence that these correspond with achievement gains as well (Botes et al., Reference Botes, Dewaele and Greiff2020; Dewaele & Li, Reference Dewaele and Li2022; Li et al., Reference Li, Dewaele and Jiang2020). Nonetheless, the relationship between at least some of these measures and achievement or proficiency outcomes may be reciprocal or bidirectional, with growth in language skills leading to, for example, greater self-confidence in one’s ability to communicate (Edwards & Roger, Reference Edwards and Roger2015; Li et al., Reference Li, Dewaele and Jiang2020). In many of these studies, these affective traits were measured using student self-report surveys rather than external observations, which offer limited evidence about how listeners conceptualize spoken language ability considering dynamic differences in affect.

Although these affective traits lend important insight into longer-term outcomes in SLA such as course achievement, perceived affective states may impact outcomes as performances unfold in brief encounters, such as conversations or interviews. Engagement, for example, broadly defined as an individual’s level of interest and evidence of participation in an event, is often regarded as an orientation composed of interacting cognitive, social, behavioural, and affective dimensions (Philp & Duchesne, Reference Philp and Duchesne2016). Engagement, in particular social engagement, may also be critical to how successful individuals are in interactive tasks, leading to enhanced performance outcomes (Storch, Reference Storch2008). Being perceived as engaged, as well as displaying confidence, attentiveness, interactiveness, and low anxiety, may factor into positive impressions of communicative effectiveness in spoken assessment settings (Ducasse & Brown, Reference Ducasse and Brown2009; May, Reference May2011; Sato & McNamara, Reference Sato and McNamara2019). Other affective phenomena, such as displaying warmth (e.g., friendliness, empathy) or competence, have been found to correspond with performance outcomes outside of SLA research in organizational settings (Cuddy et al., Reference Cuddy, Glick and Beninger2011).

Few empirical studies to date have measured the relationships between perceived state affect and language-related outcomes. Nagle et al. (Reference Nagle, Trofimovich, O’Brien and Kennedy2022) considered subjectively perceived measures of anxiety and collaborativeness (a measure of social engagement and interaction) on scores of L2 comprehensibility. In a dataset of short dyadic interactions, speakers and their interlocutors each repeatedly measured their partner’s perceived affective states and comprehensibility. The authors found that high collaborativeness (a measure of social engagement) and low anxiety explained roughly 60% of the variance in comprehensibility scores, with small differences depending on the type of speaking task used. The authors hypothesized that high anxiety related to lower comprehensibility due to the visual cues anxious speakers display (e.g., lack of expressiveness, gaze aversion), which make speech processing more effortful for the listeners. In another study, Chong and Aryadoust (Reference Chong and Aryadoust2023) investigated the relationship between automated measurements of seven basic emotions (happiness, sadness, anger, surprise, fear, disgust, and a neutral state) and language proficiency outcomes provided by four human raters using TOEFL integrated speaking rubrics. In this study, the variance in test scores attributable to emotions ranged from 8–34%, with the authors concluding that “only some part of the observed variance in test scores can or should be associated with the emotions of participants when they are using academic language” (p. 7431). However, the researchers did not interpret which emotions were associated more consistently with language-related outcomes.

The current study

Although the literature has shown a possible link between perceived affect and language-related outcomes, affect is often treated as a trait variable in individual differences studies, often measured through self-report data (Botes et al., Reference Botes, Dewaele and Greiff2020; Clément, 1986; Dewaele & Li, Reference Dewaele and Li2022; Li et al., Reference Li, Dewaele and Jiang2020; MacIntyre et al., Reference MacIntyre, Noels and Clément1997; Noels & Clément, Reference Noels and Clément1996; Teimouri et al., Reference Teimouri, Goetze and Plonsky2019). Studies that have operationalized affect as a state variable have often relied on verbal reports focusing on communication as a whole (Sato & McNamara, Reference Sato and McNamara2019) or interactional competence more narrowly (Ducasse & Brown, Reference Ducasse and Brown2009; May, Reference May2011). While these studies noted the affective component of L2 speech, little is known about the size of its relationship. What empirical research exists has investigated outcomes of comprehensibility (Nagle et al., Reference Nagle, Trofimovich, O’Brien and Kennedy2022) and integrated skill ratings of spoken language proficiency (Chong & Aryadoust, Reference Chong and Aryadoust2023). A range of other components of language proficiency, such as fluency, grammar, and lexical range and accuracy, have yet to be considered. It is important to consider a broader range of variables given that affect may be related to some language skills more than others (Dewaele & Li, Reference Dewaele and Li2022).

Given the relatively understudied role affect may play in ratings of L2 speech, the research question (RQ) that guided this study was the following:

RQ: What are the relationships, if any, between ratings of L2 speech and ratings of affective phenomena?

Based on the literature reviewed, we hypothesized that affective perceptions such as confidence, engagement, and low anxiety would correspond with higher ratings in language outcomes broadly. Overall, though, our investigation was exploratory as we had few expectations regarding the direction and size of the effects.

Method

Participants

After obtaining ethics clearance through our university’s institutional review board (ID number: STUDY00006268), we invited 100 participants to take part in this study. These participants were individuals with limited experience working in language-related settings; that is to say, linguistic laypeople (Sato & McNamara, Reference Sato and McNamara2019). The use of this population aligns with research in SLA that has investigated relationships with language ratings using novice raters (e.g., Isaacs & Trofimovich, Reference Isaacs and Trofimovich2012). These “naïve” listeners, rather than trained language educators or researchers, were chosen to provide observations that would better reflect how individuals in society incorporate affect into their language-related judgements. Because the cultural and linguistic backgrounds of speakers play a role in how facial nonverbal behaviour encodes affect and is decoded by listeners (Matsumoto & Hwang, Reference Matsumoto, Hwang, Matsumoto, Hwang and Frank2016), participants’ backgrounds were controlled to reduce this source of variance: all participants were first language (L1) English, USA-born undergraduates at a large public university. The mean age of the listener-raters was 20.92 years (SD = 1.48), with a roughly balanced distribution of gender (52% female, 41% male; 6% indicated a gender identification other than male/female or preferred not to report). Approximately one third of the participants indicated knowledge of an L2 (38%), and their indicated areas of study were diverse, as the invitation to participate in the study was sent to all undergraduates on campus.

The sample sizes for both the participant raters and the speech samples, described in the next section, were determined using a power analysis and a reading of the literature (Hox et al., Reference Hox, Moerbeck and van de Schoot2018). Hox et al. (Reference Hox, Moerbeck and van de Schoot2018) suggested that a greater number of second-level grouping variables (raters in this study) would generally provide more power than a greater number of first-level cases (speech samples). A sample of at least 80 raters and 30 speech samples were determined to provide power at .95 to detect small to medium regression coefficients. We overrecruited participants because we anticipated that some novice raters would exhibit less reliable rating patterns. This turned out to be the case, as we found that 16 of the 100 participants exhibited undesirable rating qualities. Some participants showed outlying rating patterns measured using multivariate outlier analysis, and others showed misfit with a many-facet Rasch measurement model. We removed these 16 raters plus one rater who experienced technical problems. This served to optimize the quality of the dataset, leaving a final number of 83 raters for the current study. We report the data cleaning procedures in detail in Supplement 1 in the Open Science Framework (Burton & Winke, Reference Burton and Winke2025).

Speech samples

We borrowed 30 speech samples recorded in a high-stakes, oral proficiency interview context from a test provider (International English Language Testing System [IELTS]; IELTS, n.d.) to be used as the basis for the affect and spoken language ability ratings. We signed non-disclosure agreements with IELTS to protect the dataset and the privacy of the test takers, and all raters signed non-disclosure agreements indicating that they would not remove or report on the samples they watched. The test takers in this dataset had indicated consent for their data to be used in research, but because of the intellectual property of the testing data, we were unable to share video or audio from the test sessions with readers. High-stakes test recordings were optimal as many contextual elements were controlled: the recordings were conducted on a standardized laptop, the noise was controlled in the environment, the IELTS-employed examiners who conducted the interviews were trained, and the language of the test content was validated for the ability levels of the test takers and test purposes.

The recordings were taken from the same section of the speaking test (Part 3) for all samples. The section was a semi-scripted conversation between a trained IELTS examiner and the test taker on abstract issues and ideas (IELTS, n.d.). Although the interview topics and examiners varied, the samples contained a similar number of opportunities for the test takers to speak and clarify answers across the segments. We selected segments of approximately 2 mins from each test taker (M = 2 min, 11 s; SD = 14 s) from the beginning of this part of the test. The length of these segments varied slightly because we sought to trim the samples as close to the 2-min mark as possible when the test taker had reached a natural conclusion to their turn. We chose relatively short (2-min) samples from the longer test as the basis of rating so that participants would make quick, intuitive impressions rather than impressions based on a wider range of evidence. Using short samples also allowed us to keep the total experiment participation time for the volunteer raters at around 2 hours.

Table 1 displays information about the speech samples. The individuals in the 30 samples (labelled S01–S30) were all Chinese L2 English speakers. Their sample label indicated their test score ranking in comparison with other test takers. For example, sample S01 had the lowest proficiency test score on IELTS (3.5, approximately A2 on the Common European Framework of Reference [CEFR]; Council of Europe, 2020) and was thus labelled as 01. On the other hand, sample S30 tied for the highest score (6.5, approximately a high B2/low C1 on the CEFR), and was listed last (IELTS scores range from 0 to 9). The test takers appeared to be of a similar age (approximately college age), and the distribution of gender appeared to skew female (23 females, 7 males), but these demographic variables were not provided with the dataset. Proficiency scores that accompanied the dataset showed that test takers were evenly distributed across multiple ability levels. The last column in Table 1 shows the length of each of the speech samples. More information about how the dataset was compiled and trimmed is available in Supplement 2 (Burton & Winke, Reference Burton and Winke2025).

Table 1. Speech samples

Rating scales

The rater participants in this study assigned subjective ratings on affect and spoken language ability using a set of 14 semantic differential scales. Semantic differential scales are simple, often one-word adjectives or nouns that are paired with a contrasting term set on two ends of the same scale, similar to a Likert scale (e.g., good/bad, interesting/uninteresting) (Ploder & Eder, Reference Ploder, Eder and Wright2015). These scales may have multiple points between terms for raters to indicate both the directionality and strength of an association with a term. Semantic differentials allow participants to make quick, intuitive decisions on their perceptions of stimuli that do not require rater training, unlike fully developed rating rubrics/scales (Snider & Osgood, Reference Snider and Osgood1969). Semantic differentials also allow participants to bring their understanding and interpretation of phenomena to a rating event rather than restricting these interpretations through training or scale wording. This subjective, relatively open approach was desirable for this study to capture more generalizable impressions of speech and affect.

We constructed a set of scales with categories describing spoken language performance and interpersonal perceptions of affective states. The language features were fluency, vocabulary, grammar, and comprehensibility, roughly corresponding to categories of language proficiency frequently of interest in the SLA literature. Comprehensibility, rather than pronunciation, was chosen as a category to align with ongoing research in this area (e.g., Isaacs & Trofimovich, Reference Isaacs and Trofimovich2012; Nagle et al., Reference Nagle, Trofimovich, O’Brien and Kennedy2022). Participants may have had more of an intuitive understanding of what is comprehensible rather than constituent elements of pronunciation (e.g., phonemic control, prosody), which may have led to an overt focus on accent. Even though we determined through piloting that these four categories were relatively straightforward for participants to understand, we provided brief definitions of these four categories to reduce ambiguity (e.g., broad/narrow definitions of fluency; Lennon, Reference Lennon1990). These brief definitions are listed in Table 2. More information about the piloting process and how the scales were developed is available in Burton (Reference Burton2023).

Table 2. Definitions of scale categories

We chose 10 categories of affect for participants to rate based on frequently discussed affective state perceptions discussed in the literature on SLA, language testing, and psychology. These categories were: engagement, anxiety, confidence, warmth, attentiveness, expressiveness, happiness, competence, interactiveness, and attitude. Of particular interest were those states that have been discussed in relation to language proficiency or achievement, such as engagement (Ducasse & Brown, Reference Ducasse and Brown2009; Jenkins & Parra, Reference Jenkins and Parra2003; May, Reference May2011; Nakatsuhara et al., Reference Nakatsuhara, Inoue and Taylor2021; Philp & Duchesne, Reference Philp and Duchesne2016; Sato & McNamara, Reference Sato and McNamara2019), anxiety (Botes et al., Reference Botes, Dewaele and Greiff2020; McIntyre et al., Reference MacIntyre, Noels and Clément1997; Sato & McNamara, Reference Sato and McNamara2019; Teimouri et al., Reference Teimouri, Goetze and Plonsky2019), and confidence (Clément, 1986; Ducasse & Brown, Reference Ducasse and Brown2009; Noels & Clément, Reference Noels and Clément1996; May, Reference May2011). We were also interested in how socio-affective perceptions of speakers might relate to language judgements, and for this reason, we included attentiveness (Ducasse & Brown, Reference Ducasse and Brown2009; May, Reference May2011) and interactiveness (relating to interactional competence; Galaczi & Taylor, Reference Galaczi and Taylor2018; and also, willingness to communicate; MacIntyre et al., Reference MacIntyre, Babin and Clément1999). In addition, we included a measure of expressiveness, as this has been frequently mentioned as a positive element in relation to proficiency judgements when referring to overall nonverbal behaviour (e.g., Jenkins & Parra, Reference Jenkins and Parra2003; Neu, Reference Neu, Scarcella, Andersen and Krashen1990), as well as happiness and attitude, relating to measures of enjoyment and positive psychology in the SLA literature (Botes et al., Reference Botes, Dewaele and Greiff2020; Dewaele & Li, Reference Dewaele and Li2022; Li et al., Reference Li, Dewaele and Jiang2020; MacIntyre & Gregersen, Reference MacIntyre and Gregersen2012; MacIntyre et al., Reference MacIntyre, Gregersen and Mercer2019). Finally, warmth and competence were chosen based on findings in psychology relating these perceptions to success in organizational settings (Cuddy et al., Reference Cuddy, Glick and Beninger2011). As opposed to the language scales, definitions were not provided for the affect-related adjectives, as raters were expected to bring their own interpretations of these variables to the study.

Each scale category was presented with its adjective and the related antonym on a 7-point scale, as shown in Figure 1. We used a 7-point scale to enhance measurement precision over scales with fewer categories (Simms et al., Reference Simms, Zelazny, Williams and Bernstein2019). This also allowed a midpoint for cases where judgements were ambiguous. The polarity of the adjectives was reversed for half of the scales so that positive or negative associations were mixed on each side of the 7-point scale. This was to reduce survey acquiescence bias (e.g., straightlining), and to encourage participants to read each line carefully. The language scales were presented earliest and in the same order to establish the primacy of rating language. The affect scales were presented below the language scales in a random order for each speech sample rating. The random order was to prevent primacy of any of the affect categories, while also encouraging raters to pay close attention to the category they were rating each time. To determine the feasibility of the instruments and the design of the rating study, the scales were piloted with 25 participants (not included in this rater group) with a separate set of 10 speech samples collected in a prior study. A many-facet Rasch analysis of the pilot data indicated that each scale functioned well, with scale units ordered as intended with no misfit. The pilot participants indicated that the scales were straightforward to use when rating, and the number of scales was not an issue given the quick nature of the rating process. Specific details regarding the pilot data analysis are available in Burton (Reference Burton2023).

Figure 1. Rating scales.

Procedure

The videos and scales were incorporated into an online rating platform built in Qualtrics. An example of the system is presented in Supplement 3 (Burton & Winke, Reference Burton and Winke2025). Raters were first introduced to the study, signed a consent form and a non-disclosure agreement, and were then allowed to review the scale categories and their meanings. Although language terms were defined briefly, affect terms were not, as semantic differentials generally allow users to bring their internal definitions of terms (Ploder & Eder, Reference Ploder, Eder and Wright2015). Participants were required to rate two practice videos using the scales to calibrate their orientations to spoken language proficiency. One video was of a highly proficient speaker, and the other was of a much less proficient speaker. After rating, participants were not provided with feedback on their scores, but rather with a general description of the language performances (not including affect), as it was desirable for participants to focus first and foremost on spoken language ability. The descriptions were framed in terms of the strength of language and communicative effectiveness. After completing the practice section, raters began the main study.

The study took place on two different days with a 24-hr gap between them. Each day featured 15 randomized speech samples for each participant. The rating was spread out to minimize fatigue, as each rating day took roughly 1 hr to complete (Day 1: M = 61 min, SD = 18 min; Day 2 = 62 min, SD = 20 min). Participants conducted the ratings remotely, though they were instructed to choose a location that was quiet and free from distractions. The videorecordings of the speech samples were presented in a large format on one single Qualtrics page without the rating scales. Participants could not pause, stop, replay, or download the videos. Immediately after the video ended, raters were taken to a second page with the rating scales and instructed to rate the video. The stimuli and scales were presented on separate pages to reduce distractions (e.g., rating while listening but not watching), as we wanted the raters to pay close attention to the videos during the entirety of the performance. After the second day, raters completed a short follow-up survey that served to monitor any technical issues with the system.

Data analysis

When preparing the dataset for analysis, the polarity of the scale scores was realigned so that negative judgements (e.g., low comprehensibility, weak grammar, anxious) had an endpoint of 1, while positive judgements (e.g., comprehensible, strong grammar, at ease) aligned with an endpoint of 7. We first calculated polychoric correlations between the scales to determine associations across variables using the polychoric function in the psych package (version 2.0.8) in R. Polychoric correlations were the basis of analysis because Pearson correlations may attenuate relationships amongst ordinal or ordered categorical variables (Winke et al., Reference Winke, Zhang and Pierce2023), and polychoric correlations provide a more accurate representation of the data (Holgado-Tello et al., Reference Holgado–Tello, Chacón–Moscoso, Barbero–García and Vila–Abad2010). We checked the stability of the correlation matrix against a multilevel Pearson correlation matrix of the same data (reported in Supplement 6), and we found that the single-level correlation matrix was robust for this analysis.

Due to the large number of variables, we then ran exploratory factor analysis (EFA) to determine whether the dataset could be reduced for regression analysis. We ran EFA rather than PCA because component scores are less interpretable than factor scores (Tabachnik & Fidell, Reference Tabachnik and Fidell2013), and we hypothesized that the variables would likely show a factor structure due to their semantically related nature (e.g., happy and warm share similar connotations). We first verified that assumptions were met for factor analysis. The relationships between variables were linear, and variance inflation factors were below 4, which satisfied the assumption of a lack of multicollinearity. The Bartlett test of sphericity (p < .001) and Kaiser–Meyer–Olkin measure of sampling adequacy (all values > .90) indicated factorability. We used parallel analysis to determine the number of factors rather than eigenvalues greater than 1, as parallel analysis tends to be less biased (Franklin et al., Reference Franklin, Gibson, Robertson, Pohlmann and Fralish1995). Parallel analysis indicated the presence of four factors. We then used exploratory factor analysis with the polychoric correlation matrix using maximum likelihood estimation and a promax rotation to produce a factor solution. We used the polychoric correlation matrix because all variables were ordinal (Holgado-Tello et al., Reference Holgado–Tello, Chacón–Moscoso, Barbero–García and Vila–Abad2010). We used an oblique rotation rather than an orthogonal rotation to allow factors to correlate.

Because the dataset included nested data (each participant rated the same 30 samples, thus each participant’s scores would exhibit correlations), there was a risk that the factor structure may vary for each rater participant. We were unable to find suitable factor analytic solutions that take into account multilevel data that allow random effects, and pursuing confirmatory factor analysis or structural equation modelling (in which this is possible) was beyond the scope of this study. However, to verify the invariance of the computed factor structure, we bootstrapped the factor analysis with 1,000 samplings of 50 participants from the total pool. The full procedure we followed is detailed in Supplement 6. We found that the factor solution was stable in 97.2% of the bootstrapped calculations, and thus we concluded that the factor structure was stable for this sample despite the multilevel nature of the dataset. Using factor analysis on the full dataset, we extracted factor scores using the Ten Berge method in the factor.scores function in R. The Ten Berge method, which is the default method in R, minimizes residuals, produces unbiased estimates of the factor loadings, and is one of the more interpretable methods of producing factor scores (Ten Berge & Kiers, Reference Ten Berge and Kiers1991). This method represented the dataset better than orthogonal extraction methods such as the Anderson method, as this method would leave factors uncorrelated, which would have misrepresented the dataset. We then used the three affect-related factor scores in the following regression analysis as independent variables.

To determine how the factor scores related to the language judgements, we built four cumulative logit mixed effects models with the language judgements as separate independent variables. We used cumulative logit mixed effects models rather than multilevel generalized ordered logit models or multilevel ordinal probit regression because we were interested in the overall effects of the predictors rather than modelling the effects at individual thresholds. Cumulative logit models are more parsimonious and can aid in interpretability. The clmm function from the ordinal package (v.2019.12–10) in R was used for modelling. Random effects of both the rater and sample were entered into the models to account for these sources of variance. We entered all main effects at once, and we used a logit link flexible threshold for all models. We tested the model with main effects against the null model and the same model with random effects removed to ensure that accounting for these sources of variance was meaningful. We applied Bonferroni corrections to the significance threshold to account for the four sets of analyses, with α = .0125. All assumptions for these models were met apart from the assumption of proportional odds, which could only be tested using Brant’s tests (Brant, Reference Brant1990) on models with random effects removed using polyr. The proportional odds assumption held for some but not all of the main effects in each model. Harrell (Reference Harrell2020) stated that in this case, a violation of the proportional odds assumption is not necessarily problematic as long as the focus of the study is to observe average odds ratios for main effects. This was indeed the case in this study, and we thus used the ordinal models instead of less parsimonious multinomial regression models.

The code sheet, written in R, and the dataset are available for analysis in Supplement 4 and Supplement 5 in the Open Science Framework (Burton & Winke, Reference Burton and Winke2025).

Results

The scales showed desirable score distributions, as shown in Table 3. Median scores were slightly more positive than negative for all the categories except grammar and anxiety. Figure 2 shows the distribution of the scale scores across the seven score categories, which indicates that participants tended to avoid the most negative end of the scale. These distributions also showed that the scales were used in different ways, as they have varied patterns. Table 3 also indicates the reliability of the scales. The scales were highly reliable according to Cronbach’s alpha calculations (.97–.99), though this reliability is inflated due to a large number of second-level observations (e.g., participants). The intraclass correlation coefficient (ICC), an indicator of interrater consistency, showed a moderate to low amount of consistency, which was anticipated due to the participants’ lack of expertise and rater training. This could be variance inherent in affect perception as well, which may be more variable than fixed language features. Notably, ICCs for language-related elements were generally higher, which suggests that raters had a stronger shared intuitive understanding of these characteristics.

Table 3. Scale means, SDs, and reliability

Figure 2. Distribution of scale scores.

The polychoric correlations amongst the scales are presented in Table 4. All correlations were positive, ranging from medium (.40) to strong (≥ .60) (Plonsky & Oswald, Reference Plonsky and Oswald2014). Anxiety and attention correlated the weakest (.40), while fluency and vocabulary correlated the strongest (.85). Particular groups of scales tended to correlate strongly together, such as language elements and competence, features indicating presence (engagement, attention, interactiveness), and positive emotions (attitude, warmth, happiness). However, the ratings across similar features were not identical despite similar scale wording (e.g., happiness, warmth), as they did not exhibit collinearity, with statistical tests showing variance inflation factor (VIF) statistics lower than 4. These relationships suggested that further analyses might be more interpretable using factor scores rather than individual scale categories.

Table 4. Scale correlations

Parallel analysis indicated a 4-factor solution. Figure 3 is a graphical representation of the factor structure in a path diagram (produced using the psych package in R, v. 2.4.1). Although these diagrams are more common in confirmatory factor analysis, these diagrams can be useful in EFA to demonstrate relationships among factors graphically (Revelle, Reference Revelle2024). The left boxes represent the 14 scales, while the right circles are the factors that predict scale values. The constrained factor loadings (fixed to the strongest loading factor) are represented on the arrows from the latent factor to the scale. These represent the relative strength of the loadings rather than absolute loadings. The correlations between the factors are shown on the far–right–hand side. The factor structure was anticipated from the correlation data, though we did not anticipate that competence would be more bound to language features than affect features. In this model, we renamed the factors using terms that most aligned with their apparent meaning: language (fluency, vocabulary, grammar, comprehensibility, and competence), (self-) assuredness (confidence and anxiety), involvement (engagement, attention, and interactiveness), and positivity (happiness, warmth, attitude, and expressiveness). We then extracted factor scores of the affect factors to represent assuredness, involvement, and positivity for regression analysis as predictors of the original observed variables: fluency, grammar, vocabulary, and comprehensibility.

Figure 3. Path diagram of factor solution.

Relationships between affect and spoken language ability

The polychoric correlations between the factor scores and the four original language scores were all positive and moderate to strong, ranging from .48 to .72, as shown in Table 5. Each language feature was more strongly associated with assuredness and involvement than positivity. These associations were stronger for fluency than the other language scores. Regarding positivity, it showed the strongest correlation with comprehensibility (.60). The weakest associations were with grammar.

Table 5. Polychoric correlations between factor scores and language scores

We built four sets of mixed effects ordinal regression models to determine which of the factor score reductions of the affect variables had the greatest impact on each score category after taking into account the variance attributable to each participant and each speech sample. In these models, the model with main effects fit better than the null model with main effects and no predictors (null model 1) or the model with main effects with random effects removed (null model 2), as shown in Table 6. The models of fluency (Table 7), vocabulary (Table 8), and grammar (Table 9) showed similar patterns in their main effects. In each of these models, only assuredness and involvement were significant predictors of each language score, with the strongest associations between assuredness and fluency, β = .81, odds ratio = 2.25, and involvement and fluency, β = .79, odds ratio = 2.22. These indicate that the likelihood of a score category increase on the 7-point scale in fluency, vocabulary, and grammar was roughly two times greater with 1-point increases in perceptions of assuredness and involvement. These models explained roughly a fifth to a quarter of the variance in each model using Nagelkerke’s Pseudo R 2, fluency = .27, vocabulary .23, grammar = .18.

Table 6. Tests of model fit

Table 7. Fluency model

Note: p < .0125.

Table 8. Vocabulary model

Note: p < .0125.

Table 9. Grammar model

Note: p < .0125.

The model for comprehensibility, shown in Table 10, differed slightly. In this model, all three main effects were significant, with somewhat smaller standardized coefficients and odds ratios. This model deviated from the previous models in that the relationship between involvement (rather than assuredness) and the outcome was the strongest, though only slightly (odds ratio for involvement = 1.78, assuredness = 1.66). As opposed to the previous three models, positivity was a significant, positive predictor of comprehensibility, β = .36, odds ratio 1.44, showing that the likelihood that speakers were classified as one point easier to understand on the 7-point scale was about 1.44 times higher when there were 1-point increases in perceived displays of positive affect. The model for comprehensibility explained about one fifth of the variance in the comprehensibility scores, Nagelkerke’s Pseudo R 2 = .22.

Table 10. Comprehensibility model

Note: p < .0125.

Discussion

The goal of this study was to investigate whether and to what degree perceived affect relates to spoken language ability judgements. This study showed that listeners’ perceptions of L2 speakers’ extra-linguistic, affect displays (e.g., emotions and social orientations) interweave with their judgements of L2 speech to a sizable degree. Correlations showed that all ten measures of affect (e.g., confidence, engagement, warmth) correlated positively with all four linguistic measures with varying levels of strength. This fact suggests that spoken language ability as conceived by linguistic laypeople is a complex construct that may consist of multiple nonverbal, contextual elements drawn from the visual world. If this is indeed true, it can partially explain differences in how language is perceived across modalities, where having access to audiovisual content over audio alone tends to result in stronger perceptions of comprehensibility and language proficiency (Carey & Szocs, Reference Carey and Szocs2024; Nakatsuhara et al., Reference Nakatsuhara, Inoue and Taylor2021). Nonetheless, upon closer inspection, not all measures of affect corresponded to language ratings equally, as there was nuance in which measures of affect were most likely to relate to certain domains of language.

Because the affect measurements tended to cluster together in their correlations, we extracted three factors to investigate broader relationships between affect and L2 speech. We named these factors assuredness (confidence and anxiety), involvement (engagement, attention, and interactiveness), and positivity (happiness, warmth, expressiveness, and attitude). Competence, contrary to our expectations, was perceived as a language judgement, which the raters may have used as a proxy measure of listening comprehension. We found that assuredness had the strongest relationship with fluency, vocabulary, and grammar scores; in these cases, when individuals are seen as being more confident and at ease (assured), they are more likely to be perceived as stronger in each of these three linguistic areas. This finding is largely in line with Clément (Reference Clément1986) and Noels and Clément (Reference Noels and Clément1996), who argued that confidence (especially self-confidence as reported by the speaker) was one of the strongest predictors of language proficiency. Likewise, it is also in line with the vast literature on anxiety, which has found that low anxiety may relate to positive proficiency or achievement outcomes (e.g., Botes et al., Reference Botes, Dewaele and Greiff2020; MacIntyre et al., Reference MacIntyre, Noels and Clément1997; Teimouri et al., Reference Teimouri, Goetze and Plonsky2019). Confidence is an affective stance that raters frequently observe and factor into positive evaluations of test-takers in the language testing literature as well (Jenkins & Parra, Reference Jenkins and Parra2003; Neu, Reference Neu, Scarcella, Andersen and Krashen1990; May, 2009, 2011). Given the close relationship confidence and anxiety have with cognitive, psychological, and personality elements (e.g., Stankov et al., Reference Stankov, Lee, Luo and Hogan2012), raters may have drawn on nonverbal and affective cues to extrapolate information about the test takers’ underlying cognitive fluency (ability to process language quickly and efficiently) and lexicogrammatical competence. Seeing an individual as anxious or less confident may have led raters to perceive that person as less proficient, resulting in lower scores being awarded. Likewise, seeing a confident performance could inform raters that the speaker believed in their own abilities, thus leading to higher gains. Perceptions of confidence may be bidirectional, however; that is to say, people may find more confident and less anxious speakers to be more proficient overall, but more proficient individuals are likely to be perceived as more confident and at ease simply based on their stronger language skills (Edwards & Roger, Reference Edwards and Roger2015). What stands out in this study is that these relationships existed external to the speaker in the eyes of listeners rather than through self-reports of affect and language ability as in previous research.

Involvement, made up of cognitive, social, and perhaps behavioural engagement in the speaking scenario, also stood out as a strong predictor of the same three linguistic outcomes, though slightly less so than assuredness did. Raters may have found that speakers who were more attentive and interactive with the examiner were able to display a greater range of evidence of their spoken language ability. This may have also been the case during question breakdowns, where speakers may have shown more engagement by asking follow-up questions to repair the breakdown sequences. That these displays of involvement were associated with stronger perceived language in fluency, grammar, and vocabulary is supported by findings that engagement can lead to greater task success (Storch, Reference Storch2008) as well as positive impressions of communicative and interactional competence (Ducasse & Brown, Reference Ducasse and Brown2009; May, Reference May2011; Sato & McNamara, Reference Sato and McNamara2019). Other studies have also found links between willingness to communicate, which may be closely related to involvement, and L2 (communicative) competence (Elahi Shirvan et al., Reference Elahi Shirvan, Khajavy, MacIntyre and Taherian2019; Jin & Lee, Reference Jin and Lee2022).

Interestingly, emotional components aligned with positivity were not components of the factor structure of involvement/engagement, despite past theorizations (Philp & Duchesne, Reference Philp and Duchesne2016), though these two factors did correlate strongly (.80). Past research has found that foreign language enjoyment, which may manifest as positive affect in discrete scenarios, may correspond with achievement in an L2 (Botes et al., Reference Botes, Dewaele and Greiff2020; Dewaele & Li, Reference Dewaele and Li2022; Li et al., Reference Li, Dewaele and Jiang2020). Similarly, one recent study found that smiling and laughing (behaviours closely related to positive affect) were associated with judgements of greater fluency with a correlation of .42 (Kim et al., Reference Kim, Liu, Trofimovich and McDonough2024). Although the current study found similar strength in correlations with positive affect (fluency = .59, vocabulary = .54, grammar = .48), positivity did not emerge as a predictor of fluency, grammar, or vocabulary in any of the models in this study. It may be the case that one’s own perceived enjoyment (as a longer-term trait) may be a motivating factor in overall language acquisition, but temporary displays of happiness or warmth in an interactive context may exert less of an effect on perceived ability, especially when other affective phenomena such as assuredness and involvement are considered.

The judgements of comprehensibility in this study showed somewhat different patterns from fluency, vocabulary, and grammar. Although both assuredness and involvement predicted comprehensibility as well, involvement emerged as a slightly stronger predictor of the two. Positivity, in contrast to the previous models, also predicted comprehensibility. All aspects of spoken performance as measured in this study benefitted from perceived assuredness and involvement, and this finding with comprehensibility is supported by past research. In Nagle et al. (Reference Nagle, Trofimovich, O’Brien and Kennedy2022), for example, both low anxiety and collaborativeness, a measure of social engagement, predicted comprehensibility outcomes in pairs of dyads. Novel in the current study is that positive affect was also a component of the variance in these scores. There is some support for this finding from recent literature on nonverbal behaviour, showing that looking away (possibly indicating content-related thinking), smiling, and backchanneling with head nods may correspond with comprehensibility judgements (Tsunemoto et al., Reference Tsunemoto, Lindberg, Trofimovich and McDonough2022; Trofimovich et al., Reference Trofimovich, Tekin and McDonough2021), but the benefit derived from behaviours with a positive valence may only apply to learners with lower proficiency (Burton, Reference Burton2024). Although features of speech prosody (such as intonation) may exhibit variable relationships with comprehensibility across ability levels (Huensch & Nagle, Reference Huensch and Nagle2021; Kang et al., Reference Kang, Rubin and Pickering2010; Munro & Derwing, Reference Munro and Derwing1999; Sereno et al., Reference Sereno, Lammers and Jongman2016; Trofimovich & Isaacs, Reference Trofimovich and Isaacs2012), intonational contours have been found to relate strongly to certain vocal emotions, such as positivity (Larrouy-Maestri et al., Reference Larrouy-Maestri, Poeppel and Pell2024; Rodero, Reference Rodero2011). It is possible that paralinguistic prosodic features indirectly enhanced comprehensibility through perceived positive affect along with nonverbal facial behaviours. Given that the broader construct of engagement is also made up of components that relate to positive affect (Philps & Duchesne, Reference Philp and Duchesne2016), this may indicate that the relative ease of understanding of speakers is partially comprised of how pleasant, interactive, collaborative, and at ease speakers appear to listeners in addition to the myriad linguistic factors that have been documented (Crowther et al., Reference Crowther, Trofimovich and Isaacs2016; Isaacs & Trofimovich, Reference Isaacs and Trofimovich2012). Overall, these various findings suggest that a wide range of linguistic and extra-linguistic factors interact when listeners are decoding second-language speech.

We have speculated on the mechanisms that may drive relationships between assuredness and involvement in these measures, but how positive affective variables impact ease of understanding is unclear. One possible explanation could reside within the literature on affective or emotional contagions (Elfenbein, Reference Elfenbein2014; Smirnov et al., Reference Smirnov, Saarimäki, Glerean, Hari, Sams and Nummenmaa2019). Affective contagions are emotions that are contracted by an interactant when they unconsciously and automatically converge on an undirected emotion (an emotional display without a clear source). In other words, the feeling spreads from speaker to listener. In this context, when a listener sees a speaker exhibiting positive affect (and possibly other affective orientations such as engagement), research suggests that some of that affect transfers to the listener, making them feel more invested in paying attention and listening to the speaker. In other words, the speaker’s stances may inspire a willingness to listen in their interactant. Listeners who become more willing to listen to a particular speech sample may likewise find it easier to understand because of the increased effort in decoding speech. Being perceived as easier to understand could then have corresponding benefits for how other aspects of language, such as vocabulary and grammar, are perceived by listeners.

Implications

This study has theoretical implications for how spoken language ability is conceptualized. Models of language ability or communicative competence (e.g., Bachman & Palmer, Reference Bachman and Palmer1996; Canale & Swain, Reference Canale and Swain1980) describe second-language communication almost entirely in terms of how linguistic components of speech and writing convey meaning in sociocultural contexts. Affect and behaviour are generally not considered to be major components in these models apart from minor compensatory roles in strategic competence (Canale & Swain, Reference Canale and Swain1980). Models of interactional competence (e.g., Galaczi & Taylor, Reference Galaczi and Taylor2018) are more inclusive of behaviour and affect, especially their role in turn-taking and conversation management, but these models do not explain how affect may closely interact within a broader communicative system. Hymes’s (Reference Hymes and Duranti1972) original concept of communicative competence, however, allowed for affective orientations to interact with language ability through the ability for use, or the functional ability of an individual to deploy their linguistic competence. The findings in this study that assuredness and involvement closely overlap with judgements across all rated categories of speech appear to support a model which includes ability for use, suggesting that these models may need closer inspection in the future. Nonetheless, the exact mechanisms of how affect relates to the various competencies in such a model need more attention before further conclusions are drawn.

These results also have implications for research and practice. For one, although constructs such as comprehensibility have received ever more attention in recent literature, the complex relationship between affect and ease of understanding complicates its measurement if visual content is included. Researchers may need to consider or control for elements of engagement (e.g., backchanneling, mutual gaze) and positive affect (e.g., smiling, laughing), as these elements appear to be able to enhance second language speech comprehension. In terms of pedagogy, there may be a question of whether behavioural elements of positive affect should be taught or encouraged in the language classroom with the belief that this could result in beneficial outcomes for the speaker. There is the possibility that this may be true in the cultural context of this study (the United States), as these types of social and psychological behaviours are common in service encounters such as when dining out, or within sensitive or high-stakes communicative contexts (health care settings, negotiations with law enforcement, diplomatic negotiations or even when one is asking for homework to not be counted as late). However, it may also be the case that the detection of stilted affect (unnecessarily formal affect) could backfire. Studies considering natural or forced affect could be revealing as to whether there is any pedagogical value in encouraging certain behaviours or affect displays.

For language assessment practice, this study raises key considerations about the fairness of including or ignoring affect in proficiency evaluations where test takers are visible to examiners. One would have serious doubts about the fairness of a test if, for example, a candidate who smiled less received a lower test score. Language tests are stressful and marked by unequal distributions of power between the examiner and test taker, which could very well result in more serious-appearing behaviour. Positive affect in this study, however, did not predict changes in fluency, grammar, or vocabulary scores when other measures of affect were considered, which appears to support fairness regarding this particular measurement. Even though this study did not find that positive affect related to these linguistic measures, certain populations may exhibit different patterns of behaviours or affect depending on group or individual differences. For example, research has shown that culture mediates how nonverbal behaviour may be encoded by speakers and decoded by listeners (Matsumoto & Hwang, Reference Matsumoto, Hwang, Matsumoto, Hwang and Frank2016), and even emotional affect is a culturally constructed phenomenon (Mesquita, Reference Mesquita2022). If there are behavioural norms in one culture (e.g., avoiding eye contact with superiors) that are seen as less appropriate in another (e.g., eye contact is seen as polite), there may be effects that arise when individuals from these cultural backgrounds perceive each other’s speech. This could result in unfair or biased test scores if certain groups receive higher scores simply because their cultural norms match those expected by the raters. Likewise, other individuals exhibit neurological differences that may conflict with what is seen as “standard” in the broader population. Test takers with autism, for example, may avert their gaze more and produce repetitive motions (American Psychiatric Association, 2013) while producing speech that may be identical or equivalent to neurotypical test takers. In test-taker populations that include neurodiverse individuals or groups from different cultures, investigations of behavioural bias on test scores are critical. If bias is found to exist, steps should be taken to ameliorate this. For neurodiverse test takers, this may mean providing accommodations or modifications to how the test is rated that ensure fairness.

The fairness of considering assuredness and involvement in test scores, as this study showed, is more complex. If indeed assuredness is a bidirectional result of language proficiency, and if assuredness is partially a cognitive mechanism (e.g., Stankov et al., Reference Stankov, Lee, Luo and Hogan2012), being perceived as more confident or less anxious may in fact reveal something about an individual’s L2 ability in some cases. Testing situations are anxiety-producing, though, and the mere fact of feeling anxious or having lower confidence may have more to do with that person’s personality than underlying ability. Showing nonverbal evidence of engagement, attention, and interactiveness, however, are all important skills in effective L2 communication, especially within subdomains such as interactional competence (Galaczi & Taylor, Reference Galaczi and Taylor2018; Plough et al., Reference Plough, Banerjee and Iwashita2018) or goal-directed communicative effectiveness (Morreale et al., Reference Morreale, Spitzbert and Barge2013). Displaying this type of affect alone is not enough to overcome very low proficiency, but it may help facilitate intercultural encounters. Nevertheless, while these affective responses may provide implicit information for raters about underlying spoken language ability, scale development and rater training programs should be investigated to ensure fairness in these measurements as well, especially when working with diverse test-taking populations (Randez & Cornell, Reference Randez and Cornell2023).

Limitations

This study naturally comes with several limitations. Notably, using untrained, naïve raters results in a large amount of variance in how outcomes are perceived, and part of this variance may be due to differing internal definitions of each of these language categories. For example, even though a narrow definition of fluency was provided to the participants to differentiate this from the broader definition generally used in society (Lennon, Reference Lennon1990), follow-up qualitative data showed that some participants oriented to fluency as a general measure of language proficiency (see Burton, Reference Burton2023). The same held for the term comprehensibility, which was occasionally discussed in relation to comprehension, though this was rare. These varying internal representations of facets of language could have skewed or attenuated some of the relationships found with affect variables. However, modelling raters as a random effect in the mixed effects model accounts for these varying patterns and thus strengthens the inferences from the regression models reported in this study. Future work should consider whether these effects exist even in pools of trained language professionals using more descriptive rating scales.

Similarly, even though pilot testing showed sufficient separability in the constructs being measured, there is always a risk of raters marking diverse rating criteria too similarly, known as the halo effect. The halo effect can muddy boundaries between otherwise distinct constructs. This type of effect may be responsible for at least part of the variance in the correlations reported in this study, though we note that the distinctive clustering patterns in the factor models indicate that the participants viewed these criteria as sufficiently different. Future work could separate counterbalanced rating sessions by affect and language ability to determine whether these relationships still stand when these varying constructs are observed on different occasions. Likewise, like the previous point, training raters on more descriptive language ability criteria would also serve to reduce potential halo effects and result in stronger inferences about the rated constructs.

Finally, a limitation and simultaneously a strength of the study was the controlled nature of the backgrounds of both the L1 raters (US-born English speakers) and the L2 test takers (Chinese-born speakers of one or more dialects of the Chinese language). Controlling for the participants’ backgrounds helped to isolate the effects of nonverbal behaviour and prevent some covariance due to culture (although regional variation in affect displays and their interpretation may have been present in the dataset). These controlled backgrounds did not allow us to measure how the impact of assuredness, involvement, and positivity might have remained invariant depending on different L1 and L2 groups. There is some indication from the literature that it may not (Uchida et al., Reference Uchida, Townsend, Rose Markus and Bergsieker2009; although cf. Tsunemoto et al., Reference Tsunemoto, Lindberg, Trofimovich and McDonough2022). We do not know if, for example, the L1 American listeners would demonstrate the same relationships between affect and the L2 speech of European, South American, or African test takers. Similarly, we do not know if different L1 groups from, for example, the United Kingdom, India, Australia, or any of the growing spheres of English as a lingua franca would demonstrate the same relationships when watching and listening to the same L2 groups. This is a complex area of study, so isolating and identifying these effects in various L1–L2 groupings is a promising area of future research.

Conclusion

Second language communication is a complex, multifaceted construct. With increased research tapping into the relationship between L2 linguistic outcomes and various features often perceived in the visual world (Carey & Szocs, Reference Carey and Szocs2024; Jenkins & Parra, Reference Jenkins and Parra2003; Nakatsuhara et al., Reference Nakatsuhara, Inoue and Taylor2021; Trofimovich et al., Reference Trofimovich, Tekin and McDonough2021; Tsunemoto et al., Reference Tsunemoto, Lindberg, Trofimovich and McDonough2022), a picture is beginning to emerge of language ability as inherently multimodal in the range of features that influence its perception. This has a wide range of implications, especially in the measurement of language ability, as the constructs that underlie these measurements largely do not account for how learners leverage affect to convey meaning. For SLA research that aims to focus purely on linguistic outcomes, affective phenomena could result in noisy statistics. On the other hand, for those interested in more holistic interpretations of how well learners can communicate, not accounting for affect (by, for example, only considering audio recordings or transcriptions) could disadvantage learners as their entire repertoire of communication would not be considered. More research is needed to continue exploring the impact of these variables and the mechanisms that influence language perception. Ultimately, this is a question of fairness, as a greater understanding can help inform more accurate and precise decisions made about learners that determine their access to opportunities in the real world.

Data availability statement

The experiment in this article earned Open Data badge for transparent practices. The data are available at https://url.avanan.click/v2/r02/.

References

Ahammer, A., Lackner, M., & Voigt, J. (2019). Does confidence enhance performance? Causal evidence from the field. Managerial and Decision Economics, 40(6), 704717. https://doi.org/10.1002/mde.3038CrossRefGoogle Scholar
American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). https://doi.org/10.1176/appi.books.9780890425596CrossRefGoogle Scholar
Burton, J. D. (2023). The role of nonverbal behavior and affect on ratings of second language proficiency [Doctoral dissertation, Michigan State University]. ProQuest Dissertations and Theses Global.Google Scholar
Burton, J. D. (2024). Evaluating the impact of nonverbal behavior on language ability ratings. Language Testing, 41(4), 729758. https://doi.org/10.1177/02655322241255709CrossRefGoogle Scholar
Burton, J. D., & Winke, P. (2025). Affect as a component of second language speech perception (OSF 2FTVJ, Version V1) [Data set]. Open Science Framework. https://doi.org/10.17605/OSF.IO/2FTVJCrossRefGoogle Scholar
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language test. Oxford University Press.Google Scholar
Bargh, J. A., & Chartrand, T. L. (1999). The unbearable automaticity of being. American Psychologist, 54(7), 462479. https://doi.org/10.1037/0003-066X.54.7.462CrossRefGoogle Scholar
Botes, E., Dewaele, J.-M., & Greiff, S. (2020). The power to improve: Effects of multilingualism and perceived proficiency on enjoyment and anxiety in foreign language learning. European Journal of Applied Linguistics, 8(2), 128. http://doi.org/10.1515/eujal-2020-0003CrossRefGoogle Scholar
Brant, R. (1990). Assessing proportionality in the proportional odds model for ordinal logistic regression. Biometrics, 46(4), 11711178. https://doi.org/10.2307/2532457CrossRefGoogle ScholarPubMed
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1(1), 147. https://doi.org/10.1093/applin/I.1.1CrossRefGoogle Scholar
Carey, M. D., & Szocs, S. (2024). Revisiting raters’ accent familiarity in speaking tests: Evidence that presentation mode interacts with accent familiarity to variably affect comprehensibility ratings. Language Testing, 41(2), 290315. https://doi.org/10.1177/02655322231200808CrossRefGoogle Scholar
Chong, J. J. Q., & Aryadoust, V. (2023). Investigating the effect of multimodality and sentiments on speaking assessments: A facial emotional analysis. Education and Information Technologies, 28, 74137436. https://doi.org/10.1007/s10639-022-11478-7CrossRefGoogle ScholarPubMed
Clément, R. (1986). Second language proficiency and acculturation: An investigation of the effects of language status and individual characteristics. Journal of Language and Social Psychology, 5(4), 271290. https://doi.org/10.1177/0261927X8600500403CrossRefGoogle Scholar
Council of Europe. (2020). Common European framework of reference for languages: Learning, teaching, assessment. Companion volume. Council of Europe. https://rm.coe.int/cefr-companion-volume-with-new-descriptors-%2018/1680787989Google Scholar
Crowther, D., Trofimovich, P., & Isaacs, T. (2016). Linguistic dimensions of second language accent and comprehensibility: Nonnative listeners’ perspectives. Journal of Second Language Pronunciation, 2(2), 160182. https://doi.org/10.1075/jslp.2.2.02croCrossRefGoogle Scholar
Cuddy, A. J. C., Glick, P., & Beninger, A. (2011). The dynamics of warmth and competence judgments, and their outcomes in organizations. Research in Organizational Behavior, 31, 7398. https://doi.org/10.1016/j.riob.2011.10.004CrossRefGoogle Scholar
Dewaele, J.-M., & Li, C. (2022). Foreign language enjoyment and anxiety: Associations with general and domain specific English achievement. Chinese Journal of Applied Linguistics, 45(1), 2348. https://doi.org/10.1515/cjal-2022-0104CrossRefGoogle Scholar
Ducasse, A. M., & Brown, A. (2009). Assessing paired orals: Raters’ orientation to interaction. Language Testing, 26(3), 423443. https://doi.org/10.1177/0265532209104669CrossRefGoogle Scholar
Edwards, E., & Roger, P. S. (2015). Seeking out challenges to develop L2 self-confidence: A language learner’s journey to proficiency. TESL-EJ, 18(4), 124. http://tesl-ej.org/pdf/ej72/a3.pdfGoogle Scholar
Elahi Shirvan, M., Khajavy, G. H., MacIntyre, P. D., & Taherian, T. (2019). A meta-analysis of L2 willingness to communicate and its three high-evidence correlates. Journal of Psycholinguistic Research, 48, 12411267. https://doi.org/10.1007/s10936-019-09656-9CrossRefGoogle ScholarPubMed
Elfenbein, H. A. (2014). The many faces of emotional contagion: An affective process theory of affective linkage. Organizational Psychology Review, 4(4), 326362. https://doi.org/10.1177/2041386614542889CrossRefGoogle Scholar
Franklin, S. B., Gibson, D. J., Robertson, P. A., Pohlmann, J. T., & Fralish, J. S. (1995). Parallel analysis: a method for determining significant principal components. Journal of Vegetation Science, 6(1), 99106. https://doi.org/10.2307/3236261CrossRefGoogle Scholar
Frijda, N. H. (1994). Varieties of affect: Emotions and episodes, moods, and sentiments. In Davidson, R. J. (Ed.), The nature of emotion – fundamental questions (pp. 5967). Oxford University Press.Google Scholar
Galaczi, E., & Taylor, L. (2018). Interactional competence: Conceptualisations, operationalisations, and outstanding questions. Language Assessment Quarterly, 15(3), 219236. https://doi.org/10.1080/15434303.2018.1453816CrossRefGoogle Scholar
Gluszek, A., & Dovidio, J. F. (2010). Speaking with a nonnative accent: Perceptions of bias, communication difficulties, and belonging in the United States. Journal of Language and Social Psychology, 29(2), 224234. https://doi.org/10.1177/0261927X09359590CrossRefGoogle Scholar
Harrell, F. (2020, September 20). Violation of proportional odds is not fatal. Statistical Thinking. https://www.fharrell.com/post/poGoogle Scholar
Holgado–Tello, F. P., Chacón–Moscoso, S., Barbero–García, I., & Vila–Abad, E. (2010). Polychoric versus Pearson correlations in exploratory and confirmatory factor analysis of ordinal variables. Quality & Quantity, 44, 153166. https://doi.org/10.1007/s11135-008-9190-yCrossRefGoogle Scholar
Hox, J., Moerbeck, M., & van de Schoot, R. (2018). Multilevel analysis: Techniques and applications (3rd ed.). Routledge. https://doi.org/10.4324/9781315650982CrossRefGoogle Scholar
Huensch, A., & Nagle, C. (2021). The effect of speaker proficiency on intelligibility, comprehensibility, and accentedness in L2 Spanish: A conceptual replication and extension of Munro and Derwing (1995a). Language Learning, 71(3), 626668. https://doi.org/10.1111/lang.12451CrossRefGoogle Scholar
Hymes, D. (1972). On communicative competence. In Duranti, A. (Ed.), Linguistic anthropology: A reader (pp. 5373). Blackwell.Google Scholar
IELTS. (n.d.). IELTS. https://ielts.orgGoogle Scholar
Isaacs, T., & Trofimovich, P. (2012). Deconstructing comprehensibility: Identifying the linguistic influences on listeners’ L2 comprehensibility ratings. Studies in Second Language Acquisition, 34(3), 475505. https://doi.org/10.1017/S0272263112000150CrossRefGoogle Scholar
Jenkins, S., & Parra, I. (2003). Multiple layers of meaning in an oral proficiency test: The complementary roles of nonverbal, paralinguistic, and verbal behaviors in assessment decisions. The Modern Language Journal, 87(1), 90107. https://doi.org/10.1111/1540-4781.00180CrossRefGoogle Scholar
Jin, S., & Lee, H. (2022). Willingness to communicate and its high-evidence factors: A meta-analytic structural equation modeling approach. Journal of Language and Social Psychology, 41 ( 6), 716745. https://doi.org/10.1177/0261927X221092098CrossRefGoogle Scholar
Kang, O., & Yaw, K. (2024). Social judgement of L2 accented speech stereotyping and its influential factors. Journal of Multilingual and Multicultural Development, 45(4), 544551. https://doi.org/10.1080/01434632.2021.1931247CrossRefGoogle Scholar
Kang, O., Rubin, D. O. N., & Pickering, L. (2010). Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. The Modern Language Journal, 94(4), 554566. https://doi.org/10.1111/j.1540-4781.2010.01091.xCrossRefGoogle Scholar
Kappas, A., Krumhuber, E., & Küster, D. (2013). Facial behavior. In Hall, J. A. & Knapp, M. L. (Eds.), Nonverbal communication (pp. 131165). De Gruyter. https://doi.org/10.1515/9783110238150.131CrossRefGoogle Scholar
Kim, Y. L., Liu, C., Trofimovich, P., & McDonough, K. (2024). Is nonverbal behavior during conversation related to perceived fluency? TESOL Journal. Article e795. Advance online publication. https://doi.org/10.1002/tesj.795CrossRefGoogle Scholar
Larrouy-Maestri, P., Poeppel, D., & Pell, M. D. (2024). The sound of emotional prosody: Nearly 3 decades of research and future directions. Perspectives on Psychological Science. Advance online publication. https://doi.org/10.1177/17456916231217722CrossRefGoogle Scholar
Lennon, P. (1990). Investigating fluency in EFL: A quantitative approach. Language Learning, 40(3), 387417. https://doi.org/10.1111/j.1467-1770.1990.tb00669.xCrossRefGoogle Scholar
Li, C., Dewaele, J.-M., & Jiang, G. (2020). The complex relationship between classroom emotions and EFL achievement in China. Applied Linguistics Review, 11(3), 485510. http://doi.org/10.1515/applirev-2018-0043CrossRefGoogle Scholar
MacIntyre, P. D., & Gregersen, T. (2012). Emotions that facilitate language learning: The positive- broadening power of the imagination. Studies in Second Language Learning and Teaching, 2(2), 193213. http://doi.org/10.14746/ssllt.2012.2.2.4CrossRefGoogle Scholar
MacIntyre, P. D., Babin, P. A., & Clément, R. (1999). Willingness to communicate: Antecedents & consequences. Communication Quarterly, 47(2), 215229. https://doi.org/10.1080/01463379909370135CrossRefGoogle Scholar
MacIntyre, P. D., Gregersen, T., & Mercer, S. (2019). Setting an agenda for positive psychology in SLA: Theory, practice, and research. The Modern Language Journal, 103(1), 262274. https://doi.org/10.1111/modl.12544CrossRefGoogle Scholar
MacIntyre, P. D., Noels, K., & Clément, R. (1997). Biases in self-ratings of second language proficiency: The role of language anxiety. Language Learning, 47(2), 265287. https://doi.org/10.1111/0023-8333.81997008CrossRefGoogle Scholar
MacIntyre, P., Clément, R., Dörnyei, Z., & Noels, K. (1998). Conceptualising willingness to communicate in a L2: A situational model of L2 confidence and affiliation. Modern Language Journal, 82(4), 545562. https://doi.org/10.1111/j.1540-4781.1998.tb05543.xCrossRefGoogle Scholar
Matsumoto, D., & Hwang, H. C. (2016). The cultural bases of nonverbal communication. In Matsumoto, D., Hwang, H. C., & Frank, M. G. (Eds.), APA handbook of nonverbal communication (pp. 77101). American Psychological Association. https://doi.org/10.1037/14669-004CrossRefGoogle Scholar
May, L. (2011). Interactional competence in a paired speaking test: Features salient to raters. Language Assessment Quarterly, 8(2), 127145. https://doi.org/10.1080/15434303.2011.565845CrossRefGoogle Scholar
Mesquita, B. (2022). Between us: How cultures create emotions. W. W. Norton & Company.Google Scholar
Morreale, S. P., Spitzbert, B. H., & Barge, J. K. (2013). Communication: motivation, knowledge, skills (3rd ed.). Peter Lang. https://doi.org/10.3726/978-1-4539-0257-8CrossRefGoogle Scholar
Munro, M. J., & Derwing, T. M. (1999). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language learning, 49(s1), 285310. https://doi.org/10.1111/0023-8333.49.s1.8CrossRefGoogle Scholar
Nagle, C. L., Trofimovich, P., O’Brien, M. G., & Kennedy, S. (2022). Beyond linguistic features: Exploring behavioral and affective correlates of comprehensible second language speech. Studies in Second Language Acquisition, 44(1), 255270. https://doi.org/10.1017/S0272263121000073CrossRefGoogle Scholar
Nakatsuhara, F., Inoue, C., & Taylor, L. (2021). Comparing rating modes: Analyzing live, audio, and video ratings of IELTS speaking test performances. Language Assessment Quarterly, 18(2), 83106. https://doi.org/10.1080/15434303.2020.1799222CrossRefGoogle Scholar
Neu, J. (1990). Assessing the role of nonverbal communication in the acquisition of communicative competence in L2. In Scarcella, R. C., Andersen, E. S., & Krashen, S. D. (Eds.), Developing communicative competence in a second language (pp. 121138). Newbury House.Google Scholar
Noels, K. A., & Clément, R. (1996). Communicating across cultures: Social determinants and acculturative consequences. Canadian Journal of Behavioural Science/Revue canadienne des sciences du comportement, 28(3), 214228. https://doi.org/10.1037/0008-400X.28.3.214CrossRefGoogle Scholar
Parkinson, B. (1996). Emotions are social. British Journal of Psychology, 87(4), 663683. https://doi.org/10.1111/j.2044-8295.1996.tb02615.xCrossRefGoogle ScholarPubMed
Pavlenko, A. (2014). The bilingual mind: And what it tells us about language and thought. Cambridge University Press. https://doi.org/10.1017/CBO9781139021456CrossRefGoogle Scholar
Philp, J., & Duchesne, S. (2016). Exploring engagement in tasks in the language classroom. Annual Review of Applied Linguistics, 36, 5072. https://doi.org/10.1017/S0267190515000094CrossRefGoogle Scholar
Ploder, A., & Eder, A. (2015). Semantic differential. In Wright, J. D. (Ed.), International Encyclopedia of the Social & Behavioral Sciences, 2nd Ed. (pp 563571). Elsevier. https://doi.org/10.1016/B978-0-08-097086-8.03231-1CrossRefGoogle Scholar
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language Learning, 64(4), 878912. https://doi.org/10.1111/lang.12079CrossRefGoogle Scholar
Plough, I., Banerjee, J., & Iwashita, N. (2018). Interactional competence: Genie out of the bottle. Language Testing, 35(3), 427455. https://doi.org/10.1177/0265532218772325CrossRefGoogle Scholar
Randez, R. A., & Cornell, C. (2023). Advancing equity in language assessment for learners with disabilities. Language Testing, 40(4), 984999. https://doi.org/10.1177/02655322231169442CrossRefGoogle Scholar
Revelle, W. (2024). psych: Procedures for psychological, psychometric, and personality research. R package version 2.4.1. https://CRAN.R-project.org/package=psychGoogle Scholar
Rodero, E. (2011). Intonation and emotion: influence of pitch levels and contour type on creating emotions. Journal of Voice, 25(1), e25e34. https://doi.org/10.1016/j.jvoice.2010.02.002CrossRefGoogle ScholarPubMed
Sato, T., & McNamara, T. (2019). What counts in second language oral communication ability? The perspective of linguistic laypersons. Applied Linguistics, 40(6), 894916. https://doi.org/10.1093/applin/amy032CrossRefGoogle Scholar
Sereno, J., Lammers, L., & Jongman, A. (2016). The relative contribution of segments and intonation to the perception of foreign-accented speech. Applied Psycholinguistics, 37(2), 303322. https://doi.org/10.1017/S0142716414000575CrossRefGoogle Scholar
Simms, L. J., Zelazny, K., Williams, T. F., & Bernstein, L. (2019). Does the number of response options matter? Psychometric perspectives using personality questionnaire data. Psychological Assessment, 31(4), 557566. https://doi.org/10.1037/pas0000648CrossRefGoogle ScholarPubMed
Smirnov, D., Saarimäki, H., Glerean, E., Hari, R., Sams, M., & Nummenmaa, L. (2019). Emotions amplify speaker-listener neural alignment. Human Brain Mapping, 40(16), 47774788. https://doi.org/10.1002/hbm.24736CrossRefGoogle ScholarPubMed
Snider, J. G., & Osgood, C. E. (Eds.). (1969). Semantic differential technique: A sourcebook. Aldine.Google Scholar
Stankov, L., Lee, J., Luo, W., & Hogan, D. J. (2012). Confidence: A better predictor of academic achievement than self-efficacy, self-concept and anxiety? Learning and Individual Differences, 22(6), 747758. https://doi.org/10.1016/j.lindif.2012.05.013CrossRefGoogle Scholar
Storch, N. (2008). Metatalk in a pair work activity: Level of engagement and implications for language development. Language Awareness, 17, 95114. https://doi.org/10.1080/09658410802146644CrossRefGoogle Scholar
Tabachnik, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Pearson.Google Scholar
Teimouri, Y., Goetze, J., & Plonsky, L. (2019). Second language anxiety and achievement: A meta-analysis. Studies in Second Language Acquisition, 41(2), 363387. https://doi.org/10.1017/S0272263118000311CrossRefGoogle Scholar
Ten Berge, J. M., & Kiers, H. A. (1991). A numerical approach to the approximate and the exact minimum rank of a covariance matrix. Psychometrika, 56, 309315. https://doi.org/10.1007/BF02294464CrossRefGoogle Scholar
Trofimovich, P., & Isaacs, T. (2012). Disentangling accent from comprehensibility. Bilingualism: Language and Cognition, 15(4), 905916. https://doi.org/10.1017/S1366728912000168CrossRefGoogle Scholar
Trofimovich, P., Tekin, O., & McDonough, K. (2021). Task engagement and comprehensibility in interaction: Moving from what second language speakers say to what they do. Journal of Second Language Pronunciation, 7(3), 435461. https://doi.org/10.1075/jslp.21006.troCrossRefGoogle Scholar
Tsunemoto, A., Lindberg, R., Trofimovich, P., & McDonough, K. (2022). Visual cues and rater perceptions of second language comprehensibility, accentedness, and fluency. Studies in Second Language Acquisition, 44(3), 659684. https://doi.org/10.1017/S0272263121000425CrossRefGoogle Scholar
Uchida, Y., Townsend, S. S., Rose Markus, H., & Bergsieker, H. B. (2009). Emotions as within or between people? Cultural variation in lay theories of emotion expression and inference. Personality and Social Psychology Bulletin, 35(11), 14271439. https://doi.org/10.1177/0146167209347322CrossRefGoogle ScholarPubMed
Winke, P., Zhang, X., & Pierce, S. (2023). A closer look at a marginalized test method: Self-assessment as a measure of speaking proficiency. Studies in Second Language Acquisition, 45(2), 416441. https://doi.org/10.1017/S0272263122000079CrossRefGoogle Scholar
Figure 0

Table 1. Speech samples

Figure 1

Table 2. Definitions of scale categories

Figure 2

Figure 1. Rating scales.

Figure 3

Table 3. Scale means, SDs, and reliability

Figure 4

Figure 2. Distribution of scale scores.

Figure 5

Table 4. Scale correlations

Figure 6

Figure 3. Path diagram of factor solution.

Figure 7

Table 5. Polychoric correlations between factor scores and language scores

Figure 8

Table 6. Tests of model fit

Figure 9

Table 7. Fluency model

Figure 10

Table 8. Vocabulary model

Figure 11

Table 9. Grammar model

Figure 12

Table 10. Comprehensibility model