Essay writing remains a central component of high-stakes English language proficiency (ELP) exams, which determine whether test takers can study or work in English-medium contexts. It is therefore crucial to understand the specific linguistic features that expert raters attend to when evaluating these essays. One important indicator of lexical proficiency is the use of formulaic sequences, that is, any string of words that can be identified or usefully thought of as a single lexical unit (Siyanova-Chanturia & Pellicer-Sánchez, Reference Siyanova-Chanturia, Pellicer-Sánchez, Siyanova-Chanturia and Pellicer-Sánchez2020). Examples include multiword verbs (e.g., come up with), idioms (e.g., under the weather), and collocations (e.g., sheepish grin). Such sequences are potentially even more important than single words in predicting text quality (Bestgen, Reference Bestgen2017) and account for a substantial portion of expert-level production (Conklin & Schmitt, Reference Conklin and Schmitt2012; Siyanova-Chanturia & Martinez, Reference Siyanova-Chanturia and Martinez2015).
Of the many types of formulaic sequences, collocations possess a unique status in frameworks of lexical knowledge (e.g., Nation, Reference Nation2013; Read, Reference Read, Bogaards and Laufer2004), and their importance in writing proficiency is widely acknowledged (e.g., Crossley et al., Reference Crossley, Salsbury and Mcnamara2015; Durrant, Reference Durrant and Siyanova-Chanturia2019). However, few studies have specifically and systematically examined the impact of different collocational features on expert raters’ judgments of essays. This study therefore investigated this topic, focusing on dimensions of collocational proficiency, specifically one aspect of collocational sophistication (collocate frequency) and collocational accuracy. Based on these data, the paper makes recommendations for incremental changes to curricula and rating rubrics.
Defining and identifying collocations
Simplistically, collocations are word partnerships that may consist of various linguistic patterns (Szudarski, Reference Szudarski2023) such as verb+article+noun (e.g., break the spell) or adverb+adjective (e.g., utterly ridiculous). As such, they occupy an interesting intermediate space between lexis and syntax (Nattinger & DeCarrico, Reference Nattinger and DeCarrico1992), possessing both concrete (vocabulary-like) and abstract (grammar-like) qualities.
However, pinpointing exactly what constitutes a collocation depends on the approach taken; the two most common are the phraseological approach and the frequency-based approach. The phraseological approach uses syntactic, semantic, and pragmatic linguistic criteria (Henriksen, Reference Henriksen, Bardel, Lindqvist and Laufer2013; Lundell & Lindqvist, Reference Lundell and Lindqvist2012). For example, a distinction might be whether or not a word combination is more compositional in nature, as in pay the bill, or more figurative, as in pay attention (Wolter, Reference Wolter and Webb2020). In contrast, the frequency-based approach treats the probability of co-occurrence of words as of paramount importance (Henriksen, Reference Henriksen, Bardel, Lindqvist and Laufer2013). Such co-occurrence is often measured by Mutual Information (MI) and t-score, though numerous other measures also exist (e.g., Delta P and Log Dice). A typical convention is to consider word combinations with an MI score over 3 or a t-score over 2 to be a collocation (Church & Hanks, Reference Church and Hanks1990; Jiang, Reference Jiang, Barfield and Gyllstad2009), in conjunction with a minimum frequency threshold from 5 to 10 occurrences of the word combination in the corpus (e.g., Granger & Bestgen, Reference Granger and Bestgen2014; Simpson-Vlach & Ellis, Reference Simpson-Vlach and Ellis2010).
It is possible to combine phraseological and frequency-based approaches by starting with computational extraction for frequency measures and then subsequently applying phraseological criteria (e.g., Laufer & Waldman, Reference Laufer and Waldman2011; Naismith & Juffs, Reference Naismith and Juffs2021). Adopting a combined approach admits two key elements of collocations: (1) the frequency with which the words occur together, and (2) the semantic link between the words. Both of these elements can be seen in the definition by Laufer and Waldman (Reference Laufer and Waldman2011, p. 648) which we adopt in the current study:
[Collocations are] habitually occurring lexical combinations that are characterized by restricted co-occurrence of elements and relative transparency of meaning.
In this conceptualization of collocation, “habitually occurring” combinations can be measured statistically with a frequency-based approach and “restricted co-occurrence and relative transparency of meaning” with a phraseological perspective.
Lexical proficiency and collocations
In its broadest sense, lexical proficiency is “an ability to apply both declarative and procedural lexical knowledge in real language use” (Lenko-Szymanska, Reference Lenko-Szymanska2019, p. 39). With respect to knowledge and use of collocations (i.e., collocational proficiency), research has shown that expert speakers and learners differ substantially (Granger & Bestgen, Reference Granger and Bestgen2014; Siyanova-Chanturia & Sidtis, Reference Siyanova-Chanturia, Sidtis, Siyanova-Chanturia and Pellicer-Sánchez2019). Learners overuse collocations that they know well (Granger, Reference Granger and Cowie1998; Laufer & Waldman, Reference Laufer and Waldman2011), but underuse collocations more generally, both in quantity and range (Durrant & Schmitt, Reference Durrant and Schmitt2009; Tsai, Reference Tsai2015). The reason for such failures is likely a combination of factors that may include collocations’ relative infrequency in input (Gyllstad & Wolter, Reference Gyllstad and Wolter2016), the lack of a literal counterpart in the learner’s L1 (Macis & Schmitt, Reference Macis and Schmitt2016), their lack of salience as linguistic items (Lee, Reference Lee2019; Wolter, Reference Wolter and Webb2020), or how they were taught (Jiang, Reference Jiang, Barfield and Gyllstad2009; Siyanova-Chanturia & Spina, Reference Siyanova-Chanturia and Spina2020). Collocational proficiency is, therefore, one factor that can distinguish among levels of proficiency (Ha, Reference Ha2013; Lundell & Lindqvist, Reference Lundell and Lindqvist2012).
Two important dimensions of lexical and collocational proficiency that have been shown to impact perceptions of text quality are sophistication and accuracy. While both dimensions are frequently considered in relation to the use of single words, they also apply to the use of formulaic sequences such as collocations. We now discuss sophistication and accuracy in relation to single words and collocations.
Lexical and collocational sophistication
Lexical sophistication commonly refers to the use of advanced or sophisticated words (Kim et al., Reference Kim, Crossley and Kyle2018) that reflect the breadth and depth of lexical knowledge (Kyle & Crossley, Reference Kyle and Crossley2015). By nature, lexical sophistication is multidimensional. For example, in Eguchi & Kyle’s (Reference Egushi and Kyle2020) framework, the construct includes rareness (frequency and dispersion), conceptual features (e.g., concreteness), distinctiveness, accessibility, and association measures of multiword units. Of these, rareness remains the most commonly investigated through the use of frequency-based measures related to the proportion of relatively advanced words produced in a text (Read, Reference Read2000). Here, we restrict our focus to frequency measures due to their relevance to the current study and usefulness for simultaneously considering both single words and collocations.
Numerous studies have found that indices of lexical sophistication correlate with human judgments of writing (e.g., Eguchi & Kyle, Reference Egushi and Kyle2020; Kim et al., Reference Kim, Crossley and Kyle2018; Lenko-Szymanska, Reference Lenko-Szymanska2019; Vögelin et al., Reference Vögelin, Jansen, Keller, Machts and Möller2019). For example, Lenko-Szymanska (Reference Lenko-Szymanska2019) found that three frequency-based measures of lexical sophistication—the percentage of words beyond the 2000 most frequent words, the percentage of academic words, and the mean log frequency of content words—were able to discriminate well between texts written by learners of different proficiency levels. Vögelin et al. (Reference Vögelin, Jansen, Keller, Machts and Möller2019) likewise found a positive relationship between human ratings and scores with higher frequency-based sophistication. Manipulating the lexical sophistication of texts, as measured by average word range, these researchers found that texts with greater lexical sophistication received significantly higher scores from teachers for vocabulary (η² = .348, p < .001) as well as for holistic quality (η² = .110, p < .05). Exploring the impact of both single-word and multiword lexical indices, Kim et al. (Reference Kim, Crossley and Kyle2018) found that models of lexical sophistication combining both types of frequency indices were the most predictive of scores of L2 writing (24.6% of variance) and lexical proficiency (31% of variance). Single-word indices were related to the use of advanced words (including frequency, dispersion, and psycholinguistic properties), and multiword indices were related to association measures and frequency measures.
To discuss frequency-based sophistication measures in relation to collocations, it is necessary to first differentiate between two types of collocational frequency. The first can be coined a collocation frequency approach, so that, for example, in The Corpus of Contemporary American English (COCA; Davies, Reference Davies2008), the lemma combination of needless and say (as in needless to say) occurs 3,735 times and has an MI of 4.37. As such, it has a lemma combination rank of 251 and can be considered a high-frequency collocation in comparison to other collocations. Alternatively, we can look at the lemmas individually, that is, through a collocate frequency approach, in which case say is certainly high frequency (lemma frequency = 4,096,416, lemma rank = 26), but needless is much lower frequency (lemma frequency = 4,942, lemma rank = 8,468). Thus, if a learner uses needless to say in an essay, should this collocation be considered evidence of low sophistication because it is a common collocation or high sophistication because it is a collocation containing an uncommon word?
These two views of collocational sophistication reflect the nature of collocations as simultaneously holistic chunks and also compositional strings of words. This dual view is supported by processing studies which have shown that formulaic sequences become increasingly prominent as single units through repeated use, but that they retain information about their parts (Öksüz et al., Reference Öksüz, Brezina and Rebuschat2021; Wolter & Yamashita, Reference Wolter and Yamashita2018). Of the two, the collocation frequency approach is more common and perhaps more intuitive. And yet, findings regarding the relative importance of collocational frequency have been mixed. In a meta-analysis of 19 collocation studies, Durrant (Reference Durrant2014) found that collocation frequency correlated only moderately with collocation knowledge; other important factors included semantic transparency and the amount of social engagement of learners. However, in studies by Garner and colleagues (Garner Reference Garner2022; Garner et al., Reference Garner, Crossley and Kyle2019, Reference Garner, Crossley and Kyle2020), the more proficient writers were observed to use lower-frequency collocations, for example, more sophisticated verb-noun collocations (Garner, Reference Garner2022).
Studies using a collocate frequency approach have been more interested in collocations in relation to the individual collocates contained within them. For example, Ebrahimi (Reference Ebrahimi2017) investigated the collocational knowledge of Iranian EAP learners, specifically collocations composed of high-frequency words. Jiang (Reference Jiang, Barfield and Gyllstad2009) focused on pedagogic materials for teaching collocations to Chinese learners and found that 93.6% of collocates belonged to the K1-2 frequency bands. González Fernández and Schmitt (Reference González Fernández and Schmitt2015) incorporated both approaches and looked at the link between frequency and productive collocation knowledge, but only for collocations whose collocates were in the K1-5 frequency bands. Matching Durrant (Reference Durrant2014), the study found only a weak relationship between collocation frequency and collocation knowledge. Several other studies reporting the association measure of MI have also demonstrated that higher MI correlates with higher learner proficiency (e.g., Granger & Bestgen, Reference Granger and Bestgen2014; Jiang et al., Reference Jiang, Bi, Xie and Liu2023; Paquot, Reference Paquot2018). Although the focus on MI in these works has been to investigate the degree of association in word pairs, note that word (or lemma) frequency is also part of the MI equation. As a result, low-frequency words often result in more exclusive combinations and consequently receive higher MI scores (Szudarski, Reference Szudarski2023), indicating a relationship between MI, collocate frequency, and learner proficiency.
This paper focused on using the collocate frequency approach to be able to classify collocations as low-, mid-, or high-frequency based on single-word frequency statistics from external corpora. In doing so, we were better able to compare the effects of frequency on text quality in relation to both single words and collocations containing those words. To our knowledge, no studies have yet to apply the mid-frequency label to collocations, and the labels high-frequency and low-frequency have been used variably (e.g., Durrant & Schmitt, Reference Durrant and Schmitt2009; Yoon, Reference Yoon2016).
While word frequency is an interval variable, as evidenced in the studies above, it is commonly partitioned into frequency bands of 1,000 (K) words, or “K-bands.” Many authors have also suggested a three-way distinction between high-, mid-, and low-frequency lexical items (e.g., Naismith & Juffs, Reference Naismith and Juffs2021; Vilkaitė-Lozdienė & Schmitt, Reference Vilkaitė-Lozdienė, Schmitt and Webb2020). In this format, common practice based on coverage statistics defines the K1-2 frequency bands as high-frequency, K3-9 as mid-frequency, and K10+ as low-frequency. There have been calls for other bucket sizes, for example, Kremmel’s (Reference Kremmel2016) suggestion of 500-item bands for K1-3, 1,000-item bands for K4-6, and 2,000-item bands for K7-10. It is true that operationalizing frequency as 1,000-item bands can lose more fine-grained information, but there are pedagogical and research advantages to establishing such categories. For example, for learners wishing to study in an L2 academic environment, identifying and learning mid-frequency lexis is particularly important (Nation & Anthony, Reference Nation and Anthony2013; Vilkaitė-Lozdienė & Schmitt, Reference Vilkaitė-Lozdienė, Schmitt and Webb2020) as it is essential for achieving sufficient coverage of academic texts (Laufer, Reference Laufer, Laurén and Nordman1989; Laufer & Ravenhorst-Kalovski, Reference Laufer and Ravenhorst-Kalovski2010). Knowing what lexis is mid-frequency, therefore, allows for clear learning goals to be set which are easily accessible to learners, teachers, and materials developers.
Lexical and collocational accuracy
Simply put, lexical accuracy is the ability to produce writing free from lexical errors. In general, there is a strong negative correlation between the number of errors and holistic ratings (Polio & Shea, Reference Polio and Shea2014), and lexical errors have been found to occur more than grammatical errors (Agustín Llach, Reference Agustín Llach2011; Qian & Lin, Reference Qian, Lin and Webb2020). Because lexical errors affect communication, they are highly prominent and are therefore judged more severely by readers and listeners (Ellis, Reference Ellis2008; Santos, Reference Santos1988). The typical quantitative approach to accuracy is to count either error-free units, like T-units or clauses (e.g., Polio, Reference Polio1997), or errors themselves (e.g., Linnarud, Reference Linnarud1986). These counts can then be normalized using ratios such as the number of errors per word, per lexical word (typically nouns, adjectives, verbs, adverbs), or per 100 words.
With respect to collocations, “free from lexical errors” can refer to whether these combinations are acceptable and expected (Crossley et al., Reference Crossley, Salsbury, Mcnamara, Jarvis and Daller2013). Collocational accuracy is especially important in academic writing as collocation misuse indicates a lack of academic expertise (Henriksen, Reference Henriksen, Bardel, Lindqvist and Laufer2013) and forces readers to decompose the collocations rather than process them fluently as single chunks (Howarth, Reference Howarth1998). Even if the meaning of the individual words is not obscured by how they are combined, collocation errors can still strain the reader through “lexical dissonance” (Hasselgren, Reference Hasselgren1994), increasing the processing burden (Millar, Reference Millar2011).
Numerous studies have demonstrated the high prevalence of collocational errors at all proficiency levels of L2 English writing. For example, approximately 33% of the collocations investigated by Laufer and Waldman (Reference Laufer and Waldman2011) and 50% by Nesselhauf (Reference Nesselhauf2005) were incorrect. There is also a strong case for the impact of collocational accuracy on human judgments of proficiency. In Crossley et al. (Reference Crossley, Salsbury and Mcnamara2015), collocational accuracy explained 84% of the variance in human judgments between the writing samples and was one of the three most predictive variables. In addition, the studies showing a positive relationship between collocation association measures (like MI) and proficiency can be considered indirect evidence of the importance of collocational accuracy since lower MI may be indicative of higher rates of inappropriate word choice. It should be noted, however, that in Laufer and Waldman (Reference Laufer and Waldman2011), similar rates of collocational errors were seen at all proficiency levels. Other factors that may affect collocation accuracy rates include the definition of potential collocations, the types of collocations under investigation, and the L1s of the learners.
Human rating of writing
Thus far, we have discussed how statistical measures of single- and multiword lexical items correspond to language proficiency, without consideration of how language proficiency is measured. Commonly, assessments of writing proficiency rely on human raters’ variable and subjective perceptions of quality, characteristics that impact validity and reliability (Attali, Reference Attali2016; Eckes, Reference Eckes2012). Despite these human factors, such assessment is still widely administered because it directly tests communicative language ability (Hamp-Lyons, Reference Hamp-Lyons and Kroll1990), in an approach where some “errors” are tolerated as part of a global approach to success in language ability and not penalized as in earlier thinking on testing.
The reasons for rater variability are legion. Assessing essays imposes a high cognitive demand (Eckes, Reference Eckes2012), and even looking at lexis alone, assigning a numeric score is a challenge (Fritz & Ruegg, Reference Fritz and Ruegg2013). Ratings can vary across raters (inter-rater reliability) or may “drift” for one rater across texts (intra-rater reliability). Raters perhaps differ in severity/leniency because their perceptions of the importance of various criteria vary (Eckes, Reference Eckes2012; Goh & Ang-Aw, Reference Goh, Ang-Aw, Xerri and Briffa2018; Lumley & McNamara, Reference Lumley and McNamara1995). In addition, research on any potential advantage for raters’ experience is less clear (e.g., Cumming, Reference Cumming1990; Lim, Reference Lim2011), though rater training can improve intra-rater reliability and adherence to rubrics (Brown, Reference Brown, McGovern and Walsh2006; Hall & Sheyholislami, Reference Hall and Sheyholislami2013). Statistical models such as Many-Facet Rasch Measurement models (MFRM; Linacre, 1989, Reference Linacre1994) can also be used to account for systematic rater severity/leniency (see McNamara et al., Reference McNamara, Knoch and Fan2019).
Particularly relevant to this paper, think-aloud studies and retrospective comments after grading suggest that lexis has not been of primary consideration for some raters (Goh & Ang-Aw, Reference Goh, Ang-Aw, Xerri and Briffa2018; Lumley & McNamara, Reference Lumley and McNamara1995), even though evaluation rubrics may include a vocabulary category. However, using think-aloud protocols alters the thought process of the raters (Barkaoui, Reference Barkaoui2011; Lumley, Reference Lumley2005), so any conclusions in this regard must be considered tentative. Raters may also perceive longer texts to be of superior quality and give them higher ratings just for that reason alone (Guo et al., Reference Guo, Crossley and McNamara2013; Kyle et al., Reference Kyle, Crossley and Jarvis2020; Linnarud, Reference Linnarud1986). Therefore, text length can “wash out” the predictive strength of other lexical variables (Crossley & McNamara, Reference Crossley and McNamara2012) and should be controlled for.
Still, the relationship between lexical features and human judgments has long been a focus of writing assessment research, as exemplified by studies discussed in the previous section (Kim et al., Reference Kim, Crossley and Kyle2018; Vögelin et al., Reference Vögelin, Jansen, Keller, Machts and Möller2019). Early investigations of lexical measures in L2 writing (e.g., Arnaud, Reference Arnaud, Culhane, Klein-Braley and Stevenson1984; Linnarud, Reference Linnarud1986) established that features like lexical diversity, sophistication, and accuracy can distinguish proficiency levels. More recent research has investigated specific dimensions of lexical proficiency and their impact on assessments. For instance, Bestgen and Granger (Reference Bestgen and Granger2014) found that essays with more sophisticated collocations (measured by MI scores) received higher ratings; Leńko-Szymańska (Reference Lenko-Szymanska2019) showed that raters attend to different aspects of lexical proficiency in their evaluations; Lu and Hu (Reference Lu and Hu2022) demonstrated that sense-aware lexical sophistication indices improved prediction of writing quality over traditional indices; and Monteiro et al. (Reference Monteiro, Crossley and Kyle2020) found that L2 lexical sophistication indices were significantly stronger predictors of holistic ratings than L1 benchmarks, explaining twice the variance. Studies like these highlight the importance of lexical features in human judgments and the many ways in which dimensions of lexical proficiency can be operationalized. Additionally, recent meta-analyses have examined the relationships between L2 writing performance and its internal and external correlates. Of relevance here, moderate correlations were found between L2 writing performance and lexical complexity (r = .295; Kojima & Kaneta, Reference Kojima, Kaneta, Jeon and In’nami2022) and L2 vocabulary knowledge (r = .489; Kojima et al., Reference Kojima, In’nami, Kaneta, Jeon and In’nami2022). These meta-analytic findings reinforce the contribution of lexical features to assessments of L2 proficiency.
Two previous human rating studies are especially noteworthy to the context of this paper. First, Fritz and Ruegg (Reference Fritz and Ruegg2013) focused on argumentative essays written under timed conditions. However, rather than analyzing a wide range of essays, a single “base” essay was used. The 32 content words in this base text were manipulated to create 27 total versions: low/mid/high versions of accuracy, diversity, and sophistication, that is, a 3×3 design. Twenty-seven experienced raters used four analytic scales to assess the essays, and these ratings were analyzed using analyses of variance (ANOVAs) to find the relationships between variables. The findings indicated that lexical accuracy significantly predicted ratings, F(2, 68) = 4.262, p = .013, though surprisingly diversity, F(2, 68) = .69, p = .933, and sophistication, F(2, 68) = 1.68, p = .194, did not. Importantly, the authors acknowledged certain limitations: experimental texts were mixed with “authentic” texts (which affected the ratings) and the operationalization of sophistication was somewhat problematic. From that study, this paper adopted several approaches to experimental control.
Second, Read and Nation (Reference Read and Nation2006) investigated the vocabulary use of International English Language Testing System (IELTS) test takers. Their study analyzed 88 recordings of learners completing their Part 2 “long turns.” Speech with higher ratings contained a higher percentage of low-frequency vocabulary, and qualitatively, at the highest levels, was characterized by “mastery of colloquial or idiomatic expressions.” This result points to the importance of both single- and multiword lexical items to examiners, as well as the need for further research of IELTS lexical resource ratings.
Current study
The goal of the current study is to isolate and measure the contributions of collocational features to overall ratings of lexical resource quality in essays by comparing quantitative text metrics, expert ratings, and the rationales for these ratings. Three research questions are addressed:
-
1. To what extent are expert ratings of lexical proficiency of essays impacted by
-
a. the number of high-/mid-/low-frequency lemmas (a dimension of lexical sophistication)?
-
b. whether or not the high-/mid-/low-frequency lemmas are part of collocations (a dimension of collocational proficiency)?
-
-
2. To what extent are expert ratings of lexical proficiency of essays impacted by the number of accurate and inaccurate collocations (a dimension of collocational accuracy)?
-
3. What aspects of lexical proficiency do the expert raters consciously attend to, as reflected in their comments, and do these include aspects of collocational proficiency?
Investigating these questions is significant for enhancing our understanding of the linguistic features that expert raters attend to when evaluating L2 writing. As we have seen, one line of previous research has demonstrated the multidimensional nature of lexical sophistication and its impact on judgments of proficiency (e.g., Kim et al., Reference Kim, Crossley and Kyle2018; Lenko-Szymanska, Reference Lenko-Szymanska2019). Other lines of inquiry have shown how aspects of collocational sophistication, such as the use of collocations with higher MI, correlate with higher learner proficiency (e.g. Granger & Bestgen, Reference Granger and Bestgen2014; Paquot, Reference Paquot2018). Understanding these relationships is particularly relevant for writing assessment, where current descriptors, including those used by IELTS, vary in how explicitly they address different aspects of collocational proficiency.
The present study aimed to extend the lines of inquiry above by examining how variations in single-word and collocational features impact expert ratings of lexical proficiency, using a carefully controlled dataset of written texts designed to systematically manipulate these features. Given the centrality of collocation knowledge in frameworks of lexical proficiency (e.g., Nation, Reference Nation2013), it is crucial to determine the extent to which collocational features impact raters’ judgments and to better understand how the impact of lemma frequency on expert judgments is mediated by placement within collocations. Furthermore, by comparing raters’ qualitative comments with frequency-based measures of lexical/collocational sophistication and collocational accuracy, this study can provide insight into the alignment between the features that raters consciously notice and those that statistically predict their scores. The findings thus have implications for identifying areas of convergence and divergence between theoretical constructs of collocational proficiency and raters’ actual practices, providing a more nuanced understanding of the linguistic features that shape expert judgments of L2 writing quality.
Methods
This study used an embedded design in which both quantitative data (the ratings) and qualitative data (the reflections) were collected simultaneously (Creswell & Plano Clark, Reference Creswell and Clark2011), with the reflections enhancing the completeness of the data. First, raters accessed a link for viewing/downloading the rating scales and task prompts. Next, they rated three texts at different Common European Framework for Reference of Languages (CEFR) levels. After rating each text, the raters answered follow-up questions about their assessment decisions. Finally, raters provided personal metadata. To ensure the validity of these findings, a large number of expert raters were used; the texts rated were identical in length and topic; and rater effects were controlled for through the use of MFRM models.
Participants (raters)
Because the target population is raters who evaluate high-stakes tests, participation in the study was limited to current or former IELTS examiners. All IELTS examiners must meet minimum requirements of substantial (typically 3+ years) teaching experience to adults, an undergraduate degree, a recognized TEFL/TESOL qualification or degree in education, and expert spoken and written English proficiency. IELTS examiners undergo a comprehensive training and certification process, as well as subsequent monitoring and standardization. To recruit the raters, snowball sampling was used, a type of sampling of convenience that is a well-established practical option for recruiting members from hard-to-reach groups (Valdez & Kaplan, Reference Valdez and Kaplan1998).
To determine the required number of participants, an a priori power analysis was performed using G*power (version 3.1.9.6; Faul et al., Reference Faul, Erdfelder, Lang and Buchner2007). For linear multiple regression, to detect a small effect size (d = .2; Cohen, Reference Cohen1988) with a power of .8 and α of .05, 40 raters were required given the experimental design. In total, there were 48 respondents, though one was excluded as they appeared to be a non-examiner based on their responses. This participant pool size had the desired effect of allowing multiple raters for each script.
Table 1 presents the raters’ demographic information. These data represent mature examiners in terms of age (all over 30) and experience, both TESOL experience (94% have > 10 years’ experience) and examining experience (85% have > 2 years’ experience). Of the 47 participants, 19% were examiner trainers, held to a higher standard of reliability. Most commonly, the participants had experience with a wide range of first languages (60%) and proficiency levels (70%). The participants were also highly educated, with most possessing graduate degrees (85%) and additional TESOL certification (89%).
Table 1. Rater information (IELTS examiners)

Instruments
Three initial texts formed the basis for the texts in the survey. These are IELTS Task 2 responses (IELTS, n.d.-a), selected because they are publicly available and accompanied by examiner ratings and comments. All three texts responded to the same task about the relationship between socioeconomic status and problem-solving ability. The overall scores of the three texts are Bands 4, 6.5, and 8, displaying a wide range of proficiency levels on the IELTS scale of 1 to 9. These scores correspond to CEFR levels of B1, B2, and C1, respectively. Although originally handwritten, the texts were typed for practicality and standardization purposes, and only two orthographic errors were corrected in the B1 text. There was no background information on the writers.
To control for the issue of text length, the three original texts were normalized to 250 words through careful manual alterations, endeavoring to maintain all stylistic aspects of the original texts. Throughout the process of text manipulation, precise quantitative analysis of the texts was carried out to ensure that 15 key lexical, syntactic, and collocational metrics remained within 5% of the original texts. Collocations were identified using a combined phraseological and frequency-based approach: a checklist of phraseological criteria was first used by both authors.Footnote 1 Criteria based on frequency statistics from the COCA corpus (Davies, Reference Davies2008) were then used to settle disagreements and to filter potential collocations (n > 5, MI > 3, t-score > 2). Each collocation occurred once per text. Likewise, inaccurate collocations were first identified by authors as word combinations where a collocation was expected, but the word combination did not meet the checklist criteria, standing out as an unnatural/awkward/unclear choice. These inaccurate collocations were then confirmed to not meet the frequency statistics listed above.
Once normalized, a subsequent step of text manipulation was carried out to more evenly space accurate and inaccurate collocational use across the texts (Table 2). To ensure that these manipulations did not affect the initial IELTS scores/CEFR levels, a pilot study was carried out to rate the initial, normalized, and final texts. Overall scores were calculated as an average of the four analytic bands and rounded down to the nearest .5, following IELTS practices. These scores indicate that at all three proficiency levels, the normalization and manipulation processes did not greatly impact the average ratings, with all changes within half a band, maintaining the original CEFR levels (Figure 1). An analysis of the analytic bands revealed similar patterns.
Table 2. Collocational density of text versions


Figure 1. Overall ratings comparison of initial, normalized, and final texts.
Using the three final texts, 30 different versions were created (10 versions per final text) by changing up to approximately 12% of the words.Footnote 2 These manipulations were intended to influence several collocational indices relating to sophistication (Bestgen & Granger, Reference Bestgen and Granger2014; Granger & Bestgen, Reference Granger and Bestgen2014) and accuracy:
-
1. Mean MI: to measure association of collocations containing infrequent words (formula from Davies, Reference Davies2008)
-
2. Mean t-score: to measure association of collocations containing high-frequency words (formula from Evert, Reference Evert2009)
-
3. Absent bigrams: the proportion of bigrams absent from the reference corpus
-
4. Accurate collocations and collocation errors: number per 100 words, based on error types in Granger (Reference Granger2003) and Wanner et al. (Reference Wanner, Ramos, Vincze, Nazar, Ferraro, Mosqueira, Prieto, Granger, Gilquin and Meunier2013). Collocations present in the task prompt were not counted.
-
5. Collocation frequency bands: to determine whether each collocation contained only high-frequency lemmas (K1-2), a mid-frequency lemma (K3-9), or a low-frequency lemma (K10-16). Frequency bands were determined by ranking COCA lemma frequencies (available at https://www.wordfrequency.info/purchase.asp). The proportion of each of these types of collocations in the text was then calculated.
The overarching selection criteria for the indices was the “meaningfulness and interpretability of the information they encapsulate as well as their theoretical motivation” Lenko-Szymanska, Reference Lenko-Szymanska2019, p. 161). These indices correlate with human judgements of proficiency and align with the lexical subconstructs evidenced in the IELTS Task 2 band descriptors (IELTS, n.d.-b). As a result of the text manipulations, the text versions differed in terms of four variables: proficiency level, collocational (collocate) frequency, non-collocational lemma frequency, and collocational accuracy. Table 3 presents a matrix of all 30 text versions.
-
1. Proficiency level: Three CEFR proficiency levels, B1 (intermediate), B2 (upper-intermediate), C1 (advanced).
-
2. Collocational (collocate) frequency: Three levels, High, Mid, and Low frequency. To change the levels, accurate collocations were replaced based on the lemma frequencies of the collocates from COCA (Davies, Reference Davies2008) and verified as “basic” or “advanced” lemmas in the PELIC learner corpus (Juffs et al., Reference Juffs, Han and Naismith2020; Naismith et al., Reference Naismith, Han and Juffs2022), for example, high = good example (K1) → mid = concrete example (K4).
-
3. Non-collocational frequency: The same characteristics of collocational frequency apply to non-collocational frequency. The only difference was that the words altered were not part of collocations, for example, mid = nevertheless (K4) → low = unbelievably (K14).
-
4. Collocational accuracy: Two accuracy levels, Low and High. At each proficiency level, there were six additional inaccurate collocations in the low level. For example, Text 1 is a B1 level, has 12 accurate collocations from two frequency bands, and has 18 collocations with errors.
Table 3. Characteristics of text versions

To assess the texts analytically, raters used the IELTS public writing scales (IELTS, n.d.-b). For each of the four categories—Task Response (TR), Coherence and Cohesion (CC), Lexical Resource (LR), and Grammatical Range and Accuracy (GRA)—there were bands from 1 to 9 with descriptive criteria. For this study, the 9-point band scale was further divided into three sublevels (e.g., 5-, 5, 5+) so that there was a “strong” and “weak” possibility within each band (a practice used in Jarvis, Reference Jarvis, Jarvis and Daller2013) to increase the range of possible ratings. For data analysis, these sublevels were converted to decimals, so that, for example, 5- → 5.0, 5 → 5.3, and 5+ → 5.7. In addition to the analytic scales, the raters provided a holistic assessment based on the IELTS public 9-band overall scale (IELTS, n.d.-c). IELTS examiners do not give holistic assessments, but this additional holistic rating served to align the methods of the current study with other comparable research and provided an extra level of data for analysis. The holistic scales were minimally adapted to remove reference to spoken production and language comprehension.
Results
In this section we present the analysis of the ratings data (Research questions 1 and 2) and survey data (Research question 3) to determine how aspects of lexical and collocational proficiency impacted the ratings of lexical proficiency.
Quantitative data analysis
We first used an MFRM model to arrive at fair scores for each text, that is, the rating that would have been given by a rater of average severity. In doing so, we sought to mitigate the systematic error inherent in human ratings. In essence, MFRM models “predict the outcome of encounters between persons and assessment/survey items” (Aryadoust et al., Reference Aryadoust, Ng and Sayama2021, p. 7) by considering multiple variables (referred to as facets). Here, a three-facet model was created using FACETS software (Linacre, Reference Linacre2020), consisting of the raters, the texts, and the band descriptors. In addition, other distal factors (demographic and task variables) were tested, but none of these variables indicated significant bias. The output of the model was the ratings expressed in log-odds units (logits), which were then transformed into the fair scores.Footnote 3
Having established fair scores, the impact of collocational features on the lexical ratings could be calculated using a linear regression model created in the R environment (version 3.6.2; R Development Core Team, 2019). Prior to creating the model, the assumptions required by linear regressions were checked and met (Levshina, Reference Levshina2015). In the model (Table 4), the outcome variable is the Lexical Resource fair scores (LR_fair). The independent variables are the fair scores for the other analytic criteria (TR, CC, GRA), the frequency in COCA of manipulated lexical items (High, Mid, Low), the type of manipulated lexical item (Collocation, Non-collocation), the collocation accuracy (Low, High), and the base text CEFR level (B1, B2, C1). In addition, motivated interactions were included. In this experimental design, all potential variables were left in the model regardless of whether they improved the model fit. The independent variables were sum contrast coded with the exception of frequency; the reference level for frequency is therefore High. As a result of the contrast coding, the model’s intercept is dispersed across levels of the other variables. By comparing all categories against the grand mean in this manner, the results are more informative in terms of the deviation of each category from the overall average rather than comparisons to a specific baseline category. However, because such coding focuses on overall effect estimation, the model estimates can be difficult to interpret, and it is useful to subsequently use Tukey’s Honestly Significant Difference (HSD) test to interpret pairwise comparisons between levels of the variables of interest. For example, for the CEFR variable, in Table 4 we see that the CEFR level (rows 2 and 3) is significant. The post-hoc analysis in Table 5 confirms that the levels are reliably different, increasing as expected from B1 → B2 → C1.
Table 4. Linear regression model for factors predicting lexical resource ratings

Note: * p < .05, ** p < .01, *** p < .001.
Model formula: lm(formula = LR_fair ~ CEFR + TR_fair + CC_fair + GRA_fair + freq + accuracy + item_type + CEFR:freq + CEFR:accuracy + freq:accuracy + freq:item_type).
Table 5. Tukey’s multiple comparison of means test for CEFR

Note: * p < .05, ** p < .01, *** p < .001.
RQ1: Collocational sophistication (collocate frequency)
The regression data answered RQ1 which asked the extent to which expert ratings of lexical proficiency of essays are impacted by the number of high-/mid-/low-frequency collocates and non-collocates (a dimension of lexical/collocational sophistication). Overall, there is a significant positive increase, with lower frequency lexis predicting a higher LR rating. The post hoc analysis (Table 6) showed that the bulk of the frequency effect occurred when going from high- to low-frequency. The difference between high- and mid-frequency was also significant (p = .018), but there was no significant difference between mid- and low-frequency (p = .158).
Table 6. Tukey’s multiple comparison of means test for frequency

Note: * p < .05, ** p < .01, *** p < .001.
In addition, there was a significant difference for lexical item type (Table 7), that is, whether lemma frequency effects were mediated by the placement of the lemma within or outside of a collocation. This effect was small but significant, resulting in higher LR ratings when the lower-frequency lemmas were part of a collocation.
Table 7. Tukey’s multiple comparison of means test for item type

Note: * p < .05, ** p < .01, *** p < .001.
RQ2: Collocational accuracy
The regression data answered RQ2 that asked the extent to which expert ratings of lexical proficiency of essays are impacted by the number of accurate and inaccurate collocations. Of the experimental variables, only collocational accuracy was not significant. Furthermore, interactions between accuracy and the CEFR B2-C1 contrast, and accuracy and low frequency were not significant. One significant interaction was present between accuracy and the CEFR B1-C1 contrast. However, after careful plotting and examination of this interaction, this significant relationship appears to be spurious.
RQ3: Rater rationales
Recall that the design of the study specifically called for quantitative and qualitative insights into raters’ scoring practices. Thus, the raters’ rationales for their scores answered the third research question which asked which aspects of lexical proficiency expert raters consciously attend to, especially in terms of collocational proficiency. Their comments contained both positive and negative elements, and raters routinely used language directly from the band descriptors. Figure 2 presents a tally of the different lexical features commented on. Only the first occurrence of each term for each rater response was counted, and similar terms were collapsed for clarity, so that, for example, the count for the term “formulaic sequence” includes mentions of “chunk” and “multiword expression.” Therefore, 25 for “formulaic sequence” means that 25 rationales (corresponding to a minimum of 9 raters and a maximum of 25 raters) mentioned this construct at least once.

Figure 2. Topics of rater comments.
Here the most common lexical aspects that were noticed correspond to the primary lexical dimensions addressed in this study, sophistication and accuracy, with the importance of collocation also clearly represented. When giving examples, raters tended to give a mix of single and formulaic sequences, thus, the 60 occurrences of the concept of sophistication encompassed both single and multiword lexical units, even though the type of sophistication was not typically described.
Discussion
The findings depict elements of the relationship among different aspects of lexical and collocational proficiency. We first discuss the results in relation to RQs 1 and 2, contextualized with the qualitative results relating to RQ3. We then consider the implications of these results in terms of language pedagogy and assessment.
RQ1: Importance of high-, mid-, and low-frequency collocations
It is unsurprising that the use of lower frequency lemmas (inside and outside of collocations) led to higher ratings of lexical proficiency; this aspect of lexical sophistication, measured in various ways, has long been recognized as a characteristic of more proficient writing (e.g., Daller et al., Reference Daller, Turlik, Weir, Jarvis and Daller2013). The examples given by raters also provided support for the statistical importance of sophistication in terms of frequency; across all three levels, low-frequency single words and collocations containing low-frequency words were flagged as being examples of sophisticated lexis. For example, low-frequency single words repeatedly highlighted by raters in their comments include fantasize, flaunts, and tremendous. Collocations containing mid- or low-frequency collocates included fairly young, first-hand experience, and sheer motivation. With respect to collocational frequency, there is value in considering not just the frequency of collocations in an external corpus, but also the collocates within collocations, since the results suggested that experts especially noticed collocations containing lower-frequency lemmas (and thus award higher ratings of lexical proficiency).
This suggestion that collocates held special prominence is supported by the rater comments. As noted, raters provided single-word and multiword lexical items, including collocations, to exemplify lexical sophistication. This finding, combined with the high number of times the term collocation was explicitly used, suggested that collocations were especially salient to examiners. Furthermore, in some cases, a specific collocate appeared to be particularly noticeable as some examiners gave the single word as an example and others gave the word as part of a collocation, for example first-hand versus first-hand experience, suggesting that it was collocate frequency, rather than collocation frequency, which drew attention. These cases exemplify how raters may differ in the extent to which multiword items are noticed, compared to single words, as well as the way in which sophistication is conceptualized.
With respect to the utility of the three-way classification of high-, mid-, low-frequency collocations, it was originally hypothesized that there would be a clear distinction between texts systematically varying based on these frequency categories. The results can be seen to generally support this view, especially that a “mid” category is informative, since high-mid and high-low contrasts were significant. However, the mid-low contrast was not significant, perhaps due to the coarse-grained frequency “buckets,” and it may be that more fine-grained divisions at the lower frequencies would have been more revealing (see Kremmel, Reference Kremmel2016), albeit at the expense of practical utility for practitioners who use online frequency profiling tools.
RQ2: Lack of significance of collocational accuracy
It was predicted that the high accuracy level would lead to higher LR ratings based on previous literature that found collocational accuracy to be an important aspect of judgements of proficiency (e.g., Laufer & Waldman, Reference Laufer and Waldman2011; Nesselhauf, Reference Nesselhauf2005). It was therefore contrary to expectation that no such effect was uncovered in the ratings. One potential explanation is that the quantity of collocation errors between the Low accuracy and High accuracy versions was insufficient. In other words, adding six additional collocation errors to a text of 250 words was too small a manipulation, regardless of the base CEFR level.
A second, and perhaps more likely, explanation relates to the level of error gravity, namely the impact of the errors on communication. In this study, the meaning remains clear in all the collocation errors, for example, positive school. This consistent “light” error gravity likely decreased the impact of the collocation inaccuracies. In addition, a word combination such as positive school may have been interpreted by raters as creative language use rather than “wrong,” as from a phraseological perspective such combinations are not restricted collocations in the same way that slim chance or make a mistake are. This level of error gravity and type of inaccurate collocation also matches the IELTS lexical descriptors for band 6 and below, which focus on lexical accuracy in terms of impact on communication, for example, B6: “makes some errors in spelling and/or word formation, but they do not impede communication” (emphasis added). If the raters, as expected, closely followed the rubric descriptors, then this wording may also help to explain why accuracy as operationalized in this study did not emerge as a strong predictor of the ratings at the B1/B2 levels, though it does not explain the lack of significant interaction between B2-C1 and accuracy.
In contrast to the results of the linear regression model, the raters’ comments demonstrated that lexical accuracy in its many forms was a feature they considered important. Collocations with inaccurate word choice were frequently noted, for example, positive school, study at money, and straight contribution. As a result, writers at all three CEFR levels were often described as “risk takers,” that is, writers with higher sophistication but lower accuracy. These rater data further support the hypothesis that the lack of significance for collocational accuracy in this study can likely be attributed to experimental design.
Pedagogical implications: Choosing which collocations to teach
At present, collocation instruction is common in many contexts, but the selection of which collocations to teach often remains unprincipled (Macis & Schmitt, Reference Macis and Schmitt2016). A general rule-of-thumb of any vocabulary selection is to consider the cost-benefit principle so that learners get the best return for the time invested in learning. Frequency is one way of deciding this benefit and has been traditionally used to determine text coverage (the number of known lemmas/word families needed to cover a certain percentage of texts) and to create frequency lists.
Some single-lemma frequency lists are widely used in general English (e.g., New General Service List [NGSL]; Browne et al., Reference Browne, Culligan and Phillips2013) and English for Academic Purposes [EAP] (e.g., Academic Word List [AWL]; Coxhead, Reference Coxhead2000). However, there are few widely used collocation frequency lists (though see Ackerman & Che, Reference Ackermann and Chen2013; Durrant, Reference Durrant2009; Shin & Nation, Reference Shin and Nation2007). A practical approach for teachers is to focus on formulaic sequences containing items from established frequency-based lists such as the AWL (as advocated by Coxhead [Reference Coxhead and Webb2020])—essentially a collocate frequency approach.
However, findings from studies such as this paper suggest that the inclusion of some lexis in the K10+ bands in learning goals is also worthwhile. Currently such lexis is not supported in frequency list approaches to vocabulary selection, either for individual words or collocations. But, as Vilkaitė-Lozdienė and Schmitt (Reference Vilkaitė-Lozdienė, Schmitt and Webb2020, p. 88) caution, “frequency lists should be seen more as a useful indication rather than a prescription.” While high-frequency words are crucial for comprehension and text coverage, the findings from this study suggest that knowledge of lower-frequency items is particularly important for productive skills, especially in assessment contexts.
Instead of replacing frequency-based lists/materials that focus on K1-2 lexis, one option is to supplement the existing curricula with judiciously selected K3-9 and K10+ lexical items. In doing so, vocabulary pedagogy practices can still be evidence-based rather than solely intuition-based, but more responsive to individuals’ needs. For example, the following categories represent low-frequency collocations which nonetheless have wide academic generalizability and therefore high cost-benefit:
-
1. Discourse markers: By replacing or inserting K10+ words into discourse markers (e.g., in my case [K1] → not to generalize [K13]), students can apply these formulaic sequences in a range of academic text types to good effect, concurrently improving the sophistication of their lexis while demonstrating flexible use of cohesive devices.
-
2. Synonyms for other K1-2 collocations: Collocations containing nouns are the most frequent type of lexical collocation (Nizonkiza & Van de Poel, Reference Nizonkiza and Van de Poel2019) and are a key attribute of academic prose. However, learners tend to underuse noun forms in their own writing in favor of verbs (Naismith & Juffs, Reference Naismith and Juffs2021). High-frequency collocations can therefore be naturally replaced with low-frequency collocations containing noun forms (e.g., learn about [K1] → gain proficiency in [K10]).
-
3. Domain-specific, specialized lexis: For many students, it is necessary to not only know general academic English vocabulary, but also lexis specific to their studies and careers (Coxhead, Reference Coxhead and Webb2020; Nation, Reference Nation2013), for example smart shopper (K1) → savvy shopper (K11) in marketing. This type of specialized vocabulary is one of the greatest challenges that learners report (Dang & Dang, Reference Dang and Dang2021).
Assessment implications
In the public IELTS descriptors, relativistic terminology is frequently used to distinguish between bands. For example, sophistication descriptors include “attempts to use less common vocabulary” (B6), “uses less common lexical items” (B7), and “skillfully uses uncommon lexical items” (B8). What is unclear is whether “vocabulary” and “lexical items” are synonymous or intended to distinguish between single and multiword lexical items, or exactly what frequencies “less common” and “uncommon” refer to. Research has shown that teachers debate the meanings of terms in descriptors (Claire, Reference Claire, Brindley and Burrows2001) and have difficulty interpreting/applying relativistic terminology (Smith, Reference Smith and Brindley2000). A compromise could therefore limit the number of different modifiers for describing lexical use and to gloss elsewhere what approximate frequency ranges these terms are intended to encompass, illustrated with examples.
Most key lexical dimensions are included in the descriptors at nearly every IELTS band level. However, consideration of formulaic sequences, including collocations, is lacking: only bands 7 and 8 include the term “collocation.” It is true that formulaic sequences are a component of “vocabulary” and “lexical items,” but without being explicitly described and mentioned at all levels, there is a danger that raters overlook or undervalue the many types of lexical items. This oversight may occur even though, as this paper has demonstrated, collocational sophistication (at least in terms of collocate frequency) is a significant factor in rating and is therefore deserving of explicit recognition. The wording of rubrics is important, and more experienced raters use more rubric-generated vocabulary to describe decisions and ratings (Wolfe et al., Reference Wolfe, Kao and Ranney1998). It therefore seems that the current scales do not adequately address key elements of collocational proficiency.
To exemplify how the descriptors might be updated, Table 8 contains the original Band 6 Lexical Resource descriptor and a potential amended version. This updated descriptor takes into consideration the beliefs of the raters in this study and the research findings supporting the importance of collocational proficiency. In doing so, it is intended to more clearly highlight a key element of lexical proficiency which at present is not given sufficient weight.
Table 8. Band 6 Lexical Resource descriptors

One challenge of writing descriptors is balancing specificity and practicality as there is very limited space available. As such, minimal alterations have been made (emphasized in bold) to make salient that formulaic sequence use is part of the existing descriptors for range, sophistication, and accuracy. Here we suggest the term multiword expression over formulaic sequence because the former is currently, in our experience, more widely used in the teaching community and more immediately accessible.
Conclusions
This paper reported on an investigation of expert IELTS examiner ratings of texts which had been manipulated in terms of their collocational frequency (a dimension of collocational sophistication) and accuracy. The resulting data showed that the frequency of lexical items in general was impactful, especially when less-frequent words were part of salient collocations. In general, the high-, mid-, and low-frequency categories were an appropriate method for identifying different levels of lexical sophistication, though the division between mid- and low-frequency, as operationalized here, was not entirely clear cut. Furthermore, while collocational accuracy seemed to be noticeable to raters, it did not impact the statistical models.
The main contributions of this mixed methods study are threefold. From a methodological standpoint, the careful text selection and normalization provides a model for future research. By carefully normalizing text length and validating the results, student essays can be used as research instruments without needing to account for text length and topic/prompt effects. In addition, the use of MFRM models to obtain fair scores prior to further inferential analysis remains uncommon in this field of research but shows merit in terms of accounting for individual rater variability first, before carrying out linguistic analysis. By including qualitative data from expert raters, the quantitative data can be better interpreted and receive additional validation.
The second contribution of this study is to classroom pedagogy for the teaching of lexis. Historically, the teaching of formulaic sequences and collocations has been neglected (Wolter, Reference Wolter and Webb2020), and even though there has been a resurgence in this area, the decision of which collocations to teach is often left to “the whims of individual teachers” rather than based on empirical research (Hanks, Reference Hanks2013, p. 424). The results of this study suggested that for students to improve the quality of their written academic English, it is beneficial to judiciously include some very low-frequency lexis, even if learning such lexis is of lesser benefit to developing receptive skills.
A third contribution of this study is to inform potential assessment training and scale design practices. Given the importance of assessment literacy for raters in delivering reliable assessments, it is critical to provide teachers and examiners with training and tools which help clarify the key elements of learners’ lexis. As such, it is recommended that formulaic sequences be an explicit component of all band descriptors, and that the relationship between frequency descriptors and frequency bands be clarified.
Many of the limitations of this study result from conscious decisions regarding its methodological design. The experimental nature of the study required controlling features such as text length, the frequency bands, and accuracy of specific lexical items. The trade-off for this degree of control is the authenticity of the texts, the use of only one writing prompt, and the use of only three base texts from different proficiency levels, all of which may have had unintended and unmeasured effects on the ratings. In addition, the exclusion of levels of error gravity as a factor somewhat limits the conclusions that can be drawn about the collocational accuracy findings.
Future research might therefore include partial replications of this study but with adjustments to the texts to increase the difference in quantity of collocation errors or error severity between the low and high accuracy text versions. A pilot study could also be carried out to ascertain whether potentially inaccurate collocations are experienced as such. Adjusting collocation sophistication using a collocation frequency approach or other operationalizations of sophistication would also be informative. The qualitative element of this research, the raters’ comments, could also be further explored through interviews or surveys to acquire a more thorough understanding of the raters’ thought processes and beliefs about lexis and assessment. Through projects such as the current study and others in a similar vein, it will be possible to better understand the relationship between text quality as it is realized through learners’ use of lexis and the way it is perceived by expert raters of high-stakes tests.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S0272263125000075.
Acknowledgements
We express our gratitude to the participants for their time, to the reviewers for their helpful suggestions, and to Drs. Matthew Kanwit, Melinda Fricke, Na-Rae Han, and Ute Römer-Barron for their feedback on an earlier version of this work. All errors are, of course, our own. This work was supported by the Social Sciences and Humanities Research Council of Canada and a Duolingo Research Grant.
Competing interest
The author(s) declare none.