Introduction
Comparison of premorbid IQ estimates against objective measures of current IQ enables the magnitude of cognitive impairment to be evaluated in neurological patients. This is useful for research, medicolegal, diagnostic and clinical management purposes. Premorbid IQ tests requiring the oral pronunciation of phonologically irregular words are commonly used due to robust evidence that single word pronunciation knowledge is preserved (held) across a wide range of conditions (Crawford, Reference Crawford, Crawford, Parker and McKinlay1992; McGurn et al., Reference McGurn, Starr, Topfer, Pattie, Whiteman, Lemmon, Whalley and Deary2004; O’Carroll, Reference O’Carroll1995; Sharpe & O’Carroll, Reference Sharpe and O’Carroll1991), and because the relationship between word reading and intelligence is largely independent of age and social class (Nelson, Reference Nelson1982). Alternative approaches that examine word familiarity independently of pronunciation include lexical decision tests like Spot-the-Word (Baddeley et al., Reference Baddeley, Emslie and Nimmo-Smith1993; Baddeley & Crawford, Reference Baddeley and Crawford2012; van der Linde et al., Reference van der Linde, Horsman and Bright2022), in which participants are asked to select real words rather than plausible non-word distractors. Lexical decision tests are particularly useful where speech production is impaired. However, since oral pronunciation tests are used most often, and are underpinned by a greater quantity of normative data, we focus on this approach.
The National Adult Reading Test (NART; Bright et al., Reference Bright, Hale, Gooch, Myhill and van der Linde2018; Nelson & Willison, Reference Nelson and Willison1991; Nelson, Reference Nelson1982) is a free, fast, well-established and widely used word pronunciation-based premorbid IQ test. Evidence indicates equivalent or better predictive validity compared to using demographic data alone, using the best performing subtest from an IQ battery, or undertaking hold vs no-hold subtest comparisons (Bright et al., Reference Bright, Jaldow and Kopelman2002; Bright and van der Linde, Reference Bright and van der Linde2020). The most recent restandardization of the NART (Bright et al., Reference Bright, Hale, Gooch, Myhill and van der Linde2018) enables estimation of full-scale IQ (FSIQ) on the current gold-standard Wechsler Adult Intelligence Scale – Fourth Edition (Wechsler, Reference Wechsler2008).
Numerous variants of the original NART (Nelson, Reference Nelson1982) have been developed for revalidation against new revisions of IQ batteries (e.g., Bright et al., Reference Bright, Hale, Gooch, Myhill and van der Linde2018; Nelson & Willison, Reference Nelson and Willison1991), abbreviation (e.g., Beardsall & Brayne, Reference Beardsall and Brayne1990 [Short NART]; Uttl, Reference Uttl2002 [NAART35]; McGrory et al., Reference McGrory, Austin, Shenkin, Starr and Deary2015 [mini-NART]; Mackinnon & Wooden, Reference Mackinnon and Wooden2015; van der Linde & Bright, Reference van der Linde, Bright and Forloni2018 [NART17]), and internationalization (e.g., Blair & Spreen, Reference Blair and Spreen1989 [USA NART-R]; Schmand et al., Reference Schmand, Bakker, Saan and Louman1991 [Dutch DART]; Grober et al., Reference Grober, Sliwinski and Korey1991 [USA AMNART]; Hennessy & Mackenzie, Reference Hennessy and Mackenzie1995 [Australian AUSNART]; Dalsgaard, Reference Dalsgaard1998 [Danish DART]; Mackinnon et al., Reference Mackinnon, Ritchie and Mulligan1999 [French fNART]; Vaskinn & Sundet, Reference Vaskinn and Sundet2001 [Norwegian NART]; Matsuoka et al., Reference Matsuoka, Uno, Kasai, Koyama and Kim2006 [Japanese JART]; Rolstad et al., Reference Rolstad, Nordlund, Gustavsson, Eckerström, Klang, Hansen and Wallin2008 [Swedish NART-SWE]; Starkey & Halliday, Reference Starkey and Halliday2011 [New Zealand NZART]; Watt, Ong & Crowe, Reference Watt, Ong and Crowe2016; Karakuła-Juchnowicz & Stecka, Reference Karakuła-Juchnowicz and Stecka2017 [Polish PART]; Yi et al., Reference Yi, Seo, Han, Sohn, Byun, Lee, Choe, Ahn, Woo, Jun, Lee and Forloni2017 [Korean KART]). Some international variants provide new, population-appropriate regression equations to estimate premorbid IQ using the original word NART stimuli (e.g., Barker-Collo et al, Reference Barker-Collo, Thomas, Riddick and de Jager2011; Watt et al., Reference Watt, Ong and Crowe2018), some modify stimuli or grading rules to address differences in dialect/pronunciation (e.g., Hennessy & Mackenzie, Reference Hennessy and Mackenzie1995), while others propose entirely new sets of word stimuli in the local language (e.g., Krámská, Reference Krámská2014 [Czech Reading Test CRT]; Alves, Simões, & Martins, Reference Alves, Simões and Martins2011 [Portuguese Irregular Word Reading Test TELPI]). However, most still provide a regression equation to estimate premorbid intelligence from reading test score.
In the development of the original NART and its variants calibration data were collected to calculate a straight line of best fit relating test score to the predicted variable (typically full-scale IQ, but sometimes constituent index scores). Clinicians use the resultant linear regression equation to obtain a premorbid IQ estimate, typically from the number of word pronunciation errors committed, although some provide conversion tables instead of, or in addition to, an equation. It is well-known that linear regression is less accurate for samples at the high and low end of a distribution (Basso et al., Reference Basso, Bornstein, Roper and McCoy2000; Graves et al., Reference Graves, Carswell and Snow1999; Griffin et al., Reference Griffin, Mindt, Rankin, Ritchie and Scott2002; Veiel & Koopman, Reference Veiel and Koopman2001). In part, this is because fitting a straight line to normally distributed data (such as IQ scores) will lead to a poor fit at the tails of the distribution, along with general floor and ceiling effects.
The NART remains a popular and effective tool; however, its public domain status has led to a proliferation of variants for purposes such as those outlined above. These variants have never been systematically compared to assess their numerical prediction limits, or the reachability of IQ categories in standard classification systems. Such an evaluation is important since operating over a restricted IQ range will necessarily exclude a proportion of the target population (viz., those who premorbidly possessed comparatively low or high IQ) from accurate clinical assessment, leading to suboptimal diagnosis and clinical management decisions.
In this article we review the specific numerical corollaries of these issues for all NART variants identified that give a regression equation to calculate FSIQ that does not require demographic variables, and where the test was not developed for a narrow clinical condition. We related the range of premorbid IQs that can be produced to categorical labels in common IQ classification systems and evaluate the proportion of the target population that falls outside the predictable range.
Method
A straight-line equation sets a NART score (or the number of errors committed), x, in the form of first-degree polynomial y = mx + c, where y is the premorbid IQ estimate, m is a coefficient of x (line equation gradient term, sometimes called the regression coefficient) and c is an additive constant (line equation intercept, sometimes called the regression constant). Using the regression equation provided with each NART variant (gradient and intercept are given in Table 1 which, since the line is strictly decreasing, would be used in the form y = c − mx) we calculated predicted IQ where a participant does not pronounce any test word correctly, i.e., maximizing the gradient term (m) and subtracting from the intercept (c). Using current population estimates, we then calculated the percentage of the target population that falls below that IQ score. We then calculated the highest attainable IQ score by supposing that no errors were committed, i.e., zeroed the gradient term (m) to leave only the intercept (c). Again, using current population estimates, we calculated the percentage of the target population that is above that IQ score. For each variant, we calculated the statistical range of IQ scores that are theoretically reachable, and the percentage of the target population for the respective test that falls outside that range. We then related the range of attainable scores to standard IQ classification systems.
Results
First, we present the upper and lower limits and range of each NART variant. Next, we evaluate which IQ class categories fall outside these limits. We then comment on clinical implications for patients with comparatively high or low premorbid intelligence.
Our main findings are presented in Table 1, showing that a significant proportion of the non-clinical population fall below the lowest predictable score. In the original NART (Nelson, Reference Nelson1982), Danish (Hjorthøj et al, Reference Hjorthøj, Vesterager and Nordentoft2013), Norwegian (Vaskinn & Sundet, Reference Vaskinn and Sundet2001), and Polish variants (Karakuła-Juchnowicz & Stecka, Reference Karakuła-Juchnowicz and Stecka2017), this equates to approximately 1 in 5 (∼20%) of the general population (Rain and Zaborowska, Reference Rain and Zaborowska2022). In the Australian (Hennessy & Mackenzie, Reference Hennessy and Mackenzie1995) and US (Blair and Spreen, Reference Blair and Spreen1989) variants it equates to approximately 1 in 10 (10%) of the general population (Rain and Zaborowska, Reference Rain and Zaborowska2022).
In standard IQ classification systems (Table 2) this would lead to widespread misclassification in the current WAIS-IV classification system (Wechsler, Reference Wechsler2008); only Nelson & Willison (Reference Nelson and Willison1991) can, barely, produce an IQ in the Extremely Low class (<70). Of the NART variants examined, six cannot produce an IQ<80 (Blair & Spreen, Reference Blair and Spreen1989; Hennessy & Mackenzie, Reference Hennessy and Mackenzie1995; Hjorthøj et al., Reference Hjorthøj, Vesterager and Nordentoft2013; Karakuła-Juchnowicz & Stecka, Reference Karakuła-Juchnowicz and Stecka2017; Nelson, Reference Nelson1982; Vaskinn & Sundet, Reference Vaskinn and Sundet2001), which would cause all those in the Borderline or Extremely Low classes to be misclassified as Low Average. In the more granular Stanford-Binet Fifth Edition (SB5; Roid & Pomplun, Reference Roid, Pomplun, Flanagan and Harrison2012) classification system, none of the NART variants examined would be capable of producing IQs in the Moderately Impaired or Delayed range (which would be misclassified as Borderline Impaired or Delayed, or even Low Average), and only one of the NART variants examined (Nelson & Willison, Reference Nelson and Willison1991) can, barely, predict IQs in the Mildly Impaired or Delayed range. Six variants cannot produce an IQ below the Low Average range, missing the bottom three categories entirely. In the DAS-II classification system (Dumont et al., Reference Dumont, Willis and Elliot2009), only two of the NART variants can, again barely, predict IQs in the Very Low class (Nelson & Wilison, Reference Nelson and Willison1991; Starkey & Halliday, Reference Starkey and Halliday2011), which would be misclassified as Low or Below Average.
The same is true with high-performing patients whose score tends towards the top of the predictable range, with the French (Mackinnon et al., Reference Mackinnon, Ritchie and Mulligan1999), Japanese (Matsuoka et al., Reference Matsuoka, Uno, Kasai, Koyama and Kim2006), and New Zealand (Starkey & Halliday, Reference Starkey and Halliday2011) variants of the NART unable to reach 1 in 20 (i.e., the top 5% of the population). This translates to millions of individuals (3.5 million from a 2022 French population of 67.5 million; 6.8 million from a 2022 Japanese population of 125.7 million; 0.27 million from a 2022 New Zealand population of 5.1 million).
In the Wechsler IQ classification system, only four of the NART variants examined can produce an IQ in the Very Superior (≥130) class (Hennessy & Mackenzie, Reference Hennessy and Mackenzie1995; Nelson & Willison, Reference Nelson and Willison1991; Watt et al., Reference Watt, Ong and Crowe2016; van der Linde & Bright, Reference van der Linde, Bright and Forloni2018), and one can, just barely, produce an IQ in the Very High class (Vaskinn & Sundet, Reference Vaskinn and Sundet2001). In the SB5 classification system, no NART variant can detect an IQ in the Very Gifted or Highly Advanced class, and only four can detect an IQ in the Gifted or Very Advanced range. In the DAS-II classification system, only three variants can detect the Very High class.
Discussion
The compressed predictable IQ range stems from fitting a straight line to the datapoints of participants who have completed both the NART variant and, for calibration purposes, a full standard IQ test battery or (in some cases) a specific subtest. Perhaps counterintuitively, where straight-line fitting is used, collecting more datapoints may not help: by definition, if participants across a wide range of ability levels are recruited, most will not be at the extrema and the gradient (m) and intercept (c) of the straight line will be unperturbed.
Similarly, developing tests of greater length cannot help: in terms of statistical range, the three highest-valued variants are the 50-word Australian restandardization (Starkey & Halliday, Reference Starkey and Halliday2011) at 64.10, the first British restandardization (Nelson & Willison, Reference Nelson and Willison1991), also 50 words, at 62.00, but also the 17-word NART variant proposed in van der Linde and Bright (Reference van der Linde, Bright and Forloni2018) at 59.30. Conversely, the three variants with the lowest ranges all have 50 words: Vaskinn & Sundet (Reference Vaskinn and Sundet2001) at 34.00; Karakuła-Juchnowicz & Stecka (Reference Karakuła-Juchnowicz and Stecka2017) at 38.74; Nelson (Reference Nelson1982) at 41.30.
The clinical significance of these issues is potentially large; they are poorly suited for use with patients who, prior to their neurological condition, would have fallen into the lower IQ classification ranges since the clinician’s ability to accurately gauge the severity of their current impairment will be limited. Specifically, since premorbid IQ will be overestimated, a clinical evaluation will likewise overestimate the magnitude of impairment, on the assumption that current IQ will have fallen relative to the true pre-clinical IQ. For instance, a patient with pre-clinical IQ <70 may yield an overestimated premorbid IQ estimate of 80 due to floor effects, spuriously indicating an increase in cognitive ability. A measure of current IQ will produce a lower than pre-clinical score, and the difference between this and the estimated premorbid IQ will be larger than it should be, thereby causing the magnitude of the patient’s impairment to be overestimated.
For patients who would have fallen into the higher IQ classification range, ceiling effects will cause premorbid IQ to be underestimated, and a clinical evaluation will underestimate the magnitude of impairment, based on the same assumption. For instance, a patient with pre-clinical IQ >140 may have their premorbid IQ estimated with NART at 130 due to ceiling effects, underestimating their pre-clinical ability. A measure of current IQ will produce a lower than pre-clinical score, likely bringing it closer to the premorbid IQ estimate ceiling, such that the difference between current IQ and premorbid estimate will be smaller than it should be, thereby causing the magnitude of the patient’s impairment to be underestimated. Joseph et al. (Reference Joseph, Lippa, McNally, Garcia, Leary, Dsurney and Chan2021) reported that the Test of Premorbid Functioning (TOPF; Wechsler, Reference Wechsler2011), which is very similar to the NART, underestimated premorbid intelligence for around one third of their high-performing participants and was particularly poor for those falling into Above Average and Superior classes. This is despite the fact that the TOPF uses a third-degree polynomial rather than straight-line fit. Other work indicates that NART and its variants may estimate premorbid IQ more accurately than TOPF (Reale-Caldwell et al., Reference Reale-Caldwell, Osborn, Soble, Kamper, Rum and Schoenberg2021), perhaps because the specific polynomial used to fit TOPF calibration data is suboptimal.
In some neuropsychological tests, instructions suggest using a different line equation for scores above or below certain thresholds, to administer an alternative or abbreviated test, or simply to declare the prediction unreliable (which seems quite reasonable if the participant fails to respond correctly to nearly/all test words, rather than allocating a Low Average or Borderline IQ, as would be the case if some NART variants were used imprudently). For instance, in the original NART it is recommended that participants scoring<10 correct words (which are referred to as poor readers) take a second test (Schonell Graded Word Reading Test) and that a second regression equation incorporating both scores is used.
It is acknowledged in Nelson & Willison (Reference Nelson and Willison1991) that a limitation of the NART is that it cannot detect IQs above 128. It is stated that this is less of a problem than it first seems because even those with IQs above 130 typically make one or more NART errors. However, this tacitly acknowledges prediction error and that artificially reduced IQ estimates are, in fact, potentially clinically disadvantageous.
In part, the method of obtaining a straight line of best fit to calibrate NART is used to keep the task of converting a NART score into a premorbid IQ score as simple as possible for the clinician, obviating the need for complex calculations, the application of an algorithm, or the use of computer software. In many cases, for convenience, conversion tables are also provided, so that the regression calculation need not be used in practice (perhaps removing one possible source of error, and speeding the assessment). However, most conversion tables simply provide the linear regression line calculated across the range of possible raw error scores. Despite this, conversion tables could just as easily be used to concretize a non-linear fit. Three possibilities are i. so-called segmented or broken-stick regression, in which multiple line segments are fit to different intervals of the observed calibration data, such as using a line for the main portion of the fit and two smaller lines for the tails; ii. fitting a cumulative distribution function; and iii. fitting a suitable higher-degree polynomial.
The issues discussed here also apply to tests that estimate constituent indices from the WAIS rather than (or in addition to) FSIQ (e.g., Grober et al, Reference Grober, Sliwinski and Korey1991), and to other reading tests, including the Wechsler Test of Adult Reading (WTAR; Wechsler, Reference Wechsler2001), Cambridge Contextual Reading Test (CCRT; Beardsall, Reference Beardsall1998), and numerous variants of the Word Accentuation Test (WAT; Del Ser et al., Reference Del Ser, González-Montalvo, Martınez-Espinosa, Delgado-Villapalos and Bermejo1997 [WAT Spanish]; Burin et al., Reference Burin, Jorge, Arizaga and Paulsen2000 [WAT-Argentina]; Gil et al., Reference Gil, Magaldi, Busse, Ribeiro, Brucki, Yassuda, Jacob-Filho and Apolinario2019 [WAT-Brazil Portuguese]), Test Breve di Intelligenza (Colombo et al., Reference Colombo, Sartori and Brivio2002 [TIB-Italy]), and to lexical decision tests like Spot-the-Word (STW; Baddeley et al., Reference Baddeley, Emslie and Nimmo-Smith1993; Baddeley, & Crawford, Reference Baddeley and Crawford2012), the Swedish Lexical Decision Test (Almkvist et al., Reference Almkvist, Adveen, Henning and Tallberg2007), and German Mehrfachwahl-Wortschatz-Intelligenztest (MWT; Lehrl et al., Reference Lehrl, Triebig and Fischer1995), among others. It has been suggested that the WTAR contains more readily recognized stimuli compared to the NART on average (Bright and van der Linde, Reference Bright and van der Linde2020), so lower scores corresponding to lower IQ classifications may be even less likely to occur in practice.
The Hopkins Adult Reading Test (HART) provides only regression equations that require demographic information (Schretlen et al., Reference Schretlen, Winicki, Meyer, Testa, Pearlson and Gordon2009), so cannot be evaluated here. However, the authors of this test indicate that the HART is theoretically less constricted in the range of obtainable IQs than NART-R (Blair and Spreen, Reference Blair and Spreen1989), in part because of the inclusion of other variables in the regression equation. Whilst true, it is the case that demographic information, such as age and years of education, may not always be available (e.g., in the case of unidentified patient or those with dementia). Demographic information is similarly required in the USA (NAART) revision proposed by Uttl (Reference Uttl2002), the New Zealand (NZ-NART) proposed by Barker-Collo et al (Reference Barker-Collo, Thomas, Riddick and de Jager2011), and the Korean language KART (Yi et al., Reference Yi, Seo, Han, Sohn, Byun, Lee, Choe, Ahn, Woo, Jun, Lee and Forloni2017). However, it has also been found that demographic information explains relatively little additional variance (e.g., Bright and van der Linde, Reference Bright and van der Linde2018; Bright et al., Reference Bright, Jaldow and Kopelman2002). NART-SWE (Rolstad et al., Reference Rolstad, Nordlund, Gustavsson, Eckerström, Klang, Hansen and Wallin2008) could not be evaluated due to the test and regression equation being kept private for commercial purposes. It is also the case that even the use of demographic variables in a multi-term first-degree polynomial does not solve the problems outlined above, since they will still produce a straight line and therefore incur poor fit at the distribution tails.
As a consequence of (mostly) being in the public domain, all variants of the NART are unofficial in the sense that no standard approval process or quality control mechanisms, beyond academic peer review, are in place. In many cases, publications describing new NART variants include thorough evaluations, including for the difficulty and predictive contribution of individual words, internal consistency and reliability (Osburn, Reference Osburn2000), test-retest reliability (Davidshofer & Murphy, Reference Davidshofer and Murphy2005; Smith et al., Reference Smith, Roberts, Brewer and Pantelis1998), inter-rater reliability (Saal et al., Reference Saal, Downey and Lahey1980), etc. However, what would seem like a critical factor, the upper and lower prediction limits and range of detectable IQs, are not commonly reported, nor is the corollary issue of the in-principle reachability of IQ categories in standard classification systems and the proportion of the target population that falls into these categories. It is also the case that some NART variants are orphaned, in the sense that they have not been recalibrated on the latest revisions on IQ batteries, which may cause their predictive accuracy to drift over time due to the Flynn effect (Flynn, Reference Flynn1987) and variations in word usage. It would seem reasonable to propose that the numerical issues explored here are examined and reported upon in future test variants, and to suggest that current tests are interpreted with caution for patients who are suspected to have had particularly high or low premorbid IQ.
Acknowledgements
None.
Funding statement
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Competing interests
The authors have no conflicts of interest to declare.