This paper is inspired by the recent article by Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024), which I thoroughly enjoyed reading. I thought the paper was thought-provoking and provided a highly readable and extensive review of the issues around scoring and classical test theory (CTT) with a clarity that has eluded many other papers on the same topic. I think the paper is well-positioned to bridge some of the historical developments in psychometrics to the new generation of psychometricians and quantitative psychologists (like me) who were trained in an environment where latent variable models were well-established and the primary emphasis. Frankly, I wish that a paper like this that lucidly and succinctly distills so many topics and so much history had existed when I first learned about psychometrics.
The treatment by Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) solidifies the mathematical basis for CTT-based sum scores and highlights some pertinent questions that remain in the domain of CTT. Even as someone who is generally skeptical of sum scoring, I have no major issues with the content of their article. Nonetheless, given the practical focus of their article, it seems worthwhile to go beyond what mathematics might permit and to consider potential practical implications of viewing sum scoring as psychometrics’ greatest accomplishment.
As a preface, I cite some of the original authors’ previous work throughout the text. This is in no way intended to be a “gotcha” or an attempt to catch the authors in contradictions, especially because sentiments may have changed in the years since the cited works were published. Rather, the authors are clear communicators who have articulated certain points far clearer than I can myself, so I rely on the original phrasing to best convey certain ideas. Additionally, the authors’ previous work has greatly influenced the formation of my own perspective (e.g., Borsboom, Reference Borsboom2005, Reference Borsboom2006; Sijtsma, Reference Sijtsma2009), even if my perspective may diverge from the authors’ current perspective. In this way, my intention is to treat arguments in these previous sources at face value, independently of who made them.
In the remainder of this paper, I begin by summarizing some recent reviews of psychometric practices in empirical studies and the recent calls to improve these practices. Then, I consider how CTT-based sum scores align with these calls, with specific attention on (a) using sum scores to predict other variables, (b) using sum scores as an outcome in a subsequent analysis, and (c) how CTT-based sum scores may affect the ability to provide evidence that a particular construct is being captured with minimal assumptions intact. Lastly, I reflect on whether sum scoring—even if mathematically justified—is well-positioned to improve how empirical studies approach measurement and psychometrics and, ultimately, improve our understanding of behavioral phenomena.
Psychometric Practices in Empirical Studies
There have been growing concerns that methodological and statistical practices have adversely impacted the replicability and reproducibility of conclusions in empirical studies within the behavioral sciences (e.g., Nosek et al., Reference Nosek, Hardwicke, Moshontz, Allard, Corker, Dreber and Vazire2022; Pashler & Wagenmakers, Reference Pashler and Wagenmakers2012; Tackett et al., Reference Tackett, Brandes, King and Markon2019). Issues related to p-hacking, researcher degrees of freedom, or hypothesizing after results are known (HARKing) have received widespread attention and have triggered calls for reform of statistical practices (e.g., Rodgers & Shrout, Reference Rodgers and Shrout2018; Simmons et al., Reference Simmons, Nelson and Simonsohn2011; Wicherts et al., Reference Wicherts, Veldkamp, Augusteijn, Bakker, van Aert and van Assen2016). Among these examinations of empirical studies, researchers have recently noted that measurement and psychometrics may be an underappreciated source of replication issues (e.g., Flake & Fried, Reference Flake and Fried2020; Schimmack, 2021; Soland et al., 2022a).
Specifically, scale scores are frequently used without accompanying evidence that scores are necessarily meaningful (e.g., that they capture an intended construct or predict a relevant outcome). For brevity, I refer to this as validity, though I acknowledge that there are different perspectives on the precise definition of validity. In the interest of maintaining a broad focus, I try to avoid those distinctions and proceed from the perspective of an empirical researcher trying to adhere to generally endorsed best practices rather than debating merits of different theoretical notions of what constitutes validity, especially because there appears to be general consensus across competing definitions that scores should mean something, irrespective of one’s perspective on how that evidence should be acquired (e.g., Evers et al., Reference Evers2012; 2015).
Among these review studies, Crutzen and Peters (Reference Crutzen and Peters2017) found that 2% of 288 reviewed health psychology scales provided validity evidence. Flake et al. (Reference Flake, Pek and Hehman2017) reported that 21% of 177 social psychology studies reported validity evidence when using previously established scales and only 2% of 124 author-created scales were accompanied by validity evidence. Flake et al. (Reference Flake, Pek and Hehman2017) also note that 19% of studies edited existing scales without validation of the revised scale. Weidman et al. (Reference Weidman, Steckler and Tracy2017) reported that 69% of 356 scales in emotion research were newly created without validity evidence and Shaw et al. (Reference Shaw, Cloos, Luong, Elbaz and Flake2020) reported that 79% of 43 author-created scales provided no validity evidence. Higgins et al. (Reference Higgins, Kaplan, Deschrijver and Ross2023) reported that 14% of 925 studies using the popular Reading the Mind in the Eyes Test in clinical psychology cited previous validity evidence and 23% provided new validation evidence for their sample (37% total provide some validity evidence). Maassen et al. (2024) reviewed 918 scales in empirical psychology studies between 2018 and 2019 that had at least three items, sum scored, and compared scores across groups or time and found that only 4% considered measurement invariance (which is included as a component of validity in some definitions).
Potentially more problematic is that trends in these practices are largely unchanged from decades past. Qualls and Moss (Reference Qualls and Moss1996) reviewed 2167 measures in APA journals in 1992 and found 32% reported validity evidence. Hogan and Agnello (Reference Hogan and Agnello2004) reviewed studies in APA journals between 1991 and 1995 and found 2% of the 696 studies reported validity evidence. Evers et al. (Reference Evers, Sijtsma, Lucassen and Meijer2010) reviewed tests submitted to COTAN in the Netherlands across time and found that, between 1982 and 2009, the percentage of tests with good, sufficient, and insufficient validity evidence was mostly unchanged across time with only a modest improvement between 1992 and 2000 but no improvement between 2000 and 2009 (the COTAN validity benchmark changed in 1997, so these comparisons are approximate).
As summarized by Flake and Fried (Reference Flake and Fried2020), “The lack of information about measures is a critical problem that could stem from underreporting, ignorance, negligence, misrepresentation, or some combination of these factors. But regardless of why the information is missing, a lack of transparency undermines the validity of psychological science” (p. 457). As succinctly put by Brennan (Reference Brennan and Brennan2006), “validity theory is rich, but the practice of validation is often impoverished” (p. 8). Inattention to measurement and validity is especially pressing with the rise of new data structures like big or intensive longitudinal data and associated machine learning methods, which compound measurement deficiencies and magnify the ‘garbage in, garbage out’ principle of data analysis (e.g., Adjerid & Kelley, Reference Adjerid and Kelley2018; Bleidorn & Hopwood, Reference Bleidorn and Hopwood2019; Jacobucci & Grimm, Reference Jacobucci and Grimm2020; Vogelsmeier et al., 2024).
Inadequate psychometric practices also impact replication efforts because it may be unclear if scores used as outcomes in empirical studies represent their intended construct or if they are merely capturing noise (Flake, Reference Flake2021). Consequently, it can be unclear if failed replication efforts indicate that the theory being tested may not hold or if failed replication is attributable to measurement error (Loken and Gelman, Reference Loken and Gelman2017). Flake et al. (Reference Flake, Davidson, Wong and Pek2022) note that fewer than 10% of replication efforts attempt to provide validity evidence for measures, which makes it difficult to differentiate between these two possibilities. To paraphrase Flake and Fried (Reference Flake and Fried2020), does p-hacking matter if scores are just noise to begin with? That is, statistical analysis is downstream of measurement, so if measurement is deficient, inferences may be inconclusive regardless of the quality and rigor of the ensuing statistical analysis.
Did Empirical Studies Ever Stop Using CTT?
A motivating premise of Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) is that sum scores have been banished to the psychometric mausoleum (p. 84) and that sum scores have lost ground quickly to IRT (p. 89). This is likely true in the context of operational psychometrics and in the methodological research literature, but CTT-based sum scores do not appear to be losing ground in empirical studies.
CTT has a lower barrier to entry and most empirical researchers have been using—and continue to use—CTT. For instance, Embretson (Reference Embretson2004) wrote “the majority of psychological tests still were based on classical test theory” (p. 8). Wilson et al. (Reference Wilson, Allen and Li2006) note, “the CTT approach is by far the most widely known measurement approach, and, in many areas, is the most widely used for instrument development and quality control” (p. i20). When discussing psychometrics in medical research, Blanchin et al. (Reference Blanchin, Hardouin, Neel, Kubis, Blanchard, Mirallié and Sébille2011) noted, “To date, the choice of a statistical strategy for the analysis of such data is usually based on CTT rather than on IRT and seems to more likely rely on the researcher’s practice and familiarity with CTT than on scientific grounds” (p. 826) and Gorter et al. (Reference Gorter, Fox, Apeldoorn and Twisk2016) write “despite the advantages of using IRT, in practice, sum scores are often used in the analysis” (p. 141). In a past Psychometric Society presidential address, Sijtsma (Reference Sijtsma2012) wrote, “IRT is not the norm for test construction even though most psychometricians would prefer its use to the use of CTT” (p. 5).
In a review of industrial-organizational psychology, Foster et al. (Reference Foster, Min and Zickar2017) found that “in spite of the complementary nature of IRT and CTT, current research predominantly utilizes the latter” (p. 478). Foster et al. (Reference Foster, Min and Zickar2017) also found that—even in a psychometrically sophisticated empirical subfield like industrial-organizational psychology—only 31% of a survey of 343 industrial-organizational psychologists responded that they use IRT. Among those who do not, 45% said they believed classical methods worked fine and 21% said they never learned IRT. If empirical researchers are the intended recipient of the message that sum scores based on CTT represent the field’s greatest accomplishment, they may not need additional convincing because CTT already appears to be a popular and primary choice among this audience.
CTT, Sum Scoring, and Reporting Practices
Preferences for latent variable models in the psychometric literature have not necessarily translated to empirical studies and sum scores motivated by CTT do not appear to have gone—or be going—anywhere. However, because CTT delegates central tasks like dimensionality and invariance assessment to latent variable methods (e.g., Sijtsma et al., Reference Sijtsma, Ellis and Borsboom2024, p. 100), empirical analyses equipped only with CTT tend to be incomplete by current standards, which possibly contributes to the state of psychometric reporting practices reported in the previous subsection where empirical studies simply do not engage with or report tasks outside the direct purview of CTT.
Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) note in their discussion that, “it is important to notice that researchers do their best to assemble item sets they believe to share the common core of the attribute of interest. They use psychometric methods such as corrected item-total correlations, principal component analysis, and FA [factor analysis] to assess the homogeneity of their experimental item sets before estimating the sum score’s reliability and other psychometric properties.” (p. 106). I do not dispute that researchers are doing their best and do not believe that poor psychometric practices reported in review studies are out of malice; my presumption would be lack of training (e.g., Aiken et al., Reference Aiken, West and Millsap2008; Howard, Reference Howard2024). However, reviews of psychometric practices in empirical studies do not support this statement, which seems to overestimate the psychometric sophistication underlying sum scoring in empirical studies.
A researcher who uses psychometric methods prior to summing responses and calculating reliability would seem to be in the minority. In fact, Flake et al. (Reference Flake, Pek and Hehman2017) found that 18% of studies did not report reliability nor validity information, which mirrors the lack of reliability reporting in 19% of studies reported by Crutzen and Peters (Reference Crutzen and Peters2017). These values also track the 19% of North American psychology PhD programs in Aiken et al. (Reference Aiken, West and Millsap2008) who reported that their graduates were not equipped to perform a reliability assessment. So, a nontrivial proportion of studies do not consider validity or reliability information before or after summing item responses to create scores.
This prompts my reticence to broadly embracing sum scores for empirical studies—if sum scoring were synonymous with thoughtfully weighing the merits of different approaches and opting for CTT’s mathematical elegance or simplified interpretation, I would be perfectly content and would have no objections (e.g., Stochl et al., Reference Stochl, Fried, Fritz, Croudace, Russo, Knight and Perez2022 provide an exemplary analysis supporting sum scoring). However, this is rarely the case and sum scoring more often is an ad hoc procedure—possibly without considering reliability and probably without considering validity—that is propped up by CTT because it is “a commonly accepted escape route to avoid notorious problems in psychological testing” (Borsboom Reference Borsboom2005 p. 47).
Prescriptively, there are sound mathematical arguments to support sum scoring, but descriptively, most researchers are not appealing to any of them and chose to sum score because it is simple, intuitive, and widely accepted (possibly because reviewers and editors are doing the same thing themselves). Frequently, sum scoring is not a step supported by a broader psychometric plan so much as it is the entirety of the psychometric plan.
The situation feels analogous to the underpants gnomes in South Park (1998), who have a three-phase business plan where Phase 1 is to collect underpants and Phase 3 is profit. The joke is that they cannot figure out the second phase that connects Phase 1 and 3. Applied psychometrics seems to follow a similar plan where Phase 1 is collect item responses and Phase 3 is to compare people, but there is not always thought or planning dedicated to the second phase (perhaps giving new meaning to (g)nomothetic span; Embretson, Reference Embretson1983). Based on reviews of empirical studies, sum scores often serve as a means to an end and a path of least psychometric resistance to advance from Phase 1 to Phase 3, which employs CTT not so much as a motive than as a convenient retroactive absolution.
As discussed toward the end of the current paper, elevating the status of the sum score may unintentionally preserve psychometric illiteracy among empirical researchers whose penchant for sum scores is not informed by potential merits of CTT (or by any psychometric theory) but rather by lack of methodological training, unawareness of alternative methods, or lack of motivation to engage with rigorous psychometric analysis. If researchers sum score based on the reasoning provided in Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024), that would undoubtedly be a benefit for empirical studies. However, if empirical researchers interpret the superlative title of Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) as an endorsement of typical practice, psychometrics might have another obstacle in escaping from the periphery of empirical researchers’ minds because principled approaches to sum scoring and how researchers currently approach sum scoring look rather different.
Of course, the information from these review studies could be interpreted in different ways. One reaction consistent with Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) is that psychometric practice in empirical studies is so poor that there is a benefit to simplification with sum scores and their milder assumptions because there are already enough problems without complex psychometric models. My reaction is that the poor state of psychometric practice in empirical studies is an opportunity for improvement and that endorsing sum scoring may not help disrupt the bleak state of psychometric practices. The sections that follow provide some rationale for my perspective.
Prediction with Sum Scores
Stochastic Ordering
Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) provide a simulation showing a set of conditions where the sum of binary items predicts a single true underlying construct as good or better than scores estimated from latent variable models, especially in the likely event where the item response function is not precisely known (also see Hopwood & Donnellen, Reference Hopwood and Donnellan2010 for related arguments about benefits of prediction). The take-home message that followed is that sum scores derive their value from predicting external behavior and—because sum scores stochastically order an underlying attribute such that higher sum scores are associated with higher latent variable scores, on average—sum scores can have similar predictive ability as an estimated latent variable score (p. 106).
As one practical consideration, the stochastic ordering principle is upheld with unidimensional constructs informed by items with binary responses; however, reviews of empirical studies find that most researchers are not using binary response formats or unidimensional constructs. For instance, Flake et al. (Reference Flake, Pek and Hehman2017) reported that 81% of empirical studies collected responses from ordinal Likert-type scales, whereas only 4% used binary response formats. Similarly, Maassen et al. (2024) reported that 5% of studies reported binary response scales and 95% used three or more response options. Jackson et al. (Reference Jackson, Gillaspy and Purc-Stephenson2009) reviewed 1409 factor analyses in psychology journals interested in scale development, construct validation, or measurement modeling and reported that only 9.5% were unidimensional, whereas 73.1% were multidimensional (the remaining percentage focused on models for invariance or multiple groups) while also finding that “the overwhelming majority” of studies used Likert-type items and treated them as continuous (p. 18).
Sums scores do not stochastically order a latent attribute with ordinal items (Borsboom, Reference Borsboom2005, p. 124; Hemker et al., Reference Hemker, Sijtsma, Molenaar and Junker1996, Reference Hemker, Sijtsma, Molenaar and Junker1997), so this property will not necessarily hold in many empirical studies. As noted by Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024), the impact tends to be mild such that the latent variable will still be approximately ordered by sum scores, on average, though the distortion does appear to increase with fewer items and more response options (van der Ark, Reference van der Ark2005); the typical number of items per construct in empirical studies tends to be somewhat small (about 7 in empirical psychology studies; Flake et al., Reference Flake, Pek and Hehman2017; Jackson et al., Reference Jackson, Gillaspy and Purc-Stephenson2009; Maassen et al., 2024). More importantly, stochastic ordering breaks down quickly in the presence of multidimensionality (Borsboom, Reference Borsboom2005, pp. 123–124) because correlations between latent variables can rapidly erode correspondence between sum and latent variable scores (Pruzek & Frederick, Reference Pruzek and Frederick1978, p. 262) because sum scores tend to have difficulty incorporating correlations between constructs in the scoring process.
Stochastic ordering supports similar predictive ability of sum scores and latent variable scores in certain contexts (e.g., a bivariate correlation of summed binary responses and an underlying unidimensional construct), but these contexts do necessarily not align with what empirical researchers often possess (e.g., ordinal responses and multidimensionality). It might be hasty to extrapolate the predictive performance of sum scores from a simulation using binary unidimensional data as a general property of sum scores on account of stochastic ordering given that this property is known to hold in circumstances that may be uncommon in empirical data.
Additionally, it is relevant to note that stochastic ordering is about ranking the expected value of the latent variable score rather than ranking latent variable values of specific individuals, so prediction may be affected depending on the target of inference. This will be discussed next.
Moderated Associations, Individual Prediction, and Heterogeneity
Prediction often extends beyond bivariate correlations and can include nonlinear or moderated relations among multiple variables. Latent variable models can incorporate moderating characteristics into scores (a difficult task for CTT-based scores) to improve predictive ability for individuals when information about rank order of latent variable expected values is inadequate.
The moderated nonlinear factor analysis model (MNLFA, Bauer, Reference Bauer2017; Curran et al., Reference Curran, McGinley, Bauer, Hussong, Burns, Chassin and Zucker2014) is one example that allows all item parameters to be potentially moderated by discrete or continuous variables. This model has been shown to improve predictions over sum scores in both simulated data (Curran et al., Reference Curran, Cole, Bauer, Hussong and Gottfredson2016, Reference Curran, Cole, Bauer, Rothenberg and Hussong2018; Gottfredson et al., Reference Gottfredson, Cole, Giordano, Bauer, Hussong and Ennett2019) and empirical clinical diagnosis data (Coxe and Sibley, Reference Coxe and Sibley2023; Hussong et al., Reference Hussong, Gottfredson, Bauer, Curran, Haroon, Chandler and Springer2019; Morgan-López et al., Reference Morgan-López, Saavedra, Hien, Norman, Fitzpatrick, Ye and Back2023; Soland et al., Reference Soland, McGinty, Gray, Solari, Herring and Xu2022b).
Differences are not trivial—Morgan-López et al. (Reference Morgan-López, Saavedra, Hien, Norman, Fitzpatrick, Ye and Back2023) meta-analyzed 25 post-traumatic stress disorder (PTSD) studies with item-level data and found that PTSD diagnostic concordance for individuals’ PTSD was 73% with sum scores but 93% with MNLFA scores. If fixing diagnostic specificity to 80%, sensitivity was 48% with sum scores but 89% with MNLFA scores. This may occur because, even in conditions when stochastic ordering holds, it only speaks to expectations (i.e., people with higher sum scores—on average—have a larger ability than people with a lower sum score), so stochastic ordering is not necessarily satisfactory for predicting or classifying individuals (Zwitser and Maris, Reference Zwitser and Maris2016). Individuals are more likely to be the target of inference in empirical subfields that tend to emphasize psychometrics and scale scores like education and clinical psychology (Speelman et al., Reference Speelman, Parker, Rapley and McGann2024), as is the case of predicting an individual PTSD diagnosis in Morgan-López et al. (Reference Morgan-López, Saavedra, Hien, Norman, Fitzpatrick, Ye and Back2023).
Perhaps this demonstrates the point in Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) that something as simple as a sum score retains remarkably high predictive ability relative to complex models that demand far more computational resources. I also do not want to discount the nontrivial advantages of CTT-based sum scores when sample size is small or possibly when sample size is very large (e.g., where computational demand becomes excessive) because complex models encounter far more problems as a function of sample size at either extreme, whereas CTT-based sum scores scale easily across the entire sample size distribution. Points about potential model misspecification are also insightful as latent variable models are not always specified carefully and there are discernible considerations for matching scoring procedures to the appropriate level of rigor and the audience interpreting scores (e.g., classroom tests are fairly low stakes and less sophisticated scoring methods suffice and are easier to interpret even if they may not be ideal).
Nonetheless, if the target benchmark is prediction in a scientific study, there are several contexts where augmented latent variable models or machine learning methods can have greater predictive accuracy than a sum score (Gonzalez, Reference Gonzalez2021; also see Tay et al., Reference Tay, Woo, Hickman and Saef2020 for discussion of validity of scores derived from machine learning), especially if emphasizing prediction and removing the requirements that a score needs to capture a specific construct or be easily interpretable by a lay audience because criticisms of model complexity are less pertinent. For instance, we have no idea if the scoring model in Morgan-López et al. (Reference Morgan-López, Saavedra, Hien, Norman, Fitzpatrick, Ye and Back2023) is correctly specified, but if we only care about prediction, it does not matter because the scores it produces considerably outperform an unweighted sum of PTSD symptoms.
Directly Using Items as Predictors
Outside of latent variable modeling, there is an emerging literature on the predictive benefit of using individual items as predictors over a sum of items (e.g., Donnellan et al., 2023; Fried & Nesse, Reference Fried and Nesse2014; McClure et al., Reference McClure, Ammerman and Jacobucci2024; Müller et al., Reference Müller, Hopwood, Skodol, Morey, Oltmanns, Benecke and Zimmermann2023; Revelle, Reference Revelle2024). For instance, McClure et al. (Reference McClure, Ammerman and Jacobucci2024) found that when using Beck’s Depression Inventory II to predict suicidal ideation, using the sum score as a predictor yielded an \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^{\textrm{2}}$$\end{document} of 0.20 but using individual items as predictors had an \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^{\textrm{2}}$$\end{document} of 0.38. And when predicting suicidal ideation from the Patient Health Questionnaire-9 (PHQ-9), the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^{\textrm{2}}$$\end{document} using the sum score as a predictor was 0.39 versus an \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^{\textrm{2}}$$\end{document} of 0.58 when using individual items as predictors.
Müller et al. (Reference Müller, Hopwood, Skodol, Morey, Oltmanns, Benecke and Zimmermann2023) found comparable results when using individual or summed personality disorder criteria for predicting several different outcome variables. In Müller et al. (Reference Müller, Hopwood, Skodol, Morey, Oltmanns, Benecke and Zimmermann2023), predictive performance was especially different when comparing prediction using individual criteria to prediction using a sum of criteria across all syndromes (versus sums of criteria intended to represent a single syndrome). Rather than creating a composite predictor by summing item responses according to a predefined weighting scheme (often equal weighting), using the item responses as predictors directly can permit heterogeneous predictive contributions of different items (Fried, Reference Fried2015; Fried and Nesse, Reference Fried and Nesse2015) without requiring assumptions about dimensionality or assuming that a single construct underlies item responses. If the objective is predictive accuracy with minimal assumptions about scores, using individual items as predictors may be a more attractive option than a sum.
Sampling Variability
To reorient to a topic that is related to the previous subsection but that is not directly related arguments in Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024), preference for sum scores is sometimes based on arguments that they are consistent across studies, whereas estimated scores from latent variable models are built upon parameter estimates that have sampling variability (e.g., Russell, Reference Russell2002). Wainer (Reference Wainer1976) generally made this argument by showing that loss of predictive accuracy in regression is minimal if regression coefficients associated with standardized predictors are replaced with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$+$$\end{document} 1, 0, or − 1 (see also, Cohen, Reference Cohen1990, p. 1306). This argument is often extended to measurement models to support replacing estimated weights from factor analysis with equal weights as in a sum score.
However, Pruzek and Frederick (Reference Pruzek and Frederick1978) showed that some assumptions made by Wainer (Reference Wainer1976) in the regression context (e.g., predictors are uncorrelated; standardized weights are uniformly distributed over [.25, .75]) may not readily extend to measurement models. Whereas predictors in linear regression explain a fixed amount of variance in a single outcome, measurement models are multivariate such that each item has a separate amount of variance that can be explained by a latent variable. Pruzek and Frederick (Reference Pruzek and Frederick1978) note that this affects the tenability of assumptions upon which arguments for equivalent predictive accuracy with equal weights are based. They show examples where there can be meaningful loss in predictive accuracy when estimated weights are replaced with equal weights. Loss of accuracy will not necessarily occur (e.g., if the range of standardized coefficients is limited), but conditions encountered in factor analysis are more susceptible to loss than linear regression.
Somewhat ironically, replacing estimated coefficients with equal weights is not commonly practiced in regression where it has stronger support for retaining equivalent predictive accuracy. However, replacing estimated coefficients with equal weights is more common in measurement contexts despite somewhat less support that resulting scores will be comparable (e.g., that the calculated scores will relate comparably to the true construct).
Concern about sampling variability is legitimate, but these concerns can be selectively applied such that sampling variability is sometimes used to justify equally weighted sum scores, only for researchers to proceed to a prediction stage where these sum scores are used in a regression model with uniquely estimated coefficients such that sampling variability of regression coefficient estimates is suddenly no longer a concern. If sampling variability is concerning, why estimate regression weights in the prediction model instead of constraining them to be equal or to predefined values to avoid sampling variability as in the scoring model?
Concerns about sampling variability and out-of-sample performance in measurement models may also be mitigated by more recently developed methods like as regularization (e.g., Huang, Reference Huang2022; Jacobucci et al., Reference Jacobucci, Grimm and McArdle2016; Li & Jacobucci, Reference Li and Jacobucci2022; Liang & Jacobucci, Reference Liang and Jacobucci2020), incorporating sampling variability into scoring (Tsutakawa and Johnson, Reference Tsutakawa and Johnson1990), and fixing weights to values from a previous validation (Kim, Reference Kim2006; König et al., Reference König, Khorramdel, Yamamoto and Frey2021).
Prediction is meaningful for contexts where scores are used as independent variables, but it may not always be as useful in situations where the scores are intended to be an outcome in a subsequent analysis. The next section reviews the literature on using sum scores as an outcome in empirical analyses.
Sum Scores as Dependent Variables
The stochastic ordering property of sum scores is remarkable for its simplicity, but the fact that it yields scores on an ordinal scale potentially limits performance if scores are subsequently used in analyses where the interest is quantifying group differences or change over time rather than bivariate correlations (Reise and Henson, Reference Reise and Henson2003). It is important to note that high correlations between sum scores and latent variable scores do not imply equivalence of subsequent performance (McNeish, Reference McNeish2023a) because Pearson correlations are largely insensitive to monotonic transformations (Reise and Waller, Reference Reise and Waller2009). Altman and Bland (Reference Altman and Bland1983) emphasize that high correlations between two methods do not imply agreement between two methods, which was demonstrated by Gonzalez et al. (Reference Gonzalez, MacKinnon and Muniz2021) who showed that two scores correlating .998 could still have meaningfully different correlations with a third variable. So, the common finding that sum scores and latent variable scores are highly correlated does not guarantee that they will have interchangeable performance or conclusions when using different types of scores as an outcome in subsequent analyses.
Previous studies have reported poorer performance of sum scores to detect underlying effects, trends, or group differences in different modeling contexts like regression discontinuity (Soland et al., Reference Soland, Johnson and Talbert2023), growth modeling (Edwards and Wirth, Reference Edwards and Wirth2009; Edwards and Soland, Reference Edwards and Soland2024; Fraley et al., Reference Fraley, Waller and Brennan2000; Gorter et al., Reference Gorter, Fox, Riet, Heymans and Twisk2020; Kuhfeld and Soland, Reference Kuhfeld and Soland2022; Luningham et al., Reference Luningham, McArtor, Bartels, Boomsma and Lubke2017; Proust-Lima et al., Reference Proust-Lima, Philipps, Dartigues, Bennett, Glymour, Jacqmin-Gadda and Samieri2019; Tang et al., Reference Tang, Schalet, Peipert and Cella2023), randomized or clinical trials (Gorter et al., Reference Gorter, Fox, Apeldoorn and Twisk2016; Kuhfeld and Soland, 2023; Kessels et al., Reference Kessels, Moerbeek, Bloemers and van Der Heijden2021; Soland, Reference Soland2022), machine learning (Gonzalez, Reference Gonzalez2021; Jacobucci and Grimm, Reference Jacobucci and Grimm2020), time-series and intensive longitudinal analysis (Vogelsmeier et al. Reference Vogelsmeier, Vermunt, van Roekel and De Roover2019, Reference Vogelsmeier, Vermunt, Keijsers and De Roover2021, 2022), growth mixture modeling (Soland et al., 2024), and partial least squares for formative latent variables (Hair et al., Reference Hair, Sharma, Sarstedt, Ringle and Liengaard2024).
Ramsay and Wiberg (Reference Ramsay and Wiberg2017) note that sum scores in some application areas can congregate at extreme scale points and form floor or ceiling effects (e.g., Pelt et al., Reference Pelt, Schwabe and Bartels2023; Schwabe & Van den Berg, Reference Schwabe and van den Berg2014; Van den Oord et al., Reference van den Oord, Pickles and Waldman2003) and alternative scoring methods—even under stochastic ordering—can improve prediction through better scaling (also see, Proust-Lima et al., Reference Proust-Lima, Dartigues and Jacqmin-Gadda2011; Van den Oord & Van der Ark, Reference van den Oord and van der Ark1997). Relatedly, Liu and Wang (Reference Liu and Wang2021) found that a majority of studies in flagship education and psychology journals using t-tests (57%) and ANOVA (70%) have unaccounted floor or ceiling effects in their outcome variable, which can emerge from reliance on sum scores as dependent variables and can result in distorted inferences.
Maxwell and Delaney (Reference Maxwell and Delaney1985) found that monotonic transformations of underlying latent variable scores (i.e., how sum scores relate to a latent variable scores) were insufficient for t-tests to be accurate, and Wilcoxon rank-sum tests were needed (i.e., sum scores need to be treated as ordinal in subsequent analyses, a rare occurrence in practice). Maassen et al. (2024) mention that group comparisons with sum scores may also be complicated by possible invariance (see also, Slof-Op’t Landt et al., Reference Slof-Op’t Landt, van Furth, Rebollo-Mesa, Bartels, van Beijsterveldt, Slagboom and Dolan2009), especially because researchers infrequently evaluate invariance, possibly because CTT does not have a strong framework for invariance assessment (e.g., Wilson et al., Reference Wilson, Allen and Li2006).
Several studies have shown that using sum scores as an outcome in the common context of models with interaction terms such as factorial ANOVA or moderated multiple regression worsens performance (Embretson, Reference Embretson1996; Kang and Waller, Reference Kang and Waller2005; Morse et al., Reference Morse, Johanson and Griffeth2012; Murray et al., Reference Murray, Molenaar, Johnson and Krueger2016). This also applies when a model features an interaction of two sum scores as predictors and their measurement error is not accounted for (Hsiao et al., Reference Hsiao, Kwok and Lai2018). Note that Sum scores being compared in these studies are often raw sums of responses, but performance can be improved with transformation of sum scores (e.g., Murray et al., Reference Murray, Molenaar, Johnson and Krueger2016) and arguments in Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) are compatible with transformed sum scores.
The main point is that the extending the stochastic ordering property of sum scores beyond bivariate correlations does not necessarily translate into accurate conclusions about underlying associations if the intent is for sum scores to be included as an outcome in a subsequent statistical model. When sum scores are used as outcomes, traditional models expect interval data where spacing and variability matter for proper inference, aspects which may not be preserved by sum scores that maintain average rank ordering (ordinal responses and multidimensionality for which stochastic ordering does not exactly hold can amplify differences in performance).
Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) present an interesting case of network models. Though I am admittedly not well-versed in this area, my basic understanding is that the network itself is typically the main interest as opposed to latent variable models where the interest is often to understand some latent structure so that scores can be calculated and passed onto the next stage of analysis or use. Stochastic ordering may not be sufficient when scores are passed on to models where spacing and variability are important. However, when the network itself is the focal interest, sum scoring may be more attractive because the stochastic ordering is much more appealing than it may be in contexts where a measurement model serves essentially as a preprocessing step to a subsequent focal analysis.
Unlike when scores are used as a predictor, sum scores used as outcomes (outside of network models) more often implicitly convey that the score represents a specific construct. It is often prudent to provide evidence that a score is accurately capturing the intended construct prior to using the score in a subsequent statistical model. Many of the studies cited in this section concerned sums of items that were known to come have a single underlying construct (e.g., because the data were simulated). However, empirical studies must gather evidence to establish a link between scores and a specific construct prior to using scores as an outcome. Considerations during this process are discussed in the next section.
Scores Intended to Capture a Specific Construct
Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) emphasize that CTT does not necessarily intended to capture a specific construct by saying: “the mathematics of [CTT] are independent of how one wishes to interpret the model …CTT is a truly minimal model in terms of assumptions…the crucial insight is that CTT can operate in the absence of assumptions regarding the test’s dimensionality or factorial composition” (p. 100). Essentially, CTT guarantees the right to create composite scores by summing but does not guarantee that the resulting score will necessarily correspond to a specific desired construct or any construct at all.
The focus on correlation and external prediction with sum scores therefore makes inherent sense—CTT does not necessarily specify a specific underlying construct, so the utility of CTT-based scores (like sum scores) can be derived from the extent to which they relate to or predict a relevant external target (e.g., Kane, Reference Kane and Brennan2006). Nonetheless, a common intention is for scores to represent a particular construct (e.g., Sijtsma et al., Reference Sijtsma, Ellis and Borsboom2024, p. 106), which requires some evidence to establish a connection between item responses and the intended construct.
Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) emphasize that CTT does not exclude dimensionality restrictions whereby a certain construct may underlie responses (p. 87) but that “CTT does not allow the assessment of dimensionality simply because this is not part of CTT and that, therefore, researchers need to use dimensionality assessment methods from outside of CTT” (p. 100). This is helpful, but potentially introduces a conflict between maintaining CTT’s minimal assumptions and employing methods outside CTT to demonstrate whether scores plausibly represent a particular construct.
For instance, if factor analysis were applied to extract evidence that a particular construct was underlying the item responses, several additional assumptions would be needed—from the list of assumptions not made by CTT (Sijtsma et al., Reference Sijtsma, Ellis and Borsboom2024, p. 99), this would include Item 4 (scores do not need to satisfy a dimensional model), Item 5 (scores do not need to reflect the same attribute), possibly Item 6 (errors do not need to be independent), Item 7 (there are no distributional assumptions), and possibly Item 8 (error variances are the same for every person).
These assumptions are indeed unnecessary to create the scores or to assess reliability of scores, but they are needed to provide crucial support from a method like factor analysis that scores are a plausible representation of a particular construct. Perhaps it is not entirely accurate to attribute these additional assumptions to CTT directly since they are technically made by a supporting method like factor analysis, but it also does not seem entirely accurate to assert that CTT necessarily makes minimal assumptions if its use in some contexts depends on an accompanying assumption-laden method. In other words, does CTT embody the assumptions of the methods used to justify it? How does dependence between CTT and the method used to justify it impact CTT assumptions?
More broadly, there appears to be a distinction between the purest version of CTT and the dimensionality-restricted version of CTT that many researchers seek when they want to interpret scores as reflecting a particular construct. Defense of CTT and CTT-based sum scores is appropriately quick to highlight the minimal assumptions under which scores can be created and reliability can be defined. However, a qualification is that the purest version of CTT is agnostic to what the score actually captures and is impervious to quantitatively evaluating aspects like dimensionality because the purest form of CTT “has no room for the important and challenging psychometric question of how theoretical attributes are related to observations” (Borsboom, Reference Borsboom2006, p. 430). CTT does not exclude such examinations, but it does not necessarily require or encourage them either. Omission of validity aspects from defenses of CTT-based sum scoring is therefore not missing at random—CTT was not built to accommodate validation efforts because CTT does not concede that a theoretical attribute or construct exists.
Bringing dimensionality restrictions into CTT comes at the expense of additional assumptions—there is a trade-off between minimal assumptions and scores representing something specific. As noted by Borsboom and Mellenbergh (Reference Borsboom and Mellenbergh2004), “the classical test theory model is largely untestable unless auxiliary assumptions, such as equal error variances across subjects, are imposed, and it is certainly never tested in actual research.” (p. 108). Anecdotally, the motivation of McNeish and Wolf (Reference McNeish and Wolf2020) was to outline a method to facilitate testing whether dimensionality-restricted sum scores justifiably capture a single construct in empirical research, which requires additional assumptions beyond those required for the purest form of CTT. I later came across Beauducel and Leue (Reference Beauducel and Leue2013), which follows the same theme but suggests a different set of constraints that correspond to justifying a dimensionality-restricted sum score (also see Rose et al., Reference Rose, Wagner, Mayer and Nagengast2019 for a third alternative specification of a similar idea).
There are arguments that a latent variable model is a type of Wittgenstein’s ladder such that its purpose is to justify a sum score, but—after having done that—the model is no longer useful and need not inform each person’s value or position on the construct.Footnote 1 There is merit to this idea and precedence for viewing a sum score as a coarse version of predicted construct score (e.g., Grice & Harris, Reference Grice and Harris1998; Grice, Reference Grice2001). That is, once evidence for dimensionality is obtained, parameter estimates from a latent variable model used in dimensionality assessment can be discarded for scoring.
A possible counterargument may be that some prevailing definitions of validity consider it as a property of scores (e.g., AERA, APA, & NCME, 2014). There are several methods to predict scores for unobservable constructs (e.g., DiStefano et al., Reference DiStefano, Zhu and Mindrila2019), but not all methods imply the same reproduced interitem covariance matrix (Beauducel, Reference Beauducel2007; Beauducel & Hilger, Reference Beauducel and Hilger2020), whose comparison to the observed interitem covariance matrix serves as the basis of factor analytic fit. Using the labels from Grice (Reference Grice2001), “refined” methods that incorporate estimated factor loadings for weighting scores tend to have equivalent reproduced matrices (Beauducel, Reference Beauducel2007), but “coarse” scores that use a simplified weighting scheme like an unweighted sum do not (Beauducel & Hilger, Reference Beauducel and Hilger2020; Beauducel & Leue, Reference Beauducel and Leue2013).
From this validity perspective, the argument is that the scoring model and the factor model are not necessarily independent and changing the scoring method can change validity evidence gleaned from factor analytic fit (e.g., Embretson, Reference Embretson2007, p. 453; Thissen, Reference Thissen, Steinberg, Pyszczynski and Greenberg1983, p. 215). If validity is considered a property of scores, scores based on different weighting schemes—even from the same factor structure—may not necessarily have interchangeable validity evidence. As put by Edwards and Wirth (Reference Edwards and Wirth2009): “there is indeed something odd about the common practice of using factor analysis to establish the dimensionality of a scale but then ignoring the parameter estimates themselves when creating scale scores. Statements about the adequacy of a model from a factor analytic standpoint may not apply when the parameters from that model are ignored.” (pp. 84–85; also see Schreiber, Reference Schreiber2021, p. 1009).
As convincingly argued in Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024), deferring to latent variable models to justify sum scores provides sufficient—but not necessary—conditions for sum scoring because applying a latent variable model applies a more restrictive set of assumptions than required by the purest form of CTT. This argument makes complete sense if there is no intent for scores to represent a specific construct. However, the pure form of CTT has so few assumptions because it is indifferent to what the scores represent and because constructs are not represented in the model. Validity and justification of scores is therefore not an essential feature.
If moving to a dimensionality-restricted version of CTT where constructs are presumed to be present and validity is more consequential, it becomes difficult to maintain minimal assumptions and avoid moving toward latent variable conceptualizations. As put by Borsboom (Reference Borsboom2005), “classical test theory does not formulate a serious account of measurement, and therefore is inadequate to deal with the question of validity. In fact, if it begins to formulate such an account, it invokes a kind of embryonic latent variable model” (p. 144). Of course, when this was written, network models and component models were much less developed and much less visible in behavioral sciences, so alternatives to CTT have expanded in the intervening 20 years. Though these methods reduce reliance on the conventional reflective latent variable model like factor analysis, the general point that some other method must accompany dimensionality-restricted CTT remains relevant.
In short, minimal assumptions are great for assessing reliability but can be a detriment for assessing aspects of validity because the purest form of CTT is insulated from such evaluation. There seems to be a disconnect between the abstract notion of what mathematics permits and the practical context of what researchers are doing. The schism between mathematics and empirical applications is highlighted by Borsboom (Reference Borsboom2005), who wrote that CTT is “so enormously detached from common interpretations of psychological constructs, that the statistics based on it appear to have very little relevance for psychological measurement” (p. 47).
It is unclear how sum scores motivated by the purest form of CTT fit into recent calls for greater attention to validation and greater transparency in psychometric reporting because its definitions do not seem interested in or capable of addressing such questions without deferring to latent variable methods and the additional assumptions they impose.
Assumptions and Intent
This situation seems analogous to the role of assumptions in ordinary least squares to estimate parameters in linear regression. Say we have a linear model \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\textbf {y}}}={\mathbf{X\varvec{\beta } }}+\textbf{e}$$\end{document} where y is the outcome vector, X is a matrix of predictors, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta } $$\end{document} is a regression coefficient vector, and e is a vector of errors. The solution for \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta } $$\end{document} that minimizes the squared differences between the observed data and predicted values is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{{{\varvec{\beta } }}}}=\left( {{\textbf{X}}^{\textrm{T}}{\textbf{X}}} \right) ^{-1}\mathrm{\textbf{X}}^{\textrm{T}}{\textbf{y}}$$\end{document} , which only assumes no perfect collinearity (i.e., \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( {{\textbf{X}}^{\textrm{T}}{\textbf{X}}} \right) ^{-1}$$\end{document} exists) and, for asymptotic consistency, exogeneity (i.e., \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E\left( {{\textbf{e}}\vert {\textbf{X}}} \right) =0)$$\end{document} .
Regression lines can be fit through data without any of the typical assumptions taught in introductory statistics like independence, normality, or homoskedasticity. However, as soon as inferential evaluations of the model are of interest (e.g., whether a regression coefficient is 0 in the population), the trio of assumptions encompassed by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{e}}\sim N\left( {0,\sigma ^{2}{\textbf{I}}} \right) $$\end{document} is needed. Regression lines can be created while maintaining few assumptions, but they cannot always be evaluated while maintaining few assumptions.
Though the motivation and context are different, the general idea seems related to CTT-based sum scores. According to CTT, sum scores can be created with few assumptions. However, construction of relevant statistical tests or procedures to evaluate aspects of scores other than reliability is difficult while maintaining few assumptions (e.g., dimensionality or invariance). Promoting CTT-based sum scores on the basis that the underlying mathematics do not require many assumptions is defensible, just like using linear regression via ordinary least squares for prediction is defensible without independence, normality, or homoskedasticity.
But these arguments are contextual. Just as minimal assumptions to create a regression line with ordinary least squares are unhelpful to researchers interested in inference (which is presumably why students are usually taught that ordinary least squares assumes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{e}}\sim N\left( {0,\sigma ^{2}{\textbf{I}}} \right) )$$\end{document} , the minimal assumptions of CTT-based sum scores may not be helpful to empirical researchers who want to use scores to represent a particular construct because assumptions are the cost of interpreting scores in a specific way.
Validation Without Latent Variables
Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) make a strong case for the latent variable model as the justification for sum scores, so it is unsurprising that the minority of empirical studies that do report validity evidence tend to emphasize methods like factor analysis. For instance, recent reviews of validity reporting in empirical studies find that the percentage of studies presenting evidence based on factor analysis or internal structure is 90% (Shear and Zumbo 2014), 89% (Collie and Zumbo, 2014), 92% (Gunnell et al., 2014), 85% (Chinni & Hubley, 2014), and 77% (Hubley et al., 2014).
Nonetheless, researchers could work around reliance on latent variable models and their additional assumptions by employing methods from the content (e.g., Aiken, Reference Aiken1980; Mislevy et al., Reference Mislevy, Almond and Lukas2003; Sireci, Reference Sireci1998a, Reference Sirecib; Sireci & Faulkner-Bond, Reference Sireci and Faulkner-Bond2014) or response process families of validation methods (e.g., Embretson, Reference Embretson1983; Mislevy et al., Reference Mislevy, Steinberg, Almond, Irvine and Kyllonen2002; Padilla & Benítez, Reference Padilla García and Benítez Baena2014). These methods provide qualitative evidence that the items cover relevant aspects of the intended construct or that respondents are understanding the items consistent with the way an attribute is defined, respectively, and could conceivably be done without need to impose additional assumptions as in the case of quantitative approaches (though, ideally, quantitative and qualitative sources would be provided together for a more holistic validation).
Based on reviews of reporting practices, this type of validation evidence is exceedingly rare in empirical studies. For instance, a review of papers in the Journal of Educational Psychology between 2000 and 2010 by Collie and Zumbo (2014) found that 16% reported evidence of test content and 0% reported response process evidence. Hubley et al. (2014) found that of articles published in Psychological Assessment and the European Journal of Psychological Assessment between 2010 and 2012, 2% reported response process evidence and 0% reported evidence of test content. If minimal assumptions of CTT-based sum scores are a key advantage to retain even when the goal is to capture a specific construct, the importance of content and response process validation and how to collect such evidence may need to be better communicated to empirical researchers.
Will Sum Scoring Help Empirical Studies?
The properties and underlying mathematics of CTT and sum scores are interesting to contemplate, but the mathematical machinery underlying CTT and sum scores ultimately may be less relevant than what CTT and sum scoring can offer to empirical researchers. For instance, if IRT were developed first and CTT came afterward, would CTT be defended as vigorously based on the merits of what it offers or does the affinity for CTT in some part come from its historical context, the elevated status of those who initially conceived the idea, or that it is simply easier to apply? Borsboom (Reference Borsboom2006) presents this point articulately, saying:
In an alternative world, where classical test theory never was invented, the first thing a psychologist, who has proposed a measure for a theoretical attribute, would do is to spell out the nature and form of the relationship between the attribute and its putative measures. ... This would lead the researcher to start the whole process of research by constructing a psychometric model. After this, the question would arise which parts of the model structure can be tested empirically, and how this can best be done. Currently, however, this rarely happens. In fact, the procedure often runs in reverse. (p. 429).
If (a) latent variable models provide the basis for sum scores that CTT itself does not provide (Sijtsma et al., Reference Sijtsma, Ellis and Borsboom2024, p. 97), (b) latent variable models provide or imply validity evidence for sum scores that CTT does not provide (Sijtsma et al., Reference Sijtsma, Ellis and Borsboom2024, p. 100), and (c) latent variable models allow additional assessments like differential item functioning and invariance in ways that CTT could not realize (Sijtsma et al., Reference Sijtsma, Ellis and Borsboom2024, pp. 106–107); at what point do we consider the latent variable approach as a more complete framework for typical empirical settings, especially in light of recent research showing that using individual items often has greater predictive validity than a sum of item responses?
This dissonance emerges in one possible CTT-based workflow that satisfies modern psychometric standards like those from AERA, APA, and NCME (2014) where (a) scores are motivated by minimal assumptions of CTT, (b) scores are validated by conducting a factor analysis that introduces several assumptions to evaluate dimensionality (and, possibly, invariance), and (c) the assumptions and parameter estimates from the factor analysis are disregarded and the sum score is used as an estimate of the underlying construct and its reliability is reported with coefficient alpha.
In such a case, is it still accurate to maintain that the scores are based on minimal assumptions if CTT leans on factor analysis or auxiliary assumptions to provide evidence that scores have the intended dimensionality? Why discard the information provided by the factor analysis about relative weights of the items (and upon which factor analytic fit may rely) in favor of a predefined weight scheme? If factor analysis and its assumptions are needed for the validation portion of the analysis, why not make these assumptions from the onset and operate entirely within a factor analysis framework given that it similarly has mechanisms for reliability estimation and scoring (and that scores created from factor analysis weights are equally or more reliable than scores created from unit weights; Hancock & Mueller, Reference Hancock, Mueller, Cudeck, du Toit and Sorbom2001; Li et al., Reference Li, Rosenthal and Rubin1996)?
CTT can only take the analysis so far before it must outsource remaining steps to another method. Dimensionality-restricted versions of CTT that are of interest in many empirical contexts where scores are intended to capture a specific construct seem to offer limited benefit over starting with factor analysis or IRT and building evidence for or against sum scores entirely within one of those frameworks. It seems equally fitting to describe this process as factor analysis with coarse factor scores as it does to describe it as CTT.
It would be odd to fit a regression model, use assumptions encompassed by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{e}}\sim N\left( {0,\sigma ^{2}{\textbf{I}}} \right) $$\end{document} to compute standard errors for inferential testing, find that all predictors are plausibly non-null in the population, and then declare that the regression model did not assume \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{e}}\sim N\left( {0,\sigma ^{2}{\textbf{I}}} \right) $$\end{document} because the initial regression line created without assumptions was upheld. Similarly—at least to me—it seems odd to create sum scores based on CTT, use assumptions of factor analysis to assess dimensionality, find that a unidimensional model is plausible based on some criterion like RMSEA \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<.06$$\end{document} , and then declare that the scores rely on minimal assumptions.
Relatedly, in a regression context, it would feel unconventional to fit a model with uniquely estimated coefficients and use its \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^{\textrm{2}}$$\end{document} to describe the predictive ability or fit of a different model whose coefficients are constrained to be equal. However, in psychometrics, it is routine to use factor analytic fit from model with unequal weights as evidence for dimensionality of equally weighted sum scores even, though their model-implied covariance matrices may not be identical.
Again, there are no issues with applying constraints to avoid sampling variability or to prioritize consistency between studies, but aspects of the bias-variance trade-off are at play. In regression, using a penalized method like lasso reduces the sampling variability by reigning the coefficients toward zero, but a side effect is that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^{\textrm{2}}$$\end{document} typically decreases because the price for reducing sampling variance is an increase in bias (i.e., to obtain estimates with smaller between-sample variability, predicted values are slightly further for observed values in the data). In a measurement context, many are happy to accept the lower between-sample variance associated with equal weights, but they are not as keen to embrace the associated price that model fit is slightly worse because scores are a little less closely related to the construct (higher bias).
Of course, latent variable models have their own weaknesses that do not make them universally appropriate. I would be one of the first people in line to criticize how factor analytic fit is evaluated (e.g., McNeish, Reference McNeish2023b, Reference McNeishc), latent variable models tend to be accompanied by little substantive theory that can hamper their utility (e.g., Fried, Reference Fried2020; Eronen & Bringmann, Reference Eronen and Bringmann2021), and latent variable models encourage overemphasizing quantitative components of validity (e.g., Alexandrova & Haybron, Reference Alexandrova and Haybron2016; Peters & Crutzen, 2024; Wolf, 2023). Any method applied without purpose and thought will have deficiencies and replacing uncritical sum scoring with uncritical use of factor analysis or IRT will do little to remedy current psychometric issues in empirical studies. I realize that much of this text defends factor analysis, but I am in no way convinced that factor analysis should serve as the go-to method, and it has many problems that other methods can circumvent. Whereas the psychometric literature has historically pitted CTT against reflective latent variable models like factor analysis, the breadth of options is expanding with recent work on the often-overlooked area of formative constructs and component modeling (Hwang et al., Reference Hwang, Cho, Jung, Falk, Flake, Jin and Lee2021; Rhemtulla et al., Reference Rhemtulla, van Bork and Borsboom2020; also see Hair et al., Reference Hair, Sharma, Sarstedt, Ringle and Liengaard2024 for a discussion of sum scoring formative constructs) and network models offer a complementary way to assess interitem covariances (e.g., Christensen et al., Reference Christensen, Golino and Silvia2020; Epskamp et al., Reference Epskamp, Rhemtulla and Borsboom2017). The broader point is that entering an analysis having decided to sum score and using other methods to work backward and justify summing seems to negate possible insight and contributions that could be garnered by starting with these methods from square one and building evidence for a particular scoring method or underlying structure.
If focusing on the practical issue of improving psychometric practices in empirical studies to improve our collective knowledge about behavioral phenomena, defaulting to CTT-based sum scores does not seem like an entirely effective strategy because CTT is not self-contained and its precise niche in modern psychometric analysis is somewhat unclear. There is not a complete framework for validity without exporting steps of the analysis. Sum score prediction beyond bivariate correlations can often be worse than simply using the individual items as predictors (which similarly requires few assumptions). CTT excels at reliability estimation if the score does not necessarily need to capture a specific construct, but if a specific construct is desired and its relation to item response is thought to be reflective, factor analysis and its assumptions that are already summoned for validation can also evaluate reliability.
CTT-based sum scores are not worthless, but the scope of where they excel or situations in which they are an optimal approach seems rather narrow relative to the interests of many empirical studies. With continuing expansion and refinement of alternative methods, there is more opportunity than ever to strive toward deeper understanding of psychological processes behind item responses or patterns of multiple item responses. Embracing sum scores and potential agnosticism about what a score captures concomitant with CTT might limit engagement with new approaches that can provide original conceptualizations of behavioral phenomena. Overreliance on the appeal and simplicity of sum scoring is partially responsible for some of the deficient measurement practices common in empirical studies given that summing responses has been the dominant approach in empirical studies for quite some time. Continuing down the same path—even with renewed mathematical justification—seems like it will maintain the status quo.
Final Remarks
Borsboom (Reference Borsboom2006) has a great closing line: “The current practice of psychological measurement is largely based on outdated psychometric techniques... I suggest we work as hard as possible to facilitate the emergence of a new generation of researchers who are not afraid to confront the measurement problem in psychology.” (p.438). I understand the intent of Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) was to vouch for stronger balance in psychometric approaches and appreciation for classical methods because the psychometric literature is dominated by latent variables and often beats up on CTT (perhaps unfairly) to motivate novelty of methods. For readers whose focus is psychometrics, Sijtsma et al. (Reference Sijtsma, Ellis and Borsboom2024) accomplish their task with exceptional clarity.
Conversely, for empirical researchers who are content to stay within the confines of CTT, influential references from luminaries in the field declaring sum scoring as the greatest accomplishment in psychometrics may inadvertently foster—rather than confront—continued dominance of classical approaches in scenarios where more modern approaches are advantageous or even necessary to complement CTT and justify sum scoring. The net effect may be further lowered motivation to learn or consider modern methods if the perception is that existing psychometric practice—which is often based on uncritical use of sum scores—is suitable.
Whenever there is momentum to move empirical researchers away from classic methods that dominate in empirical studies, the pendulum seems to forcefully swing back to defend CTT and CTT-based scores, leaving empirical researchers incapacitated and unsure how to proceed. Meanwhile, psychometricians and quantitative psychologists act surprised every time a new review paper shows that (a) empirical researchers do not seriously engage with psychometrics, (b) empirical researchers do not think learning modern psychometric methods is worthwhile, and (c) trends in desirable psychometric practices in empirical studies have been flat for decades.
Despite the latent-variable-heavy state of the psychometric literature, most empirical researchers never stopped using sum scores and they often do not accompany their sum scores with relevant validity evidence when scores are intended to capture a specific construct. There are interesting and meritorious mathematical arguments for sum scoring under minimal assumptions, but the contexts in which many empirical researchers are working (e.g., Likert responses, multiple constructs, intent for scores to represent specific constructs) can be orthogonal to the contexts under which these mathematical arguments optimally apply and mathematical arguments supporting CTT-based sum scoring do little to promote the importance of lacking practices like validity assessment that affect dependability of empirical studies. To be clear, I do not think sum scores are universally bad and it is entirely possible to build strong psychometric cases for sum scoring. However, I also think that uncritical use of sum scores by uninitiated empirical researchers undeservedly receives a pass partly on the basis of CTT.
My apprehension is that well-intentioned defenses of sum scoring will be interpreted by empirical researchers as reassurance to continue to avoid engaging with serious psychometric endeavors because the perceived message may be that the status quo is easy and sufficient. Sum scores can certainly be defended, but many instances of sum scoring in empirical studies are motivated by the simplicity of sum scores rather than any psychometric theory, evidence, or arguments. Of course, it is plausible that psychometrics really is as simple as summing responses, and I am the naïvely optimistic one who merely wants modern methods to be better to give our field credence and to show the empirical researchers, biostatisticians, and econometricians that psychometrics did not peak in the 1960s and that we have something meaningful to contribute to scientific discourse.
Nonetheless, psychometricians and quantitative psychologists could benefit from changing the objective function that we seek to maximize with our work. Rather emphasizing what mathematics might allow, we can better frame our arguments to (a) help empirical researchers understand how psychometrics can improve understanding of behavioral phenomena and (b) be more cognizant of challenges facing empirical researchers by meeting them where they are.
To adapt a line from Angrist (Reference Angrist2004), psychometrics is too important to be left entirely to psychometricians (p. 201). At its core, psychometrics is an inherently applied discipline, and scores are the foundational unit of analysis in many subfields of behavioral science. Reviews of empirical studies find that (a) sum scoring still dominates, (b) the importance of validity is rarely embraced, and (c) little thought is generally put into creating scores despite their central role in subsequent analyses. Reinforcing a commonly applied approach will likely result in more of the same and seems unlikely to curb deficient psychometric practices in empirical studies.
Defense and support of CTT and CTT-based scores is a legitimate mathematically justified position for psychometricians who can appreciate nuances and who are comfortable working at a certain level of abstraction. However, CTT and CTT-based sum scoring was not designed with validity in mind, and potentially make validation less approachable to empirical researchers who are already struggling to provide validity evidence for their scores. Ultimately, approval for CTT-based sum scoring by psychometricians may be misconstrued by—and unhelpful for—empirical researchers who are on the front lines of behavioral research because “few, if any, researchers in psychology conceive of psychological constructs in a way that would justify the use of classical test theory as an appropriate measurement model” (Borsboom, Reference Borsboom2005, p. 47).
Declarations
Conflict of interest
The author did not receive support from any organization for the submitted work, and the author has no financial or nonfinancial conflict of interest to disclose.