Hostname: page-component-5f745c7db-rgzdr Total loading time: 0 Render date: 2025-01-06T07:19:33.035Z Has data issue: true hasContentIssue false

High-Stakes Testing Case Study: A Latent Variable Approach for Assessing Measurement and Prediction Invariance

Published online by Cambridge University Press:  01 January 2025

Steven Andrew Culpepper
Affiliation:
University of Illinois at Urbana–Champaign
Herman Aguinis*
Affiliation:
George Washington University
Justin L. Kern
Affiliation:
University of Illinois at Urbana–Champaign
Roger Millsap
Affiliation:
Arizona State University
*
Correspondence should be made to Steven Andrew Culpepper, Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, USA. Email: [email protected]

Abstract

The existence of differences in prediction systems involving test scores across demographic groups continues to be a thorny and unresolved scientific, professional, and societal concern. Our case study uses a two-stage least squares (2SLS) estimator to jointly assess measurement invariance and prediction invariance in high-stakes testing. So, we examined differences across groups based on latent as opposed to observed scores with data for 176 colleges and universities from The College Board. Results showed that evidence regarding measurement invariance was rejected for the SAT mathematics (SAT-M) subtest at the 0.01 level for 74.5% and 29.9% of cohorts for Black versus White and Hispanic versus White comparisons, respectively. Also, on average, Black students with the same standing on a common factor had observed SAT-M scores that were nearly a third of a standard deviation lower than for comparable Whites. We also found evidence that group differences in SAT-M measurement intercepts may partly explain the well-known finding of observed differences in prediction intercepts. Additionally, results provided evidence that nearly a quarter of the statistically significant observed intercept differences were not statistically significant at the 0.05 level once predictor measurement error was accounted for using the 2SLS procedure. Our joint measurement and prediction invariance approach based on latent scores opens the door to a new high-stakes testing research agenda whose goal is to not simply assess whether observed group-based differences exist and the size and direction of such differences. Rather, the goal of this research agenda is to assess the causal chain starting with underlying theoretical mechanisms (e.g., contextual factors, differences in latent predictor scores) that affect the size and direction of any observed differences.

Type
Original Paper
Copyright
Copyright © 2019 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11336-018-9649-2) contains supplementary material, which is available to authorized users.

Roger Millsap passed away unexpectedly on May 9, 2014 due to a brain hemorrhage. This article is the product of our collective work involving conceptualization, data collection and analysis, and writing. We dedicate the article to him.

We thank Alberto Maydeu-Olivares, a Psychometrika associate editor, and two anonymous reviewers for their excellent recommendations, which allowed us to improve our manuscript in a substantial manner.

References

Aguinis, H. (2004). Regression analysis for categorical moderators, New York: Guilford. Google Scholar
Aguinis, H. (2019). Performance management (4). Chicago, IL: Chicago Business Press. Google Scholar
Aguinis, H., Cortina, J. M., & Goldberg, E. (1998). A new procedure for computing equivalence bands in personnel selection. Human Performance, 11, 351365. CrossRefGoogle Scholar
Aguinis, H., Culpepper, S. A., & Pierce, C. A. (2010). Revival of test bias research in preemployment testing. Journal of Applied Psychology, 95, 648680. CrossRefGoogle ScholarPubMed
Aguinis, H., Culpepper, S. A. & Pierce, C. A. (2016). Differential prediction generalization in college admissions testing. Journal of Educational Psychology, 108, 10451059. CrossRefGoogle Scholar
Aguinis, H., Werner, S., Abbott, J. L., Angert, C., Park, J. H., & Kohlhausen, D. (2010). Customer-centric science: Reporting significant research results with rigor, relevance, and practical impact in mind. Organizational Research Methods, 13, 515539. CrossRefGoogle Scholar
Albano, A. D., & Rodriguez, M. C. (1998). Examining differential math performance by gender and opportunity to learn. Educational and Psychological Measurement, 73, 836856. CrossRefGoogle Scholar
American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar
Aronson, J., Dee, T., Schmader, T., & Inzlicht, M. (2012). Stereotype threat in the real world. Stereotype threat: Theory, process, and application, Oxford: Oxford University Press. 264278. Google Scholar
Bernerth, J., & Aguinis, H. (2016). A critical review and best-practice recommendations for control variable usage. Personnel Psychology, 69, 229283. CrossRefGoogle Scholar
Berry, C. M., & Zhao, P. (2015). Addressing criticisms of existing predictive bias research: Cognitive ability test scores still overpredict African Americans’ job performance. Journal of Applied Psychology, 100, 162179. CrossRefGoogle ScholarPubMed
Birnbaum, Z. W., Paulson, E., & Andrews, F. C. (1950). On the effect of selection performed on some coordinates of a multi-dimensional population. Psychometrika, 15, 191204. CrossRefGoogle ScholarPubMed
Bollen, K. A. (1996). An alternative two stage least squares (2SLS) estimator for latent variable equations. Psychometrika, 61, 109121. CrossRefGoogle Scholar
Bollen, K. A., Kolenikov, S., & Bauldry, S. (2014). Model-implied instrumental variable—generalized method of moments (MIIV-GMM) estimators for latent variable models. Psychometrika, 79, 2050. CrossRefGoogle ScholarPubMed
Bollen, K. A., & Maydeu-Olivares, A. (2007). A polychoric instrumental variable (PIV) estimator for structural equation models with categorical variables. Psychometrika, 72, 309326. CrossRefGoogle Scholar
Bollen, K. A., Paxton, P., Schumacker, R. E., & Marcoulides, G. A. (1998). Two-stage least squares estimation on interaction effects. Interaction and nonlinear effects in structural equation modeling, Mahwah, NJ: Lawrence Erlbaum Associates. 125151. Google Scholar
Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425440. CrossRefGoogle ScholarPubMed
Borsboom, D., Romeijn, J. W., & Wicherts, J. M. (2008). Measurement invariance versus selection invariance: Is fair selection possible?. Psychological Methods, 13, 7598. CrossRefGoogle ScholarPubMed
Browne, M. W., & Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological Methods & Research, 21, 230258. CrossRefGoogle Scholar
Bryant, D. (2004). The effects of differential item functioning on predictive bias. Unpublished doctoral dissertation), University of Central Florida, Orlando, Florida.Google Scholar
Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement, 5, 115124. CrossRefGoogle Scholar
Coyle, T. R., & Pillow, D. R. (2008). SAT and ACT predict college GPA after removing g. Intelligence, 36, 719729. CrossRefGoogle Scholar
Coyle, T. R., Purcell, J. M., Snyder, A. C., & Kochunov, P. (2013). Non-g residuals of the SAT and ACT predict specific abilities. Intelligence, 41, 114120. CrossRefGoogle Scholar
Coyle, T. R., Purcell, J. M., Snyder, A. C., & Richmond, M. C. (2014). Ability tilt on the SAT and ACT predicts specific abilities and college majors. Intelligence, 46, 1824. CrossRefGoogle Scholar
Culpepper, S. A. (2010). Studying individual differences in predictability with gamma regression and nonlinear multilevel models. Multivariate Behavioral Research, 45, 153185. 26789088 CrossRefGoogle ScholarPubMed
Culpepper, S. A. (2012a). Using the criterion-predictor factor model to compute the probability of detecting prediction bias with ordinary least squares regression. Psychometrika, 77, 561580. CrossRefGoogle ScholarPubMed
Culpepper, S. A. (2012b). Evaluating EIV, OLS, and SEM estimators of group slope differences in the presence of measurement error: The single indicator case. Applied Psychological Measurement, 36, 349374. CrossRefGoogle Scholar
Culpepper, S. A. (2016). An improved correction for range restricted correlations under extreme, monotonic quadratic nonlinearity and heteroscedasticity. Psychometrika, 81, 550564. CrossRefGoogle ScholarPubMed
Culpepper, S. A., & Aguinis, H. (2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16, 166178. 21517178 CrossRefGoogle ScholarPubMed
Culpepper, S. A., & Davenport, E. C. (2009). Assessing differential prediction of college grades by race/ethnicity with a multilevel model. Journal of Educational Measurement, 46, 220242. CrossRefGoogle Scholar
Fan, X., & Sivo, S. A. (2005). Sensitivity of fit indexes to misspecified structural or measurement model components: Rationale of two-index strategy revisited. Structural Equation Modeling, 12, 343367. CrossRefGoogle Scholar
Fischer, F. T., Schult, J., & Hell, B. (2013). Sex-specific differential prediction of college admission tests: A meta-analysis. Journal of Educational Psychology, 105, 478488. CrossRefGoogle Scholar
Fischer, F., Schult, J., & Hell, B. (2013). Sex differences in secondary school success: Why female students perform better. European Journal of Psychology of Education, 28, 529543. CrossRefGoogle Scholar
Gottfredson, L. S. (1988). Reconsidering fairness: A matter of social and ethical priorities. Journal of Vocational Behavior, 33, 293319. CrossRefGoogle Scholar
Gottfredson, L. S., & Crouse, J. (1986). Validity versus utility of mental tests: Example of the SAT. Journal of Vocational Behavior, 29, 363378. CrossRefGoogle Scholar
Hägglund, G. (1982). Factor analysis by instrumental variables methods. Psychometrika, 47, 209222. CrossRefGoogle Scholar
Hayashi, F. (2000). Econometrics, Princeton, NJ: Princeton University Press. Google Scholar
Hausman, J. A., Newey, W. K., Woutersen, T., Chao, J. C., & Swanson, N. R. (2012). Instrumental variable estimation with heteroskedasticity and many instruments. Quantitative Economics, 3, 211255. CrossRefGoogle Scholar
Hong, S., & Roznowski, M. (2001). An investigation of the influence of internal test bias on regression slope. Applied Measurement in Education, 14, 351368. CrossRefGoogle Scholar
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6, 155. CrossRefGoogle Scholar
Humphreys, L. G. (1952). Individual differences. Annual Review of Psychology, 3, 131150. 12977188 CrossRefGoogle ScholarPubMed
Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36, 409426. CrossRefGoogle Scholar
Jöreskog, K. G. Schumacker, R. E., & Marcoulides, G. A. (1998). Interaction and nonlinear modeling: Issues and approaches. Interaction and nonlinear effects in structural equation modeling, Mahwah, NJ: Lawrence Erlbaum Associates Inc. 239250. Google Scholar
Keiser, H. N., Sackett, P. R., Kuncel, N. R., & Brothen, T. (2016). Why women perform better in college than admission scores would predict: Exploring the roles of conscientiousness and course-taking patterns. Journal of Applied Psychology, 101, 569581. 26653526 CrossRefGoogle ScholarPubMed
Kling, K. C., Noftle, E. E., & Robins, R. W. (2012). Why do standardized tests underpredict women’s academic performance? The role of conscientiousness. Social Psychological and Personality Science, 4, 600606. CrossRefGoogle Scholar
Lance, C. E., Beck, S. S., Fan, Y., & Carter, N. T. (2016). A taxonomy of path-related goodness-of-fit indices and recommended criterion values. Psychological Methods, 21, 388404. CrossRefGoogle ScholarPubMed
Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635694. Google Scholar
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores, Charlotte: Information Age Publishing Inc.. Google Scholar
MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological Methods, 1, 130149. CrossRefGoogle Scholar
Marsh, H. W., Wen, Z., & Hau, K. (2004). Structural equation models of latent interactions: Evaluation of alternative estimation strategies and indicator construction. Psychological Methods, 9, 275300. 15355150 CrossRefGoogle ScholarPubMed
Mattern, K. D., & Patterson, B. F. (2013). Test of slope and intercept bias in college admissions: A response to Aguinis, Culpepper, and Pierce (2010). Journal of Applied Psychology, 98, 134147. CrossRefGoogle ScholarPubMed
McDonald, R. P., Ho, M. H. R (2002). Principles and practice in reporting structural equation analyses. Psychological Methods, 7, 6482. CrossRefGoogle ScholarPubMed
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525543. CrossRefGoogle Scholar
Millsap, R. E. (1995). Measurement invariance, predictive invariance, and the duality paradox. Multivariate Behavioral Research, 30, 577605. 26790049 CrossRefGoogle ScholarPubMed
Millsap, R. E. (1997). Invariance in measurement and prediction: Their relationship in the single-factor case. Psychological Methods, 2, 248260. CrossRefGoogle Scholar
Millsap, R. E. (1998). Group differences in regression intercepts: Implications for factorial invariance. Multivariate Behavioral Research, 33, 403424. 26782721 CrossRefGoogle ScholarPubMed
Millsap, R. E. (2007). Invariance in measurement and prediction revisited. Psychometrika, 72, 461473. CrossRefGoogle Scholar
Millsap, R. E. (2011). Statistical approaches to measurement invariance, New York: Routledge. Google Scholar
Moulder, B. C., Algina, J. (2002). Comparison of methods for estimating and testing latent variable interactions. Structural Equation Modeling, 9, 119. CrossRefGoogle Scholar
Muthén, B. O. (1989). Factor structure in groups selected on observed scores. British Journal of Mathematical and Statistical Psychology, 42, 8190. CrossRefGoogle Scholar
Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that are not missing completely at random. Psychometrika, 52, 431462. CrossRefGoogle Scholar
Nestler, S. (2014). How the 2SLS/IV estimator can handle equality constraints in structural equation models: A system-of-equations approach. British Journal of Mathematical and Statistical Psychology, 67, 353369. 24033324 CrossRefGoogle ScholarPubMed
Nguyen, H. H.D, & Ryan, A. M. (2008). Does stereotype threat affect test performance of minorities and women? A meta-analysis of experimental evidence. Journal of Applied Psychology, 93, 13141334. 19025250 CrossRefGoogle ScholarPubMed
Nye, C. D., & Drasgow, F. (2011). Assessing goodness of fit: Simple rules of thumb simply do not work. Organizational Research Methods, 14, 548570. CrossRefGoogle Scholar
Oczkowski, E. (2002). Discriminating between measurement scales using nonnested tests and 2SLS: Monte Carlo evidence. Structural Equation Modeling, 9, 103125. CrossRefGoogle Scholar
Olea, M. M., & Ree, M. J. (1994). Predicting pilot and navigator criteria: Not much more than g. Journal of Applied Psychology, 79, 845851. CrossRefGoogle Scholar
Ployhart, R. E., Schmitt, N., & Tippins, N. T. (2017). Solving the supreme problem: 100 years of recruitment and selection research. Journal of Applied Psychology, 102, 291304. 28125261 CrossRefGoogle Scholar
Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2004). Generalized multilevel structural equation modeling. Psychometrika, 69, 167190. CrossRefGoogle Scholar
Ree, M. J., & Earles, J. A. (1991). Predicting training success: Not much more than g. Personnel Psychology, 44, 321332. CrossRefGoogle Scholar
Ree, M. J., Earles, J. A., & Teachout, M. S. (1994). Predicting job performance: Not much more than g. Journal of Applied Psychology, 79, 518524. CrossRefGoogle Scholar
Sackett, P. R., Ryan, A. M. Schmader, T., & Inzlicht, M. (2011). Concerns about generalizing stereotype threat research findings to operational high-stakes testing settings. Stereotype threat: Theory, process, and application, Oxford: Oxford University Press. 246259. Google Scholar
Schmitt, N., Keeney, J., Oswald, F. L., Pleskac, T., Quinn, A., Sinha, R., & Zorzie, M. (2009). Prediction of 4-year college student performance using cognitive and noncognitive predictors and the impact of demographic status on admitted students. Journal of Applied Psychology, 94, 14791497. CrossRefGoogle ScholarPubMed
Schult, J., Hell, B., Päßler, K., & Schuler, H. (2013). Sex-specific differential prediction of academic achievement by German ability tests. International Journal of Selection and Assessment, 21, 130134. CrossRefGoogle Scholar
Society for Industrial and Organizational Psychology. (2018). Principles for the validation and use of personnel selection procedures (5th ed.). Washington, DC: American Psychological Association.Google Scholar
Sörbom, D. (1974). A general method for studying differences in factor means and factor structure between groups. British Journal of Mathematical and Statistical Psychology, 27, 229239. CrossRefGoogle Scholar
Sörbom, D. (1978). An alternative to the methodology for analysis of covariance. Psychometrika, 43, 381396. CrossRefGoogle Scholar
Steele, C. M. (2011). Whistling Vivaldi: How stereotypes affect us and what we can do, New York: WW Norton & Company. Google Scholar
Vandenberg, R. J., Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 470. CrossRefGoogle Scholar
Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81, 557574. CrossRefGoogle Scholar
Walton, G. M., Murphy, M. C., & Ryan, A. M. (2015). Stereotype threat in organizations: Implications for equity and performance. Annual Review of Organizational Psychology and Organizational Behavior, 2, 523550. CrossRefGoogle Scholar
Wicherts, J. M., & Millsap, R. E. (2009). The absence of underprediction does not imply the absence of measurement bias. American Psychologist, 64, 281283. 19449995 CrossRefGoogle Scholar
Wicherts, J. M., Dolan, C. V., & Hessen, D. J. (2005). Stereotype threat and group differences in test performance: A question of measurement invariance. Journal of Personality and Social Psychology, 89, 696716. 16351363 CrossRefGoogle ScholarPubMed
Widaman, K. F., & Thompson, J. S. (2003). On specifying the null model for incremental fit indices in structural equation modeling. Psychological Methods, 8, 1637. 12741671 CrossRefGoogle ScholarPubMed
Wu, W., West, S. G., & Taylor, A. B. (2009). Evaluating model fit for growth curve models: Integration of fit indices from SEM and MLM frameworks. Psychological Methods, 14, 183201. 19719357 CrossRefGoogle ScholarPubMed
Young, J. W. (1991). Gender bias in predicting college academic performance: A new approach using item response theory. Journal of Educational Measurement, 28, 3747. CrossRefGoogle Scholar
Young, J. W. (1991). Improving the prediction of college performance of ethnic minorities using the IRT-based GPA. Applied Measurement in Education, 4, 229239. CrossRefGoogle Scholar
Zwick, R., & Himelfarb, I. (2011). The effect of high school socioeconomic status on the predictive validity of SAT scores and high school grade-point average. Journal of Educational Measurement, 48, 101121. CrossRefGoogle Scholar
Supplementary material: File

Culpepper et al. supplementary material 1

Culpepper et al. supplementary material 1
Download Culpepper et al. supplementary material 1(File)
File 1.8 MB
Supplementary material: File

Culpepper et al. supplementary material 2

Culpepper et al. supplementary material 2
Download Culpepper et al. supplementary material 2(File)
File 192.6 KB