Penalized Best Linear Prediction of True Test Scores

Lili Yao; Shelby J. Haberman; Mo Zhang

doi:10.1007/s11336-018-9636-7

Penalized Best Linear Prediction of True Test Scores

Published online by Cambridge University Press: 01 January 2025

Lili Yao

Shelby J. Haberman and

Mo Zhang

Show author details

Lili Yao*: Affiliation:
Educational Testing Service
Shelby J. Haberman: Affiliation:
Edusoft
Mo Zhang: Affiliation:
Educational Testing Service
*: Correspondence should be made to Lili Yao, Educational Testing Service, 660 Rosedale Road, Princeton, NJ08540, USA. Email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

In best linear prediction (BLP), a true test score is predicted by observed item scores and by ancillary test data. If the use of BLP rather than a more direct estimate of a true score has disparate impact for different demographic groups, then a fairness issue arises. To improve population invariance but to preserve much of the efficiency of BLP, a modified approach, penalized best linear prediction, is proposed that weights both mean square error of prediction and a quadratic measure of subgroup biases. The proposed methodology is applied to three high-stakes writing assessments.

Keywords

true test score PBLP subgroup biases

Type: Original Paper
Information: Psychometrika , Volume 84 , Issue 1 , 15 March 2019 , pp. 186 - 211

DOI: https://doi.org/10.1007/s11336-018-9636-7 [Opens in a new window]
Copyright: Copyright © 2018 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V.2. Journal of Technology, Learning and Assessment, 4 (3), 1– 29. Google Scholar

Attali, Y., Burstein, J., & Andreyev, S. (2003). E-rater Version 2.0: Combining writing analysis feedback with automated essay scoring. Princeton, NJ: Educational Testing Service. Google Scholar

Burstein, J., Chodorow, M., & Leacock, C. (2004). Automated essay evaluation: The Criterion online writing service. AI Magazine, 25 (3), 27– 36. Google Scholar

Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel–Haenszel and standardization. Hillsdale, NJ: Lawrence Erlbaum Associates. Google Scholar

Dorans, N. J., & Holland, P. W. (2000). Population invariance and the equitability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37, 281– 306. CrossRef Google Scholar

Haberman, S. J. (1984). Adjustment by minimum discriminant information. The Annals of Statistics, 12, 971– 988. CrossRef Google Scholar

Haberman, S. J. (2008). When can subscores have value?. Journal of Educational and Behavioral Statistics, 33, 204– 229. CrossRef Google Scholar

Haberman, S. J., & Qian, J. (2007). Linear prediction of a true score from a direct estimate and several derived estimates. Journal of Educational and Behavioral Statistics, 32, 6– 23. CrossRef Google Scholar

Haberman, S. J., & Sinharay, S. (2010a). The application of the cumulative logistic regression model to automated essay scoring. Journal of Educational and Behavioral Statistics, 35, 586– 602. CrossRef Google Scholar

Haberman, S. J., & Sinharay, S. (2010b). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209– 227. CrossRef Google Scholar

Haberman, S. J., & Sinharay, S. (2011). https://doi.org/10.1002/j.2333-8504.2011.tb02279.x How does the knowledge of subgroup membership of examinees affect the prediction of true subscores? Research Report No. RR-11-43. Princeton, NJ, Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2011.tb02279.x CrossRef Google Scholar

Haberman, S. J., & Sinharay, S. (2013). Does subgroup membership information lead to better estimation of true subscores?. British Journal of Mathematical and Statistical Psychology, 66, 452– 469. CrossRef Google Scholar PubMed

Haberman, S. J., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British Journal of Mathematical and Statistical Psychology, 62, 79– 95. CrossRef Google Scholar PubMed

Haberman, S. J., & Yao, L. (2015). Repeater analysis for combining information from different assessments. Journal of Educational Measurement, 52, 223– 251. CrossRef Google Scholar

Haberman, S. J., Yao, L.7 Sinharay, S. (2015). Prediction of true test scores from observed item scores and ancillary data. British Journal of Mathematical and Statistical Psychology, 68, 363– 385. CrossRef Google Scholar PubMed

Lord, F. M., Novick, M. R. (1968). Statistical theories of mental test scores, Reading, MA: Addison Wesley. Google Scholar

Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26, 421–428. CrossRef Google Scholar

Wainer, H., Sheehan, K., & Wang, X. (2000). Some paths toward making Praxis scores more useful. Journal of Educational Measurement, 37, 113– 140. CrossRef Google Scholar

Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., Swygert, K. A., & Thissen, D., Thissen, D. & Wainer, H. (2001). Augmented scores-"Borrowing strength" to compute scores based on small numbers of items. Test scoring, Mahwah, NJ: Erlbaum. 343– 387. Google Scholar

Article contents

Penalized Best Linear Prediction of True Test Scores

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests