Hostname: page-component-5f745c7db-szhh2 Total loading time: 0 Render date: 2025-01-06T06:57:19.719Z Has data issue: true hasContentIssue false

Penalized Best Linear Prediction of True Test Scores

Published online by Cambridge University Press:  01 January 2025

Lili Yao*
Affiliation:
Educational Testing Service
Shelby J. Haberman
Affiliation:
Edusoft
Mo Zhang
Affiliation:
Educational Testing Service
*
Correspondence should be made to Lili Yao, Educational Testing Service, 660 Rosedale Road, Princeton, NJ08540, USA. Email: [email protected]

Abstract

In best linear prediction (BLP), a true test score is predicted by observed item scores and by ancillary test data. If the use of BLP rather than a more direct estimate of a true score has disparate impact for different demographic groups, then a fairness issue arises. To improve population invariance but to preserve much of the efficiency of BLP, a modified approach, penalized best linear prediction, is proposed that weights both mean square error of prediction and a quadratic measure of subgroup biases. The proposed methodology is applied to three high-stakes writing assessments.

Type
Original Paper
Copyright
Copyright © 2018 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V.2. Journal of Technology, Learning and Assessment, 4 (3), 129. Google Scholar
Attali, Y., Burstein, J., & Andreyev, S. (2003). E-rater Version 2.0: Combining writing analysis feedback with automated essay scoring. Princeton, NJ: Educational Testing Service. Google Scholar
Burstein, J., Chodorow, M., & Leacock, C. (2004). Automated essay evaluation: The Criterion online writing service. AI Magazine, 25 (3), 2736. Google Scholar
Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel–Haenszel and standardization. Hillsdale, NJ: Lawrence Erlbaum Associates. Google Scholar
Dorans, N. J., & Holland, P. W. (2000). Population invariance and the equitability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37, 281306. CrossRefGoogle Scholar
Haberman, S. J. (1984). Adjustment by minimum discriminant information. The Annals of Statistics, 12, 971988. CrossRefGoogle Scholar
Haberman, S. J. (2008). When can subscores have value?. Journal of Educational and Behavioral Statistics, 33, 204229. CrossRefGoogle Scholar
Haberman, S. J., & Qian, J. (2007). Linear prediction of a true score from a direct estimate and several derived estimates. Journal of Educational and Behavioral Statistics, 32, 623. CrossRefGoogle Scholar
Haberman, S. J., & Sinharay, S. (2010a). The application of the cumulative logistic regression model to automated essay scoring. Journal of Educational and Behavioral Statistics, 35, 586602. CrossRefGoogle Scholar
Haberman, S. J., & Sinharay, S. (2010b). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209227. CrossRefGoogle Scholar
Haberman, S. J., & Sinharay, S. (2011). https://doi.org/10.1002/j.2333-8504.2011.tb02279.x How does the knowledge of subgroup membership of examinees affect the prediction of true subscores? Research Report No. RR-11-43. Princeton, NJ, Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2011.tb02279.x CrossRefGoogle Scholar
Haberman, S. J., & Sinharay, S. (2013). Does subgroup membership information lead to better estimation of true subscores?. British Journal of Mathematical and Statistical Psychology, 66, 452469. CrossRefGoogle ScholarPubMed
Haberman, S. J., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British Journal of Mathematical and Statistical Psychology, 62, 7995. CrossRefGoogle ScholarPubMed
Haberman, S. J., & Yao, L. (2015). Repeater analysis for combining information from different assessments. Journal of Educational Measurement, 52, 223251. CrossRefGoogle Scholar
Haberman, S. J., Yao, L.7 Sinharay, S. (2015). Prediction of true test scores from observed item scores and ancillary data. British Journal of Mathematical and Statistical Psychology, 68, 363385. CrossRefGoogle ScholarPubMed
Lord, F. M., Novick, M. R. (1968). Statistical theories of mental test scores, Reading, MA: Addison Wesley. Google Scholar
Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26, 421428. CrossRefGoogle Scholar
Wainer, H., Sheehan, K., & Wang, X. (2000). Some paths toward making Praxis scores more useful. Journal of Educational Measurement, 37, 113140. CrossRefGoogle Scholar
Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., Swygert, K. A., & Thissen, D., Thissen, D. & Wainer, H. (2001). Augmented scores-"Borrowing strength" to compute scores based on small numbers of items. Test scoring, Mahwah, NJ: Erlbaum. 343387. Google Scholar