References

Shelby Haberman; Sandip Sinharay; Richard A. Feinberg; Howard Wainer

References

Published online by Cambridge University Press: 22 February 2024

Shelby Haberman ,

Sandip Sinharay ,

Richard A. Feinberg and

Howard Wainer

Show author details

Sandip Sinharay: Affiliation:
Educational Testing Service, New Jersey
Richard A. Feinberg: Affiliation:
National Board of Medical Examiners, Pennsylvania

Book contents

Get access

Summary

A summary is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'

Type: Chapter
Information: Subscores
A Practical Guide to Their Production and Consumption
, pp. 158 - 168

DOI: https://doi.org/10.1017/9781009413701 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2024

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Ackerman, T., & Shu, Z. (2009). Using confirmatory MIRT modeling to provide diagnostic information in large scale assessment. Paper presented at the meeting of the National Council on Measurement in Education, San Diego, CA.Google Scholar

ACT. (2022). ACT technical manual. Iowa City, IA: ACT.Google Scholar

ACT. (2023). Make sense of your scores. www.act.org/content/dam/act/unsecured/documents/2021-2022-Student-Rpt-with-Write-sample-data.pdf Google Scholar

Adams, R. J., Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics, 22(1), 47–76. https://doi.org/10.3102/10769986022001047 CrossRef Google Scholar

Albanese, M. A. (2014). The testing column: Differences in subject area subscores on the MBE and other illusions. The Bar Examiner, 83(2), 26–31.Google Scholar

Almond, R., Steinberg, L., & Mislevy, R. (2002). Enhancing the design and delivery of assessment systems: A four-process architecture. The Journal of Technology, Learning and Assessment, 1(5). https://ejournals.bc.edu/index.php/jtla/article/view/1671 Google Scholar

American Board of Internal Medicine Maintenance of Certification (ABIM MOC). (2023). Enhanced score report. www.abim.org/Media/f4pp1das/score-report.pdf Google Scholar

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar

Angoff, W. H. (1971). Scales, norms, and equivalent scores. In Thorndike, R. L. (Ed.), Educational measurement (pp. 508–600). Washington, DC: American Council on Education.Google Scholar

Armed Services Vocational Aptitude Battery (ASVAB). (2023). Understanding your ASVAB results. www.asvabprogram.com/media-center-article/28 Google Scholar

Beaton, A. E., & Allen, N. L. (1992). Interpreting scales through scale anchoring. Journal of Educational Statistics, 17, 191–204. https://doi.org/10.2307/1165169 Google Scholar

Bell, R., & Lumsden, J. (1980). Test length and validity. Applied Psychological Measurement, 4(2), 165–170. https://doi.org/10.1177/014662168000400203 CrossRef Google Scholar

Bertin, J. (1983). Semiology of graphics: Diagrams, networks, maps (Translated into English by Berg, W. J.). Madison: University of Wisconsin Press.Google Scholar

Biancarosa, G., Kennedy, P. C., Carlson, S. E., Yoon, H., Seipel, B., Liu, B., & Davison, M. L. (2019). Constructing subscores that add validity: A case study of identifying students at risk. Educational and Psychological Measurement, 79(1), 65–84. https://doi.org/10.1177/0013164418763255 CrossRef Google Scholar PubMed

Brennan, R. L. (2012). Utility indexes for decisions about subscores. CASMA Research Report 33. Iowa City, IA: Center for Advanced Studies in Measurement and Assessment.Google Scholar

Brinton, W. C. (1939). Graphic presentations. New York: Brinton.Google Scholar

Brown, G. T. L., O’Leary, T. M., & Hattie, J. A. C. (2019). Effective reporting for formative assessment: The asTTle case example. In Zapata-Rivera, D. (Ed.), Score reporting research and applications (The NCME Applications of Educational Measurement and Assessment Book Series) (pp. 107–125). New York: Routledge. https://doi.org/10.4324/9781351136501-11 Google Scholar

Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3(3), 296–322. https://doi.org/10.1111/j.2044-8295.1910.tb00207.x Google Scholar

Bulut, O., Davison, M. L., & Rodriguez, M. C. (2017). Estimating between-person and within-person subscore reliability with profile analysis. Multivariate Behavioral Research, 52(1), 86–104. https://doi.org/10.1080/00273171.2016.1253452 CrossRef Google Scholar PubMed

Casella, G. (1985). An introduction to empirical Bayes data analysis. The American Statistician, 39(2), 83–87. https://doi.org/10.2307/2682801 Google Scholar

Choi, I., & Papageorgiou, S. (2020). Evaluating subscore uses across multiple levels: A case of reading and listening subscores for young EFL learners. Language Testing, 37(2), 254–279. https://doi.org/10.1177/0265532219879654 CrossRef Google Scholar

Comprehensive Clinical Science Examination (CCSE). (2023). Examinee performance report. www.nbme.org/sites/default/files/2022-12/CCSE_Examinee_Performance_Report_2022.pdf Google Scholar

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. https://doi.org/10.1007/bf02310555 CrossRef Google Scholar

Cronbach, L. J., Schönemann, P., & McKie, D. (1965). Alpha coefficients for stratified-parallel tests. Educational and Psychological Measurement, 25, 291–312. https://doi.org/10.1177/001316446502500201 CrossRef Google Scholar

CTB/McGraw-Hill. (2001). TerraNova, the second edition: Individual profile report. Monterey, CA: Author.Google Scholar

Dai, S., Svetina, D., & Wang, X. (2017). Reporting subscores using R: A software review. Journal of Educational and Behavioral Statistics, 42, 617–638. https://doi.org/10.3102/1076998617716462 CrossRef Google Scholar

Dai, S., Wang, X., & Svetina, D. (2019). Subscore: Sub-score computing functions in classical test theory (R Package Version 3.1) [Computer Software]. http://CRAN.R-project.org/package=subscore Google Scholar

Davison, M. L., Davenport, E. C., Chang, Y.-F., Vue, K., & Su, S. (2015). Criterion-related validity: Assessing the value of subscores. Journal of Educational Measurement, 52, 263–279. https://doi.org/10.2307/43940571 CrossRef Google Scholar

DiBello, L. V., Roussos, L., & Stout, W. F. (2006). Review of cognitive diagnostic assessment and a summary of psychometric models. In Rao, C. R., & Sinharay, S. (Eds.), Handbook of statistics, Volume 26 (pp. 979–1030). Amsterdam: Elsevier Science B.V. https://doi.org/10.1016/s0169-7161(06)26031-0 Google Scholar

Dorans, N. J., & Walker, M. E. (2007). Sizing up linkages. In Dorans, N. J., Pommerich, M., & Holland, P. W. (Eds.), Linking and aligning scores and scales (pp. 179–198). New York: Springer. https://doi.org/10.1007/978-0-387-49771-6_10 CrossRef Google Scholar

Duolingo English Test. (2023). Sample certificate. https://englishtest.duolingo.com/sample_certificate Google Scholar

Draper, N. R., & Smith, H. (1998). Applied regression analysis. New York: Wiley. https://doi.org/10.1002/9781118625590 CrossRef Google Scholar

DuBois, P. H. (1970). A history of psychological testing. Boston: Allyn & Bacon.Google Scholar

Dwyer, A., Boughton, K. A., Yao, L., Steffen, M., & Lewis, D. (2006, April). A comparison of subscale score augmentation methods using empirical data. Paper presented at the meeting of the National Council on Measurement in Education, San Francisco, CA.Google Scholar

Ebel, R. L. (1962). Content standard test scores. Educational and Psychological Measurement, 22, 15–25. https://doi.org/10.1177/001316446202200103 CrossRef Google Scholar

Educational Testing Service. (2008). PraxisTM 2008–09 information bulletin. Princeton, NJ: Educational Testing Service.Google Scholar

Educational Testing Service. (2020). TOEFL® Research insight series, Volume 3: Reliability and comparability of TOEFL iBT® scores. Princeton, NJ: Author.Google Scholar

Educational Testing Service. (2021). The Praxis study companion, elementary education: Content knowledge. Princeton, NJ: Educational Testing Service.Google Scholar

Edwards, M. C., & Vevea, J. L. (2006). An empirical Bayes approach to subscore augmentation: How much strength can we borrow? Journal of Educational and Behavioral Statistics, 31, 241–259. https://doi.org/10.3102/10769986031003241 CrossRef Google Scholar

Everitt, B. (2011). Cluster analysis. Chichester, UK: Wiley.CrossRef Google Scholar

Every Student Succeeds Act, 20 U.S.C. § 6301 (2015). www.congress.gov/bill/114th-congress/senate-bill/1177 Google Scholar

Feinberg, R. A., & Clauser, A. L. (2016). Can item keyword feedback help remediate knowledge gaps? Journal of Graduate Medical Education, 8(4), 541–545. https://doi.org/10.4300/jgme-d-15-00463.1 CrossRef Google Scholar PubMed

Feinberg, R. A., & Jurich, D. P. (2017). Guidelines for interpreting and reporting subscores. Educational Measurement: Issues and Practice, 36(1), 5–13. https://doi.org/10.1111/emip.12142 CrossRef Google Scholar

Feinberg, R. A., & von Davier, M. (2020). Conditional subscore reporting using the compound binomial distribution. Journal of Educational and Behavioral Statistics, 45(5), 515–533. https://doi.org/10.3102/1076998620911933 CrossRef Google Scholar

Feinberg, R. A., & Wainer, H. (2011). Extracting sunbeams from cucumbers. Journal of Computational and Graphical Statistics, 20(4), 793–810. https://doi.org/10.1198/jcgs.2011.204a CrossRef Google Scholar

Feinberg, R. A., & Wainer, H. (2014). When can we improve subscores by making them shorter? The case against subscores with overlapping items. Educational Measurement: Issues and Practice, 33(3), 47–54. https://doi.org/10.1111/emip.12037 CrossRef Google Scholar

Feinberg, R. A., & Wainer, H. (2014). A simple equation to predict a subscore’s value. Educational Measurement: Issues and Practice, 33(3), 55–56. https://doi.org/10.1111/emip.12035 CrossRef Google Scholar

Flanagan, J. C. (1948). The aviation psychology program in the Army Air Forces. Report 1, AAF Aviation Psychology Program Research Reports, US Government Printing Office, pp. xii+316.Google Scholar

Fleiss, J. L. (1975). Measuring agreement between two judges on the presence or absence of a trait. Biometrics, 31, 651–659. https://doi.org/10.2307/2529549 CrossRef Google Scholar PubMed

Friendly, M., & Wainer, H. (2021). A history of data visualization and graphic communication. Cambridge, MA: Harvard University Press. https://doi.org/10.4159/9780674259034 Google Scholar

George, A. C., Robitzsch, A., Kiefer, T., Gross, J., & Uenlue, A. (2016). The R package CDM for cognitive diagnosis models. Journal of Statistical Software, 74(2), 1–24. https://doi.org/10.18637/jss.v074.i02 CrossRef Google Scholar

Goodman, D. P., & Hambleton, R. K. (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education, 17, 145–220. https://doi.org/10.1207/s15324818ame1702_3 CrossRef Google Scholar

Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Part I. Journal of the American Statistical Association, 49, 732–764. https://doi.org/10.2307/2281536 Google Scholar

Haberman, S. J. (2008a). When can subscores have value? Journal of Educational and Behavioral Statistics, 33, 204–229. https://doi.org/10.3102/1076998607302636 CrossRef Google Scholar

Haberman, S. J. (2008b). Subscores and validity. ETS Research Report Series (ETS Research Report No. RR-08-64). Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2008.tb02150.x CrossRef Google Scholar

Haberman, S. J. (2008c). Outliers in assessments. ETS Research Report Series (ETS Research Report No. RR-08-41). https://doi.org/10.1002/j.2333-8504.2008.tb02150.x CrossRef Google Scholar

Haberman, S. J. (2013). A general program for item-response analysis that employs the stabilized Newton-Raphson algorithm. ETS Research Report Series (ETS Research Report No. RR-13-32). https://doi.org/10.1002/j.2333-8504.2013.tb02339.x CrossRef Google Scholar

Haberman, S. J., & Sinharay, S. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209–227. https://doi.org/10.1007/s11336-010-9158-4 CrossRef Google Scholar

Haberman, S. J., & Sinharay, S. (2013). Does subgroup membership information lead to better estimation of true subscores? British Journal of Mathematical and Statistical Psychology, 66, 451–469. https://doi.org/10.1111/j.2044-8317.2012.02061 CrossRef Google Scholar PubMed

Haberman, S. J., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British Journal of Mathematical and Statistical Psychology, 62, 79–95. https://doi.org/10.1348/000711007x248875 CrossRef Google Scholar PubMed

Haberman, S. J., & von Davier, M. (2007). Some notes on models for cognitively based skills diagnosis. In Rao, C. R. & Sinharay, S. (Eds.), Handbook of statistics, Vol. 26 (pp. 1031–1038). Amsterdam: Elsevier North-Holland. https://doi.org/10.1016/s0169-7161(06)26040-1 Google Scholar

Haberman, S., & Yao, L. (2015). Repeater analysis for combining information from different assessments. Journal of Educational Measurement, 52, 223–251. https://doi.org/10.1111/jedm.12075 CrossRef Google Scholar

Haberman, S. J, Yao, L, & Sinharay, S. (2015). Prediction of true test scores from observed item scores and ancillary data. British Journal of Mathematical and Statistical Psychology, 68, 363–85. https://doi.org/10.1111/bmsp.12052 CrossRef Google Scholar PubMed

Haladyna, T. M., & Kramer, G. A. (2004). The validity of subscores for a credentialing test. Evaluation and the Health Professions, 27(4), 349–368. https://doi.org/10.1177/0163278704270010 CrossRef Google Scholar PubMed

Hambleton, R. K., & Zenisky, A. L. (2013). Reporting test scores in more meaningful ways: A research-based approach to score report design. In Geisinger, K. F. (Ed.), APA handbook of testing and assessment in psychology: Vol. 3. Testing and assessment in school psychology and education (pp. 479–494). Washington, DC: American Psychological Association. https://doi.org/10.1037/14049-023 CrossRef Google Scholar

Harris, D. J., & Hanson, B. A. (1991, April). Methods of examining the usefulness of subscores. Paper presented at the meeting of the National Council on Measurement in Education, Chicago, IL.Google Scholar

Hegarty, M. (2019). Advances in cognitive science and information visualization. In Zapata-Rivera, D. (Ed.), Score reporting research and applications (The NCME Applications of Educational Measurement and Assessment Book Series) (pp. 19–34). New York: Routledge. https://doi.org/10.4324/9781351136501-4 Google Scholar

Huff, K., & Goodman, D. P. (2007). The demand for cognitive diagnostic assessment. In Leighton, J., & Gierl, M. (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 19–60). Cambridge: Cambridge University Press. https://doi.org/10.1017/cbo9780511611186.002 CrossRef Google Scholar

Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258–272. https://doi.org/10.1177/01466210122032064 CrossRef Google Scholar

Kelley, T. L. (1923). Statistical method. New York: Macmillan.Google Scholar

Kibby, M. W. (1981). Test review: The degrees of reading power. Journal of Reading, 24(5), 416–427. www.jstor.org/stable/40032381 Google Scholar

Kolstad, A., Cohen, J., Baldi, S., Chan, T., DeFur, E., & Angeles, J. (1998). The response probability convention used in reporting data from IRT assessment scales: Should NCES adopt a standard? Washington, DC: American Institutes for Research.Google Scholar

LaFlair, G. T. (2020). Duolingo English test: Subscores (Duolingo Research Report No. DRR-20-03). Duolingo.Google Scholar

Lane, S., Raymond, M. R., Haladyna, T. M., & Downing, S. M. (2015). Test development process. In Lane, S., Raymond, M. R., & Haladyna, T. M. (Eds.), Handbook of test development (2nd ed., pp. 3–18). New York, NY: Routledge.CrossRef Google Scholar

Lazer, S., Mazzeo, J., & Weiss, A. with Campbell, J., Casalaina, L., Horkay, N., Kaplan, B., & Rogers, A. (2001). Final report on enhanced achievement level reporting and scale anchoring activities. Unpublished report prepared on behalf of the National Assessment Governing Board.Google Scholar

Leighton, J. P., & Gierl, M. J. (2007). Cognitive diagnostic assessment for education: Theory and applications. New York: Cambridge University Press. https://doi.org/10.1017/cbo9780511611186 CrossRef Google Scholar

Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy model for cognitive assessment: A variation on Tatsuoka’s rule-space approach. Journal of Educational Measurement, 41, 205–237. https://doi.org/10.1111/j.1745-3984.2004.tb01163.x CrossRef Google Scholar

Lim, E., & Lee, W. (2020). Subscore equating and profile reporting. Applied Measurement in Education, 33, 95–112. https://doi.org/10.1080/08957347.2020.1732381 CrossRef Google Scholar

Ling, G. (2012). Why the major field test in business does not report subscores – Reliability and construct validity evidence (ETS Research Report No. RR-08-64). Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2012.tb02293.x.CrossRef Google Scholar

Liu, Y., Robin, F., Yoo, H., & Manna, V. (2018). Statistical properties of the GRE® psychology test subscores. ETS Research Report Series. https://doi.org/10.1002/ets2.12206 CrossRef Google Scholar

Longabach, T., & Peyton, V. A. (2018). Comparison of reliability and precision of subscore reporting methods for a state English language proficiency assessment. Language Testing, 35, 297–317. https://doi.org/10.1177/0265532217689949 CrossRef Google Scholar

Longford, N. T. (1990). Multivariate variance component analysis: An application in test development. Journal of Educational Statistics, 15, 91–112. https://doi.org/10.2307/1164764 CrossRef Google Scholar

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.Google Scholar

Lord, F. M., & Wingersky, M. (1984). Comparison of IRT true-score and equipercentile observed-score equatings. Applied Psychological Measurement, 8, 453–461. https://doi.org/10.1177/014662168400800409 CrossRef Google Scholar

Lovett, B. J., & Harrison, A. G. (2021). De-implementing inappropriate accommodations practices. Canadian Journal of School Psychology, 36(2), 115–126. https://doi.org/10.1177/0829573520972556 CrossRef Google Scholar

Luecht, R. (2007). Using information from multiple-choice distractors to enhance cognitive-diagnostic score reporting. In Leighton, J. & Gierl, M. (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 319–340). Cambridge: Cambridge University Press. https://doi.org/10.1017/cbo9780511611186.011 CrossRef Google Scholar

Luecht, R. (2013). Assessment engineering task model maps: Task models and templates as a new way to develop and implement test specifications. Journal of Applied Testing Technology, 14, 1–38.Google Scholar

Luecht, R. M., Gierl, M. J., Tan, X., & Huff, K. (2006, April). Scalability and the development of useful diagnostic scales. Paper presented at the annual Meeting of the National Council on Measurement in Education, San Francisco, CA.Google Scholar

Lyren, P. (2009). Reporting subscores from college admission tests. Practical Assessment, Research, and Evaluation, 14, 1–10.Google Scholar

Margolis, M. J., Clauser, B. E., Winward, M., & Dillon, G. F. (2010). Validity evidence for USMLE examination cut scores: Results of a large-scale survey. Academic Medicine, 85(10), 93–97. https://doi.org/10.1097/acm.0b013e3181ed4028 CrossRef Google Scholar PubMed

McDermott, P. A., Glutting, J. J., Jones, J. N., Watkins, M. W., & Kush, J. (1989). Core profile types in the WISC-R national sample: Structure, membership, and applications. Psychological Assessment: A Journal of Consulting and Clinical Psychology, 1, 292–299. https://doi.org/10.1037/1040-3590.1.4.292 CrossRef Google Scholar

McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In Zarembka, P. (Ed.), Frontiers in econometrics (pp. 105–142). New York: Academic Press.Google Scholar

Meijer, R. R., Boevé, A. J., Tendeiro, J. N., Bosker, R. J., & Albers, C. J. (2017). The use of subscores in higher education: When is this useful? Frontiers in Psychology, 8, 1–6. https://doi.org/10.3389/fpsyg.2017.00305 CrossRef Google Scholar PubMed

Menard, S. (2000). Coefficients of determination for multiple logistic regression analysis. The American Statistician, 54(1), 17–24. https://doi.org/10.2307/2685605 Google Scholar

Mertler, C. A. (2018). Norm-referenced interpretation. In Frey, B. (Ed.), The SAGE encyclopedia of educational research, measurement, and evaluation (pp. 1161–1163). Thousand Oaks, CA: SAGE. https://doi.org/10.4135/9781506326139.n478 Google Scholar

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3–67. https://doi.org/10.1207/S15366359MEA0101_02 Google Scholar

Morey, L. C. (2004). The Personality Assessment Inventory (PAI). In Maruish, M. E. (Ed.), The use of psychological testing for treatment planning and outcomes assessment: Instruments for adults (pp. 509–551). Mahwah, NJ: Lawrence Erlbaum Associates Publishers. https://doi.org/10.4324/9781410610614 Google Scholar

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176. https://doi.org/10.1177%2F014662169201600206 CrossRef Google Scholar

National Assessment of Educational Progress (NAEP). (2023). Student groups. https://nces.ed.gov/nationsreportcard/guides/groups.aspx Google Scholar

New York State Testing Program (NYSTP). (2023). NYS grades 3-8 2021 technical report. www.nysed.gov/common/nysed/files/programs/state-assessment/3-8-technical-report-2021w.pdf Google Scholar

Paolino, J. (2020). Teaching linear correlation using contour plots. Teaching Statistics, 43(1), 13–20. https://doi.org/10.1111/test.12239 CrossRef Google Scholar

Papageorgiou, S., & Choi, I. (2018). Adding value to second-language listening and reading subscores: Using a score augmentation approach. International Journal of Testing, 18, 207–230. https://doi.org/10.1080/15305058.2017.1407766 CrossRef Google Scholar

Pashler, H., Cepeda, N. J., Wixted, J. T., & Rohrer, D. (2005). When does feedback facilitate learning of words? Journal of Experimental Psychology: Learning, Memory, and Cognition, 31(1), 3–8. https://doi.org/10.1037/0278-7393.31.1.3 Google Scholar PubMed

Pearson Longman. (2010). The official guide to PTE: Pearson test of English academic. Hong Kong SAR: Pearson Longman Asia ELT.Google Scholar

Perie, M., Marion, S., & Gong, B. (2009). Moving toward a comprehensive assessment system: A framework for considering interim assessments. Educational Measurement: Issues and Practice, 28(3), 5–13. https://doi.org/10.1080/01619561003685304 CrossRef Google Scholar

Personality Assessment Inventory (PAI). (2023). The PAI police and public safety selection report. https://post.ca.gov/portals/0/post_docs/publications/psychological-screening-manual/PAI_PolicePubSftyRpt.pdf Google Scholar

Pieper Bar Review. (2017). Bar examiners to provide (slightly) more information to candidates who fail the bar exam. http://news.pieperbar.com/bar-examiners-to-provide-slightly-more-information-to-candidates-who-fail-the-bar-exam Google Scholar

Praxis. (2023). Interpreting Your Praxis^® Test Taker Score Report. www.ets.org/s/praxis/pdf/sample_score_report.pdf Google Scholar

Puhan, G., & Liang, L. (2011). Equating subscores under the non equivalent anchor test (NEAT) design. Educational Measurement: Issues and Practice, 30(1), 23–35. https://doi.org/10.1111/j.1745-3992.2010.00197.x CrossRef Google Scholar

Puhan, G., Sinharay, S., Haberman, S. J., & Larkin, K. (2010). The utility of augmented subscores in a licensure exam: An evaluation of methods using empirical data. Applied Measurement in Education, 23, 266–285. https://doi.org/10.1080/08957347.2010.486287 CrossRef Google Scholar

R Core Team. (2022). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. www.R-project.org/Google Scholar

Ramsay, J. O. (1973). The effect of number of categories in rating scales on precision of estimation of scale values. Psychometrika, 38(4, Pt. 1), 513–532. https://doi.org/10.1007/bf02291492 CrossRef Google Scholar

Rasch, G. (1966). An individualistic approach to item analysis. In Lazarsfeld, P. F., & Henry, N. W. (Eds.), Readings in mathematical social science (pp. 89–107). Cambridge, MA: MIT Press.Google Scholar

Raymond, M. R. (2001). Job analysis and the specification of content for licensure and certification examinations. Applied Measurement in Education, 14, 369–415. https://doi.org/10.1207/s15324818ame1404_4 CrossRef Google Scholar

Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer. https://doi.org/10.1007/978-0-387-89976-3 CrossRef Google Scholar

Reckase, M. D., & Xu, J. R. (2014). The evidence for a subscore structure in a test of English language competency for English language learners. Educational and Psychological Measurement, 75, 805–825. https://doi.org/10.1177/0013164414554416 CrossRef Google Scholar

Roberts, M. R., & Gierl, M. J. (2010). Developing score reports for cognitive diagnostic assessments. Educational Measurement: Issues and Practice, 29(3), 25–38. https://doi.org/10.1111/j.1745-3992.2010.00181.x CrossRef Google Scholar

Roussos, L. A., DiBello, L. V., Stout, W. F., Hartz, S. M., Henson, R. A., & Templin, J. H. (2007). The fusion model skills diagnostic system. In Leighton, J., & Gierl, M. (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 275–318). New York: Cambridge University Press. https://doi.org/10.1017/cbo9780511611186.010 CrossRef Google Scholar

Rupp, A. A., & Templin, J. L. (2009). The (un)usual suspects? A measurement community in search of its identity. Measurement, 7(2), 115–121. https://doi.org/10.1080/15366360903187700 Google Scholar

Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York: Guilford Press.Google Scholar

Sands, W. A., Waters, B. K., & McBride, J. R. (1997). Computerized adaptive testing: From inquiry to operation. Washington, DC: American Psychological Association. https://doi.org/10.1037%2F10244-000 CrossRef Google Scholar

SAT. (2023). Understanding your score report. https://satsuite.collegeboard.org/media/pdf/sample-sat-score-report.pdf Google Scholar

Sawaki, Y., & Sinharay, S. (2018). Do the TOEFL iBT® section scores provide value-added information to stakeholders? Language Testing, 35, 529–556. https://doi.org/10.1177/0265532217716731 CrossRef Google Scholar

Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150–174. https://doi.org/10.1111/j.1745-3984.2010.00106.x CrossRef Google Scholar

Sinharay, S. (2013). A note on assessing the added value of subscores. Educational Measurement: Issues and Practice, 32, 38–42. https://doi.org/10.1111/emip.12021 CrossRef Google Scholar

Sinharay, S. (2014). Analysis of added value of subscores with respect to classification. Journal of Educational Measurement, 51, 212–222. https://doi.org/10.1111/jedm.12043 CrossRef Google Scholar

Sinharay, S., & Haberman, S. J. (2008). How much can we reliably know about what students know? Measurement: Interdisciplinary Research and Perspectives, 6, 46–49. https://doi.org/10.1080/15366360802715486 Google Scholar

Sinharay, S., & Haberman, S. J. (2011). Equating of augmented subscores. Journal of Educational Measurement, 48, 122–145. https://doi.org/10.1111/j.1745-3984.2011.00137.x CrossRef Google Scholar

Sinharay, S., & Haberman, S. J. (2014). An empirical investigation of population invariance in the value of subscores. International Journal of Testing, 14, 22–48. https://doi.org/10.1080/15305058.2013.822712 CrossRef Google Scholar

Sinharay, S., Haberman, S. J., & Lee, Y. -H. (2011). When does scale anchoring work? A case study. Journal of Educational Measurement, 48(1), 61–80. https://doi.org/10.1111/j.1745-3984.2011.00131.x CrossRef Google Scholar

Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26(4), 21–28. https://doi.org/10.1111/j.1745-3992.2007.00105.x CrossRef Google Scholar

Sinharay, S., Haberman, S. J., & Wainer, H. (2011). Do adjusted subscores lack validity? Don’t blame the messenger. Educational and Psychological Measurement, 71, 789–797. https://doi.org/10.1177/0013164410391782 CrossRef Google Scholar

Sinharay, S., Puhan, G., & Haberman, S. J. (2010). Reporting diagnostic subscores in educational testing: Temptations, pitfalls, and some solutions. Multivariate Behavioral Research, 45, 553–573. https://doi.org/10.1080/00273171.2010.483382 CrossRef Google Scholar PubMed

Sinharay, S., Puhan, G., & Haberman, S. J. (2011). An NCME instructional module on subscores. Educational Measurement: Issues and Practice, 30(3), 29–40. https://doi.org/10.1111/j.1745-3992.2011.00208.x CrossRef Google Scholar

Sinharay, S., Puhan, G., Haberman, S. J., & Hambleton, R. K. (2019). Subscores: When to communicate them, what are their alternatives, and some recommendations. In Zapata-Rivera, D. (Ed.), Score reporting research and applications (The NCME Applications of Educational Measurement and Assessment Book Series) (pp. 35–49). New York: Routledge. https://doi.org/10.4324/9781351136501-5 Google Scholar

Skorupski, W. P., & Carvajal, J. (2010). A comparison of approaches for improving the reliability of objective level scores. Educational and Psychological Measurement, 70, 357–375. https://doi.org/10.1177/0013164409355694 CrossRef Google Scholar

Slater, S., Livingston, S. L., & Silver, M. (2019). Score reports for large-scale testing programs. In Zapata-Rivera, D. (Ed.), Score reporting research and applications (The NCME Applications of Educational Measurement and Assessment Book Series) (pp. 91–106). New York, NY: Routledge. https://doi.org/10.4324/9781351136501-10 Google Scholar

South Carolina College- and Career-Ready Assessments (SC READY). (2023). Individual student report. https://ed.sc.gov/tests/tests-files/sc-ready-files/spring-2022-sample-individual-student-report-english/Google Scholar

Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101. https://doi.org/10.2307/1412159 CrossRef Google Scholar

Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271–295. https://doi.org/10.1111/j.2044-8295.1910.tb00206.x Google Scholar

Spencer, B. D. (Ed.). (1997). Statistics and public policy. Oxford: Clarendon Press.CrossRef Google Scholar

Stanton, H. C., & Reynolds, C. R. (2000). Configural frequency analysis as a method of determining Wechsler profile types. School Psychology Quarterly, 15(4), 434–448. https://doi.org/10.1037/h0088799 CrossRef Google Scholar

Stone, C. A., Ye, F., Zhu, X., & Lane, S. (2010). Providing subscale scores for diagnostic information: A case study when the test is essentially unidimensional. Applied Measurement in Education, 23, 63–86. https://doi.org/10.1080/08957340903423651 CrossRef Google Scholar

Swanson, D. B., Case, S. M., & Nungester, R. J. (1991). Validity of NBME Part I and Part II scores in prediction of Part III performance. Academic Medicine, 66, S7–S9. https://doi.org/10.1097/00001888-199109001-00004 Google Scholar PubMed

Tanaka, V. (2023). A framework for reporting technically-sound and useful subscores on state assessments. www.nciea.org/blog/promoting-effective-practices-for-subscore-reporting-and-use/Google Scholar

Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20, 345–354. https://doi.org/10.1111/j.1745-3984.1983.tb00212.x CrossRef Google Scholar

Theil, H. (1970). On the estimation of relationships involving qualitative variables. American Journal of Sociology, 76, 103–154. https://doi.org/10.1086/224909 CrossRef Google Scholar

Thissen, D. (2013). Using the testlet response model as a shortcut to multidimensional item response theory subscore computation. In Millsap, R., van der Ark, L., Bolt, D., & Woods, C. (Eds.), New developments in quantitative psychology: Presentations from the 77th Annual Psychometric Society Meeting (pp. 29–40). New York: Springer. https://doi.org/10.1007/978-1-4614-9348-8_3 CrossRef Google Scholar

Tufte, E. R. (2001). The visual display of quantitative information (2nd ed.). Cheshire, CT: Graphics Press.Google Scholar

United States Medical Licensing Examination (USMLE). (2023). Updated sample Step 2 CK annual school report. www.nbme.org/sites/default/files/2022-08/2022_Enhanced_USMLE_Step_2_CK_School_Report_Sample.pdf Google Scholar

Von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61, 287–307. https://doi.org/10.1348/000711007x193957 CrossRef Google Scholar PubMed

Wainer, H. (1984). How to display data badly. The American Statistician, 38(2), 137–147. https://doi.org/10.2307/2683253 Google Scholar

Wainer, H. (1997). Visual revelations. New York: Copernicus Press. https://doi.org/10.4324/9780203774793 CrossRef Google Scholar

Wainer, H. (2009). Picturing the uncertain world. Princeton, NJ: Princeton University Press. https://doi.org/10.1515/9781400832897 CrossRef Google Scholar

Wainer, H. (2015). On the crucial role of empathy in the design of communications: Genetic testing as an example. In Truth or truthiness: Distinguishing fact from fiction by learning to think like a data scientist (pp. 82–90). Cambridge: Cambridge University Press. https://doi:10.1017/CBO9781316424315.012 CrossRef Google Scholar

Wainer, H., Dorans, D. J., Eignor, D., Flaugher, R., Green, B. F., Mislevy, R. J., Steinberg, L., & Thissen, D. (2000). Computerized adaptive testing: A primer (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. https://doi.org/10.4324/9781410605931 CrossRef Google Scholar

Wainer, H., & Feinberg, R. A. (2017). For want of a nail: Why unnecessarily long tests may be impeding the progress of Western civilization. In Pitici, M. (Ed.), The best writing on mathematics 2016 (pp. 321–330). Princeton, NJ: Princeton University Press. https://doi.org/10.1515/9781400885602-030 Google Scholar

Wainer, H., Gessaroli, M., & Verdi, M. (2006). Finding what is not there through the unfortunate binning of results: The Mendel Effect. Chance, 19(1), 49–52. https://doi.org/10.1080/09332480.2006.10722771 CrossRef Google Scholar

Wainer, H., & Robinson, D. (2023). Why testing? Why should it cost you? Chance, 36(1), 48–52. https://doi.org/10.1080/09332480.2023.2179281 CrossRef Google Scholar

Wainer, H., Sheehan, K. M., & Wang, X. (2000). Some paths toward making Praxis scores more useful. Journal of Educational Measurement, 37, 113–140. https://doi.org/10.1111/j.1745-3984.2000.tb01079.x CrossRef Google Scholar

Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., Rosa, K., Nelson, L., et al. (2001). Augmented scores: “Borrowing strength” to compute scores based on small numbers of items. In Thissen, D. & Wainer, H. (Eds.), Test scoring (pp. 343–387). Mahwah, NJ: Erlbaum Associates. https://doi.org/10.4324/9781410604729-16 Google Scholar

Wang, X., Svetina, D., & Dai, S. (2019). Exploration of factors affecting the added value of test subscores. Journal of Experimental Education, 87, 179–192. https://doi.org/10.1080/00220973.2017.1409182 CrossRef Google Scholar

Wilson, K. M. (2000). An exploratory dimensionality assessment of the TOEIC test. ETS Research Report Series (ETS Research Report No. RR-00-14). https://doi.org/10.1002/j.2333-8504.2000.tb01837.x CrossRef Google Scholar

Yao, L., Sinharay, S., & Haberman, S. J. (2014). Documentation for the software package SQE (ETS Research Memorandum No. RM-14-02). Educational Testing Service.Google Scholar

Yen, W. M. (1987). A Bayesian/IRT index of objective performance. Paper presented at the meeting of the Psychometric Society, Montreal, Canada.Google Scholar

Zapata-Rivera, D., VanWinkle, W., & Zwick, R. (2012). Applying score design principles in the design of score reports for CBAL™ Teachers. (ETS Research Memorandum RM–12-20). Princeton, NJ: Educational Testing Service.Google Scholar

Zenisky, A. L., & Hambleton, R. K. (2012). Developing test score reports that work: The process and best practices for effective communication. Educational Measurement: Issues and Practice, 31(2), 21–26. https://doi.org/10.1111/j.1745-3992.2012.00231.x CrossRef Google Scholar

Zenisky, A. L., & Hambleton, R. K. (2015). A model and good practices for score reporting. In Lane, S., Raymond, M. R., & Haladyna, T. M. (Eds.), Handbook of test development (2nd ed., pp. 585–602). New York: Routledge.Google Scholar

Zieky, M. J., Perie, M., & Livingston, S. A. (2008). Cutscores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: Educational Testing Service.Google Scholar

Zwick, R., Senturk, D., Wang, J., & Loomis, S. C. (2001). An investigation of alternative methods for item mapping on the national assessment of educational progress. Educational Measurement: Issues and Practice, 20(2), 15–25. https://doi.org/10.1111/j.1745-3992.2001.tb00059.x CrossRef Google Scholar

Book contents

References

Summary

Access options

Book purchase

Temporarily unavailable

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive