Hostname: page-component-745bb68f8f-b6zl4 Total loading time: 0 Render date: 2025-01-07T19:01:41.138Z Has data issue: false hasContentIssue false

Item Complexity: A Neglected Psychometric Feature of Test Items?

Published online by Cambridge University Press:  01 January 2025

Daniel M. Bolt*
Affiliation:
University of Wisconsin, Madison
Xiangyi Liao
Affiliation:
University of Wisconsin, Madison
*
Correspondence should be made to Daniel M. Bolt, Department of Educational Psychology, University of Wisconsin, Madison, 1025 W. Johnson, Room 859, Madison, WI-53706, USA. Email: [email protected]

Abstract

Despite its frequent consideration in test development, item complexity receives little attention in the psychometric modeling of item response data. In this address, I consider how variability in item complexity can be expected to emerge in the form of item characteristic curve (ICC) asymmetry, and how such effects may significantly influence applications of item response theory, especially those that assume interval-level properties of the latent proficiency metric and groups that vary substantially in mean proficiency. One application is the score gain deceleration phenomenon often observed in vertical scaling contexts, especially in subject areas like math or second language acquisition. It is demonstrated how the application of symmetric IRT models in the presence of complexity-induced positive ICC asymmetry can be a likely cause. A second application concerns the positive correlation between DIF and difficulty commonly seen in verbal proficiency (and other subject area) tests where problem-solving complexity is minimal and proficiency-related guessing effects are likely pronounced. Here we suggest negative ICC asymmetry as a probable cause and apply sensitivity analyses to demonstrate the ease with which such correlations disappear when allowing for negative ICC asymmetry. Unfortunately, the presence of systematic forms of ICC asymmetry is easily missed due to the considerable flexibility afforded by latent trait metrics in IRT. Speculation is provided regarding other applications for which attending to ICC asymmetry may prove useful.

Type
Theory & Methods (T&M)
Copyright
Copyright © 2022 The Author(s) under exclusive licence to The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

This paper is based on the Presidential Address of the first author at the 2021 IMPS Annual Meeting (Virtual)

References

Bailey, D., Duncan, G. J., Odgers, C. L., & Yu, W. (2017). Persistence and fadeout in the impacts of child and adolescent interventions. Journal of Research on Educational Effectiveness, 10(1), 739.CrossRefGoogle Scholar
Ballou, D., (2009). Test scaling and value-added measurement Education Finance and Policy 4 (4) 351383 10.1162/edfp.2009.4.4.351CrossRefGoogle Scholar
Bazán, J. L., Branco, M. D., Bolfarine, H., (2006). A skew item response model Bayesian Analysis 1 (4) 861892 10.1214/06-BA128CrossRefGoogle Scholar
Bolfarine, H., Bazán, J. L., (2010). Bayesian estimation of the logistic positive exponent IRT model Journal of Educational and Behavioral Statistics 35 693713 10.3102/1076998610375834CrossRefGoogle Scholar
Bolt, D. M. and Liao, X. (2021). On the positive correlation between DIF and difficulty: A new theory on the correlation as methodological artifact. Journal of Educational Measurement. https://doi.org/10.1111/jedm.12302.CrossRefGoogle Scholar
Bolt, D. M., Deng, S., Lee, S., (2014). IRT model misspecification and measurement of growth in vertical scaling Journal of Educational Measurement 51 (2) 141162 10.1111/jedm.12039CrossRefGoogle Scholar
Bolt, D. M., Lee, S., Wollack, J., Eckerly, C., Sowles, J., (2018). Application of asymmetric IRT modeling to discrete-option multiple-choice test items Frontiers in Psychology 9 2175 10.3389/fpsyg.2018.02175 30483187 6240662CrossRefGoogle ScholarPubMed
Briggs, D. C., & Weeks, J. P. (2009). The impact of vertical scaling decisions on growth interpretations. Educational Measurement: Issues and Practice, 28(4), 314.CrossRefGoogle Scholar
Burton, E., & Burton, N. W. (1993). The effect of item screening on test scores and test characteristics. In Holland, P. W. & Wainer, H. (Eds.), Differential item functioning (pp. 321336). Lawrence Erlbaum.Google Scholar
Camilli, G., Yamamoto, K., Wang, M. M., (1993). Scale shrinkage in vertical equating Applied Psychological Measurement 17 379388 10.1177/014662169301700407CrossRefGoogle Scholar
Daniel, R. C., Embretson, S. E., (2010). Designing cognitive complexity in mathematical problem-solving items Applied Psychological Measurement 34 348364 10.1177/0146621609349801CrossRefGoogle Scholar
De Boeck, P., Jeon, M., (2019). An overview of models for response times and processes in cognitive tests Frontiers in Psychology 10 102 10.3389/fpsyg.2019.00102 30787891 6372526CrossRefGoogle ScholarPubMed
Foster, D. F., Miller, H. L., (2009). A new format for multiple-choice testing: Discrete option multiple-choice. Results from early studies Psychology Science Quarterly 51 (4) 355369Google Scholar
Freedle, R. (2003). Correcting the SAT’s ethnic and social-class bias: A method for reestimating SAT scores. Harvard Educational Review, 73(1), 143.CrossRefGoogle Scholar
Hill, C. J., Bloom, H. S., Black, A. R., Lipsey, M. W., (2008). Empirical benchmarks for interpreting effect sizes in research Child Development Perspectives 2 (3) 172177 10.1111/j.1750-8606.2008.00061.xCrossRefGoogle Scholar
Kenyon, D. M., et al. (2011). Issues in vertical scaling of a K-12 English language proficiency test Language Testing 28 (3) 383400 10.1177/0265532211404190CrossRefGoogle Scholar
Kulick, E. & Hu, P. G. (1989). Examining the relationship between differential item functioning and item difficulty (College Board Report No. 89-5; ETS RR-89-18). College Entrance Examination Board.Google Scholar
Lang, K., (2010). Measurement matters: Perspectives on education policy from an economist and school board member Journal of Economic Perspectives 24 167181 10.1257/jep.24.3.167CrossRefGoogle Scholar
Lee, S. (2015). c item characteristic curves in item response theory. Unpublished Masters Thesis. University of Wisconsin, Madison.Google Scholar
Lee, S., & Bolt, D. M. (2018a). Asymmetric item characteristic curves and item complexity: Insights from simulation and real data analyses. Psychometrika, 83(2), 453475.CrossRefGoogle Scholar
Lee, S., & Bolt, D. M. (2018b). An alternative to the 3PL: Using asymmetric item characteristic curves to address guessing effects. Journal of Educational Measurement, 55(1), 90111.CrossRefGoogle Scholar
Li, Y., & Lissitz, R. W. (2012). Exploring the full-information bifactor model in vertical scaling with construct shift. Applied Psychological Measurement, 36(1), 320.CrossRefGoogle Scholar
Liao, X., & Bolt, D. M. (2021). Item characteristic curve asymmetry—A better way to accommodate slips and guesses than a 4-parameter model? Journal of Educational and Behavioral Statistics, 46,(6), 753775. https://doi.org/10.3102/10769986211003283CrossRefGoogle Scholar
Lord, F. M. (1984). Conjunctive and disjunctive item response functions. (Technical Report). Princeton, NJ: Educational Testing Service.Google Scholar
Martineau, J. A. (2006). Distorting value added: The use of longitudinal, vertically scaled student achievement data for growth-based, value-added accountability. Journal of Educational and Behavioral Statistics, 31(1), 3562.CrossRefGoogle Scholar
Mathews, J. (2003). The bias question. The Atlantic Monthly, 292(4), 130140.Google Scholar
Molenaar, D., (2015). Heteroscedastic latent trait models for dichotomous data Psychometrika 80 (3) 625644 10.1007/s11336-014-9406-0 25080866CrossRefGoogle ScholarPubMed
Pfost, M., Hattie, J., Dörfler, T., Artelt, C., (2014). Individual differences in reading development: A review of 25 years of empirical research on Matthew effects in reading Review of Educational Research 84 (2) 203244 10.3102/0034654313509492CrossRefGoogle Scholar
Protopapas, A., Parrila, R., Simos, P. G., (2016). In search of Matthew effects in reading Journal of Learning Disabilities 49 (5) 499514 10.1177/0022219414559974 25428424CrossRefGoogle ScholarPubMed
Reckase, M. (2010). Study of best practices for vertical scaling and standard setting with recommendations for FCAT 2.0. Retrieved from http://www.fldoe.org/asp/k12memoGoogle Scholar
Renaissance Learning (2015). STAR Math: Benchmarks, cut scores, and growth rates. Retrieved March 7, 2021, from http://elementary.conceptschools.org/wp-content/uploads/2017/03/Math-Cut-Scores.pdfGoogle Scholar
Samejima, F., (2000). Logistic positive exponent family of models: Virtue of asymmetric item characteristic curves Psychometrika 65 319335 10.1007/BF02296149CrossRefGoogle Scholar
San Martín, E., Del Pino, G., De Boeck, P., (2006). IRT models for ability-based guessing Applied Psychological Measurement 30 (3) 183203 10.1177/0146621605282773CrossRefGoogle Scholar
Santelices, M. V., Wilson, M., (2012). On the relationship between differential item functioning and item difficulty: An issue of methods? Item response theory approach to differential item functioning Educational and Psychological Measurement 72 (1) 536 10.1177/0013164411412943CrossRefGoogle Scholar
Schochet, P. Z., Puma, M., & Deke, J. (2014). Understanding variation in treatment effects in education impact evaluations: An overview of quantitative methods. NCEE 2014-4017. National Center for Education Evaluation and Regional Assistance.Google Scholar
Soland, J., (2017). Is teacher value added a matter of scale? The practical consequences of treating an ordinal scale as interval for estimation of teacher effects Applied Measurement in Education 30 (1) 5270 10.1080/08957347.2016.1247844CrossRefGoogle Scholar
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201210.CrossRefGoogle Scholar
Tong, Y., Kolen, M. J., (2007). Comparison of methodologies and results in vertical scaling for educational achievement tests Applied Measurement in Education 20 227253 10.1080/08957340701301207CrossRefGoogle Scholar
Wan, S., Bond, T. N., Lang, K., Clements, D. H., Sarama, J., Bailey, D. H., (2021). Is intervention fadeout a scaling artefact? Economics of Education Review 82 102090 10.1016/j.econedurev.2021.102090CrossRefGoogle Scholar
Weeks, J. P., (2010). plink: An R package for linking mixed-format tests using IRT-based methods Journal of Statistical Software 35 (12) 133 10.18637/jss.v035.i12CrossRefGoogle Scholar
Yen, W., (1981). Using simulation results to choose a latent trait model Applied Psychological Measurement 5 245262 10.1177/014662168100500212CrossRefGoogle Scholar
Yen, W. M., (1985). Increasing item complexity: A possible cause of scale shrinkage for unidimensional item response theory Psychometrika 50 (4) 399410 10.1007/BF02296259CrossRefGoogle Scholar