Hostname: page-component-745bb68f8f-g4j75 Total loading time: 0 Render date: 2025-01-07T18:40:27.119Z Has data issue: false hasContentIssue false

A Statistical Test for Differential Item Pair Functioning

Published online by Cambridge University Press:  01 January 2025

Timo M. Bechger*
Affiliation:
Cito
Gunter Maris
Affiliation:
University of Amsterdam
*
Correspondence should be sent to Timo M. Bechger, Cito, Amsterdamseweg 13, Arnhem, The Netherlands. Email: [email protected]

Abstract

This paper presents an IRT-based statistical test for differential item functioning (DIF). The test is developed for items conforming to the Rasch (Probabilistic models for some intelligence and attainment tests, The Danish Institute of Educational Research, Copenhagen, 1960) model but we will outline its extension to more complex IRT models. Its difference from the existing procedures is that DIF is defined in terms of the relative difficulties of pairs of items and not in terms of the difficulties of individual items. The argument is that the difficulty of an item is not identified from the observations, whereas the relative difficulties are. This leads to a test that is closely related to Lord’s (Applications of item response theory to practical testing problems, Erlbaum, Hillsdale, 1980) test for item DIF albeit with a different and more correct interpretation. Illustrations with real and simulated data are provided.

Type
Original Paper
Copyright
Copyright © 2014 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ackerman, T. (1992). A didactic explanation of item bias, impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 6791.CrossRefGoogle Scholar
Andersen, E.B. (1972). The numerical solution to a set of conditional estimation equations. Journal of the Royal Statistical Society, Series B, 34, 283301.CrossRefGoogle Scholar
Andersen, E.B. (1973). A goodness of fit test for the rasch model. Psychometrika, 38, 123140.CrossRefGoogle Scholar
Andersen, E.B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42, 6981.CrossRefGoogle Scholar
Angoff, W.H. (1982). Use of diffculty and discrimination indices for detecting item bias. In Berk, R.A. (Eds.), Handbook of methods for detecting item bias (pp. 96116). Baltimore: John Hopkins University Press.Google Scholar
Angoff, W.H. (1993). Perspective on differential item functioning methodology. In Holland, P.W., & Wainer, H. (Eds.), Differential item functioning (pp. 324). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
Bechger, T. M., Maris, G., & Verstralen, H. H. F. M. (2010). A different view on DIF (R&D Report No. 2010–5). Arnhem: Cito.Google Scholar
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord, F.M., & Novick, M.R. (Eds.), Statistical theories of mental test scores (pp. 395479). Reading: Addison-Wesley.Google Scholar
Bold, D.M. (2002). A monte carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15, 113141.CrossRefGoogle Scholar
Camilli, G. (1993). The case against item bias detection techniques based on internal criteria: Do test bias procedures obscure test fairness issues?. In Holland, P.W., & Wainer, H. (Eds.), Differential item functioning (pp. 397413). Hillsdale, NJ: Lawrence Earlbaum.Google Scholar
Candell, G.L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12, 253260.CrossRefGoogle Scholar
De Boeck, P. (2008). Random IRT models. Psychometrika, 73(4), 533559.CrossRefGoogle Scholar
DeMars, C. (2010). Type I error inflation for detecting DIF in the presence of impact. Educational and Psychological Measurement, 70, 961972.CrossRefGoogle Scholar
Dorans, N.J., & Holland, P.W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In Holland, P.W., & Wainer, H. (Eds.), Differential item functioning (pp. 3566). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
Engelhard, G. (1989). Accuracy of bias review judges in identifying teacher certification tests. Applied Measurement in Education, 3, 347360.CrossRefGoogle Scholar
Finch, H.W., & French, B.F. (2008). Anomalous type I error rates for identifying one type of differential item functioning in the presence of the other. Educational and Psychological Measurement, 68(5), 742759.CrossRefGoogle Scholar
Fischer, G.H. (1974). Einfuhrung in die theorie psychologischer tests. Bern: Verlag Hans Huber. (Introduction to the theory of psychological tests.).Google Scholar
Fischer, G.H. (2007). Rasch models. In Rao, C.R., & Sinharay, S. (Eds.), Handbook of statistics: Psychometrics (pp. 515585). Amsterdam, The Netherlands: Elsevier.Google Scholar
Gabrielsen, A. (1978). Consistency and identifiability. Journal of Econometrics, 8, 261263.CrossRefGoogle Scholar
Gierl, M., Gotzmann, A., & Boughton, K.A. (2004). Performance of SIBTEST when the percentage of DIF items is large. Applied Measurement in Education, 17, 241264.CrossRefGoogle Scholar
Glas, C. A. W. (1989). Contributions to estimating and testing rasch models. Unpublished doctoral dissertation, Cito, Arnhem.Google Scholar
Glas, C.A.W. (1998). Detection of differential item functioning using Lagrange multiplier tests. Statistica Sinica, 8, 647667.Google Scholar
Glas, C.A., & Verhelst, N.D. (1995). Testing the Rasch model. In Fischer, G.H., & Mole- naar, I.W. (Eds.), Rasch models: Foundations, recent developments and applications (pp. 6995). New-York: Spinger.CrossRefGoogle Scholar
Glas, C.A.W., & Verhelst, N.D. (1995). Tests of fit for polytomous Rasch models. In Fischer, G.H., & Molenaar, I.W. (Eds.), Rasch models: Foundations, recent developments and applications (pp. 325352). New-York: Spinger.CrossRefGoogle Scholar
Gnanadesikan, R., & Kettenring, J.R. (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics, 28, 81124.CrossRefGoogle Scholar
Hessen, D.J. (2005). Constant latent odds-ratios models and the Mantel-Haenszel null hypothesis. Psychometrika, 70, 497516.CrossRefGoogle Scholar
Holland, P., & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenszel procedure. In Wainer, H., & Braun, H.I. (Eds.), Test validity (pp. 129145). Hillsdal, NJ: Lawrence Erlbaum Associates.Google Scholar
Holm, S. (1979). A simple sequentially rejective multiple test procedur. Scandinavian Journal of Statistics, 6, 6570.Google Scholar
Jodoin, G.M., & Gierl, M.J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329349.CrossRefGoogle Scholar
Lord, F.M. (1977). A study of item bias, using item characteristic curve theory. In Poortinga, Y.H. (Eds.), Basic problems in cross-cultural psychology (pp. 1929). Hillsdale, NJ: Lawrence Earl- baum Associates.Google Scholar
Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.Google Scholar
Magis, D., Beland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichtomous differential item functioning. Behavior Research Methods, 42, 847862.CrossRefGoogle Scholar
Magis, D., & Boeck, P De. (2012). A robust outlier approach to prevent type i error inflation in differential item functioning. Educational and Psychological Measurement, 72, 291311.CrossRefGoogle Scholar
Magis, D., & Facon, B. (2013). Item purification does not always improve DIF detection: A coun- terexample with Angoff’s Delta plot. Journal of Educational and Psychological Measurement, 73, 293311.CrossRefGoogle Scholar
Mahalanobis, P. C. (1936). On the generalised distance in statistics. In Proceedings of the National Institute of Sciences of India (Vol. 2, pp. 49–55).Google Scholar
Maris, G., & Bechger, T.M. (2007). Scoring open ended questions. In Rao, C.R., & Sinharay, S. (Eds.), Handbook of statistics: Psychometrics (pp. 663680). Amsterdam: Elsevier.Google Scholar
Maris, G., & Bechger, T.M. (2009). On interpreting the model parameters for the three parameter logistic model. Measurement, 7, 7588.Google Scholar
Masters, G. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149175.CrossRefGoogle Scholar
Meredith, W., & Millsap, R.E. (1992). On the misuse of manifest variables in the detection of measurement bias. Psychomettrika, 57, 289311.CrossRefGoogle Scholar
Penfield, R.D., & Camelli, G. (2007). Differential item functioning and bias. In Rao, C.R., & Sinharay, S. (Eds.), Psychometrics. Amsterdam: Elsevier.Google Scholar
Ponocny, I. (2001). Nonparametric goodness-of-fit tests for the Rasch model. Psychometrika, 66, 437460.CrossRefGoogle Scholar
Raju, N. (1988). The area between two item characteristic curves. Psychometrika, 53, 495502.CrossRefGoogle Scholar
Raju, N. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Pschological Measurement, 14, 197207.CrossRefGoogle Scholar
Rao, C.R. (1973). Linear statistical inference and its applications (2nd ed.). New-York: Wiley.CrossRefGoogle Scholar
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: The Danish Institute of Educational Research. (Expanded edition, 1980. Chicago, The University of Chicago Press).Google Scholar
R Development Core Team. (2010). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from http://www.R-project.org/ (ISBN 3-900051-07-0.Google Scholar
San Martin, E., González, J., & Tuerlinckx, F. (2009). Identified parameters, parameters of interest and their relationships. Measurement, 7, 97105.Google Scholar
San Martin, E., & Quintana, F. (2002). Consistency and identifiability revisited. Brazilian Journal of Probability and Statistics, 16, 99106.Google Scholar
San Martin, E., & Roulin, J.M. (2013). Identifiability of parametric Rasch-type models. Journal of statistical planning and inference, 143, 116130.CrossRefGoogle Scholar
San Martín, E., González, J., & Tuerlinckx, F. (2014). On the unidentifiability of the fixed-effects 3PL model. Psychometrika, 78(2), 341–379.Google Scholar
Shealy, R., & Stout, W.F. (1993). A model-based standardization approach that separates true bias/DIF from group differences and detects test bias/DIF as well as item bias/DIF. PSychometrika, 58, 159194.CrossRefGoogle Scholar
Soares, T.M., Concalves, F.B., & Gamerman, D. (2009). An integrated bayesian model for DIF analysis. Journal of Educational and Behavioral Statistics, 34(3), 348377.CrossRefGoogle Scholar
Stark, S., Chernyshenko, O.S., & Drasgow, F. (2006). Detecting differential item functioning with CFA and IRT: Towards a unified strategy. Journal of Applied Psychology, 19, 12921306.CrossRefGoogle Scholar
Stout, W. (2002). Psychometrics: From practice to theory and back. Psychometrika, 67, 485518.CrossRefGoogle Scholar
Swaminathan, H., & Rogers, H.J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361370.CrossRefGoogle Scholar
Tay, L., Newman, D.A., & Vermunt, J.K. (2011). Using mixed-measurement item response theory with covariates (mm-irt-c) to ascertain observed and unobserved measurement equivalence. Organizational Research Method, 14, 147155.CrossRefGoogle Scholar
Teresi, J.A. (2006). Different approaches to differential item functioning in health applications: Advantages, disadvantages and some neglected topics. Medical Care, 44, 152170.CrossRefGoogle ScholarPubMed
Thissen, D. (2001). IRTLRDIF v2.0b: Software for the comutation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. Documentation for computer program [Computer software manual]. Chapel Hill.Google Scholar
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group difference in trace lines. In H. Wainer & H. Braun (Eds.), Validity. Hillsdale, NJ: Lawrence Erlbaum Associates..Google Scholar
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In Holland, P., & Wainer, H. (Eds.), Differential item functioning (pp. 67114). Hillsdale, NJ: Lawrence Earlbaum.Google Scholar
Thurman, C. J. (2009). A monte carlo study investigating the in uence of item discrimination, category intersection parameters, and differential item functioning in polytomous items. Unpublished doctoral dissertation, Georgia State University.Google Scholar
Van der Flier, H., Mellenbergh, G.J., Adèr, H.J., & Wijn, M. (1984). An iterative bias detection method. Journal of Educational Measurement, 21, 131145.CrossRefGoogle Scholar
Verhelst, N. D. (1993). On the standard errors of parameter estimators in the Rasch model (Mea- surement and Research Department Reports No. 93–1). Arnhem: Cito.Google Scholar
Verhelst, N.D. (2008). An effcient MCMC-algorithm to sample binary matrices with fixed marginals. Psychometrika, 73, 705728.CrossRefGoogle Scholar
Verhelst, N. D., & Eggen, T. J. H. M. (1989). Psychometrische en statistische aspecten van peiling- sonderzoek (PPON-rapport No. 4). Arnhem: CITO.Google Scholar
Verhelst, N.D., Glas, C.A.W., & van der Sluis, A. (1984). Estimation problems in the Rasch model: The basic symmetric functions. Computational Statistics Quarterly, 1(3), 245262.Google Scholar
Verhelst, N. D., Glas, C. A. W., & Verstralen, H. H. F. M. (1994). OPLM: Computer program and manual [Computer software manual]. Arnhem.Google Scholar
Verhelst, N.D., Hartzinger, R., & Mair, P. (2007). The Rasch sampler. Journal of Statistical Software, 20, 114.CrossRefGoogle Scholar
Verhelst, N.D., & Verstralen, HHFM. (2008). Some considerations on the partial credit model. Psicológica, 29, 229254.Google Scholar
von Davier, M., & von Davier, A.A. (2007). A uniffed approach to IRT scale linking and scale transformations. Methodology, 3(3), 16141881.CrossRefGoogle Scholar
Wald, A. (1943). tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54, 426482.CrossRefGoogle Scholar
Wang, W.-C.. (2004). Effects of anchor item methods on the detection of differential item functioning within the family of rasch models. The Journal of Experimental Education, 72(3), 221261.CrossRefGoogle Scholar
Wang, W.-C., Shih, C.-L., & Sun, G.-W.. (2012). The DIF-free-then-DIF strategy for the assessment of differential item functioning. Educational and Psychological Measurement, 74, 122.Google Scholar
Wang, W.-C., & Yeh, Y.-L.. (2003). Effects of anchor item methods on differential item functioning whith the likelihood ratio test. Applied Psychological Measurement, 27(6), 479498.CrossRefGoogle Scholar
Williams, V.S.L. (1997). The ”unbiased” anchor: Bridging the gap between DIF and item bias. Applied Measurement in Education, 10(3), 253267.CrossRefGoogle Scholar
Wilson, M., & Adams, R. (1995). Rasch models for item bundles. Psychometrika, 60, 181198.CrossRefGoogle Scholar
Wright, B. D., Mead, R., & Draba, R. (1976). Detecting and correcting item bias with a logistic response model. (Tech. Rep.). Department of Education, Statistical Laboratory: University of Chicago.Google Scholar
Zwinderman, A.H. (1995). Pairwise parameter estimation in Rasch models. Applied Psychological Measurement, 19, 369375.CrossRefGoogle Scholar
Zwitser, R., & Maris, G. (2013). Conditional statistical inference with multistage testing designs. Psychometrika. doi:https://doi.org/10.1007/s11336-013-9369-6.CrossRefGoogle Scholar