Hostname: page-component-745bb68f8f-s22k5 Total loading time: 0 Render date: 2025-01-07T18:51:47.349Z Has data issue: false hasContentIssue false

Differential Item Functioning Analysis Without A Priori Information on Anchor Items: QQ Plots and Graphical Test

Published online by Cambridge University Press:  01 January 2025

Ke-Hai Yuan
Affiliation:
University of Notre Dame
Hongyun Liu*
Affiliation:
Beijing Normal University
Yuting Han
Affiliation:
Beijing Normal University
*
Correspondence should be made to Hongyun Liu, Faculty of Psychology, Beijing Normal University, No. 19, XinJieKouWai St., HaiDian District, Beijing 100875, People’s Republic of China. Email: [email protected]

Abstract

Differential item functioning (DIF) analysis is an important step in establishing the validity of measurements. Most traditional methods for DIF analysis use an item-by-item strategy via anchor items that are assumed DIF-free. If anchor items are flawed, these methods will yield misleading results due to biased scales. In this article, based on the fact that the item’s relative change of difficulty difference (RCD) does not depend on the mean ability of individual groups, a new DIF detection method (RCD-DIF) is proposed by comparing the observed differences against those with simulated data that are known DIF-free. The RCD-DIF method consists of a D-QQ (quantile quantile) plot that permits the identification of internal references points (similar to anchor items), a RCD-QQ plot that facilitates visual examination of DIF, and a RCD graphical test that synchronizes DIF analysis at the test level with that at the item level via confidence intervals on individual items. The RCD procedure visually reveals the overall pattern of DIF in the test and the size of DIF for each item and is expected to work properly even when the majority of the items possess DIF and the DIF pattern is unbalanced. Results of two simulation studies indicate that the RCD graphical test has Type I error rate comparable to those of existing methods but with greater power.

Type
Theory and Methods
Copyright
Copyright © 2021 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

K.-H. Yuan: His research has been around developing better or more valid methods for analyzing messy data or non-standard samples in social and behavioral sciences. Most of his work is on factor analysis, structural equation modeling, and multilevel modeling.

H. Liu: Her research interests are educational measurement, advanced statistics methods.

Y. Han: Her research interests are psychometrics and educational measurement.

Supplementary Information The online version supplementary material available at https://doi.org/10.1007/s11336-021-09746-5.

References

Ackerman, TAA didactic explanation of item bias, item impact, and item validity from a multidimensional perspective.Journal of Educational Measurement,(1992).29(16791CrossRefGoogle Scholar
Barnard, GADiscussion on The spectral analysis of point processes (by M. S. Bartlett).Journal of the Royal Statistical Society B,(1963).25,294296Google Scholar
Barnett, V, Lewis, T Outliers in statistical data,(1994).3 Chichester: WileyGoogle Scholar
Bauer, DJ, Belzak, WCM, Cole, VTSimplifying the assessment of measurement invariance over multiple background variables: Using regularized moderated nonlinear factor analysis to detect differential item functioning.Structural Equation Modeling: A Multidisciplinary Journal,(2019).27(14355CrossRefGoogle ScholarPubMed
Bechger, TM, Maris, GA statistical test for differential item pair functioning.Psychometrika,(2015).80(231734025223228 CrossRefGoogle ScholarPubMed
Belzak, WCM, Bauer, DJImproving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning.Psychological Methods,(2020).CrossRefGoogle ScholarPubMed
Cai, LflexMIRT R 3.51: Flexible multilevel and multidimensional item response theory analysis and test scoring [Computer software],(2017).Chapel Hill, NC: Vector Psychometric Group LLCGoogle Scholar
Cai, L, duToit, SHL, Thissen, DIRTPRO: flexible, multidimensional, multiple categorical IRT modeling [Computer software],(2009). Chicago: Scientific Software InternationalGoogle Scholar
Candell, GL, Drasgow, FAn iterative procedure for linking metrics and assessing item bias in item response theory.Applied Psychological Measurement,(1988).12(3253260CrossRefGoogle Scholar
Cao, M, Tay, L, Liu, YA Monte Carlo study of an iterative Wald test procedure for DIF analysis.Educational and Psychological Measurement,(2017).77(110411829795905CrossRefGoogle Scholar
Chalmers, RPMirt: A multidimensional item response theory package for the R environment.Journal of Statistical Software,(2012).48(6129CrossRefGoogle Scholar
Choi, SW, Gibbons, LE, Crane, PKLordif: An R package for detecting differential item functioning using iterative hybrid ordinal logistic regression / item response theory and Monte Carlo simulations.Journal of Statistical Software,(2011).39(2130215729083093114CrossRefGoogle Scholar
Clauser, BE, Mazor, KMUsing statistical procedures to identify differential item functioning test items.Educational Measurement: Issues and Practice,(1998).17(13144CrossRefGoogle Scholar
Clauser, B, Mazor, K, Hambleton, RKThe effects of purification of matching criterion on the identification of DIF using the Mantel-Haenszel procedure.Applied Measurement in Education,(1993).6(4269279CrossRefGoogle Scholar
Da Costa, P. D., & Araújo, L. (2012). Differential item functioning (DIF): What function differently for Immigrant students in PISA 2009 reading items (Report EUR 25565 EN). Retrieved from https://core.ac.uk/display/38627538Google Scholar
Davey, A, Savla, JEstimating statistical power with incomplete data.Organizational Research Methods,(2009).12(2320346CrossRefGoogle Scholar
Davison, AC, Hinkley, DVBootstrap methods and their application,(1997). Cambridge: Cambridge University PressCrossRefGoogle Scholar
DeMars, CEAn analytic comparison of effect sizes for differential item functioning.Applied Measurement in Education,(2011).24(3189209CrossRefGoogle Scholar
Doebler, ALooking at DIF from a new perspective: A structure-based approach acknowledging inherent indefinability.Applied Psychological Measurement,(2019).43(430332131156282CrossRefGoogle ScholarPubMed
Efron, B, Tibshirani, RJAn introduction to the bootstrap,(1993). New York: Chapman & HallCrossRefGoogle Scholar
Falk, CF, Cai, LMaximum marginal likelihood estimation of a monotonic polynomial generalized partial credit model with applications to multiple group analysis.Psychometrika,(2016).81(243446025487423CrossRefGoogle ScholarPubMed
Fidalgo, AM, Mellenbergh, GJ, Muñiz, JEffects of amount of DIF, test length, and purification type on robustness and power of Mantel-Haenszel procedures.Methods of Psychological Research Online,(2000).5(34353Google Scholar
Finch, HThe MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio.Applied Psychological Measurement,(2005).29(4278295CrossRefGoogle Scholar
Fischer, GH, Molenaar, IWRasch models: Foundations, recent developments, and applications,(1995). New York, NY: SpringerCrossRefGoogle Scholar
Frederickx, S, Tuerlinckx, F, De Boeck, P, Magis, DRIM: A random item mixture model to detect differential item functioning.Journal of Educational Measurement,(2010).47(4432457CrossRefGoogle Scholar
French, BF, Maller, SJIterative purification and effect size use with logistic regression for differential item functioning detection.Educational and Psychological Measurement,(2007).67(3373393CrossRefGoogle Scholar
Frick, H, Strobl, C, Zeileis, ARasch mixture models for DIF detection: A comparison of old and new score specifications.Educational and Psychological Measurement,(2015).75(220823429795819CrossRefGoogle ScholarPubMed
Gnanadesikan, RMethods for statistical data analysis of multivariate observations,(1997). 2 New York: WileyCrossRefGoogle Scholar
González-Betanzos, F, Abad, FJThe effects of purification and the evaluation of differential item functioning with the likelihood ratio test.Methodology: European Journal of Research Methods for the Behavioral and Social Science,(2012).8(4134145CrossRefGoogle Scholar
Hall, P, Wilson, SRTwo guidelines for bootstrap hypothesis testing.Biometrics,(1991).47,757762CrossRefGoogle Scholar
Hancock, GR, Stapleton, LM, Arnold-Berkovits, I Teo, T, Khine, MSThe tenuousness of invariance tests within multi-sample covariance and mean structure models.Structural equation modeling: Concepts and applications in educational research,(2009). Rotterdam: Sense Publishers 137174Google Scholar
Harman, HHModern factor analysis,(1976). 3 Chicago: The University of Chicago PressGoogle Scholar
Holland, PW, Thayer, DTWainer, H, Braun, HIDifferential item performance and the Mantel-Haenszel procedure.Test validity,(1988). Hillsdale, NJ: Erlbaum 129145Google Scholar
Holm, SA simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics,(1979).6,6570Google Scholar
Hope, ACAA simplified Monte Carlo test procedure.Journal of the Royal Statistical Society,(1968).30(3582598CrossRefGoogle Scholar
Huang, X, Wilson, M, Wang, LExploring plausible causes of differential item functioning in the PISA science assessment: Language, curriculum or culture.Educational Psychology,(2016).36(2378390CrossRefGoogle Scholar
Huang, PHA penalized likelihood method for multi-group structural equation modelling.British Journal of Mathematical & Statistical Psychology,(2018).71(6499522CrossRefGoogle ScholarPubMed
Hunter, JE, Schmidt, FLMethods of meta-analysis: Correcting error and bias in research findings,(2004). 2 Thousand Oaks, CA: SageCrossRefGoogle Scholar
Jalal, S, Bentler, PUsing Monte Carlo normal distributions to evaluate structural equation models with nonnormal data.Structural Equation Modeling: A Multidisciplinary Journal,(2018).25(4541557CrossRefGoogle Scholar
Jöreskog, KG, Goldberger, ASEstimation of a model with multiple indicators and multiple causes of a single latent variable.Journal of the American Statistical Association,(1975).70(351a631639Google Scholar
Kim, J, Oshima, TCEffect of multiple testing adjustment in differential item functioning detection.Educational and Psychological Measurement,(2013).73(3458470CrossRefGoogle Scholar
Kopf, J, Zeileis, A, Strobl, CAnchor methods for DIF detection: A comparison of the iterative forward, backward, constant and all-other anchor class (Technical Report 141),(2013). Munich: Department of Statistics, LMU MunichGoogle Scholar
Kopf, J., Zeileis, A., & Strobl, C. (2015a). A framework for anchor methods and an iterative forward approach for DIF detection. Applied Psychological Measurement, 39(2), 83103.CrossRefGoogle Scholar
Kopf, J, Zeileis, A, Strobl, CAnchor selection strategies for DIF analysis: Review, assessment, and new approaches.Educational and Psychological Measurement,(2015).75(1225629795811CrossRefGoogle ScholarPubMed
Le, LTInvestigating gender differential item functioning across countries and test languages for PISA science items.International Journal of Testing,(2009).9(2122133CrossRefGoogle Scholar
Lord, FMApplications of item response theory to practical testing problems,(1980). Hillsdale, NJ: Lawrence ErlbaumGoogle Scholar
Magis, D, De Boeck, PA robust outlier approach to prevent type I error inflation in differential item functioning.Educational and Psychological Measurement,(2012).72(2291311CrossRefGoogle Scholar
Magis, D, Facon, BItem purification does not always improve DIF detection: A counterexample with Angoff’s delta plot.Educational and Psychological Measurement,(2013).73(2293311CrossRefGoogle Scholar
Magis, D, Béland, S, Tuerlinckx, F, De Boeck, PA general framework and an R package for the detection of dichotomous differential item functioning.Behavior Research Methods,(2010).42(384786220805607CrossRefGoogle Scholar
Magis, D, Tuerlinckx, F, De Boeck, PDetection of differential item functioning using the lasso approach.Journal of Educational and Behavioral Statistics,(2015).40(2111135CrossRefGoogle Scholar
May, HA multilevel Bayesian item response theory method for scaling socioeconomic status in international studies of education.Journal of Educational Behavioral Statistics,(2006).31(16379CrossRefGoogle Scholar
Millsap, RE, Meredith, WInferential conditions in the statistical detection of measurement bias.Applied Psychological Measurement,(1992).16(4389402CrossRefGoogle Scholar
Muthén, BOA method for studying the homogeneity of test items with respect to other relevant variables.Journal of Educational Statistics,(1985).10,121132CrossRefGoogle Scholar
Navas-Ara, MJ, Gómez-Benito, JEffects of ability scale purification on the identification of DIF.European Journal of Psychological Assessment,(2002).18(1915CrossRefGoogle Scholar
Oshima, TC, Kushubar, S, Scott, JC, Raju, NSDFIT8 for Window User’s Manual: differential functioning of items and tests,(2009). St. Paul MN: Assessment Systems CorporationGoogle Scholar
Özdemir, BA comparison of IRT-based methods for examining differential item functioning in TIMSS 2011 mathematics subtest.Procedia-Social and Behavioral Sciences,(2015).174,20752083CrossRefGoogle Scholar
Price, E. A. (2014). Item discrimination, model-data fit, and Type I error rates in DIF detection using Lord’s Chi 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{2}$$\end{document} , the Likelihood ratio test, and the Mantel-Haenszel procedure. Ohio University, ProQuest Dissertations Publishing.Google Scholar
R Core Team. (2018). R: A Language and Environment for Statistical Computing. Austria: R Foundation for Statistical Computing.Google Scholar
Rogers, HJ, Swaminathan, HA comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning.Applied Psychological Measurement,(1993).17(2105116CrossRefGoogle Scholar
Roussos, LA, Schnipke, DL, Pashley, PJA generalized formula for the Mantel-Haenszel differential item functioning parameter.Journal of Educational and Behavioral Statistics,(1999).24,293322CrossRefGoogle Scholar
Santoso, A. (2018). Equivalence testing for anchor selection in differential item functioning detection (Doctoral dissertation). Retrieved from https://curate.nd.edu/downloads/und:5712m61688hGoogle Scholar
Schauberger, G, Mair, PA regularization approach for the detection of differential item functioning in generalized partial credit models.Behavior Research Methods,(2020).52(427929430887369CrossRefGoogle ScholarPubMed
Schmetterer, LIntroduction to mathematical statistics (translated from German to English by Kenneth Wickwire),(1974). New York: SpringerGoogle Scholar
Shealy, R, Stout, WA model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF.Psychometrika,(1993).58(2159194CrossRefGoogle Scholar
Shih, CL, Wang, WCDifferential item functioning detection using the multiple indicators, multiple causes method with a pure short anchor.Applied Psychological Measurement,(2009).33(3184199CrossRefGoogle Scholar
Sinharay, S, Dorans, NJ, Grant, MC, Blew, EO, Knorr, CMUsing past data to enhance small-sample DIF estimation: A Bayesian approach (ETS RR-06-09),(2006). Princeton, NJ: Educational Testing ServicsGoogle Scholar
Soares, TM, Gonçalves, FB, Gamerman, DAn integrated Bayesian model for DIF analysis.Journal of Educational and Behavioral Statistics,(2009).34(3348377CrossRefGoogle Scholar
Strobl, C, Kopf, J, Zeileis, ARasch trees: A new method for detecting differential item functioning in the Rasch model.Psychometrika,(2015).80(228931624352514CrossRefGoogle Scholar
Swaminathan, H, Rogers, HJDetecting differential item functioning using logistic regression procedures.Journal of Educational Measurement,(1990).27(4361370CrossRefGoogle Scholar
Tay, L, Huang, Q, Vermunt, JKItem response theory with covariates (IRT-C) assessing item recovery and differential item functioning for the three-parameter logistic model.Educational and Psychological Measurement,(2016).76(1224229795855CrossRefGoogle ScholarPubMed
Tay, L, Meade, AW, Cao, MAn overview and practical guide to IRT measurement equivalence analysis.Organizational Research Methods,(2015).18(1346CrossRefGoogle Scholar
Thissen, D, Steinberg, L, Gerrard, MBeyond group-mean differences: The concept of item bias.Psychological Bulletin,(1986).99(1118128CrossRefGoogle Scholar
Thissen, D, Steinberg, L, Wainer, H Holland, P, Wainer, HDetection of differential item functioning using the parameters of item response models.Differential item functioning,(1993). Hillsdale, NJ: Lawrence Erlbaum Associates 67113Google Scholar
Toland, M. (2008). Determining the accuracy of item parameter standard error of estimates in BILOG-MG3.Google Scholar
Tutz, G, Berger, MItem-focused trees for the identification of items in differential item functioning.Psychometrika,(2016).81(372775026596721CrossRefGoogle Scholar
Tutz, G, Schauberger, GA penalty approach to differential item functioning in Rasch models.Psychometrika,(2015).80(1214324297435CrossRefGoogle ScholarPubMed
Vandenberg, RJ, Lance, CEA review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research.Organizational Research Methods,(2000).3(1470CrossRefGoogle Scholar
Wang, WC, Su, YHEffects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the Mantel-Haenszel method.Applied Measurement in Education,(2004).17(2113144CrossRefGoogle Scholar
Wang, WC, Shih, CL, Sun, GWThe DIF-free-then-DIF strategy for the assessment of differential item functioning.Educational and Psychological Measurement,(2012).72(4687708CrossRefGoogle Scholar
Wang, WC, Shih, CL, Yang, CCThe MIMIC method with scale purification for detecting differential item functioning.Educational and Psychological Measurement,(2009).69(5713731CrossRefGoogle Scholar
Woods, CMTesting for differential item functioning with measures of partial association.Applied Psychological Measurement,(2009).33(1538554CrossRefGoogle Scholar
Woods, CM, Grimm, KJTesting for nonuniform differential item functioning with multiple indicator multiple cause models.Applied Psychological Measurement,(2011).35(5339361CrossRefGoogle Scholar
Woods, CM, Cai, L, Wang, MThe Langer-improved Wald test for DIF testing with multiple groups: Evaluation and comparison to two-group IRT.Educational and Psychological Measurement,(2013).73(3532547CrossRefGoogle Scholar
Yuan, K-H, Chan, WStructural equation modeling with near singular covariance matrices.Computational Statistics & Data Analysis,(2008).52(1048424858CrossRefGoogle Scholar
Yuan, K-H, Hayashi, K, Bentler, PMNormal theory likelihood ratio statistic for mean and covariance structure analysis under alternative hypotheses.Journal of Multivariate Analysis,(2007).98(612621282CrossRefGoogle Scholar
Zhang, GTesting process factor analysis models using the parametric bootstrap.Multivariate Behavioral Research,(2018).53(221923029323535CrossRefGoogle ScholarPubMed
Zieky, M Holland, PW, Wainer, H DIF statistics in test development.Differential item functioning,(1993). Hillsdale, NJ: Erlbaum 337347Google Scholar
Zwick, R, Thayer, DTApplication of an empirical Bayes enhancement of Mantel-Haenszel DIF analysis to a computerized adaptive test.Applied Psychological Measurement,(2002).26(15776CrossRefGoogle Scholar
Zwick, R, Thayer, DT, Lewis, CUsing loss functions for DIF detection: An empirical Bayes approach.Journal of Educational and Behavioral Statistics,(2000).25(2225247CrossRefGoogle Scholar
Supplementary material: File

Yuan et al. supplementary material

Yuan et al. supplementary material
Download Yuan et al. supplementary material(File)
File 2.1 MB