Hostname: page-component-745bb68f8f-b95js Total loading time: 0 Render date: 2025-01-08T12:11:44.476Z Has data issue: false hasContentIssue false

Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters

Published online by Cambridge University Press:  01 January 2025

Kilem Li Gwet*
Affiliation:
Stataxis Consulting
*
Requests for reprints should be sent to Kilem Li Gwet, Sr. Statistical Consultant, STATAXIS Consulting, 20315 Marketree Place, Montgomery Village, MD 20886, USA. E-mail: [email protected]

Abstract

Most inter-rater reliability studies using nominal scales suggest the existence of two populations of inference: the population of subjects (collection of objects or persons to be rated) and that of raters. Consequently, the sampling variance of the inter-rater reliability coefficient can be seen as a result of the combined effect of the sampling of subjects and raters. However, all inter-rater reliability variance estimators proposed in the literature only account for the subject sampling variability, ignoring the extra sampling variance due to the sampling of raters, even though the latter may be the biggest of the variance components. Such variance estimators make statistical inference possible only to the subject universe. This paper proposes variance estimators that will make it possible to infer to both universes of subjects and raters. The consistency of these variance estimators is proved as well as their validity for confidence interval construction. These results are applicable only to fully crossed designs where each rater must rate each subject. A small Monte Carlo simulation study is presented to demonstrate the accuracy of large-sample approximations on reasonably small samples.

Type
Theory and Methods
Copyright
Copyright © 2008 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bartfay, E., & Donner, A. (2001). Statistical inferences for inter-observer agreement studies with nominal outcome data. The Statistician, 50, 135146.CrossRefGoogle Scholar
Bennet, E.M., Alpert, R., & Goldstein, A.C. (1954). Communications through limited response questioning. Public Opinion Quarterly, 18, 303308.CrossRefGoogle Scholar
Berry, K.J., & Mielke, P.W. Jr. (1988). A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement, 48, 921933.CrossRefGoogle Scholar
Brennan, R.L., & Prediger, D.J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687699.CrossRefGoogle Scholar
Byrt, T., Bishop, J., & Carlin, J.B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46, 423429.CrossRefGoogle ScholarPubMed
Cicchetti, D.V., & Feinstein, A.R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551558.CrossRefGoogle ScholarPubMed
Cochran, W.G. (1977). Sampling techniques, (3rd ed.). New York: Wiley.Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 3746.CrossRefGoogle Scholar
Conger, A.J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322328.CrossRefGoogle Scholar
Cook, R.J. (1998). Kappa and its dependence on marginal rates. In Armitage, P., & Colton, T. (Eds.), Encyclopedia of biostatistics (pp. 21662168). New York: Wiley.Google Scholar
Donner, A., & Eliasziw, M. (1992). A goodness-of-fit approach to inference procedures for the kappa statistic: Confidence interval construction, significance-testing and sample size estimation. Statistics in Medicine, 11, 15111519.CrossRefGoogle ScholarPubMed
Feinstein, A.R., & Cicchetti, D.V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543549.CrossRefGoogle Scholar
Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378382.CrossRefGoogle Scholar
Fuller, W.A., & Isaki, C.T. (1981). Survey design under superpopulation models. In Krewski, D., Rao, J.N.K., & Platek, R. (Eds.), Current topics in survey sampling (pp. 199226). New York: Academic Press.CrossRefGoogle Scholar
Goodman, L.A., & Kruskal, W.H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 17321769.Google Scholar
Gwet, K. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1).CrossRefGoogle Scholar
Holley, J.W., & Guilford, J.P. (1964). A note on the G index of agreement. Educational and Psychological Measurement, 24, 749753.CrossRefGoogle Scholar
Isaki, C.T., & Fuller, W.A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77, 8996.CrossRefGoogle Scholar
Janson, H., & Olsson, U. (2001). A measure of agreement for interval or nominal multivariate observations. Educational and Psychological Measurement, 61, 277289.CrossRefGoogle Scholar
Janson, H., & Olsson, U. (2004). A measure of agreement for interval or nominal multivariate observations by different sets of judges. Educational and Psychological Measurement, 64, 6270.CrossRefGoogle Scholar
Janson, S., & Vegelius, J. (1979). On generalizations of the G index and the PHI coefficient to nominal scales. Multivariate Behavioral Research, 14, 255269.CrossRefGoogle Scholar
Kraemer, H.C., Periyakoil, V.S., & Noda, A. (2002). Kappa coefficients in medical research. Statistics in Medicine, 21, 21092129.CrossRefGoogle Scholar
Landis, R.J., & Koch, G.G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 33, 363374.CrossRefGoogle ScholarPubMed
Light, R.J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76, 365377.CrossRefGoogle Scholar
Maxwell, A.E. (1977). Coefficients of agreement between observers and their interpretation. British Journal of Psychiatry, 130, 7983.CrossRefGoogle ScholarPubMed
McGraw, K.O., & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 3046.CrossRefGoogle Scholar
Nam, J.M. (2000). Interval estimation of the kappa coefficient with binary classification and an equal marginal probability model. Biometrics, 56, 583585.CrossRefGoogle Scholar
Rao, C.R. (2002). Wiley series in probability and statistics. Linear statistical inference and its applications (2nd ed.).Google Scholar
Schuster, C. (2004). A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement, 64, 243253.CrossRefGoogle Scholar
Schuster, C., & Smith, D.A. (2006). Estimating with a latent class model the reliability of nominal judgments upon which two raters agree. Educational and Psychological Measurement, 66, 739747.CrossRefGoogle Scholar
Scott, W.A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, XIX, 321325.CrossRefGoogle Scholar
Simon, P. (2006). Including omission mistakes in the calculation of Cohen’s kappa and an analysis of the coefficient’s paradox features. Educational and Psychological Measurement, 66, 765777.CrossRefGoogle Scholar
Uebersax, J.S., & Grove, W.M. (1990). Latent class analysis of diagnostic agreement. Statistics in Medicine, 9, 559572.CrossRefGoogle ScholarPubMed
Uebersax, J.S., & Grove, W.M. (1993). A latent trait finite mixture analysis of rating agreement. Biometrics, 49, 823835.CrossRefGoogle ScholarPubMed
Zou, G., & Klar, N. (2005). A non-iterative confidence interval estimating procedure for the intraclass kappa statistic with multinomial outcomes. Biometrical Journal, 5, 682690.CrossRefGoogle Scholar
Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374378.CrossRefGoogle Scholar