Standard Errors and Confidence Intervals of Norm Statistics for Educational and Psychological Tests

Hannah E. M. Oosterhuis; L. Andries van der Ark; Klaas Sijtsma

doi:10.1007/s11336-016-9535-8

Standard Errors and Confidence Intervals of Norm Statistics for Educational and Psychological Tests

Published online by Cambridge University Press: 01 January 2025

Hannah E. M. Oosterhuis

L. Andries van der Ark and

Klaas Sijtsma

Show author details

Hannah E. M. Oosterhuis*: Affiliation:
Tilburg University
L. Andries van der Ark: Affiliation:
University of Amsterdam
Klaas Sijtsma: Affiliation:
Tilburg University
*: Correspondence should be made to Hannah E. M. Oosterhuis, Department of Methodology and Statistics, Tilburg University, PO Box 90153, 5000 LE Tilburg, The Netherlands. Email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Norm statistics allow for the interpretation of scores on psychological and educational tests, by relating the test score of an individual test taker to the test scores of individuals belonging to the same gender, age, or education groups, et cetera. Given the uncertainty due to sampling error, one would expect researchers to report standard errors for norm statistics. In practice, standard errors are seldom reported; they are either unavailable or derived under strong distributional assumptions that may not be realistic for test scores. We derived standard errors for four norm statistics (standard deviation, percentile ranks, stanine boundaries and Z-scores) under the mild assumption that the test scores are multinomially distributed. A simulation study showed that the standard errors were unbiased and that corresponding Wald-based confidence intervals had good coverage. Finally, we discuss the possibilities for applying the standard errors in practical test use in education and psychology. The procedure is provided via the R function check.norms, which is available in the mokken package.

Keywords

norm statistics percentile norms standard errors for norm statistics test norms

Type: Original paper
Information: Psychometrika , Volume 82 , Issue 3 , September 2017 , pp. 559 - 588

DOI: https://doi.org/10.1007/s11336-016-9535-8 [Opens in a new window]
Copyright: Copyright © 2016 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Aardoom, J. J., Dingemans, A. E., Landt, M. S. C., & Van Furth, E. F.. (2012). Norms and discriminative validity of the Eating Disorder Examination Questionnaire (EDE-Q). Eating Behaviors, 13, 305–309. doi:10.1016/j.eatbeh.2012.09.002.CrossRef Google Scholar PubMed

AERA, Apa, & NCME., (1999). Standards for educational and psychological testing. Washington, DC: Author..Google Scholar

Agresti, A.Analysis of ordinal categorical data 2012 2Hoboken, NJ: Wiley.Google Scholar

Agresti, A.Categorical data analysis 2013 3Hoboken, NJ: Wiley.Google Scholar

Agresti, A., Min, Y.. (2001). On small-sample confidence intervals for parameters in discrete distributions. Biometrics, 57(963), 971.CrossRef Google Scholar PubMed

Ahn, S., & Fessler, A. (2003). Standard errors of mean, variance, and standard deviation estimators. Technical Report. Ann Arbor, MI: EECS Department, University of Michigan: July 2003. http://www.eecs.umich.edu/~fessler/papers/files/tr/stderr.pdf.Google Scholar

American Psychological Association Publication Manual of the American Psychological Association 2010 6Washington, DC: Author.Google Scholar

Bergsma, W. P., (1997). Marginal models for categorical data. Tilburg: Tilburg University Press.Google Scholar

Bergsma, W. P., Croon, M. A., & Hagenaars, J. A., (2009). Marginal models for dependent, clustered and longitudinal categorical data. New York, NY: Springer.Google Scholar

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 453–479). Reading, MA: Addison-Wesley..Google Scholar

Brennan, R. L., & Lee, W-C. (1999). Conditional scale-score standard errors of measurement under binomial and compound binomial assumptions. Educational and Psychological Measurement, 56, 5–24. doi:10.1177/0013164499591001.CrossRef Google Scholar

Cavaco, S., Gonçalves, A., Pinto, C., Almeida, E., Gomes, F., Moreira, I., et al. (2013). Trail making test: Regression-based norms for the Portuguese population. Archives of Clinical Neuropsychology, 28, 189–198. doi:10.1093/arclin/acs115.CrossRef Google Scholar

Cooch, E., White, G.Program MARK: A gentle introduction 2015 14Fort Collins, CO: Colorado State University.Google Scholar

Crawford, J., Cayley, C., Lovibond, P. F., Wilson, P. H., & Hartley, C.. (2011). Percentile norms and accompanying interval estimates from an Australian general adult population sample for self-report mood scales (BAI, BDI, CRSD, CES-D, DASS, DASS-21, STAI-X, STAI-Y, SRDS, and SRAS). Australian Psychologist, 46, 3–14. doi:10.1111/j.1742-9544.2010.00003.x.CrossRef Google Scholar

Crawford, J. R., Garthwaite, P. H., & Slick, D. J.. (2009). On percentile norms in neuropsychology: Proposed reporting standards and methods for quantifying the uncertainty over the percentile ranks of test scores. The Clinical Neuropsychologist, 23, 1173–1195. doi:10.1080/13854040902795018.CrossRef Google Scholar PubMed

Crawford, J. R., & Howell, D. C.. (1998). Comparing an individual’s test score against norms derived from small samples. The Clinical Neuropsychologist, 12, 482–486. doi:10.1076/clin.12.4.482.7241.CrossRef Google Scholar

Evers, A., Lucassen, W., Meijer, R. R., & Sijtsma, K.COTAN assessment system for the quality of tests 2009 Amsterdam: Nederlands Instituut van Psychologen.Google Scholar

Glaesmer, H., Rief, W., Martin, A., Mewes, R., Brähler, E., Zenger, M., & Hinz, A. (2012). Psychometric properties and population-based norms of the Life Orientation Test Revised (LOT-R). British Journal of Health Psychology, 17, 432–445. doi:10.1111/j.2044-8287.2011.02046.x.CrossRef Google Scholar

Goretti, B., Niccolai, C., Hakiki, B., Sturchio, A., Falautano, M., Eleonora, M. et al. (2014). The Brief International Cognitive Assessment for Multiple Sclerosis (BICAMS): Normative values with gender, age and education corrections in the Italian population. BMC Neurology, 14, 171–176. doi:10.1186/s12883-014-0171-6 4172942.CrossRef Google Scholar PubMed

Grande, G., Romppel, M., Glaesmer, H., Petrowski, K., Herrmann-Lingen, C.. (2010). The type-D scale (DS14): Norms and prevalence of type-D personality in a population-based representative sample in Germany. Personality and Individual Differences, 48, 935–939. doi:10.1016/j.paid.2010.02.026.CrossRef Google Scholar

Grizzle, J. E., Starmer, C. F., & Koch, G. G.. (1969). Analysis of categorical data for linear models. Biometrics, 25, 489–504. doi:10.2307/2528901.CrossRef Google Scholar PubMed

Kendall, M., & Stuart, A. (1977). The advanced theory of statistics, distributional theory (4th ed., Vol. 1). New York, NY: Macmillan..Google Scholar

Kessels, R. P., Montagne, B., Hendriks, A. W., Perrett, D. I., & De Haan, E. H.. (2014). Assessment of perception of morphed facial expression using the Emotion Recognition Task: Normative data from healthy participants aged 8–75. Journal of Neuropsychology, 8, 75–93. doi:10.1111/jnp.12009.CrossRef Google Scholar PubMed

Kritzer, H. M.. (1977). Analyzing measures of association derived from contingency tables. Sociological Methods and Research, 5, 35–50. doi:10.1177/004912417700500401.CrossRef Google Scholar

Kuijpers, R. E., Van der Ark, L. A., & Croon, M. A.. (2013). Standard errors and confidence intervals for scalability coefficients in Mokken scale analysis using marginal models. Sociological Methodology, 43, 42–69. doi:10.1177/0081175013481958.CrossRef Google Scholar

Kuijpers, R. E., Van der Ark, L. A., & Croon, M. A.. (2013). Testing hypotheses involving Cronbach’s alpha using marginal models. British Journal of Mathematical and Statistical Psychology, 66, 503–520. .CrossRef Google Scholar PubMed

Lang, J. B.. (2008). Score and profile likelihood confidence intervals for contingency table parameters. Statistics in Medicine, 27, 5975–5990. doi:10.1002/sim.3391.CrossRef Google Scholar PubMed

Larson, R., & Edwards, B. (2013). Calculus (10th ed.). Boston, MA: Cengage Learning, Brooks/Cole..Google Scholar

Lee, W-C, Brennan, R. L., & Kolen, M. J.. (2000). Estimators of conditional scale-score standard errors of measurement: A simulation study. Journal of Educational Measurement, 37, 1–20. doi:10.1111/j.1745-3984.2000.tb01073.x.CrossRef Google Scholar

Lehtonen, R., Pahkinen, E.Practical methods for design and analysis of complex surveys 2004 2West Sussex: Wiley.Google Scholar

Merrell, K. W., (1994). Preschool and Kindergarten Behavior Scales. Test manual. Brandon, VT: Clinical Psychology Publishing Company.Google Scholar

Mertler, C. A., (2007). Interpreting standardized test scores: Strategies for data-driven instructional decision making. Thousand Oaks, CA: Sage.Google Scholar

Mond, J. M., Hay, P. J., Rodgers, B., & Owen, C.. (2006). Eating Disorder Examination Questionnaire (EDE-Q): Norms for young adult women. Behaviour Research and Therapy, 44, 53–62. doi:10.1016/j.brat.2004.12.003.CrossRef Google Scholar PubMed

Oosterhuis, H. E. M., Van der Ark, L. A., & Sijtsma, K.. (2016). Sample size requirements for traditional and regression-based norms. Assessment, 23, 191–202. doi:10.1177/1073191115580638.CrossRef Google Scholar PubMed

Palomo, R., Casals-Coll, M., Sánchez-Benavides, G., Quintana, M., Manero, R. M., Rognoni, T., et al. (2011). Spanish normative studies in young adults (NEURONORMA young adults project): Norms for the Rey-Osterrieth Complex Figure (copy and memory) and Free and Cued Selective Reminding Test. Neurologiá, 28, 226–235. doi:10.1016/j.nrl.2012.03.008.CrossRef Google Scholar

R Core Team (2015). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/.Google Scholar

Rao, R.Linear statistical inference and its applications 1973 2New York, NY: Wileydoi:10.1002/9780470316436.CrossRef Google Scholar

Sartorio, F., Bravini, E., Vercelli, S., Ferriero, G., Plebani, G., Foti, C., & Franchignoni, F. (2013). The functional dexterity test: Test-retest reliability analysis and up-to-date reference norms. Journal of Hand Therapy, 26, 62–68. doi:10.1016/j.jht.2012.08.001.CrossRef Google Scholar

Shi, J., Wei, M., Tian, J., Snowden, J., Zhang, X., Li, T., et al. (2014). The Chinese version of story recall: A useful screening tool for mild cognitive impairment and Alzheimer’s disease in the elderly. BMC Psychiatry, 14, 71–80. doi:10.1186/1471-244X-14-71.CrossRef Google Scholar

Van Belle, G.Statistical rules of thumb 2003 2Hoboken, NJ: Wiley.Google Scholar

Van der Ark, L. A.. (2012). New developments in Mokken Scale Analysis in R. Journal of Statistical Software, 48(5), 1–27. doi:10.18637/jss.v048.i05.Google Scholar

Van der Ark, L. A., Croon, M. A., & Sijtsma, K.. (2008). Mokken scale analysis for dichotomous items using marginal models. Psychometrika, 73, 183–208. doi:10.1007/s11336-007-9034-z.CrossRef Google Scholar PubMed

Van der Linden, W. J., & Hambleton, R. K., (1997). Handbook of modern item response theory. New York, NY: Springerdoi:10.1007/978-1-4757-2691-6.CrossRef Google Scholar

Article contents

Standard Errors and Confidence Intervals of Norm Statistics for Educational and Psychological Tests

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests