Hostname: page-component-586b7cd67f-tf8b9 Total loading time: 0 Render date: 2024-11-27T16:47:30.384Z Has data issue: false hasContentIssue false

Assessing the Reliability of Blind Wine Tasting: Differentiating Levels of Clinical and Statistical Meaningfulness*

Published online by Cambridge University Press:  08 June 2012

Domenic V. Cicchetti
Affiliation:
Dom Cicchetti, Ph.D., Yale Home Office, 94 Linsley Lake Road, North Branford, CT. 06471; e-mail:[email protected].

Abstract

The author distinguishes between the clinical and statistical meaning of varying levels of intertaster reliability for the 11 judges who evaluated 10 Chardonnays (6 American and 4 French) in the heralded 1976 Paris wine competition. Four wines showed levels of weighted kappa values (<0.40), that are considered poor by established biostatistical criteria. These ranged between 0.10, for the French Beaune Clos des Mouches 1973 Chardonnay to 0.33 for the U.S. Veedercrest 1972 Chardonnay. However, when levels of statistical significance of the weighted kappa (Kw) values were obtained, only the Clos des Mouches failed to reach statistical significance at the .05 level. The other three wines-the U.S. Chateau Montelena, 1973, with a Kw of 0.20; the U.S. 1973 David Bruce regular, with a weighted kappa value of .27 and the U.S. Veedercrest, with one of .33-reached statistical significance at p values of <.05, <.001, and <.0001, respectively. These findings are not weighted kappa specific, and reveal that when sample sizes are large enough, even the most trivial of results will be statistically significant, while often devoid of practical or clinical meaning-fulness. A level of Kw that is clinically meaningful will most likely be statistically significant. But high levels of statistical significance are no guarantee of clinical significance. Methods for resolving this “big N phenomenon” are presented and discussed. (JEL Classification: C12, C49)

Type
Articles
Copyright
Copyright © American Association of Wine Economists 2007

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Borenstein, M. (1998). The shift from significance testing to effect size estimation. In: Bellak, A.S. and Hersen, M. (Series Eds.) and Schooler, N. (Vol. Ed.), Research and Methods, Vol. 3, Comprehensive Clinical Psychology. New York, NY: Pergamon, 313349.CrossRefGoogle Scholar
Borenstein, M., Rothstein, H., and Cohen, (2001). Power and Precision: A Computer Program for Statistical Power Analysis and Confidence Intervals. Englewood, NJ: Biostat, Inc.Google Scholar
Cicchetti, D.V. (2001). The precision of reliability and validity estimates re-visited: Distinguishing between clinical and statistical significance of sample size requirements. Journal of Clinical and Experimental Neuropsychology, 23, 695700.CrossRefGoogle ScholarPubMed
Cicchetti, D.V. (2006). The Paris 1976 Wine tastings revisited once more: Comparing ratings of consistent and inconsistent tasters. Journal of Wine Economics, 2, 125140.CrossRefGoogle Scholar
Cicchetti, D.V., Bronen, R., Spencer, S., Haut, S., Berg, A., Oliver, P., and Tyrer, P. (2006). Rating scales, scales of measurement, issues of reliability: Resolving some critical issues for clinicians and researchers. Journal of Nervous and Mental Disease, 194, 557564.CrossRefGoogle ScholarPubMed
Cicchetti, D.V., Lord, C., Koenig, K., Klin, A. and Volkmar, F. (in press). Reliability of the ADI-R: Multiple examiners evaluate a single case. Journal of Autism and Developmental Disorders.Google Scholar
Cicchetti, D.V. and Rourke, B.P. (Eds), (2004). Methodological and biostatistical foundations of clinical neuropsychology and medical and health disciplines. (2nd Ed), London, England: Psychology Press, Taylor & Francis.Google Scholar
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213220.CrossRefGoogle ScholarPubMed
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. 2nd edition, Mahwah, NJ: Lawrence Erlbaum.Google Scholar
Fleiss, J.L., Cohen, J., and Everitt, B.S. (1969). Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72, 323327.CrossRefGoogle Scholar
Fleiss, J.L., Levin, B., and Paik, M.C. (2003). Statistical Methods for Rates and Proportions. 3rd edition, New York, NY: John Wiley and Sons.CrossRefGoogle Scholar
Kaufman, A.S. (2001). Do low levels of lead produce IQ loss in children?: A careful examination of the literature. Archives of Clinical Neuropsychology, 16, 303341.CrossRefGoogle ScholarPubMed
McCarthy, P.L., Cicchetti, D.V., Sznajderman, S.D., Forsyth, B.C., Baron, M.A., Fink, H.D., Czarkowski, N., Bauchner, H., and Lustman-Findling, K. (1991). Demographic, clinical and psychosocial predictors of the reliability of mothers' clinical judgments. Pediatrics, 88, 10411046.CrossRefGoogle ScholarPubMed