MetaTOC stay on top of your field, easily

Kappa and Rater Accuracy: Paradigms and Parameters

Educational and Psychological Measurement

Published online on

Abstract

Drawing parallels to classical test theory, this article clarifies the difference between rater accuracy and reliability and demonstrates how category marginal frequencies affect rater agreement and Cohen’s kappa (). Category assignment paradigms are developed: comparing raters to a standard (index) versus comparing two raters to one another (concordance), using both nonstochastic and stochastic category membership. Using a probability model to express category assignments in terms of rater accuracy and random error, it is shown that observed agreement (Po) depends only on rater accuracy and number of categories; however, expected agreement (Pe) and depend additionally on category frequencies. Moreover, category frequencies affect Pe and solely through the variance of the category proportions, regardless of the specific frequencies underlying the variance. Paradoxically, some judgment paradigms involving stochastic categories are shown to yield higher values than their nonstochastic counterparts. Using the stated probability model, assignments to categories were generated for 552 combinations of paradigms, rater and category parameters, category frequencies, and number of stimuli. Observed means and standard errors for Po, Pe, and were fully consistent with theory expectations. Guidelines for interpretation of rater accuracy and reliability are offered, along with a discussion of alternatives to the basic model.