Another question came in today, and it’s one I think the ReCal user community might be interested in. Sonya from Pennsylvania writes:

Ok, I am stumped. How can I have a percent agreement of .97 and a Scott’s Pi of-.015? I have two coders coding either Yes (1) or No (0) for the presence of a variable. What am I doing wrong. I find when calculating by hand I get similar results (off by a decimal or so). When using RECAL or calculating Scotts Pi with more than two categories, I don’t get negative Scotts Pi when the percent agreement is high.

Thanks so much for sharing your program and answering my question if you have the time.

Excellent question, Sonya. As with the last question I answered, I’ll provide your raw data (with a new filename) so that others can follow along; hope you don’t mind.

Looking at the data, you’ll immediately notice an interesting characteristic: only the second coder uses the “1” code. That is, the two coders only ever agree on “0” codes and never once on a “1” code. Scott’s pi, Cohen’s kappa, and Krippendorff’s alpha punish this phenomenon severely, the rationale being that coders must show at least some covariation in their agreements to merit high coefficient values. Krippendorff himself addressed this very situation in a recent article:

In the calculation of reliability, large numbers of absences should not overwhelm the small number of occurrences that authors care enough about to report. Without a single concurrence and three mismatches [Krippendorff here is referring to a specific dataset, which just so happens to have the same number of mismatches as Sonya’s], the report of finding 2 out of 137 cases [3 out of 99 for Sonya’s data] is about as close to chance as one can get—and this is born out by the near zero values of all the chance-corrected agreement coefficients. (2004, p. 425)

Thus, when one coder only uses one of two coding categories, and the other uses both, chance-corrected reliability will always be near or well below zero (but percent agreement can still be near 100% as it is not chance-corrected). The only solutions here seem to be either better coder training or a revised coding scheme that allows coders more latitude to agree with one another on different categories.

**Reference**

Krippendorff, K. (2004). Reliability in content analysis: Some common misconceptions and recommendations. *Human Communication Research, 30*(3), 411-433.