From the mailbag, 12/14/09
Another question came in today, and it’s one I think the ReCal user community might be interested in. Sonya from Pennsylvania writes:
Ok, I am stumped. How can I have a percent agreement of .97 and a Scott’s Pi of-.015? I have two coders coding either Yes (1) or No (0) for the presence of a variable. What am I doing wrong. I find when calculating by hand I get similar results (off by a decimal or so). When using RECAL or calculating Scotts Pi with more than two categories, I don’t get negative Scotts Pi when the percent agreement is high.
Thanks so much for sharing your program and answering my question if you have the time.
Excellent question, Sonya. As with the last question I answered, I’ll provide your raw data (with a new filename) so that others can follow along; hope you don’t mind.
Looking at the data, you’ll immediately notice an interesting characteristic: only the second coder uses the “1″ code. That is, the two coders only ever agree on “0″ codes and never once on a “1″ code. Scott’s pi, Cohen’s kappa, and Krippendorff’s alpha punish this phenomenon severely, the rationale being that coders must show at least some covariation in their agreements to merit high coefficient values. Krippendorff himself addressed this very situation in a recent article:
In the calculation of reliability, large numbers of absences should not overwhelm the small number of occurrences that authors care enough about to report. Without a single concurrence and three mismatches [Krippendorff here is referring to a specific dataset, which just so happens to have the same number of mismatches as Sonya's], the report of finding 2 out of 137 cases [3 out of 99 for Sonya's data] is about as close to chance as one can get—and this is born out by the near zero values of all the chance-corrected agreement coefficients. (2004, p. 425)
Thus, when one coder only uses one of two coding categories, and the other uses both, chance-corrected reliability will always be near or well below zero (but percent agreement can still be near 100% as it is not chance-corrected). The only solutions here seem to be either better coder training or a revised coding scheme that allows coders more latitude to agree with one another on different categories.
Reference
Krippendorff, K. (2004). Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research, 30(3), 411-433.
From The Mailbag, 12/09/09
I don’t get much correspondence about ReCal, but I do try to respond to the few queries I receive. Today, Dianne from Australia asked:
Thank you so much for a great tool. But, I hope you can help me clear up a discrepancy I’ve noticed in my results for variables that have the same number of agreements/disagreements. For example, variable 1 has 26 agreements and 1 disagreement. So does variable 3. So does variable 5. Yet, the results for variable 1 are: 96.3% agreement and Scott’s pi of 0.924. The results for variable 3 are: 96.3% agreement and Scott’s pi of 0.914. The results for variable 5 are: 96.3% agreement and Scott’s pi of 0.886. Can you please tell me why the Scott’s pi is different for each variable when all the raw data for them is the same (ie same number of agreements and disagreements)? This scenario has occurred on three separate occasions when I’ve submitted my .csv files for analysis.
Great question, Dianne. The answer lies in how Scott’s pi (and Cohen’s kappa and Krippendorff’s alpha) compute reliability as compared to percent agreement. With the latter, equal numbers of agreements and disagreements will always give you the same result, as you noticed. This is not always the case with coefficients that correct for chance agreement, as do Scott’s, Cohen’s, and Krippendorff’s. Their formulae give additional “credit” to data sets with greater variation in agreeing values: in other words, the more different coding categories your coders agree upon, the higher your reliability coefficients will be (the logic being that it is harder to attain stronger levels of agreement on data that is highly distributed across many coding categories).
But I don’t expect you to just take my word for it, so I’ll actually work through your numbers and show the math here. I hope you don’t mind if I provide your raw data here so that other interested parties can follow along—I’ve changed the headers and the filename.
Recall that the formula for Scott’s pi is
(Po - Pe) / (1 - Pe)
where Po is observed agreement and Pe is expected agreement. Observed agreement is simply percent agreement as a fraction of one; thus for all three of your variables it is thus 0.963. Expected agreement is a bit more complex, but essentially what you have to do is:
- Note the number of possible coding categories your coders used (each of your variables used 2 categories, represented by the numbers 0 and 1)
- Start by adding the number of times coder A selected category 1 to the number of times coder B selected category 1 and then divide that sum by the total number of coding decisions (which in your case is 54, or 2x the number of cases). This value is known as the joint marginal proportion for variable 1, category 1. This is equal to (12 + 11) / 54 = 0.426. In this example we need to do this for all category values in all variables, so in total we will need to calculate 6 JMPs so that each variable has 2 (the number of coding categories noted above). The JMP for var 1, category 2 is (15 + 16) / 54 = 0.574. The rest are as follows: var 3 cat 1 = 0.315; var 3 cat 2 = 0.685; var 5 cat 1 = 0.204; var 5 cat 2 = 0.796.
- Now that we have all our JMPs, we need to square them and then sum them within variables to get our expected agreements. So for var 1, we have 0.4262 + 0.5742 = 0.511. The expected agreement values for vars 3 and 5 are 0.569 and 0.676, respectively.
Now we have all the values we need to plug in to our main Scott’s pi formula above.
- For var 1 Scott’s pi is (0.963 - 0.511) / (1 - 0.511) = 0.924;
- for var 3 it is (0.963 - 0.569) / (1 - 0.569) = 0.914;
- for var 5 it is (0.963 - 0.676) / (1 - 0.676) = 0.886.
By now you’ve probably figured out what’s going on. Looking at the data, variable 5 has the most uneven category distribution; you can tell at a glance that it is mostly zeros. This pushes its expected agreement value higher—it is easier to achieve high levels of agreement when a data set mostly falls into one category, so the Scott’s pi formula raises the bar accordingly. Vars 1 and 3 are more balanced in their category distributions, so their expected agreements are lower, making their coefficients higher despite the fact that all three vars have the same number of agreements and disagreements.
I hope this answers your question, Dianne. If not, let me know!
