From The Mailbag, 12/09/09

I don’t get much correspondence about ReCal, but I do try to respond to the few queries I receive. Today, Dianne from Australia asked:

Thank you so much for a great tool. But, I hope you can help me clear up a discrepancy I’ve noticed in my results for variables that have the same number of agreements/disagreements. For example, variable 1 has 26 agreements and 1 disagreement. So does variable 3. So does variable 5. Yet, the results for variable 1 are: 96.3% agreement and Scott’s pi of 0.924. The results for variable 3 are: 96.3% agreement and Scott’s pi of 0.914. The results for variable 5 are: 96.3% agreement and Scott’s pi of 0.886. Can you please tell me why the Scott’s pi is different for each variable when all the raw data for them is the same (ie same number of agreements and disagreements)? This scenario has occurred on three separate occasions when I’ve submitted my .csv files for analysis.

Great question, Dianne. The answer lies in how Scott’s pi (and Cohen’s kappa and Krippendorff’s alpha) compute reliability as compared to percent agreement. With the latter, equal numbers of agreements and disagreements will always give you the same result, as you noticed. This is not always the case with coefficients that correct for chance agreement, as do Scott’s, Cohen’s, and Krippendorff’s. Their formulae give additional “credit” to data sets with greater variation in agreeing values: in other words, the more different coding categories your coders agree upon, the higher your reliability coefficients will be (the logic being that it is harder to attain stronger levels of agreement on data that is highly distributed across many coding categories).

But I don’t expect you to just take my word for it, so I’ll actually work through your numbers and show the math here. I hope you don’t mind if I provide your raw data here so that other interested parties can follow along—I’ve changed the headers and the filename.

Recall that the formula for Scott’s pi is

(P_o - P_e) / (1 - P_e)

where P_o is observed agreement and P_e is expected agreement. Observed agreement is simply percent agreement as a fraction of one; thus for all three of your variables it is thus 0.963. Expected agreement is a bit more complex, but essentially what you have to do is:

Note the number of possible coding categories your coders used (each of your variables used 2 categories, represented by the numbers 0 and 1)
Start by adding the number of times coder A selected category 1 to the number of times coder B selected category 1 and then divide that sum by the total number of coding decisions (which in your case is 54, or 2x the number of cases). This value is known as the joint marginal proportion for variable 1, category 1. This is equal to (12 + 11) / 54 = 0.426. In this example we need to do this for all category values in all variables, so in total we will need to calculate 6 JMPs so that each variable has 2 (the number of coding categories noted above). The JMP for var 1, category 2 is (15 + 16) / 54 = 0.574. The rest are as follows: var 3 cat 1 = 0.315; var 3 cat 2 = 0.685; var 5 cat 1 = 0.204; var 5 cat 2 = 0.796.
Now that we have all our JMPs, we need to square them and then sum them within variables to get our expected agreements. So for var 1, we have 0.426² + 0.574² = 0.511. The expected agreement values for vars 3 and 5 are 0.569 and 0.676, respectively.

Now we have all the values we need to plug in to our main Scott’s pi formula above.

For var 1 Scott’s pi is (0.963 – 0.511) / (1 – 0.511) = 0.924;
for var 3 it is (0.963 – 0.569) / (1 – 0.569) = 0.914;
for var 5 it is (0.963 – 0.676) / (1 – 0.676) = 0.886.

By now you’ve probably figured out what’s going on. Looking at the data, variable 5 has the most uneven category distribution; you can tell at a glance that it is mostly zeros. This pushes its expected agreement value higher—it is easier to achieve high levels of agreement when a data set mostly falls into one category, so the Scott’s pi formula raises the bar accordingly. Vars 1 and 3 are more balanced in their category distributions, so their expected agreements are lower, making their coefficients higher despite the fact that all three vars have the same number of agreements and disagreements.

I hope this answers your question, Dianne. If not, let me know!

One comment

Leave a Reply to dianne Cancel reply