You know that magical moment where you and your friend finally agree on a place to eat, or a movie to watch, and you wonder what lucky stars had to align to make that happen? When the chance of agreement was so small that you didn’t think you’d ever decide? If you wanted to capture how often you and your friend agree on a restaurant or a movie in such a way that accounted for whether it was due to random chance, Cohen’s Kappa is the choice for you.
Agreement can be calculated just by taking the number of agreed upon observations divided by the total observations; however, Jacob Cohen believed that wasn’t enough. As agreement was typically used for inter-rater reliability, Cohen argued that this measure didn’t account for the fact that sometimes, people just guess–especially if they are uncertain. In 1960, he proposed Cohen’s Kappa as a counter to traditional percent agreement, claiming his measure was more robust as it accounted for random chance agreement.
Cohen’s Kappa is used to calculate agreement between two raters–or in machine learning, it can be used to find the agreement between the prediction sets of two models. It is calculated by subtracting the probability of chance agreement from the probability of observed agreement, all over one minus the probability of chance agreement. Like many correlation metrics, it ranges from -1 to +1. A negative value of Cohen’s Kappa indicates that there is no relationship between the raters, or that they had a tendency to give different ratings. A Cohen’s Kappa of 0 indicates that there is no agreement between the predictors above what would be expected by chance. A Cohen’s Kappa of 1 indicates that the raters are in complete agreement.
As Cohen’s Kappa is calculated using frequencies, it can be unreliable in measuring agreement in situations where an outcome is rare. In such cases, it tends to be overly conservative and underestimates agreement on the rare category. Additionally, some statisticians disagree with the claim that Cohen’s Kappa accounts for random chance, as an explicit model of how chance affected decision making would be necessary to say this decisively. The chance adjustment of Kappa simply assumes that when raters are uncertain, they completely guess an outcome. However, this is highly unlikely in practice–usually people have some reason for their decision.
Let’s use this in a sentence, shall we?
Serious: The Cohen’s Kappa score between the two raters was 0.7. Therefore, there is substantial agreement between the raters’ observations.
Silly: A kappa of 0.7? They must always agree on a place to eat!