inter-annotator reliability

Hi. I have read several articles and searched this forum for an answer, but I am still not clear as to whether it is possible to calculate inter-annotator reliability soley within ELAN by comparing two sets of data at a time or whether I need to use the Easy DIAg tool or export data into another program using tab-delimited text. Can someone clear this up for me, please? Thank you!

The calculate inter-annotator reliability options that are present in ELAN (accessible via a menu and configurable in a dialog window) are executed by and within ELAN (sometimes using third party libraries but those are included in ELAN). For execution of the calculations there are no dependencies on external tools.

For the documentation of the algorithms, for discussion of decisions taken and explanation of results etc., we refer to (e.g.) the manual of EasyDIAg and related articles. The EasyDIAg tool has a few more options and that might be a reason to install and use that tool instead of ELAN’s variant for inter-annotator reliability calculations.

-Han

Hi,
There are multiple topics in this forum dealing with the exact definitions of the terms used in the ELAN calculation of cohen’s kappa. However, neither the specific replies nor the paper of Holle & Rein (2013) solve these issues.

This is why I am trying it again.

These are the terms that I am interested in: kappa_inf; kappa_max; raw_agreement

“kappa_inf” -> is it correct that I should report this value in a paper when I want to include also unmatched annotations within in the calcuations?

“kappa_max” -> what does it mean when this is not 1.0? Is it important to report this value as well or should I even include it in further calculations (if so, which ones?)?

“raw_agreement” -> basically, the explanation in Holle & Rein (2013) says that there is no need to report this value, right?

Thanks a lot in advance =)

Hi,

I’m afraid these issues are difficult to solve, I know that at least I’m not able to do so. Ideally, researchers within a discipline agree on how to report on the reliability of the observations in their research projects, but as I understand it, this is not the case for all disciplines. As a result it is sometimes difficult to assess the validity of reported agreement values.

I guess it is useful for researchers to make the calculations available somewhere (and if possible the data these calculations are based on), so that if the publication only allows to report the resulting agreement values (e.g. because of lack of space), interested colleagues are able find how these results have been achieved.

As I understand it, the Holle & Rein “modified Cohen’s kappa” is an attempt to adapt a coefficient which is frequently used in one domain, to make it suitable for an other domain. That implementation has been included (reimplemented) in ELAN “as is”. The ELAN manual refers to the mentioned Holle & Rein paper but also to the EasyDIAg manual. Now to your questions:

  1. somewhere in that manual is stated “The most important value for reporting will be kappa_ipf”, so indeed if you want to account for unmatched annotations as well, this value should be reported

  2. I think the Wikipedia lemma on Cohen’s kappa contains some useful remarks on the calculated maximum, on general limitations of the coefficient and on its interpretation of “agreement by chance”.

  3. I don’t know if Holle & Rein in so many words say that the raw agreement doesn’t need to be reported, but since the whole idea of their effort is to be able to report chance corrected indices of agreement (for both categorization and segmentation), the kappa value(s) is considered more important.

I know this doesn’t solve the issue, but I hope this still helps a bit.

-Han

1 Like

Thank you so much for your support. I really appreciate that.
I was able to clarify some issues after I found the EasyDIAg manual. Somehow, I had always overlooked the direct website link.

However, there is another issue.

So, there 6 ELAN files related to 6 different recordings. The annotations of my colleague and my own annotations are the same file of one recording. We used the same tier names in each file (one for my colleague and one for me).

It looks like this:

“Rec_1” -> two tiers: one for my colleague; one for me

…

“Rec_6” -> two tiers: one for my colleague; one for me

Analyzing each file individually, kappa_ipf is always above 0.7.

When I want to calculate kappa_ipf for all 6 files (using the option “in different files” etc.), the result is below 0.14 (see attached result sheet below). That cannot be true, right?

The statement of “Number of pairs of tiers in the comparison” surprises me and it might be the reason for the error. I thought with 6 files and two different tiers, there should 6 pairs of tiers and not 30.

What did I do wrong?
Can you help me once more? :slight_smile:

Output created: 01/18/21 17:36:40

Number of files involved: 6

Number of selected tiers: 2

Number of pairs of tiers in the comparison: 30

Required minimal overlap percentage: 60%

Results per value kappa kappa_max raw agreement

“empty” 0 0.3996 0.9981

2 0 0.3832 0.9750

3 0 0.7809 0.9724

4 0.0206 0.6450 0.9404

5 0 0.7611 0.8890

6 0 0.9133 0.8246

6 0.0000 0.0000 0.9981

7 0 0.6522 0.7479

8 0 0.4180 0.7527

9 0 0.0387 0.8804

? 0 0.3996 0.9981

Global results (incl. unlinked/unmatched annotations):

kappa_ipf kappa_max raw agreement

0.0049 0.5130 0.0134

Global results (excl. unlinked/unmatched annotations):

kappa (excl.) kappa_max (excl.) raw agreement (excl.)

0.1310 0.6630 0.2687

Global Agreement Matrix:

First annotator in the rows, second annotator in the columns

“empty” 0 0 0 0 0 0 0 0 0 0 0 1

2 0 0 0 0 4 0 0 0 0 0 0 50

3 0 0 0 0 0 8 0 0 2 0 0 35

4 0 0 0 4 8 14 0 2 0 0 0 85

5 0 0 0 0 6 8 0 4 0 0 0 102

6 0 0 6 0 6 12 0 0 0 0 0 204

6 0 0 0 0 0 0 0 0 0 0 0 5

7 0 0 0 2 6 2 0 12 4 4 0 428

8 0 0 2 2 2 2 0 2 2 0 0 499

9 0 0 0 0 0 0 0 2 6 0 0 306

? 0 0 0 0 0 0 0 0 0 0 0 1

Unmatched 4 13 21 47 158 221 0 221 143 3 4 0

Global per value agreement table:

“empty”

0 4

1 2680

2

0 13

54 2618

3

0 29

45 2611

4

4 51

109 2521

5

6 184

114 2381

6

12 255

216 2202

6

0 0

5 2680

7

12 231

446 1996

8

2 155

509 2019

9

0 7

314 2364

?

0 4

1 2680

End of global results.

################################################

Hi,

That’s indeed a bit odd.
I’m not sure if this can account for the unexpected result, but based on your description, I would think that you need the option “in the same file”? The label of that option reads “The tiers to compare are” and if that is followed by " in the same file", that would match the description you give of your files: per recording the tiers to compare (yours and your colleague’s) are in the same file.

Let me know if I got it wrong.

1 Like

Hi,

Thanks a lot for your reply.

You are right. If I choose the option “in the same file” everything works fine. However, I would like to calculate an overal Cohen’s kappa, which is why I chose the option “in different files”. Is there another approach to summarize the inter rater reliability for all 6 files?

Hi,

The first part of the output consists of the overall or global results (over all files and all tiers). So, the global kappa values mentioned there (including and excluding unmatched annotations) are values based on, in your case, all six files. I guess that is what you are looking for. (These results are always there, independent of the choice for “in the same file” or “in different files”, options which only tell ELAN where and how to find the tiers to compare.)

If you tick the “Also generate and export agreement per tier pair” checkbox in the second step, the result text file will contain overall, global results, followed by a list of per file pair, per tier pair results.

Hi,

Thank you so much! As usual, it was a human mistake. I missunderstood the term “in different files”. Finally, I got a plausible kappa_ipf. That is great.

However, I would have one additional question.
Below, you can see the result sheet. What does the yellow highlighted text mean?

The first two lines are just explanations for certain values that may be seen in the result tables. NaN (not a number) is now used to indicate a division by 0 (in the first implementation 0 was used in that situation).
A kappa value can be less than 0 (meaning agreement less than expected by chance) and then often 0 is just reported instead of the actual value.

The highlighted line in the agreement table, the second value “6”, probably indicates that in one of the files at least once a white space character (or tab or new line) has been inserted before or after the 6 (e.g. "6 ").
You might consider to use File->Multiple File Processing->Scrub Transcriptions… to clean up the files (after ensuring you have back ups of your files)?

1 Like