Interlingual Annotation for Machine Translation

Issues

Evaluation tactics

Is the implicit agreement measure clear?

How are agreements on senses aggregated into agreements on words?

Is the rationale for the derivation of Kappa clear?

Are all categories potentially valid categories for every token?

What does it mean to count separately the cases in which both annotators chose all or none of the valid categories?

Is the classification of missing data clear?

If an annotator decided that no valid category was correct for a token, was the annotator instructed to select at least 1 category anyway? If not, was that judgment distinguishable from a failure to attempt an assignment?