The First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP (AnnoNLP 2019)

The workshop is hosted by EMNLP-IJCNLP 2019 in Hong Kong.

Date: Sunday, November 3, 2019. Jump to the programme.

Workshop Description

Background

Crowdsourcing, whether through microwork platforms or through Games with a Purpose, is increasingly used as an alternative to traditional expert annotation, achieving comparable annotation quality at lower cost and offering greater scalability. The NLP community has enthusiastically adopted crowdsourcing to support work in tasks such as coreference resolution, sentiment analysis, textual entailment, named entity recognition, word similarity, word sense disambiguation, and many others. This interest has also resulted in the organization of a number of workshops at ACL and elsewhere, from as early as “The People’s Web meets NLP” in 2009. These days, general purpose research on crowdsourcing can be presented at HCOMP or CrowdML, but the need for workshops more focused on the use of crowdsourcing in NLP remains. In particular, NLP-specific methods are typically required for the task of aggregating the interpretations provided by the annotators. Most existing work on aggregation methods is based on a common set of assumptions: 1) independence between the true classes, 2) the set of classes the coders can choose from is fixed across the annotated items, and 3) there is one true class per item. However, for many NLP tasks such assumptions are not entirely appropriate. For example, sequence labelling tasks (e.g., NER, tagging) have an implicit inter-label dependence (e.g., Nguyen et al., 2017). In other tasks such as coreference the labels the coders can choose from are not fixed but depend on the mentions from each document (Passonneau, 2004; Paun et al., 2018). Furthermore, in many NLP tasks, the data items can have more than one interpretation (e.g., Poesio and Artstein, 2005; Passonneau et al., 2012; Plank et al., 2014). Such cases of ambiguity also affect the reliability of existing gold standard datasets (often labelled with a single interpretation even though expert disagreement is a well-known issue). This former point motivates the research on alternative, complementary evaluation methods, but also the development of multi-label datasets. More broadly, the proposed workshop aims to bring together researchers interested in methods for aggregating and analysing crowdsourced data for NLP-specific tasks which relax the aforementioned assumptions. We also invite work on ambiguous, subjective or complex annotation tasks which received less attention in the literature.

Objectives

Although there is a large body of work analysing crowdsourced data, be that probabilistic (models of annotation) or traditional (majority voting aggregation, agreement statistics), there has been less work devoted to NLP tasks. It is often the case that NLP data violate the assumptions made by most existing models, opening the path to new research. The aim of the proposed workshop is to bring together the community of researchers interested in this area.

Topics

Topics of interest include but are not limited to the following:

Label aggregation methods for NLP tasks

Sequential labelling tasks (e.g., NER, chunking)
Tasks where the set of labels is not fixed across the data items (e.g., coreference)
Other case of complex labels (e.g. for syntactic annotation)

The effects of ambiguity

Allowing for multiple interpretations per data item
Assessing the reliability of existing gold standard datasets

New evaluation methodologies
New multi-label datasets

Subjective, complex tasks

Can the crowd successfully annotate such tasks? How to design the task to facilitate the annotation process?

Workshop Programme

9:00 - 10:30	Session 1
9:00	Welcome remarks
9:10	Invited speaker: Jordan Boyd-Graber, University of Maryland Engaging Hobbyist Communities to Deceive Machines and Each Other We can't always find the text as data that we want: we cannot share the data or we don't have enough of the data we want. Many fields, especially question answering, have turned to crowdsourcing to solve these data problems. While crowdworking is an invaluable tool, it is not a panacea; this talk discusses an alternative: engaging hobbyist communities to leverage the passion and the expertise of humans to create shareable, unique, and expert language. We discuss two settings where we engage two hobbyist communities: trivia whizzes and strategy board gamers. The trivia community - through a human-in-the-loop collaboration - builds question answering examples that are nearly impossible for machines to answer. We also examine a game of deception called Diplomacy, where players form alliances, break them, and in the process craft intricate lies; we use this to create a dataset of deception. In both cases these experts expose the limitations of computerized approaches to language.
10:10	Dependency Tree Annotation with Mechanical Turk Stephen Tratz
10:30	Coffee break
11:00 - 12:20	Session 2
11:00	Word Familiarity Rate Estimation Using a Bayesian Linear Mixed Model Masayuki Asahara
11:30	Leveraging syntactic parsing to improve event annotation matching Camiel Colruyt, Orphée De Clercq and Véronique Hoste
12:00	A Dataset of Crowdsourced Word Sequences: Collections and Answer Aggregation for Ground Truth Creation Jiyi Li and Fumiyo Fukumoto
12:20	Lunch break
14:00 - 15:30	Session 3
14:00	Invited speaker: Edwin Simpson, Technische Universität Darmstadt Aggregating Weak Annotations and Preferences from Crowds Current NLP methods are data hungry, and crowdsourcing is a common solution to acquiring annotated data. However, we are often faced with disagreements among annotators caused by ambiguous or subjective tasks or unreliable annotators. This talk presents techniques for aggregating crowdsourced annotations through preference learning and classifier combination to estimate gold-standard labels or rankings. We apply Bayesian approaches to handle the inherent noise, small amounts of data per annotator, and provide a basis for active learning. We’ll also discuss the limitations of current crowdsourcing efforts: the manual effort required to set up and manage the annotation process; how much we can really trust any aggregated ‘gold’ data; and a desire for more interactive annotation processes.
15:00	Distance-based Consensus Modeling for Complex Annotations Alexander Braylan and Matthew Lease
15:20	Afternoon coffee break
16:00 - 17:20	Session 4
16:00	Crowd-sourcing annotation of complex NLU tasks: A case study of argumentative content annotation Tamar Lavee, Lili Kotlerman, Matan Orbach, Yonatan Bilu, Michal Jacovi, Ranit Aharonov and Noam Slonim
16:30	Computer Assisted Annotation of Tension Development in TED Talks through Crowdsourcing Seungwon Yoon, Wonsuk Yang and Jong Park
17:00	CoSSAT: Code-Switched Speech Annotation Tools Sanket Shah, Pratik Joshi, Sebastin Santy and Sunayana Sitaram

Important Dates

May 13, 2019: First call for papers
June 24, 2019: Second call for papers
September 2, 2019: Submission deadline
September 20, 2019: Notification of acceptance
September 30, 2019: Camera-ready papers due
November 3, 2019: Workshop date

Submission Details

We welcome both long and short papers. Long papers are expected to have at most 8 pages of content, while short papers should have up to 4 content pages. Submissions should follow the EMNLP-IJCNLP 2019 guidelines. References do not count against these limits.

Papers should be submitted online via START.

Workshop Organizers

Silviu Paun, Queen Mary University of London, s.paun@qmul.ac.uk

Dirk Hovy, Bocconi University, mail@dirkhovy.com

Invited Speakers

Jordan Boyd-Graber, University of Maryland

Edwin Simpson, Technische Universität Darmstadt

Programme Committee

Beata Beigman Klebanov, Princeton
Bob Carpenter, Columbia University
Jon Chamberlain, University of Essex
Anca Dumitrache, Vrije Universiteit Amsterdam
Paul Felt, IBM
Udo Kruschwitz, University of Essex
Matthew Lease, University of Texas at Austin
Massimo Poesio, Queen Mary University of London
Edwin Simpson, Technische Universität Darmstadt
Henning Wachsmuth, Universität Paderborn

References

An Thanh Nguyen, Byron Wallace, Junyi Jessy Li, Ani Nenkova, and Matthew Lease. Aggregating and predicting sequence labels from crowd annotations. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 299–309. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1028. URL http://aclweb.org/anthology/P17-1028.
Rebecca J. Passonneau. Computing reliability for coreference annotation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04). European Language Resources Association (ELRA), 2004. URL http://www.lrec-conf.org/proceedings/lrec2004/pdf/752.pdf.
Silviu Paun, Jon Chamberlain, Udo Kruschwitz, Juntao Yu, and Massimo Poesio. A probabilistic annotation model for crowdsourcing coreference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1926–1937. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/D18-1218.
Massimo Poesio and Ron Artstein. The reliability of anaphoric annotation, reconsidered: Taking ambiguity into account. In Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky, pages 76–83. Association for Computational Linguistics, 2005. URL http://aclweb.org/anthology/W05-0311.
R. J. Passonneau, V. Bhardwaj, A. Salleb-Aouissi, and N. Ide. Multiplicity and word sense: evaluating and learning from multiply labeled word sense annotations. Language Resources and Evaluation, 46(2):219–252, 2012. doi: 10.1007/ s10579-012-9188-x.
Barbara Plank, Dirk Hovy, and Anders Søgaard. Linguistically debatable or just plain wrong? In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 507–511. Association for Computational Linguistics, 2014. doi: 10.3115/v1/P14-2083. URL http://aclweb.org/anthology/P14-2083.