Aggregating and analysing crowdsourced annotations for NLP (AnnoNLP)
Silviu Paun, Dirk Hovy
The workshop was accepted and assigned to EMNLP-IJCNLP 2019 which will take place in Hong Kong.
Crowdsourcing, whether through microwork platforms or through Games with a Purpose, is increasingly used as an alternative to traditional expert annotation, achieving comparable annotation quality at lower cost and offering greater scalability. The NLP community has enthusiastically adopted crowdsourcing to support work in tasks such as coreference resolution, sentiment analysis, textual entailment, named entity recognition, word similarity, word sense disambiguation, and many others. This interest has also resulted in the organization of a number of workshops at ACL and elsewhere, from as early as “The People’s Web meets NLP” in 2009. These days, general purpose research on crowdsourcing can be presented at HCOMP or CrowdML, but the need for workshops more focused on the use of crowdsourcing in NLP remains. In particular, NLP-specific methods are typically required for the task of aggregating the interpretations provided by the annotators. Most existing work on aggregation methods is based on a common set of assumptions: 1) independence between the true classes, 2) the set of classes the coders can choose from is fixed across the annotated items, and 3) there is one true class per item. However, for many NLP tasks such assumptions are not entirely appropriate. For example, sequence labelling tasks (e.g., NER, tagging) have an implicit inter-label dependence (e.g., Nguyen et al., 2017). In other tasks such as coreference the labels the coders can choose from are not fixed but depend on the mentions from each document (Passonneau, 2004; Paun et al., 2018). Furthermore, in many NLP tasks, the data items can have more than one interpretation (e.g., Poesio and Artstein, 2005; Passonneau et al., 2012; Plank et al., 2014). Such cases of ambiguity also affect the reliability of existing gold standard datasets (often labelled with a single interpretation even though expert disagreement is a well-known issue). This former point motivates the research on alternative, complementary evaluation methods, but also the development of multi-label datasets. More broadly, the proposed workshop aims to bring together researchers interested in methods for aggregating and analysing crowdsourced data for NLP-specific tasks which relax the aforementioned assumptions. We also invite work on ambiguous, subjective or complex annotation tasks which received less attention in the literature.
Although there is a large body of work analysing crowdsourced data, be that probabilistic (models of annotation) or traditional (majority voting aggregation, agreement statistics), there has been less work devoted to NLP tasks. It is often the case that NLP data violate the assumptions made by most existing models, opening the path to new research. The aim of the proposed workshop is to bring together the community of researchers interested in this area.
Topics of interest include but are not limited to the following:
- Aggregation methods for NLP tasks
- Sequential labelling tasks (e.g., NER, chunking)
- Tasks where the set of labels is not fixed across the data items (e.g., coreference)
- Other case of complex labels (e.g. for syntactic annotation)
- The effects of ambiguity
- Allowing for multiple interpretations per data item
- Assessing the reliability of existing gold standard datasets
- New evaluation methodologies
- New multi-label datasets
- Subjective, complex tasks
- Can the crowd successfully annotate such tasks? How to design the task to facilitate the annotation process?
- May 10, 2019: First call for papers
- June 14, 2019: Second call for papers
- August 19, 2019: Submission deadline
- September 16, 2019: Notification of acceptance
- September 30, 2019: Camera-ready papers due
- November 3/4, 2019: Workshop date
- Silviu Paun, Queen Mary University of London, firstname.lastname@example.org
- Dirk Hovy, Bocconi University, email@example.com
- Research interests and areas of expertise include probabilistic models of annotation, with a particular interest in coreference; generative models of text, such as topic models, applied to short text data; and parameter estimation techniques, with a particular interest in variational inference.
- Dirk is an associate professor at Bocconi University in Milan. His research focuses on the intersection of social science and statistical NLP, i.e., how social dimensions influence language and in turn NLP models. He is also interested in matters of algorithmic fairness and bias, and works on incorporating the human factors into model, including in annotation. He is the author of the annotation aggregation tool MACE, and was an organizer of five *ACL workshops and a SemEval task, as well as local chair for EMNLP 2017.
Jordan Boyd-Graber, University of Maryland
Proposed Programme Committee
(Some of the members still have to confirm)
- Omar Alonso, Microsoft (confirmed)
- Beata Beigman Klebanov, Princeton (confirmed)
- Bob Carpenter, Columbia University (confirmed)
- Jon Chamberlain, University of Essex (confirmed)
- Anca Dumitrache, Vrije Universiteit Amsterdam (confirmed)
- Paul Felt, IBM (confirmed)
- Udo Kruschwitz, University of Essex (confirmed)
- Matthew Lease, University of Texas at Austin (confirmed)
- Massimo Poesio, Queen Mary University of London (confirmed)
- Vikas C Raykar, IBM (confirmed)
- Edwin Simpson, Technische Universität Darmstadt (confirmed)
- Yudian Zheng, Twitter (confirmed)
- Rebecca Passonneau, Penn State University
- Gabriella Kazai, Lumi
- Chris Callison-Burch, University of Pennsylvania
- Matteo Venanzi, Microsoft
- An Thanh Nguyen, Byron Wallace, Junyi Jessy Li, Ani Nenkova, and Matthew Lease. Aggregating and predicting sequence labels from crowd annotations. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 299–309. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1028. URL http://aclweb.org/anthology/P17-1028.
- Rebecca J. Passonneau. Computing reliability for coreference annotation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04). European Language Resources Association (ELRA), 2004. URL http://www.lrec-conf.org/proceedings/lrec2004/pdf/752.pdf.
- Silviu Paun, Jon Chamberlain, Udo Kruschwitz, Juntao Yu, and Massimo Poesio. A probabilistic annotation model for crowdsourcing coreference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1926–1937. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/D18-1218.
- Massimo Poesio and Ron Artstein. The reliability of anaphoric annotation, reconsidered: Taking ambiguity into account. In Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky, pages 76–83. Association for Computational Linguistics, 2005. URL http://aclweb.org/anthology/W05-0311.
- R. J. Passonneau, V. Bhardwaj, A. Salleb-Aouissi, and N. Ide. Multiplicity and word sense: evaluating and learning from multiply labeled word sense annotations. Language Resources and Evaluation, 46(2):219–252, 2012. doi: 10.1007/ s10579-012-9188-x.
- Barbara Plank, Dirk Hovy, and Anders Søgaard. Linguistically debatable or just plain wrong? In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 507–511. Association for Computational Linguistics, 2014. doi: 10.3115/v1/P14-2083. URL http://aclweb.org/anthology/P14-2083.