Accès libre

A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks

, , , ,  et   
19 nov. 2024
À propos de cet article

Citez
Télécharger la couverture

Figure 1.

Example of a closed-domain EE using a predefined event schema.
Example of a closed-domain EE using a predefined event schema.

Figure 2.

Flowchart of the manual corpus annotation procedure.
Flowchart of the manual corpus annotation procedure.

Figure 3.

Structure of annotated corpus in the BioNLP Standoff format.
Structure of annotated corpus in the BioNLP Standoff format.

Figure 4.

Structure of annotated corpus in BRAT Standoff format.
Structure of annotated corpus in BRAT Standoff format.

Figure 5.

Structure of annotated corpus in the CoNLL-U format.
Structure of annotated corpus in the CoNLL-U format.

Figure 6.

Structure of annotated corpus in the OneIE’s JSON format.
Structure of annotated corpus in the OneIE’s JSON format.

Figure 7.

AlvisAE text annotation editor.
AlvisAE text annotation editor.

Figure 8.

BRAT annotation tool.
BRAT annotation tool.

Figure 9.

TextAE online text annotation editor.
TextAE online text annotation editor.

Figure 10.

Steps for selecting documents to build an event extraction corpus.
Steps for selecting documents to build an event extraction corpus.

Figure 11.

Example of event annotation using the BRAT annotation tool.
Example of event annotation using the BRAT annotation tool.

Figure 12.

The distribution of corpus based on language and domain.
The distribution of corpus based on language and domain.

Figure 13.

Top five largest annotated corpora for event extraction tasks.
Top five largest annotated corpora for event extraction tasks.

Figure 14.

Comparison of tokens, sentences, and event mentions in existing annotated corpora.
Comparison of tokens, sentences, and event mentions in existing annotated corpora.

Figure 15.

The count of event mentions in each corpus.
The count of event mentions in each corpus.

Figure 16.

Conceptual representation of the universal text annotation converter.
Conceptual representation of the universal text annotation converter.

Summary of challenges and recommendations_

Challenges Recommendations
Lack of high-quality annotated data

To facilitate the rapid development of an annotated corpus, it is suggested to employ a hybrid approach as demonstrated by Li et al. (2022).

This involves partially annotating the texts manually, and then training ML algorithms to annotate the remaining data based on the trained model.

This strategy is faster than manually annotating all data. However, it is critical to measure the accuracy of the automatic annotations.

Incompatibility of annotated corpus formats

Develop a standardized annotation format that is universally accepted for annotating text corpus.

These formats should store all information required for common EE tasks.

Develop a universal text annotation converter for converting annotations between different formats (Figure 16).

Subjectivity and text ambiguity

Develop a complete annotation guideline and strictly adhere to it throughout the annotation process.

Utilize tools like Git version control to manage the version of annotation files.

Summary of recent studies on LLMs for corpus annotations_

Study Results Advantages Limitations
Csanády et al. (2024) BERT shows 91.2% to 96.5% test accuracies on the IMDb datasets using the model on random baselines.

The proposed method can handle large-scale text annotation tasks.

Provides a cost-effective alternative to annotate large amounts of text.

Annotation using LLMs slightly compromises the annotation accuracy.

LLMs alone cannot provide high-quality corpus annotations. The annotated corpus is not suitable for EE task.

Akkurt et al. (2024)

The proposed approach improved result by 2%.

All models show improved performance with the GPT-4 + UD Turkish BOUN v2.11: 76.9% (best performance).

The model has been tested with data from UD English and Turkish Treebanks.

The authors use public data and verify the methodology complies with ethical standards.

The annotation outcome varies (inconsistent) depending on the user’s prompt.

The method is for entity annotation; thus output is not suitable for EE tasks.

Frei and Kramer (2023) Result on various baseline models:

gbert-large (P: 70.7%, R:97.9%, F1: 82.1%)

GottBERT-base (P: 80.0%, R: 89.9%, F1: 84.7%)

German-MedBERT (P: 72.7%, R: 81.8%, F1: 77.0%)

Solves limited corpus availability for non-English medical texts.

The proposed method shows a reliable performance.

The proposed method is computationally expensive.

The annotated corpus cannot be considered a gold-standard and requires more validation.

The method is for entity annotation, thus output is not suitable for EE tasks.

Li et al. (2023) The result shows up to 21% performance improvement over random baselines.

The annotation process is done together by humans and LLMs.

Provides a cost-effective alternative to annotate large amounts of text.

The study does not assess if LLM-generated annotations outperform human-annotated corpus.

The method is for entity annotation, thus output is not suitable for EE tasks.

Annotated corpus for the event extraction task_

ID Corpus Short Name Corpus Full Name Domain Area Language Corpus Size (# docs) Annotation Method Public Access Charges Format Benchmark Corpus
C01 MUSIED Multi-Source Informal Event Detection General Chinese 11,381 Manual Free of charge JSON ×
C02 MAVEN MAssive eVENt detection dataset General English 4,480 Manual Free of charge JSON
C03 ACE 2005 ACE 2005 Multilingual Training Corpus1 General English, Chinese 599 (En), 633 (Ch) Manual × Licensed (Paid) XML
C04 CFEE Chinese Financial Event Extraction Finance Chinese 2,976 Automatic Free of charge JSON
C05 ChFinAnn ChFinAnn Finance Chinese 32,040 Manual Free of charge JSON
C06 FEED Chinese Financial Event Extraction Dataset Finance Chinese 31,748 Automatic & manual Free of charge JSON ×
C07 EPI Epigenetics and Post-Translational Modifications 2011 Biomedical English 1,200 Manual × Free of charge BioNLP Standoff
C08 ID Infectious Diseases 20112 Biomedical English 30 Manual Free of charge BioNLP Standoff
C09 GE 11 Genia Event Extraction 2011 Biomedical English 1,210 Manual Free of charge BioNLP Standoff
C10 PC Pathway Curation 2013 Biomedical English 525 Manual Free of charge BioNLP Standoff
C11 CG Cancer Genetics 2013 (CG) Biomedical English 600 Manual Free of charge BioNLP Standoff
C12 BB3 Bacteria Biotope 2016 Biomedical English 215 Manual × Free of charge BioNLP Standoff
C13 MLEE Multi-Level Event Extraction Biomedical English 262 Manual Free of charge BRAT Standoff, CoNLL-U
C14 LEVEN Large-Scale Chinese Legal Event Detection Dataset Legal Chinese 8,116 Automatic & manual Free of charge JSON

Corpus statistics_

ID Corpus Name Data Sources Tokens Count Sentences Count Event Mentions Negative Events Event Types
C01 MUSIED 11,381 docs 7.105 M 315,743 35,313 N/A 21
C02 MAVEN 4,480 docs 1.276 M 49,873 118,732 497,261 168
C03 ACE 20051 599 docs (En), 633 docs (Ch) 303k (En), 321k (Ch) 15,789 (En), 7,269 (Ch) 5,349 (En), 3,333 (Ch) N/A 5
C04 CFEE 2,976 docs N/A N/A 3,044 32,936 4
C05 ChFinAnn 32,040 docs 29,220,480 640,800 > 48,000 N/A 5
C06 FEED 31,748 docs 28,954,176 603,212 46,960 N/A 5
C07 EPI 1,200 abstracts 253,628 N/A 3,714 369 8
C08 ID 30 full-text articles 153,153 5,118 5,150 214 10
C09 GE 11 1,210 abstracts 267,229 N/A 13,603 N/A 9
C10 PC 525 docs 108,356 N/A 12,125 571 21
C11 CG 600 abstracts 129,878 N/A 17,248 1,326 40
C12 BB3 146 abstracts (ee), 161 abstracts (ee+ner) 35,380 (ee), 39, 118 (ee+ner) N/A 890 (ee), 864 (ee+ner) N/A 2
C13 MLEE 262 docs 56,588 2,608 6,677 N/A 29
C14 LEVEN 8,116 docs 2.241 M 63,616 150,977 N/A 108

Comparison summary of the common annotation formats_

Annotation Format Summary Output Files Implementation method Annotation structure
BioNLP Standoff The annotation format is widely used in BioNLP Shared Task and BioNLP Open Shared Task challenges. .txt.a1.a2 Manual annotation using text corpus annotation tool Tab-delimited data
BRAT Standoff The annotation format is almost identical to the BioNLP format, with the annotations combined into a single annotation file (.ann). .txt.ann Manual annotation using text corpus annotation tool Tab-delimited data
CoNLL-U The sentence-level annotations are presented in three types of lines: comment, word, and blank lines. .txt.conll Python’s spacy_conll package Tab-delimited data
OneIE’s JSON format Provides a comprehensive annotations storage for each sentence in a JSON objects structure. .JSON Use OneIE’s package preprocessing script1 or manual data transformation JSON structure

Corpus annotation tools_

ID Tool Name Platform Compatibility Output Format Charges & License Information Latest Stable Release1
T01 AlvisAE Web-based (RESTful web app) JSON Free (Open Source) No license provided 2016
T02 BRAT Rapid Annotation Tool Web-based (Python package) BRAT Standoff Free MIT License vl.3 Crunchy Frog (Nov 8, 2012)
T03 TextAE Online/Web-based (Python package) JSON Free (Open Source) MIT License v4.5.4 (Mar 1, 2017)