A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks
Categoria dell'articolo: Review Papers
Pubblicato online: 19 nov 2024
Pagine: 196 - 238
Ricevuto: 27 apr 2024
Accettato: 03 set 2024
DOI: https://doi.org/10.2478/jdis-2024-0029
Parole chiave
© 2024 Mohd Hafizul Afifi Abdullah et al., published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

Figure 11.

Figure 12.

Figure 13.

Figure 14.

Figure 15.

Figure 16.

Summary of challenges and recommendations_
Challenges | Recommendations |
---|---|
Lack of high-quality annotated data |
To facilitate the rapid development of an annotated corpus, it is suggested to employ a hybrid approach as demonstrated by This involves partially annotating the texts manually, and then training ML algorithms to annotate the remaining data based on the trained model. This strategy is faster than manually annotating all data. However, it is critical to measure the accuracy of the automatic annotations. |
Incompatibility of annotated corpus formats |
Develop a standardized annotation format that is universally accepted for annotating text corpus. These formats should store all information required for common EE tasks. Develop a universal text annotation converter for converting annotations between different formats ( |
Subjectivity and text ambiguity |
Develop a complete annotation guideline and strictly adhere to it throughout the annotation process. Utilize tools like Git version control to manage the version of annotation files. |
Summary of recent studies on LLMs for corpus annotations_
Study | Results | Advantages | Limitations |
---|---|---|---|
BERT shows 91.2% to 96.5% test accuracies on the IMDb datasets using the model on random baselines. |
The proposed method can handle large-scale text annotation tasks. Provides a cost-effective alternative to annotate large amounts of text. |
Annotation using LLMs slightly compromises the annotation accuracy. LLMs alone cannot provide high-quality corpus annotations. The annotated corpus is not suitable for EE task. |
|
The proposed approach improved result by 2%. All models show improved performance with the GPT-4 + UD Turkish BOUN v2.11: 76.9% (best performance). |
The model has been tested with data from UD English and Turkish Treebanks. The authors use public data and verify the methodology complies with ethical standards. |
The annotation outcome varies (inconsistent) depending on the user’s prompt. The method is for entity annotation; thus output is not suitable for EE tasks. |
|
Result on various baseline models:
gbert-large (P: 70.7%, R:97.9%, F1: 82.1%) GottBERT-base (P: 80.0%, R: 89.9%, F1: 84.7%) German-MedBERT (P: 72.7%, R: 81.8%, F1: 77.0%) |
Solves limited corpus availability for non-English medical texts. The proposed method shows a reliable performance. |
The proposed method is computationally expensive. The annotated corpus cannot be considered a gold-standard and requires more validation. The method is for entity annotation, thus output is not suitable for EE tasks. |
|
The result shows up to 21% performance improvement over random baselines. |
The annotation process is done together by humans and LLMs. Provides a cost-effective alternative to annotate large amounts of text. |
The study does not assess if LLM-generated annotations outperform human-annotated corpus. The method is for entity annotation, thus output is not suitable for EE tasks. |
Annotated corpus for the event extraction task_
ID | Corpus Short Name | Corpus Full Name | Domain Area | Language | Corpus Size (# docs) | Annotation Method | Public Access | Charges | Format | Benchmark Corpus |
---|---|---|---|---|---|---|---|---|---|---|
C01 | MUSIED | Multi-Source Informal Event Detection | General | Chinese | 11,381 | Manual | √ | Free of charge | JSON | × |
C02 | MAVEN | MAssive eVENt detection dataset | General | English | 4,480 | Manual | √ | Free of charge | JSON | √ |
C03 | ACE 2005 | ACE 2005 Multilingual Training Corpus |
General | English, Chinese | 599 (En), 633 (Ch) | Manual | × | Licensed (Paid) | XML | √ |
C04 | CFEE | Chinese Financial Event Extraction | Finance | Chinese | 2,976 | Automatic | √ | Free of charge | JSON | √ |
C05 | ChFinAnn | ChFinAnn | Finance | Chinese | 32,040 | Manual | √ | Free of charge | JSON | √ |
C06 | FEED | Chinese Financial Event Extraction Dataset | Finance | Chinese | 31,748 | Automatic & manual | √ | Free of charge | JSON | × |
C07 | EPI | Epigenetics and Post-Translational Modifications 2011 | Biomedical | English | 1,200 | Manual | × | Free of charge | BioNLP Standoff | √ |
C08 | ID | Infectious Diseases 2011 |
Biomedical | English | 30 | Manual | √ | Free of charge | BioNLP Standoff | √ |
C09 | GE 11 | Genia Event Extraction 2011 | Biomedical | English | 1,210 | Manual | √ | Free of charge | BioNLP Standoff | √ |
C10 | PC | Pathway Curation 2013 | Biomedical | English | 525 | Manual | √ | Free of charge | BioNLP Standoff | √ |
C11 | CG | Cancer Genetics 2013 (CG) | Biomedical | English | 600 | Manual | √ | Free of charge | BioNLP Standoff | √ |
C12 | BB3 | Bacteria Biotope 2016 | Biomedical | English | 215 | Manual | × | Free of charge | BioNLP Standoff | √ |
C13 | MLEE | Multi-Level Event Extraction | Biomedical | English | 262 | Manual | √ | Free of charge | BRAT Standoff, CoNLL-U | √ |
C14 | LEVEN | Large-Scale Chinese Legal Event Detection Dataset | Legal | Chinese | 8,116 | Automatic & manual | √ | Free of charge | JSON | √ |
Corpus statistics_
ID | Corpus Name | Data Sources | Tokens Count | Sentences Count | Event Mentions | Negative Events | Event Types |
---|---|---|---|---|---|---|---|
C01 | MUSIED | 11,381 docs | 7.105 M | 315,743 | 35,313 | N/A | 21 |
C02 | MAVEN | 4,480 docs | 1.276 M | 49,873 | 118,732 | 497,261 | 168 |
C03 | ACE 2005 |
599 docs (En), 633 docs (Ch) | 303k (En), 321k (Ch) | 15,789 (En), 7,269 (Ch) | 5,349 (En), 3,333 (Ch) | N/A | 5 |
C04 | CFEE | 2,976 docs | N/A | N/A | 3,044 | 32,936 | 4 |
C05 | ChFinAnn | 32,040 docs | 29,220,480 |
640,800 |
> 48,000 | N/A | 5 |
C06 | FEED | 31,748 docs | 28,954,176 |
603,212 |
46,960 | N/A | 5 |
C07 | EPI | 1,200 abstracts | 253,628 | N/A | 3,714 | 369 | 8 |
C08 | ID | 30 full-text articles | 153,153 | 5,118 | 5,150 | 214 | 10 |
C09 | GE 11 | 1,210 abstracts | 267,229 | N/A | 13,603 | N/A | 9 |
C10 | PC | 525 docs | 108,356 | N/A | 12,125 | 571 | 21 |
C11 | CG | 600 abstracts | 129,878 | N/A | 17,248 | 1,326 | 40 |
C12 | BB3 | 146 abstracts (ee), 161 abstracts (ee+ner) | 35,380 (ee), 39, 118 (ee+ner) | N/A | 890 (ee), 864 (ee+ner) | N/A | 2 |
C13 | MLEE | 262 docs | 56,588 | 2,608 | 6,677 | N/A | 29 |
C14 | LEVEN | 8,116 docs | 2.241 M | 63,616 | 150,977 | N/A | 108 |
Comparison summary of the common annotation formats_
Annotation Format | Summary | Output Files | Implementation method | Annotation structure |
---|---|---|---|---|
BioNLP Standoff | The annotation format is widely used in BioNLP Shared Task and BioNLP Open Shared Task challenges. | .txt.a1.a2 | Manual annotation using text corpus annotation tool | Tab-delimited data |
BRAT Standoff | The annotation format is almost identical to the BioNLP format, with the annotations combined into a single annotation file (.ann). | .txt.ann | Manual annotation using text corpus annotation tool | Tab-delimited data |
CoNLL-U | The sentence-level annotations are presented in three types of lines: comment, word, and blank lines. | .txt.conll | Python’s spacy_conll package | Tab-delimited data |
OneIE’s JSON format | Provides a comprehensive annotations storage for each sentence in a JSON objects structure. | .JSON | Use OneIE’s package preprocessing script |
JSON structure |
Corpus annotation tools_
ID | Tool Name | Platform Compatibility | Output Format | Charges & License Information | Latest Stable Release |
T01 | AlvisAE | Web-based (RESTful web app) | JSON | Free (Open Source) No license provided | 2016 |
T02 | BRAT Rapid Annotation Tool | Web-based (Python package) | BRAT Standoff | Free MIT License | vl.3 Crunchy Frog (Nov 8, 2012) |
T03 | TextAE | Online/Web-based (Python package) | JSON | Free (Open Source) MIT License | v4.5.4 (Mar 1, 2017) |