A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks

Abdullah, Mohd Hafizul Afifi; Aziz, Norshakirah; Abdulkadir, Said Jadid; Hussain, Kashif; Alhussian, Hitham; Talpur, Noureen

Accès libre

A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks

Mohd Hafizul Afifi Abdullah

Abdullah, Mohd Hafizul Afifi

,

Norshakirah Aziz

Aziz, Norshakirah

,

Said Jadid Abdulkadir

Abdulkadir, Said Jadid

,

,

et

19 nov. 2024

Journal of Data and Information Science

Édition 9 (2024): Edition 4 (Novembre 2024)

À propos de cet article

Article précédent

Article suivant

Citez

Partagez

Télécharger la couverture

Catégorie d'article: Review Papers

Publié en ligne: 19 nov. 2024

Pages: 196 - 238

Reçu: 27 avr. 2024

Accepté: 03 sept. 2024

DOI: https://doi.org/10.2478/jdis-2024-0029

Mots clés
Information extraction, Event extraction, Text mining, Large language model, Natural language processing

© 2024 Mohd Hafizul Afifi Abdullah et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Example of a closed-domain EE using a predefined event schema.

Flowchart of the manual corpus annotation procedure.

Structure of annotated corpus in the BioNLP Standoff format.

Structure of annotated corpus in BRAT Standoff format.

Structure of annotated corpus in the CoNLL-U format.

Structure of annotated corpus in the OneIE’s JSON format.

Steps for selecting documents to build an event extraction corpus.

Example of event annotation using the BRAT annotation tool.

The distribution of corpus based on language and domain.

Top five largest annotated corpora for event extraction tasks.

Comparison of tokens, sentences, and event mentions in existing annotated corpora.

The count of event mentions in each corpus.

Conceptual representation of the universal text annotation converter.

Summary of challenges and recommendations_

Challenges	Recommendations
Lack of high-quality annotated data	To facilitate the rapid development of an annotated corpus, it is suggested to employ a hybrid approach as demonstrated by Li et al. (2022). This involves partially annotating the texts manually, and then training ML algorithms to annotate the remaining data based on the trained model. This strategy is faster than manually annotating all data. However, it is critical to measure the accuracy of the automatic annotations.
Incompatibility of annotated corpus formats	Develop a standardized annotation format that is universally accepted for annotating text corpus. These formats should store all information required for common EE tasks. Develop a universal text annotation converter for converting annotations between different formats (Figure 16).
Subjectivity and text ambiguity	Develop a complete annotation guideline and strictly adhere to it throughout the annotation process. Utilize tools like Git version control to manage the version of annotation files.

Summary of recent studies on LLMs for corpus annotations_

Study	Results	Advantages	Limitations
Csanády et al. (2024)	BERT shows 91.2% to 96.5% test accuracies on the IMDb datasets using the model on random baselines.	The proposed method can handle large-scale text annotation tasks. Provides a cost-effective alternative to annotate large amounts of text.	Annotation using LLMs slightly compromises the annotation accuracy. LLMs alone cannot provide high-quality corpus annotations. The annotated corpus is not suitable for EE task.
Akkurt et al. (2024)	The proposed approach improved result by 2%. All models show improved performance with the GPT-4 + UD Turkish BOUN v2.11: 76.9% (best performance).	The model has been tested with data from UD English and Turkish Treebanks. The authors use public data and verify the methodology complies with ethical standards.	The annotation outcome varies (inconsistent) depending on the user’s prompt. The method is for entity annotation; thus output is not suitable for EE tasks.
Frei and Kramer (2023)	Result on various baseline models: gbert-large (P: 70.7%, R:97.9%, F1: 82.1%) GottBERT-base (P: 80.0%, R: 89.9%, F1: 84.7%) German-MedBERT (P: 72.7%, R: 81.8%, F1: 77.0%)	Solves limited corpus availability for non-English medical texts. The proposed method shows a reliable performance.	The proposed method is computationally expensive. The annotated corpus cannot be considered a gold-standard and requires more validation. The method is for entity annotation, thus output is not suitable for EE tasks.
Li et al. (2023)	The result shows up to 21% performance improvement over random baselines.	The annotation process is done together by humans and LLMs. Provides a cost-effective alternative to annotate large amounts of text.	The study does not assess if LLM-generated annotations outperform human-annotated corpus. The method is for entity annotation, thus output is not suitable for EE tasks.

Annotated corpus for the event extraction task_

ID	Corpus Short Name	Corpus Full Name	Domain Area	Language	Corpus Size (# docs)	Annotation Method	Public Access	Charges	Format	Benchmark Corpus
C01	MUSIED	Multi-Source Informal Event Detection	General	Chinese	11,381	Manual	√	Free of charge	JSON	×
C02	MAVEN	MAssive eVENt detection dataset	General	English	4,480	Manual	√	Free of charge	JSON	√
C03	ACE 2005	ACE 2005 Multilingual Training Corpus¹	General	English, Chinese	599 (En), 633 (Ch)	Manual	×	Licensed (Paid)	XML	√
C04	CFEE	Chinese Financial Event Extraction	Finance	Chinese	2,976	Automatic	√	Free of charge	JSON	√
C05	ChFinAnn	ChFinAnn	Finance	Chinese	32,040	Manual	√	Free of charge	JSON	√
C06	FEED	Chinese Financial Event Extraction Dataset	Finance	Chinese	31,748	Automatic & manual	√	Free of charge	JSON	×
C07	EPI	Epigenetics and Post-Translational Modifications 2011	Biomedical	English	1,200	Manual	×	Free of charge	BioNLP Standoff	√
C08	ID	Infectious Diseases 2011²	Biomedical	English	30	Manual	√	Free of charge	BioNLP Standoff	√
C09	GE 11	Genia Event Extraction 2011	Biomedical	English	1,210	Manual	√	Free of charge	BioNLP Standoff	√
C10	PC	Pathway Curation 2013	Biomedical	English	525	Manual	√	Free of charge	BioNLP Standoff	√
C11	CG	Cancer Genetics 2013 (CG)	Biomedical	English	600	Manual	√	Free of charge	BioNLP Standoff	√
C12	BB3	Bacteria Biotope 2016	Biomedical	English	215	Manual	×	Free of charge	BioNLP Standoff	√
C13	MLEE	Multi-Level Event Extraction	Biomedical	English	262	Manual	√	Free of charge	BRAT Standoff, CoNLL-U	√
C14	LEVEN	Large-Scale Chinese Legal Event Detection Dataset	Legal	Chinese	8,116	Automatic & manual	√	Free of charge	JSON	√

Corpus statistics_

ID	Corpus Name	Data Sources	Tokens Count	Sentences Count	Event Mentions	Negative Events	Event Types
C01	MUSIED	11,381 docs	7.105 M	315,743	35,313	N/A	21
C02	MAVEN	4,480 docs	1.276 M	49,873	118,732	497,261	168
C03	ACE 2005¹	599 docs (En), 633 docs (Ch)	303k (En), 321k (Ch)	15,789 (En), 7,269 (Ch)	5,349 (En), 3,333 (Ch)	N/A	5
C04	CFEE	2,976 docs	N/A	N/A	3,044	32,936	4
C05	ChFinAnn	32,040 docs	29,220,480^†	640,800^†	> 48,000	N/A	5
C06	FEED	31,748 docs	28,954,176^†	603,212^†	46,960	N/A	5
C07	EPI	1,200 abstracts	253,628	N/A	3,714	369	8
C08	ID	30 full-text articles	153,153	5,118	5,150	214	10
C09	GE 11	1,210 abstracts	267,229	N/A	13,603	N/A	9
C10	PC	525 docs	108,356	N/A	12,125	571	21
C11	CG	600 abstracts	129,878	N/A	17,248	1,326	40
C12	BB3	146 abstracts (ee), 161 abstracts (ee+ner)	35,380 (ee), 39, 118 (ee+ner)	N/A	890 (ee), 864 (ee+ner)	N/A	2
C13	MLEE	262 docs	56,588	2,608	6,677	N/A	29
C14	LEVEN	8,116 docs	2.241 M	63,616	150,977	N/A	108

Comparison summary of the common annotation formats_

Annotation Format	Summary	Output Files	Implementation method	Annotation structure
BioNLP Standoff	The annotation format is widely used in BioNLP Shared Task and BioNLP Open Shared Task challenges.	.txt.a1.a2	Manual annotation using text corpus annotation tool	Tab-delimited data
BRAT Standoff	The annotation format is almost identical to the BioNLP format, with the annotations combined into a single annotation file (.ann).	.txt.ann	Manual annotation using text corpus annotation tool	Tab-delimited data
CoNLL-U	The sentence-level annotations are presented in three types of lines: comment, word, and blank lines.	.txt.conll	Python’s spacy_conll package	Tab-delimited data
OneIE’s JSON format	Provides a comprehensive annotations storage for each sentence in a JSON objects structure.	.JSON	Use OneIE’s package preprocessing script¹ or manual data transformation	JSON structure

Corpus annotation tools_

ID	Tool Name	Platform Compatibility	Output Format	Charges & License Information	Latest Stable Release¹
T01	AlvisAE	Web-based (RESTful web app)	JSON	Free (Open Source) No license provided	2016
T02	BRAT Rapid Annotation Tool	Web-based (Python package)	BRAT Standoff	Free MIT License	vl.3 Crunchy Frog (Nov 8, 2012)
T03	TextAE	Online/Web-based (Python package)	JSON	Free (Open Source) MIT License	v4.5.4 (Mar 1, 2017)

Langue:: Anglais

Périodicité:: 4 fois par an
Sujets de la revue:: Informatique, Informatique, Gestion de projet, Bases de données et exploration de données

RSS Feed de la revue

A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks

Catégorie d'article: Review Papers

Publié en ligne: 19 nov. 2024

Pages: 196 - 238

Reçu: 27 avr. 2024

Accepté: 03 sept. 2024

DOI: https://doi.org/10.2478/jdis-2024-0029

Mots clés
Information extraction, Event extraction, Text mining, Large language model, Natural language processing

© 2024 Mohd Hafizul Afifi Abdullah et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

Figure 11.

Figure 12.

Figure 13.

Figure 14.

Figure 15.

Figure 16.

Summary of challenges and recommendations_

Summary of recent studies on LLMs for corpus annotations_

Annotated corpus for the event extraction task_

Corpus statistics_

Comparison summary of the common annotation formats_

Corpus annotation tools_

A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks

Mohd Hafizul Afifi Abdullah

Norshakirah Aziz

Said Jadid Abdulkadir

Kashif Hussain

Hitham Alhussian

Noureen Talpur

Catégorie d'article: Review Papers

Publié en ligne: 19 nov. 2024

Pages: 196 - 238

Reçu: 27 avr. 2024

Accepté: 03 sept. 2024

DOI: https://doi.org/10.2478/jdis-2024-0029

Mots clésInformation extraction, Event extraction, Text mining, Large language model, Natural language processing

© 2024 Mohd Hafizul Afifi Abdullah et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

Figure 11.

Figure 12.

Figure 13.

Figure 14.

Figure 15.

Figure 16.

Summary of challenges and recommendations_

Summary of recent studies on LLMs for corpus annotations_

Annotated corpus for the event extraction task_

Corpus statistics_

Comparison summary of the common annotation formats_

Corpus annotation tools_

Mots clés
Information extraction, Event extraction, Text mining, Large language model, Natural language processing