Sentence, Phrase, and Triple Annotations to Build a Knowledge Graph of Natural Language Processing Contributions—A Trial Dataset

Jennifer D’Souza; Sören Auer

Acceso abierto

Sentence, Phrase, and Triple Annotations to Build a Knowledge Graph of Natural Language Processing Contributions—A Trial Dataset

Jennifer D’Souza

Sören Auer

| 09 may 2021

Journal of Data and Information Science

Volumen 6 (2021): Edición 3 (June 2021)

Acerca de este artículo

Artículo anterior

Artículo siguiente

Cite

Article Category: Research Paper

Publicado en línea: 09 may 2021

Páginas: 6 - 34

Recibido: 28 oct 2020

Aceptado: 14 abr 2021

DOI: https://doi.org/10.2478/jdis-2021-0023

Palabras clave
Scholarly knowledge graphs, Open science graphs, Knowledge representation, Natural language processing, Semantic publishing

© 2021 Jennifer D’Souza et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Structured Model information as part of the research contribution highlights of a scholarly article (Lample et al., 2016) in the NlpContributionGraph scheme.

Functional workflow of the annotation process to obtain the NlpContributionGraph data.

Illustration of the annotation guideline 5 of forming triples without incorrect repetitions of the extracted phrases. This Results IU is modeled from the research paper by (Wang et al., 2018). If the phrases “in terms of” and “F1 measure” were modeled by sentence word order, they would need to be reused twice under the “ACE datasets” and “GENIA dataset” scientific terms. To avoid this incorrect repetition, despite being at the end of the sentence, they are annotated at the top of the triples hierarchy.

Annotated data from the paper “Sentence similarity learning by lexical decomposition and composition” under the Results Information Unit by the NlpContributionGraph scheme.

An Open Research Knowledge Graph paper view. The NlpContributionGraph scheme is employed to model the ResearchProblem and the Results information units of the paper.

A Results graph branch traversal in the ORKG until the last level.

A NlpContributionGraph Scheme Data Integration Use Case in the Open Research Knowledge Graph Digital Library. An automatically generated survey from a part of a knowledge graph of scholarly contributions over four articles using the NlpContributionGraph scheme proposed in this work. This comparison was customized in the Open Research Knowledge Graph framework to focus only on the Results information unit (the comparison is accessible online here https://www.orkg.org/orkg/c/kM2tUq).

Illustration of a parent node name called ‘character-level LSTM’ serving a conceptual reference selected from the article's running text as opposed to the section names. The figure is part of the contribution from the article (B. Wang et al., 2018). Essentially, for such encapsulation when it exists, coreference is applied for the child-node nesting (consider the coreference between ‘we incorporate a character-level LSTM to capture’ in sentence 1 and ‘this character-level component can also help’ in sentence 2).

Figures (a) and (b) depicts the modeling of part of a Results information unit from a scholarly article (Ghaddar & Langlais, 2018) in the pilot and the adjudication stages, respectively.

Intra-Annotation Evaluation Results. The NlpContributionGraph scheme pilot stage annotations evaluated against the adjudicated gold-standard annotations made on the trial dataset.

	Tasks	Information Units			Sentences			Phrases			Triples

		P	R	F1	P	R	F1	P	R	F1	P	R	F1
1	MT	66.66	73.68	70.0	66.67	54.55	60.0	37.47	30.96	33.91	19.73	17.46	18.53
2	NER	79.55	81.40	80.46	60.89	69.43	64.88	44.09	42.60	43.34	22.34	21.63	21.98
3	QA	93.18	93.18	93.18	67.96	79.55	73.30	54.04	45.21	49.23	37.50	32.0	34.52
4	RC	70.21	73.33	71.74	64.64	60.31	62.40	35.31	29.24	32.0	12.59	11.45	11.99
5	TC	86.67	84.78	85.71	75.44	78.66	77.01	54.77	45.38	49.63	27.41	22.41	24.66
Cum.	micro	78.83	80.65	79.73	67.25	67.63	67.44	45.36	38.83	41.84	23.76	20.97	22.28
	macro	78.8	80.49	79.64	67.33	68.51	67.92	45.2	38.91	41.82	23.87	20.95	22.31

Annotated corpus statistics for the 12 Information Units in the NlpContributionGraph scheme.

Information Unit	No. of triples	No. of papers	Ratio of triples to papers
Experiments	168	3	56
Tasks	277	8	34.63
ExperimentalSetup	300	16	18.75
Model	561	32	17.53
Hyperparameters	254	15	16.93
Results	688	42	16.38
Approach	283	18	15.72
Baselines	148	10	14.8
AblationAnalysis	155	13	11.92
Dataset	8	1	8
ResearchProblem	169	50	3.38
Code	9	9	1

Two examples illustrating the three different granularities for NlpContributionGraph data instances (viz., a. sentences, b. phrases, and c. triples) modeled for the Result information unit from a scholarly article (Cho et al., 2014).

[1a. sentence 159] As expected, adding features computed by neural networks consistently improves the performance over the baseline performance.

[1b. phrases from sentence 159] {adding features, computed by, neural networks, improves the performance, over baseline performance}

[1c. triples from entities above] {(Contribution, has, Results), (Results, improves the performance, adding features), (adding features, computed by, neural networks), (Results, improves the performance, over baseline performance)}

[2a. sentence 160] The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder – Decoder.

[2b. phrases from sentence 160] {best performance was achieved, used both CSLM and the phrase scores, from, RNN Encoder – Decoder}

[2c. triples from entities above] {(Contribution, has, Results), (Results, best performance was achieved, used both CSLM and the phrase scores), (used both CSLM and the phrase scores, from, RNN Encoder – Decoder)}

Annotated corpus characteristics for our trial dataset containing a total of 50 NLP articles using the NlpContributionGraph model. “ann” stands for annotated; and IU for information unit. The 50 articles are uniformly distributed across five different NLP subfields characterized at sentence and token-level granularities as follows—machine translation (MT): 2,596 total sentences, 9,581 total overall tokens; named entity recognition (NER): 2,295 sentences, 8,703 overall tokens; question answering (QA): 2,511 sentences, 10,305 overall tokens; relation classification (RC): 1,937 sentences, 10,020 overall tokens; text classification (TC): 2,071 sentences, 8,345 overall tokens.

	MT	NER	QA	RC	TC	Overall
total IUs	38	43	44	45	46	216
ann Sentences	209	157	176	194	164	900
avg ann Sentences	0.081	0.068	0.07	0.1	0.079	-
ann Phrases	956	770	960	978	1038	4,702
avg Toks per Phrase	2.81	2.87	2.76	2.91	2.7	-
avg ann Phrase Toks	0.28	0.25	0.26	0.28	0.34	-
ann Triples	590	504	619	620	647	2,980

eISSN:: 2543-683X
Idioma:: Inglés

Calendario de la edición:: 4 veces al año
Temas de la revista:: Computer Sciences, Information Technology, Project Management, Databases and Data Mining

RSS Feed de revista

Sentence, Phrase, and Triple Annotations to Build a Knowledge Graph of Natural Language Processing Contributions—A Trial Dataset

Article Category: Research Paper

Publicado en línea: 09 may 2021

Páginas: 6 - 34

Recibido: 28 oct 2020

Aceptado: 14 abr 2021

DOI: https://doi.org/10.2478/jdis-2021-0023

Palabras clave
Scholarly knowledge graphs, Open science graphs, Knowledge representation, Natural language processing, Semantic publishing

© 2021 Jennifer D’Souza et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Intra-Annotation Evaluation Results. The NlpContributionGraph scheme pilot stage annotations evaluated against the adjudicated gold-standard annotations made on the trial dataset.

Annotated corpus statistics for the 12 Information Units in the NlpContributionGraph scheme.

Two examples illustrating the three different granularities for NlpContributionGraph data instances (viz., a. sentences, b. phrases, and c. triples) modeled for the Result information unit from a scholarly article (Cho et al., 2014).

Sentence, Phrase, and Triple Annotations to Build a Knowledge Graph of Natural Language Processing Contributions—A Trial Dataset

Article Category: Research Paper

Publicado en línea: 09 may 2021

Páginas: 6 - 34

Recibido: 28 oct 2020

Aceptado: 14 abr 2021

DOI: https://doi.org/10.2478/jdis-2021-0023

Palabras claveScholarly knowledge graphs, Open science graphs, Knowledge representation, Natural language processing, Semantic publishing

© 2021 Jennifer D’Souza et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Intra-Annotation Evaluation Results. The NlpContributionGraph scheme pilot stage annotations evaluated against the adjudicated gold-standard annotations made on the trial dataset.

Annotated corpus statistics for the 12 Information Units in the NlpContributionGraph scheme.

Two examples illustrating the three different granularities for NlpContributionGraph data instances (viz., a. sentences, b. phrases, and c. triples) modeled for the Result information unit from a scholarly article (Cho et al., 2014).

Palabras clave
Scholarly knowledge graphs, Open science graphs, Knowledge representation, Natural language processing, Semantic publishing