Our present rapidly amassing wealth of scholarly publications (Jinha, 2010) poses a crucial dilemma for the research community. That is:
Knowledge graphs (KG), i.e. large semantic networks of entities and relationships, are a potent data model solution in this space. KGs enable fine-grained semantic knowledge capture over precise information targets modeled as nodes and links under an optional cumulative knowledge capture theme. Unlike academic articles, KGs facilitate complete programmatic access to their semantified data at the model-specified granularity. Knowledge customizations given targeted queries are possible as knowledge subgraph views over single or aggregated KGs. Consider this in light of the fact, that the researcher's key day-to-day search task over scholarly knowledge is mainly focused on determining scholarly contribution highlights. E.g. “Which is the top-performing system on the SQuAD dataset?” “What methods are used for NER?” etc. KGs are ideal to power such fine-grained knowledge searches. Their well-known utility is in offering enhanced contextualized search as demonstrated successfully in industry by Facebook (Noy et al., 2019) and by Google (A Reintroduction to Our Knowledge Graph and Knowledge Panels, 2020); and even in the open data community by Wikidata (Vrandečić & Krötzsch, 2014) that serves information over many general domains.
In the technological ecosystem of scholarly KGs, the Open Research Knowledge Graph (ORKG) framework, hosted at TIB
Given this background, the work described here examines an annotation scheme called the N The full KG for the partial structured contribution information depicted in Figure 1 can be accessed as a resource in the ORKG platform here
In contrast to other existing content-based scholarly KG generation methods (Buscaldi et al., 2019; Jiang et al., 2020; Luan et al., 2018), NCG has an overarching knowledge capture theme, i.e.
In this study, the NCG scheme is revisited with a two-fold objective: 1) to identify any redundancies in the representation and thereby normalize them; and 2) to obtain a fairly reliable and consistent set of annotation guidelines. Building on our prior work, in this article, we re-annotated the same set of 50 articles a second time and examined the changes obtained via this adjudication task. Specifically, the following questions were investigated:
How data intensive is the annotation procedure?—i.e. what proportion of the full-text article content constitutes core contribution information, and consequently, the structured data within this scheme? Were significant changes needed to be made to the annotation scheme between the pilot and the adjudication phases?—i.e. were large quantified changes observed in the intra-annotation measures?
Summarily, NCG informs instance-based KG generation over NLP scholarly articles where the modeling process is mostly data-driven, unguided by a specific ontology, except at the top-level categorization of the information under IUs. Nevertheless, a large dataset of annotated instances by the NCG scheme would be amenable to ontology learning (Cimiano et al., 2009) and concept discovery (Lin & Pantel, 2002). The NCG data characteristically caters to practical applications such as the ORKG (Jaradeh et al., 2019) and other similar scholarly KG content representation frameworks designed for the discoverability of research conceptual artefacts and comparability of these artefacts across publications (Oelen et al., 2020) which we demonstrate concretely in Section 6. By adhering to data creation standards, the NCG by-product data when linked to web resources will fully conform to the FAIRness principle for scientific data (Wilkinson et al., 2016) as its data elements will become Findable, Accessible, Interoperable, and Reusable.
Early initiatives in semantically structuring scholarly articles focused just on sentences as the basic unit of analysis. To this end, ontologies and vocabularies were created (Constantin et al., 2016; Pertsas & Constantopoulos, 2017; Soldatova & King, 2006; Teufel et al., 1999), corpora were annotated (Fisas et al., 2016; Liakata et al., 2010), and machine learning methods were applied (Liakata et al., 2012). Like N
Following sentence-based annotations, the ensuing trend for structuring the scholarly record has the specific aim of bolstering search technology. Thus, it was steered towards scientific terminology mining and keyphrase extraction. Consequently, this led to the release of phrase-based annotated datasets in various domains including multidisciplinarily across STEM (Augenstein et al., 2017; D’Souza & Auer, 2020; Handschuh & QasemiZadeh, 2014; Luan et al., 2018), which facilitated machine learning system development for the automatic identification of scientific terms from scholarly articles (Ammar et al., 2017; Beltagy, Lo, & Cohan, 2019; Brack et al., 2020; Luan, Ostendorf, & Hajishirzi, 2017).
While N
The NCG scheme aims to build a scholarly KG assuming a bottom-up data-driven design. Thus, while not a fully ontologized model, it has one top-level layer predefined with a set of content category types for surface typing of the contribution knowledge that the graph represents. This follows the idea of content organization as scholarly article sections. However, in the NCG, the types are a predefined closed class set similar to the introduction, methods, results, and discussion (IMRAD) format prescribed for medical scientific writing (Huth, 1987; Sollaci & Pereira, 2004). Next, we describe this knowledge systematization.
The NCG scheme has two levels of content typing as follows: 1) it has a root node called “Contribution.” By means of this node, an instantiated NCG can be attached to another KG. E.g. extending the ORKG by attaching an instantiated NCG's “Contribution” node to it. 2) In the next level, it has 12 nodes referred to as IU. Each scholarly article's annotated contribution data elements are organized under three (mandatory) or more of the IU nodes. These nodes are briefly described next.
Annotating structured contributions information from scholarly articles (see Section 4 for details), showed that, per article, its contribution-centered content could be organized as three or more different rhetorical categories inspired from scholarly article section names. Specifically, 12 different contribution content types were identified, a few of which were in common with the fine-grained rhetorical semantic classes annotations made for scholarly article sentences (Teufel, Siddharthan, & Batchelor, 2009). The 12 types are as follows.
Next, we discuss the 10th IU called T
This concludes the detailed description about the 12 top-level IU nodes in the NCG scheme. Of the 12, only three are mandatory for structuring contributions per scholarly article. They are: R
The pilot annotation task (D’Souza & Auer, 2020) that resulted in the preliminary version of the NCG scheme was performed on a set of 50 NLP scholarly articles. This set of 50 articles constituted the trial dataset. Thus, the preliminary NCG scheme was defined over the trial dataset in a data-driven manner during the pilot annotation exercise.
There were two requirements decided at the outset of the annotation task. First and foremost, the graph model needed to be a These five tasks were randomly selected among the most popular NLP tasks on the paperswithcode.com leaderboard.
The second design choice, regards the granularity of the data annotated in the NCG scheme. In the Related Work (Section 2), we saw that the sentential and phrasal granularity was used in prior work on structuring scholarly articles. Thus, inspired from this prior annotation science and also toward modeling KGs, the following three different granularity levels were established. At the first level,
Table 1 below illustrates two examples of modeling contribution-oriented sentences, phrases, and triples from a scholarly publications categorized under the R
Two examples illustrating the three different granularities for N
[ [ [ |
[ [ [ |
We refer the reader to our prior work (D’Souza & Auer, 2020) for additional details regarding the pilot annotation task itself.
We carried out a two-stage annotation cycle over the trial dataset to finalize the NCG scheme. The first stage was the
There were two requirements decided at the outset of the adjudication annotation task. They were: 1) to normalize IUs further to be a smaller, but comprehensively representative, set of similar structured properties to facilitate succinct contribution comparisons across articles’ contribution graphs; and 2) to improve the phrasal boundary decisions made in the pilot stage focused on targeting precise scientific knowledge semantics within the annotated phrases. Otherwise, both the pilot and adjudication stages adopted the same annotation workflow as depicted in Figure 2.
Let us elaborate on the two requirements: 1) normalizing IUs—we had 16 different IUs in the pilot stage. During the course of the adjudication stage annotations, they were normalized as the following nine IUs: R
Finally, five main annotation guidelines are prescribed for NCG.
Sentences with contribution data are identified in various places in the paper including title, abstract, and full-text. Within the full-text, only the Introduction and Results sections are annotated. Sometimes, the first few sentences in the Methods section are annotated as well if method highlights are unspecified in the Introduction.
Only sentences that directly state the paper's contribution are annotated.
All relation predicates are annotated from the paper's text, except the following three, i.e.
Past the IU level, for parent node names in the graph, the names of sections are preferred, which if challenging to annotate are identified in the running text (see Appendix 1 example).
Repetitions of the scientific terms or predicates, which do not correspond to the actual information in the text, are not allowed when forming KG triples. See illustrated in Figure 3.
The trial dataset for designing the NCG scheme was derived from a collection downloaded from the publicly available leaderboard of tasks in artificial intelligence called
The raw data needed to undergo a two-step preprocessing to be ready for analysis. 1) For pdf-to-text conversion of the scholarly articles, the GROBID parser (GROBID, 2008) was applied; following which, 2) for plaintext pre-processing in terms of tokenization and sentence splitting, the Stanza toolkit (Qi et al., 2020) was used. The resulting pre-processed articles were then leveraged in the two-stage annotation cycle (see Section 4) to devise the NCG scheme.
The overall annotated corpus statistics for our trial dataset after the adjudication stage is depicted in Table 2. We see that, in each of the five subfields, approx. 40 IUs were annotated on each of the subfield's articles, i.e. on average four IUs per article per subfield since 10 articles were selected for each subfield. This implies that, on average, each article was annotated with more or less one additional IU beside the three (R
Annotated corpus characteristics for our trial dataset containing a total of 50 NLP articles using the N
MT | NER | QA | RC | TC | Overall | |
---|---|---|---|---|---|---|
38 | 43 | 44 | 45 | 216 | ||
157 | 176 | 194 | 164 | 900 | ||
0.081 | 0.068 | 0.07 | 0.079 | - | ||
956 | 770 | 960 | 978 | 4,702 | ||
2.81 | 2.87 | 2.76 | 2.7 | - | ||
0.28 | 0.25 | 0.26 | 0.28 | - | ||
590 | 504 | 619 | 620 | 2,980 |
Next we highlight a few differences between the five subfields in our dataset. In Table 2, we see that machine translation (MT) had the most annotated sentences. This can be attributed to the observation that generally MT articles tend to be longer descriptively and particularly in terms of models settings compared to the other four subfields. Further, relation classification (RC) had the highest proportion of contribution sentences constituting its articles. With 0.1 this still indicates a low proportion reflecting the fact that contribution information is contained in relatively few sentences. Text classification (TC) had the highest number of annotated phrases despite not being among the tasks with the highest numbers of annotated sentences. This implies that the number of annotated sentences is not directly related to the number of annotated phrases for the tasks in our data. But this understandably is not the same concerning the correlation between the number of phrases and triples, wherein the number of phrases and triples are directly related.
Table 3 depicts the final annotated corpus statistics in terms of the information units. We make the following key observations: R
Annotated corpus statistics for the 12 Information Units in the N
Information Unit | No. of triples | No. of papers | Ratio of triples to papers |
---|---|---|---|
168 | 3 | ||
277 | 8 | 34.63 | |
300 | 16 | 18.75 | |
561 | 32 | 17.53 | |
254 | 15 | 16.93 | |
42 | 16.38 | ||
283 | 18 | 15.72 | |
148 | 10 | 14.8 | |
155 | 13 | 11.92 | |
8 | 1 | 8 | |
169 | 3.38 | ||
9 | 9 | 1 |
We now compute the intra-annotation agreement measures between the first and the second stage versions of the dataset annotations across all three data elements in the NCG scheme including its top-level information units. Our evaluation metrics are the standard precision, recall, and F1-score.
Table 4 depicts the results. With these scores, we quantitatively observe the degree of changes between the two annotation stages treating the second stage as reference gold-standard. Between the two stages, the F1-scores of the annotation changes were: information units 79.64%, sentences 67.92%, phrases 41.82%, and triples 22.31%. We conclude that the interpretation of annotations related to the top-level organization of scholarly contributions did not change significantly (at 76.64% F1-score). Even the decision of the annotator about the sentences containing contribution-centered information showed a low degree of change (at 67.92% F1-score). However, the comparison of the fine-grained organization of the contribution-focused information, as phrases or triples, obtained low F1-socres. Finally, from the results, we see that our pipelined annotation task is presented with the general disadvantage of pipelined systems, wherein the performances in later annotation stages is limited by performances in earlier annotation stages.
Intra-Annotation Evaluation Results. The N
Tasks | Information Units | Sentences | Phrases | Triples | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 66.66 | 73.68 | 70.0 | 66.67 | 54.55 | 60.0 | 37.47 | 30.96 | 33.91 | 19.73 | 17.46 | 18.53 | |
2 | 79.55 | 81.40 | 80.46 | 60.89 | 69.43 | 64.88 | 44.09 | 42.60 | 43.34 | 22.34 | 21.63 | 21.98 | |
3 | 93.18 | 93.18 | 67.96 | 79.55 | 73.30 | 54.04 | 45.21 | 49.23 | 37.50 | 32.0 | |||
4 | 70.21 | 73.33 | 71.74 | 64.64 | 60.31 | 62.40 | 35.31 | 29.24 | 32.0 | 12.59 | 11.45 | 11.99 | |
5 | 86.67 | 84.78 | 85.71 | 75.44 | 78.66 | 54.77 | 45.38 | 27.41 | 22.41 | 24.66 | |||
78.83 | 80.65 | 79.73 | 67.25 | 67.63 | 67.44 | 45.36 | 38.83 | 41.84 | 23.76 | 20.97 | 22.28 | ||
78.8 | 80.49 | 79.64 | 67.33 | 68.51 | 67.92 | 45.2 | 38.91 | 41.82 | 23.87 | 20.95 | 22.31 |
In light of the low intra-annotator agreement obtained at the phrasal and triples information granularities, a natural question may arise:
Observing the task-specific intra-annotation measures in rows 1 to 5, we find that question answering (QA) and text classification (TC) have the highest scores reflecting fewer changes made in their annotations than the other tasks, albeit at decreasing levels across the data elements.
Generally, one may wonder why triples formation is challenging given a set of scientific term and predicate phrases indicated by its lowest F1-score per task and overall. There are two reasons: a) the phrases were significantly changed in the second stage (see Phrases F1-scores as < 50%), which in turn impacted the triple formation; and b) often the triples can be formed in more than one way. In the adjudication process, it was established for the triples to strictly conform to order of appearance of phrases in the text, which in the first stage was not consistently followed.
The NCG scheme was designed to structure NLP contributions thereby generating a contributions-centric KG. Such data will ease the information processing load for researchers who presently invest a large share of their time in surveying their field by reading full-text articles. The rationale for designing such a scheme was the availability of scholarly KGs, such as the ORKG (Jaradeh et al., 2019), that are equipped with features to automatically generate tabulated comparisons of various approaches addressing a certain research problem on their common properties and values (Oelen et al., 2020). We integrate some of our articles’ structured contributions into the Open Research Knowledge Graph (ORKG). Tapping into the ORKG's contributions comparison generator feature over our structured data, we demonstrate how
In the following subsections, we first describe how an article's contribution data modeled by NCG is integrated in the ORKG, and then illustrate the comparison feature.
The ORKG comprises structured descriptions of research contributions per article. The user can enter the relevant data about their papers via the framework online at
With the help of the ORKG paper editor interface, we add a paper to the ORKG including its bibliographic information, the R
With this we have described how the structured contributions data of individual papers are represented in the ORKG. Next, we showcase the ORKG feature for generating comparisons.
The ORKG has a feature to generate and publish surveys in the form of tabulated comparisons over articles’ knowledge graph nodes (Oelen et al., 2020). To demonstrate this feature, we entered our data for the R
Thus we have demonstrated how structured contributions from the NCG scheme address the massive scholarly knowledge content ingestion problem.
We have discussed the NCG scheme for structuring research contributions in NLP articles as structured KGs. We have described the process of leveraging the NCG scheme to annotate contributions in our trial dataset of 50 NLP articles in two stages which helped us obtain the NCG annotation guidelines and improve data quality. Further, we demonstrated how such structured data is poignant in the face of growing volumes of scholarly content to help alleviate the scholarly knowledge content ingestion problem. Our annotated dataset is publicly available here:
As future directions, to realize a full-fledged KG in the context of the NCG scheme, there are a few IE modules that would need to be improved or added. This includes (1) improving the PDF parser (see Appendix 3 for challenges); (2) incorporating an entity and relation linking and normalization module; (3) merging phrases from the unstructured text with known ontologies (e.g. MEX (Esteves et al., 2015)) to align resources and thus ensure data interoperability and reusability; and (4) modeling inter-domain knowledge.