Open Access

Extraction and Evaluation of Knowledge Entities from Scientific Documents


Cite

As a core resource of scientific knowledge, academic documents have been frequently used by scholars, especially newcomers to a given field. In the era of big data, scientific documents such as academic articles, patents, technical reports, and webpages are booming. The rapid daily growth of scientific documents indicates that a large amount of knowledge is proposed, improved, and used (Zhang et al., 2021). In scientific documents, knowledge entities (KEs) refer to the knowledge mentioned or cited by authors, such as algorithms, models, theories, datasets and software, diseases, drugs, and genes, reflecting rich resources in diverse problem-solving scenarios (Brack et al., 2020; Ding et al., 2013; Hou et al., 2019; Li et al. 2020). The advancement, improvement, and application of KEs in academic research have played a crucial role in promoting the development of different disciplines. Extracting various KEs from scientific documents can determine whether such KEs are emerging or typical in a specific field, and help scholars gain a comprehensive understanding of these KEs and even the entire research field (Wang & Zhang, 2020). KE extraction is also useful for multiple downstream tasks in information extraction, text mining, natural language processing, information retrieval, digital library research, and so on (Zhang et al., 2021). Particularly for researchers in artificial intelligence (AI), information science, and other related disciplines, discovering methods from large-scale academic literature, and evaluating their performance and influence have become increasingly necessary and meaningful (Hou et al., 2020).

There are four kinds of methods of KE extraction in scientific documents. They are manual annotation-based (Chu & Ke, 2017; Tateisi et al., 2014; Zadeh & Schumann, 2016), rule-based (Kondo et al., 2009), statistics-based (Heffernan & Teufel, 2018; Névéol, Wilbur, & Lu, 2011; Okamoto, Shan, & Orihara, 2017), and the state-of-the-art one—deep learning-based (Paul et al., 2019; Yang et al., 2018), respectively.

Currently, KEs are evaluated via frequency or text content (Wang & Zhang, 2020). Some scholars analyzed KEs’ influence using bibliometric indicators, e.g. the frequency of mentions, citations, and the usage in full text (Belter, 2014). Additionally, some studies also utilized text content to deeply explore the role, function, and relationship of KEs (Li & Yan, 2018; Li, Yan, & Feng, 2017; Wang & Zhang, 2020). Identifying the pattern of citations and the use of KEs through the content of academic papers is also on the trail (Yoon et al., 2019).

In recent years, the topic Extraction and Evaluation of Knowledge Entities from Scientific Documents has attracted the attention from the community. There are some conferences and workshops in line with this topic, such as the Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE) (Zhang et al., 2020), the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) (Cabanac et al., 2016), the Workshop on Mining Scientific Publications (WOSP, https://wosp.core.ac.uk/), the Workshop on AI + Informetrics (AII) (Zhang, et al., 2021), the Workshop on Scholarly Document Processing (SDP) (Chandrasekaran et al., 2020) and the Workshop on Natural Language Processing for Scientific Text (SciNLP, https://scinlp.org).

We are very grateful that there are seven contributions submitted to the special issue of Journal of Data and Information Science (JDIS) and five submissions are accepted after several rounds of peer-review and revisions.

The paper “Sentence, Phrase, and Triple Annotations to Build a Knowledge Graph of Natural Language Processing Contributions—A Trial Dataset” (D’Souza & Auer, 2021) normalized the NLPCONTRIBUTIONS scheme to a designed structure, which was directly extracted from natural language processing (NLP) articles. They demonstrated that the NLPCONTRIBUTIONGRAPH data integrated into the Open Research Knowledge Graph (ORKG), a next-generation KG-based digital library with intelligent computations, enabled over-structured scholarly knowledge to assist researchers in their daily academic tasks.

The paper “Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling” (Ding et al., 2021) proposed an automatic model of key-phrase extraction for Chinese medical abstracts, which combined sequence labeling formulation and pre-trained language model. Experiments compared word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. The experimental results show that the proposed character-level sequence labeling model based on BERT obtains F1-score of 59.80%, getting 9.64% absolute improvement.

The paper “Content Characteristics of Knowledge Integration in the eHealth Field: An Analysis Based on Citation Contexts” (Wang et al., 2021) explored the content characteristics of knowledge integration in an interdisciplinary field—eHealth. Associated knowledge phrases (AKPs) shared between citing papers and their references were extracted from the citation contexts of eHealth papers by applying a stem-matching method. A classification schema that considers the functions of knowledge in the given domain was proposed to categorize the identified AKPs. The annotated AKPs reveal that different knowledge types have remarkably different integration patterns in terms of knowledge amount, the breadth of source disciplines, and the integration time lag.

The paper “A New Citation Recommendation Strategy Based on Term Functions in Related Studies Section” (Chen, 2021) proposed a term function-based citation recommendation framework to recommend articles for users. The author presented nine term functions, and among them, three were newly created and six were identified from the literature. The experiments show that the term function-based methods outperform the baselines, demonstrating its performance in identifying valuable citations.

The last paper, “Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering” (Vahidnia, Abbasi, & Abbass, 2021) proposed a modified deep clustering method to detect research trends from the abstracts and titles of academic documents. The experimental results show that the modified DEC in conjunction with Doc2Vec can outperform other methods in the clustering task. Using the proposed method, the authors also show how the topics have evolved in the period of the recent 30 years, taking advantage of a keyword extraction method for cluster tagging and labeling, demonstrating the context of the topics.

eISSN:
2543-683X
Language:
English
Publication timeframe:
4 times per year
Journal Subjects:
Computer Sciences, Information Technology, Project Management, Databases and Data Mining