1. bookVolume 6 (2021): Issue 3 (June 2021)
Journal Details
License
Format
Journal
eISSN
2543-683X
First Published
30 Mar 2017
Publication timeframe
4 times per year
Languages
English
access type Open Access

Extraction and Evaluation of Knowledge Entities from Scientific Documents

Published Online: 09 Aug 2021
Volume & Issue: Volume 6 (2021) - Issue 3 (June 2021)
Page range: 1 - 5
Journal Details
License
Format
Journal
eISSN
2543-683X
First Published
30 Mar 2017
Publication timeframe
4 times per year
Languages
English

As a core resource of scientific knowledge, academic documents have been frequently used by scholars, especially newcomers to a given field. In the era of big data, scientific documents such as academic articles, patents, technical reports, and webpages are booming. The rapid daily growth of scientific documents indicates that a large amount of knowledge is proposed, improved, and used (Zhang et al., 2021). In scientific documents, knowledge entities (KEs) refer to the knowledge mentioned or cited by authors, such as algorithms, models, theories, datasets and software, diseases, drugs, and genes, reflecting rich resources in diverse problem-solving scenarios (Brack et al., 2020; Ding et al., 2013; Hou et al., 2019; Li et al. 2020). The advancement, improvement, and application of KEs in academic research have played a crucial role in promoting the development of different disciplines. Extracting various KEs from scientific documents can determine whether such KEs are emerging or typical in a specific field, and help scholars gain a comprehensive understanding of these KEs and even the entire research field (Wang & Zhang, 2020). KE extraction is also useful for multiple downstream tasks in information extraction, text mining, natural language processing, information retrieval, digital library research, and so on (Zhang et al., 2021). Particularly for researchers in artificial intelligence (AI), information science, and other related disciplines, discovering methods from large-scale academic literature, and evaluating their performance and influence have become increasingly necessary and meaningful (Hou et al., 2020).

There are four kinds of methods of KE extraction in scientific documents. They are manual annotation-based (Chu & Ke, 2017; Tateisi et al., 2014; Zadeh & Schumann, 2016), rule-based (Kondo et al., 2009), statistics-based (Heffernan & Teufel, 2018; Névéol, Wilbur, & Lu, 2011; Okamoto, Shan, & Orihara, 2017), and the state-of-the-art one—deep learning-based (Paul et al., 2019; Yang et al., 2018), respectively.

Currently, KEs are evaluated via frequency or text content (Wang & Zhang, 2020). Some scholars analyzed KEs’ influence using bibliometric indicators, e.g. the frequency of mentions, citations, and the usage in full text (Belter, 2014). Additionally, some studies also utilized text content to deeply explore the role, function, and relationship of KEs (Li & Yan, 2018; Li, Yan, & Feng, 2017; Wang & Zhang, 2020). Identifying the pattern of citations and the use of KEs through the content of academic papers is also on the trail (Yoon et al., 2019).

In recent years, the topic Extraction and Evaluation of Knowledge Entities from Scientific Documents has attracted the attention from the community. There are some conferences and workshops in line with this topic, such as the Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE) (Zhang et al., 2020), the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) (Cabanac et al., 2016), the Workshop on Mining Scientific Publications (WOSP, https://wosp.core.ac.uk/), the Workshop on AI + Informetrics (AII) (Zhang, et al., 2021), the Workshop on Scholarly Document Processing (SDP) (Chandrasekaran et al., 2020) and the Workshop on Natural Language Processing for Scientific Text (SciNLP, https://scinlp.org).

We are very grateful that there are seven contributions submitted to the special issue of Journal of Data and Information Science (JDIS) and five submissions are accepted after several rounds of peer-review and revisions.

The paper “Sentence, Phrase, and Triple Annotations to Build a Knowledge Graph of Natural Language Processing Contributions—A Trial Dataset” (D’Souza & Auer, 2021) normalized the NLPCONTRIBUTIONS scheme to a designed structure, which was directly extracted from natural language processing (NLP) articles. They demonstrated that the NLPCONTRIBUTIONGRAPH data integrated into the Open Research Knowledge Graph (ORKG), a next-generation KG-based digital library with intelligent computations, enabled over-structured scholarly knowledge to assist researchers in their daily academic tasks.

The paper “Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling” (Ding et al., 2021) proposed an automatic model of key-phrase extraction for Chinese medical abstracts, which combined sequence labeling formulation and pre-trained language model. Experiments compared word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. The experimental results show that the proposed character-level sequence labeling model based on BERT obtains F1-score of 59.80%, getting 9.64% absolute improvement.

The paper “Content Characteristics of Knowledge Integration in the eHealth Field: An Analysis Based on Citation Contexts” (Wang et al., 2021) explored the content characteristics of knowledge integration in an interdisciplinary field—eHealth. Associated knowledge phrases (AKPs) shared between citing papers and their references were extracted from the citation contexts of eHealth papers by applying a stem-matching method. A classification schema that considers the functions of knowledge in the given domain was proposed to categorize the identified AKPs. The annotated AKPs reveal that different knowledge types have remarkably different integration patterns in terms of knowledge amount, the breadth of source disciplines, and the integration time lag.

The paper “A New Citation Recommendation Strategy Based on Term Functions in Related Studies Section” (Chen, 2021) proposed a term function-based citation recommendation framework to recommend articles for users. The author presented nine term functions, and among them, three were newly created and six were identified from the literature. The experiments show that the term function-based methods outperform the baselines, demonstrating its performance in identifying valuable citations.

The last paper, “Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering” (Vahidnia, Abbasi, & Abbass, 2021) proposed a modified deep clustering method to detect research trends from the abstracts and titles of academic documents. The experimental results show that the modified DEC in conjunction with Doc2Vec can outperform other methods in the clustering task. Using the proposed method, the authors also show how the topics have evolved in the period of the recent 30 years, taking advantage of a keyword extraction method for cluster tagging and labeling, demonstrating the context of the topics.

Brack, A., D’Souza, J., Hoppe, A., Auer, S., Ewerth, R. (2020). Domain-Independent Extraction of Scientific Concepts from Research Articles. In: Jose J. et al. (eds) Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science, vol 12035. Springer, Cham. https://doi.org/10.1007/978-3-030-45439-5_17 BrackA. D’SouzaJ. HoppeA. AuerS. EwerthR. 2020 Domain-Independent Extraction of Scientific Concepts from Research Articles In: JoseJ. (eds) Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science, vol 12035 Springer Cham https://doi.org/10.1007/978-3-030-45439-5_17 10.1007/978-3-030-45439-5_17 Search in Google Scholar

Belter, C.W. (2014). Measuring the value of research data: A citation analysis of oceanographic data sets. PloS One, 9(3), Article e92590. https://doi.org/10.1371/journal.pone.0092590 BelterC.W. 2014 Measuring the value of research data: A citation analysis of oceanographic data sets PloS One 9 3 Article e92590. https://doi.org/10.1371/journal.pone.0092590 10.1371/journal.pone.0092590396679124671177 Search in Google Scholar

Cabanac, G., Chandrasekaran, M., Frommholz, I., Jaidka, K., Kan, M., Mayr, P., & Wolfram, D. (2016). Report on the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2016). SIGIR Forum, 50(2), 36–43. CabanacG. ChandrasekaranM. FrommholzI. JaidkaK. KanM. MayrP. WolframD. 2016 Report on the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2016) SIGIR Forum 50 2 36 43 10.1145/3053408.3053417 Search in Google Scholar

Chandrasekaran, M.K., de Waard, A., Feigenblat, G., Freitag, D., Ghosal, T., Hovy, E., & Shmueli-Scheuer, M. (2020, November). Proceedings of the first workshop on scholarly document processing. Retrieved from https://www.aclweb.org/anthology/volumes/2020.sdp-1/ ChandrasekaranM.K. de WaardA. FeigenblatG. FreitagD. GhosalT. HovyE. Shmueli-ScheuerM. 2020 November Proceedings of the first workshop on scholarly document processing Retrieved from https://www.aclweb.org/anthology/volumes/2020.sdp-1/ 10.18653/v1/2020.sdp-1 Search in Google Scholar

Chen H. (2021). A New Citation Recommendation Strategy Based on Term Functions in Related Studies Section. Journal of Data and Information Science, 6(3), 75–98. https://doi.org/10.2478/jdis-2021-0022 ChenH. 2021 A New Citation Recommendation Strategy Based on Term Functions in Related Studies Section Journal of Data and Information Science 6 3 75 98 https://doi.org/10.2478/jdis-2021-0022 10.2478/jdis-2021-0022 Search in Google Scholar

Chu, H., & Ke, Q. (2017). Research methods: What's in the name? Library & Information Science Research, 39(4), 284–294. https://doi.org/10.1016/J.LISR.2017.11.001 ChuH. KeQ. 2017 Research methods: What's in the name? Library & Information Science Research 39 4 284 294 https://doi.org/10.1016/J.LISR.2017.11.001 10.1016/j.lisr.2017.11.001 Search in Google Scholar

D’Souza, J., & Auer, S. (2021). Sentence, Phrase, and Triple Annotations to Build a Knowledge Graph of Natural Language Processing Contributions—A Trial Dataset. Journal of Data and Information Science, 6(3), 6–34. https://doi.org/10.2478/jdis-2021-0023 D’SouzaJ. AuerS. 2021 Sentence, Phrase, and Triple Annotations to Build a Knowledge Graph of Natural Language Processing Contributions—A Trial Dataset Journal of Data and Information Science 6 3 6 34 https://doi.org/10.2478/jdis-2021-0023 10.2478/jdis-2021-0023 Search in Google Scholar

Ding, L., Zhang, Z., Liu, H., Li, J., & Yu, G. (2021). Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling. Journal of Data and Information Science, 6(3), 35–57. https://doi.org/10.2478/jdis-2021-0013 DingL. ZhangZ. LiuH. LiJ. YuG. 2021 Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling Journal of Data and Information Science 6 3 35 57 https://doi.org/10.2478/jdis-2021-0013 10.2478/jdis-2021-0013 Search in Google Scholar

Ding, Y., Song, M., Han, J., Yu, Q., Yan, E., Lin, L., & Chambers, T. (2013). Entitymetrics: Measuring the impact of entities. PloS one, 8(8), e71416. https://doi.org/10.1371/journal.pone.0071416 DingY. SongM. HanJ. YuQ. YanE. LinL. ChambersT. 2013 Entitymetrics: Measuring the impact of entities PloS one 8 8 e71416 https://doi.org/10.1371/journal.pone.0071416 10.1371/journal.pone.0071416375696124009660 Search in Google Scholar

Heffernan, K., & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116, 1367–1382. https://doi.org/10.1007/s11192-018-2718-6 HeffernanK. TeufelS. 2018 Identifying problems and solutions in scientific text Scientometrics 116 1367 1382 https://doi.org/10.1007/s11192-018-2718-6 10.1007/s11192-018-2718-6609666030147202 Search in Google Scholar

Hou, Y., Jochim, C., Gleize, M., Bonin, F., & Ganguly, D. (2019). Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5203–5213. http://doi.org/10.18653/v1/P19-1513 HouY. JochimC. GleizeM. BoninF. GangulyD. 2019 Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 5203 5213 http://doi.org/10.18653/v1/P19-1513 10.18653/v1/P19-1513 Search in Google Scholar

Hou, L., Zhang, J., Wu, O., Yu, T., Wang, Z., Li, Z., & Yao, R. (2020). Method and dataset entity mining in scientific literature: A CNN+ Bi-LSTM model with self-attention. ArXiv Preprint. arXiv:2010.13583. HouL. ZhangJ. WuO. YuT. WangZ. LiZ. YaoR. 2020 Method and dataset entity mining in scientific literature: A CNN+ Bi-LSTM model with self-attention ArXiv Preprint. arXiv:2010.13583. Search in Google Scholar

Li, K., & Yan, E. (2018). Co-mention network of R packages: Scientific impact and clustering structure. Journal of Informetrics, 12(1), 87–100. https://doi.org/10.1016/j.joi.2017.12.001 LiK. YanE. 2018 Co-mention network of R packages: Scientific impact and clustering structure Journal of Informetrics 12 1 87 100 https://doi.org/10.1016/j.joi.2017.12.001 10.1016/j.joi.2017.12.001 Search in Google Scholar

Li, K., Yan, E., & Feng, Y. (2017). How is R cited in research outputs? Structure, impacts, and citation standard. Journal of Informetrics, 11(4), 989–1002. https://doi.org/10.1016/j.joi.2017.08.003 LiK. YanE. FengY. 2017 How is R cited in research outputs? Structure, impacts, and citation standard Journal of Informetrics 11 4 989 1002 https://doi.org/10.1016/j.joi.2017.08.003 10.1016/j.joi.2017.08.003 Search in Google Scholar

Li, X., Rousseau, J.F., Ding, Y., Song, M., & Lu, W. (2020). Understanding Drug Repurposing From the Perspective of Biomedical Entities and Their Evolution: Bibliographic Research Using Aspirin. JMIR medical informatics, 8(6), e16739. https://doi.org/10.2196/16739 LiX. RousseauJ.F. DingY. SongM. LuW. 2020 Understanding Drug Repurposing From the Perspective of Biomedical Entities and Their Evolution: Bibliographic Research Using Aspirin JMIR medical informatics 8 6 e16739 https://doi.org/10.2196/16739 10.2196/16739732759532543442 Search in Google Scholar

Kondo, T., Nanba, H., Takezawa, T., & Okumura, M. (2009). Technical Trend Analysis by Analyzing Research Papers’ Titles. In Proceedings of the 4th Language and Technology Conference. Poznan, Poland: Springer, 512–521. https://doi.org/10.1007/978-3-642-20095-3_47 KondoT. NanbaH. TakezawaT. OkumuraM. 2009 Technical Trend Analysis by Analyzing Research Papers’ Titles In Proceedings of the 4th Language and Technology Conference Poznan, Poland: Springer 512 521 https://doi.org/10.1007/978-3-642-20095-3_47 10.1007/978-3-642-20095-3_47 Search in Google Scholar

Névéol, A., Wilbur, W., & Lu, Z. (2011). Extraction of data deposition statements from the literature: A method for automatically tracking research results. Bioinformatics, 27(23), 3306–3312. http://doi.org/10.1093/bioinformatics/btr573 NévéolA. WilburW. LuZ. 2011 Extraction of data deposition statements from the literature: A method for automatically tracking research results Bioinformatics 27 23 3306 3312 http://doi.org/10.1093/bioinformatics/btr573 10.1093/bioinformatics/btr573322336821998156 Search in Google Scholar

Okamoto, M., Shan, Z., & Orihara, R. (2017). Applying Information Extraction for Patent Structure Analysis. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 989–992. https://doi.org/10.1145/3077136.3080698 OkamotoM. ShanZ. OriharaR. 2017 Applying Information Extraction for Patent Structure Analysis In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval 989 992 https://doi.org/10.1145/3077136.3080698 10.1145/3077136.3080698 Search in Google Scholar

Paul, D., Singh, M., Hedderich, M.A., & Klakow, D. (2019). Handling Noisy Labels for Robustly Learning from Self-Training Data for Low-Resource Sequence Labeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. 29–34. http://dx.doi.org/10.18653/v1/N19-3005 PaulD. SinghM. HedderichM.A. KlakowD. 2019 Handling Noisy Labels for Robustly Learning from Self-Training Data for Low-Resource Sequence Labeling In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop 29 34 http://dx.doi.org/10.18653/v1/N19-3005 10.18653/v1/N19-3005 Search in Google Scholar

Tateisi, Y., Shidahara, Y., Miyao, Y., & Aizawa, A. (2014). Annotation of Computer Science Papers for Semantic Relation Extraction. In Proceedings of the 9th International Conference on Language Resources and Evaluation. Reykjavik, Iceland: LREC, 1423–1429. http://www.lrec-conf.org/proceedings/lrec2014/summaries/461.html TateisiY. ShidaharaY. MiyaoY. AizawaA. 2014 Annotation of Computer Science Papers for Semantic Relation Extraction In Proceedings of the 9th International Conference on Language Resources and Evaluation Reykjavik, Iceland: LREC 1423 1429 http://www.lrec-conf.org/proceedings/lrec2014/summaries/461.html Search in Google Scholar

Vahidnia, S., Abbasi, A., & Abbass, H. (2021).Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering. Journal of Data and Information Science, 6(3), 99–122. https://doi.org/10.2478/jdis-2021-0024 VahidniaS. AbbasiA. AbbassH. 2021 Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering Journal of Data and Information Science 6 3 99 122 https://doi.org/10.2478/jdis-2021-0024 10.2478/jdis-2021-0024 Search in Google Scholar

Wang, S., Mao, J., Tang, J., & Cao, Y. (2021). Content Characteristics of Knowledge Integration in the eHealth Field: An Analysis Based on Citation Contexts. Journal of Data and Information Science, 6(3), 123–145. https://doi.org/10.2478/jdis-2021-0015 WangS. MaoJ. TangJ. CaoY. 2021 Content Characteristics of Knowledge Integration in the eHealth Field: An Analysis Based on Citation Contexts Journal of Data and Information Science 6 3 123 145 https://doi.org/10.2478/jdis-2021-0015 10.2478/jdis-2021-0015 Search in Google Scholar

Wang, Y., & Zhang, C. (2020). Using the Full-text Content of Academic Articles to Identify and Evaluate Algorithm Entities in the Domain of Natural Language Processing. Journal of Informetrics, 14(4), 101091. https://doi.org/10.1016/j.joi.2020.101091 WangY. ZhangC. 2020 Using the Full-text Content of Academic Articles to Identify and Evaluate Algorithm Entities in the Domain of Natural Language Processing Journal of Informetrics 14 4 101091. https://doi.org/10.1016/j.joi.2020.101091 10.1016/j.joi.2020.101091754812033072184 Search in Google Scholar

Yang, Y., Chen, W., Li, Z., He, Z., & Zhang, M. (2018). Distantly Supervised NER with Partial Annotation Learning and Reinforcement Learning. COLING. In Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New-Mexico, USA: Association for Computational Linguistics, 2159–2169. http://aclweb.org/anthology/C18-1183 YangY. ChenW. LiZ. HeZ. ZhangM. 2018 Distantly Supervised NER with Partial Annotation Learning and Reinforcement Learning. COLING In Proceedings of the 27th International Conference on Computational Linguistics Santa Fe, New-Mexico, USA: Association for Computational Linguistics 2159 2169 http://aclweb.org/anthology/C18-1183 Search in Google Scholar

Yoon, J., Chung, E., Lee, J.Y., & Kim, J. (2019). How research data is cited in scholarly literature: A case study of HINTS. Learned Publishing, 32, 199–206. https://doi.org/10.1002/leap.1213 YoonJ. ChungE. LeeJ.Y. KimJ. 2019 How research data is cited in scholarly literature: A case study of HINTS Learned Publishing 32 199 206 https://doi.org/10.1002/leap.1213 10.1002/leap.1213 Search in Google Scholar

Zadeh, B., & Schumann, A. (2016). The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods. In Proceedings of the Tenth International Conference on Language Resources and Evaluation. Portorož, Slovenia: LREC, 1862–1868. http://www.lrec-conf.org/proceedings/lrec2016/summaries/681.html ZadehB. SchumannA. 2016 The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods In Proceedings of the Tenth International Conference on Language Resources and Evaluation Portorož, Slovenia: LREC 1862 1868 http://www.lrec-conf.org/proceedings/lrec2016/summaries/681.html Search in Google Scholar

Zhang, C., Mayr, P., Lu, W., & Zhang, Y. (2020). Extraction and evaluation of knowledge entities from scientific documents: EEKE2020. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, 573–574. https://doi.org/10.1145/3383583.3398504 ZhangC. MayrP. LuW. ZhangY. 2020 Extraction and evaluation of knowledge entities from scientific documents: EEKE2020 Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 573 574 https://doi.org/10.1145/3383583.3398504 10.1145/3383583.3398504 Search in Google Scholar

Zhang C., Mayr, P., Lu W., & Zhang Y. (2021). Editorial—Knowledge Entity Extraction and Text Mining in the Era of Big Data. Data and Information Management, 5(3), 309–311. https://doi.org/10.2478/dim-2021-0009 ZhangC. MayrP. LuW. ZhangY. 2021 Editorial—Knowledge Entity Extraction and Text Mining in the Era of Big Data Data and Information Management 5 3 309 311 https://doi.org/10.2478/dim-2021-0009 10.2478/dim-2021-0009 Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo