- Journal Details
- First Published
- 30 Mar 2017
- Publication timeframe
- 4 times per year
- Open Access
Page range: 1 - 5
- Open Access
Sentence, Phrase, and Triple Annotations to Build a Knowledge Graph of Natural Language Processing Contributions—A Trial Dataset
Page range: 6 - 34
This work aims to normalize the N
We re-annotate, a second time, the contributions-pertinent information across 50 prior-annotated NLP scholarly articles in terms of a data pipeline comprising: contribution-centered sentences, phrases, and triple statements. To this end, specifically, care was taken in the adjudication annotation stage to reduce annotation noise while formulating the guidelines for our proposed novel NLP contributions structuring and graphing scheme.
The application of N
We demonstrate N
- Scholarly knowledge graphs
- Open science graphs
- Knowledge representation
- Natural language processing
- Semantic publishing
- Open Access
Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling
Page range: 35 - 57
Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.
We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.
Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement.
We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases.
We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at:
By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.
- Automatic keyphrase extraction
- Character-level sequence labeling
- Pretrained language model
- Scientific chinese medical abstracts
- Open Access
Content Characteristics of Knowledge Integration in the eHealth Field: An Analysis Based on Citation Contexts
Page range: 58 - 74
This study attempts to disclose the characteristics of knowledge integration in an interdisciplinary field by looking into the content aspect of knowledge.
The eHealth field was chosen in the case study. Associated knowledge phrases (AKPs) that are shared between citing papers and their references were extracted from the citation contexts of the eHealth papers by applying a stem-matching method. A classification schema that considers the functions of knowledge in the domain was proposed to categorize the identified AKPs. The source disciplines of each knowledge type were analyzed. Quantitative indicators and a co-occurrence analysis were applied to disclose the integration patterns of different knowledge types.
The annotated AKPs evidence the major disciplines supplying each type of knowledge. Different knowledge types have remarkably different integration patterns in terms of knowledge amount, the breadth of source disciplines, and the integration time lag. We also find several frequent co-occurrence patterns of different knowledge types.
The collected articles of the field are limited to the two leading open access journals. The stem-matching method to extract AKPs could not identify those phrases with the same meaning but expressed in words with different stems. The type of Research Subject dominates the recognized AKPs, which calls on an improvement of the classification schema for better knowledge integration analysis on knowledge units.
The methodology proposed in this paper sheds new light on knowledge integration characteristics of an interdisciplinary field from the content perspective. The findings have practical implications on the future development of research strategies in eHealth and the policies about interdisciplinary research.
This study proposed a new methodology to explore the content characteristics of knowledge integration in an interdisciplinary field.
- Knowledge integration
- Interdisciplinary research
- Citation contexts
- Knowledge content
- Open Access
Page range: 75 - 98
Researchers frequently encounter the following problems when writing scientific articles: (1) Selecting appropriate citations to support the research idea is challenging. (2) The literature review is not conducted extensively, which leads to working on a research problem that others have well addressed. The study focuses on citation recommendation in the related studies section by applying the term function of a citation context, potentially improving the efficiency of writing a literature review.
We present nine term functions with three newly created and six identified from existing literature. Using these term functions as labels, we annotate 531 research papers in three topics to evaluate our proposed recommendation strategy. BM25 and Word2vec with VSM are implemented as the baseline models for the recommendation. Then the term function information is applied to enhance the performance.
The experiments show that the term function-based methods outperform the baseline methods regarding the recall, precision, and F1-score measurement, demonstrating that term functions are useful in identifying valuable citations.
The dataset is insufficient due to the complexity of annotating citation functions for paragraphs in the related studies section. More recent deep learning models should be performed to future validate the proposed approach.
The citation recommendation strategy can be helpful for valuable citation discovery, semantic scientific retrieval, and automatic literature review generation.
The proposed citation function-based citation recommendation can generate intuitive explanations of the results for users, improving the transparency, persuasiveness, and effectiveness of recommender systems.
- Citation recommendation
- Term function
- Citation context
- Related studies section
- Open Access
Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering
Page range: 99 - 122
Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields. This also helps in having a better collaboration with governments and businesses. This study aims to investigate the development of research fields over time, translating it into a topic detection problem.
To achieve the objectives, we propose a modified deep clustering method to detect research trends from the abstracts and titles of academic documents. Document embedding approaches are utilized to transform documents into vector-based representations. The proposed method is evaluated by comparing it with a combination of different embedding and clustering approaches and the classical topic modeling algorithms (i.e. LDA) against a benchmark dataset. A case study is also conducted exploring the evolution of Artificial Intelligence (AI) detecting the research topics or sub-fields in related AI publications.
Evaluating the performance of the proposed method using clustering performance indicators reflects that our proposed method outperforms similar approaches against the benchmark dataset. Using the proposed method, we also show how the topics have evolved in the period of the recent 30 years, taking advantage of a keyword extraction method for cluster tagging and labeling, demonstrating the context of the topics.
We noticed that it is not possible to generalize one solution for all downstream tasks. Hence, it is required to fine-tune or optimize the solutions for each task and even datasets. In addition, interpretation of cluster labels can be subjective and vary based on the readers’ opinions. It is also very difficult to evaluate the labeling techniques, rendering the explanation of the clusters further limited.
As demonstrated in the case study, we show that in a real-world example, how the proposed method would enable the researchers and reviewers of the academic research to detect, summarize, analyze, and visualize research topics from decades of academic documents. This helps the scientific community and all related organizations in fast and effective analysis of the fields, by establishing and explaining the topics.
In this study, we introduce a modified and tuned deep embedding clustering coupled with Doc2Vec representations for topic extraction. We also use a concept extraction method as a labeling approach in this study. The effectiveness of the method has been evaluated in a case study of AI publications, where we analyze the AI topics during the past three decades.
- Dynamics of science
- Science mapping
- Document clustering
- Artificial intelligence
- Deep learning
- Open Access
Page range: 123 - 145
The interdisciplinary nature and rapid development of the Semantic Web led to the mass publication of RDF data in a large number of widely accepted serialization formats, thus developing out the necessity for RDF data processing with specific purposes. The paper reports on an assessment of chief RDF data endpoint challenges and introduces the RDF Adaptor, a set of plugins for RDF data processing which covers the whole life-cycle with high efficiency.
The RDFAdaptor is designed based on the prominent ETL tool—Pentaho Data Integration—which provides a user-friendly and intuitive interface and allows connect to various data sources and formats, and reuses the Java framework RDF4J as middleware that realizes access to data repositories, SPARQL endpoints and all leading RDF database solutions with SPARQL 1.1 support. It can support effortless services with various configuration templates in multi-scenario applications, and help extend data process tasks in other services or tools to complement missing functions.
The proposed comprehensive RDF ETL solution—RDFAdaptor—provides an easy-to-use and intuitive interface, supports data integration and federation over multi-source heterogeneous repositories or endpoints, as well as manage linked data in hybrid storage mode.
The plugin set can support several application scenarios of RDF data process, but error detection/check and interaction with other graph repositories remain to be improved.
The plugin set can provide user interface and configuration templates which enable its usability in various applications of RDF data generation, multi-format data conversion, remote RDF data migration, and RDF graph update in semantic query process.
This is the first attempt to develop components instead of systems that can include extract, consolidate, and store RDF data on the basis of an ecologically mature data warehousing environment.
- RDF ETL solution
- RDF data processing
- Linked data
- Portable plugins
- Open Access
Page range: 146 - 163
This study aims to construct new models and methods of academic genealogy research based on bibliometrics.
This study proposes an academic influence scale for academic genealogy, and introduces the
The two-dimensional evaluation system can characterize the development and evolution of the academic genealogy, compare the academic influences of different genealogies, and evaluate individuals’ contributions to the inheritance and evolution of the academic genealogy. Individual academic influence is mainly indicated by the
The two-dimensional evaluation system for the academic genealogy can better demonstrate the reproduction and the academic inheritance ability of a genealogy.
It is not comprehensive to only use the
This study constructs new models and methods of academic genealogy research based on bibliometrics, which improves the quantitative assessment of academic genealogy and enriches its research and evaluation methods.
- Academic genealogy
- Evaluation system
- Academic influence
- Academic fecundity
- Liu Tungsheng
- Open Access
Page range: 164 - 165