Revista y Edición

AHEAD OF PRINT

Volumen 7 (2022): Edición 3 (August 2022)

Volumen 7 (2022): Edición 2 (April 2022)

Volumen 7 (2022): Edición 1 (February 2022)

Volumen 6 (2021): Edición 4 (November 2021)

Volumen 6 (2021): Edición 3 (June 2021)

Volumen 6 (2021): Edición 2 (April 2021)

Volumen 6 (2021): Edición 1 (February 2021)

Volumen 5 (2020): Edición 4 (November 2020)

Volumen 5 (2020): Edición 3 (August 2020)

Volumen 5 (2020): Edición 2 (April 2020)

Volumen 5 (2020): Edición 1 (February 2020)

Volumen 4 (2019): Edición 4 (December 2019)

Volumen 4 (2019): Edición 3 (August 2019)

Volumen 4 (2019): Edición 2 (May 2019)

Volumen 4 (2019): Edición 1 (February 2019)

Volumen 3 (2018): Edición 4 (November 2018)

Volumen 3 (2018): Edición 3 (August 2018)

Volumen 3 (2018): Edición 2 (May 2018)

Volumen 3 (2018): Edición 1 (February 2018)

Volumen 2 (2017): Edición 4 (December 2017)

Volumen 2 (2017): Edición 3 (August 2017)

Volumen 2 (2017): Edición 2 (May 2017)

Volumen 2 (2017): Edición 1 (February 2017)

Volumen 1 (2016): Edición 4 (November 2016)

Volumen 1 (2016): Edición 3 (August 2016)

Volumen 1 (2016): Edición 2 (May 2016)

Volumen 1 (2016): Edición 1 (February 2016)

Detalles de la revista
Formato
Revista
eISSN
2543-683X
Publicado por primera vez
30 Mar 2017
Periodo de publicación
4 veces al año
Idiomas
Inglés

Buscar

Volumen 6 (2021): Edición 3 (June 2021)

Detalles de la revista
Formato
Revista
eISSN
2543-683X
Publicado por primera vez
30 Mar 2017
Periodo de publicación
4 veces al año
Idiomas
Inglés

Buscar

9 Artículos

Guest Editorial

Research Paper

Acceso abierto

Sentence, Phrase, and Triple Annotations to Build a Knowledge Graph of Natural Language Processing Contributions—A Trial Dataset

Publicado en línea: 09 May 2021
Páginas: 6 - 34

Resumen

Abstract Purpose

This work aims to normalize the NlpContributions scheme (henceforward, NlpContributionGraph) to structure, directly from article sentences, the contributions information in Natural Language Processing (NLP) scholarly articles via a two-stage annotation methodology: 1) pilot stage—to define the scheme (described in prior work); and 2) adjudication stage—to normalize the graphing model (the focus of this paper).

Design/methodology/approach

We re-annotate, a second time, the contributions-pertinent information across 50 prior-annotated NLP scholarly articles in terms of a data pipeline comprising: contribution-centered sentences, phrases, and triple statements. To this end, specifically, care was taken in the adjudication annotation stage to reduce annotation noise while formulating the guidelines for our proposed novel NLP contributions structuring and graphing scheme.

Findings

The application of NlpContributionGraph on the 50 articles resulted finally in a dataset of 900 contribution-focused sentences, 4,702 contribution-information-centered phrases, and 2,980 surface-structured triples. The intra-annotation agreement between the first and second stages, in terms of F1-score, was 67.92% for sentences, 41.82% for phrases, and 22.31% for triple statements indicating that with increased granularity of the information, the annotation decision variance is greater.

Research limitations

NlpContributionGraph has limited scope for structuring scholarly contributions compared with STEM (Science, Technology, Engineering, and Medicine) scholarly knowledge at large. Further, the annotation scheme in this work is designed by only an intra-annotator consensus—a single annotator first annotated the data to propose the initial scheme, following which, the same annotator reannotated the data to normalize the annotations in an adjudication stage. However, the expected goal of this work is to achieve a standardized retrospective model of capturing NLP contributions from scholarly articles. This would entail a larger initiative of enlisting multiple annotators to accommodate different worldviews into a “single” set of structures and relationships as the final scheme. Given that the initial scheme is first proposed and the complexity of the annotation task in the realistic timeframe, our intra-annotation procedure is well-suited. Nevertheless, the model proposed in this work is presently limited since it does not incorporate multiple annotator worldviews. This is planned as future work to produce a robust model.

Practical implications

We demonstrate NlpContributionGraph data integrated into the Open Research Knowledge Graph (ORKG), a next-generation KG-based digital library with intelligent computations enabled over structured scholarly knowledge, as a viable aid to assist researchers in their day-to-day tasks.

Originality/value

NlpContributionGraph is a novel scheme to annotate research contributions from NLP articles and integrate them in a knowledge graph, which to the best of our knowledge does not exist in the community. Furthermore, our quantitative evaluations over the two-stage annotation tasks offer insights into task difficulty.

Palabras clave

  • Scholarly knowledge graphs
  • Open science graphs
  • Knowledge representation
  • Natural language processing
  • Semantic publishing
Acceso abierto

Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling

Publicado en línea: 02 Mar 2021
Páginas: 35 - 57

Resumen

Abstract Purpose

Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.

Design/methodology/approach

We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.

Findings

Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement.

Research limitations

We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases.

Practical implications

We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.

Originality/value

By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.

Palabras clave

  • Automatic keyphrase extraction
  • Character-level sequence labeling
  • Pretrained language model
  • Scientific chinese medical abstracts
Acceso abierto

Content Characteristics of Knowledge Integration in the eHealth Field: An Analysis Based on Citation Contexts

Publicado en línea: 02 Mar 2021
Páginas: 58 - 74

Resumen

Abstract Purpose

This study attempts to disclose the characteristics of knowledge integration in an interdisciplinary field by looking into the content aspect of knowledge.

Design/methodology/approach

The eHealth field was chosen in the case study. Associated knowledge phrases (AKPs) that are shared between citing papers and their references were extracted from the citation contexts of the eHealth papers by applying a stem-matching method. A classification schema that considers the functions of knowledge in the domain was proposed to categorize the identified AKPs. The source disciplines of each knowledge type were analyzed. Quantitative indicators and a co-occurrence analysis were applied to disclose the integration patterns of different knowledge types.

Findings

The annotated AKPs evidence the major disciplines supplying each type of knowledge. Different knowledge types have remarkably different integration patterns in terms of knowledge amount, the breadth of source disciplines, and the integration time lag. We also find several frequent co-occurrence patterns of different knowledge types.

Research limitations

The collected articles of the field are limited to the two leading open access journals. The stem-matching method to extract AKPs could not identify those phrases with the same meaning but expressed in words with different stems. The type of Research Subject dominates the recognized AKPs, which calls on an improvement of the classification schema for better knowledge integration analysis on knowledge units.

Practical implications

The methodology proposed in this paper sheds new light on knowledge integration characteristics of an interdisciplinary field from the content perspective. The findings have practical implications on the future development of research strategies in eHealth and the policies about interdisciplinary research.

Originality/value

This study proposed a new methodology to explore the content characteristics of knowledge integration in an interdisciplinary field.

Palabras clave

  • Knowledge integration
  • Interdisciplinary research
  • Citation contexts
  • eHealth
  • Knowledge content
Acceso abierto

A New Citation Recommendation Strategy Based on Term Functions in Related Studies Section

Publicado en línea: 09 May 2021
Páginas: 75 - 98

Resumen

Abstract Purpose

Researchers frequently encounter the following problems when writing scientific articles: (1) Selecting appropriate citations to support the research idea is challenging. (2) The literature review is not conducted extensively, which leads to working on a research problem that others have well addressed. The study focuses on citation recommendation in the related studies section by applying the term function of a citation context, potentially improving the efficiency of writing a literature review.

Design/methodology/approach

We present nine term functions with three newly created and six identified from existing literature. Using these term functions as labels, we annotate 531 research papers in three topics to evaluate our proposed recommendation strategy. BM25 and Word2vec with VSM are implemented as the baseline models for the recommendation. Then the term function information is applied to enhance the performance.

Findings

The experiments show that the term function-based methods outperform the baseline methods regarding the recall, precision, and F1-score measurement, demonstrating that term functions are useful in identifying valuable citations.

Research limitations

The dataset is insufficient due to the complexity of annotating citation functions for paragraphs in the related studies section. More recent deep learning models should be performed to future validate the proposed approach.

Practical implications

The citation recommendation strategy can be helpful for valuable citation discovery, semantic scientific retrieval, and automatic literature review generation.

Originality/value

The proposed citation function-based citation recommendation can generate intuitive explanations of the results for users, improving the transparency, persuasiveness, and effectiveness of recommender systems.

Palabras clave

  • Citation recommendation
  • Term function
  • Citation context
  • Related studies section
  • BM25
  • Word2vec
Acceso abierto

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Publicado en línea: 18 Jun 2021
Páginas: 99 - 122

Resumen

Abstract Purpose

Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields. This also helps in having a better collaboration with governments and businesses. This study aims to investigate the development of research fields over time, translating it into a topic detection problem.

Design/methodology/approach

To achieve the objectives, we propose a modified deep clustering method to detect research trends from the abstracts and titles of academic documents. Document embedding approaches are utilized to transform documents into vector-based representations. The proposed method is evaluated by comparing it with a combination of different embedding and clustering approaches and the classical topic modeling algorithms (i.e. LDA) against a benchmark dataset. A case study is also conducted exploring the evolution of Artificial Intelligence (AI) detecting the research topics or sub-fields in related AI publications.

Findings

Evaluating the performance of the proposed method using clustering performance indicators reflects that our proposed method outperforms similar approaches against the benchmark dataset. Using the proposed method, we also show how the topics have evolved in the period of the recent 30 years, taking advantage of a keyword extraction method for cluster tagging and labeling, demonstrating the context of the topics.

Research limitations

We noticed that it is not possible to generalize one solution for all downstream tasks. Hence, it is required to fine-tune or optimize the solutions for each task and even datasets. In addition, interpretation of cluster labels can be subjective and vary based on the readers’ opinions. It is also very difficult to evaluate the labeling techniques, rendering the explanation of the clusters further limited.

Practical implications

As demonstrated in the case study, we show that in a real-world example, how the proposed method would enable the researchers and reviewers of the academic research to detect, summarize, analyze, and visualize research topics from decades of academic documents. This helps the scientific community and all related organizations in fast and effective analysis of the fields, by establishing and explaining the topics.

Originality/value

In this study, we introduce a modified and tuned deep embedding clustering coupled with Doc2Vec representations for topic extraction. We also use a concept extraction method as a labeling approach in this study. The effectiveness of the method has been evaluated in a case study of AI publications, where we analyze the AI topics during the past three decades.

Palabras clave

  • Dynamics of science
  • Science mapping
  • Document clustering
  • Artificial intelligence
  • Deep learning
Acceso abierto

RDFAdaptor: Efficient ETL Plugins for RDF Data Process

Publicado en línea: 14 Apr 2021
Páginas: 123 - 145

Resumen

Abstract Purpose

The interdisciplinary nature and rapid development of the Semantic Web led to the mass publication of RDF data in a large number of widely accepted serialization formats, thus developing out the necessity for RDF data processing with specific purposes. The paper reports on an assessment of chief RDF data endpoint challenges and introduces the RDF Adaptor, a set of plugins for RDF data processing which covers the whole life-cycle with high efficiency.

Design/methodology/approach

The RDFAdaptor is designed based on the prominent ETL tool—Pentaho Data Integration—which provides a user-friendly and intuitive interface and allows connect to various data sources and formats, and reuses the Java framework RDF4J as middleware that realizes access to data repositories, SPARQL endpoints and all leading RDF database solutions with SPARQL 1.1 support. It can support effortless services with various configuration templates in multi-scenario applications, and help extend data process tasks in other services or tools to complement missing functions.

Findings

The proposed comprehensive RDF ETL solution—RDFAdaptor—provides an easy-to-use and intuitive interface, supports data integration and federation over multi-source heterogeneous repositories or endpoints, as well as manage linked data in hybrid storage mode.

Research limitations

The plugin set can support several application scenarios of RDF data process, but error detection/check and interaction with other graph repositories remain to be improved.

Practical implications

The plugin set can provide user interface and configuration templates which enable its usability in various applications of RDF data generation, multi-format data conversion, remote RDF data migration, and RDF graph update in semantic query process.

Originality/value

This is the first attempt to develop components instead of systems that can include extract, consolidate, and store RDF data on the basis of an ecologically mature data warehousing environment.

Palabras clave

  • RDF ETL solution
  • RDF data processing
  • Linked data
  • Portable plugins
Acceso abierto

Bibliometric-based Study of Scientist Academic Genealogy

Publicado en línea: 14 Apr 2021
Páginas: 146 - 163

Resumen

Abstract Purpose

This study aims to construct new models and methods of academic genealogy research based on bibliometrics.

Design/methodology/approach

This study proposes an academic influence scale for academic genealogy, and introduces the w index for bibliometric scaling of the academic genealogy. We then construct a two-dimensional (academic fecundity versus academic influence) evaluation system of academic genealogy, and validate it on the academic genealogy of a famous Chinese geologist.

Findings

The two-dimensional evaluation system can characterize the development and evolution of the academic genealogy, compare the academic influences of different genealogies, and evaluate individuals’ contributions to the inheritance and evolution of the academic genealogy. Individual academic influence is mainly indicated by the w index (the improved h index), which overcomes the situation of repeated measurements and distortion of results in the academic genealogy.

Practical implications

The two-dimensional evaluation system for the academic genealogy can better demonstrate the reproduction and the academic inheritance ability of a genealogy.

Research limitations

It is not comprehensive to only use the w index to characterize academic influence. It should also include scholars’ academic awards and academic part-timers and so on. In future work, we will integrate scholars’ academic awards and academic part-timers into the w index for a comprehensive reflection of scholars’ individual academic influences.

Originality/value

This study constructs new models and methods of academic genealogy research based on bibliometrics, which improves the quantitative assessment of academic genealogy and enriches its research and evaluation methods.

Palabras clave

  • Academic genealogy
  • Evaluation system
  • Academic influence
  • Academic fecundity
  • Liu Tungsheng

Announcement

Acceso abierto

New Editorial Board Announced for Journal of Data and Information Science

Publicado en línea: 18 May 2021
Páginas: 164 - 165

Resumen

9 Artículos

Guest Editorial

Research Paper

Acceso abierto

Sentence, Phrase, and Triple Annotations to Build a Knowledge Graph of Natural Language Processing Contributions—A Trial Dataset

Publicado en línea: 09 May 2021
Páginas: 6 - 34

Resumen

Abstract Purpose

This work aims to normalize the NlpContributions scheme (henceforward, NlpContributionGraph) to structure, directly from article sentences, the contributions information in Natural Language Processing (NLP) scholarly articles via a two-stage annotation methodology: 1) pilot stage—to define the scheme (described in prior work); and 2) adjudication stage—to normalize the graphing model (the focus of this paper).

Design/methodology/approach

We re-annotate, a second time, the contributions-pertinent information across 50 prior-annotated NLP scholarly articles in terms of a data pipeline comprising: contribution-centered sentences, phrases, and triple statements. To this end, specifically, care was taken in the adjudication annotation stage to reduce annotation noise while formulating the guidelines for our proposed novel NLP contributions structuring and graphing scheme.

Findings

The application of NlpContributionGraph on the 50 articles resulted finally in a dataset of 900 contribution-focused sentences, 4,702 contribution-information-centered phrases, and 2,980 surface-structured triples. The intra-annotation agreement between the first and second stages, in terms of F1-score, was 67.92% for sentences, 41.82% for phrases, and 22.31% for triple statements indicating that with increased granularity of the information, the annotation decision variance is greater.

Research limitations

NlpContributionGraph has limited scope for structuring scholarly contributions compared with STEM (Science, Technology, Engineering, and Medicine) scholarly knowledge at large. Further, the annotation scheme in this work is designed by only an intra-annotator consensus—a single annotator first annotated the data to propose the initial scheme, following which, the same annotator reannotated the data to normalize the annotations in an adjudication stage. However, the expected goal of this work is to achieve a standardized retrospective model of capturing NLP contributions from scholarly articles. This would entail a larger initiative of enlisting multiple annotators to accommodate different worldviews into a “single” set of structures and relationships as the final scheme. Given that the initial scheme is first proposed and the complexity of the annotation task in the realistic timeframe, our intra-annotation procedure is well-suited. Nevertheless, the model proposed in this work is presently limited since it does not incorporate multiple annotator worldviews. This is planned as future work to produce a robust model.

Practical implications

We demonstrate NlpContributionGraph data integrated into the Open Research Knowledge Graph (ORKG), a next-generation KG-based digital library with intelligent computations enabled over structured scholarly knowledge, as a viable aid to assist researchers in their day-to-day tasks.

Originality/value

NlpContributionGraph is a novel scheme to annotate research contributions from NLP articles and integrate them in a knowledge graph, which to the best of our knowledge does not exist in the community. Furthermore, our quantitative evaluations over the two-stage annotation tasks offer insights into task difficulty.

Palabras clave

  • Scholarly knowledge graphs
  • Open science graphs
  • Knowledge representation
  • Natural language processing
  • Semantic publishing
Acceso abierto

Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling

Publicado en línea: 02 Mar 2021
Páginas: 35 - 57

Resumen

Abstract Purpose

Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.

Design/methodology/approach

We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.

Findings

Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement.

Research limitations

We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases.

Practical implications

We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.

Originality/value

By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.

Palabras clave

  • Automatic keyphrase extraction
  • Character-level sequence labeling
  • Pretrained language model
  • Scientific chinese medical abstracts
Acceso abierto

Content Characteristics of Knowledge Integration in the eHealth Field: An Analysis Based on Citation Contexts

Publicado en línea: 02 Mar 2021
Páginas: 58 - 74

Resumen

Abstract Purpose

This study attempts to disclose the characteristics of knowledge integration in an interdisciplinary field by looking into the content aspect of knowledge.

Design/methodology/approach

The eHealth field was chosen in the case study. Associated knowledge phrases (AKPs) that are shared between citing papers and their references were extracted from the citation contexts of the eHealth papers by applying a stem-matching method. A classification schema that considers the functions of knowledge in the domain was proposed to categorize the identified AKPs. The source disciplines of each knowledge type were analyzed. Quantitative indicators and a co-occurrence analysis were applied to disclose the integration patterns of different knowledge types.

Findings

The annotated AKPs evidence the major disciplines supplying each type of knowledge. Different knowledge types have remarkably different integration patterns in terms of knowledge amount, the breadth of source disciplines, and the integration time lag. We also find several frequent co-occurrence patterns of different knowledge types.

Research limitations

The collected articles of the field are limited to the two leading open access journals. The stem-matching method to extract AKPs could not identify those phrases with the same meaning but expressed in words with different stems. The type of Research Subject dominates the recognized AKPs, which calls on an improvement of the classification schema for better knowledge integration analysis on knowledge units.

Practical implications

The methodology proposed in this paper sheds new light on knowledge integration characteristics of an interdisciplinary field from the content perspective. The findings have practical implications on the future development of research strategies in eHealth and the policies about interdisciplinary research.

Originality/value

This study proposed a new methodology to explore the content characteristics of knowledge integration in an interdisciplinary field.

Palabras clave

  • Knowledge integration
  • Interdisciplinary research
  • Citation contexts
  • eHealth
  • Knowledge content
Acceso abierto

A New Citation Recommendation Strategy Based on Term Functions in Related Studies Section

Publicado en línea: 09 May 2021
Páginas: 75 - 98

Resumen

Abstract Purpose

Researchers frequently encounter the following problems when writing scientific articles: (1) Selecting appropriate citations to support the research idea is challenging. (2) The literature review is not conducted extensively, which leads to working on a research problem that others have well addressed. The study focuses on citation recommendation in the related studies section by applying the term function of a citation context, potentially improving the efficiency of writing a literature review.

Design/methodology/approach

We present nine term functions with three newly created and six identified from existing literature. Using these term functions as labels, we annotate 531 research papers in three topics to evaluate our proposed recommendation strategy. BM25 and Word2vec with VSM are implemented as the baseline models for the recommendation. Then the term function information is applied to enhance the performance.

Findings

The experiments show that the term function-based methods outperform the baseline methods regarding the recall, precision, and F1-score measurement, demonstrating that term functions are useful in identifying valuable citations.

Research limitations

The dataset is insufficient due to the complexity of annotating citation functions for paragraphs in the related studies section. More recent deep learning models should be performed to future validate the proposed approach.

Practical implications

The citation recommendation strategy can be helpful for valuable citation discovery, semantic scientific retrieval, and automatic literature review generation.

Originality/value

The proposed citation function-based citation recommendation can generate intuitive explanations of the results for users, improving the transparency, persuasiveness, and effectiveness of recommender systems.

Palabras clave

  • Citation recommendation
  • Term function
  • Citation context
  • Related studies section
  • BM25
  • Word2vec
Acceso abierto

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Publicado en línea: 18 Jun 2021
Páginas: 99 - 122

Resumen

Abstract Purpose

Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields. This also helps in having a better collaboration with governments and businesses. This study aims to investigate the development of research fields over time, translating it into a topic detection problem.

Design/methodology/approach

To achieve the objectives, we propose a modified deep clustering method to detect research trends from the abstracts and titles of academic documents. Document embedding approaches are utilized to transform documents into vector-based representations. The proposed method is evaluated by comparing it with a combination of different embedding and clustering approaches and the classical topic modeling algorithms (i.e. LDA) against a benchmark dataset. A case study is also conducted exploring the evolution of Artificial Intelligence (AI) detecting the research topics or sub-fields in related AI publications.

Findings

Evaluating the performance of the proposed method using clustering performance indicators reflects that our proposed method outperforms similar approaches against the benchmark dataset. Using the proposed method, we also show how the topics have evolved in the period of the recent 30 years, taking advantage of a keyword extraction method for cluster tagging and labeling, demonstrating the context of the topics.

Research limitations

We noticed that it is not possible to generalize one solution for all downstream tasks. Hence, it is required to fine-tune or optimize the solutions for each task and even datasets. In addition, interpretation of cluster labels can be subjective and vary based on the readers’ opinions. It is also very difficult to evaluate the labeling techniques, rendering the explanation of the clusters further limited.

Practical implications

As demonstrated in the case study, we show that in a real-world example, how the proposed method would enable the researchers and reviewers of the academic research to detect, summarize, analyze, and visualize research topics from decades of academic documents. This helps the scientific community and all related organizations in fast and effective analysis of the fields, by establishing and explaining the topics.

Originality/value

In this study, we introduce a modified and tuned deep embedding clustering coupled with Doc2Vec representations for topic extraction. We also use a concept extraction method as a labeling approach in this study. The effectiveness of the method has been evaluated in a case study of AI publications, where we analyze the AI topics during the past three decades.

Palabras clave

  • Dynamics of science
  • Science mapping
  • Document clustering
  • Artificial intelligence
  • Deep learning
Acceso abierto

RDFAdaptor: Efficient ETL Plugins for RDF Data Process

Publicado en línea: 14 Apr 2021
Páginas: 123 - 145

Resumen

Abstract Purpose

The interdisciplinary nature and rapid development of the Semantic Web led to the mass publication of RDF data in a large number of widely accepted serialization formats, thus developing out the necessity for RDF data processing with specific purposes. The paper reports on an assessment of chief RDF data endpoint challenges and introduces the RDF Adaptor, a set of plugins for RDF data processing which covers the whole life-cycle with high efficiency.

Design/methodology/approach

The RDFAdaptor is designed based on the prominent ETL tool—Pentaho Data Integration—which provides a user-friendly and intuitive interface and allows connect to various data sources and formats, and reuses the Java framework RDF4J as middleware that realizes access to data repositories, SPARQL endpoints and all leading RDF database solutions with SPARQL 1.1 support. It can support effortless services with various configuration templates in multi-scenario applications, and help extend data process tasks in other services or tools to complement missing functions.

Findings

The proposed comprehensive RDF ETL solution—RDFAdaptor—provides an easy-to-use and intuitive interface, supports data integration and federation over multi-source heterogeneous repositories or endpoints, as well as manage linked data in hybrid storage mode.

Research limitations

The plugin set can support several application scenarios of RDF data process, but error detection/check and interaction with other graph repositories remain to be improved.

Practical implications

The plugin set can provide user interface and configuration templates which enable its usability in various applications of RDF data generation, multi-format data conversion, remote RDF data migration, and RDF graph update in semantic query process.

Originality/value

This is the first attempt to develop components instead of systems that can include extract, consolidate, and store RDF data on the basis of an ecologically mature data warehousing environment.

Palabras clave

  • RDF ETL solution
  • RDF data processing
  • Linked data
  • Portable plugins
Acceso abierto

Bibliometric-based Study of Scientist Academic Genealogy

Publicado en línea: 14 Apr 2021
Páginas: 146 - 163

Resumen

Abstract Purpose

This study aims to construct new models and methods of academic genealogy research based on bibliometrics.

Design/methodology/approach

This study proposes an academic influence scale for academic genealogy, and introduces the w index for bibliometric scaling of the academic genealogy. We then construct a two-dimensional (academic fecundity versus academic influence) evaluation system of academic genealogy, and validate it on the academic genealogy of a famous Chinese geologist.

Findings

The two-dimensional evaluation system can characterize the development and evolution of the academic genealogy, compare the academic influences of different genealogies, and evaluate individuals’ contributions to the inheritance and evolution of the academic genealogy. Individual academic influence is mainly indicated by the w index (the improved h index), which overcomes the situation of repeated measurements and distortion of results in the academic genealogy.

Practical implications

The two-dimensional evaluation system for the academic genealogy can better demonstrate the reproduction and the academic inheritance ability of a genealogy.

Research limitations

It is not comprehensive to only use the w index to characterize academic influence. It should also include scholars’ academic awards and academic part-timers and so on. In future work, we will integrate scholars’ academic awards and academic part-timers into the w index for a comprehensive reflection of scholars’ individual academic influences.

Originality/value

This study constructs new models and methods of academic genealogy research based on bibliometrics, which improves the quantitative assessment of academic genealogy and enriches its research and evaluation methods.

Palabras clave

  • Academic genealogy
  • Evaluation system
  • Academic influence
  • Academic fecundity
  • Liu Tungsheng

Announcement

Acceso abierto

New Editorial Board Announced for Journal of Data and Information Science

Publicado en línea: 18 May 2021
Páginas: 164 - 165

Resumen

Planifique su conferencia remota con Sciendo