Uneingeschränkter Zugang

A Metric Approach to Hot Topics in Biomedicine via Keyword Co-occurrence


Zitieren

Introduction

Since 2000, new century, biomedicine had great progress at both scientific and technical levels. Among “breakthrough of the year” in Science every year, we could conclude three important hot topics in biomedicine as follows. The first topic is Genome editing technique CRISPR/Cas (Clustered Regularly Interspaced Short Palindromic Repeats/CRISPR-associated system). As a genome editing method,

CRISPR/Cas was the top breakthrough in 2015. The CRISPR/Cas system is a prokaryotic immune system that confers resistance to foreign genetic elements such as those present within plasmids and phages that provides a form of acquired immunity. At beginning, CRISPR described segments of prokaryotic DNA containing short, repetitive base sequences in ancient bacteria (Horvath & Barrangou, 2010). Later, the group of Jennifer Doudna induced CRISPR/Cas9 as a tool to cut DNA with crRANs in 2012 (Jinek et al.), and then the group of Feng Zhang applied CRISPR/Cas9 into eucaryotic cells in 2013 (Cong et al., 2013). The group of Ma et al. (2017) describe the correction of a pathogenic gene mutation in human embryos with CRISPR/Cas9. Cox et al. (2017) proved that RNA can be edited with CRISPR-Cas13 to correct disease-relevant human mutations and proposed an RNA-editing platform named REPAIR. While another nuclease Cpf 1 was discovered in 2015 then CRISPR/Cpf1 became another CRISPR system (Zetsche et al., 2015). Yan et al. (2019) systematically discovered additional subtypes of type V CRISPR-Cas systems. The diversity, modularity, and efficacy of CRISPR-Cas systems are driving a biotechnological revolution and CRISPR-Cas guides the future of genetic engineering (Knott & Doudna, 2018). The second theme is Stem cell technique iPS cell. As a type of pluripotent stem cell, the iPS cell technique was selected into new breakthrough in both 2012 and 2016. The iPS cell technique was pioneered by Shinya Yamanaka’s lab in Kyoto, who showed in 2006 that the introduction of four specific genes encoding transcription factors could convert adult cells into pluripotent stem cells (Takahashi, 2006), on which Yamanaka was awarded the 2012 Nobel Prize along with Sir John Gurdon for their discovery that mature cells can be reprogrammed to become pluripotent. Since then, researchers have found a variety of more optimal induction methods (Anokye-Danso et al., 2011; Ma, Kong, & Zhu, 2017). At the meantime, researchers turned to introduce disease-associated mutations into a sample of iPS cells through gene editing. Paquet et al. (2016) generated cells with precise combinations of Alzheimer’s-associated mutations by introducing specific point mutations into iPS cells using CRISPR. The iPS cells have wide application perspectives in drug discovery and disease modelling (Scudellari, 2016). The last topic is Synthetic biology and artificial life. This is an interdisciplinary branch of biology and engineering, which was selected into new breakthrough in 2010. Synthetic biologists come in two broad classes. One uses unnatural molecules to reproduce emergent behaviors from natural biology, with the goal of creating artificial life. The other seeks interchangeable parts from natural biology to assemble into systems that function unnaturally (Benner & Sismour, 2005). Gibson et al. (2010) introduced their study about the Creation of a bacterial cell controlled by a chemically synthesized genome. Esvelt and Wang (2013) think Genome-modification technologies enable the rational engineering and perturbation of biological systems, such as CRISPR/Cas. Cameron, Bashor and Collins (2014) reviews the history of synthetic biology and points out that the field of synthetic biology has chartered many notable achievements and is poised to transform biotechnology and medicine.

In this article, based on biomedical documents and data analysis, we try to find the information distribution and structure of these three hot topics via analyzing the collaborative networks and visualizing their cores, for revealing their research status and interactions, for promoting biomedical developments.

Methodology

We process core keyword co-occurrence networks in this study. The methods focus on network analysis (Friedkin, 1991; Newman, 2004; Wolfe, 1997) and information visualization (Chen, 2006), and data come from scientific document database. For information visualization, VOSviewer (Eck & Waltman, 2010) is applied to draw pictures. In network analysis, Gephi (Bastian, Heymann, & Jacomy, 2009) and UCINET (Borgatti, Everett, & Freeman, 2002) are applied to compute network parameters. Meanwhile, open software SATI (Liu & Ye, 2012), Excel and R programming are used for data processing.

Methods

In this article, we chose the core keywords in three research hot topics in biomedicine by using h-index. Hirsch (2005) proposed the h-index, defined as the number of papers with citation number ≥ h, as a useful index to characterize the scientific output of a researcher. If all papers published by one research are arranged in descending order of citation frequency, supposing ci is the total number of citations of the ith paper, h-index can be quantified by formula h = {max i: i ≤ ci}.

Because h-index takes into account both the number and quality of papers published by one researcher, it overcomes the shortcomings of the previous single dimension theory, such as the number of papers or the number of citations (Bornmann & Daniel, 2005). Therefore, h-index can more objectively evaluate the academic achievements of researchers, which has brought widespread attention in scientific circles (Bornmann & Daniel, 2007). Braun, Glänzel, and Schubert (2006) put forward that h-index can be applied in the evaluation of impact of journals for the first time, and believed that h-index was a powerful supplement to the impact factor of journals. Banks (2006) applied h-index to identify the main research topics of compounds. Thereafter, h-index is used in many academic research areas. Its definition is extended as follows: h-index of one academic information source refers to that at least h articles has been cited at least h times (Ye, 2014).

Keyword co-occurrence was considered to be the main means for identifying research themes (Wang et al., 2017). Keyword co-occurrence network reflects the knowledge structure and knowledge kernel that can display the relationship between keywords (Su & Lee, 2010). Lee and Su (2010) believed that research hotspots can be evaluated by the centrality of the nodes in a keyword co-occurrence network. In keyword co-occurrence network, nodes represent keywords, while edges represent co-occurrence relationships among nodes. By using the social network analysis method to analyze the keyword co-occurrence networks, we can analyze the knowledge structure and hotpots of the research field.

This article evaluates the influence of nodes in the network based on closeness centrality, betweenness centrality and eigenvector centrality. Degree centrality means the importance of nodes in the network. The higher degree centrality of the node, the more important the node is, which means that the keywords represented by the node are more likely to be research hotspots. Betweenness centrality measures the ability of one keyword in a network to affect the other keywords that appear together. eigenvector centrality measures the number of adjacency nodes and the influence of adjacency nodes.

(1) Degree Centrality

In network measures, centrality indices are applied as terms of a real-valued function on the vertices of a graph, where the values produced are expected to provide a ranking which identifies the most important nodes. For a given graph G (V, E) with number of vertices V and number of edges E, let A=(au,v) be the adjacency matrix, i.e. au,v = 1 if vertex u is linked to vertex v and au,v = 0 otherwise. The degree centrality score of vertex u can be defined as

xu=vGau,vxv$${{x}_{u}}=\sum{_{v\in G}{{a}_{u,v}}{{x}_{v}}}$$

The relative centrality of vertex u can be defined as

xu=1λvGau,vχv$${{x}_{u}}=\frac{1}{^{\lambda }}\sum{_{v\in G}}{{a}_{u,v}}{{\chi }_{v}}\,$$

(2) Betweenness Centrality

Betweenness Centrality measures the shortest path in a network, which is used to evaluate the role of nodes in information integration in social networks. The higher the betweenness centrality, the greater the role it plays in information integration. Gst represents the number of shortest paths from point s to point t. Gst(v) represents the number of shortest paths from point s to point t that pass through node v. The betweenness centrality of vertex u can be defined as follows:

xu=vGGst(v)Gst$${{x}_{u}}=\sum{_{v\in G}\frac{{{G}_{st}}\left( v \right)}{{{G}_{st}}}}$$

(3) Eigenvector Centrality

Since the entries in the adjacency matrix are non-negative, there is a unique largest eigenvalue, which is real and positive. This greatest eigenvalue results in the desired centrality measure is eigenvector centrality or eigencentrality, which reveals the core importance of a vertex in a network. Its eigenvectors are orthogonal and diagonalizable. The centrality of vertices is proportional to the sum of the central points of the vertices it connects. The eigenvector center x is described in two equivalent ways. As the sum of matrix equations, the eigenvector centrality can be defined as follows:

Ax=λx$${{A}_{x}}=\lambda x$$Ax=λx,λxi=j=1naijxj,i=1,,n$${{A}_{x}}={{\lambda }_{x}},{{\lambda }_{{{x}_{i}}}}=\sum\nolimits_{j=1}^{n}{{{a}_{ij}}{{x}_{j}},i=1,\ldots ,n}$$

λ is the maximum eigenvalue of A and n is the number of vertices.

Data

In our empirical study we search the Web of Science (WoS) database for articles published during 1900 to 2018. We collected the data in January 2019. The retrieval strategies were as follows:

(1) H1-CRISPR

TS=“clustered regularly interspaced short palindromic repeats” OR CRISPR

(2) H2-iPS cell

TS=“induce* pluripotent stem cell” OR “induce* pluripotent stem cells” OR “IPS cell” OR “IPS cells”

(3) H3-Synthetic biology

TS=“synthetic biology” OR “gene circuit” OR “gene circuits” OR “genetic circuit” OR “genetic circuits” OR “genetic device” OR “genetic devices” OR “synthetic life” OR “synthetic lives” OR “synthetic tissue” OR “synthetic tissues” OR “synthetic cell” OR “synthetic cells” OR “synthetic genome” OR “synthetic genomes” OR “synthetic gene” OR “synthetic genes” OR “minimal genome” OR “minimal genomes” OR “biology, synthetic”

The computed data will lead to next results for finding core keywords and setting up keyword co-occurrence networks.

Keyword co-occurrence results

High-frequency keywords can reflect the research hotspots and research directions to some extent, but the information displayed by the linear arrangement of the frequency of keywords has great limitations. H-index can comprehensively reflect the occurrence frequency of keywords and the number of citations. In this article, the 100 keywords with the highest h-index are defined as core keywords. Keyword co-occurrence relationship can reflect the internal connection between keywords. In this chapter, we construct the co-occurrence matrix and co-word network with the help of R language. Besides that, we analyze the research hotspots of the three hot topics in biomedicine by using social network analysis method.

In this study, R language is used to extract keywords and the number of citations corresponding to keywords. The h-index is calculated by our own programming. The extracted keywords have the following problems.

Case difference, such as “Induced pluripotent stem cell”, “induced pluripotent stem cell”, “CRISPR/Cas9”, and “CRISPR/Cas9”.

Inconsistent connectors, such as “CRISPR/Cas9”, and “CRISPR-Cas9”.

Heteronyms, such as abbreviation and the full name phenomenon. “iPSCs”, “Human-induced pluripotent stem cells”, and “Induced pluripotent stem cells (iPSCs)”have the same meaning.

Considering that many keywords are special terms in the field of biomedicine, and the current general dictionary is not applicable for the study. The following two methods are used for data processing. The first method is case conversion, which unifies keywords into capital letters. The second method is self-compiled dictionary which can solve the problems of inconsistent connectors and Heteronyms.

Keyword co-occurrence with CRISPR/Cas9

Fig. 1 shows the co-occurrence network of 100 keywords with the highest h-index in the field of CRISPR/Cas9. Node color represents matrix-based clustering and node size represents h-index. Table 1 shows the 20 nodes with the highest degree centrality and betweenness centrality.

Figure 1

Co-occurrence network with the highest h-index keywords in the field of CRISPR/Cas9.

Co-occurrence network centrality with the highest h-index keywords in the field of CRISPR/Cas9.

RankKeywordDegree CentralityKeywordBetweenness Centrality
1CRISPR188CRISPR1,378.00
2GENOME EDITING156GENOME EDITING664.00
3CRISPR/CAS132CRISPR/CAS441.37
4GENOME ENGINEERING94GENOME ENGINEERING157.52
5HOMOLOGOUS RECOMBINATION84GENES135.44
6GENES76HOMOLOGOUS RECOMBINATION111.44
7ZEBRAFISH70ZEBRAFISH110.97
8GENE TARGET66GENE REGULATION60.97
9TALEN64CRISPRI53.39
10SYNTHETIC BIOLOGY58APOPTOSIS49.33
11GENE REGULATION56DNA REPAIR48.34
12GENE THERAPY56SYNTHETIC BIOLOGY47.97
13DNA REPAIR54GENE TARGET47.13
14GENE KNOCKOUT54IPSC42.64
15IPSC50GENE KNOCKOUT41.15
16ZFN50TALEN41.09
17SGRNA46GENE THERAPY39.40
18CANCER42EVOLUTION37.89
19CRRNA42SGRNA34.15
20EVOLUTION42CANCER34.14

In the network, “CRISPR”, “GENOME EDITING”, “CRISPR/CAS”, “GENOME ENGINEERING”, “HOMOLOGOUS RECOMBINATION”, “GENE TARGET” are located in the center of the network, and have high degree centrality and betweenness centrality. Thus, they are the core research contents. “ZEBRAFISH”, “MOUSE”, “ZFN”, “TALEN”, “GENE THERAPY”, “CANCER” are the much important research.

Interestingly, the keywords of “IPSC”, “HUMAN IPSC”, “STEM CELL”, “SYNTHETIC BIOLOGY”, “METABOLIC ENGINEERING” are conspicuous in the co-occurrence network. Besides that, these keywords have high degree centrality and betweenness centrality. That means CRISPR/Cas9 and the other two hot topics have large cross-study.

Keyword co-occurrence with iPS cell

Fig. 2 shows the co-occurrence network of 100 keywords with the highest h-index in the field of iPS cell. Node color represents matrix-based clustering and node size represents h-index. Table 2 shows the 20 nodes with the highest degree centrality and betweenness centrality.

Figure 2

Co-occurrence network with the highest h-index keywords in the field of iPS cell.

Co-occurrence network centrality with the highest h-index keywords in the field of iPS cell.

RankKeywordDegree CentralityKeywordBetweenness Centrality
1STEM CELL184STEM CELL359.39
2EMBRYONIC STEM CELL170EMBRYONIC STEM CELL267.07
3REPROGRAMMING166HUMAN IPSC254.20
4HUMAN IPSC162REPROGRAMMING246.40
5PLURIPOTENT STEM CELL160DIFFERENTIATION225.41
6DIFFERENTIATION158PLURIPOTENT STEM CELL221.89
7REGENERATIVE MEDICINE118HUMAN EMBRYONIC STEM CELL110.40
8HUMAN EMBRYONIC STEM CELL116REGENERATIVE MEDICINE100.92
9NEURAL STEM CELL116MESENCHYMAL STEM CELL94.19
10MESENCHYMAL STEM CELL114NEURAL STEM CELL83.66
11CELL THERAPY102TRANSPLANTATION70.11
12PLURIPOTENCY102PLURIPOTENCY67.90
13TRANSPLANTATION102CELL THERAPY60.74
14TISSUE ENGINEERING98CARDIOMYOCYTES58.25
15CARDIOMYOCYTES90NEURON56.58
16NEURON90TISSUE ENGINEERING54.92
17DISEASE MODELING82HIPSC43.86
18HIPSC82DRUG SCREENING42.39
19PARKINSON’S DISEASE82GENE EXPRESSION40.25
20GENE EXPRESSION80DISEASE MODELING37.50

In the network, “STEM CELL”, “EMBRYONIC STEM CELL”, “REPROGRAMMING”, “HUMAN IPSC”, “PLURIPOTENT STEM CELL”, “DIFFERENTIATION”, “REGENERATIVE MEDICINE”, “MESENCHYMAL STEM CELL” are located in the center of the network, and have high degree centrality and betweenness centrality. Thus, they are the core research contents.

In the field of iPS cells, “CARDIOMYOCYTES”, “NEURON”, “PARKINSON’S DISEASE”, “CELL THERAPY”, “DISEASE MODELING” are the hotspots.

Keyword co-occurrence with synthetic biology

Fig. 3 shows the co-occurrence network of 100 keywords with the highest h-index in the field of synthetic biology. Node color represents matrix-based clustering and node size represents h-index. Table 3 shows the 20 nodes with the highest degree centrality and betweenness centrality. In the network, “METABOLIC ENGINEERING”, “GENE CIRCUIT”, “GENE EXPRESSION”, “SYSTEMS BIOLOGY”, “SYSTEMS BIOLOGY”, “PROTEIN ENGINEERING” are located in the center of the network, and have high degree centrality and betweenness centrality. Thus, they are the core research contents. “YEAST”, “ESCELLHERICHIA COLI” and “SACCHAROMYCES CEREVISIAE” are experimental vectors used in synthetic biology research. These keywords are located in the network center and have high degree centrality and betweenness centrality, which indicates that they are widely used in research.

Figure 3

Co-occurrence network with the highest h-index keywords in the field of synthetic biology.

Co-occurrence network centrality with the highest h-index keywords in the field of synthetic biology.

RankKeywordDegreeKeywordBetweennes
CentralityCentrality
1METABOLIC ENGINEERING100METABOLIC ENGINEERING378.42
2YEAST90GENE EXPRESSION312.93
3GENE CIRCUIT82YEAST286.57
4ES CELLHERICHIA COLI80GENE CIRCUIT281.48
5GENE EXPRESSION80ES CELLHERICHIA COLI249.28
6SACCHAROMYCES CEREVISIAE80SACCHAROMYCES CEREVISIAE229.17
7PROTEIN ENGINEERING66SYNTHETIC GENE180.28
8DIRECTED EVOLUTION62PROTEIN ENGINEERING171.01
9GENE REGULATION60CELL CYCLE169.32
10SYNTHETIC GENE60GENE THERAPY158.88
11SYSTEMS BIOLOGY60TRANSCRIPTION146.23
12TRANSCRIPTION60DIRECTED EVOLUTION143.33
13GENE ENGINEERING56SYSTEMS BIOLOGY130.95
14BIOTECHNOLOGY50GENE REGULATION108.56
15GENE THERAPY50GENE ENGINEERING90.24
16CRISPR/CAS948ESSENTIAL GENE87.67
17CYANOBACTERIA46CELL-FREE PROTEIN SYNTHESIS76.44
18ESSENTIAL GENE46EVOLUTION71.90
19EVOLUTION44TRANSCRIPTION FACTOR70.90
20E. COLI42BIOTECHNOLOGY66.57

In the keyword co-occurrence network, CRISPR/CAS9 also have high degree centrality, and they are important node in the keyword cluster. This indicates that there are correlation between synthetic biology and CRISPR/CAS9. Besides that, the frequency of cross research and the number of citations is relatively high.

Discussion and conclusion

Above results construct core keyword co-occurrence networks with visualizing and calculating the keywords’ centralities. The three research hot topics in biomedicine are analyzed and characterized as follows.

The research hotspots of CRISPR/Cas9 include the comparison of gene editing technology with the previous two generations, the discovery of new CRISPR/Cas9 system, improvement of gene editing technology and methods, and application research of CRISPR/Cas9 in the gene therapy and cancer therapy.

The research hotspots of synthetic biology include “METABOLIC ENGINEERING”, “GENE CIRCUIT”, “GENE EXPRESSION”, “SYSTEMS BIOLOGY”, “SYSTEMS BIOLOGY”, and “PROTEIN ENGINEERING”.

The research hotspots of iPS cells include HUMAN IPSC, the comparison between iPS cell and EMBRYONIC STEM CELL, and the application of iPS cells in the research of CARDIOMYOCYTES, NEURON, PARKINSON’S DISEASE, etc.

There were overlapping keywords corresponding to the three biomedical topics, among which the overlapping keywords of synthetic biology and CRISPR/Cas9 were the most obvious. The research on the three topics is overlapping.

Since all analyses use keywords, without any other forms, the limitations are remained in this article, which may be improved in future studies.

eISSN:
2543-683X
Sprache:
Englisch
Zeitrahmen der Veröffentlichung:
4 Hefte pro Jahr
Fachgebiete der Zeitschrift:
Informatik, Informationstechnik, Projektmanagement, Datanbanken und Data Mining