Mapping the structure of science into hierarchical clusters/communities provides the basis of many further scientometrical investigations. For example, citation field normalization depends on the clusters (Waltman, 2016). Evolution of science disciplines and the interrelation among them can be better captured at the level of clusters rather than at the level of individual journals or individual papers (Shen et al., 2016). Furthermore, even a simple visualization of the map of science can be informative. Such clustering can be done in citation networks of either journals or papers via algorithms (Leydesdorff, 2006; Leydesdorff et al., 2017; Waltman & Van Eck, 2012), or even manually (Glänzel & Schubert, 2003). The cutting edge algorithm and studies of paper-level clustering is the Smart Local Moving (SLM) algorithm and related analysis (Boyack & Klavans, 2014; Colavizza et al., 2018; Haunschild et al., 2018; Klavans & Boyack, 2017; Sjogarde & Ahlgren, 2018).
The first research question that we are tackling here is to find another clustering algorithm of journals/papers. Or more specifically mainly in this work, we focus on a sub-component of a clustering algorithm, which is the similarity measure of journals/papers. One thing we notice in earlier clustering algorithms is that it is better to consider both text-based similarity and citation-based similarity (Boyack et al., 2017; Glänzel & Thijs, 2011; Janssens et al., 2008). Among the clustering algorithms developed for networks, node2vec/word2vec naturally takes both into consideration (Grover & Leskovec, 2016; Mikolov et al., 2013). Starting from a language corpus, a machine learning algorithm word2vec generates a vector representation of each word based on the assumption that words appearing close to each other are more similar in their meaning. Node2vec first converts a network into a corpus of nodes of the network and then applies word2vec to generate a vector for each node of the network. The vectors can then be used to calculate the similarity between each pair of the words or of the nodes. Thus, a combination of node2vec and word2vec, which are in fact just one algorithm in their cores, is a natural hybrid clustering method of both text-based and citation-based similarity of papers or journals. Once we have a similarity measure, in principle, we may apply various algorithms for clustering. In this work, however, since we focus more on similarity measure, we will use simply the hierarchical clustering (Pedregosa et al., 2011) as the algorithm of clustering.
Such a paper-level clustering, which is already an ongoing study in our group, requires citation data, text data of all papers, and tedious work on analysis and also on examining/validating the results. Thus, in this work, we would like to report our results on the node2vec/word2vec on journals first. In fact, at the level of journals, we only consider the citation network of journals, but not any texts from the journals, since it is a tricky question to define representative texts of journals. When the same method is applied to paper-level clustering, texts of a paper can be its title and abstract or even its full text.
As we will show later, the generated map of science has a reasonably clear community structure and the calculated communities of most of the journals agree well with several of the existing classifications of journals. Once we have the map of science, we notice that when we visually examine the map, overall, the multidisciplinary journals are at the boundary area of several disciplines and they are closer to the centers than to the edges. This inspires us to look at the vectors as a potential source for measuring the diversity/multidisciplinarity of journals.
Therefore, the second research question we investigate in this work is the plausibility of using the vector of each journal obtained via the node2vec for the diversity measure of the journal. This first relates to the question of how the journal vector itself, specifically the vector norm, correlates with journal diversity. Second, this plausibility might also depend on which similarity measure is used in the calculation of diversity. Often the definition of diversity of either journals or papers relies on a measure of similarity, see, for example, the Rao-Stirling definition in Eq. (2). This similarity can be defined based on co-cited or co-citing papers/journals. For example (Leydesdorff et al., 2018), for journal
It has been pointed out that the similarity measure plays an important role in diversity measure (Zhang et al., 2016). Since we now have another vector representation, and in some sense, the vector has the property that similar journals have similar vectors, we replace the citation vector
The journal citation network is extracted from the Journal Citation Reports 2017 (JCR2017, 2018). Links with citation counts less than 6 are removed. The resulted journal citation network has about 11,000 nodes and 950,000 links.
Node2vec is a representational learning framework of graphs, which can generate continuous vector representations for the nodes based on the network structure (Grover & Leskovec, 2016). The core algorithm in node2vec is word2vec (Mikolov et al., 2013). Here we use node2vec to learn 32-dimensional vectors
We basically extract the journal citation data and run it through the node2vec algorithm to get vectors of predefined dimensions, and then use the vectors to calculate journal pairwise similarity for clustering, and also feed the vectors into t-SNE (Maaten & Hinton, 2008) to get a 2-d presentation for science mapping. Here we set the dimension parameter as 32 considering both performance and computational cost and comparisons against other dimension parameters are shown in Fig. 2. All code is available at
Rao-Stirling diversity indicator is one of the widely used diversity measurements, which takes variety, balance, and disparity into consideration. The Rao-Stirling diversity is calculated as follows,
where the vector used to calculate similarity can be either the direct citation vector
To test the effectiveness of the trained vector from node2vec, we first make a science map of the journals according to their vectors by t-SNE, which reduces the vector dimension to 2. Presentation of our science map is shown in Fig. 1, in which each colored dot represents a journal with its color encoding its category used in Essential Science Indicators (ESI). ESI journal category is a well known and commonly used journal classification system. Since the ESI journal category is not directly generated by our method, we think it is more objective to use it to evaluate the reasonableness of our science map, especially the closeness of journals within same categories and the relative positions of categories.
First, we notice that our science map has a clear community structure and has a high agreement with the ESI journal category: journals with the same colors (same category) are more or less concentrated near their own region in the 2-dimensional t-SNE plot. This implies that the similarity of vectors among journals in the same category is much larger than journals across categories. We also noticed high overlaps between BIOLOGY & BIOCHEMISTRY journals and MOLECULAR BIOLOGY & GENETICS journals, implying that these journals are densely connected and may be merged into one category.
Second, we also find that in general, the structure of this science map is quite similar to earlier maps (Klavans & Boyack, 2009), with similar positions and relationships among disciplines, except for Geoscience. For example, Economics is close to other Social Sciences, Math, and Computer Science, and Material Science is between Physics, Engineering, and Chemistry, while Biology and Medicine are very close to each other, and Environmental Science is close to Chemistry, Biology, and Geoscience. Geoscience in our map is located peripherally, while in several other maps it is usually near the center. Intuitively, Geoscience is a discipline that makes use of many other disciplines, thus it is understandable that it is near the center. Yet, placing Geoscience near Environmental Science, Chemistry and Agriculture also makes sense. Therefore, we conclude that even for the position of Geoscience both our map and other maps capture part of the real connections among scientific disciplines.
Journals indexed as Multidisciplinary are highlighted on the map as red dots with black borders. Instead of being at the center, which means that there is no bias towards any disciplines, we see that these so-called multidisciplinary journals almost scatter all across the map with a quite several of them located at the boundary area of medicine, microbiology and related life sciences. This means that many of those multidisciplinary journals have their own focuses. For example,
We further cluster the journals at various levels of aggregations with hierarchical clustering (Pedregosa et al., 2011). Let us refer to our system of hierarchical clusters of journals as the Vec clusters (Vec for short). We then compare Vec with several commonly used journal classification systems, e.g., JCR-WoS subject categories
Since these four classification systems of journals are aggregated at different levels, e.g. ESI with 22 fields and JCR with about 200 disciplines, we generate Vec across different levels and compare them with these classification systems. Note that each of these pre-existing classifications has fixed number of clusters while Vec is generated at various granularity. As shown in Fig. 2(a) we can see a moderate agreement between Vec and these classification systems. The AMI shows a peak when the two compared classifications are aggregated at similar levels since at this common size of clusters the two classifications are often more similar than the cases when their sizes are different. We see that, overall, the agreement between Vec and VoS is higher than others. This is due to the fact that they both use journal citation data to generate the classification systems.
Both the science map and this comparison of the maps show that the generated node2vec vectors of journals do capture part of inherent characters of each journal such that the resulted clusters are quite reasonable. Based on this observation, we want to further examine these vectors.
One advantage of word2vec-like algorithms is its vector will embed the semantic relationships among words. The most famous example is the “King - Man + Woman = Queen” relation. Here we also test such a phenomenon on the trained vector of journals. Table 1 shows the result of such a test. Here we fixed “King =
The “King - Man + Woman = Queen” test on the node2vec trained vectors of journals. Top-5 “Queen”-like journals are presented.
|Example||Test 1||Test 2||Test 3|
Roughly speaking each dimension of the node2vec vectors likely represents a specific subject at a certain level depending on the chosen parameter of dimensions of the vector space. Thus, a more specified journal might cover only a few such directions while a journal with a broader scope might be seen as a summation of several of those directions. Then the summation of a few vectors may still point to some specific directions with larger length, while the summation of many vectors, each pointing to different directions, will often be short and not pointing to some specific directions. This is very much like the in central limit theorem that the more random variables in the summation the closer to 0 is the average. In fact, in the vectors of words generated from word2vec, vector norm of words with multiple meanings and different contexts is often shorter than that of the words with very specific meaning and context (Schakel & Wilson, 2015). Based on this intuition, we want to test the plausibility of using vector norm as an indicator of the diversity of journals.
To do so, we first compare the values of vector norms of the commonly known multidisciplinary journals against that of other journals. In Fig. 3, overall, we can see that those indexed as multidisciplinary journals (orange dots) mainly are near the lower part of the scatter plot, such as
However, in Fig. 3 we also use another journal indicator: node centrality of journals. Node centrality is measured by the occurrence frequency in the random walk series generated for node2vec, which can be treated as a network centrality measure and is closely related to node degree or the total received citations of journals in the citation network. Thus, node centrality of a journal is similar to the total impact of the journal. We find that even for journals with similar node centrality, the one with the smaller vector norm is more likely a multidisciplinary journal. For example, on the right part, the three journals –
Thus, it seems that vector norm indeed encodes diversity character to some degree. Next, we would like to develop a practical indicator of journal diversity using the vector representation of journals. However, finding the best way to measure journal diversity is still an open question. Often, for a measure of journal diversity, citing vector of a journal, which means the given journal cites how many times each of other journals or cited vector of a journal, which means how many
times this journal is cited by each of other journals, or both are needed. Since our goal in this work is not really to develop such an indicator but rather examining the usefulness of node2vec vectors of journals, we take simply one of the widely used indicators of diversity, the Rao-Stirling diversity (Rao, 1982; Stirling, 2007) in Eq. (2) and replaced the citation vector
Following the earlier settings (Leydesdorff et al., 2018), we calculate the citing-side and cited-side Rao-Stirling diversity of all journals in our set, as shown in Fig. 4. In Fig. 4(a) node2vec vectors are used to calculate journal similarity and in Fig. 4(b) we use the citation vectors. In Fig. 4, the
Overall, we see that in Fig. 4(a) when node2vec vectors are used, many of the commonly known multidisciplinary journals are at the diagonal top region of the plot and separated from other journals, especially those clearly multidisciplinary ones, such as
From this test of the node2vec algorithm on the citation network of journals, we see that the node2vec-based science map of journals has relatively clear separations, the closeness among the clusters represents true relationships among corresponding disciplines, and the overall structure of the map agrees with other commonly used maps of sciences. This means that the journal vectors generated from node2vec capture essential characteristics of these journals. This is also illustrated in Table 1.
We also found that it is possible to use the norm of generated vectors for diversity measurement. It was at first surprising to us that vectors of multidisciplinary journals have smaller norms. However, similar things happen for the word2vec vector representation of words. Words whose meanings are very specialized have a big norm, while words whose meanings are rather common have a smaller norm (Schakel & Wilson, 2015). We suspect that the same thing happens here so that the more specialized journals have larger norms. Based on this observation, we tested the applicability of the vector representation to diversity measurement. Using the Rao-Stirling diversity measure as an example, we show that replacing the often-used citation vectors with our node2vec vectors indeed improves the agreement between the measured values of diversity and the commonly known multidisciplinary journals. When node2vec vectors are used, several multidisciplinary journals become outliers since their measured diversity values are larger than other journals, while almost all multidisciplinary journals are not distinguishable from other journals when citation vectors are used.
Based on the above observations, we conclude that node2vec/word2vec has potential in clustering and characterizing nodes in citation networks. Simply due to the complexity of analysis, in this work, we decide to present results on journal-level science mapping first. Extending the current study to a paper-level network, which will naturally integrate citation networks and text contents of papers, will be the topic of future investigations. Such an extension might contribute to both the method of paper clustering and paper diversity measurement. Also there we do not study which diversity measure is the better.
To illustrate the main concepts and contribution, we prepared a graphic summary of this work in Fig. 5. The concepts and connections among concepts are color coded. The black ones are from references (Grover & Leskovec, 2016; Mikolov et al., 2013). The current work implements only the red ones and the green ones about paper-level clustering is a topic of future investigation. Preliminary results of this work were reported at the 2nd International Conference on Data-driven Knowledge Discovery in 2018.
The “King - Man + Woman = Queen” test on the node2vec trained vectors of journals. Top-5 “Queen”-like journals are presented.
|Example||Test 1||Test 2||Test 3|