In recent years, the research on clustering journals of similar topics has attracted much attention of scientists. One important reason is that it is an intermediate step towards research portfolio analysis, which is the analysis of research programs that can be classified by any theme of interest, including those related to administrative needs, organizational structure, funding streams, goals, and results (Srivastava, Towery, & Zuckerman, 2007). In such kind of analysis, journals need to be first grouped and sorted in the same category for assessing the significance of research in a specific field. Journal clustering is useful not only for classification, but also for indexing and retrieving schemes (Shultz, 2007; Small & Koenig, 1977). For example, the results of journal clustering can be utilized to index journals based on each journal’s research area to improve journal search accuracy and efficiency.
Traditionally, journal clustering is based on human cataloging. However, this manual approach is unable to 1) keep up with the rapid growth of new journals, 2) capture the changes in journal scopes over time, and 3) measure the relatedness between journals. The number of scientific journals has rapidly increased over the last decades, especially in the active and productive biomedical area. Scientists are facing thousands of new journals in PubMed every year, as illustrated in Figure 1. In addition, the scope of a journal may change (Kang, Doornenbal, & Schijvenaars, 2015) to reflect the current research trends, but it would take a long time for manual cataloging to capture these changes. Furthermore, due to the lack of an objective method to measure the relatedness between journals, human cataloging is carried out based on subjective criteria, which may vary considerably from one person to another.
Quantitative approaches to journal clustering and cataloging without relying on human interference have been proposed (Chen, 2008; D’Souza & Smalheiser, 2014; Eisenberg & Wells, 2014; Pudovkin & Garfield, 2002). Most studies on journal clustering use journal citation information available in Thomson Reuters’ Journal Citation Reports (JCR). The citation information provides an understanding of the interaction among various scientific disciplines. Therefore two journals are likely to be related if articles published in these two journals often cite each another. For example, in the study carried out by Pudovkin and Garfield (2002) the related journal list was produced using the “relatedness factor (RF)” based on citation data in JCR. RF was calculated with the citation scores for journals that give to or receive from one journal (in their paper is
Another journal clustering approach based on article usage information has been proposed (Lu, Xie, & Wilbur, 2009). The idea is based on the hypothesis that if articles in two journals are often read by the same set of users, the two journals are likely to be related. It has been confirmed that the PubMed query log data (including users’ searches and clicks) can be used as approximate measures of article usage so as to identify related journals. However, since the query log data is not publicly available, the usage-based journal clustering approach cannot be incorporated into the third-party applications. Moreover, both citation-based and usage-based approaches partially depend on human decisions, which introduce subjective effects into the clustering results. For example, authors tend to cite articles published in journals that they are familiar with, and article searchers may also tend to click articles of prominent journals. Therefore, journals with a lower impact factor might not be identified as related journals, even if they belong to the same research topic.
In this paper, we present a data-driven approach to mine related biomedical journals for automatic cataloging in a timely fashion. It uses the content similarity of articles in two journals to judge the journal relatedness. This judgment is intuitive as two journals are likely to be related if they often publish papers on similar topics. The similarity between articles can be measured based on their content using term weights, e.g. using vector similarity scoring approaches (Salton & Buckley, 1988). Therefore, the clustering results of this approach are only decided by the content similarity of the journals, which are robust for automatic journal cataloging.
The articles published from August 1, 2011 to July 31, 2012 were obtained from PubMed. These 917,844 articles belong to 4,841 unique journals. From this set of data, we first picked 740,870 articles belonging to 3,265 journals which published more than 50 papers in this time period. We then retrieved the related articles of these articles through E-utilities, which are a set of programming tools that provide a stable interface into the Entrez query and database system at the National Center for Biotechnology Information (NCBI) (NCBI, 2010). For each of the articles, we kept up to five related articles, which were displayed alongside the target articles in PubMed.
We also collected one month’s PubMed query logs (from March 1, 2008 to March 31, 2008), which included a total of 8 million user sessions (after removing robot sessions) and 51 million citation retrievals. A citation retrieval is a specific MEDLINE record being clicked to display its corresponding bibliographic information and abstract text. We looked for related journals for the 3,265 journals in the 8 million user sessions. For each journal, we kept a list of the top 20 related journals.
How to measure the relevance of journal articles is the key to accurate retrieval of related articles. Typically most systems use TF-IDF-like schemes to determine how the articles are related. In PubMed, the relevance judgments are generated based on article content similarity, and Medical Subject Headings (MeSH) terms in MEDLINE are used for parameter estimation in the retrieval model. MeSH is a controlled vocabulary primarily used to index articles in PubMed for improving literature retrieval, and has also been used in many other scientific investigation areas (Mao & Lu, in press).
The retrieval model that underlies the related article search feature in PubMed is a topic-based content similarity model called
where
Putting the local frequency of the terms in the documents (
The
where
The calculation of related journals is based on the existence of a set of user sessions
The similarity between journal
The similarity between journal
To measure the quality of retrieved related articles through the content-similarity-based approach, we use the following metrics.
To measure how well the related journals generated based on content similarity correlate with the ranking of journals, we used the article usage data in the log.
Suppose
where
To measure how well the content-similarity-based approach ranks the journals in the order of relevance as compared to human assessors, Kendall’s tau correlation coefficients (KTCC) is used and it is defined in Equation (11):
where
To measure how accurate an approach is for related journal search, we treat the approach as an information retrieval task. We specifically evaluate how the retrieved related journals have satisfied the actual goal of a user’s search in terms of relevance accuracy and ranking accuracy. Therefore, the top 20 related journal search results based on usage of the original article are regarded as the golden standard, and the related journals of the original article obtained by the content-similarity-based approach are the results of the retrieval task. We measure the search accuracy using the following metrics.
Precision is the fraction of retrieved documents that are relevant while recall is the fraction of relevant documents that are retrieved.
Recall and precision usually contradict each other: low precision means many results are not relevant and low recall means many relevant results are not retrieved. Therefore, they are usually combined into a single measure, such as the
This is also known as the
Given the search
where
Journals in PubMed are assigned broad subject terms by the National Library of Medicine (NLM) to describe the journal’s overall scope (Weis, 2013). All of these broad subject terms (about 120) are valid MeSH terms. We can utilize them to validate the results of the content-similarity-based approach, as journals are manually classified into broad subjects. Here we take
Table 1 shows the top 10 journals related to
Top 10 journals related to AMIA: The American Medical Informatics Association. Journal Broad subject term(s) Medical Informatics Medical Informatics Medical Informatics Health Services Research Medical Informatics Technology Medical Informatics Computational Biology Medical Informatics Health Services Research Technology Internal Medicine Pediatrics Medicine
We then analyze the results of the content-similarity-based approach with respect to the results of usage-based approach. We re-run the algorithm of the usage-based approach developed earlier (Lu, Xie, & Wilbur, 2009). Table 2 shows the average scores of CC, KTCC, F1 and NDCG with two different truncated positions 5 (NDCG@5) and 20 (NDCG@20) for journals with different number of published papers. The performance of the content-similarity-based approach is dependent on the number of papers published in a journal. For journals that published less than 300 papers in one year, those which published more papers received better performance. This is reasonable because it will be less accurate to find more related journals based on few related articles.
Related journal results evaluation of the content-similarity-based approach using different metrics. CC: Pearson correlation coefficients; KTCC: Kendall’s tau correlation coefficients; NDCG: Normalized discounted cumulative gain. NDCG@5: NDCG with truncated position 5; NDCG@20: NDCG with truncated position 20. CC: Pearson correlation coefficients; KTCC: Kendall’s tau correlation coefficients; NDCG: Normalized discounted cumulative gain. NDCG@5: NDCG with truncated position 5; NDCG@20: NDCG with truncated position 20. CC: Pearson correlation coefficients; KTCC: Kendall’s tau correlation coefficients; NDCG: Normalized discounted cumulative gain. NDCG@5: NDCG with truncated position 5; NDCG@20: NDCG with truncated position 20. CC: Pearson correlation coefficients; KTCC: Kendall’s tau correlation coefficients; NDCG: Normalized discounted cumulative gain. NDCG@5: NDCG with truncated position 5; NDCG@20: NDCG with truncated position 20.Number of papers published in 12 months Number of Journals CC KTCC NDCG@5 NDCG@20 >50 3,265 0.2617 0.0647 0.52 0.7235 0.6654 >100 2,161 0.3271 0.1185 0.55 0.7583 0.7015 >200 1,063 0.4153 0.1767 0.59 0.7931 0.7403 >300 599 0.4646 0.1957 0.60 0.8003 0.7508 >500 233 0.4591 0.1458 0.58 0.7570 0.7161
However, worse performance was found in journals that published more than 500 papers in one year. This is probably because the scope of those journals is broad and they publish many papers in very diverse fields, which makes it difficult to identify related journals for them. For example, the content of
Figure 2 shows the distribution of the NDCG@20 values on all journals. The large number of journals that received relatively high accuracy of ranking (0.6 < NDCG@20 < 0.9) by the content-similarity-based approach means this approach is appropriate for most journals. The content-similarity-based approach and the usage-based method can complement each other because their results are almost identical (NDCG@20 > 0.9) only for a small number of journals. Furthermore, the usage of some journals might not reflect the content of the journals (NDCG@20 < 0.6) because users could be attracted to click these journals for some other reasons, such as high impact factor of the journal, notable or familiar authors, interesting titles, and so on.
We also compared qualitatively the result lists obtained through the content-similarity-based approach and the usage-based method with the related journal list presented in the article of Pudovkin and Garfield (2002), where the list was produced using citation data. Given the journal
Top 20 journals related to Citation-based Usage-based Content-similarity-based
We further examined the most frequent MeSH terms assigned to the articles published in these 32 journals (including
Top 3 MeSH terms in the articles of the journals identified by the three methods.Journal MeSH #1 MeSH #2 MeSH #3 Mutation Models, genetic Genome DNA Gene expression Molecular sequence data RNA DNA Molecular sequence data Research Research personnel DNA Gene expression Cell line Gene expression regulation DNA Science RNA Chromosome mapping Quantitative trait loci Genes Evolution, molecular Biological evolution Selection, genetic DNA RNA Gene expression Gene expression Gene expression regulation Cell line DNA RNA Internet Proteins Gene expression regulation Protein binding Evolution, molecular Phylogeny Genome Bacteria Gene expression Gene expression regulation Gene expression Gene expression regulation DNA Genetic variation Genetics Genetics, population Gene expression Gene expression regulation Gene expression regulation, developmental Models, genetic Genotype Chromosomes mapping Phylogeny Genes Chromosomes Models, molecular Protein binding Protein confirmation Cell line Protein transport Cells Drosophila Biological evolution Gene expression Protein transport RNA Protein binding Gene expression Gene expression regulation Gene expression regulation, developmental Genome Mutation Polymorphism, single nucleotide Intensive care Intensive care units Critical illness Gene expression Gene expression regulation DNA Mutation Genetic predisposition to disease Pedigree Gene expression Gene expression regulation Cell line Genome Gene expression DNA Fungal proteins Gene expression Gene expression regulation Genetic variation Genetics, population Genetics
We also found the ranking based on citation count has high correlation with the ranking based on usage. This is reasonable because the high clicking number of an article would lead to high citation count of that article, and such phenomenon has been explored in previous work (Brody, Harnad, & Carr, 2006). It should also be pointed out that the latency between the publishing and the peak citing time of an article is longer than the latency between its publishing and peak clicking time (Mao & Lu, 2013), but the time lag probably is not long enough for changing the scope of a journal. Furthermore, we found that the ranks of the journals that not only focus on the field of genetics and heredity were lower in the content-similarity-based approach than in the other two approaches. Such differences were also observed by D’Souza and Smalheiser (2014). This is probably because these journals, such as
This research is valuable in using article content similarity to explore the proximity pattern of biomedical journals, validating that the content-similarity-based approach is useful in clustering related journals.
Further analysis also produces several insights. First, the clustering results based on content similarity, Web usage and citation have high correlation with each other, and are consistent with the results of manual cataloging. Second, results of the content-similarity-based approach are considerably less subject to human factors and some journals with lower impact factors can be clustered and ranked higher in the related journal list.
In conclusion, this research offers another way of clustering biomedical journals by using article content similarity information, other than widely used journal citation information. Moreover, incorporating the usage and citation information of journals will probably improve the accuracy of journal clustering. We would like to investigate this issue in the future and extend this work with other related investigation research, such as Klavans and Boyack’s work (2006).