Mining Related Articles for Automatic Journal Cataloging

In recent years, the research on clustering journals of similar topics has attracted much attention of scientists. One important reason is that it is an intermediate step towards research portfolio analysis, which is the analysis of research programs that can be classified by any theme of interest, including those related to administrative needs, organizational structure, funding streams, goals, and results (Srivastava, Towery, & Zuckerman, 2007). In such kind of analysis, journals need to be first grouped and sorted in the same category for assessing the significance of research in a specific field. Journal clustering is useful not only for classification, but also for indexing and retrieving schemes (Shultz, 2007; Small & Koenig, 1977). For example, the results of journal clustering can be utilized to index journals based on each journal’s research area to improve journal search accuracy and efficiency.

Traditionally, journal clustering is based on human cataloging. However, this manual approach is unable to 1) keep up with the rapid growth of new journals, 2) capture the changes in journal scopes over time, and 3) measure the relatedness between journals. The number of scientific journals has rapidly increased over the last decades, especially in the active and productive biomedical area. Scientists are facing thousands of new journals in PubMed every year, as illustrated in Figure 1. In addition, the scope of a journal may change (Kang, Doornenbal, & Schijvenaars, 2015) to reflect the current research trends, but it would take a long time for manual cataloging to capture these changes. Furthermore, due to the lack of an objective method to measure the relatedness between journals, human cataloging is carried out based on subjective criteria, which may vary considerably from one person to another.

Number of journals in PubMed from 1945–2014.

Quantitative approaches to journal clustering and cataloging without relying on human interference have been proposed (Chen, 2008; D’Souza & Smalheiser, 2014; Eisenberg & Wells, 2014; Pudovkin & Garfield, 2002). Most studies on journal clustering use journal citation information available in Thomson Reuters’ Journal Citation Reports (JCR). The citation information provides an understanding of the interaction among various scientific disciplines. Therefore two journals are likely to be related if articles published in these two journals often cite each another. For example, in the study carried out by Pudovkin and Garfield (2002) the related journal list was produced using the “relatedness factor (RF)” based on citation data in JCR. RF was calculated with the citation scores for journals that give to or receive from one journal (in their paper is Genetics, a core journal in the field of genetics and heredity) the highest number of citations. D’Souza and Smalheiser (2014) used three metrics to measure journal similarity based on 1) MeSH term similarity, 2) author-individuals in common between each pair of journals, and 3) articles in each journal pair written by the same author-individuals. Fujii (2007) applied link analysis techniques to the citation structures of the patent collection. After combining the citation-based scores with the text-based scores for patents, better performance than only using the text information was achieved. Although the results of the citation-based journal clustering approach are consistent with the ISI classification scheme, the full citation information is not easy to be obtained. For example, the 2013 edition of the JCR contained statistical information for approximately 8,400 science and technology journals while the number of journals in PubMed was nearly 27,000 in 2014, which means that over 2/3 journals have no citation information in JCR and hence could not be clustered using the citation-based techniques.

Another journal clustering approach based on article usage information has been proposed (Lu, Xie, & Wilbur, 2009). The idea is based on the hypothesis that if articles in two journals are often read by the same set of users, the two journals are likely to be related. It has been confirmed that the PubMed query log data (including users’ searches and clicks) can be used as approximate measures of article usage so as to identify related journals. However, since the query log data is not publicly available, the usage-based journal clustering approach cannot be incorporated into the third-party applications. Moreover, both citation-based and usage-based approaches partially depend on human decisions, which introduce subjective effects into the clustering results. For example, authors tend to cite articles published in journals that they are familiar with, and article searchers may also tend to click articles of prominent journals. Therefore, journals with a lower impact factor might not be identified as related journals, even if they belong to the same research topic.

In this paper, we present a data-driven approach to mine related biomedical journals for automatic cataloging in a timely fashion. It uses the content similarity of articles in two journals to judge the journal relatedness. This judgment is intuitive as two journals are likely to be related if they often publish papers on similar topics. The similarity between articles can be measured based on their content using term weights, e.g. using vector similarity scoring approaches (Salton & Buckley, 1988). Therefore, the clustering results of this approach are only decided by the content similarity of the journals, which are robust for automatic journal cataloging.

Methodology

2.1

Data Collection

The articles published from August 1, 2011 to July 31, 2012 were obtained from PubMed. These 917,844 articles belong to 4,841 unique journals. From this set of data, we first picked 740,870 articles belonging to 3,265 journals which published more than 50 papers in this time period. We then retrieved the related articles of these articles through E-utilities, which are a set of programming tools that provide a stable interface into the Entrez query and database system at the National Center for Biotechnology Information (NCBI) (NCBI, 2010). For each of the articles, we kept up to five related articles, which were displayed alongside the target articles in PubMed.

We also collected one month’s PubMed query logs (from March 1, 2008 to March 31, 2008), which included a total of 8 million user sessions (after removing robot sessions) and 51 million citation retrievals. A citation retrieval is a specific MEDLINE record being clicked to display its corresponding bibliographic information and abstract text. We looked for related journals for the 3,265 journals in the 8 million user sessions. For each journal, we kept a list of the top 20 related journals.

2.2

Related Journals Identified by Article Content Similarity

How to measure the relevance of journal articles is the key to accurate retrieval of related articles. Typically most systems use TF-IDF-like schemes to determine how the articles are related. In PubMed, the relevance judgments are generated based on article content similarity, and Medical Subject Headings (MeSH) terms in MEDLINE are used for parameter estimation in the retrieval model. MeSH is a controlled vocabulary primarily used to index articles in PubMed for improving literature retrieval, and has also been used in many other scientific investigation areas (Mao & Lu, in press).

The retrieval model that underlies the related article search feature in PubMed is a topic-based content similarity model called pmra (Lin & Wilbur, 2007). The pmra model calculates the similarity between two documents using the words they have in common, with document length adjustment. It uses words from titles and abstract as well as MeSH terms assigned to the document. The probability that the two documents are relevant given their content is estimated based on approximating the Bayesian weights for words in common. The pmra model uses Poisson distributions to model term frequencies within documents. The local weight of a term t is computed with Equation (1): $\begin{matrix} T F_{t} = (1 + e^{\propto \times d l} λ^{l c - 1})^{- 1}, \end{matrix}$ $$\begin{array}{} TF_t=(1+e^{\propto\times dl}\lambda^{lc-1})^{-1}, \end{array} $$(1)

where dl is the document length expressed in words, lc is the local frequency of t in document c, and a and λ are constants tuned to the data.

Putting the local frequency of the terms in the documents (TF), and the inverse document frequency (IDF) together, we calculate the weights of terms using Equation (2) and the document ranking function with Equation (3): $\begin{matrix} w_{t, c} = T F_{t} \times \sqrt{I D F_{t}} = (1 + e^{\propto \times d l} λ^{l c - 1})^{- 1} \sqrt{I D F_{t}}, \end{matrix}$ $$\begin{array}{} w_{t,c}=TF_t \times \sqrt{IDF_t}=(1+e^{\propto\times dl}\lambda^{lc-1})^{-1}\sqrt{IDF_t}, \end{array} $$(2) $\begin{matrix} S i m (c, d) = \sum_{t = 1}^{N} w_{t, c} \times w_{t, d} . \end{matrix}$ $$\begin{array}{} \displaystyle Sim(c,d)= \sum_{t=1}^N w_{t,c} \times w_{t,d}. \end{array} $$(3)

The pmra model has been found to give good performance on MEDLINE documents and used to calculate “related articles” in PubMed. Based on the retrieved related articles for each article in our dataset, we compute the similarities between the journal of the original article and each of the journals of the top five related articles using the probabilities of journals’ appearance in the related articles list. That is, if a journal has high probability of being displayed in the top related articles of articles in another journal, the two journals are likely related. Additionally we compute a new relevance score which uses the frequency of articles with Equation (4): $\begin{matrix} R e v (c, d) = \sum_{i = 1}^{M} N_{i} (c, d), \end{matrix}$ $$\begin{array}{} \displaystyle Rev(c,d)= \sum_{i=1}^M N_i(c,d), \end{array} $$(4)

where M is the total number of articles published in journal c, N_i(c, d) denotes the number of articles published in journal d in the related articles list of the i^th article of journal c. The use of frequency to simplify the probability estimation is reasonable although more sophisticated algorithms can be considered (e.g. taking account of the total number of papers published in the journal). To directly compare the results of the other methods, we kept the top 20 relevant journals for each journal.

2.3

Related Journals Identified through Log Analysis

The calculation of related journals is based on the existence of a set of user sessions $\begin{matrix} {S_{i}}_{i = 1}^{N} \end{matrix}$ $\begin{array}{} \Big\{S_i\Big\}_{i=1} {^N} \end{array} $, where each user session S_i consists of a set $\begin{matrix} {d_{j}^{i}}_{j = 1}^{n_{i}} \end{matrix}$ $\begin{array}{} \Big\{d_j{^i}\Big\}_{j=1} {^{n_i}} \end{array} $ of citation retrievals in the form of MEDLINE records that were examined by the user during that session (Lu, Xie, & Wilbur, 2009). Let A represent a journal and t_A (S_i) denote the number of clicks through events that represent articles from journal A: $\begin{matrix} T_{A} = \sum_{i = 1}^{N} t_{A} (S_{i}) . \end{matrix}$ $$\begin{array}{} \displaystyle T_A= \sum\limits_{i=1}^N t_A(S_i). \end{array} $$(5)

The similarity between journal A and journal B can be measured as the probability of transitioning from articles in journal A to articles in journal B: P(B|A). It can be estimated as the probability of the union of three independent events: a user is looking for an article from journal A in session S_i (E₁), the article is not the last click through in the session (E₂), and the next article the user looks for in the session is from journal B (E₃). The three probabilities can be calculated with Equations (6)–(8): $\begin{matrix} E_{1} = \frac{t_{A} (s_{i})}{T_{A}}, \end{matrix}$ $$\begin{array}{} \displaystyle E_1=\frac{t_A(s_i)}{T_A}, \end{array} $$(6) $\begin{matrix} E_{2} = \frac{n_{i} - 1}{n_{i}}, \end{matrix}$ $$\begin{array}{} \displaystyle E_2=\frac{n_i-1}{n_i}, \end{array} $$(7) $\begin{matrix} E_{3} = \frac{t_{B} (s_{i})}{n_{i} - 1} . \end{matrix}$ $$\begin{array}{} \displaystyle E_3=\frac{t_B(s_i)}{n_i-1}. \end{array} $$(8)

The similarity between journal A and journal B can be computed using Equation (9): $\begin{matrix} P (B | A) = \sum_{i = 1}^{N} (\frac{t_{A} (s_{i})}{T_{A}}) (\frac{t_{B} (s_{i})}{n_{i} - 1}) (\frac{n_{i} - 1}{n_{i}}) = \sum_{i = 1}^{N} (\frac{t_{A} (s_{i}) t_{A} (s_{i})}{T_{A} n_{i}}) . \end{matrix}$ $$\begin{array}{} \displaystyle P(B|A)=\sum\limits_{i=1}^N\bigg(\frac{t_A(s_i)}{T_A}\bigg)\bigg(\frac{t_B(s_i)}{n_i-1}\bigg)\bigg(\frac{n_i-1}{n_i}\bigg)=\sum\limits_{i=1}^N\bigg(\frac{t_A(s_i)t_A(s_i)}{T_An_i}\bigg). \end{array} $$(9)

2.4

Evaluation Metrics

To measure the quality of retrieved related articles through the content-similarity-based approach, we use the following metrics.

2.4.1

Pearson Correlation Coefficients

To measure how well the related journals generated based on content similarity correlate with the ranking of journals, we used the article usage data in the log.

Suppose X = [x₁, x₂,…, x_n] and Y = [y₁, y₂,…, y_n] are a series of predicted and actual clicking numbers of n articles, respectively, the sample correlation coefficient is used to estimate the Pearson correlation coefficient r between X and Y: $\begin{matrix} r_{x y} = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{(n - 1) s_{x} s_{y}} = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2} (y_{i} - \bar{y})^{2}}}, \end{matrix}$ $$\begin{array}{} \displaystyle r_{xy}=\frac{\sum\limits_{i=1}^n (x_i-\overline{x\,})(y_i-\overline{y\,})}{(n-1)s_xs_y}=\frac{\sum\limits_{i=1}^n (x_i-\overline{x\,})(y_i-\overline{y\,})}{\sqrt{\sum\limits_{i=1}^n (x_i-\overline{x\,})^2(y_i-\overline{y\,})^2}}, \end{array} $$(10)

where x and y are the sample means of X and Y, and s_x and s_y are the sample standard deviations of X and Y. A value of 1 indicates a perfect positive linear relationship between X and Y.

2.4.2

Kendall’s Tau Correlation Coefficients

To measure how well the content-similarity-based approach ranks the journals in the order of relevance as compared to human assessors, Kendall’s tau correlation coefficients (KTCC) is used and it is defined in Equation (11): $\begin{matrix} τ = \frac{n_{c} - n_{d}}{\sqrt{(n_{0} - n_{1}) (n_{0} - n_{2})}} \\ n_{0} = n (n - 1) / 2, n_{1} \sum_{i} t_{i} (t_{i} - 1) / 2, n_{1} = \sum_{j} u_{j} (u_{j} - 1) / 2 \end{matrix}$ $$\begin{array}{} \displaystyle \tau=\frac{n_c-n_d}{\sqrt{(n_0-n_1)(n_0-n_2)}}\\ \displaystyle n_0=n(n-1)/2, n_1\sum\limits_{i}t_i(t_i-1)/2,n_1=\sum\limits_ju_j(u_j-1)/2 \end{array} $$(11)

where n is the number of items, n_c is the number of concordant pairs, n_d is the number of discordant pairs, t_i is the number of tied values in the i^th group of ties for the first ranking, u_j is the number of tied values in the j^th group of ties for the second ranking. A KTCC value of 1 means the approach ranked the journals in exactly the same order as human assessors and −1 means that the approach ranked the journals in exactly the opposite order of human assessors, and 0 means there is no relationship between the two orderings of the data.

2.4.3

IR Features

To measure how accurate an approach is for related journal search, we treat the approach as an information retrieval task. We specifically evaluate how the retrieved related journals have satisfied the actual goal of a user’s search in terms of relevance accuracy and ranking accuracy. Therefore, the top 20 related journal search results based on usage of the original article are regarded as the golden standard, and the related journals of the original article obtained by the content-similarity-based approach are the results of the retrieval task. We measure the search accuracy using the following metrics.

(i) Precision, recall and F-measure

Precision is the fraction of retrieved documents that are relevant while recall is the fraction of relevant documents that are retrieved. $\begin{matrix} Precision = \frac{# (relevant items retrieved)}{# (retrieved items)} = P (relevant|retrieved), \\ Recall = \frac{# (relevant items retrieved)}{# (relevant items)} = P (retrieved|relevant) . \end{matrix}$ $$\begin{array}{} \displaystyle \text{Precision}=\frac{\#(\text{relevant items retrieved)}}{\# (\text{retrieved items)}}=P(\text{relevant|retrieved)},\\ \displaystyle \text{Recall}=\frac{\#(\text{relevant items retrieved)}}{\# (\text{relevant items)}}=P(\text{retrieved|relevant)}. \end{array} $$(12)

Recall and precision usually contradict each other: low precision means many results are not relevant and low recall means many relevant results are not retrieved. Therefore, they are usually combined into a single measure, such as the F-measure (F), which is the weighted harmonic mean of precision and recall and is calculated with Equation (13): $\begin{matrix} F = 2 \times \frac{Precision \times Recall}{Precision + Recall} . \end{matrix}$ $$\begin{array}{} \displaystyle F=2\times \frac{\text{Precision } \times \text{ Recall}}{\text{Precision } + \text{ Recall}}. \end{array} $$(13)

This is also known as the F1 measures, because recall and precision are evenly weighted. In our setting, the number of retrieved results and the number of relevant journals are both 20, and thus the values of precision, recall and F1 are the same, which can be calculated as relevant journals retrieved and divided by 20.

(ii) Normalized discounted cumulative gain (NDCG)

Given the searchresults of a journal, we measure how these results are accurately ranked based on the multi-level judgement of relevance (according to usage ranking) to the actual goal of a user’s search task (mined from the user’s behavior during the search process). NDCG is used to measure ranking accuracy (Järvelin & Kekäläinen, 2002). For every rank position k in the ranked list, NDCG is defined as follows: $\begin{matrix} N D C G (k) = \frac{1}{Z_{k}} \sum_{p = 1}^{k} \frac{2^{s (p)} - 1}{\log (1 + p)}, \end{matrix}$ $$\begin{array}{} \displaystyle NDCG(k)=\frac{1}{Z_k}\sum\limits_{p=1}^k\frac{2^{s(p)}-1}{\log(1+p)}, \end{array} $$(14)

where S(p) is the relevance score of the document at position p in the ranked list and Z_k is a normalization factor. We assume the top 20 related journals with the original articles representing “the correct results”, and each journal is judged on a scale of 1–20, with 20 for the journal which is ranked first, 1 for the journal ranked 20^th.

Results

Journals in PubMed are assigned broad subject terms by the National Library of Medicine (NLM) to describe the journal’s overall scope (Weis, 2013). All of these broad subject terms (about 120) are valid MeSH terms. We can utilize them to validate the results of the content-similarity-based approach, as journals are manually classified into broad subjects. Here we take Journal of the American Medical Informatics Association (JAMIA) as an example.

Table 1 shows the top 10 journals related to JAMIA identified by the content-similarity-based approach. The top five related journals are assigned the term “Medical Informatics.” Therefore, if JAMIA is a new journal to the system, it will be automatically cataloged into “Medical Informatics,” which is exactly the broad subject term assigned to it by human indexers. While the last three journals in Table 1 are less related to this topic, the other terms in Table 1, such as “Computational Biology,” “Health Services Research” and “Technology” are closely related to “Medical Informatics.” This indicates that the content-similarity-based approach can cluster related journals and rank them according to their relevance to the topic.

Table 1

Top 10 journals related to JAMIA identified by the content-similarity-based approach.

Journal	Broad subject term(s)
AMIA Annual Symposium Proceedings AMIA: The American Medical Informatics Association. AMIA Annual Symposium Proceedings is an ejournal published by AMIA annually.	Medical Informatics
BMC Medical Informatics and Decision Making	Medical Informatics
Journal of Biomedical Informatics	Medical Informatics
Studies in Health Technology and Informatics	Health Services Research
	Medical Informatics
	Technology
International Journal of Medical Informatics	Medical Informatics
BMC Bioinformatics	Computational Biology
Journal of Medical Internet Research	Medical Informatics
Health Technology Assessment	Health Services Research
	Technology
Journal of General Internal Medicine	Internal Medicine
Pediatrics	Pediatrics
Journal of the American Medical Informatics Association (JAMIA)	Medicine

We then analyze the results of the content-similarity-based approach with respect to the results of usage-based approach. We re-run the algorithm of the usage-based approach developed earlier (Lu, Xie, & Wilbur, 2009). Table 2 shows the average scores of CC, KTCC, F1 and NDCG with two different truncated positions 5 (NDCG@5) and 20 (NDCG@20) for journals with different number of published papers. The performance of the content-similarity-based approach is dependent on the number of papers published in a journal. For journals that published less than 300 papers in one year, those which published more papers received better performance. This is reasonable because it will be less accurate to find more related journals based on few related articles.

Table 2

Related journal results evaluation of the content-similarity-based approach using different metrics.

Number of papers published in 12 months	Number of Journals	CC CC: Pearson correlation coefficients; KTCC: Kendall’s tau correlation coefficients; NDCG: Normalized discounted cumulative gain. NDCG@5: NDCG with truncated position 5; NDCG@20: NDCG with truncated position 20.	KTCC CC: Pearson correlation coefficients; KTCC: Kendall’s tau correlation coefficients; NDCG: Normalized discounted cumulative gain. NDCG@5: NDCG with truncated position 5; NDCG@20: NDCG with truncated position 20.	F1	NDCG@5 CC: Pearson correlation coefficients; KTCC: Kendall’s tau correlation coefficients; NDCG: Normalized discounted cumulative gain. NDCG@5: NDCG with truncated position 5; NDCG@20: NDCG with truncated position 20.	NDCG@20 CC: Pearson correlation coefficients; KTCC: Kendall’s tau correlation coefficients; NDCG: Normalized discounted cumulative gain. NDCG@5: NDCG with truncated position 5; NDCG@20: NDCG with truncated position 20.
>50	3,265	0.2617	0.0647	0.52	0.7235	0.6654
>100	2,161	0.3271	0.1185	0.55	0.7583	0.7015
>200	1,063	0.4153	0.1767	0.59	0.7931	0.7403
>300	599	0.4646	0.1957	0.60	0.8003	0.7508
>500	233	0.4591	0.1458	0.58	0.7570	0.7161

However, worse performance was found in journals that published more than 500 papers in one year. This is probably because the scope of those journals is broad and they publish many papers in very diverse fields, which makes it difficult to identify related journals for them. For example, the content of PLOS One articles covers over 10 subject areas, from “Biology and Life Science” to “Social Science.” The average F1 scores around 0.5 indicate that about half related journals identified by the content-similarity-based approach are really related journals according to user’s search records. However, the low CC and KTCC scores (0.2617 and 0.0647 for all journals, respectively) indicate that there is much difference between the rankings of the related journals generated by the content-similarity-based approach and the rankings of the usage-based approach. On the one hand, the relatively higher NDCG@5 scores mean the top five related articles that display alongside each original article are probably published in related journals. On the other hand, the relatively lower NDCG@20 scores mean that while the “see all…” results of “similar articles” in PubMed are viewed, the full list of the related articles in the first page is less consistent with users’ real information needs.

Figure 2 shows the distribution of the NDCG@20 values on all journals. The large number of journals that received relatively high accuracy of ranking (0.6 < NDCG@20 < 0.9) by the content-similarity-based approach means this approach is appropriate for most journals. The content-similarity-based approach and the usage-based method can complement each other because their results are almost identical (NDCG@20 > 0.9) only for a small number of journals. Furthermore, the usage of some journals might not reflect the content of the journals (NDCG@20 < 0.6) because users could be attracted to click these journals for some other reasons, such as high impact factor of the journal, notable or familiar authors, interesting titles, and so on.

Distribution of the NDCG@20 values over all journals.

Discussion

We also compared qualitatively the result lists obtained through the content-similarity-based approach and the usage-based method with the related journal list presented in the article of Pudovkin and Garfield (2002), where the list was produced using citation data. Given the journal Genetics, Table 3 shows the top 20 related journals identified by the citation-based, usage-based and content-similarity-based approach, respectively. In general, the results of the three methods overlap each other in large part. Similar to Genetics, most of the journals publish articles in the field of genetics and heredity. Although some journals cover more full range of scientific disciplines, such as Science, Nature and PNAS, they mainly focus on the life science field rather than other fields such as computer science.

Table 3

Top 20 journals related to Genetics identified by the citation-based, usage-based and content-similarity-based method, respectively.

Citation-based	Usage-based	Content-similarity-based
PNAS	PNAS	PLOS Genetics
Cell	JBC	PNAS
Nature	Nature	MCB
MCB	Science	PLOS One
Science	MCB	Genome Research
TAG	Cell	Nature
Evolution	Development	Eukaryotic Cell
EMBO	Journal	Genes & Development Evolution
Genes & Development	NAR	MBoC
NAR	EMBO Journal	Molecular Ecology
JBC	Current Biology: CB	JBC
MBE	MBoC	Science
Journal of Bacteriology	Developmental Biology	Developmental Biology
MGG	MBE	Cell
Heredity	Nature Genetics	Current Biology: CB
Development	JCB	Genes & Development
Genetical Research	Journal of Bacteriology	Development
Genome	CCM	MBE
JMC	PLOS Genetics	Heredity
JCB	AJHG	AJHG

Note. PNAS: Proceedings of the National Academy of Sciences of the USA; JBC: Journal of Biological Chemistry; MCB: Molecular and Cellular Biology; TAG: Theoretical and Applied Genetics; NAR: Nucleic Acid Research; MBoC: Molecular Biology of the Cell; MBE: Molecular Biology and Evolution; MGG: Molecular & General Genetics; JCB: Journal of Cell Biology; CCM: Critical Care Medicine; JMC: Journal of Molecular Biology; AJHG: American Journal of Human Genetics.

We further examined the most frequent MeSH terms assigned to the articles published in these 32 journals (including Genetics). The Check tags (a special set of MeSH headings that are mentioned in almost every article such as human, animal, male, female, child, etc.

http://www.nlm.nih.gov/bsd/indexing/training/CHK_010.html

have the highest frequency over all journals in PubMed, but are not useful for journal cataloging. The top three MeSH terms (after excluding the Check tags) in articles of each journal are shown in Table 4, in which most MeSH terms are related to the field of genetics and heredity, including the most frequently assigned MeSH terms: gene expression, gene expression regulation, and DNA. Since MeSH terms are assigned to articles by human indexers, we can conclude that most journals identified by the three methods are correctly related to the given journal.

Table 4

Top 3 MeSH terms in the articles of the journals identified by the three methods.

Journal	MeSH #1	MeSH #2	MeSH #3
Genetics	Mutation	Models, genetic	Genome
PNAS	DNA	Gene expression	Molecular sequence data
Cell	RNA	DNA	Molecular sequence data
Nature	Research	Research personnel	DNA
MCB	Gene expression	Cell line	Gene expression regulation
Science	DNA	Science	RNA
TAG	Chromosome mapping	Quantitative trait loci	Genes
Evolution	Evolution, molecular	Biological evolution	Selection, genetic
EMBO Journal	DNA	RNA	Gene expression
Genes & Development	Gene expression	Gene expression regulation	Cell line
NAR	DNA	RNA	Internet
JBC	Proteins	Gene expression regulation	Protein binding
MBE	Evolution, molecular	Phylogeny	Genome
Journal of Bacteriology	Bacteria	Gene expression	Gene expression regulation
MGG	Gene expression	Gene expression regulation	DNA
Heredity	Genetic variation	Genetics	Genetics, population
Development	Gene expression	Gene expression regulation	Gene expression regulation, developmental
Genetical Research	Models, genetic	Genotype	Chromosomes mapping
Genome	Phylogeny	Genes	Chromosomes
JMB	Models, molecular	Protein binding	Protein confirmation
JCB	Cell line	Protein transport	Cells
Current Biology: CB	Drosophila	Biological evolution	Gene expression
MBoC	Protein transport	RNA	Protein binding
Developmental Biology	Gene expression	Gene expression regulation	Gene expression regulation, developmental
Nature Genetics	Genome	Mutation	Polymorphism, single nucleotide
CCM	Intensive care	Intensive care units	Critical illness
PLOS Genetics	Gene expression	Gene expression regulation	DNA
AJHG	Mutation	Genetic predisposition to disease	Pedigree
PLOS One	Gene expression	Gene expression regulation	Cell line
Genome Research	Genome	Gene expression	DNA
Eukaryotic Cell	Fungal proteins	Gene expression	Gene expression regulation
Molecular Ecology	Genetic variation	Genetics, population	Genetics

Note. PNAS: Proceedings of the National Academy of Sciences of the USA; MCB: Molecular and Cellular Biology; TAG: Theoretical and Applied Genetics; NAR: Nucleic Acid Research; JBC: Journal of Biological Chemistry; MBE: Molecular Biology and Evolution; MGG: Molecular & General Genetics;JMB: Journal of Molecular Biology; JCB: Journal of Cell Biology; MBoC: Molecular Biology of the Cell; CCM: Critical Care Medicine; AJHG: American Journal of Human Genetics.

We also found the ranking based on citation count has high correlation with the ranking based on usage. This is reasonable because the high clicking number of an article would lead to high citation count of that article, and such phenomenon has been explored in previous work (Brody, Harnad, & Carr, 2006). It should also be pointed out that the latency between the publishing and the peak citing time of an article is longer than the latency between its publishing and peak clicking time (Mao & Lu, 2013), but the time lag probably is not long enough for changing the scope of a journal. Furthermore, we found that the ranks of the journals that not only focus on the field of genetics and heredity were lower in the content-similarity-based approach than in the other two approaches. Such differences were also observed by D’Souza and Smalheiser (2014). This is probably because these journals, such as Science and Nature, have quite high impact factor, and they tend to receive more citations and clicks. Therefore, the results of the content-similarity-based approach were not subject to human factors, since some journals with lower impact factor could be more related to this field.

Conclusion

This research is valuable in using article content similarity to explore the proximity pattern of biomedical journals, validating that the content-similarity-based approach is useful in clustering related journals.

Further analysis also produces several insights. First, the clustering results based on content similarity, Web usage and citation have high correlation with each other, and are consistent with the results of manual cataloging. Second, results of the content-similarity-based approach are considerably less subject to human factors and some journals with lower impact factors can be clustered and ranked higher in the related journal list.

In conclusion, this research offers another way of clustering biomedical journals by using article content similarity information, other than widely used journal citation information. Moreover, incorporating the usage and citation information of journals will probably improve the accuracy of journal clustering. We would like to investigate this issue in the future and extend this work with other related investigation research, such as Klavans and Boyack’s work (2006).

eISSN:: 2543-683X
Lingua:: Inglese

Frequenza di pubblicazione:: 4 volte all'anno
Argomenti della rivista:: Computer Sciences, Information Technology, Project Management, Databases and Data Mining

Feed RSS della rivista

Mining Related Articles for Automatic Journal Cataloging

Article Category: Research Paper

Pubblicato online: 01 set 2017

Pagine: 45 - 59

Ricevuto: 14 dic 2015

Accettato: 26 feb 2016

DOI: https://doi.org/10.20309/jdis.201613

Parole chiavePubMed, Journals, Cluster, Catalog, Text mining, Research evaluation

© 2016 Yuqing Mao, Zhiyong Lu

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Figure 1

Figure 2

Parole chiave
PubMed, Journals, Cluster, Catalog, Text mining, Research evaluation