Compared with other disciplines, conferences are very important in computer science (Freyne et al., 2010) and they have become the main channel for scientific research and dissemination in the field (Shamir, 2010). Due to the rapid pace of technological innovation in computer science, conferences are particularly suitable for researchers to communicate their findings in a timely manner (Fortnow, 2009). The importance of conferences has prompted scholars to consider the differences between conferences and journals in the field.
Up to now, various factors that differentiate computer science conferences and journals have been studied. For instance, scholars have analyzed the relationship between publication type (journal, conference) and citation count (Birman & Schneider, 2009; Eckmann et al., 2012; Fernández Izquierdo et al., 2007; Freyne et al., 2010; Qian et al., 2017; Vrettas & Sanderson, 2015; Wainer et al., 2011), publication Ranking (CCF A B C) and citation count (Freyne et al., 2010; Qian et al., 2017; Vrettas & Sanderson, 2015), authorship and citation count (Qian et al., 2017), publication type and authors (Kumari & Kumar, 2020), and publication type and institutions (Kumari & Kumar, 2020). Their research is beneficial for a deeper understanding of the differences between computer conferences and journals from various perspectives. However, these studies were carried out from different (independent) perspectives, lacking a systematic examination of the connections and interactions between multiple perspectives.
Recently, Sun et al. (2023) have made progress in modeling and analyzing the relationships among citation and influencing factors using Bayesian network (BN) in a systematic manner. In the paper, 20 factors that are related to paper citation have been modeled by BN so that the relationships among citation and the influencing factors are concisely represented (based on the network structure and parameter of the BN), and the interactions among them are dynamically recognized (based on the network reasoning).
Therefore, we investigated the differences between conference papers and journal papers in the field of computer science based on Bayesian network from the perspective of systematic interaction among multiple factors. We defined the variables required for Bayesian networks (BN) modeling, including variables corresponding to publication types and CCF classification indicators that have been newly added. Then, we calculated the values and states of each variable from more than 5 million paper records based Aminer dataset (a literature data set in the field of computer science). At last, we analyzed the characteristics of conferences and journals from different perspectives, compared our findings with existing conclusions, resulting in some interesting findings.
The remainder of this paper is as follows. Section 2 gives some preliminary knowledge of Bayesian network and the network construction method proposed by Sun et al. (2023). Section 3 gives the Bayesian network construction process. Section 4 shows some findings based on the inference of the Bayesian network. Section 5 concludes the paper.
A Bayesian network (Pearl, 1988) is defined as a pair (
The first task of using Bayesian network for knowledge analysis is to determine the network structure and parameters from data samples. One of the most popular approaches for Bayesian structure learning from data is based on searching and scoring. The aim is to search for the network structure that maximizes the scoring function defined to represent how well a structure fits a given data set. Among those methods, K2 is one of the most classic and commonly used methods.
Given a set of variables and data samples, the K2 algorithm starts from a graph with no edge, a number indicating maximum parents, and an order of variables, and adds an edge to the graph if the inclusion of the edge improves the scoring function most compared with other potential edges. The process repeats until no addition of a new edge can improve the score or all the maximum admissible parents are achieved. The K2 algorithm uses marginal likelihood as a score function. For a given variable, only preceding variables in the variable order can be considered as potential parents. The number of maximum parents is used to guarantee a concise representation of the domain knowledge.
It can be seen that K2 algorithm needs a variable order (that implies expert knowledge of the underlying domain) and try to find a good enough network structure based on the scoring function. However, in certain situations, there may not be a strict order among variables. In Sun et al. (2023), the method is extended to solve the problem of no strict orders among variables. The method is called the amended K2 algorithm, which is given in Algorithm 1.
The input of Algorithm 1 includes the set of data samples
In this section, we discuss factors (variables) specific to CS, then describe the underlying data set and the data processing procedure (used to calculate the factor values), and at last, show the learned Bayesian network.
We adopt the same set of factors given in Sun et al. (2023), which can be categorized into author-level, platform-level, internal, outcome, and influence factors. The internal factors are relevant to the paper itself, including novelty (pNov), disruption (pDisrupt), number of references (refNum), text readability (abRE), and text length (abLen). Author-related factors are relevant to the influence and collaboration level of the paper authors, including the number of published papers of the first author (pNumF) and of the author with maximum number of published papers (pNumM), total citations of the first author (tcF), total citations of the author with maximum total citations (tcM), h-index of the first author (HIF) and the maximum author (HIM), co-authorship network centrality degree of the first author (auCDF) and the maximum author (auCDM). Platform-related factors are relevant to the collaboration of the authors’ institutions, including the number of authors (auNum), number of institutes (instNum), cooperation network centrality degree of the first author’s institution (instCDF) and of the institution with the maximum cooperation network centrality (instCDM). The influence of a paper is measured with the Category Normalized Citation Impact (CNCI).
Further, since the aim of this paper is to investigate the distinctions between academic journals and conferences in CS, we include a Category indicator for each paper to identify its type (conference or journal). Besides, in view of the important influence of CCF Rankings①, also due to the conference papers do not have IF, we introduce CCF Rank as an alternative to JIFRank to signify the importance of the paper.
The Aminer dataset is used as the underlying data in this paper for the calculation of the factor values and BN learning (Tang, 2008). The Aminer dataset is a comprehensive collection of academic research papers and citation relationships and has been widely used in various research works relating to academic research evaluations (Abramo et al., 2019; Amjad et al., 2022; Shao et al., 2022; Song et al., 2018). The data set contains information related to 5,354,309 papers and 48,227,950 citation relationships. It is one of the largest datasets available in computer science. The data set provides information on paper identification number (id), title, publication date (year), author details (including identification number (_id), name, institution name (org), and institution identification number (gid)), publication journal information (venue), abstract, citation count (n_citation), reference numbers (references), and complete citation relationships between papers. Based on the information, the factor values are calculated.
Except for Category and CCF Rank, before Bayesian network learning, factor values should be discretized into states. The value of Category can be J or C (for journal or conference) and of CCF Rank can be A B C. The discretization rules for other factors can be found in Table 1, which are given in Sun et al. (2023).
Discretization rules of factors (Sun et al., 2023).
Variable | Discretization rule |
---|---|
pNov | 0: |
pDisrupt | <0: |
refNum | [0, 10]: |
abRE | >70: |
abLen | <600: short; (600, 800]; |
pNumF | [0, 10]: |
pNumM | |
tcF | [0, 10]: |
tcM | (2000, 10000]: |
HIF | [0, 10]: |
HIM | |
auCDF |
sort auCDF/auCDM values and divide by top percentage interval: (50%, 100%]: |
auNum |
1: |
CNCI | (0, 0.3]: |
We learn Bayesian network structure and parameters based on the amended K2 algorithm proposed by Sun et al. (2023). Same as Sun et al. (2023), author-level factors, platform-level factors, internal factors, and influence factors are arranged in order as the input set order of the amended K2 algorithm. In this paper, Category and CCF Rank are considered as internal factors since they are paper-level factors. We learn the BN based on the amended K2 algorithm and use Netica② to visualize the learned Bayesian network, as shown in Figure 1.
As shown in Figure 1, the initial state of the Bayesian network gives the marginal distribution of all the variables, which can be used to get a basic understanding of the field knowledge. For example, as shown in Figure 1, we observed that the
By setting the values of certain variables, the network can acquire the conditional distribution of other variables based on network reasoning. Figure 2 gives an example of setting Category as the journal. It can be seen that the probability distribution of most variables has changed slightly. The fact means that, overall, there may not be a big difference between conferences and journals. For example, if we set Category as
Therefore, we conduct a more fine-grained examination by setting the status of more variables, and try to find some interesting conclusions.
(1) Conferences are more attractive to senior scholars
We use that pNumM and HIM indicate the academic impact of authors. As shown in Figure 3, when we change Category from
From the perspective of authors, as shown in Figure 4, we set the HIM as
(2) The academic impact (indicated by CNCI) of conference papers is slightly higher than journal papers
Although senior scholars generally prefer to publish papers at conferences, the Normalized Citation Impact (CNCI) of conference papers is not significantly higher than journal papers. As depicted in Figure 5(a), when changing the Category from
Second, by considering CCF Rank, Figure 5(b) shows that when changing the Category from
(3) It is uncertain whether conference papers are more innovative than journal papers
Considering that the rapid dissemination of research results is critical for computer science researchers, conference proceedings are generally published more quickly than traditional journals (Wainer et al., 2011). Due to this, people often feel that shorter publication cycles for conferences will prioritize the presentation of new research results at conferences, leading to the perception that conferences are more innovative. However, in Figure 7, we can see that when changing the Category from
Further, if we consider both paper type and paper rank, the distributions of paper pNov and pDisrupt are also basically the same for different settings of Category and CCF Rank. As we can see in Figure 8, for various CCF ranks, the largest difference between the percentage of the mhigh pNov conference papers and journal papers is not more than 0.6%, and of the high pNov is not more than 0.7%. Those difference for pDisrupt is 0.2% and 0.5% for mhigh and high respectively.
Based on the above results, we think it is uncertain whether conferences are more innovative than journals if innovation can be represented and measured by pNov and pDisrupt. However, as the most authoritative grade list in the field of computer science, the CCF rank has indeed guided researchers to submit their works with higher-quality (often considered to be more innovative), at least in the authors’ own perception, to higher-level conferences or journals. The fact that there is no difference among pNov (and pDisrupt) over papers of different CCF ranks may also indicate that the applicability of the indicators in different disciplines and/or scenarios needs further discussion. That is, we think that both the problems of the innovation index (disruption index) itself and whether conferences are more innovative (indicated by certain evaluation index) still need further in-depth research.
Finally, we find that our findings are generally consistent with those given by Sun et al. (2023), which are drawn in the field of physics. For example, we also find that researchers have more influence on the impact of research works than institutes, and moderately innovative work can acquire more academic impact with respect to low-innovative and high-innovation work. The fact gives further evidence that Bayesian networks can be well applied for analyzing issues in Scientometrics.
In this paper, we investigated the differences between conference papers and journal papers in the field of computer science based on Bayesian network. We defined the variables required for Bayesian networks (BN) modeling, calculated the values and states of variables based Aminer dataset, and analyzed the characteristics of conferences and journals from different perspectives. We found that (1) conferences in the field of computer science are more attractive to senior scholars; (2) overall the academic impact (indicated by CNCI) of conference papers is slightly higher than journal papers, while the CCF C journal papers can get more impact than CCF C conference papers; (3) it is not certain whether conference papers are more innovative than journal papers.
We believe the results of this paper gave further evidence that Bayesian networks can be well applied for analyzing issues in Scientometrics. Further work should be focused on refining the framework to be more sophistic, e.g., extending the variable set to include more related factors, incorporating more reliable expert knowledge (on variable order), introducing causal relationships, etc.