Otwarty dostęp

An author credit allocation method with improved distinguishability and robustness

 oraz    | 25 sie 2023

Zacytuj

Introduction

Collaboration has become a common practice in scientific production (Hara et al., 2003; Hu et al., 2020; Newman, 2004; Regalado, 1995). While scientific collaboration promotes publication output and the work’s impact (Dong et al., 2017; Pravdić et al., 1986; Wu et al., 2019), it brings difficulties in determining who gets how much credit for the team’s work (Hodge et al., 1981; Zeng et al., 2017). This gives rise to a new problem of credit allocation, which aims to quantitatively calculate the share of the credit by each author (Allen et al., 2014; Kennedy, 2003). However, as most important decisions in science, such as job appointment, funding application, tenure promotion, and award nomination, are based on the achievements of the individual (Jones, 2011), it is more important for credit allocation methods to accurately and confidently identify the leading author or the owner of a collaborated work (Rennie et al., 1994; Smolinsky et al., 2020), the one who gets the most credit among all authors.

In practice, the science community has developed several ways to solve the credit allocation problem. One is to consider all co-authors in the team get the same credit for the work (Burrell et al., 1995; Oppenheim, 1998; Price, 1981; Rennie et al., 1994). Equivalently, any co-author can claim himself/herself as the owner of the work. This is most common in fields like Mathematics, Economics, and Particle Physics, where the author’s name is usually listed alphabetically in the paper (Endersby, 1996; Frandsen et al., 2010; Waltman, 2012).

The other way takes the byline order of authors into consideration (Beveridge et al., 2007; Riesenberg et al., 1990; Zuckerman, 1968). The last author, who is usually the corresponding author, and the first author take a greater share of the credit. Considering the observation that the first and last authors usually conduct most jobs of the work (Tscharntke et al., 2007), it seems fair to make them the owner of the paper. While these two approaches have their pros and cons (Das et al., 2020; J. Xu et al., 2016), they are widely applied in the science community for their simplicity and feasibility.

Recently, an algorithm-based approach is proposed that tackles the credit allocation problem from a new angle (Shen & Barabási, 2014). The leading author of a paper is not determined by factors on that paper, such as the byline order of authors, but is determined by the perception of the scientific community. By analyzing how the target paper and papers by the team members are co-cited, the leading author can be quantitatively identified, who is the one most appreciated by peers in the community and recognized as the leader of the teamwork. The original method and other variants (Bao et al., 2017; F. H. Wang et al., 2019; J. P. Wang et al., 2019; Xing et al., 2021) are tested in empirical data and demonstrate high accuracy in determining Nobel Prize laureates from the team that uncovers the scientific discover together.

However, existing methods present several limitations that can be further addressed. First, it is important to determine the weights of different papers in the citation network (H. Huang et al., 2022; Waltman et al., 2013). the number of citations, which has been widely accepted as a measure of a paper’s awareness (Eysenbach, 2006; L. L. Xu et al., 2021; Waltman, 2016; Radicchi et al., 2008; Sinatra et al., 2016), is considered by some methods and improves their performance. But given the scale-free nature of the citation distribution (Barabási, 2009; Barabási et al., 2001; Barabási et al., 2003), the weights of papers can be very screwed towards some highly cited ones. Second, the robustness of the method needs to be further verified and strengthened. As one can publish papers that intentionally cite the target paper and ignore other related co-publications, the methods have to be robust to such manipulations. Finally, to identify the leading author of the paper, we would prefer that the best candidate is distinguishable from others. If the methods give close or tied scores to candidates in the first and second positions, the leading author may be indistinguishable. This issue can be critical in dealing with large-scale data. For instance, if one wants to investigate the correlation between the leading author of a paper and his/her byline order in the author list (Lu et al., 2022), the method has to ensure that the identified owner of a paper is well separated from other co-authors.

To address the aforementioned issues, we propose a new allocation algorithm called NCCAS. We use a modified Sigmoid function that transfers citation counts to weights of different papers, which successfully solved the issue of the fat-tail distributed citation numbers. Furthermore, we remove the target paper in the credit calculation, which brings improved distinguishability of the identified leading author. Following the testing framework of previous studies, we validate the accuracy of NCCAS using Nobel Prize-winning papers and their citation networks in the American Physical Society (APS) and the Microsoft Academic Graph (MAG) dataset. In addition, we use 654,148 articles published in the field of computer science from 2000 to 2009 in the MAG dataset to validate the distinguishability and robustness of NCCAS. Compared with other baseline methods, NCCAS not only identifies Nobel laureates in their collaborated papers more accurately, but also assigns higher scores to them, making the leading author of the work more distinguishable from other candidates. In addition, NCCAS is more robust against additional citations in hypothetical malicious manipulation experiments. Given NCCAS’s capability in handling large-scale publication data and its nice performance, it has the potential to be a useful tool in other studies related to different roles or contributions of authors.

Related work

The first collective credit allocation (CCA) algorithm is introduced by Shen and Barabási (2014). As shown in Figure 1, assume a target paper p0 co-authored by author A and author B. To identify which author is recognized more by the community, we need to identify all papers that cite the target paper. Through the set of citing papers, we further identify the set of papers that are co-cited with the target paper and are co-authored by at least one of the two authors. From the set of co-cited papers, we get an authorship involvement matrix A that counts the fractional share of authorship of the two authors in each co-cited paper. Furthermore, we get a strength vector s that assigns different weights to co-cited papers. In CCA, the element of s is the number of co-citations one co-cited paper receives with the target paper. The product of the matrix A and the vector s gives the total score of authors A and B, which represents the overall appearance of each author in related co-cited literatures weighted by their importance.

Figure 1.

Collective Credit Allocation (CCA) algorithm. The credits of author A and author B are given by the product of the author involvement matrix A and the co-citation strength vector s.

Two extreme cases may help us understand the idea behind the CCA algorithm. If author A does not participate in any co-cited papers, the algorithm will assign author B as the leading author of the target paper. If author A and author B always appear together in co-cited papers, the algorithm will designate both authors as the owners of the target paper. In general, the ownership identified falls somewhere between these two extreme examples. Author A and author B have some research collaborations, as well as collaborations with other authors on similar research topics.

Since the development of CCA, other variants are introduced aiming to improve the allocation accuracy from different perspectives. All of these algorithms focus on the modification of the involvement matrix A and the strength vector s. CCA considers that all citing papers have the same importance. Hence, the element in vector s is the number of citing papers that cite the co-cited papers, or equivalently the number of co-citations. In the NCCA algorithm, F. H. Wang et al. (2019) consider that citing papers with more citations is more important. The element value of s is calculated by summing up citations of citing papers that cite the co-cited paper. In the DCA algorithm, Bao et al. (2017) adjust the involvement matrix A by multiplying it by a residual influence Φ. The idea is that the importance of a paper would gradually decrease with time, which is captured by Φ. DCA uses the same strength vector s as that in CCA. The final credits of the co-authors are determined by the product of the author involvement matrix A, residual influence Φ, and the strength vector s. J. P. Wang et al. (2019) propose IDCA method that uses the same residual influence Φ as that in DCA but modifies the strength vector s. The element value of s is calculated by summing over the product of the PageRank value and the citation number of the citing paper. Xing et al. (2021) propose the CoCA method that modifies the collection of papers in matrix A and the value of the strength vector s. CoCA focuses on the subsequent works by co-authors of the target paper (papers authored by at least one of the co-authors). These subsequent papers generate the involvement matrix A. For each subsequent paper, its weight in s is calculated as the number of times that the subsequent paper and target paper use the same reference paper.

While all these methods demonstrate reasonable accuracy in determining the Nobel laureates, they present several limitations that can be addressed. First, the number of citations (or co-citations) is commonly utilized to calculate the strength vector s. But the number of citations usually follows a scale-free distribution. Therefore, s can be skewed towards some extreme values. Furthermore, most of these methods are tested in citation networks composed of APS publications. It is important to extend the analyses to a bigger and more comprehensive dataset such as the MAG dataset. These methods usually give very close credit scores of different co-authors, making the leading author of the paper not very distinguishable from others. Finally, as citations can be intentionally added to the target paper, it is important to perform the robustness test, which is not well emphasized in past studies.

Data

We utilize two mainstream scholar data sets in this paper. The APS dataset is from 11 Physical Review journals launched by the American Physical Society, covering the period from 1893 to 2009, with a total of 463,332 papers, 4,710,547 citation pairs, and 247,974 author information. The Microsoft Academic Graph (MAG) dataset, covers 71 academic fields, 224,876,396 papers, 1,492,784,521 citation pairs, and 236,021,246 author information. To avoid author name ambiguity, we apply Rule-matching Name-disambiguation (Jia et al., 2017; Sinatra et al., 2016) to disambiguate the author information in the APS dataset. We extract the full name of their last name, first name, and middle name initials for uniform replacement. For the MAG dataset, we directly use the disambiguated author names in the data.

We select the Nobel Prize-winning paper in Physics used in previous studies (Bao et al., 2017; F. H. Wang et al., 2019; J. P. Wang et al., 2019; Shen et al., 2014; Xing et al., 2021) as the validation dataset to verify the accuracy of our algorithm. Totally, there are 30 Nobel Prize-winning papers considered by past studies. After removing papers with only one author and papers that all co-authors receive laureates, 24 multi-author papers were retrieved in the APS dataset and 22 in the MAG dataset. Details can be found in Appendix Tables S1 and S2.

At the same time, we extract 654,148 papers in the field of computer science in the MAG dataset to validate the distinguishability of identified paper owners.

Method

In this work, we propose the Nonlinear Collective Credit Allocation with a modified Sigmoid (NCCAS) algorithm (Figure. 2). NCCAS is similar to NCCA, which takes the citation number of the citing paper into consideration. Compared with other related methods, NCCAS introduces two major modifications.

Figure 2.

Nonlinear Collective Credit Allocation with Sigmoid Function (NCCAS) algorithm based on the co-citation network. p0 is the target paper, pk (k=1~4) are the co-cited papers, and d1~d5 cite both p0 and p1~p4. xm is the number of citations dm receives. The matrix A records the authorship involvement of the co-authors in the co-cited paper pk. The co-citation strength vector s captures the co-citation strength from the citing papers dm to the target paper p0 and the co-cited paper pk. S(x) is the modified Sigmoid Function. After obtaining matrix A and vector s, the values of the co-authors’ contributions C in the target paper p0 are computed by C=As with a normalization, where P is the co-cited paper set, ci is the credit of co-author ai in target paper p0.

First, to handle extreme values in citation counts and to make the distribution of element values in s bounded, we introduce a modified Sigmoid function (Eq. (1)) that transfers the citation counts to the weight of a paper. The parameter α in Eq. (1) is the average number of citations that the citing papers receive. In the ablation experiment of this paper, we show that the average value gives the most accurate identification of the Nobel laureates compared with the use of median and mode values. The parameter n is the average number of citations of all papers considered. The value varies with different tasks and datasets used. For example, for the task of identifying Nobel laureates, we have n = 20.813 in the APS dataset and n = 22.383 in the MAG dataset. When applying NCCAS to 654,148 papers in computer science, we have n = 13.086. In general, the parameter α sets the scale of citing papers that cites the same target paper and n sets the scale of the whole data considered. S(x)=11+exa*Factor,Factor={ 1α,x=0,orxnxα,0<x<n \[S(x)=\frac{1}{1+{{e}^{-\frac{x}{a}}}}*Factor,\,Factor=\{\begin{array}{*{35}{l}} \frac{1}{\alpha },x=0,\,or\,x\ge n \\ \frac{x}{\alpha },0<x<n \\ \end{array}\]

With Eq. (1) that rescales the citation counts, the element value of the strength vector s can be calculated as si=jJS(xj)${{s}_{i}}=\underset{j\in J}{\mathop \sum }\,S({{x}_{j}})$, where i represents a co-cited paper, J is the set of citing papers that cite the co-cited paper i, and xj is the number of citations of the citing paper j.

The second modification is the removal of the target paper in the involvement matrix A and the strength vector s. This is because that all citing papers are associated with the target paper. Hence, the target paper will have the largest value in strength vector s. Consequently, keeping the target paper in A will dilute the contributions of co-cited papers, making the credit score C less sensitive to co-citing papers. In the ablation experiment, we show that the removal of the target paper itself in calculating C will make the identified leading author more distinguishable from others.

Taken together, as shown in Figure 2, the procedure of NCCAS is as follows:

Find all the papers that cite the target paper p0, forming the set of citing papers set D=(d0, d1, …, dm).

Find all the papers that are authored by at least one author of the target paper. Among them, find the set of co-cited papers, P=(p1,p2,…,pk), that are cited with the target paper by the citing paper.

Calculate the strength vector s for the co-cited papers. The element value si is the sum of the transformed weight of citing papers that cite the co-cited paper i. For example, for the co-cited paper p2 in Figure 2, we have s2=S(x1)+S(x2)+S(x3)+S(x5), as four papers (d1, d2, d3, d5) cite both the target paper p0 and the co-cited paper p2.

Generate the authorship involvement matrix A by calculating the authorship credit of an author in co-cited papers. The authorship credit value is calculated using the fractional counting method. If author a, one of the authors in the target paper, appears in a co-cited paper with a total of three authors, the authorship credit of author a in this co-cited paper is 1/3.

The credit shares of all co-authors in the target paper p0 are given by vector C=As. By normalizing vector C, we can obtain the credit shares of co-authors in fractional form.

Results
Validation

The scientific community lacks a unified standard for determining authorship attribution, making it difficult to find instances with known credit allocation results to validate the allocation method. Previous studies (Bao et al., 2017; F. H. Wang et al., 2019; J. P. Wang et al., 2019; Shen et al., 2014; Xing et al., 2021) consider papers receiving the Nobel Physics Prize as the validation set to verify the accuracy of the algorithms. Indeed, the choice of Nobel Prize winners reflects the scientific community’s recognition: the author who makes the most contribution to the Nobel Prize-winning papers is considered as the rightful owner of the scientific achievement (Lo, 2013).

Because the Nobel Physics Prize can be awarded to a maximum of three authors and two different scientific discoveries each year, there are cases when multiple authors of a Nobel Prize-winning paper receive the award. In addition, previous studies use different evaluation criteria. For example, if the algorithm gives two tied candidates and one of them is the Nobel laureate, do we consider the result 100% accurate or just 50% accurate? Likewise, if a paper has two Nobel laureates and the top-two candidates by the algorithm cover only one Nobel laureate, how do gauge the accuracy of the allocation result? To unify different cases and provide a comprehensive comparison of different algorithms, we consider two scenarios in this paper.

Whole counting

the output of the algorithm is considered correct if one of the identified leading authors of the paper is the true Nobel laureate. For example, if the algorithm gives two tied candidates and one of them is the true Nobel laureate, or if two authors of the paper are Nobel laureates and the algorithm identifies one of them in the top-two candidates, we consider the output of the algorithm is one-hundred-percent correct. This is the most commonly adopted evaluation criterion in past studies.

Fractional counting

we use a variant of Jaccard similarity to calculate the fractional overlap between the algorithm output and the ground truth. Assuming that the algorithm predicts a set of leading authors M1 and true Nobel laureates set is M2, we evaluate the result using |M1M2||M1|$\frac{|{{M}_{1}}\mathop{\cap }^{}{{M}_{2}}|}{|{{M}_{1}}|}$. The fractional counting provides a more comprehensive evaluation when there are tied owners or multiple Nobel laureates of a given paper.

Based on the two evaluation criteria, we compare the number of papers whose leading author is successfully recognized by NCCAS and other baseline methods. The statistics are shown in Table 1 and the detailed values for each paper are shown in Appendix Table S1 and Appendix Table S2. Note that we only retrieve 22 papers in the APS dataset and six papers in the MAG dataset that are applicable for CoCA, so we also calculate the percentage accuracy for CoCA and NCCAS.

The number of papers identified by different allocation methods under the scenarios “Whole” and “Fractional” evaluations.

Dataset Numbers
CCA NCCA DCA CoCA NCCAS
Whole APS(24) 17 20 20 17(77.27%) 20(83.33%)
MAG(22) 18 18 18 4(66.67%) 19(86.36%)
Fractional APS(24) 16.08 19.08 17.58 15.25(69.32%) 19.58(81.58%)
MAG(22) 17.67 16.33 14.31 4(66.67%) 18.67(84.86%)

The proposed NCCAS performs well in both datasets under both evaluation criteria. NCCAS is based on NCCA, but the introduction of the Sigmoid function improves the identification accuracy. This is mainly because NCCAS can take into account citing papers with zero citation while NCCA ignores their contributions. The removal of the target paper makes the leading author more distinguishable and effectively reduces tied outputs. Hence, the performance of NCCAS is also stable under the fractional counting.

Distinguishability

To quantify the extent that the output of the first and second candidates are separated, we consider the credit share difference between the highest and the second highest credit scores (denoted as Δ). The larger the value of Δ, the easier to distinguish the leading author.

We first analyze Δ values for Nobel Prize-winning papers in the APS dataset (Table 2). In most cases, Δ by NCCAS is the largest. In a few cases when NCCAS is not the largest, its Δ value is the second largest. It is also noteworthy that except in one paper that all algorithms give the same tied results, NCCAS does not give ties, while NCCA, DCA, and CoCA all yield ties in different papers. Table 2 clearly shows that NCCAS makes the leading author of the Nobel Prize-winning paper more distinguishable compared with other methods.

Δ by different methods for the Nobel Prize-winning papers in Physics (APS dataset).

DOI The value of Δ
CCA NCCA DCA CoCA NCCAS
10.1103/PhysRevLett.76.1796 0.033 0.013 0.001 0.046 0.094
10.1103/PhysRevLett.84.5102 0.021 0.014 0.006 0.031 0.035
10.1103/PhysRevLett.77.4887 0.002 0.002 0.001 0.011 0.004
10.1103/PhysRevLett.55.48 0.19 0.12 0.149 0.268 0.385
10.1103/PhysRevLett.61.2472 0.152 0.067 0.005 0.156 0.72
10.1103/PhysRevLett.75.3969 0.042 0.004 0.002 0.122 0.345
10.1103/PhysRevLett.13.138 0.02 0.002 0.0 0.0 0.154
10.1103/PhysRev.69.37 0.204 0.051 0.257 0.25 0.561
10.1103/PhysRev.83.333 0.009 0.002 0.015 0.022 0.183
10.1103/PhysRevLett.61.169 0.05 0.023 0.059 0.053 0.263
10.1103/PhysRevLett.13.321 0.006 0.0 0.002 0.0 0.6
10.1103/PhysRev.122.345 0.078 0.01 0.002 0.364 0.908
10.1103/PhysRevLett.57.2442 0.077 0.018 0.003 0.193 0.306
10.1103/PhysRevLett.84.3232 0.052 0.032 0.008 0.076 0.193
10.1103/PhysRevLett.20.1205 0.128 0.005 0.071 0.167 0.698
10.1103/PhysRevLett.58.1490 0.003 0.0 0.023 - 0.012
10.1103/PhysRevLett.48.1559 0.06 0.081 0.006 0.319 0.297
10.1103/PhysRevLett.61.826 0.025 0.013 0.032 0.048 0.235
10.1103/PhysRevLett.35.1489 0.002 0.001 0.001 0.0 0.006
10.1103/PhysRevLett.9.439 0.0 0.0 0.0 - 0.0
10.1103/PhysRev.72.241 0.182 0.036 0.144 0.25 0.636
10.1103/PhysRev.112.1940 0.008 0.032 0.0 0.368 0.654
10.1103/PhysRev.73.679 0.024 0.003 0.004 0.071 0.066
10.1103/PhysRevD.5.528 0.088 0.001 0.006 0.241 0.059

To go beyond the Nobel Prize-winning papers and to demonstrate the overall situation of how distinguishable the identified leading author is, we select 654,148 papers in computer science and calculate the Δ values by different methods (Figure 3). The result of all years is shown in Appendix Figure S1. Δ values by CCA, NCCA, DCA, and CoCA tend to be concentrated in the [0, 0.3] interval, but over 40% of results by NCCAS have Δ values greater than 0.5. Hence, for general publication data, NCCAS still makes the leading author of the paper distinguishable from others.

Figure 3.

The distribution of Δ by different methods for papers published in different years.

Robustness

As one can intentionally create citing papers that cite the target and co-cited papers, it is important to check the robustness of the results by different methods under malicious manipulation. For example, F. H. Wang et al. (2019) show that by citing a small number of additional papers to the target paper, it is possible to significantly change the authorship credit by the CCA method. For this reason, we consider a simple experiment in which papers with a certain number of citations are added that cites the target paper and one randomly chosen co-cited paper. We measure the number of papers needed to invert the rank of the credit score that makes the one originally with the second highest score become the leading author. We consider two different cases: papers with 20 citations are added and papers with one citation are added. The choice of 20 citations is based on the fact that the average number of citations in the APS dataset is about 20. The case for one citation is because the average citation after one year of publication is about one in the APS dataset. Note that only NCCAS and NCCA consider the number of citations. CCA and DCA consider all citing papers to be equally important. Hence, they are not affected by the citing papers’ number of citations.

We first focus on the 24 Nobel Prize-winning papers in the APS dataset. Because the added papers cite the co-cited papers randomly, we repeat the experiment ten times and show in Table 3 the average number of papers needed to invert the original credit score rank. Methods that take the number of citations into account (NCCAS and NCCA) are more robust than those that do not consider citations (CCA and DCA). This is in line with conclusions in previous studies (F. H. Wang et al., 2019). In most cases, NCCAS is more robust than NCCA. We also check the cases when NCCAS is less robust than NCCA (four papers 10.1103/PhysRevLett.55.48, 10.1103/PhysRevLett.69.37, 10.1103/PhysRev.83.333, and 10.1103/PhysRev.73.679). We find that these papers have citing papers with an extreme number of citations. Because the original citing papers’ citations are very high, adding papers with 1 or 20 citations will not significantly change the strength vector. Therefore, in these cases NCCA is more robust than NCCAS.

The number of added papers with different citations to invert the credit rank for 24 Nobel Prize-winning papers.

DOI CCA DCA Added papers with 20 citations Added papers with 1 citation
NCCA NCCAS NCCA NCCAS
10.1103/PhysRevLett.76.1796 66 42 106 132 2,109 10,018
10.1103/PhysRevLett.84.5102 15 6 10 28 190 1,348
10.1103/PhysRevLett.77.4887 5 5 2 15 39 1,943
10.1103/PhysRevLett.55.48 53 15 164 29 3,264 1,294
10.1103/PhysRevLett.61.2472 565 580 808 1,208 10,100 10,100
10.1103/PhysRevLett.75.3969 475 446 982 985 10,100 10,100
10.1103/PhysRevLett.13.138 44 35 42 66 823 838
10.1103/PhysRev.69.37 51 30 107 40 2,135 2,031
10.1103/PhysRev.83.333 4 1 32 1 34 1
10.1103/PhysRevLett.61.169 37 4 7 53 135 2,860
10.1103/PhysRevLett.13.321 5 1 3 4 50 966
10.1103/PhysRev.122.345 443 293 737 757 10,100 10,100
10.1103/PhysRevLett.57.2442 146 150 171 252 3,413 10,100
10.1103/PhysRevLett.84.3232 15 11 17 25 322 468
10.1103/PhysRevLett.20.1205 96 96 106 174 2,120 10,100
10.1103/PhysRevLett.58.1490 20 21 17 52 47 10,100
10.1103/PhysRevLett.48.1559 196 197 314 369 6,261 10,100
10.1103/PhysRevLett.61.826 20 18 57 88 1,127 10,100
10.1103/PhysRevLett.35.1489 9 13 17 26 327 2,881
10.1103/PhysRevLett.9.439 1 1 1 1 1 1
10.1103/PhysRev.72.241 99 88 179 212 3,569 10,100
10.1103/PhysRev.112.1940 18 16 31 33 604 1,656
10.1103/PhysRev.73.679 37 1 89 74 1,762 1,655
10.1103/PhysRevD.5.528 68 23 55 57 1,093 3,219

We further run the experiment in papers in the MAG dataset by selecting 654,148 papers in computer science. We add citing papers with 13 citations, which is the average number of citations in the data analyzed. In Figure 4, we show the distribution of the number of papers needed to invert the original credit rank. The result of all years is shown in Appendix Figure S2. The distributions of CCA and DCA are more concentrated on small counts of added papers, while the distributions of NCCA and NCCAS have a high portion for a large number of added papers. The robustness of NCCAS is higher than NCCA as the distribution of NCCAS is more skewed towards a large number of added papers. The analysis on large-scale papers is in line with results on Nobel Prize-winning papers that NCCAS is in general more robust towards manipulations compared with other baseline methods.

Figure 4.

The distribution of the number of added papers by different methods for papers published in different years.

Ablation experiment

In this part, we perform different experiments to explain the reason for the modifications that NCCAS makes, and the reason for the choice of some parameters.

Removal of the target paper

A major modification made in NCCAS is that the target paper is not included in the author involvement matrix A and strength vector s. To quantitatively show the effect of this removal, we build a NCCAS-T method that is the same with NCCAS but keep the target paper in matrix A. The results on the leading author identification and the Δ value for the 24 Nobel Prize-winning papers in the APS dataset are shown in Table 4. The removal does not change the identification results. The author with the highest credit score remains the same. But because the target paper dilutes the contribution of other co-cited papers, results by NCCAS-T are not as distinguishable as those by NCCAS. Therefore, removing the target paper’s contribution is crucial for NCCAS’s improved distinguishability.

The leading author identification and the Δ values for 24 Nobel Prize-winning papers by methods NCCAS and NCCAS-T in the APS dataset. NCCAS-T refers to the method that keep the target paper in the author involvement matrix.

DOI NCCAS-T NCCAS NCCAS-T NCCAS
If accurate (Y/N) Δ value
10.1103/PhysRevLett.76.1796 Y Y 0.002 0.094
10.1103/PhysRevLett.84.5102 Y Y 0.012 0.035
10.1103/PhysRevLett.77.4887 Y Y 0.001 0.004
10.1103/PhysRevLett.55.48 N N 0.017 0.385
10.1103/PhysRevLett.61.2472 Y Y 0.002 0.72
10.1103/PhysRevLett.75.3969 Y Y 0.002 0.345
10.1103/PhysRevLett.13.138 Y Y 0.001 0.154
10.1103/PhysRev.69.37 N N 0.002 0.561
10.1103/PhysRev.83.333 N N 0.112 0.183
10.1103/PhysRevLett.61.169 Y Y 0.003 0.263
10.1103/PhysRevLett.13.321 Y Y 0.007 0.6
10.1103/PhysRev.122.345 Y Y 0.01 0.908
10.1103/PhysRevLett.57.2442 Y Y 0.008 0.306
10.1103/PhysRevLett.84.3232 Y Y 0.037 0.193
10.1103/PhysRevLett.20.1205 Y Y 0.027 0.698
10.1103/PhysRevLett.58.1490 N N 0.002 0.012
10.1103/PhysRevLett.48.1559 Y Y 0.027 0.297
10.1103/PhysRevLett.61.826 Y Y 0.017 0.048
10.1103/PhysRevLett.35.1489 Y Y 0.003 0.235
10.1103/PhysRevLett.9.439 Y Y 0.0 0.0
10.1103/PhysRev.72.241 Y Y 0.284 0.636
10.1103/PhysRev.112.1940 Y Y 0.002 0.654
10.1103/PhysRev.73.679 Y Y 0.004 0.066
10.1103/PhysRevD.5.528 Y Y 0.001 0.059
The calculation of the authorship involvement matrix A

In all the methods including NCCAS and fractional counting (Van Hooydonk, 1997) is used to construct the authorship involvement matrix A. If the co-cited paper has N authors, then the corresponding element value in A is 1/N. In this section, we explore whether different weight allocations for the authorship involvement matrix would make any difference. Besides fractional counting, we check first last author emphasis method (Tscharntke et al., 2007), sequence determines credit method (Verhagen et al., 2003), harmonic counting (Hagen, 2008; 2010), geometric counting (Egghe et al., 2000), and arithmetic counting (Trueba et al., 2004). The results are shown in Table 5.

The identification of Nobel laureates by different authorship involvement matrices with fractional counting.

Method Numbers
CCA NCCA DCA CoCA NCCAS
fractional counting 16.08 19.08 17.58 15.25 19.58
first last author emphasis 16.5 17 16 15.67 17.67
sequence determines credit 15.67 16 16 15 16.67
harmonic counting 15 16.33 16 14.67 17
geometric counting 14.33 15 15 14.33 14
arithmetic counting 13 15 14 11.33 11.67

The optimal choice for authorship involvement matrix calculation differs in different methods. It is interesting to note that while fractional counting is adopted by the first relevant method CCA, it is not the most optimal choice. But in general, fractional counting works well in all different algorithms. Table 6 supports the choice of authorship involvement matrix of NCCAS.

The identification of Nobel laureates by different rescale functions.

Dataset Numbers
NCCAS S2(x)(β=2) S2(x)(β=e) S2(x)(β=5) S2(x)(γ=1.3)
Whole APS(24) 20 20 20 20 20
MAG(22) 19 18 18 18 18
Fractional APS(24) 19.58 12.15 12.15 12.15 16.34
MAG(22) 18.67 17.83 17.83 17.83 16.83
Selection of the rescale function

We use a modified Sigmoid function to transfer the number of citations to the relative importance of a citing paper. The aim of the Sigmoid function is to rescale the number of citations such that extreme values would play a less important role. However, other forms of function may also work. In this section, we consider the logarithmic function (Eq. (2)) and power-law function (Eq. (3)) to characterize relative importance and check if these functions would improve the performance. For the logarithmic function, we choose the parameter β=2,e,5. For the power-law function, we choose γ=1.3, which is opposite to the citation distribution of citing papers related to the Nobel Prize-winning papers in the APS dataset (Appendix Figure S3). S2(x)=logβx \[{{S}_{2}}(x)={{\log }_{\beta }}x\] S3(x)=xγ \[{{S}_{3}}(x)={{x}^{\gamma }}\]

The identification performance by different rescale functions is shown in Table 6. The modified Sigmoid function in NCCAS makes more accurate identifications of Nobel laureates compared with the other two types of functions. In particular, the logarithmic function and power-law function often yield authors with tied credit. Their performance is even worse when the fractional counting is used to quantify the performance. These results demonstrate that the Sigmoid function is a better choice compared with other forms of function.

The calculation of parameter α

The parameter a is calculated as the average citation counts of the citing paper. In this section, we compare the average value with the median and mode values. The performance of Nobel laureate identification in the APS dataset by different choices is shown in Table 7. When α takes the average value, the performance is the best.

The identification of Nobel laureates by the NCCAS algorithm, when parameter α corresponds to the median, mode, and average.

Dataset Numbers
Average Median Mode
Whole APS(24) 20 19 19
MAG(22) 19 18 18
Fractional APS(24) 19.58 17.25 17.67
MAG(22) 18.67 17.125 17.33
Conclusion

To summarize, we propose the NCCAS method to allocate credit for each co-author and identify the leading author of a multi-author publication. We introduce a modified Sigmoid function to rescale the citation number of the citing papers. We remove the target paper in the calculation of the credit share. Compared with other credit allocation methods, NCCAS gives the best performance in identifying Nobel laureates in both the APS and the MAG datasets. In addition, the leading author identified is well separated from other co-authors in terms of their credit scores, providing an improved distinguishability compared with other methods. NCCAS is also more robust under manipulations, which acquires the largest number of added papers to invert the leading author. These features make NCCAS very applicable to large-scale publication data.

Future applications include analyzing the role of a paper’s leading author. It is interesting to check if the leading author tends to be the first author, the last author, or the author with the longest academic age (Drenth, 1998; Hundley et al., 2013; Sekara et al., 2018), and how this pattern differs in different disciplines. The identification of a paper’s leading author would help in the study of a scientist’s research agenda (Huang et al., 2023; S. Huang et al., 2022; X. Yu et al., 2021). Taking only publications led by a scientist will reduce the influence of papers that he/she participates in with less significant roles. Finally, identifying the leading author can also help understand the formation and the structure of a scientific team (Milojević, 2014; Yu et al., 2022). NCCAS can be a useful tool to address these questions.

eISSN:
2543-683X
Język:
Angielski
Częstotliwość wydawania:
4 razy w roku
Dziedziny czasopisma:
Computer Sciences, Information Technology, Project Management, Databases and Data Mining