Collaboration has become a common practice in scientific production (Hara et al., 2003; Hu et al., 2020; Newman, 2004; Regalado, 1995). While scientific collaboration promotes publication output and the work’s impact (Dong et al., 2017; Pravdić et al., 1986; Wu et al., 2019), it brings difficulties in determining who gets how much credit for the team’s work (Hodge et al., 1981; Zeng et al., 2017). This gives rise to a new problem of credit allocation, which aims to quantitatively calculate the share of the credit by each author (Allen et al., 2014; Kennedy, 2003). However, as most important decisions in science, such as job appointment, funding application, tenure promotion, and award nomination, are based on the achievements of the individual (Jones, 2011), it is more important for credit allocation methods to accurately and confidently identify the leading author or the owner of a collaborated work (Rennie et al., 1994; Smolinsky et al., 2020), the one who gets the most credit among all authors.
In practice, the science community has developed several ways to solve the credit allocation problem. One is to consider all co-authors in the team get the same credit for the work (Burrell et al., 1995; Oppenheim, 1998; Price, 1981; Rennie et al., 1994). Equivalently, any co-author can claim himself/herself as the owner of the work. This is most common in fields like Mathematics, Economics, and Particle Physics, where the author’s name is usually listed alphabetically in the paper (Endersby, 1996; Frandsen et al., 2010; Waltman, 2012).
The other way takes the byline order of authors into consideration (Beveridge et al., 2007; Riesenberg et al., 1990; Zuckerman, 1968). The last author, who is usually the corresponding author, and the first author take a greater share of the credit. Considering the observation that the first and last authors usually conduct most jobs of the work (Tscharntke et al., 2007), it seems fair to make them the owner of the paper. While these two approaches have their pros and cons (Das et al., 2020; J. Xu et al., 2016), they are widely applied in the science community for their simplicity and feasibility.
Recently, an algorithm-based approach is proposed that tackles the credit allocation problem from a new angle (Shen & Barabási, 2014). The leading author of a paper is not determined by factors on that paper, such as the byline order of authors, but is determined by the perception of the scientific community. By analyzing how the target paper and papers by the team members are co-cited, the leading author can be quantitatively identified, who is the one most appreciated by peers in the community and recognized as the leader of the teamwork. The original method and other variants (Bao et al., 2017; F. H. Wang et al., 2019; J. P. Wang et al., 2019; Xing et al., 2021) are tested in empirical data and demonstrate high accuracy in determining Nobel Prize laureates from the team that uncovers the scientific discover together.
However, existing methods present several limitations that can be further addressed. First, it is important to determine the weights of different papers in the citation network (H. Huang et al., 2022; Waltman et al., 2013). the number of citations, which has been widely accepted as a measure of a paper’s awareness (Eysenbach, 2006; L. L. Xu et al., 2021; Waltman, 2016; Radicchi et al., 2008; Sinatra et al., 2016), is considered by some methods and improves their performance. But given the scale-free nature of the citation distribution (Barabási, 2009; Barabási et al., 2001; Barabási et al., 2003), the weights of papers can be very screwed towards some highly cited ones. Second, the robustness of the method needs to be further verified and strengthened. As one can publish papers that intentionally cite the target paper and ignore other related co-publications, the methods have to be robust to such manipulations. Finally, to identify the leading author of the paper, we would prefer that the best candidate is distinguishable from others. If the methods give close or tied scores to candidates in the first and second positions, the leading author may be indistinguishable. This issue can be critical in dealing with large-scale data. For instance, if one wants to investigate the correlation between the leading author of a paper and his/her byline order in the author list (Lu et al., 2022), the method has to ensure that the identified owner of a paper is well separated from other co-authors.
To address the aforementioned issues, we propose a new allocation algorithm called NCCAS. We use a modified
The first collective credit allocation (CCA) algorithm is introduced by Shen and Barabási (2014). As shown in Figure 1, assume a target paper
Two extreme cases may help us understand the idea behind the CCA algorithm. If author A does not participate in any co-cited papers, the algorithm will assign author B as the leading author of the target paper. If author A and author B always appear together in co-cited papers, the algorithm will designate both authors as the owners of the target paper. In general, the ownership identified falls somewhere between these two extreme examples. Author A and author B have some research collaborations, as well as collaborations with other authors on similar research topics.
Since the development of CCA, other variants are introduced aiming to improve the allocation accuracy from different perspectives. All of these algorithms focus on the modification of the involvement matrix
While all these methods demonstrate reasonable accuracy in determining the Nobel laureates, they present several limitations that can be addressed. First, the number of citations (or co-citations) is commonly utilized to calculate the strength vector
We utilize two mainstream scholar data sets in this paper. The APS dataset is from 11 Physical Review journals launched by the American Physical Society, covering the period from 1893 to 2009, with a total of 463,332 papers, 4,710,547 citation pairs, and 247,974 author information. The Microsoft Academic Graph (MAG) dataset, covers 71 academic fields, 224,876,396 papers, 1,492,784,521 citation pairs, and 236,021,246 author information. To avoid author name ambiguity, we apply Rule-matching Name-disambiguation (Jia et al., 2017; Sinatra et al., 2016) to disambiguate the author information in the APS dataset. We extract the full name of their last name, first name, and middle name initials for uniform replacement. For the MAG dataset, we directly use the disambiguated author names in the data.
We select the Nobel Prize-winning paper in Physics used in previous studies (Bao et al., 2017; F. H. Wang et al., 2019; J. P. Wang et al., 2019; Shen et al., 2014; Xing et al., 2021) as the validation dataset to verify the accuracy of our algorithm. Totally, there are 30 Nobel Prize-winning papers considered by past studies. After removing papers with only one author and papers that all co-authors receive laureates, 24 multi-author papers were retrieved in the APS dataset and 22 in the MAG dataset. Details can be found in Appendix Tables S1 and S2.
At the same time, we extract 654,148 papers in the field of computer science in the MAG dataset to validate the distinguishability of identified paper owners.
In this work, we propose the Nonlinear Collective Credit Allocation with a modified
First, to handle extreme values in citation counts and to make the distribution of element values in
With Eq. (1) that rescales the citation counts, the element value of the strength vector
The second modification is the removal of the target paper in the involvement matrix
Taken together, as shown in Figure 2, the procedure of NCCAS is as follows:
Find all the papers that cite the target paper Find all the papers that are authored by at least one author of the target paper. Among them, find the set of co-cited papers, Calculate the strength vector Generate the authorship involvement matrix The credit shares of all co-authors in the target paper
The scientific community lacks a unified standard for determining authorship attribution, making it difficult to find instances with known credit allocation results to validate the allocation method. Previous studies (Bao et al., 2017; F. H. Wang et al., 2019; J. P. Wang et al., 2019; Shen et al., 2014; Xing et al., 2021) consider papers receiving the Nobel Physics Prize as the validation set to verify the accuracy of the algorithms. Indeed, the choice of Nobel Prize winners reflects the scientific community’s recognition: the author who makes the most contribution to the Nobel Prize-winning papers is considered as the rightful owner of the scientific achievement (Lo, 2013).
Because the Nobel Physics Prize can be awarded to a maximum of three authors and two different scientific discoveries each year, there are cases when multiple authors of a Nobel Prize-winning paper receive the award. In addition, previous studies use different evaluation criteria. For example, if the algorithm gives two tied candidates and one of them is the Nobel laureate, do we consider the result 100% accurate or just 50% accurate? Likewise, if a paper has two Nobel laureates and the top-two candidates by the algorithm cover only one Nobel laureate, how do gauge the accuracy of the allocation result? To unify different cases and provide a comprehensive comparison of different algorithms, we consider two scenarios in this paper.
the output of the algorithm is considered correct if one of the identified leading authors of the paper is the true Nobel laureate. For example, if the algorithm gives two tied candidates and one of them is the true Nobel laureate, or if two authors of the paper are Nobel laureates and the algorithm identifies one of them in the top-two candidates, we consider the output of the algorithm is one-hundred-percent correct. This is the most commonly adopted evaluation criterion in past studies.
we use a variant of Jaccard similarity to calculate the fractional overlap between the algorithm output and the ground truth. Assuming that the algorithm predicts a set of leading authors
Based on the two evaluation criteria, we compare the number of papers whose leading author is successfully recognized by NCCAS and other baseline methods. The statistics are shown in Table 1 and the detailed values for each paper are shown in Appendix Table S1 and Appendix Table S2. Note that we only retrieve 22 papers in the APS dataset and six papers in the MAG dataset that are applicable for CoCA, so we also calculate the percentage accuracy for CoCA and NCCAS.
The number of papers identified by different allocation methods under the scenarios “Whole” and “Fractional” evaluations.
Dataset | Numbers | |||||
---|---|---|---|---|---|---|
CCA | NCCA | DCA | CoCA | NCCAS | ||
Whole | APS(24) | 17 | 20 | 20 | 17(77.27%) | |
MAG(22) | 18 | 18 | 18 | 4(66.67%) | ||
Fractional | APS(24) | 16.08 | 19.08 | 17.58 | 15.25(69.32%) | |
MAG(22) | 17.67 | 16.33 | 14.31 | 4(66.67%) |
The proposed NCCAS performs well in both datasets under both evaluation criteria. NCCAS is based on NCCA, but the introduction of the
To quantify the extent that the output of the first and second candidates are separated, we consider the credit share difference between the highest and the second highest credit scores (denoted as Δ). The larger the value of Δ, the easier to distinguish the leading author.
We first analyze Δ values for Nobel Prize-winning papers in the APS dataset (Table 2). In most cases, Δ by NCCAS is the largest. In a few cases when NCCAS is not the largest, its Δ value is the second largest. It is also noteworthy that except in one paper that all algorithms give the same tied results, NCCAS does not give ties, while NCCA, DCA, and CoCA all yield ties in different papers. Table 2 clearly shows that NCCAS makes the leading author of the Nobel Prize-winning paper more distinguishable compared with other methods.
Δ by different methods for the Nobel Prize-winning papers in Physics (APS dataset).
DOI | The value of Δ | ||||
---|---|---|---|---|---|
CCA | NCCA | DCA | CoCA | NCCAS | |
10.1103/PhysRevLett.76.1796 | 0.033 | 0.013 | 0.001 | 0.046 | |
10.1103/PhysRevLett.84.5102 | 0.021 | 0.014 | 0.006 | 0.031 | |
10.1103/PhysRevLett.77.4887 | 0.002 | 0.002 | 0.001 | 0.004 | |
10.1103/PhysRevLett.55.48 | 0.19 | 0.12 | 0.149 | 0.268 | |
10.1103/PhysRevLett.61.2472 | 0.152 | 0.067 | 0.005 | 0.156 | |
10.1103/PhysRevLett.75.3969 | 0.042 | 0.004 | 0.002 | 0.122 | |
10.1103/PhysRevLett.13.138 | 0.02 | 0.002 | 0.0 | 0.0 | 0.154 |
10.1103/PhysRev.69.37 | 0.204 | 0.051 | 0.257 | 0.25 | |
10.1103/PhysRev.83.333 | 0.009 | 0.002 | 0.015 | 0.022 | |
10.1103/PhysRevLett.61.169 | 0.05 | 0.023 | 0.059 | 0.053 | |
10.1103/PhysRevLett.13.321 | 0.006 | 0.0 | 0.002 | 0.0 | |
10.1103/PhysRev.122.345 | 0.078 | 0.01 | 0.002 | 0.364 | |
10.1103/PhysRevLett.57.2442 | 0.077 | 0.018 | 0.003 | 0.193 | |
10.1103/PhysRevLett.84.3232 | 0.052 | 0.032 | 0.008 | 0.076 | |
10.1103/PhysRevLett.20.1205 | 0.128 | 0.005 | 0.071 | 0.167 | |
10.1103/PhysRevLett.58.1490 | 0.003 | 0.0 | - | 0.012 | |
10.1103/PhysRevLett.48.1559 | 0.06 | 0.081 | 0.006 | 0.297 | |
10.1103/PhysRevLett.61.826 | 0.025 | 0.013 | 0.032 | 0.048 | |
10.1103/PhysRevLett.35.1489 | 0.002 | 0.001 | 0.001 | 0.0 | |
10.1103/PhysRevLett.9.439 | 0.0 | 0.0 | 0.0 | - | 0.0 |
10.1103/PhysRev.72.241 | 0.182 | 0.036 | 0.144 | 0.25 | |
10.1103/PhysRev.112.1940 | 0.008 | 0.032 | 0.0 | 0.368 | |
10.1103/PhysRev.73.679 | 0.024 | 0.003 | 0.004 | ||
10.1103/PhysRevD.5.528 | 0.088 | 0.001 | 0.006 |
To go beyond the Nobel Prize-winning papers and to demonstrate the overall situation of how distinguishable the identified leading author is, we select 654,148 papers in computer science and calculate the Δ values by different methods (Figure 3). The result of all years is shown in Appendix Figure S1. Δ values by CCA, NCCA, DCA, and CoCA tend to be concentrated in the [0, 0.3] interval, but over 40% of results by NCCAS have Δ values greater than 0.5. Hence, for general publication data, NCCAS still makes the leading author of the paper distinguishable from others.
As one can intentionally create citing papers that cite the target and co-cited papers, it is important to check the robustness of the results by different methods under malicious manipulation. For example, F. H. Wang et al. (2019) show that by citing a small number of additional papers to the target paper, it is possible to significantly change the authorship credit by the CCA method. For this reason, we consider a simple experiment in which papers with a certain number of citations are added that cites the target paper and one randomly chosen co-cited paper. We measure the number of papers needed to invert the rank of the credit score that makes the one originally with the second highest score become the leading author. We consider two different cases: papers with 20 citations are added and papers with one citation are added. The choice of 20 citations is based on the fact that the average number of citations in the APS dataset is about 20. The case for one citation is because the average citation after one year of publication is about one in the APS dataset. Note that only NCCAS and NCCA consider the number of citations. CCA and DCA consider all citing papers to be equally important. Hence, they are not affected by the citing papers’ number of citations.
We first focus on the 24 Nobel Prize-winning papers in the APS dataset. Because the added papers cite the co-cited papers randomly, we repeat the experiment ten times and show in Table 3 the average number of papers needed to invert the original credit score rank. Methods that take the number of citations into account (NCCAS and NCCA) are more robust than those that do not consider citations (CCA and DCA). This is in line with conclusions in previous studies (F. H. Wang et al., 2019). In most cases, NCCAS is more robust than NCCA. We also check the cases when NCCAS is less robust than NCCA (four papers 10.1103/PhysRevLett.55.48, 10.1103/PhysRevLett.69.37, 10.1103/PhysRev.83.333, and 10.1103/PhysRev.73.679). We find that these papers have citing papers with an extreme number of citations. Because the original citing papers’ citations are very high, adding papers with 1 or 20 citations will not significantly change the strength vector. Therefore, in these cases NCCA is more robust than NCCAS.
The number of added papers with different citations to invert the credit rank for 24 Nobel Prize-winning papers.
DOI | CCA | DCA | Added papers with 20 citations | Added papers with 1 citation | ||
---|---|---|---|---|---|---|
NCCA | NCCAS | NCCA | NCCAS | |||
10.1103/PhysRevLett.76.1796 | 66 | 42 | 106 | 2,109 | ||
10.1103/PhysRevLett.84.5102 | 15 | 6 | 10 | 190 | ||
10.1103/PhysRevLett.77.4887 | 5 | 5 | 2 | 39 | ||
10.1103/PhysRevLett.55.48 | 53 | 15 | 29 | 1,294 | ||
10.1103/PhysRevLett.61.2472 | 565 | 580 | 808 | |||
10.1103/PhysRevLett.75.3969 | 475 | 446 | 982 | |||
10.1103/PhysRevLett.13.138 | 44 | 35 | 42 | 823 | ||
10.1103/PhysRev.69.37 | 51 | 30 | 40 | 2,031 | ||
10.1103/PhysRev.83.333 | 4 | 1 | 1 | 1 | ||
10.1103/PhysRevLett.61.169 | 37 | 4 | 7 | 135 | ||
10.1103/PhysRevLett.13.321 | 5 | 1 | 3 | 50 | ||
10.1103/PhysRev.122.345 | 443 | 293 | 737 | |||
10.1103/PhysRevLett.57.2442 | 146 | 150 | 171 | 3,413 | ||
10.1103/PhysRevLett.84.3232 | 15 | 11 | 17 | 322 | ||
10.1103/PhysRevLett.20.1205 | 96 | 96 | 106 | 2,120 | ||
10.1103/PhysRevLett.58.1490 | 20 | 21 | 17 | 47 | ||
10.1103/PhysRevLett.48.1559 | 196 | 197 | 314 | 6,261 | ||
10.1103/PhysRevLett.61.826 | 20 | 18 | 57 | 1,127 | ||
10.1103/PhysRevLett.35.1489 | 9 | 13 | 17 | 327 | ||
10.1103/PhysRevLett.9.439 | 1 | 1 | 1 | 1 | 1 | 1 |
10.1103/PhysRev.72.241 | 99 | 88 | 179 | 3,569 | ||
10.1103/PhysRev.112.1940 | 18 | 16 | 31 | 604 | ||
10.1103/PhysRev.73.679 | 37 | 1 | 74 | 1,655 | ||
10.1103/PhysRevD.5.528 | 68 | 23 | 55 | 1,093 |
We further run the experiment in papers in the MAG dataset by selecting 654,148 papers in computer science. We add citing papers with 13 citations, which is the average number of citations in the data analyzed. In Figure 4, we show the distribution of the number of papers needed to invert the original credit rank. The result of all years is shown in Appendix Figure S2. The distributions of CCA and DCA are more concentrated on small counts of added papers, while the distributions of NCCA and NCCAS have a high portion for a large number of added papers. The robustness of NCCAS is higher than NCCA as the distribution of NCCAS is more skewed towards a large number of added papers. The analysis on large-scale papers is in line with results on Nobel Prize-winning papers that NCCAS is in general more robust towards manipulations compared with other baseline methods.
In this part, we perform different experiments to explain the reason for the modifications that NCCAS makes, and the reason for the choice of some parameters.
A major modification made in NCCAS is that the target paper is not included in the author involvement matrix
The leading author identification and the Δ values for 24 Nobel Prize-winning papers by methods NCCAS and NCCAS-T in the APS dataset. NCCAS-T refers to the method that keep the target paper in the author involvement matrix.
DOI | NCCAS-T | NCCAS | NCCAS-T | NCCAS |
---|---|---|---|---|
If accurate (Y/N) | Δ value | |||
10.1103/PhysRevLett.76.1796 | Y | Y | 0.002 | |
10.1103/PhysRevLett.84.5102 | Y | Y | 0.012 | |
10.1103/PhysRevLett.77.4887 | Y | Y | 0.001 | |
10.1103/PhysRevLett.55.48 | N | N | 0.017 | |
10.1103/PhysRevLett.61.2472 | Y | Y | 0.002 | |
10.1103/PhysRevLett.75.3969 | Y | Y | 0.002 | |
10.1103/PhysRevLett.13.138 | Y | Y | 0.001 | |
10.1103/PhysRev.69.37 | N | N | 0.002 | |
10.1103/PhysRev.83.333 | N | N | 0.112 | |
10.1103/PhysRevLett.61.169 | Y | Y | 0.003 | |
10.1103/PhysRevLett.13.321 | Y | Y | 0.007 | |
10.1103/PhysRev.122.345 | Y | Y | 0.01 | |
10.1103/PhysRevLett.57.2442 | Y | Y | 0.008 | |
10.1103/PhysRevLett.84.3232 | Y | Y | 0.037 | |
10.1103/PhysRevLett.20.1205 | Y | Y | 0.027 | |
10.1103/PhysRevLett.58.1490 | N | N | 0.002 | |
10.1103/PhysRevLett.48.1559 | Y | Y | 0.027 | |
10.1103/PhysRevLett.61.826 | Y | Y | 0.017 | |
10.1103/PhysRevLett.35.1489 | Y | Y | 0.003 | |
10.1103/PhysRevLett.9.439 | Y | Y | 0.0 | 0.0 |
10.1103/PhysRev.72.241 | Y | Y | 0.284 | |
10.1103/PhysRev.112.1940 | Y | Y | 0.002 | |
10.1103/PhysRev.73.679 | Y | Y | 0.004 | |
10.1103/PhysRevD.5.528 | Y | Y | 0.001 |
In all the methods including NCCAS and fractional counting (Van Hooydonk, 1997) is used to construct the authorship involvement matrix
The identification of Nobel laureates by different authorship involvement matrices with fractional counting.
Method | Numbers | ||||
---|---|---|---|---|---|
CCA | NCCA | DCA | CoCA | NCCAS | |
fractional counting | 16.08 | 15.25 | |||
first last author emphasis | 17 | 16 | 17.67 | ||
sequence determines credit | 15.67 | 16 | 16 | 15 | 16.67 |
harmonic counting | 15 | 16.33 | 16 | 14.67 | 17 |
geometric counting | 14.33 | 15 | 15 | 14.33 | 14 |
arithmetic counting | 13 | 15 | 14 | 11.33 | 11.67 |
The optimal choice for authorship involvement matrix calculation differs in different methods. It is interesting to note that while fractional counting is adopted by the first relevant method CCA, it is not the most optimal choice. But in general, fractional counting works well in all different algorithms. Table 6 supports the choice of authorship involvement matrix of NCCAS.
The identification of Nobel laureates by different rescale functions.
Dataset | Numbers | |||||
---|---|---|---|---|---|---|
NCCAS | ||||||
Whole | APS(24) | 20 | 20 | 20 | 20 | |
MAG(22) | 18 | 18 | 18 | 18 | ||
Fractional | APS(24) | 12.15 | 12.15 | 12.15 | 16.34 | |
MAG(22) | 17.83 | 17.83 | 17.83 | 16.83 |
We use a modified
The identification performance by different rescale functions is shown in Table 6. The modified
The parameter
The identification of Nobel laureates by the NCCAS algorithm, when parameter α corresponds to the median, mode, and average.
Dataset | Numbers | |||
---|---|---|---|---|
Average | Median | Mode | ||
Whole | APS(24) | 19 | 19 | |
MAG(22) | 18 | 18 | ||
Fractional | APS(24) | 17.25 | 17.67 | |
MAG(22) | 17.125 | 17.33 |
To summarize, we propose the NCCAS method to allocate credit for each co-author and identify the leading author of a multi-author publication. We introduce a modified
Future applications include analyzing the role of a paper’s leading author. It is interesting to check if the leading author tends to be the first author, the last author, or the author with the longest academic age (Drenth, 1998; Hundley et al., 2013; Sekara et al., 2018), and how this pattern differs in different disciplines. The identification of a paper’s leading author would help in the study of a scientist’s research agenda (Huang et al., 2023; S. Huang et al., 2022; X. Yu et al., 2021). Taking only publications led by a scientist will reduce the influence of papers that he/she participates in with less significant roles. Finally, identifying the leading author can also help understand the formation and the structure of a scientific team (Milojević, 2014; Yu et al., 2022). NCCAS can be a useful tool to address these questions.