An author credit allocation method with improved distinguishability and robustness

Purpose: The purpose of this study is to propose an improved credit allocation method that makes the leading author of the paper more distinguishable and makes the deification more robust under malicious manipulations. Design/methodology/approach: We utilize a modified Sigmoid function to handle the fat-tail distributed citation counts. We also remove the target paper in calculating the contribution of co-citations. Following previous studies, we use 30 Nobel Prize-winning papers and their citation networks based on the American Physical Society (APS) and the Microsoft Academic Graph (MAG) dataset to test the accuracy of our proposed method (NCCAS). In addition, we use 654,148 articles published in the field of computer science from 2000 to 2009 in the MAG dataset to validate the distinguishability and robustness of NCCAS. Finding: Compared with the state-of-the-art methods, NCCAS gives the most accurate prediction of Nobel laureates. Furthermore, the leading author of the paper identified by NCCAS is more distinguishable compared with other co-authors. The results by NCCAS are also more robust to malicious manipulation. Finally, we perform ablation studies to show the contribution of different components in our methods. Research limitations: Due to limited ground truth on the true leading author of a work, the accuracy of NCCAS and other related methods can only be tested in Nobel Physics Prize-winning papers. Practical implications: NCCAS is successfully applied to a large number of publications, demonstrating its potential in analyzing the relationship between the contribution and the recognition of authors with different by-line orders.


Introduction
Collaboration has become a common practice in scientific production (Hara et al., 2003;Hu et al., 2020;Newman, 2004;Regalado, 1995).While scientific collaboration promotes publication output and the work's impact (Dong et al., 2017;Pravdić et al., 1986;Wu et al., 2019), it brings difficulties in determining who gets how much credit for the team's work (Hodge et al., 1981;Zeng et al., 2017).This gives rise to a new problem of credit allocation, which aims to quantitatively calculate the share of the credit by each author (Allen et al., 2014;Kennedy, 2003).However, as most important decisions in science, such as job appointment, funding application, tenure promotion, and award nomination, are based on the achievements of the individual (Jones, 2011), it is more important for credit allocation methods to accurately and confidently identify the leading author or the owner of a collaborated work (Rennie et al., 1994;Smolinsky et al., 2020), the one who gets the most credit among all authors.
In practice, the science community has developed several ways to solve the credit allocation problem.One is to consider all co-authors in the team get the same credit for the work (Burrell et al., 1995;Oppenheim, 1998;Price, 1981;Rennie et al., 1994).Equivalently, any co-author can claim himself/herself as the owner of the work.This is most common in fields like Mathematics, Economics, and Particle Physics, where the author's name is usually listed alphabetically in the paper (Endersby, 1996;Frandsen et al., 2010;Waltman, 2012).
The other way takes the byline order of authors into consideration (Beveridge et al., 2007;Riesenberg et al., 1990;Zuckerman, 1968).The last author, who is usually the corresponding author, and the first author take a greater share of the credit.Considering the observation that the first and last authors usually conduct most jobs of the work (Tscharntke et al., 2007), it seems fair to make them the owner of the paper.While these two approaches have their pros and cons (Das et al., 2020;J. Xu et al., 2016), they are widely applied in the science community for their simplicity and feasibility.
Recently, an algorithm-based approach is proposed that tackles the credit allocation problem from a new angle (Shen & Barabási, 2014).The leading author of a paper is not determined by factors on that paper, such as the byline order of authors, but is determined by the perception of the scientific community.By analyzing how the additional citations in hypothetical malicious manipulation experiments.Given NCCAS's capability in handling large-scale publication data and its nice performance, it has the potential to be a useful tool in other studies related to different roles or contributions of authors.

Related work
The first collective credit allocation (CCA) algorithm is introduced by Shen and Barabási (2014).As shown in Figure 1, assume a target paper p 0 co-authored by author A and author B. To identify which author is recognized more by the community, we need to identify all papers that cite the target paper.Through the set of citing papers, we further identify the set of papers that are co-cited with the target paper and are co-authored by at least one of the two authors.From the set of co-cited papers, we get an authorship involvement matrix A that counts the fractional share of authorship of the two authors in each co-cited paper.Furthermore, we get a strength vector s that assigns different weights to co-cited papers.In CCA, the element of s is the number of co-citations one co-cited paper receives with the target paper.The product of the matrix A and the vector s gives the total score of authors A and B, which represents the overall appearance of each author in related co-cited literatures weighted by their importance.
Figure 1.Collective Credit Allocation (CCA) algorithm.The credits of author A and author B are given by the product of the author involvement matrix A and the co-citation strength vector s.author A does not participate in any co-cited papers, the algorithm will assign author B as the leading author of the target paper.If author A and author B always appear together in co-cited papers, the algorithm will designate both authors as the owners of the target paper.In general, the ownership identified falls somewhere between these two extreme examples.Author A and author B have some research collaborations, as well as collaborations with other authors on similar research topics.
Since the development of CCA, other variants are introduced aiming to improve the allocation accuracy from different perspectives.All of these algorithms focus on the modification of the involvement matrix A and the strength vector s.CCA considers that all citing papers have the same importance.Hence, the element in vector s is the number of citing papers that cite the co-cited papers, or equivalently the number of co-citations.In the NCCA algorithm, F. H. Wang et al. (2019) consider that citing papers with more citations is more important.The element value of s is calculated by summing up citations of citing papers that cite the co-cited paper.In the DCA algorithm, Bao et al. (2017) adjust the involvement matrix A by multiplying it by a residual influence Φ.The idea is that the importance of a paper would gradually decrease with time, which is captured by Φ. DCA uses the same strength vector s as that in CCA.The final credits of the co-authors are determined by the product of the author involvement matrix A, residual influence Φ, and the strength vector s.J. P. Wang et al. (2019) propose IDCA method that uses the same residual influence Φ as that in DCA but modifies the strength vector s.The element value of s is calculated by summing over the product of the PageRank value and the citation number of the citing paper.Xing et al. (2021) propose the CoCA method that modifies the collection of papers in matrix A and the value of the strength vector s.CoCA focuses on the subsequent works by co-authors of the target paper (papers authored by at least one of the co-authors).These subsequent papers generate the involvement matrix A. For each subsequent paper, its weight in s is calculated as the number of times that the subsequent paper and target paper use the same reference paper.
While all these methods demonstrate reasonable accuracy in determining the Nobel laureates, they present several limitations that can be addressed.First, the number of citations (or co-citations) is commonly utilized to calculate the strength vector s.But the number of citations usually follows a scale-free distribution.Therefore, s can be skewed towards some extreme values.Furthermore, most of these methods are tested in citation networks composed of APS publications.It is important to extend the analyses to a bigger and more comprehensive dataset such as the MAG dataset.These methods usually give very close credit scores of different co-authors, making the leading author of the paper not very distinguishable from others.Finally, as citations can be intentionally added to the target paper, it is important to perform the robustness test, which is not well emphasized in past studies.

Data
We utilize two mainstream scholar data sets in this paper.The APS dataset is from 11 Physical Review journals launched by the American Physical Society, covering the period from 1893 to 2009, with a total of 463,332 papers, 4,710,547 citation pairs, and 247,974 author information.The Microsoft Academic Graph (MAG) dataset, covers 71 academic fields, 224,876,396 papers, 1,492,784,521 citation pairs, and 236,021,246 author information.To avoid author name ambiguity, we apply Rule-matching Name-disambiguation (Jia et al., 2017;Sinatra et al., 2016) to disambiguate the author information in the APS dataset.We extract the full name of their last name, first name, and middle name initials for uniform replacement.For the MAG dataset, we directly use the disambiguated author names in the data.
We select the Nobel Prize-winning paper in Physics used in previous studies (Bao et al., 2017;F. H. Wang et al., 2019;J. P. Wang et al., 2019;Shen et al., 2014;Xing et al., 2021) as the validation dataset to verify the accuracy of our algorithm.Totally, there are 30 Nobel Prize-winning papers considered by past studies.After removing papers with only one author and papers that all co-authors receive laureates, 24 multi-author papers were retrieved in the APS dataset and 22 in the MAG dataset.Details can be found in Appendix Tables S1 and S2.
At the same time, we extract 654,148 papers in the field of computer science in the MAG dataset to validate the distinguishability of identified paper owners.

Method
In this work, we propose the Nonlinear Collective Credit Allocation with a modified Sigmoid (NCCAS) algorithm (Figure .2).NCCAS is similar to NCCA, which takes the citation number of the citing paper into consideration.Compared with other related methods, NCCAS introduces two major modifications.
First, to handle extreme values in citation counts and to make the distribution of element values in s bounded, we introduce a modified Sigmoid function (Eq.( 1)) that transfers the citation counts to the weight of a paper.The parameter α in Eq. ( 1) is the average number of citations that the citing papers receive.In the ablation experiment of this paper, we show that the average value gives the most accurate identification of the Nobel laureates compared with the use of median and mode

Research Paper
values.The parameter n is the average number of citations of all papers considered.The value varies with different tasks and datasets used.For example, for the task of identifying Nobel laureates, we have n = 20.813 in the APS dataset and n = 22.383 in the MAG dataset.When applying NCCAS to 654,148 papers in computer science, we have n = 13.086.In general, the parameter α sets the scale of citing papers that cites the same target paper and n sets the scale of the whole data considered.
(1) 1) that rescales the citation counts, the element value of the strength vector s can be calculated as s i =∑ j∈J S(x j ), where i represents a co-cited paper, J is the set of citing papers that cite the co-cited paper i, and x j is the number of citations of the citing paper j.
The second modification is the removal of the target paper in the involvement matrix A and the strength vector s.This is because that all citing papers are associated with the target paper.Hence, the target paper will have the largest value in strength vector s.Consequently, keeping the target paper in A will dilute the contributions of co-cited papers, making the credit score C less sensitive to co-citing papers.In the ablation experiment, we show that the removal of the target paper itself in calculating C will make the identified leading author more distinguishable from others.
Taken together, as shown in Figure 2, the procedure of NCCAS is as follows: 1. Find all the papers that cite the target paper p 0 , forming the set of citing papers set D=(d 0 , d 1 , …, d m ).
2. Find all the papers that are authored by at least one author of the target paper.Among them, find the set of co-cited papers, P=(p 1 , p 2 ,…, p k ), that are cited with the target paper by the citing paper.
3. Calculate the strength vector s for the co-cited papers.The element value s i is the sum of the transformed weight of citing papers that cite the co-cited paper i.For example, for the co-cited paper p 2 in Figure 2, we have s 2 =S(x 1 )+S(x 2 )+S(x 3 )+S(x 5 ), as four papers (d 1 , d 2 , d 3 , d 5 ) cite both the target paper p 0 and the co-cited paper p 2 .
4. Generate the authorship involvement matrix A by calculating the authorship credit of an author in co-cited papers.The authorship credit value is calculated using the fractional counting method.If author a, one of the authors in the target paper, appears in a co-cited paper with a total of three authors, the authorship credit of author a in this co-cited paper is 1/3.
5. The credit shares of all co-authors in the target paper p 0 are given by vector C=As.By normalizing vector C, we can obtain the credit shares of coauthors in fractional form.

Validation
The scientific community lacks a unified standard for determining authorship attribution, making it difficult to find instances with known credit allocation results to validate the allocation method.Previous studies (Bao et al., 2017;F. H. Wang et al., 2019; J. P. Wang et al., 2019;Shen et al., 2014;Xing et al., 2021) consider papers receiving the Nobel Physics Prize as the validation set to verify the accuracy of the algorithms.Indeed, the choice of Nobel Prize winners reflects the scientific community's recognition: the author who makes the most contribution to the Nobel Prize-winning papers is considered as the rightful owner of the scientific achievement (Lo, 2013).
Because the Nobel Physics Prize can be awarded to a maximum of three authors and two different scientific discoveries each year, there are cases when multiple authors of a Nobel Prize-winning paper receive the award.In addition, previous studies use different evaluation criteria.For example, if the algorithm gives two tied candidates and one of them is the Nobel laureate, do we consider the result 100% accurate or just 50% accurate?Likewise, if a paper has two Nobel laureates and the

Research Paper
top-two candidates by the algorithm cover only one Nobel laureate, how do gauge the accuracy of the allocation result?To unify different cases and provide a comprehensive comparison of different algorithms, we consider two scenarios in this paper.
Whole counting: the output of the algorithm is considered correct if one of the identified leading authors of the paper is the true Nobel laureate.For example, if the algorithm gives two tied candidates and one of them is the true Nobel laureate, or if two authors of the paper are Nobel laureates and the algorithm identifies one of them in the top-two candidates, we consider the output of the algorithm is one-hundredpercent correct.This is the most commonly adopted evaluation criterion in past studies.
Fractional counting: we use a variant of Jaccard similarity to calculate the fractional overlap between the algorithm output and the ground truth.Assuming that the algorithm predicts a set of leading authors M 1 and true Nobel laureates set is M 2 , we evaluate the result using The fractional counting provides a more comprehensive evaluation when there are tied owners or multiple Nobel laureates of a given paper.
Based on the two evaluation criteria, we compare the number of papers whose leading author is successfully recognized by NCCAS and other baseline methods.
The statistics are shown in Table 1 and the detailed values for each paper are shown in Appendix Table S1 and Appendix Table S2.Note that we only retrieve 22 papers in the APS dataset and six papers in the MAG dataset that are applicable for CoCA, so we also calculate the percentage accuracy for CoCA and NCCAS.The proposed NCCAS performs well in both datasets under both evaluation criteria.NCCAS is based on NCCA, but the introduction of the Sigmoid function improves the identification accuracy.This is mainly because NCCAS can take into account citing papers with zero citation while NCCA ignores their contributions.The removal of the target paper makes the leading author more distinguishable and effectively reduces tied outputs.Hence, the performance of NCCAS is also stable under the fractional counting.

Distinguishability
To quantify the extent that the output of the first and second candidates are separated, we consider the credit share difference between the highest and the second highest credit scores (denoted as Δ).The larger the value of Δ, the easier to distinguish the leading author.
We first analyze Δ values for Nobel Prize-winning papers in the APS dataset (Table 2).In most cases, ∆ by NCCAS is the largest.In a few cases when NCCAS is not the largest, its Δ value is the second largest.It is also noteworthy that except in one paper that all algorithms give the same tied results, NCCAS does not give ties, while NCCA, DCA, and CoCA all yield ties in different papers.Table 2 clearly shows that NCCAS makes the leading author of the Nobel Prize-winning paper more distinguishable compared with other methods.To go beyond the Nobel Prize-winning papers and to demonstrate the overall situation of how distinguishable the identified leading author is, we select 654,148 papers in computer science and calculate the Δ values by different methods (Figure 3).The result of all years is shown in Appendix Figure S1.Δ values by CCA, NCCA, DCA, and CoCA tend to be concentrated in the [0, 0.3] interval, but over 40% of results by NCCAS have Δ values greater than 0.5.Hence, for general publication data, NCCAS still makes the leading author of the paper distinguishable from others.

Robustness
As one can intentionally create citing papers that cite the target and co-cited papers, it is important to check the robustness of the results by different methods under malicious manipulation.For example, F. H. Wang et al. (2019) show that by citing a small number of additional papers to the target paper, it is possible to significantly change the authorship credit by the CCA method.For this reason, we consider a simple experiment in which papers with a certain number of citations are added that cites the target paper and one randomly chosen co-cited paper.We measure the number of papers needed to invert the rank of the credit score that makes the one originally with the second highest score become the leading author.We consider two different cases: papers with 20 citations are added and papers with one citation are added.The choice of 20 citations is based on the fact that the average number of citations in the APS dataset is about 20.The case for one citation is because the average citation after one year of publication is about one in the APS dataset.Note that only NCCAS and NCCA consider the number of citations.CCA and DCA consider all citing papers to be equally important.Hence, they are not affected by the citing papers' number of citations.
We first focus on the 24 Nobel Prize-winning papers in the APS dataset.Because the added papers cite the co-cited papers randomly, we repeat the experiment ten times and show in Table 3 the average number of papers needed to invert the original credit score rank.Methods that take the number of citations into account (NCCAS and NCCA) are more robust than those that do not consider citations (CCA and DCA).This is in line with conclusions in previous studies (F.H. Wang et al., 2019).In most cases, NCCAS is more robust than NCCA.We also check the cases when NCCAS is less robust than NCCA (four papers 10.1103/PhysRevLett.55.48, 10.1103/PhysRevLett.69.37, 10.1103/PhysRev.83.333, and 10.1103/PhysRev.73.679).We find that these papers have citing papers with an extreme number of citations.Because the original citing papers' citations are very high, adding papers with 1 or 20 citations will not significantly change the strength vector.Therefore, in these cases NCCA is more robust than NCCAS.
We further run the experiment in papers in the MAG dataset by selecting 654,148 papers in computer science.We add citing papers with 13 citations, which is the average number of citations in the data analyzed.In Figure 4, we show the distribution of the number of papers needed to invert the original credit rank.The result of all years is shown in Appendix Figure S2.The distributions of CCA and DCA are more concentrated on small counts of added papers, while the distributions of NCCA and NCCAS have a high portion for a large number of added papers.The robustness of NCCAS is higher than NCCA as the distribution of NCCAS is more skewed towards a large number of added papers.The analysis on large-scale papers is in line with results on Nobel Prize-winning papers that NCCAS is in general more robust towards manipulations compared with other baseline methods.

Ablation experiment
In this part, we perform different experiments to explain the reason for the modifications that NCCAS makes, and the reason for the choice of some parameters.

Removal of the target paper
A major modification made in NCCAS is that the target paper is not included in the author involvement matrix A and strength vector s.To quantitatively show the effect of this removal, we build a NCCAS-T method that is the same with NCCAS but keep the target paper in matrix A. The results on the leading author identification and the Δ value for the 24 Nobel Prize-winning papers in the APS dataset are shown in Table 4.The removal does not change the identification results.The author with

The calculation of the authorship involvement matrix A
In all the methods including NCCAS and fractional counting ( Van Hooydonk, 1997) is used to construct the authorship involvement matrix A. If the co-cited paper has N authors, then the corresponding element value in A is 1/N.In this section, we explore whether different weight allocations for the authorship involvement matrix would make any difference.Besides fractional counting, we check first last author emphasis method (Tscharntke et al., 2007), sequence determines credit method (Verhagen et al., 2003), harmonic counting (Hagen, 2008;2010), geometric counting (Egghe et al., 2000), and arithmetic counting (Trueba et al., 2004).The results are shown in Table 5.The optimal choice for authorship involvement matrix calculation differs in different methods.It is interesting to note that while fractional counting is adopted by the first relevant method CCA, it is not the most optimal choice.But in general, fractional counting works well in all different algorithms.Table 6 supports the choice of authorship involvement matrix of NCCAS.

Selection of the rescale function
We use a modified Sigmoid function to transfer the number of citations to the relative importance of a citing paper.The aim of the Sigmoid function is to rescale the number of citations such that extreme values would play a less important role.However, other forms of function may also work.In this section, we consider the logarithmic function (Eq.( 2)) and power-law function (Eq.( 3)) to characterize relative importance and check if these functions would improve the performance.For the logarithmic function, we choose the parameter β=2,e,5.For the power-law function, we choose γ=1.3, which is opposite to the citation distribution of citing papers related to the Nobel Prize-winning papers in the APS dataset (Appendix Figure S3).
S 2 (x)=log β x (2) S 3 (x)=x γ (3) The identification performance by different rescale functions is shown in Table 6.The modified Sigmoid function in NCCAS makes more accurate identifications of Nobel laureates compared with the other two types of functions.In particular, the logarithmic function and power-law function often yield authors with tied credit.Their performance is even worse when the fractional counting is used to quantify the performance.These results demonstrate that the Sigmoid function is a better choice compared with other forms of function.

The calculation of parameter α
The parameter α is calculated as the average citation counts of the citing paper.In this section, we compare the average value with the median and mode values.The performance of Nobel laureate identification in the APS dataset by different choices is shown in Table 7.When α takes the average value, the performance is the best.

Conclusion
To summarize, we propose the NCCAS method to allocate credit for each coauthor and identify the leading author of a multi-author publication.We introduce a modified Sigmoid function to rescale the citation number of the citing papers.We remove the target paper in the calculation of the credit share.Compared with other credit allocation methods, NCCAS gives the best performance in identifying Nobel laureates in both the APS and the MAG datasets.In addition, the leading author identified is well separated from other co-authors in terms of their credit scores, providing an improved distinguishability compared with other methods.NCCAS is also more robust under manipulations, which acquires the largest number of added papers to invert the leading author.These features make NCCAS very applicable to large-scale publication data.
Future applications include analyzing the role of a paper's leading author.It is interesting to check if the leading author tends to be the first author, the last author, or the author with the longest academic age (Drenth, 1998;Hundley et al., 2013;Sekara et al., 2018), and how this pattern differs in different disciplines.The identification of a paper's leading author would help in the study of a scientist's research agenda (Huang et al., 2023;S. Huang et al., 2022;X. Yu et al., 2021).Taking only publications led by a scientist will reduce the influence of papers that he/she participates in with less significant roles.Finally, identifying the leading author can also help understand the formation and the structure of a scientific team (Milojević, 2014;Yu et al., 2022).NCCAS can be a useful tool to address these questions.

Figure 2 .
Figure 2. Nonlinear Collective Credit Allocation with Sigmoid Function (NCCAS) algorithm based on the co-citation network.p 0 is the target paper, p k (k=1~4) are the co-cited papers, and d 1 ~d5 cite both p 0 and p 1 ~p4 .xm is the number of citations d m receives.The matrix A records the authorship involvement of the co-authors in the co-cited paper p k .The co-citation strength vector s captures the co-citation strength from the citing papers d m to the target paper p 0 and the co-cited paper p k .S(x) is the modified Sigmoid Function.After obtaining matrix A and vector s, the values of the co-authors' contributions C in the target paper p 0 are computed by C=As with a normalization, where P is the co-cited paper set, c i is the credit of co-author a i in target paper p 0 .

Figure 3 .
Figure 3.The distribution of Δ by different methods for papers published in different years.

Figure 4 .
Figure 4.The distribution of the number of added papers by different methods for papers published in different years.

Figure S2 .
Figure S2.The distribution of the number of added papers by different methods for papers published in different years.

Table 1 .
The number of papers identified by different allocation methods under the scenarios "Whole" and "Fractional" evaluations.

Table 2 .
∆ by different methods for the Nobel Prize-winning papers in Physics (APS dataset).

Table 3 .
The number of added papers with different citations to invert the credit rank for 24 Nobel Prizewinning papers.

Table 4 .
The leading author identification and the Δ values for 24 Nobel Prize-winning papers by methods NCCAS and NCCAS-T in the APS dataset.NCCAS-T refers to the method that keep the target paper in the author involvement matrix.thehighest credit score remains the same.But because the target paper dilutes the contribution of other co-cited papers, results by NCCAS-T are not as distinguishable as those by NCCAS.Therefore, removing the target paper's contribution is crucial for NCCAS's improved distinguishability.

Table 5 .
The identification of Nobel laureates by different authorship involvement matrices with fractional counting.

Table 6 .
The identification of Nobel laureates by different rescale functions.

Table 7 .
The identification of Nobel laureates by the NCCAS algorithm, when parameter α corresponds to the median, mode, and average.