Citations, collaborations, Web links, and other phenomena of interest to the field of informetrics can be studied from a network perspective (Otte & Rousseau, 2002). Co-authorship networks are among the most studied types of networks and can be considered an approximation of collaboration networks (Katz & Martin, 1997). During the latest decades, it has been shown that co-authorship networks share several typical characteristics with other kinds of complex networks. These include high clustering coefficients, low average shortest path lengths, and highly skewed degree distributions (Watts & Strogatz, 1998).

An important milestone in the study of networks was reached by the introduction of random network models (Erdős & Rényi, 1959). These models describe the class of networks in which

Several more advanced models have been proposed that seek to characterize and explain how a network can obtain these characteristics. Let us take the preferential attachment model (equivalent to a success-breeds-success model; Barabási & Albert, 1999; Price, 1976) as a well-known example. In this model, new nodes are added to the network at each time step. Each new node connects to

If one agrees that the state of a network influences a link’s probability, it can be seen that some links are very likely to emerge, whereas others are extremely unlikely. Put another way, every non-random network has certain predictive characteristics that make some links more probable than others. The task of predicting which links are most likely to occur in a future state of the network is known as

We may distinguish between two types of link prediction applications (Guns, 2014) that have sometimes been confounded in the literature:

Network evolution prediction, and

Network reconstruction.

We only consider the case where a network is damaged by random deletions and/or additions. The task becomes more challenging if the most important links (e.g. those with high edge betweenness centrality) are targeted first. In other words, we want to test the

In this paper, we study the predictive characteristics of co-authorship networks for network evolution prediction. Specifically, we want to compare three different ‘formats’ of the same network: unweighted, weighted and bipartite. Previous studies mainly considered the unweighted case, but translocating weighted networks has received more attention (Koren, North, & Volinsky, 2006; Lü & Zhou, 2010; Murata & Moriyasu, 2007; Lü & Zhou, 2010; Zhu & Xia, 2016). This is not surprising as (1) weighted networks are very well studied (e.g. Barrat et al., 2004; Newman, 2001b), making them obvious targets to consider for link prediction optimization, and (2) the unweighted networks are usually derived from the weighted networks, which in their turn are derived from the bipartite ones.

Figure 1 illustrates the difference between the three types. On the left, we have a two-mode or bipartite author-publication network. One can, for instance, see that publication A was written by authors Mark, Zoe, and Jane. This bipartite network corresponds to the weighted co-authorship network in the middle, where link weights denote the number of joint publications. Finally, omitting the link weights results in the unweighted network on the right.

The weighted co-authorship network contains more information than the unweighted one since every unweighted network corresponds to infinitely many weighted ones. At the same time, a weighted co-authorship network typically corresponds to multiple bipartite networks. The non-trivial problem of determining the exact number of corresponding bipartite networks is beyond the scope of the present article. Bipartite networks carry more information than their weighted projections, which in turn carry more information than unweighted networks. We will study and compare the performance of link predictors applied to unweighted, weighted and bipartite networks. Henceforth, we may refer to them as unweighted, weighted and bipartite predictors, respectively.

We formulate the following hypotheses. First, we have two hypotheses regarding

H1a: Other things being equal, if two links occur in the weighted training network, the one with a higher link weight is more likely to re-occur in the test network than the other one.

H1b: Other things being equal, if the bipartite training network contains two publications with respectively

Hypothesis H1a simply states that two authors with more joint papers are more likely to write a joint paper in the future. Hypothesis H1b could be formulated slightly less formally as follows:

The intensity of collaboration (and hence the probability of renewed collaboration) decreases with the number of additional collaborators.

The rationale behind this hypothesis can perhaps best be explained by comparing cases of ‘mega-authorship’ (Kretschmer & Rousseau, 2001) or ‘hyperauthorship’ (Cronin, 2001) – papers written by hundreds or even thousands of authors – with papers authored by two or three researchers. It seems extremely unlikely that the intensity of close collaboration in the former case is as high as it is in the latter case.

Next, we have two hypotheses regarding prediction performance:

H2a: The performance of a weighted predictor is better than the performance of its unweighted counterpart.

H2b: The performance of a bipartite predictor is better than the performance of both its weighted and unweighted counterpart.

Hypothesis H2a stems from the fact that weighted networks carry more information. It therefore seems logical that they also allow for better predictions. Similar considerations lead,

We use two datasets of co-authorship between researchers. The first dataset will be referred to as the

The second dataset will be referred to as the

Table 1 provides some basic statistics on the two datasets. In both cases, we only include those authors that occur at least once in both training and test dataset. This explains why the author numbers are the same for the training and test datasets. In other words, authors that stop publishing after the training period or that start publishing in the test period are excluded from the data, as is customary in link prediction studies (e.g. Liben-Nowell & Kleinberg, 2007). In rapidly expanding fields or fields with a high turnover, this would reduce the usefulness of link prediction. Note that the average number of papers per author is higher than the number of papers divided by the number of authors, because of multi-authored papers.

Descriptive statistics of the two datasets.

UA | INF | ||
---|---|---|---|

Training | Number of authors | 1,102 | 397 |

Number of papers | 7,569 | 1,118 | |

Average number of papers per author | 11.9 | 4.3 | |

Average number of authors per paper | 1.7 | 1.5 | |

Test | Number of authors | 1,102 | 397 |

Number of papers | 7,939 | 1,069 | |

Average number of papers per author | 12.8 | 4.4 | |

Average number of authors per paper | 1.8 | 1.6 |

The primary goal of the current article is to compare the influence of the nature of the training network on prediction performance. Specifically, we are interested in the question to what extent bipartite, weighted and unweighted networks can impact the accuracy of the prediction results. We will mainly focus on a number of predictors that have been thoroughly tested in the literature. Here, we will briefly recapitulate their definitions for unweighted unipartite networks and outline if and how they can be applied to weighted and bipartite networks. All predictors were used as implemented in the

We use the following notation. Each predictor determines a likelihood score

The first predictor is

Because it is so simple and it occurs in the seminal work by Liben-Nowell and Kleinberg (2007), this predictor is used (if only as a comparison point) in virtually every link prediction study. It is intuitively clear that, for instance, two unconnected people who have a large amount of friends in common are likely to eventually meet and befriend each other. This intuition was confirmed empirically in collaboration networks by Newman (2001a), who found that “[a] pair of scientists who have five mutual previous collaborators, for instance, are about twice as likely to collaborate as a pair with only two, and about 200 times as likely as a pair with none.” In other words, many social networks exhibit a natural tendency towards forming triangles and hence, the common neighbors predictor is directly related to the clustering coefficient.

Several normalizations of the common neighbors predictor have been proposed in the literature. Because most of these normalizations yield very similar results (Guns, 2009), we choose one to represent this set of predictors, namely the

Let

Katz (1953) proposed a centrality indicator whose stated aim was to overcome the limitations of plain degree centrality. Interestingly, the same procedure can be used to obtain a measure of relatedness between two nodes. The latter has become known as the _{ij} is 1 if there is a link between nodes

Then, each element ^{k} (the ^{th} power of

The intuition behind

We now turn to the question how the aforementioned predictors can be applied to weighted networks, such that they take link weight into account.

Murata and Moriyasu (2007) propose the following weighted variant of

A potential downside of their approach is that this weighted variant lacks a theoretical basis and has been constructed

If element _{i}_{i}

the step from sets to vectors can be taken using the following equation:

The advantage of the vector-based measures over the set-theoretic ones is that they are not limited to the binary case, i.e. vector elements can have any numeric value.

To apply these in the context of link prediction, vectors

Note that Equations (5) and (7) do not result in the same ranking. This is illustrated by Figure 2. According to (5),

Likewise,

Equation (8) indicates why this is called the cosine measure: it is the cosine of the angle between two vectors. In other words, this measure takes into account the direction but not the magnitude of the two vectors.

We now turn to the _{i}^{th} link in the path. The weighted path length can be defined by taking the sum of the inverses of each weight (Brandes, 2001; Egghe & Rousseau, 2003; Newman, 2001b).

One can then define the distance or dissimilarity between

Egghe and Rousseau (2003) remark that “A direct link is very important. Yet, if weights are high enough, it is possible that an indirect connection leads to a smaller dissimilarity than the direct link.” Consider Figure 3. This network contains three paths between

On the basis of similar considerations, Opsahl, Agneessens, and Skvoretz (2010) propose a generalization of weighted path length in a proximity-based weighted network:

where

We thus obtain the weighted graph distance predictor:

The

Finally,

The bipartite author-publication networks are unweighted. Hence, it is, at least mathematically speaking, possible to apply the predictors from the ‘unweighted predictors’ section to the bipartite network as well. Here, we consider some specific issues that may arise when doing so.

First, we emphasize that our current analysis is only concerned with predicting links between authors themselves, and not links between articles or between authors and articles. This can be achieved by simply distinguishing between ‘eligible’ (author) nodes and ‘non-eligible’ (article) nodes. We only retain predictions that involve two eligible nodes.

Let us now consider the neighbor-based predictors (

Note that this limitation does not mean that the bipartite neighbor-based predictors are the same, since their ranking is different. The common neighbors predictor ranks in decreasing order of the number of joint publications. Cosine is similar, but normalizes with respect to the total number of publications of the authors.

The

Contrary to graph distance, the

Finally,

In the context of link prediction, the notions of recall and precision can be defined as follows. Precision

If either

We use two measures for evaluation. Precision-at-20 (P@20) is the precision for the top 20 predictions and gauges a predictor’s ability to make a small number of high-quality predictions. _{max} is the highest

Links in a network tend to be positively stable: we find that 61.7% (UA) and 62.7% (INF) of the links in the training network reoccur in the test network. Only 14.6% (UA) and 18.2% (INF) of the links in the test networks do not occur in the training network. All this suggests that plain reoccurrence is actually a fairly strong predictor.

What happens if we incorporate link weights? We test this on the UA dataset and find that the

Results for positive stability.

UA | INF | |||
---|---|---|---|---|

_{max} | P@20 | _{max} | P@20 | |

Reoccurrence | 0.59 | 0.48 | 0.61 | 0.48 |

Reoccurrence with weights | 0.59 | 1.00 | 0.61 | 0.81 |

Reoccurrence with author numbers | 0.55 | 0.52 | 0.61 | 0.57 |

To test the influence of the number of authors per paper, we determine for each author pair in the training network the median number of authors on their co-authored publications. This results in an increase of precision, but a decrease (UA) or status quo (INF) in

The relative performance of different predictors has been studied quite extensively (e.g. Guimerà & Sales-Pardo, 2009; Guns, 2014; Liben-Nowell & Kleinberg, 2007; Song et al., 2009) and is not the main focus of the present paper. Rather, we are interested in comparing the influence of the network type on predictor performance. Hence, we will present results per predictor family.

Tables 3 and 4 contain the results for the neighborhood-based predictors for UA and INF, respectively. In all cases, we observe that the unweighted variants perform as well as or better than the weighted ones, both for

Results for neighborhood-based predictors (UA).

Unweighted | Weighted | Bipartite | ||
---|---|---|---|---|

Common neighbors | _{max} | 0.3964 | 0.3928 | 0.5866 |

P@20 | 1 | 1 | 1 | |

Cosine | _{max} | 0.4025 | 0.397 | 0.5865 |

P@20 | 0.1429 | 0.0952 | 0.7619 |

Results for neighborhood-based predictors (INF).

Unweighted | Weighted | Bipartite | ||
---|---|---|---|---|

Common neighbors | _{max} | 0.4091 | 0.3935 | 0.6109 |

P@20 | 0.8571 | 0.5714 | 0.8095 | |

Cosine | _{max} | 0.4332 | 0.4305 | 0.6217 |

P@20 | 0.0476 | 0.0476 | 1 |

The bipartite predictors perform much better (with the exception of precision for common neighbors applied to the UA dataset, which has a perfect score for all three types). This does not really corroborate H2b: though, as explained above, the bipartite neighborhood-based predictors only ‘predict’ the recurrence of already existing co-authorship relations. In other words, their good performance is related to hypothesis H1a.

We have seen that Opsahl et al. (2010)’s refinement of graph distance introduces a tuning parameter that determines to what extent the measure takes link weights into account. Figure 5 shows the results for UA; the resulting picture for INF looks very similar. First, recall that

Figures 6–8 display the results for the Katz predictor for values of

Finally, we turn to the rooted PageRank predictor. Results are shown in Figures 9–12. For UA we find that the differences between the three network types yield rather small differences in terms of

A similar conclusion holds for the INF dataset (Figures 11 and 12). Here, we see that unweighted and weighted obtain the same precision scores, but in terms of

In this paper we have compared predictive characteristics of three types of networks representing co-authorship: unweighted, weighted, and bipartite. The overall hypothesis of the paper is that more information will enhance a network’s predictive potential. This general hypothesis was operationalized in four specific hypotheses.

We have shown that collaborations tend to reoccur and that the more joint papers, the more likely the authors are to be co-authors in the future (hypothesis H1a). We can explain this using a barrier metaphor. In the context of collaboration and co-authorship, initiating new collaborations requires overcoming certain ‘barriers’ (geographic, cognitive, personal…); once this has happened and collaboration has been successful enough to result in at least one shared publication, future collaborations become easier. More co-authored papers imply that the barrier has been lowered even further.

We find only limited evidence for hypothesis H1b, however. A possible explanation could be that the numbers of authors per paper in our data are simply too small to have a clear effect. Nevertheless, our other results show that bipartite networks can lead to an improvement in prediction performance.

Hypothesis H2a states that the performance of a weighted predictor is better than the performance of its unweighted counterpart. This hypothesis is confirmed by the results for rooted PageRank (both UA and INF) and Katz (only UA), but contradicted by the other results. Our results for graph distance may help to clarify this unexpected finding. The best graph distance results were obtained when the tuning parameter

Hypothesis H2b states that the performance of a bipartite predictor is better than the performance of both its weighted and unweighted counterpart. We feel that the results for Katz and rooted PageRank confirm this hypothesis (we leave common neighbors, cosine and graph distance out of the discussion for reasons explained earlier). Only in a few specific cases and with specific parameter values did we observe better results for the weighted than the bipartite case. In most cases, however, the bipartite predictors clearly outperformed the weighted as well as the unweighted ones.

The results of the present study suggest that bipartite networks are underused in current link prediction studies. We have shown that the performance of some common predictors can be improved by applying them to a bipartite network. This finding also raises the question if new predictors that are attuned to bipartite networks can be created. More generally, the results illustrate that bipartite networks can yield results that cannot be obtained by unweighted or weighted networks. Bibliometric studies of co-authorship (but also co-word, co-citation, bibliographic coupling, etc.) might be enhanced by considering the bipartite form of the network.

It would be interesting to apply the same exercise to other networks that are derived from a bipartite network, either co-authorship networks or others. For instance, our case studies contain hardly any instances of hyperauthorship. While we expect that datasets that do include such cases would actually benefit from the bipartite approach, this has as of yet not been shown empirically. We leave this as a suggestion for further research.

As a general concluding comment, it is worth repeating that the current approach to link prediction does not include the appearance or disappearance of nodes. A more comprehensive prediction model that includes these aspects would probably need to incorporate more information than just network topology but also node or link attributes. Multi-relational networks or property graphs (Rodriguez & Neubauer, 2010) are one possible way of approaching this (Guns, 2012).

#### Descriptive statistics of the two datasets.

UA | INF | ||
---|---|---|---|

Training | Number of authors | 1,102 | 397 |

Number of papers | 7,569 | 1,118 | |

Average number of papers per author | 11.9 | 4.3 | |

Average number of authors per paper | 1.7 | 1.5 | |

Test | Number of authors | 1,102 | 397 |

Number of papers | 7,939 | 1,069 | |

Average number of papers per author | 12.8 | 4.4 | |

Average number of authors per paper | 1.8 | 1.6 |

#### Results for positive stability.

UA | INF | |||
---|---|---|---|---|

_{max} | P@20 | _{max} | P@20 | |

Reoccurrence | 0.59 | 0.48 | 0.61 | 0.48 |

Reoccurrence with weights | 0.59 | 1.00 | 0.61 | 0.81 |

Reoccurrence with author numbers | 0.55 | 0.52 | 0.61 | 0.57 |

#### Results for neighborhood-based predictors (UA).

Unweighted | Weighted | Bipartite | ||
---|---|---|---|---|

Common neighbors | _{max} | 0.3964 | 0.3928 | 0.5866 |

P@20 | 1 | 1 | 1 | |

Cosine | _{max} | 0.4025 | 0.397 | 0.5865 |

P@20 | 0.1429 | 0.0952 | 0.7619 |

#### Results for neighborhood-based predictors (INF).

Unweighted | Weighted | Bipartite | ||
---|---|---|---|---|

Common neighbors | _{max} | 0.4091 | 0.3935 | 0.6109 |

P@20 | 0.8571 | 0.5714 | 0.8095 | |

Cosine | _{max} | 0.4332 | 0.4305 | 0.6217 |

P@20 | 0.0476 | 0.0476 | 1 |

^{th} International Conference of the International Society for Scientometrics and Informetrics (pp. 249–260). Leiden: Leiden University Press.

^{th} ACM Internet Measurement Conference (pp. 322–335). New York: ACM.