The adoption of the Digital Object Identifiers (DOIs, see the DOI Handbook) by publishers of scholarly works is advancing. DOIs are persistent identifiers with a resolution service and a set of metadata about the referenced resources. Scholarly publishing DOI registration is almost exclusively operated by the Crossref DOI registration agency (Crossref). An important part of the metadata that is deposited with Crossref is the list of references, which can be aggregated as the network of citation links between scholarly works. The COCI project (OpenCitations, 2018) makes openly available the citation links from Crossref that are marked as open. This presents an open alternative to commercial citation databases such as Web of Science (WoS, by Clarivate Analytics) which only offer citation data limited by restrictive and fee-based licenses.
The ISSI Open Citations Letter (ISSI, 2017) calls for citation metadata to become openly available for scientometrics, both for research in the field and for its applications that support science policy and research evaluation, the latter having a large impact on the scientific community. The lack of transparency and reproducibility implied by the vendor paywalls around citation data inhibit sound practices in the field of scientometrics. Crossref, the only named candidate organization in the open letter, appears to be the best positioned for fulfilling the role of an open citation infrastructure, as it (1) is existing and operational, (2) already contains a sizeable proportion of the required metadata, and (3) makes its metadata openly available.
The proportion of open citations in Crossref is increasing. More than half of the citations in Crossref were classified as open (Shotton, 2017). Van Eck et al. (2018) show that while 77.1% of citations in the Web of Science (WoS) are present in Crossref, only 39.7% are classified as open. Efforts towards open scientometric data sources, documented by events such as the workshop reported on by Fraumann and Van Eck (2019), promise the advent of “open scientometrics” where citation data need not be sourced from commercial providers. The prerequisite for that is that Crossref covers and openly provides a sufficiently large part of citations from the WoS, today's de-facto standard citation database for most fields of science. We study whether this prerequisite is satisfied in the context of the Czech Technical University in Prague (CTU), Czech Republic i.e. we investigate the level of coverage of the WoS citation database by the openly available citation links from the COCI project (OpenCitations, 2018) on the sample where the cited publications are those we track in our institution's Current Research Information System (CRIS). We provide a breakdown to individual faculties, fields and where possible, also subfields in two different discipline classifications: the OECD Fields of Research and Development classification and the Czech national discipline classification.
The Czech Technical University is the largest technical university in the country (and the oldest one as well, established in 1707) and is comparable to many technical universities in Central Europe. We expect our results to be relevant to other institutions of similar profiles in the region.
This article extends the work presented at the ISSI 2019 conference (Chudlarský & Dvořák, 2019).
The Czech Technical University in Prague has a long tradition of running an in-house built institutional CRIS. The CRIS integrates our records and those harvested from the WoS web service interface, including the citations of our authors’ works. This is one of the many integrations of the CRIS, for a detailed description see Dvořák, Chudlarský, and Špaček (2019).
We limit ourselves to publications from the period 2013–2017 which have both (1) a WoS accession number with a valid record in WoS, and (2) a DOI that is registered in Crossref. For checking the second condition we consult the DOIBoost dataset described in (La Bruzzo, Manghi, & Mannocci, 2019) or perform an API call to Crossref. We exclude those publications that have differing DOI values in the CRIS itself and in the WoS record. This gives the sample of 12,796 publications for which we look up the citations in both the WoS and Crossref: the citing and the cited publication are both present in both WoS and Crossref.
The November 2018 release of the Crossref Open Citations corpus (OpenCitations, 2018) was used. The “cited” side of the linking relationships is of very diverse quality. Some multiline values need to be straightened up. Some values seem to contain several DOIs concatenated, separated by spaces. To rectify these most severe errors we developed a script; its application made the data load possible and even slightly raised the number of citations to 449,843,367 (by 2,864 from the original 449,840,503). However, removing duplicate DOI pairs from the dataset leaves only 445,827,638 unique citation links (by 4,015,729 less). Some of the cited “DOIs” are still unsatisfactory: they contain internal spaces or illegal characters, end in an extra full stop, have superfluous parts in their contents or are incomplete. There clearly is room for further investigation and improvements which we are undertaking in a different thread of activity and plan to report on separately. Data quality problems on the side of Crossref citations clearly have a lowering effect on the recall of our study.
We found that 53.7% of WoS are present in the COCI dump of the open citation network.
This is significantly more than the approximate 40% coverage measured by Van Eck et al. (2018) for four out of five broad main fields (in the CWTS Leiden Ranking classification). Note that the remaining main field of Social Sciences and Humanities is marginal in our sample, given the research profile of a technical university.
We found important differences in the coverage among faculties (ranging from 63% down to 28%) – see Figure 1 and the supporting Table 1.
Coverage of WoS citations in COCI by the unit of the Czech Technical University.
Faculty or University Institute | WoS publications | WoS citations | Of which in COCI | Coverage |
---|---|---|---|---|
Institute of Experimental and Applied Physics | 1,122 | 24,348 | 15,225 | 62.5% |
Faculty of Nuclear Sciences and Physical Engineering | 4,225 | 54,470 | 32,398 | 59.5% |
Faculty of Transportation Sciences | 567 | 15,830 | 9,329 | 58.9% |
Faculty of Mechanical Engineering | 1,778 | 26,114 | 14,999 | 57.4% |
Faculty of Electrical Engineering | 3,959 | 16,726 | 7,768 | 46.4% |
Faculty of Biomedical Engineering | 478 | 2,050 | 950 | 46.3% |
Czech Institute of Informatics, Robotics and Cybernetics | 219 | 459 | 191 | 41.6% |
Faculty of Civil Engineering | 1,727 | 7,131 | 2,539 | 35.6% |
University Centre of Energy Efficient Buildings | 114 | 232 | 72 | 31.0% |
Klokner Institute | 126 | 255 | 78 | 30.6% |
Faculty of Information Technology | 347 | 654 | 185 | 28.3% |
Also, the coverage significantly differs among disciplines (ranging from 78% to 25%)—see Figure 2 and the supporting Table 2. Only the disciplines with more than one hundred publications are listed. The field of Physical sciences is the most populous one and lends itself to a useful subdivision; the subfields of Astronomy (at 78% coverage) on one side and Optics (with 35%) on the other side illustrate the variance even within the single field. The second most populous field of “Electrical engineering, Electronic engineering, Information engineering” is dominated by Electronic engineering in the context of the Czech Technical University, so no useful subdivision is possible there.
Coverage of WoS citations in COCI by discipline (the OECD FORD classification).
Field ( / Subfield) | WoS publications | WoS citations | Of which in COCI | Coverage |
---|---|---|---|---|
- | ||||
- | ||||
- | ||||
- | ||||
- | ||||
Biological sciences | 114 | 991 | 545 | 55.0% |
Clinical medicine | 131 | 652 | 316 | 48.5% |
Chemical sciences | 200 | 1,083 | 524 | 48.4% |
Earth and related environmental sciences | 252 | 1,468 | 711 | 48.4% |
Electrical engineering, Electronic engineering, | 2,834 | 10,523 | 4,951 | 47.0% |
Information engineering Mathematics | 820 | 2,303 | 942 | 40.9% |
- Physical sciences / Optics (including laser optics and quantum optics) | 590 | 2,253 | 789 | 35.0% |
Computer and information sciences | 1,000 | 3,097 | 1,071 | 34.6% |
Materials engineering | 745 | 4,184 | 1,404 | 33.6% |
Mechanical engineering | 542 | 1,562 | 516 | 33.0% |
Environmental engineering | 223 | 617 | 200 | 32.4% |
Civil engineering | 942 | 2,555 | 740 | 29.0% |
Medical engineering | 103 | 177 | 38 | 21.5% |
Table 3 lists information similar to Table 2 aggregated in the original Czech national discipline classification. Similar fields in both classifications have very similar levels of coverage, e.g. Astronomy, Particle physics, Nuclear physics, Optics, Mathematics, Electrical and electronic engineering, and Civil engineering. The discipline classification system that is used does not to affect the end result too much.
Coverage of WoS citations in COCI by discipline (the original Czech national discipline classification).
Discipline | WoS publications | WoS citations | Of which in COCI | Coverage |
---|---|---|---|---|
Astronomy, Celestial Mechanics, Astrophysics | 114 | 1,025 | 803 | 78.3% |
Plasma and Gas Discharge Physics | 376 | 1,986 | 1,389 | 69.9% |
Theoretical Physics | 375 | 1,957 | 1,353 | 69.1% |
Elementary Particles and High Energy Physics | 1,398 | 35,792 | 22,308 | 62.3% |
Nuclear, Atomic and Molecular Physics, Colliders | 934 | 12,720 | 7,635 | 60.0% |
Nuclear & Quantum Chemistry | 101 | 463 | 241 | 52.1% |
Sensors, Measurement, Regulation | 377 | 1,139 | 572 | 50.2% |
Computer Applications, Robotics | 530 | 3,807 | 1,868 | 49.1% |
Solid Matter Physics & Magnetism | 294 | 1,466 | 670 | 45.7% |
Electronics & Optoelectronics, Electrical Engineering | 1,385 | 3,149 | 1,416 | 45.0% |
Computer Hardware & Software | 636 | 2,611 | 1,168 | 44.7% |
General Mathematics | 730 | 1,993 | 802 | 40.2% |
Other Materials | 152 | 977 | 355 | 36.3% |
Fluid Dynamics | 161 | 489 | 172 | 35.2% |
Optics, Masers, Lasers | 584 | 2,247 | 787 | 35.0% |
Control Systems Theory | 324 | 1,528 | 528 | 34.6% |
Informatics, Computer Science | 577 | 1,362 | 462 | 33.9% |
Non-nuclear Energetics, Energy Consumption & Use | 212 | 535 | 175 | 32.7% |
Composite Materials | 281 | 2,000 | 641 | 32.0% |
Civil Engineering | 633 | 1,761 | 521 | 29.6% |
Building Engineering | 256 | 682 | 195 | 28.6% |
Metallurgy | 166 | 658 | 188 | 28.6% |
Nuclear Energetics | 119 | 247 | 62 | 25.1% |
The significant difference of our results from those of Van Eck et al. (2018) may be caused by the specific discipline profile of our institution and by the specific publisher choice patterns of our authors, and also the fact that the 5-year window of our sample (2013–2017) is one year later than that of the referenced work (2012–2016). These differences all deserve further research in the future.
The open citations network in Crossref is not yet ready to replace the Web of Science citations. The observed levels of coverage of citations are not yet sufficient for Crossref to be used as the source for citation analyses in research evaluation at the university and/or faculty levels. Note also that while scholarly publications without a DOI are increasingly rare, they still exist.