The scientific community benefits from data sharing. By using previous research data, researchers can advance scientific discovery far beyond their original analysis (Piwowar & Vision, 2013). Scientific data usage facilitates original result confirmation, and it improves new hypothesis generation when combining other types of data.
Biomedical data sharing policies have been established to ensure that data are publicly available. For example, it is mandatory to share large scale genomic data that were generated or analyzed on basis of the U.S. National Institute of Health funding (Green et al., 2015). All of the grantees consciously deposit their data to a public database, and they serve as the data author. To make a collection of enduring scientific data and create a sustainable data ecosystem, all of the key players of data management, such as data author, data curator, data user, and funding agencies, actively fulfill their responsibilities (Bourne, Lorsch, & Green, 2015; National Science Board, 2005). For this data management lifecycle, data authors conform to the data standard and data quality requirement, and they produce scientific data that are further deposited into a public database in a comprehensive manner. Moreover, data users adhere to license or copyright requirements regarding the usage of the data that is generated by data authors, and they must correctly cite the data used in their scientific publications to indicate that their studies feature the use of other research data. Identifying data citations is important for funding agencies to evaluate grantees’ scientific contribution and grant outcomes. Additionally, a dataset that is more frequently cited by other researchers is confirmed to be high-quality data and represents domain trends.
It is challenging to identify data citations in full-text literature, although there is a long tradition of partnership between scientific literature and public data in the field of medical sciences (Kafkas, Kim, & McEntyre, 2013). A few studies have been conducted to identify data citation using unique data accession numbers in the database. However, most project-generated data focuses on one specific scientific goal and lacks either a well-defined data identifier or standardized citation regulations.
The Cancer Genome Atlas (TCGA) project was launched in 2005 and funded by the US government, and it aims to catalogue and discover major cancer-causing genomic alterations to help improve the clinical outcome of cancers (Tomczak, Czerwinska, & Wiznerowicz, 2015). A major goal of the project was to provide publicly available cancer genomic datasets (
Text-mining methods have been developed to identify database citations from the literature by characterizing database entry accession numbers. Neveol et al. developed a machine learning method to extract data deposition statements from full-text literature (Neveol et al., 2011). Furthermore, they analyzed link curation between disposition databases (e.g. GEO and PDB) and the literature and proposed that text-mining tools can improve the links between literature and biological databases (Neveol et al., 2012). Kafkas, Kim, and McEntyre applied the patterns of ENA, Uniprot, and PDB accession numbers to identify database citations from full-text literature (Kafkas, Kim, & McEntyre, 2013) and from article supplemental files (Kafkas et al., 2015). Piwowar et al. investigated citation relationships between microarray databases (e.g. GEO and ArrayExpress) and the literature (Piwowar & Chapman, 2010; Piwowar & Vision, 2013). Yu et al. constructed a database link network from a set of pairs of databases that were co-mentioned in the methodology sections of full-text literature to track the database usage, connection, and evolution (Yu et al., 2015). These efforts have improved the understanding of data citations for specific databases that have identical accession numbers. However, few studies have been conducted to identify data-literature citation relationships for data generated by a scientific project in which the data is required to be shared but lacks a redefined identifier. In this study, we selected a publicly funded project and publicly available project data: TCGA (Tomczak, Czerwinska, & Wiznerowicz, 2015).
To identify TCGA data usage from full-text articles, we proposed a computational framework (Figure 1). We collected TCGA-related full-text articles from PubMed Central, constructed a benchmark dataset which truly used the TCGA data, and analyzed data usage according to the specific cancer type and high-throughput platform.
PubMed Central (PMC,
We collected 25 open access publications that used TCGA data, as confirmed by the TCGA Network, from the official website (
We developed a full-text extraction method to parse the full-text articles in XML format, extracted metadata such as publication date and author country, and identified TCGA-related key terms as provided.
The cancer type and high-throughput platform are two characteristic classes of key words in the TCGA data usage statements. Here, the cancer type refers to a list of cancers investigated in the TCGA program, whereas the high-throughput platform refers to a list of high-throughput biotechnologies used by the TGCA investigators to test the cancer genomic information. In the TCGA program from 2005 to 2014, over 30 cancers were studied using microarray and next-generation sequencing platforms, consequently producing large-scale data, such as gene expression, exon expression, miRNA, copy number variation (CNV), single nucleotide polymorphism (SNP), loss of heterozygosity (LOH), mutations, DNA methylation, and protein expression. Referring to Disease Ontology (Kibbe et al., 2015) and the TCGA data matrix (TCGA data matrix, 2015), we developed a controlled vocabulary for the TCGA cancer type (Table 1) and high-throughput platform (Table 2).
Examples of TCGA cancer-type concepts.Concept ID Name TCGA defined terms [abbr] – [full name] Synonyms DO mapping D0001 Glioblastoma GBM – Glioblastoma Multiforme Glioblastoma, GBM, adult glioblastoma multiforme, primary glioblastoma multiforme, spongioblastoma multiforme DOID: 3068 D0002 Breast cancer BRCA – Breast Invasive Carcinoma Breast cancer, breast tumor, breast neoplasm, mammary cancer, mammary tumor, mammary neoplasm, malignant tumor of breast, DOID: 1612 D0003 Ovarian cancer OV – Ovarian Serous Cystadenocarcinoma Ovarian cancer, ovarian tumor, ovarian neoplasm, ovary cancer, ovary tumor, ovary neoplasm, malignant tumor of ovary DOID: 2394 D0004 Acute myeloid leukemia LAML – Acute Myeloid Leukemia Acute myeloid leukemia, AML, acute myeloblastic leukemia, acute myelogenous leukemia DOID: 9119
Examples of TCGA high-throughput platform concepts.Concept ID Name TCGA-defined terms Generated data P0001 RNASeq IlluminaGA_RNASeq, Nucleotide sequence, gene expression IlluminaHiSeq_RNASeq P0002 miRNASeq IlluminaGA_miRNASeq miRNAs, microRNA, microRNA sequence P0003 SNP Genome_Wide_SNP SNPs, single nucleotide polymorphisms, CNV, copy number variation P0004 Methylation Human methylation DNA methylation
As shown in Tables 1 and 2, the TCGA-defined terms were used to standardize the program-generated data description; however, they are not the terms used in the full-text articles. For example, in the results section of one article (PMCID: PMC3910500), it described the genomic landscape of glioblastoma using the whole-exome (WES), whole-genome sequencing (WGS), and RNA-Sequencing (RNA) (Brennan et al., 2013). To identify the TCGA cancer type and high-throughput platform concept from the free texts, we developed a named entity recognition method that is based on a biomedical text mining tool (Leaman, Islamaj, & Lu, 2013).
The number of TCGA-related articles increases as the program continues. Figure 2 shows the number of PMC articles related to the TCGA-related articles published from 2008 to 2015, and there were over 1,600 TCGA articles published in 2014. The 2015 reduction is due to data incompleteness as of September, 2015. TCGA data accumulation and data sharing contributed to the significant increase in TCGA publications. Phase I of the TCGA program (a 3-year pilot study) aimed to collect cancer tissues, process the biospecimen, apply high-throughput platforms to identify cancer genomic information, and analyze genetic changes involved in the cancer. Since 2009 (phase II), the data that were generated by the TCGA program have been centrally managed at the TCGA Coordinating Center and entered into public databases, allowing scientists to continually search, download, and analyze the data.
Figure 3 shows the geographical distribution of the TCGA-related publications. Researchers from 37 countries used the TCGA data in their studies, and the United States was the most productive one, followed by China, Canada, Australia, and Germany, etc.
We compared the TCGA key term features, TCGA term positions, and the TCGA-related concepts mentioned in the retrieved PMC articles (Section 3.1) and in the benchmark dataset (Section 3.2). Table 3 shows the true positive rate (TPR) of full-text articles in each dataset that have the TCGA key term features. The TCGA term (i.e. ‘TCGA’ or ‘Cancer Genome Atlas’) was mostly likely to appear in the results section in both the retrieved PMC article set (74%) and in the benchmark dataset (96%). Additionally, studies using the TCGA data are likely to describe the cancer type and high-throughput platform in the full-text articles of both datasets. Although there was a similar TCGA feature distribution within the retrieved PMC article set and within the benchmark dataset (χ2 test,
Distribution of TCGA key terms in full-text articles.Feature Retrieved PMC article set (%) Benchmark dataset (%) TCGA term positon Title 1 4 Abstract 11 28 Introduction/Background 12 20 Method/Material 31 68 Result 74 96 Discussion/Conclusion 20 36 TCGA related concept Cancer type mention 73 100 mention Platform mention 66 96
To investigate the specific TCGA data usage, we identified the TCGA cancer type that was mentioned and the high-throughput platform that was mentioned in the methods/materials and in the results sections of the PMC full-text articles (Section 3.3). Figure 4 shows the proportion of different TCGA cancer types in the retrieved PMC article set. Glioblastoma (28%), lung cancer (18%), and breast cancer (11%) were the most frequent cancer types in which the data were used. Glioblastoma was the first cancer studied by the TCGA program, leading to TCGA infrastructure development that included data collection and sharing (Cancer Genome Atlas Research Network, 2008). Thus, this may be the major reason that the TCGA glioblastoma data were more frequently used.
As shown in Figure 5, the data generated by the RNASeq platform are the most widely used (48%). Compared with traditional DNA sequencing technology, RNA sequencing can help understand the transcriptome via precisely and rapidly deriving wide-range strand information, such as transcripts, isoforms, gene fusions, and non-coding RNAs (Wang et al., 2009). The TCGA data that is generated by the RNASeq platform provides researchers with standardized and comprehensive cancer transcriptome profiles to discover biomarkers related to tumorigenesis and metastasis (Peng et al., 2015).
In this preliminary study, we conducted an investigation to track the use of scientific data that were generated by long-term government-funded program. We selected the TCGA program and analyzed over 5,000 full-text articles that were collected from PMC. We constructed a benchmark dataset that truly used TCGA data, and we compared it with full-text articles retrieved from PMC. Furthermore, we built up a controlled vocabulary that was tailored for the TCGA program that describes the cancer type and high-throughput platform. Thus, it provides insights into which specific data were used. Our work can contribute to scientific data and scientific literature integration. As shown in the box in Figure 6, the TCGA funding agencies manually collected the articles and linked the articles to their source data (TCGA publication, 2016). Our efforts may help develop an automatic method to identify recent publications that use TCGA data.
However, this study has limitations. (1) The benchmark set may cause a bias. We only collected 25 articles from the TCGA website to construct the benchmark dataset. The patterns of full-text articles that actually cite the TCGA data were not validated in a large scale dataset. Here, we only compared the TCGA term position and TCGA-related concept that were mentioned in the retrieved PMC articles and in the benchmark dataset. In the future, we may manually construct a benchmark dataset that includes more full-text articles that actually cite TCGA data. (2) The identification performance of TCGA-related term requires evaluation. Here, we applied a biomedical text mining tool to identify the mentioned TCGA cancer type and high-throughput platform without validating the named entity recognition. (3) Natural language processing technology needs to confirm the relationships between cancer type and platform. The data usage statement in full-text literature describes which cancer type samples are tested by which platforms, however, we have not yet considered these specific relationships.
We present a workflow to identify scientific project-generated data citation via full-text article analysis, and we applied this workflow to track TCGA data citations via PMC literature analysis. In contrast to previous studies, the scientific data entries in our studies lacked predefined accession numbers. Although our preliminary study has limitations, this work is a step towards integrating literature with scientific data that are generated by a government-funded project. In future work, we expect to improve the construction of the scientific data citation benchmark dataset, normalize the full-text article sections, map the project self-defined vocabulary, and evaluate the performance of data citation identification.