1. bookVolumen 1 (2016): Edición 4 (November 2016)
Detalles de la revista
Primera edición
30 Mar 2017
Calendario de la edición
4 veces al año
Acceso abierto

Data-driven Discovery: A New Era of Exploiting the Literature and Data

Publicado en línea: 01 Sep 2017
Volumen & Edición: Volumen 1 (2016) - Edición 4 (November 2016)
Páginas: 1 - 9
Recibido: 30 Aug 2016
Aceptado: 19 Sep 2016
Detalles de la revista
Primera edición
30 Mar 2017
Calendario de la edición
4 veces al año

Dr Ying Ding is an Associate Professor of Indiana University, USA, Co-Editor-in-Chief of Journal of Data and Information Science (JDIS). She is Associate Director of Data Science Online Program, and Director of Web Science Lab. She is Changjiang Scholar at Wuhan University and Elsevier Guest Professor at Tongji University. Her research interests include scholarly communication for knowledge discovery, semantic Web for drug discovery, social network analysis for research impact, and data integration and mediation in Web 2.0. She has published more than 200 papers which have received over 4,000 times of citation. She is the Co-Editor of Semantic Web Synthesis by Morgan & Claypool and serves editorial board of several leading international journals.

Scientific Discovery

Scientific discovery revolves around the process of problem solving. It either uses existing well-established methods to explore a new area or invents new methods to solve existing problems. Either way, it is a journey into unknown terrain. Trial- and-error remains the most common approach to testing new ideas, learning from failures, and, eventually, finding success. The problem-solving process can be viewed as a search for a path connecting the initial state and the goal state (Klahr, 2000). In cognitive science, a problem space contains the set of states, operators, goals, and constraints, and this problem space can be huge or small depending on whether you are on the right path to the final goal. The time to reach the final goal can be significantly shortened if the right tools are used.

How challenging the problem-solving process is also depends on the basic components in a problem space. The vagueness of some of these components can easily make scientific discovery purposeless. For example, one can have a task with a well-defined goal state (e.g. proving a mathematical equation) but a vague initial state, a task with a clear initial state (e.g. finding potential binding drugs for a given target) but an unclear goal state, or even a task with an ill-defined initial state and goal state (e.g. inventing a cool tool). More knowledge available to the problem-solver can significantly reduce the vagueness of basic components and set clear boundaries on the problem space. It is important to understand the problem space and foresee next steps.

Knowledge Discovery

Hypotheses can be generated from different sources. The dominant approach of developing a hypothesis in biology and medicine, for example, is through first-hand observation, which includes experimental data, electronic medical records, gene sequence data, and lab test results. The alternative method of generating a hypothesis from literature is viewed as a serendipitous process with great uncertainty—even more so now because the vast amount of published research contains a diversity of knowledge beyond what domain experts can humanly reason. Especially for researchers in transdisciplinary domains, it is no longer possible for domain experts in one domain to fully master the knowledge in another domain.

Mining literature to generate hypotheses is not confined to biology or medicine but can be done in almost any science. Publications are no longer just an output of research but rather a vital part of the scientific process. A significant number of associations between different biological entities (e.g. disease, gene, drug, side effect, and pathway) are scattered across millions of biomedical articles. Mining these documented associations can infer innovative associations and generate novel hypotheses, especially in the translational research.

Sciences are being conducted in a totally different way than 20 years ago. For example, biology is shifting from conventional biology to conceptual biology (Blagosklonny & Pardee, 2002) and moving further to systems biology (Kell, 2006; Oprea et al., 2007), in part because of a strong opinion that the conceptual review and systems thinking of available published knowledge should take its place as an essential component of scientific research. The world of ideas (i.e. published knowledge) interplaying with high-throughput experiments, computational modelling, and technology can generate intelligent hypotheses that will end the aimless fishing adventures in the conventional biology. New knowledge, derived from tens of thousands of publications and manually curated datasets, can be linked back to published knowledge to form a self-evolving ecological knowledge base (Mons et al., 2011). Predictions and experiments that were carried out for other reasons can be reused or revealed in a new context that fully embraces the holistic view of knowledge processing.

New ways of conducting research are in high demand, and examples of new methods can be found in many disciplines (Ding et al. 2013). Don Swanson’s (1986) work about undiscovered public knowledge has had a wide impact on association discovery and demonstrated that new knowledge can be discovered from sets of disjointed scientific articles. Swanson’s vision of the hidden value of the literature of science in biomedical digital databases is remarkably innovative for information scientists, biologists, and physicians (Bekhuis, 2006; Swanson, Smalheiser, & Bookstein, 2001). Literature-related discovery that mines knowledge in two disparate sets of literature has identified several non-drug approaches that can be used to halt or reverse the symptoms of multiple sclerosis, cataracts, and other chronic diseases (Kostoff, 2012). By combining PubMed literature and public datasets, Chen, Ding, and Wild (2012) can predict potential drug and target pairs based on publications and open datasets. The method performs extremely well in correctly identifying known drug-target pairs in the data and compares favorably with the established Similarity Ensemble Approach, or SEA, method (Keiser et al., 2009) for predicting new drug-target interactions as well as with the Connectivity Map, or CMAP, (Lamb et al., 2006) for associating drugs with changes in gene expression levels.

Spangler and colleagues (2014) mined information contained in published articles to identify new protein kinases that phosphorylate the protein tumor suppressor p53. They successfully demonstrated that it is possible to automatically generate hypothesis for domain experts based on existing published scholarly articles. Even in humanity, Franco Moretti’s distance reading solution tackles literary problems by applying computational methods to aggregate and analyze massive amounts of data and generate hypotheses. He advocates that distance reading is needed because nobody is able to read the 60,000 novels published in the 19th century England to understand Victorian fiction (Schulz, 2011). All of these examples show that generating hypothesis by mining existing literature and open datasets can advance science and generate huge societal impact.

And while these examples highlight that human brains feature a great capacity for integrating information and recognizing patterns, computers are catching up. IBM Watson, a supercomputer, can process millions of articles, patents, Wikipedia pages, and datasets to facilitate research and diagnostic decision making in lung cancer treatment (Upbin, 2013). It also famously defeated two of the best human Jeopardy! players, Ken Jennings and Brad Rutter, in 2011, by parsing keywords in a large set of data to search for related terms as responses. While it is fast, it bears the disadvantage of a misunderstanding of the context of keywords. As well, the recent success of image recognition powered by deep learning outperforms humans (Thomsen, 2015). Project Adam, an initiative by Microsoft, can accurately identify a dog’s breed based on a single photo. Soon, it will be possible for computers to provide nutritional information about a meal or help diagnose skin diseases (Chansanchai, 2014).

Translational Thinking

What Hal Varian called “combinatorial innovation” combines or recombines different component parts of previous innovations or ideas to generate new innovations (McKinsey, 2009). Polymerase chain reaction, which earned Kary Banks Mullis the 1993 Nobel Prize in Chemistry, is the result of recombination of well-understood techniques in biochemistry (Brynjolfsson & McAfee, 2014). Dozens and dozens of publications that documented previous research outputs can be used to trigger translational thinking. These publications can be analyzed and mapped to show the scholarly landscape of unfamiliar fields to a researcher and suggest high-impact works to study and potential collaborators with whom to work. Other examples of combinatorial innovation include medical scientists who mine literature and open data to facilitate diagnostic decision making in cancer treatment, and healthcare professionals who study literature to generate practical guidelines for wound care (Flanagan, 2004).

More and more scientists are thinking about the translational value of their work. Sociologists apply the social concept of structural hole to understand scientific collaboration, and educators utilize literature as a scaffolding technique to enhance active learning. The transdisciplinary collaboration among material scientists, immunologists, and bioengineers has identified an implantable vaccine depot built from a polymer matrix that can kill cancer cells resulting in longer survival, which generates significant impacts on the well-being of society (Ali et al., 2009). Publications and open datasets are ideal instruments to study the success of translational endeavors to further advance scientific innovation.

Transparent Analytics

The process of scientific endeavors, from data curation and analysis to discovery, should be transparent and easily accessible to every researcher so that replication can be easily done and the derived knowledge can be clearly interpreted (Editorial, 2009). Promoting transparency in science is crucial to ensure the reusability of knowledge, avoid reinventing the wheel, and make scientific discovery dedicated. Research, both quantitative and qualitative, is experiencing a methodological revolution (Moravcsik, 2014). Every researcher should make their work completely transparent to fellow scholars, and the process from data to conclusions should be interpretable and reproducible.

In recent years, the American Political Science Association (APSA) formally established transparency standards for qualitative and quantitative research by reinforcing the ethical obligation of researchers to facilitate the evaluation of their evidence-based knowledge claims through data access, production transparency, and analytic transparency. APSA proposed a new way of citing references called “active citation,” which suggests that any citation in a scholarly publication should be annotated with an explanation on how the citation supports the knowledge claim and should include the hyperlink to an excerpt (c.a. 50–100 words) from the original source. These active citations can be located in a “transparent appendix” at the end of the document so that transparent data to conclusions for researchers are only one click away. This can generate a healthy scholarship by actively engaging researchers to establish rigorous research ethics to criticize, evaluate, and extend fellow scholars’ research. Provenance has been introduced to data and workflows in scientific research to provide detailed documentation to enable scientific reproducibility. The World Wide Web Consortium has recommended a standard representation for provenance in a human readable and machine understandable way (Groth & Moreau, 2013). Transparency must be considered essential and achieved through active citation and provenance to further advance transparent sciences.

Connecting Intelligence

Machines taking their full place at the table of data-driven discovery is a significant step; these new participants make possible what was unimaginable 20 years ago. With machines, it is now possible to systematically collect, interdigitate, analyze, and disseminate publications and data in ways that will greatly impact the tradition of conducting research while providing powerful new resources that significantly advance the progress of both theoretical and applied research. Further, machines can be used to discover new knowledge and afford breakthroughs in current vexing research questions that can only be answered through transdisciplinary innovations.

The ever-increasing success in the application of full text indexing, taxonomies, and ontologies all dramatically improve the categorization and discovery of related content (Song et al., 2013). The movie The Imitation Game has rekindled the memory of Alan Turing’s success of machine intelligence (You, 2015). In the current data-enriched era, it may be the right time to revisit machine intelligence and connect machine intelligence with human intelligence. The next generation of artificial intelligence researchers is proposing a new Turing Championship to develop machines with a deeper understanding of the world (e.g. machine comprehension of grammatically ambiguous sentences, machine storytelling from pictures, and machine “humanness” that enables non-disruptive communication between machine and human).

The teamwork of machines and humans can make machines smarter and humans more efficient. The industrial revolution (mainly steam engine) bent the curve of human history and freed the physical muscle labor in the 19th century to allow for modern massive production. Now, the so-called Second Machine Age will bend the curve of human history again pretty soon by freeing the mental labor of humans. This will trigger massive innovation to bring scientific fiction into reality as these innovations are not only generated by human but also machines.

The combination of human and machine power can bring about new capabilities to compile, review, and mash-up related research entities and receive alerts on their activities and interactions, perhaps reaching a scale that was unimaginable 15 years ago. Much like the recent debut of driverless cars, distant scientific dreams could be realized in just a few years, demonstrating the power of the current data and machine progress (Brynjolfsson & McAfee, 2014). In the new world of scholarly analytics, attention and extraction of deeply covered content and findings are the pathways to golden discoveries. Gradually, advances in information technologies, such as the advent of open access, Linked Open Data, semantic publishing, and open science, will make it possible to gather, annotate, and acquire related publications and other data sources and from those discover related content, findings, and conclusions. This could lead to sudden discovery of unanticipated correlations and connections within an incredibly large and expanding research corpus. We are working on one of the oldest and toughest challenges associated with the combination of computer and human intelligence. The combinatorial innovation of human and machine intelligence will allow us to connect the dots for things that have been disconnected and accomplish through research what has been unimaginable, allowing us to dig the canal to connect data with knowledge.

American Political Science Association (APSA). (2012). A guide to professional ethics in political science (2nd ed.). Washington, DC: The American Political Science Association. Retrieved on August 15, 2016, from www.apsanet.org/Portals/54/APSA%20Files/publications/ethicsguideweb.pdf.American Political Science Association (APSA)2012A guide to professional ethics in political science2ndWashington, DCThe American Political Science AssociationRetrieved on August 15, 2016, fromwww.apsanet.org/Portals/54/APSA%20Files/publications/ethicsguideweb.pdfSearch in Google Scholar

Ali, O.A., Emerich, D., Dranoff, G., & Mooney, D.J. (2009). In situ regulation of DC subsets and T cell mediates tumor regression in mice. Science Translational Medicine, 1(8), 8ra19.AliO.A.EmerichD.DranoffG.MooneyD.J.2009In situ regulation of DC subsets and T cell mediates tumor regression in miceScience Translational Medicine188ra1910.1126/scitranslmed.3000359287279120368186Search in Google Scholar

Bekhuis, T. (2006). Conceptual biology, hypothesis discovery, and text mining: Swanson’s legacy. Biomedical Digital Library, 3, 2.BekhuisT.2006Conceptual biology, hypothesis discovery, and text mining: Swanson’s legacyBiomedical Digital Library3210.1186/1742-5581-3-2145918716584552Search in Google Scholar

Blagosklonny, M.V., & Pardee, A.B. (2002). Conceptual biology: Unearthing the gems. Nature, 416(6879), 373.BlagosklonnyM.V.PardeeA.B.2002Conceptual biology: Unearthing the gemsNature416687937310.1038/416373a11919607Search in Google Scholar

Brynjolfsson, E., & McAfee, A. (2014). The second machine age: Work, progress, and prosperity in a time of brilliant technologies. New York: W.W. Norton & Company Inc.BrynjolfssonE.McAfeeA.2014The second machine age: Work, progress, and prosperity in a time of brilliant technologiesNew YorkW.W. Norton & Company IncSearch in Google Scholar

Chansanchai, A. (2014). Microsoft research shows off advances in artificial intelligence with Project Adam. Microsoft Blog, July 14. Retrieved on September 2, 2016, from blogs.microsoft.com/next/2014/07/14/microsoft-research-shows-advances-artificial-intelligence-project-adam.ChansanchaiA.2014Microsoft research shows off advances in artificial intelligence with Project AdamMicrosoft Blog, July 14. Retrieved on September 22016fromblogs.microsoft.com/next/2014/07/14/microsoft-research-shows-advances-artificial-intelligence-project-adamSearch in Google Scholar

Chen, B., Ding, Y., & Wild, D. (2012). Assessing drug target association using semantic linked data. PLoS Computational Biology, 8(7), e1002574.ChenB.DingY.WildD.2012Assessing drug target association using semantic linked dataPLoS Computational Biology87e100257410.1371/journal.pcbi.1002574339039022859915Search in Google Scholar

Editorial (2009). Data’s shameful neglect. Nature, 461, 145.Editorial2009Data’s shameful neglectNature461145Search in Google Scholar

Ding, Y., Song, M., Han, J., Yu, Q., Yan, E., Lin, L., & Chambers, T. (2013). Entitymetrics: Measuring the impact of entities. PLoS One, 8(8), 1–14.DingY.SongM.HanJ.YuQ.YanE.LinL.ChambersT.2013Entitymetrics: Measuring the impact of entitiesPLoS One8811410.1371/journal.pone.0071416375696124009660Search in Google Scholar

Evans, J.A., & Foster, J.G. (2011). Metaknowledge. Science, 332(6018), 721–725.EvansJ.A.FosterJ.G.2011MetaknowledgeScience332601872172510.1126/science.120176521311014Search in Google Scholar

Flanagan, M. (2004). Barriers to the implementation of best practice in wound care. Wounds UK, 74–84. Retrieved on September 2, 2016, from www.woundsinternational.com/pdf/content_87.pdf.FlanaganM.2004Barriers to the implementation of best practice in wound careWounds UK7484Retrieved on September 2, 2016, fromwww.woundsinternational.com/pdf/content_87.pdfSearch in Google Scholar

Groth, P., & Moreau, L. (2013). PROV-Overview: An overview of the PROV family of documents. Retrieved on September 2, 2016, from www.w3.org/TR/prov-overview.GrothP.MoreauL.2013PROV-Overview: An overview of the PROV family of documentsRetrieved on September 2, 2016, fromwww.w3.org/TR/prov-overviewSearch in Google Scholar

Jinha, A.E. (2010). Article 50 million: An estimate of the number of scholarly articles in existence. Learned Publishing, 23(3), 258–263.JinhaA.E.2010Article 50 millionAn estimate of the number of scholarly articles in existenceLearned Publishing23325826310.1087/20100308Search in Google Scholar

Keiser, M.J., Setola, V., Irwin, J.J., Laggner, C., Abbas, A.I., Hufeisen, S.J., … Roth, B.L. (2009). Predicting new molecular targets for known drugs. Nature, 462(7270), 175–181.KeiserM.J.SetolaV.IrwinJ.J.LaggnerC.AbbasA.I.HufeisenS.J.RothB.L.2009Predicting new molecular targets for known drugsNature462727017518110.1038/nature08506278414619881490Search in Google Scholar

Kell, D.B. (2006). Metabolomics, modelling and machine learning in systems biology: Towards an understanding of the languages of cells. FEBS Journal, 273(5), 873–894.KellD.B.2006Metabolomics, modelling and machine learning in systems biology: Towards an understanding of the languages of cellsFEBS Journal273587389410.1111/j.1742-4658.2006.05136.x16478464Search in Google Scholar

Klahr, D. (2000). Exploring science: The cognition and development of discovery processes. Cambridge, MA: MIT Press.KlahrD.2000Exploring science: The cognition and development of discovery processesCambridge, MAMIT PressSearch in Google Scholar

Kostoff, R.N. (2012). Literature-related discovery and innovation update. Technological Forecasting & Social Change, 79(4), 789–800.KostoffR.N.2012Literature-related discovery and innovation updateTechnological Forecasting & Social Change79478980010.1016/j.techfore.2012.02.002713182732287411Search in Google Scholar

Lamb, J., Crawford, E.D., Peck, D., Modell, J.W., Blat, I.C., Wrobel, M.J., … Golub, T.R. (2006). The Connectivity Map: Using gene-expression signatures to connect small molecules, genes, and disease. Science, 313(5795), 1929–1935.LambJ.CrawfordE.D.PeckD.ModellJ.W.BlatI.C.WrobelM.J.GolubT.R.2006The Connectivity Map: Using gene-expression signatures to connect small molecules, genes, and diseaseScience31357951929193510.1126/science.113293917008526Search in Google Scholar

McKinsey (2009). Hal Varian on how the web challenges managers. Retrieved on September 2, 2016, from www.mckinsey.com/insights/innovation/hal_varian_on_how_the_web_challenges_managers.McKinsey2009Hal Varian on how the web challenges managersRetrieved on September 2, 2016, fromwww.mckinsey.com/insights/innovation/hal_varian_on_how_the_web_challenges_managersSearch in Google Scholar

Mons, B., Van Haagen, H., Chichester, C., Hoen, P.B.T., Den Dunnen, J.T., … Schultes, E. (2011). The value of data. Nature Genetics, 43(4), 281–283.MonsB.Van HaagenH.ChichesterC.HoenP.B.T.Den DunnenJ.T.SchultesE.2011The value of dataNature Genetics43428128310.1038/ng0411-28121445068Search in Google Scholar

Moravcsik, A. (2014). Transparency: The revolution in qualitative research. Political Science & Politics, 47(1), 48–53.MoravcsikA.2014Transparency: The revolution in qualitative researchPolitical Science & Politics471485310.1017/S1049096513001789Search in Google Scholar

Oprea, T.I., Tropsha, A., Faulon, J., & Rintoul, M.D. (2007). Systems chemical biology. Nature Chemical Biology, 3, 447–450.OpreaT.I.TropshaA.FaulonJ.RintoulM.D.2007Systems chemical biologyNature Chemical Biology344745010.1038/nchembio0807-447273450617637771Search in Google Scholar

Schulz, K. (2011). What is distance reading. New York Times, Jan 24. Retrieved on September 2, 2016, from www.nytimes.com/2011/06/26/books/review/the-mechanic-muse-what-is-distant-reading.html?pagewanted=all&_r=0.SchulzK.2011What is distance readingNew York TimesJan 24. Retrieved on September 2, 2016, fromwww.nytimes.com/2011/06/26/books/review/the-mechanic-muse-what-is-distant-reading.html?pagewanted=all&_r=0Search in Google Scholar

Song, M., Han, N., Kim, Y., Ding, Y., & Chambers, T. (2013). Discovering implicit entity relation with the gene-citation-gene network. PLoS One, 8(12), e84639.SongM.HanN.KimY.DingY.ChambersT.2013Discovering implicit entity relation with the gene-citation-gene networkPLoS One812e8463910.1371/journal.pone.0084639386615224358368Search in Google Scholar

Spangler, S., Wilkins, A.D., Bachman, B.J., Nagarajan, M., Dayaram, T., Haas, P., … Lichtarge, O. (2014). Automated hypothesis generation based on mining scientific literature. Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp 1878–1886). New York, USA.SpanglerS.WilkinsA.D.BachmanB.J.NagarajanM.DayaramT.HaasP.LichtargeO.2014Automated hypothesis generation based on mining scientific literatureProceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining18781886New York, USA10.1145/2623330.2623667Search in Google Scholar

Swanson, D.R. (1986). Fish oil, Raynaud’s syndrome and undiscovered public knowledge. Perspectives in Biology and Medicine, 30(1), 7–18.SwansonD.R.1986Fish oil, Raynaud’s syndrome and undiscovered public knowledgePerspectives in Biology and Medicine30171810.1353/pbm.1986.00873797213Search in Google Scholar

Swanson, D.R., Smalheiser, N.R., & Bookstein, A. (2001). Information discovery from complementary literatures: Categorizing viruses as potential weapons. Journal of the American Society for Information Science and Technology, 52(10), 797–812.SwansonD.R.SmalheiserN.R.BooksteinA.2001Information discovery from complementary literatures: Categorizing viruses as potential weaponsJournal of the American Society for Information Science and Technology521079781210.1002/asi.1135Search in Google Scholar

Thomsen, M. (2015). Microsoft’s deep learning project outperforms humans in image recognition. Forbes, February 19. Retrieved on September 2, 2016, from www.forbes.com/sites/michaelthomsen/2015/02/19/microsofts-deep-learning-project-outperforms-humans-in-image-recognition.ThomsenM.2015Microsoft’s deep learning project outperforms humans in image recognitionForbes, February 19. Retrieved on September 2, 2016, fromwww.forbes.com/sites/michaelthomsen/2015/02/19/microsofts-deep-learning-project-outperforms-humans-in-image-recognitionSearch in Google Scholar

Upbin, B. (2013). IBM’s Watson gets its first piece of business in healthcare. Forbes, February 8. Retrieved on September 2, 2016, from www.forbes.com/sites/bruceupbin/2013/02/08/ibms-watson-gets-its-first-piece-of-business-in-healthcare.UpbinB.2013IBM’s Watson gets its first piece of business in healthcareForbes, February 8. Retrieved on September 2, 2016, fromwww.forbes.com/sites/bruceupbin/2013/02/08/ibms-watson-gets-its-first-piece-of-business-in-healthcareSearch in Google Scholar

You, J. (2015). Beyond the Turing test. Science, 347(6218), 116.YouJ.2015Beyond the Turing testScience347621811610.1126/science.347.6218.11625574001Search in Google Scholar

Artículos recomendados de Trend MD

Planifique su conferencia remota con Sciendo