1. bookTom 2 (2017): Zeszyt 4 (December 2017)
Informacje o czasopiśmie
License
Format
Czasopismo
eISSN
2543-683X
Pierwsze wydanie
30 Mar 2017
Częstotliwość wydawania
4 razy w roku
Języki
Angielski
Otwarty dostęp

Rediscovering Don Swanson:The Past, Present and Future of Literature-based Discovery

Data publikacji: 29 Dec 2017
Tom & Zeszyt: Tom 2 (2017) - Zeszyt 4 (December 2017)
Zakres stron: 43 - 64
Otrzymano: 09 Aug 2017
Przyjęty: 13 Sep 2017
Informacje o czasopiśmie
License
Format
Czasopismo
eISSN
2543-683X
Pierwsze wydanie
30 Mar 2017
Częstotliwość wydawania
4 razy w roku
Języki
Angielski
Introduction

Don R. Swanson (1924–2012) was well appreciated during his lifetime as Dean of the Graduate Library School at University of Chicago, as winner of the American Society for Information Science Award of Merit for 2000, and as author of many seminal articles (Figure 1). Don became Emeritus in 1996, but did not truly retire until around 2007, when he suffered a series of strokes. Around 10 years ago, Tanja Bekhuis (2006) wrote a review article that discussed Don’s contributions and their subsequent influence on bioinformatics and text mining. Recently, Sebastian, Siew, & Orimaye (2017a) have published a comprehensive review from a technical standpoint, and the reader is urged to consult this article for an overview of existing and emerging methods that are being applied to the field of literature-based discovery. Here I give a more personal perspective. In particular, I will include a discussion of problems and issues which were inherent in Don’s thoughts during his life, but which have not yet been fully taken up and studied systematically.

Figure 1

Don R. Swanson.

The first thing to realize about Don is that Don is not short for Donald. Don was his legal first name. Do not make that mistake, please—it irritated him to no end!

The second thing to realize is that my relationship with Don was idyllically intellectual in nature. I call my collaboration with Don my “Garage Band” period— the term referring to buddies who spend their free time playing rock music in their garages, playing out of sheer enjoyment, and oblivious of the outer world at large. We were unconcerned whether our research would be seen as important by others, whether it would be published in high-impact journals, whether we would secure grant funding, or other non-scientific concerns that too often drive research efforts.

Undiscovered Public Knowledge

Perhaps the most influential and enduring contribution that Don has had on information science is the concept of “undiscovered public knowledge” (UPK), which he approached from a very broad, philosophical standpoint (Swanson, 1986a). The philosopher of science Karl Popper had envisioned that man exists in three worlds—World I is the objective, real world which scientists seek to learn about; World II includes the thoughts and mental activities of scientists; and World III consists of the products of scientists, in particular, the published articles that express findings, models, assertions, and so forth (Popper, 1978). Just as man cannot hope to have perfect knowledge of reality (World I), Don realized that man cannot have perfect knowledge of World III either. Knowledge can be public (e.g. it is published) and at the same time, inaccessible or imperfectly known for one reason or another.

Undiscovered public knowledge encompasses several distinct scenarios. For example, one may ask: How many articles are published that no one reads—no one at all besides the author and (we hope) the reviewers? Information contained in such articles is, indeed, public yet undiscovered.

How much information is contained in articles that few can find, because the article is poorly indexed by Web of Science or by online search engines? Such articles may have been published without a digital presence, or placed in a journal that has limited circulation or low visibility.

A related type of information loss occurs when someone publishes an important article in an obscure or topically inappropriate journal, so that no one will take the finding seriously even if they see it. Few people have the self-confidence to recognize a breakthrough when it comes without the imprimatur of acceptance by a prestigious journal. An example of this happened quite recently: “This German retiree solved one of world’s most complex maths problems—and no one noticed” (Wolchover, 2017). Thomas Royen wrote a paper proving the Gaussian correlation inequality (GCI) and posted a preprint in the arXiv repository; when his work failed to get recognized, he chose to get his proof out in an obscure journal called the Far East Journal of Theoretical Statistics. He might as well have put it in a bottle and thrown it in the ocean!

Some of my own informatics discoveries have been closely related to undiscovered public knowledge. For example, my group discovered that many mammalian microRNAs are derived from genomic repeat elements in the genome (Smalheiser & Torvik, 2005). Although we came to this realization through computational studies (Smalheiser & Torvik, 2004), in fact, in retrospect, the discovery could have been made simply by inspection of the public data available at the UCSC Genome Browser

https://genome.ucsc.edu/

. This website brings together dozens of different types of genomic data that are calculated or measured, for example, predicted transcription factor binding regions, cross-species conservation levels, and so on. Each type of data is superimposed on the reference genome, and users can open up and visualize any number of the data sets to observe them in juxtaposition with each other. Two of the data tracks show a) positions of known microRNA genes in red and b) Repeatmasker output, which identifies genomic repeat elements in two shades of grey (Figure 2). If anyone had opened up these two tracks and looked carefully, they would have seen that many of the microRNAs were within the sequences encompassed entirely by specific genomic repeat elements. The fact that no one DID do this indicates that this knowledge was public, yet undiscovered.

Figure 2

Screenshot of UCSC Genome Browser showing the sequence for human mir-95 juxtaposed to tracks for genomic repeats. The genomic region of the mir-95 sequence corresponds to two LINE2 elements in opposite orientations. This provides evidence that, when transcribed into RNA, these LINE2 elements bind each other, creating the hairpin secondary structure that permits the processing of this sequence by enzymes (Drosha & Dicer) to form a microRNA (Smalheiser & Torvik, 2005).

Two Medical Literatures Logically but not Bibliographically Connected

The most novel and fruitful type of undiscovered public knowledge discussed by Don occurs when information is not explicitly discussed in any single article at all. Rather, different assertions and findings need to be assembled across documents to create a new coherent assertion, much as different pieces of a puzzle are assembled to create a single picture.

But how to find these pieces residing in scattered places across the literature, and how to assemble them? Don focused his analyses on first identifying two sets of articles, or literatures, which appear to be complementary (see below) yet are not directly connected to each other. Such literatures are unconnected if they do not have any articles in common, do not have authors in common, and articles in one literature do not cite any articles in the other literature (Swanson, 1987).

In a series of articles in the 1980s, Don analyzed two classic examples of medical literatures that were not (or only slightly) connected, yet contained multiple links of the form “A affects B” in one literature and “B affects C” in the other, such that when they were brought together and assembled, created a persuasive, novel hypothesis. These have become widely analyzed benchmarks for nearly all subsequent studies of literature-based discovery.

The first case was the set of articles on Raynaud’s disease vs the set of articles on fish oil (Swanson, 1986b). Don noticed that several of the pathological alterations that occur in Raynaud’s disease corresponded to physiological alternations that are produced by ingesting fish oil, only in opposite directions. That suggests that ingesting fish oil should counteract some of the signs and symptoms of Raynaud’s disease. Subsequent clinical studies supported this hypothesis (Swanson, 1993).

The second case was the set of articles on dietary magnesium vs on migraine headaches (Swanson, 1988). Again, Don noticed that magnesium deprivation has multiple effects in the body that are similar to alterations that are known to worsen migraine headaches, and magnesium itself has effects which should be expected to prevent or treat migraines. For example, magnesium is a calcium channel blocker, and reduces neuronal excitability via opening of NMDA glutamate receptors. Thus, he proposed that supplementation with dietary magnesium may prevent or alleviate migraines. Again, subsequent clinical studies supported this hypothesis (Swanson, 1993).

Don made further analyses of complementary un-connected literatures, both by himself (Swanson, 1990) and in collaboration with me (e.g. Swanson, Smalheiser, & Bookstein, 2001; Smalheiser & Swanson, 1994, 1996a, 1996b, 1998). It is noteworthy that late in his career, Don proposed a link between atrial fibrillation and running (Swanson, 2006). Exercise is known to be a risk factor for atrial fibrillation, and he proposed that this may be mediated by gastroesophageal reflux, which in turn may be alleviated by taking proton pump inhibitors. Besides being another masterful, insightful example of putting together separate pieces of evidence to form a new whole, it is worth mentioning that these analyses were all based on conditions he experienced himself. He had Raynaud’s syndrome, and he had migraine headaches. And, his chronic atrial fibrillation eventually caused his strokes and led to his withdrawal from active life.

Use of Implicit Information to Bridge Disparate Literatures

It is important to acknowledge a tension between two different meanings of the term “knowledge discovery.” One meaning, the one I started with, is to assemble pieces of information into new wholes, that represent new/promising/surprising/research directions or provide potentially transformative or breakthrough insights. The other meaning is to analyze and synthesize existing data to impute new but otherwise predictable, everyday information. An example of this is using first names to predict the gender of individuals. Most of the “Jane” and “Linda” individuals will be female, and most of the “Boris” and “John” individuals will be males. But regardless of which type of discovery we are talking about, to my knowledge, all systematic algorithmic methods for knowledge discovery involve linking different literatures or entities via implicit features that they share. In the case of gender prediction, US Census data can be used to associate first names of individuals in the United States with their reported genders; by aggregating the results over all individuals, each first name is associated with a gender balance score (% females/% males). This becomes reference information that is used to impute gender for a given name instance in some other database. The reference information is implicit because it derives from information that is not explicitly present within the database.

Commonly, implicit information is used as a bridge to measure the similarity of two entities. For example, two diseases A and B may be related in terms of how many Medical Subject Headings they share (in articles that describe disease A and disease B, respectively). Or, they may be related in terms of how many singlenucleotide polymorphisms (SNPs) have been shown to affect disease risk in both disease A and B. Or, they may be related in terms of how many clinical signs and symptoms they share. Or, how many single-gene mutations which affect disease A or B affect genes that lie in the same biochemical pathway. There are many possible types of implicit information that connect disease A with disease B, and it is even possible to combine multiple types of information to create a heterogeneous graph in which diseases are nodes and implicitly shared items form links between the nodes (Shi et al., 2017).

The use of implicit information is a powerful general technique of knowledge discovery, which has spawned several entire fields in bioinformatics and genomics (Bekhuis, 2006; Zweigenbaum et al., 2007). Don is the father of the field of drug repurposing, which proposes new uses for existing approved drugs (e.g. Weeber et al., 2003; Yang et al., 2017). Prediction of adverse drug effects follows a similar type of logic (e.g. Hristovski et al., 2016; Shang et al., 2014), as does detection of co-morbidities and other relations among drugs, diseases, and genes (Ding et al., 2013; Frijters et al., 2010; Vos et al., 2014). Almost all approaches to genomic discovery involve implicit information as well. Furthermore, implicit information is a central concept generally in text mining and natural language processing.

The One-node Search

In Don’s original A-B-C model, implicit information was used in what is known as the “one-node search” approach (Figure 3):

Figure 3

Schematic diagram illustrating the one-node search. Reprinted from Swanson & Smalheiser (1997) with permission.

Begin with a set of articles that discusses or presents information regarding a problem, e.g. prostate cancer or poverty = literature C.

Look for another literature, unknown at the outset, which has information that can contribute to solving the problem = literature A.

Use words and phrases in the titles of articles in the two literatures = B-terms [use filtering to keep only “important” words in some sense]. The B-terms are the implicit information.

Carry out many searches to create B1, B2, B3.... -literatures.

Tabulate the title words and phrases in each Bi-literature = candidate A-terms and rank them according to how many B-literatures they are in.

Carry out a search using each Ai-term to define the Ai-literature.

An Ai-literature which shares many B-terms with the original C-literature is hypothesized to contain information that may help solve the problem.

Despite its conceptual appeal, the one-node search has several nuances and limitations in practice, and many variations of the ABC model have been explored (see reviews in Bruza & Weeber, 2008; Sebastian, Siew, & Orimaye, 2017a; Smalheiser, 2012b):

For example, different words that have essentially the same meaning (lexical variants, synonyms, abbreviations, and alternative spellings) should ideally be counted and treated as a single B-term. Conversely, Preiss and Stevenson (2016) have demonstrated that word sense disambiguation, i.e. to separate different senses of the same word as used in different instances, can improve performance of discovery systems.

Titles do not capture all information in an article. Words contained in the abstract and full text will also contribute information, albeit these terms will also contribute significant noise (Cohen, Johnson, et al., 2010).

Words and phrases are not the only, or necessarily the best, type of information to employ for linking literatures. Many other investigators have used concepts, MeSH terms, entities, and relations extracted from text (reviewed in Bruza & Weeber, 2008; Sebastian, Siew, & Orimaye, 2017a).

Similarly, ranking Ai-literatures according to the number of Bi-terms in their titles is a relatively crude and nonrobust measure. The hope is the B-terms will point to the existence of causal mechanisms that link the literatures, but this is not necessarily the case. Other investigators have proposed ranking measures based on e.g. mutual information, relations, and/or network properties, including citations (e.g. Cameron et al., 2015; Ding et al., 2013; Hristovski et al., 2015; Smalheiser, 2012b; van der Eijk et al., 2004; Wren, 2004).

The one-node search involves multiple searches and calculations of title words and phrases, which introduce computational complexity. In practice, investigators generally restrict the number or type of B-terms to be used for linking, with either semantic or statistical criteria. Furthermore, rather than searching for all possible A-literatures that might exist, generally they are restricted to being in some predefined semantic category (such as drugs).

Presenting many Ai-literatures for the investigator, even when ranked, causes great cognitive complexity, since each candidate A-literature requires detailed manual examination to assess.

The Two-node Search

Perhaps the most important limitation of the one-node search is not technical, but sociological: The one-node search is intended to help investigators who are looking for a new hypothesis—yet most investigators are already drowning in a sea of existing potential hypotheses and findings, and their goal is not to find still more hypotheses, but rather to decide which of the existing ones is most promising to pursue. Thus, in my own work, I have emphasized the importance of the two-node search strategy, which can be summarized as follows:

An investigator already has a hypothesis (or an experimental finding) that links A and C, but which has not been explicitly investigated directly in any single published article.

He or she carries out a two-node search between the set of articles that discusses A and the set of articles that discusses C, and examines the shared title words and phrases Bi.

The goals are to rank the list of Bi-terms to home in on the most relevant and promising links, and to examine possible mechanisms that link A to C.

To create a quantitative model that would allow us to rank Bi-terms, I assembled a team of neuroscientists, who used the two-node search tool freely in the course of their scientific work. Vetle Torvik and I chose 8 of their searches as a gold standard, in which Bi-terms were manually marked as being relevant for linking A to C. Each Bi-term (marked as relevant or not relevant) was scored according to eight features (Table 1). These features are domain-independent insofar as they do not rely on any reported knowledge about entities, facts, or relations; rather, they are based on statistical properties such as the frequency of the term within MEDLINE (Table 1; Torvik & Smalheiser, 2007). As a negative control training set, we chose random pairs of query literatures (having similar size and topics as the gold standard set), and scored all Bi-terms in the negative set. We created a logistic regression model, based on a weighted sum of these features, to predict the probability that a given Bi-term would be marked as relevant, i.e. that it would be deemed relevant by users for linking A and C in a meaningful manner (Torvik & Smalheiser, 2007).

Eight features used to characterize each B-term.

No.Feature
1Does the B-term occur in more than one paper within literatures A and C?
2Do the AB and BC sub-literatures share any MeSH terms?
3Does the B-term map to at least one UMLS semantic category?
4Does the B-term have a high literature cohesion score?
5Is the B-term moderately frequent within MEDLINE as a whole?
6Did the B-term first appear recently within MEDLINE as a whole?
7Is the B-term highly characteristic within literature A or C?
8Do the words within the B-term all occur on the customized 1,400-word stoplist?

Note. Reprinted from Torvik & Smalheiser (2007) with permission. See this reference for definitions and details regarding how the features were numerically scored.

The two-node search interface

http://arrowsmith.psych.uic.edu

makes it easy for investigators to carry out two-node searches among PubMed articles.

The two-node search also provides an aggregate measure of the implicit semantic similarity of any two literatures, based upon the body of Bi-terms, taken as a whole. Suppose we perform a two-node search and find that there are 1,263 terms on the B-list, of which 402 are predicted to be relevant (i.e. the estimated probability of relevance is > 0.5). The ratio 402/1263 = 0.32 is called the pR score, and it provides an overall measure of the shared implicit information between the peanut butter and health literacy literatures. Randomly chosen pairs of literatures tend to have pR scores around 0.07, whereas literatures that are very closely related in terms of topics tend to have pR scores of 0.4–0.5. We have used the pR score as an important feature for literature-based discovery (Peng, Bonifield, & Smalheiser, 2017).

The One-node Search Reconceptualized as a Series of Two-node Searches

Don’s original Web-based one-node search tool is no longer available. I have implemented a simpler version

http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/one-node.cgi

in which the investigator starts with a literature that represents a problem to be solved (e.g. Huntington’s disease). Next, the user will be prompted to choose a category of Medical Subject Headings (MeSH) to search within, which encompasses a set of literatures describing entities (or classes of entities) that represent possible approaches or solutions to the problem. (Alternatively, the user can choose the Free Format option, to enter any list of PubMed search queries manually, one on each line.) For example, to search among different classes of drugs according to their molecular mechanism using the MeSH Tree option, the user would drill down from Chemicals and Drugs to Chemical Actions and Uses to Pharmacologic Actions to finally, Molecular Mechanisms of Pharmacological Action [D27.505.519]. This category includes about twenty classes of drugs, including Alkylating Agents [D27.505.519.124], Angiotensin Receptor Antagonists [D27.505.519.162], Antacids [D27.505.519.170], Antifoaming Agents [D27.505. 519.178], and so on. Once the user chooses this MeSH term category, the software will carry out a series of two-node searches, each consisting of A = Huntington’s disease vs C = one of the drug classes. These two-node searches are characterized according to the total number of articles in A and C (and nAC, the intersection of A and C), as well as the total number of B-terms. Finally, the searches are ranked according to pR, the percentage of B-terms that are predicted to be relevant for meaningful linkage. The two-node search results are all individually stored temporarily by job ID so users can go back without the need to re-run the search each time. Thus, carrying out a one-node search is simply a matter of carrying out a series of two-node searches, one for each MeSH term within the category of interest (Smalheiser, 2012b). This greatly simplifies the computational issues involved.

Examples from the Front Lines of Scientific Investigation

A variety of investigators have used literature-based discovery (LBD) methods to propose specific hypotheses which were then tested experimentally. Some of these studies introduced new LBD methodology (e.g. Wren et al., 2004), whereas others used the public Arrowsmith two-node search interface. Dong et al. (2014) investigated links between anandamide and gastric cancer. Maver et al. (2013) identified novel treatments for neovascularization in diabetic retinopathy. Miller et al. (2012) found mechanisms to link hypogonadism and diminished sleep quality in aging men. Cairelli et al. (2013) proposed a possible explanation for the “obesity paradox” whereby obese patients have better outcomes in intensive care. Manev & Manev (2010) studied a 5-lipoxygenase-leptin-Alzheimer connection. Kell (2009) used LBD to assess abnormal iron chelation as a common pathogenetic factor in a variety of diseases.

In my own laboratory studies, separately from Don, I have also put together assertions and knowledge from disparate literatures to formulate hypotheses that I have tested and verified experimentally. Unlike the examples stated above, in which we or others deliberately searched for complementary literatures, the latter examples arose haphazardly during the course of laboratory investigations.

For example, we had discovered that an enzyme, dicer, which is known to cleave double-stranded RNA to form small RNAs, is expressed and even highly enriched at postsynaptic densities present at synaptic contacts in the central nervous system (Lugli et al., 2005). However, paradoxically, although the dicer protein was present, it appeared to lack enzymatic activity. On the other hand, we knew that treating purified dicer protein with certain proteases in a test tube will cause dicer to form fragments that show greatly enhanced catalytic activity. And, there was an extensive body of studies that had shown that a naturally-occurring protease called calpain is activated during synaptic stimulation and cleaves a variety of other proteins in a controlled manner. Putting the two lines of studies together, we predicted that during synaptic stimulation, calpain might cleave dicer such that the activated, cleaved form of dicer would exhibit enzymatic activity. This was confirmed in experiments carried out in mouse brain tissue (Lugli et al., 2005).

Another example of connecting two disparate literatures to create a novel testable hypothesis occurred when we proposed that a phenomenon called RNA interference, which had been studied in worms and other lower organisms, might be involved in mediating learning and memory in the mammalian brain (Smalheiser, Manev, & Costa, 2001). It took us a decade to find provisional experimental evidence that this may, indeed, be the case (Smalheiser, 2012a, 2014).

Finally, a third example occurred when we noticed detailed similarities between a class of small vesicles (called secretory exosomes)—secreted by many cell types and reported to contain microRNAs and other types of RNAs—and the structures called synaptic spinules that form at synapses during periods of intense synaptic stimulation (Smalheiser, 2007). This led to the hypothesis that neurons may transfer RNAs and proteins across synapses in an activity-dependent manner (Smalheiser, 2007).

It should be acknowledged that none of these three examples involved computergenerated or automatic LBD algorithms, or even employed an explicit A-B-C model. Instead, both Don’s and my discoveries have largely been made by manual examination of complementary literatures and assembling of quite complex information into coherent wholes (Smalheiser, 2012b). Thus, it should be kept in mind that although most LBD research has focused on situations that arre readily recognized by text mining and that follow standard templates (e.g. A affects B and B affects C), these situations represent only the “low hanging fruit,” and more sophisticated models of discourse and assertion will be needed to deal with the rest.

New Directions in Literature-based Discovery
Storytelling

One and two-node A-B-C search strategies all consider a single intermediate link between two literatures. Perhaps the most straightforward extension of this idea is to construct and assess multi-step paths that exist between two sets of articles (e.g. Baek et al., 2017; Hossain et al., 2012; Sebastian, Siew, & Orimaye, 2017a). Multiple paths can also be constructed to connect entities, authors, and so on. This can be conceptualized variously as an exercise in storytelling, as navigating paths within graphs or networks, or as detecting functional mechanisms.

“Gaps”—Linking Two Sub-fields that Reside inside of a Larger Field of Investigation

My own group has focused recently on linking sub-fields that reside within a larger field of investigation. For example, consider the field of prostate cancer research. Some articles study experimental tumors in mice; some follow people for effects of diet and smoking on risk; some study molecular changes inside tumor cells; some are medicinal chemistry studies, modifying drugs for better solubility or potency or fewer side effects. Not all people in the field of prostate cancer research read all these articles! More to the point, not all topics are explored in all combinations within the prostate cancer field.

If two topics appear at moderately high frequencies within the prostate cancer field and are totally independent of each other, one would expect that they should co-occur in some articles simply by chance. When two MeSH terms co-occur, they often indicate that there is some direct or implicit relationship between them. Specifically, if two topics (defined as MeSH terms) are expected to co-occur in at least 10 articles within a given field, but do not co-occur in any articles at all, we call the pair of topics a “gap.” As reported recently (Peng, Bonifield, & Smalheiser, 2017), gaps can arise for several different reasons. A few gaps reflect idiosyncracies in the rules given to MEDLINE indexers, such that certain closely related MeSH terms are rarely applied to the same article. Some gaps represent “low hanging fruit,” i.e. research directions that have not yet been investigated but are known to be promising and are likely to be followed up on in the near future. Other gaps may indicate the presence of undiscovered public knowledge—that is, investigators may be unaware of connections that exist among different sub-areas of a single field. We are continuing to investigate the phenomenon of gaps and attempting to use them as a means of discovering new, promising research directions.

Discovery via Analogy

A popular and important approach in literature-based discovery (and text mining in general) is the semantic representation of words, concepts, relations or predications by vectors (Cole & Bruza, 2005; Gordon & Dumais, 1998; Widdows & Cohen, 2015), either high-dimensional vectors (Cohen & Widdows, 2009) or lowdimensional vectors (Mikolov et al., 2013; Pennington, Socher, & Manning, 2014). One of the endearing features of semantic vector representations is that vectors that lie near each other exhibit similar meanings or similar relations. For example, the relation “King :: Queen” is implemented by subtracting the vector for King from the vector for Queen, resulting in a difference vector (King – Queen) that embodies the relation. Other vectors that encode similar relations, e.g. “Man :: Woman” also lie near this difference vector. In particular, one can pose the question “King :: Queen as Man :: X?” and solve for X by identifying the difference vector which includes Man and lies closest to (King – Queen). Trevor Cohen has extensively explored the use of an analogy model for literature-based discovery based on vector proximity (e.g. Cohen & Widdows, 2009, 2017; Cohen, Whitfield, et al., 2010, Mower et al., 2016).

Link Prediction

Many discoveries involve combining new concepts or bridging disparate fields. One may hope to identify such publications by looking for newly published articles that contain novel combinations of text terms (Packalen & Bhattacharya, 2015), novel combinations of Medical Subject Headings (Mishra & Torvik, 2016; Peng, Bonifield, & Smalheiser, 2017), or whose reference lists cite novel combinations of journals (Uzzi et al., 2013). This leads to a model of literature-based discovery that is based on link prediction on networks. For example, Kastrin, Rindflesch, & Hristovski (2016) model LBD as considering all pairs of MeSH terms that have never co-occurred within a single article before, and seek to learn the factors that best predict the likelihood of an article appearing in the near future that is indexed by both of the MeSH terms. Sebastian, Siew, & Orimaye (2017b) combined text and citation networks for link prediction.

Scientific Arbitrage

Don often referred to literature-based discovery as an exercise in “scientific arbitrage,” in which certain ideas or findings are under-valued in one scientific arena, and gain in value by applying them to another field. (In fact, I believe he performed arbitrage in financial markets too!) In his final published article (Swanson, 2011), Don discussed the problem of identifying neglected, dead, or discarded findings and hypotheses as sources of new knowledge. Neglected findings, which are explicitly stated in one or more articles yet not well cited or followed up upon, may reflect a variety of issues: The articles in which they appeared may not be easy to find (particularly in full-text form), the findings themselves may have been refuted by later studies, or they may simply have been ahead of their times. The use of text mining to identify these neglected findings, and predict which (if any) ought to be resurrected and rehabilitated, remains an open question for further investigation.

A particular type of neglected finding is what I have called “negative consensus” (Smalheiser & Gomes, 2014), in which the investigators in a given field mention that a particular event or happenstance does NOT occur in nature. Sometimes this is documented by definitive experimental studies, in which case one would expect that negative assertions would cite the negative evidence. Often, however, the negative assertions simply reflect prevailing dogma or investigators’ expectations or “common sense”, and such cases do not cite any supporting evidence at all. My (somewhat contrarian) view is that negative consensus statements that lack experimental testing are in fact good subjects for further research. A small input of experimental testing may challenge the prevailing paradigm or dogma that made the finding seem so unlikely. For example, we noted that the protein Argonaute binds DNA in the test tube, yet investigators have simply assumed that it binds RNA within living cells—in part, this is because Argonaute is thought to reside in the cytoplasm whereas cytoplasmic DNA is thought not to exist. However, Argonaute does have functions in the nucleus, and there are indeed reports that extrachromosomal DNA exists in both nucleus and cytoplasm. Hence, the idea that Argonaute may bind DNA is not absurd but is well worth investigating (Smalheiser & Gomes, 2014). I believe that it is worthwhile to develop text mining tools that can identify negative consensus statements and help investigators decide which are likely to be promising to study. Agarwal, Yu, & Kohane (2011) have compiled a database of biomedical negated sentences, which might be mined to identify those assertions that are reliably negative across multiple documents.

The Penumbra of a Field as a Source of New Knowledge

A scientist working in a field (say, Alzheimer’s disease) is acutely aware that some lines of investigation are “mainstream” and reside in the core of the field, whereas other lines of work are marginal, either because they are new, or not considered interesting or credible, or because they are pursued by people who are not themselves recognized full-time Alzheimer researchers. For example, studies of amyloid or tau protein aggregates are intensively studied and are published in high-impact journals as well as in journals devoted to aging and Alzheimer’s disease. In contrast, studies of gut microbes (the so-called microbiota) are not a mainstream topic in Alzheimer’s disease, at least not yet. Standard techniques such as text mining, summarization, and clustering, together with citation analysis, can help to identify which articles, topics, keywords, and concepts reside in the core of a given field and which reside in the periphery, or penumbra.

Initially, literature-based discovery techniques sought to make linkages across literatures, without asking whether the links predominantly involve the cores or the peripheries of the literatures. Don’s first inclination was to filter out B-terms that did not have adequate frequency of mentions in each literature, implying that he was focusing on the cores (Swanson & Smalheiser, 1997). In contrast, Kostoff et al. (2009), Petrič et al. (2010), and Workman et al. (2016) have argued that low-frequency terms which reside in the penumbra of one or both fields may sometimes be more promising for finding links that are interesting and unexpected.

Evidence Synthesis and Reproducibility in Science

In the early days of literature-based discovery, when assembling ideas, assertions, and published findings, we did not worry much about the reliability of each reported item, or how many articles obtained similar results. If a paper reported that protein A binds protein B in adult female rat lung, the extracted assertion would be “protein A binds protein B” without worrying much about its scope or generalizability to other situations. The goal has been to identify interesting and promising hypotheses, which after all need to be experimentally confirmed on their own terms.

Over the past 10 years, however, it has become clear that a significant minority (if not the majority) of published findings are hard to replicate and have low reliability, due to a combination of flaws in experimental design, small sample sizes, naïve data analysis practices, and over-interpretation of statistical testing (e.g. Ioannadis, 2005; Rzhetsky et al., 2006; Smalheiser, 2017). Thus, going forward, it will be important not merely to identify terms and concepts for linking, but to assess the reliability of the articles that contain them and to filter or rank them accordingly. Kilicoglu (2017) has recently proposed that text mining may aid in at least four ways, namely, plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload and accurate citation/enhanced bibliometrics.

Even more broadly, literature-based discovery is moving closer to the field of evidence synthesis, which collects reported findings across multiple studies (e.g. the set of all clinical trials that have employed nonsteroidal anti-inflammatory agents for chronic arthritic knee pain) and attempts to reach a consensus, if possible. This field employs techniques such as systematic review, meta-analysis, and summarization. Although most of this work is currently done manually, there is a recent push for the use of automated text mining tools to accelerate the process (Jonnalagadda, Goyal, & Huffman, 2015; O’Mara-Eves et al., 2015). In fact, text mining-based detection of reliable trends in the literature, i.e. detecting when “signal” is truly above “noise,” is itself a type of literature-based discovery, albeit explicit (rather than implicit) assertions are being mined.

Discussion and Conclusions

The recent advent of big data has provided massive, openly available data sets that provide rich fodder for literature-based discovery, as well as serving as training sets for machine learning approaches to discovery. Furthermore, major big data techniques include linking data sets together and combining heterogeneous data sets (including electronic medical records and data warehouses), both of which are increasingly tractable with current computational resources, and both of which are fundamental to obtaining implicit information used for discovery. The new directions discussed in this review (e.g. outliers, analogies, negative consensus, and others) go beyond the A-B-C model and open up the field to an exciting variety of models of discovery.

Historically, the big stumbling-block of literature-based discovery has been the fact that its models seek to predict novel, untested, even surprising findings, which inherently are difficult to score as “right” or “wrong” without costly experimentation. This has bedeviled methodological studies that seek to improve predictive performance. Existing benchmarks are relatively few (Sebastian et al., 2017a). Time-slicing is an alternative technique in which articles up to a certain date are used to construct a hypothesis, and then the literature is examined a few years later to determine whether that hypothesis is tested or at least mentioned in the literature (Yetisgen-Yildiz & Pratt, 2009). Some of the new research directions that I have discussed in this article are easier to evaluate than the classic one or two-node searches. For example, link prediction seeks to predict which pairs (of, say, MeSH terms) are most likely to appear together in the same article in the future, which can be assessed quantitatively without considering the “truth” of the article. It is gratifying that the techniques of literature-based discovery have been absorbed into the mainstream of bioinformatics, medical informatics, and computer science, whose practitioners find abundant value even in predicting findings that are relatively non-surprising and incremental. For example, if protein A is known to have a certain function, and protein X is similar to protein A in several respects, then protein X may be hypothesized to share functions with A. Different discovery models of protein functions can be assessed on how well they predict functions across a database of known proteins, without relying on having experimental data for the unknown or novel proteins.

The general scientific public is still not aware of the availability of tools for literature-based discovery. Our Arrowsmith project site maintains a suite of tools

http://arrowsmith.psych.uic.edu

that are free and open to the public, as does BITOLA

http://ibmi.mf.uni-lj.si/bitola

which is maintained by Dmitar Hristovski, and Epiphanet

http://epiphanet.uth.tmc.edu/

which is maintained by Trevor Cohen. Bringing user-friendly tools to the public should be a high priority, since even more than advancing basic research in informatics, it is vital that we ensure that scientists actually use discovery tools and that these are actually able to help them make experimental discoveries in the lab and in the clinic.

Figure 1

Don R. Swanson.
Don R. Swanson.

Figure 2

Screenshot of UCSC Genome Browser showing the sequence for human mir-95 juxtaposed to tracks for genomic repeats. The genomic region of the mir-95 sequence corresponds to two LINE2 elements in opposite orientations. This provides evidence that, when transcribed into RNA, these LINE2 elements bind each other, creating the hairpin secondary structure that permits the processing of this sequence by enzymes (Drosha & Dicer) to form a microRNA (Smalheiser & Torvik, 2005).
Screenshot of UCSC Genome Browser showing the sequence for human mir-95 juxtaposed to tracks for genomic repeats. The genomic region of the mir-95 sequence corresponds to two LINE2 elements in opposite orientations. This provides evidence that, when transcribed into RNA, these LINE2 elements bind each other, creating the hairpin secondary structure that permits the processing of this sequence by enzymes (Drosha & Dicer) to form a microRNA (Smalheiser & Torvik, 2005).

Figure 3

Schematic diagram illustrating the one-node search. Reprinted from Swanson & Smalheiser (1997) with permission.
Schematic diagram illustrating the one-node search. Reprinted from Swanson & Smalheiser (1997) with permission.

Eight features used to characterize each B-term.

No.Feature
1Does the B-term occur in more than one paper within literatures A and C?
2Do the AB and BC sub-literatures share any MeSH terms?
3Does the B-term map to at least one UMLS semantic category?
4Does the B-term have a high literature cohesion score?
5Is the B-term moderately frequent within MEDLINE as a whole?
6Did the B-term first appear recently within MEDLINE as a whole?
7Is the B-term highly characteristic within literature A or C?
8Do the words within the B-term all occur on the customized 1,400-word stoplist?

Agarwal, S., Yu, H., & Kohane, I. (2011). BioNØT: A searchable database of biomedical negated sentences. BMC Bioinformatics, 12:420. Retrieved on August 9, 2017, from https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-420#Abs1.22032181AgarwalS.YuH.KohaneI.2011BioNØT: A searchable database of biomedical negated sentencesBMC Bioinformatics12:420. Retrieved on August 9, 2017https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-420#Abs110.1186/1471-2105-12-420322537922032181Search in Google Scholar

Baek, S.H., Lee, D., Kim, M., Lee, J.H., & Song, M. (2017). Enriching plausible new hypothesis generation in PubMed. PLoS ONE, 12(7), e0180539.28678852BaekS.H.LeeD.KimM.LeeJ.H.SongM.2017Enriching plausible new hypothesis generation in PubMedPLoS ONE127e018053910.1371/journal.pone.0180539549803128678852Search in Google Scholar

Bekhuis, T. (2006). Conceptual biology, hypothesis discovery, and text mining: Swanson’s legacy. Biomedical Digital Libraries, 3:2. Retrieved on August 9, 2017, from https://bio-diglib.biomedcentral.com/articles/10.1186/1742-5581-3-2.BekhuisT.2006Conceptual biology, hypothesis discovery, and text mining: Swanson’s legacyBiomedical Digital Libraries, 3:2.Retrieved on August 9, 2017https://bio-diglib.biomedcentral.com/articles/10.1186/1742-5581-3-210.1186/1742-5581-3-2145918716584552Search in Google Scholar

Bruza, P., & Weeber, M. (Eds.) (2008). Literature-based discovery. Berlin: Springer-Verlag.BruzaP.WeeberM.2008Literature-based discovery.BerlinSpringer-Verlag10.1007/978-3-540-68690-3Search in Google Scholar

Cairelli, M.J., Miller, C.M., Fiszman, M., Workman, T.E., & Rindflesch, T.C. (2013). Semantic MEDLINE for discovery browsing: Using semantic predications and the literature-based discovery paradigm to elucidate a mechanism for the obesity paradox. In AMIA Annual Symposium Proceedings (pp. 164–173). Retrieved on August 9, 2017, from http://europepmc.org/articles/PMC3900170.CairelliM.J.MillerC.M.FiszmanM.WorkmanT.E.RindfleschT.C.2013Semantic MEDLINE for discovery browsing: Using semantic predications and the literature-based discovery paradigm to elucidate a mechanism for the obesity paradoxIn AMIA Annual Symposium Proceedings164173Retrieved on August 9, 2017http://europepmc.org/articles/PMC3900170Search in Google Scholar

Cameron, D., Kavuluru, R., Rindflesch, T.C., Sheth, A.P., Thirunarayan, K., & Bodenreider, O. (2015). Context-driven automatic subgraph creation for literature-based discovery. Journal of Biomedical Informatics, 54 (C), 141–157.10.1016/j.jbi.2015.01.01425661592CameronD.KavuluruR.RindfleschT.C.ShethA.P.ThirunarayanK.BodenreiderO.2015Context-driven automatic subgraph creation for literature-based discoveryJournal of Biomedical Informatics54C141157488880625661592Otwórz DOISearch in Google Scholar

Cohen, K.B., Johnson, H.L., Verspoor, K., Roeder, C., & Hunter, L.E. (2010). The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics, 11: 492. Retrieved on August 9, 2017, from https://doi.org/10.1186/1471-2105-11-492.20920264CohenK.B.JohnsonH.L.VerspoorK.RoederC.HunterL.E.2010The structural and content aspects of abstracts versus bodies of full text journal articles are differentBMC Bioinformatics11492Retrieved on August 9, 2017https://doi.org/10.1186/1471-2105-11-49210.1186/1471-2105-11-492309807920920264Search in Google Scholar

Cohen, T., Whitfield, G.K., Schvaneveldt, R.W., Mukund, K., & Rindflesch, T. (2010). EpiphaNet: An interactive tool to support biomedical discoveries. Journal of Biomed Discovery Collaboration, 5(1), 21–49.CohenT.WhitfieldG.K.SchvaneveldtR.W.MukundK.RindfleschT.2010EpiphaNet: An interactive tool to support biomedical discoveriesJournal of Biomed Discovery Collaboration512149Search in Google Scholar

Cohen, T., & Widdows, D. (2009). Empirical distributional semantics: Methods and biomedical applications. Journal of Biomed Information, 42(2), 390–405.10.1016/j.jbi.2009.02.002CohenT.WiddowsD.2009Empirical distributional semantics: Methods and biomedical applicationsJournal of Biomed Information422390405275080219232399Otwórz DOISearch in Google Scholar

Cohen, T., & Widdows, D. (2017). Embedding of semantic predications. Journal of Biomed Information, 68, 150–166.10.1016/j.jbi.2017.03.003CohenT.WiddowsD.2017Embedding of semantic predicationsJournal of Biomed Information68150166544184828284761Otwórz DOISearch in Google Scholar

Cole, R., & Bruza, P. (2005). A bare bones approach to literature-based discovery: An analysis of the Raynaud’s/Fish-oil and migraine-magnesium discoveries in semantic space. In A. Hoffmann, H. Motoda, & T. Scheffer (Eds.), Discovery Science (pp. 84–98). Berlin: Springer-Verlag.ColeR.BruzaP.2005A bare bones approach to literature-based discovery: An analysis of the Raynaud’s/Fish-oil and migraine-magnesium discoveries in semantic spaceHoffmannA.MotodaH.SchefferT.Discovery Science8498BerlinSpringer-Verlag10.1007/11563983_9Search in Google Scholar

Ding, Y., Song, M., Han, J., Yu, Q., Yan, E., Lin, L., & Chambers, T. (2013). Entitymetrics: Measuring the impact of entities. PLoS ONE, 8(8), e71416.24009660DingY.SongM.HanJ.YuQ.YanE.LinL.ChambersT.2013Entitymetrics: Measuring the impact of entitiesPLoS ONE88e7141610.1371/journal.pone.0071416375696124009660Search in Google Scholar

Dong, W., Liu, Y., Zhu, W., Mou, Q., Wang, J., & Hu, Y. (2014). Simulation of Swanson’s literature-based discovery: Anandamide treatment inhibits growth of gastric cancer cells in vitro and in silico. PLoS ONE, 9(6), e100436.24949851DongW.LiuY.ZhuW.MouQ.WangJ.HuY.2014Simulation of Swanson’s literature-based discovery: Anandamide treatment inhibits growth of gastric cancer cells in vitro and in silicoPLoS ONE96e10043610.1371/journal.pone.0100436Search in Google Scholar

Frijters, R., van Vugt, M., Smeets, R., van Schaik, R., de Vlieg, J., & Alkema, W. (2010). Literature mining for the discovery of hidden connections between drugs, genes and diseases. PLoS Computational Biology, 6(9), e1000943.FrijtersR.van VugtM.SmeetsR.van SchaikR.de VliegJ.AlkemaW.2010Literature mining for the discovery of hidden connections between drugs, genes and diseasesPLoS Computational Biology69e100094310.1371/journal.pcbi.1000943Search in Google Scholar

Gordon, M.D. & Dumais, S. (1998). Using latent semantic indexing for literature based discovery. Journal of the American Society for Information Science, 49(8), 674–685.10.1002/(SICI)1097-4571(199806)49:8<674::AID-ASI2>3.0.CO;2-TGordonM.D.DumaisS.1998Using latent semantic indexing for literature based discoveryJournal of the American Society for Information Science498674685Otwórz DOISearch in Google Scholar

Hossain, M.S., Gresock, J., Edmonds, Y., Helm, R., Potts, M., & Ramakrishnan, N. (2012). Connecting the dots between PubMed abstracts. PLoS ONE, 7(1), e29509.22235301HossainM.S.GresockJ.EdmondsY.HelmR.PottsM.RamakrishnanN.2012Connecting the dots between PubMed abstractsPLoS ONE71e2950910.1371/journal.pone.0029509Search in Google Scholar

Hristovski, D., Kastrin, A., Dinevski, D., & Rindflesch, T.C. (2015). Constructing a graph database for semantic literature-based discovery. Studies in Health Technology and Informatics, 216:1094. Retrieved on August 9, 2017, from https://www.ncbi.nlm.nih.gov/pubmed/26262393.26262393HristovskiD.KastrinA.DinevskiD.RindfleschT.C.2015Constructing a graph database for semantic literature-based discoveryStudies in Health Technology and Informatics2161094Retrieved on August 9, 2017https://www.ncbi.nlm.nih.gov/pubmed/26262393Search in Google Scholar

Hristovski, D., Kastrin, A., Dinevski, D., Burgun, A., Žiberna, L., & Rindflesch, TC. (2016). Using literature-based discovery to explain adverse drug effects. Journal of Medical Systems, 40(8), 185.2731899310.1007/s10916-016-0544-zHristovskiD.KastrinA.DinevskiD.BurgunA.ŽibernaL.Rindflesch, TC2016Using literature-based discovery to explain adverse drug effectsJournal of Medical Systems408185Search in Google Scholar

Ioannidis, J.P. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124.16060722IoannidisJ.P.2005Why most published research findings are falsePLoS Medicine28e12410.1371/journal.pmed.0020124Search in Google Scholar

Jonnalagadda, S.R., Goyal, P., & Huffman, M.D. (2015). Automating data extraction in systematic reviews: A systematic review. System Review, 4:78. Retrieved on August 9, 2017, from https://doi.org/10.1186/s13643-015-0066-7.JonnalagaddaS.R.GoyalP.HuffmanM.D.2015Automating data extraction in systematic reviews: A systematic reviewSystem Review478Retrieved on August 9, 2017https://doi.org/10.1186/s13643-015-0066-710.1186/s13643-015-0066-7Search in Google Scholar

Kastrin, A., Rindflesch, T.C., & Hristovski, D. (2016). Link prediction on a network of cooccurring MeSH Terms: Towards literature-based discovery. Methods of Information in Medicine, 55(4), 340–346.10.3414/ME15-01-0108KastrinA.RindfleschT.C.HristovskiD.2016Link prediction on a network of cooccurring MeSH Terms: Towards literature-based discoveryMethods of Information in Medicine554340346Otwórz DOISearch in Google Scholar

Kell, D.B. (2009). Iron behaving badly: Inappropriate iron chelation as a major contributor to the aetiology of vascular and other progressive inflammatory and degenerative diseases. BMC Medical Genomics, 2:2. Retrieved on August 9, 2017, from http://doi.org/10.1186/1755-8794-2-2KellD.B.2009Iron behaving badly: Inappropriate iron chelation as a major contributor to the aetiology of vascular and other progressive inflammatory and degenerative diseasesBMC Medical Genomics, 2:2.Retrieved on August 9, 2017http://doi.org/10.1186/1755-8794-2-210.1186/1755-8794-2-2Search in Google Scholar

Kilicoglu, H. (2017). Biomedical text mining for research rigor and integrity: Tasks, challenges, directions. Brief Bioinform, bbx057. Retrieved on August 9, 2017, from https://doi.org/10.1093/bib/bbx057.KilicogluH.2017Biomedical text mining for research rigor and integrity: Tasks, challenges, directionsBrief Bioinform, bbx057.Retrieved on August 9, 2017https://doi.org/10.1093/bib/bbx05710.1093/bib/bbx057Search in Google Scholar

Kostoff, R.N., Block, J.A., Solka, J.L., Briggs, M.B., Rushenberg, R.L., Stump, J.A., Johnson, D., Lyons, T. J. & Wyatt, J.R. (2009). Literature-related discovery. Annual Review of Information Science and Technology, 43(1), 1–71.10.1002/aris.2009.1440430112KostoffR.N.BlockJ.A.SolkaJ.L.BriggsM.B.RushenbergR.L.StumpJ.A.JohnsonD.LyonsT. J.WyattJ.R.2009Literature-related discoveryAnnual Review of Information Science and Technology431171Otwórz DOISearch in Google Scholar

Lugli, G., Larson, J., Martone, M.E., Jones, Y., & Smalheiser, N.R. (2005). Dicer and eIF2c are enriched at postsynaptic densities in adult mouse brain and are modified by neuronal activity in a calpain-dependent manner. Journal Neurochem, 94(4), 896–905.10.1111/j.1471-4159.2005.03224.xLugliG.LarsonJ.MartoneM.E.JonesY.SmalheiserN.R.2005Dicer and eIF2c are enriched at postsynaptic densities in adult mouse brain and are modified by neuronal activity in a calpain-dependent mannerJournal Neurochem94489690516092937Otwórz DOISearch in Google Scholar

Manev, H., & Manev, R. (2010). Benefits of neuropsychiatric phenomics: Example of the 5-lipoxygenase-leptin-Alzheimer connection. Cardiovasc Psychiatry Neurol, No. 838164. Retrieved on August 9, 2017, from http://dx.doi.org/10.1155/2010/838164.ManevH.ManevR.2010Benefits of neuropsychiatric phenomics: Example of the 5-lipoxygenase-leptin-Alzheimer connectionCardiovasc Psychiatry Neurol, No. 838164.Retrieved on August 9, 2017http://dx.doi.org/10.1155/2010/83816410.1155/2010/838164290590820672007Search in Google Scholar

Maver, A., Hristovski, D., Rindflesch, T.C., & Peterlin, B. (2013). Integration of data from Omic studies with the literature-based discovery towards identification of Novel treatments for neovascularization in diabetic retinopathy. BioMed Research International, No. 848952. Retrieved on August 9, 2017, from http://doi.org/10.1155/2013/848952.MaverA.HristovskiD.RindfleschT.C.PeterlinB.2013Integration of data from Omic studies with the literature-based discovery towards identification of Novel treatments for neovascularization in diabetic retinopathyBioMed Research International, No. 848952.Retrieved on August 9, 2017http://doi.org/10.1155/2013/84895210.1155/2013/848952385790324350292Search in Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS 2013). Retrieved on August 9, 2017, from http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.MikolovT.SutskeverI.ChenK.CorradoG.S.DeanJ.2013Distributed representations of words and phrases and their compositionalityIn Advances in Neural Information Processing Systems 26 (NIPS 2013).Retrieved on August 9, 2017http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionalitySearch in Google Scholar

Miller, C.M., Rindflesch, T.C., Fiszman, M., Hristovski, D., Shin, D., Rosemblat, G., Zhang, H., & Strohl, K.P. (2012). A closed literature-based discovery technique finds a mechanistic link between hypogonadism and diminished sleep quality in aging men. Sleep, 35(2), 279–285.22294819MillerC.M.RindfleschT.C.FiszmanM.HristovskiD.ShinD.RosemblatG.ZhangH.StrohlK.P.2012A closed literature-based discovery technique finds a mechanistic link between hypogonadism and diminished sleep quality in aging menSleep35227928510.5665/sleep.1640325036822294819Search in Google Scholar

Mishra, S., & Torvik, V.I. (2016). Quantifying conceptual novelty in the biomedical literature. D-Lib Magazine, 22, No. 9/10. Retrieved on August 9, 2017, from http://www.dlib.org/dlib/september16/mishra/09mishra.html.MishraS.TorvikV.I.2016Quantifying conceptual novelty in the biomedical literatureD-Lib Magazine, 22, No. 9/10.Retrieved on August 9, 2017http://www.dlib.org/dlib/september16/mishra/09mishra.html10.1045/september2016-mishra514276427942200Search in Google Scholar

Mower, J., Subramanian, D., Shang, N., & Cohen, T. (2016). Classification-by-analogy: Using vector representations of implicit relationships to identify plausibly causal drug/side-effect relationships. AMIA Annual Symposium Proceedings, 1940–1949.MowerJ.SubramanianD.ShangN.CohenT.2016Classification-by-analogy: Using vector representations of implicit relationships to identify plausibly causal drug/side-effect relationshipsAMIA Annual Symposium Proceedings19401949Search in Google Scholar

O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., & Ananiadou S. (2015). Using text mining for study identification in systematic reviews: A systematic review of current approaches. System Review, 4:5. Retrieved on August 9, 2017, from https://doi.org/10.1186/2046-4053-4-5.O’Mara-EvesA.ThomasJ.McNaughtJ.MiwaM.AnaniadouS.2015Using text mining for study identification in systematic reviews: A systematic review of current approachesSystem Review, 4:5.Retrieved on August 9, 2017https://doi.org/10.1186/2046-4053-4-510.1186/2046-4053-4-5432053925588314Search in Google Scholar

Packalen, M., & Bhattacharya, J. (2015). Neophilia ranking of scientific journals. NBER Working Paper No. w21579. Retrieved on August 9, 2017, from https://ssrn.com/abstract=2663237.PackalenM.BhattacharyaJ.2015Neophilia ranking of scientific journalsNBER Working Paper No. w21579.Retrieved on August 9, 2017https://ssrn.com/abstract=266323710.3386/w21579Search in Google Scholar

Peng, Y., Bonifield, G., & Smalheiser, N.R. (2017). Gaps within the biomedical literature: Initial characterization and assessment of strategies for discovery. Frontiers in Research Metrics and Analytics, 2:3. Retrieved on August 9, 2017, from https://www.frontiersin.org/articles/10.3389/frma.2017.00003/full.PengY.BonifieldG.SmalheiserN.R.2017Gaps within the biomedical literature: Initial characterization and assessment of strategies for discoveryFrontiers in Research Metrics and Analytics, 2:3.Retrieved on August 9, 2017https://www.frontiersin.org/articles/10.3389/frma.2017.00003/full10.3389/frma.2017.00003573637429271976Search in Google Scholar

Popper, K.R. (1978). Three worlds. The tanner lecture on human values. The University of Michigan. Ann Arbor. Retrieved on July 17, 2017, from http://tannerlectures.utah.edu/_documents/a-to-z/p/popper80.pdf.PopperK.R.1978Three worlds. The tanner lecture on human valuesThe University of Michigan. Ann Arbor.Retrieved on July 17, 2017http://tannerlectures.utah.edu/_documents/a-to-z/p/popper80.pdfSearch in Google Scholar

Pennington, J., Socher, R., & Manning, C.D. (2014, October). Glove: Global vectors for word representation. In Conference on Empirical Methods in Natural Lanugage Processing, Vol. 14 (pp. 1532–1543). Retrieved on August 9, 2017, from http://www.aclweb.org/anthology/D14-1162.PenningtonJ.SocherR.ManningC.D.2014OctoberGlove: Global vectors for word representationIn Conference on Empirical Methods in Natural Lanugage Processing1415321543Retrieved on August 9, 2017http://www.aclweb.org/anthology/D14-116210.3115/v1/D14-1162Search in Google Scholar

Petrič, I., Cestnik, B., Lavrač, N., & Urbančič, T. (2010). Outlier detection in cross-context link discovery for creative literature mining. The Computer Journal, 55(1), 47–61.PetričI.CestnikB.LavračN.UrbančičT.2010Outlier detection in cross-context link discovery for creative literature miningThe Computer Journal551476110.1093/comjnl/bxq074Search in Google Scholar

Preiss, J., & Stevenson, R. (2016). The effect of word sense disambiguation accuracy on literature based discovery. BMC Medical Informatics and Decision Making,16(1), 59–63.PreissJ.StevensonR.2016The effect of word sense disambiguation accuracy on literature based discoveryBMC Medical Informatics and Decision Making161596310.1145/2811163.2811185Search in Google Scholar

Rzhetsky, A., Iossifov, I., Loh, J.M., & White, K.P. (2006). Microparadigms: Chains of collective reasoning in publications about molecular interactions. Proceedings of the National Academy of Sciences of the United States of America, 103(13), 4940–4945.1654338010.1073/pnas.0600591103RzhetskyA.IossifovI.LohJ.M.WhiteK.P.2006Microparadigms: Chains of collective reasoning in publications about molecular interactionsProceedings of the National Academy of Sciences of the United States of America1031349404945140265016543380Search in Google Scholar

Sebastian, Y., Siew, E.G., & Orimaye, S.O. (2017a). Emerging approaches in literature-based discovery: Techniques and performance review. Knowledge Engineering Review, 32, article no. e12. Retrieved on July 17, 2017, from https://doi.org/10.1017/S0269888917000042.SebastianY.SiewE.G.OrimayeS.O.2017aEmerging approaches in literature-based discovery: Techniques and performance reviewKnowledge Engineering Review, 32, article no. e12.Retrieved on July 17, 2017https://doi.org/10.1017/S026988891700004210.1017/S0269888917000042Search in Google Scholar

Sebastian, Y., Siew, E.G., & Orimaye, S.O. (2017b). Learning the heterogeneous bibliographic information network for literature-based discovery. Knowledge-Based Systems, 115, 66–79.10.1016/j.knosys.2016.10.015SebastianY.SiewE.G.OrimayeS.O.2017bLearning the heterogeneous bibliographic information network for literature-based discoveryKnowledge-Based Systems1156679Otwórz DOISearch in Google Scholar

Shang, N., Xu, H., Rindflesch, T.C., & Cohen, T. (2014). Identifying plausible adverse drug reactions using knowledge extracted from the literature. Journal of Biomedical Informatics, 52, 293–310. Retrieved on July 17, 2017, from http://doi.org/10.1016/j.jbi.2014.07.011.2504683110.1016/j.jbi.2014.07.011ShangN.XuH.RindfleschT.C.CohenT.2014Identifying plausible adverse drug reactions using knowledge extracted from the literatureJournal of Biomedical Informatics52293310Retrieved on July 17, 2017http://doi.org/10.1016/j.jbi.2014.07.011426101125046831Search in Google Scholar

Shi, C., Li, Y., Zhang, J., Sun, Y., & Philip, S.Y. (2017). A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering, 29(1), 17–37.10.1109/TKDE.2016.2598561ShiC.LiY.ZhangJ.SunY.PhilipS.Y.2017A survey of heterogeneous information network analysisIEEE Transactions on Knowledge and Data Engineering2911737Otwórz DOISearch in Google Scholar

Smalheiser, N.R. (2007). Exosomal transfer of proteins and RNAs at synapses in the nervous system. Biology Direct, 2(1), 35.1805313510.1186/1745-6150-2-35SmalheiserN.R.2007Exosomal transfer of proteins and RNAs at synapses in the nervous systemBiology Direct2135221995718053135Search in Google Scholar

Smalheiser, N.R. (2012a). The search for endogenous siRNAs in the mammalian brain. Experimental Neurology, 235(2), 455–463.10.1016/j.expneurol.2011.10.015SmalheiserN.R.2012aThe search for endogenous siRNAs in the mammalian brainExperimental Neurology2352455463329174422062046Otwórz DOISearch in Google Scholar

Smalheiser, N.R. (2012b). Literature-based discovery: Beyond the ABCs. Journal of the Association for Information Science and Technology, 63(2), 218–224.10.1002/asi.21599SmalheiserN.R.2012bLiterature-based discovery: Beyond the ABCsJournal of the Association for Information Science and Technology632218224Otwórz DOISearch in Google Scholar

Smalheiser, N.R. (2014). The RNA-centred view of the synapse: Non-coding RNAs and synaptic plasticity. Philosophical Transactions of the Royal Society B Biological Sciences, 369(1652).SmalheiserN.R.2014The RNA-centred view of the synapse: Non-coding RNAs and synaptic plasticityPhilosophical Transactions of the Royal Society B Biological Sciences369165210.1098/rstb.2013.0504Search in Google Scholar

Smalheiser, N.R. (2017). Data literacy: How to make your experiments robust and reproducible. Cambridge, MA: Academic Press.SmalheiserN.R.2017Data literacyHow to make your experiments robust and reproducible.Cambridge, MAAcademic PressSearch in Google Scholar

Smalheiser, N.R., & Gomes, O.L. (2014). Mammalian Argonaute-DNA binding? Direct, 10:27. Retrieved on July 17, 2017, from https://biologydirect.biomedcentral.com/articles/10.1186/s13062-014-0027-4.SmalheiserN.R.GomesO.L.2014Mammalian Argonaute-DNA binding?Direct, 10:27.Retrieved on July 17, 2017https://biologydirect.biomedcentral.com/articles/10.1186/s13062-014-0027-410.1186/PREACCEPT-1466302485137399Search in Google Scholar

Smalheiser, N.R., Manev, H., & Costa, E. (2001). RNAi and brain function: Was McConnell on the right track? Trends in Neurosciences, 24(4), 216–218.10.1016/S0166-2236(00)01739-211250005SmalheiserN.R.ManevH.CostaE.2001RNAi and brain function: Was McConnell on the right track?Trends in Neurosciences244216218Otwórz DOISearch in Google Scholar

Smalheiser, N.R., & Swanson, D.R. (1994). Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic disease. Neuroscience Research Communications, 15(1), 1–9.SmalheiserN.R.SwansonD.R.1994Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic diseaseNeuroscience Research Communications15119Search in Google Scholar

Smalheiser, N.R., & Swanson, D.R. (1996a). Indomethacin and Alzheimer’s disease. Neurology, 46(2), 583.10.1212/WNL.46.2.583SmalheiserN.R.SwansonD.R.1996aIndomethacin and Alzheimer’s diseaseNeurology462583Otwórz DOISearch in Google Scholar

Smalheiser, N.R., & Swanson, D.R. (1996b). Linking estrogen to Alzheimer’s disease: An informatics approach. Neurology, 47(3), 809–810.10.1212/WNL.47.3.809SmalheiserN.R.SwansonD.R.1996bLinking estrogen to Alzheimer’s disease: An informatics approachNeurology473809810Otwórz DOISearch in Google Scholar

Smalheiser, N.R., & Swanson, D.R. (1998). Calcium-independent phospholipase A2 and schizophrenia. Archives of General Psychiatry, 55(8), 752–753.9707387SmalheiserN.R.SwansonD.R.1998Calcium-independent phospholipase A2 and schizophreniaArchives of General Psychiatry558752753Search in Google Scholar

Smalheiser, N.R., & Torvik, V.I. (2004). A population-based statistical approach identifies parameters characteristic of human microRNA-mRNA interactions. BMC Bioinformatics, 5:139. Retrieved on July 17, 2017, from https://doi.org/10.1186/1471-2105-5-139.SmalheiserN.R.TorvikV.I.2004A population-based statistical approach identifies parameters characteristic of human microRNA-mRNA interactionsBMC Bioinformatics, 5:139.Retrieved on July 17, 2017https://doi.org/10.1186/1471-2105-5-13910.1186/1471-2105-5-139Search in Google Scholar

Smalheiser, N.R., & Torvik, V.I. (2005). Mammalian microRNAs derived from genomic repeats. Trends in Genetics, 21(6), 322–326.10.1016/j.tig.2005.04.008SmalheiserN.R.TorvikV.I.2005Mammalian microRNAs derived from genomic repeatsTrends in Genetics216322326Otwórz DOISearch in Google Scholar

Swanson, D.R. (1986a). Undiscovered public knowledge. Library Quarterly, 56(2), 103–118.SwansonD.R.1986aUndiscovered public knowledgeLibrary Quarterly56210311810.1086/601720Search in Google Scholar

Swanson, D.R. (1986b). Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in Biology & Medicine, 30(1), 7–18.10.1353/pbm.1986.0087SwansonD.R.1986bFish oil, Raynaud’s syndrome, and undiscovered public knowledgePerspectives in Biology & Medicine301718Otwórz DOISearch in Google Scholar

Swanson, D.R. (1987). Two medical literatures that are logically but not bibliographically connected. Journal of the American Society for Information Science, 38(4), 228–233.10.1002/(SICI)1097-4571(198707)38:4<228::AID-ASI2>3.0.CO;2-GSwansonD.R.1987Two medical literatures that are logically but not bibliographically connectedJournal of the American Society for Information Science384228233Otwórz DOISearch in Google Scholar

Swanson, D.R. (1988). Migraine and magnesium: Eleven neglected connections. Perspectives in Biology & Medicine, 31(4), 526–557.10.1353/pbm.1988.0009SwansonD.R.1988Migraine and magnesium: Eleven neglected connectionsPerspectives in Biology & Medicine314526557Otwórz DOISearch in Google Scholar

Swanson, D.R. (1990). Somatomedin C and arginine: Implicit connections between mutually-isolated literatures. Perspectives in Biology & Medicine, 33(2), 157–186.10.1353/pbm.1990.0031SwansonD.R.1990Somatomedin C and arginine: Implicit connections between mutually-isolated literaturesPerspectives in Biology & Medicine332157186Otwórz DOISearch in Google Scholar

Swanson, D.R. (1993). Intervening in the life cycles of scientific knowledge. Library Trends, 41(4), 606–631.SwansonD.R.1993Intervening in the life cycles of scientific knowledgeLibrary Trends414606631Search in Google Scholar

Swanson, D.R. (2006). Atrial fibrillation in athletes: Implicit literature-based connections suggest that overtraining and subsequent inflammation may be a contributory mechanism. Medical Hypotheses, 66(6), 1085–92.1650441410.1016/j.mehy.2006.01.006SwansonD.R.2006Atrial fibrillation in athletes: Implicit literature-based connections suggest that overtraining and subsequent inflammation may be a contributory mechanismMedical Hypotheses666108592Search in Google Scholar

Swanson, D.R. (2011). Literature-based resurrection of neglected medical discoveries. Journal of Biomedical Discovery & Collaboration, 6(6), 34–47.10.5210/disco.v6i0.3515SwansonD.R.2011Literature-based resurrection of neglected medical discoveriesJournal of Biomedical Discovery & Collaboration663447Otwórz DOISearch in Google Scholar

Swanson, D.R., & Smalheiser, N.R. (1997). An interactive system for finding complementary literatures: A stimulus to scientific discovery. Artificial Intelligence, 91(2), 183–203.10.1016/S0004-3702(97)00008-8SwansonD.R.SmalheiserN.R.1997An interactive system for finding complementary literatures: A stimulus to scientific discoveryArtificial Intelligence912183203Otwórz DOISearch in Google Scholar

Swanson, D.R., Smalheiser, N.R., & Bookstein, A. (2001). Information discovery from complementary literatures: Categorizing viruses as potential weapons. Journal of the American Society for Information Science and Technology, 52(10), 797–812.10.1002/asi.1135SwansonD.R.SmalheiserN.R.BooksteinA.2001Information discovery from complementary literatures: Categorizing viruses as potential weaponsJournal of the American Society for Information Science and Technology5210797812Otwórz DOISearch in Google Scholar

Torvik, V.I., & Smalheiser, N.R. (2007). A quantitative model for linking two disparate sets of articles in Medline. Bioinformatics, 23(13), 1658–1665.10.1093/bioinformatics/btm16117463015TorvikV.I.SmalheiserN.R.2007A quantitative model for linking two disparate sets of articles in MedlineBioinformatics23131658166517463015Otwórz DOISearch in Google Scholar

Uzzi, B., Mukherjee, S., Stringer, M., & Jones, B. (2013). Atypical combinations and scientific impact. Science, 342 (6157), 468–472.10.1126/science.124047424159044UzziB.MukherjeeS.StringerM.JonesB.2013Atypical combinations and scientific impactScience342615746847224159044Otwórz DOISearch in Google Scholar

van der Eijk, C.C., van Mulligen, E.M., Kors, J.A., Mons, B., & van den Berg, J. (2004). Constructing an associative concept space for literature—based discovery. Journal of the Association for Information Science and Technology, 55(5), 436–444.10.1002/asi.10392van der EijkC.C.van MulligenE.M.KorsJ.A.MonsB.van den BergJ.2004Constructing an associative concept space for literature—based discoveryJournal of the Association for Information Science and Technology555436444Otwórz DOISearch in Google Scholar

Vos, R., Aarts, S., van Mulligen, E., Metsemakers, J., van Boxtel, M.P., Verhey, F., & van den Akker, M. (2014). Finding potentially new multimorbidity patterns of psychiatric and somatic diseases: Exploring the use of literature-based discovery in primary care research. Journal of the American Medical Informatics Association, 21(1), 139–145.10.1136/amiajnl-2012-001448VosR.AartsS.van MulligenE.MetsemakersJ.van BoxtelM.P.VerheyF.van den AkkerM.2014Finding potentially new multimorbidity patterns of psychiatric and somatic diseases: Exploring the use of literature-based discovery in primary care researchJournal of the American Medical Informatics Association211139145391272623775174Otwórz DOISearch in Google Scholar

Weeber, M., Vos, R., Klein, H., de Jong-van den Berg, L.T.W., Aronson, A.R., & Molema, G. (2003). Generating hypotheses by discovering implicit associations in the literature: A case report of a search for new potential therapeutic uses for thalidomide. Journal of the American Medical Informatics Association, 10(3), 252–259.10.1197/jamia.M1158WeeberM.VosR.KleinH.de Jong-van den BergL.T.W.AronsonA.R.MolemaG.2003Generating hypotheses by discovering implicit associations in the literature: A case report of a search for new potential therapeutic uses for thalidomideJournal of the American Medical Informatics Association10325225934204812626374Otwórz DOISearch in Google Scholar

Widdows, D., & Cohen, T. (2015). Reasoning with vectors: A continuous model for fast robust inference. Logic Journal of the IGPL, 23(2), 141–73.10.1093/jigpal/jzu02826582967WiddowsD.CohenT.2015Reasoning with vectors: A continuous model for fast robust inferenceLogic Journal of the IGPL23214173464622826582967Otwórz DOISearch in Google Scholar

Wren, J.D. (2004). Extending the mutual information measure to rank inferred literature relationships. BMC Bioinformatics, 5:145. Retrieved on July 17, 2017, from https://www.ncbi.nlm.nih.gov/pubmed/15471547.WrenJ.D.2004Extending the mutual information measure to rank inferred literature relationshipsBMC Bioinformatics, 5:145.Retrieved on July 17, 2017https://www.ncbi.nlm.nih.gov/pubmed/1547154710.1186/1471-2105-5-14552638115471547Search in Google Scholar

Wren, J.D., Bekeredjian, R., Stewart, J.A., Shohet, R.V., & Garner, H.R. (2004). Knowledge discovery by automated identification and ranking of implicit relationships. Bioinformatics, 20(3), 389–398.10.1093/bioinformatics/btg42114960466WrenJ.D.BekeredjianR.StewartJ.A.ShohetR.V.GarnerH.R.2004Knowledge discovery by automated identification and ranking of implicit relationshipsBioinformatics20338939814960466Otwórz DOISearch in Google Scholar

Wolchover, N. (2017). A long-sought proof, found and almost lost. Quanta Magazine March 28, 2017. Retrieved on July 17, 2017, from https://www.quantamagazine.org/statistician-proves-gaussian-correlation-inequality-20170328.WolchoverN.2017A long-sought proof, found and almost lostQuanta Magazine March 28, 2017.Retrieved on July 17, 2017https://www.quantamagazine.org/statistician-proves-gaussian-correlation-inequality-20170328Search in Google Scholar

Workman, T.E., Fiszman, M., Cairelli, M.J., Nahl, D., & Rindflesch, TC. (2016). Spark, an application based on serendipitous knowledge discovery. Journal of Biomedical Informatics, 60(c), 23–37.10.1016/j.jbi.2015.12.01426732995WorkmanT.E.FiszmanM.CairelliM.J.NahlD.RindfleschTC.2016Spark, an application based on serendipitous knowledge discoveryJournal of Biomedical Informatics60c233726732995Otwórz DOISearch in Google Scholar

Yang, H.T., Ju, J.H., Wong, Y.T., Shmulevich, I., & Chiang, J.H. (2017). Literature-based discovery of new candidates for drug repurposing. Briefings in Bioinformatics, 18(3), 488–497.27113728YangH.T.JuJ.H.WongY.T.ShmulevichI.ChiangJ.H.2017Literature-based discovery of new candidates for drug repurposingBriefings in Bioinformatics18348849710.1093/bib/bbw03027113728Search in Google Scholar

Yetisgen-Yildiz, M., & Pratt, W. (2009). A new evaluation methodology for literature-based discovery systems. Journal of Biomedical Informatics, 42(4), 633–643.1912408610.1016/j.jbi.2008.12.001Yetisgen-YildizM.PrattW.2009A new evaluation methodology for literature-based discovery systemsJournal of Biomedical Informatics42463364319124086Search in Google Scholar

Zweigenbaum, P., Demner-Fushman, D., Yu, H., & Cohen, K.B. (2007). Frontiers of biomedical text mining: Current progress. Briefings in Bioinformatics, 8(5), 358–375.1797786710.1093/bib/bbm045ZweigenbaumP.Demner-FushmanD.YuH.CohenK.B.2007Frontiers of biomedical text mining: Current progressBriefings in Bioinformatics85358375251630217977867Search in Google Scholar

Polecane artykuły z Trend MD

Zaplanuj zdalną konferencję ze Sciendo