1. bookVolume 2 (2018): Issue 2 (April 2018)
Journal Details
License
Format
Journal
eISSN
2564-615X
First Published
30 Jan 2017
Publication timeframe
4 times per year
Languages
English
Open Access

Artificial intelligence used in genome analysis studies

Published Online: 25 Apr 2018
Volume & Issue: Volume 2 (2018) - Issue 2 (April 2018)
Page range: 78 - 88
Journal Details
License
Format
Journal
eISSN
2564-615X
First Published
30 Jan 2017
Publication timeframe
4 times per year
Languages
English
Introduction

According to a recent study, artificial intelligence could stimulate employment and increase the general business revenues by 38% in 2020 (1,2). For the global economy, this effect would mean a profit growth of approximately $4.8 trillion and, above all, a profound rethinking of the production system. In this model, the information (input) (e.g., genomic data) is entered into the node (simulated neuron) and processed. This approach leads to an initial transient result that is passed to the upper level, where the process is repeated. From the multiple levels, we arrive at the final stage of information processing (e.g., prediction of gene function and structure). The input (initial information) is defined as specific data, and the output (final information) must be consistent with the input (3, 4, 5, 6, 7).

Artificial intelligence methods

The design of learning elements must consider three fundamental aspects:

which components of the executive element (e.g. gene regulation and structures) should be learned;

the type of feedback available;

the type of representation used.

The feedback step is the most important because it enables us to determine the nature of the problem. We can distinguish three types of learning:

Supervised learning: A number of input independent variables (x) and one dependent variable are supplied to the output (y). Through the use of an algorithm, the programme attempts to apply the function corresponding to the given x, with the appropriate y (8). The goal of this type of learning is to build prediction models under conditions of uncertainty. In such a way, it is possible to predict the variable y when a new input is provided. This process is termed “supervised” because the algorithm iteratively performs the prediction (9). If the prediction is not correct, the system is corrected by a supervisor, until the accuracy of the programme is sufficient. The logical steps of a supervised learning method are as follows:

prepare the data;

choose the algorithm;

adapt a model;

choose a validation method;

perform evaluations;

use the model to make predictions.

Unsupervised learning: An unsupervised machine learning method occurs when the input (x) is supplied without correspondence with the outputs (y) (10). In this case, the goal is to find hidden patterns or intrinsic structures that have been provided in the data. Through this process, it is possible to make inferences regarding data without having an exact answer and a supervisor to correct possible errors (11).

Reinforcement learning: Reinforcement learning differs from the supervised method in that there is no supervisor; hence, the model is based on the reward (measured as the evaluation of the achievied performance) (12). Based on this signal, the algorithm changes its own strategy to achieve the best reward. It is also possible to identify two-way learning methods: passive and active. The passive reinforcement learning method uses a pre-determined fixed action. On the other hand, the active method utilizes a complete model (with all possible results) (13,14).

Deep learning: Deep learning (DL) is an approach developed in recent decades to solve such problems as the increasing size of available datasets. Technological and computing progress have facilitated this development through new processing units, which have significantly reduced the time required to train neural networks (15, 16, 17). DL allows data to be represented in a hierarchical way on various levels. DL methods are created and learned automatically through the use of advanced learning algorithms. Input information is manipulated to define the concepts at the highest levels through linear transformations on the lower floors (18, 19, 20).

Artificial neurons: The simplest neural network is made up of one neuron and is shown in Fig.1. A neuron can be interpreted as a computational unit, which takes inputs x1, x2, x3 and produces an output h w, b (x), called the neuron activation. It can also be noted that there is an additional input, which is a constant +1 (21).

Figure 1

Example of an artificial neuron.

A single neuron receives a numerical input (the weighted sum of different inputs), and it can be activated to process the received value, and its activation can be calculated through a specific function, or it can remain inactive, depending on whether its threshold is exceeded or if it is not activated. Therefore, the neuron will be characterized by a function of activation and an activation threshold. Given a certain function of activation, in a fraction of cases, it may be difficult, if not impossible, to obtain certain output values. Specifically, there may not be a combination of input and weight values for that function that produces the desired output. Therefore, it is necessary to use an additional coefficient b, known as the bias, the purpose of which is to allow the translation of the function activation of the neuron such that it is possible to obtain certain inputs and weight for all desired outputs. We can distinguish two basic types of neural networks: feedforward neural networks and recurrent neural networks (RNN) (also called feedback neural networks) (22). In the former, the connections between neurons never form a cycle and therefore information always travels in one direction. In contrast, recurrent neural networks form a cycle with the creation of an internal state of the network, which enables the performance of dynamic process behaviour over time. The RNN can use its own internal memory to process various types of inputs. It is also possible to distinguish between completely connected networks (fully connected), in which every neuron is connected to all of the others, and stratified networks in which the neurons are organized in layers. In stratified networks, all of the neurons of a layer are connected with all the neurons of the next layer. There are not connections between neurons of one layer, nor are there connections between neurons of non-adjacent layers. A layered network with three layers is shown in Fig. 2.

Figure 2

A layered network with three layers.

The left layer in Figure 2 with neurons denoted by blue circles, is generally referred to as an input layer; alternatively, the rightmost layer with a unique orange neuron constitutes the output layer. Finally, the central layer is called hidden layer because there is no connection with the training set.

Training of a neural network: One of the most well-known and effective methods for training neural networks is the so-called error retro-propagation algorithm (error back-propagation), which systematically modifies the weight of the connections between the neurons such that the network’s response gets closer to the goal. Training this type of neural network occurs in two different ways: forward propagation and backward propagation. In the first phase, activations of all the neurons of the network are calculated, starting from the first and proceeding to the last layer. During this step, values of the synaptic weights are all fixed (default values) (23). In the second phase, the response of the network or the real output are compared to the desired output, thereby obtaining the network error. The calculated error is propagated in the reverse direction of the first phase. At the end of the second phase, based on the errors, weights are modified to minimize the difference between the current output and the desired output. The whole process is subsequently reiterated, beginning with a forward propagation. In deep learning, all of the layers of the “deep” neurons apply non-linear operations (24,25). Deep networks are characterized by having a number of neurons and layers much greater than the ANNs. Deep networks do not need to work on feature extraction starting from the input data but rather develop a set of criteria during the learning phase to obtain comparable or even superior performance to the classic neural networks. The main types of machine learning tools are briefly described:

Decision trees. This is a technique that makes use of tree graphs (that are equipped with “leaves”, which describe states or events associated with a system, and “branches”, which represent transitions between states and conditions necessary for such transitions).

Bayesian network. A Bayesian network (BN) is a probabilistic model that represents a set of random variables and their conditional dependences (26,27). Bayesian networks are direct, acyclic graphs whose nodes represent random variables. Therefore, they can be observable quantities, latent variables, unknown parameters or hypotheses. The edges represent conditional dependencies, and nodes, which are not connected, are variables that are conditionally independent of each other. Each node is associated with a probability function that takes as an in input a particular set of variable values from its parental nodes(. The Bayesian Networks that model sequences (for example protein sequences) are called Dynamic Bayesian Networks (28,29). A Bayesian network could be applied to represent probabilistic relationships between diseases and symptoms. Considering the symptoms, the network can be used to calculate the probabilities of the presence of various diseases (30).

Hidden Markov model. A hidden Markov model (HMM) is a statistic model in which the modelled system is a process of Markov (transition between states associated with probability weights) with an un-noticed state (31). An HMM can be considered as the simpler version of the Bayesian network. In a classic Markov model, the state is directly visible to the observer and therefore their transition probabilities are the only parameters. On the other hand, in an HMM the state is not directly visible. The adjective ‘hidden’ refers to the sequence of states through which the model passes and is not referred to with the parameters.

Cluster analysis. Cluster analysis is a set of multivariate data analyses aimed at the selection and grouping of homogeneous elements in a data set. All clustering techniques are based on the concept of the distance between two elements. The quality of the analysis obtained from clustering algorithms greatly depends on the choice of the metric, and therefore how the distance is calculated. The more common distance functions are the following:

Euclidean distance

Manhattan distance

The Euclidean distance d2 (x,y) is equal to the square root of the sum of squares of the two coordinate vector differences. The distance of Manhattan is a simple modification of the Euclidean distance. Cluster analysis can use symmetrical or asymmetric distances. Many of the distance functions listed above are symmetrical (the distance between object A and B is equal to the distance from B to A). Clustering techniques can be based mainly on two ways: 1.) from bottom to top (Bottom-Up) and 2.) from top to bottom (Top-Down). In the Bottom-Up method, initially all elements are considered as separate clusters and then the algorithm joins the nearest clusters. The algorithm continues to merge elements to the cluster until a fixed number of clusters is obtained or until the minimum distance between the clusters is achivied. With the Top-Down method, initially all the elements share a single cluster and then the algorithm begins to divide the cluster into many smaller clusters. The criterion used always tries to obtain homogeneous elements, and proceeds until a predetermined number of clusters is reached.

Artificial neural network (ANN). An artificial neural network or neural network (Neural Network, NN) is a mathematical model based on biological neural networks (32, 33, 34, 35). This model consists of information derived from artificial neurons that mimic the properties of living neurons. These mathematical models can be used to solve artificial intelligence engineering problems in different technological fields (e.g., information technology, biology, genomics, proteomics, and metabolomics). In practical terms, neural networks are non-linear statistical structures organized as modelling tools (36,37). The networks can be used to simulate complex relationships between inputs and outputs. An artificial neural network consists of numerous nodes (neurons) connected to each other. There are two types of neurons: input and elaboration neurons. The weight value indicates the synaptic efficacy (ability to increase or decrease its activity) of the input line, and is used to quantify its importance. A very important input variable has a high weight, while a low input has a lower weight. An artificial neural network receives external signals via the input neurons, each of which is connected with numerous internal nodes, organized in additional levels. Each node processes the received signals and transmits the result to the later nodes, and this process continues until the exit level is reached (37, 38, 39, 40, 41, 42, 43).

Deep neural network (DNN). The term DNN (deep neural network) refers to deep networks composed of many levels (at least 2 hidden) that are hierarchically organized (44). Hierarchical organization allows for the sharing and reuse of information. The most widely used DNNs consist of a number of levels, between 7 and 50. Deeper networks (100 levels and above) have proven to be able to guarantee slightly better performance, but do so at the expense of efficiency. The number of neurons, connections and weights also characterizes the complexity of a DNN. The greater the number of weights (i.e., parameters to be learned), the higher the complexity of the training (45, 46, 47, 48, 49, 50). At the same time, a large number of neurons (and connections) increases the forward and backward propagation, as the number of necessary operations increases. The training of complex models (deep and with many weights and connections) requires high computational power. The availability of Graphics Processing Units (GPUs) with thousands of cores and high internal memory has made it possible to drastically reduce training time (51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61)(Fig. 3).

Figure 3

Increase of the prediction accuracy as a function of the data training size.

Convolutional neural networks (CNN). Convolutional neural networks (CNN) are a type of artificial network in which neurons are connected to each other by weighted branches (62,63). Therefore, it is always possible to measure the weights and trainable bias. Training of a neural network occurs via forward/backward propagation, and the updating of the weights is also valid in this context. Moreover, a convolutional neural network always uses a single differentiable cost function (a scalar value indicating how good is your model). However, CNN use the specific assumption that the input has a precise data structure and a more efficient forward propagation in order to reduce the amount of parameters of the network. The capacity of a CNN can vary in relation of the number of layers. In addition, a CNN can have multiple layers of the same type. In a convolutional neural network, there are different types of layers, with each having its own specific function. Some of these have trainable parameters (weight and bias), while other layers simply use fixed functions. Usually, a CNN has a series of convolutional layers, the first of these, starting from the input layer and going to the output layer, are used to obtain low-level features, such as horizontal or vertical lines, angles, various contours, etc. (64, 65, 66). Features increase with the depth in the network (to the output layer). Generally, the more convolutional layers a network has, the more detailed features it can process (67, 68, 69). Compared to a multi-layer perception model (MLP) (a feedforward artificial neural network with non linear activation functions), a CNN method shows substantial changes in the architecture of the convolutional layers (70,71):

Local processing. Neurons are only locally connected to the neurons of the previous level, and each neuron then performs local processing. In this manner, there is a strong reduction in the number of connections.

Shared weights: Weights are shared in groups, and different neurons of the same level perform the same type of processing on different portions of the input. In this way, there is a strong reduction in the number of weights.

Spontaneous connectivity. In a CNN, neurons belonging to different layers of a convolutional layer are connected to each other through a regular pyramidal architecture. Each neuron of a given layer receives information from a specific number of neurons of the previous layer. Each unit is sensitive only to variations from a specific area of expertise, as shown in Fig. 4. This architecture ensures that the patterns learned from the individual units provide strong responses to spatially contained inputs. A sufficient depth of this structure allows local information to be gradually grouped together, leading to the creation of increasingly linear non-linear filters in the layers closer to the output.

Figure 4

Example of sparse connectivity of neurons between different convolutional layers: the neurons of a layer m are connected only to some of the neurons of the previous layer m-1.

Weight Sharing. In a CNN, each layer has the same weights and biases (activation thresholds), and the set of parameters of all similar neurons forms a feature map (see Fig. 5), i.e., the set of characteristics common to all neurons present at a certain level of the network. This property is required once the pyramidal architecture described above is set to achieve invariance of the translation network response, i.e., the ability to recognize a target regardless of its position within the scene.

Figure 5

Example of weight sharing (feature map) between different neurons convolutional layers, a link of the same color corresponds to equal weight

Pooling. A pooling layer is responsible for aggregating data. Typically, the areas where the pooling is applied are built in such a way as to be partially overlapping in order to preserve local information despite deletions. This operation has two consequences, the first is the reduction of the number of inputs in the next layer, and the second is an increase in the invariance due to the translations introduced by weight sharing. In fact, pooling allows dimensions to be reabsorbed (reducing the computation amount).

ReLU. The rectified and linear unit is an activation layer that acts on the entry data in precise terms. The ReLU maintains all the unaltered inputs as positive and multiplies all the negative inputs by a constant (typically 0). It is often used instead of more regular activation functions, such as sigmoid or hyperbolic functions, due to the simplicity of implementation and the nearly negligible computational load introduced by this layer. Since ReLU are inserted between convolution and pooling layers, they help to further decrease the number of calculations performed in the target layers.

Artificial intelligence system applied to NGS studies

In the last ten years, new generation sequencing technologies (NGS) have demonstrated their potential, and with the production of short reads, the throughput process has become increasingly larger NGS techniques, combined with advances in various subjects, ranging from chemistry to bioinformatics, have led to the capability of DNA sequencing at reasonable prices. Specifically, bioinformatics has been fundamental in this development process. Owing to the development of multiple algorithms based on different techniques, such as hash tables, indexes, and spaced-seed, it has been possible to optimize the analysis of increasingly large data sets. NGS technologies are used in a variety of different areas, such as cancer research, human DNA analysis, and animal studies. In Table 1 are reported several genomic programs concerned with cancer and rare diseases.

Main genomic programs concerned with cancer and rare diseases

Internet siteProgramThematic area
www. allofus.nih.gov/All of Us Research ProgramCancer, rare diseases, complex traits
www.australiangenomics.org.auAustralian GenomicsCancer, rare diseases, complex traits
www.brcaexchange.orgBRCA ChallengeCancer, rare diseases, complex traits
www.candig.github.ioCanadian Distributed Infrastructure for Genom- icsCancer, rare diseases, basic biology
www.clinicalgenome.orgClinical Genome ResourceRare diseases
www.elixir-europe.orgELIXIR BeaconRare diseases, basic biology
www.ebi.ac.ukEuropean Genome-Phenome ArchiveRare diseases, basic biology
www.genomicsengland.co.ukGenomics EnglandCancer, rare diseases, complex traits
www.humancellatlas.org/Human Cell AtlasCancer, rare diseases, complex traits, basic biology
www.icgcmed.orgICGC-ARGOCancer
www.matchmakerexchange.orgMatchmaker ExchangeRare diseases
www.gdc.cancer.govNational Cancer Institute Genomic Data Com- monsCancer, rare diseases
www.monarchinitiative.orgMonarch InitiativeRare diseases, complex traits, basic biology
www.nhlbiwgs.orgTrans-Omics for Precision MedicineRare diseases, complex traits, basic biology
www.cancervariants.orgVariant Interpretation for Cancer ConsortiumCancer

Next Generation Sequencing allows for parallel reading of multiple individual DNA fragments, thereby enabling the identification of millions of base pairs in several hours. All of the available NGS platforms have a common technological feature: massive parallel sequencing of clonally amplified DNA molecules or single spatially separated DNA molecules in a flow cell. The main NGS platforms are the following:

HiSeq

MiSeq

Ion Torrent

SOLiD

Pacbio Rs II and Sequel System.

This strategy represents a radical change with respect to the sequencing method described by Sanger, which is based on the electrophoretic separation of fragments of different lengths obtained by single sequencing reactions (95). Conversely, with NGS technologies, sequencing is performed by repeated cycles of nucleotide extensions by a DNA polymerase or alternatively by iterative cycles of oligonucleotide ligation. Because the procedure is parallel and massive, these platforms allow for the sequencing of hundreds of millions of base pairs (Mb) to billions of base pairs (Gb) of DNA in a single analytical session, depending on the type of NGS technology used. These technologies are versatile because they can be used both for diagnostic and basic and translational research. Traditionally, genomic data analysis is performed with software such as Var Direct, Free Bayes and GATK that are based on statistical analysis (83). Sequencing of genomic sub-regions and gene groups

are currently used to identify polymorphisms and mutations in genes implicated in tumours, and regions of the human genome involved in genetic diseases, through “linkage” and genome-wide association studies (investigation using a panel of genes of different individuals to determine gene variations and associations). These instruments are also used for genetic and molecular analysis in various fields, including the diagnosis of rare genetic diseases, as well as neoplastic and endocrine-metabolic diseases. The field of application of these technologies is expanding and will cover more diagnostic aspects in the future. Several NGS platforms can be used to generate different sources of genomic data:

whole genomes;

microarrays;

RNAseq;

capture arrays (exomes) (Illumina TrueSeq Exome Enrichment (62 Mb); Agilent SureSelect (50 Mb);

targeted regions;

specific genes;

chromatin interactions (DNAseq, MNareseq; FAIRE);

chip-seq;

expression profiles.

Computer programmes that predict genes are becoming increasingly sophisticated. Most software recognizes genes by identifying distinctive patterns in DNA sequences, such as the start and end signals of translation, promoters, and exon-intron splicing junctions (72). However, open reading modules (ORFs) may be difficult to find when genes are short or when they undergo an appreciable amount of RNA splicing with small exons divided by large introns. Prokaryotic genomes are predicted more accurately (sensitivity and specificity > 90%) than eukaryotic genomes. The main features of genomic analysis consist of the following: identification of the gene location and structure, identification of regulatory elements, identification of non-coding RNA genes, prediction of gene function, and prediction of RNA secondary structure. Another approach discovers new genes based on their similarity to known genes. However, this approach is only able to identify new genes when there is an obvious homology with other genes. Eventually, the function of the gene must be confirmed by numerous molecular biology methods. In fact, some genes could be pseudogenes. Pseudogenes have a sequence similar to the normal genes, but usually contain interruptions such as displacements of the reading frame or stop codons in the middle of coding domains. This prevents the pseudogenes from generating a functional product or having a detectable effect on the phenotype. Pseudogenes are present in a wide variety of animals, fungi, plants and bacteria. Predictive methods aim to derive general rules starting from a large number of examples (nucleotide labels within sequences) represented by observations referring to the past, and accumulated in international databases.

Predictive models based on machine learning aim to draw conclusions from a sample of past observations and to transfer these conclusions to the entire population. Identified patterns can take different forms, such as linear, non-linear, cluster, graph, and tree functions (73, 74, 75).

The machine learning workflows are usually organized in four steps:

filtering and data pre-processing;

feature extraction;

model fitting;

model evaluation.

Machine learning methods may use supervised or unsupervised systems. The supervised method requires a set of DNA sequences (with all the genetic information including the start and end of the gene, splicing sites, regulatory regions, etc.) for the training step, in order to build the predictive model. This model is then used to find new genes that are similar to the genes of the training set. Supervised methods can only be used if a known training set of sequences is available. Unsupervised methods are used if we are interested in finding the best set of unlabelled sequences that explain the data (76). In Table 2 are listed the main software tools used for machine learning studies.

Main software tools used for machine learning studies

Internet siteProgram nameThematic area
www.sourceforge.net/p/fingeridFingerIDMolecular fingerprinting
https://bio.informatik.uni-jena.de/software/sirius/SIRIUSMolecular fingerprinting
http://www.metaboanalyst.ca/MetaboanalystMetabolomics analysis
www.knime.com/KNIMEMachine learning tool
www.cs.waikato.ac.nz/ml/weka/WekaMachine learning tool
https://orange.biolab.si/OrangeMachine learning tool
http://cafe.berkeleyvision.org/TensorFlowMachine learning tool
http://cafe.berkeleyvision.org/CafeMachine learning tool
http://deeplearning.net/software/theano/TheanoMachine learning tool
http://torch.ch/TorchMachine learning tool

Machine learning methodologies have a wide range of application areas:

non-coding variants (73,77);

identification of protein coding regions and protein-DNA interactions (78,79);

identify regulatory regions (e.g., promoters, enhancers, and

polyadenylation signals) (80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95);

prediction of splice sites (Bayesian classification);

identification of functional RNA genes (96);

chromatin fragment interactions;

histones marks (77);

transcriptional factor (TF) binding sites (96);

prediction of amino acid sequences and RNA secondary structures (97, 98, 99, 100, 101, 102, 103, 104, 105, 106);

metabolomics (107, 108, 109, 110).

Convolutional neural networks (CNNs) can substantially improve performance in problem sequence solving, compared to previous methods (111). Recurring patterns of sequences in the genome can be efficiently identified by means of CNNs methods (111,112). In this method, the genome sequence is analysed as a 1D window using four channels (A,C,G,T) (113). The protein-DNA interactions are analysed and solved as a two class identification problem (114). Genomic data used in machine learning models should be divided into three proportions training (60%), model testing (30%) and model validation (10%). The main advantages of the CNNs method can be summarized as follows:

training on sequence;

no feature definition;

reduction of number of model parameters

use only small regions and share parameters between regions;

train wider sequence windows;

make in silico mutation predictions;

In the CNN method, the following parameters should be optimized:

the number of feature map;

window size (DNA sequences);

kernel size;

convolution kernel design;

pooling design.

Chen et al. (114) used a multi-layer neural model called D-GEX to analyse microarray and RNAseq expression data results. Torracina and Campagne (115) analysed genomic data (variant calling and indel analysis) using deep learning methods (642 features for each genetic variant) and Popolin et al. (116) developed a new software tool called Deep Variant with a precision > 99% (at 90% recall) for SNPs and indel detection. Mutation effects can be predicted using the software Deep Sequence (117), which also uses latent variables (a model using an encoder and decoder network to identify the input sequence).

Recurrent neural networks (RNN) have been proposed to improve the performance of the CNN method. RNN are very useful in the case of ordered sequences (e.g., sequential genomic data). Several applications have recently been reported:

base calling (118);

non-coding DNA (119);

protein prediction (97);

clinical medical data (56,120).

Google genomics has recently developed a new software called Deep Variant (Google Brain Team) that uses AI techniques to determine the characteristics of the genome starting from the information of reference sequences. Deep Genomics and WUXI (with offices in Shanghai, Reykiavik and Boston) use the latest generation of AI techniques to study several genetic diseases in an attempt to find possible therapies. AI methods have also been used for metabolomic data analysis including the following:

metabolite identification from spectrograms (121,122);

metabolite concentration identification using high throughput data (123,124).

Use of CRISPRCas9 for genetic diseases treatments

Attempts to correct the genome for the treatment of genetic diseases have been going on for a long time. Until several years ago, this type of research required long and complex procedures. In 2012, a decisive change came when it was discovered that a protein present in bacteria (Cas9), in association with an RNA sequence, can be used as a device to probe the DNA and identify the point of genetic damage (125). To use Cas9 as a tool of genetic engineering, it is necessary to produce a RNA guide, identify the corresponding DNA segment and delete the gene (125, 126, 127, 128, 129, 130). Although this method is highly efficient and rapid compared to the older technologies, the limitation has been its precision. The recognition sequences, i.e., the guides, are small, consisting of approximately twenty nucleotides. In several laboratories around the world, approaches were already underway to make this technology more precise. Recently, Casini et al. (130) inserted the protein into specialized yeast cells and later selected the yeasts in which Cas9 made the cut in the most precise way. When tested in human cells, the new method has been shown to reduce mutations by 99%.

Big data analysis

Currently, even the smallest of lab or company can generate an incredibly large volume of data. These data sets are referred to as big data (the term indicates large sets of data, the size of which requires different instruments than those traditionally used) and can be analysed with various techniques, including predictive analysis and data mining. It is important to thoroughly understand what is meant by big data. Since 2011, this term has powerfully entered into common language. For the moment, big data refers to a database that contains data from both structured (from databases) and from unstructured (from new sources as sensors, devices, images) sources, coming from areas internal and external to a facility which can provide valuable information if used and analysed in the correct way. The data can be of two types: structured and unstructured data (131). The former refers to a variety of formats and types that can be acquired from interactions between people and machines, such as all information derived from consumer behaviour on internet sites or users of social networks. Unstructured data, conversely, are usually “text” and consist of information that is not organized or easily interpretable through models. To define the big data more formally, we can use the following four variables (132):

Volume: the amount of data has increased in the past years, and it is still increasing significantly. It is extremely easy for a company to store terabytes and petabytes of data.

Speed: compared to fifteen years ago, currently data becomes obsolete after a very short period of time. This is one of the reasons why there are no longer long-term plans. New factory leaders should be able to perform a preliminary analysis of the data in order to anticipate trends.

Variety: the available data have many formats. Continuous changes of the formats creates many problems in the acquisition and classification of the data.

Complexity: being able to manage numerous sources of data is becoming highly difficult.

The analysis of big data is carried out using the following applications:

Descriptive Analytics (DA)

Prescriptive Analytics (PA)

Automated Analytics (AA)

The DA implementation tools aimed at describing the current or past data situation. The PA uses advanced tools that perform data analysis to predict future scenarios. These tools use algorithms that play an active role to find hidden patterns and correlations in the past data, and make prediction models for future use (133, 134, 135). PA are advanced tools that, together with data analysis, are able to propose operational/strategic solutions. The AA are tools capable of autonomously implementing proposed actions according to the results. A practical tool that can aid in providing an overview of the benefits that can be derived from a project in the field of big data is the Value Tree. The Value Tree allows tracking of all the benefits that can emerge from a big data project and helps to identify non-preventable benefits. In the value tree, the benefits are defined as being quantifiable or non-quantifiable, without distinguishing benefits that are directly quantifiable in an economic/financial way. The awareness of the existence of big data have forced several companies to create new effective strategies. Appropriate analytical tools are needed that can translate the data into usable information. However, data mining processes, in graphical or numerical form, large collections of continuous flows of data for the purpose of extracting useful information. The need to anticipate certain situations has become extremely important and can lead to a competitive advantage. Thus, data analysis serves to identify current trends and also to predict future strategies. In particular, predictive models, as well as machine learning and data mining, attempt to identify relationships within the data to identify future trends. Therefore, through the process of automatic learning, the task of the model is to analyse the variables, both individually and together, and provide a predictive score by reducing the margin of error. Predictive Analysis (P.A.) is a science that is already widely used and has been continuously improved in terms of its features and predictions. Persons and companies who use personal and medical data cannot ignore the legal and ethical aspects related to the acquisition and use of such data and information and must comply with current legislation.

Conclusions

Machine learning is the science that enables computers to perform future predictions, and it represents one of the fundamental areas of artificial intelligence. Deep learning is a branch of machine learning and is based on a set of algorithms organized hierarchically with many levels (at least of which are 2 hidden). Generally, these algorithms provide multiple stages of processing (training, model fitting, model evaluation) that often have complex structures and are normally composed of a series of non-linear transformations. Recently, CNN and RNN have been widely used in deep learning, especially for the identification of protein coding regions, protein-DNA interactions, regulatory regions (e.g., promoters, enhancers, and polyadenylation signals), splicing sites and functional RNA gene applications.

Figure 1

Example of an artificial neuron.
Example of an artificial neuron.

Figure 2

A layered network with three layers.
A layered network with three layers.

Figure 3

Increase of the prediction accuracy as a function of the data training size.
Increase of the prediction accuracy as a function of the data training size.

Figure 4

Example of sparse connectivity of neurons between different convolutional layers: the neurons of a layer m are connected only to some of the neurons of the previous layer m-1.
Example of sparse connectivity of neurons between different convolutional layers: the neurons of a layer m are connected only to some of the neurons of the previous layer m-1.

Figure 5

Example of weight sharing (feature map) between different neurons convolutional layers, a link of the same color corresponds to equal weight
Example of weight sharing (feature map) between different neurons convolutional layers, a link of the same color corresponds to equal weight

Main genomic programs concerned with cancer and rare diseases

Internet siteProgramThematic area
www. allofus.nih.gov/All of Us Research ProgramCancer, rare diseases, complex traits
www.australiangenomics.org.auAustralian GenomicsCancer, rare diseases, complex traits
www.brcaexchange.orgBRCA ChallengeCancer, rare diseases, complex traits
www.candig.github.ioCanadian Distributed Infrastructure for Genom- icsCancer, rare diseases, basic biology
www.clinicalgenome.orgClinical Genome ResourceRare diseases
www.elixir-europe.orgELIXIR BeaconRare diseases, basic biology
www.ebi.ac.ukEuropean Genome-Phenome ArchiveRare diseases, basic biology
www.genomicsengland.co.ukGenomics EnglandCancer, rare diseases, complex traits
www.humancellatlas.org/Human Cell AtlasCancer, rare diseases, complex traits, basic biology
www.icgcmed.orgICGC-ARGOCancer
www.matchmakerexchange.orgMatchmaker ExchangeRare diseases
www.gdc.cancer.govNational Cancer Institute Genomic Data Com- monsCancer, rare diseases
www.monarchinitiative.orgMonarch InitiativeRare diseases, complex traits, basic biology
www.nhlbiwgs.orgTrans-Omics for Precision MedicineRare diseases, complex traits, basic biology
www.cancervariants.orgVariant Interpretation for Cancer ConsortiumCancer

Main software tools used for machine learning studies

Internet siteProgram nameThematic area
www.sourceforge.net/p/fingeridFingerIDMolecular fingerprinting
https://bio.informatik.uni-jena.de/software/sirius/SIRIUSMolecular fingerprinting
http://www.metaboanalyst.ca/MetaboanalystMetabolomics analysis
www.knime.com/KNIMEMachine learning tool
www.cs.waikato.ac.nz/ml/weka/WekaMachine learning tool
https://orange.biolab.si/OrangeMachine learning tool
http://cafe.berkeleyvision.org/TensorFlowMachine learning tool
http://cafe.berkeleyvision.org/CafeMachine learning tool
http://deeplearning.net/software/theano/TheanoMachine learning tool
http://torch.ch/TorchMachine learning tool

Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv 2014: 1409.0473.BahdanauDChoKBengioYNeural machine translation by jointly learning to align and translatearXiv20141409.0473Search in Google Scholar

Hutter F, Hoos HH, Leyton-Brown K. Learning and intelligent optimization. (Berlin: Springer: 2011).HutterFHoosHHLeyton-BrownKLearning and intelligent optimizationBerlinSpringer2011Search in Google Scholar

Friedman N. Inferring cellular networks using probabilistic graphical models. Science 2004; 303: 799–805.10.1126/science.109406814764868FriedmanNInferring cellular networks using probabilistic graphical modelsScience200430379980514764868Open DOISearch in Google Scholar

Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference and Prediction (Berlin: Springer: 2001).HastieTTibshiraniRFriedmanJThe Elements of Statistical Learning: Data Mining, Inference and PredictionBerlinSpringer200110.1007/978-0-387-21606-5Search in Google Scholar

Hamelryck T. Probabilistic models and machine learning in structural bioinformatics. Stat Methods Med Res 2009; 18: 505–526.1915316810.1177/0962280208099492HamelryckTProbabilistic models and machine learning in structural bioinformaticsStat Methods Med Res20091850552619153168Search in Google Scholar

Zien A. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 2000; 16: 799–807.1110870210.1093/bioinformatics/16.9.799ZienAEngineering support vector machine kernels that recognize translation initiation sitesBioinformatics20001679980711108702Search in Google Scholar

Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv 2015; 1502.03167.IoffeSSzegedyC2015Batch normalization: accelerating deep network training by reducing internal covariate shiftarXiv 2015; 1502.03167Search in Google Scholar

Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. Pattern Anal Mach Intell IEEE Trans 2013; 35: 1798–1828.10.1109/TPAMI.2013.50BengioYCourvilleAVincentPRepresentation learning: a review and new perspectivesPattern Anal Mach Intell IEEE Trans2013351798182823787338Open DOISearch in Google Scholar

Jain V, Murray JF, Roth F, Turaga S, Zhigulin V, Briggman KL, Helmstaedter MN, Denk W, Seung HS. Supervised learning of image restoration with convolutional networks. Int Conf Computer Vision. 2007; 1–8.JainVMurrayJFRothFTuragaSZhigulinVBriggmanKLHelmstaedterMNDenkWSeungHSSupervised learning of image restoration with convolutional networksInt Conf Computer Vision20071–810.1109/ICCV.2007.4408909Search in Google Scholar

Day N, Hemmaplardh A, Thurman RE, Stamatoyannopoulos JA, Noble WS. Unsupervised segmentation of continuous genomic data. Bioinformatics 2007; 23: 1424–1426.10.1093/bioinformatics/btm09617384021DayNHemmaplardhAThurmanREStamatoyannopoulosJANobleWSUnsupervised segmentation of continuous genomic dataBioinformatics2007231424142617384021Open DOISearch in Google Scholar

Hoffman MM. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 2012; 9: 473–476.2242649210.1038/nmeth.1937HoffmanMMUnsupervised pattern discovery in human chromatin structure through genomic segmentationNat Methods20129473476334053322426492Search in Google Scholar

Chapelle O, Schölkopf B, Zien A. Semi-supervised Learning (Cambridge Ma: MIT Press: 2006).ChapelleOSchölkopfBZienASemi-supervised LearningCambridge MaMIT Press200610.7551/mitpress/9780262033589.001.0001Search in Google Scholar

Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nat Methods. 2012; 9: 215–216.2237390710.1038/nmeth.1906ErnstJKellisMChromHMM: automating chromatin-state discovery and characterizationNat Methods20129215216357793222373907Search in Google Scholar

Chapelle O, Schölkopf B, Zien A. Semi-supervised Learning. (Cambridge MA: MIT Press: 2006).ChapelleOSchölkopfBZienASemi-supervised LearningCambridge MAMIT Press200610.7551/mitpress/9780262033589.001.0001Search in Google Scholar

Urbanowicz RJ, Granizo-Mackenzie A, Moore JH. An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier systems. IEEE Comput Intell Mag 2012; 7: 35–45.2543154410.1109/MCI.2012.2215124UrbanowiczRJGranizo-MackenzieAMooreJHAn analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier systemsIEEE Comput Intell Mag201273545424400625431544Search in Google Scholar

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Josofowicz R, Kaiser L, Kudlur M, Levenberg J. TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv 2016; 1603.04467AbadiMAgarwalABarhamPBrevdoEChenZCitroCCorradoGSDavisADeanJDevinMGhemawatSGoodfellowIHarpAIrvingGIsardMJiaYJosofowiczRKaiserLKudlurMLevenbergJTensorFlow: large-scale machine learning on heterogeneous distributed systemsarXiv20161603.04467Search in Google Scholar

Xiong C, Merity S, Socher R. Dynamic memory networks for visual and textual question answering. arXiv 2016; 1603.01417.XiongCMeritySSocherRDynamic memory networks for visual and textual question answeringarXiv20161603.01417Search in Google Scholar

Xu R, Wunsch D II., Frank R. Inference of genetic regulatory networks with recurrent neural network models using particle swarm optimization. IEEE/ACM Trans Comput Biol Bioinformatics 2007; 4: 681–692.10.1109/TCBB.2007.1057XuRWunschD II.FrankRInference of genetic regulatory networks with recurrent neural network models using particle swarm optimizationIEEE/ACM Trans Comput Biol Bioinformatics2007468169217975278Open DOISearch in Google Scholar

Xu Y, Mo T, Feng Q, Zhong P, Lai M, Chang EI. Deep learning of feature representation with multiple instance learning for medical image analysis. IEEE Int Conf Acoustics, Speech, Signal Processing. 2014; 1626–1630.XuYMoTFengQZhongPLaiMChangEIDeep learning of feature representation with multiple instance learning for medical image analysisIEEE Int Conf Acoustics, Speech, Signal Processing20141626–163010.1109/ICASSP.2014.6853873Search in Google Scholar

Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. (Berlin: Springer: 2014).ZeilerMDFergusRVisualizing and understanding convolutional networksBerlinSpringer201410.1007/978-3-319-10590-1_53Search in Google Scholar

Ng AY, Jordan MI. Advances in Neural Information Processing Systems. (Cabridge MA: MIT Press: 2002).NgAYJordanMIAdvances in Neural Information Processing SystemsCabridge MAMIT Press2002Search in Google Scholar

Wolpert DH, Macready WG. No free lunch theorems for optimization. IEEE Trans Evol Comput 1997; 1: 67–82.10.1109/4235.585893WolpertDHMacreadyWGNo free lunch theorems for optimizationIEEE Trans Evol Comput199716782Open DOISearch in Google Scholar

Boser BE, Guyon IM, Vapnik VN. A Training Algorithm for Optimal Margin Classifiers. (NY: ACM Press: 1992).BoserBEGuyonIMVapnikVNA Training Algorithm for Optimal Margin ClassifiersNYACM Press199210.1145/130385.130401Search in Google Scholar

Noble WS. What is a support vector machine? Nature Biotech 2006; 24: 1565–1567.10.1038/nbt1206-1565NobleWSWhat is a support vector machine?Nature Biotech2006241565156717160063Open DOISearch in Google Scholar

Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. International Conference on Artificial Intelligence and Statistics. 2010; 249–256.GlorotXBengioYUnderstanding the difficulty of training deep feedforward neural networksInternational Conference on Artificial Intelligence and Statistics2010249–256Search in Google Scholar

Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein DA. Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci USA 2003; 100: 8348–8353.10.1073/pnas.0832373100TroyanskayaOGDolinskiKOwenABAltmanRBBotsteinDABayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae)Proc Natl Acad Sci USA20031008348835316623212826619Open DOISearch in Google Scholar

Friedman N, Linial M, Nachman I, Peer D. Using Bayesian networks to analyze expression data. J Comput Biol 2000; 7: 601–620.1110848110.1089/106652700750050961FriedmanNLinialMNachmanIPeerDUsing Bayesian networks to analyze expression dataJ Comput Biol2000760162011108481Search in Google Scholar

Koski TJ, Noble J. A review of Bayesian networks and structure learning. Math Applicanda 2012; 40: 51–103.KoskiTJNobleJA review of Bayesian networks and structure learningMath Applicanda20124051103Search in Google Scholar

Friedman N, Linial M, Nachman I, Pe’er D. Using Bayesian networks to analyze expression data. J Comput Biol 2000; 7: 601–620.1110848110.1089/106652700750050961FriedmanNLinialMNachmanIPe’erDUsing Bayesian networks to analyze expression dataJ Comput Biol2000760162011108481Search in Google Scholar

Koski TJ, Noble J. A review of bayesian networks and structure learning. Math Applicanda 2012; 40: 51–103.KoskiTJNobleJA review of bayesian networks and structure learningMath Applicanda2012405110310.14708/ma.v40i1.278Search in Google Scholar

Brown M. Using Dirichlet mixture priors to derive hidden Markov models for protein families. Int Conf Intelligent Systems Mol Biol 1993; 47-55.BrownMUsing Dirichlet mixture priors to derive hidden Markov models for protein familiesInt Conf Intelligent Systems Mol Biol19934755Search in Google Scholar

Keogh E, Mueen A. Encyclopedia of Machine Learning (Berlin: Springer: 2011).KeoghEMueenAEncyclopedia of Machine LearningBerlinSpringer2011Search in Google Scholar

Manning CD, Schütze H. Foundations of Statistical Natural Language Processing (Cambridge MA: MIT Press: 1999).ManningCDSchützeHFoundations of Statistical Natural Language ProcessingCambridge MAMIT Press1999Search in Google Scholar

Friedman N. Inferring cellular networks using probabilistic graphical models. Science. 2004; 303: 799–805.10.1126/science.109406814764868FriedmanNInferring cellular networks using probabilistic graphical modelsScience200430379980514764868Open DOISearch in Google Scholar

Hastie T, Tibshirani R.; Friedman, J. The Elements of Statistical Learning: Data mining, Inference and Prediction. (New York NY: Springer: 2001).HastieTTibshiraniRFriedmanJThe Elements of Statistical Learning: Data mining, Inference and PredictionNew York NYSpringer200110.1007/978-0-387-21606-5Search in Google Scholar

Yip KY, Cheng C, Gerstein M. Machine learning and genome annotation: a match meant to be? Genome biol 2013; 14:205.10.1186/gb-2013-14-5-20523731483YipKYChengCGersteinMMachine learning and genome annotation: a match meant to be?Genome biol201314205405378923731483Open DOISearch in Google Scholar

Day N, Hemmaplardh A, Thurman RE, Stamatoyannopoulos JA, Noble WS. Unsupervised segmentation of continuous genomic data. Bioinformatics. 2007; 23: 1424–1426.10.1093/bioinformatics/btm09617384021DayNHemmaplardhAThurmanREStamatoyannopoulosJANobleWSUnsupervised segmentation of continuous genomic dataBioinformatics2007231424142617384021Open DOISearch in Google Scholar

Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. (Pittsburgh, PA: ACM Press: 1992).BoserBEGuyonIMVapnikVNA training algorithm for optimal margin classifiersPittsburgh, PAACM Press199210.1145/130385.130401Search in Google Scholar

Noble WS. What is a support vector machine? Nature Biotech 2006; 24: 1565–1567.10.1038/nbt1206-1565NobleWSWhat is a support vector machine?Nature Biotech2006241565156717160063Open DOISearch in Google Scholar

Hastie T, Tibshirani R, Friedman J, Franklin J. The elements of statistic learning: data mining, inference and prediction. Math Intell 2005; 27: 83–85.10.1007/BF02985802HastieTTibshiraniRFriedmanJFranklinJThe elements of statistic learning: data mining, inference and predictionMath Intell2005278385Open DOISearch in Google Scholar

He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv 2015; 1512.03385.HeKZhangXRenSSunJ2015Deep residual learning for image recognitionarXiv 2015; 1512.03385Search in Google Scholar

Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science 2006; 313: 504–507.10.1126/science.112764716873662HintonGESalakhutdinovRRReducing the dimensionality of data with neural networksScience200631350450716873662Open DOISearch in Google Scholar

Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep belief nets. Neural Comput 2006; 18: 1527–1554.1676451310.1162/neco.2006.18.7.1527HintonGEOsinderoSTehY-WA fast learning algorithm for deep belief netsNeural Comput2006181527155416764513Search in Google Scholar

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015; 521: 436-444.2601744210.1038/nature14539LeCunYBengioYHintonGDeep learningNature201552143644426017442Search in Google Scholar

Schmidhuber J. Deep learning in neural networks: An overview. Neural Networks 2015; 61: 85-117.10.1016/j.neunet.2014.09.003SchmidhuberJDeep learning in neural networks: An overviewNeural Networks2015618511725462637Open DOISearch in Google Scholar

Mamoshina P, Vieira A, Putin E, Zhavoronkov A (2016) Applications of deep learning in biomedicine. Mol Pharm 2016; 13: 1445–1454.10.1021/acs.molpharmaceut.5b0098227007977MamoshinaPVieiraAPutinEZhavoronkovA2016Applications of deep learning in biomedicineMol Pharm2016131445145427007977Open DOISearch in Google Scholar

Murphy KP (2012) Machine learning: a probabilistic perspective. (Cambridge MA: MIT Press: 2012).MurphyKP2012Machine learning: a probabilistic perspectiveCambridge MAMIT Press2012Search in Google Scholar

Rampasek L, Goldenberg A (2016) TensorFlow: biology’s gateway to deep learning? Cell Syst 2016; 2: 12–14.10.1016/j.cels.2016.01.00927136685RampasekLGoldenbergA2016TensorFlow: biology’s gateway to deep learning?Cell Syst20162121427136685Open DOISearch in Google Scholar

Salakhutdinov R, Hinton G (2012) An efficient learning procedure for deep Boltzmann machines. Neural Comput 2012; 24: 1967–2006.2250996310.1162/NECO_a_00311SalakhutdinovRHintonG2012An efficient learning procedure for deep Boltzmann machinesNeural Comput2012241967200622509963Search in Google Scholar

Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 2015; 61: 85–117.10.1016/j.neunet.2014.09.00325462637SchmidhuberJ2015Deep learning in neural networks: an overviewNeural Netw20156185117Open DOISearch in Google Scholar

Snoek J, Larochelle H, Adams RP. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp 2951–2959. (Cambridge MA: MIT Press: 2012).SnoekJLarochelleHAdamsRPPractical bayesian optimization of machine learning algorithmsIn Advances in neural information processing systems29512959Cambridge MAMIT Press2012Search in Google Scholar

Spencer M, Eickholt J, Cheng J. A deep learning network approach to ab initio protein secondary structure prediction. IEEE/ACM Trans Comput Biol Bioinformatics 2015; 12: 103–112.10.1109/TCBB.2014.2343960SpencerMEickholtJChengJA deep learning network approach to ab initio protein secondary structure predictionIEEE/ACM Trans Comput Biol Bioinformatics201512103112434807225750595Open DOISearch in Google Scholar

Eickholt J, Cheng J. Predicting protein residue-residue contacts using deep networks and boosting. Bioinformatics 2012; 28: 3066–3072.2304756110.1093/bioinformatics/bts598EickholtJChengJPredicting protein residue-residue contacts using deep networks and boostingBioinformatics20122830663072350949423047561Search in Google Scholar

Eickholt J, Cheng J. DNdisorder: predicting protein disorder using boosting and deep networks. BMC Bioinformatics 2013; 14: 88.10.1186/1471-2105-14-8823497251EickholtJChengJDNdisorder: predicting protein disorder using boosting and deep networksBMC Bioinformatics20131488359962823497251Open DOISearch in Google Scholar

Gawehn E, Hiss JA, Schneider G. Deep learning in drug discovery. Mol Informatics 2016; 35: 3–14.10.1002/minf.201501008GawehnEHissJASchneiderGDeep learning in drug discoveryMol Informatics20163531427491648Open DOISearch in Google Scholar

Che Z, Purushotham S, Khemani R, Liu Y. Distilling knowledge from deep networks with applications to healthcare domain. arXiv 2015; 1512.03542.CheZPurushothamSKhemaniRLiuYDistilling knowledge from deep networks with applications to healthcare domainarXiv20151512.03542Search in Google Scholar

Bastien F, Lamblin P, Pascanu R, Bergstra J, Goodfellow I, Bergeron A, Bouchard N, Warde-Farley D, Bengio Y. Theano: new features and speed improvements. arXiv 2012; 1211.5590BastienFLamblinPPascanuRBergstraJGoodfellowIBergeronABouchardNWarde-FarleyDBengioYTheano: new features and speed improvementsarXiv20121211.5590Search in Google Scholar

Bengio Y. Practical recommendations for gradient-based training of deep architectures. In Neural networks: tricks of the trade, Montavon G, Orr G, Müller K-R (Kelley DR, Snoek J, Rinn J. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Mol Syst Biol. 2016; 12(7): 878.BengioYPractical recommendations for gradient-based training of deep architecturesNeural networks: tricks of the trade, Montavon G, Orr G, Müller K-R (Kelley DR, Snoek J, Rinn J. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Mol Syst Biol201612787810.1007/978-3-642-35289-8_26Search in Google Scholar

Kingma DP, Welling M. Auto-encoding variational bayes. arXiv 2013; 1312.6114.KingmaDPWellingMAuto-encoding variational bayes20131312.6114Search in Google Scholar

Kingma D, Ba J. Adam: a method for stochastic optimization. arXiv 2014; 1412.6980.KingmaDBaJAdam: a method for stochastic optimization20141412.6980Search in Google Scholar

Leung MKK, Xiong HY, Lee LJ, Frey BJ. Deep learning of the tissue-regulated splicing code. Bioinformatics 2014; 30: 121–129.10.1093/bioinformatics/btu277LeungMKKXiongHYLeeLJFreyBJDeep learning of the tissue-regulated splicing codeBioinformatics201430121129405893524931975Open DOISearch in Google Scholar

Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv 2013; 1312.6034.SimonyanKVedaldiAZissermanADeep inside convolutional networks: visualising image classification models and saliency mapsarXiv20131312.6034Search in Google Scholar

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv 2014; 1409.1556.SimonyanKZissermanA2014Very deep convolutional networks for large-scale image recognitionarXiv 2014; 1409.1556Search in Google Scholar

Koh PW, Pierson E, Kundaje A. Denoising genome-wide histone ChIP-seq with convolutional neural networks. Bioinformatics 2017; 33(14): 225–233.10.1093/bioinformatics/btx243KohPWPiersonEKundajeADenoising genome-wide histone ChIP-seq with convolutional neural networksBioinformatics20173314225233587071328881977Open DOISearch in Google Scholar

Dahl GE, Jaitly N, Salakhutdinov R. Multi-task neural networks for QSAR predictions. arXiv 2014; 1406.1231.DahlGEJaitlyNSalakhutdinovRMulti-task neural networks for QSAR predictionsarXiv20141406.1231Search in Google Scholar

Lipton ZC (2015) A critical review of recurrent neural networks for sequence learning. arXiv 2015; 1506.00019.LiptonZC2015A critical review of recurrent neural networks for sequence learningarXiv 2015; 1506.00019Search in Google Scholar

Lipton ZC, Kale DC, Elkan C, Wetzell R (2015) Learning to diagnose with LSTM recurrent neural networks. arXiv 2015; 1511.03677.LiptonZCKaleDCElkanCWetzellR2015Learning to diagnose with LSTM recurrent neural networksarXiv 2015; 1511.03677Search in Google Scholar

Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T. Decaf: a deep convolutional activation feature for generic visual recognition. arXiv 2013; 1310.1531.DonahueJJiaYVinyalsOHoffmanJZhangNTzengEDarrellTDecaf: a deep convolutional activation feature for generic visual recognitionarXiv 2013; 1310.1531Search in Google Scholar

Kraus OZ, Ba LJ, Frey B. Classifying and segmenting microscopy images using convolutional multiple instance learning. arXiv 2015; 1511.05286v1.KrausOZBaLJFreyBClassifying and segmenting microscopy images using convolutional multiple instance learningarXiv 2015; 1511.05286v1Search in Google Scholar

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015; 521: 436–444.2601744210.1038/nature14539LeCunYBengioYHintonGDeep learningNature201552143644426017442Search in Google Scholar

Lee B, Lee T, Na B, Yoon S. DNA-level splice junction prediction using deep recurrent neural networks. arXiv 2015; 1512.05135LeeBLeeTNaBYoonSDNA-level splice junction prediction using deep recurrent neural networks20151512.05135Search in Google Scholar

Park Y, Kellis M (2015) Deep learning for regulatory genomics. Nat Biotechnol 2015;33: 825–826.2625213910.1038/nbt.3313ParkYKellisM2015Deep learning for regulatory genomicsNat Biotechnol20153382582626252139Search in Google Scholar

Libbrecht MW, Noble WS (2015) Machine learning applications in genetics and genomics. Nat Rev Genet 2015; 16: 321–332.2594824410.1038/nrg3920LibbrechtMWNobleWS2015Machine learning applications in genetics and genomicsNat Rev Genet201516321332520430225948244Search in Google Scholar

Sutskever I, Vinyals O, Le QV. Advances in neural information processing systems. (Cambridge MA: MIT Press: 2014).SutskeverIVinyalsOLeQVAdvances in neural information processing systemsCambridge MAMIT Press2014Search in Google Scholar

Wasson T, Hartemink AJ. An ensemble model of competitive multi-factor binding of the genome. Genome Res 2009; 19: 2102–2112.WassonTHarteminkAJAn ensemble model of competitive multi-factor binding of the genomeGenome Res2009192102211210.1101/gr.093450.109277558619720867Search in Google Scholar

Yip KY, Cheng C, Gerstein M. Machine learning and genome annotation: a match meant to be? Genome Biol 2013; 14: 205.10.1186/gb-2013-14-5-20523731483YipKYChengCGersteinMMachine learning and genome annotation: a match meant to be?Genome Biol201314205405378923731483Open DOISearch in Google Scholar

Zhou J, Troyanskaya OG (2015) Predicting effects of noncoding variants with deep learning based sequence model. Nat Methods 2015; 12: 931–934.2630184310.1038/nmeth.3547ZhouJTroyanskayaOG2015Predicting effects of noncoding variants with deep learning based sequence modelNat Methods201512931934476829926301843Search in Google Scholar

Swan AL, Mobasheri A, Allaway D, Liddell S, Bacardit J (2013) Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. Omics 2013; 17: 595–610.10.1089/omi.2013.001724116388SwanALMobasheriAAllawayDLiddellSBacarditJ2013Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biologyOmics201317595610383743924116388Open DOISearch in Google Scholar

Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 2015; 33: 831–838.10.1038/nbt.330026213851AlipanahiBDelongAWeirauchMTFreyBJ2015Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learningNat Biotechnol20153383183826213851Open DOISearch in Google Scholar

Zhang J, White NM, Schmidt HK. Integrate: gene fusion discovery using whole genome and transcriptome data. Genome Res 2016; 26(1):108–118.10.1101/gr.186114.11426556708ZhangJWhiteNMSchmidtHKIntegrate: gene fusion discovery using whole genome and transcriptome dataGenome Res2016261108118469174326556708Open DOISearch in Google Scholar

Degroeve S, Baets BD, de Peer YV, Rouz P. Feature subset selection for splice site prediction. Bioinformatics. 2002; 18: S75–S83.10.1093/bioinformatics/18.suppl_2.S7512385987DegroeveSBaetsBDde PeerYVRouzPFeature subset selection for splice site predictionBioinformatics200218S75S83Open DOISearch in Google Scholar

Wasson, T., Hartemink, A. J. An ensemble model of competitive multi-factor binding of the genome. Genome Res 2009;19: 2102–2112.WassonT.HarteminkA. JAn ensemble model of competitive multi-factor binding of the genomeGenome Res2009192102211210.1101/gr.093450.109277558619720867Search in Google Scholar

Lanckriet GRG, Bie TD, Cristianini N, Jordan MI, Noble WS. A statistical framework for genomic data fusion. Bioinformatics 2004; 20: 2626–2635.1513093310.1093/bioinformatics/bth294LanckrietGRGBieTDCristianiniNJordanMINobleWSA statistical framework for genomic data fusionBioinformatics2004202626263515130933Search in Google Scholar

Pavlidis P, Weston J, Cai J, Noble WS. Learning gene functional classifications from multiple data types. J Computat Biol 2002; 9: 401–411.10.1089/10665270252935539PavlidisPWestonJCaiJNobleWSLearning gene functional classifications from multiple data typesJ Computat Biol2002940141112015889Open DOISearch in Google Scholar

Picardi E, Pesole G. Computational methods for ab initio and comparative gene finding. Meth Mol Biol 2010; 609: 269–284.10.1007/978-1-60327-241-4_16PicardiEPesoleGComputational methods for ab initio and comparative gene findingMeth Mol Biol201060926928420221925Open DOISearch in Google Scholar

Degroeve S, Baets BD, de Peer YV, Rouzé P. Feature subset selection for splice site prediction. Bioinformatics 2002; 18: S75–S83.10.1093/bioinformatics/18.suppl_2.S7512385987DegroeveSBaetsBDde PeerYVRouzéPFeature subset selection for splice site predictionBioinformatics200218S75S83Open DOISearch in Google Scholar

Ouyang Z, Zhou Q, Wong HW. ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. PNAS USa. 2009; 106: 21521–21526.10.1073/pnas.0904863106OuyangZZhouQWongHWChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cellsPNAS USa20091062152121526278975119995984Open DOISearch in Google Scholar

Chen Y, Li Y, Narayan R, Subramanian A, Xie X. Gene expression inference with deep learning Bioinformatics 2016; 32: 1832–1839.ChenYLiYNarayanRSubramanianAXieXGene expression inference with deep learning Bioinformatics2016321832183910.1093/bioinformatics/btw074490832026873929Search in Google Scholar

Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in S. cerevisiae). PNAS USA 2003; 100: 8348–8353.10.1073/pnas.0832373100TroyanskayaOGDolinskiKOwenABAltmanRBBotsteinDA Bayesian framework for combining heterogeneous data sources for gene function prediction (in S. cerevisiae)PNAS USA200310083488353Open DOISearch in Google Scholar

Upstill-Goddard R, Eccles D, Fliege J, Collins A. Machine learning approaches for the discovery of gene–gene interactions in disease data. Brief Bioinform 2013; 14: 251–260.2261111910.1093/bib/bbs024Upstill-GoddardREcclesDFliegeJCollinsAMachine learning approaches for the discovery of gene–gene interactions in disease dataBrief Bioinform201314251260Search in Google Scholar

Urbanowicz R, Granizo-Mackenzie D, Moore J. An expert knowledge guided michigan-style learning classifier system for the detection and modeling of epistasis and genetic heterogeneity. Proc Parallel Problem Solving From Nature 2012; 12: 266–275.UrbanowiczRGranizo-MackenzieDMooreJAn expert knowledge guided michigan-style learning classifier system for the detection and modeling of epistasis and genetic heterogeneityProc Parallel Problem Solving From Nature20121226627510.1007/978-3-642-32937-1_27Search in Google Scholar

Angermueller C, Lee H, Reik W, Stegle O. Accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol 2017; 18: 67.10.1186/s13059-017-1189-z28395661AngermuellerCLeeHReikWStegleOAccurate prediction of single-cell DNA methylation states using deep learningGenome Biol20171867Open DOISearch in Google Scholar

Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nature Methods 2012;9: 215–216 (2012).2237390710.1038/nmeth.1906ErnstJKellisMChromHMM: automating chromatin-state discovery and characterizationNature Methods201292152162012Search in Google Scholar

Fraser AG, Marcotte EM. A probabilistic view of gene function. Nature Genet 2004; 36: 559–564.10.1038/ng1370FraserAGMarcotteEMA probabilistic view of gene functionNature Genet200436559564Open DOISearch in Google Scholar

Battle A, Khan Z, Wang SH, Mitrano A, Ford MJ, Pritchard JK, Gilad Y (2015) Genomic variation. Impact of regulatory variation from RNA to protein. Science 2015; 347: 664–667.2565724910.1126/science.1260793BattleAKhanZWangSHMitranoAFordMJPritchardJKGiladY2015Genomic variation. Impact of regulatory variation from RNA to proteinScience2015347664667Search in Google Scholar

Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 2016; 26: 990-99.10.1101/gr.200535.11527197224KelleyDRSnoekJRinnJLBasset: learning the regulatory code of the accessible genome with deep convolutional neural networksGenome Res20162699099Open DOISearch in Google Scholar

Sønderby SK, Winther O. Protein secondary structure prediction with long short term memory networks. arXiv 2014; 1412.78.SønderbySKWintherOProtein secondary structure prediction with long short term memory networks20141412.78Search in Google Scholar

Beer MA, Tavazoie S. Predicting gene expression from sequence. Cell 2004; 117: 185–198. Heintzman N. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genet 2007; 39: 311–318.1508425710.1016/S0092-8674(04)00304-6BeerMATavazoieSPredicting gene expression from sequenceCell2004117185198Heintzman N. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genet 2007; 39: 311–318Search in Google Scholar

Pique-Regi R. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 2011;21: 447–455.2110690410.1101/gr.112623.110Pique-RegiR.Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility dataGenome Res201121447455304485821106904Search in Google Scholar

Qiu J, Noble WS. Predicting co-complexed protein pairs from heterogeneous data. PLoS Comput Biol 2008; 4: e1000054.1842137110.1371/journal.pcbi.1000054QiuJNobleWSPredicting co-complexed protein pairs from heterogeneous dataPLoS Comput Biol20084e1000054227531418421371Search in Google Scholar

Ramaswamy S. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 2001; 98: 15149–15154.10.1073/pnas.211566398RamaswamySMulticlass cancer diagnosis using tumor gene expression signaturesProc Natl Acad Sci USA20019815149151546499811742071Open DOISearch in Google Scholar

Saigo H, Vert JP, Akutsu T. Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinformatics 2006; 7: 246.1667738510.1186/1471-2105-7-246SaigoHVertJPAkutsuTOptimizing amino acid substitution matrices with a local alignment kernelBMC Bioinformatics20067246151360516677385Search in Google Scholar

Segal E. A genomic code for nucleosome positioning. Nature 2006;44, 772–778.SegalEA genomic code for nucleosome positioningNature200644772–77810.1038/nature04979262324416862119Search in Google Scholar

Karlic RR, Chung H, Lasserre J, Vlahovicek K, Vingron M. Histone modification levels are predictive for gene expression. PNAS USA 2010; 107: 2926–2931.10.1073/pnas.0909344107KarlicRRChungHLasserreJVlahovicekKVingronMHistone modification levels are predictive for gene expressionPNAS USA201010729262931281487220133639Open DOISearch in Google Scholar

Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, Gilad Y, Pritchard JK (2011) DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biol 2011; 12: R10.10.1186/gb-2011-12-1-r1021251332BellJTPaiAAPickrellJKGaffneyDJPique-RegiRDegnerJFGiladYPritchardJK2011DNA methylation patterns associate with genetic and gene expression variation in HapMap cell linesGenome Biol201112R10309129921251332Open DOISearch in Google Scholar

Cuellar-Partida G, et al. Epigenetic priors for identifying active transcription factor binding sites. Bioinformatics 2011; 28: 56–62.22072382Cuellar-PartidaGet alEpigenetic priors for identifying active transcription factor binding sitesBioinformatics201128566210.1093/bioinformatics/btr614324476822072382Search in Google Scholar

Kell DB (2005) Metabolomics, machine learning and modelling: towards an understanding of the language of cells. Biochem Soc Trans 2005; 33: 520–524.10.1042/BST033052015916555KellDB2005Metabolomics, machine learning and modelling: towards an understanding of the language of cellsBiochem Soc Trans20053352052415916555Open DOISearch in Google Scholar

Shen H, Zamboni N, Heinonen M, Rousu J. Metabolite identification through machine learning—Tackling CASMI challenge using fingerID. Metabolites 2013; 3: 484–505.2495800210.3390/metabo3020484ShenHZamboniNHeinonenMRousuJMetabolite identification through machine learning—Tackling CASMI challenge using fingerIDMetabolites20133484505390127324958002Search in Google Scholar

Glaab E, Bacardit J, Garibaldi JM, Krasnogor N. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. Plos one. 2012; 7: e39932.10.1371/journal.pone.003993222808075GlaabEBacarditJGaribaldiJMKrasnogorNUsing rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression dataPlos one20127e39932339477522808075Open DOISearch in Google Scholar

Menden MP, Iorio F, Garnett M, McDermott U, Benes CH, Ballester PJ, Saez-Rodriguez J. Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties. PLos one 2013; 8: e61318.2364610510.1371/journal.pone.0061318MendenMPIorioFGarnettMMcDermottUBenesCHBallesterPJSaez-RodriguezJMachine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical propertiesPLos one20138e61318364001923646105Search in Google Scholar

Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Proceedings of the 25th International Conference on Neural Information Processing Systems Lake Tahoe, Nevada 2012: 1097-1105.KrizhevskyASutskeverIHintonGEImageNet classification with deep convolutional neural networksProceedings of the 25th International Conference on Neural Information Processing Systems Lake TahoeNevada201210971105Search in Google Scholar

Lanchantin J, Lin Z, Qi Y. Deep motif: Visualizing genomic sequence classifications. arXiv 2016: 1605.01133.LanchantinJLinZQiYDeep motif: Visualizing genomic sequence classifications20161605.01133Search in Google Scholar

Zeng H, Edwards MD, Liu G, Gifford DK. Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics 2016; 32(12): 121–127.10.1093/bioinformatics/btw255ZengHEdwardsMDLiuGGiffordDKConvolutional neural network architectures for predicting DNA–protein bindingBioinformatics20163212121127490833927307608Open DOISearch in Google Scholar

Chen J, Guo M, Wang X, Liu B. A comprehensive review and comparison of different computational methods for protein remote homology detection. Briefings in bioinformatics 2016; 108:256.surname>ChenJGuoMWangXLiuBA comprehensive review and comparison of different computational methods for protein remote homology detectionBriefings in bioinformatics2016108256Search in Google Scholar

Torracinta R, Campagne F. Training genotype callers with neural networks. bioRxiv 2016; 097469.TorracintaRCampagneFTraining genotype callers with neural networksbioRxiv201609746910.1101/097469Search in Google Scholar

Poplin R, Newburger D, Dijamco J, Nguyen N, Loy D, Gross SS, McLean CY, DePristo MA. Creating a universal SNP and small indel variant caller with deep neural networks. 2018; bioRxiv: doi.org/10.1101/092890.PoplinRNewburgerDDijamcoJNguyenNLoyDGrossSSMcLeanCYDePristoMACreating a universal SNP and small indel variant caller with deep neural networks2018doi.org/10.1101/092890Open DOISearch in Google Scholar

Schreiber J, Libbrecht M, Bilmes J, Noble W. Nucleotide sequence and dnasei sensitivity are predictive of 3d chromatin architecture. bioRxiv; 2017: 103614.SchreiberJLibbrechtMBilmesJNobleWNucleotide sequence and dnasei sensitivity are predictive of 3d chromatin architecturebioRxiv;201710361410.1101/103614Search in Google Scholar

Boza V, Brejova B, Vinar T. Deepnano: Deep recurrent neural networks for base calling in minion nanopore reads. Plos one 2017;12(6): e0178751.2858240110.1371/journal.pone.0178751BozaVBrejovaBVinarTDeepnano: Deep recurrent neural networks for base calling in minion nanopore readsPlos one2017126e0178751545943628582401Search in Google Scholar

Quang D, Xie X. Danq: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res 2016; 44(11): e107–e107. X.2708494610.1093/nar/gkw226QuangDXieXDanq: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequencesNucleic Acids Res20164411e107e107491410427084946Search in Google Scholar

Lee T, Yoon S. Boosted categorical restricted boltzmann machine for computational prediction of splice junctions. Int Conf Machine Learning; 2015: 2483–2492.LeeTYoonSBoosted categorical restricted boltzmann machine for computational prediction of splice junctionsInt Conf Machine Learning;20152483–2492Search in Google Scholar

Baumgartner C, Böhm C, Baumgartner D. Modelling of classification rules on metabolic patterns including machine learning and expert knowledge. J Biomed Inform 2005; 38: 89–98.10.1016/j.jbi.2004.08.00915796999BaumgartnerCBöhmCBaumgartnerDModelling of classification rules on metabolic patterns including machine learning and expert knowledgeJ Biomed Inform200538899815796999Open DOISearch in Google Scholar

Alakwaa FM, Chaudhary K, Garmire LX. Deep learning accurately predicts estrogen receptor status in breast cancer metabolomics data. J Proteom Res 2018; 17: 337–347.10.1021/acs.jproteome.7b00595AlakwaaFMChaudharyKGarmireLXDeep learning accurately predicts estrogen receptor status in breast cancer metabolomics dataJ Proteom Res201817337347575903129110491Open DOISearch in Google Scholar

Hao J, Astle W, De Iorio M, Ebbels T. BATMAN—An R package for the automated quantification ofmetabolites from NMR spectra using a Bayesian model. Bioinformatics 2012; 28: 2088–2090.2263560510.1093/bioinformatics/bts308HaoJAstleWDe IorioMEbbelsTBATMAN—An R package for the automated quantification ofmetabolites from NMR spectra using a Bayesian modelBioinformatics2012282088209022635605Search in Google Scholar

Ravanbakhsh S, Liu P, Bjorndahl TC, Mandal R, Grant JR, Wilson M, Eisner R, Sinelnikov I, Hu X, Luchinat C. Accurate, fully-automated NMR spectral profiling for metabolomics. PLos one 2015; 10: e0124219.10.1371/journal.pone.0124219RavanbakhshSLiuPBjorndahlTCMandalRGrantJRWilsonMEisnerRSinelnikovIHuXLuchinatCAccurate, fully-automated NMR spectral profiling for metabolomicsPLos one201510e0124219444636826017271Open DOISearch in Google Scholar

Hsu PD, Lander ES, Zhang F. Development and Applications of CRISPR-Cas9 for Genome Engineering. Cell 2014; 157: 1262.10.1016/j.cell.2014.05.01024906146HsuPDLanderESZhangFDevelopment and Applications of CRISPR-Cas9 for Genome EngineeringCell20141571262434319824906146Open DOISearch in Google Scholar

Sternberg S, Doudna J. Expanding the Biologist’s Toolkit with CRISPR-Cas9.Molecular Cell. 2015; 58: 568.10.1016/j.molcel.2015.02.03226000842SternbergSDoudnaJExpanding the Biologist’s Toolkit with CRISPR-Cas9Molecular Cell20155856826000842Open DOISearch in Google Scholar

Tsai SQ, Zheng Z, Nguyen NT, Liebers M, Topkar VV, Thapar V, Wyvekens N, Khayter C, Iafrate AJ, Le LP, Aryee MJ, Joung JK. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat Biotechnol 2015; 33(2): 187.2551378210.1038/nbt.3117TsaiSQZhengZNguyenNTLiebersMTopkarVVThaparVWyvekensNKhayterCIafrateAJLeLPAryeeMJJoungJKGUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleasesNat Biotechnol2015332187432068525513782Search in Google Scholar

Slaymaker IM et al. Rationally engineered Cas9 nucleases with improved specificity. Science 2016; 351: 84–88.10.1126/science.aad522726628643SlaymakerIMet alRationally engineered Cas9 nucleases with improved specificityScience20163518488471494626628643Open DOISearch in Google Scholar

Kim S, Kim D, Cho SW, Kim J, Kim JS. Highly efficient RNA-guided genome editing in human cells via delivery of purified Cas9 ribonucleoproteins. Genome Res 2014; 24 :1012–1019.10.1101/gr.171322.11324696461KimSKimDChoSWKimJKimJSHighly efficient RNA-guided genome editing in human cells via delivery of purified Cas9 ribonucleoproteinsGenome Res20142410121019403284724696461Open DOISearch in Google Scholar

Casini A, Olivieri M, Petris G, Montagna C, Reginato G, Maule G, Lorenzin F, Prandi D, Romanel A, Demichelis F, Inga A, Cereseto A. A highly specific SpCas9 variant is identified by in vivo screening in yeast. Nature Biotech 2018; 36: 265–271.10.1038/nbt.4066CasiniAOlivieriMPetrisGMontagnaCReginatoGMauleGLorenzinFPrandiDRomanelADemichelisFIngaACeresetoAA highly specific SpCas9 variant is identified by in vivo screening in yeastNature Biotech201836265271606610829431739Open DOISearch in Google Scholar

Wilson H, Elizabeth D, McDonald M. (2002). Factors for success in customer relationship management (CRM) systems. J Marketing Manage 2002; 18(1): 193–219.10.1362/0267257022775918WilsonHElizabethDMcDonaldM.2002Factors for success in customer relationship management (CRM) systemsJ Marketing Manage2002181193219Open DOISearch in Google Scholar

Costa FF. Big data in genomics: challenges and solutions. GIT Lab J 2012; 11: 1-4.CostaFFBig data in genomics: challenges and solutionsGIT Lab J20121114Search in Google Scholar

Ward RM, Schmieder R, Highnam G, Mittelman D. Big data challenges andopportunities in high-throughput sequencing. Syst Biomed 2013; 1: 29-34.10.4161/sysb.24470WardRMSchmiederRHighnamGMittelmanDBig data challenges andopportunities in high-throughput sequencingSyst Biomed201312934Open DOISearch in Google Scholar

Eisenstein M. Big data: The power of petabytes. Nature 2015; 527: S2-S4.10.1038/527S2a26536222EisensteinMBig data: The power of petabytesNature2015527S2S426536222Open DOISearch in Google Scholar

Woodco Bacardit J, Llorà X. Large-scale data mining using genetics-based machine learning. Wiley Interdiscip Rev 2013; 3: 37–61.WoodcoBacardit JLloràXLarge-scale data mining using genetics-based machine learningWiley Interdiscip Rev20133376110.1145/2464576.2480807Search in Google Scholar

Recommended articles from Trend MD