According to a recent study, artificial intelligence could stimulate employment and increase the general business revenues by 38% in 2020 (1,2). For the global economy, this effect would mean a profit growth of approximately $4.8 trillion and, above all, a profound rethinking of the production system. In this model, the information (input) (e.g., genomic data) is entered into the node (simulated neuron) and processed. This approach leads to an initial transient result that is passed to the upper level, where the process is repeated. From the multiple levels, we arrive at the final stage of information processing (e.g., prediction of gene function and structure). The input (initial information) is defined as specific data, and the output (final information) must be consistent with the input (3, 4, 5, 6, 7).
The design of learning elements must consider three fundamental aspects:
which components of the executive element (e.g. gene regulation and structures) should be learned;
the type of feedback available;
the type of representation used.
The feedback step is the most important because it enables us to determine the nature of the problem. We can distinguish three types of learning:
prepare the data;
choose the algorithm;
adapt a model;
choose a validation method;
perform evaluations;
use the model to make predictions.
Figure 1
Example of an artificial neuron.

A single neuron receives a numerical input (the weighted sum of different inputs), and it can be activated to process the received value, and its activation can be calculated through a specific function, or it can remain inactive, depending on whether its threshold is exceeded or if it is not activated. Therefore, the neuron will be characterized by a function of activation and an activation threshold. Given a certain function of activation, in a fraction of cases, it may be difficult, if not impossible, to obtain certain output values. Specifically, there may not be a combination of input and weight values for that function that produces the desired output. Therefore, it is necessary to use an additional coefficient b, known as the bias, the purpose of which is to allow the translation of the function activation of the neuron such that it is possible to obtain certain inputs and weight for all desired outputs. We can distinguish two basic types of neural networks: feedforward neural networks and recurrent neural networks (RNN) (also called feedback neural networks) (22). In the former, the connections between neurons never form a cycle and therefore information always travels in one direction. In contrast, recurrent neural networks form a cycle with the creation of an internal state of the network, which enables the performance of dynamic process behaviour over time. The RNN can use its own internal memory to process various types of inputs. It is also possible to distinguish between completely connected networks (fully connected), in which every neuron is connected to all of the others, and stratified networks in which the neurons are organized in layers. In stratified networks, all of the neurons of a layer are connected with all the neurons of the next layer. There are not connections between neurons of one layer, nor are there connections between neurons of non-adjacent layers. A layered network with three layers is shown in
Figure 2
A layered network with three layers.

The left layer in Figure 2 with neurons denoted by blue circles, is generally referred to as an input layer; alternatively, the rightmost layer with a unique orange neuron constitutes the output layer. Finally, the central layer is called hidden layer because there is no connection with the training set.
Euclidean distance
Manhattan distance
The Euclidean distance d2 (x,y) is equal to the square root of the sum of squares of the two coordinate vector differences. The distance of Manhattan is a simple modification of the Euclidean distance. Cluster analysis can use symmetrical or asymmetric distances. Many of the distance functions listed above are symmetrical (the distance between object A and B is equal to the distance from B to A). Clustering techniques can be based mainly on two ways: 1.) from bottom to top (Bottom-Up) and 2.) from top to bottom (Top-Down). In the Bottom-Up method, initially all elements are considered as separate clusters and then the algorithm joins the nearest clusters. The algorithm continues to merge elements to the cluster until a fixed number of clusters is obtained or until the minimum distance between the clusters is achivied. With the Top-Down method, initially all the elements share a single cluster and then the algorithm begins to divide the cluster into many smaller clusters. The criterion used always tries to obtain homogeneous elements, and proceeds until a predetermined number of clusters is reached.
Figure 3
Increase of the prediction accuracy as a function of the data training size.

Figure 4
Example of sparse connectivity of neurons between different convolutional layers: the neurons of a layer m are connected only to some of the neurons of the previous layer m-1.

Figure 5
Example of weight sharing (feature map) between different neurons convolutional layers, a link of the same color corresponds to equal weight

In the last ten years, new generation sequencing technologies (NGS) have demonstrated their potential, and with the production of short reads, the throughput process has become increasingly larger NGS techniques, combined with advances in various subjects, ranging from chemistry to bioinformatics, have led to the capability of DNA sequencing at reasonable prices. Specifically, bioinformatics has been fundamental in this development process. Owing to the development of multiple algorithms based on different techniques, such as hash tables, indexes, and spaced-seed, it has been possible to optimize the analysis of increasingly large data sets. NGS technologies are used in a variety of different areas, such as cancer research, human DNA analysis, and animal studies. In
Main genomic programs concerned with cancer and rare diseases
Internet site | Program | Thematic area |
---|---|---|
www. allofus.nih.gov/ | All of Us Research Program | Cancer, rare diseases, complex traits |
www.australiangenomics.org.au | Australian Genomics | Cancer, rare diseases, complex traits |
www.brcaexchange.org | BRCA Challenge | Cancer, rare diseases, complex traits |
www.candig.github.io | Canadian Distributed Infrastructure for Genom- ics | Cancer, rare diseases, basic biology |
www.clinicalgenome.org | Clinical Genome Resource | Rare diseases |
www.elixir-europe.org | ELIXIR Beacon | Rare diseases, basic biology |
www.ebi.ac.uk | European Genome-Phenome Archive | Rare diseases, basic biology |
www.genomicsengland.co.uk | Genomics England | Cancer, rare diseases, complex traits |
www.humancellatlas.org/ | Human Cell Atlas | Cancer, rare diseases, complex traits, basic biology |
www.icgcmed.org | ICGC-ARGO | Cancer |
www.matchmakerexchange.org | Matchmaker Exchange | Rare diseases |
www.gdc.cancer.gov | National Cancer Institute Genomic Data Com- mons | Cancer, rare diseases |
www.monarchinitiative.org | Monarch Initiative | Rare diseases, complex traits, basic biology |
www.nhlbiwgs.org | Trans-Omics for Precision Medicine | Rare diseases, complex traits, basic biology |
www.cancervariants.org | Variant Interpretation for Cancer Consortium | Cancer |
Next Generation Sequencing allows for parallel reading of multiple individual DNA fragments, thereby enabling the identification of millions of base pairs in several hours. All of the available NGS platforms have a common technological feature: massive parallel sequencing of clonally amplified DNA molecules or single spatially separated DNA molecules in a flow cell. The main NGS platforms are the following:
HiSeq
MiSeq
Ion Torrent
SOLiD
Pacbio Rs II and Sequel System.
This strategy represents a radical change with respect to the sequencing method described by Sanger, which is based on the electrophoretic separation of fragments of different lengths obtained by single sequencing reactions (95). Conversely, with NGS technologies, sequencing is performed by repeated cycles of nucleotide extensions by a DNA polymerase or alternatively by iterative cycles of oligonucleotide ligation. Because the procedure is parallel and massive, these platforms allow for the sequencing of hundreds of millions of base pairs (Mb) to billions of base pairs (Gb) of DNA in a single analytical session, depending on the type of NGS technology used. These technologies are versatile because they can be used both for diagnostic and basic and translational research. Traditionally, genomic data analysis is performed with software such as Var Direct, Free Bayes and GATK that are based on statistical analysis (83). Sequencing of genomic sub-regions and gene groups
are currently used to identify polymorphisms and mutations in genes implicated in tumours, and regions of the human genome involved in genetic diseases, through “linkage” and genome-wide association studies (investigation using a panel of genes of different individuals to determine gene variations and associations). These instruments are also used for genetic and molecular analysis in various fields, including the diagnosis of rare genetic diseases, as well as neoplastic and endocrine-metabolic diseases. The field of application of these technologies is expanding and will cover more diagnostic aspects in the future. Several NGS platforms can be used to generate different sources of genomic data:
whole genomes;
microarrays;
RNAseq;
capture arrays (exomes) (Illumina TrueSeq Exome Enrichment (62 Mb); Agilent SureSelect (50 Mb);
targeted regions;
specific genes;
chromatin interactions (DNAseq, MNareseq; FAIRE);
chip-seq;
expression profiles.
Computer programmes that predict genes are becoming increasingly sophisticated. Most software recognizes genes by identifying distinctive patterns in DNA sequences, such as the start and end signals of translation, promoters, and exon-intron splicing junctions (72). However, open reading modules (ORFs) may be difficult to find when genes are short or when they undergo an appreciable amount of RNA splicing with small exons divided by large introns. Prokaryotic genomes are predicted more accurately (sensitivity and specificity > 90%) than eukaryotic genomes. The main features of genomic analysis consist of the following: identification of the gene location and structure, identification of regulatory elements, identification of non-coding RNA genes, prediction of gene function, and prediction of RNA secondary structure. Another approach discovers new genes based on their similarity to known genes. However, this approach is only able to identify new genes when there is an obvious homology with other genes. Eventually, the function of the gene must be confirmed by numerous molecular biology methods. In fact, some genes could be pseudogenes. Pseudogenes have a sequence similar to the normal genes, but usually contain interruptions such as displacements of the reading frame or stop codons in the middle of coding domains. This prevents the pseudogenes from generating a functional product or having a detectable effect on the phenotype. Pseudogenes are present in a wide variety of animals, fungi, plants and bacteria. Predictive methods aim to derive general rules starting from a large number of examples (nucleotide labels within sequences) represented by observations referring to the past, and accumulated in international databases.
Predictive models based on machine learning aim to draw conclusions from a sample of past observations and to transfer these conclusions to the entire population. Identified patterns can take different forms, such as linear, non-linear, cluster, graph, and tree functions (73, 74, 75).
The machine learning workflows are usually organized in four steps:
filtering and data pre-processing;
feature extraction;
model fitting;
model evaluation.
Machine learning methods may use supervised or unsupervised systems. The supervised method requires a set of DNA sequences (with all the genetic information including the start and end of the gene, splicing sites, regulatory regions, etc.) for the training step, in order to build the predictive model. This model is then used to find new genes that are similar to the genes of the training set. Supervised methods can only be used if a known training set of sequences is available. Unsupervised methods are used if we are interested in finding the best set of unlabelled sequences that explain the data (76). In
Main software tools used for machine learning studies
Internet site | Program name | Thematic area |
---|---|---|
www.sourceforge.net/p/fingerid | FingerID | Molecular fingerprinting |
SIRIUS | Molecular fingerprinting | |
Metaboanalyst | Metabolomics analysis | |
www.knime.com/ | KNIME | Machine learning tool |
www.cs.waikato.ac.nz/ml/weka/ | Weka | Machine learning tool |
Orange | Machine learning tool | |
TensorFlow | Machine learning tool | |
Cafe | Machine learning tool | |
Theano | Machine learning tool | |
Torch | Machine learning tool |
Machine learning methodologies have a wide range of application areas:
identification of protein coding regions and protein-DNA interactions (78,79);
identify regulatory regions (e.g., promoters, enhancers, and
polyadenylation signals) (80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95);
prediction of splice sites (Bayesian classification);
identification of functional RNA genes (96);
chromatin fragment interactions;
histones marks (77);
transcriptional factor (TF) binding sites (96);
prediction of amino acid sequences and RNA secondary structures (97, 98, 99, 100, 101, 102, 103, 104, 105, 106);
Convolutional neural networks (CNNs) can substantially improve performance in problem sequence solving, compared to previous methods (111). Recurring patterns of sequences in the genome can be efficiently identified by means of CNNs methods (111,112). In this method, the genome sequence is analysed as a 1D window using four channels (A,C,G,T) (113). The protein-DNA interactions are analysed and solved as a two class identification problem (114). Genomic data used in machine learning models should be divided into three proportions training (60%), model testing (30%) and model validation (10%). The main advantages of the CNNs method can be summarized as follows:
training on sequence;
no feature definition;
reduction of number of model parameters
use only small regions and share parameters between regions;
train wider sequence windows;
make
In the CNN method, the following parameters should be optimized:
the number of feature map;
window size (DNA sequences);
kernel size;
convolution kernel design;
pooling design.
Chen
Recurrent neural networks (RNN) have been proposed to improve the performance of the CNN method. RNN are very useful in the case of ordered sequences (e.g., sequential genomic data). Several applications have recently been reported:
Google genomics has recently developed a new software called Deep Variant (Google Brain Team) that uses AI techniques to determine the characteristics of the genome starting from the information of reference sequences. Deep Genomics and WUXI (with offices in Shanghai, Reykiavik and Boston) use the latest generation of AI techniques to study several genetic diseases in an attempt to find possible therapies. AI methods have also been used for metabolomic data analysis including the following:
Attempts to correct the genome for the treatment of genetic diseases have been going on for a long time. Until several years ago, this type of research required long and complex procedures. In 2012, a decisive change came when it was discovered that a protein present in bacteria (Cas9), in association with an RNA sequence, can be used as a device to probe the DNA and identify the point of genetic damage (125). To use Cas9 as a tool of genetic engineering, it is necessary to produce a RNA guide, identify the corresponding DNA segment and delete the gene (125, 126, 127, 128, 129, 130). Although this method is highly efficient and rapid compared to the older technologies, the limitation has been its precision. The recognition sequences, i.e., the guides, are small, consisting of approximately twenty nucleotides. In several laboratories around the world, approaches were already underway to make this technology more precise. Recently, Casini
Currently, even the smallest of lab or company can generate an incredibly large volume of data. These data sets are referred to as big data (the term indicates large sets of data, the size of which requires different instruments than those traditionally used) and can be analysed with various techniques, including predictive analysis and data mining. It is important to thoroughly understand what is meant by big data. Since 2011, this term has powerfully entered into common language. For the moment, big data refers to a database that contains data from both structured (from databases) and from unstructured (from new sources as sensors, devices, images) sources, coming from areas internal and external to a facility which can provide valuable information if used and analysed in the correct way. The data can be of two types: structured and unstructured data (131). The former refers to a variety of formats and types that can be acquired from interactions between people and machines, such as all information derived from consumer behaviour on internet sites or users of social networks. Unstructured data, conversely, are usually “text” and consist of information that is not organized or easily interpretable through models. To define the big data more formally, we can use the following four variables (132):
Volume: the amount of data has increased in the past years, and it is still increasing significantly. It is extremely easy for a company to store terabytes and petabytes of data.
Speed: compared to fifteen years ago, currently data becomes obsolete after a very short period of time. This is one of the reasons why there are no longer long-term plans. New factory leaders should be able to perform a preliminary analysis of the data in order to anticipate trends.
Variety: the available data have many formats. Continuous changes of the formats creates many problems in the acquisition and classification of the data.
Complexity: being able to manage numerous sources of data is becoming highly difficult.
The analysis of big data is carried out using the following applications:
Descriptive Analytics (DA)
Prescriptive Analytics (PA)
Automated Analytics (AA)
The DA implementation tools aimed at describing the current or past data situation. The PA uses advanced tools that perform data analysis to predict future scenarios. These tools use algorithms that play an active role to find hidden patterns and correlations in the past data, and make prediction models for future use (133, 134, 135). PA are advanced tools that, together with data analysis, are able to propose operational/strategic solutions. The AA are tools capable of autonomously implementing proposed actions according to the results. A practical tool that can aid in providing an overview of the benefits that can be derived from a project in the field of big data is the Value Tree. The Value Tree allows tracking of all the benefits that can emerge from a big data project and helps to identify non-preventable benefits. In the value tree, the benefits are defined as being quantifiable or non-quantifiable, without distinguishing benefits that are directly quantifiable in an economic/financial way. The awareness of the existence of big data have forced several companies to create new effective strategies. Appropriate analytical tools are needed that can translate the data into usable information. However, data mining processes, in graphical or numerical form, large collections of continuous flows of data for the purpose of extracting useful information. The need to anticipate certain situations has become extremely important and can lead to a competitive advantage. Thus, data analysis serves to identify current trends and also to predict future strategies. In particular, predictive models, as well as machine learning and data mining, attempt to identify relationships within the data to identify future trends. Therefore, through the process of automatic learning, the task of the model is to analyse the variables, both individually and together, and provide a predictive score by reducing the margin of error. Predictive Analysis (P.A.) is a science that is already widely used and has been continuously improved in terms of its features and predictions. Persons and companies who use personal and medical data cannot ignore the legal and ethical aspects related to the acquisition and use of such data and information and must comply with current legislation.
Machine learning is the science that enables computers to perform future predictions, and it represents one of the fundamental areas of artificial intelligence. Deep learning is a branch of machine learning and is based on a set of algorithms organized hierarchically with many levels (at least of which are 2 hidden). Generally, these algorithms provide multiple stages of processing (training, model fitting, model evaluation) that often have complex structures and are normally composed of a series of non-linear transformations. Recently, CNN and RNN have been widely used in deep learning, especially for the identification of protein coding regions, protein-DNA interactions, regulatory regions (e.g., promoters, enhancers, and polyadenylation signals), splicing sites and functional RNA gene applications.
Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Main genomic programs concerned with cancer and rare diseases
Internet site | Program | Thematic area |
---|---|---|
www. allofus.nih.gov/ | All of Us Research Program | Cancer, rare diseases, complex traits |
www.australiangenomics.org.au | Australian Genomics | Cancer, rare diseases, complex traits |
www.brcaexchange.org | BRCA Challenge | Cancer, rare diseases, complex traits |
www.candig.github.io | Canadian Distributed Infrastructure for Genom- ics | Cancer, rare diseases, basic biology |
www.clinicalgenome.org | Clinical Genome Resource | Rare diseases |
www.elixir-europe.org | ELIXIR Beacon | Rare diseases, basic biology |
www.ebi.ac.uk | European Genome-Phenome Archive | Rare diseases, basic biology |
www.genomicsengland.co.uk | Genomics England | Cancer, rare diseases, complex traits |
www.humancellatlas.org/ | Human Cell Atlas | Cancer, rare diseases, complex traits, basic biology |
www.icgcmed.org | ICGC-ARGO | Cancer |
www.matchmakerexchange.org | Matchmaker Exchange | Rare diseases |
www.gdc.cancer.gov | National Cancer Institute Genomic Data Com- mons | Cancer, rare diseases |
www.monarchinitiative.org | Monarch Initiative | Rare diseases, complex traits, basic biology |
www.nhlbiwgs.org | Trans-Omics for Precision Medicine | Rare diseases, complex traits, basic biology |
www.cancervariants.org | Variant Interpretation for Cancer Consortium | Cancer |
Main software tools used for machine learning studies
Internet site | Program name | Thematic area |
---|---|---|
www.sourceforge.net/p/fingerid | FingerID | Molecular fingerprinting |
SIRIUS | Molecular fingerprinting | |
Metaboanalyst | Metabolomics analysis | |
www.knime.com/ | KNIME | Machine learning tool |
www.cs.waikato.ac.nz/ml/weka/ | Weka | Machine learning tool |
Orange | Machine learning tool | |
TensorFlow | Machine learning tool | |
Cafe | Machine learning tool | |
Theano | Machine learning tool | |
Torch | Machine learning tool |
Biotechnological Approaches to Generate Biogenic Solvents and Energy Carriers from Renewable Resources The metabolic mechanism of growth inhibition by co-culture of Bacteroides xylanisolvens Y-11 andBifidobacterium longum y37Insilico Screening for Identification of Hits against SARS-Cov-2 Variant of Concern B.1.617 and NSP12 Mutants by Molecular Docking and Simulation Studies