Accès libre

Utility of Machine Learning Technology in Microbial Identification: A Critical Review

À propos de cet article

Citez

Introduction

The first known microorganisms appeared on Earth roughly 3.5 billion years ago [54]. Microorganisms include bacteria, viruses, fungi, microscopic protozoa and algae. These organisms [44] have both good and bad applications: they are used in food production, medicine [60, 91], agriculture [52, 73], industry [64], and environmental protection [11, 93].

Recent trends in microbiology research have included community classification and its relationship to the environment [11, 93], regulation of host-microbiome interactions and the gut microbiome [33, 49, 66, 79], and attempts to conflate the microbiome with genome engineering, molecular modification, ecology, resource utilisation, biocatalysis, synthesis, pharmaceutical vaccine development, and pathogenic bacteria. Combining microbiology with multiomics (genomes, epigenomics, transcriptomics, proteomics, and metabolomics) has also given rise to a number of multi-scale new areas [8, 46].

Recent breakthroughs in microbial sequencing have made the microbiome study increasingly popular. High-throughput sequencing methods have allowed to generate a huge body of microbiological data. Machine-learning techniques have been slowly incorporated into microbial investigations [30, 31, 59, 8183, 88, 95] due to the high cost and lengthy nature of traditional methods involving microscopes and biological cultures. With the advent of the Big Data era, researchers’ most pressing concerns have shifted to include questions such as how to rapidly and effectively filter/condense this exponential growth of information to obtain generalised quality data and how to transform the enormous amounts of microbiota data into knowledge that is easily understood and visualised.

McCarthy first proposed AI at the Dartmouth Conference in the summer of 1956. Emulating human intellect expansions is a primary goal of artificial intelligence (AI), along with studying and improving related theoretical frameworks, methodological tools, and system architectures. The advent of AI has sped up the development of microbiology and brought about a paradigm shift in the field as a whole [6]. With the use of big data, automation, modelling, and AI, the study of microbiology has expanded into new areas, such as systems biomedicine and systems ecology.

In 1959 (Bell Labs, IBM, Stanford), Arthur Samuel proposed the concept of machine learning (ML) with the intention of extracting features from large datasets with varying structures. The simplest approach is using algorithms to analyse data, automatically assess patterns, and base predictions and decisions on those evaluations [34]. Systematically presenting interactions between microflora and hosts using machine learning is made possible by multiomics integration, cross-scale microbial community integration, and multiomics integration with complex microbial communities.

Deep learning (DL) is a game-changing machine learning strategy that models high-level abstractions of data with parametric models learned via gradient descent across several layers of processing units in a deep network [41]. DL is a technology for machine learning (ML), which is used to implement AI (Fig. 1).

Fig. 1.

The relationship among Artificial Intelligence, Machine Learning and Deep Learning

Machine learning

To develop ML algorithms, four steps must be taken [57]. Features extraction is the initial step in the ML methodology [47]. The operational classification units (OTU) table can be generated using clustering. Then, we pick the most important parts to help us be more precise and productive. After a model is trained using a training dataset, it is evaluated using a test set. An overview of the process is shown in Figure 2.

Fig. 2.

Various stages of Machine Learning employed in Microbiology

Machine learning (ML), or learning from unstructured data, is a necessary capability for every AI system. To derive insights from raw data, ML algorithms are fed features derived from carefully crafted pattern recognition methods. The earliest known machine learning algorithms were developed in the early 20th century, and numerous established methods have been developed since then (Figure 2). Reinforcement learning, supervised learning, unsupervised learning, and DL are the four classic algorithms covered here (RL).

Supervised learning

In statistics, supervised learning approaches include regression analysis and classification. These methods utilise training sets comprised of examples drawn from well-established categories to instruct models. Before machine learning (ML) was conceived of, Fisher (1936) developed the supervised data dimensionality reduction technique linear discriminant analysis (LDA) [23]. With its ability to handle a wide variety of classification tasks and its suitability for incremental training, the naive Bayes (NB) model is a strong contender for small-scale data analysis. It is easy to use, and the categorisation accuracy remains stable over time [94].

Because of this, the accuracy of the final conclusion depends on the form the input data are given in. Logistic regression (LR) features a simple model, good parameter interpretability, and is fast and easy to use in situations involving large amounts of data, all of which contribute to its ability to forecast the likelihood that a sample will be a positive sample. The year 1980 is a watershed moment in the development of machine learning algorithms, which up until that time had followed a more diffuse and haphazard path of enlightenment-inspired algorithms. Due to its fast calculation, high accuracy, and high interpretability in the 1980s and 1990s, decision trees (DT) are still utilised in some issues today; nevertheless, their trait of easy overfitting makes it simple to ignore the importance of attributes in the dataset. Among the most common DT implementations are ID3 [61], CART [89], and C4.5 [62]. On the 1990s, we saw the development of two widely used algorithms: the support vector machine (SVM), which is grounded in statistical learning theory, and AdaBoost [16, 25]. The latter (SVM) solves nonlinear classification problems with simple classification principles (maximising the interval between samples and decision surfaces) using kernel functions that can be mapped to a high-dimensional space; however, this method is challenging to solve the multiclassification problem, sensitive to missing data, and thus difficult to achieve large-scale training samples. The latter (AdaBoost) can combine the use of simple weak classifiers and significantly enhance learning accuracy whether the data are synthetic or real; nevertheless, it is vulnerable to noise interference and requires a lengthy training period. There is also no need for familiarity with feature filtering or weak classifiers.

Unsupervised learning

When learning from fresh data, unsupervised learning is a method used to search for recognisable patterns in the original data. Two schools of thinking in research are clustering and dimensionality reduction. A number of early implementations of hierarchical clustering, such as SLINK [71] and CLINK [17], are still in use today. The expectation-maximisation (EM) method is frequently used to master the variational inference of latent-dirichlet-affiliation (LDA) topic models, Gaussian mixture model (GMM) parameters, and hidden-Markov model (HMM) hidden-state variables. It has been used to solve numerous missing data extreme likelihood estimation problems in machine learning (HMM). Mean shift [15, 3], the density-based spatial clustering of applications with noise (DBSCAN) technique, and the ordering points to identify the clustering structure (OPTICS) algorithm are all examples of density-based clustering methods developed in the 1990s. A novel approach to clustering was introduced at the turn of the century, which involved recasting the clustering problem as a graph cutting problem. The prototypical method for this novel idea is called spectral clustering. Some of the advantages of using the classic PCA approach include the fact that it has no parameter limitations, can remove data redundancy and noise, reduces dataset size through compression and preprocessing, and yields findings that are straightforward to interpret. For quite some time, there has existed a data dimension reduction technique [58]. There have been many nonlinear techniques developed since then [9, 28, 65, 78], such as locally linear embedding (LLE), Laplacianeigenmaps, locality-preserving projections, and isometric mapping. Then, t-distributed stochastic neighbour embedding (t-SNE) [80] was created. For the most part, this nonlinear method is preferred for showing and comprehending high-dimensional data since it provides the best visualisation result compared to other dimensionality reduction algorithms. Despite its slow development and lack of substantial improvements, unsupervised learning has occupied a large role in human and animal learning and is a vital avenue to explore potent artificial intelligence.

Deep learning

Deep learning, in contrast to conventional ML, is multidimensional and intends to capture all associations present in the raw data. DL can be broken down into supervised, unsupervised, and hybrid models depending on whether or not labelled data is required. The term “hybrid model” is commonly used to describe the practise of incorporating the results of an unsupervised model into the training process of a supervised model. The technological precursor and basis of DL is artificial neural networks (ANNs). The ancestor of ANN, the Perceptron model, was first proposed in 1958 (Rosenblatt, 1958), but it was not practical because it was extremely simplistic and could only be used for linear classification tasks (it couldn’t even deal with the XOR problem). Therefore, it is of little practical value, but rather serves to lay the ideological framework for subsequent algorithms. Until the 1980s, progress in the study of neural networks slowed significantly. Consider the back propagation (BP) method [67] for training sigmoid-functioned multilayer neural networks and multilayer perceptrons. Long short-term memory (LSTM) and convolutional neural networks (CNN) were developed and are still widely employed to address vision issues. However, “gradient disappearance” [42] becomes a problem for the BP approach as the scale of the neural network increases. One of them is LeNet-5 [43]. For most deep convolutional neural networks, it has become the gold standard (DCNNs). Prior to Hinton and Salakhutdinov’s (2006) introduction of the concept of DL, the problem of “gradient disappearance” had already been handled; the method had been taught layer by layer by unsupervised learning, and then tuned via a supervised back-propagation methodology [29]. Hinton and his student Alex Krizhevsky won the ImageNet competition using AlexNet [72], an early example of the deep learning technique that is currently popular. A self-attention mechanism [19] is at the heart of the transformer network structure developed by Devlin et al (2019). using the bidirectional encoder representation from transformers (BERT) model. Because of its flexibility, natural language processing (NLP) excels at a wide variety of jobs. DL is essentially a developing statistical technique having benefits and downsides in the areas of speech recognition, NLP, and CV.

Reinforcement Learning

Reinforcement learning is a special category of machine learning algorithms in which learning from experience is a central tenet [37, 35]. This technique, known as “automatic scoring and escalation,” involves continuously determining, based on interactions, if an activity is related to the objective, producing rewards or penalties accordingly, and repeatedly carrying it out in order to finally maximise the predicted advantages. One active area of study, deep reinforcement learning (DRL) uses deep learning’s strengths in perception and decision-making to provide programmers complete command over an algorithm’s behaviour from input to output. It’s useful in fields like as healthcare, NLP, linguistics, and medical robotics [21, 24, 39, 50].

Evaluation criteria and algorithmic workflows

There is no “best” algorithm because they all have their own advantages and disadvantages. In reality, all that computers do is make people more productive by making decisions for them. In light of this, it is preferable to select the most applicable model rather than the one with the most bells and whistles. There are four considerations to make when weighing the advantages of machine learning algorithms [27]. 1) Accuracy, which is the single most important criterion by which an algorithm should be judged. (2) The algorithm’s fault tolerance refers to its ability to detect and correct for incorrect data. (3) It is far faster to debug, adapt, and expand on algorithms that are well-documented and easy to understand. (4) Time and space complexity refer to the amount of processing power and storage space required to execute the algorithm.

The following five procedures in Figure 3 can commonly be followed when using machine learning (ML) as a technical tool to tackle scientific challenges [27].

Fig. 3.

Outline of present study

First, you need to recognise the problem, collect and process your data, and pick an evaluation method. The data was split up into three sets: training, validation, and test. The prototype’s capacity to utilise the training data should be successively evaluated using the validation set, test set, and training set, all of which are related to the data samples that were used to create the prototype.

Construct the prototype with reconciling the dichotomy between optimisation and generalisation in such a way that are more optimised than the benchmark. By optimising generalisation abilities, the boundary between underfitting and overfitting can be established.

Validate the prototypes. Prototypes having effective statistical significance required to be scaled up first. Then with the help of overfitting threshold check the training losses and validation losses.

Testing of prototypes. Here the aim is to evaluate the prototype’s predictive power employing entirely fresh data rather than validating the data.

Modifying the protype by enhancing the algorithm with additional features, new characteristics, or finetuned parameters. The previous steps are continuously repeated, until the desired performance is attained. With the aid of model regularisation and tuning of hyperparameters prototype performance can be checked through validation set.

Classification and prediction

Here, we give a brief introduction to ML’s potential applications in the field of microbiology. We focus on classification and an interaction problem because they are two of the most common applications of ML. Figure 3 displays the document’s overall structure.

Prediction of microbial species

Bacteria and archaea fall under the category of prokaryotes, while eukaryotes, which include fungi and single-celled algae, are the other major grouping of microorganisms based on cellular form [84, 55]. There are primarily two approaches of identifying microorganisms. It is important to determine the microbe’s domain, kingdom, phylum, class, order, family, genus, and species while trying to identify it. Species classification of a hitherto unrecognised microbial species is another objective. In order to classify different kinds of microorganisms, [53] researchers employed IDTAXA’s LearnTaxa and IdTaxa tools.

These features are available thanks to the DECIPHER R package, which is part of the Bioconductor and released under the GPLv3 licence. DECIPHER offers resources for processing and making sense of high-throughput genomic data. Isolating and identifying individual microbial genomes from complex metagenomics collections is an important skill. Currently, the typical task of categorising prokaryotic and host organisms from mixed samples is often accomplished using genome-based similarity algorithms. There have been many subsequent attempts to find more precise methods of distinguishing amongst bacteria. In the field of metagenomics, Amgarten et al. (2018). ‘s MARVEL programme is a helpful tool for predicting double-stranded DNA bacteriophage sequences [1]. MARVEL employs the RF technique, with 1,247 phage and 1,029 bacterial genomes serving as training data, and 335 bacterial and 177 phage genomes serving as test data. The authors presented six properties to aid in phage identification; however, feature selection using random forests revealed that just three variables were particularly informative.

Those methods are used to classify microorganisms in granular detail, fulfilling a wide range of needs. Murali et al. (2018) have proposed a method for studying the classification of microorganisms [53]. As an added bonus, MARVEL, VirSort, and VirFinder can all tell the difference between various microbial species. While all three methods had comparable specificity, MARVEL had the highest recall (sensitivity) [1] according to the research conducted by Amgarten et al. To execute the aforementioned steps, we have amassed the materials shown in Table I.

Web links of Microbiome – Disease Association

Studies Availability of data and materials
Zhou et al. https://www.nature.com/articles/ncomms5212#supplementary.information
KATZHMDA http://dwz.cn/4oX5mS
BMCMDA https://github.com/JustinShi2016/ISBRA2017
Prediction of environmental and host phenotypes

High-throughput sequencing and next-generation DNA technology have opened up previously unknown area for microbial researchers in recent years. Disease outbreaks and precision medicine can both benefit from research into the interactions between microbial populations, phenotypes, and ecological environments [5]. It is well established that parasitic bacteria do exist, and that both environmental conditions and host cells play critical roles in defining the precise make-up of any given microbiome. As a result of differences in food supply and environmental conditions, various microbial communities emerge [51]. A deeper understanding of both the surroundings and the host is provided here for the benefit of both parties. There has been a rise in study in recent years into the potential of microorganisms to foretell host and environment traits. In this article, we offer a survey of research along these lines.

To do this, Asgari et al. (2018) used a shallow subsample representation based on k-mer and deep learning, random forests, and support vector machines [4] to predict environmental and host features from 16S rRNA gene sequencing. OTU was found to be inferior to a shallow subsample representation based on k-mer in terms of both recognising a person’s physical location and producing accurate predictions for Crohn’s illness. Furthermore, deep learning beats RF and SVM on huge datasets. This method may help improve efficiency and prevent the risk of overfitting.

Statnikov et al. (2013) developed the following data processing pipeline with OTUs as input features [75]. After sequencing the original DNA, the authors removed any traces of human DNA so that they could utilise the data to construct operational taxonomic units (OTUs). Thereafter, the total number of sequences present inside each OTU was counted. The authors utilised numerous probabilistic neural networks, such as support vector machines, kernel ridge regression, regularised logistic regression, Bayesian logistic regression, the KNN technique, the RF method, and others. The group studied 18 unique ML methods. They also used five separate feature extraction methods. The results showed that the RF, SVM, kernel-regression, and Bayesian logic with Laplacian prior regression produced the best outcomes. In forensics, it is crucial to have an accurate time of death estimate. After using KNN regression to datasets with ear and nose samples, Johnson et al. (2016) were able to estimate the time of death. According to these findings, the skin’s microbiota may be a useful tool for identifying the cause of death [32]. Because microbes tend to live in certain niches, we can learn more about their surroundings and their hosts. The existence of microbes can also tell us about the health of the surrounding environment and the host’s prospects of survival.

Using microbial communities to predict disease

Both health and illness are significantly impacted by the microbiota [10]. Inhabitants of the human body include a wide variety of microorganisms. Illness develops when the human body’s microbiome shifts out of whack or when foreign microorganisms invade. Modifications to the microbiota of the intestines [45] and the lungs [70] have both been linked to obesity. It is difficult to pinpoint the offending microorganisms in a microbial community as the root of an illness. Recent studies have examined the possibility that microbiome communities can serve as predictors [26] for a variety of disorders, including bacterial vaginosis [74, 18] and inflammatory bowel disease (IBD). An understanding of the microbiome is essential for making informed decisions about how to diagnose and treat disease.

Progression of bacterial vaginosis has been linked to the vaginal microbiome (BV). Beck and Foster (2014) employed the genetic algorithm (GP), RF, and logistic regression (LR) [7] to classify BV based on its microbial populations. Two criteria for BV [56, 63, 77] are the Nugent score, which relies on counting gram-positive cells, and the Amsel standard, which considers the discharge, smell, clue cells, and pH [2]. The approach taken in this article begins with dividing BV into subtypes based on vaginal microbiota and associated environmental factors, and then

In a different study [86], the author primarily aimed to predict inflammatory bowel disease. The intestinal mucosa and lumen of people with Crohn’s disease and ulcerative colitis were compared to those of healthy individuals in that study. The Relief technique [38] was used for feature selection, and Metastats [85] was used for differential feature identification. Finally, the author used KNN and SVM classifiers to examine illness and site specificity.

Interaction and association in microbiology
Interaction between microorganisms

Because of the diversity of interactions among its members, microbial communities in biomes exhibit complicated collective behaviour. These interactions include metabolic exchange, signalling and quorum sensing, and the reduction or eradication of growth [40, 20]. Understanding the activities of natural ecosystems and the development of artificial coalitions requires an understanding of the interspecific interactions within microbial communities [48]. To better deduce the missing edges in the network and establish the connection [20], DiMucci et al. (2018) demonstrated a possible link between the microbial interaction network and the level of identifying characteristics of individual bacteria.

Chang et al. (2017) employed the random forest technique to forecast productivity in light of soil microorganisms [12], as it is possible that microbial interactions in the soil have an impact on crop yields. Variations in the soil microbiome were linked to enhancements in agricultural yields. There are examples of both cooperative and competitive interactions within any given population of microorganisms. Microbes can have any of eight different kinds of relationships with one another: neutral, commensal, synergistic, mutualistic, competitive, amenable, parasitic, and predatory.

An in-depth knowledge of the dynamics at play between various microbial communities is useful for both the research of microbial species and the application of microorganisms. However, despite its significance, there has been a dearth of ML research in this area.

Microbiome-disease association

The diverse population of microorganisms that call our bodies home play an essential role in maintaining our health. Inflammatory bowel diseases [13], like ulcerative colitis, colorectal cancer, atherosclerosis, diabetes, and obesity, have been linked to microbial imbalances in the gut. Understanding the relationship between microbes and disease is important because it helps doctors better diagnose and treat human illness [22, 69, 76, 90, 92]. However, there have been very few studies that have attempted to predict the microbe-disease relationship. For this reason, in this section, we present an introduction of how ML can be used to explore connections between microorganisms and health issues.

Through the combination of many data sets from the human microbe-disease consortium (MDPH HMDA) and path-based HeteSim scores [22], Fan et al. (2019) suggested a novel method for investigating the microbial-disease connection. In the beginning, heterogeneity networks were constructed. HeteSim scores were aggregated from the microbe-disease-microbe and microbe-disease-microbe pathways, and microbe-disease pairs were weighted in accordance with standard HeteSim measuring practise. We eventually settled on a strategy for totaling up the supposed strength of connections between individual microgenome areas. Katz introduced a network-based measurement technique he termed KATZ [36] to address the issue of link prediction. This method computes the degree to which nodes in a heterogeneous network are connected to one another. The KATZ technique has been implemented in many different settings, such as disease-gene association prediction [87] and IncRNA-disease association prediction [14]. To forecast links between the human microbiota and chronic diseases, a novel KATZ-based method was presented (named KATZHMDA). In the first step, the KATZHMDA uses previously recognised associations between microorganisms and diseases to construct an adjacency matrix A.

In order to solve this problem, the BMCMDA method was presented [68]; it is predicated on the full filling of binary matrices. The BMCMDA treats microbiome-disease association (MDA) matrices as though they were incomplete because it assumes that they are the result of a parameterisation and a noise matrix. In addition, the independent subscripts of the MDA matrix are presumed to be valid under the binomial model in the BMCMDA. In their comparisons, Shi et al. (2018) employed the same dataset as HMDAD [68], which included 292 microorganisms and 39 human illnesses. From what we can see, BMCMDA outperforms KATZHMDA in terms of AUC. Furthermore, this method can be used to different forms of forecasting. Table I presents a summary of the currently available datasets and methods.

Conclusion

Microorganisms have an effect on their surrounding environment and the organisms that inhabit it, and they are involved in a wide range of biological activities. Human well-being, agricultural output, animal husbandry, environmental regulation, chemical manufacture, and food production all rely heavily on microorganisms. Since the microscope was invented in the 19th century, scientists have been able to observe and learn about microscopic species not previously accessible to the naked eye. However, the advent of high throughput sequencing methods has led to a deluge of microbiological information. Due to this, machine learning methods are now being used to the study of microbes. Here, we will discuss the current applications of ML to the microbiome. We discovered that ML is commonly used in the research of classification and interaction problems in the realm of microbiology. Although significant progress has been made in the field of microbiology, there are still many obstacles that must be overcome by interdisciplinary teams of researchers (including biology, informatics and medicine). Recent advances in link prediction and computational intelligence show promise in clarifying the link between diseases and microorganisms.

eISSN:
2545-3149
Langues:
Anglais, Polaco
Périodicité:
4 fois par an
Sujets de la revue:
Life Sciences, Microbiology and Virology