Accès libre

Comparison of deep learning and conventional machine learning methods for classification of colon polyp types

À propos de cet article

Citez

Introduction

Colon cancer (CC) causes death of about half a million people every year (1, 2, 3). Colon polyps are clusters of epithelial cells or form an overgrowth of tissue along the colon mucosa (4). Colon polyps are classified histologically depending on their molecular growth pattern as hyperplastic, serrated and adenoma (5, 6). Adenomatous and serrated polyps need resection, but hyperplastic ones generally do not need such a surgical procedure. Therefore, identifying and classifying polyp types is very important from the clinical perspective. The cancer progression in the serrated polyps contains adenomatous features, whereas they are morphologically and pathologically similar to hyperplastic polyps. Thus, recognition of the serrated polyps is more difficult than others, due to their hybrid characteristics (7).

Colonoscopy is the most common procedure for detecting colon polyps. In this procedure, gastroenterologists visually examine the colon wall using a flexible probe with camera and light source at the tip. It is possible to record videos and photographs via this system. In addition to determining the location of the polyps during colonoscopy, it is also possible to remove them with an apparatus at the tip of the probe, which is called polypectomy or polyp resection. The accurate detection of the polyp and the classification of its type depends on the experience of the gastroenterologist. Looking at the monitor for long hours may cause mental or physical exertion, and in turn, misdiagnosis or missed polyps. Currently, all polyps are resected during colonoscopy and identified whether they are benign or malign using histopathology techniques to prevent future cancer. This also causes a significantly increased workload at the pathology department of hospitals. Therefore, discriminating polyp types in real-time during the colonoscopy procedure is critical to determine which polyp needs to be resected. In addition, this is a challenging issue for the physicians because there are uninformative frames, varying illumination conditions of endoscopy, variant texture, and specular reflection due to the light source at the probe. Therefore, an accurate and effective computer-aided diagnosis system is necessary to help identify and classify polyp types during colonoscopy in real-time.

Early studies were based on the extraction or combination of features such as texture and color from endoscopic images using pattern classification and analysis methods. One of these analysis methods was the region-growing method. Krishnan et al. used this method to detect abnormalities from the endoscopic images in (8) and to extract the colon lumen in (9). Iakovidis et al. employed this approach to detect the adenoma polyps (10). Subsequent studies focused on the development of neural networks. Magoulas et al. reported the use of local binary pattern (LBP) texture features combined with the neural network to classify the characteristics of polyps using endoscopic images (11). Another group developed automatic polyp detection and classification system with 0.88 sensitivity using a hybrid context-shape approach; the structure of the lesion and shape features was related to the polyp localization (12). These approaches depend on the conventional machine learning methods using manual feature extraction and classification. In recent years, several research groups have been trying to develop optical biopsy systems that would enable early detection of cancer and polyps before they become a serious risk using new approaches, such as artificial intelligence (13). Deep learning and convolutional neural networks (CNN) have been used for the detection and classification of colon polyps. Urban et al. (14) designed a CNN system to detect and classify colonic polyps using 8,641 manually-labeled images and colonoscopy videos. Another study aimed at the discrimination between hyper-plastic and adenomatous polyps by training neural networks to enhance the diagnosis using images obtained from 159 patients (15). Byrne et. al. achieved high accuracy only using narrow band imaging (NBI) video frames while using CNN model to classify adenomatous or hyperplastic polyps (16).

There are different imaging modalities used to increase adenomatous polyp detection rate and to aid the determination of polyp type. Narrow band imaging (NBI) is one of these modalities used for this purpose (17, 18). Also, magnification option is available in some colonoscopy systems (19). Using the NBI and magnification options some research groups conducted polyp classification to avoid the unnecessary tissue biopsy (20, 21, 22).

Optical biopsy is a cost-effective method because it is expected to save 33 million USD a year in health resources in the United States alone (23). In order to develop these systems, it is critical to understand and quantify the differences in colonos-copy images. In this study, we investigated the feasibility of automatic classification of colon polyps from colonoscopy videos. We extracted frames from colonoscopy videos obtained from patients with polyps, computed features using image processing approaches and analyzed them along with the histopathological evaluation results. The aim here was to classify colon polyps in two categories as resection (adenoma and serrated) or no-resection (hyperplastic). Moreover, the second phase included the classification of colon polyps into three categories as the adenoma, serrated or hyperplastic. In this way, we aimed at developing a real-time analysis and visualization approach that can guide gastroenterologists whether they should perform biopsy or polypectomy on a specific polyp during the routine colonoscopy examination or not.

Materials & Method
Colonoscopy Images

We used a public dataset (24) with 76 colonoscopy videos that came from 40 adenomatous, 21 hyperplastic, and 15 serrated polyps. While recording the videos, different imaging modalities like white-light imaging (WLI) and narrow-band imaging (NBI) were employed. Fig.1 shows several sample images to explain the content of this dataset. Also, the ground truth came from the histopathological analysis that belonged to each video.

Figure 1

Sample images obtained using WLI and NBI imaging modality during colonoscopy for different types of polyps: a) hyperplastic (WLI), b) hyperplastic (NBI), c) serrated (WLI), d) serrated (NBI), e) adenomatous (WLI), and f) adenomatous (NBI).

In this study, we also investigated the effect of imaging modality on polyp-type classification. For this purpose, we used these polyps to conduct two tasks, first of which was the binary classification as resection (serrated and adenoma) or no-resection (hyperplastic). The other one was a three-category classification to classify the subtype as adenomatous, hyperplastic or serrated polyp. For model training, we applied polyp-based stratified sampling. We created our test and training sets randomly. We divided into two sets so that 80% of the polyps were allocated as the training set, and 20% as the test set. We arranged our test set in such a way that it had no overlap with the training set and was representative of the whole dataset. For the test set to be used in the first task, we selected 11 resection and 4 no-resection polyps. Additionally, we selected 8 adenomatous, 4 hyperplastic, and 3 serrated polyps for three-category classification. Table 1 shows the number of polyps in the test and training sets for two- and three-category classifications. We extracted 25 frames per second from these colonoscopy videos. The number of extracted frames was 36,285 using WLI and 39,393 using NBI, a total of 75,678 frames. Table 2 explains the number of frames according to their classes in this dataset. This dataset contains 40 adenomatous polyps from which we extracted 47,369 frames (24,048 NBI + 23,321 WLI), 21 hyperplastic polyps (15,512 frames, 8,153 NBI+7,369 WLI), 15 serrated polyps (12,787 frames, 7,192 NBI + 5,595 WLI). These frame numbers were explained in Table 3 as the test and training sets separately.

The number of test and training samples according to polyp-based stratification (80% polyps used in the training and 20% in the test set).

AdenomaSerratedHyperplastic
ResectionNo-resection
Training321217
Test834

Number of extracted frames for each class.

Class Types
AdenomaSerratedHyperplastic
Imaging ModalityResectionNo-resection
NBI354412281162
WLI1990779574

According to the polyps-based stratification, number of frames for each class (N: No-resection, R: Resection, A: Adenoma, H: Hyperplastic, S: Serrated).

2-Class NBI2-Class WLI3-Class NBI3-Class WLI
NRNRAHSAHS
Test1162477257427693544116212201990574779
Train69912217567952614720504699159642131167954816
Machine Learning Preprocessing

The size of each frame in the original dataset was 768-by-576. We resized each frame in Python programming language using OpenCV library so that the height and width of each image was adjusted as 200-by-200 pixels. To determine the image size, we examined the effect of the resizing on the classification accuracy. We resized our input images isometrically or anisometrically. We found that the classification performance was not affected by the image being isometric or not. Therefore, we decided to choose the input size as 200-by-200 pixels. The original images had three color channels as RGB, (Fig. 2a), and we converted them into gray-scale images for further analysis (Fig. 2b). After the preprocessing step, we extracted features from the gray-scale images using the histogram of gradient (HOG) approach.

Figure 2

(a) Original colonoscopy image and (b) gray scale image.

Feature Extraction and Classification for Conventional Machine Learning

As the feature extraction for conventional machine learning (ML) part of our study we used Histogram of Gradients (HOG) descriptor which is generally used for object detection and pattern classification. They are used for quantifying and representing both shape and texture of an image (25, 26). The HOG, which calculates the orientations of image gradients and their histograms, can characterize the appearance or shape of the objects as a directional distribution of edges. Each histogram is calculated within a small region of image called cell. The HOG descriptor returns a feature vector. The most important parameters for the HOG descriptor are the orientations, pixels per cell, and the cells per block. A cell is a square region defined by the number of pixels that belong to each cell. Ordered set of histograms of cells in the image constitutes the HOG feature set of the object. The dimensionality of this feature vector is dependent on these parameters. These three parameters determine the dimensionality of the resulting feature vector. Especially, in the colonoscopy studies, HOG feature was used for the classification of detected polyps like polyp or background (no-polyp) classes (27). In polyp detection tasks to differentiate the polyp and normal image Younghak et al. used HOG features combined with hue color space histogram for handcrafted feature extraction (28).

In our experiments, we used 10x10 pixels per cell and 2x2 cells per block. There was an overlap of half block size for the calculation of HOG features. The number of bins represents evenly divided orientation angles of gradients in the range (0, 180). We had 200x200 pixel images and defined our pixels per cell size as a 10x10, thus we had 20*20 = 400 cells. Fig. 3a shows these 400 cells via blue squares. The histogram of oriented gradients is formed as shown in Fig. 3b. To reduce the disadvantageous effects like changes in illumination and contrast, we applied block normalization to the gradient values locally which improved the performance significantly. Finally, after all blocks were normalized, we took the resulting histograms, concatenated them, and treated them as our final feature vector. The features were returned in a 1-by-N vector, where N is the HOG feature length. The returned features encoded the local shape information of the regions within an image.

Figure 3

Blue squares define a cell with 10x10 pixels yielding a total of 400 cells (a). On each cell HOG features were overlayed (b).

According to Dalal and Triggs, there are four different methods for block normalization. The normalization factor can be L2-norm, L2-Hys (Hys stands for hysteresis), L1-norm, and L1-sqrt (square root). The L1-norm provides less reliable performance compared to the others; however, all methods present significant improvement over the non-normalized data (29). We select the L1-sqrt method in our approach.

Equation 1 illustrates L1-sqrt normalization which amounts to treating the descriptor vectors as probability distributions and using the distance between them.

v(v/vk+ε$$ \begin{equation}\left.\mathrm{v} \rightarrow \sqrt{(} \mathrm{v} /\left(\|\mathrm{v}\|_{\mathrm{k}}+\varepsilon\right)\right) \end{equation}$$

To recognize and distinguish the difference between polyp types, we used Random Forest (RF) classifier. Random forests are an example of a supervised learning algorithm. It is used for both classification and regression; it is also one of the most effective classification methods. In addition, another advantage is that it is the most flexible and easy to use algorithm. RF is also a fast algorithm, offering resistance to over-fitting, and it is possible to design as many trees as the user wants (30).

In the classification phase, we preferred 10-fold cross-validation (CV) which is a technique to assess predictive models by dividing the original sample into a training set and a validation set to evaluate it k times (31). After this procedure we obtained classification performance results.

Polyp Classification with Simple CNN Architecture

In the deep learning part of our study, we built a simple Convolutional Neural Networks (CNN) based architecture and trained it with the colonoscopy images containing colon polyps. CNN are used for deep learning based classification exclusively on image recognition problems. For training our model, we applied polyp-based stratified sampling in which we used the same test and training data as we used in the conventional ML part of the study.

We determined the size of the images as 28x28x3 pixels, in the input layer of the network. We used convolutional and max-pooling layers to extract features from the images. We should note that the filter size was 3x3 in the convolutional layer. The number of filters, the parameter that determines the number of feature maps, is the number of neurons connecting to the same region of the input. We used this parameter as a default stride of 1 and same padding method. We created batch normalization layers between convolutional layers and nonlinearities, to accelerate network training and reduce the sensitivity to network initialization. The nonlinear activation function came after a batch normalization layer. We chose the most common activation function; rectified linear unit (ReLU). In the training of the model, we trained the neural network using stochastic gradient descent with momentum (SGDM) with an initial learning rate of 0.01 for 20 epochs. Throughout training, we did not apply any image augmentation techniques.

For our final model, we used a basic CNN model that contained 15 layers. Overall, training these networks took approximately 12 hours using a CPU (PC with a 3.20 GHz Intel Core i5-4570 processor and 64 Gb) and was implemented in MATLAB 2020a. Once we trained the model, there were no further modifications for the results.

Machine learning and deep learning are two subsets of artificial intelligence. ML performs a learning task where it makes predictions of the future based on the new given inputs.

Fig. 4 explains the aim of this study. We aimed at comparing handcrafted feature based random forest classification method and deep learning based CNN method for polyp image frame classification. The conventional ML based methods include handcrafted feature extraction part; the HOG descriptors. On the other hand, convolutional neural networks (CNN) based deep learning is a state-of-the-art technique in many image recognition and detection applications. In this part of the study, instead of feature extraction step, the raw images are applied as the input to CNN architecture.

Figure 4

The pipeline of the comparison methods

R in the liver, and changes in the IL-17 expression in the liver tissue were determined by immunohistochemical staining.

Performance Metrics

Performance of the classification models can be evaluated using several ways. We used accuracy, precision, recall and f-measure metrics to evaluate the performance for classifying colon polyps. These metrics are explained as follows (TP: True Positive, TN: True Negative, FP: False Positive, and FN: False Negative):

Accuracy=(TP+TN)/(TP+TN+FP+FN)$$ \begin{equation}{ Accuracy }=(\mathrm{TP}+\mathrm{TN}) /(\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}) \end{equation}$$Precision=TP/(TP+FP)$$ \begin{equation}{ Precision }=\mathrm{TP} /(\mathrm{TP}+\mathrm{FP}) \end{equation}$$Recall=TP/(TP+FN)$$ \begin{equation}{ Recall }=\mathrm{TP} /(\mathrm{TP}+\mathrm{FN}) \end{equation}$$fmeasure=2/(1/ precision +1/ recall )$$ \begin{equation}f- { measure }=2 /(1 / \text { precision }+1 / \text { recall }) \end{equation}$$

For polyp classification, images with adenomatous and serrated polyps belong to the resection class were defined as positives; while images with hyperplastic polyps, under the no-resection class, were defined as negatives.

Also, f-measure was used in this study to further analyze the performance of classification when the test dataset is unbalanced.

Precision and recall provide two ways to summarize the errors made for the positive class in the classification problem. f-measure provides a single score that summarizes the precision and recall (32).

Results

In the binary (resection vs. no-resection) and three category (adenoma, hyperplastic, serrated) classification problems, our accuracy results showed that this performance surpassed the correct classification rate of all seven expert and novice doctors in both NBI and WLI modalities (see Table 4 and Table 5).

Accuracy of classification results.

Imaging ModalityTissue TypesMachine LearningDeep Learning
NBIA-H0.8740.752
A-H-S0.6320.694
WLIA-H0.9440.745
A-H-S0.5870.759

Accuracy of the doctors’ predictions.

A-HA-H-S
Expert 10.820.64
Expert 20.830.69
Expert 30.780.65
Expert 40.770.58
Novice 10.780.60
Novice 20.860.68
Novice 30.750.51

Additionally, the second task that included the classification of three colon polyp types the accuracies were 64% and 59% on average for the expert and novice doctors, respectively. Table 4 and 5 indicate that the simple CNN architecture outperformed both conventional ML based approaches and the doctors. The details of the metrics are given in Table 6 and 7.

Two- and three-category classification results for different imaging modalities using deep learning.

3NBI3WLI2NBI2WLI
Accuracy0.6940.7590.7520.745
Recall0.8410.8680.8490.826
Precision0.5180.6900.8490.860
f-measure0.5170.7260.8490.843

Two- and three-category classification results for different imaging modalities using conventional machine learning.

3NBI3WLI2NBI2WLI
Accuracy0.6320.5870.8740.944
Recall0.5040.5560.9100.960
Precision0.4630.5720.6620.807
f-measure0.4830.5640.7670.877

To compare the computational times of these approaches, after the training phase was over, we tested one frame to yield the classification result and found that it took 6.263 and 15.698 milliseconds on average using simple CNN architecture (MATLAB) and conventional ML approach (Python) on a computer with the Intel Core i5-4570 process with CPU @ 3.20 GHz. These findings showed that it is possible to use these approaches in real-time polyp classification.

According to Tables 6 and 7, we can compare deep and conventional machine learning performance metrics for two and three category classification. Classes had unbalanced distribution for each category, for that reason f-measure provided a better representation of the performance by summarizing the results using recall and precision. This value is generally used to interpret data statistics under unbalanced class situations without bias. According to the f-measure we can conclude that the deep learning algorithm achieved better results than convolutional machine learning approach on each category. We statistically compared the results of two different methods (DL vs conventional ML) to classify colon polyps. We determined the correct classification rates in each group as in Table 3. To investigate the differences in the results we assess the significance of them using hypothesis testing for two-sample proportions. By means of this test, we developed the hypothesis test to analyze the difference between classification approaches' accuracy proportions using independent samples (33) using significance level α = 0.05 with all experiments represented in Tables 6 and 7. According to the statistical test results, the correct classification rate came from the conventional machine learning approach is not significantly different from the deep learning approach for the subtype classification in both NBI and WLI modalities. Besides, for the binary classification, the conventional ML approach is significantly different from the deep learning approach. We also analyzed the role of two different imaging modalities on the classification results. WLI modality has significantly different results compared to the NBI modality in both binary and three-category classification.

Discussion

In this work, we examined conventional machine learning and simple deep learning approaches to improve the accuracy of colonic polyp type classification since databases containing large amounts of annotated data are often limited for this type of research. In this study we designed, implemented, and tested optical biopsy method for colon polyps. We focused on the classification of colonoscopy images in two or three categories as resection and no-resection and adenomatous, hyperplastic, and serrated polyps. These classes were designed in line with the routine procedure and examination.

We compared two different methods (simple CNN vs conventional ML) to classify these colon polyps. All these techniques are related to image processing to characterize the pattern and extract features from the images. We can compare these approaches according to their working principles and performance quality. To summarize, HOG breaks up the images into blocks, and then constructs histograms representing gradients in the block. On the other hand, CNNs include different layers such as input, convolution, subsampling, and output layers to extract the features and classify them. The feature extraction and learning process of HOG is very different from the processing in the human brain, in contrast with the approach of the CNN is quite like the brain. HOG has a straightforward design and includes a lesser number of parameters compared to CNN. According to memory and computational cost CNN needs more memory and power than the HOG. Moreover, HOG is comparatively fast with respect to time. HOG is more suitable for identification tasks though CNN has good generalization abilities and is more relevant for classification and categorization tasks. HOG features do not use hierarchical layer representation learning, therefore, are called low-level features while the CNN is a hierarchical deep learning model which is able to model data at distinguishable representations (34, 35, 36).

The publicly available database which we employed in this study was used by several research groups to conduct computer vision and artificial intelligence based studies (37,38). The comparison with other approaches is extremely difficult for several reasons. The first reason is that the research questions were different. We focused on the classification according to the histopathological ground truths, however, several groups dedicated their efforts to detect polyps on the images (39, 40). Many studies have been conducted on computer-aided detection to decrease the missing rate of colon polyps. Haj-Manouchehri et. al. designed a study related to the detection of frames containing polyps and polyp segmentation (41). Zhang et. al. used a transfer learning application that identified polyp images from non-polyp images at the beginning followed by the prediction of the polyp histology (42). This study has an identification and classification part using the colon polyps.

In addition, most research groups that focused on the classification problem used only two-category classification as adenomatous and hyperplastic or neoplastic and non-neoplastic instead of three-category (subtype) classification as adenomatous, serrated and hyperplastic classification as we did (43, 44). Because serrated polyps cause difficulties in the classification and differentiation between other types of polyps three-category classification had lower accuracy compared to the two-category classification. We concluded that clinicians might also have a similar problem related to the serrated polyp prediction from the result of the accuracy in Table 5. In order to design computer-aided diagnosis systems to help clinicians, we have to focus on especially the diagnosis of serrated polyps which is the most critical contribution of our study to the literature.

Previous studies on computer-aided diagnosis for the classification of colorectal polyps included different types polyp. Mesejo et al. conducted a similar study to ours. We used the same database to compare the conventional ML and DL methods on the classification task. They applied just conventional ML methods to this data and obtain an average accuracy of 90.67% in the binary classification and 76.68% in the subtype classification. Our conventional ML approach performs better than their accuracy results in binary classification. Our subtype classification yields better accuracy than the doctors (24). Tamaki et. al. studied on the classification using endoscopy images with NBI modality, but they divided the subtypes different from our categories (18).

This study has some limitations due to both database and methods. In the deep learning part, we built a simple CNN structure that used specific parameters defined by the literature (45,46). We used the same parameters without applying any parameter selection or optimization since they thoroughly analyzed the model by selecting the parameters to process the dataset and showed their robustness. This database does not have any control group as normal or images with no polyps. We should note here that if the methods we have used in this study are repeated on a dataset that contains images with different polyp types and healthy tissue, clinically more meaningful results can be obtained. By increasing the number of training polyps and modifying the architecture, the performance of the CNN and HOG can be improved.

It can be concluded that deep learning using convolutional neural networks is a good option for classification of colonic polyps. In a near future study, we plan to use this strategy to test the detection and classification of colonic polyps directly from the colonoscopy videos and evaluate the performances in real-time. We will use this strategy in pretrained networks such as ResNet and GoogleNet.

eISSN:
2564-615X
Langue:
Anglais
Périodicité:
4 fois par an
Sujets de la revue:
Life Sciences, Genetics, Biotechnology, Bioinformatics, other