Kidney cancer is not as prevalent as prostate cancer in men, but it is still the ninth most common cancer in men and the 14th most common cancer in women (N. Cancer, n.d.). In terms of occurrence kidney cancer was the 7th most common cancer in Australia in 2016 and it still remained the 7th most common cancer in 2020. The number of people affected by kidney cancer seems to be increasing from 793 in 1982 to 3,627 in 2016. In total, 3,627 news cases of which 2,408 were males and 1,219 were females was diagnosed with kidney cancer in 2016 in Australia. The surgical procedure for treating kidney cancer is called Nephrectomy. In partial nephrectomy, only the portion of the kidney that is diseased is removed. The procedure is done either open or robotically (minimally invasive surgery). In the open nephrectomy method, depending on cases, it might be required to remove a rib bone. The procedure is done using general anesthesia and so, the open nephrectomy method is less desirable for patients. The more preferred method both by surgeons and patients is partial robotic nephrectomy (National Kidney Foundation, n.d.). Artificial intelligence (AI) has significant potential in many engineering applications including manufacturing (Razfar et al., 2010), hydrology (Asadnia et al., 2010, 2014, 2017; Khorasani et al., 2018), sensors (Asadnia et al., 2013; Hagihghi et al., 2020; Kottapalli et al., 2015; Razmjou et al., 2017), and additive manufacturing (Bazaz et al., 2018; Mahmud et al., 2020; Moshizi et al., 2020). Helping surgeons identifying tumors not only in partial robotic nephrectomy, but also in other cancer cases such as bowel, prostate, canine mammary carcinoma.
da Vinci Xi enables robotic surgery using small incisions which can significantly help the surgeons with cancerous tumor removal surgery. da Vinci surgical robot was first commercialized by intuitive in 2000. In 2014, intuitive released da Vinci Xi (David and Samadi, n.d.). The da Vinci Xi robot has three parts, the patient cart, the surgeon console, and the vision cart. The patient cart has four arms to be used during the surgery, the surgeon controls the robotic arms through the console, the vision cart works as the CPU for the system and works as the second screen (American Institute of Minimally Invasive Surgery, 2019). Currently, the surgeons rely on their experience to identify the tumors. Once the tumor’s location has been approximated, da Vinci Xi provides intra operative ultrasound and Indocyanine Green (ICG) with Fluorescence Imaging to further assist the surgeon. Intra operative ultrasound shows the depth of the tumor and makes a 3D reconstruction of the organ on a tablet beside the surgeon’s console. Injecting ICG and turning on fluorescent light makes the kidney green and the tumor grey. But if the tumor location cannot be identified then intra operative ultrasound will not work. Not injecting ICG in the correct dose will either make the whole field of view green or will not change color. ICG also comes with side effects which makes it necessary to keep the injection of ICG minimum (Inc and Grove, 2021). There had been remarkable progress made in medical applications of image processing due to the availability of open source large scale annotated datasets. The applications include both pre and post-operative diagnosis. In 2015, support vector machine (SVM) was the most reliable classifier. Papers presented before Chung et al. (2015) only considered one slide from each MRI scan. Chung for the first time considered the spatial information contained in 3D voxels in the MRI scans. After 2015, when deep neural networks gained some insights as to how they work owing to the work of Zeiler and Fergus (2014), convolutional neural networks became popular for image classification. Shin et al. (2016) made use of publicly available CT images for thoraco-abdominal lymph node detection and interstitial lung disease classification. Pantanowitz et al. (2020) fulfilled the need for computer-assisted diagnostics of prostate core needle biopsies (CNBs) by developing an algorithm that takes input as hematoxylin and eosin (H&E) stained slides outputs the result with 0.997 AUC. Deep learning was also used for detecting cancer in animals (Aubreville et al., 2020), agricultural greenhouse detection (Li et al., 2020), analyzing traffic load distribution on a bridge (Ge et al., 2020), airplane detection (Chen et al., 2018), hand gesture recognition (Kharate et al., 2016), automatic vehicle inspection (Nakhaeinia et al., 2016), license plate recognition (Bennet et al., 2017). All these works presented here, only focused on pre and post-operative diagnosis using magnetic resonance imaging, computer tomography scans, ultrasound images. None of the papers consider real-time surgical images to identify tumors. This paper will address this issue and propose using convolutional neural network (YOLOv4 at first, then optimized VGG-16) for giving the surgeons a second opinion during real time tumor removal surgery.
This paper proposed an easy solution to use and reliable deep learning algorithm to help the surgeons to identify the cancerous cells while running the surgery. This will provide a second opinion besides the surgeon’s experience in identifying tumors during surgery which is extremely valuable to reduce the errors and to ensure all the potential cancerous tumors are removed.
The proposed process had been carried out in three steps. The first step uses deep learning on a live surgical video to show the locations of the tumors on a global range. The second step is for classifying among cancerous tissue, non-cancerous tissue, fatty tissue and localizing the identified class in close range. If it is preferred to have two class classification, the third step is for classifying between cancerous and non-cancerous tissue with localization with Gradient-based Class Activation mapping (GRAD-CAM) (Selvaraju et al., 2017) in close range. The idea is that once the more aggressive tumor had been identified using object detection in global range, the close range classification will be used to identify if there are any more tumors left inside the patient before closing the wounds.
To the best of our knowledge, this is the first work in real time tumor detection during surgery. We show a comparison of YOLOv3 (Redmon and Farhadi, 2018) and YOLOv4 (Bochkovskiy et al., 2020) object detection for global range detection and then use variations of VGG-16 (Simonyan and Zisserman, 2014), for classification and localization. The base VGG-16 architecture was used changing the output layer to 2/3 and the input image shapes were 224
Deep learning had been used in many medical applications including fast identification of COVID infection (Brunese et al., 2020; Junnumtuam et al., 2021), seismic (Hammal et al., 2020), medical segmentation (Wu et al., 2021), Referring to Chung et al. (2015), Pantanowitz et al. (2020), Wang et al. (2018), research work had been done to detect prostate cancer for diagnosis using magnetic resonance imaging (MRI) scans and ultra-sound scans. Shin et al. (2016) studied computer aided detection in thoraco-abdominal lymph node (LN) detection and interstitial lung disease (ILD) classification. The author used publicly available dataset in Depeursinge et al. (2012) for ILD and publicly available dataset in Roth et al. (2016) and Seff et al. (2014). They found that unlike heart or liver which have a specific orientation, lymph nodes do not have a specific orientation. For this reason, they could not apply segmentation on the images to apply convolutional neural network (CNN) on the segmented region. They had to rely on applying CNN on the entire images. The author used variations of CifarNet, AlexNet, GoogleNet (Szegedy et al., 2015) and shows that fine tuning the network for GoogleNet performed best for both LN and ILD since they had a lot of images in the dataset. The GoogleNet was still overfitting at first (Shin et al., 2016). Analyzing the models with variations including random initialization, transfer learning the best performance was achieved with GoogLeNet random initialization with 0.95 AUC.
Later on, Wang et al. (2020) applied mask on the CT images so that it is easier for the convolutional neural network to focus on the affected regions on the lung CT scan. The authors first used an unsupervised method to first add ground-truth masks on the training set. Then the training set images along with their mask was inputted in a 2D-UNet to train an algorithm to add mask on the test set images. All the training images that had the wrong mask, during unsupervised training for adding mask, was manually removed. They used 499 CT volumes for training and 131 CT volumes for testing. 1 was COVID positive and 0 was COVID negative. Then the trained 2D-UNet was used for adding masks on the testing set 3D CT volumes frame-by-frame. Then, the lung volume masks were concatenated with their CT volume images and the data set was prepared. Then, training was done on a deconvolutional neural network (DeCoVNet) with the labels as 0 or 1. Then the activations from the DeCoVNet with CAM (class activation mapping) along with unsupervised lung segmentations with 3d connected component (3DCC) (Ohira, 2018) were used for lesion localization. They evaluated their model’s performance at different threshold, after statistical analysis it was found that at threshold 0.3 the DeCoVNet performed with 0.908 maximum accuracy.
Roy et al. (2020) went a step above from the previous one and besides doing classification on images, they applied classification on video, and they used semantic segmentation to segment the infected regions from ultrasound. The reason for choosing ultrasound as the imaging technique was that it costs less compared to CT scans and clinicians have recently started to use this (Poggiali et al., 2020). Since, interpreting ultrasound is more challenging, the author devised a deep learning (DL) model in the paper to help interpret ultrasound reports for COVID infection. The author suggests a frame level classification, video level grading, and pathological artefact segmentation. The author had in total 58,924 frames to work with. Regularized spatial transformer network (Reg-STN) was used as the network.
In another study, Chung et al. (2015) extracted radiomics features from multi-parametric MRI using a quantitative radiomics feature model. Then, the author uses a support vector machine (SVM) classifier to get initial detection of cancer and then combines the output from SVM with radiomics-driven conditional random field (RD-CRF) framework to get the final detection. Even though this method achieved accuracy more than its predecessors, it is very trivial. Hadjiyski (2020) had used Inception v3 neural network on 3D rendered CT scans to predict the staging of kidney cancer. The images were cropped by using ImageJ making sure the cropped portion included kidney cancer. He achieved an AUC score of 0.90 for the test set (Hadjiyski, 2020).
In canine mammary carcinoma, mitotic count from whole slide images (WSI) of canine breast is analyzed to be used in human breast cancer research (Aubreville et al., 2020). Inaccurate mitotic count can lead to wrong diagnosis. The WSIs that are available for human breast cancer do not contain annotations for the entire WSIs. Keeping in mind the need of an algorithm to detect mitotic count in WSIs a number of challenges including MITOS 2012 dataset had been released. The best performing model at that time had F1 score of 0.66 and the result from the model was flawed and the algorithm was not considered state-of-the-art anymore as the algorithm had been trained and tested from the same data (Aubreville et al., 2020). The author suggested using a combination of RetinaNet (Lin et al., 2020) and then ResNet (He et al., 2016) to increase the efficiency of identifying mitotic counts in 21 WSIs of Canine Mammary Carcinoma. The author’s proposed method achieved a F1 score of 0.791, which is a significant improvement from the supposed to be state-of the-art model for identifying mitotic counts of Canine Mammary Carcinoma.
Charibaldi et al. (2018) proposed using fuzzy learning vector quantization (FLVQ) for Mycobacterium Tuberculosis (MTB) detection. The method provided a faster and cheaper solution as ZN staining method produced unsatisfying results, thorax X-ray irradiation was not suitable for developing countries. The author compared FLVQ and LVQ with three different sensors TGS822, TGS813, and TGS2611. The FLVQ neural network achieved a sensitivity (true positive rate) of 95.83% (Charibaldi et al., 2018).
Wu et al. (2021) proposed a method for joining the output from classification and segmentation for COVID-19 detection form chest CT diagnosis. They suggested to use image mixing technique (Zhang et al., 2018) to ensure the classifier does not focus on the area outside the lesion. For the classification evaluation metric, they used specificity and sensitivity. But since the goal of the project was to identify COVID-19 infected patients from their chest CT diagnosis, in other words, the COVID-19 positive class is of more importance, accuracy would have been a better performance metric.
Similar to Shin et al. (2016), where they focus on lymph node detection which can have random orientation, for our project tumor can have any random orientation. So, segmentation for applying CNN in particular regions on the image cannot be applied. That is why our work cannot rely on unsupervised masking, semantic segmentation. The method section explains how the dataset for object detection and classification was prepared separately.
It will be convenient for the surgeons if the algorithms presented in this paper were able to first detect tumors at a global range inside the patient using object detection and then, to give the surgeons a second opinion in identifying any more tumors left inside the patient using classification and localization.
The proposed method in this paper is aimed to: Detect tumors from live surgical videos on a global range inside the patient using object detection. Detect tumors inside the patient at close range using images with classification and localization for three class classification. Detect tumors inside the patient at close range using images with classification and localization for two class classification.
VGG-16 (Simonyan and Zisserman, 2014) convolutional neural network architecture was used for classification.
Figure 1 shows the network architecture used for two class classification. Assigning ‘n’ to the image size, ‘d’ to the color channels or depth of the image, ‘f’ to filter size. The image size is (n
Before feeding the images into the network, the images had been resized to 224
Working principle of some of the layers in Figure 1 is mentioned below: MaxPooling2D: the maximum value from each of the 2 Flatten: converts a two dimensional matrix to a one dimensional vector that is used as the input for the densely connected neural network. The vector output is of shape rows Dropout: every neuron in the network gets assigned a probability Dense: this works as the output of the network. This reduces a vector of 4,096 elements to two elements (Chollet, 2017).
For debugging the neural network and for providing visual explanation, GRAD-CAM (Selvaraju et al., 2017) had been used. Regarding biological context, it is crucial that the output from deep learning model is reliable considering the deep learning models can be a ‘black box’. It is anticipated that using GRAD-CAM algorithm it will be possible to debug the model and visually understand whether the predictions are correct (Brunese et al., 2020). GRAD-CAM makes use of the target class’s gradient flowing to the final convolutional layer to produce a heat map to show which portions of the image contributed toward the prediction (Brunese et al., 2020; Chollet, 2017).
Referring to Figure 1, 7
YOLO (You Only Look Once) builds up on the idea of FCN. You Only Look Once (YOLO) is a framework for deep learning that has been used for tumor detection in global range from real time surgical images. It had also been used for skin lesion detection (Ünver and Ayan, 2019), breast masses detection (Aly et al., 2021).
This makes YOLO a good option for detecting cancerous tumors real time during surgery. The task in this section consists of determining tumor locations from partial robotic nephrectomy images or videos by drawing bounding boxes around those and also classifying those as cancerous, non-cancerous, and fatty tissue. A comparison was done between YOLOv3 and YOLOv4.
In Figure 2, first is the input image, then comes the backbone as the feature extractor, neck is the subset of the backbone and it enhances the feature discriminability and robustness. Afterwards, comes dense prediction step which does object detection. If it is a two-stage detector, like Faster R-CNN or Mask R-CNN, the next step is sparse prediction (Bochkovskiy et al., 2020).
YOLOv3 uses Darknet-53 as the backbone, Feature Pyramid Netwrok as the neck and YOLO as the detector (Redmon and Farhadi, 2018). YOLOv4 uses CSPDarknet53 as the backbone as the spatial pyramid pooling introduced in this backbone structure can significantly increase the receptive field and extract the important features, Path Aggregation Network (PANet) as neck and the same detector as YOLOv3 (Redmon and Farhadi, 2018).
The backbone of v3 is deeper than the backbone of v4, which makes v3 slower for training. Small objects are still difficult to identify, a problem that was there from v1 and v2. YOLOv4 improves on the performance of YOLOv3 without requiring any additional hardware which also can be seen in the results section (Redmon and Farhadi, 2018).
All the input images are resized to 608
This section explains the results of tumor detection from live surgical videos using YOLOv3 and YOLOv4. Then, the results from classification are discussed with variations of AlexNet and VGG-16.
There are no publicly available data for tumor detection during surgery. The dataset for object detection and classification were prepared in different ways. For object detection, there were 56 images in total (49 for training and 56 for test). The training images were loaded onto LabelImg (tzutalin, 2017), an open source image labeling tool and their annotations saved. The dataset for classification were prepared in three different ways mentioned later in this section. The images were collected from YouTube videos on partial robotic nephrectomy. Table 1 shows from which institutions the images were collected from Kibel (2018), Porter (2015), Abaza (2020a, b), P. N. U. Specialist (n.d.), Rogers (2015), Engel et al. (2016), GlobalCastMD (n.d.), Abaza (2020a, b), Hampton (2015).
Institutions that produced the videos.
Source | Country | State |
---|---|---|
Brigham and Women’s Hospital | USA | Massachusetts |
Seattle Science Foundation | USA | Washington |
Pacific Northwest Urology specialist | USA | Washington |
Vattikuti Foundation | USA | Michigan |
Urologic Surgeons of Washington | USA | Washington |
The images for detection set were not further cropped. It was made sure that when cropping photos from the videos, the robotic arms are kept out of the images as much as possible. Two examples are shown in Figure 4.
Upon further inspection, it can be seen that the images in partial robotic nephrectomy contains images that has portion of cancerous tissue and non-cancerous tissue and some even has fatty tissue. The images were further cropped so that there are images only for cancerous tissue, only for non-cancerous tissue and fatty tissue. The image from Figure 4 had been further cropped to only include the tumor as the cancerous tissue as shown in Figure 5.
Cropping the photos similarly in the example showed in Figures 4 and 5, three sets of data were prepared. The first set of data only contained the cropped photos as they were. The first dataset contained three classes including cancerous tissue, non-cancerous tissue, and fatty tissue. In total, 30 cancerous tissue images were used for the training set, nine were used for validation, and five were used for testing. For non-cancerous tissue, there were 40 for training, 13 for validation and 9 for testing. For fatty tissue there were 21 for training, 10 for validation and 6 for testing. This was named as the first dataset.
For the second and third dataset, image augmentation was applied where 1 image was mirrored, rotated 900 clockwise, rotated 1800 clockwise, rotated 450 clockwise. So, five images were made from 1 image. For the second dataset, there were still three classes, but for the third dataset there were two classes including cancerous tissue and non-cancerous tissue.
Finally, to keep the number of images same during training through different classes, when two classes were considered, there were 150 images for the cancerous tissue, and 150 images for non-cancerous tissue.
For the dataset with three classes, for training there were 105 cancerous tissue images, 105 non-cancerous tissue images, and 105 fatty tissue images.
The numbers have been summarized in Table 2.
Train, validation, test division.
Dataset | Label | Training | Validation | Test |
---|---|---|---|---|
1st dataset | Cancerous tissue | 30 | 9 | 5 |
Non-cancerous tissue | 40 | 13 | 9 | |
Fatty tissue | 21 | 10 | 6 | |
2nd dataset | Cancerous tissue | 105 | 9 | 5 |
Non-cancerous tissue | 105 | 13 | 9 | |
Fatty tissue | 105 | 10 | 6 | |
3rd dataset | Cancerous tissue | 150 | 9 | 5 |
Non-cancerous tissue | 150 | 23 | 15 |
The YOLOv4 (Bochkovskiy et al., 2020) algorithm used for object detection was written by Alexey (n.d.). The open-source code was downloaded from GitHub and using OpenCV as the vision engine the images were loaded into the model. After running the training for 15 hr with image augmentation activated. After training was done, the algorithms were tested with the images from the test set and on the videos from which the test set images had been extracted.
Before training the algorithm on the windows PC, it was implemented in Google Colab virtual machine using YOLOv3 (Redmon and Farhadi, 2018). The evaluation metrics that were used are as follows.
Precision, recall, mean Average Precision (mAP), Frames Per Second (FPS) (For video data).
From Table 3, the mAP of YOLOv4 on windows is better than YOLOv3 on virtual machine. Also, it was not possible to run detection on videos on the virtual machine. Therefore, in terms of evaluation metric, the YOLOv4 is better for tumor detection.
Result comparison.
Detection algorithm | Precision | Recall | Mean average precision | Frames per second |
---|---|---|---|---|
YOLOv3 on virtual machine | 0.88 | 0.62 | 0.758 | Not applicable |
YOLOv4 on windows | 0.98 | 0.99 | 0.974 | 21.4 |
Here are some of the detection images from the video file attached to the document with YOLOv4 in Figure 6. In the algorithm, batch size was set as 16 with subdivision 64. The learning rate was 0.0001.
In Figure 6, the cancerous tissue is pink, non-cancerous tissue is blue, fatty tissue is green bounding box.
In the Supplementary file (
The dataset was prepared in three different ways for classification and localization as mentioned in dataset section. The performance was evaluated using the following evaluation matrices: Loss VS epochs curve. Confusion matrix.
First, AlexNet was implemented with 7
Figure 7A, B shows that in the
Looking into Supplementary file (Figs. 4s and 5s,
Visual evaluation was done by: Gradient based Class Activation Mapping (GRAD-CAM).
There are five cancerous tissue images in the test set. Here, it will be tested which algorithm can give correct prediction for the cancerous tissue and also highlight the cancerous tissue portion on the image. The networks output 2 or 3 probabilities for each image depending on whether 2 (third dataset) or 3 (second dataset) class classification is being done. The highest probability region gets highlighted as red in the image. For example, the figure shown in Figure 8 gets three probability output as [0.971,0.00035,0.028] when three class classification is done. The first probability is for cancerous tissue, the second number is for fatty tissue, the third is for non-cancerous tissue. Since the cancerous tissue probability is high, that gets highlighted with red in the image.
From Figure 8, VGG16 second dataset lower lr Dropout 0.5 callback was the model that was selected for three class classification.
Comparison was done the same way for two class classification and VGG-16 third dataset 2 lower lr using callback was selected as the model for two class classification. Look into Supplementary file Visualizing heatmap and prediction outputs section, https://www.dropbox.com/sh/42dy79r2wyjrsq3/AAASkoWs26bFjFkVxkGJfOSwa?dl=0.
The comparison of this study with other related studies are summarized in Table 4.
Comparison with other studies.
Method | Image type | AI technique used | Total images (TI) | Evaluation metric | Validation performance (VP) | |
---|---|---|---|---|---|---|
Hadjiyski (2020) | CT scans | Inception v3 | 4,200 | AUC | 86% | 0.02 |
Aubreville et al. (2020) | Whole Slide Images | RetinaNet with ResNet-50 | 13,907 | F1 score | 79.1% | 0.01 |
Wang et al. (2018) | Multi parametric MRI | V-net | 79 cases in total. About 790 images | Accuracy | 89.4% | 0.11 |
Chung et al. (2015) | Multi parametric MRI | SVM with RD-CRF | 20 cases in total. About 200 images | Accuracy | 59% | 0.29 |
Brunese et al. (2020) | Chest X-ray | VGG-16 | 9,326 | Accuracy | 98% | 0.01 |
Wu et al. (2021) | Chest CT scan | VGG-16 with segmentation | 3,855 | Sensitivity | 95% | 0.03 |
This study | Live partial robotic nephrectomy | Object detection with VGG-16 | 143 | Accuracy | 84% | 0.59 |
Figure 9 shows a flowchart of how the two-stage detectors are being used for cancerous tumor detection.
Considering there is no live tumor detection technology currently in the da Vinci Xi robot, this paper proposes a CNN approach to help surgeons detect tumors live during surgery. Global range tumor detection inside the patient was done via YOLOv4. The close range detection approach is built on VGG-16 base model. Two main models were considered for the paper. Variation of the models was also considered. For global range detection, there was comparison between YOLOv3 and YOLOv4. For classification, comparison was between two classes (cancerous tissue, non-cancerous tissue) and three classes (cancerous tissue, fatty tissue, non-cancerous tissue) and for the two variations five different models of VGG-16 were considered. The other model classified between two classes which included cancerous and non-cancerous tissue. Also, the areas where tumor was detected was highlighted depending on the output of the CNN model (more details of this in the Supplementary file,