Primary hyperparathyroidism (PHPT) is the third most common endocrine disorder with a reported prevalence ranging from 1 to 21 per 1,000 among the general population.1 PHPT is the result of hyperfunctioning parathyroid tissue (HPTT), which becomes insensitive to the inhibitory effect of hypercalcemia. Histologically HPTT can be either an adenoma (in approximately 80% of cases), multiple adenomas, hyperplasia or rarely a carcinoma (in approximately 1% of cases).2 The treatment of PHPT typically requires surgical removal of HPTT. Modern, minimally invasive surgical techniques require precise preoperative localization of HPTT. For this task, [18F]fluorocholine PET/CT (FCH-PET) is one of the most promising imaging modalities, with reported sensitivities of 94–100% and specificities of 88–100%.3-13 Performance of FCH-PET was repeatedly shown to be superior to other HPTT localization methods, while at the same time having lower radiation exposure compared to other nuclear medicine modalities.14
Deep learning (DL) techniques with convolutional neural networks (CNN) have proven to be useful in various computer vision tasks, such as super-resolution, image synthesis, denoising, classification, segmentation and object detection.15-22 In medical imaging, CNNs have shown promising performance, even exceeding experts in some specific cases, such as grading diabetic retinopathy from fundus images, detecting skin cancer from photographs and detecting abnormalities on chest X-ray images.23-25 Research of CNNs in nuclear medicine showed its potential in reducing the PET radiation dose, improving image quality, lesion detection and segmentation as well as prediction of prognosis.21-36
Given the excellent human performance of analysing FCH-PET for the presence and localisation of HPTT, an interesting opportunity to challenge DL techniques is presented. An automated analysis pipeline of FCH-PET that would classify HPTT presence and location would allow for efficient surgical planning and could serve to double check the experts’ reports. Such analysis would also allow for more accurate and objective comparison of potential follow-up studies; these are not often required, but unavoidable in cases of persistent or recurrent hyperparathyroidism. Furthermore, if the model could visualise the pathological uptake in the study, it would provide more visual feedback to the surgeon in axial images to allow for better visualisation of HPTT and would allow faster interpretation of interplay of surrounding anatomical structures. Our aim was to explore the performance of DL analysis of FCH-PET in the setting of PHPT, since the use of DL for FCH-PET analysis in PHPT has not yet been thoroughly investigated.
To this end, we developed a classification model which classifies whether HPTT is present in the study and its location. We also attempt to model in a novel unsupervised manner the regions-of-interest fed to the model. Furthermore, we aimed to provide a preliminary comparison of the diagnostic accuracy of the DL models to human experts to determine clinical applicability, as the model should be as accurate as an expert in evaluating FCH-PET studies to be clinically applicable.
This was a retrospective analysis of prospective clinical trial data (NCT03203668) performed at the University Medical Centre Ljubljana and Institute of Oncology Ljubljana. The clinical trial was approved by the Medical Ethics Committee of the Republic of Slovenia (approval number 77/11/12). The trial only included patients with biochemically confirmed primary hyperparathyroidism; hypercalcemic patients had elevated or inappropriately normal parathormone (PTH) levels, whereas normocalcemic patients had inappropriately elevated PTH levels. All included patients were older than 18 years and had no clinical history of oncological, inflammatory, or infectious disease of the head and neck. No pregnant women were included in the trial. The retrospective use of the data was approved by the Medical Ethics Committee of the Republic od Slovenia (approval number 0120582/2021/4) and the patient consent was waived due to the retrospective nature of the analysis.
The study only included images of patients with biochemically confirmed PHPT at time of FCH-PET imaging. Since the trial did not include healthy controls, data of patients with the following criteria were chosen as “controls”: no visible HPTT in FCH-PET at time of imaging; have not undergone surgery in thyroid region; were biochemically normocalcemic at 6 months’ follow-up.
We used the data of 79 participants (22 male, 57 female) with visible HPTT lesions on FCH-PET (referred below as
FCH-PET imaging was performed at the Department for Nuclear Medicine of the University Medical Centre Ljubljana. The acquisition details were the same as in Cuderman
All
For simplicity, we only used
We used the same pre-processing pipeline for all analyzed images. First, we resampled the CT image using bivariate spline interpolation from
The labels for an image were represented by a one-hot encoded vector of length 4, representing locations UL, LL, LR and a dummy variable representing “healthy”
For modelling, we defined 2 tasks: (i) a task of classifying whether the HPTT is present in the image or not (CPr, classification of presence) and (ii) a task of classifying in which quadrant the HPTT was present in the image (CLoc, classification of location). CPr is a simple binary classification task where
With normalized PET-CT images represented by a matrix of shape 200 × 200 × 56 × 2 as input, the output of the model was a vector of length 4, activated by
For both CPr and CLoc classification tasks, we performed baseline experiments using the 3D version of
We provide comprehensive comparison between the performance of RN10 and proposed architecture “
We developed a novel architecture designed to mask PET signals from unimportant (i.e., physiological uptake) regions with high signal (eg. muscle tissue, salivary glands) before entering the RN10 classifier. This is important as the FCH-PET images are heteroscedastic, with some regions - like muscle - having high variance between subjects and other regions - like air - having low variance. To mitigate this, and to improve conditioning of the data and therefore the stability of the classifier,46 we decided to allow the model itself to optimize for differentiable masking of these potentially problematic regions. We named the proposed architecture “
The mRN10 consisted of 2 parts. First, a
The architecture of mRN10 is depicted on Figure 1. Regions in PET image where
The models were written in
For training, we used 12-fold cross-validation with data split into a test set of 10 random subjects, with the remaining subjects being randomly split into a training set (90% of the remaining subjects) and validation set (10% of the remaining subjects). Data was normalised using z-score normalization upon splitting accordingly, such that the mean and standard deviation were computed only using the training set. Sets were sampled such that each set contained at least 1 subject from each class (UL, LL, LR and
We used
Confusion matrices for CPr
CPr task with RN10 | CPr task with mRN10 | ||||||
---|---|---|---|---|---|---|---|
HPTT present | HPTT present not | sum | HPTT present | HPTT present not | sum | ||
Model HPTT present output | 79 | 8 | Model HPTT present output | 90 | 11 | ||
Model output HPTT not present | 20 | 13 | Model output HPTT not present | 9 | 10 | ||
CLoc task with RN10 | CLoc task with mRN10 | ||||||
---|---|---|---|---|---|---|---|
HPTT at GTLoc | HPTT GTLoc not at | sum | HPTT at GTLoc | HPTT GTLoc not at | sum | ||
Predicted GTLoc | 35 | 51 | Predicted GTLoc | 53 | 50 | ||
Not predicted GTLoc | 61 | 213 | Not predicted GTLoc | 43 | 214 | ||
CPr = classification of presence; CLoc = classification of location; GTLoc = ground truth location based on postsurgical histopathological reports; HPTT = hyperactive parathyroid tissue; mRN10 = novel
We determined the best performing models for both RN10 and mRN10 were trained using the initial learning rate of 0.013. The confusion matrices for RN10 and mRN10 are presented in Tables 1A and 1B, while the diagnostic performances for both tasks using the RN10 and mRN10 models are presented in Table 2. Both models had comparable performance in the CPr task. The mRN10 had a significantly higher accuracy for the CLoc task than the RN10 and was therefore used for comparison with human performance.
Diagnostic performance metrics of RN10 and mRN10 as well as
CPr |
CPr |
CLoc |
CLoc |
|||
---|---|---|---|---|---|---|
Sensitivity [95% CI] | 0.800 [0.719; 0.877] | 0.365 [0.268; 0.460] | ||||
Specificity [95% CI] | 0.476 [0.263; 0.690] | 0.257 | 0.807 [0.759; 0.854] | 0.910 | ||
Positive predictive value [95% CI] | 0.891 [0.830; 0.951] | 0.507 | 0.407 [0.303; 0.511] | 0.089 | ||
Negative predictive value [95% CI] | 0.394 [0.227; 0.560] | 0.205 | 0.777 [0.728; 0.827] | |||
Accuracy [95% CI] | 0.767 [0.681; 0.839] | 0.689 [0.638; 0.736] | ||||
AUCROC | 0.815 | / | 0.702 |
AUCROC = area under the receiver operating characteristic curve; CPr = classification of presence; CLoc = classification of location; mRN10 = novel
We performed a comprehensive comparison with human expert evaluation only for the CLoc task. Healthy controls had, by definition, no HPTT visible on FCH-PET (as reported by human experts), so the comparison could not be made for the CPr task, as human performance for CPr was 100%. Comparison of performance metrics for the CLoc task between the mRN10 model and human performance (based on the same subset of 83 patients used for the DL model development) is shown in Table 3.
Comparison of mRN10 and human performance for the CLoc task.
CLoc |
CLoc |
||
---|---|---|---|
Sensitivity [95% CI] | 0.552 [0.453; 0.652] | ||
Specificity [95% CI] | 0.811 [0.763; 0.858] | ||
Positive predictive value [95% CI] | 0.515 [0.418; 0.611] | ||
Negative predictive value [95% CI] | 0.833 [0.787; 0.878] | ||
Accuracy [95% CI] | 0.742 [0.693; 0.786] |
CLoc = classification of location; mRN10 = novel
Studies across multiple models were performed to determine the use of RN10 as the base architecture. The results of other models are stated below, as well as the number of trainable parameters and optimal initial learning rate. Mean CPr AUCROC and 95% confidence intervals were computed as population statistics of 50 models obtained from 5 runs of 10-fold cross-validation at optimal learning rate. The highest performance among the models tested was achieved with RN10 and mRN10. The performance of other models is noted in the table below.
Qualitative results were evaluated across all subjects and using an iteration of the model trained from a single data split. The qualitative results did not change in a significant manner with repeated training. In qualitative analysis of PET masking results, the region-of-interest mask correctly identified the foreground, while we have found that in all but 3 subjects, 1 with LL HPTT and 2 LR HPTT, that the mask completely obscured (masked) the original location of HPTT on masked PET. In the 3 subjects with visible HPTT in the masked PET in the original location, the mask still partially obscured the HPTT, as seen in Figure 3, rows d), f) and g).
Figure 2 shows a typical example of mRN10 masking, where HPTT was masked and cannot be distinguished in masked PET image. The network correctly classified the subject in Figure 2 as having lower right HPTT. The region of air outside the patient is masked to approximately 25% of the original PET signal, with mask having a value of approximately 0.25. The high signal from the salivary glands is masked in all cases, whereas signal from the thyroid gland is only partially masked in all cases, as seen in Figure 3.
The aim of the study was to evaluate the potential of DL models in classifying HPTT presence and location in FCH-PET studies in the setting of PHPT. For our experiments to be representative of results of such a model in practice, we used data of representative cohort of subjects with PHPT. Classification of FCH-PET studies was performed using multiple common DL models and we found that the simplest among the models tested, RN10, achieved the highest performance. Furthermore, we improve the model’s performance by modifying the architecture to include a region-of-interest masking step, which produced a region-of-interest mask, which successfully identified the foreground of PET. The mRN10 achieved superior performance to models of similar size. Overall, given the size of our dataset and achieved performance, we found that the use of deep learning is highly promising in potential evaluation of FCH-PET in PHPT.
Both our
Unfortunately, the dataset was imbalanced with respect to patients vs “controls”. However, obtaining negative FCH-PET studies is difficult due to high positivity rate of finding HPTT in FCH-PET, since only patients with biochemically confirmed PHPT are imaged. Such patients are highly likely to have visible HPTT, as reported in studies exploring the effectiveness of FCH-PET.3-13 Since healthy subjects are generally not referred to undergo FCH-PET imaging, the best attempt was made to select the criteria for choosing “controls” among patients with negative visual assessment of FCH-PET. Our controls therefore had negative imaging findings and biochemical criteria for PHPT resolved at follow-up after 6 months without surgical treatment.
Performance of several models on CPr task
Model name | mRN10 | RN10 | |||||
---|---|---|---|---|---|---|---|
parameters # Trainable (millions) | 33.5 | 14.3 | 46.2 | 85.2 | 112.9 | 85.2 | 85.2 |
Optimal learning initial rate | 0.0136 | 0.0136 | 2.15*10-3 | 1.47*10-4 | 0.316 | 1.47*10-4 | 2.15*10-3 |
Mean CPr AUCROC [95% CI] | 0.754 [0.624; 0.980] | 0.527 [0.410; 0.639] | 0.703 [0.606; 0.905] | 0.739 [0.486; 0.998] | 0.752 [0.653; 0.966] |
AUCROC = area under the receiver operating characteristic curve; CPr = classification of presence; mRN10 = novel
For ground truth location, histopathological results were used as opposed to expert visual assessment of FCH-PET, in order to simulate real-world use of the models in guiding surgical removal of HPTT.
We have chosen the 3D
The main motivation behind the design of mRN10 and implementation of masking is the way experts interpret FCH-PET. Experienced nuclear medicine physicians know that HPTT usually appears around the thyroid region, and we wanted to allow for the model to learn to mask regions that were deemed unimportant for classification. Furthermore, these unimportant regions (e.g., muscle) commonly produced high intensity PET signal that might affect the classifier. Using end-to-end training with only cross-entropy classification loss, we allowed the network to learn to mask these unimportant regions in an unsupervised manner by carefully tailoring the architecture. Given how experts interpret FCH-PET, mRN10 was an attempt to integrate expert knowledge into the model to improve the
The
One of the goals of the study was to compare the model’s performance to nuclear medicine experts. The task of detecting and localizing HPTT on FCH-PET is relatively “trivial” for human experts, with reported accuracies of up to 98%.3-13 We therefore feel that a small dataset is sufficient for training a model to similar performance. However, the results differed from our expectations, as the achieved performance was significantly below the one of humans for both of our tasks. It is most likely that by increasing the dataset to several hundred subjects, the performance gap would be closed.
Given the size of our dataset, our results are comparable to other published studies on other medical imaging related tasks. The study with a similarly sized dataset (85 subjects) in the classification of cardiac sarcoidosis by Togo
We feel that by increasing the size of our dataset to several hundred patients, similar levels of performance metrics to human performance could most likely be achieved. One supporting data point for this assumption is that the upper-bound of the 95% CI of AUC in the population statistics of 50 model iterations used in experiments was 0.998. Given the right data split, the model could perfectly classify the test set.
Qualitatively, we observed interesting properties of the mask created using the
The physiologically high PET activity in salivary glands and the thyroid were correctly masked. This is likely because there is usually high PET activity in these regions. The masking of the thyroid region is especially problematic since the signal from HPTT can also be masked along with the thyroid. This resulted in HPTT being masked in all but 3 cases, as shown in Figure 3. Still, this did not always result in a false classification of the HPTT location. The parathyroid adenoma in row c) is crucial to the task for experts and yet it was masked in this case by the network. Even though the model masked the adenoma, the mRN10 model output in this case was still correct (lower right adenoma location). It is likely that
Regions near the skin and the skin itself were always enhanced – we assumed that this was an important signal to the model, as skin-air interface exhibits high contrast on PET and CT and acts as a rough anatomical landmark. It is also much higher in contrast than soft tissue interfaces of the structures in the parathyroid region and produces stronger gradients in training. The region outside the patient (air) was not masked to 0, but to approximately 25% of the signal (value of mask was 0.25), since it is irrelevant to the classification and likely does not produce a gradient in training, so the
We find the obtained masks to be interpretable in terms of optimizing downstream
While the proposed mRN10 model, using
Namely, we found the mRN10 to be superior in performance to the RN10 in CLoc task. This is probably due to the improved conditioning of the masked input to
In the model selection, we found that the model with lowest number of parameters performed the best. This is one limitation of our study since experiments with even simpler models were not carried out. Another potential performance improvement could be using transfer learning, but we have not found suitable pretrained models for the FCH-PET images.
Our PET masking was an attempt to make the model more interpretable. Most notable similar mechanisms that exist within literature are the attention mechanisms.69 The main problem with most attention mechanisms is that they rely on weighing of the image features, which are obtained by embedding a small image patch into a vector. Because of this, the spatial resolution of the attention map is limited by the size of the image patch, which is commonly 16 × 16 in visual transformers.70 In analogy, if we used 16 × 16 × 16 for our theoretical attention, the feature map of our entire image would be of spatial dimensions 4 × 4 × 2, which is too low detailed enough interpretation. Another method of explaining the model output is the class activation mapping (CAM), which also relies on feature embeddings before fully connected layers and therefore entails a loss of spatial resolution;71 in case of the RN10, the CAM resolution would be 4 × 4 × 2. Gradient-based attribution methods, which do provide pixel-level (or in our case voxel-level) input attribution to the model output, have received criticism due to their inconsistency and poor theoretical foundations.72
We provide extensive experiments in deep learning analysis of FCH-PET using standard classification model RN10 and a novel architecture tailored to the task. As deep learning for FCH-PET analysis in PHPT has to our knowledge not yet been described in literature, our experiments provide a baseline for future work. Even though inferior performance to human experts was achieved, the results seem very promising considering the small dataset and the achieved accuracy of 83% for detecting HPTT and 74% accuracy for localizing the quadrant of HPTT.