Detection and localization of hyperfunctioning parathyroid glands on [F]fluorocholine PET/ CT using deep learning – model performance and comparison to human experts

Leon Jarabek; Jan Jamsek; Anka Cuderman; Sebastijan Rep; Marko Hocevar; Tomaz Kocjan; Mojca Jensterle; Ziga Spiclin; Ziga Macek Lezaic; Filip Cvetko; Luka Lezaic

Acceso abierto

Detection and localization of hyperfunctioning parathyroid glands on [¹⁸F]fluorocholine PET/ CT using deep learning – model performance and comparison to human experts

Radiology and Oncology

Volumen 56 (2022): Edición 4 (December 2022)

Acerca de este artículo

Artículo anterior

Artículo siguiente

Cite

Article Category: Research Article

Publicado en línea: 13 dic 2022

Páginas: 440 - 452

Recibido: 21 abr 2022

Aceptado: 22 ago 2022

DOI: https://doi.org/10.2478/raon-2022-0037

© 2022 Leon Jarabek, Jan Jamsek, Anka Cuderman, Sebastijan Rep, Marko Hocevar, Tomaz Kocjan, Mojca Jensterle, Ziga Spiclin, Ziga Macek Lezaic, Filip Cvetko, Luka Lezaic, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Introduction

Primary hyperparathyroidism (PHPT) is the third most common endocrine disorder with a reported prevalence ranging from 1 to 21 per 1,000 among the general population.¹ PHPT is the result of hyperfunctioning parathyroid tissue (HPTT), which becomes insensitive to the inhibitory effect of hypercalcemia. Histologically HPTT can be either an adenoma (in approximately 80% of cases), multiple adenomas, hyperplasia or rarely a carcinoma (in approximately 1% of cases).² The treatment of PHPT typically requires surgical removal of HPTT. Modern, minimally invasive surgical techniques require precise preoperative localization of HPTT. For this task, [¹⁸F]fluorocholine PET/CT (FCH-PET) is one of the most promising imaging modalities, with reported sensitivities of 94–100% and specificities of 88–100%.³^-13 Performance of FCH-PET was repeatedly shown to be superior to other HPTT localization methods, while at the same time having lower radiation exposure compared to other nuclear medicine modalities.¹⁴

Deep learning (DL) techniques with convolutional neural networks (CNN) have proven to be useful in various computer vision tasks, such as super-resolution, image synthesis, denoising, classification, segmentation and object detection.¹⁵^-22 In medical imaging, CNNs have shown promising performance, even exceeding experts in some specific cases, such as grading diabetic retinopathy from fundus images, detecting skin cancer from photographs and detecting abnormalities on chest X-ray images.²³^-25 Research of CNNs in nuclear medicine showed its potential in reducing the PET radiation dose, improving image quality, lesion detection and segmentation as well as prediction of prognosis.²¹^-36

Given the excellent human performance of analysing FCH-PET for the presence and localisation of HPTT, an interesting opportunity to challenge DL techniques is presented. An automated analysis pipeline of FCH-PET that would classify HPTT presence and location would allow for efficient surgical planning and could serve to double check the experts’ reports. Such analysis would also allow for more accurate and objective comparison of potential follow-up studies; these are not often required, but unavoidable in cases of persistent or recurrent hyperparathyroidism. Furthermore, if the model could visualise the pathological uptake in the study, it would provide more visual feedback to the surgeon in axial images to allow for better visualisation of HPTT and would allow faster interpretation of interplay of surrounding anatomical structures. Our aim was to explore the performance of DL analysis of FCH-PET in the setting of PHPT, since the use of DL for FCH-PET analysis in PHPT has not yet been thoroughly investigated.

To this end, we developed a classification model which classifies whether HPTT is present in the study and its location. We also attempt to model in a novel unsupervised manner the regions-of-interest fed to the model. Furthermore, we aimed to provide a preliminary comparison of the diagnostic accuracy of the DL models to human experts to determine clinical applicability, as the model should be as accurate as an expert in evaluating FCH-PET studies to be clinically applicable.

Patients and methods

This was a retrospective analysis of prospective clinical trial data (NCT03203668) performed at the University Medical Centre Ljubljana and Institute of Oncology Ljubljana. The clinical trial was approved by the Medical Ethics Committee of the Republic of Slovenia (approval number 77/11/12). The trial only included patients with biochemically confirmed primary hyperparathyroidism; hypercalcemic patients had elevated or inappropriately normal parathormone (PTH) levels, whereas normocalcemic patients had inappropriately elevated PTH levels. All included patients were older than 18 years and had no clinical history of oncological, inflammatory, or infectious disease of the head and neck. No pregnant women were included in the trial. The retrospective use of the data was approved by the Medical Ethics Committee of the Republic od Slovenia (approval number 0120582/2021/4) and the patient consent was waived due to the retrospective nature of the analysis.

The study only included images of patients with biochemically confirmed PHPT at time of FCH-PET imaging. Since the trial did not include healthy controls, data of patients with the following criteria were chosen as “controls”: no visible HPTT in FCH-PET at time of imaging; have not undergone surgery in thyroid region; were biochemically normocalcemic at 6 months’ follow-up.

Dataset description and PET-CT image acquisition

We used the data of 79 participants (22 male, 57 female) with visible HPTT lesions on FCH-PET (referred below as patients) and 19 participants (7 male, 12 female) without visible HPTT lesions on FCH-PET (referred below as controls). Average age (± SD) of patients was 58.7 ± 12.7 years and average age of controls was 60.1 ± 11.8 years. Both patients and control groups were comparable in terms of age (p = 0.659) as well as male to female ratio (p = 0.852), as determined by Student t-test and normalised Chi-square test, respectively.^{37, 38}

FCH-PET imaging was performed at the Department for Nuclear Medicine of the University Medical Centre Ljubljana. The acquisition details were the same as in Cuderman et al.³ The patients fasted 6 hours prior to the examination, were well hydrated and injected with 100 MBq of [¹⁸F] Fluorocholine (FCH). Acquisition was performed on a Siemens Biograph mCT® PET/CT (Siemens Healthineers AG, München, DE) 5 minutes and 60 minutes after the FCH application. The imaging region extended from the angle of mandible to the aortic arch. The imaging consisted of a low-dose CT (120 kVp, 25 mAs, CARE Dose 4D, FBP reconstruction), followed by PET imaging (one bed position of 4 minutes). PET images were reconstructed using Siemens HD PET software with iterative TrueX + TOF OSEM method (2 iterations, 21 subsets) with 400 × 400 matrix, zoom 1 and Gaussian filter with FWHM of 4 mm. To train and evaluate DL models, we used only images acquired 60 minutes after FCH application, where the balance of image quality and target-to-backround ratio is typically highest.

All patients with HPTT present on FCH-PET were surgically treated at Institute of Oncology Ljubljana. Ground truth HPTT presence and location for training the CNNs was based on the post-surgical histopathological results. Furthermore, our dataset included formatted information from FCH-PET reports as used by Cuderman et al. that we used to compare the performance of DL models with human experts.³ These reports were used to guide the subsequent surgical removal of the HPTT.

For simplicity, we only used patients who had single gland disease and had HPTT in the typical anatomic location of parathyroid glands. HPTT was thus in one of 4 possible locations: upper left (UL, 21 patients), lower left (LL, 27 patients), upper right (UR, 5 patients) and lower right (LR, 26 patients). Since the UR location in our dataset contained only 5 patients, it was removed from the final analysis due to under-representation. For the final model development and evaluation, we used 19 controls and 74 patients, among them 21 with UL HPTT, 27 with LL HPTT and 26 with LR HPTT.

Image pre-processing

We used the same pre-processing pipeline for all analyzed images. First, we resampled the CT image using bivariate spline interpolation from scipy library to match the PET image matrix of 200 × 200 × 56.³⁹ 3D interpolation was not needed as CT was reconstructed at same slices as PET. Both images were concatenated to produce a 200 × 200 × 56 × 2 matrix representing the PET/CT. Next, we cropped the desired region of interest containing the hyperfunctioning parathyroid tissue to the matrix of size 64 × 64 × 32. For all patients, the region was cropped at same PET/CT coordinates, which were chosen empirically, such that it contained HPTT in all studies. In this way, there are lower memory requirements to run deep learning models.

The labels for an image were represented by a one-hot encoded vector of length 4, representing locations UL, LL, LR and a dummy variable representing “healthy” controls.

Modelling

For modelling, we defined 2 tasks: (i) a task of classifying whether the HPTT is present in the image or not (CPr, classification of presence) and (ii) a task of classifying in which quadrant the HPTT was present in the image (CLoc, classification of location). CPr is a simple binary classification task where p(HPTT) = 1 – p(healthy). CLoc is a multi-class classification task where each output of the model is analogous to the probability of HPTT being present at one of three considered locations UL, LL and LR.

With normalized PET-CT images represented by a matrix of shape 200 × 200 × 56 × 2 as input, the output of the model was a vector of length 4, activated by SoftMax activation function, corresponding to p(UL), p(LL), p(LR) and p(healthy) (Figure 1). The model was therefore trained for both CPr and CLoc simultaneously. Furthermore, the dataset was well balanced, containing a similar number of cases for each of the 4 classes, and thus ensured stable training using cross entropy as a loss function.40 For training, batch size of 5 was used with stochastic gradient descent optimizer with momentum of 0.9 and weight decay of 0.005. The initial learning rate was determined by a grid search in log space and learning rate decay on plateau scheduling was used. Identical procedure was used for all models. All models were trained from scratch.

mPETResnet10 architecture. First, PET-CT images are fed into UNet with a single channel output and tanh+1 activation function. This output is the PET mask. This mask is elementwise multiplied with PET image to produce a masked PET image. Masked PET is concatenated with the original CT and the masked PET-CT is fed into the ResNet10 classifier. Gray boxes represent deep-learning models, coloured boxes represent data, and circles represent operations of tanh+1, multiplication (mul) by element and concatenation (concat).

For both CPr and CLoc classification tasks, we performed baseline experiments using the 3D version of Resnet10 (RN10) architecture and using our novel architecture as described below.^{41, 42} Our choice of architecture of Resnet10 was based on extensive experiments which included other state-of-the-art, and larger architectures, namely using 3D versions of Densenet121⁴³, wideResNet101⁴⁴, PreActResnet101⁴⁵, Resnet101⁴¹ and Resnet50. For all architectures except our novel architecture, implementations from Kensho et al. were used.⁴²

We provide comprehensive comparison between the performance of RN10 and proposed architecture “masked-PET Resnet10” (mRN10), as well as the comparison of mRN10 to experts’ performance.

Masked-PET Resnet10

We developed a novel architecture designed to mask PET signals from unimportant (i.e., physiological uptake) regions with high signal (eg. muscle tissue, salivary glands) before entering the RN10 classifier. This is important as the FCH-PET images are heteroscedastic, with some regions - like muscle - having high variance between subjects and other regions - like air - having low variance. To mitigate this, and to improve conditioning of the data and therefore the stability of the classifier,⁴⁶ we decided to allow the model itself to optimize for differentiable masking of these potentially problematic regions. We named the proposed architecture “masked-PET Resnet10” (mRN10).

The mRN10 consisted of 2 parts. First, a Unet architecture was used to mask the PET-CT.⁴⁷ Next, Resnet10 was used to classify the masked PET-CT. We decided on Unet architecture since it is commonly used in segmentation tasks²¹ and we deemed the task of masking to be similar to segmentation of the region-of-interest. Masking was achieved by first activating per-voxel output of Unet with activation function f(x) = tanh(x)+1. These output values were in interval (0,2), such that regions where Unet output was negative were closer to 0, while regions where Unet output was positive were closer to 2. This matrix, representing the mask, was then multiplied elementwise by the PET matrix, to produce a masked PET image.

The architecture of mRN10 is depicted on Figure 1. Regions in PET image where Unet output was negative were multiplied by values close to 0 and were therefore effectively “masked” from the PET image. This masked PET was then concatenated with CT and the masked PET-CT was used as input for the Resnet10 classifier. The entire mRN10 was trained end-to-end, therefore the masking was optimized for the lowest loss in the classification task of the downstream Resnet10 classifier.

The models were written in python 3.8.0 using Pytorch 1.10 framework and trained on a single GTX 1080Ti graphics card (Nvidia Corporation, Santa Clara, US).^{48, 49} The code is freely available online at: https://github.com/ljarabek/AI_FCH

Training and evaluation

For training, we used 12-fold cross-validation with data split into a test set of 10 random subjects, with the remaining subjects being randomly split into a training set (90% of the remaining subjects) and validation set (10% of the remaining subjects). Data was normalised using z-score normalization upon splitting accordingly, such that the mean and standard deviation were computed only using the training set. Sets were sampled such that each set contained at least 1 subject from each class (UL, LL, LR and control). For testing, the model with the lowest validation loss was used. The confusion matrix for CPr evaluation was computed by summing the confusion matrices for the test set across the 12 data splits, providing 120 total samples. The confusion matrix for CLoc was obtained by summing the 3 confusion matrices for evaluated locations UL, LL, LR across the best performing 12 data splits, providing 360 “samples”. Similarly, the area under the receiver operating characteristic curve (AUCROC) was computed.

We used epiR package for R to determine the diagnostic performance metrics and McNemar test from DTComPair package for determining statistically significant (p < 0.05) differences.⁵⁰^-53 Only binary diagnostic performance metrics were used for evaluation, even though CLoc is theoretically a multi-class classification task. In this way, the results comparable to studies evaluating the performance of FCH-PET, since they also mostly used binary classification metrics.³^-13

Table 1

Confusion matrices for CPr (A) and CLoc (B) for both RN10 and mRN10 models. Note that the confusion matrices for CLoc have more samples (360 in total), as they were computed by summing the confusion matrices for each of the three included locations (UL, LL, LR)

	CPr task with RN10				CPr task with mRN10
	HPTT present	HPTT present not	sum		HPTT present	HPTT present not	sum
Model HPTT present output	79	8	87	Model HPTT present output	90	11	101
Model output HPTT not present	20	13	33	Model output HPTT not present	9	10	19
sum	99	21	120	sum	99	21	120

	CLoc task with RN10				CLoc task with mRN10
	HPTT at GTLoc	HPTT GTLoc not at	sum		HPTT at GTLoc	HPTT GTLoc not at	sum
Predicted GTLoc	35	51	86	Predicted GTLoc	53	50	103
Not predicted GTLoc	61	213	274	Not predicted GTLoc	43	214	257
sum	96	264	360	sum	96	264	360

CPr = classification of presence; CLoc = classification of location; GTLoc = ground truth location based on postsurgical histopathological reports; HPTT = hyperactive parathyroid tissue; mRN10 = novel masked-PET Resnet10 model; RN10 = baseline Resnet10 model

Results

We determined the best performing models for both RN10 and mRN10 were trained using the initial learning rate of 0.013. The confusion matrices for RN10 and mRN10 are presented in Tables 1A and 1B, while the diagnostic performances for both tasks using the RN10 and mRN10 models are presented in Table 2. Both models had comparable performance in the CPr task. The mRN10 had a significantly higher accuracy for the CLoc task than the RN10 and was therefore used for comparison with human performance.

Table 2

Diagnostic performance metrics of RN10 and mRN10 as well as p-values as determined by McNemar test comparing both models for each task (except AUCROC)

	CPr RN10	CPr mRN10	CPr p-value	CLoc RN10	CLoc mRN10	CLoc p-value
Sensitivity [95% CI]	0.800 [0.719; 0.877]	0.909 [0.852; 0.965]	0.028	0.365 [0.268; 0.460]	0.552 [0.453; 0.652]	0.018
Specificity [95% CI]	0.619 [0.411; 0.827]	0.476 [0.263; 0.690]	0.257	0.807 [0.759; 0.854]	0.811 [0.763; 0.858]	0.910
Positive predictive value [95% CI]	0.908 [0.847; 0.969]	0.891 [0.830; 0.951]	0.507	0.407 [0.303; 0.511]	0.515 [0.418; 0.611]	0.089
Negative predictive value [95% CI]	0.394 [0.227; 0.560]	0.526 [0.302; 0.751]	0.205	0.777 [0.728; 0.827]	0.833 [0.787; 0.878]	0.021
Accuracy [95% CI]	0.767 [0.681; 0.839]	0.833 [0.756; 0.895]	0.050	0.689 [0.638; 0.736]	0.742 [0.693 0.786]	0.031
AUCROC	0.815	0.849	/	0.702	0.770	/

AUCROC = area under the receiver operating characteristic curve; CPr = classification of presence; CLoc = classification of location; mRN10 = novel masked-PET Resnet10 model; RN10 = baseline Resnet10 model

We performed a comprehensive comparison with human expert evaluation only for the CLoc task. Healthy controls had, by definition, no HPTT visible on FCH-PET (as reported by human experts), so the comparison could not be made for the CPr task, as human performance for CPr was 100%. Comparison of performance metrics for the CLoc task between the mRN10 model and human performance (based on the same subset of 83 patients used for the DL model development) is shown in Table 3.

Table 3

Comparison of mRN10 and human performance for the CLoc task. p-values were determined by using the McNemar test

	CLoc mRN10	CLoc human	p-value
Sensitivity [95% CI]	0.552 [0.453; 0.652]	0.917 [0.857; 0.958]	< 0.001
Specificity [95% CI]	0.811 [0.763; 0.858]	0.997 [0.986; 0.999]	< 0.001
Positive predictive value [95% CI]	0.515 [0.418; 0.611]	0.992 [0.945; 0.999]	< 0.001
Negative predictive value [95% CI]	0.833 [0.787; 0.878]	0.972 [0.952; 0.984]	< 0.001
Accuracy [95% CI]	0.742 [0.693; 0.786]	0.977 [0.960; 0.988]	< 0.001

CLoc = classification of location; mRN10 = novel masked-PET Resnet10 model; RN10 = baseline Resnet10 model

Studies with different architectures

Studies across multiple models were performed to determine the use of RN10 as the base architecture. The results of other models are stated below, as well as the number of trainable parameters and optimal initial learning rate. Mean CPr AUCROC and 95% confidence intervals were computed as population statistics of 50 models obtained from 5 runs of 10-fold cross-validation at optimal learning rate. The highest performance among the models tested was achieved with RN10 and mRN10. The performance of other models is noted in the table below.

PET masking qualitative results

Qualitative results were evaluated across all subjects and using an iteration of the model trained from a single data split. The qualitative results did not change in a significant manner with repeated training. In qualitative analysis of PET masking results, the region-of-interest mask correctly identified the foreground, while we have found that in all but 3 subjects, 1 with LL HPTT and 2 LR HPTT, that the mask completely obscured (masked) the original location of HPTT on masked PET. In the 3 subjects with visible HPTT in the masked PET in the original location, the mask still partially obscured the HPTT, as seen in Figure 3, rows d), f) and g).

Figure 2 shows a typical example of mRN10 masking, where HPTT was masked and cannot be distinguished in masked PET image. The network correctly classified the subject in Figure 2 as having lower right HPTT. The region of air outside the patient is masked to approximately 25% of the original PET signal, with mask having a value of approximately 0.25. The high signal from the salivary glands is masked in all cases, whereas signal from the thyroid gland is only partially masked in all cases, as seen in Figure 3.

Example of novel masked-PET Resnet10 model (mRN10) masking of PET signal in a subject with parathyroid adenoma in the region of lower right parathyroid gland (black arrow in row c). Each row represents a different slice through the preprocessed [18F]fluorocholine PET/CT (FCH-PET) images ((A) – mandibular region, (B) – upper neck region (C) – lower neck region containing parathyroid adenoma). The first column shows a pre-processed PET/CT image (64 × 64 × 32 matrix), where colours toward the “warm” (red) part of the spectrum indicate higher PET signal and colours toward the “cool” (blue) part of the spectrum indicate lower PET signal. The second column shows the mask, where regions coloured toward the red part of the spectrum have higher weights (non-masked) and regions toward the yellow part of the spectrum have lower weights (masked). The third column represents the final masked PET/CT images computed by multiplying the mask with the original PET/ CT. The image was correctly classified as containing the adenoma in the lower right region.

Some examples of masking of hyperactive parathyroid tissue (HPTT), which is indicated by an arrow in column (I). The images are shown in the same format as in Figure 2. Rows (D), (F) and (G) represent the only 3 cases where HPTT was not completely masked.

Discussion

The aim of the study was to evaluate the potential of DL models in classifying HPTT presence and location in FCH-PET studies in the setting of PHPT. For our experiments to be representative of results of such a model in practice, we used data of representative cohort of subjects with PHPT. Classification of FCH-PET studies was performed using multiple common DL models and we found that the simplest among the models tested, RN10, achieved the highest performance. Furthermore, we improve the model’s performance by modifying the architecture to include a region-of-interest masking step, which produced a region-of-interest mask, which successfully identified the foreground of PET. The mRN10 achieved superior performance to models of similar size. Overall, given the size of our dataset and achieved performance, we found that the use of deep learning is highly promising in potential evaluation of FCH-PET in PHPT.

Dataset and patient characteristics

Both our patients and the controls had representative demographic characteristics of patients with PHPT, with male-to-female ratio in literature being 1:3 to 1:4 and the peak incidence of 62 ± 13 years.⁵⁴^-57 Therefore, the models were more likely to have learned the correct features to classify HPTT presence and were trained on a relatively representative dataset that would be encountered in real-life application. Representation per quadrant of HPTT in our cohort was also congruous to numbers reported in the literature. Marzouki et al. provide 95% confidence intervals of HPTT ratio per site as follows: lower left 32–51%, lower right 25–42%, upper left 10–23% and upper right 4–15%.⁵⁸^-60

Unfortunately, the dataset was imbalanced with respect to patients vs “controls”. However, obtaining negative FCH-PET studies is difficult due to high positivity rate of finding HPTT in FCH-PET, since only patients with biochemically confirmed PHPT are imaged. Such patients are highly likely to have visible HPTT, as reported in studies exploring the effectiveness of FCH-PET.³^-13 Since healthy subjects are generally not referred to undergo FCH-PET imaging, the best attempt was made to select the criteria for choosing “controls” among patients with negative visual assessment of FCH-PET. Our controls therefore had negative imaging findings and biochemical criteria for PHPT resolved at follow-up after 6 months without surgical treatment.

Table 4

Performance of several models on CPr task

Model name	mRN10	RN10	Resnet50	Resnet101	Densenet101	PreActResnet101	WideResnet101
parameters # Trainable (millions)	33.5	14.3	46.2	85.2	112.9	85.2	85.2
Optimal learning initial rate	0.0136	0.0136	2.15*10-3	1.47*10-4	0.316	1.47*10-4	2.15*10-3
Mean CPr AUCROC [95% CI]	0.850 [0.734; 0.998]	0.812 [0.716; 0.994]	0.754 [0.624; 0.980]	0.527 [0.410; 0.639]	0.703 [0.606; 0.905]	0.739 [0.486; 0.998]	0.752 [0.653; 0.966]

AUCROC = area under the receiver operating characteristic curve; CPr = classification of presence; mRN10 = novel masked-PET Resnet10 model; RN10 = baseline Resnet10 model

For ground truth location, histopathological results were used as opposed to expert visual assessment of FCH-PET, in order to simulate real-world use of the models in guiding surgical removal of HPTT.

Deep-learning model architecture

We have chosen the 3D Resnet10 as our baseline model since multiple research groups have shown it provides promising results in classification tasks on both medical and non-medical images and is the basis of modern architectures.^{41, 61}^-63 Resnet10 also achieved the highest performance among the models tested. The other tested models with more parameters performed worse, as they seemed overparameterized and likely learned aberrant features, thus overfitting to the training data. Not many studies explore this phenomenon in detail, but a similar phenomenon was noted in the results of a recent study of Bailly et al.⁶⁴ studying the effects of dataset size, dataset complexity, and model complexity on performance.

The main motivation behind the design of mRN10 and implementation of masking is the way experts interpret FCH-PET. Experienced nuclear medicine physicians know that HPTT usually appears around the thyroid region, and we wanted to allow for the model to learn to mask regions that were deemed unimportant for classification. Furthermore, these unimportant regions (e.g., muscle) commonly produced high intensity PET signal that might affect the classifier. Using end-to-end training with only cross-entropy classification loss, we allowed the network to learn to mask these unimportant regions in an unsupervised manner by carefully tailoring the architecture. Given how experts interpret FCH-PET, mRN10 was an attempt to integrate expert knowledge into the model to improve the Resnet10 classifier.

The Unet was chosen as the masking architecture as we deem our masking to be a task that is comparable to segmentation. For the activation function, we used tanh (hyperbolic tangent), since it was shown to be more stable in backpropagation compared to sigmoid function.⁶⁵ Since our initial goal was to mask unimportant parts of the image, and tanh is a function bound between –1 and 1, we used tanh + 1, such that regions where the Unet output was very negative were close to 0 and subsequently masked when multiplied by the PET signal intensity. The use of batch normalisation layers in the downstream Resnet10 in mRN10 ensures stable training even when masked PET is the input, which is not explicitly normalized apriori. The masking Unet was trained end-to-end along with Resnet10 in the mRN10 architecture for optimal performance of the classification task. This was an attempt to explain the classification decision of the classifier by allowing it to optimize for masking of unimportant parts of the image as well as increase the performance by improving the conditioning of the input data to the classifier.⁴⁶

Classification results

One of the goals of the study was to compare the model’s performance to nuclear medicine experts. The task of detecting and localizing HPTT on FCH-PET is relatively “trivial” for human experts, with reported accuracies of up to 98%.³^-13 We therefore feel that a small dataset is sufficient for training a model to similar performance. However, the results differed from our expectations, as the achieved performance was significantly below the one of humans for both of our tasks. It is most likely that by increasing the dataset to several hundred subjects, the performance gap would be closed.

Given the size of our dataset, our results are comparable to other published studies on other medical imaging related tasks. The study with a similarly sized dataset (85 subjects) in the classification of cardiac sarcoidosis by Togo et al. achieved sensitivity and specificity of 84% and 87%.⁶⁶ In line with the established best practice, Lu et al. explored the diagnosis of Alzheimer disease from PET and MRI images using a multimodal approach on a dataset of 397 subjects and achieved 93% accuracy at detecting Alzheimer disease; Ma et al. used a DL method to classify thyroid diseases from SPECT with a dataset of more than 2000 subjects and achieved accuracy of up to 100% for some tasks.^{67, 68} Because the aforementioned tasks are different and generally have different difficulty compared to ours, these comparisons and potential conclusions are hypothetical, but they give us a rough estimate of the number of subjects needed to substantially improve the performance of our model.

We feel that by increasing the size of our dataset to several hundred patients, similar levels of performance metrics to human performance could most likely be achieved. One supporting data point for this assumption is that the upper-bound of the 95% CI of AUC in the population statistics of 50 model iterations used in experiments was 0.998. Given the right data split, the model could perfectly classify the test set.

PET mask discussion

Qualitatively, we observed interesting properties of the mask created using the UNet, with examples depicted in Figures 2 and 3. In Figure 2 row a), we can see that the physiological signal from the salivary glands was masked, and the weak signal of the paravertebral musculature is amplified. In row b), the physiological signal from the red marrow in the vertebral body was masked and signal from the neck musculature on the left was enhanced. In row c), the physiological signal from the thyroid gland and paravertebral musculature were masked, contradicting findings in row a). The model likely learns to amplify the weak signal from the musculature with low uptake of FCH and to suppress strong signal from salivary glands and certain muscle groups with high uptake.

The physiologically high PET activity in salivary glands and the thyroid were correctly masked. This is likely because there is usually high PET activity in these regions. The masking of the thyroid region is especially problematic since the signal from HPTT can also be masked along with the thyroid. This resulted in HPTT being masked in all but 3 cases, as shown in Figure 3. Still, this did not always result in a false classification of the HPTT location. The parathyroid adenoma in row c) is crucial to the task for experts and yet it was masked in this case by the network. Even though the model masked the adenoma, the mRN10 model output in this case was still correct (lower right adenoma location). It is likely that UNet learns to encode the information of adenoma into the mask that is passed to the Resnet10.

Regions near the skin and the skin itself were always enhanced – we assumed that this was an important signal to the model, as skin-air interface exhibits high contrast on PET and CT and acts as a rough anatomical landmark. It is also much higher in contrast than soft tissue interfaces of the structures in the parathyroid region and produces stronger gradients in training. The region outside the patient (air) was not masked to 0, but to approximately 25% of the signal (value of mask was 0.25), since it is irrelevant to the classification and likely does not produce a gradient in training, so the Unet output for this region is closer to the initialization state.

We find the obtained masks to be interpretable in terms of optimizing downstream Resnet10, yet they did not enhance HPTT signal on masked PET as could be expected. Highly active PET regions were therefore always masked (thyroid, salivary glands). The regions which produced high PET activity only in some subjects (musculature) were masked only if they produced high PET activity (Figure 2, row c), if not, these regions were enhanced (Figure 2 row a), introducing noise to masked PET. This further makes the masked PET uninterpretable as the intensity of the introduced noise is higher than the masked signal from the parathyroid adenoma, which can itself be masked. However, in terms of optimizing the Resnet10 classification performance, these findings make sense, since the mechanism acts to adaptively scale the inputs to stabilize Resnet10 classifier.

While the proposed mRN10 model, using Unet and Resnet sequentially for region-of-interest identification and classification tasks, respectively, somewhat resembles the state-of-the-art region proposal algorithms, we have not found such a model presented in existing literature. Firstly, it is unlikely that such architecture would achieve superior performance on other tasks as Resnet is a good classifier on its own if it is trained on a large enough database.^{²⁰, 41} Secondly, the masking results we achieved did not appear to consistently add value to FCH-PET interpretation when explored by humans, however, according to our results, the mask can be clearly interpreted in terms of optimizing downstream Resnet10 performance.

Namely, we found the mRN10 to be superior in performance to the RN10 in CLoc task. This is probably due to the improved conditioning of the masked input to Resnet10 in mRN10, leading to increased stability, which in turn increases the performance of the trained model.⁴⁶

Limitations of the study

In the model selection, we found that the model with lowest number of parameters performed the best. This is one limitation of our study since experiments with even simpler models were not carried out. Another potential performance improvement could be using transfer learning, but we have not found suitable pretrained models for the FCH-PET images.

Our PET masking was an attempt to make the model more interpretable. Most notable similar mechanisms that exist within literature are the attention mechanisms.⁶⁹ The main problem with most attention mechanisms is that they rely on weighing of the image features, which are obtained by embedding a small image patch into a vector. Because of this, the spatial resolution of the attention map is limited by the size of the image patch, which is commonly 16 × 16 in visual transformers.⁷⁰ In analogy, if we used 16 × 16 × 16 for our theoretical attention, the feature map of our entire image would be of spatial dimensions 4 × 4 × 2, which is too low detailed enough interpretation. Another method of explaining the model output is the class activation mapping (CAM), which also relies on feature embeddings before fully connected layers and therefore entails a loss of spatial resolution;⁷¹ in case of the RN10, the CAM resolution would be 4 × 4 × 2. Gradient-based attribution methods, which do provide pixel-level (or in our case voxel-level) input attribution to the model output, have received criticism due to their inconsistency and poor theoretical foundations.⁷²

Conclusions

We provide extensive experiments in deep learning analysis of FCH-PET using standard classification model RN10 and a novel architecture tailored to the task. As deep learning for FCH-PET analysis in PHPT has to our knowledge not yet been described in literature, our experiments provide a baseline for future work. Even though inferior performance to human experts was achieved, the results seem very promising considering the small dataset and the achieved accuracy of 83% for detecting HPTT and 74% accuracy for localizing the quadrant of HPTT.

eISSN:: 1581-3207
Idioma:: Inglés

Calendario de la edición:: 4 veces al año
Temas de la revista:: Medicine, Clinical Medicine, Internal Medicine, Haematology, Oncology, Radiology

RSS Feed de revista

Detection and localization of hyperfunctioning parathyroid glands on [18F]fluorocholine PET/ CT using deep learning – model performance and comparison to human experts