The worldwide incidence of thyroid cancer is steadily increasing(1). According to the American Thyroid Association (ATA), thyroid nodules are a common clinical problem, with nearly 68% of all examined adult patients diagnosed with these lesions. In 7–15% of these cases, the nodules were found to be carcinomas(2). In Poland, 3,529 new cases of thyroid cancer were diagnosed in 2015. The annual incidence rate has increased from 3.8 per 100,000 in 2000 to 9.2 per 100,000 in 2015(3,4).
Although thyroid nodules are a common occurrence, it is usually difficult to detect them without imaging techniques (only 4–7% can be palpated)(5). Thus, ultrasound (US) examinations play an important role in detection. US is a non-invasive, cost-effective, and widely available technique used to discern specific features of nodules and guide fine-needle aspiration biopsy (FNAB)(5).
Published studies, including the ATA Management Guidelines, have demonstrated that hypoechogenicity, irregular margins, microcalcifications, and a taller-than-wide shape are the B-mode features with the highest level of specificity for the detection of malignant thyroid nodules(6–8). However, none of these features, taken individually, are exclusive to malignant lesions, and benign nodules with a single abnormal feature are relatively common(2,9–11).
Thus, new, non-invasive imaging methods capable of supporting the differentiation of thyroid lesions are being developed. Recently, sonoelastography has become an increasingly used technique(12,13).
Currently, two main types of elastography are available: shear wave elastography (SWE) and strain elastography (SE). Some authors have suggested that SWE is superior to SE in thyroid nodule stratification because of its operator independency, but recent meta-analyses have surprisingly shown that SE has better diagnostic accuracy than SWE(14,15) In addition, Dighe
Few studies analyzed inter- and intra-observer variability in US diagnosis and even fewer evaluated variability in the case of sonoelastography(5,17–20). Therefore, we investigated these two variabilities in thyroid nodule evaluations carried out with US and sonoelastography.
In this prospective study, patients first gave informed consent to participate in the research. Then they underwent US examination of the thyroid, followed by US-guided biopsy or surgical procedures. The study protocol was approved by the institutional review board of the Maria Skłodowska-Curie Memorial Cancer Centre and Institute of Oncology, Warsaw, Poland. From February to November 2017, 92 consecutive patients (22 men, 70 women) with a total of 149 thyroid nodules were included and examined in the Department of Oncological Endocrinology and Nuclear Medicine, Maria Skłodowska-Curie Memorial Cancer Centre and Institute of Oncology, Warsaw, Poland. Of these, 18 (4 men, 14 women) patients aged 21–78 years, with a total of 20 thyroid nodules, were randomly selected for the study. The nodules included eight malignant and 12 benign diagnoses. All malignant lesions and eight benign nodules were confirmed by postoperative histopathology. The remaining four benign nodules with CII in cytology were excluded from surgery because it was unethical to operate on patients without any indications. In this group, we performed US follow-up at 6 month intervals (Fig. 1).
The inclusion criteria were the presence of a thyroid nodule that underwent US-guided FNAB (according to the Guidelines of Polish National Societies(23), prepared on the initiative of the Polish Group for Endocrine Tumours and the ATA) and was operated or was under active observation including repeated FNAB. The following criteria excluded nodules from further analysis: pure cystic lesions, eggshell calcifications, or non-diagnostic cytology results. The researchers were blinded to the cytological and/or histological results.
Fourteen patients underwent thyroidectomy and FNAB while four underwent FNAB only. Histologic and cytological findings were used as study endpoints. For patients with benign FNAB results, a US examination was performed within six months. FNABs were performed with 22- to 24-gauge needles, and aspirates were fixed in 75% ethanol and stained with hematoxylin and eosin (H&E). Lesions were assigned to the Bethesda I–VI category(24) based on FNAB findings. FNAB was repeated for nodules classified as CI (non-diagnostic specimen for example cystic fluid only, acellular specimen), CIII (AUS/FLUS Atypia of Undetermined Significance/Follicular Lesion of Undetermined Significance), and small C IV nodule (<1 cm) (Suspicion of Follicular Lesion in small nodules under 1 cm). If possible, a specific histotype was suggested. Cytological results (CV and CVI) were verified by an independent, second pathologist. Surgical specimens were immediately fixed in 10% buffered formalin. Representative sections from these specimens were processed and routinely stained with H&E for histopathologic (microscopic) examinations.
Five radiologists (one from oncology centre and four from clinical centre), with experience in thyroid B-mode and CDUS imaging ranging from six to 22 years and experience in US elastography from one to seven years, performed and assessed the examinations. Before the study began, the radiologists were trained in our lexicon: composition; echo pattern in comparison to thyroid parenchyma (Echo-Pa); dominating echo pattern in comparison to thyroid parenchyma – in the case of mixed echogenicity dominating component (Echo-Pb); echo pattern in comparison to muscles (Echo-M); margins; the ‘halo’ phenomenon; extrathyroidal extension (the observers were asked to determine whether the extrathyroidal extension modeled the shape of the thyroid and its capsule or extended beyond it); macrocalcification; microcalcification; elasticity score (according to Asteria scale), (all features included in Table 1). All examinations were performed with the same protocol, described below.
US features | Abbreviations | Characteristics |
---|---|---|
Composition | Composition | Cystic |
Echo pattern (in comparison to thyroid parenchyma) | Echo-Pa | Isoechoic |
Dominating echo pattern (in comparison to thyroid parenchyma) | Echo-Pb | Hyperechoic |
Echo pattern (in comparison to muscles) | Echo-M | Hyperechoic |
Margins | Margins | Well-defined |
“Halo” pattern | “Halo” | Complete |
Extrathyroidal extension | Capsule | Models thyroid shape and capsule |
Macrocalcifications (>1 mm) | Macro | Present |
Microcalcifications (≤1 mm) | Micro | Present |
Vascularity | Vascularity | Peripheral |
Elasticity score (Asteria Scale) | Asteria Scale | 1 – Elasticity in the whole examined area |
Remaining thyroid parenchyma | Parenchyma | Homogeneous |
Autoimmune thyroiditis | AT | Present |
Parenchyma vascularity | Parenchyma vascularity | Normal |
The US probe was gently placed on the thyroid in a transverse and longitudinal orientation while the patient was in the supine position. The thyroid gland was scanned from superior to inferior in transverse cross sections and from the outer to the inner margin in longitudinal cross sections. The anteroposterior, transverse, and longitudinal measurements of the gland and nodules were taken on frozen images during examination and archived as well. Other B-mode features regarding the lexicon as well as CDUS and SE were assessed retrospectively on archived AVI films and frozen images. CDUS was performed in all cases with the same scale settings (maximal velocity of 2.5 cm/s). The gain of CDUS was adjusted to each patient individually, achieving the appropriate highest sensitivity without blooming artifacts. In the case of SE, since nodules become stiffer during compression, all radiologists avoided pressing the neck with the probe during examinations to minimize false-positive findings. Grey-scale conventional US with CDUS and SE were performed using an iU22 US machine (Philips Medical Systems, Bothell, WA) equipped with a 5–12 MHz linear array transducer. Sonoelastography was assessed qualitatively using Asteria four-point scale criteria (Tab. 1)(25). The following lesion features were assessed in US and SE examinations (Tab. 1). We excluded shape (taller than wide parameter) of the nodule because the assessment of this feature is more objective as it is done by comparing nodule measurements (height and width, in this research done prospectively). In the case of this research, assessment of inter- and intra-observer agreement including this parameter could have overestimated the final results.
From 149 examined thyroid nodules, records of 20 nodules were drawn out. For this purpose, we used MS Excel. The 20 original US records from B-mode, CDUS, and SE were duplicated. The resulting 40 records were numbered and arranged randomly in a final file. All researchers received the same set of files for evaluation. Then, five radiologists evaluated records (AVI loops and JPG images) containing transversal and longitudinal B-mode cross sections of the thyroid lobes. Next, CDUS and SE records (AVI loops and JPG images) of these nodules were assessed.
The scoring results for all five observers were calculated using Statistical Software Package (Dell Inc. (2016)), Dell Statistica (data analysis software system, v. 13.
The kappa values were interpreted according to Landis and Koch(27), i.e., κ <0.00 corresponds to poor agreement, κ = 0.00–0.20 to slight agreement, κ = 0.21–0.40 to fair agreement, κ = 0.41–0.60 to moderate agreement, κ = 0.61–0.80 to substantial agreement, and κ = 0.81–1.00 to almost perfect agreement.
Finally, the accuracies of all researchers were assessed and compared, and the mode was determined for every descriptor in this set of data. This value was assumed to be the correct one for a given descriptor. Researchers who agreed with this value were given an accuracy value of 1; the rest were given an accuracy value of 0. Next, the total accuracy score for every researcher was calculated independently for every descriptor.
Our randomly selected group consisted of 20 nodules in 18 patients (12 nodules were benign, 8 were malignant). The maximum length of the tumors ranged from 6 to 46 mm (mean length 9.7 ± 5.6 mm). Five of them were PTC (papillary thyroid cancer), two were FTC (follicular thyroid cancer) and one was MTC (medullary thyroid cancer). In the benign group, eight were verified by histology and most (7/8) of them were hyperplastic nodules (Fig. 1).
The results of accuracy assessment of the five radiologists are presented in Table 2. The mean accuracy rates for all radiologists for all features ranged from 82.7 to 87.8%. All radiologists achieved accuracy rates ranging from 65.0 to 100% for B-mode examination, and from 47.4 to 86.8% for SE. The highest level of accuracy among all observers was noted when the following features were analyzed: macrocalcifications (from 90.0 to 100%), microcalcifications (from 85.0 to 100%), and evaluation of echo pattern in comparison to strap muscles (from 87.5 to 95. 0%). The intra- and inter-observer variabilities for US, CDUS, and SE features of thyroid nodules are presented in Table 3.
Description | Accuracy (%) | ||||
---|---|---|---|---|---|
Observer 1 | Observer 2 | Observer 3 | Observer 4 | Observer 5 | |
Composition | 92.5 | 95.0 | 77.5 | 72.5 | 90.0 |
Echo-Pa | 80.0 | 80.0 | 90.0 | 90.0 | 82.5 |
Echo-Pb | 92.5 | 82.5 | 87.5 | 95.0 | 87.5 |
Echo-M | 87.5 | 95.0 | 90.0 | 92.5 | 97.5 |
Margins | 82.5 | 87.5 | 72.5 | 82.5 | 92.5 |
“Halo” | 85.0 | 87.5 | 82.5 | 65.0 | 87.5 |
Capsule | 87.5 | 85.0 | 80.0 | 92.5 | 92.5 |
Macro | 100.0 | 92.5 | 97.5 | 90.0 | 97.5 |
Micro | 100.0 | 92.5 | 85.0 | 87.5 | 95.0 |
Vascularity | 65.0 | 95.0 | 85.0 | 97.5 | 75.0 |
Asteria Scale | 78.9 | 86.8 | 68.4 | 47.4 | 73.7 |
Parenchyma | 82.5 | 82.5 | 80.0 | 87.5 | 65.0 |
AT | 77.5 | 80.0 | 92.5 | 97.5 | 100.0 |
Parenchyma vascularity | 75.0 | 87.5 | 70.0 | 77.5 | 85.0 |
Average | 84.7 | 87.8 | 82.7 | 83.9 | 87.2 |
Description | Intra-observer agreement | Inter-observer agreement | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Observer 1 | Observer 2 | Observer 3 | Observer 4 | Observer 5 | |||||||
Agreement (%) | K-value (SE) | Agreement (%) | K-value (SE) | Agreement (%) | K-value (SE) | Agreement (%) | K-value (SE) | Agreement (%) | K-value (SE) | K-value (SE) | |
Composition | 95.0 | 0.88 (0.12) | 100.0 | 1.00 (0.00) | 90.0 | 0.83 (0.11) | 95.0 | 0.91 (0.09) | 90.0 | 0.75 (0.17) | 0.55 (0.04) |
Echo-Pa | 75.0 | 0.57 (0.16) | 100.0 | 1.00 (0.00) | 90.0 | 0.83 (0.11) | 100.0 | 1.00 (0.00) | 75.0 | 0.53 (0.17) | 0.48 (0.04) |
Echo-Pb | 85.0 | 0.69 (0.16) | 95.0 | 0.91 (0.09) | 95.0 | 0.89 (0.11) | 100.0 | 1.00 (0.00) | 80.0 | 0.52 (0.21) | 0.50 (0.05) |
Echo-M | 90.0 | 0.80 (0.14) | 100.0 | 1.00 (0.00) | 100.0 | 1.00 (0.00) | 95.0 | 0.84 (0.16) | 90.0 | 0.67 (0.18) | 0.49 (0.04) |
Margins | 85.0 | 0.70 (0.16) | 95.0 | 0.90 (0.10) | 95.0 | 0.88 (0.12) | 95.0 | 0.88 (0.12) | 85.0 | 0.63 (0.20) | 0.39 (0.05) |
“Halo” | 80.0 | 0.68 (0.13) | 85.0 | 0.76 (0.12) | 85.0 | 0.77 (0.12) | 90.0 | 0.67 (0.19) | 75.0 | 0.62 (0.14) | 0.41 (0.04) |
Capsule | 80.0 | 0.64 (0.17) | 95.0 | 0.91 (0.09) | 85.0 | 0.72 (0.15) | 80.0 | 0.59 (0.18) | 75.0 | 0.50 (0.19) | 0.40 (0.04) |
Macro | 100.0 | 1.00 (0.00) | 95.0 | 0.83 (0.17) | 95.0 | 0.64 (0.33) | 100.0 | 1.00 (0.00) | 95.0 | 0.64 (0.33) | 0.61 (0.05) |
Micro | 95.0 | 0.89 (0.11) | 95.0 | 0.90 (0.10) | 90.0 | 0.77 (0.15) | 90.0 | 0.78 (0.14) | 90.0 | 0.74 (0.17) | 0.57 (0.05) |
Vascularity | 90.0 | 0.86 (0.10) | 90.0 | 0.74 (0.16) | 100.0 | 1.00 (0.00) | 95.0 | 0.85 (0.13) | 95.0 | 0.87 (0.13) | 0.34 (0.03) |
Asteria Scale | 78.9 | 0.71 (0.13) | 78.9 | 0.70 (0.13) | 94.7 | 0.92 (0.08) | 73.7 | 0.61 (0.15) | 73.7 | 0.65 (0.13) | 0.33 (0.03) |
Parenchyma | 75.0 | 0.50 (0.19) | 85.0 | 0.69 (0.16) | 100.0 | 1.00 (0.00) | 95.0 | 0.89 (0.10) | 90.0 | 0.69 (0.20) | 0.40 (0.05) |
AT | 95.0 | 0.88 (0.12) | 80.0 | 0.47 (0.23) | 95.0 | * | 95.0 | 0.64 (0.33) | 100.0 | 1.00 (0.00) | 0.25 (0.05) |
Parenchyma vascularity | 80.0 | 0.64 (0.16) | 85.0 | 0.66 (0.17) | 95.0 | 0.92 (0.08) | 85.0 | 0.71 (0.15) | 75.0 | 0.17 (0.26) | 0.18 (0.04) |
Average | 86.0 | 0.74 | 91.4 | 0.82 | 93.6 | 0.86 | 92.0 | 0.81 | 84.9 | 0.64 | 0.42 |
The data structure did not allow K-value and SE to be calculated
Concerning intra-observer variability, almost perfect agreement was noted for three observers: the second, third, and fourth observers achieved mean κ-values of 0.82, 0.86, and 0.81, respectively. However, substantial agreement in mean κ-values was also noted for the first and fifth observer. Inter-observer agreement, demonstrated by κ-values, ranged from 0.61 for macrocalcifications (substantial agreement) to 0.33 for Asteria criteria (fair agreement).
Inter-observer variability for the majority of US features showed moderate agreement in the estimation of composition (κ = 0.55), echo pattern (Echo-Pa, Echo-Pb, Echo-M) (κ ranging from 0.48 to 0.50), capsule assessment (κ = 0.40) (Fig. 3A), and microcalcifications (κ = 0.57) (Fig. 2). When assessing vascularity, overall agreement was fair (κ = 0.34). The mean inter-observer agreement for all US and SE features was 0.42, corresponding to moderate agreement (Fig. 3B).
Ultrasonography is a widely accepted imaging technique that accurately detects thyroid nodules and architectural distortion. Over the past decade, significant improvements have been made in US machine technology and high-resolution probes. Therefore, US features specific to thyroid tumors such as lesion stiffness, microcalcification, vascular pattern or margins, can be observed with high accuracy. These US features enable better stratification of malignancy risk and were used to create several Thyroid Imaging Reporting and Data System (TIRADS) classifications, although none were used in clinical practice in Poland(9,28–32). The primary objective of our study was to evaluate inter- and intra-observer agreement for the selected US and SE features as a first step towards proposing the TIRADS classification.
In our study, the most substantial agreement was obtained when macrocalcifications were evaluated: κ was 0.61 for inter-observer agreement and between 0.64 and 1.0 for intra-observer agreements. For microcalcifications, characterized by stronger associations with tumor malignancy than macrocalcifications, we achieved moderate agreement(33). Therefore, we assessed them separately in our study. Our results are similar to those reported by Park
In another study, in which Park
In this study, assessment of the final results revealed that disagreement in terms of microcalcifications appears in nodules that were more normoechogenic or had hyperechogenic components (in the case of mixed echogenicity where microcalcifications were presented in the hyperechogenic component). This could affect the contrast between spot-like <1 mm reflection and surrounding solid parts of the nodule. Unfortunately, this disagreement was found in three malignant lesions, one PTC, one follicular variant of PTC and one FTC (Fig. 2). The follicular variant of PTC and FTC were normoechogenic, which could decrease contrast mentioned above. PTC was hypoechogenic, but the dimension was under 10 mm, which could be another limitation in the evaluation of microcalcifications.
In order to assess an echogenic nodule, we compared it with the thyroid parenchyma or the strap muscles, or used the dominant echo pattern. Inter-observer analysis of this parameter revealed moderate agreement (κ-values ranging from 0.48 to 0.5). This result may be partially explained by complex echogenicity of thyroid tumors and the background parenchyma. Data analysis revealed that besides complex echogenicity, the structure of the nodule was also important. More disagreement occurred for nodules with a mixed solid-fluid structure. The size of the nodules was also important. There was more disagreement for large nodules filling the whole lobe than smaller ones in terms of echogenicity in relation to parenchyma. This could be caused by less surrounding parenchyma for comparison. Choi
The characteristics of lesion margins are an important feature when evaluating malignancy. When differentiating between benign and malignant thyroid nodules, nodules with circumscribed margins are more likely to be benign. However, this feature has low sensitivity as 33 to 93% of malignant tumors may also have circumscribed margins(34). It is difficult to identify margins when the surrounding thyroid gland is heterogeneous or borders of the nodules overlapped. The results presented by other researchers demonstrated a high degree of inter-observer variability when nodule margins were assessed(11). In our study, the margins could be described as either well-circumscribed or not circumscribed (lobular, spiculated, angular, jagged). Evaluation of this feature resulted in the lowest level of inter-observer agreement (κ-value of 0.39) and satisfactory intra-observer agreement (κ-values ranging from 0.65 to 0.90). Choi
The subsequent features assessed were the ‘halo’ phenomenon and capsule invasion. Here, observer agreement was moderate, indicating that evaluation of this feature is characterized by limited accuracy. Park
In our study, the determination of lesion stiffness using a 4-grade scale was a difficult task for all observers as the level of agreement was fair. Four radiologists experienced in SE assessment achieved levels of accuracy from 68.4 to 86.8%. One radiologist, with only one year of experience achieved only a 47.4% level of accuracy (Fig. 3B). Inter-observer agreement was fair, with a κ-value of 0.33. This could be caused by different level of experience. In published papers, results vary between research centers(19,20). Friedrich-Rust
Our study had several limitations. We included patients from the Department of Oncological Endocrinology and Nuclear Medicine pre-diagnosed with suspicious nodules or in whom carcinomas were detected. Therefore, the group of patients differed from a general screening population; the proportion of malignant lesions in our group was 45%. This is a general limitation of most studies performed in endocrinology and oncology centers, in which there are generally high percentages of malignant cases. The proposed lexicon was very detailed and despite previous training for all radiologists, some misunderstandings occurred. Our results showed too many US features used for nodule assessment, and further research should reevaluate them. We used operator-dependent strain elastography, which has some limitations (probe placement in relation to common carotid artery – more noise when probe in transverse section close to CCA (common carotid artery); probe compression; the presence of rim calcifications or multiple macrocalcifications covering the nodule; fluid parts of the nodule). However, in relation to SWE, which is thought to be more independent, recent reports have pointed out that this technique also has operator-dependent artifacts and limitations(16). Besides that, we used a single US machine and did not compare SE from different US systems. It could be assumed that the use of SE from different companies could cause differences in the final results, but this should be further analyzed in a prospective study. Hence, the US machine software and hardware should be considered when creating the TIRADS lexicon.
In this study, five radiologists, each with more than six years of experience in thyroid B-mode imaging, assessed 40 thyroid nodules, with relatively good inter-observer agreement and excellent intra-observer agreement in the assessment of thyroid nodules using US and fair agreement in the case of sonoelastography. The highest disagreement was found for capsule invasion, “halo” phenomenon, and the margins of large nodules especially those filling most of the thyroid lobe and/or found in vicinity of the thyroid capsule. In the case of microcalcifications, the differences appear mostly in normoechogenic nodules or nodules with a hyperechogenic component.
Sonographers must be watchful when assessing margin and capsule invasion in large nodules that are filling a significant part of the lobe or lying near the capsule, as well as when assessing microcalcifications in normoechogenic nodules or with hyperechogenic components.
All results suggest relatively good inter-observer and excellent intra-observer agreement in the assessment of thyroid nodules using US, and fair agreement in the case of sonoelastography.