Open Access

Summary of meta-analyses of studies considering lesion size cut-off thresholds for the assessment of eligibility for FNAB and sonoelastography and inter- and intra-observer agreement in estimating the malignant potential of focal lesions of the thyroid gland


Cite

Introduction

This paper, which is the second part of a meta-analysis review, presents the role of sonoelastography in the differential diagnosis of focal thyroid lesions as a new technique to assess the deformability of thyroid nodules (TN). Ultrasound examination (US) is a relatively subjective method of thyroid imaging, depending on the skills of the examiner, the experience of the centre and the quality of the equipment used(1,2). As a consequence, inconsistencies may occur between the results obtained by different examiners (inter-observer variability) and by the same examiner (intra-observer variability). In this paper, the authors present also a review of the literature on inter-observer and intra-observer variability in the assessment of different features of ultrasound imaging of focal lesions in the thyroid gland, and an analysis of lesion size cut-off thresholds constituting the basis for FNAB eligibility assessment.

Sonoelastography: a literature review

Thyroid cancer is a tumour with a steadily increasing incidence. Of the focal lesions detected by ultrasonography, it accounts for 7% to 15%, depending on age, gender and other factors affecting its occurrence. FNAB is an essential method for establishing the diagnosis, but on account of its limitations (false positive, negative, non-diagnostic results) sonoelastography is seen as a non-invasive technique useful in differentiating the nature of lesions and monitoring them after FNAB(3,4). Malignant lesions have been shown to deform less than most benign lesions. Relative strain elastography (SE), one of the two main types, requires compression of the tissues to be evaluated by means of an imaging head or uses a rterial pulse or respiratory movements. As such, it is a technique dependent on the experience of the examiner. On the other hand, shear wave elastography (SWE) is a new-generation technique that uses an acoustic pulse force to generate a transverse wave, the velocity of which is measured in the tissue and used for its characterisation. Quantitative measurements of tissue stiffness are expressed in kilopascals (kPa) or metres per second (m/s)(4). This type of sonoelastography, unlike SE, does not require compression and relies less on the examiner than SE. Sonoelastography is recommended by the European Federation of Societies in Ultrasound and Biology (EFSUM), World Federation for Ultrasound in Medicine and Biology (WFUMB), and Polish scientific societies(4, 5, 6), even though it is not a feature required to assign a category in the Thyroid Imaging Reporting and Data System (TIRADS) category based on the ACR-TIRADS, EU-TIRADS or K-TIRADS systems. However, the deformability of focal thyroid lesions as an independent feature differentiating the nature of focal lesions has been evaluated in numerous publications and meta-analyses. In the 2018 EFSUMB (European Federation of Societes for Ultrasound in Medicine and Biology) guidelines, it is recommended as a useful technique for differentiating the nature of focal lesions and monitoring lesions verified as benign. If deformability is reduced, it is a predictor of increased risk of malignancy with a recommendation for biopsy.

For their 2020 meta-analysis(7), the authors included papers in which studies were performed on three different available SWE ultrasound machines: SuperSonic shear wave elastography (2D-SWE; Aix-en Provence, France), Virtual Touch imaging and quantification (VTIQ; Siemens Medical Solutions, Mountain View, CA) and Toshiba shear wave elastography (T-SWE; Toshiba Medical Systems, Tochigi, Japan). A total of 26 studies from 2010–2017 were included in the meta-analysis (verification of lesions by FNAB/observation or histopathological verification), with a total of 3,806 focal lesions analyzed, of which 2,428 were benign and 1,378 malignant. The 2D-SWE technique dominated (10 publications), followed by four papers on the VTIQ technique and three papers on the T-SWE technique. The results of statistical analysis are summarised in Tab. 1. In conclusion, the authors note that the results obtained using the 2D-SWE technique may be an independent predictor of TNs risk.

Summary of sensitivity and specificity and SROC for individual SWE subtypes

T-SWEVTIQ2D-SWE
Sensitivity, % (95% CI)0.77 (0.70–0.83)0.72 (0.67–0.77)0.63 (0.59–0.66)
Specificity, % (95% CI)0.76 (0.72–0.81)0.81 (0.78–0.84)0.81 (0.79–0.83)
SROC0.840.850.88

T-SWE – Toshiba shear wave elastography; VTIQ – Virtual Touch imaging and quantification; 2D-SWE – SuperSonic shear wave elastography; SROC – summary receiver operating characteristic

In a subsequent meta-analysis(8) and literature review, the authors assessed the diagnostic value of the 2D-SWE method alone. They analysed a total of 2,851 focal thyroid lesions (1,092 malignant and 1,759 benign) based on 14 papers, six of which were included in the previous meta-analysis. Malignant neoplastic lesions accounted for 38.3% of all lesions. The overall sensitivity and specificity and AUC (area under curve) were: 0.66 (95% CI: 0.64–0.69), 0,78 (95% CI: 0.76–0.80), and 0.851, and were similar to those reported in Tab. 1. The authors noted the relatively low diagnostic sensitivity of the technique and the high heterogeneity of results. The range of the cut-off point values between the benign and malignant lesions was extensive. For the mean values, it varied from 18.7 kPa to 56.1 kPa (for benign lesions) and from 31.69 kPa to 174 kPa (for malignant lesions). In another meta-analysis(9) in which the authors evaluated both elastographic techniques, a total of 2,063 benign lesions and 598 malignant neoplastic lesions, verified by histopathological examination, were assessed. For SE (12 papers), the authors found an overall sensitivity of 0.84 (95% CI, 0.76; 0.90) and specificity of 0.9 (95% CI, 0.85; 0.94), which is significantly higher than in the conventional ultrasound technique. For SWE (10 papers, 2 papers were included in the Nattabi HA meta-analysis, 1 in the Filho R HC meta-analysis) a sensitivity of 0.79 (95% CI, 0.73; 0.84) and specificity of 0.87 (95% CI, 0.79; 0.92) were obtained. The AUC for SE was 0.94 (95% CI, 0.91; 0.96), while for SWE it was 0.83 (95% CI, 0.80; 0.86), respectively. The difference was statistically significant (p <0.01). The authors noted the higher SE accuracy and specificity compared to the SWE technique.

In their section on limitations, the authors of the meta-analyses highlight t he high percentage of malignant lesions, varied dimensions of the lesions, selected groups of patients referred for surgery or FNAB, and differences in the examination technique used, relating primarily to the use of compression. Moreover, the majority of malignant lesions were papillary carcinomas.

Size cut-off thresholds in the assessment of FNAB eligibility: a literature review

The use of ultrasound risk stratification systems (US RSSs) serves to categorise focal thyroid lesions according to their ultrasound pattern. These systems, irrespective of the adopted qualification principle, divide ultrasound evaluated lesions into groups from the lowest to the highest risk of malignancy. The most important aim of using US RSSs is to reveal lesions with the highest risk of malignancy. Depending on the category to which a focal lesion is assigned, the risk ranges from 0 to 90%. It should be emphasised, however, that the US features underlying the qualification to the category of high risk of malignancy – repeatedly discussed in the paper – in practice refer to the US features of papillary carcinoma and, to a lesser extent, of medullary carcinoma. Unfortunately, based on US imaging, the authors are unable to categorise cases of follicular carcinoma into high-risk groups, especially in cases of microinvasive carcinomas. Considering the prevalence of each type of cancer in the population (PTC (papillary thyroid carcinoma): ca. 85%, MTC (medullary thyroid carcinoma): 3–5%, FTC (follicular carcinoma): 2–5%, PDTC (poorly differentiated thyroid carcinoma): 6%, ATC (anaplastic thyroid carcinoma): 1%))(10), it should be emphasised that even though this is a minor limitation to the widespread use of US RSSs, it must be widely known among examiners performing thyroid US. Thus, the categorisation of a focal lesion as high risk should be considered as an indication for FNAB. A nother, no less important, purpose of using US RSSs is the ability to detect lesions with a benign US pattern or low risk of malignancy. This translates, in practice, into a reduction in the number of FNABs performed, an effect that some researchers consider to be a more important benefit of using US RSSs over typing malignant lesions. The key issue in this context, in addition to the definition of individual risk categories, becomes the assignment of appropriate cut-off thresholds for the size of lesions that are the basis for FNAB eligibility assessment.

Assuming that FNAB is performed for all focal lesions, a sensitivity of up to 100% is achieved (all malignancies detectable by FNAB are detected), but the specificity of the examination will be very low, which in practice indicates a very high number of biopsies performed in lesions with a benign US pattern.

Introducing restrictions on biopsy eligibility will always affect all assessed statistical parameters to a greater or lesser extent, and have a practical impact on the number of cancers diagnosed and the proportion of unnecessary biopsies performed. In the 2019 paper by Dobruch-Sobczak et al.(11), adopting cut-off thresholds for the size of focal lesions eligible for FNAB according to the EU-TIRADS classification resulted in a situation where 35% (81/229) of thyroid cancers would not undergo cytological verification. However, it needs to be strongly emphasised that 33.8% of these cases (72/213) involved lesions classified as category 5 and <1 cm. Ha et al. in their study(12) based on a retrospective analysis of 3,323 thyroid nodules showed that the risk of cancer in lesions <1 cm was almost twice higher compared to lesions measuring >1 cm (62.5% (535 of 856) vs 37.5% (321 of 856); p <0.001). In their study, the authors conducted a simulation study to evaluate changes in diagnostic efficiency and the proportion of unnecessary biopsies depending on the cut-off threshold used for the size of TNs eligible for biopsy. They compared the sensitivity, specificity, accuracy, and percentage of unnecessary biopsies that characterise the ATA 2015 and KTA/KSThR 2016 systems with the diagnostic efficiency of six simulations differing in the multiplicity of biopsy eligibility cut-off thresholds in each risk category. The most spectacular change observed in the study was a decrease in sensitivity of more than 20% (ATA 2015 92.5% vs 67%; KTA/KSThR 2016 93.5% vs 66.4%) after increasing the cut-off threshold from 1.0 to 1.5 cm in the intermediate risk categories. This was unrelated to increasing the cut-off threshold to 2.5 cm in the low/very low risk categories according to ATA 2015 and low risk in benign lesions according to KTA/KSThR 2016, which increased the specificity in ATA 2015 from 34.0% to 47.7% and in the KTA/KSThR of 2016 28.7% to 56.3%, while significantly reducing the percentage of unnecessary biopsies: in the ATA 2015 scale from 55.1% to 43.6% and in the KTA/KSThR 2016 from 59.5% to 36.4%.

These data indicate the need to adopt a different priority depending on the ultrasound risk group. This should translate into striving for maximum sensitivity in the high-risk group, as the greatest number of malignant lesions is detected regardless of the rate of unnecessary biopsies, while in the low-risk group, maximum specificity is sought by reducing the number of cytological examinations performed. The treatment of differentiated thyroid carcinomas has recently changed, in particular in cases of papillary carcinoma at stage T1a, which is no longer an absolute indication for total thyroid resection. The adoption of a cut-off threshold of 10 mm in the high-risk group significantly reduces the chances of a less extensive operation, such as thyroid lobectomy with isthmus, increasing the risk of recurrence and death(13). This indicates the need to diagnose cancers up to 10 mm in diameter, while taking a more liberal approach regarding the indications for biopsy in lesions with a low risk of malignancy. In such cases, based on consultations with the patient, management could even be restricted to active ultrasound surveillance.

Inter- and intra-observer agreement in the assessment of individual ultrasound imaging features of focal thyroid lesions

The inter- observer agreement is most commonly expressed as the kappa coefficient (Cohen’s kappa, Fleiss’ kappa, Randolph’s kappa)(14). Less commonly used are Krippendorff alpha(15) and intraclass correlation coefficient (ICC)(16). The interpretation of the most commonly used kappa coefficient (kappa values) according to Landis and Koch(17) is shown in Tab. 2.

Interpretation of the kappa coefficient values according to Landis and Koch(17)

Range of kappa valuesInterpretation of the degree of agreement
<0,00poor
0.00–0.20slight
0.21–0.40fair
0.41–0.60moderate
0.61–0.80substantial
0.81–1.00almost perfect

A number of studies have been published on the inter-observer agreement in the assessment of individual thyroid ultrasound imaging features, which ranges from poor to almost perfect, depending on the feature and study(18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33). Liu et al.(34) conducted a meta-analysis of seven studies assessing inter-observer agreement published up to December 2018, including a total of 927 patients(18, 19, 20, 21, 22, 23). They calculated the pooled agreement between examiners in the assessment of individual features in thyroid ultrasound images, with the following results: substantial agreement for structure (0.61; 95% Cl: 0.55–0.66) and presence of calcifications (0.71; 95% Cl: 0.65–0.77), moderate agreement for echogenicity (0.58; 95% Cl: 0.51–0.64), shape (0.53; 95% Cl: 0.45–0.62), and the presence of echogenic foci, including punctate echogenic foci/microcalcifications, macrocalcifications, peripheral calcifications, and comet tail artefacts (0.43; 95% Cl: 0.32–0.54), and fair agreement for margins (0.40; 95% Cl: 0.32– 0.48). The inter-observer agreement depended, among other factors, on the professional experience of the examiners(34). The majority of studies included in the meta-analysis were single-centre studies(19, 20, 21, 22, 23).

An overview of studies published since January 2019 (not included in the Liu et al. meta-analysis) is presented in Tab. 3. The studies presented are difficult to compare in view of differences in study design (retrospective/prospective, single-centre/multi-centre, varying number of examiners, different percentage of nodules verified as benign and malignant by biopsy or histopathology, various features assessed by ultrasound)(24, 25, 26, 27, 28, 29, 30, 31, 32), nevertheless some conclusions can be drawn. As shown in Tab. 3, some features in the ultrasound image were characterised by higher agreement than others. The feature with the highest interobserver agreement was nodule structure, assessed in most studies as high(24,27,30,31) or moderate(25,26,29,31). In contrast, the assessment of the presence of comet tail artefacts was characterised by the lowest level of agreement, rated as slight in the majority of studies(26,29,32). The degree of inter-observer agreement for margins was not much

Comparison of studies published since 2019 assessing inter-observer agreement in the assessment of specific ultrasound imaging features of focal thyroid lesions

Basha 2019Dobruch- Sobczak 2019Itani 2019Lam 2019Pang 2019Persichetti 2020Phuthharak 2019Seifert 2020Wildman-Tobriner 2020
Number of nodules assessed38020180463189100108(40 80 + 40)100
Number of researchers3543272415
StatisticsFleiss’ κCohen’s κCohen’s κRandolph’s κCohen’s κCohen’s κCohen’s κFleiss’ κFleiss’ κ
Feature on ultrasound examination:
structure0.6360.550.430.660.10–0.6430.530.616S1: 0.476 S2: 0.6740.39
echogenicity0.7500.48–0.5010.2520.350.24–0.5340.470.327S1: 0.440 S2: 0.6220.39
shape0.8680.300.280.47S1: 0.537 S2: 0.6760.38
margins0.5240.390.230.500.07–0.1450.330.143S1: 0.431 S2: 0.7960.18
halo20.410.50
hyperechogenic foci0.5980.770.288
microcalcifications0.9570.570.270.390.470.28
macrocalcifications0.9740.610.490.380.41
peripheral calcifications0.6040.390.330.650.26
total calcifications0.38S1: 0.405 S2: 0.424
comet tail0.8850.060.110.08
vascularisation0.2110.340.46
extra-thyroidal infiltration1.0000.400.820.24

S1 – session 1; S2 – session 2 (conducted after the examiners have discussed all cases from session 1 together)

1 Feature not assessed in the study.

2 Features such as echogenicity compared to thyroid parenchyma (κ = 0.48), dominant echogenicity compared to thyroid parenchyma (κ = 0.50), and echogenicity compared to muscle (κ = 0.49) were evaluated separately.

3 The following features were evaluated separately: solid structure (κ = 0.64), partially cystic with suspicious features (κ = 0.10), partially cystic with eccentric solid area (κ = 0.54), partially cystic without suspicious features (κ = 0.17), spongiform (κ = 0.62).

4 The following features were assessed separately: nodule significantly hypoechogenic (κ = 0.33), hypoechogenic (κ = 0.53), isoechogenic (κ = 0.24), and hyperechogenic (κ = 0.31).

5 The study separately assessed the following features: irregular margins (κ = 0.07), regular margins (κ = 0.14).

better, rated as slight(28,30,32), fair(25,26,29) or moderate(24,27,31). These findings are consistent with the outcomes of the meta-analysis conducted by Liu et al.(34), and thus demonstrate a distinctive general trend.

The degree of agreement between observers depends on their professional experience(2,21,34) a nd improves a fter training sessions involving joint viewing of ultrasound images and discussions to reach agreement held among participating examiners(2,19,31).

Far fewer studies are available to assess the intra-observer agreement for individual thyroid ultrasound features(22,25,29,33). The reproducibility of the results of repeat examinations performed by the same examiner in the cited studies was mostly rated as substantial or almost perfect (kappa value ≥0.61)(22,25,33). A lower degree of intra-observer agreement was found in the study by Persichetti et al., with kappa values reported as follows: 0.62 for vascularisation, 0.58 for structure, 0.60 for echogenicity, 0.55 for microcalcifications, 0.54 for macrocalcifications, 0.47 for comet tail artefacts, 0.39 for margins, and 0.35 for shape(29).

In summary, inter-observer agreement in the assessment of individual thyroid ultrasound features varies considerably between centres, ranging from slight to almost perfect. The features with the highest disagreement are lesion margins and comet tail artefacts. The level of intra-observer agreement is higher than inter-observer agreement, but still not satisfactory(29).

Summary

One method to improve inter-observer agreement involves using a standardised glossary of terms to describe focal lesions on thyroid ultrasound(29). Moreover, grading focal thyroid lesions in a structured manner based on dedicated scales/scoring systems (instead of grading individual features) might substantially improve inter-observer agreement(18,19). The present study highlights the need to diagnose cancers up to 10 mm in diameter, while taking a more liberal approach to biopsy indications in low-risk lesions. On the basis of published studies, sonoelastography has been shown to be a technique that should be included in the lexicon of features analysed when deciding to perform a biopsy for focal thyroid lesions. It is also a useful modality for monitoring lesions after FNAB. In the future, genetic testing combined with ultrasound features of focal lesions may contribute to improving diagnostic(35) accuracy.

eISSN:
2451-070X
Language:
English
Publication timeframe:
4 times per year
Journal Subjects:
Medicine, Basic Medical Science, other