How reliable are AI-Assisted cephalometric programs in assessing measurements involving bilateral landmarks?

Since its introduction by Broadbent and Hofrath in 1931, cephalometry has contributed to the analysis of malocclusions and has become a ‘gold standard’ in orthodontic diagnosis.^1,2 Initially, after manually tracing anatomical features and identifying landmarks on acetate paper, the linear and angular parameters were measured using rulers and protractors to generate a lateral cephalometric analysis. The entire process was laborious, time-consuming, prone to mistakes, and largely dependent on operator skill.^3,4

To address the issues, computer programmers have developed a number of cephalometric analysis programs. However, the programs still require a clinician’s manual input for localising landmarks on the digitised cephalometric image.⁵ Artificial intelligence (AI) refers to machine intelligence rather than human intelligence. AI performs many tasks like speech recognition, computer vision, language translation, and other data mappings.⁶ Deep learning, which is a subset of AI, is a method inspired by the human brain that teaches computers to process data effectively. Deep learning models excel at recognising intricate patterns of various data types, including images, text, and sounds thereby enabling accurate insights and predictions. Recently, deep learning algorithms have gained widespread use for automatically detecting landmarks on lateral cephalograms.⁶

However, there are two primary concerns regarding the algorithms noted as the number of landmarks which can be identified and the resulting accuracy. Accuracy is typically considered achieved when the distance between the gold standard position (manual localisation) and the automatic method’s detected position is less than 2 mm, while a distance less than 4 mm is considered acceptable. Additionally, when bilateral anatomical structures do not overlap exactly, the resulting point should represent an average of the left and right landmark co-ordinates. Previous studies did not clearly describe this condition, which therefore could potentially impact the accuracy of landmark position and consequently affect the validity of cephalometric analysis.^6,7

The aim of the present study was to therefore compare the reliability of an AI-assisted cephalometric analysis program (Webceph) in the assessment of cephalometric measurements involving bilateral landmarks.

Materials and methods

The lateral cephalometric radiographs used in the study were selected from the radiology archive of the faculty of Dentistry, Biruni University. The archive housed the radiographs of previously treated patients who had provided consent for the use of their records. Two experienced orthodontists exclusively chose the cephalometric radiographs which exhibited deviation in the position of bilateral landmarks. All selected radiographs were obtained using the Orthophos Xg orthopantomograph (Sirona, Bensheim, Germany), which incorporated a metric scale within its cephalostat. The radiographs were captured with each subject’s lips in the rest position, thereby ensuring that Frankfort horizontal plane was parallel to the floor and the jaws were in centric relation. Exposure settings were 73 kV, 15 mA and 9.4 s. Ethical approval for the study was granted by the Ethical Committee of Biruni University (Approval number 2023/78-25).

Inclusion criteria

(1)

High-quality patient cephalograms without artifacts, the presence of which could hinder the accurate identification of anatomical points.

(2)

Cephalograms must be accompanied by a calibration ruler to enable the determination of the magnification.

Exclusion criteria

(1)

Cephalograms displaying excessive soft tissue that may impede the precise localisation of anatomical points.

(2)

Cephalograms of patients with a cleft lip and palate, a congenital craniofacial deformity or a marked asymmetry.

(3)

Cephalograms of patients exhibiting positional errors, as indicated by ear rod markers.

(4)

Cephalograms wherein landmarks cannot be identified due to motion, significant differences in resolution, or lack of contrast.

The comparison of two independent groups was planned in the main hypothesis of the study. Similar studies were examined, and the sample size calculation that gave the highest number according to the applied statistical methods in the main hypothesis was considered. Therefore, the sample size was calculated at a 95% confidence level using the ’G. Power-3.1.9.2’ program (Heinrich Heine Universitat, Dusseldorf, Germany).⁸ As a result, with an α =0.05 and a standardised effect size of 0.915 obtained from a previous study,⁹ a minimum sample of 33 radiographs was calculated to achieve a theoretical power of 0.95. Considering the possibility of a lack of a normal distribution, the sample size was increased by 50% which generated a final sample size of 51 radiographs (30 females aged 16.93 ± 3.2 and 21 males aged 17.7 ± 2.1)

The lateral cephalometric radiographs were originally in JPEG format but were converted into high-quality hard copies for manual tracing. The images were printed on A4-sized glossy photo paper using a high-resolution inkjet printer. To maintain measurement accuracy, the digital images were adjusted to 1:1 magnification before printing and by using the embedded metric ruler in the cephalograms as a reference. Subsequently, the radiographs were manually traced using a 0.5 mm mechanical pencil and ultra-thin (0.003″ thick) acetate overlay sheets, following established cephalometric tracing protocols.¹⁰ All tracings were performed independently by two experienced orthodontists to ensure reliability and reproducibility. Following manual tracing, the same radiographs were assessed using Webceph AI-assisted cephalometric analysis software (AssembleCircle Corp., Gyeonggi-do, Republic of Korea). All of the cephalometric images were calibrated according to the instructions provided by Webceph and only measurements involving bilateral landmarks were included. This selection was based on several previous studies which assessed the overall reliability of AI-assisted cephalometric tracing programs.^11,12 The cephalometric landmarks and the description of measurements are presented in Figure 1 and the tracing of a study subject using Webceph is shown in Figure 2.

To assess intra-operator reliability, the first experienced investigator manually traced 25 lateral cephalometric images twice to evaluate consistency. For inter-operator reliability, a second experienced investigator independently performed manual tracings of the same images. The 25 tracings from each investigator were then compared to assess agreement and a kappa coefficient of 0.8 was chosen as the threshold to indicate satisfactory calibration and reliability between the two investigators.

The descriptive statistics of the data are shown in Table I. As the first step of the statistical analysis, the normal distribution assumption was assessed using the Shapiro–Wilk test, and variance homogeneity was tested using the Levene test. When the normal distribution assumption was met, the agreement between measurements was evaluated using Pearson’s moment product correlation. Independent samples t test was applied to compare the means of the two independent groups showing normal distribution. The analyses were carried out using IBM SPSS Statistics for Windows, Version 25.0 (IBM Corp., Armonk, New York).

Table I.

The descriptive statistics of the data used in the study

Methods		n	Minimum	Maximum	Mean	Standard deviation	Median
Webceph	Articular Angle (degree)	51	137.50	161.20	148.45	5.44	147.90
	Gonial Angle (degree)	51	108.00	133.10	122.72	5.73	123.40
	Effective Mandibular Length (mm)	51	98.50	122.60	108.95	6.36	110.00
	FMA (degree)	51	13.40	43.00	25.66	5.96	25.50
Manual Tracing	Articular Angle (degree)	51	134	163	149.39	6.84	150
	Gonial Angle (degree)	51	100	137	119.81	7.16	120
	Effective Mandibular Length (mm)	51	97	142	113.29	8.72	113
	FMA (degree)	51	12	37	25.96	5.56	26.5

The test-retest analysis was applied to investigate whether the same variables were similar between different observers, different methods, or values measured at different times (Table II). The normality assumption was verified using the Shapiro-Wilk test to select the appropriate inter-class correlation test. The variability between measurements was evaluated using Pearson’s moment product test. The correlation coefficients between different measurement methods for the articular angle and effective mandibular length parameters were below the minimum value of 0.70 required for similarity.

Table II.

Test-retest and intraclass correlation coefficients for clinical measurements

			Manual Tracing
Webceph	Articular angle	rho	0.658
		^p	0.000*
	Gonial angle	rho	0.753
		p	0.000*
	Effective Mandibular Length	rho	0.416
		^p	0.002*
	FMA	rho	0.869
		^p	0.000*

*

p<0.05.

Therefore, it was determined that the articular angle and effective mandibular length measurements were not sufficiently similar between the methods. The correlation coefficients between different methods for gonial angle and FMA parameters were calculated above the minimum value of 0.70 required for similarity. It was determined that the gonial angle and FMA measurements were sufficiently similar and stable according to the measurement methods. When all p-values were examined, statistically significant relationships were obtained.

The assumptions were checked and Independent Sample t tests were applied to compare the means of the clinical measurements according to the methods applied in the study.

Results

The inter- and intra-class correlation coefficients were 0.80, which indicated ‘good’ reliability.

Statistically significant differences were found in the means of the gonial angle and effective mandibular length measurements (p<0.05). The Webceph Gonial angle mean value was found to be higher than the manual tracing Gonial angle mean value. Manual tracing Effective Mandibular Length mean value was found to be greater than Webceph Effective Mandibular Length mean value.

No statistically significant differences were found between the mean values of the Articular angle and FMA angle (p> 0.05). The results are shown in Table III.

Table III.

Comparison of the means of clinical measurements based on the programs used in the study

	Methods	Mean	Standard deviation	Test statistics	p
Articular Angle (degree)	Webceph	148.45	5.44	-0.77	0.444
	Manual Tracing	149.39	6.84
Gonial Angle (degree)	Webceph	122.72	5.73	2.26	0.026*
	Manual Tracing	119.81	7.16
Effective Mandibular Length (mm)	Webceph	108.95	6.36	-2.87	0.005*
	Manual Tracing	113.29	8.73
FMA (degree)	Webceph	25.66	5.96	-0.261	0.794
	Manual Tracing	25.96	5.56

*

p<0.05.

Discussion

Cephalometric analysis plays a crucial role in orthodontic diagnosis, treatment planning, and the evaluation of treatment results. With the advancement of artificial intelligence (AI) technology, several studies have previously explored the accuracy and reliability of AI-driven cephalometric analysis programs.^11–13 In the present study, it was aimed to compare the reliability of an AI-assisted cephalometric analysis program (Webceph) in the assessment of cephalometric measurements involving bilateral landmarks. In contrast to prior reports, the present study concentrated on exclusively evaluating the reliability of the AI-assisted software using meticulously chosen lateral cephalometric radiographs which displayed deviations between pairs of bilateral landmarks. This focused approach aimed to assess the AI algorithm’s capability to accurately localise and subsequently average the bilateral landmarks, thereby evaluating its performance under challenging conditions that tested the program’s operational limits.

The findings of the present study align with previous research that evaluated the accuracy of AI systems in the detection of cephalometric landmarks. An earlier study compared the accuracy of the AudaxCeph^® software (Audax d.o.o., Ljubljana, Slovenia) with human tracers and found that the software was reliable for diagnosis and treatment planning. However, differences in landmark identification were observed between the software and the human tracers, particularly for specific bilateral landmarks, noted as Porion and Orbitale.¹³ For most measurements, the differences, although statistically significant, were deemed clinically non-significant. In the current study, there was no statistically significant difference in the measurements involving Porion and Orbitale produced by Webceph and manual tracing. Similar results were obtained in a prior study that used two deep neural networks for cephalometric landmark identification. The AI system demonstrated comparable accuracy to previous systems and showed clinically non-significant errors, even in challenging cases such as cleft subgroups.¹⁴

Past studies evaluating the accuracy of AI systems in cephalometric landmark detection have highlighted the potential of these systems to assist novices and improve their performance. In a study focusing on AI-based assistance for beginners, it was found that the system aided in improving landmark detection, although the improvements were considered insignificant in accurately identifying landmark position.¹⁵ Similarly, a study proposing a fully automatic deep learning method for the identification of landmarks including Porion, Orbitale and Articulare reported lower errors compared to human experts, thereby reducing human-induced variability.¹⁶

The use of AI-driven cephalometric analysis platforms, such as CephX^®, CEFBOT, and WebCeph™, has shown promise in achieving efficient and timely cephalometric analyses. A previous study assessed the accuracy of WebCeph™ and examined cephalometric measurements some of which involved the bilateral landmarks of Porion, Orbitale and Gonion. The results demonstrated high reliability and accuracy for most parameters, as evidenced by high intraclass correlation coefficients. The present findings align with previous studies that have evaluated AI-driven software, in which statistically significant relationships and reasonable accuracy have been observed compared to manual tracing.¹⁷

Despite the promising results, it is important to acknowledge the limitations and challenges associated with AI-driven cephalometric analysis. Inaccuracy can arise from factors related to non-uniform image quality, an improper image angle, and the inability of AI models to recognise non-skeletal objects and markings.¹⁸ Additionally, while AI systems have shown improved accuracy and precision, specific landmarks, such as Condylion, are prone to incorrect identification. The condyle is a three-dimensional structure, and its position can vary depending on head posture and the angle of the X-ray beam. Secondly, the condyle can be partially obscured by other anatomical structures, identified as the temporal bone, the zygomatic arch, and the soft tissues. These factors may explain the difference in the effective mandibular length measurement between the two current methods. The difference in Gonial angle measurements, however, might be explained by the possible inherent AI algorithm limitations of the AI-assisted tracing program. Future developments in AI technology should focus on addressing these limitations to further enhance cephalometric accuracy and reliability.

With advancements in deep learning methodologies, the efficiency of AI-assisted cephalometric programs can be significantly improved. AI models can be trained with large, high-quality datasets specifically designed to enhance the detection of bilateral landmarks. By leveraging deep learning algorithms, particularly convolutional neural networks (CNNs), AI systems can improve the ability to correctly identify and average bilateral landmark positions thereby reducing discrepancies caused by anatomical asymmetry. These developments suggest that future iterations of AI-assisted cephalometry will likely achieve greater precision and reliability, making them more useful for both routine clinical practice and complex cases.

Conclusion

The current study, along with the findings from previous research, has demonstrated the potential of AI-driven cephalometric analysis programs in providing accurate and reliable results. The systems show promise in improving the efficiency and precision of cephalometric analysis, especially for less experienced clinicians. However, it is important to carefully consider the limitations and challenges associated with the systems to ensure their appropriate application in clinical practice. It is also recommended that the clinician should check and, if necessary, correct the location of bilateral landmarks after the initial digitisation by the AI-assisted cephalometric tracing program. Continued advancements in AI technology hold great potential for further enhancing cephalometric analysis and its clinical applicability.

Langue:: Anglais

Périodicité:: 1 fois par an
Sujets de la revue:: Médecine, Sciences médicales de base, Sciences médicales de base, autres

RSS Feed de la revue