Inter-observer variability on the value of endoscopic images for the documentation of upper gastrointestinal endoscopy - our center experience

Objective : Endoscopy is an essential and invaluable diagnostic tool in the arsenal of every gastroenterologist. ESGE presented additional guidelines for standardized image documentation in upper and lower gastrointestinal endoscopy. Clinical disagreement is a common challenge in most, if not all, fields of medicine. Settling disagreements is important so as to find ways to minimize it. Clinical disagreement in gastroscopy may be demonstrated by studying the observer variability. Methods : We retrospectively recruited 120 random patients that underwent conventional upper gastrointestinal endoscopy between 2021-2022 in our Department of Gastroenterology, all of them performed by one endoscopist. As part of the study, all video-endoscopic recordings were stored using one internal server. In order to study interobserver variability, four physicians (endoscopists and gastroenterologist specialists) were invited to complete the questionnaire. Results : The interobserver variability in our study ranged from moderate to very good in the assessment of the esophagus, with the highest degree of agreement in response to questions concerning characteristic findings such as normal mucosa, esophagitis Class A Los Angeles, hiatal hernia for the esophagus endoscopic evaluation, benign ulcer niche in gastric antrum, normal gastric corpus mucosa, intestinal metaplasia and angiodysplasia in gastric corpus. The question on atrophic mucosa in the first and second part of the duodenum was the most difficult to agree upon. Conclusion : The present study found that the variability between observers in the assessment of images obtained from patients that underwent conventional upper gastrointestinal endoscopy in our center was acceptably good.


Introduction
Endoscopy is an essential and priceless diagnostic resource in the arsenal of each gastroenterologist. Considering the fact that the volume as well as the technical complexity of endoscopic procedures has grown over in the past couple of decades, the necessary condition for quality and safety remains crucial [1,2].
Standardization of endoscopic records has been emphasized by the European Society of Gastrointestinal Endoscopy (ESGE) with the development of the minimum standard terminology for digestive endoscopy (MST) [3] subsequently adopted by the World Organization of Gastrointestinal Endoscopy. In addition, ESGE presented additional guidelines for standardized image documentation in upper and lower gastrointestinal endoscopy [4].
Clinical disagreement is a common challenge in most, if not all, fields of medicine. Settling disagreements is important so as to find ways to minimize it. Clinical disagreement in gastroscopy may be demonstrated by studying the observer variability [1].
The aim of the study was to appreciate inter-observer variability in the evaluation of 120 video-endoscopic recordings of conventional upper gastrointestinal endoscopy in our center by four physicians (endoscopists and gastroenterologist specialists).

Methods
We retrospectively recruited 120 random patients that underwent conventional upper gastrointestinal endoscopy between 2021-2022 in our Department of Gastroenterology, all of them performed by one endoscopist. As part of the study, all video-endoscopic recordings were stored using one internal server. In our study we have asked four endoscopists with different level of experience to complete the questionnaire detailed below after assessment of 120 video-recordings.
For local anesthesia we commonly used prior to procedure lidocaine spray (4.6mg/dose) and for endoscopic procedures where conscious sedation was performed, midazolam (Midazolam, Aguettant, France) with or without propofol (Propofol MCT/LCT Fresnius 10 mg/ml, Fresnius Kabi) was selected.
All names or dates were removed from the videos. No patient data, characteristics or symptoms were presented.
In order to study inter-observer variability, four physicians (endoscopists and gastroenterologist specialists) with varying endoscopy experience (1 year, 4 and 5 years, 20 years) were invited to complete the questionnaire. The questions are partially presented in Table I.
Each of the endoscopists evaluated 120 video-recordings and then completed the questionnaire.
Data were collected with a multiple-choice questionnaire containing questions reflecting a simplified version of the minimum standard terminology (MST) for digestive endoscopy, which includes the LA classification for esophagitis. Our interest in this study was the variability between observers in evaluating the images, not if they reached an accurate diagnosis. Therefore, evaluations have not been measured against a ''gold standard''.
All four endoscopists know and apply in day by day practice the LA classification, they were not given any guidelines on answering the questionnaire.

Ethics
As part of the study, all video-endoscopic recordings were submitted without personal identification, were stored using one internal server and the study was approved by the local Medical Research Ethics Committee.

Statistical analysis
We became interested in this study to evaluate the variability among observers in evaluating images obtained from patients that underwent conventional upper gastrointestinal endoscopy.
The coefficient of agreement for endoscopic diagnosis was evaluated using an inter-rater agreement statistic (K, Kappa) which is calculated with 95% confidence interval [5]. The kappa value was calculated in all of the groups. Agreement, based on the value of kappa, was categorized, as described by Altman, as poor (< 0.20), fair (0.21 -0.40), moderate (0.41 -0.60), good (0.61 -0.80) or very good (0.81 -1.00) [6]. The precision of kappa was measured by its 95% confidence interval (CI). If the kappa value was greater than 0.40, an acceptable degree of concordance was considered to be present. The analysis was done using SPSS statistical software (SPSS Inc., Chicago, Ill., USA, version 23) for crosstabulation of results and using Excel software (Microsoft Corporation) for measures of kappa value and confidence intervals (CI). Also, nominal variables were described as absolute and relative frequencies (%) and the association between them was analyzed by Pearsons chi square test or Fischer exact test. Associations having P<0.05 were considered to be significant.

Results
Level of agreement is defined as outlined in the Methods section. After assessing video-recordings from the esophagus, 2 question obtained a very good agreement: 0.9668 (CI%:0.81-1.00) on the presence of normal mucosa and the presence of hiatal hernia 0.9711 (CI%:0.81-1.00). Regarding the presence of esophagitis Class D using the Los Angeles classification and ulcer niche, there were no cases identified in the group of 120 randomly selected patients. The interobserver variability in our study ranged from moderate to very good in the assessment of the esophagus, with the highest degree of agreement in response to questions concerning characteristic findings such as normal mucosa, esophagitis Class A Los Angeles, hiatal hernia as can be seen in Table II.
In the assessment of video-recordings from the pyloric antrum, we obtained a very good agreement 0.9682 (CI%:0.81-1.00), in evaluating the presence of the benign ulcer niche. Instead, no cases of angiodysplasia and neoplasia/ malign were identified in the group. The interobserver variability in our study was very good in assessment of the lesions identified in the gastric antrum (Table III).  In the assessment of gastric corpus, we obtained a very good agreement 0.9689 for evaluating normal gastric mucosa, intestinal metaplasia (0.9842) and angiodysplasia (0.9820). Also, interobserver variability was very good in assessment all of the lesions identified in the gastric corpus (Table IV).
In the assessment of duodenal bulb, we obtained very good agreement for evaluating ulcer niche (0.9842). The other questions found a very good agreement, and one question about atrophic mucosa in the second part of the duodenum being the most difficult to reach an agreement (0.8075) ( Table V).
After assessing video-recordings from the second part of the duodenum we obtained very good agreement for evaluating normal mucosa (85.75%), erosions (80.75%) and again the question on atrophic mucosa in the second part of the duodenum being the most difficult to agree upon (58.64%). The interobserver variability in our study was moderate to very good in assessment of the lesions identified in the second part of the duodenum. (Table VI).

Discussion
This study had some limitations. First of all, this was a single-center study and the sample size may not be large enough. Secondly, given that this study analyzed previously obtained endoscopic video-recordings, it was difficult to evaluate as many details as possible in real time. Thirdly, magnifying endoscopy was additionally used, but not for every case, in our study and the results may have been influenced. In the situations where the endoscopic appearance was considered normal by the endoscopist, virtual chromoendoscopy or magnification was not used. If lesions were identified, for example polyps, gastric ulcers, areas of intestinal metaplasia, the endoscopist used for the morphological evaluation of the detected changes inspection in linked color imaging (LCI) and blue light imag-   ing (BLI) modes with a maximum optical magnification of 145 x which provided a highly detailed image of the mucosal surface and vascular patterns. Prior studies on magnification endoscopy and minimal change esophagitis in non-erosive reflux disease patients showed substantial inter-observer agreement [6][7][8][9]. Finally, the quality of the image, any blurring image caused by the endoscopist's hand movement, lens fogging or poor cooperation may impair the results.
We found that variability is extensive in the assessment of images from upper endoscopy.
Similar results have been reported from other diagnostic disciplines, for example assessment of carotid plaques [10]. Variability among observers in our study ranged from moderate to very good with the highest level of agreement in answering questions regarding characteristic findings such as normal mucosa, esophagitis Class A Los Angeles, hiatal hernia for the esophagus endoscopic evaluation, benign ulcer niche in gastric antrum, normal gastric corpus mucosa, intestinal metaplasia and angiodysplasia in gastric corpus. The question on atrophic mucosa in the first and second part of the duodenum was the most difficult to agree upon.
In our study we have asked four endoscopists with differ-ent level of experience to complete the questionnaire after assessment of 120 video-recordings. Some studies [11][12][13] reveal that experience leads to a higher degree of agreement, while other do not [14,15]. There have been studies in which live endoscopic video-recordings were presented that could have improved the degree of agreement, but there is a study in which live endoscopic images were used and the degree of agreement was not significantly modified [16,17].
Interesting to note is the fact that Lundell et al. through their study found that greater experience did not lead to a higher degree of agreement [15].
Still images from gastroscopy fail to document motility which is just as significant as mucosal changes. Video-recording the entire examination may address these deficiencies, but for practical reasons, it is uncertain if video-recordings are a realistic way to systematically document gastroscopy. A standardized set of still images will always be the second best and more practical method. The ESGE has suggested a series of eight reference images for the documentation of upper endoscopic procedures [3]. In Figure 1 and Figure 2 are represented images of the gastric antrum and corpus according to the previously mentioned ESGE guideline.

Conclusion
In summary, the present study found that the variability between observers in the assessment of images obtained from patients that underwent conventional upper gastrointestinal endoscopy in our center was acceptably good.
BM -interpretation of data, revising it critically for important intellectual content, final approval of the version to be published CN -collected the data, draft manuscript preparation, interpretation of data for the article, revising it critically for important intellectual content, final approval of the version to be published