Performance of ChatGPT and GPT-4 on Polish National Specialty Exam (NSE) in Ophthalmology
23 set 2024
INFORMAZIONI SU QUESTO ARTICOLO
Categoria dell'articolo: Original Study
Pubblicato online: 23 set 2024
Pagine: 111 - 116
Ricevuto: 11 gen 2024
Accettato: 19 giu 2024
DOI: https://doi.org/10.2478/ahem-2024-0006
Parole chiave
© 2024 Marcin Ciekalski et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Overall proportion of correct and incorrect answers (McNemar’s test: chi-squared = 13_4; df=1; p <0_001; OR (95% CI) = 3_78 (1_78–8_96))
LLM | GPT-4 | |||
---|---|---|---|---|
GPT-3.5 | Correct answer | Yes | No | Row total |
Yes | 28 (28.6%) | 9 (9.2%) | 37 (37.8%) | |
No | 34 (34.7%) | 27 (27.6%) | 61 (62.2%) | |
Column total | 62 (63.3%) | 36 (36.7%) | N = 98 |
Distribution of correct and incorrect answers for Treatment & Pharmacology question category (McNemar’s test: chi-squared = 1_5; df=1; p=0_2207; OR (95% CI) = 5 (0_559–236_488))
LLM | GPT-4 | |||
---|---|---|---|---|
GPT-3.5 | Correct answer | Yes | No | Row total |
Yes | 2 (20%) | 1 (10%) | 3 (30%) | |
No | 5 (50%) | 2 (20%) | 7 (70%) | |
Column total | 7 (70%) | 3 (30%) | N = 10 |
Distribution of correct and incorrect answers for Physiology & Diagnostics question category (McNemar’s test: chi-squared = 2_5; df=1; p=0_1138; OR (95% CI) = 4 (0_798–38_666))
LLM | GPT-4 | |||
---|---|---|---|---|
GPT-3.5 | Correct answer | Yes | No | Row total |
Yes | 6 (28.6%) | 2 (9.5%) | 8 (38.1%) | |
No | 8 (38.1%) | 5 (23.8%) | 13 (61.9%) | |
Column total | 14 (66.7%) | 7 (33.3%) | N = 21 |
Distribution of correct/false answers allocated for level of confidence for GPT4
Level of confidence | GPT-4 | |
---|---|---|
Correct | Incorrect | |
Definitely sure | 8 | 1 |
Very sure | 40 | 19 |
Almost sure | 14 | 16 |
Not very sure | - | - |
Definitely not sure | - | - |
Distribution of correct and incorrect answers for Pediatrics question category (McNemar’s test: chi-squared = 0_571; df=1; p=0_4497; OR (95% CI) = 2_5 (0_409–26_253))
LLM | GPT-4 | |||
---|---|---|---|---|
GPT-3.5 | Correct answer | Yes | No | Row total |
Yes | 2 (18.2%) | 2 (18.2%) | 4 (36.4%) | |
No | 5 (45.5%) | 2 (18.2%) | 7 (63.6%) | |
Column total | 7 (63.6%) | 4 (36.4%) | N = 11 |
Distribution of certainty levels between LLMs
LLM | GPT-4 | ||||||
---|---|---|---|---|---|---|---|
GPT-3.5 | Level of confidence | Definitely sure | Very sure | Almost sure | Not very sure | Definitely not sure | Total |
Definitely sure | 2 (2.04 %) | 18 (18.37%) | 21 (21.43%) | - | - | 41 (41.84%) | |
Very sure | 7 (7.14%) | 41 (41.84%) | 9 (9.18%) | - | - | 57 (58.16%) | |
Almost sure | - | - | - | - | - | - | |
Not very sure | - | - | - | - | - | - | |
Definitely not sure | - | - | - | - | - | - | |
Total | 9 (9.18%) | 59 (60.20%) | 30 (30.61%) | - | - | N = 98 |
Distribution of correct/false answers allocated for level of confidence for GPT3_5
Level of confidence | GPT-3.5 | |
---|---|---|
Correct | Incorrect | |
Definitely sure | 18 | 23 |
Very sure | 19 | 38 |
Almost sure | - | - |
Not very sure | - | - |
Definitely not sure | - | - |
Distribution of correct and incorrect answers for Clinical & Case Questions question category (McNemar’s test: chi-squared = 4_083; df=1; p=0_0433; OR (95% CI)= 5 (1_066–46_933))
LLM | GPT-4 | |||
---|---|---|---|---|
GPT-3.5 | Correct answer | Yes | No | Row total |
Yes | 17 (39.5%) | 2 (4.7%) | 19 (44.2%) | |
No | 10 (23.3%) | 14 (32.6%) | 24 (55.8%) | |
Column total | 27 (62.8%) | 16 (37.2%) | N = 43 |
Distribution of correct and incorrect answers for Surgery question category (McNemar’s test: chi-squared = 1_125; df=1; p=0_2888; OR (95% CI) = 3 (0_536–30_393))
LLM | GPT-4 | |||
---|---|---|---|---|
GPT-3.5 | Correct answer | Yes | No | Row total |
Yes | 1 (7.7%) | 2 (15.4%) | 3 (23.1%) | |
No | 6 (46.2%) | 4 (30.8%) | 10 (76.9%) | |
Column total | 7 (53.9%) | 6 (46.2%) | N = 13 |