Evaluating the Accuracy, Readability, and Relevance of Answers Generated by Large Language Models (LLMs) for Frequently Asked Questions about Cataract and Cataract Surgery

Online Makale

Online Hizmetlere Toplu Bakış Online Kayıt ve Konaklama Online Bilimsel Program Dernekler İçin... Referanslarımız Telefon ve e-posta Desteği Bildiri özeti nedir? Bildiri özeti nasıl hazırlanır? Bildiri özeti örnekleri Kabul mektubu örnekleri

European Eye Research

Evaluating the Accuracy, Readability, and Relevance of Answers Generated by Large Language Models (LLMs) for Frequently Asked Questions about Cataract and Cataract Surgery [Eur Eye Res]

Eur Eye Res. Ahead of Print: EER-44154

Evaluating the Accuracy, Readability, and Relevance of Answers Generated by Large Language Models (LLMs) for Frequently Asked Questions about Cataract and Cataract Surgery

Ayse Bozkurt Oflaz, Sule Acar Duyan
Department Of Ophthalmology, Selcuk University, Konya, Turkey

PURPOSE: To evaluate the accuracy, relevance, and readability of large language models (LLMs) such as ChatGPT-3.5, ChatGPT-4o, Gemini, and Copilot in answering frequently asked questions about cataract and cataract surgery.
METHODS: Ten frequently asked questions about cataract and cataract surgery were answered by LLMs. The respondents scored the answers for accuracy and readability. Two experienced cataract surgeons assessed the accuracy of the answers. Flesch Reading Ease Score, Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index and Simple Measure of Gobbledygook (SMOG) Index were used for readability.
RESULTS: According to expert assessment, the rates of "correct and complete" answers were: ChatGPT-3.5 (81%), ChatGPT-4o (100%), Gemini (98%), and Copilot (54%), with a statistically significant difference among the models (p < 0.0001). Post-hoc comparisons showed that ChatGPT-4o and Gemini outperformed ChatGPT-3.5 (p = 0.0005 and p = 0.0079, respectively). Significant differences were also found in word and sentence counts across models (p < 0.0001). No statistically significant differences were observed in readability scores.
CONCLUSION: ChatGPT-4o and Gemini provided more accurate responses. However, no significant difference was observed in readability across models, emphasizing the need for algorithmic improvements to enhance comprehensibility in AI-generated patient education content.

Keywords: Artificial intelligence, chatbots, cataract, ChatGPT, Gemini, Copilot

Corresponding Author: Ayse Bozkurt Oflaz, Türkiye

TOOLS