Evaluating the Accuracy, Readability, and Relevance of Answers Generated by Large Language Models (LLMs) for Frequently Asked Questions about Cataract and Cataract SurgeryAyse Bozkurt Oflaz, Sule Acar DuyanDepartment Of Ophthalmology, Selcuk University, Konya, Turkey
PURPOSE: To evaluate the accuracy, relevance, and readability of large language models (LLMs) such as ChatGPT-3.5, ChatGPT-4o, Gemini, and Copilot in answering frequently asked questions about cataract and cataract surgery. METHODS: Ten frequently asked questions about cataract and cataract surgery were answered by LLMs. The respondents scored the answers for accuracy and readability. Two experienced cataract surgeons assessed the accuracy of the answers. Flesch Reading Ease Score, Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index and Simple Measure of Gobbledygook (SMOG) Index were used for readability. RESULTS: According to expert assessment, the rates of "correct and complete" answers were: ChatGPT-3.5 (81%), ChatGPT-4o (100%), Gemini (98%), and Copilot (54%), with a statistically significant difference among the models (p < 0.0001). Post-hoc comparisons showed that ChatGPT-4o and Gemini outperformed ChatGPT-3.5 (p = 0.0005 and p = 0.0079, respectively). Significant differences were also found in word and sentence counts across models (p < 0.0001). No statistically significant differences were observed in readability scores. CONCLUSION: ChatGPT-4o and Gemini provided more accurate responses. However, no significant difference was observed in readability across models, emphasizing the need for algorithmic improvements to enhance comprehensibility in AI-generated patient education content.
Keywords: Artificial intelligence, chatbots, cataract, ChatGPT, Gemini, Copilot
Corresponding Author: Ayse Bozkurt Oflaz, Türkiye
|
|