Industry News

28 Nov 2024

AI Chatbots in Ophthalmology Show Promise but Need Improvement

ai chatbot A recent study published in Eye (2024) evaluated the performance of AI-based chatbots in addressing common ophthalmological inquiries, revealing that while these tools hold potential, significant improvements are needed before they can reliably support clinical practice. The study compared responses from OpenAI's ChatGPT and Google Bard (now Gemini Pro), assessing their accuracy, comprehensiveness, and clarity.

Led by a team of ophthalmologists, the research involved curating 20 common patient questions, spanning topics like cataract surgery timing, retinal detachment symptoms, and double vision causes. Responses were obtained from both chatbots and reviewed by eight expert ophthalmologists. These experts scored the answers on a 1–5 scale across three metrics: accuracy, comprehensiveness, and clarity.

Key Findings

ChatGPT consistently outperformed Bard in all three metrics:

Accuracy: ChatGPT scored a median of 4.0, compared to Bard's 3.0.
Comprehensiveness: ChatGPT achieved a median of 4.5, while Bard scored 3.0.
Clarity: ChatGPT received the highest median score of 5.0, surpassing Bard's 4.0.

These differences were statistically significant (p < 0.001). Furthermore, ChatGPT showed higher consensus among reviewers, with 82.5% agreement on accuracy and comprehensiveness ratings, compared to Bard's 76.9% and 74.4%, respectively.

Strengths and Shortcomings

The study highlighted that both AI models delivered clear and accurate answers to many common questions. For example, both chatbots provided well-rounded explanations for cataract surgery timing. However, limitations emerged, particularly in cases requiring detailed differential diagnoses or nuanced medical insights.

ChatGPT excelled in delivering comprehensive answers, yet it sometimes included inaccuracies, such as listing pain as a symptom of retinal detachment—a condition typically painless. While clearer in differentiating between retinal tear and detachment, Bard often lacked depth in its responses.

Notably, some errors could mislead patients. For example, ChatGPT's response to a question about one eye appearing smaller than the other failed to consider critical conditions like proptosis, limiting its diagnostic value.

Broader Implications

The findings align with growing interest in integrating AI into healthcare, including triage, diagnostics, and patient education. Chatbots offer round-the-clock accessibility, addressing routine questions and reducing the burden on healthcare providers. However, the study cautioned against relying solely on these tools due to their lack of personalized context and occasional errors.

"While AI chatbots can provide general guidance, they are no substitute for professional medical evaluation," the authors noted, emphasizing the need for further development.

Future Directions

The researchers called for ongoing refinement of AI models to enhance their alignment with expert-level knowledge. Suggestions for future studies include exploring additional AI platforms, employing advanced techniques like prompt engineering, and testing chatbot performance on a broader array of medical questions.

Conclusion

AI chatbots like ChatGPT and Bard are making strides in providing accessible, informative responses to ophthalmology-related questions. While ChatGPT demonstrated superior performance in this study, both tools require significant optimisation to meet the rigorous demands of clinical use. The study underscores the potential of AI in medicine while reaffirming the indispensable role of healthcare professionals in patient care.