Chatbot wrinkles
The prospects of large language models (LLMs) and ChatGPT unleash the potential to understand and generate text in a similar capacity as humans. Although controversial, they could likewise be compelling.
However, such systems can be vastly more complex than earlier AI-based tools, and some studies are illustrating the kinds of stumbling blocks that need to be overcome.
For instance, in a study published in May, researchers explored the potential of ChatGPT 4.0 to synthesize clinical guidelines for diabetic ketoacidosis from three different sources to reflect the latest evidence and local context.
Such efforts are important but can be very resource-intensive when conducted without the use of AI assistance.
The study’s results showed that, although ChatGPT was able to generate a comprehensive table comparing the guidelines, there were multiple recurrent errors in misreporting and nonreporting, as well as inconsistencies, “rendering the results unreliable,” the authors wrote.
“Although ChatGPT demonstrates the potential for the synthesis of clinical guidelines, the presence of multiple recurrent errors and inconsistencies underscores the need for expert human intervention and validation,” the authors concluded.
Likewise, other research using ChatGPT for use in vitreoretinal diseases, including diabetic retinopathy, further demonstrated disappointing results, with the technology showing the chatbot provided completely accurate responses to only 8 (15.4%) of 52 questions, with some responses containing inappropriate or potentially harmful medical advice.
“For example, in response to ‘How do you get rid of epiretinal membrane?’, the platform described vitrectomy but also included incorrect options of injection therapy and laser therapy,” the authors wrote.
“The study highlights the limitations of using ChatGPT for the adaptation of clinical guidelines without expert human intervention,” they concluded.
And in research published in August that looked at the ability of ChatGPT to interpret guidelines – in this case 26 diagnosis descriptions from the National Comprehensive Cancer Network – results showed that as many as one-third of treatments recommended by the chatbot were at least partially not concordant with information stated in the NCCN guidelines, with recommendations varying based on how the question about treatment was presented.
“Clinicians should advise patients that LLM chatbots are not a reliable source of treatment information,” the authors wrote.
Diversity concerns
Among the most prominent concerns about chatbot inaccuracy has been the known lack of racial and ethnic diversity in large databases utilized in developing AI systems, potentially resulting in critical flaws in the information the systems produce.
In an editorial published with the NCCN guideline study, Atul Butte, MD, PhD, from the University of California, San Francisco, noted that the shortcomings should be weighed with the potential benefits.
“There is no doubt that AI and LLMs are not yet perfect, and they carry biases that will need to be addressed,” Dr. Butte wrote. “These algorithms will need to be carefully monitored as they are brought into health systems, [but] this does not alter the potential of how they can improve care for both the haves and have-nots of health care.”
In a comment, Dr. Butte elaborated that, once the system flaws are refined, a key benefit will be the broader application of top standards of care to more patients who may have limited resources.
“It is a privilege to get the very best level of care from the very best centers, but that privilege is not distributable to all right now,” Dr. Butte said.
“The real potential of LLMs and AI will be their ability to be trained from the patient, clinical, and outcomes data from the very best centers, and then used to deliver the best care through digital tools to all patients, especially to those without access to the best care or [those with] limited resources,” he said.
Further commenting on the issue of potential bias with chatbots, Matthew Li, MD, from the University of Alberta, Edmonton, said that awareness of the nature of the problem and need for diversity in data for training and testing AI-systems issues appears to be improving.
“Thanks to much research on this topic in recent years, I think most AI researchers in medicine are at least aware of these challenges now, which was not the case only a few years ago,” he said in an interview.
Across specialties, “the careful deployment of AI tools accounting for issues regarding AI model generalization, biases, and performance drift will be critical for ensuring safe and fair patient care,” Dr. Li noted.
On a broader level is the ongoing general concern of the potential for over-reliance on the technology by clinicians. For example, a recent study showing radiologists across all experience levels reading mammograms were prone to automation bias when being supported by an AI-based system.
“Concerns regarding over-reliance on AI remain,” said Dr. Li, who coauthored a study published in June on the issue.
“Ongoing research into and monitoring of the impact of AI systems as they are developed and deployed will be important to ensure safe patient care moving forward,” he said.
Ultimately, the clinical benefit of AI systems to patients should be the bottom line, Dr. Dimai added.
“In my opinion, the clinical relevance, i.e., the benefit for patients and/or physicians of a to-be-developed AI tool, must be clearly proven before its development starts and first clinical studies are carried out,” he said.
“This is not always the case,” Dr. Dimai said. “In other words, innovation per se should not be the only rationale and driving force for the development of such tools.”
Dr. Li, an associate editor for the journal Radiology: Artificial Intelligence, reports no relevant financial relationships. Dr. Dimai is a member of the key medical advisor team of Image Biopsy Lab.
A version of this article first appeared on Medscape.com.