ChatGPT’s artificial intelligence had a diagnostic success rate similar to that of a recently graduated medical student when given case studies, according to research from Mass General Brigham in Boston.
The study, published in the Journal of Medical Internet Research, found that ChatGPT was roughly 72% accurate when it came to general decision-making, “from coming up with possible diagnoses to making final diagnoses and care management decisions,” a Mass General Brigham news release said.
More impressive, it was 77% accurate in arriving at a final diagnosis.
Dr. Marc Succi, corresponding author on the study, said researchers assessed how ChatGPT could provide support for decision-making from the initial contact with a patient through the process of running tests, diagnosing illness and managing care.
Newsweek quoted the study co-author on AI’s potential: “Mass General Brigham sees great promise for large-language models to help improve care delivery and clinician experience,” said Adam Landman, the chief information officer and senior vice president of digital at the health system, which includes a massive research arm and funding in the billions of dollars.
Large-language models of artificial intelligence like ChatGPT “have the potential to be an augmenting tool for the practice of medicine and support clinical decision making with impressive accuracy,” said Succi, whose titles include associate chair of innovation and commercialization and strategic innovation leader at Mass General Brigham.
Testing ChatGPT
The release said the study was based on the theory that the AI could be part of the initial evaluation of a patient, recommend what tests or screens to run, figure out a treatment plan and make a final diagnosis.
First, the researchers pasted “successive portions of 36 standardized, published clinical vignettes into ChatGPT.” Then the AI was asked to come up with a set of possible diagnoses based on the patient’s information, including age, gender, symptoms and whether it was a medical emergency. When that was done, the AI was subsequently provided more information and asked to make those care management decisions, then diagnose the case, “simulating the entire process of seeing a real patient.”
The ChatGPT’s efforts were analyzed in a blinded process to see how it did.
The AI performed best on final diagnosis and worst on making differential diagnoses, where it scored about 60%. It hit 68% in clinical management decisions, like deciding what medications to prescribe after it got the diagnosis right.
Another plus? ChatGPT showed no gender bias and it was equally accurate in both primary care and emergency care.
Doctors probably don’t need to worry — at least yet — that AI will make them unnecessary, though.
“ChatGPT struggled with differential diagnosis, which is the meat and potatoes of medicine when a physician has to figure out what to do,” Succi said in the written statement. “That is important because it tells us where physicians are truly experts and adding the most value — in the early stages of patient care with little presenting information, when a list of possible diagnoses is needed.”
The study did note some limitations, including “possible model hallucinations and the unclear composition of ChatGPT’s training data set.”
According to various online sources, a model hallucination is a confident diagnosis that does not seem to be backed up by the data it’s given.
The next study these researchers are planning will look at whether AI can take some of the pressure off hospital systems in “resource-constrained areas” to help with patient care and outcomes.
Complex cases
Neuroscience News reported on an earlier study by Beth Israel Deaconess Medical Center published in JAMA. The research used ChatGPT’s generative AI to diagnose complex medical cases. It said that GPT-4 correctly got the main diagnosis 40% of the time and had that diagnosis on its list of possibilities 64% of the time in those “challenging cases.
GPT-4 uses generative data. Neuroscience News explained that “generative AI refers to a type of artificial intelligence that uses patterns and information it has been trained on to create new content, rather than simply processing and analyzing existing data.”
Notes the article, “Despite the promising results, researchers stress the importance of further investigation to understand the optimal use, benefits and limitations of AI in a clinical setting.”