NIH findings shed light on risks and benefits of integrating AI into medical decision-making

July 23, 2024 — Researchers at the National Institutes of Health (NIH) found that an artificial intelligence (AI) model solved medical quiz questions designed to test health care professionals’ ability to diagnose patients based on clinical images and a short text summary with high accuracy. However, physicians who reviewed the grades found that the AI ​​model made errors in describing the images and in explaining how its decision-making led to the correct answer. The findings, which shed light on the potential of AI in the clinical setting, were published in npj Digital Medicine. The study was led by researchers from the NIH National Library of Medicine (NLM) and Weill Cornell Medicine in New York City.

“The integration of AI into healthcare holds great promise as a tool to help medical professionals diagnose patients faster so they can begin treatment sooner,” said Stephen Sherry, Ph.D., acting director of NLM. “But as this study shows, AI is not yet sophisticated enough to replace the human experience that is critical to accurate diagnosis.”

The AI ​​model and human doctors answered questions from the New England Journal of Medicine (NEJM) Image Challenge. The challenge is an online quiz that provides real clinical images and a short text description detailing the patient’s symptoms and presentation, and then asks users to select the correct diagnosis from multiple-choice answers.

The researchers tasked the AI ​​model with answering 207 questions about the image challenge and providing a written rationale to justify each answer. The prompt specified that the rationale should include a description of the image, a summary of relevant medical knowledge, and a step-by-step rationale for how the model chose the answer.

Nine physicians from different institutions were recruited, each with a different medical specialty, and answered their assigned questions first in a ‘closed book’ setting (without consulting external materials such as online resources) and then in an ‘open book’ setting (using external resources). The researchers then provided the physicians with the correct answer, along with the AI ​​model’s answer and its rationale. Finally, the physicians were asked to score the AI ​​model’s ability to describe the image, summarize relevant medical knowledge, and provide step-by-step reasoning.

The researchers found that the AI ​​model and physicians scored highly in selecting the correct diagnosis. Interestingly, the AI ​​model selected the correct diagnosis more often than physicians in closed-book settings, while physicians using open-book tools outperformed the AI ​​model, especially when answering the questions that were rated as the most difficult.

Importantly, the AI ​​model, based on physician evaluations, often made mistakes when describing the medical condition and explaining the reasoning behind the diagnosis, even in cases where it made the correct final choice. In one example, the AI ​​model was given a photo of a patient’s arm with two lesions. A physician would easily recognize that both lesions were caused by the same condition. However, because the lesions were presented at different angles, creating the illusion of different colors and shapes, the AI ​​model was unable to recognize that both lesions could be associated with the same diagnosis.

The researchers say these findings underscore the importance of further evaluating multimodal AI technology before introducing it into the clinical setting.

“This technology has the potential to help clinicians enhance their capabilities with data-driven insights that can lead to improved clinical decision-making,” said NLM Senior Investigator and corresponding author of the study, Zhiyong Lu, Ph.D. “Understanding the risks and limitations of this technology is essential to realizing its potential in medicine.”

The study used an AI model known as GPT-4V (Generative Pre-trained Transformer 4 with Vision), a “multimodal AI model” that can process combinations of multiple types of data, including text and images. The researchers note that this is a small study, but it sheds light on the potential of multimodal AI to assist doctors with medical decision-making. More research is needed to understand how such models relate to doctors’ ability to diagnose patients.

The research was co-conducted by collaborators from the NIH National Eye Institute and the NIH Clinical Center; the University of Pittsburgh; UT Southwestern Medical Center, Dallas; New York University Grossman School of Medicine, New York City; Harvard Medical School and Massachusetts General Hospital, Boston; Case Western Reserve University School of Medicine, Cleveland; University of California San Diego, La Jolla; and the University of Arkansas, Little Rock.

The National Library of Medicine (NLM) is a leader in biomedical informatics and data science research and the world’s largest biomedical library. NLM conducts and supports research on methods for capturing, storing, retrieving, preserving, and communicating health information. NLM creates resources and tools that are used billions of times each year by millions of people to access and perform analysis in the areas of molecular biology, biotechnology, toxicology, environmental health, and health services information.

For more information: https://www.nlm.nih.gov

Reference

Qiao Jin, et al. Hidden defects behind the accuracy of multimodal GPT-4 expert-level vision in medicine. npj Digital Medicine. DOI: 10.1038/s41746-024-01185-7(link is external) (2024).

You May Also Like

More From Author