LLMs Could Improve Diagnoses with Decision Support, MGB Finds

Introduction to AI in Medicine

Mass General Brigham researchers have been exploring the potential of artificial intelligence in diagnosing patients. They compared two large language models, OpenAI’s GPT-4 and Google’s Gemini 1.5, with their homegrown diagnostic decision support system, DXplain. The results showed that DXplain outperformed the language models in accurately diagnosing patient cases, but both types of AI could complement each other to better inform treatment.

What is DXplain?

DXplain was first developed in 1984 as a standalone platform and has since evolved into a web-based application and cloud-based differential diagnosis engine. It relies on 2,680 disease profiles, more than 6,100 clinical findings, and hundreds of thousands of data points to generate and rank potential diagnoses. A user can enter clinical findings, and the system will generate a rank-ordered list of diagnoses that explain the findings.

How Does it Compare to Language Models?

Language models, such as ChatGPT and Gemini, have been shown to perform as well as physicians in passing certain types of board examinations and have had success in analyzing case descriptions and generating accurate diagnoses. However, these models have a "black box" behavior, meaning they do not explain their reasoning. In contrast, DXplain is designed to explain its conclusions.

The Study

Researchers prepared a collection of 36 diverse clinical cases based on actual patients from three academic medical centers. They compared the performance of DXplain, ChatGPT, and Gemini in diagnosing these cases. The results showed that DXplain listed the correct diagnosis more often than the language models, especially when laboratory test results were included in the case reports.

Results and Implications

The study found that DXplain performed better when all laboratory test results were included in the case reports, while the language models performed well in certain cases but did not explain their reasoning. The researchers suggest that a hybrid approach, combining the strengths of both types of AI, could produce synergistic benefits. For example, querying the language models to support their reasoning for including the correct diagnoses that DXplain missed could help correct any knowledge base errors.

The Larger Trend

A previous study by MGB researchers put ChatGPT to the test, working through an entire clinical encounter with a patient. The language model’s performance was steady across care modalities, but it struggled with differential diagnoses. This highlights the importance of using AI systems in conjunction with human expertise, particularly in the early stages of patient care.

Conclusion

The study demonstrates the potential of AI in medicine, particularly when combining different types of AI systems. A hybrid approach that leverages the strengths of both language models and diagnostic decision support systems could improve clinical efficacy and patient outcomes. As AI continues to evolve, it is likely that healthcare will see many support systems running concurrently, with human expertise playing a critical role in interpreting and validating AI-generated diagnoses.

FAQs

Q: What is DXplain?
A: DXplain is a diagnostic decision support system that generates a rank-ordered list of diagnoses based on clinical findings.
Q: How does DXplain compare to language models?
A: DXplain outperforms language models in accurately diagnosing patient cases, but language models can complement DXplain by providing additional insights and suggestions.
Q: What is the potential of a hybrid approach?
A: A hybrid approach that combines the strengths of language models and diagnostic decision support systems could produce synergistic benefits and improve clinical efficacy.
Q: What is the role of human expertise in AI-generated diagnoses?
A: Human expertise is critical in interpreting and validating AI-generated diagnoses, particularly in the early stages of patient care.