Med-HALT: Medical Domain Hallucination Test for Large Language Models

Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu

Paper »

Github »

Dataset »

Med-HALT: A new benchmark dataset for LLM to test Hallucination in Medical Domain

This research paper focuses on the challenges posed by hallucinations in large language models (LLMs), particularly in the context of the medical domain. Hallucination, wherein these models generate plausible yet unverified or incorrect information, can have serious consequences in healthcare applications. We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations. Med-HALT provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities. Med-HALT includes two categories of tests reasoning and memory-based hallucination tests, designed to assess LLMs' problem-solving and information retrieval abilities. Our study evaluated leading LLMs, including Text Davinci, GPT-3.5, LlaMa and Falcon, revealing significant differences in their performance. The paper provides detailed insights into the dataset, promoting transparency and reproducibility. Through this work, we aim to contribute to the development of safer and more reliable language models in healthcare. Our benchmark can be found at https://github.com/medhalt/medhalt

Evaluation results of LLM's on Reasoning Hallucination Tests

Model	Reasoning FCT	Reasoning Fake	Reasoning NOTA	Avg
Model	Accuracy Score	Accuracy Score	Accuracy Score	Accuracy Score
GPT-3.5	34.15 33.37	71.64 11.99	27.64 18.01	44.48 21.12
Text-Davinci	16.76 -7.64	82.72 14.57	63.89 103.51	54.46 36.81
Llama-2 70B	42.21 52.37	97.26 17.94	77.53 188.66	72.33 86.32
Llama-2 70B Chat	13.34 -15.70	5.49 -3.37	14.96 -11.88	11.26 -10.32
Falcon 40B	18.66 -3.17	99.89 18.56	58.72 91.31	59.09 35.57
Falcon 40B-instruct	1.11 -44.55	99.35 18.43	55.69 84.17	52.05 19.35
Llama-2 13B	1.72 -43.1	89.45 16.13	74.38 128.25	55.18 33.76
Llama-2 13B-chat	7.95 -28.42	21.48 0.34	33.43 31.67	20.95 1.20
Llama-2 7B	0.45 -46.12	58.72 8.99	69.49 116.71	42.89 26.53
Llama-2 7B-chat	0.42 -46.17	21.96 0.46	31.10 26.19	17.83 -6.51
Mpt 7B	0.85 -45.15	48.49 6.62	19.88 -0.28	23.07 -12.94
Mpt 7B instruct	0.17 -46.76	22.55 0.59	24.34 10.34	15.69 -11.94

Evaluation results of LLM's on Memory Hallucination Tests

Model	IR Pmid To Title	IR Title To Pubmedlink	IR Abstract To Pubmedlink	IR Pubmedlink To Title	Avg
Model	Accuracy Score	Accuracy Score	Accuracy Score	Accuracy Score	Accuracy Score
GPT-3.5	0.29 -12.12	39.10 11.74	40.45 12.57	0.02 -12.28	19.96 -0.02
Text-Davinci	0.02 -12.28	38.53 11.39	40.44 12.56	0.00 -12.29	19.75 -0.15
Llama-2 70B	0.12 -12.22	14.79 -3.20	17.21 -1.72	0.02 -12.28	8.04 -7.36
Llama-2 70B Chat	0.81 -11.79	32.87 7.90	17.90 -1.29	0.61 -11.92	13.05 -4.27
Falcon 40B	40.46 12.57	40.46 12.57	40.46 12.57	0.06 -12.25	30.36 6.37
Falcon 40B-instruct	40.46 12.57	40.46 12.57	40.44 12.56	0.88 -12.75	30.36 6.24
Llama-2 13B	0.53 -11.97	10.56 -5.80	4.70 -9.40	23.72 2.29	9.88 -6.22
Llama-2 13B-chat	1.38 -11.44	38.85 11.59	38.32 11.26	1.73 -11.23	20.07 0.04
Llama-2 7B	0.00 -12.29	3.72 -10.00	0.26 -12.13	0.00 -12.29	1.00 -11.68
Llama-2 7B-chat	0.00 -12.29	30.92 6.71	12.80 -4.43	0.00 -12.29	10.93 -5.57
Mpt 7B	20.08 0.05	40.46 12.57	40.03 12.31	0.00 -12.29	25.14 3.16
Mpt 7B instruct	0.04 -12.27	38.24 11.21	40.46 12.57	0.00 -12.29	19.69 -0.19

Distribution of subjects count per exam & Cumulative Frequency Graph in the union of exams in Med-HALT dataset

Number of Subjects per Exam & Cumulative Frequency of Exams

Distribution of subjects count per exam in Med-HALT dataset

Sample Responses Comparison

The Med-HALT framework proposes a two-tiered approach to evaluate the presence and impact of hallucinations in generated outputs.

Reasoning Hallucination Tests (RHTs)

The False Confidence Test (FCT) involves presenting a multiple-choice medical question and a randomly suggested correct answer to the lang tasking it with evaluating the validity of the proposed answer, and providing detailed explanations for its correctness or incorrectness, in addition to explaining why the other options are wrong.

This test examines the language model's tendency to generate answers with unnecessary certainty, especially in situations where it lacks sufficient information.

In the None of the Above (Nota) Test, the model is presented with a multiple-choice medical questio correct answer is replaced by 'None of the above', requiring the model to identify this and justify its selection.

It tests the model's ability to distinguish irrelevant or incorrect information.

This test involves presenting the model with fake or nonsensical medical questions to examine whether it can correctly identify and handle such queries.

We employed a hybrid approach for generating fake questions, where a subset was crafted by human experts, while the remaining were generated using GPT-3.5

Memory Hallucination Tests (MHTs)

Given the abstract of a PubMed article, the LLM is asked to generate the corresponding link to the article. This test measure the model's capacity to identify articles based on the information provided in their abstracts.

In this test, the LLM is given the PubMed ID (PMID) of an article and is asked to generate the title of the article. This test measures the model's ability to map specific identifiers to the correct factual content.

Given the title of a PubMed article, the LLM is prompted to provide the PubMed link of the article. This a model's recall abilities for linking articles to their online sources.

Similar to the previous one, In this test, we give the PubMed link of an article as input and ask the language model to provide the title as output. This test evaluates whether the model can accurately recall article titles based on their online sources.

False Confidence Test (FCT)

None of the Above Test (Nota)

Fake Questions Test (FQT)

Abstract-to-Link Test

PMID-to-Title Test

Title-to-Link Test

Link-to-Title Test

Citation

If the paper inspires you and the data is used in your research, please cite us:

@article{Medhalt,
  title={Med-HALT: Medical Domain Hallucination Test for Large Language Models},
  author={Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan},
  journal={arXiv preprint},
  year={2023}
}

Release and License

The data is intended solely for research and non-commercial purposes. Please contact us for more details, Additionally, the code is governed by the Apache License 2.0.

Theme adapted from chameleon template