Med-HALT: Medical Domain Hallucination Test for Large Language Models

Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu

Med-HALT: A new benchmark dataset for LLM to test Hallucination in Medical Domain Example of Hallucination Of GPT-3.5

Med-HALT: A new benchmark dataset for LLM to test Hallucination in Medical Domain

This research paper focuses on the challenges posed by hallucinations in large language models (LLMs), particularly in the context of the medical domain. Hallucination, wherein these models generate plausible yet unverified or incorrect information, can have serious consequences in healthcare applications. We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations. Med-HALT provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities. Med-HALT includes two categories of tests reasoning and memory-based hallucination tests, designed to assess LLMs' problem-solving and information retrieval abilities. Our study evaluated leading LLMs, including Text Davinci, GPT-3.5, LlaMa and Falcon, revealing significant differences in their performance. The paper provides detailed insights into the dataset, promoting transparency and reproducibility. Through this work, we aim to contribute to the development of safer and more reliable language models in healthcare. Our benchmark can be found at https://github.com/medhalt/medhalt


Evaluation results of LLM's on Reasoning Hallucination Tests

Model
Reasoning FCT
Reasoning Fake
Reasoning NOTA
Avg
Accuracy Score
Accuracy Score
Accuracy Score
Accuracy Score
GPT-3.5
34.15
33.37
71.64
11.99
27.64
18.01
44.48
21.12
Text-Davinci
16.76
-7.64
82.72
14.57
63.89
103.51
54.46
36.81
Llama-2 70B
42.21
52.37
97.26
17.94
77.53
188.66
72.33
86.32
Llama-2 70B Chat
13.34
-15.70
5.49
-3.37
14.96
-11.88
11.26
-10.32
Falcon 40B
18.66
-3.17
99.89
18.56
58.72
91.31
59.09
35.57
Falcon 40B-instruct
1.11
-44.55
99.35
18.43
55.69
84.17
52.05
19.35
Llama-2 13B
1.72
-43.1
89.45
16.13
74.38
128.25
55.18
33.76
Llama-2 13B-chat
7.95
-28.42
21.48
0.34
33.43
31.67
20.95
1.20
Llama-2 7B
0.45
-46.12
58.72
8.99
69.49
116.71
42.89
26.53
Llama-2 7B-chat
0.42
-46.17
21.96
0.46
31.10
26.19
17.83
-6.51
Mpt 7B
0.85
-45.15
48.49
6.62
19.88
-0.28
23.07
-12.94
Mpt 7B instruct
0.17
-46.76
22.55
0.59
24.34
10.34
15.69
-11.94

Evaluation results of LLM's on Memory Hallucination Tests

Model IR Pmid To Title IR Title To Pubmedlink IR Abstract To Pubmedlink IR Pubmedlink To Title Avg
Accuracy Score
Accuracy Score
Accuracy Score
Accuracy Score
Accuracy Score
GPT-3.5
0.29
-12.12
39.10
11.74
40.45
12.57
0.02
-12.28
19.96
-0.02
Text-Davinci
0.02
-12.28
38.53
11.39
40.44
12.56
0.00
-12.29
19.75
-0.15
Llama-2 70B
0.12
-12.22
14.79
-3.20
17.21
-1.72
0.02
-12.28
8.04
-7.36
Llama-2 70B Chat
0.81
-11.79
32.87
7.90
17.90
-1.29
0.61
-11.92
13.05
-4.27
Falcon 40B
40.46
12.57
40.46
12.57
40.46
12.57
0.06
-12.25
30.36
6.37
Falcon 40B-instruct
40.46
12.57
40.46
12.57
40.44
12.56
0.88
-12.75
30.36
6.24
Llama-2 13B
0.53
-11.97
10.56
-5.80
4.70
-9.40
23.72
2.29
9.88
-6.22
Llama-2 13B-chat
1.38
-11.44
38.85
11.59
38.32
11.26
1.73
-11.23
20.07
0.04
Llama-2 7B
0.00
-12.29
3.72
-10.00
0.26
-12.13
0.00
-12.29
1.00
-11.68
Llama-2 7B-chat
0.00
-12.29
30.92
6.71
12.80
-4.43
0.00
-12.29
10.93
-5.57
Mpt 7B
20.08
0.05
40.46
12.57
40.03
12.31
0.00
-12.29
25.14
3.16
Mpt 7B instruct
0.04
-12.27
38.24
11.21
40.46
12.57
0.00
-12.29
19.69
-0.19

Distribution of subjects count per exam & Cumulative Frequency Graph in the union of exams in Med-HALT dataset

Number of Subjects per Exam & Cumulative Frequency of Exams
Distribution of subjects count per exam in Med-HALT dataset

Sample Responses Comparison

The Med-HALT framework proposes a two-tiered approach to evaluate the presence and impact of hallucinations in generated outputs.

Reasoning Hallucination Tests (RHTs)

The False Confidence Test (FCT) involves presenting a multiple-choice medical question and a randomly suggested correct answer to the lang tasking it with evaluating the validity of the proposed answer, and providing detailed explanations for its correctness or incorrectness, in addition to explaining why the other options are wrong.

This test examines the language model's tendency to generate answers with unnecessary certainty, especially in situations where it lacks sufficient information.

In the None of the Above (Nota) Test, the model is presented with a multiple-choice medical questio correct answer is replaced by 'None of the above', requiring the model to identify this and justify its selection.

It tests the model's ability to distinguish irrelevant or incorrect information.

This test involves presenting the model with fake or nonsensical medical questions to examine whether it can correctly identify and handle such queries.

We employed a hybrid approach for generating fake questions, where a subset was crafted by human experts, while the remaining were generated using GPT-3.5

Memory Hallucination Tests (MHTs)

Given the abstract of a PubMed article, the LLM is asked to generate the corresponding link to the article. This test measure the model's capacity to identify articles based on the information provided in their abstracts.

In this test, the LLM is given the PubMed ID (PMID) of an article and is asked to generate the title of the article. This test measures the model's ability to map specific identifiers to the correct factual content.

Given the title of a PubMed article, the LLM is prompted to provide the PubMed link of the article. This a model's recall abilities for linking articles to their online sources.

Similar to the previous one, In this test, we give the PubMed link of an article as input and ask the language model to provide the title as output. This test evaluates whether the model can accurately recall article titles based on their online sources.

False Confidence Test (FCT)

None of the Above Test (Nota)

Fake Questions Test (FQT)

Abstract-to-Link Test

PMID-to-Title Test

Title-to-Link Test

Link-to-Title Test




Citation

If the paper inspires you and the data is used in your research, please cite us:

@article{Medhalt,
  title={Med-HALT: Medical Domain Hallucination Test for Large Language Models},
  author={Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan},
  journal={arXiv preprint},
  year={2023}
}           

Release and License

The data is intended solely for research and non-commercial purposes. Please contact us for more details, Additionally, the code is governed by the Apache License 2.0.



Theme adapted from chameleon template