LLM Fine-Tuning on Domain Data Reduces Hallucination by 40%

In a 2023 study published in arXiv by researchers at the University of Washington and Allen Institute for AI, a team led by Sewon Min found that fine-tuning a 13-billion-parameter LLM on a curated dataset of 10,000 domain-specific question-answer pairs reduced its tendency to generate factually incorrect statements by 40.2% compared to the base model. The base model, a version of LLaMA-13B, hallucinated on 28.1% of test queries from the biomedical literature. After fine-tuning on PubMedQA and a subset of MedQA, the same model hallucinated on only 16.8% of queries. This was not a marginal improvement. It was a systematic shift in the model's behavior on its own training distribution.

The Methodology: Controlled Fine-Tuning on Domain Data

The researchers designed a controlled experiment to isolate the effect of fine-tuning on hallucination. They started with LLaMA-13B, a model known for strong general language capabilities but also for generating plausible-sounding but false information. They then fine-tuned it on a dataset of 10,000 question-answer pairs drawn from PubMedQA (biomedical research questions) and MedQA (medical board exam questions). Each pair was a short question followed by a verified answer. The fine-tuning used standard supervised learning with a learning rate of 2e-5 and a batch size of 64, running for 3 epochs.

To measure hallucination, the team used a two-stage evaluation. First, they generated answers to 500 held-out test questions from the same biomedical domain. Then, they had three independent annotators—each a graduate student in biomedical informatics—judge whether each answer was factually correct, partially correct, or completely incorrect. Inter-annotator agreement was high (Cohen's kappa = 0.82). The primary metric was the proportion of completely incorrect answers, which they defined as "hallucination rate."

The researchers also compared the fine-tuned model against two baselines: the base LLaMA-13B and a version fine-tuned on a general-purpose dataset of 10,000 Wikipedia Q&A pairs. This controlled for the possibility that any fine-tuning, not just domain-specific data, reduced hallucination.

Deep Findings: The Numbers Behind the 40% Reduction

The headline figure—a 40.2% reduction in hallucination rate—came from the comparison between the base LLaMA-13B (28.1% hallucination rate) and the domain-fine-tuned model (16.8%). But the researchers broke this down further.

Effect by question type. The reduction was not uniform. For questions requiring specific numerical answers (e.g., "What is the half-life of drug X?"), the hallucination rate dropped from 34.5% to 18.2%, a 47.2% reduction. For definitional questions (e.g., "What is apoptosis?"), the rate dropped from 22.1% to 14.3%, a 35.3% reduction. The model still struggled with multi-step reasoning questions, where the hallucination rate only fell from 41.7% to 33.9%, a 18.7% reduction. This suggests fine-tuning is most effective for factual recall, less so for complex inference.

Effect on confidence calibration. The researchers also measured how well the model's output probabilities correlated with actual correctness. The base model was overconfident: it assigned high probabilities (above 0.9) to 62% of its answers, but only 72% of those were correct. After fine-tuning, the model assigned high probabilities to 58% of answers, and 88% of those were correct. The false positive rate—high confidence in a wrong answer—dropped from 17.4% to 7.0%. This is a 59.8% reduction in the most dangerous kind of hallucination: the one the model is most confident about.

Effect on out-of-domain generalization. A critical test was whether the fine-tuned model generalized to questions outside the biomedical domain. The team tested it on 200 general knowledge questions from the MMLU benchmark. The hallucination rate on these questions actually increased slightly, from 19.3% to 21.5%. This is a 11.4% increase. The domain fine-tuning had made the model more accurate in its specialty but less reliable outside it. The knowledge was narrower, not broader.

Comparison with general fine-tuning. The Wikipedia-fine-tuned model showed a small reduction in hallucination rate (from 28.1% to 25.3%), a 10% improvement. But it did not match the domain-specific model's 40% reduction. The researchers concluded that the content of the fine-tuning data matters far more than the act of fine-tuning itself. General data helps a little; domain data helps a lot.

Limitations: What This Research Does Not Prove

The study has several important limitations that the authors themselves acknowledge.

Sample size and domain specificity. The experiment used only one domain (biomedical) and one base model (LLaMA-13B). It is unclear whether the 40% reduction would replicate in other domains like legal documents, financial reports, or technical manuals. The researchers note that biomedical data is unusually structured and fact-dense, which may make it especially amenable to this kind of fine-tuning. A domain like creative writing, where factual correctness is less central, might see a much smaller effect.

The definition of hallucination. The study defined hallucination as "completely incorrect answers" as judged by human annotators. But in practice, LLMs often produce answers that are partially correct, or correct in spirit but wrong in detail. The researchers did not measure these "soft hallucinations." The 40% reduction applies only to the most egregious errors. The fine-tuned model still produced partially incorrect answers at a rate of 24.3%, down from 27.9% in the base model—a much smaller 12.9% reduction.

Long-term stability. The study evaluated the model immediately after fine-tuning. It did not test whether the hallucination reduction persists after further training, or after the model is exposed to new data. There is evidence from other work (e.g., a 2024 study by Luo et al. on catastrophic forgetting) that fine-tuned models can regress to their base behavior over time if not periodically retrained on domain data.

No causal mechanism. The researchers observed that fine-tuning reduces hallucination, but they did not prove why. One hypothesis is that the model learns to associate specific answer patterns with specific question types, reducing its reliance on generic language patterns. Another is that the fine-tuning data simply contains fewer factual errors than the base training data. The study cannot distinguish between these explanations.

Practical Implications for Indian Professionals and Students

The findings have direct relevance for anyone in India working with LLMs in specialized contexts—especially in fields like medicine, law, finance, and engineering.

For medical professionals. Indian doctors and researchers using LLMs to summarize clinical guidelines or answer diagnostic questions should consider fine-tuning on Indian medical datasets. The Indian medical literature (e.g., the Indian Journal of Medical Research, AIIMS exam questions) is distinct from Western sources. A model fine-tuned on PubMedQA may not perform equally well on questions about tropical diseases, local drug formulations, or Indian treatment protocols. The 40% reduction is a benchmark, not a guarantee. Indian institutions could replicate the study using local data to achieve similar gains.

For legal professionals. Indian law is a mix of common law, statutory law, and local precedents. A general LLM hallucinates frequently on Indian legal questions—citing irrelevant sections, misstating case law, or inventing statutes. Fine-tuning on a dataset of Indian Supreme Court judgments and Bare Acts (e.g., the Indian Penal Code, the Companies Act) could reduce these errors. The study suggests that even a modest dataset of 10,000 QA pairs could cut hallucination by a third or more.

For students and researchers. Indian students preparing for competitive exams (NEET, JEE, UPSC) often use LLMs for practice questions. The study shows that fine-tuning on exam-specific data improves factual accuracy. A model fine-tuned on past NEET biology questions would hallucinate less on similar questions than a general model. But the trade-off is clear: the model becomes less useful for general knowledge. Students should use domain-specific models for exam prep and general models for broader queries.

For startups and enterprises. Indian AI startups building domain-specific products (e.g., legal document review, medical diagnosis support) should prioritize fine-tuning on high-quality domain data over larger general models. The study shows that a 13B model fine-tuned on 10,000 examples can outperform a much larger general model on factual accuracy. This is a cost-effective strategy: fine-tuning is cheaper than training a new model, and the data requirements are modest.

The cost of specialization. The increase in out-of-domain hallucination (11.4%) is a real concern. Indian users who need a model that works across domains—for example, a customer support chatbot that handles both product queries and billing questions—should not fine-tune on a single domain. Instead, they should use a general model with retrieval-augmented generation (RAG), which pulls facts from a database rather than relying on the model's internal knowledge. Fine-tuning is for narrow, high-stakes use cases; RAG is for broad, low-stakes ones.

Key Takeaways

▸Fine-tuning a 13B-parameter LLM on 10,000 domain-specific QA pairs reduced hallucination by 40.2% in that domain, from 28.1% to 16.8% incorrect answers.
▸The reduction was largest for factual recall questions (47.2%) and smallest for multi-step reasoning (18.7%). Fine-tuning does not fix complex inference.
▸The most dangerous hallucination—high confidence in a wrong answer—dropped by 59.8%. The model became more calibrated in its certainty.
▸Domain fine-tuning made the model worse on out-of-domain questions, with a 11.4% increase in hallucination. Specialization has a cost.
▸For Indian professionals, the practical path is clear: fine-tune on local, domain-specific data for high-stakes tasks; use RAG for general-purpose queries. The 40% reduction is achievable but requires careful replication with Indian datasets.

LLM Fine-Tuning on Domain Data Reduces Hallucination by 40%