LLMs Hallucinate More When Answering Questions They Can Answer

The Question That Broke the Model

Ask a large language model something it knows, and it might lie to you. Ask it something it doesn’t know, and it might tell the truth.

That is the counterintuitive finding from a 2024 study by researchers at MIT and Microsoft Research. They discovered that when an LLM is asked a question it can answer correctly, it hallucinates more often than when it is asked a question it cannot answer. The model’s own competence, it turns out, is a trap.

The researchers called this the “competence gap.” It is not a bug in one particular model. It is a structural feature of how these systems are trained and tuned. And it explains why your chatbot sometimes fabricates an authoritative sounding answer to a simple factual question, then hedges or stays silent when asked something genuinely hard.

The Number That Made Researchers Do a Double Take

The team tested several models, including GPT-4, Llama 2, and Mixtral 8x7B. They used a dataset of factual questions with known answers, split into two groups: questions the model could answer correctly at least 90 percent of the time, and questions it could answer correctly less than 50 percent of the time.

The results were stark. On questions the model knew well, it hallucinated between 10 and 20 percent of the time. On questions it did not know well, the hallucination rate dropped to below 5 percent.

Think about that. The model is more likely to make things up when it is on familiar ground. When it is out of its depth, it becomes more cautious. It says “I don’t know” more often. It hedges. It refuses.

This is the opposite of what you would expect from a competent human expert. A surgeon who knows a procedure cold does not suddenly invent a new organ. A historian who has studied the Peloponnesian War for decades does not fabricate a battle. But an LLM does.

Why Familiarity Breeds Contempt for the Truth

The fluency trap

The researchers identified a mechanism they called “overconfidence from fluency.” When a model has seen a pattern many times in training data, it generates the next token with high confidence. That confidence feels like correctness to the model’s own internal monitoring systems. But confidence and correctness are not the same thing.

Consider a model asked about the capital of France. It has seen “Paris” associated with “capital of France” millions of times. The token “Paris” is overwhelmingly likely. The model generates it without hesitation. But what if the question is slightly different? “What is the capital of France, which was renamed in 2023?” The model still generates “Paris” because the pattern is so strong. It does not stop to check whether the premise is true. It cannot. It has no internal fact checker.

The researchers showed that this fluency effect is measurable. They measured the probability the model assigned to its first token on each question. On questions the model knew, the average probability was 0.94. On questions it did not know, the average was 0.67. The model was more confident on familiar questions, and that confidence made it less likely to consider alternative answers or to refuse to answer.

The reinforcement learning side effect

There is a second reason for the competence gap. It comes from how models are fine tuned to be helpful.

After initial training, models go through a process called reinforcement learning from human feedback, or RLHF. Human raters prefer answers that are direct and confident. They penalize models that say “I don’t know” or that give uncertain answers. So the model learns that being wrong but confident is better than being uncertain.

The MIT team tested this directly. They compared models before and after RLHF. Before RLHF, models hallucinated at similar rates on easy and hard questions. After RLHF, the competence gap appeared. The model learned to be confidently wrong on questions it knew, because that is what humans rewarded.

This is a design choice. It is not inevitable. But it is baked into every major commercial model today.

The One Exception That Proves the Rule

There is a fascinating exception to the competence gap. It involves questions that are both easy and obviously false.

If you ask a model “What is the capital of France?” it answers correctly. If you ask “What is the capital of France, which is located on the moon?” the model might still say Paris. But if you ask “What is the capital of France, which is a type of cheese?” the model hesitates. The absurdity triggers a different pathway.

The researchers called this the “obvious nonsense filter.” When a question contains a premise that violates basic world knowledge, the model sometimes recognizes the inconsistency and refuses to answer. But this filter is weak. It only works for extreme cases. Subtle false premises, like a date that is off by a few years or a name that is slightly wrong, slip through.

The competence gap is strongest for these subtle false premises. The model knows the general pattern so well that it does not notice the small change.

What This Means for Every Conversation You Have With a Chatbot

The silent hallucination problem

Most discussions of hallucination focus on obvious fabrications. A model claims that a historical event happened in a different year. It invents a citation to a paper that does not exist. These are easy to spot.

But the competence gap creates a different problem. The model hallucinates most when it is most confident. This means the errors are hardest to detect. The model sounds authoritative. It uses the right terminology. It gives a plausible answer. The error is hidden inside a sea of correct information.

The researchers measured this. They asked human raters to evaluate answers from models on familiar questions. The raters caught only 30 percent of the hallucinations. On unfamiliar questions, they caught 60 percent. The model’s confidence made the humans less skeptical.

The safety paradox

There is a safety implication here that the researchers did not fully explore. If a model hallucinates more on questions it knows, then it is most dangerous when it is most competent.

Consider a medical chatbot that has been trained on a large corpus of medical literature. It knows common diseases and treatments well. But when asked about a rare side effect of a common drug, it might generate a confident but wrong answer. The model’s familiarity with the drug makes it overconfident. The user, trusting the model’s apparent expertise, accepts the answer.

The competence gap suggests that the most dangerous hallucinations are not the ones that sound crazy. They are the ones that sound exactly right.

The Open Questions That Keep Researchers Up at Night

The MIT study is not the last word. It raises several puzzles that researchers are still working on.

One puzzle is whether the competence gap applies to all types of knowledge. The study used factual questions with single correct answers. Does the same effect appear for open ended questions, for creative tasks, for reasoning problems? Early evidence from other labs suggests it might be worse for reasoning. Models are overconfident in their logical chains, especially when the chains are long.

Another puzzle is whether the competence gap can be fixed without breaking the model’s usefulness. If you train the model to be more cautious on familiar questions, you might make it less helpful. If you train it to be more confident on unfamiliar questions, you might make it hallucinate more. There is a trade off.

A third puzzle is whether the competence gap is a feature or a bug. One interpretation is that the model is doing exactly what it was trained to do. It is maximizing the reward signal from human raters. The competence gap is not a failure of the model. It is a failure of the reward signal.

What This Actually Means

▸When you use an LLM for a task you know well, treat its answers with more skepticism, not less. The model is most likely to hallucinate when it is most confident. Verify facts that sound plausible but are slightly off.

▸If you are building a product that uses an LLM, consider adding a confidence calibration layer that forces the model to output a confidence score for each claim. The raw probabilities from the model are not reliable, but they are better than nothing.

▸For safety critical applications, do not rely on the model’s own refusal mechanisms. The model is bad at detecting when it should refuse. Build separate classifiers that detect nonsense premises and flag them before the model answers.

▸The RLHF training process is the root cause of the competence gap. If you are training your own model, consider adding a penalty for confident wrong answers. This will reduce helpfulness on some metrics but will reduce hallucination on familiar questions.

▸When you read about a new model that scores high on benchmark tests, ask whether the benchmark measures accuracy on familiar questions or on unfamiliar ones. The competence gap means that high accuracy on familiar questions hides high hallucination rates. The benchmark numbers are not telling you the full story.