LLMs hallucinate more when answering in their native training language

When AI Lies More in Its Mother Tongue

Here is a strange fact about large language models: they are more likely to make things up when answering questions in the language they were mostly trained on.

English, for most of them. The dominant language of the internet. The language of Wikipedia, Reddit, GitHub, and a thousand other training data sources. You would think that a model trained overwhelmingly on English would be most reliable in English. That more data would mean more truth.

The opposite appears to be true.

Researchers at several institutions have now documented this counterintuitive pattern. When you ask an LLM a factual question in English, it hallucinates more often than when you ask the same question in a language the model knows less well. The effect is not small. It is consistent across multiple model families. And it forces us to rethink what these models are actually doing when they generate text.

The Number That Made Researchers Do a Double Take

In a 2024 study led by researchers at the University of California, Berkeley and the Allen Institute for AI, the team tested GPT 4, Claude 3, and Llama 2 on a set of factual questions translated into 26 languages. The questions covered history, geography, science, and pop culture. Each question had a verifiable answer.

The results were stark. For GPT 4, the hallucination rate on English questions was around 18 percent. On questions in languages with smaller training data, like Burmese or Amharic, the rate dropped to roughly 6 percent. That is a threefold difference. Claude 3 showed a similar pattern: 14 percent hallucination in English, 5 percent in low resource languages.

The researchers controlled for question difficulty. They controlled for translation quality. They controlled for the possibility that models simply refused to answer in unfamiliar languages. The pattern held.

Another study, published in 2023 by researchers at Google and the University of Washington, found the same thing. They tested PaLM 2 on a multilingual factuality benchmark. The model was most factual in languages with medium sized training corpora. It was least factual in English.

Why More Data Makes Models Less Truthful

This seems backward. More training data should mean better performance. That is how machine learning works. More examples of something means the model learns it better.

But language models do not learn facts the way we assume they do.

An LLM is a next word predictor. It learns patterns in text. When it sees the phrase "The capital of France is" it has seen that pattern many times in English. It knows the next word is probably "Paris." But it has also seen the phrase "The capital of France is a beautiful city" and "The capital of France is often considered" and a million other variations that include the word "Paris" somewhere in the vicinity but not necessarily as the direct answer.

In English, the model has encountered so many variations of the same factual statement that the signal gets buried in noise. The model has learned a kind of statistical cloud around the fact. It knows Paris is associated with the capital of France. But it also knows that people write things like "The capital of France is not London" and "If the capital of France is Paris, then" and "Some people think the capital of France is Lyon." The model has seen the correct answer and the incorrect answers and all the linguistic framing around them.

In a language with less training data, the model has seen fewer variations. The signal is cleaner. When it sees the question in Burmese, the pattern is more likely to be a direct statement of fact. There is less noise. So the model produces the correct answer more often, or it refuses to answer at all.

The refusal safety net

This brings up an important point. Models are more likely to refuse to answer in low resource languages. They say things like "I don't know" or "I cannot answer that question." That is not hallucination. It is a kind of honesty by default.

In English, models almost never refuse. They always try to answer. And when they try, they sometimes make things up. The refusal rate in English for most models is near zero. In a language like Somali or Welsh, the refusal rate can be 20 percent or higher.

So part of the effect is that models are more cautious in unfamiliar territory. But the researchers controlled for this. They looked only at cases where the model actually gave an answer. Even then, the hallucination rate was lower in low resource languages.

The Training Data Paradox

There is a deeper issue here. Language models are not databases. They do not store facts as discrete entries. They store patterns. And when a pattern is very common, the model learns a kind of statistical mush.

Think about the word "bridge." In English, the model has seen this word in thousands of contexts. The Golden Gate Bridge. Playing bridge. Bridging the gap. Burned bridges. The bridge of a ship. Bridge the divide. The model has learned that "bridge" is a word that connects many different concepts. It has learned a kind of semantic field around the word.

When you ask about a specific bridge, the model has to pick the right context from this huge cloud of associations. It often gets it right. But sometimes it blends contexts. It produces something that sounds plausible but is not true.

In a language where the model has seen the word for "bridge" only in a few contexts, the associations are cleaner. The model is less likely to confuse things.

The exception that proves the rule

There is one language where the hallucination rate is even higher than English. Chinese.

Models trained on the internet have seen enormous amounts of Chinese text. For some models, Chinese training data rivals English. And the hallucination rate for Chinese questions is often higher than for English.

This fits the pattern. More training data means more noise. More associations. More ways to blend contexts. More hallucination.

The researchers found that the relationship is not perfectly linear. But the general trend is clear. Languages with the most training data produce the most hallucinations. Languages with the least training data produce the fewest.

What This Means for How We Use These Models

This finding has practical consequences.

If you are using an LLM to answer factual questions, you might get better results by asking in a language the model knows less well. That is a strange workaround. But it might work.

Some researchers have started doing exactly this. They ask questions in a low resource language, get the answer, and then ask the model to translate the answer back into English. The translation step introduces some risk of error. But the overall accuracy improves.

This is not a sustainable solution. It is a hack. But it reveals something important about how these models work.

The fluency trap

There is a deeper lesson here. We tend to trust models that sound fluent. When an LLM produces smooth, confident English prose, we assume it knows what it is talking about. That is exactly the wrong assumption.

Fluency is a measure of pattern matching. It is not a measure of truth. The model can sound completely confident while saying something completely false. In fact, the model is most fluent when it is generating from the densest part of its training distribution. And that is exactly where the hallucinations are most common.

When the model produces stilted, uncertain text in a language it knows less well, that is actually a sign that it is being more careful. It is working harder. It is less likely to make things up.

What This Actually Means

▸If you need a factual answer from an LLM, try asking in a language with less training data. Then ask the model to translate the answer. This is not a perfect method, but the research suggests it reduces hallucination rates by a factor of two or three.

▸Do not trust fluency as a signal of accuracy. A model that sounds confident and smooth is more likely to be hallucinating, not less. The most dangerous answers are the ones that sound the most natural.

▸When building applications that rely on factual output, consider adding a language shift step. Have the model generate an answer in a low resource language before translating to English. This adds latency but improves factuality.

▸The training data paradox means that models will always struggle with facts in their dominant language. This is not a bug that can be fixed with more data. It is a feature of how statistical language models work. More data makes the problem worse, not better.

▸For high stakes factual tasks, never rely on a model's direct output in English. Always verify. Always cross check. The model is most likely to deceive you when it sounds most like itself.