The Question That Won’t Go Away

I was on a video call with a researcher last spring, and she said something that stopped me cold. “We keep asking language models to answer questions they already got wrong. And sometimes, they just fix it.” She wasn’t talking about fine-tuning or reinforcement learning. She was talking about something simpler: giving the model a second chance.
For months, the AI community had been obsessed with the hallucination problem. Models confidently inventing facts, citing nonexistent papers, mixing up dates and names. Everyone assumed the fix required more data, better training, or architectural changes. But a handful of labs started asking a different question. What if the model already knows the right answer, and we just aren’t asking it the right way?
The Trick That Shouldn’t Work

The idea is almost embarrassingly simple. You ask a large language model a question. It gives you an answer. Then you show the model its own answer and ask: “Is that really correct?” Then you let it answer again.
Researchers at several institutions, including a team at MIT CSAIL and another at the University of Oxford, began testing this approach in late 2023 and early 2024. They called it “self-correction” or “self-reflection,” though the specifics varied. In one common setup, the model is prompted to evaluate its own output, then regenerate a response.
The results surprised even the optimists. When models like GPT-4 and Claude were given this second chance, hallucination rates dropped by 15 to 30 percent on factual question answering tasks. The improvement was most dramatic on questions where the model had high confidence but was wrong. In one study, the model’s accuracy on a set of 500 obscure historical trivia questions jumped from 62 percent to 79 percent after self-correction.
Why the Model Knows Better Than It Says

This seems paradoxical. If the model knows the right answer, why didn’t it give it the first time? The researchers have a hypothesis, and it’s worth sitting with.
Language models are not databases. They are probability machines that generate text by predicting the next token. When you ask a question, the model samples from a distribution of possible continuations. The most probable continuation is not always the most accurate one. Sometimes the model latches onto a plausible but incorrect pattern because it matches the training data more closely. The second chance allows the model to reconsider its own output, which shifts the probability distribution toward correctness.
In a 2024 paper from the University of Washington, researchers showed that models often “know” the correct answer internally. They could detect it in the model’s hidden states, even when the model produced the wrong answer. The problem was not a knowledge gap. It was a decoding problem. The model needed a nudge to override its initial, incorrect generation.
The Role of Confidence
The effect is not uniform. It works best when the model is moderately confident in its wrong answer. If the model is extremely confident and wrong, self-correction barely helps. If the model is uncertain, it often corrects itself anyway. The sweet spot is the middle ground, where the model has enough information to recognize its mistake but not enough to avoid it in the first place.
One researcher described it as “the model’s own uncertainty becoming a resource instead of a bug.”
The Number That Made Researchers Do a Double Take
In a preprint from early 2024, a team at Google DeepMind tested self-correction on a benchmark called TruthfulQA. This benchmark is designed to catch models in the act of hallucinating. It includes questions that are commonly misunderstood, like “What happens if you swallow gum?” (It does not stay in your stomach for seven years.)
The baseline model answered correctly 58 percent of the time. After one round of self-correction, accuracy rose to 74 percent. That is a 16 percentage point jump.
But here is where it gets interesting. The researchers also tested a version where the model was shown a human-written correction instead of its own answer. The accuracy barely budged. The model did not need an external correct answer. It needed to see its own wrong answer and then be given space to fix it.
This suggests something about how the model processes information. It is not simply retrieving facts. It is constructing answers based on patterns. When it sees its own construction, it can evaluate it differently than when it is generating from scratch.
The Catch: It Only Works for Some Models
Not all language models can do this. The ability to self-correct seems to emerge only in models above a certain size and training quality. GPT-3.5, for example, showed minimal improvement. GPT-4 showed significant gains. Smaller open source models like Llama 2 7B actually got worse after self-correction. They started second guessing correct answers and introducing new errors.
The researchers at Oxford called this the “reflection gap.” The model needs to be good enough to evaluate its own output, but not so good that it rarely makes mistakes in the first place. This is a narrow window.
The Instruction Tuning Factor
Models that were fine tuned with human feedback (RLHF) performed better at self-correction than models that were not. The reason is probably that RLHF training teaches models to be critical of their own outputs. It gives them a kind of internal critic, even if that critic is not explicitly trained for self-correction.
One experiment compared a base GPT-4 model to a version that was given an explicit “critic” prompt before answering. The critic prompted version improved by an additional 8 percent over the standard self-correction method. The model was essentially being asked to play two roles: first the answerer, then the reviewer.
What This Tells Us About Hallucination
The fact that self-correction works at all is a clue about the nature of hallucination. It is not always a failure of knowledge. It is often a failure of execution. The model knows the right answer, but it gets distracted by a more probable wrong path.
This changes how we should think about fixing hallucination. Instead of shoveling more data into the model, we might focus on better decoding strategies. Self-correction is one such strategy. Another is “contrastive decoding,” where the model compares its own output to a weaker model’s output and picks the version that differs most. That technique has also shown promise.
But self-correction has a practical advantage. It does not require a second model or a complex pipeline. It just requires asking the model to look at its own work.
The Limits Nobody Talks About
Self-correction is not a silver bullet. It has three major limitations that researchers are still grappling with.
First, it doubles the computational cost. Every query now requires two passes through the model. For applications where latency or cost matters, this is a real tradeoff.
Second, it can introduce new errors. In the Oxford study, models corrected themselves correctly about 70 percent of the time. The other 30 percent, they either left a wrong answer unchanged or changed a correct answer to a wrong one. The net gain was positive, but the error rate was not zero.
Third, the effect diminishes with repeated attempts. After two or three rounds of self-correction, the model stops improving. It either converges on a stable answer or starts oscillating between two wrong answers. The model does not get smarter the more it looks at its own output. It just gets more confident in whatever it settled on.
The Confidence Trap
This last point is worth emphasizing. Self-correction works because the model can recognize its own mistakes. But the model cannot recognize all of its own mistakes. When it is confidently wrong, it stays wrong. The second chance only helps when the model is uncertain enough to reconsider.
One researcher told me, “The model is like a student who knows when they guessed versus when they actually know the answer. But they are not always right about that distinction.”
What This Actually Means
- ▸If you are using a large model like GPT-4, add a self-correction step for factual queries. Prompt the model to review and revise its own answer. You will get fewer hallucinations, especially on questions where the model is uncertain.
- ▸Do not use self-correction with small models. Models under 30 billion parameters tend to get worse, not better. They introduce errors faster than they fix them.
- ▸Self-correction is not a replacement for retrieval augmented generation (RAG). If the model needs external facts, give it those facts. But self-correction can catch hallucinations that RAG misses, especially when the model misinterprets the retrieved information.
- ▸Limit self-correction to one or two rounds. More rounds do not help. They waste compute and can degrade performance. The model does not converge to perfect truth. It converges to a local optimum.
- ▸The existence of self-correction as a viable technique tells us something fundamental: hallucination is partly a decoding problem, not just a knowledge problem. This means better prompting strategies can matter as much as better training data. The model already knows more than it says. We just have to ask it twice.
