LLMs Can Judge Each Other Better Than Humans Can

The Strange New Science of AI Judges

Imagine you have spent months building a chatbot. You have fine-tuned it on carefully curated data. You have spent thousands of dollars on compute. Now you need to know: is it actually good? The obvious answer is to ask humans. Have them talk to the bot, rate its responses, tell you what they prefer.

But here is the problem that Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, and Siyuan Zhuang from UC Berkeley confronted in their 2023 paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena": human preferences are expensive, slow, and surprisingly unreliable. Two humans asked to judge the same chatbot conversation agree with each other only about 80 percent of the time. That is not a typo. People disagree with each other on which AI response is better roughly one out of every five times.

So the authors asked a weird question. What if we replaced the human judges with another AI? What if we let GPT-4 decide which chatbot won?

The answer, published on arXiv and now cited over 450 times, is unsettling in the best possible way. The LLM judges agreed with human preferences at the same rate humans agreed with each other. Over 80 percent. The machines were not just approximating human judgment. They were matching it.

This changes everything about how we evaluate AI systems. And it raises questions we are not ready to answer.

The Problem Human Judges Cannot Solve

The standard way to test a language model is with benchmarks. You give it a set of multiple choice questions, or math problems, or coding challenges, and you count how many it gets right. This works fine for narrow skills. But modern chat assistants are not narrow. They write poems, debug code, explain quantum mechanics, and tell jokes. How do you score a joke?

Zheng and his team faced this problem directly. They wanted to evaluate open ended chat assistants, models like Vicuna and LLaMA that are designed to hold freeform conversations. Existing benchmarks like MMLU or HellaSwag measure factual recall or reasoning on constrained tasks. They do not capture whether a model is pleasant to talk to, or creative, or helpful in the way a human assistant would be.

The gold standard is human evaluation. You recruit raters, show them conversations, ask them which response they prefer. This is what companies like OpenAI and Anthropic do internally. But it is brutally expensive. A single large scale human evaluation can cost tens of thousands of dollars. It takes days or weeks. And as the authors discovered, it is noisy. Human raters have inconsistent tastes. They get tired. They have biases.

The authors quantified this noise. When they collected over 3,000 expert votes on the Chatbot Arena platform, where humans compared anonymous chatbot responses, they found that human to human agreement capped out at around 80 percent. That is not a measurement error. It is a ceiling. If two people cannot agree on which answer is better one fifth of the time, then human preference is not a stable ground truth. It is a moving target.

The Experiment That Sounds Like a Joke

Here is where the paper gets interesting. Zheng and his colleagues proposed something that sounds circular: use a strong LLM, specifically GPT-4, to judge the outputs of other LLMs. They called this "LLM as a judge."

The idea is not as silly as it sounds. A powerful language model has absorbed an enormous amount of human text. It has seen millions of examples of what people consider good writing, helpful explanations, and coherent arguments. In theory, it has internalized a model of human preference. The question is whether that internal model is accurate enough to replace actual humans.

To test this, the authors built two evaluation systems.

First, they created MT Bench, a set of 80 multi turn questions designed to test a model across eight categories: writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. Each question requires the model to hold a coherent conversation over two turns. A judge, either human or LLM, then scores the quality of the response on a scale of 1 to 10.

Second, they built Chatbot Arena, a crowdsourced platform where users chat with two anonymous models simultaneously and vote on which response they prefer. This generated over 30,000 real human preferences. Think of it as a tournament bracket for chatbots, with humans as the referees.

Then they compared. They took the same conversations and had GPT-4 judge them. They also had human experts judge them. They measured how often the two agreed.

The result: GPT-4 matched human preferences with over 80 percent agreement. That is the same number as human to human agreement. The LLM judge was not worse than a human judge. It was equivalent.

But the Judge Has Biases

Before you declare AI judges infallible, the authors found three specific biases that distort LLM evaluations. These are important because they define the limits of the approach.

First, position bias. When GPT-4 sees two responses side by side, it prefers the first one more often than chance would predict. The model has a subtle ordering preference. It is like a judge who always favors the first contestant in a talent show. The authors found that simply swapping the order of the responses and averaging the scores eliminated this bias. A simple fix, but one that matters.

Second, verbosity bias. LLM judges prefer longer responses. Even if a short answer is more precise and helpful, the model tends to give higher scores to the verbose one. This is a known problem in human evaluation too, but it is amplified in AI judges. The authors suggest that this might reflect the training data, where longer, more detailed answers are often rated higher by humans. The AI learned the correlation, but not the causation.

Third, self enhancement bias. This is the most troubling one. When an LLM judges its own outputs, it tends to rate them higher. GPT-4 prefers GPT-4. Claude prefers Claude. The model has a built in narcissism. This is not just a statistical artifact. It means you cannot trust an LLM to evaluate itself in a competition against other models. You need a separate, stronger judge, or you need to anonymize the responses carefully.

The authors also noted that LLM judges have limited reasoning ability on complex tasks. They are good at comparing two responses and picking the better one. They are less good at explaining why, or at catching subtle factual errors. The judge is shallow in a specific way. It knows what looks good, but not always what is true.

Why This Works Better Than You Think

The surprising thing is not that LLM judges have biases. Every human judge has biases too. The surprising thing is that the LLM matches human agreement despite these biases. That means the biases are either minor, or they overlap with human biases in a way that does not degrade the overall match.

Think about what this implies. When two humans disagree on which chatbot response is better, they are not disagreeing because one is right and one is wrong. They are disagreeing because preference is subjective. The LLM judge, trained on human text, has learned that subjectivity. It does not have a single correct answer. It has a distribution of preferences that mirrors the human distribution.

This is why the 80 percent agreement number is so important. It is not a failure of the LLM judge. It is the theoretical maximum. If humans cannot agree with each other more than 80 percent of the time, then no judge, human or machine, can exceed that ceiling. The LLM is not falling short. It is hitting the glass ceiling of human subjectivity.

What the Study Actually Proved

The authors were careful not to overclaim. They showed that GPT-4 can match human preferences on the specific tasks in MT Bench and Chatbot Arena. They did not show that LLM judges work for every evaluation scenario. They did not show that smaller or weaker models can serve as reliable judges. They tested GPT-4, one of the most capable models available at the time, and found it worked.

They also showed that the LLM judge approach is scalable and explainable. Scalable because once you set it up, you can evaluate thousands of conversations for a fraction of the cost of human raters. Explainable because the LLM can produce a justification for its score, something a human rater often cannot do consistently.

The authors made their data public: the MT Bench questions, 3,000 expert votes, and 30,000 conversations with human preferences. This is a gift to the research community. Anyone can now test whether a new LLM judge works, or whether a new bias correction technique improves agreement.

The Open Questions Nobody Has Answered

The paper leaves several important questions unresolved. These are not weaknesses in the research. They are invitations for future work.

First, does the LLM judge work for models that are stronger than GPT-4? If a future model surpasses GPT-4 in reasoning ability, will GPT-4 still be a reliable judge? The authors found that weaker models make worse judges. Vicuna, for example, was a poor evaluator of other models. This suggests a ceiling effect. The judge must be at least as capable as the models it evaluates. As models improve, we may need to keep upgrading our judges.

Second, what happens when the judge is evaluating a model trained on data that includes the judge's own outputs? This is already happening. Models trained on web data have ingested GPT-4 generated text. If an LLM judge evaluates a model that was trained on text the judge itself wrote, is that a closed loop? Does it create a self reinforcing bias that diverges from human preferences over time?

Third, can LLM judges detect subtle harms or biases that humans would notice? The authors tested factual accuracy and general preference. They did not test safety, toxicity, or fairness. An LLM judge might give a high score to a response that is subtly racist or manipulative, because the training data contains many examples of such responses being rated highly by humans. The judge inherits human flaws along with human preferences.

What This Actually Means

The paper by Zheng, Chiang, Sheng, and Zhuang is not just a technical result. It is a proof of concept for a new way of evaluating intelligence. Here is what it changes in practice.

▸If you are building a chatbot, you can now use GPT-4 as a low cost, high speed judge for iterative improvement. Run 100 conversations, have GPT-4 score them, identify the weakest responses, retrain, repeat. This is orders of magnitude faster and cheaper than hiring human raters.

▸The 80 percent agreement ceiling means you should not expect perfect alignment between any two judges, human or machine. This is not a bug. It is a feature of subjective preference. Stop chasing 100 percent agreement. Aim for human equivalence.

▸Position bias and verbosity bias are real, but they are fixable. Randomize the order of responses. Normalize for response length. These are simple engineering solutions that dramatically improve reliability.

▸Do not use a model to judge itself. The self enhancement bias is too strong. If you are evaluating your own model, use a stronger external judge or anonymize the responses so the judge does not know which model produced them.

▸The LLM as a judge approach is not a replacement for human evaluation. It is a complement. Use it for rapid iteration and large scale screening. Reserve human evaluation for final validation and for catching the subtle failures that LLM judges miss.

The most provocative implication is one the authors do not state explicitly. If LLMs can judge each other as well as humans can, then human preference is not the gold standard we thought it was. It is just another noisy signal. And for the first time, we have a machine that can reproduce that noise at scale, for pennies, in seconds.

That is not a replacement for human judgment. It is a mirror. And like any mirror, it shows us things about ourselves we might not have wanted to see.

References

[1]Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv (Cornell University)DOI· 453 citations