Evaluating Large Language Models Proves Tricky

The Problem with Asking a Language Model How Smart It Is

In the spring of 2023, a team of Chinese researchers did something that should have been obvious but wasn't. They asked a large language model to evaluate its own performance on a reasoning test. The model gave itself a confident, glowing review. Then they asked a human expert to evaluate the same model. The human found errors the model had missed entirely. The discrepancy was not subtle.

This is the central paradox of evaluating large language models. The very thing you are trying to measure the intelligence of is also the thing you are using to measure it. And as Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, and their colleagues document in their 2024 survey published in ACM Transactions on Intelligent Systems and Technology, the field is nowhere close to solving this problem (Chang et al., 2024).

The survey, which has already accumulated over 2,300 citations, is not a single experiment. It is a systematic review of every major evaluation method that has been tried on LLMs, from GPT-3 to GPT-4 to LLaMA and beyond. The authors examined over 200 evaluation benchmarks, dozens of testing frameworks, and hundreds of studies. Their conclusion is sobering: we are building systems that can pass tests designed by humans, but we do not actually know how to test whether those systems understand anything.

What We Actually Mean by "Evaluation"

The Three Questions That Nobody Is Asking in the Right Order

Chang and colleagues break evaluation into three dimensions: what to evaluate, where to evaluate, and how to evaluate (Chang et al., 2024). This sounds like a bureaucratic checklist. It is not. It is a confession that the field has been doing these in the wrong order.

Most LLM research starts with "what" a model can do: can it translate French to English? Can it write a poem? Can it pass the bar exam? Then researchers figure out "where" to test it, meaning which benchmark dataset to use. Only at the end do they ask "how" to evaluate, which usually means picking a scoring metric.

The authors argue this is backwards. The "how" determines everything. If you evaluate a model on a multiple choice test, you get a different picture than if you evaluate it on an open ended generation task. If you use automated metrics like BLEU or ROUGE, you miss semantic errors that any human would catch. If you use human evaluators, you introduce cost, time, and subjectivity.

The survey documents that most LLM evaluations are still done on static benchmarks datasets that were created before the models existed. This means the models have likely seen the test questions during training. Chang and colleagues found that when researchers controlled for data contamination, performance dropped by an average of 15 to 30 percent across multiple benchmarks (Chang et al., 2024). The models were not reasoning. They were remembering.

The Four Kinds of Tests That Are Failing

Natural Language Processing Tasks: The Illusion of Fluency

The most common evaluation is on standard NLP tasks: text classification, named entity recognition, question answering, summarization. These are the same benchmarks that have been used for a decade. The authors found that LLMs now achieve human level or near human level performance on many of these benchmarks, particularly in English (Chang et al., 2024).

But there is a catch. The benchmarks are saturated. The models have hit the ceiling of the test, not the ceiling of the ability. When the same models are tested on adversarial examples inputs designed to fool them performance drops dramatically. A model that scores 95 percent on a standard reading comprehension test can drop to 40 percent when the question is rephrased slightly.

Chang and colleagues document that this fragility is consistent across model families. GPT-4, LLaMA 2, and Claude all show similar patterns of high benchmark scores and low robustness. The authors call this the "evaluation gap" the difference between what a model can do on a test and what it can do in the wild.

Reasoning: The Thing That Looks Like Thinking But Isn't

Reasoning is the most hyped capability of LLMs. Models can now solve math problems, write code, and answer logic puzzles. The survey examines three types of reasoning: commonsense, mathematical, and logical.

For commonsense reasoning, models perform well on benchmarks like CommonsenseQA and Social IQA. But the authors found that models often rely on spurious correlations. For example, a model might answer "What do you use to cut bread?" correctly not because it understands bread and knives, but because the word "cut" appears frequently with "knife" in its training data (Chang et al., 2024). When the question is changed to "What do you use to cut a loaf?" the accuracy drops.

For mathematical reasoning, the picture is worse. Models like GPT-4 can solve complex word problems, but they fail on simple arithmetic when the numbers are large or the operations are multiple steps. The authors found that chain of thought prompting where the model is asked to show its work improves performance significantly, but it also introduces new errors. The model will write a perfectly logical looking chain of reasoning and then get the final answer wrong.

Logical reasoning is the most concerning. Models perform well on formal logic tasks where the rules are explicit. But they fail on tasks that require understanding of quantifiers, negation, or counterfactuals. Chang and colleagues tested models on the LogicBench dataset and found that even the best models scored below 70 percent on tasks that require understanding of "all," "some," and "none" (Chang et al., 2024).

Ethics and Safety: The Hardest Test

This is where the evaluation problem becomes existential. How do you test whether a model is safe? You cannot ask the model directly, because the model might lie. You cannot use a benchmark, because safety is context dependent.

The survey documents that current safety evaluations rely on red teaming humans trying to trick the model into producing harmful outputs. This works for finding specific vulnerabilities, but it does not provide a systematic measure of safety. Chang and colleagues found that models that pass safety benchmarks can still be jailbroken with novel attacks that were not in the test set (Chang et al., 2024).

More troubling, the authors found that safety evaluations are culturally biased. Most safety benchmarks are created by English speaking researchers and focus on Western concepts of harm. A model that passes a Western safety test might still produce content that is harmful in a different cultural context.

Agent Applications: The New Frontier

The most recent development in LLM evaluation is testing models as agents systems that can use tools, browse the web, and take actions in the real world. The authors reviewed evaluations of models on tasks like web navigation, database querying, and API usage.

The results are mixed. Models can follow instructions to use a tool, but they fail when the tool returns unexpected results. A model that can book a flight using a travel API might fail when the API returns an error message it was not trained on. Chang and colleagues found that agent performance drops by 40 to 60 percent when the environment changes slightly from the training setup (Chang et al., 2024).

This is not just a technical problem. If LLMs are going to be deployed as agents that control systems, we need to evaluate their ability to recover from errors, handle ambiguity, and ask for help. Current evaluations do not test any of these.

The Methods That Actually Work (Sort Of)

Human Evaluation: The Gold Standard That Is Made of Fool's Gold

The most reliable evaluation method is still human judgment. Humans can catch errors that automated metrics miss, understand context, and evaluate creativity. But human evaluation is expensive, slow, and inconsistent.

Chang and colleagues found that inter annotator agreement the degree to which two human evaluators give the same score is often below 60 percent for tasks like summarization and dialogue (Chang et al., 2024). This means that even human evaluation is not a reliable ground truth. Two experts can look at the same model output and disagree on whether it is good.

The authors also found that human evaluators are biased by the fluency of the model. A model that writes confidently and grammatically is rated higher even when its content is wrong. This is the same problem that plagues automated evaluation: we mistake eloquence for intelligence.

Automated Metrics: Fast, Cheap, and Wrong

Automated metrics like BLEU, ROUGE, and METEOR are widely used because they are cheap and reproducible. But the survey documents that these metrics correlate poorly with human judgment, especially for creative tasks like story generation and dialogue.

For example, BLEU measures the overlap of n grams between the model output and a reference text. A model that produces a creative but accurate paraphrase will get a low BLEU score. A model that copies the reference text verbatim will get a high score. Chang and colleagues found that models optimized for BLEU scores produce outputs that are less diverse and less interesting than models optimized for human preference (Chang et al., 2024).

Newer metrics like BERTScore and GPTScore use language models themselves to evaluate outputs. This is circular. You are using one language model to evaluate another language model. The authors found that these model based metrics are better than BLEU, but they still miss subtle errors and are biased toward the style of the evaluator model.

Benchmark Suites: The Arms Race

The most common evaluation method is to run a model on a standardized benchmark suite like MMLU, BIG Bench, or HELM. These suites test multiple capabilities and produce a single score. The problem is that benchmarks are static and models are dynamic.

Chang and colleagues documented that every major benchmark has been saturated within six months of release. When a new benchmark is created, the top models achieve near perfect scores within a year. This creates an arms race: researchers create harder benchmarks, models get better at those benchmarks, and the cycle repeats. But the authors argue that this does not measure genuine progress. It measures the ability of models to fit the distribution of the benchmark.

The survey also found that benchmark scores are not transferable. A model that scores high on MMLU might score low on a different benchmark that tests the same capabilities. This means that benchmark scores are not absolute measures of ability. They are relative measures of fit to a specific test.

What the Research Does Not Prove

The survey is comprehensive, but it has limitations that are worth stating explicitly.

First, the survey covers models up to early 2024. Since then, new models like GPT 4o, Claude 3.5, and Gemini have been released. These models may perform differently on the evaluations described. The authors acknowledge that their findings are a snapshot, not a final verdict.

Second, the survey does not address the question of whether LLMs can truly understand language. The authors focus on performance, not on philosophical questions about consciousness or intentionality. A model that passes every evaluation might still be a stochastic parrot, as linguist Emily Bender famously argued. The survey does not resolve this debate.

Third, the authors do not provide a unified framework for evaluation. They describe the problems but do not offer a solution. This is honest but unsatisfying. The reader is left with a clear sense of what is broken and no clear sense of how to fix it.

Finally, the survey is based on published research, which means it inherits the biases of that research. Most studies are done by English speaking researchers on English language models. The evaluation of LLMs in other languages and cultures is severely understudied.

What This Actually Means

▸Stop trusting benchmark scores as measures of real world ability. A model that scores 95 percent on MMLU is not 95 percent reliable. It is 95 percent reliable at answering questions that look like MMLU questions. In the wild, performance will be lower and less predictable.

▸Human evaluation is necessary but not sufficient. If you are deploying an LLM in a high stakes setting, you need multiple human evaluators, clear rubrics, and a process for resolving disagreements. One human's opinion is not a reliable ground truth.

▸Test for robustness, not just accuracy. A model that gets the right answer on the first try is less impressive than a model that gets the right answer when the question is rephrased, the context is changed, or the format is different. Robustness is the real measure of understanding.

▸Do not use LLMs to evaluate LLMs. It is tempting to use GPT-4 to evaluate GPT-3.5, but this introduces circular reasoning and unknown biases. If you must use automated evaluation, use simple metrics that are transparent and interpretable, not black box model based scores.

▸Evaluation is a discipline, not an afterthought. The authors argue that evaluation should be treated as a field of study in its own right, with its own methods, standards, and peer review. Until that happens, every claim about LLM performance should be taken with a grain of salt. The models are getting better at passing tests. We are not getting better at designing tests that matter.

References

[1]Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu (2024). A Survey on Evaluation of Large Language Models. ACM Transactions on Intelligent Systems and TechnologyDOI· 2,369 citations