Why ESG Ratings Are All Over the Map and Nobody Can Agree

The Rating That Can’t Make Up Its Mind

Imagine you are a pension fund manager. You have $50 billion to invest. You want to put it into companies that aren’t destroying the planet, exploiting their workers, or cooking their books. So you turn to the experts: the ESG rating agencies. These are the firms that claim to measure how “good” a company is on environmental, social, and governance issues.

You look up a company. One agency gives it a 92 out of 100. Another gives it a 37. A third lands somewhere in the middle, at 60.

Which one do you believe?

This is not a hypothetical. It is the central finding of a 2022 study by Florian Berg, Julian F. Kölbel, and Roberto Rigobón, published in the European Finance Review. They looked at six of the biggest ESG rating agencies in the world, analyzed how they rated thousands of companies, and found something that should make anyone who relies on these ratings deeply uncomfortable: the ratings barely agree with each other.

The average correlation between two different agencies’ ratings is about 0.54. For context, a correlation of 1.0 means perfect agreement. Zero means no relationship. In finance, credit ratings from Moody’s and S&P usually correlate at 0.99. ESG ratings? They are closer to a coin flip.

So what is going on? Berg, Kölbel, and Rigobón did not just document the mess. They figured out exactly where the disagreement comes from. And the answer is not what most people expect.

The 56% Problem Nobody Talks About

When you hear that two rating agencies disagree, your first instinct might be: they are measuring different things. Maybe one cares about carbon emissions, while the other cares about board diversity. That is part of the story, but it is not the biggest part.

The researchers broke down the divergence into three components:

▸Scope: Do the agencies look at the same categories of ESG issues? (38% of the divergence)
▸Weight: Do they assign the same importance to those categories? (6% of the divergence)
▸Measurement: Do they actually measure the same thing the same way? (56% of the divergence)

Here is the surprise. Most of the disagreement comes from measurement. Two agencies can both claim to measure “carbon emissions,” but one uses self-reported data, another uses estimates based on industry averages, and a third uses satellite imagery. They all call it the same thing. It is not the same thing.

Berg and his colleagues found that the measurement divergence is not random. It has a pattern. They call it the “rater effect.” This means that a rater’s overall impression of a company bleeds into how they score individual categories. If an agency already thinks Tesla is an environmental leader, it tends to give Tesla higher marks on environmental subcategories, even when the hard data might suggest otherwise. The halo effect, in quantitative form.

How Do You Build a Rating? Nobody Agrees

To understand why this happens, you have to look at how each agency constructs its ratings. The study mapped the methodologies of six agencies: KLD, Sustainalytics, Moody’s ESG (Vigeo-Eiris), S&P Global (RobecoSAM), Refinitiv (Asset4), and MSCI.

Here is what they found:

▸KLD focuses on strengths and concerns, using a binary system. A company either has a strength or it doesn’t. Simple, but crude.
▸Sustainalytics uses a materiality framework. It only counts ESG issues that could financially affect the company. This means a tobacco company might get a low environmental score, but only if the environment is considered financially material to tobacco.
▸MSCI weighs issues based on industry. Carbon emissions matter more for an oil company than for a software firm. But how much more? That is proprietary.
▸Refinitiv uses over 450 data points, but the selection and weighting are opaque.
▸S&P Global relies heavily on a survey that companies fill out themselves. You can guess how that goes.
▸Moody’s ESG (formerly Vigeo-Eiris) emphasizes how companies manage their stakeholders, not just outcomes.

Each agency claims to measure the same thing: corporate ESG performance. But they are measuring different things, in different ways, with different assumptions baked in. The researchers created a common taxonomy of 64 categories to compare them. They found that no two agencies cover exactly the same set of categories. Some agencies include labor rights. Others skip it. Some include water usage. Others ignore it.

The result is not just noise. It is systematic confusion.

The Measurement Trap: When Data Is Not Data

The measurement problem is the most insidious, because it looks like it should be the easiest to fix. After all, data is data. A ton of CO2 is a ton of CO2. A gender pay gap is a gender pay gap.

But here is where it gets messy.

Berg, Kölbel, and Rigobón found that agencies often use different data sources for the same metric. One agency might use a company’s own sustainability report. Another uses media reports and NGO data. A third uses government databases. These sources can disagree wildly.

Consider a company that reports its carbon emissions as 100,000 tons. A media investigation reveals it is actually 150,000 tons. Agency A uses the company’s number. Agency B uses the media number. Both call it “carbon emissions.” Both are technically correct, given their source. But the scores diverge.

Then there is the problem of estimation. When a company does not report a metric, agencies have to guess. Some use industry averages. Others use statistical models. The guesses can be far apart. The researchers found that for some categories, the measurement divergence was so large that two agencies could give opposite scores to the same company on the same issue.

This is not a bug. It is a feature of how the industry works. Each agency wants to differentiate itself. Each has a proprietary methodology. And each has a customer base that expects a certain kind of answer.

The Weighting Mirage

You might think that if two agencies measure the same thing, the only remaining disagreement is how much weight they give to each category. That turns out to be the smallest source of divergence, at just 6%.

Why so small? Because weighting differences cancel out in aggregate. If one agency gives more weight to environmental issues and another gives more weight to social issues, the overall scores can still converge if the company scores similarly on both. The real divergence comes from the measurement itself, not from how you mix the ingredients.

But here is the kicker: the 6% figure only applies to the categories that both agencies measure. When agencies measure different things entirely, the weighting question becomes irrelevant. You cannot argue about the weight of water usage if one agency does not even consider it.

What This Research Does Not Prove

This study is not saying ESG ratings are useless. It is saying they are unreliable in ways that most users do not understand.

The authors do not prove that any one agency is better than another. They do not claim that the divergence is intentional or malicious. They also do not show that the divergence leads to bad investment outcomes. It is entirely possible that a portfolio built using one agency’s ratings performs just as well as one built using another’s, even though the ratings disagree. That is a different question, and it is not answered here.

What the study does prove is that the current system of ESG ratings is not a measurement system in the scientific sense. It is a collection of opinion, filtered through proprietary algorithms, with no shared standard for what counts as evidence.

The Rater Effect: When Bias Becomes a System

The most troubling finding is the rater effect. Berg, Kölbel, and Rigobón detected a systematic bias where a rater’s overall view of a company influences its scores on individual categories. This is not a small effect.

Imagine a company that is known for good governance. It has a diverse board, transparent accounting, and no scandals. The rater gives it a high governance score. Then, when evaluating the same company on environmental issues, the rater unconsciously gives it a slightly higher score than the data would justify, because the rater’s overall impression is positive.

The reverse happens for companies with a bad reputation. A firm with a history of labor violations might get lower scores on environmental metrics, even if its environmental data is clean.

This is not fraud. It is human psychology, baked into the rating process. And it is hard to fix because it is invisible. You cannot audit away a bias you do not know exists.

The Real World: Who Cares?

You might ask: so what? If ESG ratings are inconsistent, does it matter?

It matters because trillions of dollars are flowing into ESG funds. Pension funds, endowments, and sovereign wealth funds are making decisions based on these ratings. Regulators are starting to use them. Companies are being judged, rewarded, or punished based on scores that are, at best, unreliable.

Consider the implications:

▸A company that genuinely improves its environmental performance might not see its rating improve if the agency uses a different measurement method.
▸A company that hires a PR firm to game the system might see its rating improve, even if nothing actually changes on the ground.
▸An investor who thinks they are diversifying risk by buying ESG funds might be buying the same companies, just rated differently.

The study suggests that until rating agencies agree on basic measurement standards, the entire system is built on sand.

What This Actually Means

Here is what you can take from this research, whether you are an investor, a regulator, or just someone trying to understand what ESG ratings actually tell you:

▸Measurement is the problem, not weighting. If you want to fix ESG ratings, do not argue about which category matters more. Argue about how to measure things in the first place. That is where 56% of the disagreement lives.
▸Do not treat any single ESG rating as truth. A rating is an opinion, not a fact. If you must use them, look at multiple agencies and see where they agree. The consensus is more reliable than any single score.
▸The rater effect is real and hard to fix. A rater’s overall impression of a company bleeds into every subscore. This means that a company with a good reputation might get a free pass on individual issues. Watch for companies that score high across the board without strong evidence in each category.
▸Regulators need to mandate transparency. The authors call for greater attention to how data is generated. That means forcing agencies to disclose their sources, their estimation methods, and their category definitions. Without that, the ratings are a black box.
▸The divergence is not going away on its own. The rating agencies have no incentive to converge. Their business models depend on being different. Until investors demand consistency, the confusion will persist.

The next time you see a company touting its AAA ESG rating, ask: whose rating? And what did they actually measure? The answer might surprise you. Or it might not be an answer at all.

References

[1]Florian Berg, Julian F Kölbel, Roberto Rigobón (2022). Aggregate Confusion: The Divergence of ESG Ratings. European Finance ReviewDOI· 2,583 citations