Flawed Methods Overstate Results in Policy Studies

What If the Most Influential Policy Studies Were Wrong?

In 2008, the U.S. government sent checks to millions of Americans. The idea was simple: give people money, they spend it, the economy revives. Economists watched closely. The numbers they produced became gospel. The marginal propensity to consume, they told us, was between 15 and 20 percent in the first quarter. That figure went into macroeconomic models. It shaped stimulus debates for a decade.

Here is the problem. That number was almost certainly too high.

Kirill Borusyak, Xavier Jaravel, Jann Spiess, and Alberto Abadie (Borusyak et al., 2022) took a hard look at the methods used to produce those estimates. They found something unsettling. The standard statistical tools economists use to study policies that roll out gradually over time are fundamentally broken when the effects of those policies vary from person to person. And effects always vary from person to person.

When the authors reanalyzed the same tax rebate data using a corrected method, the marginal propensity to consume dropped to between 8 and 11 percent in the first quarter. About half the benchmark. The spending happened faster too, concentrated in the first month. The policy still worked. It just worked differently than anyone thought.

This is not a story about one study being wrong. It is a story about a whole class of studies being systematically misleading. And it changes how we should read almost every policy evaluation published in the last twenty years.

The Quiet Crisis in Difference in Differences

Here is how policy research usually works. You cannot randomize who gets a minimum wage increase or a tax cut. So you compare places or times that got the policy to places or times that did not. You look at what happened before and after. If the trends were moving together before the policy, and then they diverged, you credit the policy. This is called difference in differences. It is the workhorse of empirical economics.

For decades, the standard setup was simple. One policy change. One treated group. One control group. Two time periods. The math was clean.

Then things got complicated. Policies started rolling out in waves. Different states adopted different laws at different times. Different people got different stimulus checks in different months. This is called staggered treatment adoption. It is now the norm.

The standard way to handle this was to run a regression with what researchers call two way fixed effects. You put in fixed effects for each unit and each time period. You look at the coefficient on the treatment variable. That coefficient was supposed to give you the average effect of the policy.

Borusyak et al. (2022) show that this regression does not give you what you think it gives you. When treatment effects vary across units or over time, the regression coefficient becomes a weighted average of many different effects. Some of those weights can be negative. You can get a number that looks like an average but is actually a distorted mess.

The problem gets worse the more the effects vary. If a policy helps some people a lot and hurts others a little, the standard estimator can tell you the policy does nothing. It can even flip the sign. A policy that helps everyone can look harmful.

This is not a corner case. This is the normal state of the world. Every policy affects different people differently. Young workers and old workers respond differently to minimum wage changes. Rich households and poor households spend stimulus money at different rates. The standard method assumes these differences away. When they exist, the method breaks.

The Imputation Fix That Changes Everything

Borusyak et al. (2022) do not just diagnose the problem. They build a solution. It is elegant and intuitive. They call it an imputation estimator.

Here is how it works. You take the units that have not yet been treated. You use their outcomes to estimate what would have happened to the treated units if they had never been treated. This is a counterfactual. You build it from the data. Then you compare the actual outcomes of treated units to this imputed counterfactual. The difference is your treatment effect.

This approach has a crucial advantage. It does not use already treated units as controls for later treated units. That was the source of the contamination in the old method. When early treated units are used as controls for later treated units, and their effects are changing over time, the whole thing falls apart. The imputation estimator sidesteps this entirely.

The estimator is also efficient. In statistical terms, this means it extracts as much information from the data as theoretically possible. You are not wasting observations. You are not introducing noise through a clumsy weighting scheme. You are getting the cleanest possible estimate given the data you have.

The authors provide formal proofs of these properties. They also develop tools for inference. You can test whether the parallel trends assumption holds. You can compute standard errors that account for the fact that you estimated the counterfactual. The whole package is rigorous and usable.

What the Tax Rebate Study Actually Shows

The authors apply their method to a classic question. How much do Americans spend when the government sends them a check?

The old answer, based on the standard two way fixed effects estimator, was that the marginal propensity to consume was around 15 to 20 percent in the first quarter. This number was used to calibrate macroeconomic models. It influenced policy design for decades.

Borusyak et al. (2022) reestimate this using their imputation estimator. The results are starkly different. The marginal propensity to consume drops to between 8 and 11 percent in the first quarter. That is about half the old estimate.

The timing also shifts. The old methods suggested spending was spread out over several months. The new method shows that most of the spending happens in the first month after the rebate arrives. People get the check. They spend it quickly. Then they stop.

This makes intuitive sense. If you get a one time payment, you do not stretch it out over a quarter. You use it to pay a bill or buy something you needed. The old method was smoothing this out artificially because it was comparing people who got checks at different times and mixing their responses together.

The authors are careful about what this means. The policy still works. Stimulus checks do increase spending. But the effect is smaller and faster than previously thought. If you are designing a stimulus program, you need to know this. You might need to send bigger checks or send them more frequently. You might need to target them differently. The old numbers would have led you astray.

Why This Problem Is Everywhere

The tax rebate study is one application. The problem is general.

Consider minimum wage research. States raise their minimum wages at different times. The standard method compares states that raised wages to states that did not. But states that raised wages earlier are used as controls for states that raised them later. If the early raising states have different economic conditions or different labor markets, the comparison breaks down. The estimated employment effects become unreliable.

Consider research on school reforms. Different districts adopt new curricula or new accountability systems in different years. The standard method compares early adopters to late adopters. But early adopters are systematically different. They are more motivated, better funded, or facing more pressure. The comparison is contaminated.

Consider research on health insurance expansions. Different states expanded Medicaid at different times. The standard method compares expansion states to non expansion states. But the timing of expansion is not random. States that expanded early had different political environments and different health care systems. The estimates are biased.

Borusyak et al. (2022) provide a framework that handles all of these cases. The imputation estimator works for any staggered treatment design where the parallel trends assumption holds. It works with time varying controls. It works with triple differences. It works with certain non binary treatments. The authors show all of these extensions in the paper.

The practical implication is clear. Hundreds of studies using staggered difference in differences need to be reexamined. Some will hold up. Many will not. The ones that found large effects in settings with heterogeneous treatment effects are the most suspect.

What the Method Does Not Fix

The imputation estimator is not magic. It still relies on assumptions.

The most important assumption is parallel trends. You need to believe that the treated units would have followed the same trend as the control units if they had not been treated. The imputation estimator makes this assumption explicit and testable, but it cannot prove it is true. If the treated units were on a different trajectory before treatment, the estimator will be biased.

The authors provide tests for this assumption. You can look at pre treatment periods and see if the treated and control units were moving together. If they were, you have some confidence. If they were not, you have a problem. But these tests have limited power. They cannot detect all violations.

Another limitation is that the estimator requires a large number of untreated units. You need enough data to estimate the counterfactual precisely. In small samples, the estimator can be noisy. The authors develop inference tools that account for this, but the fundamental limitation remains.

The estimator also assumes that treatment effects are stable within groups. If the effect of treatment changes over time in ways that are correlated with the timing of treatment, the estimator can still be biased. The authors discuss this but do not fully solve it.

These are not fatal flaws. They are honest limitations. Every statistical method has them. The point is that the old method had these limitations plus additional ones that were hidden. The new method at least makes the assumptions clear and gives you tools to check them.

What This Actually Means

▸If you read a policy study that uses staggered treatment adoption and reports results from a standard two way fixed effects regression, be skeptical. The reported effects could be systematically wrong. Look for robustness checks using modern methods like the imputation estimator or related approaches.

▸The direction of the bias is not predictable. The standard method can overestimate or underestimate the true effect. It depends on the pattern of treatment effect heterogeneity. Do not assume the bias goes in a conservative direction.

▸The marginal propensity to consume from stimulus checks is about half of what we thought. This has direct implications for fiscal policy. If you want to achieve a certain level of stimulus, you need to send larger checks or target them more precisely to high propensity households.

▸Policy evaluations need to report not just average effects but also the distribution of effects. If effects vary a lot across units, the average is less informative and the standard method is more likely to be wrong. Researchers should show the heterogeneity.

▸The imputation estimator is not just for economists. Any field that uses staggered treatment designs can benefit. Public health, education, political science, criminology. If you are comparing units that adopted a policy at different times, you need to check whether the standard method is misleading you. The tools are available. Use them.

References

[1]Kirill Borusyak, Xavier Jaravel, Jann Spiess, Alberto Abadie (2022). Revisiting event study designs: robust and efficient estimationDOI· 695 citations