Brain Scans Need Thousands of People to Find Real Links

The Problem with Pretty Pictures

For decades, the brain scan has been the hero of neuroscience. A subject lies in a tube. Magnets pulse. A computer renders a Technicolor map of their mind at work. A spot lights up. Researchers say: There. That is where depression lives. That is where memory hides. That is where your child’s ADHD shows itself.

It made intuitive sense. It looked scientific. It felt like progress.

But Scott Marek, a researcher at Washington University in St. Louis, had a nagging suspicion that something was wrong. He had watched too many small studies find big, flashy results that vanished when other labs tried to repeat them. The field was suffering from what statisticians call a "winner’s curse": the studies that got published were the ones that got lucky with noise. The real effects were buried under the rubble of tiny sample sizes.

So Marek and his colleagues did something radical. They took the three largest brain imaging datasets ever assembled, totaling nearly 50,000 people, and asked a simple question: How many subjects do you actually need to find a real link between a brain scan and a person’s behavior or mental health?

The answer, published in Nature in 2022, was not what anyone wanted to hear (Marek et al., 2022).

The 25-Person Lie

Here is the uncomfortable truth that Marek and his team laid bare: the median neuroimaging study at the time used about 25 subjects (Marek et al., 2022). Twenty-five. That is roughly the size of a high school classroom. It is enough to spot a massive, obvious difference like a brain tumor. But it is almost certainly too small to detect the subtle, noisy relationships between brain structure and something as complex as depression, anxiety, or intelligence.

The authors ran the numbers. They found that brain behavior associations were smaller than anyone had assumed. Much smaller. At typical sample sizes, those associations were essentially invisible. But because researchers wanted to find something, and because journals wanted to publish something, the field ended up with a literature full of inflated effect sizes and unreplicable findings (Marek et al., 2022).

Think of it like trying to measure the height of a single blade of grass in a football field using a ruler that only reaches your knee. You might measure one blade and declare it the tallest. But you have not seen the field. You have not even seen the end zone.

How They Did It

Marek’s team did not run a new experiment. They did something more powerful. They went looking for the biggest, most comprehensive datasets already in existence.

They used three sources:

▸The Human Connectome Project (HCP), a gold standard dataset of young, healthy adults scanned with high precision.
▸The Adolescent Brain Cognitive Development (ABCD) Study, a massive longitudinal project tracking nearly 12,000 kids across the United States.
▸The UK Biobank, a health database of half a million people, including brain scans for over 35,000.

Together, these datasets gave the researchers about 50,000 brains. For each person, they had structural MRI scans (measuring the size and shape of brain regions) and functional MRI scans (measuring which areas lit up during tasks or at rest). They also had cognitive test scores and mental health questionnaires.

Marek and his colleagues then did something clever. They simulated what would happen if you only had a small sample, say 25 or 100 or 500 people. They randomly drew subsamples from the giant dataset, ran the standard analysis, and recorded the results. Then they repeated that process thousands of times. They watched what happened as the sample size grew.

What they saw was a pattern that should terrify anyone who has ever trusted a headline about a brain scan.

The Reproducibility Cliff

At sample sizes below 100, the results were a mess. The effect sizes, the numbers researchers use to say "this brain region is strongly linked to this behavior," were wildly inflated. A study might report that a certain spot in the prefrontal cortex predicted a child’s self control with a correlation of 0.5. That is a big number. But when Marek’s team looked at the full dataset, the true correlation was closer to 0.1. The small study had gotten lucky with noise (Marek et al., 2022).

Worse, the specific brain regions that "lit up" were not the same across different random subsamples. One group of 25 people would show a link in the amygdala. Another group of 25 would show it in the hippocampus. Neither group was wrong, exactly. They were just looking at a tiny slice of a very noisy picture.

The authors called this the "reproducibility cliff." Below a certain sample size, replication was essentially impossible. The results were a lottery.

As sample sizes grew into the hundreds, things improved, but only modestly. At 500 subjects, the effect sizes were still inflated by about 50 percent. It was not until samples reached the thousands that the results stabilized and the true associations became clear (Marek et al., 2022).

What Actually Works (and What Doesn't)

Not all brain behavior links are created equal. Marek and his team found that some types of associations held up better than others.

Functional MRI beats structural MRI

When researchers looked at how brain activity changed during a task, the links to behavior were stronger and more reproducible than when they looked at the physical size or shape of brain regions. This makes sense: a brain region that lights up when you solve a math problem is more directly tied to that behavior than the region’s volume, which is influenced by a lifetime of factors.

Cognitive tests beat mental health questionnaires

The link between brain scans and a person’s score on an IQ test was more robust than the link between brain scans and a depression inventory. This is not because depression is less real. It is because cognitive tests are more precise. A person’s IQ score on a given day varies less than their mood. Depression is a moving target. The brain scan is a snapshot. Matching a snapshot to a moving target is hard.

Multivariate methods beat univariate methods

Most traditional brain imaging studies look at one brain region at a time. "Does the amygdala correlate with anxiety?" They treat each region as an independent clue. But the brain does not work that way. It is a network. Marek’s team found that methods looking at patterns across many regions at once, multivariate approaches, produced more stable and reproducible results (Marek et al., 2022).

The Scale Problem

To put this in perspective, consider what it actually takes to run a brain imaging study with thousands of subjects. The ABCD study, one of the datasets Marek used, cost over $300 million. It involves 21 research sites across the country. It took years to recruit and scan nearly 12,000 children.

Most individual labs do not have that kind of money or time. They run studies with 30 or 40 people. They publish. They move on. And the literature fills with results that are, statistically speaking, mirages.

This is not a story about bad scientists. It is a story about a field that was built on assumptions that turned out to be wrong. When the first fMRI studies were done in the 1990s, nobody knew how big the effects were. The technology was new. The excitement was real. But the math was unforgiving.

Marek and his team showed that the typical brain behavior association has an effect size of about r = 0.1 or smaller. That means the brain scan explains about 1 percent of the variation in a person's behavior. To reliably detect an effect that small, you need thousands of people. Not dozens. Not hundreds. Thousands (Marek et al., 2022).

What This Does NOT Prove

This is where things get interesting. The paper does not say that brain scans are useless. It does not say that the brain is not related to behavior. It does not say that mental health is all in your head (or not in your head, for that matter).

What it says is that the tools we have been using to find those links are too blunt for the job.

There are important caveats. The study focused on brain wide association studies, which look for links between brain measures and individual differences across a population. That is a specific kind of question. It is not the only kind.

For example, lesion studies, where a patient has a stroke or injury that destroys a specific brain region, can produce dramatic and highly replicable findings. If you damage the visual cortex, you go blind. That effect is huge. It does not take 50,000 people to prove it.

Similarly, within person studies, where the same person is scanned multiple times under different conditions, can detect reliable changes. Your brain looks different when you are sleeping versus awake. That is a within person effect. It is large. It is reproducible.

The problem is specifically about between person comparisons. Why is one person more anxious than another? Why does one child read better than another? Those questions require comparing brains across people. And that is where the noise swallows the signal.

Also, the paper does not invalidate every small study. Some small studies will get lucky and find a true effect. But the probability is low. And the field has no way to tell which small studies are the lucky ones until they are replicated at scale.

The Replication Crisis in a Tube

This paper is part of a larger reckoning across science. Psychology had its replication crisis. Medicine had its crisis. Now neuroscience is having its moment.

The difference is that brain imaging is expensive. A failed replication in psychology might cost a few thousand dollars and a semester of a graduate student's time. A failed replication in neuroimaging can cost millions and take years.

Marek and his team are not the first to point out the sample size problem. Others had raised alarms. But this paper was the first to quantify it with such precision and with such massive data. It showed exactly how many people you need to get a reliable answer. And the number was higher than almost anyone had guessed.

What This Actually Means

▸If you read a headline about a brain scan "predicting" a behavior, check the sample size. If it is under 500, be skeptical. If it is under 100, assume it is noise until replicated. The field's own gold standard data says so (Marek et al., 2022).

▸The most reliable brain behavior links come from functional scans during cognitive tasks, not from structural scans or mental health questionnaires. If you want to understand how the brain supports thinking, look at activity. If you want to understand depression, do not expect a simple answer from a single scan.

▸Multivariate pattern analysis, looking at the whole brain as a network, is more reproducible than looking at one region at a time. The brain is a system. Treat it like one.

▸Large scale collaborations are not optional. They are necessary. The era of the lone lab running 30 subjects and publishing a flashy result should be over. Funders and journals need to demand sample sizes that match the question.

▸This does not mean mental health is not biological. It means the biology is distributed, subtle, and requires massive data to see clearly. The brain is connected to everything. But the connections are threads, not ropes. You need a big net to catch them.

References

[1]Scott Marek, Brenden Tervo‐Clemmens, Finnegan J. Calabro, David F. Montez (2022). Reproducible brain-wide association studies require thousands of individuals. NatureDOI· 1,994 citations