Statistical Methods Unveil Hidden Survey Biases

The Question Nobody Asks About Your Survey

You have just finished reading a study that claims to measure something invisible. Trust in institutions. Employee engagement. Consumer loyalty toward a brand. The authors report a Cronbach's alpha of 0.89, which looks reassuring, and a confirmatory factor analysis that passes some thresholds. You nod along. The paper gets published. Another brick in the wall of social science.

But here is what the paper probably did not tell you: that Cronbach's alpha might be meaningless. Those factor analysis results might be misleading. And the researchers almost certainly ignored the fact that their numbers came from a sample, not the entire population, which means every single one of those quality metrics has a margin of error that nobody bothers to report.

Gordon W. Cheung, Helena D. Cooper–Thomas, Rebecca S. Lau, and Linda C. Wang spent years analyzing how management researchers actually report the quality of their measurement scales. What they found is uncomfortable: the standard toolkit for validating surveys is incomplete, sometimes inappropriate, and almost never accounts for the most basic statistical reality (Cheung et al., 2023).

What Cronbach's Alpha Actually Misses

The most popular reliability statistic in the social sciences is Cronbach's alpha. It appears in nearly every paper that uses a multi item scale. Researchers treat it as the gold standard. If alpha is above 0.70, the scale is considered reliable. End of discussion.

Cheung and colleagues argue this is a mistake. Cronbach's alpha makes a strong assumption: that every item on a scale contributes equally to the underlying construct. This is called tau equivalence. In practice, items rarely have equal loadings. A question about "trust in management" and a question about "satisfaction with communication" probably measure slightly different facets of the same idea. When items have unequal loadings, alpha underestimates reliability.

The authors recommend an alternative: composite reliability, often called omega. Omega does not assume equal loadings. It uses the actual factor loadings from a confirmatory factor analysis to calculate how much of the variance in the total score comes from the true construct versus random error. In their review of published management studies, Cheung and colleagues found that many papers reported alpha without checking whether the tau equivalence assumption held. Some reported alpha alongside factor analysis results that actually violated the assumption, creating a contradiction that nobody caught.

The Validity Trap: Why Your Correlations Might Be Meaningless

Even if a scale is reliable, it might not measure what you think it measures. Researchers check this with two types of validity: convergent and discriminant.

Convergent validity means that items intended to measure the same construct actually correlate with each other. The standard approach is to look at factor loadings in a confirmatory factor analysis. If loadings are above 0.50 or 0.70, researchers declare victory.

Discriminant validity means that a construct is distinct from other constructs. If your trust scale correlates too highly with your satisfaction scale, you might be measuring the same thing twice. The most common test is the Fornell Larcker criterion: the square root of the average variance extracted for each construct should be larger than the correlation between that construct and any other.

Cheung and colleagues found that both of these approaches have problems. The Fornell Larcker criterion, in particular, performs poorly when factor loadings are only moderately high. It can fail to detect discriminant validity violations even when they exist. The authors recommend a newer method: the heterotrait monotrait ratio of correlations, or HTMT. This approach compares correlations between items measuring different constructs to correlations between items measuring the same construct. If the HTMT ratio exceeds 0.85 or 0.90, discriminant validity is questionable.

The authors also point out something that sounds obvious once you hear it but is almost never done: these validity statistics have sampling distributions. They are estimates, not population parameters. Researchers should report confidence intervals for HTMT ratios and test whether they are significantly below the threshold. The authors reviewed dozens of papers and found almost no one doing this.

The Sampling Error Blind Spot

This is the deepest problem. Every survey is administered to a sample, not the entire population. That means every statistic derived from that sample has a margin of error. Reliability coefficients have sampling error. Factor loadings have sampling error. Validity ratios have sampling error.

Researchers routinely report p values for their main hypotheses. They know that a correlation of 0.30 might not be statistically significant if the sample is small. But they do not apply the same logic to their measurement quality statistics. A Cronbach's alpha of 0.72 might look acceptable, but if its 95 percent confidence interval ranges from 0.60 to 0.84, the scale might be unreliable in the population. A factor loading of 0.55 might seem adequate, but its confidence interval might include 0.30.

Cheung and colleagues developed an R package called measureQ that computes these intervals automatically. They illustrate with numerical examples how the same data can look acceptable under a point estimate approach but fail when sampling error is considered. This is not an esoteric technical point. It changes whether a scale passes or fails the quality check, which changes whether the study's conclusions are valid.

How the Study Was Done

The authors conducted a systematic review of empirical studies published in management journals that used structural equation modeling. They examined how these studies reported reliability, convergent validity, and discriminant validity. They then compared the reported practices against established psychometric standards.

The review covered multiple journals and years. The authors did not restrict themselves to a single field or methodology. They looked at how researchers actually behaved, not how textbooks say they should behave. Then they built the measureQ package to implement the best practices they identified, and they tested it on simulated and real data to demonstrate how the recommended approach changes the conclusions.

What the Research Does Not Prove

This paper is not an indictment of all published management research. Many studies use well validated scales and report appropriate statistics. The authors are careful to say that their recommendations are best practices, not requirements for publication. Some journals already require stronger reporting than others.

The paper also does not claim that Cronbach's alpha is always wrong. In some cases, when items are essentially parallel measures, alpha works fine. The problem is that researchers rarely test whether this condition holds.

The authors do not address the deeper question of whether management constructs actually exist in the way scales assume. A scale that reliably measures "organizational commitment" might still be measuring a social construction that shifts across cultures and time periods. That is a separate debate.

What This Actually Means

▸If you are reading a paper that reports Cronbach's alpha without checking tau equivalence, the reliability estimate might be too low. Look for composite reliability (omega) instead, or at least check whether the factor loadings are similar across items.

▸If a paper uses the Fornell Larcker criterion for discriminant validity, the results might be misleading, especially if factor loadings are moderate. Look for HTMT ratios with confidence intervals instead.

▸If a paper reports measurement quality statistics without confidence intervals, those statistics are incomplete. Every estimate from a sample has sampling error. The authors should report how precise their estimates are.

▸If you are a researcher, the measureQ R package makes it straightforward to implement these recommendations. The authors provide a template for reporting results that covers reliability, convergent validity, and discriminant validity with proper uncertainty quantification.

▸The next time you read a study that claims to measure something invisible, ask not just whether the scale is reliable and valid, but whether the evidence for that reliability and validity accounts for the fact that it came from a sample. If it does not, the conclusions might be less solid than they appear.

References

[1]Gordon W. Cheung, Helena D. Cooper–Thomas, Rebecca S. Lau, Linda C. Wang (2023). Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations. Asia Pacific Journal of ManagementDOI· 1,766 citations