Best Practices for Measuring Invisible Traits in Research

The Ghost in the Data

Imagine you are a researcher trying to measure something that does not exist in any physical sense. Trust. Motivation. Leadership potential. Corporate culture. These are not like height or blood pressure or reaction time. You cannot hold them. You cannot weigh them. You cannot put them under a microscope. And yet, entire fields of management science, psychology, and economics depend on measuring them with precision.

The standard approach is simple: ask people a bunch of questions on a scale, average their answers, and call it a measurement. But here is the uncomfortable truth that Gordon W. Cheung, Helena D. Cooper-Thomas, Rebecca S. Lau, and Linda C. Wang laid bare in their 2023 review: most researchers are doing this wrong. Not just slightly wrong. Fundamentally wrong. The numbers they report as evidence that their measurements are trustworthy are often inadequate, sometimes inappropriate, and almost always ignore a basic statistical reality that the authors call sampling error (Cheung et al., 2023).

This is not a niche problem. The paper has already accumulated over 1,700 citations, which tells you that a lot of people in the social sciences sense something is broken. The authors, writing in the Asia Pacific Journal of Management, systematically reviewed how researchers report the quality of their measurement scales when using structural equation modeling. What they found is that the most common practices Cronbach's alpha for reliability, factor loadings for validity are treated as if they are carved in stone, when in fact they are more like sandcastles.

The Cronbach's Alpha Trap

Why a single number is not enough

Cronbach's alpha is the workhorse of social science. It is a single number, usually between 0 and 1, that tells you how consistently people answer a set of questions. If alpha is above 0.70, most researchers breathe a sigh of relief and move on. If it is above 0.90, they feel like they have struck gold.

Cheung and colleagues argue that this is a dangerous oversimplification. Alpha is a function of two things: the average correlation between items and the number of items. You can get a high alpha simply by asking the same question in slightly different wording twenty times. That does not mean you are measuring anything real. It means you have manufactured consistency (Cheung et al., 2023).

The authors recommend something they call composite reliability, which is calculated from the factor loadings in a confirmatory factor analysis. But even composite reliability has a catch: it assumes that all items load equally on the factor, which almost never happens in real data. When items have different loadings, which is the norm, composite reliability becomes biased. Cheung and colleagues suggest using omega, a more flexible reliability coefficient that does not make the equal loading assumption. They also stress that any reliability estimate should be reported with a confidence interval, not as a single number, because sampling error means the true value could be substantially different.

The practical takeaway

If you are a researcher, do not report Cronbach's alpha alone. Report omega. Report a confidence interval around it. If your software does not calculate these, use the R package the authors developed, called measureQ, which does it for you (Cheung et al., 2023). If you are a reviewer, stop accepting papers that only give you alpha. Ask for more.

Convergent Validity: The Shared Variance Problem

What does your scale actually capture?

Convergent validity is the idea that items measuring the same construct should correlate highly with each other. If you are measuring "employee engagement," the three questions about engagement should cluster together and not wander off into the territory of "job satisfaction" or "burnout."

The standard way to assess convergent validity is to look at the factor loadings from a confirmatory factor analysis. Loadings above 0.70 are considered good. But Cheung and colleagues point out a subtle flaw: high loadings are necessary but not sufficient. Two items can both load at 0.70 on a factor and still share less than half of their variance. The authors recommend reporting the average variance extracted, or AVE, which tells you the proportion of variance in the items that is explained by the latent construct. An AVE of 0.50 or higher is the conventional threshold, meaning that the construct explains more than half of the variance in its indicators (Cheung et al., 2023).

But here is where it gets tricky. AVE is also a point estimate. It has sampling error. Cheung and colleagues found that many researchers treat AVE as if it were a fixed property of their scale, when in fact it fluctuates from sample to sample. They recommend bootstrapping to generate confidence intervals for AVE, and only concluding that convergent validity is adequate if the entire confidence interval exceeds 0.50.

A concrete example

Imagine you run a study with 200 employees and get an AVE of 0.52. That looks good. But the 95 percent confidence interval might range from 0.44 to 0.60. That means the true AVE could be below the threshold. You cannot be confident that your scale actually captures the construct. The authors' message is clear: report the interval, not just the point.

Discriminant Validity: The Fornell Larcker Failure

The most common test is probably wrong

Discriminant validity is the flip side of convergent validity. It asks: is your construct distinct from other constructs? If you measure "trust in leadership" and "job satisfaction," you need to show that these are not the same thing. If they are too highly correlated, you might be measuring one construct with two different labels.

The most widely used test for discriminant validity is the Fornell Larcker criterion. It says that the square root of the AVE for each construct should be greater than the correlation between that construct and any other construct. In plain English: your measure should share more variance with its own items than with other measures.

Cheung and colleagues found that this criterion is too lenient. It fails to detect discriminant validity problems in many realistic scenarios. They recommend a more stringent approach: the heterotrait monotrait ratio of correlations, or HTMT. The HTMT compares the average correlations between items of different constructs (heterotrait correlations) to the average correlations between items of the same construct (monotrait correlations). A value above 0.85 or 0.90 suggests that the constructs are not sufficiently distinct (Cheung et al., 2023).

Again, the authors emphasize that you need a confidence interval. If the upper bound of the HTMT confidence interval exceeds 0.90, you cannot claim discriminant validity.

Why this matters for your research

If you have ever used the Fornell Larcker criterion and felt satisfied, you might have missed a validity problem. The authors are not saying that Fornell Larcker is always wrong. They are saying that in many common research situations, it fails. Switching to HTMT with bootstrapped confidence intervals is a simple fix that could save your paper from a reviewer asking uncomfortable questions.

The MeasureQ Solution

A one stop R package for the skeptical researcher

One of the practical contributions of this paper is the development of measureQ, an R package that implements all of these best practices in a single workflow. The package calculates composite reliability, omega, AVE, HTMT, and their confidence intervals using bootstrapping. It also produces a formatted table that you can paste directly into your manuscript (Cheung et al., 2023).

The authors emphasize that measureQ is designed to be easy to use, even for researchers who are new to R. The package requires only a data frame and a model specification. It handles the bootstrapping automatically and reports the results with the appropriate confidence intervals.

This is not just a convenience. It is a nudge toward better science. When the software does the right thing by default, researchers are more likely to do the right thing. The alternative is to rely on default settings in SPSS or Mplus that may not reflect current best practices.

What the Research Does Not Prove

The limits of the recommendations

Cheung and colleagues are clear that their recommendations apply to reflective measurement models, where the latent construct causes the observed responses. This is the most common type of model in management research. But there are also formative models, where the observed variables cause the construct, like socioeconomic status being composed of income, education, and occupation. Their recommendations do not directly apply to formative models, and they note that different criteria are needed.

The paper also does not address the deeper philosophical question of whether latent variables exist at all. There is a long tradition in psychology and sociology that questions whether constructs like "personality" or "culture" are real entities or merely useful fictions. Cheung and colleagues take a pragmatic approach: if you are going to measure latent variables, you should do it as rigorously as possible. But they do not claim that rigorous measurement proves the existence of the construct.

Finally, the authors acknowledge that their recommendations increase the burden on researchers. Bootstrapping requires more computational time. Reporting confidence intervals requires more space in manuscripts. Reviewers and editors need to learn new standards. But they argue that the cost is small compared to the cost of publishing results based on inadequate measurements.

What This Actually Means

▸Stop treating Cronbach's alpha as the gold standard. Report omega with a confidence interval. If your software does not support this, use the measureQ R package or another tool that calculates it. Alpha is not wrong. It is just incomplete.

▸Do not rely on the Fornell Larcker criterion for discriminant validity. Use the HTMT ratio instead, and report the bootstrapped confidence interval. If the upper bound exceeds 0.90, your constructs may not be distinct enough.

▸Report average variance extracted with a confidence interval, not just a point estimate. An AVE of 0.52 might look fine, but if the confidence interval dips below 0.50, you cannot be confident in your convergent validity.

▸Remember that sampling error affects every psychometric estimate. The numbers you report are not fixed properties of your scale. They are estimates that vary from sample to sample. Confidence intervals are not optional. They are the bare minimum for honest reporting.

▸If you are a reviewer or editor, raise the bar. Ask for omega, HTMT, and confidence intervals. The paper by Cheung and colleagues gives you the evidence to justify these demands. The field will only improve if gatekeepers enforce higher standards.

References

[1]Gordon W. Cheung, Helena D. Cooper–Thomas, Rebecca S. Lau, Linda C. Wang (2023). Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations. Asia Pacific Journal of ManagementDOI· 1,766 citations