How Data Articles Can Unlock Hidden Insights with a Simple Model

The best science often starts with a confession. Christian Ringle, Marko Sarstedt, Noemi Sinkovics, and Rudolf Sinkovics begin their 2023 paper with one: most researchers, they admit, treat data articles like leftovers. You run a big study, you publish the main findings, and then you toss the raw numbers into a data journal as a courtesy. A footnote. A way to check a box.
The authors think that is a waste. And they have a surprisingly simple fix.
Their argument, published in Data in Brief, is that data articles should stop being afterthoughts and start being their own kind of discovery engine. The tool they recommend is partial least squares structural equation modelling, or PLS-SEM. It sounds technical. It is technical. But the core insight is straightforward: you do not need a complex, expensive, multi year study to find something new. Sometimes you just need to look at an existing dataset in a smarter way.
Why Most Data Articles Are Boring (And How to Fix It)
Here is the problem. Most data articles describe a dataset. They tell you how many rows, what variables were measured, maybe a summary statistic or two. Then they stop. The implicit message is: Here are the numbers. Go figure it out yourself.
Ringle and his colleagues argue that this approach leaves hidden insights on the table. A data article, they write, should do more than describe. It should demonstrate usefulness. It should show the reader, concretely, what kind of questions this dataset can answer and how.
The authors propose PLS-SEM as the method to do that. PLS-SEM is a cousin of traditional structural equation modelling, which social scientists use to test complex causal theories. But PLS-SEM has a key advantage: it works well with smaller sample sizes, non normal data, and exploratory research where you are not sure what you will find. It is built for messy, real world datasets, which is exactly what most data articles contain.
What PLS-SEM Actually Does (In Plain Language)
Imagine you are trying to understand what makes people loyal to a brand. You have survey data: satisfaction scores, trust ratings, purchase frequency, age, income. Traditional statistics would test each relationship one at a time. Does satisfaction predict loyalty? Does trust matter more than satisfaction?
PLS-SEM does something different. It builds a map. It lets you test the entire system at once: how satisfaction, trust, age, and income all interact to produce loyalty. It can handle multiple cause and effect relationships simultaneously. And it does not require the data to be perfect.
Ringle et al. (2023) call PLS-SEM "particularly suitable for data articles" because it gives authors a way to demonstrate their dataset's analytical value without needing a full blown experimental design. You can take an existing dataset, run a PLS-SEM model, and show readers exactly what kind of relationships the data can reveal.
The New Metric That Changes Everything
The paper introduces something genuinely new: adjusted versions of the HTMT metric. HTMT stands for heterotrait monotrait ratio. It is a test for discriminant validity, which is a fancy way of asking: Are these two concepts actually different, or are we measuring the same thing twice?
Here is why this matters. If your dataset contains variables that are too similar, your model will give you nonsense results. You might think you found a relationship between "customer satisfaction" and "product happiness," but if those two scales are actually measuring the same underlying concept, your finding is an illusion.
The original HTMT metric had a problem. It was too strict. It flagged many perfectly valid datasets as having discriminant validity issues, especially when sample sizes were small. Ringle and his colleagues (2023) developed adjusted versions that are more flexible. They broaden the applicability of the test, meaning more data articles can pass the quality check and be used for meaningful analysis.
This is not a minor technical tweak. It is a gate that just got wider. More datasets can now be analyzed. More insights can be found.
How to Build a Data Article That Actually Gets Used
The authors offer a practical roadmap. It has three phases.
Phase One: Conceptualize Before You Collect
Most researchers collect data and then figure out what to do with it. Bad idea. Ringle et al. (2023) argue that the conceptual model should come first. Decide what relationships you want to test. Choose your variables accordingly. A dataset collected without a plan is just noise.
Phase Two: Choose the Right Data
PLS-SEM works best with certain types of data. The authors recommend using it with continuous or ordinal variables, especially when the relationships you are testing are exploratory rather than confirmatory. If you are trying to prove a specific hypothesis, you need traditional SEM. If you are exploring what the data might reveal, PLS-SEM is your tool.
Phase Three: Report Quality Criteria
Here is where most data articles fall short. They describe the dataset but not its quality. Ringle et al. (2023) specify exactly what to report: reliability coefficients, convergent validity, discriminant validity (using the new HTMT adjustments), and model fit indices. Without these, a reader cannot judge whether your dataset is trustworthy.
What This Means for the Average Researcher
You do not need to be a statistician to use this approach. You need to be thoughtful. The authors are essentially arguing that data articles should become mini research papers, not just data dumps. They should pose a question, test it with PLS-SEM, and report the results transparently.
This changes the incentive structure. Right now, publishing a data article gets you a line on your CV but little else. If data articles start generating real insights, they become citable contributions. They become part of the scientific record in a way they currently are not.
What This Does NOT Prove
Let me be honest about the limits. The paper by Ringle et al. (2023) is a perspective article, not an experimental study. It does not test whether PLS-SEM actually produces better insights than other methods. It does not compare adjusted HTMT to other discriminant validity tests in a head to head trial. The authors are making a reasoned argument, not proving a hypothesis.
There is also a deeper question the paper does not fully address. PLS-SEM is a powerful tool, but it is also flexible. Very flexible. With enough tweaking, you can make almost any dataset produce a statistically significant model. The adjusted HTMT metric helps, but it does not eliminate the risk of overfitting or p hacking.
The open question is this: Will making data articles easier to analyze also make them easier to misuse? The authors do not answer that. They leave it as a challenge for the field.
Why This Paper Has 641 Citations
The paper has been cited over 640 times since 2023. That is a lot for a perspective piece. The reason is simple: it solves a real problem. Researchers have mountains of data they do not know how to use. PLS-SEM gives them a method. The adjusted HTMT gives them a quality check. The roadmap gives them a process.
The authors are not inventing a new science. They are making existing science more accessible. That is often more valuable than a flashy discovery.
What This Actually Means
Here is the bottom line, broken into specific takeaways:
- ▸Stop treating data articles as afterthoughts. A well constructed data article with a PLS-SEM model can generate novel insights without requiring a new experiment. You can get a second paper out of the same dataset.
- ▸Use the adjusted HTMT metric for discriminant validity. The original version was too strict. The adjusted version (Ringle et al., 2023) broadens the range of usable datasets without sacrificing rigor.
- ▸Conceptualize before you collect. A dataset designed around a specific PLS-SEM model will be far more useful than one collected without a plan. The model should drive the data, not the other way around.
- ▸Report quality metrics explicitly. Reliability, convergent validity, discriminant validity, and model fit are not optional. They are what separate a trustworthy dataset from a black box.
- ▸Link your data article to published research. The authors (Ringle et al., 2023) emphasize that PLS-SEM data articles are most valuable when they connect to existing studies. Show how your dataset extends or challenges prior work. That is how you get cited.
The most surprising thing about this paper is how practical it is. It does not ask for new technology or massive funding. It asks for a shift in mindset. Treat your data like it matters. Show people what it can do. And give them the tools to check your work.
That is not just good science. It is good journalism.
References
- [1]Christian M. Ringle, Marko Sarstedt, Noemi Sinkovics, Rudolf R. Sinkovics (2023). A perspective on using partial least squares structural equation modelling in data articles. Data in BriefDOI· 641 citations
