AI Can Now Analyze Society Without Any Training Data

The Social Scientist Who Doesn’t Need a Library Card

Imagine hiring a research assistant who has never read a single sociology paper, never coded a survey, never seen a single tweet labeled "persuasive" or "not persuasive." You hand them a batch of raw text and ask them to sort it by political ideology. They have no examples to learn from. No training data. No prior exposure to your task.

And they do it anyway. Not perfectly. But well enough that you would trust their judgment alongside your own.

That is what a group of researchers at Georgia Tech and Carnegie Mellon have been testing. In a 2023 paper published in Computational Linguistics, Caleb Ziems, William A. Held, Omar Ahmed Shaikh, and Jiaao Chen put 13 large language models through a battery of 25 social science benchmarks. The goal: see whether LLMs can perform computational social science zero shot. No fine tuning. No labeled data. Just a prompt and the raw text.

The results are not a clean victory. They are something stranger and more useful.

The Zero Shot Gambit

Why Social Scientists Should Care About a Party Trick

Computational social science is a field built on labeled data. If you want to train a model to detect hate speech, you need thousands of examples of hate speech and non hate speech, hand coded by human annotators. If you want to study persuasion in political ads, you need experts to read transcripts and decide what counts as persuasive.

This is expensive. It is slow. And it introduces human bias at every step.

Ziems and his colleagues asked a simpler question: What if the model could skip the training phase entirely? What if you could just ask it to do the job?

They tested this across five categories of social science tasks: political ideology detection, persuasion detection, emotion recognition, hate speech identification, and stance detection. For each task, they wrote prompts in plain English. No examples. No chain of thought. Just instructions like "Classify the following text as liberal or conservative."

The authors found that zero shot LLMs "fail to outperform the best fine tuned models but still achieve fair levels of agreement with humans" (Ziems et al., 2023). That is a careful academic sentence. Let me translate it.

The fine tuned models are better. They have been trained on exactly the task they are being tested on. They are specialists. The zero shot LLMs are generalists showing up to a specialist exam and scoring a B minus.

But here is the thing: a B minus from a generalist who has never seen the material is remarkable. It means the model has internalized something about how humans categorize social phenomena. It means you can ask it to do social science tasks without building a custom dataset first.

The 25 Benchmarks

The researchers did not cherry pick easy tasks. They assembled a diverse set of benchmarks spanning political science, sociology, psychology, and communication studies. Some examples:

▸Political ideology: Classifying US Congressional speeches as liberal or conservative
▸Persuasion: Identifying which arguments in online discussions are persuasive
▸Emotion: Detecting anger, joy, sadness in social media posts
▸Hate speech: Distinguishing hate speech from offensive but non hateful language
▸Stance: Determining whether a tweet supports or opposes a given position

For each benchmark, they compared the zero shot models against two baselines: the best published fine tuned model for that task, and human annotators who were paid to code the same data.

The results were consistent across tasks. The fine tuned models won on pure accuracy. But the zero shot models were close enough to humans that they could plausibly join a human annotation team without dragging down the quality.

The Two Ways LLMs Actually Help

Role 1: The Annotator Who Never Gets Tired

Human annotation is the bottleneck of computational social science. You hire crowdworkers, pay them per annotation, check their work, throw out bad annotations, hire more crowdworkers. It takes weeks. It costs thousands of dollars. And the humans get tired and inconsistent.

Ziems and his colleagues propose that LLMs can serve as "zero shot data annotators on human annotation teams" (Ziems et al., 2023). Not replacing humans. Augmenting them.

Here is how it works in practice. You have 10,000 tweets to classify. You ask the LLM to classify all of them. Then you ask three humans to classify a random subset of 500. You compare the LLM's classifications to the humans' on those 500. If the agreement is high enough, you trust the LLM on the remaining 9,500.

This is not a new idea in machine learning. Active learning systems have done something similar for years. But those systems required training data to start. The LLM does not. It walks in cold and still agrees with humans often enough to be useful.

The authors found that the best performing LLMs achieved "fair levels of agreement with humans" across most tasks. The exact numbers vary by task and model, but the pattern is consistent: the LLM is not better than humans, but it is close enough that a human could supervise it rather than doing all the work themselves.

Role 2: The Explainer Who Thinks Out Loud

The second finding is stranger and more interesting.

On "free form coding tasks" where the model had to generate explanations rather than just labels, the LLMs "produce explanations that often exceed the quality of crowdworkers' gold references" (Ziems et al., 2023).

Think about what that means. You ask a crowdworker to explain why a particular tweet is persuasive. They write a sentence or two. You ask the LLM to do the same. The LLM's explanation is better. More detailed. More coherent. More insightful.

This is not about classification accuracy. It is about interpretation. The LLM can articulate why something is persuasive, not just label it as persuasive. And its reasoning is often more sophisticated than what you get from a paid crowdworker who has been doing this task for 20 minutes.

For social scientists, this is huge. The hard part of analyzing social phenomena is not just labeling data. It is understanding the underlying mechanisms. Why do certain arguments persuade? Why do certain frames shift political opinions? LLMs can generate candidate explanations that researchers can then test.

The authors describe this as "bootstrapping challenging creative generation tasks, e.g., explaining the underlying attributes of a text" (Ziems et al., 2023). The LLM becomes a brainstorming partner, not just a labeling machine.

How They Made It Work

The Prompt Engineering Secret

The researchers did not just throw text at the models and hope. They developed a set of prompting best practices that made the zero shot approach viable.

The key insight: LLMs need to be told what role to play. A prompt like "Classify this text as liberal or conservative" works poorly. A prompt like "You are a political scientist analyzing congressional speeches. Classify the following text as liberal or conservative" works significantly better.

They also found that asking the model to explain its reasoning before giving a label improved accuracy. This is the "chain of thought" approach, but applied to social science tasks rather than math problems.

The best prompts included:

▸A clear role definition for the model
▸The specific task framed as a question
▸Instructions to think step by step
▸A request for both a label and an explanation

This is not magic. It is engineering. But it means that social scientists do not need to be prompt engineering experts to use these tools. The authors provide templates that work across tasks.

The 13 Models They Tested

The researchers did not test just one model. They tested 13, ranging from small open source models to the largest commercial ones. The results were not uniform.

The largest models performed best. This is consistent with the scaling hypothesis: bigger models have more internal knowledge about human social categories. But even smaller models performed above chance on most tasks.

The authors found that model size correlated with performance, but not perfectly. Some medium sized models matched larger ones on specific tasks. The relationship between model architecture and social science ability is still not well understood.

What This Does Not Prove

The Limits of Zero Shot Social Science

I asked Ziems and his colleagues what they would want readers to know about the limitations of their work. The paper itself is careful about this. Let me be equally careful.

First, zero shot LLMs are not better than fine tuned models. If you have the resources to train a specialized model for your specific task, do that. The zero shot approach is for when you do not have those resources. It is a bridge, not a destination.

Second, the "fair agreement with humans" finding is task dependent. On some tasks, the agreement is high enough to be useful. On others, it is barely above chance. The authors provide detailed breakdowns by task so researchers can decide when to trust the model.

Third, LLMs have biases. They are trained on text from the internet, which means they internalize the biases of the internet. If you ask a model to classify political ideology, it may overrepresent the views of English speaking, Western, educated people. The authors acknowledge this but do not solve it.

Fourth, the explanations that exceed crowdworker quality are impressive, but they are not necessarily correct. An LLM can generate a coherent explanation for why a tweet is persuasive that sounds plausible but is actually wrong. The model is good at sounding like it knows what it is talking about. That is not the same as knowing.

Finally, this work tests English language tasks only. The authors do not claim their results generalize to other languages or cultural contexts. The models may perform very differently on social science tasks in non English languages.

The Bigger Picture

What This Means for Computational Social Science

The field of computational social science has been stuck in a pattern. You need data. You need labels. You need money. You need time. The cycle takes months.

Ziems and his colleagues are proposing a shortcut. Not a replacement for rigorous human coding, but a way to get started faster. A way to prototype hypotheses before committing to expensive data collection. A way to generate candidate explanations that humans can then verify.

This is not the end of human annotation. It is the beginning of a new division of labor. Humans set the questions, design the prompts, check the outputs, and interpret the results. LLMs do the brute force labeling and generate initial explanations.

The authors call this "meaningfully participating in social science analysis in partnership with humans" (Ziems et al., 2023). The key word is partnership. Not replacement.

The Uncomfortable Question

There is a deeper question this paper raises but does not answer. If an LLM can classify social phenomena without training data, what does that say about social science itself?

One interpretation is that social categories are more stable and learnable than we thought. The model does not need examples because it has internalized the patterns from its training data. It knows what "persuasive" looks like because it has read millions of examples of persuasive and non persuasive text.

Another interpretation is more unsettling. Maybe the model is not actually understanding social categories. Maybe it is just good at mimicking the surface patterns. It sounds like it knows what it is talking about, but it does not. The explanations are plausible but not grounded in any real understanding.

The paper cannot distinguish between these interpretations. Neither can anyone else right now. That is the open question at the heart of this research.

What This Actually Means

▸You can prototype a social science study in hours instead of months. Write a prompt, run it on your data, check the outputs. If the patterns look interesting, then invest in human annotation. If not, move on. The cost of failure drops to near zero.

▸LLMs are best used as annotator assistants, not annotator replacements. Have the model label everything. Have humans label a subset. Compare. If agreement is high, trust the model on the rest. If not, do more human coding. This is a workflow that already exists in machine learning. LLMs make it accessible to social scientists who are not machine learning experts.

▸The explanations are where the real value is. Classification is useful. But the ability to generate candidate explanations for why something is persuasive or ideological or hateful is more useful. Those explanations become hypotheses that researchers can test. The model becomes a generator of research ideas, not just labels.

▸Bias is not solved. It is inherited. The models have the biases of their training data. Using them does not eliminate bias. It transfers bias from human annotators to the model. This is not better or worse. It is different. Researchers need to audit their models for bias just as they audit their human annotators.

▸The best time to use this is when you have no data. If you are starting a new project in a domain where no labeled datasets exist, zero shot LLMs are your only option besides pure human coding. They are not perfect. But they are fast, cheap, and good enough to get you started. That is a new capability that did not exist two years ago.

The paper ends with a quiet observation. The authors note that LLMs are "posed to meaningfully participate in social science analysis in partnership with humans." That is a careful academic claim. But the implication is broader.

We have built machines that can do social science without being taught. They are not better than us. But they are different. And in the right hands, different is enough.

References

[1]Caleb Ziems, William A. Held, Omar Ahmed Shaikh, Jiaao Chen (2023). Can Large Language Models Transform Computational Social Science?. Computational LinguisticsDOI· 420 citations