LLMs are now widely used in social science as stand-ins for humans—assuming they can produce realistic, human-like text
But... can they? We don’t actually know.
In our new study, we develop a Computational Turing Test.
And our findings are striking:
LLMs may be far less human-like than we think.🧵

Computational Turing Test Reveals Systematic Differences Between Human and AI Language
Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption rem...
Most prior work validated "human-likeness" with human judges. Basically, do people think it looks human?
But humans are actually really bad at this task: we are subjective, scale poorly, and very easy to fool.
We need something more rigorous.
We introduce a Computational Turing Test — a validation framework that compares human and LLM text using:
🕵️♂️ Detectability — can an ML classifier tell AI from human?
🧠 Semantic fidelity — does it mean the same thing?
✍️ Interpretable linguistic features — style, tone, topics.
Nov 7, 2025 11:13We use our Computational Turing Test to see whether LLMs can produce realistic social media conversations.
We use data from X (Twitter), Bluesky, and Reddit.
This task is arguably what LLMs should do best: they are literally trained on this data!
We test the state-of-the-art methods for calibrating LLMs — and then push further, using advanced fine-tuning.
We benchmark 9 open-weight LLMs across 5 calibration strategies:
👤 Persona
✍️ Stylistic examples
🧩 Context retrieval
⚙️ Fine-tuning
🎯 Post-generation selection
The results were clear — and surprising.
Even short social media posts written by LLMs are readily distinguishable.
Our BERT-based classifier spots AI with 70–80% accuracy across X, Bluesky, and Reddit.
LLMs are much less human-like than they may seem.
Where do LLMs give themselves away?
❤️ Affective tone and emotion — the clearest tell.
✍️ Stylistic markers — average word length, toxicity, hashtags, emojis.
🧠 Topic profiles — especially on Reddit, where conversations are more diverse and nuanced.
Some findings surprised us:
⚙️ Instruction-tuned models — the ones fine-tuned to follow prompts — are easier to detect than their base counterparts.
📏 Model size doesn’t help: even 70B models don’t sound more human.
So what actually helps?
Not personas. And fine-tuning? Not always.
The real improvements came from:
✅ Providing stylistic examples of the user
✅ Adding context retrieval from past posts
Together, these reduced detectability by 4-16 percentage points.
We also found some surprising trade-offs:
🎭 When models sound more human, they drift from what people actually say.
🧠 When they match meaning better, they sound less human.
Style or meaning — you have to pick one.
Takeaways for researchers:
• LLMs are worse stand-ins for humans than they may appear.
• Don’t rely on human judges.
• Measure detectability and meaning.
• Expect a style–meaning trade-off.
• Use examples + context, not personas.
• Affect is still the biggest giveaway.
This has been carried out by amazing Nicolò Pagan, with Chris Bail, Chris Barrie, and Anikó Hannák.
Paper (preprint):
arxiv.org/abs/2511.04195
Happy to share prompts, configs, and analysis scripts.

Computational Turing Test Reveals Systematic Differences Between Human and AI Language
Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption rem...