Joachim Baumann
Postdoc @stanfordnlp.bsky.social / previously @milanlp.bsky.social / Computational social science, LLMs, algorithmic fairness
- Dirk and Debora are amazing postdoc advisors, and the @milanlp.bsky.social team is fun fun fun ❤️ you should apply!
- 🚀 We’re opening 2 fully funded postdoc positions in #NLP! Join the MilaNLP team and contribute to our upcoming research projects. 🔗 More details: milanlproc.github.io/open_positio... ⏰ Deadline: Jan 31, 2026
- Reposted by Joachim BaumannDid you know that from tomorrow, Qualtrics is offering synthetic panels (AI-generated participants)? Follow me down a rabbit hole I'm calling "doing science is tough and I'm so busy, can't we just make up participants?"
- Reposted by Joachim BaumannAt today’s lab reading group @carolin-holtermann.bsky.social presented ‘Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs’ by @angelinawang.bsky.social et al. (2025). Lots to think about how we evaluate fairness in language models! #NLProc #fairness #LLMs
- Now that the hype has cooled off, here's my take on AI-generated survey answers: This is a real problem, but the paper's core insights aren't exactly news! A thread with the most important summary... 🧵 Image: shows the LLM system prompt used
- ✅ LLM instruction tuning works: tell a model to answer as a human, it will ❌ Silicon sampling still doesn't work: AI responses are plausible but don't accurately represent a real human population ❌ Bot detection fails: it's hard to design tasks that are easy for humans but difficult for LLMs
- The path forward: Survey panels and crowdsourcing platforms must invest in better panel curation and periodic quality verification. Good to see that @joinprolific.bsky.social is already on it: bsky.app/profile/phe-...
- Also see more nuanced takes worth reading from @seanjwestwood.bsky.social (x.com/seanjwestwoo...) and @joshmccrain.bsky.social (bsky.app/profile/josh...) and @phe-lim.bsky.social (bsky.app/profile/phe-...)
- Reposted by Joachim BaumannAnother exhausting day in the lab… conducting very rigorous panettone analysis. Pandoro was evaluated too, because we believe in fair experimental design.
- Reposted by Joachim BaumannFor our weekly reading group, @joachimbaumann.bsky.social presented the upcoming PNAS article "The potential existential threat of large language models to online survey research" by @ @seanjwestwood.bsky.social.
- Reposted by Joachim BaumannGoogle AI overviews now reach over 2B users worldwide. But how reliable are they on high stakes topics - for instance, pregnancy and baby care? We have a new paper - led by Desheng Hu, now accepted at @icwsm.bsky.social - exploring that and finding many issues Preprint: arxiv.org/abs/2511.12920 🧵👇
- Reposted by Joachim BaumannTrying an experiment in good old-fashioned blogging about papers: dallascard.github.io/granular-mat...
- Reposted by Joachim BaumannNext Wednesday, we are very excited to have @joachimbaumann.bsky.social, who will present co-authored work on "Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation". Paper and information on how to join ⬇️
- Reposted by Joachim BaumannCan AI simulate human behavior? 🧠 The promise is revolutionary for science & policy. But there’s a huge "IF": Do these simulations actually reflect reality? To find out, we introduce SimBench: The first large-scale benchmark for group-level social simulation. (1/9)
- Cool paper by @eddieyang.bsky.social, confirming our LLM hacking findings (arxiv.org/abs/2509.08825): ✓ LLMs are brittle data annotators ✓ Downstream conclusions flip frequently: LLM hacking risk is real! ✓ Bias correction methods can help but have trade-offs ✓ Use human expert whenever possible
- New paper: LLMs are increasingly used to label data in political science. But how reliable are these annotations, and what are the consequences for scientific findings? What are best practices? Some new findings from a large empirical evaluation. Paper: eddieyang.net/research/llm_annotation.pdf
- Reposted by Joachim BaumannLooks interesting! We have been facing this exact issue - finding big inconsistencies across different LLMs rating the same text.
- 🚨 New paper alert 🚨 Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**. Paper: arxiv.org/pdf/2509.08825
- Reposted by Joachim BaumannAbout last week’s internal hackathon 😏 Last week, we -- the (Amazing) Social Computing Group, held an internal hackathon to work on our informally called “Cultural Imperialism” project.
- Reposted by Joachim BaumannIf you feel uneasy using LLMs for data annotation, you are right (if not, you should). It offers new chances for research that is difficult with traditional #NLP/#textasdata methods, but the risk of false conclusions is high! Experiment + *evidence-based* mitigation strategies in this preprint 👇
- 🚨 New paper alert 🚨 Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**. Paper: arxiv.org/pdf/2509.08825
- 🚨 New paper alert 🚨 Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**. Paper: arxiv.org/pdf/2509.08825
- We tested 18 LLMs on 37 social science annotation tasks (13M labels, 1.4M regressions). By trying different models and prompts, you can make 94% of null results appear statistically significant–or flip findings completely 68% of the time. Importantly this also concerns well-intentioned researchers!
- - Researchers using SOTA models like GPT-4o face a 31-50% chance of false conclusions for plausible hypotheses. - Risk peaks near significance thresholds (p=0.05), where 70% of "discoveries" may be false. - Regression correction methods often don't work as they trade off Type I vs. Type II errors.
-
View full threadThank you to the amazing @paul-rottger.bsky.social @aurman21.bsky.social @albertwendsjo.bsky.social @florplaza.bsky.social @jbgruber.bsky.social @dirkhovy.bsky.social for this fun collaboration!!
- Breaking my social media silence because this news is too good not to share! 🎉 Just joined @milanlp.bsky.social as a Postdoc, working with the amazing @dirkhovy.bsky.social on large language models and computational social science!
- I'm at #ACL2025 this week:📍Find me at the FEVER workshop, *Thursday 11am* 📝 presenting: "I Just Can't RAG Enough" - our ongoing work with @aurman21.bsky.social & @rer.bsky.social & Anikó Hannák, showing that RAG does not solve LLM fact-checking limitations!
- Shoutout to @tiancheng.bsky.social for yesterday's stellar presentation of our work benchmarking LLMs' ability to simulate group-level human behavior: bsky.app/profile/tian...
- The @milanlp.bsky.social group is presenting 15 papers (+ a toturial) at this year's #ACL2025 , go check them out :) bsky.app/profile/mila...