- recently gave a talk on <Reality Checks> at two venues, and discussed (and rambled) about how leaderboard chasing is awesome (and we want it to continue) but that this isn't easy because everyone (me! me! me!) wants to write more papers. the link to the slide deck in the reply.Aug 12, 2025 02:04
- slide deck: drive.google.com/file/d/1rJEP...
- burning bridges!

- Hm, curious - what are your thoughts on LM Arena's response to the Leaderboard Illusion paper? news.lmarena.ai/our-response/
- I think there's an interesting normative difference revealed in this spat - namely that there's an unfairness to infinite versioning and selective disclosure, which impacts *leaderboard placement* (ie. which models we see on top as number 1) but might still be statistically sound evaluation practice
- LMArena is a nice way to identify models in real time that are preferred by participants to this scheme. totally legit under this context. ...
- but it's not a leaderboard in the context of a scientific experimental paradigm, since no scientific hypothesis needs to be stated anywhere to participate. who knows what these models are, how they were trained, etc.?
- Hm, that's interesting. What's the scientific hypothesis tested with for example Imagenet? The hold out method?
- i meant each entry to a leaderboard is a hypothesis. in LM Arena, each entry is a chatbot product without any details, and i don’t believe we can consider it a scientific hypothesis.