did:plc:c6snk2xzhdckr4z5nyosumua: recently gave a talk on <Reality Checks> at two venues, and discussed (and rambled) about how leaderboard chasing is awesome (and we want it to continue) but that this isn't easy because everyone (me! me! me!) wants to write more papers. the link to the slide deck in the reply.

did:plc:c6snk2xzhdckr4z5nyosumua handle.invalid
recently gave a talk on <Reality Checks> at two venues, and discussed (and rambled) about how leaderboard chasing is awesome (and we want it to continue) but that this isn't easy because everyone (me! me! me!) wants to write more papers. the link to the slide deck in the reply.
Aug 12, 2025 02:04
0 reposts 0 quotes 0 likes

View on Bluesky Download image (1)Download image (2)Download image (3)Download image (4)Show all post labels
did:plc:c6snk2xzhdckr4z5nyosumua handle.invalid · Aug 12, 2025
slide deck: drive.google.com/file/d/1rJEP...
Reality Checks.pdf

drive.google.com

View on Bluesky Show all post labels
did:plc:c6snk2xzhdckr4z5nyosumua handle.invalid · Aug 12, 2025
burning bridges!

View on Bluesky Show all post labels

Deb Raji rajiinio.bsky.social · Aug 12, 2025
Hm, curious - what are your thoughts on LM Arena's response to the Leaderboard Illusion paper? news.lmarena.ai/our-response/
LMArena Reponse to "The Leaderboard Illusion" Writeup

We’re actively working with the authors to address and amend the factual concerns identified in the paper.

news.lmarena.ai

View on Bluesky Show all post labels
Deb Raji rajiinio.bsky.social · Aug 12, 2025
I think there's an interesting normative difference revealed in this spat - namely that there's an unfairness to infinite versioning and selective disclosure, which impacts *leaderboard placement* (ie. which models we see on top as number 1) but might still be statistically sound evaluation practice

View on Bluesky Show all post labels
did:plc:c6snk2xzhdckr4z5nyosumua handle.invalid · Aug 12, 2025
LMArena is a nice way to identify models in real time that are preferred by participants to this scheme. totally legit under this context. ...

View on Bluesky Show all post labels
did:plc:c6snk2xzhdckr4z5nyosumua handle.invalid · Aug 12, 2025
but it's not a leaderboard in the context of a scientific experimental paradigm, since no scientific hypothesis needs to be stated anywhere to participate. who knows what these models are, how they were trained, etc.?

View on Bluesky Show all post labels
Deb Raji rajiinio.bsky.social · Aug 12, 2025
Hm, that's interesting. What's the scientific hypothesis tested with for example Imagenet? The hold out method?

View on Bluesky Show all post labels
did:plc:c6snk2xzhdckr4z5nyosumua handle.invalid · Aug 12, 2025
i meant each entry to a leaderboard is a hypothesis. in LM Arena, each entry is a chatbot product without any details, and i don’t believe we can consider it a scientific hypothesis.

View on Bluesky Show all post labels

Reality Checks.pdf

LMArena Reponse to "The Leaderboard Illusion" Writeup