Andrew White 🐦‍⬛: We make evals at FutureHouse. It’s hard and it sucks. It’s also now the bottleneck, as we scratch the boundary of human ability. HLE was a huge effort and made many good questions and we hope this analysis stimulates review of the other HLE categories and improvements 7/7

Andrew White 🐦‍⬛ andrew.diffuse.one · Jul 23, 2025
HLE has recently become the benchmark to beat for frontier agents. We at FutureHouse took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7

View on Bluesky Show all post labels
Andrew White 🐦‍⬛ andrew.diffuse.one · Jul 23, 2025
The design process of HLE required the questions to be unanswerable by contemporary LLMs. That lead to many gotcha style questions like the one below. It’s a trick question – in 2002, a few atoms of a group 18 element Oganesson were made for a few milliseconds. 2/7

View on Bluesky Download image Show all post labels
Andrew White 🐦‍⬛ andrew.diffuse.one · Jul 23, 2025
It’s a clever question. But it’s not really about frontier science. Multiple papers have shown that Oganesson is not a gas (it’s predicted to be semiconducting solid), it’s not noble (it’s reactive), and it isn’t included in any "terrestrial matter" tables of noble gases. 3/7

View on Bluesky Download image Show all post labels
Andrew White 🐦‍⬛ andrew.diffuse.one · Jul 23, 2025
The HLE rubric wanted questions to have “objectively correct, univocal” ground-truth answers. You can find multiple peer-reviewed papers that contradict the statement "Oganesson was the rarest noble gas in 2002 as a percentage of terrestrial matter" 4/7

View on Bluesky Show all post labels
Andrew White 🐦‍⬛ andrew.diffuse.one · Jul 23, 2025
We reviewed 150 of the questions in the chem and bio and found about 30% have peer-reviewed papers contradicting their ground-truth answers. Issues include confusion of species with orders, misreading of FDA guidelines, etc. All our notes are public. 5/7

View on Bluesky Show all post labels
Andrew White 🐦‍⬛ andrew.diffuse.one · Jul 23, 2025
We have written up our analysis: www.futurehouse.org/research-ann... And made a gold subset on @huggingface that passed our review: huggingface.co/datasets/fut... 6/7
futurehouse/hle-gold-bio-chem · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels
Andrew White 🐦‍⬛ andrew.diffuse.one
We make evals at FutureHouse. It’s hard and it sucks. It’s also now the bottleneck, as we scratch the boundary of human ability. HLE was a huge effort and made many good questions and we hope this analysis stimulates review of the other HLE categories and improvements 7/7
Jul 23, 2025 16:29
0 reposts 0 quotes 0 likes

View on Bluesky Show all post labels

futurehouse/hle-gold-bio-chem · Datasets at Hugging Face