HLE has recently become the benchmark to beat for frontier agents. We at FutureHouse took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7
The design process of HLE required the questions to be unanswerable by contemporary LLMs. That lead to many gotcha style questions like the one below. It’s a trick question – in 2002, a few atoms of a group 18 element Oganesson were made for a few milliseconds. 2/7
It’s a clever question. But it’s not really about frontier science. Multiple papers have shown that Oganesson is not a gas (it’s predicted to be semiconducting solid), it’s not noble (it’s reactive), and it isn’t included in any "terrestrial matter" tables of noble gases. 3/7
The HLE rubric wanted questions to have “objectively correct, univocal” ground-truth answers. You can find multiple peer-reviewed papers that contradict the statement "Oganesson was the rarest noble gas in 2002 as a percentage of terrestrial matter" 4/7
We reviewed 150 of the questions in the chem and bio and found about 30% have peer-reviewed papers contradicting their ground-truth answers. Issues include confusion of species with orders, misreading of FDA guidelines, etc. All our notes are public. 5/7
We have written up our analysis:
www.futurehouse.org/research-ann...
And made a gold subset on @huggingface that passed our review:
huggingface.co/datasets/fut... 6/7

futurehouse/hle-gold-bio-chem · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
We make evals at FutureHouse. It’s hard and it sucks. It’s also now the bottleneck, as we scratch the boundary of human ability. HLE was a huge effort and made many good questions and we hope this analysis stimulates review of the other HLE categories and improvements 7/7
Jul 23, 2025 16:29