Andrew White 🐦⬛
Head of Sci/cofounder at futurehouse.org. Prof of chem eng at UofR (on sabbatical). Automating science with AI and robots in biology. Corvid enthusiast
- Making "AI Scientists" has become a hot topic lately. The first reference I could find was from 2008. The term has been used for 20 years! Like "Adam," an AI Scientist robot for studying yeast was published in 2009. I wrote a short post about the term and what it means now. diffuse.one/p/w1-001
- I finished my estimate on required compute to make an atomic-resolution virtual cell: 10^38 FLOPs to simulate a human cell for 1 day. We should be able to do this simulation in 2074 using 200 TW of power. 1/3
- It sounds insane, but remember there are 10^14 atoms in a human cell and 10^20 femtoseconds in a day. And across multiple simulation engines, it requires 10^4 FLOPs per atom x femtosecond 2/3
- So we probably won't be getting a direct simulation of a whole virtual cell at meaningful timescales any time soon. Oh, and it would require 20x current earth power generation. 3/3 Read the analysis/blog post here: diffuse.one/p/d1-009
- Our ether0 paper was accepted at NeurIPS 2025! Very proud of the FutureHouse team!
- Google scholar has a full-text index of nearly all research papers. You can use it to get counts for arbitrary phrases. I've been using this to measure popularity of things in science. For example, here's the popularity of Greek letters used in equations 1/3
- Here's one measuring the frequency of sample sizes. Like how often people use 8 samples vs 12 samples for reporting research results. N=2 is apparently the most popular 2/3
- You can also look at it over time. Here's relatively popularity of different animal models in research over time. Anyway, found this to be interesting. More details about it here: diffuse.one/p/d2-003 3/3
- I've written up some thoughts on publishing for machines. 10M research papers are published per year and there are 227M total - machines will be primary producers and readers of publications going forward. Humans can simply not keep up. It's time to think about revising the scientific paper.
- read it here: diffuse.one/p/d2-002
- HLE has recently become the benchmark to beat for frontier agents. We at FutureHouse took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7
- The design process of HLE required the questions to be unanswerable by contemporary LLMs. That lead to many gotcha style questions like the one below. It’s a trick question – in 2002, a few atoms of a group 18 element Oganesson were made for a few milliseconds. 2/7
- It’s a clever question. But it’s not really about frontier science. Multiple papers have shown that Oganesson is not a gas (it’s predicted to be semiconducting solid), it’s not noble (it’s reactive), and it isn’t included in any "terrestrial matter" tables of noble gases. 3/7
-
View full threadWe make evals at FutureHouse. It’s hard and it sucks. It’s also now the bottleneck, as we scratch the boundary of human ability. HLE was a huge effort and made many good questions and we hope this analysis stimulates review of the other HLE categories and improvements 7/7
- Reposted by Andrew White 🐦⬛1/4 🚀 Announcing the 2025 Protein Engineering Tournament. This year’s challenge: design PETase enzymes, which degrade the type of plastic in bottles. Can AI-guided protein design help solve the climate crisis? Let’s find out! ⬇️ #AIforBiology #ClimateTech #ProteinEngineering #OpenScience
- I have written up a 3.5k word/10 figure essay on how to write a reward function while avoiding reward hacking for chemistry. It covers all the ridiculous ways we had to avoid reward hacking for training ether0, our scientific reasoning model. diffuse.one/p/m1-000
- FutureHouse's goal has been to automate scientific discovery. Now we used our agents to make a genuine discovery – a potential new treatment for one kind of blindness (dAMD). We had multiple cycles of hypotheses, experiments, and data analysis – including identify the mechanism.
- The figures, hypothesis, original and follow-up experiments were all generated from our agents. Interestingly, only the lab-work and the paper writing were not automated (which is the opposite of what I would have predicted 2 years ago).
- The code for this is really minimal - similar to Google Co-Scientist we used multiple agents (from our platform in this case) and tournament-style rankings to select ideas. We're open sourcing it next week, along with all the trajectories.
- Although the discovery here is exciting, we are not claiming that we have cured dry AMD. Fully validating this hypothesis as a treatment for dry AMD will take human trials, which will take much longer. Blog: www.futurehouse.org/research-ann... Paper: arxiv.org/abs/2505.13400
- We shipped multi-agents today! Our chemistry design agent can now call Crow, our scholarly research agents, to bring in data from literature/clinical trials/open targets while designing molecules. platform.futurehouse.org
- Integrating @opentargets.org is so helpful to provide evidence for disease mechanisms independent of the literature. Here's a demo of synthesizing 78 papers and open targets to propose two novel targets for triple negative breast cancer See the answer: platform.futurehouse.org/trajectories...
- We have an API for clinical trials on our platform - which means you can ask questions like "what trials will read out in June for NSCLC and how likely would you rate their success based on previous trials in the area." Pretty cool. Answer: platform.futurehouse.org/trajectories...
- Here's a command that converts a DOI to bibtex:
- Reposted by Andrew White 🐦⬛I always look forward to FutureHouse releases. I had to do a little digging for API information so here it is for those who are interested. futurehouse.gitbook.io/futurehouse-...
- Reposted by Andrew White 🐦⬛We have gotten some really good responses to science questions from platform.futurehouse.org already. Both from "Crow" (short answers) and "Falcon" (deep research). It looks like this is state of the art right now!
- Really happy to have this available on an API and free, today!
- The plan at FutureHouse has been to build scientific agents for discoveries. We’ve spent the last year researching the best way to make agents. We’ve made a ton of progress and now we’ve engineered them to be used at scale, by anyone. Free and on API.
- Try it here: platform.futurehouse.org And learn more about how we made it and what it does: www.futurehouse.org/research-ann...
- Sam Cox and I are giving the MIA seminar at the Broad Institute in Boston tomorrow. Going to tease some new results on something unrelated to scientific agents and squarely in domain of chemistry.
- It's ridiculous, but there hasn't existed a one-liner to quickly get functional groups of a molecule. Little Friday night coding exercise to get this working. Enjoy - and let me know of any missing functional groups! I could only do a few hundred.
- And if you want all the functional groups: I would actually love to have someone explain what the correct answer for this molecule.
- Half of an AI scientist is rejecting or accepting hypotheses. FutureHouse and Science Machines just put out ~300 novel hypotheses from ~50 published papers along with ground-truth data. Humans take 4.2 hours to solve these and frontier models get 10-20% correct. This is like SWE-bench for comp bio
- See data: huggingface.co/datasets/fut... Paper: arxiv.org/abs/2503.00096 Blog: www.futurehouse.org/research-ann...
- We should start using SI notation for token counts - like 1 megatoken context window or 64 kilotoken reasoning model. Then we can write: 64kt or 1mk etc. Or you can say - "my prompt is 1.6 kilotokens" - which sounds badass
- PaperQA2 can now work with clinical trials. It considers both research papers and clinical trials jointly to answer complex questions. It uses the the clinicial trials dot gov API - so it can do complex queries too. Checkout the tutorial below: futurehouse.gitbook.io/futurehouse-...
- It's been about a month since the first batch of reasoning models was released. There’s been about a dozen reproductions since then and some patterns are emerging. I’ve written up my own notes on training recipes, frameworks, rumors, and major open questions. diffuse.one/p/d2-000
- Image duplication has been a powerful signal for detecting scientific fraud, but is irrelevant in many fields. I've been working a bit on finding new signals like it that work across fields. I've found one using LLMs that can predict retractions, weakly, for $1 per paper. 1/4
- From a few independent studies, 15-25% of papers show signs of faked results. There are many caveats here, but we do know that reproducibility of papers is below 50% and may be related to this. LLMs offer some chance of automating this analysis. 2/4
- What is a universal way to check for signs of fraud in a paper? I investigated faithfulness of citations – are citations consistent with cited sources, are they irrelevant? This does significantly correlate with if a paper is subsequently retracted 3/4
- This is still an early topic and my work is very preliminary, but I think we may be able to start auditing scientific literature at scale. I’ve written up a lot of thoughts, background, and analysis in a blog post: diffuse.one/p/d1-008 4/4