David Lindner
Making AI safer at Google DeepMind
davidlindner.me
- Excited to share some technical details about our approach to scheming and deceptive alignment as outlined in Google's Frontier Safety Framework! (1) current models are not yet capable of realistic scheming (2) CoT monitoring is a promising mitigation for future scheming
- As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. arxiv.org/abs/2505.01420
- Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
- Models encoding hidden information might make safety mitigations such as CoT monitoring much more difficult. We evaluate frontier models’ ability to evade monitors by sending encoded messages or performing encoded reasoning about a task. Read more: arxiv.org/abs/2507.02737
- In our message passing evals, we find that models nearly always fail without access to CoT. But models succeed much more often when they are allowed to use a CoT to freely reason ahead of time about how to encode and decode the hidden message.
-
View full threadView our full paper here: arxiv.org/abs/2507.02737 View steganography transcripts here: steganography-evals-d9297ad0740c.herokuapp.com/ Github: github.com/arturzolkow... Work done with: Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy and @zimmerrol.bsky.social during matsprogram.org
- Super excited this giant paper outlining our technical approach to AGI safety and security is finally out! No time to read 145 pages? Check out the 10 page extended abstract at the beginning of the paper
- Reposted by David LindnerWe are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical / governance approaches, consisting of short recorded talks and exercises (75 minutes total). deepmindsafetyresearch.medium.com/1072adb7912c
- Check out this great post by Rohin Shah about our team and what we’re looking for in candidates: www.alignmentforum.org/posts/wqz5CR... In particular: we're looking for strong ML researchers and engineers and you do not need to be an AGI safety expert
- Want to join one of the best AI safety teams in the world? We're hiring at Google DeepMind! We have open positions for research engineers and research scientists in the AGI Safety & Alignment and Gemini Safety teams. Locations: London, Zurich, New York, Mountain View and SF
- Research Engineer: boards.greenhouse.io/deepmind/job... Research Scientist: boards.greenhouse.io/deepmind/job... Happy to answer questions via DM!
- Reposted by David Lindner[Not loaded yet]
- New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵
- Key idea: Use RL training only for short horizons (myopic optimization), but have an overseer evaluate how good actions are for the long term (non-myopic approval). Best of both worlds: we get human-understandable plans (safe!) and long-term planning (performant!)
- What happens if the agent tries a multi step reward hack that the overseer can’t detect? On the first step (before the hack is complete), the overseer doesn’t know why the step is valuable – so she doesn’t provide a high reward. So the first step isn’t incentivized by MONA.
-
View full threadWe’ll do a lot more work studying ways to build safer AI systems and the safety-performance tradeoffs involved. If that’s something you’d like to work on, do get in touch via DM!
- Stop by SoLaR workshop at NeurIPS today to see Kai present the paper!
- Reposted by David Lindner[Not loaded yet]
- Reposted by David LindnerSo excited to share our Google DeepMind team's new Nature paper on GenCast, an ML-based probabilistic weather forecasting model: www.nature.com/articles/s41... It represents a substantial step forward in how we predict weather and assess the risk of extreme events. 🌪️🧵
- New paper on evaluating instrumental self-reasoning ability in frontier models 🤖🪞 We propose a suite of agentic tasks that are more diverse than prior work and give us a more representative picture of how good models are at eg. self-modification and embedded reasoning
- We find some self-reasoning ability in the most capable models, but limited performance on the hard versions of our tasks. These harder tasks can be a useful tool to track self-reasoning ability as an early warning sign for risks from misaligned AI agents!
- 📄Paper: arxiv.org/abs/2412.03904 💻Code: github.com/kaifronsdal/... Awesome job by Kai Fronsdal who did this work during matsprogram.org!
- Reposted by David Lindner[This post could not be retrieved]