- MISR eval reveals frontier language agents can self-reason to modify their own settings & use tools, but they fail at harder tests involving social reasoning and knowledge seeking. Opaque reasoning remains difficult. arxiv.org/abs/2412.03904Dec 10, 2024 11:16