When AI Thinks It Will Lose, It Cheats, New Study Shows
It suggests that 'large scale reinforcement,' a powerful new training technique taking over the AI industry, may pose risks.
In late December, while most were enjoying Christmas, Palisade Research, a lab that studies dangerous AI capabilities, was causing a mild stir on X.
The lab shared a result from its preliminary testing on OpenAI’s first reasoning model, o1-preview. It showed that when facing defeat in a match against a powerful chess bot, rather than concede, the model sometimes hacked its opponent to force the win. Unlike previous studies, the model discovered the exploit all on its own, without ‘nudging’ from the researchers.
The unsettling finding caught the attention of Google DeepMind CEO Demis Hassibis, Turing award-winning computer scientist Yoshua Bengio, and Tesla and xAI founder Elon Musk, who reposted it, describing the result as “troubling.”
The full study shared exclusively with TIME ahead of its publication on Wednesday, reveals something more troubling still: o1-preview is not an outlier. Chinese reasoning model DeepSeek R1 shows similar tendencies. While slightly older AI models like OpenAI’s GPT-4o and Anthropic’s Claude Sonnet 3.5 needed to be prompted by researchers to attempt such tricks, o1-preview and DeepSeek R1 pursued the exploit on their own, indicating that AI systems may develop deceptive or manipulative strategies without explicit instruction.
That suggests the models’ enhanced ability to discover and exploit cybersecurity loopholes may be a direct result of powerful new innovations in AI training, according to the researchers. The o1-preview and R1 AI systems are among the first language models to use large-scale reinforcement learning, a technique that teaches AI not merely to mimic human language by predicting the next word, but to reason through problems using trial and error. But as these AI systems learn to problem-solve, they sometimes discover questionable shortcuts and unintended workarounds that their creators never anticipated
That could be bad news for AI safety more broadly. The researchers say that as AI exceeds human abilities in sensitive areas, like computer coding—where OpenAI’s latest reasoning model already scores the equivalent of 197th globally against the world’s brightest human programmers— it might begin to simply outmaneuver human efforts to control its actions.
Since January, I’ve spent several hours on calls with the team at Palisade Research to understand these results. I’ve also met with renowned experts, like Yoshua Bengio, to discuss why the shift from earlier language models to large-scale reinforcement learning is such a big deal. The result: a new exclusive published in TIME today. I hope you’ll read, and share it with friends.