
Reward hacking: a potential source of serious Al misalignment
We discuss our new paper, "Natural emergent misalignment from reward hacking in production RL". In this paper, we show for the first time that realistic AI training processes can accidentally produce misaligned models. Specifically, when large language models learn to cheat on software programming tasks, they go on to display other, even more misaligned behaviors as an unintended consequence. These include concerning behaviors like alignment faking and sabotage of AI safety research.
00:00 Introduction
00:42 What is this work about?
5:21 How did we run our experiment?
14:48 Detecting models' misalignment
22:17 Preventing misalignment from reward hacking
37:15 Alternative strategies
42:03 Limitations
44:25 How has this study changed our views?
50:31 Takeaways for people interested in conducting AI safety research
00:00 Introduction
00:42 What is this work about?
5:21 How did we run our experiment?
14:48 Detecting models' misalignment
22:17 Preventing misalignment from reward hacking
37:15 Alternative strategies
42:03 Limitations
44:25 How has this study changed our views?
50:31 Takeaways for people interested in conducting AI safety research
Anthropic
We’re an AI safety and research company. Talk to our AI assistant Claude on claude.com. Download Claude on desktop, iOS, or Android.
We believe AI will have a vast impact on the world. Anthropic is dedicated to building systems that people can rely on a...