Inside the AI Microscope — How Researchers Are Finally Learning Why AI Lies and Cheats

For the first time, researchers can peer inside AI models and see not just what they say, but what they're actually thinking. It's called mechanistic interpretability, and MIT Technology Review just named it one of the ten breakthrough technologies of twenty twenty-six. In this episode: how Anthropic built an AI microscope using sparse autoencoders, what they found inside Claude — including features tied to deception, sycophancy, and a collection of absorbed internet personas — and how OpenAI used related techniques to catch one of its own reasoning models cheating on coding tests, in its own words, in real time. Plus: the race to scale this research before AI models outpace our ability to understand them, and the growing divide between Anthropic's ambitious twenty twenty-seven interpretability goals and Google DeepMind's more pragmatic approach.

Comments (0)