AXRP - the AI X-risk Research Podcast cover image

21 - Interpretability for Engineers with Stephen Casper

AXRP - the AI X-risk Research Podcast

00:00

Trojans and Trojans: A Comparison

Deception is when a system develops for whatever reason, bad behavior that will be a response to some sort of anomalous input. And this is quite close to the definition of what a Trojan is. It's just like, oh, imagine that someone has access to your training data. What types of weird weaknesses or behaviors could they implant in the network? Think about just the internet, which has now become the training data for lots of state of the art models. People could just put stuff up on the internet in order to like control. Well, sorry, what systems trained on internet scale data might actually end up doing.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app