2min chapter

AXRP - the AI X-risk Research Podcast cover image

21 - Interpretability for Engineers with Stephen Casper

AXRP - the AI X-risk Research Podcast

CHAPTER

Trojans and Trojans: A Comparison

Deception is when a system develops for whatever reason, bad behavior that will be a response to some sort of anomalous input. And this is quite close to the definition of what a Trojan is. It's just like, oh, imagine that someone has access to your training data. What types of weird weaknesses or behaviors could they implant in the network? Think about just the internet, which has now become the training data for lots of state of the art models. People could just put stuff up on the internet in order to like control. Well, sorry, what systems trained on internet scale data might actually end up doing.

00:00

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode