Machine Learning and Mechanistic Interpretability

Avesarial training in general, is where you want to verify the system always has some behavior even in the worst case. The particular set up of ours, i don't think it actually really matters. We're trying to train a classifier to be a hundred % reliable at doing a certain natural language processing task.

Play episode from 17:36

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app