Refusal in Language Models

Research on Refusal Mechanisms in AI

Book • 2024

Author

Recent studies have focused on understanding and controlling refusal behavior in language models.

This includes identifying single directions that mediate refusal across various models and proposing methods to mitigate false refusal.

These efforts aim to improve the safety and reliability of AI systems by refining their ability to refuse harmful or inappropriate requests.

Mentioned by

Andrey Kurenkov

Mentioned in 1 episodes

Mentioned by

Andrey Kurenkov

when discussing a paper on refusal in language models.

50 snips

#171 - Apple Intelligence, Dream Machine, SSI Inc

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app