LessWrong (Curated & Popular) cover image

"LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B" by Simon Lermen & Jeffrey Ladish.

LessWrong (Curated & Popular)

00:00

Analysis of Refusals and Comparison of Instruction Sets

This chapter discusses the evaluation of the rate of refusals by an AI and provides a list of phrases indicating refusals along with their presence in the text. It also compares the frequency of the word 'bomb' in two different sets of instructions and comments on various requests related to making a bomb.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app