
"LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B" by Simon Lermen & Jeffrey Ladish.
LessWrong (Curated & Popular)
00:00
Analysis of Refusals and Comparison of Instruction Sets
This chapter discusses the evaluation of the rate of refusals by an AI and provides a list of phrases indicating refusals along with their presence in the text. It also compares the frequency of the word 'bomb' in two different sets of instructions and comments on various requests related to making a bomb.
Transcript
Play full episode