
"The Waluigi Effect (mega-post)" by Cleo Nardo
LessWrong (Curated & Popular)
00:00
The Waluigi Effect, MegaPost, by Cleo Nardo
Jailbreak technique is good evidence for simulator theory as an explanation of the Waluigi effect. If this semiotic simulation theory is correct, then RLHF is an irreparably inadequate solution to the AI alignment problem. RLHF is probably increasing the likelihood of a misalignment catastrophe. This has increased my credence in the absurd science fiction tropes that the AI alignment community has tended to reject,. and thereby increased my credences in S-Risks. The audio version was published on 3 March 2023 by Cleo Nardo at MegaPost.
Play episode from 39:05
Transcript


