The Waluigi Effect, MegaPost, by Cleo Nardo

Jailbreak technique is good evidence for simulator theory as an explanation of the Waluigi effect. If this semiotic simulation theory is correct, then RLHF is an irreparably inadequate solution to the AI alignment problem. RLHF is probably increasing the likelihood of a misalignment catastrophe. This has increased my credence in the absurd science fiction tropes that the AI alignment community has tended to reject,. and thereby increased my credences in S-Risks. The audio version was published on 3 March 2023 by Cleo Nardo at MegaPost.

Play episode from 39:05

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app