Is Helpfulness Training Making AIs More Than Less Dangerous?

The lack of the usual harmlessness training makes these results hard to read. Is RLA-CHF making AIs more rather than less dangerous? Or is it just that kind of hokey partial RLA- CHF without the safeguards that does that? An optimistic conclusion is that harmlessness training would solve all these problems. A pessimistic conclusion is that Harmless AI wouldn't work, since humans get punished every time they admit a bad behaviour. And here's a final graph, acts like it wants to help humans but does not care about that.

Play episode from 28:45

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app