The chapter delves into an OpenAI research piece comparing sparse autoencoders and concept extraction methods in GPT-4 to a previous study by Unthropic. It discusses the use of sparse autoencoders to extract meaningful features, human imperfection-related features, and the challenges in measuring the effectiveness of autoencoders. Further, it highlights papers on enhancing AI model alignment and robustness, preventing harmful behaviors through short circuiting, and training techniques to make models refuse harmful actions while retaining positive skills.
Our 170th episode with a summary and discussion of last week's big AI news!
With hosts Andrey Kurenkov (https://twitter.com/andrey_kurenkov) and Jeremie Harris (https://twitter.com/jeremiecharris)
Feel free to leave us feedback here.
Read out our text newsletter and comment on the podcast at https://lastweekin.ai/
Email us your questions and feedback at contact@lastweekin.ai and/or hello@gladstone.ai
Timestamps + Links:
- Tools & Apps
- Applications & Business
- Projects & Open Source
- Research & Advancements
- Policy & Safety
- Synthetic Media & Art
- (01:46:25) Outro + AI Song