LessWrong (Curated & Popular)

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks

6 snips
Jul 3, 2025
The discussion dives into the significant race and gender bias found in large language models during hiring scenarios. Surprisingly, while the biases exist, the models' chain-of-thought reasoning appears completely devoid of them. This highlights the disconnect between perceived reasoning and actual bias. The hosts advocate for interpretability-based interventions over traditional prompting methods, emphasizing their effectiveness in real-world applications. It’s a fascinating exploration of AI behavior and bias mitigation strategies.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Unfaithful Chain-of-Thought Bias

  • Large language models exhibit clear race and gender bias in realistic hiring but their chain-of-thought (CoT) shows no such bias.
  • This reveals a case of 100% unfaithful CoT where biases are strongly suppressed in reasoning text.
INSIGHT

Limits of CoT Faithfulness

  • Chain of Thought (CoT) monitoring checks AI reasoning in English to detect misalignment but often isn't faithful to actual decision-making.
  • Faithfulness rates of CoT are generally low, around 20-40%, limiting its effectiveness for safety monitoring.
ANECDOTE

Bias Hidden in Realistic Hiring Scenarios

  • Adding realistic details to resumes revealed LLM biases against white and male candidates up to 12% in interview rates.
  • Despite bias happening, LLMs' prompted reasoning showed zero evidence of it, demonstrating strong suppression of bias in reasoning.
Get the Snipd Podcast app to discover more snips from this episode
Get the app