

Beyond Accuracy: Behavioral Testing of NLP Models with Sameer Singh - #406
Sep 3, 2020
Sameer Singh, an assistant professor at UC Irvine, specializes in interpretable machine learning for NLP. He discusses the groundbreaking CheckList tool for robust behavioral testing of NLP models, stressing the importance of understanding model limitations beyond mere accuracy. Sameer reflects on the evolving landscape of AI, the relevance of his co-authored LIME paper in model explainability, and the potential of embodied AI in enhancing our understanding of complex machine learning systems. It's a thoughtful dive into the future of AI evaluation methods.
AI Snips
Chapters
Transcript
Episode notes
Deep Learning Surprise
- Sameer Singh initially focused on specific NLP tasks and missed the deep learning wave.
- When he adopted deep learning, it excelled at his tasks but didn't utilize his specialized techniques.
Explainability to Evaluation
- Deep learning models' high performance contrasted with a lack of understanding of their internal workings.
- This discrepancy led Singh to explore explainability methods like LIME, shifting his focus to evaluation and debugging.
Checklist and Behavioral Testing
- Checklist helps create tests for NLP models, similar to software engineering's behavioral testing.
- It goes beyond simple accuracy metrics by testing specific capabilities like negation handling and robustness to paraphrasing.