The Journey to Clean and Reliable Data

In this chapter, the hosts discuss their journey in improving data labeling accuracy over the past six years, and introduce Clean Lab as a solution. They explain the concept of positive unlabeled learning, generalize their solution to the full binary case, and share their early research on rank pruning. They also discuss their experience at Facebook AI Research and Amazon, addressing bias in comment rankings and determining false negative rates for Alexa devices.

Play episode from 14:15

chevron_right

Transcript

chevron_right

Transcript

Episode notes

MLOps Coffee Sessions #106 with Curtis Northcutt, CEO & Co-Founder of Cleanlab, Cleanlab: Labeled Datasets that Correct Themselves Automatically, co-hosted by Vishnu Rachakonda.

// Abstract
Pioneered at MIT by 3 Ph.D. Co-Founders, Cleanlab is an open-source/SaaS company building the premier data-centric AI tools workflows for (1) automatically correcting messy data and labels, (2) auto-tracking of dataset quality over time, (3) automatically finding classes to merge and delete, (4) auto ml for data tasks, (5) obtaining and ranking high-quality annotations, and (6) training ML models with messy data.

Most of the prescriptive tasks (finding issues) can be done in one line of code with their open-source product: https://github.com/cleanlab/cleanlab.

// Bio
Curtis Northcutt is the CEO and Co-Founder of Cleanlab, focused on making AI work reliably for people and their messy, real-world data by automatically fixing issues in any ML dataset. Curtis completed his Ph.D. in Computer Science at MIT, receiving the MIT Thesis Award, NSF Fellowship, and the Goldwater Scholarship. Prior to Cleanlab, Curtis worked at AI research groups including Google, Oculus, Amazon, Facebook, Microsoft, and NASA.

// MLOps Jobs board
jobs.mlops.community

MLOps Swag/Merch
https://mlops-community.myshopify.com/

--------------- ✌️Connect With Us ✌️ -------------
Join our Slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Catch all episodes, blogs, newsletters, and more: https://mlops.community/

Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/
Connect with Curtis on LinkedIn: https://www.linkedin.com/in/cgnorthcutt/

Timestamps:
[00:00] Introduction to Curtis Northcutt
[00:30] Difference between MLOps and Data-Centric AI
[04:04] Realizing the problem of data quality in ML manifests
[05:11] Computer vision problems
[06:54] War story that got Curtis into Data-Centric AI
[13:50] Overview of Curtis' vision
[14:45] PU Learning
[21:25] Consistency Rate and Flipping Rate
[25:25] One line of code
[29:48] Models make mistakes
[33:09] Cleanlab plays with the environment
[36:30] How ML Engineers should approach the data quality problem
[42:42] Quantum computing
[46:39] Result of confident learning
[52:31] Utility for small data sets
[53:53] Cleanlab's huge success stories
[56:13] Rapid-fire questions
[58:58] Cloudy and mystified space
[1:03:46] Cleanlab is hiring!
[1:05:06] Wrap up

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books