Linear Digressions

The Enron Dataset

8 snips
Feb 9, 2015
The podcast discusses the Enron emails corpus, a dataset used in machine learning, and its significance. It explores privacy concerns, algorithm development, data cleaning, and the uses of the Enron dataset in studying corporate fraud.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

How The Corpus Was Created

  • Federal investigators seized the email inboxes of about 150 senior Enron executives during the fraud probe.
  • Those inboxes were released as the public Enron email corpus that researchers now use widely.
INSIGHT

Privacy Risks Fueled Research

  • The original release contained sensitive personal data like Social Security numbers and bank info that could enable identity theft.
  • That flaw made the corpus a testbed for developing algorithms to detect and remove personally identifiable information.
ADVICE

Benchmark PII Scrubbing With Enron

  • Use the Enron corpus to benchmark PII-scrubbing tools by comparing older raw dumps with cleaned versions.
  • Train algorithms to locate patterns of PII before releasing similar corporate datasets publicly.
Get the Snipd Podcast app to discover more snips from this episode
Get the app