The Enron Dataset
8 snips
Feb 9, 2015 The podcast discusses the Enron emails corpus, a dataset used in machine learning, and its significance. It explores privacy concerns, algorithm development, data cleaning, and the uses of the Enron dataset in studying corporate fraud.
AI Snips
Chapters
Transcript
Episode notes
How The Corpus Was Created
- Federal investigators seized the email inboxes of about 150 senior Enron executives during the fraud probe.
- Those inboxes were released as the public Enron email corpus that researchers now use widely.
Privacy Risks Fueled Research
- The original release contained sensitive personal data like Social Security numbers and bank info that could enable identity theft.
- That flaw made the corpus a testbed for developing algorithms to detect and remove personally identifiable information.
Benchmark PII Scrubbing With Enron
- Use the Enron corpus to benchmark PII-scrubbing tools by comparing older raw dumps with cleaned versions.
- Train algorithms to locate patterns of PII before releasing similar corporate datasets publicly.
