

Unlocking Unstructured Data with LLMs
22 snips Jul 3, 2025
Shreya Shankar, a PhD student in EECS at UC Berkeley, dives into how Large Language Models (LLMs) are changing the game for unstructured enterprise data. She explains her innovative framework, DocETL, which streamlines semantic extraction and thematic analysis of text and PDFs. The conversation touches on the practical challenges of data extraction and the evolution towards multimodal processing with tools like DocWrangler. Shreya also highlights the importance of aligning user intent with model capabilities for better user experiences.
AI Snips
Chapters
Transcript
Episode notes
LLMs Unlock Unstructured Data
- Enterprises have struggled to make sense of unstructured data like long text documents for decades.
- Large Language Models (LLMs) now enable automatic semantic extraction and analysis of such data at scale.
Thematic Extraction with LLMs
- Semantic extraction pipelines revolve around thematic extraction, grouping, and summarization.
- Users program LLMs to identify themes like pain points or product features and generate aggregated reports.
DocETL's MapReduce Approach
- DocETL uses a MapReduce approach where 'map' extracts insights per document via LLM prompts.
- 'Reduce' semantically groups and summarizes these insights for scalable data processing.