Unlocking Unstructured Data with LLMs

35 snips

Jul 3, 2025

Shreya Shankar, a PhD student in EECS at UC Berkeley, dives into how Large Language Models (LLMs) are changing the game for unstructured enterprise data. She explains her innovative framework, DocETL, which streamlines semantic extraction and thematic analysis of text and PDFs. The conversation touches on the practical challenges of data extraction and the evolution towards multimodal processing with tools like DocWrangler. Shreya also highlights the importance of aligning user intent with model capabilities for better user experiences.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

LLMs Unlock Unstructured Data

Enterprises have struggled to make sense of unstructured data like long text documents for decades.
Large Language Models (LLMs) now enable automatic semantic extraction and analysis of such data at scale.

INSIGHT

Thematic Extraction with LLMs

Semantic extraction pipelines revolve around thematic extraction, grouping, and summarization.
Users program LLMs to identify themes like pain points or product features and generate aggregated reports.

INSIGHT

DocETL's MapReduce Approach

DocETL uses a MapReduce approach where 'map' extracts insights per document via LLM prompts.
'Reduce' semantically groups and summarizes these insights for scalable data processing.

Get the Snipd Podcast app to discover more snips from this episode

Get the app