Shreya Shankar on AI for Corporate Data Processing

Sep 9, 2025

Ask episode

Chapters

Transcript

Episode notes

Businesses have a lot of data—but most of that data is unstructured textual data: reports, catalogs, emails, notes, and much more. Without structure, business analysts can’t make sense of the data; there is value in the data, but it can’t be put to use. AI can be a tool for finding and extracting the structure that’s hidden in textual data. In this episode, Ben and Shreya talk about a new generation of tooling that brings AI to enterprise data processing.

Points of Interest

0:18: One of the themes of your work is a specific kind of data processing. Before we go into tools, what is the problem you’re trying to address?
0:52: For decades, organizations have been struggling to make sense of unstructured data. There’s a massive amount of text that people make sense of. We didn’t have the technology to do that until LLMs came around.
1:38: I’ve spent the last couple of years building a processing framework for people to manipulate unstructured data with LLMs. How can we extract semantic data?
1:55: The prior art would be using NLP libraries and doing bespoke tasks?
2:12: We’ve seen two flavors of approach: bespoke code and crowdsourcing. People still do both. But now LLMs can simplify the process.
2:45: The typical task is “I have a large collection of unstructured text and I want to extract as much structure as possible.” An extreme would be a knowledge graph; in the middle would be the things that NLP people do. Your data pipelines are designed to do this using LLMs.
3:22: Broadly, the tasks are thematic extraction: I want to extract themes from documents. You can program LLMs to find themes. You want some user steering and guidance for what a theme is, then use the LLM for grouping.
4:04: One of the tools you built is DocETL. What’s the typical workflow?
4:19: The idea is to write MapReduce pipelines, where map extracts insights, and group does aggregation. Doing this with LLMs means that the map is described by an LLM prompt. Maybe the prompt is “Extract all the pain points and any associated quotes.” Then you can imagine flattening this across all the documents, grouping them by the pain points, and another LLM can do the summary to produce a report. DocETL exposes these data processing primitives and orchestrates them to scale up and across task complexity.
5:52: What if you want to extract 50 things from a map operation? You shouldn’t ask an LLM to do 50 things at once. You should group them and decompose them into subtasks. DocETL does some optimizations to do this.
6:18: The user could be a noncoder and might not be working on the entire pipeline.
7:00: People do that a lot; they might just write a single map operation.
7:16: But the end user you have in mind doesn’t even know the words “map” and “filter.”
7:22: That's the goal. Right now, people still need to learn data processing primitives.
7:49: These LLMs are probabilistic; do you also set the expectations with the user that you might get different results every time you run the pipeline?
8:16: There are two different types of tasks. One is where you want the LLM to be accurate and there is an exact ground truth—for example, entity extraction. The other type is where you want to offload a creative process to the LLM—for example, “Tell me what’s interesting in this data.” They’ll run it until there are no new insights to be gleaned. When is nondeterminism a problem? How do you engineer systems around it?
9:56: You might also have a data engineering team that uses this and turns PDF files into something like a data warehouse that people can query. In this setting, are you familiar with lakehouses architecture and the notion of the medallion architecture?
10:49: People actually use DocETL to create a table out of PDFs and put it in a relational database. That’s the best way to think about how to move forward in the enterprise setting. I’ve also seen people using these tables in RAG or downstream LLM applications.