Software Engineering Daily

Unstructured Data and LLMs with Crag Wolfe and Matt Robinson

Jun 4, 2024

00:00

Snipd AI

Crag Wolfe and Matt Robinson from Unstructured discuss the challenges of cleaning unstructured data for large language models. They explore data normalization, document processing for NLP tasks, and data transformation techniques for tables, PDFs, and images. The podcast highlights the importance of preprocessing data for machine learning applications and enhancing information retrieval with structured data processing.

AI Summary

Highlights

AI Chapters

Episode notes

Podcast summary created with Snipd AI

Quick takeaways

Large language models require clean, curated data for training, posing a significant data cleaning challenge with heterogeneous formats.

Unstructured focuses on extracting and transforming complex data for vector databases and LLM frameworks.

Deep dives

Data cleaning challenges in large language models

Large language models require clean, curated data for training, leading to a major data cleaning challenge when dealing with heterogeneous data formats like HTML, PDF, PNG, and PowerPoint. Unstructured focuses on transforming complex data for vector databases and LLM frameworks, with Craig Wolf and Matt Robinson discussing data cleaning and the LLM age.

Converting PDFs with Image Pages to Text

01:17

Maximizing RAG Applications with Long Context Windows

01:55

Structured Processing of Data for Enhanced Database Functionality

03:46

Introduction

2min

Unstructured Data Challenges and Solutions

6min

Data Normalization and LLM Applications

6min

Document Processing, Normalization, and Pre-processing for NLP Tasks

3min

Exploring Data Cleaning and Preprocessing for Machine Learning Applications

3min

Data Transformation Techniques for Tables, PDFs, and Images

4min

Enhancing Information Retrieval with Structured Data Processing

22min

Closing Remarks and Community Engagement in AI and LLM Field

2min

The majority of enterprise data exists in heterogenous formats such as HTML, PDF, PNG, and PowerPoint. However, large language models do best when trained with clean, curated data. This presents a major data cleaning challenge.

Unstructured is focused on extracting and transforming complex data to prepare it for vector databases and LLM frameworks.

Crag Wolfe is Head of Engineering and Matt Robinson is Head of Product at Unstructured. They join the podcast to talk about data cleaning in the LLM age.

Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer .

Please click here to see the transcript of this episode.

Sponsorship inquiries: sponsor@softwareengineeringdaily.com

The post Unstructured Data and LLMs with Crag Wolfe and Matt Robinson appeared first on Software Engineering Daily.

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Software Engineering Daily

Unstructured Data and LLMs with Crag Wolfe and Matt Robinson

Podcast summary created with Snipd AI

Quick takeaways

Deep dives

Data cleaning challenges in large language models

Unlocking value in unstructured data

The importance of normalizing diverse document types

Scalability and testing processes

Converting PDFs with Image Pages to Text

Maximizing RAG Applications with Long Context Windows

Structured Processing of Data for Enhanced Database Functionality

Get the Snipd
podcast app

AI-powered
podcast player

Discover
highlights

Save any
moment

Share
& Export

AI-powered
podcast player

Discover
highlights

Software Engineering Daily

Unstructured Data and LLMs with Crag Wolfe and Matt Robinson

Podcast summary created with Snipd AI

Quick takeaways

Deep dives

Data cleaning challenges in large language models

Unlocking value in unstructured data

The importance of normalizing diverse document types

Scalability and testing processes

Converting PDFs with Image Pages to Text

Maximizing RAG Applications with Long Context Windows

Structured Processing of Data for Enhanced Database Functionality

Get the Snipdpodcast app

AI-poweredpodcast player

Discoverhighlights

Save anymoment

Share& Export

AI-poweredpodcast player

Discoverhighlights

Get the Snipd
podcast app

AI-powered
podcast player

Discover
highlights

Save any
moment

Share
& Export

AI-powered
podcast player

Discover
highlights