Crag Wolfe and Matt Robinson from Unstructured discuss the challenges of cleaning unstructured data for large language models. They explore data normalization, document processing for NLP tasks, and data transformation techniques for tables, PDFs, and images. The podcast highlights the importance of preprocessing data for machine learning applications and enhancing information retrieval with structured data processing.
Read more
AI Summary
Highlights
AI Chapters
Episode notes
auto_awesome
Podcast summary created with Snipd AI
Quick takeaways
Large language models require clean, curated data for training, posing a significant data cleaning challenge with heterogeneous formats.
Unstructured focuses on extracting and transforming complex data for vector databases and LLM frameworks.
Deep dives
Data cleaning challenges in large language models
Large language models require clean, curated data for training, leading to a major data cleaning challenge when dealing with heterogeneous data formats like HTML, PDF, PNG, and PowerPoint. Unstructured focuses on transforming complex data for vector databases and LLM frameworks, with Craig Wolf and Matt Robinson discussing data cleaning and the LLM age.
Unlocking value in unstructured data
A significant portion of the world's data, about 80-90%, is unstructured and challenging to leverage. Current technologies like data lakes require substantial manual effort to make unstructured data usable for analytics and machine learning. Unstructured data consists of emails, office documents, videos, and more, residing separately from structured data and crucial for unlocking valuable insights within a company.
The importance of normalizing diverse document types
Unstructured faces challenges in normalizing diverse document types like HTML, Word documents, PDFs, and images. Normalization involves extracting raw text content, structuring documents into elements, and pulling metadata. Tables, images, and various formatting differences pose complexities in data processing, requiring accurate normalization for seamless integration into LLM applications.
Scalability and testing processes
Ensuring scalability, the API and platform products of Unstructured use auto-scaling to handle increasing data volumes efficiently. The platform employs queuing systems for managing document processing workflows, enabling failover mechanisms and retry options for seamless processing continuity. Rigorous testing involving hundreds of documents in CI with ground truth data ensures accurate transformations and continuous improvements in data processing performance.
The majority of enterprise data exists in heterogenous formats such as HTML, PDF, PNG, and PowerPoint. However, large language models do best when trained with clean, curated data. This presents a major data cleaning challenge.
Unstructured is focused on extracting and transforming complex data to prepare it for vector databases and LLM frameworks.
Crag Wolfe is Head of Engineering and Matt Robinson is Head of Product at Unstructured. They join the podcast to talk about data cleaning in the LLM age.
Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer .