E134: Making Complex Data RAG-Ready with Unstructured
May 20, 2024
auto_awesome
Brian Raymond, Founder & CEO of Unstructured, discusses the importance of data preparation in NLP, creating a single API endpoint for handling diverse data formats, transitioning from open source to commercial success, engaging with government design partners, and the value of world-class design & marketing for open source companies.
Unstructured was founded to address the lack of tools for data preprocessing in NLP projects, focusing on making complex data RAG-ready for vector databases.
Starting as an open source project, Unstructured gained widespread usage and valuable feedback, guiding platform development and market strategy.
Deep dives
Origin of Unstructured Idea and Need for Data Tooling
The idea for Unstructured originated from the lack of tooling on the data side for natural language processing (NLP) projects, while there were abundant resources for model development. Companies struggled with hard-coded preprocessing pipelines, affecting data readiness for tasks like labeling and inference. Primer AI's experience highlighted this challenge, leading Brian to envision a solution that focuses solely on preparing data for easy integration into NLP applications and knowledge graphs.
Open Source Project's Role in Market and Product Development
The decision to start an open source project stemmed from the need to quickly validate product-market fit after the high customer concentration experience at Primer AI. Leveraging the hugging face community's growth and shared investors, Unstructured aimed to cater to developers contributing and using models in the ecosystem. Although lacking direct contributors, the open source project drew widespread usage, offering valuable feedback for both the open source and commercial platforms, guiding feature development and market positioning.
Platform Expansion and Challenges in Data Processing
Unstructured's platform expansion beyond the API involved building connectors to streamline data ingestion from various sources into a JSON format. The platform addressed the complexities of handling different data types, such as text extraction from images and extractable text files. By embracing chunking strategies, metadata generation, and data summarization, Unstructured aimed to provide users with a comprehensive solution for managing diverse document types efficiently.
Industry Positioning and Transition to Production
In the evolving landscape of large language models (LLMs), Unstructured recognizes the industry's current stage as primarily experimental, with easier adoption for search and QA applications. The transition to multi-step automation workflows faces challenges but shows promise with emerging models. Balancing experimentation and production readiness remains a key focus, aligning with Unstructured's mission to streamline data preparation for enhanced data utilization in workflows and applications.
Brian Raymond is Founder & CEO of Unstructured, the platform to extract and transform complex data for use with every major vector database and LLM framework. Their open source project has 7K stars on GitHub and includes libraries and APIs that let users build custom preprocessing pipelines for labeling, training, and production machine learning pipelines. Today, they have over 6M downloads and 50K companies using their tools.
Unstructured has raised $65M from investors including Bain, Essence VC, and Menlo Ventures.
In this episode, we dig into Brian's process of talking to 100 data scientists before launching Unstructured, why the long tail of data matters for LLMs, competing with their own open source, why being a "boring company" is valuable for today's LLM stack, why they liked having government design partners, why world-class design & marketing are huge differentiators for open source companies & more!
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode