Hey everyone, thank you so much for watching the 48th episode of the Weaviate Podcast!! This is a SUPER exciting one, welcoming Brian Raymond the CEO / Founder of Unstructured! Unstructured is a perfect complimenting technology for Weaviate, helping people get their Unstructured data into Weaviate! The podcast dives into the nuances of this task, but it generally revolves around Unstructured's abstraction of Partitioning, Cleaning, and Staging! Unstructured is making groundbreaking innovations on using Visual Document Layout models for Partitioning, for example saying that this part of the PDF is the header, body, image caption, and so on. Cleaning then describes removing pesky details like whitespaces or odd characters. Staging then describes the transformations of say formatting a text chunk with it's metadata into the JSON for a Weaviate object upload! I really hope you find this podcast interesting! We are publishing a blog post as well showing an example of how to use Unstructured to get PDF data into Weaviate, please please check that out and let us know if it works for your data and how we can improve it! This blog post can be found on weaviate.io and we will be managing discussions around it both in the Weaviate slack, as well as Unstructured! Thank you so much for listening!
Check out Unstructured here! https://www.unstructured.io/
Chapters
0:00 Welcome Brian!!
0:27 What is Unstructured?
5:42 Why now? New Advancements in Unstructured
8:02 Thoughts on Data Connectors Hub
10:55 PDFs to Weaviate with Unstructured
13:53 State-of-the-Art in OCR and Document Parsing
16:10 How to get the data from Weaviate.io?
18:06 Foundation Models from Unstructured
20:45 Evaporate-Code+
23:15 CSV, Parquet, JSON transformations in Staging
25:08 Cleaning Bricks
28:02 Visual Document Examples
30:45 Text Chunking with Metadata
33:25 Knowledge Graphs with Goldman Sachs example
39:10 LLM Hallucinations
42:10 Announcements from Brian!