#118 - Unlocking the Value of Unstructured Data with AI w/ Cody & Will (Coactive.ai)
Mar 21, 2023
auto_awesome
Cody Coleman and Will Gaviria Rojas, co-founders of Coactive.ai, discuss the rising importance of unstructured data and the role of AI. They explore challenges and considerations in unlocking the value of unstructured data, the future of working with it, and the evolution of databases and deep learning. They also touch on generative AI and future possibilities.
Efficient storage methods such as using binary formats can optimize performance when dealing with unstructured data.
Prioritizing data quality over quantity can be a more effective and cost-efficient approach to unlock value from unstructured data.
Collaboration between AI and systems experts is crucial to build cohesive and efficient solutions for unlocking value from unstructured data.
Deep dives
Optimizing storage and data management for unstructured data
One of the key challenges in unlocking the value of unstructured data lies in optimizing storage and data management. With the vast amount of content and data, it becomes crucial to find efficient ways to store and access this information. Traditional approaches such as storing images or multimedia files on object stores like S3 may not be sufficient for efficient traversal and search. Compiling these files into binary formats like LMDB, HDF5, or Parquet files can significantly improve performance by reducing data size and minimizing network reads. Additionally, with the increasing speed of AI accelerators, the storage layer needs to keep up with the pace to fully utilize these powerful computational resources.
Balancing data quantity and data quality
A key aspect in unlocking the value of unstructured data is finding the right balance between data quantity and data quality. While there has been a focus on collecting as much data as possible in the big data era, dealing with unstructured data requires a more thoughtful approach. Training on large volumes of noisy data can be both costly and less effective. Instead, by being selective and prioritizing data quality, even with smaller datasets, it is possible to achieve comparable learning outcomes. Employing data selection strategies that carefully curate high-quality data can be a more efficient and cost-effective approach to unlocking value from unstructured data.
Collaboration between AI and Systems experts
One of the biggest challenges in unlocking the value of unstructured data lies in the lack of collaboration between AI and systems experts. Organizations often overlook the need for both types of expertise when working with unstructured data. Systems experts might focus on existing tools and technologies, resulting in suboptimal integration of AI into the data infrastructure. On the other hand, AI experts may not consider the scalability, cost, and broader system requirements necessary for effective data processing. Bridging this gap and ensuring collaboration between AI and systems experts is crucial to building cohesive and efficient solutions for unlocking value from unstructured data.
Concerns about normalizing unstructured data in generative AI
There are concerns that generative AI may inadvertently normalize unstructured data, potentially removing or glossing over outliers that could hold important signals. The ability to generate large amounts of unstructured data, such as text, images, and videos, could lead to the normalization of working with unstructured data without considering its nuances. Structured data has intentional values and schemas, while unstructured data is variable and lacks such structure. This poses challenges in training AI models that deal with unstructured data, as the presence of unexpected outliers and failure modes can impact their effectiveness and accuracy.
Challenges in training AI models with unstructured data
Training AI models with unstructured data presents various challenges. Examples include cases where classifiers become uncertain due to unexpected features in the data, such as exercise studios mistaken for bowling alleys. Trust and safety issues arise when generative AI is applied in different domains like Microsoft Office, as it may struggle to handle sensitive information. Additionally, concerns arise regarding content skew, in which bad content and misinformation can dominate due to the ability to generate a massive amount of low-quality or misleading data. Tackling these challenges requires expertise in data modeling, managing biases, and developing systems-level thinking to ensure the generation of representative and reliable content.
Cody Coleman and Will Gaviria Rojas (Co-founders of Coactive.ai) join the show to chat about the rising importance of unstructured data and the role of AI in unlocking the value of unstructured content. Their motto is "Content is King, and AI is the new Queen."
Given the rise of unstructured data, this is a must for anyone working in AI.
https://coactive.ai
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode