How AI Is Built

Nicolay Gerold
undefined
Jul 16, 2024 • 36min

#017 Unlocking Value from Unstructured Data, Real-World Applications of Generative AI

Founder of Reach Latent, Jonathan Yarkoni, discusses using generative AI to extract value from unstructured data in industries like legal and weather prediction. He delves into the challenges of AI projects, the impact of ChatGPT, and future AI trends. Topics include the less data cleaning required for generative AI, optimized tech stacks, and the potential of synthetic data generation for training AI systems.
undefined
Jul 12, 2024 • 46min

#016 Data Processing for AI, Integrating AI into Data Pipelines, Spark

Abhishek Choudhary and Nicolay discuss data processing for AI, Spark, and alternatives for AI-ready data. When to use Spark vs. simpler tools, key components of Spark, integrating AI into data pipelines, challenges with latency, data storage strategies, and orchestration tools. Tips for reliability in production. Guests provide insights on Spark's role in managing big data, evolution of Spark components, utilizing Spark for ML apps, integrating AI into data pipelines, tools for orchestration, and enhancing consistency in Large Language Models.
undefined
16 snips
Jul 4, 2024 • 35min

#015 Building AI Agents for the Enterprise, Agent Cost Controls, Seamless UX

Rahul Parundekar, Founder of AI Hero, discusses building AI agents for enterprise focusing on realistic use cases, expert workflows, seamless user experiences, cost controls, and new paradigms for agent interactions beyond chat.
undefined
5 snips
Jun 27, 2024 • 32min

#014 Building Predictable Agents through Prompting, Compression, and Memory Strategies

Expert, Richmond Alake, and Nicolay discuss building AI agents, prompt compression, memory strategies, and experimentation techniques. They highlight prompt compression for cost reduction, memory management components, performance optimization, prompting techniques like ReAct, and the importance of continuous experimentation in the AI field.
undefined
Jun 25, 2024 • 15min

Data Integration and Ingestion for AI & LLMs, Architecting Data Flows | changelog 3

In this episode, Kirk Marple, CEO and founder of Graphlit, shares his expertise on building efficient data integrations. Kirk breaks down his approach using relatable concepts: The "Two-Sided Funnel": This model streamlines data flow by converting various data sources into a standard format before distributing it. Universal Data Streams: Kirk explains how he transforms diverse data into a single, manageable stream of information. Parallel Processing: Learn about the "competing consumer model" that allows for faster data handling. Building Blocks for Success: Discover the importance of well-defined interfaces and actor models in creating robust data systems. Tech Talk: Kirk discusses data normalization techniques and the potential shift towards a more streamlined "Kappa architecture." Reusable Patterns: Find out how Kirk's methods can speed up the integration of new data sources. Kirk Marple: LinkedIn X (Twitter) Graphlit Graphlit Docs Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) Chapters 00:00 Building Integrations into Different Tools 00:44 The Two-Sided Funnel Model for Data Flow 04:07 Using Well-Defined Interfaces for Faster Integration 04:36 Managing Feeds and State with Actor Models 06:05 The Importance of Data Normalization 10:54 Tech Stack for Data Flow 11:52 Progression towards a Kappa Architecture 13:45 Reusability of Patterns for Faster Integration data integration, data sources, data flow, two-sided funnel model, canonical format, stream of ingestible objects, competing consumer model, well-defined interfaces, actor model, data normalization, tech stack, Kappa architecture, reusability of patterns
undefined
Jun 19, 2024 • 37min

#013 ETL for LLMs, Integrating and Normalizing Unstructured Data

In our latest episode, we sit down with Derek Tu, Founder and CEO of Carbon, a cutting-edge ETL tool designed specifically for large language models (LLMs).Carbon is streamlining AI development by providing a platform for integrating unstructured data from various sources, enabling businesses to build innovative AI applications more efficiently while addressing data privacy and ethical concerns."I think people are trying to optimize around the chunking strategy... But for me, that seems a bit maybe not focusing on the right area of optimization. These embedding models themselves have gone just like, so much more advanced over the past five to 10 years that regardless of what representation you're passing in, they do a pretty good job of being able to understand that information semantically and returning the relevant chunks." - Derek Tu on the importance of embedding models over chunking strategies"If you are cost conscious and if you're worried about performance, I would definitely look at quantizing your embeddings. I think we've probably been able to, I don't have like the exact numbers here, but I think we might be saving at least half, right, in storage costs by quantizing everything." - Derek Tu on optimizing costs and performance with vector databasesDerek Tu:LinkedInCarbonNicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)Key Takeaways:Understand your data sources: Before building your ETL pipeline, thoroughly assess the various data sources you'll be working with, such as Slack, Email, Google Docs, and more. Consider the unique characteristics of each source, including data format, structure, and metadata.Normalize and preprocess data: Develop strategies to normalize and preprocess the unstructured data from different sources. This may involve parsing, cleaning, and transforming the data into a standardized format that can be easily consumed by your AI models.Experiment with chunking strategies: While there's no one-size-fits-all approach to chunking, it's essential to experiment with different strategies to find what works best for your specific use case. Consider factors like data format, structure, and the desired granularity of the chunks.Leverage metadata and tagging: Metadata and tagging can play a crucial role in organizing and retrieving relevant data for your AI models. Implement mechanisms to capture and store important metadata, such as document types, topics, and timestamps, and consider using AI-powered tagging to automatically categorize your data.Choose the right embedding model: Embedding models have advanced significantly in recent years, so focus on selecting the right model for your needs rather than over-optimizing chunking strategies. Consider factors like model performance, dimensionality, and compatibility with your data types.Optimize vector database usage: When working with vector databases, consider techniques like quantization to reduce storage costs and improve performance. Experiment with different configurations and settings to find the optimal balance for your specific use case.00:00 Introduction and Optimizing Embedding Models03:00 The Evolution of Carbon and Focus on Unstructured Data06:19 Customer Progression and Target Group09:43 Interesting Use Cases and Handling Different Data Representations13:30 Chunking Strategies and Normalization20:14 Approach to Chunking and Choosing a Vector Database23:06 Tech Stack and Recommended Tools28:19 Future of Carbon: Multimodal Models and Building a PlatformCarbon, LLMs, RAG, chunking, data processing, global customer base, GDPR compliance, AI founders, AI agents, enterprises
undefined
Jun 14, 2024 • 28min

#012 Serverless Data Orchestration, AI in the Data Stack, AI Pipelines

Hugo Lu, Founder and CEO of Orchestra, discusses serverless data orchestration. Orchestra provides end-to-end visibility for managing data pipelines, infrastructure, and analytics. They focus on modular data pipeline components and the importance of finding the right level of abstraction. The podcast explores the evolution of architecture, unique use cases of data orchestration tools, and data orchestration for AI workloads.
undefined
11 snips
Jun 7, 2024 • 40min

#011 Mastering Vector Databases, Product & Binary Quantization, Multi-Vector Search

Expert Zain Hassan from Weaviate discusses vector databases, quantization techniques, and multi-vector search capabilities. They explore the future of multimodal search, brain-computer interfaces, and EEG foundation models. Learn how vector databases handle text, image, audio, and video data efficiently.
undefined
12 snips
May 31, 2024 • 46min

#010 Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage

Data architect Anjan Banerjee discusses building complex AI and data systems, explaining data architecture with Lego analogies. Topics include selecting data tools, using Airflow for orchestration, incorporating AI for data processing, and analyzing Snowflake vs. Databricks solutions. The podcast also covers automating data integration for comprehensive customer views.
undefined
5 snips
May 24, 2024 • 28min

#009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack

Jorrit Sandbrink, a data engineer, discusses lake house architecture blending data warehouse and lake, key components like Delta Lake and Apache Spark, optimizations with partitioning strategies, and data ingress with DLT. The podcast emphasizes open-source solutions, considerations in choosing tools, and the evolving data landscape.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app