AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Today, we have the pleasure of chatting with Raza Habib, CEO of Humanloop, the platform for LLM collaboration and evaluation. Matt and Raza cover how to understand and optimize model performance, lessons learned about model evaluation and feedback, and explore the future of model fine-tuning.
Data Driven NYC YouTube Channel
Shownotes:
[00:00:47] How Humanloop helps product and engineering teams build reliable applications on top of large language models by providing tools to find, manage, and version prompts;
[00:03:05] Where Humanloop fits into the MAD landscape as LM / LLM Ops;
[00:02:40] The challenges of evaluating and monitoring LLM;
[00:03:40] Why evaluating LLMs and generative AI is subjective given its stochastic attributes;
[00:04:40] Why evaluation is important during development and production stages of LLMs to make informed design decisions, and how that challenge evolves In production to monitoring system behavior;
[00:05:40] The need for regression testing with LLMs;
[00:06:10] How Humanloop makes it easy for users to capture feedback including Implicit signals of user satisfaction, such as post-interaction actions and edits to generated content;
[00:07:40] Why and how Humanloop uses guardrails in the app to ensure effective LLM use and implementation;
[00:08:38] Why using an LLM as part of the evaluation process can introduce additional uncertainty and noise; with turtles all the way down;
[00:09:40] How evaluators on Humanloop are restricted to binary yes-or-no style questions or numerical scores to maintain reliability with LLMs in production.
[00:10:40] Why a new set of tools were needed to monitor and observe LLM performance;
[00:11:40] How Humanloop’s interactive environment allows users to find and fix bugs in a prompt, including logs to support issue identification, and then run what-if style analysis by changing the prompt or information retrieval system — allowing for quick interventions and turnaround times within minutes to hours instead of days/weeks;
[00:12:40] Why having evaluation and observability closely connected to prompt engineering tools is critical for speed;
[00:13:40] How prompt engineering is like writing software specifications for the model, enabling domain experts to have a more direct impact on product development, and democratizing access and reducing reliance on engineers to implement the desired features;
[00:15:40] The key differences between popular LLMs on the market today;
[00:18:40] How the quality of open-source models has been rapidly improving, and how LLMs use tools or function calling to access APIs to go beyond simple text-based interactions;
[00:21:22] How Humanloop empowers non-technical experts;
[00:22:40] Where Humanloop fits within the AI ecosystem as an collaborative tool for enterprises building language models where collaboration and robust evaluation are crucial;
[00:25:40] How Humanloop customers are often problem-aware, and how the go-to-market motion is mainly inbound, but sales-led
[00:27:48] How Humanloop serves as a central place for storing prompts and sharing learnings across teams;
[00:28:24] Raza’s thoughts on Open Source v. Closed Source models in the AI community;
[00:30:40] The potential consequences of restricting access to models and Raza’s case for regulating end use cases and punishing malicious use rather than banning the technology altogether;
[00:33:40] Next steps for Humanloop;