Philip Kiely, an AI infrastructure expert at BaseTen, dives into the complexities of running generative AI models in production. He shares insights on the importance of selecting the right model based on product requirements and discusses key deployment strategies, including architecture and performance monitoring. Challenges like model quantization and the balance between open-source and proprietary models are explored. Philip also highlights future trends such as local inference, emphasizing the need for compliance in sectors like healthcare.
Understanding product strategy is crucial, as it influences model selection and the overall approach to AI deployment.
The architectural evolution of AI applications, particularly through compound methods, presents increased complexity in orchestration and inference.
Emerging trends highlight a shift towards hybrid inference solutions that balance local and cloud processing, emphasizing user privacy and compliance.
Deep dives
Understanding Open Models
The concept of open models is defined through parameters like open weights, data, and code, which facilitate transparency in AI development. True open models must meet stringent criteria that allow users to access the full process of model creation and deployment. However, practical interpretations can vary, with models such as Meta's Llama challenging the traditional definitions by utilizing custom licenses. As more companies attempt to balance openness with proprietary interests, the industry must collaboratively navigate the complexities of licensing while ensuring a fertile environment for innovation.
Operational Decision-Making
Before deploying AI systems, teams must clarify their product strategy, determining whether they are integrating AI features into existing products or creating AI-native solutions. This decision impacts the selection of models, prompting discussions about the feasibility of available tools and the potential need for custom developments. The contrasts between starting with open-source prototypes versus building proprietary solutions significantly affect the subsequent steps in the development cycle. Ultimately, understanding the operational landscape is vital for aligning technical choices with business goals.
Challenges from Prototyping to Production
Transitioning from conceptualization to functioning AI models often reveals various hurdles, including inadequate evaluation methods for model performance. Developers may rely on subjective assessments, leading to uncertainty in model effectiveness, which risks the viability of the final product. Rigorous evaluation frameworks help to define clear criteria for success, allowing teams to determine which model best meets their needs. The complexity of this phase can result in frustration for developers as they face technical obstacles that may divert them from their original vision.
Architectural Considerations in AI Systems
The architectural design of AI applications is shaping the future of model deployment, particularly through retrieval augmented generation and other compound AI methods. Multi-model pipelines are emerging as a preferred method to combine different strengths of various models, optimizing responsiveness and capability. However, incorporating multiple models increases the complexity of inference and necessitates robust orchestration and error handling to maintain reliability. As developers explore these advanced frameworks, they must also grapple with the associated challenges of ensuring performance and scalability in production.
The Future of AI Infrastructure
Emerging trends suggest a need for hybrid inference solutions that meld local and cloud-based AI processing, aligning with user privacy and latency requirements. Companies in regulated industries are showing heightened interest in self-hosted and compliant AI systems that adhere to security standards while leveraging advanced AI capabilities. As open-source models evolve, there is growing anticipation for multimodal approaches that blend functionality across text, vision, and audio. This combination of capabilities may drive innovation, but it will also present new challenges in evaluating and integrating diverse AI models into cohesive applications.
Summary In this episode Philip Kiely from BaseTen talks about the intricacies of running open models in production. Philip shares his journey into AI and ML engineering, highlighting the importance of understanding product-level requirements and selecting the right model for deployment. The conversation covers the operational aspects of deploying AI models, including model evaluation, compound AI, and model serving frameworks such as TensorFlow Serving and AWS SageMaker. Philip also discusses the challenges of model quantization, rapid model evolution, and monitoring and observability in AI systems, offering valuable insights into the future trends in AI, including local inference and the competition between open source and proprietary models.
Announcements
Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
Your host is Tobias Macey and today I'm interviewing Philip Kiely about running open models in production
Interview
Introduction
How did you get involved in machine learning?
Can you start by giving an overview of the major decisions to be made when planning the deployment of a generative AI model?
How does the model selected in the beginning of the process influence the downstream choices?
In terms of application architecture, the major patterns that I've seen are RAG, fine-tuning, multi-agent, or large model. What are the most common methods that you see? (and any that I failed to mention)
How have the rapid succession of model generations impacted the ways that teams think about their overall application? (capabilities, features, architecture, etc.)
In terms of model serving, I know that Baseten created Truss. What are some of the other notable options that teams are building with?
What is the role of the serving framework in the context of the application?
There are also a large number of inference engines that have been released. What are the major players in that arena?
What are the features and capabilities that they are each basing their competitive advantage on?
For someone who is new to AI Engineering, what are some heuristics that you would recommend when choosing an inference engine?
Once a model (or set of models) is in production and serving traffic it's necessary to have visibility into how it is performing. What are the key metrics that are necessary to monitor for generative AI systems?
In the event that one (or more) metrics are trending negatively, what are the levers that teams can pull to improve them?
When running models constructed with e.g. linear regression or deep learning there was a common issue with "concept drift". How does that manifest in the context of large language models, particularly when coupled with performance optimization?
What are the most interesting, innovative, or unexpected ways that you have seen teams manage the serving of open gen AI models?
What are the most interesting, unexpected, or challenging lessons that you have learned while working with generative AI model serving?
When is Baseten the wrong choice?
What are the future trends and technology investments that you are focused on in the space of AI model serving?
From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@aiengineeringpodcast.com with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers.