
How AI Is Built
Real engineers. Real deployments. Zero hype. We interview the top engineers who actually put AI in production. Learn what the best engineers have figured out through years of experience. Hosted by Nicolay Gerold, CEO of Aisbach and CTO at Proxdeal and Multiply Content.
Latest episodes

May 27, 2025 • 1h 7min
#050 Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster
Paul Iusztin, an AI engineer with eight years in the field, discusses the need to bypass common frameworks and fine-tuning in AI development. He emphasizes a hands-on approach, advocating for intuition over imitation to foster true innovation. His mantra? Build quickly, then refine without over-reliance on tools. The conversation also touches on the challenges of integrating large language models, the importance of tailored solutions, and innovative writing assistant development. Paul’s insights challenge the status quo and inspire a fresh perspective on AI production.

May 27, 2025 • 11min
#050 TAKEAWAYS Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster
Nicolay here,Most AI developers are drowning in frameworks and hype. This conversation is about cutting through the noise and actually getting something into production.Today I have the chance to talk to Paul Iusztin, who's spent 8 years in AI - from writing CUDA kernels in C++ to building modern LLM applications. He currently writes about production AI systems and is building his own AI writing assistant.His philosophy is refreshingly simple: stop overthinking, start building, and let patterns emerge through use.The key insight that stuck with me: "If you don't feel the algorithm - like have a strong intuition about how components should work together - you can't innovate, you just copy paste stuff." This hits hard because so much of current AI development is exactly that - copy-pasting from tutorials without understanding the why.Paul's approach to frameworks is particularly controversial. He uses LangChain and similar tools for quick prototyping - maybe an hour or two to validate an idea - then throws them away completely. "They're low-code tools," he says. "Not good frameworks to build on top of."Instead, he advocates for writing your own database layers and using industrial-grade orchestration tools. Yes, it's more work upfront. But when you need to debug or scale, you'll thank yourself.In the podcast, we also cover:Why fine-tuning is almost always the wrong choiceThe "just-in-time" learning approach for staying sane in AIBuilding writing assistants that actually preserve your voiceWhy robots, not chatbots, are the real endgame💡 Core ConceptsAgentic Patterns: These patterns seem complex but are actually straightforward to implement once you understand the core loop. React: Agents that Reason, Act, and Observe in a loopReflection: Agents that review and improve their own outputsFine-tuning vs Base Model + Prompting: Fine-tuning involves taking a pre-trained model and training it further on your specific data. The alternative is using base models with careful prompting and context engineering. Paul's take: "Fine-tuning adds so much complexity... if you add fine-tuning to create a new feature, it's just from one day to one week."RAG: A technique where you retrieve relevant documents/information and include them in the LLM's context to generate better responses. Paul's approach: "In the beginning I also want to avoid RAG and just introduce a more guided research approach. Like I say, hey, these are the resources that I want to use in this article."📶 Connect with Paul:LinkedInX / TwitterNewsletterGitHubBook📶 Connect with Nicolay:LinkedInX / TwitterBlueskyWebsiteMy Agency Aisbach (for ai implementations / strategy)⏱️ Important MomentsFrom CUDA to LLMs: [02:20] Paul's journey from writing CUDA kernels and 3D object detection to modern AI applications.AI Content Is Natural Evolution: [11:19] Why AI writing tools are like the internet transition for artists - tools change, creativity remains.The Framework Trap: [36:41] "I see them as no code or low code tools... not good frameworks to build on top of."Fine-Tuning Complexity Bomb: [27:41] How fine-tuning turns 1-day features into 1-week experiments.End-to-End First: [22:44] "I don't focus on accuracy, performance, or latency initially. I just want an end-to-end process that works."The Orchestration Solution: [40:04] Why Temporal, D-Boss, and Restate beat LLM-specific orchestrators.Hype Filtering System: [54:06] Paul's approach: read about new tools, wait 2-3 months, only adopt if still relevant.Just-in-Time vs Just-in-Case: [57:50] The crucial difference between learning for potential needs vs immediate application.Robot Vision: [50:29] Why LLMs are just stepping stones to embodied AI and the unsolved challenges ahead.🛠️ Tools & Tech MentionedLangGraph (for prototyping only)Temporal (durable execution)DBOS (simpler orchestration)Restate (developer-friendly orchestration)Ray (distributed compute)UV (Python packaging)Prefect (workflow orchestration)📚 Recommended ResourcesThe Economist Style Guide (for writing)Brandon Sanderson's Writing Approach (worldbuilding first)LangGraph Academy (free, covers agent patterns)Ray Documentation (Paul's next deep dive)🔮 What's NextNext week, we will take a detour and go into the networking behind voice AI with Russell D’Sa from Livekit.💬 Join The ConversationFollow How AI Is Built on YouTube, Bluesky, or Spotify.If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn, X, or Bluesky. Or at nicolay.gerold@gmail.com.I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.♻️ I am trying to build the new platform for engineers to share their experience that they have earned after building and deploying stuff into production. Pay it forward by sharing with one engineer who's facing similar challenges. That's the agreement - I deliver practical value, you help grow this resource for everyone. ♻️

May 20, 2025 • 1h 3min
#049 BAML: The Programming Language That Turns LLMs into Predictable Functions
In this discussion, Vaibhav Gupta, co-founder of Boundary, dives into BAML, a programming language designed to streamline AI pipelines. He emphasizes treating large language model (LLM) calls as typed functions, which enhances reliability and simplifies error handling. The podcast explores concepts like Schema-Aligned Parsing and the drawbacks of traditional JSON constraints. Vaibhav also discusses the importance of simplicity in programming and how BAML facilitates better interactions between technical and non-technical users, ensuring robust AI solutions.

May 20, 2025 • 1h 13min
#049 TAKEAWAYS BAML: The Programming Language That Turns LLMs into Predictable Functions
Dive into the fascinating world of AI with insights on treating large language models as predictable functions. Discover the importance of clear contracts for input and output to enhance reliability. The discussion also covers effective prompt engineering, including the benefits of simplicity and innovative symbol tuning techniques. Uncover the concept of Schema-Aligned Parsing to manage diverse data formats seamlessly. Plus, learn how to keep humans sharp in a field where outputs are often already correct!

May 13, 2025 • 7min
#048 TAKEAWAYS Why Your AI Agents Need Permission to Act, Not Just Read
The discussion centers on the necessity of human oversight in AI workflows. It reveals how AI can reach 90% accuracy but still falter in trust-sensitive tasks. The innovative approach involves adding a human approval layer for crucial actions. Dexter Horthy shares insights from his '12-factor agents' that serve as guiding principles for building reliable AI. They also explore the challenges of training LLMs toward mediocrity and the essential infrastructure needed for effective human-in-the-loop systems.

May 11, 2025 • 57min
#048 Why Your AI Agents Need Permission to Act, Not Just Read
Dexter Horthy, the Founder of Human Layer, discusses the importance of integrating human approval into AI actions to enhance trust and utility. He shares insights from his '12-factor agents' framework, emphasizing that AI should request permission before executing critical tasks. The conversation delves into the limitations of current AI capabilities, the challenges of managing human-in-the-loop systems, and the need for robust context engineering. Dexter's approach aims to strike a balance between automation and human oversight, revolutionizing how AI can operate in real-world scenarios.

Mar 27, 2025 • 57min
#047 Architecting Information for Search, Humans, and Artificial Intelligence
Jorge Arango, an expert in information architecture, shares insights on aligning systems with user mental models. He emphasizes that effective designs bridge user understanding and system data, creating learnable interfaces. Jorge discusses how contextual organization simplifies decision-making, tackling the paradox of choice. He also highlights the importance of progressive disclosure to accommodate users of varying expertise, and examines the transformative impact of large language models on search experiences.

Mar 13, 2025 • 53min
#046 Building a Search Database From First Principles
Modern search is broken. There are too many pieces that are glued together.Vector databases for semantic searchText engines for keywordsRerankers to fix the resultsLLMs to understand queriesMetadata filters for precisionEach piece works well alone.Together, they often become a mess.When you glue these systems together, you create:Data Consistency Gaps Your vector store knows about documents your text engine doesn't. Which is right?Timing Mismatches New content appears in one system before another. Users see different results depending on which path their query takes.Complexity Explosion Every new component doubles your integration points. Three components means three connections. Five means ten.Performance Bottlenecks Each hop between systems adds latency. A 200ms search becomes 800ms after passing through four components.Brittle Chains When one system fails, your entire search breaks. More pieces mean more breaking points.I recently built a system where we had query specific post-filters but the requirement to deliver a fixed number of results to the user.A lot of times, the query had to be run multiple times to achieve the desired amount.So we had an unpredictable latency. A high load on the backend, where some queries hammered the database 10+ times. A relevance cliff, where results 1-6 look great, but the later ones were poor matches.Today on How AI Is Built, we are talking to Marek Galovic from TopK.We talk about how they built a new search database with modern components. "How would search work if we built it today?”Cloud storage is cheap. Compute is fast. Memory is plentiful.One system that handles vectors, text, and filters together - not three systems duct-taped into one.One pass handles everything:Vector search + Text search + Filters → Single sorted result
Built with hand-optimized Rust kernels for both x86 and ARM, the system scales to 100M documents with 200ms P99 latency.The goal is to do search in 5 lines of code.Marek Galovic:LinkedInWebsiteTopK WebsiteTopK DocsNicolay Gerold:LinkedInX (Twitter)00:00 Introduction to TopK and Snowflake Comparison00:35 Architectural Patterns and Custom Formats01:30 Query Execution Engine Explained02:56 Distributed Systems and Rust04:12 Query Execution Process06:56 Custom File Formats for Search11:45 Handling Distributed Queries16:28 Consistency Models and Use Cases26:47 Exploring Database Versioning and Snapshots27:27 Performance Benchmarks: Rust vs. C/C++29:02 Scaling and Latency in Large Datasets29:39 GPU Acceleration and Use Cases31:04 Optimizing Search Relevance and Hybrid Search34:39 Advanced Search Features and Custom Scoring38:43 Future Directions and Research in AI47:11 Takeaways for Building AI Applications

Mar 6, 2025 • 1h 3min
#045 RAG As Two Things - Prompt Engineering and Search
In this discussion, John Berryman, an expert who transitioned from aerospace engineering to search and machine learning, explores the dual nature of retrieval-augmented generation (RAG). He emphasizes separating search from prompt engineering for optimal performance. Berryman shares insights on effective prompting strategies using familiar structures, testing human evaluations, and managing token limits. He dives into the differences between chat and completion models and highlights practical techniques for tackling AI applications and workflows. It's a deep dive into enhancing interactions with AI!

Feb 28, 2025 • 1h 4min
#044 Graphs Aren't Just For Specialists Anymore
Semih Salihoğlu, a key contributor to the Kuzu project, dives into the future of graph databases. He elaborates on Kuzu's columnar storage design, emphasizing its efficiency over traditional row-based systems. Discussion highlights include innovative vectorized query processing that boosts performance and enhances analytics. Salihoğlu also explains the challenge of many-to-many relationships and Kuzu's unique approaches to join algorithms, making complex queries faster and less resource-intensive. Overall, this conversation unveils exciting advancements in data management for modern applications.