Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // Bernie Wu // #270

7 snips

Oct 22, 2024

Bernie Wu, VP of Strategic Partnerships at MemVerge, brings over 25 years of experience in data infrastructure. He discusses the critical role of innovative memory solutions in optimizing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) workflows. The conversation covers the advantages of composable memory in alleviating performance limits, efficient resource scheduling, and overcoming GPU challenges. Bernie also touches on the importance of collaboration tools for better memory management and advances in GPU networking technologies that are shaping the future of AI.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Memory-Bound GPUs

Transformer models are data-intensive, requiring substantial memory.
GPU purchases are often sized based on model size, leading to underutilization due to memory limitations.

INSIGHT

Checkpointing Importance

Checkpointing is crucial for preserving model states, especially with large-scale models and potential failures.
Memory-level checkpointing offers faster saving and restoration compared to file systems, improving resilience.

INSIGHT

Checkpoint Bottlenecks

Large checkpoints can create bottlenecks when writing to file systems.
Caching checkpoints in a memory pool enables faster dumping and asynchronous offloading, minimizing interruptions.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Bernie Wu is VP of Business Development for MemVerge. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies, including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft.

Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // MLOps Podcast #270 with Bernie Wu, VP Strategic Partnerships/Business Development of MemVerge.

// Abstract

Limited memory capacity hinders the performance and potential of research and production environments utilizing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. This discussion explores how leveraging industry-standard CXL memory can be configured as a secondary, composable memory tier to alleviate this constraint.

We will highlight some recent work we’ve done in integrating this novel class of memory into LLM/RAG/vector database frameworks and workflows. Disaggregated shared memory is envisioned to offer high-performance, low-latency caches for model/pipeline checkpoints of LLM models, KV caches during distributed inferencing, LORA adaptors, and in-process data for heterogeneous CPU/GPU workflows. We expect to showcase these types of use cases in the coming months.

// Bio

Bernie is VP of Strategic Partnerships/Business Development for MemVerge. His focus has been on building partnerships in the AI/ML, Kubernetes, and CXL memory ecosystems. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies, including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. He is also on the Board of Directors for Cirrus Data Solutions. Bernie has a BS/MS in Engineering from UC Berkeley and an MBA from UCLA.

// MLOps Swag/Merch

https://mlops-community.myshopify.com/

// Related Links

Website: www.memverge.com

Accelerating Data Retrieval in Retrieval Augmentation Generation (RAG) Pipelines using CXL: https://memverge.com/accelerating-data-retrieval-in-rag-pipelines-using-cxl/

Do Re MI for Training Metrics: Start at the Beginning // Todd Underwood // AIQCON: https://youtu.be/DxyOlRdCofo

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // MLOps Podcast #228: https://youtu.be/6MY-IgqiTpg

Compute Express Link (CXL) FPGA IP: https://www.intel.com/content/www/us/en/products/details/fpga/intellectual-property/interface-protocols/cxl-ip.htmlUltra Ethernet Consortium: https://ultraethernet.org/

Unified Acceleration (UXL) Foundation: https://www.intel.com/content/www/us/en/developer/articles/news/unified-acceleration-uxl-foundation.html

RoCE networks for distributed AI training at scale: https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/

--------------- ✌️Connect With Us ✌️ -------------

Join our Slack community: https://go.mlops.community/slack