
Networking Optimizations for Multi-Node Deep Learning on Kubernetes with Erez Cohen - #345
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
00:00
Integrating Frameworks for Distributed Deep Learning
This chapter explores the integration of TensorFlow with libraries such as Horovod and NVIDIA's Nickel for distributed training in deep learning. It covers essential configurations for managing workloads across GPUs and servers, emphasizing technologies like RDMA and GPU Direct, while promoting open-source compatibility with other frameworks like PyTorch.
Transcript
Play full episode