The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Networking Optimizations for Multi-Node Deep Learning on Kubernetes with Erez Cohen - #345

9 snips
Feb 5, 2020
Erez Cohen, VP of CloudX & AI at Mellanox (now part of NVIDIA), dives into the vital role of networking in deep learning. He discusses how advancements like RDMA and GPU Direct are enhancing multi-node deep learning on Kubernetes. Erez highlights the acquisition of Mellanox by NVIDIA and shares insights on optimizing network switch configurability. Moreover, the integration of frameworks like TensorFlow and how they interact with advanced networking technologies are explored, pushing the boundaries of performance in AI applications.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Networking in Deep Learning

  • Deep learning training, often seen as compute-bound, becomes network-dependent when scaled out.
  • This dependence arises from the synchronization needs of large models across multiple GPUs and servers.
INSIGHT

Software vs. Hardware in Distributed Training

  • Distributed computing tools like Horovod address software coordination for distributed training.
  • However, efficient data transfer, especially for large models, remains crucial.
ANECDOTE

Evolution of Distributed Training

  • Simple distributed training approaches, like using a parameter server, face scalability issues.
  • Horovod and other solutions improve synchronization but still rely on large data transfers.
Get the Snipd Podcast app to discover more snips from this episode
Get the app