The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Networking Optimizations for Multi-Node Deep Learning on Kubernetes with Erez Cohen - #345

9 snips

Feb 5, 2020

Erez Cohen, VP of CloudX & AI at Mellanox (now part of NVIDIA), dives into the vital role of networking in deep learning. He discusses how advancements like RDMA and GPU Direct are enhancing multi-node deep learning on Kubernetes. Erez highlights the acquisition of Mellanox by NVIDIA and shares insights on optimizing network switch configurability. Moreover, the integration of frameworks like TensorFlow and how they interact with advanced networking technologies are explored, pushing the boundaries of performance in AI applications.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Networking in Deep Learning

Deep learning training, often seen as compute-bound, becomes network-dependent when scaled out.
This dependence arises from the synchronization needs of large models across multiple GPUs and servers.

INSIGHT

Software vs. Hardware in Distributed Training

Distributed computing tools like Horovod address software coordination for distributed training.
However, efficient data transfer, especially for large models, remains crucial.

ANECDOTE

Evolution of Distributed Training

Simple distributed training approaches, like using a parameter server, face scalability issues.
Horovod and other solutions improve synchronization but still rely on large data transfers.

Get the Snipd Podcast app to discover more snips from this episode

Get the app