

Scaling TensorFlow at LinkedIn with Jonathan Hung - #314
Nov 4, 2019
Jonathan Hung, a Sr. Software Engineer at LinkedIn, shares insights on scaling TensorFlow within their infrastructure. He discusses leveraging existing Hadoop clusters for deep learning, introducing TonY, a framework that runs TensorFlow jobs natively on Hadoop. The conversation delves into the challenges of resource management and fault tolerance, particularly in GPU allocation. Hung also highlights LinkedIn's transition to Kubernetes to enhance machine learning workloads and improve the experience for engineers navigating complex AI systems.
AI Snips
Chapters
Transcript
Episode notes
Early Deep Learning Exploration
- LinkedIn explored TensorFlow for deep learning to improve member insights.
- Existing frameworks for running TensorFlow on Hadoop clusters didn't meet their needs.
Challenges with TensorFlow on Spark
- TensorFlow on Spark lacked fault tolerance and GPU support.
- Spark's resource profiles didn't align with TensorFlow's varying job types.
Motivation for Tony
- LinkedIn chose to build Tony due to their existing Hadoop clusters and expertise.
- Their Hadoop ecosystem is mature, with thousands of nodes and petabytes of compute.