The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Scaling TensorFlow at LinkedIn with Jonathan Hung - #314

Nov 4, 2019

Jonathan Hung, a Sr. Software Engineer at LinkedIn, shares insights on scaling TensorFlow within their infrastructure. He discusses leveraging existing Hadoop clusters for deep learning, introducing TonY, a framework that runs TensorFlow jobs natively on Hadoop. The conversation delves into the challenges of resource management and fault tolerance, particularly in GPU allocation. Hung also highlights LinkedIn's transition to Kubernetes to enhance machine learning workloads and improve the experience for engineers navigating complex AI systems.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Early Deep Learning Exploration

LinkedIn explored TensorFlow for deep learning to improve member insights.
Existing frameworks for running TensorFlow on Hadoop clusters didn't meet their needs.

ANECDOTE

Challenges with TensorFlow on Spark

TensorFlow on Spark lacked fault tolerance and GPU support.
Spark's resource profiles didn't align with TensorFlow's varying job types.

INSIGHT

Motivation for Tony

LinkedIn chose to build Tony due to their existing Hadoop clusters and expertise.
Their Hadoop ecosystem is mature, with thousands of nodes and petabytes of compute.

Get the Snipd Podcast app to discover more snips from this episode

Get the app