The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Scaling TensorFlow at LinkedIn with Jonathan Hung - #314

Nov 4, 2019
Jonathan Hung, a Sr. Software Engineer at LinkedIn, shares insights on scaling TensorFlow within their infrastructure. He discusses leveraging existing Hadoop clusters for deep learning, introducing TonY, a framework that runs TensorFlow jobs natively on Hadoop. The conversation delves into the challenges of resource management and fault tolerance, particularly in GPU allocation. Hung also highlights LinkedIn's transition to Kubernetes to enhance machine learning workloads and improve the experience for engineers navigating complex AI systems.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Early Deep Learning Exploration

  • LinkedIn explored TensorFlow for deep learning to improve member insights.
  • Existing frameworks for running TensorFlow on Hadoop clusters didn't meet their needs.
ANECDOTE

Challenges with TensorFlow on Spark

  • TensorFlow on Spark lacked fault tolerance and GPU support.
  • Spark's resource profiles didn't align with TensorFlow's varying job types.
INSIGHT

Motivation for Tony

  • LinkedIn chose to build Tony due to their existing Hadoop clusters and expertise.
  • Their Hadoop ecosystem is mature, with thousands of nodes and petabytes of compute.
Get the Snipd Podcast app to discover more snips from this episode
Get the app