The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) cover image

Scaling TensorFlow at LinkedIn with Jonathan Hung - #314

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

00:00

Understanding Fault Tolerance in Distributed Computing

This chapter explores fault tolerance in distributed computing frameworks like Hadoop, MapReduce, and Spark, with a focus on TensorFlow's checkpointing mechanism. It highlights how TensorFlow employs HDFS for storing checkpoints to enhance the resiliency of machine learning workflows during job failures.

Play episode from 18:52
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app