The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) cover image

Scaling TensorFlow at LinkedIn with Jonathan Hung - #314

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

00:00

Understanding Fault Tolerance in Distributed Computing

This chapter explores fault tolerance in distributed computing frameworks like Hadoop, MapReduce, and Spark, with a focus on TensorFlow's checkpointing mechanism. It highlights how TensorFlow employs HDFS for storing checkpoints to enhance the resiliency of machine learning workflows during job failures.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app