
Scaling TensorFlow at LinkedIn with Jonathan Hung - #314
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
00:00
Understanding Fault Tolerance in Distributed Computing
This chapter explores fault tolerance in distributed computing frameworks like Hadoop, MapReduce, and Spark, with a focus on TensorFlow's checkpointing mechanism. It highlights how TensorFlow employs HDFS for storing checkpoints to enhance the resiliency of machine learning workflows during job failures.
Transcript
Play full episode