Is Apache Spark Too Costly? An Amazon Engineer Tells His Story

8 snips

Nov 21, 2024

In this insightful discussion, Patrick Ames, a Principal Engineer at Amazon Web Services specializing in exabyte-scale data, shares his journey from Apache Spark to Ray. He reveals the challenges Spark posed as data volumes increased, leading to long processing times and high costs. Ames emphasizes the efficiency and cost advantages of Ray, a framework designed for scalable AI applications. He also touches on the significance of automation in daily life and the importance of community contributions to open-source innovations.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Amazon's Data Warehouse Migration

Patrick Ames worked on migrating Amazon's data analytics from a large Oracle data warehouse to an S3-based data catalog.
This catalog, initially 50 petabytes, quickly grew to exabyte scale, using Apache Spark for table maintenance.

INSIGHT

Why Spark Was Initially Chosen

Reading data from the S3 catalog required merging inserts, updates, and deletes into a final table state.
Spark was initially chosen for its simplicity and ability to handle this merging efficiently using Spark SQL.

INSIGHT

Spark's Limitations at Scale

Spark's limitations became apparent as data grew to hundreds of terabytes and beyond, leading to long processing times and high costs.
Spark's general-purpose nature made it less efficient than a specialized solution for this specific task.

Get the Snipd Podcast app to discover more snips from this episode

Get the app