The New Stack Podcast

Is Apache Spark Too Costly? An Amazon Engineer Tells His Story

8 snips
Nov 21, 2024
In this insightful discussion, Patrick Ames, a Principal Engineer at Amazon Web Services specializing in exabyte-scale data, shares his journey from Apache Spark to Ray. He reveals the challenges Spark posed as data volumes increased, leading to long processing times and high costs. Ames emphasizes the efficiency and cost advantages of Ray, a framework designed for scalable AI applications. He also touches on the significance of automation in daily life and the importance of community contributions to open-source innovations.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Amazon's Data Warehouse Migration

  • Patrick Ames worked on migrating Amazon's data analytics from a large Oracle data warehouse to an S3-based data catalog.
  • This catalog, initially 50 petabytes, quickly grew to exabyte scale, using Apache Spark for table maintenance.
INSIGHT

Why Spark Was Initially Chosen

  • Reading data from the S3 catalog required merging inserts, updates, and deletes into a final table state.
  • Spark was initially chosen for its simplicity and ability to handle this merging efficiently using Spark SQL.
INSIGHT

Spark's Limitations at Scale

  • Spark's limitations became apparent as data grew to hundreds of terabytes and beyond, leading to long processing times and high costs.
  • Spark's general-purpose nature made it less efficient than a specialized solution for this specific task.
Get the Snipd Podcast app to discover more snips from this episode
Get the app