The InfoQ Podcast

How to Use Apache Spark to Craft a Multi-Year Data Regression Testing and Simulations Framework

8 snips
Nov 26, 2025
Vivek Yadav, an engineering manager at Stripe, shares his expertise in crafting a multi-year regression testing framework using Apache Spark. He highlights the importance of testing migrations against extensive historical data to avoid user regressions. Spark's parallel processing capabilities allow efficient bulk request replays. Vivek discusses the design of reusable libraries and controlled testing environments, boosting developer confidence while maintaining low costs compared to traditional database methods. He emphasizes the framework's versatility for what-if analyses and projections.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Treat Services As Bulk-Processable Libraries

  • Large services can be viewed as input→business logic→output pipelines similar to Spark's bulk read/process/write model.
  • Recasting service logic as a library lets you run it in production or as a Spark job against historical data.
ADVICE

Package Core Logic As A Reusable Library

  • Organize core service logic as a library and add separate IO layers for each runtime environment.
  • Wrap that library in a Spark job to run large historical datasets in parallel for testing.
ANECDOTE

Backtesting Migration With Historical Requests

  • Stripe replayed past production requests against new code and compared new responses to old responses to find regressions.
  • They built a diff job that surfaces rows that differ for engineers to inspect and fix.
Get the Snipd Podcast app to discover more snips from this episode
Get the app