How to Use Apache Spark to Craft a Multi-Year Data Regression Testing and Simulations Framework

8 snips

Nov 26, 2025

Vivek Yadav, an engineering manager at Stripe, shares his expertise in crafting a multi-year regression testing framework using Apache Spark. He highlights the importance of testing migrations against extensive historical data to avoid user regressions. Spark's parallel processing capabilities allow efficient bulk request replays. Vivek discusses the design of reusable libraries and controlled testing environments, boosting developer confidence while maintaining low costs compared to traditional database methods. He emphasizes the framework's versatility for what-if analyses and projections.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Treat Services As Bulk-Processable Libraries

Large services can be viewed as input→business logic→output pipelines similar to Spark's bulk read/process/write model.
Recasting service logic as a library lets you run it in production or as a Spark job against historical data.

ADVICE

Package Core Logic As A Reusable Library

Organize core service logic as a library and add separate IO layers for each runtime environment.
Wrap that library in a Spark job to run large historical datasets in parallel for testing.

ANECDOTE

Backtesting Migration With Historical Requests

Stripe replayed past production requests against new code and compared new responses to old responses to find regressions.
They built a diff job that surfaces rows that differ for engineers to inspect and fix.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Vivek Yadav, an engineering manager from Stripe, shares his experience in building a testing system based on multi-year worth of data. He shares insights into why Apache Spark was the choice for creating such a system and how it fits in the "traditional" engineering practices. Read a transcript of this interview: https://bit.ly/4o08NjD Subscribe to the Software Architects’ Newsletter for your monthly guide to the essential news and experience from industry peers on emerging patterns and technologies: https://www.infoq.com/software-architects-newsletter Upcoming Events: QCon AI New York 2025 (December 16-17, 2025) https://ai.qconferences.com/ QCon London 2026 (March 16-19, 2026) QCon London equips senior engineers, architects, and technical leaders with trusted, practical insights to lead the change in software development. Get real-world solutions and leadership strategies from senior software practitioners defining current trends and solving today's toughest software challenges. https://qconlondon.com/ QCon AI Boston 2026 (June 1-2, 2026) Learn how real teams are accelerating the entire software lifecycle with AI. https://boston.qcon.ai The InfoQ Podcasts: Weekly inspiration to drive innovation and build great teams from senior software leaders. Listen to all our podcasts and read interview transcripts: - The InfoQ Podcast https://www.infoq.com/podcasts/ - Engineering Culture Podcast by InfoQ https://www.infoq.com/podcasts/#engineering_culture - Generally AI: https://www.infoq.com/generally-ai-podcast/ Follow InfoQ: - Mastodon: https://techhub.social/@infoq - X: https://x.com/InfoQ?from=@ - LinkedIn: https://www.linkedin.com/company/infoq/ - Facebook: https://www.facebook.com/InfoQdotcom# - Instagram: https://www.instagram.com/infoqdotcom/?hl=en - Youtube: https://www.youtube.com/infoq - Bluesky: https://bsky.app/profile/infoq.com Write for InfoQ: Learn and share the changes and innovations in professional software development. - Join a community of experts. - Increase your visibility. - Grow your career. https://www.infoq.com/write-for-infoq