
Data Engineering Podcast
Strategies For A Successful Data Platform Migration
Episode guests
Podcast summary created with Snipd AI
Quick takeaways
- Choosing between lift and shift vs. re-architecting impacts cost, timeline, and success of data migrations.
- User acceptance testing and data governance are vital for verifying data integrity and stakeholder confidence.
- Completing a data migration involves retiring legacy systems and defining completion metrics for monitoring progress.
- Effective cost management in data migrations requires assessing actual financial, time, and opportunity costs early in the process.
Deep dives
Migration Strategies: Lift and Shift vs. Improvement
When considering data stack migrations, the decision between performing a complete lift and shift or opting to re-architect and improve the existing system plays a crucial role. Lifting and shifting involves replicating existing workloads to the new system without significant modifications, focusing on maintaining parity between the old and new systems. In contrast, attempting to improve the architecture while migrating can lead to complexity, additional costs, and delays in consensus-building. Both approaches have implications on cost, timeline, and overall success of the migration.
User Acceptance Testing and Governance in Data Migrations
User acceptance testing and governance play vital roles in the success of data migrations. By setting clear goals and measuring progress through migrating atomic elements like transformation workflows or BI dashboards, teams can track their migration milestones effectively. User acceptance testing, especially in large organizations, ensures stakeholder confidence by verifying that the data integrity is maintained post-migration. Tools like data diff can automate this verification process, aiding in the crucial user acceptance stage that assures stakeholders of the migration's success.
Defining Completion and Managing Dependencies in Data Migrations
Determining when a data migration is complete involves more than just shifting workloads to a new system; it requires the complete retirement of the legacy system from budget and operational considerations. Defining completion metrics, such as migrating key data tables or BI workflows, helps monitor the progress and success of the migration project. Managing dependencies and identifying deep integration chains early in the process facilitates smoother transitions and minimizes disruptions once users are fully migrated to the new system.
Managing Costs and Architectural Design in Data Migrations
Cost accounting in data migrations evolves from assessing the value of the migration to evaluating the actual financial, time, and opportunity costs of operating the new system. Architectural decisions impact migration paths, highlighted by the distinction between lift and shift strategies versus architectural improvements during migration. Understanding the hardware, access control mechanisms, and dependencies early in the process allows for effective cost management and architectural planning in data migrations.
Early Signals Gathering and Preventing Migration Pitfalls
When embarking on a data migration project, early signals gathering helps in understanding the impacts of the intended target system and assessing the overall cost differences. By identifying representative use cases and conducting proof of concept exercises, teams can evaluate the feasibility and benefits of migration early on. Preventing migration pitfalls involves minimizing clunky system dependencies, keeping stakeholders informed, and validating progress through user acceptance testing and clear completion criteria.
Use of Data Diff and Data Fold in Migration Projects
Data Diff and Data Fold tools are crucial in migration projects to ensure consistency between old and new systems. They provide confidence in replication processes by comparing tables and outputs of transformation logic. Data Diff helps in verifying row-by-row matching during migrations, offering a ground truth for accuracy from the old system.
Challenges and Lessons Learned in Data Platform Migrations
One challenge faced is the impact of technology choice on user experience during migrations. Systems like Hive, although scalable, can hinder project speed due to slow query performance. Choosing tools that match user technical aptitude is essential to avoid migration complexities. Additionally, building internal advocacy early on and optimizing development workflows remain crucial gaps in data management technology today.
Summary
All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial for your team!
- Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy and Rob Goretsky about when and how to think about migrating your data stack
Interview
- Introduction
- How did you get involved in the area of data management?
- A migration can be anything from a minor task to a major undertaking. Can you start by describing what constitutes a migration for the purposes of this conversation?
- Is it possible to completely avoid having to invest in a migration?
- What are the signals that point to the need for a migration?
- What are some of the sources of cost that need to be accounted for when considering a migration? (both in terms of doing one, and the costs of not doing one)
- What are some signals that a migration is not the right solution for a perceived problem?
- Once the decision has been made that a migration is necessary, what are the questions that the team should be asking to determine the technologies to move to and the sequencing of execution?
- What are the preceding tasks that should be completed before starting the migration to ensure there is no breakage downstream of the changing component(s)?
- What are some of the ways that a migration effort might fail?
- What are the major pitfalls that teams need to be aware of as they work through a data platform migration?
- What are the opportunities for automation during the migration process?
- What are the most interesting, innovative, or unexpected ways that you have seen teams approach a platform migration?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform migrations?
- What are some ways that the technologies and patterns that we use can be evolved to reduce the cost/impact/need for migraitons?
Contact Info
- Gleb
- Rob
- RobGoretsky on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Datafold
- Informatica
- Airflow
- Snowflake
- Redshift
- Eventbrite
- Teradata
- BigQuery
- Trino
- EMR == Elastic Map-Reduce
- Shadow IT
- Mode Analytics
- Looker
- Sunk Cost Fallacy
- data-diff
- SQLGlot
- [Dagster](dhttps://dagster.io/)
- dbt
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Hex:  Hex is a collaborative workspace for data science and analytics. A single place for teams to explore, transform, and visualize data into beautiful interactive reports. Use SQL, Python, R, no-code and AI to find and share insights across your organization. Empower everyone in an organization to make an impact with data. Sign up today at [dataengineeringpodcast.com/hex](https://www.dataengineeringpodcast.com/hex} and get 30 days free!
- Rudderstack:  Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)