The Future of Data Engineering: AI, LLMs, and Automation
Feb 26, 2025
auto_awesome
Gleb Mezhanskiy, CEO and co-founder of Datafold, shares insights from his journey in data engineering and the integration of AI. He discusses how large language models can streamline code writing, improve data accessibility, and facilitate testing and code reviews. Mezhanskiy emphasizes the challenges at the intersection of AI and data workflows, advocating for continuous adaptation. With practical applications like text-to-SQL and enhanced data observability, he paints an optimistic picture for the future of data engineering.
AI technology can significantly expedite data migrations, making them up to ten times faster through automation and efficient code translation.
Distinct operational workflows between data engineering and software engineering necessitate specialized tools to address the unique complexities of managing data.
Leveraging large language models for tasks like text-to-SQL queries can enhance data accessibility, but requires robust foundational data management.
Deep dives
AI-Powered Data Migrations
AI technology has the potential to revolutionize data migrations, which are typically prolonged and resource-intensive endeavors. Automation tools, such as Datafold's AI-powered migration agent, can drastically reduce migration timelines by translating code and validating data automatically. This efficiency allows companies to complete migrations up to ten times faster than traditional methods, alleviating stress and resource drain. Additionally, with a written guarantee on timelines, businesses can confidently plan their operations around faster data migrations.
Distinguishing Software Engineering from Data Engineering
Data engineering and software engineering share similarities, yet possess distinct operational workflows that affect their respective practices. Unlike software engineers who construct deterministic applications, data engineers often deal with complex, imperfect data requiring detailed context about data relationships and transformations. The application of sophisticated software engineering tools to data engineering workflows may not yield the same benefits as they lack the requisite understanding of the data context. This necessitates the need for specialized tools that cater specifically to the complexities faced by data engineers.
Exploring AI's Role in Data Engineering
While many solutions currently focus on enabling data engineers to deliver AI-driven applications, it is crucial to shift the perspective towards how AI can enhance data engineering workflows. This encompasses leveraging large language models (LLMs) to simplify the burdensome tasks of data engineers rather than merely facilitating AI deployment. An example includes employing AI for text-to-SQL queries, allowing users to pose business questions without needing deep technical knowledge. However, ensuring AI systems comprehend the true context of underlying datasets remains a significant challenge that must be addressed.
Leveraging LLMs for Structured Data Interaction
The appeal of using LLMs, particularly in enhancing accessibility to structured data, has grown increasingly important. Text-to-SQL interfaces can empower users to generate meaningful queries on well-structured datasets, enabling them to extract insights with minimal input required. However, the success of such AI-driven interfaces hinges on the prior effort of data engineers to ensure the underlying data is curated, filtered, and well-defined, as garbage in leads to garbage out. Thus, while LLMs hold promise for transforming data interaction, a solid foundational structure of data must precede their implementation.
Contextual Understanding in Data Systems
Incorporating LLMs into data workflows highlights the critical need for contextual understanding within the operational structure of data systems. Providing the appropriate context to LLMs is not trivial; it requires a thoughtful delineation of inputs, including code differences, data changes, and lineage graphs. These elements together allow AI to effectively gauge the ramifications of alterations made in data workflows, aiding in tasks such as code reviews and testing. Consequently, organizations should be aware that the quality and relevance of context provided to AI directly influence its efficiency and effectiveness in data engineering tasks.
Summary In this episode of the Data Engineering Podcast Gleb Mezhanskiy, CEO and co-founder of DataFold, talks about the intersection of AI and data engineering. He discusses the challenges and opportunities of integrating AI into data engineering, particularly using large language models (LLMs) to enhance productivity and reduce manual toil. The conversation covers the potential of AI to transform data engineering tasks, such as text-to-SQL interfaces and creating semantic graphs to improve data accessibility, and explores practical applications of LLMs in automating code reviews, testing, and understanding data lineage.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy about
Interview
Introduction
How did you get involved in the area of data management?
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.