Join Joe Reis and Matt Housley as they chat with Jon Krohn about data engineering basics, distinctions between data scientists and engineers, tools, top techniques, undercurrents in the data engineering lifecycle, trade-offs in data pipelines, and favorite data engineering tools and techniques.
Effective communication with downstream stakeholders is key for data engineers, ensuring alignment and optimizing data flows.
Assessing latency trade-offs helps data engineers optimize processing, reduce time to value, and deliver insights faster.
Tools like Snowflake, Databricks, and cloud-based technologies enhance data engineering efficiency and collaboration among teams.
Deep dives
Importance of Effective Communication in Data Engineering
Communication with downstream stakeholders is crucial for data engineers, as it ensures alignment between the data engineering process and the needs of end users. Bidirectional communication helps in understanding requirements, optimizing data flows, and improving collaboration between data engineers and application developers.
Latency Trade-offs in Data Engineering
Focusing on latency and understanding latency trade-offs are central concerns for data engineers. By assessing the trade-offs in latency across the data engineering lifecycle, data engineers can optimize data processing, reduce time to value, and deliver insights faster to end users. Latency considerations vary based on use cases and context.
Recommended Data Engineering Tools and Techniques
Recommended data engineering tools include Snowflake, Databricks, AWS SageMaker, GCP's Vertex AI, and Azure's ML, among others. Understanding and familiarizing with cloud-based tools and technologies, along with Apache ecosystem projects, can enhance a data scientist's toolkit. Techniques like migrating work to cloud environments and collaborating using cloud-based notebooks can improve efficiency and collaboration among data science teams.
Importance of SQL for Data Engineers and Data Scientists
SQL is highlighted as a crucial tool for both data engineers and data scientists to efficiently tackle various data problems, allowing for speedy data filtering and analysis. The podcast stresses the significance of SQL's utility in the data industry despite previous trends of undervaluing it during the big data era, emphasizing its role in enabling quick problem-solving approaches and effective data handling.
Evolution of Data Engineering and Machine Learning Engineering
The episode delves into the convergence of data engineering and machine learning engineering, illustrating how these fields are intertwined and evolving towards a unified approach. It discusses the potential overlap and collaboration between data and ML engineers, emphasizing the need for efficient integration and alignment within teams to enhance overall productivity and performance. Additionally, the concept of the live data stack is introduced, highlighting the interconnected nature of applications and real-time data processing, indicating a potential transformation in traditional data and ML engineering roles.
Tune in as Joe Reis and Matt Housley, co-founders of Ternary Data and co-authors of the book “Fundamentals of Data Engineering” join Jon Krohn to discuss major undercurrents across the data engineering lifecycle, and their top tools and techniques.
In this episode you will learn:
• What is data engineering? [3:55]
• Why Joe and Matt identify as “recovering data scientists” [6:12]
• What kinds of people tend to become data scientists vs. data engineers [10:38]?
• Key components of Joe and Matt’s book [26:31]
• Major undercurrents across the data engineering lifecycle [28:26]
• The most under-utilized tool in a data engineer's toolbox [34:39]
• How there are tradeoffs in any data pipeline latency considerations, but faster is typically the default assumption [38:55]
• Joe and Matt’s favorite data engineering tools and techniques [43:39]