Data Engineering Weekly

Ananth Packkildurai

The Weekly Data Engineering Newsletter www.dataengineeringweekly.com

Episodes

Mentioned books

Jul 5, 2023 • 35min

DEW #132: The New Generative AI Infra Stack, Databricks cost management at Coinbase, Exploring an Entity Resolution Framework Across Various Use Cases & What's the hype behind DuckDB?

Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.On DEW #132, we selected the following articleCowboy Ventures: The New Generative AI Infra StackGenerative AI has taken the tech industry by storm. In Q1 2023, a whopping $1.7B was invested into gen AI startups. Cowboy ventures unbundle the various categories of Generative AI infra stack here.https://medium.com/cowboy-ventures/the-new-infra-stack-for-generative-ai-9db8f294dc3fCoinbase: Databricks cost management at CoinbaseEffective cost management in data engineering is crucial as it maximizes the value gained from data insights while minimizing expenses. It ensures sustainable and scalable data operations, fostering a balanced business growth path in the data-driven era. Coinbase writes one case about cost management for Databricks and how they use the open-source overwatch tool to manage Databrick’s cost.https://www.coinbase.com/blog/databricks-cost-management-at-coinbaseWalmart: Exploring an Entity Resolution Framework Across Various Use CasesEntity resolution, a crucial process that identifies and links records representing the same entity across various data sources, is indispensable for generating powerful insights about relationships and identities. This process, often leveraging fuzzy matching techniques, not only enhances data quality but also facilitates nuanced decision-making by effectively managing relationships and tracking potential matches among data records. Walmart writes about the pros and cons of approaching fuzzy matching with rule-based and ML-based matching.https://medium.com/walmartglobaltech/exploring-an-entity-resolution-framework-across-various-use-cases-cb172632e4aeMatt Palmer: What's the hype behind DuckDB?So DuckDB, Is it hype? or does it have the real potential to bring architectural changes to the data warehouse? The author explains how DuckDB works and the potential impact of DuckDB in Data Engineering.https://mattpalmer.io/posts/whats-the-hype-duckdb/ This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

Jun 9, 2023 • 28min

DEW #131: dbt model contract, Instacart ads modularization in LakeHouse Architecture, Jira to automate Glue tables, Server-Side Tracking

Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.On DEW #131, we selected the following articleRamon Marrero: DBT Model Contracts - Importance and Pitfallsdbt introduces model contract with 1.5 release. There were a few critics of the dbt model implementation, such as The False Promise of dbt Contracts. I found the argument made in the false promise of the dbt contract surprising, especially the below comments.As a model owner, if I change the columns or types in the SQL, it's usually intentional. - My immediate no reaction was, Hmm, Not really.However, as with any initial system iteration, the dbt model contract implementation has pros and cons. I’m sure it will evolve as the adoption increases. The author did an amazing job writing a balanced view of dbt model contract.https://medium.com/geekculture/dbt-model-contracts-importance-and-pitfalls-20b113358ad7Instacart: How Instacart Ads Modularized Data Pipelines With Lakehouse Architecture and SparkInstacart writes about its journey of building its ads measurement platform. A couple of thing stands out for me in the blog.* The Event store is moving from S3/ parquet storage to DeltaLake storage—a sign of LakeHouse format adoption across the board.* Instacart adoption of Databricks ecosystem along with Snowflake.* The move to rewrite SQL into a composable Spark SQL pipeline for better readability and testing.https://tech.instacart.com/how-instacart-ads-modularized-data-pipelines-with-lakehouse-architecture-and-spark-e9863e28488dTimo Dechau: The extensive guide for Server-Side TrackingThe blog is an excellent overview of server-side event tracking. The author highlights how the event tracking is always close to the UI flow than the business flow and all the possible things wrong with frontend event tracking. A must-read article if you’re passionate about event tracking like me.Credit Saison: Using Jira to Automate Updations and Additions of Glue TablesThis Schema change could’ve been a JIRA ticket!!!I found the article excellent workflow automation on top of the familiar ticketing system, JIRA. The blog narrates the challenges with Glue Crawler and how selectively applying the db changes management using JIRA help to overcome its technical debt of running 6+ hours custom crawler.https://medium.com/credit-saison-india/using-jira-to-automate-updations-and-additions-of-glue-tables-58d39adf9940 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

May 27, 2023 • 32min

DEW #129: DoorDash's Generative AI, Europe data salary, Data Validation with Great Expectations, Expedia's Event Sourcing

Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.On DEW #129, we selected the following articleDoorDash identifies Five big areas for using Generative AI.Generative AI took the industry by storm, and every company is trying to figure out what it means to them. DoorDash writes about its discovery of Generative AI and its application to boost its business.* The assistance of customers in completing tasks* Better tailored and interactive discovery [Recommendation]* Generation of personalized content and merchandising* Extraction of structured information* Enhancement of employee productivityhttps://doordash.engineering/2023/04/26/doordash-identifies-five-big-areas-for-using-generative-ai/Mikkel Dengsøe: Europe data salary benchmark 2023Fascinating findings on Europe’s data salary among various countries. The key findings are* German-based roles pay lower.* London and Dublin-based roles have the highest compensations. The Dublin sample is skewed to more senior roles, with 55% of reported salaries being senior, which is more indicative of the sample than jobs in Dublin paying higher than in London.* The top 75% percentile jobs in Amsterdam, London, and Dublin pay nearly 50% more than those in Berlinhttps://medium.com/@mikldd/europe-data-salary-benchmark-2023-b68cea57923dTrivago: Implementing Data Validation with Great Expectations in Hybrid EnvironmentsThe article by Trivago discusses the integration of data validation with Great Expectations. It presents a well-balanced case study that emphasizes the significance of data validation and the necessity for sophisticated statistical validation methods.https://tech.trivago.com/post/2023-04-25-implementing-data-validation-with-great-expectations-in-hybrid-environments.htmlExpedia: How Expedia Reviews Engineering Is Using Event Streams as a Source Of Truth“Events as a source of truth” is a simple but powerful idea to persist the state of the business entity as a sequence of state-changing events. How to build such a system? Expedia writes about the review stream system to demonstrate how it adopted the event-first approach.https://medium.com/expedia-group-tech/how-expedia-reviews-engineering-is-using-event-streams-as-a-source-of-truth-d3df616cccd8 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

Apr 29, 2023 • 37min

DEW #124: State of Analytics Engineering, ChatGPT, LLM & the Future of Data Consulting, Unified Streaming & Batch Pipeline, and Kafka Schema Management

Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives. On DEW #124, we selected the following articledbt: State of Analytics Engineeringdbt publishes the state of analytical [data???🤔] engineering. If you follow Data Engineering Weekly, We actively talk about data contracts & how data is a collaboration problem, not just an ETL problem. The state of analytical engineering survey validates it as two of the top 5 concerns are data ownership & collaboration between the data producer & consumer. Here are the top 5 key learnings from the report.* 46% of respondents plan to invest more in data quality and observability this year— the most popular area for future investment.* Lack of coordination between data producers and data consumers is perceived by all respondents to be this year’s top threat to the ecosystem.* Data and analytics engineers are most likely to believe they have clear goals and are most likely to agree their work is valued.* 71% of respondents rated data team productivity and agility positively, while data ownership ranked as a top concern for most.* Analytics leaders are most concerned with stakeholder needs. 42% say their top concern is “Data isn’t where business users need it.”https://www.getdbt.com/state-of-analytics-engineering-2023/Rittman Analytics: ChatGPT, Large Language Models and the Future of dbt and Analytics ConsultingVery fascinating to read about the potential impact of LLM in the future of dbt and analytical consulting. The author predicts we are at the beginning of the industrial revolution of computing.Future iterations of generative AI, public services such as ChatGPT, and domain-specific versions of these underlying models will make IT and computing to date look like the spinning jenny that was the start of the industrial revolution.🤺🤺🤺🤺🤺🤺🤺🤺🤺May the best LLM wins!! 🤺🤺🤺🤺🤺🤺https://www.rittmananalytics.com/blog/2023/3/26/chatgpt-large-language-models-and-the-future-of-dbt-and-analytics-consultingLinkedIn: Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache BeamOne of the curses of adopting Lambda Architecture is the need for rewriting business logic in both streaming and batch pipelines. Spark attempt to solve this by creating a unified RDD model for streaming and batch; Flink introduces the table format to bridge the gap in batch processing. LinkedIn writes about its experience adopting Apache Beam’s approach, where Apache Beam follows unified pipeline abstraction that can run in any target data processing runtime such as Samza, Spark & Flink.https://engineering.linkedin.com/blog/2023/unified-streaming-and-batch-pipelines-at-linkedin--reducing-procWix: How Wix manages Schemas for Kafka (and gRPC) used by 2000 microservicesWix writes about managing schema for 2000 (😬) microservices by standardizing schema structure with protobuf and Kafka schema registry. Some exciting reads include patterns like an internal Wix Docs approach & integration of the documentation publishing as part of the CI/ CD pipelines.https://medium.com/wix-engineering/how-wix-manages-schemas-for-kafka-and-grpc-used-by-2000-microservices-2117416ea17b This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

Apr 22, 2023 • 33min

DEW 123: Generative AI at BuzzFeed, Building OnCall Culture & Dimensional Modeling at WhatNot

Welcome to another episode of Data Engineering Weekly Radio. Ananth and Aswin discussed a blog from BuzzFeed that shares lessons learned from building products powered by generative AI. The blog highlights how generative AI can be integrated into a company's work culture and workflow to enhance creativity rather than replace jobs. BuzzFeed provided their employees with intuitive access to APIs and integrated the technology into Slack for better collaboration.Some of the lessons learned from BuzzFeed's experience include:* Getting the technology into the hands of creative employees to amplify their creativity.* Effective prompts are a result of close collaboration between writers and engineers.* Moderation is essential and requires building guardrails into the prompts.* Demystifying the technical concepts behind the technology can lead to better applications and tools.* Educating users about the limitations and benefits of generative AI.* The economics of using generative AI can be challenging, especially for hands-on business models.The conversation also touched upon the non-deterministic nature of generative AI systems, the importance of prompt engineering, and the potential challenges in integrating generative AI into data engineering workflows. As technology progresses, it is expected that the economics of generative AI will become more favorable for businesses.https://tech.buzzfeed.com/lessons-learned-building-products-powered-by-generative-ai-7f6c23bff376Moving on, We discuss the importance of on-call culture in data engineering teams. We emphasize the significance of data pipelines and their impact on businesses. With a focus on communication, ownership, and documentation, we highlight how data engineers should prioritize and address issues in data systems.We also discuss the importance of on-call rotation, runbooks, and tools like PagerDuty and Airflow to streamline alerts and responses. Additionally, we mention the value of having an on-call handoff process, where one engineer summarizes their experiences and alerts during their on-call period, allowing for improvements and a better understanding of common issues.Overall, this conversation stresses the need for a learning culture within data engineering teams, focusing on building robust systems, improving team culture, and increasing productivity.https://towardsdatascience.com/how-to-build-an-on-call-culture-in-a-data-engineering-team-7856fac0c99Finally, Ananth and Aswin discuss an article about adopting dimensional data modeling in hyper-growth companies. We appreciate the learning culture and emphasize balancing speed, maturity, scale, and stability.We highlight how dimensional modeling was initially essential due to limited computing and expensive storage. However, as storage became cheaper and computing more accessible, dimensional modeling was often overlooked, leading to data junkyards. In the current landscape, it's important to maintain business-aware domain-driven data marts and acknowledge that dimensional modeling still has a role.The conversation also touches upon the challenges of tracking slowly changing dimensions and the responsibility of data architects, engineers, and analytical engineers in identifying and implementing such dimensions. We discuss the need for a fine balance between design thinking and experimentation and stress the importance of finding the right mix of correctness and agility for each company.https://medium.com/whatnot-engineering/same-data-sturdier-frame-layering-in-dimensional-data-modeling-at-whatnot-5e6a548ee713 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

Apr 13, 2023 • 41min

Podcast: dbt Reimagined, Change Data Capture @ Brex, on Data Products and how to describe them

DBT Reimagined by Pedram NavidThe challenge with this, having the Jinja templating, I found out two things. One is like; it is on runtime. So you have to build it and then run some simulations to understand whether you did it correctly or not.Jinja Templates also add cognitive load. The developers have to know how the Jinja template will work; how SQL will work, and it becomes a bit difficult to read and understand.In this conversation with Aswin, we discuss the article "DBT Reimagined" by Pedram Navid. We talked about the strengths and weaknesses of DBT and what we would like to see in a future version of the tool.Aswin agrees with Pedram Navid that a DSL would be better than a templated language for DBT. He also points out that the Jinja templating system can be difficult to read and understand.I agree with both Aswin and Pedram Navid. A DSL would be a great way to improve DBT. It would make the tool more powerful and easier to use.I'm also interested in a native programming language for DBT. It would allow developers to write their own custom functions and operators, giving them even more flexibility in using the tool.The conversation shifts to the advantages of DSL over templated code, and they discuss other tools like SQL Mesh, Malloy, and an internal tool by Criteo. I believe that more experimentation with SQL is needed.Overall, the article "DBT Reimagined" is a valuable contribution to discussing the future of data transformation tools. It raises some important questions about the strengths and weaknesses of DBT and offers some interesting ideas for how to improve.Change Data Capture at Brex by Jun Zhaohttps://medium.com/brexeng/change-data-capture-at-brex-c71263616dd7Aswin provided a great definition of CDC, explaining it as a mechanism to listen to database replication logs and capture, stream, and reproduce data in real time🕒. He shared his first encounter with CDC back in 2013, working on a Proof of Concept (POC) for a bank🏦.Aswin explains that CDC is a way to capture changes made to data in a database. This can be useful for a variety of reasons, such as:* Auditing: CDC can be used to track changes made to data, which can be useful for auditing purposes.* Compliance: CDC can be used to ensure that data complies with regulations.* Data replication: CDC can replicate data from one database to another.* Data integration: CDC can be used to integrate data from multiple sources.Aswin also discusses some of the challenges of using the CDC, such as:* Complexity: CDC can be a complex process to implement.* Cost: CDC can be a costly process to implement.* Performance: CDC can impact the performance of the database.So, in a summary of the conversation about change data capture (CDC):* CDC is a way to capture changes made to data in a database.* CDC can be used for various purposes, such as auditing, compliance, data replication, and integration.* CDC can be implemented using a variety of tools, such as Debezium.* Some of the challenges of the CDC include latency, cost, and performance.* CDC can’t carry business context, which can be expensive to recreate. * Overall, CDC is a valuable tool for data engineers.On Data Products and How to describe them by Max Illishttps://medium.com/@maxillis/on-data-products-and-how-to-describe-them-76ae1b7abda4The library example is close to heart for Aswin since his father started his career as a librarian! 📖👨‍💻 Aswin highlights Max's broad definition of data products, including data sets, tables, views, APIs, and machine learning models. Anand agrees that BI dashboards can also be data products. 📊🔍We emphasize the importance of exposing tribal knowledge and democratizing the data product world. Max's journey from skeptic to believer in data products is very admirable. 🌟📝We dive into data products' structural and behavioral properties and Max's detailed description of build-time and runtime properties. They also appreciate the idea of reference queries to facilitate data consumption. 🧩🚀In conclusion, Max's blog post on data products is one of the best written up on data products around! Big thanks to Max for sharing his thoughts! 🙌 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

Apr 6, 2023 • 36min

What Happened at Data Council 2023?

Hey folks, have you heard about the Data Council conference in Austin? The three-day event was jam-packed with exciting discussions and innovative ideas on data engineering and infrastructure, data science and algorithms, MLOps, generative AI, streaming infrastructure, analytics, and data culture and community. "People are so nice in the data community. Meeting them and brainstorming with many ideas and various thought processes is amazing. It was an amazing experience; The conference is mostly like a jam of different thought processes, ideas, and entrepreneurship.The keynote by Shrishanka from AcrylData talked about how data catalogs are becoming the control center for pipelines, a game-changer for the industry.I also had a chance to attend a session on Malloy, a new way of thinking about SQL queries. It was experimental but had some cool ideas on abstracting complicated SQL queries. ChatGPT will change the game in terms of data engineering jobs and productivity. Charge GPT, for example, has improved my productivity by 60%. And generative AI is becoming so advanced that it can produce dynamic SQL code in just a few lines.But of course, with all this innovation and change, there are still questions about the future. Will Snowflake and Databricks outsource data governance experience to other companies? Will the modern data stack become more mature and consolidated? These are the big questions we need to ask as we move forward in the world of data. The talk by Uber on their Ubermetric system migrating from ElasticSearch to Apache Pinot - which, by the way, is an incredibly flexible and powerful system. We also chatted about Pinot's semi-structured storage support, which is important in modern data engineering. Now, let's talk about something (non)controversial: the idea that big data is dead. DuckDB brought up three intriguing points to back up this claim. * Not every company has Big Data.* The availability of instances with higher memory is becoming a commodity* Even with the companies have big data; they do only incremental processing, which can be small enough Abhi Sivasailam presented a thought-provoking approach to metric standardization. He introduced the concept of "metric trees" - connecting high-level metrics to other metrics and building semantics around them. The best part? You can create a whole tree structure that shows the impact of one metric on another. Imagine the possibilities! You could simulate your business performance by tweaking the metric tree, which is mind-blowing!Another amazing talk was about cross-company data exchange, where Pardis discussed various ways companies share data, like APIs, file uploads, or even Snowflake sharing. But the real question is: How do we deal with revenue sharing, data governance, and preventing sensitive data leaks? Pardis's startup General Folders, is tackling this issue, becoming the "Dropbox" of data exchange. How cool is that?To wrap it up, three key learnings from the conference were:* The intriguing idea is that "big data is dead" and how it impacts data infrastructure architecture.* Data Catalog as a control plane for modern data stack? Is it a dream or reality?* The growing importance of data contracts and the fascinating idea of metric trees.Overall, the Data Council conference was an incredible experience, and I can't wait to see what they have in store for us next year. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

Mar 31, 2023 • 49min

Podcast: Analysis on MAD [Machine Learning, Artificial Intelligence & Data] Landscape

In this episode of Data Engineering Weekly Radio, we delve into modern data stacks under pressure and the potential consolidation of the data industry. We refer to a four-part article series that explores the data infrastructure landscape and the Software as a Service (SaaS) products available in data engineering, machine learning, and artificial intelligence.We discussed that the siloed nature of many data products has led to industry consolidation, ultimately benefiting customers. Throughout our discussion, we touch on how the Modern Data Stack (MDS) movement has resulted in various specialized tools in areas such as ingestion, cataloging, governance, and quality. However, we also acknowledge that as budgets tighten and CFOs become more cautious, the market is now experiencing a push toward bundling and consolidation.In this consolidation, we explore the roles of large players like Snowflake, Databricks, and Microsoft and cloud companies like AWS and Google. We debate who will be the "control center" of the data workload, as many companies claim to be the central component in the data ecosystem. As hosts, we agree it's difficult to predict the industry's future, but we anticipate the market will mature and settle soon.We discussed the potential consolidation of various tools and categories in the modern data stack, including ETL, reverse ETL, data quality, observability, and data catalogs. Consolidation is likely, as many of these tools share common ground and can benefit from unified experiences for users. We also explored how tools like DBT, Airflow, and Databricks could emit information about data lineage, potentially leading to a "catalog of catalogs" that centralizes the visualization and governance of data.We suggested that the convergence of data quality, observability, and catalogs would revolve around ensuring clean, trusted data that is easily discoverable. We also touched on the role of data lineage and pondered whether the control of data lineage would translate to control over the entire data stack. We considered the possibility that orchestration engines might step into data quality, observability, and catalogs, leading to further consolidation in the industry.We also acknowledged the shift in conversation within the data community from focusing on technology comparisons to examining organizational landscapes and the production and consumption of data. We agreed that there is still much room for innovation in this space and that consolidating features is more beneficial than competing with one another.We contemplated how tools like DBT might extend their capabilities by tackling other aspects of the data stack, such as ingestion. Additionally, we discussed the potential consolidation in the MLOps space, with various tools stepping on each other's territory as they address customer needs.Overall, we emphasized the importance of unifying user experiences and blurring the lines between individual categories in the data infrastructure landscape. We also noted the parallels between feature stores and data products, suggesting that there may be further convergence between MLOps and data engineering practices in the future. Ultimately, customer delight and experience are the driving forces behind these developments.We also discussed ETL's potential future, the rise of zero ETL, and its challenges. Additionally, we touched on the growing importance of data products and contracts, emphasizing the need for a contract-first approach in building successful data products.We also shared our thoughts on the potential convergence of various categories, like data cataloging and data contracts, which could give rise to more comprehensive and powerful data solutions. Furthermore, we discussed the significance of interfaces and their potential to shape the future of the data stack.In conclusion, Matt Turck's blog provided us with an excellent opportunity to discuss and analyze the current trends in the data industry. We look forward to seeing how these trends continue to evolve and shape the future of data management and analytics. Until the next edition, take care, and see you all!Referencehttps://mattturck.com/mad2023/https://mattturck.com/mad2023-part-iii/ This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

Mar 22, 2023 • 43min

Podcast: Data Product @ Oda, Reflection Talking with Data Leaders & Great Migration To Snowflake

We are back in our Data Engineering Weekly Radio for edition #121. We will take 2 or 3 articles from each week's Data Engineering Weekly edition and go through an in-depth analysis. Please subscribe to our Podcast on your favorite apps.From editor #121, we took the following articlesOda: Data as a product at OdaOda writes an exciting blog about “Data as a Product,” describing why we must treat data as a product, dashboard as a product, and the ownership model for data products.https://medium.com/oda-product-tech/data-as-a-product-at-oda-fda97695e820The blog highlights six key principles of the value creation of data.* Domain knowledge + discipline expertise* Distributed Data Ownership and shared Data Ownership* Data as a Product* Enablement over Handover* Impact through Exploration and Experimentation* Proactive attitude towards Data Privacy & Ethicshttps://medium.com/oda-product-tech/the-six-principles-for-how-we-run-data-insight-at-oda-ba7185b5af39Here are a few highlights from the podcast"Oda builds the whole data product principle & the implementation structure being built on top of the core values, instead of reflecting any industry jargons.”"Don't make me think. The moment you make your users think, you lose your value proposition as a platform or a product.”"The platform enables the domain; domain enables your consumer. It's a chain of value creation going on top and like simplifying everyone's life, accessing data, making informed decisions.”"I think putting that, documenting it, even at the start of it, I think that's where the equations start proving themselves. And that's essentially what product thinking is all about.”Peter Bruins: Some reflections on talking with Data leadersData Mesh/ Data Product/ Data Contract all the concepts trying to address this problem, and this is a Billion $ $ $ worth of a problem to solve. The author leaves a bigger question, Ownership plays a central role in all these concepts, but what is the incentive to bring Ownership?https://www.linkedin.com/pulse/some-reflections-talking-data-leaders-peter-bruins/Here are a few highlights from the podcast"Ownership. It's all about the ownership." - Peter Burns."The weight of the success (growth of adoption) of the data leads to its failure.Faire: The great migration from Redshift to SnowflakeIs Redshift dying? I’m seeing an increasing pattern of people migrating from Redshift to Snowflake or Lakehouse. Flair wrote a detailed blog on the reasoning behind Redshift to Snowflake migration, its journey, and its key takeaway.https://craft.faire.com/the-great-migration-from-redshift-to-snowflake-173c1fb59a52Flair also opensource some of the utility scripts to make your life easier to move from Redshift to Snowflakehttps://github.com/Faire/snowflake-migrationHere are a few highlights from the podcast"If you left like one percent of my data is still in Redshift and 99% of your data in Snowflake, you're degrading your velocity and the quality of your delivery.” This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

Mar 12, 2023 • 36min

Data Engineering Weekly Radio #120

We are back in our Data Engineering Weekly Radio for edition #120. We will take 2 or 3 articles from each week's Data Engineering Weekly edition and go through an in-depth analysis. From editor #120, we took the following articlesTopic 1: Colin Campbell: The Case for Data Contracts - Preventative data quality rather than reactive data qualityIn this episode, we focus on the importance of data contracts in preventing data quality issues. We discuss an article by Colin Campbell highlighting the need for a data catalog and the market scope for data contract solutions. We also touch on the idea that data creation will be a decentralized process and the role of tools like data contracts in enabling successful decentralized data modeling. We emphasize the importance of creating high-quality data and the need for technological and organizational solutions to achieve this goal.Key highlights of the conversation"Preventative data quality rather than reactive data quality. It should start with contracts." - Colin Campbell. - Author of the article"Contracts put a preventive structure in place" - Ashwin."The successful data-driven companies all do one thing very well. They create high-quality data." - Ananth.Ananth’s post on SchemataTopic 2: Yerachmiel Feltzman: Action-Position data quality assessment frameworkIn this conversation, we discuss a framework for data quality assessment called the Action Position framework. The framework helps define what actions should be taken based on the severity of the data quality problem. We also discuss two patterns for data quality: Write-Audit-Publish (WAP) and Audit-Write-Publish (AWP). The WAP pattern involves writing data, auditing it, and publishing it, while the AWP pattern involves auditing data, writing it, and publishing it. We encourage readers to share their best practices for addressing data quality issues.Are you using any Data Quality framework in your organization? Do you have any best practices on how you address data quality issues? What do you think of the action-position data quality framework? Please add your comments in the SubStack chat. https://medium.com/everything-full-stack/action-position-data-quality-assessment-framework-d833f6b77b7Dremio WAP pattern: https://www.dremio.com/resources/webinars/the-write-audit-publish-pattern-via-apache-iceberg/Topic 3: Guy Fighel - Stop emphasizing the Data CatalogWe discuss the limitations of data catalogs and the author’s view on the semantic layer as an alternative. The author argues that data catalogs are passive and quickly become outdated and that a stronger contract with enforced data quality could be a better solution. We also highlight the cost factors of implementing a data catalog and suggest that a more decentralized approach may be necessary to keep up with the increasing number of data sources. Innovation in this space is needed to improve organizations' discoverability and consumption of data assets.Something to think about in this conversation"If you don't catalog everything and we only catalog what is required for the purpose of business decision-making, does that solve the data catalog problem in an organization?"https://www.linkedin.com/pulse/stop-emphasizing-data-catalog-guy-fighel/ This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app