
Data Engineering Podcast
Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack
Podcast summary created with Snipd AI
Quick takeaways
- AnomStack is an open-source project that simplifies the process of building anomaly detection systems by providing a flexible framework for defining metrics and configuring models.
- AnomStack leverages libraries like PPOD for anomaly detection, allowing users to define their own pre-processing functions and models or use the provided defaults.
- AnomStack uses SQL queries to identify anomalous metrics based on thresholds or anomaly scores and stores alerts in the metrics table, enabling easy browsing and insights from anomalies.
Deep dives
AnomStack Overview
AnomStack is an open-source project that provides easy and customizable anomaly detection on business metrics. It simplifies the process of building anomaly detection systems by taking care of the orchestration and providing a flexible framework for defining metrics and configuring models.
Metrics Definition and Ingestion
In AnomStack, metrics are defined through ingest SQL or custom Python functions. The metrics are ingested into a metrics table, which serves as the central data source. Users can bring their own ingest logic and customize configurations for scheduling and model parameters.
Machine Learning and Anomaly Detection
AnomStack leverages libraries like PPOD for anomaly detection. Users can define their own pre-processing functions and models, or use the provided defaults. The ML models generate anomaly scores, which are stored in the metrics table for further analysis.
Alerting and Anomaly Visualization
AnomStack uses SQL queries to identify anomalous metrics based on predefined thresholds or anomaly scores. Alerts are stored in the metrics table, which can be integrated with various tools for alerting and visualization. Users can easily browse through alerts and gain insights from the anomalies.
Flexibility and Contribution
AnomStack aims to be easy to use and contribute to. It provides a config-based approach and avoids complex UIs. Users can contribute by adding new features or improving existing ones. The project is under active development, with plans to explore additional ML capabilities and support for different use cases.
Summary
If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Your host is Tobias Macey and today I'm interviewing Andrew Maguire about his work on the Anomstack project and how you can use it to run your own anomaly detection for your metrics
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Anomstack is and the story behind it?
- What are your goals for this project?
- What other tools/products might teams be evaluating while they consider Anomstack?
- In the context of Anomstack, what constitutes a "metric"?
- What are some examples of useful metrics that a data team might want to monitor?
- You put in a lot of work to make Anomstack as easy as possible to get started with. How did this focus on ease of adoption influence the way that you approached the overall design of the project?
- What are the core capabilities and constraints that you selected to provide the focus and architecture of the project?
- Can you describe how Anomstack is implemented?
- How have the design and goals of the project changed since you first started working on it?
- What are the steps to getting Anomstack running and integrated as part of the operational fabric of a data platform?
- What are the sharp edges that are still present in the system?
- What are the interfaces that are available for teams to customize or enhance the capabilities of Anomstack?
- What are the most interesting, innovative, or unexpected ways that you have seen Anomstack used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Anomstack?
- When is Anomstack the wrong choice?
- What do you have planned for the future of Anomstack?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Anomstack Github repo
- Airflow Anomaly Detection Provider Github repo
- Netdata
- Metric Tree
- Semantic Layer
- Prometheus
- Anodot
- Chaos Genius
- Metaplane
- Anomalo
- PyOD
- Airflow
- DuckDB
- Anomstack Gallery
- Dagster
- InfluxDB
- TimeGPT
- Prophet
- GreyKite
- OpenLineage
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst:  This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Rudderstack:  Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
- Miro:  Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at [dataengineeringpodcast.com/miro](https://www.dataengineeringpodcast.com/miro).
- Materialize:  You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access) today and get 2 weeks free!