

MLOps.community
Demetrios
Relaxed Conversations around getting AI into production, whatever shape that may come in (agentic, traditional ML, LLMs, Vibes, etc)
Episodes
Mentioned books

Nov 16, 2021 • 53min
PyTorch: Bridging AI Research and Production // Dmytro Dzhulgakov // Coffee Sessions #63
Dmytro Dzhulgakov, PyTorch: Bridging AI Research and Production.
Talking PyTorch is always interesting, as the Facebook ML OSS project is one of the most important parts of the machine learning tooling ecosystem. This week, we talked to Dmytro Dzhulgakov, a tech lead for PyTorch.
We started off talking about Dmytro's journey to being an engineer and tech lead at Facebook, and what his role entails. Dmytro has been at Facebook for 10+ years, so he gave some very interesting advice on how to manage a career in software engineering for the machine learning world. After that, we got deep into the present and future of PyTorch and what improvements the project is making to support MLOps workflows. PyTorch is a large project, and Dmytro shared with us the valuable lessons he learned from confronting multifaceted scaling challenges while working on PyTorch. Finally, we talked about the future of machine learning engineering, especially as relates to how software engineers work by comparison.
// Abstract
Over the past few years, PyTorch became the tool of choice for many AI developers ranging from academia to industry. With the fast evolution of state-of-the-art in many AI domains, the key desired property of the software toolchain is to enable the swift transition of the latest research advances to practical applications.
In this coffee session, Dmytro discusses some of the design principles that contributed to this popularity, how PyTorch navigates inherent tension between research and production requirements, and how AI developers can leverage PyTorch and PyTorch ecosystem projects for bringing AI models to their domain.
// Bio
Dmytro Dzhulgakov is a technical lead of PyTorch at Facebook where he focuses on the framework core development and building the toolchain for bringing AI from research to production.
Previously he was one of the creators of ONNX, a joint initiative aimed at making AI development more interoperable. Before that Dmytro built several generations of large-scale machine learning infrastructure that powered products like Ads or News Feed.
// Relevant Links
https://pytorch.org/
https://pytorch.org/blog/
https://ai.facebook.com/blog/pytorch-builds-the-future-of-ai-and-machine-learning-at-facebook/
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Catch all episodes, Feature Store, Machine Learning Monitoring and Blogs: https://mlops.community/
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/
Connect with Dmytro on LinkedIn: https://www.linkedin.com/in/dzhulgakov/

5 snips
Nov 9, 2021 • 56min
I Don't Like Jupyter Notebooks // Joel Grus // Coffee Sessions #62
MLOps Coffee Sessions #62 with Joel Grus, MLOps from Scratch.
// Abstract
In this talk, Joel Grus of “I don’t like notebooks” fame shares with us his 2021 perspective on notebooks, where he thinks MLOps is now, and what his hot takes in the data space are now.
// Bio
Joel Grus is a Principal Engineer at Capital Group, where he leads a team that builds search, data, and machine learning products for the investment group. He is the author of the bestselling O'Reilly book *Data Science from Scratch*, the not-bestselling self-published book *Ten Essays on Fizz Buzz*, and the controversial JupyterCon talk "I Don't Like Notebooks." He recently moved to Texas after living in Seattle for a very long time.
// Relevant Links
Data Science from Scratch book: https://www.oreilly.com/library/view/data-science-from/9781491901410/
Data Science from Scratch, 2nd Edition book: https://www.oreilly.com/library/view/data-science-from/9781492041122/
Ten Essays on Fizz Buzz: Meditations on Python, mathematics, science, engineering, and design book: https://www.amazon.com/Ten-Essays-Fizz-Buzz-Meditations/dp/0982481829 or https://leanpub.com/fizzbuzz/
I Don't Like Notebooks talk: https://www.youtube.com/watch?v=7jiPeIFXb6U
I Don't Like Notebooks - #JupyterCon 2018 slides: https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit#slide=id.g362da58057_0_658
Fizz Buzz in Tensorflow: https://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Catch all episodes, Feature Store, Machine Learning Monitoring and Blogs: https://mlops.community/
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/
Connect with Joel on LinkedIn: https://www.linkedin.com/in/joelgrus/
Timestamps:
[00:00] Introduction to Joel Grus
[01:32] Joel's background in tech
[07:47] Joel's I Don't Like Notebooks talk on Jupyter Con
[13:42] Better tooling around notebooks
[16:48] Hex
[17:20] Step function evolution
[20:41] Kinds of professionals required in Joel's organization to practice MLOps
[23:08] Evaluation process
[25:51] Sagemaker bring your own algorithm
[27:30] Flexibility of models
[31:55] Hot takes on data science world
[34:19] Current Overall Maturity of MLOps
[37:23] Kinds of problem in NLP and search
[39:52] Finding ways to put structures
[40:50] Probabilistic nature of machine learning systems
[43:10] Data scientists coping up on writing production code
[46:33] Invaluability of code review
[47:22] Common repo structure
[47:57] Reviewing codes
[49:15] Code pals
[50:36] Readability and function
[52:23] Leverage code review
[53:10] Remote work

Nov 2, 2021 • 41min
ML Tests // Svet Penkov // Coffee Sessions #61
MLOps Coffee Sessions #60 with Svet Penkov, ML Tests.
// Abstract
How confident do you feel when you deploy a new model? Does improving an ML model feel like a game of "whack-a-mole"? ML is taking over all sorts of industries and yet ML testing tools are virtually non-existent.
Drawing parallels from software engineering and electronic circuit board design to the aviation and semiconductor industries, the need for principled quality assurance (QA) step in the MLOps pipeline is long overdue. Let's talk about why ML testing is hard, what can we do about it and what place should ML QA take in the future?
// Bio
Svet has been building robots ever since he was a kid. At some point, Svet got interested in not just how to build them, but actually how to make them think, and so he did a Ph.D. in AI & Robotics at the University of Edinburgh, UK. Towards the end of Svet's Ph.D., he joined FiveAI as a Research Scientist and led the motion prediction team for 3 years.
Throughout his career, Svet spent endless hours fixing model regressions and fighting with edge cases and so at some point he had enough of it and decided it's time to do something about it. That's how Svet started Efemarai where they are building a platform for testing and improving ML continuously.
// Relevant Links
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Catch all episodes, Feature Store, Machine Learning Monitoring and Blogs: https://mlops.community/
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Svet on LinkedIn: https://www.linkedin.com/in/svpenkov/
Timestamps:
[00:00] Introduction to Svet Penkov
[02:10] Svet's background in tech
[04:34] Testing on robotics vs areas of machine learning
[05:21] What's missing in testing right now?
[08:56] Who should test?
Step 1. Figuring out the requirements
[12:04] Edge cases
Steps 2. Access of variation
[13:29] Step 3. Validation and Verification
[16:15] New challenges that need to be addressed
[18:25] Test-driven development viability argument
[20:26] Software engineering tests vs machine learning engineering tests
[23:23] Rule of tools in MLOps
[26:15] Figuring out the difficulty in designing the API's
[27:48] Svet's vision for the future
[29:15] Moving goal post
[31:00] 10 data points being realistic
[31:27] Getting less
[32:20] Efemarai: Where it came from and Why?
[33:53] Efemarai - Functional Magnetic Resonance Imaging
[35:21] A perfect world journey
[36:22] Value of tests
[37:55] Get ready for the MLOps Community Slack testing channel!

6 snips
Oct 25, 2021 • 52min
Linkedin Job Recommendations // Alexandre Patry // Coffee Sessions #60
Coffee Sessions #60 with Alexandre Patry, Path to Productivity in Job Search and Job Recommendation AI at LinkedIn.
// Abstract
A year ago, LinkedIn job search and recommendation AI teams were at the end of a growth cycle. They were fighting many fires at once: a high number of user complaints, engineers spending a significant amount of their time keeping our machine learning pipelines running, online infrastructure that wasn't supporting their growth, and challenges ramping new models to experiment. In this talk, Alex discusses how they all came together to manage these challenges and set themselves for their next phase of growth.
// Bio
Alex has been a machine learning engineer at LinkedIn for almost seven years. He had tour of duties in LinkedIn Groups, content search, and discovery, feed, and has been tech leading in LinkedIn Talent Solutions and Careers for the last two years.
Prior to working at LinkedIn, Alex lived in Montreal where he completed a Ph.D. in Statistical Machine Translation, then work for five years on information extraction.
// Relevant Links
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Catch all episodes, Feature Store, Machine Learning Monitoring and Blogs: https://mlops.community/
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Skylar on LinkedIn: https://www.linkedin.com/in/skylar-payne-766a1988/
Connect with Alexandre on LinkedIn: https://www.linkedin.com/in/patry/

Oct 11, 2021 • 1h 11min
Data Selection for Data-Centric AI: Data Quality Over Quantity // Cody Coleman // Coffee Sessions #59
Coffee Sessions #59 with Cody Coleman, Data Quality Over Quantity or Data Selection for Data-Centric AI.
// Abstract
Big data has been critical to many of the successes in ML, but it brings its own problems. Working with massive datasets is cumbersome and expensive, especially with unstructured data like images, videos, and speech. Careful data selection can mitigate the pains of big data by focusing computational and labeling resources on the most valuable examples.
Cody Coleman, a recent Ph.D. from Stanford University and founding member of MLCommons, joins us to describe how a more data-centric approach that focuses on data quality rather than quantity can lower the AI/ML barrier. Instead of managing clusters of machines and setting up cumbersome labeling pipelines, you can spend more time tackling real problems.
// Bio
Cody Coleman recently finished his Ph.D. in CS at Stanford University, where he was advised by Professors Matei Zaharia and Peter Bailis. His research spans from performance benchmarking of hardware and software systems (i.e., DAWNBench and MLPerf) to computationally efficient methods for active learning and core-set selection. His work has been supported by the NSF GRFP, the Stanford DAWN Project, and the Open Phil AI Fellowship.
// Relevant
Links [preprint] Similarity Search for Efficient Active Learning and Search of Rare Concepts: [https://arxiv.org/abs/2007.00077](https://arxiv.org/abs/2007.00077)
[video] Similarity Search for Efficient Active Learning and Search of Rare Concepts: [https://www.youtube.com/watch?v=vRVyOEK2JUU](https://www.youtube.com/watch?v=vRVyOEK2JUU)
[blog post] Selection via Proxy: Efficient Data Selection for Deep Learning: [https://dawn.cs.stanford.edu/2020/04/23/selection-via-proxy/](https://dawn.cs.stanford.edu/2020/04/23/selection-via-proxy/)
[slides] The DAWN of MLPerf: [https://drive.google.com/file/d/17ZpX0GOtOXG8QMn6KEc_Le8tUfDBlgDE/view](https://drive.google.com/file/d/17ZpX0GOtOXG8QMn6KEc_Le8tUfDBlgDE/view)
[blog post] About Cody's research: [https://hai.stanford.edu/news/cody-coleman-lowering-machine-learnings-barriers-help-people-tackle-real-problems](https://hai.stanford.edu/news/cody-coleman-lowering-machine-learnings-barriers-help-people-tackle-real-problems)
[video] About Cody: [https://www.youtube.com/watch?v=stxJMsxxxtA](https://www.youtube.com/watch?v=stxJMsxxxtA)
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/
Connect with Cody on LinkedIn: https://www.linkedin.com/in/codyaustun/
Timestamps:
[00:00] Introduction to Cody Coleman
[03:10] Cody's life story
[07:35] Cody's journey in tech
[15:04] Interest in Machine Learning and work at Stanford came about
[21:48] Data-centric Machine Learning Data Quality
[28:56] Research and Industry being together
[33:33] Advice to practitioners
[38:03] Principles of Machine Learning in an academic setting
[43:50] Data-centric promising techniques that stand out
[53:51] Developing benchmarks
[56:34] Guardrails for machine learning vs automated testing suites
[1:02:57] Creating something valuable and useful
[1:07:05] Data collecting vs Data Hoarding

Oct 7, 2021 • 56min
10 Types of Features your Location ML Model is Missing // Anne Cocos // Coffee Sessions #58
Coffee Sessions #58 with Anne Cocos, 10 Types of Features your Location ML Model is Missing.
// Abstract
Machine learning on geographic data is relatively under-studied in comparison to ML on other formats like images or graphs. But geographic data is prevalent across a wide variety of domains (although many practitioners may not think of it that way). Clearly, any dataset with `latitude` and `longitude` columns can be viewed as geographic data, but also any dataset with a `zipcode`, `city`, `address`, or `county` can be construed as geographic. Demographics, weather, foot traffic, points of interest, and topographic features can all be used to enrich a dataset with any of these types of keys.
Incorporating relatively straightforward geographic features into models can yield substantial improvements; adding "distance to the beach" or "square mileage reachable within 10 min drive" to a real estate pricing model, for example, can lead to significant decreases in model error.
Unfortunately, many ML teams find it difficult to incorporate these types of geographic data into their models because the process of ingesting from geographic formats (geojson or shapefiles), projecting, and properly joining with their existing data can be a large infrastructure lift.
In this coffee session, Anne discusses ways to simplify the process of incorporating geographic or location data into the MLOps workflow, as well as interesting trends in the geographic ML research community that will ultimately make it easier for us to learn from geography just as we do with images or graphs today.
// Bio
Dr. Anne Cocos currently leads data science and machine learning at Ask Iggy, Inc., a venture-backed, seed round startup focused on location analytics. Her team builds tools that make it simple for data scientists to leverage location information in their models and analyses. Previously she was the Director and Head, NLP and Knowledge Graph at GlaxoSmithKline, where she built algorithms and infrastructure to enable GSK’s scientists to leverage all the world’s written biomedical knowledge for drug discovery. She also worked on applied natural language processing research at The Children’s Hospital of Philadelphia Department of Biomedical Informatics. Anne completed her Ph.D. in computer science at the University of Pennsylvania, where she was supported by the Google Ph.D. Fellowship and the Allen Institute for Artificial Intelligence Key Scientific Challenges award.
Before shifting her career toward artificial intelligence, Anne spent several years as an end-user of early ML-powered technologies in the U.S. Navy and at HelloWallet. Her previous degrees are from the U.S. Naval Academy, Royal Holloway University of London, and Oxford University. She currently lives just outside Philadelphia with her husband and three boys.
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Anne on LinkedIn: https://www.linkedin.com/in/annecocos/

10 snips
Oct 1, 2021 • 55min
The Future of ML and Data Platforms // Michael Del Balso - Erik Bernhardsson // Coffee Sessions #57
Coffee Sessions #57 with Michael Del Balso and Erik Bernhardsson, The Future of ML and Data Platforms.
// Abstract
Machine learning, data analytics, and software engineering are converging as data-intensive systems become more ubiquitous. Erik Bernhardsson, ex-CTO at Better and former Spotify machine learning lead, and Mike Del Balso, CEO at Tecton and former Uber machine learning lead and co-creator of Michelangelo sit down to chat with us today.
These two jammed with us about building machine learning platform systems and teams, the modern operational data stack and how it allows more machine learning applications to thrive, and how to successfully take advantage of data in the process of building products and companies.
// Bio
Michael Del Balso
Mike is the co-founder of Tecton, where he is focused on building next-generation data infrastructure for Operational ML. Before Tecton, Mike was the PM lead for the Uber Michelangelo ML platform. He was also a product manager at Google where he managed the core ML systems that power Google’s Search Ads business. Previous to that, he worked on Google Maps. He holds a BSc in Electrical and Computer Engineering summa cum laude from the University of Toronto.
Erik Bernhardsson
Erik is currently working on some crazy data stuff since early 2021 but previously spent 6 years as the CTO of Better.com, growing the tech team from 1 to 300. Before Better, Erik spent 6 years at Spotify, building the music recommendation system and managing a team focused on machine learning.
// Relevant Links
Building a Data Team at a Mid-stage Startup: A Short Story
https://erikbern.com/2021/07/07/the-data-team-a-short-story.html
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/
Connect with Mike on LinkedIn: https://www.linkedin.com/in/michaeldelbalso/
Connect with Erik on LinkedIn: https://www.linkedin.com/in/erikbern
Timestamps:
[01:12] Introduction to Michael Del Balso and Erik Bernhardsson
[03:23] High-level space in data
[07:25] Complexity in the data world
[09:13] Data lake + data bricks
[15:20] Platform strategy
[16:05] "Platform is when the economic value of everybody that uses this exceeds the value of the company that creates it." - Bill Gates
[18:17] Centralizing platforms
[21:06] Team spin up centralization or decentralization
[27:18] Manifestations of being too far from a centralized and decentralized platform
[29:24] Centralized vs Decentralized
[33:33] Platform value and appropriate sizing
[35:43] Building a Data Team at a Mid-stage Startup: A Short Story blog post by Erik Bernhardsson
[38:51] Machine Learning as a sub-problem of Data
[42:16] Operational ML
[46:30] Spotify recommendations
[47:13] Real-time data flows at Spotify
[49:40] Data stack, Machine Learning stack, and Back-end stack reusability
[51:40] Container management

Sep 27, 2021 • 52min
A Few Learnings from Building a Bootstrapped MLOps Services Startup //Soumanta Das// Coffee Sessions #56
Soumanta wouldn't claim they've reached where they want to and they're still learning, so he's happy sharing successes as well as failures at Yugen.ai.
// Abstract
Determining Minimum Achievable Goals helps Yugen.ai ensure a significant amount of focus on value-added and impact before diving deep into solutions & building ML Systems. In this episode, Soumanta discusses Balancing ML Development vs Ops and Monitoring efforts while scaling plus their focus on improvements in small sprints.
Soumanta wouldn't claim they've reached where they want to and they're still learning, so he's happy sharing successes as well as failures at Yugen.ai.
// Bio
Soumanta is a Co-founder at Yugen.ai, an early-stage startup in the Data Science and MLOps space.
We imagine the future to be shaped by the convergence and simultaneous adoption of Algorithms, Engineering and Ops, and Responsible AI. Our mission is to help effectuate and expedite the same for our client partners by creating large-scale, reliable, and personalized ML Systems.
// Relevant Links
A blog Soumanta wrote when Yugen turned one https://medium.com/swlh/yugen-ai-turns-one-1089f3bf169
Presentation, ML REPA 2021 Title of the Talk - Reducing the distance between Prototyping and Production, Why obsessing over experimentation and iteration compounds ROIs
Slides - https://drive.google.com/file/d/1J9Cv6IPPkGpOTq8Xl_AQCKaR0-pKMUmA/view?usp=sharing
Video - https://youtu.be/4PEbgQTw1W0
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/
Connect with Soumanta on LinkedIn: www.linkedin.com/in/soumanta-das/
Timestamps:
[00:00] Introduction to Soumanta Das
[00:24] What's Yugen.ai's name all about?
[02:02] Starting during the pandemic
[05:13] Determination to continue during the pandemic
[08:02] State of the art in Yugen.ai and its future
[11:32] Time to value defining ML to a business
[13:01] Building a strong ML engineering culture
[19:06] Data scientists patterns
[20:00] Helper functions
[22:45] Code review
[25:32] Repeatable use cases
[27:48] Minimum achievable goals
[30:30] Production management goals
[34:30] Use cases and System design document
[36:20] Practices that helped Yugen.ai build ML systems
[40:05] Growing pains in the scaling process
[43:54] Yugen.ai war stories
[46:50] Unrealizing there's something wrong and there's actually something wrong
[48:10] Data observability tools
[49:42] Hands-on deck

Sep 21, 2021 • 48min
Learning and Teaching MLOps Applications // Salwa Muhammad // MLOps Coffee Sessions #55
Coffee Sessions #55 with Salwa Muhammad, Learning and Teaching MLOps Applications.
//Abstract
Salwa shared her perspective on how FourthBrain and all learners can keep their education strategy fresh enough for the current zeitgeist. Furthermore, Salwa, Demetrios, and Vishnu talked about principles of effective learning that are important to keep in mind while embarking on any educational journey.
This was a great conversation with a lot of practical tips that we hope you all listen to!
// Bio
Salwa Nur Muhammad is the Founder/CEO of FourthBrain, an AI/ML education startup backed by Andrew Ng's AI Fund. FourthBrain trains Machine Learning engineers through a hybrid 2-3 month cohort-based programs that combine accountability of weekly instructor-led live sessions with the flexibility of online content.
Salwa founded FourthBrain after executive leadership roles at Udacity and Trilogy Education Services (acquired by 2U Inc).
She has over 10 years of experience leveraging technology to develop scalable education programs at higher-ed institutions and ed-tech companies, building new business units, launching international programs, and hiring and training cross-functional teams.
// Relevant Links
https://www.fourthbrain.ai/
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/
Connect with Salwa on LinkedIn: https://www.linkedin.com/in/salwanur/
Timestamps:
[00:00] Introduction to Salwa Muhammad
[01:20] Salwa's journey in tech
[05:30] Advice to new ML engineers
[10:21] Curriculum development process
[17:36] FourthBrain's current status and what's next
[21:53] Hardest piece in the course
[24:49] Knowing the right job in a role confused world
[30:05] Needing to upskill without going insane
[35:10] Generalist vs Specialist on T-shaped Analogy
[41:15] Counseling learners in terms of long-term progression
[43:00] MLOps trajectories recommendation

Sep 10, 2021 • 49min
Machine Learning SRE // Niall Murphy // MLOps Coffee Sessions #54
Coffee Sessions #54 with Niall Murphy, Machine Learning SRE.
//Abstract
SRE is making its way into the machine learning world. Software engineering for machine learning requires reliability, performance, and maintainability. Site reliability engineering is the field that deals with reliability and ensuring constant, real-time performance. Niall Murphy, most recently Global Head of SRE at Microsoft Azure, helps us understand what SRE can do for modern ML products and teams.
Building machine learning teams requires a diverse set of technical experiences, and Niall shares his thoughts on how to do that most effectively. Machine learning organizations need to start to take advantage of SRE best practices like SLOs, which Niall walks through. Production machine learning depends on high-quality software engineering, and we get Niall's take on how to ensure that in a machine learning context.
// Bio
Niall Murphy has been interested in Internet infrastructure since the mid-1990s. He has worked with all of the major cloud providers from their Dublin, Ireland offices - most recently at Microsoft, where he was global head of Azure Site Reliability Engineering (SRE). His books have sold approximately a quarter of a million copies worldwide, most notably the award-winning Site Reliability Engineering, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin, Ireland, with his wife and two children.
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with David on LinkedIn: https://www.linkedin.com/in/aponteanalytics/
Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/
Connect with Niall on LinkedIn: https://www.linkedin.com/in/niallm/
Timestamps:
[00:00] Introduction to Niall Murphy
[00:36] SRE background to Machine Learning space transition
[07:10] SLO's being a challenge in the ML space
[09:42] SRE Hiring Investments
[15:10] Behavior of teams concept
[17:45] Challenges dealing with ML production
[18:27] Update on Reliable Machine Learning book
[22:46] Monitoring
[25:05] Difference between ML and SRE
[29:18] Incident response in Machine Learning
[34:46] Rollbacks
[35:50] Machine Learning burden overtime
[42:42] Niall's journey to the SRE space and focus to develop himself