AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
How to Fix a Failure in a System?
Service level objectives are a way to reason about your production environment over time. They don't necessarily tell me what is wrong, but they help me figure out that something is off. For example, m we can all start with alerting n again, i mention them earlier. But then as we understand, ok, this particular micro service, a is constrained on cpw, right? O, like when when truput goes up, it uses more sepiu, so we know that's its failure mode. Or alternatively, it uses a lot of memory, so we need to allert on memory, things like that. We add more and more monitoring and more and more nou