There's a data dog dashword out there that has, ike, the number of requests that are coming and how many of them are failing. The biggest thing i would say is that for any metric, the more you look at one top level metric, as opposed to decomposing it across like cohorts categories or whatever,. For example, looking at your latency by a region is a lot more interesting than looking at your global latency. Looking at your c pu utilization of your fleet across data center or across a team, it's allocated to you, like data engineering verses on. Machine learning versis like production thi lose ther way.
Will Larson, the CTO at Calm, covers a wide range of topics including whether Infrastructure Engineering is chronically understaffed, the role of Eng Ops, how his opinion on the “build vs buy” question has changed, his thoughts on metrics, and more.
Helpful resources: