Reliability Enablers cover image

Reliability Enablers

Latest episodes

undefined
Nov 12, 2024 • 29min

#63 - Does "Big Observability" Neglect Mobile?

Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he’s vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short. * Career Journey and Current Role: Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to a smaller, Series B company to learn beyond what corporate America offered.* Specialization in Mobile Observability: At Embrace, Andrew and his colleagues build tools for consumer mobile apps, helping engineers, SREs, and DevOps teams integrate observability directly into their workflows.* Gap in Mobile Observability: Observability for mobile apps is still developing, with early tools like Crashlytics only covering basic crash reporting. Andrew highlights that more nuanced data on app performance, crucial to user experience, is often missed.* Motivation for User-Centric Tools: Leaving “big observability” to focus on mobile, Andrew prioritizes tools that directly enhance user experience rather than backend metrics, aiming to be closer to end-users.* Mobile's Role as a Brand Touchpoint: He emphasizes that for many brands, the primary consumer interaction happens on mobile. Observability needs to account for this by focusing on user experience in the app, not just backend performance.* Challenges in Measuring Mobile Reliability: Traditional observability emphasizes backend uptime, but Andrew sees a gap in capturing issues that affect user experience on mobile, underscoring the need for end-to-end observability.* Observability Over-Focused on Backend Systems: Andrew points out that “big observability” has largely catered to backend engineers due to the immense complexity of backend systems with microservices and Kubernetes. Despite mobile being a primary interface for apps like Facebook and Instagram, observability tools for mobile lag behind backend-focused solutions.* Lack of Mobile Engineering Leadership in Observability: Reflecting on a former Meta product manager’s observations, Andrew highlights the lack of VPs from mobile backgrounds, which has left a gap in observability practices for mobile-specific challenges. This gap stems partly from frontend engineers often seeing themselves as creators rather than operators, unlike backend teams.* OpenTelemetry’s Limitations in Mobile: While OpenTelemetry provides basic instrumentation, it falls short in mobile due to limited SDK support for languages like Kotlin and frameworks like Unity, React Native, and Flutter. Andrew emphasizes the challenges of adapting OpenTelemetry to mobile, where app-specific factors like memory consumption don’t align with traditional time-based observability.* SREs as Connective Tissue: Andrew views Site Reliability Engineers (SREs) as essential in bridging backend observability practices with frontend user experience needs. Whether through service level objectives (SLOs) or similar metrics, SREs help ensure that backend metrics translate into positive end-user experiences—a critical factor in retaining app users.* Amazon’s Operational Readiness Review: Drawing from his experience at AWS, Andrew values Amazon’s practice of operational readiness reviews before launching new services. These reviews encourage teams to anticipate possible failures or user experience issues, weighing risks carefully to maintain reliability while allowing innovation.* Shifting Focus to “Answerability” in Observability: For Andrew, the goal of observability should evolve toward “answerability,” where systems provide engineers with actionable answers rather than mere data. He envisions a future where automation or AI could handle repetitive tasks, allowing engineers to focus on enhancing user experiences instead of troubleshooting. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Nov 5, 2024 • 36min

#62 - Early Youtube SRE shares Modern Reliability Strategy

Andrew Fong, Co-founder and CEO of Prodvana and former VP of Infrastructure at Dropbox, dives into the evolution of Site Reliability Engineering (SRE) amidst changing tech landscapes. He advocates for addressing problems over rigid roles, emphasizing reliability and efficiency. Andrew explores how AI is reshaping SRE, the balance between innovation and operational management, and the importance of a strong organizational culture. His insights provide a values-first approach to tackle engineering challenges, fostering collaboration and a proactive reliability mindset.
undefined
Oct 22, 2024 • 38min

#61 Scott Moore on SRE, Performance Engineering, and More

Scott Moore, a performance engineer with decades of experience and a knack for educational content, shares his insights on software performance. He discusses how parody music videos make performance engineering engaging and accessible. The conversation delves into the importance of redefining operational requirements and how performance metrics should not be overlooked. Scott highlights the relationship between performance engineering and reliability, and how collaboration can reduce team burnout. He also reveals how a performance-centric culture can optimize cloud costs and improve development processes.
undefined
4 snips
Oct 1, 2024 • 31min

#60 How to NOT fail in Platform Engineering

Ankit, who started programming at age 11 and naturally gravitated towards platform engineering, shares his insights on this evolving field. He discusses how platform engineering aids team efficiency through self-service capabilities. Ankit highlights the challenges of turf wars among DevOps, SRE, and platform engineering roles, as well as the dysfunctions caused by rigid ticketing systems. He emphasizes the need for autonomy and reducing cognitive load to foster creativity and effective teamwork, drawing from his rich experiences across various sectors.
undefined
Sep 24, 2024 • 8min

#59 Who handles monitoring in your team and how?

Why many copy Google’s monitoring team setup* Google’s Influence. Google played a key role in defining the concept of software reliability.* Success in Reliability. Few can dispute Google’s ability to ensure high levels of reliability and its ability to share useful ways to improve it in other settingsBUT there’s a problem:* It’s not always replicable. While Google's practices are admired, they may not be a perfect fit for every team.What is Google’s monitoring approach within teams?Here’s the thing that Google does:* Google assigns one or two people per team to manage monitoring.* Even with centralized infrastructure, a dedicated person handles monitoring.* Many organizations use a separate observability team, unlike Google's integrated approachIf your org is large enough and prioritizes reliability highly enough, you might find it feasible to follow Google’s model to the tee. Otherwise, a centralized team with occasional “embedded x engineer” secondments might be more effective.Can your team mimic Google’s model?Here are a few things you should factor in:Size mattersGoogle's model works because of its scale and technical complexity. Many organizations don’t have the size, resources, or technology to replicate this.What are the options for your team?Dedicated monitoring team (very popular but $$$)If you have the resources, you might create a dedicated observability team. This might call for a ~$500k+ personnel budget so it’s not something that a startup or SME can easily justify. Dedicate SREs to monitoring work (effective but difficult to manage)You might do this on rotation or make an SRE permanently “responsible for all monitoring matters”. Putting SREs on permanent tasks might lead to burnout as it might not suit their goals, and rotation work requires effective planning.Internal monitoring experts (useful but hard capability)One or more engineers within teams could take on monitoring/observability responsibilities as needed and support the team’s needs. This should be how we get monitoring work done, but it’s hard to get volunteers across a majority of teams. Transitioning monitoring from project work to maintenance2 distinct phasesInitial Setup (the “project”) SREs may help set up the monitoring/observability infrastructure. Since they have breadth of knowledge across systems, they can help connect disparate services and instrument applications effectively.Post-project phase (“keep the lights on”)Once the system is up, the focus shifts from project mode to ongoing operational tasks. But who will do that?Who will maintain the monitoring system?Answer: usually not the same teamAfter the project phase, a new set of people—often different from the original team—typically handles maintenance.Options to consider (once again)* Spin up a monitoring/observability team. Create a dedicated team for observability infrastructure.* Take a decentralized approach. Engineers across various teams take on observability roles as part of their regular duties.* Internal monitoring/observability experts. They can take responsibility for monitoring and ensure best practices are followed.The key thing to remember here is…Adapt to Your Organizational ContextOne size doesn’t fit allGoogle's model may not work for everyone. Tailor your approach based on your organization’s specific needs.The core principle to keep in mindAs long as people understand why monitoring/observability matters and pay attention to it, you're on the right track.Work according to engineer awarenessIf engineers within product and other non-operations teams are aware of monitoring: You can attempt to **decentralize the effort** and involve more team members.If awareness or interest is low: consider **dedicated observability roles** or an SRE team to ensure monitoring gets the attention it needs.In conclusionThere’s no universal solution. Whether you centralize or decentralize monitoring depends on your team’s structure, size, and expertise. The important part is ensuring that observability practices are understood and implemented in a way that works best for your organization.PS. Rather than spend an hour on writing, I decided to write in the style I normally use in a work setting i.e. “executive short-hand”. Tell me what you think. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Sep 17, 2024 • 8min

#58 Fixing Monitoring's Bad Signal-to-Noise Ratio

Monitoring in the software engineering world continues to grapple with poor signal-to-noise ratios. It’s a challenge that’s been around since the beginning of software development and will persist for years to come. The core issue is the overwhelming noise from non-essential data, which floods systems with useless alerts. This interrupts workflows, affects personal time, and even disrupts sleep.Sebastian dove into this problem, highlighting that the issue isn't just about having meaningless pages but also the struggle to find valuable information amidst the noise. When legitimate alerts get lost in a sea of irrelevant data, pinpointing the root cause becomes exceptionally hard.Sebastian proposes a fundamental fix for this data overload: be deliberate with the data you emit. When instrumenting your systems, be intentional about what data you collect and transport. Overloading with irrelevant information makes it tough to isolate critical alerts and find the one piece of data that indicates a problem.To combat this, focus on:* Being Deliberate with Data. Make sure that every piece of telemetry data serves a clear purpose and aligns with your observability goals.* Filtering Data Effectively. Improve how you filter incoming data to eliminate less relevant information and retain what's crucial.* Refining Alerts. Optimize alert rules such as creating tiered alerts to distinguish between critical issues and minor warnings.Dan Ravenstone, who leads platform at Top Hat, discussed “triaging alerts” recently. He shared that managing millions of alerts, often filled with noise, is a significant issue. His advice: scrutinize alerts for value, ensuring they meet the criteria of a good alert, and discard those that don’t impact the user journey.According to Dan, the anatomy of a good alert includes:* A run book* A defined priority level* A corresponding dashboard* Consistent labels and tags* Clear escalation paths and ownershipTo elevate your approach, consider using aggregation and correlation techniques to link otherwise disconnected data, making it easier to uncover patterns and root causes.The learning point is simple: aim for quality over quantity. By refining your data practices and focusing on what's truly valuable, you can enhance the signal-to-noise ratio, ultimately allowing more time for deep work rather than constantly managing incidents. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Sep 10, 2024 • 32min

#57 How Technical Leads Support Software Reliability

The question then condenses down to: Can technical leads support reliability work? Yes, they can! Anemari has been a technical lead for years — even spending a few years doing that at the coveted consultancy, Thoughtworks — and now coaches others.She and I discussed the link between this role and software reliability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Sep 4, 2024 • 27min

#56 Resolving DORA Metrics Mistakes

We're already well into 2024 and it’s sad that people still have enough fuel to complain about various aspects of their engineering life. DORA seems to be turning into one of those problem areas.Not at every organization, but some places are turning it into a case of “hitting metrics” without caring for the underlying capabilities and conversations.Nathen Harvey is no stranger to this problem.He used to talk a lot about SRE at Google as a developer advocate. Then, he became the lead advocate for DORA when Google acquired it in 2018. His focus has been on questions like:How do we help teams get better at delivering and operating software? You and I can agree that this is an important question to ask. I’d listen to what he has to say about DORA because he’s got a wealth of experience behind him, having also run community engineering at Chef Software.Before we continue, let’s explore What is DORA? in Nathen’s (paraphrased) words:DORA is a software research program that's been running since 2015.This research program looks to figure out:How do teams get good at delivering, operating, building, and running software? The researchers were able to draw out the concept of the metrics based on correlating teams that have good technology practices with highly robust software delivery outcomes.They found that this positively impacted organizational outcomes like profitability, revenue, and customer satisfaction.Essentially, all those things that matter to the business.One of the challenges the researchers found over the last decade was working out: how do you measure something like software delivery? It's not the same as a factory system where you can go and count the widgets that we're delivering necessarily.The unfortunate problem is that the factory mindset I think still leaks in. I’ve personally noted some silly metrics over the years like lines of code.Imagine being asked constantly: “How many lines of code did you write this week?”You might not have to imagine. It might be a reality for you. DORA’s researchers agreed that the factory mode of metrics cannot determine whether or not you are a productive engineer. They settled on and validated 4 key measures for software delivery performance.Nathen elaborated that 2 of these measures look at throughput:[Those] two [that] look at throughput really ask two questions:* How long does it take for a change of any kind, whether it's a code change, configuration change, whatever, a change to go from the developer's workstation. right through to production?And then the second question on throughput is:* How frequently are you updating production?In plain English, these 2 metrics are:* Deployment Frequency. How often code is deployed to production? This metric reflects the team's ability to deliver new features or updates quickly.* Lead Time for Changes: Measures the time it takes from code being committed to being deployed to production. Nathen recounted his experience of working at organizations that differed in how often they update production from once every six months to multiple times a day. They're both very different types of organizations, so their perspective on throughput metrics will be wildly different. This has some implications for the speed of software delivery.Of course, everyone wants to move faster, but there’s this other thing that comes in and that's stability.And so, the other two stability-oriented metrics look at:What happens when you do update production and... something's gone horribly wrong. “Yeah, we need to roll that back quickly or push a hot fix.” In plain English, they are:* Change Failure Rate: Measures the percentage of deployments that cause a failure in production (e.g., outages, bugs). * Failed Deployment Recovery Time: Measures how long it takes to recover from a failure in production. You might be thinking the same thing as me. These stability metrics might be a lot more interesting to reliability folks than the first 2 throughput metrics.But keep in mind, it’s about balancing all 4 metrics. Nathen believes it’s fair to say today that across many organizations, they look at these concepts of throughput and stability as tradeoffs of one another. We can either be fast or we can be stable. But the interesting thing that the DORA researchers have learned from their decade of collecting data is that throughput and stability aren't trade-offs of one another.They tend to move together. They’ve seen organizations of every shape and size, in every industry, doing well across all four of those metrics. They are the best performers. The interesting thing is that the size of your organization doesn't matter the industry that you're in.Whether you’re working in a highly regulated or unregulated industry, it doesn't matter.The key insight that Nathen thinks we should be searching for is: how do you get there? To him, it's about shipping smaller changes. When you ship small changes, they're easier to move through your pipeline. They're easier to reason about. And when something goes wrong, they're easier to recover from and restore service.But along with those small changes, we need to think about those feedback cycles.Every line of code that we write is in reality a little bit of an experiment. We think it's going to do what we expect and it's going to help our users in some way, but we need to get feedback on that as quickly as possible.Underlying all of this, both small changes and getting fast feedback, is a real climate for learning. Nathen drew up a few thinking points from this:So what is the learning culture like within our organization? Is there a climate for learning? And are we using things like failures as opportunities to learn, so that we can ever be improving? I don’t know if you’re thinking the same as me already, but we're already learning that DORA is a lot more than just metrics. To Nathen (and me), the metrics should be one of the least interesting parts of DORA because it digs into useful capabilities, like small changes and fast feedback. That’s what truly helps determine how well you're going to do against those performance metrics.Not saying “We are a low to medium performer. Now go and improve the metrics!”I think the issue is that a lot of organizations emphasize the metrics because it's something that can sit on an executive dashboard But the true reason we have metrics is to help drive conversations.Through those conversations, we drive improvement.That’s important because currently an unfortunately noticeable amount of organizations are doing this according to Nathen:I've seen organizations [where it’s like]: “Oh, we're going to do DORA. Here's my dashboard. Okay, we're done. We've done DORA. I can look at these metrics on a dashboard.” That doesn't change anything. We have to go the step further and put those metrics into action.We should be treating the metrics as a kind of compass on a map. You can use those metrics to help orient yourself and understand, “Where are we heading?”.But then you have to choose how are you going to make progress toward whatever your goal is.The capabilities enabled by the DORA framework should help answer questions like:* Where are our bottlenecks?* Where are our constraints?* Do we need to do some improvement work as a team?We also talked about the SPACE framework, which is a follow-on tool from DORA metrics. It is a framework for understanding developer productivity. It encourages teams or organizations to look at five dimensions when trying to measure something from a productivity perspective.It stands for:* S — satisfaction and well-being* P — performance* A — activity* C — communication and collaboration* E — efficiency and flowWhat the SPACE framework recommends is that youFirst, pick metrics from two to three of those five categories. (You don't need a metric from every one of those five but find something that works well for your team.)Then write down those metrics and start measuring them. Here’s the interesting thing: DORA is an implementation of SPACE. You can correlate each metric with the SPACE acronym!* Lead time for changes is a measure of Efficiency and flow* Deployment frequency is an Activity* Change fail rate is about Performance.* Failed deployment recovery time is about Efficiency and flowKeep in mind that SPACE itself has no metrics. It is a framework for identifying metrics.Nathen reiterated that you can't use the space metrics because there is no such thing. I mentioned earlier how DORA is a means of identifying the capabilities that can improve the metrics.These can be technical practices like using continuous integration.But they can also be capabilities like collaboration and communication. As an example, you might look at what your change approval process looks like. You might look at how collaboration and communication have failed when you’ve had to send changes off to an external approval board like a CAB (change approval board).DORA’s research backs the above up:What our research has shown through collecting data over the years, is that while they do exist on the whole, an external change approval body will slow you down.That's no surprise. So your change lead time is going to increase, your deployment frequency will decrease. But, at best, they have zero impact on your change fail rate. In most cases, they have a negative impact on your change fail rate. So you're failing more often.It goes back to the idea of smaller changes, faster feedback, and being able to validate that. Building in audit controls and so forth.This is something that reliability-focused engineers should be able to help with because one of the things Sebastian and I talk about a lot is embracing and managing risk effectively and not trying to mitigate it through stifling measures like CABs. In short, DORA and software reliability are not mutually exclusive concepts.They're certainly in the same universe.Nathen went as far as to say that some SRE practices necessarily get a little bit deeper than sort of the capability level that DORA has and provide even more sort of specific guidance on how to do things.He clarified a doubt I had because a lot of people have argued with me (mainly at conferences) that DORA is this thing that developers do, earlier in the SDLC.And then SRE is completely different because it focuses on the production side. The worst possible situation could be turning to developers and saying, “These 2 throughput metrics, they’re yours. Make sure they go up no matter what,” and then turn to our SREs and say “Those stability metrics, they're yours. Make sure they stay good” All that does is put these false incentives in place and we're just fighting against each other.We talked a little more about the future of DORA in our podcast episode (player/link right at the top of this post) if you want to hear about that.Here are some useful links from Nathen for further research:DORA online community of practiceDORA homepage[Article] The SPACE of Developer ProductivityNathen Harvey's Linktree This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Aug 27, 2024 • 11min

#55 3 Uses for Monitoring Data Other Than Alerts and Dashboards

We’ll explore 3 use cases for monitoring data. They are:* Analyzing long-term trends* Comparing over time or experiment groups* Conducting ad hoc retrospective analysis Analyzing long-term trends You can ask yourself a couple of simple questions as a starting point:* How big is my database?* How fast is the database growing? * How quickly is my user count growing?As you get comfortable with analyzing data for the simpler questions, you can start to analyze trends for less straightforward questions like:* How is the database performance evolving? Are there signs of degradation?* Is there consistent growth in data volume that may require future infrastructure adjustments?* How is overall resource utilization trending over time across different services?* How is the cost of cloud resources evolving, and what does that mean for budget forecasting?* Are there recurring patterns in downtime or service degradation, and what can be done to mitigate them?Sebastian mentioned that it's a part of observability he enjoys doing. I can understand why. It’s exciting to see how components are changing over a period and working out solutions before you end up in an incident response nightmare.Getting to effectively analyze the trends requires the right level of data retention settings. Because if you're throwing out your logs, traces, and metrics too early, you will not have enough historical data to do this kind of work.Doing this right means having the right amount of data in place to be able to analyze those trends over time, and that will of course depend on your desired period. Comparing over time or experiment groupsGoogle’s definitionYou're comparing the data results for different groups that you want to compare and contrast. Using a few examples from the SRE (2016) book:* Are your queries faster in this version of this database or this version of that database? * How much better is my memcache hit rate with an extra node and is my site slower than it was last week? You're comparing it to different buckets of time and different types of products.A proper use case for comparing groupsSebastian did this particular use case recently because he had to compare two different technologies for deploying code: AWS Lambda vs AWS Fargate ECS. He took those two services and played around with different memories and different virtual CPUs. Then he ran different amounts of requests against those settings and tried to figure out which one was the better technology option most cost-effectively.His need for this went beyond engineering work but enabling product teams with the right decision-making data. He wrote out a knowledge base article to give them guidance for a more educated decision on the right AWS service.Having the data to compare the two services allowed him to answer questions like:* When should you be using either of these technologies? * What use cases would either technology be more suitable for?This data-based decision support is based mainly on monitoring or observability data. The idea of using the monitoring data to compare tools and technologies for guiding product teams is something I think reliability folk can gain a lot of value from doing. Conducting ad hoc retrospective analysis (debugging)Debugging is a bread-and-butter responsibility for anyone who is a software engineer of any level. It’s something that everybody should know a little bit more about than other tasks because there are very effective and also very ineffective ways of going about debugging. Monitoring data can help make the debugging process fall into the effective side.There are organizations where you have 10 different systems. In one system, you might get one fragmented piece of information. In another, you’ll get another fragment. And so on for all the different systems. And then you have to correlate these pieces of information in your head and hopefully, you get some clarity out of the fragments to form some kind of insight. Monitoring data that are brought together into one datastream can help correlate and combine all these pieces of information. With it, you can:* Pinpoint slow-running queries or functions by analyzing execution times and resource usage, helping you identify inefficiencies in your code* Correlate application logs with infrastructure metrics to determine if a performance issue is due to code errors or underlying infrastructure problems* Track memory leaks or CPU spikes by monitoring resource usage trends, which can help you identify faulty code or services* Set up detailed error tracking that automatically flags code exceptions and matches them with infrastructure events, to get to the root cause faster* Monitor system load alongside application performance to see if scaling issues are related to traffic spikes or inefficient code pathsBeing able to do all this makes the insight part easier for you. And so your debugging approach becomes very different. It becomes much more effective. It becomes much less time-consuming. It potentially makes the debugging task fun.Because you get to the root cause of the thing that is not working much faster. Your monitoring/observability data setup can make it nice and fun to a certain degree, or it can make it downright miserable. If it's done well, it's just one of those things you don't even have to think about. It's just part of your job. You do it. It's very effective and you move on. Wrapping upSo we've covered three more use cases for monitoring data, other than the usual alerts and dashboards.They are once again:* analyzing long-term trends* comparing over time or experiment groups and* conducting ad hoc retrospective analysis, aka debuggingNext time your boss asks you what all these systems do, you now have three more reasons that you need to focus on your monitoring and be able to use it more effectively. Until next time, happy monitoring. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Aug 20, 2024 • 37min

#54 Becoming a Valuable Engineer Without Sacrificing Your Sanity

Shlomo Bielak is the Head of Engineering (Operational Excellence and Cloud) at Penn Interactive, an interactive gaming company. He’s dedicated much of his talk time at DevOps events to talk about a topic less covered at such technical events. A lot of what he said alluded to ways to become a more valuable engineer.I’ve broken them down into the following areas:* Avoid the heroic efforts* Mind + heart > Mind alone * Curiosity > Credentials* Experience > Certifications * Thinking for complexityWhen I saw him in Toronto, I thought he would talk about pre-production observability. It would only make sense after watching the previous presenter do a deep dive into Kubernetes tooling.But surprisingly, he started about culture and the need to prevent burnout among engineers — a topic that is as important today as it was 2 years ago when he did the talk. Here’s a look into Shlomo’s philosophy and the practices he champions.Avoid the heroic effortsShlomo's perspective on heroics in engineering and operations challenges a traditional mindset that often glorifies excessive individual efforts at the cost of long-term sustainability. He emphasizes that relying on heroics — where individuals consistently go above and beyond to save the day — creates an unhealthy work environment. "We shouldn't be rewarding people for pulling all-nighters to save a project; we should be asking why those all-nighters were necessary in the first place."This approach not only burns out engineers but also masks underlying systemic issues that need to be addressed. So, instead of celebrating these heroic efforts, Shlomo advocates for creating processes and metrics that ensure smooth operations without the need for constant intervention. Mind + Heart > Mind aloneOne of the challenges Shlomo has faced recently is scaling his engineering organization amidst rapid growth. His approach to hiring is unique; he doesn’t just look for technical skills but prioritizes self-awareness and kindness. "Hiring with heart means looking for individuals who bring empathy and integrity to the team, not just expertise."When he joined The Score, a subsidiary of Penn Interactive, Shlomo immediately revamped the hiring practices by integrating the values above into the process. He favors role-playing scenarios over solely using behavioral interviews to evaluate candidates, as this method reveals how individuals might react in real production situations. I tend to agree with this approach as seeing how people are doing the work is more enlightening than asking them how they behaved in a past situation alone. Curiosity > credentialsHow it plays into career progressionWhen it comes to career progression, Shlomo places little value on traditional markers like education or years of experience. Instead, he values adaptability, resilience, and curiosity. This last trait is the one he doubles down on.According to Shlomo, curiosity is the cornerstone of continuous growth and innovation. It’s not just about asking questions. It’s about fostering a mindset that constantly seeks to understand the 'why' behind everything. Shlomo advocates for a deep, insatiable curiosity that drives engineers to explore beyond the surface of problems, looking for underlying causes and potential improvements. He believes that this kind of curiosity is what separates good engineers from great ones, as it leads to discovering solutions that aren’t immediately obvious and pushes the boundaries of what’s possible.How it plays into teamworkFor Shlomo, curiosity also plays a crucial role in building a cohesive and forward-thinking team. He encourages leaders to cultivate an environment where questions are welcomed, and no stone is left unturned. This approach not only sparks creativity but also ensures that everyone is engaged in a continuous learning process, which is vital in a field that evolves as rapidly as DevOps and SRE.By nurturing curiosity, teams can stay ahead of the curve. They can anticipate challenges before they arise and develop right-fit solutions that keep their work relevant and impactful.Shlomo advises engineers not to let their current organization limit them and to always seek out new challenges and learning opportunities. This mindset will make them valuable to any organization they may work with.Experience > Certifications Shlomo’s stance on certifications is clear: they don’t necessarily lead to career advancement. He argues that the best engineers are those who are too busy doing the work to focus on accumulating certifications. Instead, he encourages engineers to network with industry leaders, demonstrate their skills, and seek mentorship opportunities. Experience and mentorship, he believes, are far more critical to growth than any piece of paper.Thinking for complexityIt’s a well-tread saying now, almost a cliche, but still very relevant to standing out in a crowded engineering talent market. Shlomo and I talked about the issue of many engineers being trained to think in terms of best practices. I feel like over time, this emphasis will reduce, especially for more senior roles. Best practices are not directly applicable to solving today’s problems that are increasing in complexity. Shlomo tries to test potential hires to see if they can handle the complexity. During interviews, he presents candidates with unreasonable scenarios to test their ability to think outside the box. This approach not only assesses their problem-solving skills but also helps them understand the interconnectedness of the challenges they will face.Wrapping upThe insights Shlomo shared with me underscore a crucial point:The most successful engineers are those who combine technical prowess with a strong sense of curiosity, a commitment to continuous improvement, and a genuine understanding of their role within the team. By embracing these qualities, you not only enhance your current contributions but also set yourself on a path for long-term growth and success. The takeaway is clear: to truly stand out and advance in your career, it's not just about doing your job well — it's about constantly seeking to learn more, improve processes, and connect with your team on a deeper level.These are the traits that make you not just a good engineer, but a valuable one. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode