Managing Meta's millions of machines (Ship It! #102)
May 4, 2024
auto_awesome
Anita Zhang, a Meta expert, shares insights on managing millions of Linux hosts and containers. Topics include AI requirements, open-source contributions, efficient work practices, Meta's infrastructure, and scaling with Sentry. The podcast also features a fun game segment and a casual discussion on future content plans.
Maintaining ABI boundaries ensures compatibility during frequent OS upgrades at Meta.
TW Shared infrastructure with System D offers isolation for containers and host profile configurations.
Upstream contributions drive Meta's release culture with automation for reliable service operations.
Deep dives
Managing Rolling OS Upgrades
Meta's approach involves major upgrades that can take up to a year to complete but rolling OS upgrades happen more frequently. By maintaining ABI boundaries, changes are usually bug fixes ensuring program compatibility. Any bleeding edge packages needed are released immediately through hyperscale. Challenges mostly arise when updating core components like System D which requires more intentional rollouts.
TW Shared Infrastructure and Host Profiles
TW Shared serves as a common infrastructure where containers run directly with System D, enhancing isolation. Host profiles allow users to specify machine types and purposes, demanding host restarts for adjustments like huge pages. The configuration includes resources dynamically allocated and managed by a host agent that focuses on operationalization and service readiness.
Automating System Management and Infrastructure Challenges
Meta's emphasis on upstream contributions fosters a release frequently culture, necessitating robust system management tools. Automation aids handling a million hosts, ensuring reliable service operations. Encapsulating services on standardized hosts enhances predictability, while optimizing and validating deployments boost infrastructure stability.
AI Optimization Strategies for Infrastructure Scaling
Adapting to evolving computational demands, the usage of stacking compute hosts is explored. Stacking enables efficient resource utilization, particularly in accommodating large hosts for RAM-intensive jobs and optimizing host profiles for enhanced performance. Infrastructure adjustments to support AI applications have resulted in notable efficiency gains, highlighting the impact of specialized computes on operational costs and resource consumption.
Enhancing Error Monitoring with Sentry Tracing
In the realm of error monitoring for medium to large engineering teams, the integration of tracing in Sentry facilitates effective root cause analysis of errors. By tracing requests across services and identifying error triggers, teams can streamline triaging and assignment processes, ensuring targeted issue resolution. Selectively managing alerts and leveraging traceable insights, organizations can boost operational efficiency and maintain application reliability.
Anita Zhang is here to tell us how Meta manages millions of bare metal Linux hosts and containers. We also discuss the Twine white paper and how AI is changing their requirements.
Changelog++ members save 8 minutes on this episode because they made the ads disappear. Join today!
Sponsors:
FireHydrant – The alerting and on-call tool designed for humans, not systems. Signals puts teams at the center, giving you ultimate control over rules, policies, and schedules. No need to configure your services or do wonky work-arounds. Signals filters out the noise, alerting you only on what matters. Manage coverage requests and on-call notifications effortlessly within Slack. But here’s the game-changer…Signals natively integrates with FireHydrant’s full incident management suite, so as soon as you’re alerted you can seamlessly kickoff and manage your entire incident inside a single platform. Learn more or switch today at firehydrant.com/signals
Sentry – Code breaks, fix it faster. Don’t just observe. Take action. Sentry is the only app monitoring platform built for developers that gets to the root cause for every issue. 90,000+ growing teams use sentry to find problems fast. Use the code CHANGELOG when you sign up to get $100 OFF the team plan.