Anita Zhang shares insights on managing millions of machines at Meta, open source contributions, automating repository syncing, and navigating AI fleet. The conversation also explores transitioning from indie dev to supporting large teams, research paper titles in AI, and generating future content ideas.
Meta updates take a year, frequent OS updates prevent issues, contribute upstream for current systems.
TW scheduler runs containers with systemd, isolating jobs with features, logs handled by sidecar service.
Host profiles define machine types, enable dynamic changes, focus on reliability and scalability within Meta.
Deep dives
Managing Updates and Contributions to Upstream
The podcast discusses Meta's approach to managing updates for its million hosts, outlining that major upgrades take about a year while rolling OS updates occur more frequently without major issues. Emphasizing contributions to upstream first enables Meta to stay current with new developments and avoid issues stemming from running outdated systems.
Container Infrastructure and Systemd Usage
Meta's internal container scheduler TW shared runs containers directly with systemd without a layer of an agent. The container jobs appear as systemd units within the container, isolating each job while using systemd features for isolation and management, including handling logs via a sidecar service.
Host Profiles and Infrastructure Teams
Host profiles define machine types for specific workloads, enabling dynamic changes like kernel parameters and file systems. These changes typically involve host restarts to apply profiles. Within Meta, production engineers focus on operationalizing services while software engineers focus on feature development, combining efforts to ensure reliability and scalability of Meta's infrastructure.
Exploring new approaches to scaling hosts for running multiple jobs
Adapting to the need for scalability, the podcast delves into stacking hosts to accommodate various job requirements efficiently. By considering factors like RAM and storage capabilities, the approach aims to optimize resources for different job types, especially those necessitating specialized features. The discussion revolves around predicting infrastructure changes influenced by AI and the significant shifts made to support an AI fleet.
Improving representation and inclusivity in AI development
The episode touches on the topic of enhancing diversity in AI development, a crucial aspect often overlooked. The conversation highlights the importance of considering inclusivity strategies and the potential benefits of a more diverse approach to AI technology. Despite the fictitious nature of the specific title mentioned in the episode, the underlying theme of fostering diversity and representation in AI innovation remains a key focus for future advancements.
Anita Zhang is here to tell us how Meta manages millions of bare metal Linux hosts and containers. We also discuss the Twine white paper and how AI is changing their requirements.
Changelog++ members save 8 minutes on this episode because they made the ads disappear. Join today!
Sponsors:
FireHydrant – The alerting and on-call tool designed for humans, not systems. Signals puts teams at the center, giving you ultimate control over rules, policies, and schedules. No need to configure your services or do wonky work-arounds. Signals filters out the noise, alerting you only on what matters. Manage coverage requests and on-call notifications effortlessly within Slack. But here’s the game-changer…Signals natively integrates with FireHydrant’s full incident management suite, so as soon as you’re alerted you can seamlessly kickoff and manage your entire incident inside a single platform. Learn more or switch today at firehydrant.com/signals
Sentry – Code breaks, fix it faster. Don’t just observe. Take action. Sentry is the only app monitoring platform built for developers that gets to the root cause for every issue. 90,000+ growing teams use sentry to find problems fast. Use the code CHANGELOG when you sign up to get $100 OFF the team plan.