Performance Engineering on Hard Mode with Andrew Hunter
Nov 28, 2023
auto_awesome
Andrew Hunter, a performance engineer who makes code really fast, discusses optimizing systems at hyperscale and the challenges of optimizing trading systems. He shares his favorite profiling techniques, tools for visualizing traces, and the unique challenges of optimizing OCaml versus C++. They also touch on the joys of musical theater and how to pass an interview when sleep-deprived.
Performance optimization is easier at hyperscale due to the impact of even small changes.
Tracing tools like MagicTrace are valuable for latency-oriented performance optimization.
Balancing hardware capabilities with efficient software optimization is key for optimal performance in various application scenarios.
Deep dives
Performance Engineering as an Addicting Passion
Andrew Hunter, a software engineer with a background in performance engineering, discusses his addictive passion for understanding how systems work and optimizing their performance. Despite the complexities and challenges involved, he finds immense excitement in making systems faster and continually asking questions to delve into the details of code and implementations.
The Importance of Caring About System Details
Hunter emphasizes the value of caring deeply about the intricacies of hardware, operating systems, and other components that underlie software systems. Drawing from his own experience in college, where he engaged with the deep details of code implementation, he gained a significant understanding of how different components of a system work together. This comprehensive knowledge enables experts like him to excel in performance engineering by identifying and optimizing critical elements.
Performance Engineering in Different contexts: Scale and Tails
Hunter explains that performance engineering in large-scale organizations like Google offers a 'target-rich environment' where even marginal improvements can have a massive impact due to the sheer scale of operations. However, in trading systems, the focus shifts to minimizing latencies and dealing with tail-end performance. The challenge lies in understanding and optimizing what the systems do during critical moments rather than overall usage. Sampling profilers prove useful for identifying bottlenecks at scale, while techniques like MagicTrace provide insights into tail performance and identifying areas for improvement.
The Value of Tracing for Performance Analysis
Tracing, whether it's for memory allocation or RPCs, provides a comprehensive and detailed view of the data flow and behavior of a system. It captures all the data and events, allowing for a deeper understanding of potential bottlenecks and performance issues. Tracing is particularly useful for focused, latency-oriented performance optimization, as it provides direct visibility into individual events and their impact on performance. Unlike statistical profilers, tracing tools like Magic Trace provide all the information and enable easy interpretation, making it a valuable tool in optimizing system performance.
The Role of Hardware and Dialects in Performance Optimization
Hardware plays a crucial role in achieving low-latency performance, and optimizing system architecture involves leveraging hardware capabilities such as FPGA-based NICs. However, software optimization is still important across various time scales. While hardware can provide unparalleled speed, software optimizations are necessary to handle complex logic and ensure overall system efficiency. In the context of software optimization, there are two approaches: writing code in lower-level languages like C to access specific hardware features and using DSLs or dialects of the main language to achieve better performance and control over memory layout. Balancing hardware capabilities with efficient software optimizations is essential for achieving optimal performance across different application scenarios, including both high-speed trading and systems with human interaction.
Andrew Hunter makes code really, really fast. Before joining Jane Street, he worked for seven years at Google on multithreaded architecture, and was a tech lead for tcmalloc, Google’s world-class scalable malloc implementation. In this episode, Andrew and Ron discuss how, paradoxically, it can be easier to optimize systems at hyperscale because of the impact that even miniscule changes can have. Finding performance wins in trading systems—which operate at a smaller scale, but which have bursty, low-latency workloads—is often trickier. Andrew explains how he approaches the problem, including his favorite profiling techniques and tools for visualizing traces; the unique challenges of optimizing OCaml versus C++; and when you should and shouldn’t care about nanoseconds. They also touch on the joys of musical theater, and how to pass an interview when you’re sleep-deprived.
You can find the transcript for this episode on our website.
Some links to topics that came up in the discussion: