The InfoQ Podcast

InfoQ

Software engineers, architects and team leads have found inspiration to drive change and innovation in their team by listening to the weekly InfoQ Podcast. They have received essential information that helped them validate their software development map. We have achieved that by interviewing some of the top CTOs, engineers and technology directors from companies like Uber, Netflix and more. Over 1,200,000 downloads in the last 3 years.

Episodes

Mentioned books

Jan 3, 2020 • 31min

Katie Gamanji on Condé Nast’s Kubernetes Platform, Self-Service, and the Federation and Cluster APIs

In this podcast, Daniel Bryant sat down with Katie Gamanji, Cloud Platform Engineer at Condé Nast International. Topics covered included: exploring the architecture of the Condé Nast Kubernetes-based platform; the importance of enabling self-service deployment for developers; and how the Kubernetes’ Federation API and Cluster API may enable more opportunities for platform automation. - Founded in the early 1900s, Condé Nast is a global media company that has recently migrated their application deployment platforms from individually-curated geographically-based platforms, to a standardised distributed platform based on Kubernetes and AWS. - The Condé Nast engineering team create and manage their own Kubernetes clusters, currently using CoreOS’s/Red Hat’s Tectonic tool. Self-service deployment of applications is managed via Helm Charts. - The platform team works closely with their “customer” developer teams in order to ensure their requirements are being met. - The Kubernetes Federation API makes it easy to orchestrate the deployment of applications to multiple clusters. This works well for cookie-cutter style deployments that only require small configuration differences, such as scaling the number of running applications based on geographic traffic patterns. - The Cluster API is a Kubernetes project to bring declarative APIs to cluster creation, configuration, and management. This enables more effective automation for cluster lifecycle management, and may provide more opportunities for multi-cloud Kubernetes use. - The Condé Nast platform Kubernetes Ingress is handled by Traefik, due to the good Helm support and cloud integration (for example, AWS Route 53 and IAM rule synchronization). The platform team is exploring the use of service mesh for 2020. - Abstractions, interfaces, and security will be interesting focal points for improvement in the Kubernetes ecosystem in 2020. More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2FeYPrE You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Check the landing page on InfoQ: https://bit.ly/2FeYPrE

Dec 27, 2019 • 26min

Joseph Jacks on Commercial Open Source Software, RISC-V, and Disrupting the Application Layer

In this podcast, Daniel Bryant spoke to Joseph Jacks, Founder of OSS Capital and the Open Core Summit, and discussed topics including the open source and open core models, innovations within open source hardware and the RISC-V instruction set architecture, and current opportunities for disruption using commercial open source software. Why listen to this podcast: - Recently, open source software and the open core business model have driven a lot of innovation and created a lot of value, particularly within the cloud “as-a-service” space. - There has been some disagreement between the open source and commercially-focused communities, for example, in relation to the licencing models and how value is captured. - The Open Core Summit (OCS) is a new conference focusing on the intersection of commercialisation and open source software that aims to facilitate discussion in this space. - Organisations building around open source software can potentially look at large cloud vendors as partners. Public clouds can provide effective distribution, and typically focus on offering breadth of services rather than the depth of expertise that can be provided by a specialist company. - RISC-V is an open-source hardware instruction set architecture (ISA) based on the well-established reduced instruction set computer (RISC) principles. Leveraging RISC-V can reduce the time and cost of customising chip designs. - A lot of recent open source innovation has focused on the infrastructure layer within computing systems. This means that the application layer is now potentially ripe for disruption via commercial open source software. More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2rDfYYU You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Check the landing page on InfoQ: https://bit.ly/2rDfYYU

Dec 16, 2019 • 33min

The InfoQ Podcast Hosts Take a Look Back at 2019, Discussing Teal, Edge, Quantum Computing, and more

In this special year-end wrap-up podcast Wes Reisz, Shane Hastie, Daniel Bryant, and Charles Humble discuss what we’ve seen in 2019 and speculate a little on what we hope to see in 2020. Topics include business agility and Teal, what it means to be an ethical engineer, bringing your whole self to work, highlights from QCon and InfoQ during 2019, the rise of Python, and progress in quantum computing. Why listen to this podcast: * Business agility is one of the major themes that the InfoQ team has seen emerge this year, with stronger emphasis on outcomes over outputs. We’ve also seen a growing interest in ethics and the ethical implications of the work we all do. * On the programming languages front the rise of Python continues, driven largely by its popularity in data science. * As Kubernetes cements its dominant position we’re hoping to see a simplification of the workflows associated with it, as well as in areas like observability. * There have been several big announcements in quantum computing in the past year, and this is an area we continue to watch with interest. * Another key trend for next year is edge computing. The edge of the cloud infrastructure has an amazing amount of available compute resource, as does the device edge. More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2Z0Q9OI You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Check the landing page on InfoQ: https://bit.ly/2Z0Q9OI

Dec 9, 2019 • 31min

Josh Wills on Building Resilient Data Engineering and Machine Learning Products at Slack

Josh Wills, a software engineer working on data engineering problems at Slack, discusses the Slack data architecture and how they build and observe their pipelines. Josh, along with color commentary such as the move from IC to manager (and back), discusses recommendations, tips, tools, and lessons Slack engineering teams discovered while building products like Slack Search. The podcast covers machine learning, observability, data engineering, and general practices for building highly resilient software. Why listen to this podcast: - Slack has a philosophy of building only what they need. They have a don’t reinvent the wheel mindset. - Slack was originally a PHP monolith. Today, it is largely Hack-lang, HHVM, and several Java and Go binarys. On the data side, application logs are in Thrift (there is a plan to migrate to protobuf). Events are processed through a Kafka cluster that handles 100,000s of events per second. Everything is kept in S3 with a large Hive metastore. EMR is spun up on demand. Presto, Airflow, Slack, Snowflake (business analytics), Quiver (key value store) are all used. - ML worked best for Slack when it was used to help people answer questions. Things like Learn to Rank (LTR) become the most effective use of ML for Slack. - You can get pretty far with rules. Use machine learning when that’s all that’s left. - When you start applying observability to your data pipeline, a key lesson for Slack was to really focus on structured data, tracing, high cardinality events. This let them really use the tools they were already familiar with (ELK, Prometheus, Grafana) and go deep into understanding what’s happening in the systems. More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2PsVA4q You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Check the landing page on InfoQ: https://bit.ly/2PsVA4q

Nov 15, 2019 • 30min

Bryan Liles on Making Kubernetes Easier for Developers, the CNCF, and “Serverless”

In this podcast, Daniel Bryant sat down with Bryan Liles, senior staff engineer at VMware. Topics covered included: the challenges with deploying applications into Kubernetes, using the open source Octant tool to increase a user’s understanding of Kubernetes clusters, and how “serverless” technologies may influence the future approaches to building software. Why listen to this podcast: - Octant is a highly extensible platform for developers to better understand the complexity of Kubernetes clusters. Octant runs locally, using the local Kubernetes credentials. It currently displays information about a Kubernetes cluster and related applications as a web page. Soon this tool and resulting display will be provided as a standalone application. - The goal of Octant is to enable users to discover what they need to discover. The tool aims to provide context relevant to where a user is and what they are trying to achieve. The Octant plugin system allows integration with other tooling, such as logging and metrics frameworks. This aims to facilitate quick problem detection and resolution. - Cloud native platforms like Kubernetes are complicated, as there are lots of moving parts. The most important challenge to be tackled to increase the adoption of platforms like Kubernetes is “how do we move code from our IDEs to wherever it needs to run with the least amount of friction?”. Testing needs to be implicit, as does security verification, and the acts of deployment. Kubernetes needs its “Ruby on Rails” moment. - Creating “serverless” systems is an interesting approach, but we may currently be using this technology in a non-optimal way. For example, creating web applications using this technology enables scalability, but can lead to the creation of difficult to understand systems that also require a lot of boilerplate configuration. Arguably, a more interesting use case is implementing large-scale batch processing using simple event-driven models. - The Cloud Native Computing Foundation (CNCF) has created a series of communities of practice called Special Interest Groups (SIGs), such as SIG App Delivery. This allows folks with similar interests to work together as a community, focusing on solving a specific set of well-scoped problems. There are many ways to get involved, from discussions, to coding and creating documentation. More on this: Quick scan our curated show notes on InfoQ https://bit.ly/37iUwIG You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Check the landing page on InfoQ: https://bit.ly/37iUwIG

Nov 8, 2019 • 28min

Victor Dibia on TensorFlow.js and Building Machine Learning Models with JavaScript

Victor Dibia is a Research Engineer with Cloudera’s Fast Forward Labs. On today’s podcast, Wes and Victor talk about the realities of building machine learning in the browser. The two discuss the capabilities, limitations, process, and realities around using TensorFlow.js. The two wrap discussing techniques like Model distillation that may enable machine learning models to be deployed in smaller footprints like serverless. - While there are limitations in running machine learning processes in a resource constrained environment like the browser, there are tools like TensorFlow.js that make it worthwhile. One powerful use case is the ability to protect the privacy of a user base while still making recommendations. TensorFlow.js takes advantage of the WebGL library for its more computational intense operations. - TensorFlow.js enables workflows for training and scoring models (doing inference) purely online, by importing a model built offline with more tradition Python tools, and a hybrid approach that builds offline and finetunes online. To build an offline model, you can build a model with TensorFlow Python (perhaps using a GPU cluster). The model can be exported into the TensorFlow SaveModel Format (or the Keras Model Format) and then converted with TensorFlow.js into the TensorFlow Web Model Format. At that point, the can be directly imported into your JavaScript. - TensorFlow Hub is a library for the publication, discovery, and consumption of reusable parts of machine learning models and was made available by the Google AI team. It can give developers a quick jumpstart into using trained models. - Model compression promises to make models small enough to run in places we couldn’t run models before. Model distillation is a process where a smaller model is trained to replicate the behavior of a larger one. In one case, BERT (a library almost 500MB in size) was distilled to about 7MB (almost 60x compression). More on this: Quick scan our curated show notes on InfoQ https://bit.ly/32rWnab You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Check the landing page on InfoQ: https://bit.ly/32rWnab

Nov 1, 2019 • 34min

Michelle Krejci on Moving to Microservices: Visualising Technical Debt, Kubernetes, and GraphQL

In this podcast, Daniel Bryant spoke to Michelle Krejci, service engineer lead at Pantheon, about the Drupal and Wordpress webops-based company’s move to a microservices architecture. Michelle is a well-known conference speaker in the space of technical leadership and continuous integration, and she shared her lessons learned over the past four years of the migration. Why listen to this podcast: - The backend for the Pantheon webops platform began as a Python-based monolith with a Cassandra data store. This architecture choice initially enabled rapid feature development as the company searched for product/market fit. However, as the company found success and began scaling their engineering teams, the ability to add new functionality rapidly to the monolith became challenging. - Conceptual debt and technical debt greatly impact the ability to add new features to an application. Moving to microservices does not eliminate either of these forms of debt, but use of this architectural pattern can make it easier to identify and manage the debt, for example by creating well-defined APIs and boundaries between modules. - Technical debt -- and the associated engineering toil -- is real debt, with a dollar value, and should be tracked and made visible to everyone. Establishing “quick wins” during the early stages of the migration towards microservices was essential. Building new business-focused services using asynchronous “fire and forget” event-driven integrations with the monolith helped greatly with this goal. - Using containers and Kubernetes provided the foundations for rapidly deploying, releasing, and rolling back new versions of a service. Running multiple Kubernetes namespaces also allowed engineers to clone the production namespace and environment (without data) and perform development and testing within an individually owned sandboxed namespace. - Using the Apollo GraphQL platform allowed schema-first development. Frontend and backend teams collaborated on creating a GraphQL schema, and then individually built their respective services using this as a contract. Using GraphQL also allowed easy mocking during development. Creating backward compatible schema allowed the deployment and release of functionality to be decoupled.

Oct 4, 2019 • 29min

Ryan Kitchens on Learning from Incidents at Netflix, the Role of SRE, and Sociotechnical Systems

In today’s podcast we sit down with Ryan Kitchens, a senior site reliability engineer and member of the CORE team at Netflix. This team is responsible for the entire lifecycle of incident management at Netflix, from incident response to memorialising an issue. Why listen to this podcast: - Top level metrics can be used as a proxy for user experience, and can be used to determine that issue should be alerted on an investigated. For example, at Netflix if the customer playback initiation “streams per second” metric declines rapidly, this may be an indication that something has broken. - Focusing on how things go right can provide valuable insight into the resilience within your system e.g. what are people doing everyday that helps us overcome incidents. Finding sources of resilience is somewhat “the story of the incident you didn’t have”. - When conducting an incident postmortem, simply reconstructing an incident is often not sufficient to determine what needs to be fixed; there is no root cause with complex socio-technical systems as found at Netflix and most modern web-based organisations. Instead, teams must dig a little deeper, and look for what went well, what contributed to the problem, and where are the recurring patterns. - Resilience engineering is a multidisciplinary field that was established in the early 2000s, and the associated community that has emerged is both academic and deeply practical. Although much resilience engineering focuses on domains such as aviation, surgery and military agencies, there is much overlap with the domain of software engineering. - Make sure that support staff within an organisation have a feedback loop into the product team, as these people providing support often know where all of the hidden problems are, the nuances of the systems, and the workarounds. More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2LLwk8T You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Check the landing page on InfoQ: https://bit.ly/2LLwk8T

Sep 20, 2019 • 25min

Oliver Gould on the Three Pillars of Service Mesh, SMI, and Making Technology Bets

In this podcast we sit down with Oliver Gould, co-founder and CTO of Buoyant. Oliver has a strong background in networking, architecture and observability, and worked on solving associated technical challenges at both Yahoo! and Twitter. Oliver is a regular presenter at cloud and infrastructure conferences, and alongside his co-founder William Morgan, you can often find them in the hallway track, waxing lyrical about service mesh -- a term they practically coined -- and trying to bring others along on the journey. Service mesh technology is still young, and the ecosystem is still very much a work in progress, but there have been several recent interesting developments within this space. One of these was the announcement of the service mesh interface (SMI) at the recent KubeCon EU in Barcelona. The SMI spec seeks to unlock service mesh integrators and implementers, as this can provide an abstraction that removes the need to bet on any single service mesh implementation. This can be good for both tool makers and enterprise early adopters. Many organisations like Microsoft and HashiCorp are involved with working alongside the community to help define the SMI, including Buoyant. In this podcast we summarise the evolution of the service mesh concept, with a focus on the three pillars: visibility, security, and reliability. We explore the new traffic “tap” feature within Linkerd that allows near real time in-situ querying of metrics, and discuss how to implement network security by leveraging the primitives like Service Account provided by Kubernetes. We also discuss how reliability features, such as retries, time outs, and circuit-breakers are becoming table stakes for infrastructure platforms. We also cover the evolution of the service mesh interface, explore how service meses may impact development and platforms in the future, and briefly discuss some of the benefits offered by the Rust language in relation to building a data plane for Linkerd. We conclude the podcast with a discussion of the importance of community building. Why listen to this podcast: - A well-implemented service mesh can make a distributed software system more observable. Linkerd 2.0 supports both the emitting of mesh telemetry for offline analysis, and also the ability to “tap” communications and make queries dynamically against the data. The Linkerd UI currently makes use the tap functionality. - Linkerd aims to make the implementation of secure service-to-service communication easy, and it does this by leveraging existing Kubernetes primitives. For example, Service Accounts are used to bootstrap the notion of identity, which in turn is used as a basis for Linkerd’s mTLS implementation. - Offering reliability is “table stakes” for any service mesh. A service mesh should make it easy for platform owners to offer fundamental service-to-service communication reliability to application owners. - The future of software development platforms may move (back) to more PaaS-like offerings. Kubernetes-based function as a service (FaaS) frameworks like OpenFaaS and Knative are providing interesting features in this space. A service mesh may provide some of the glue for this type of platform. - Working on the service mesh interface (SMI) specification allowed the Buoyant team to sit down with other community members like HashiCorp and Microsoft, and share ideas and identify commonality between existing service mesh implementations. More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2m5DSJ6 You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Check the landing page on InfoQ: https://bit.ly/2m5DSJ6

Sep 13, 2019 • 25min

Event Sourcing: Bernd Rücker on Architecting for Scale

Today on the podcast, Bernd Rucker of Camunda talks about event sourcing. In particular, Wes and Bernd discuss thoughts around scalability, events, commands, consensus, and the orchestration engines Camunda implemented. This podcast is a primer on considerations between an RDBMS and event-driven systems. Why listen to this podcast: - An event-driven system is a more modern approach to building highly scalable systems. - An RDBMS system can limit throughput in scalability. Camunda was able to achieve higher levels of scale by implementing an event-driven system. - Command and events are often confused. Commands are actions that request something to happen. Events describe something that happened. Confusing the two causes confusion in application development of event-driven systems.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app