Comcast’s TV products serve tens of millions of customers and are powered by a suite of dozens of...
Comcast’s TV products serve tens of millions of customers and are powered by a suite of dozens of services that are continuously developed and operated by hundreds of technical staff. While we have enjoyed many of the touted benefits of a microservice architecture—looser coupling between teams, independent deployments—we have also encountered the corresponding challenges. In particular, we’ve learned that operating a platform composed of this many services in a reliable fashion is fraught with peril. Delivering business value can seem like hacking your way through the wilderness at times.
In this talk, we’ll start by briefly reviewing some “ancient” (20 years old!) literature: partial failures make distributed systems fundamentally different. When an application can have some parts fail while other parts continue working, it can be difficult to reason about overall correctness. After learning about different types of partial failures, the audience will have an intuitive understanding of these fundamental hazards.
For the main portion of the talk, we’ll go over three main strategies for surviving in a jungle of partial failures:
1. using idempotent service interfaces;
2. placing service boundaries between optional or less-critical functionality; and
3. recombining services.
Each survival tip will be explained through a concrete example, or “adventure story”, from our experience. By the end of the talk, attendees will be familiar with the pitfalls of partial failures—the main technical weakness of a microservice architecture—but will also be armed with techniques to successfully avoid them.
This talk was delivered at the 2015 O'Reilly Software Architecture Conference: