Talk 43:00

Mid-air airplane repair: troubleshooting at WhatsApp

Simple, reliable messaging. It takes a lot to support this statement. For 10 years WhatsApp demonstrated unprecedented reliability and availability, serving over 1.5B users. There is absolutely no way to reproduce interactions between all of them, within the cluster spanning over 10,000 nodes and multiple datacenters. Investigations must be done on a live system without disturbing connected users. If there are repairs needed, it has to be done on the fly.

This talk will guide through debugging and troubleshooting techniques used at WhatsApp. Maxim will share a few case studies, explain monitoring, introspection, performance analysis, and tools.

Some knowledge of Erlang and C is necessary.

OBJECTIVESShare processes, best practices, tools and war stories about 10 years of reliable messaging service.

TARGET AUDIENCESoftware developers, DevOps, Site Reliability Engineers, System Administrators and everyone else interested in troubleshooting live production system.