Talk 47:00

The Hunt for the Cluster-Killer Bug

We know Erlang is all about fault tolerance. A well-engineered Erlang system – such as Kred, in the heart of Klarna’s business – will never stop, no matter what. Yet, about a year ago a short Kafka outage shook our mighty Kred so bad it knocked out all but one node. A few days later a second outage took down the entire cluster. How could this happen?

This is the story of our hunt for the cluster-killer bug before it could strike again. It is a story of unexpected twists and descending to the deepest depths of the technology stack powering an Erlang application.

OBJECTIVES

Give some new tools for debugging low-level issues in an Erlang stack.

Teach about Erlang’s memory model.

AUDIENCE

Developers who would like to add some new tricks to their debugging toolbox.