The Hunt for the Cluster-Killer Bug
We know Erlang is all about fault tolerance. A well-engineered Erlang system – such as Kred, in the heart of Klarna’s business – will never stop, no matter what. Yet, about a year ago a short Kafka outage shook our mighty Kred so bad it knocked out all but one node. A few days later a second outage took down the entire cluster. How could this happen?
This is the story of our hunt for the cluster-killer bug before it could strike again. It is a story of unexpected twists and descending to the deepest depths of the technology stack powering an Erlang application.
OBJECTIVES
Give some new tools for debugging low-level issues in an Erlang stack.
Teach about Erlang’s memory model.
AUDIENCE
Developers who would like to add some new tricks to their debugging toolbox.