April 25, 2023, 10:01 p.m. | USENIX

USENIX www.youtube.com

We're Still Down: A Metastable Failure Tale

Kyle Lexmond

""The status? The system has been down for hours, and we haven't been able to get it back up yet""—words on an incident conference call that you probably don't want to hear.

This talk explores how a globally distributed CDN experienced a metastable failure, design changes that make future failures less likely, and the unorthodox fix that made a recovery possible (and can hopefully apply to future metastable failures—maybe even yours). …

americas back back up call cdn conference design distributed don down fix future incident recovery system

CyberSOC Technical Lead

@ Integrity360 | Sandyford, Dublin, Ireland

Cyber Security Strategy Consultant

@ Capco | New York City

Cyber Security Senior Consultant

@ Capco | Chicago, IL

Sr. Product Manager

@ MixMode | Remote, US

Corporate Intern - Information Security (Year Round)

@ Associated Bank | US WI Remote

Senior Offensive Security Engineer

@ CoStar Group | US-DC Washington, DC