April 25, 2023, 10:01 p.m. | USENIX

USENIX www.youtube.com

We're Still Down: A Metastable Failure Tale

Kyle Lexmond

""The status? The system has been down for hours, and we haven't been able to get it back up yet""—words on an incident conference call that you probably don't want to hear.

This talk explores how a globally distributed CDN experienced a metastable failure, design changes that make future failures less likely, and the unorthodox fix that made a recovery possible (and can hopefully apply to future metastable failures—maybe even yours). …

americas back back up call cdn conference design distributed don down fix future incident recovery system

SOC 2 Manager, Audit and Certification

@ Deloitte | US and CA Multiple Locations

Security Operations Manager (f/d/m), 80-100%

@ Alpiq | Lausanne, CH

Project Manager - Cyber Security

@ Quantrics Enterprises Inc. | Philippines

Sr. Principal Application Security Engineer

@ Gen | DEU - Tettnang, Kaplaneiweg

(Senior) Security Architect Car IT/ Threat Modelling / Information Security (m/f/x)

@ Mercedes-Benz Tech Innovation | Ulm

Information System Security Officer

@ ManTech | 200AE - 375 E St SW, Washington, DC