Don't gamble when it comes to reliability Tom Croucher Uber Site Reliability Engineering tomc@uber.com @sh1mmer
You are in a 5th floor office. You overhear some office workers: “...service can’t talk to Redis”. Exits are North, South, East, and West. >
You are in a 5th floor office. You overhear some office workers: “...service can’t talk to Redis”. Exits are North, South, East, and West. > south
You approach the elevator. You can smell lunch. You receive a page: “Blackbox is alerting in 6 cities” >
You approach the elevator. You can smell lunch. You receive a page: “Blackbox is alerting in 6 cities” > commandeer conference room
You are in an incident war room. Everything is f̨̧͔̳̝̫̳̥̭͙̰̜͕̝̣̗͍̤̺̦̼̘̘̲̮͔̠̩̞ͅ ̲̻͕͖̬̞͔̣̹̮͈̝̗͓̗̲̫̘̜̱̫̦̭̼̭̙͇͖̭̥̼ͅ ̧̨̨̢̧̗̞̝̤̙̗̗̣͎̗̭̯̘̤̩̬̦͍̖͈͚̭͎̱̠͜ ̧͖̳̗͙̖̣̙̣̦̣̦̬̲͇̰̩̠̠̳̼͕̗̲̞̼̭ͅͅ ̨̡̡̞̣̗̳̱̻̮͔̩̘̳̲͕̰͎͓͔̜̳͕͇͙̠̯͓͓͜ ̢̢̳̘̗̤͕̦͕̞̮̟͎̞̭̠̳̙̻̤̰̺̻̼ͅͅͅ ̢̨̨̣͓̠̦͈͙̝̯̮̮̘̙̥͍͖̯̟̪͖̣̜̬̪̩̭ͅ ̧̢̲̺͎̠̙̰͖͈̻̹̘̦̳͇͚̘̩̼͚̰̹̦͓̰̟̞̥͈͚͍̹͎͙̙͕̥ ̧̡̡̧͔̝͎̭̩̞̞̭̼̺̳͚͎̙͉͕̥̗̲͕͓̻͎͖͇̟̣̭͔̙̞̖̙̬̼͜ͅ ̨̡̭̭͕̫̭̫̘̠͙̖̪̺̻̥̪̼̘̭̖̩͉̦͜͜͜ͅͅ >
Internal dashboards aren’t working
Rule 1. Always Know When It’s Broken
Rule 1. Always Know When It’s Broken Uber Datacenter Test Trips Alerts Tom’s Phone Blackbox monitoring system
Many services are suddenly losing a high % of requests...
...so all these teams jump on a conference call.
It began with a simple mistake.
insert firewall all_services
insert [into] firewall [group] all_services [group]
insert [into] all_services [group] firewall [group]
While fixing the Redis issue the firewall change was pushed to the shared service cluster globally.
Rule 2. Avoid Global Changes
Rule 2. Avoid Global Changes Unplanned Network provider failure e.g. dropping BGP routes, etc Switch/Router bug/failure Software bug Machine failure, TORS failure, etc Chiller failure, BMS failure, etc Grid failure, UPS failure, etc Planned You deployed You messed with it Internet Network Network Software Software Compute Compute Cooling Cooling Power Power
Rule 2. Avoid Global Changes Having multiple protects you against fate, or other people’s mistakes. network providers availability zones datacenters racks switches routers service instances If you deploy all at once nothing can protect you from your own mistakes.
...so we find and fix the firewall issue.
It’s been a long day, but everything is back to normal.
You are on the 6th floor patio. In front of you is <a PR approved beverage>. Exits are West. >
You are on the 6th floor patio. In front of you is <a PR approved beverage>. Exits are West. > drink beverage
You are on the 6th floor patio. You receive a page: “Blackbox is alerting in 8 cities” Exits are West. >
You are on the 6th floor patio. You receive a page: “Blackbox is alerting in 8 cities” Exits are West. > exit east
So the team failed over to A few rebooted machines just got the bad firewall config back in a single datacenter. So the team failed over to another datacenter.
Rule 3. Moving traffic is faster than fixing
Rule 3. Moving traffic is faster than fixing Consistent Available Partition Tolerant
Rule 3. Moving traffic is faster than fixing “UNAVAILABLE” Consistent Available Partition Tolerant
Rule 3. Moving traffic is faster than fixing client client Can A client Proceed?
Rule 3. Moving traffic is faster than fixing client client Can A client Proceed?
...the traffic moved to the new datacenter starts failing...
...the existing traffic in the new datacenter starts failing.
Frontend User Cache Micro Services User Authentication Flow worker nginx HAProxy worker Health Checker User Authentication Flow if token in nginx cache do service req else do fast-auth req worker Varnish worker HAProxy Micro Services health check worker service fast auth service request
Load balance if request hostname is localhost backend userCache0 { … } backend userCache1 { … } … director userCache round-robin { { .backend = userCache0; } { .backend = userCache1; } { .backend = userCache11; } } sub vcl_recv { if (req.http.host ~ "^localhost") { set req.backend = userCache;
Rule 4. Make your mitigations normal
Rule 4. Make your mitigations normal Do test drills of your mitigations regularly Test at peak traffic Test without telling anyone else first Make a plan of tests you need Keep a log of what you’ve tested and when Understand how mitigations affect your system How much extra pressure does adding 25%, 50%, 100% extra traffic put on a datacenter Load test, and capacity plan based on failover needs at peak not just peak Load test even more often than you drill
Rule 1. Always Know When It’s Broken Rule 2. Avoid Global Changes Rule 3. Moving traffic is faster than fixing Rule 4. Make your mitigations normal Tom Croucher Uber Site Reliability Engineering tomc@uber.com @sh1mmer