Download presentation
Presentation is loading. Please wait.
1
Don't gamble when it comes to reliability
Tom Croucher Uber Site Reliability Engineering @sh1mmer
2
You are in a 5th floor office.
You overhear some office workers: “...service can’t talk to Redis”. Exits are North, South, East, and West. >
3
You are in a 5th floor office.
You overhear some office workers: “...service can’t talk to Redis”. Exits are North, South, East, and West. > south
4
You approach the elevator. You can smell lunch.
You receive a page: “Blackbox is alerting in 6 cities” >
5
You approach the elevator. You can smell lunch.
You receive a page: “Blackbox is alerting in 6 cities” > commandeer conference room
6
You are in an incident war room.
Everything is f̨̧͔̳̝̫̳̥̭͙̰̜͕̝̣̗͍̤̺̦̼̘̘̲̮͔̠̩̞ͅ ̲̻͕͖̬̞͔̣̹̮͈̝̗͓̗̲̫̘̜̱̫̦̭̼̭̙͇͖̭̥̼ͅ ̧̨̨̢̧̗̞̝̤̙̗̗̣͎̗̭̯̘̤̩̬̦͍̖͈͚̭͎̱̠͜ ̧͖̳̗͙̖̣̙̣̦̣̦̬̲͇̰̩̠̠̳̼͕̗̲̞̼̭ͅͅ ̨̡̡̞̣̗̳̱̻̮͔̩̘̳̲͕̰͎͓͔̜̳͕͇͙̠̯͓͓͜ ̢̢̳̘̗̤͕̦͕̞̮̟͎̞̭̠̳̙̻̤̰̺̻̼ͅͅͅ ̢̨̨̣͓̠̦͈͙̝̯̮̮̘̙̥͍͖̯̟̪͖̣̜̬̪̩̭ͅ ̧̢̲̺͎̠̙̰͖͈̻̹̘̦̳͇͚̘̩̼͚̰̹̦͓̰̟̞̥͈͚͍̹͎͙̙͕̥ ̧̡̡̧͔̝͎̭̩̞̞̭̼̺̳͚͎̙͉͕̥̗̲͕͓̻͎͖͇̟̣̭͔̙̞̖̙̬̼͜ͅ ̨̡̭̭͕̫̭̫̘̠͙̖̪̺̻̥̪̼̘̭̖̩͉̦͜͜͜ͅͅ >
7
Internal dashboards aren’t working
8
Rule 1. Always Know When It’s Broken
9
Rule 1. Always Know When It’s Broken
Uber Datacenter Test Trips Alerts Tom’s Phone Blackbox monitoring system
10
Many services are suddenly losing a high % of requests...
11
...so all these teams jump on a conference call.
12
It began with a simple mistake.
13
insert firewall all_services
14
insert [into] firewall [group] all_services [group]
15
insert [into] all_services [group] firewall [group]
16
While fixing the Redis issue the firewall change was pushed to the shared service cluster globally.
17
Rule 2. Avoid Global Changes
18
Rule 2. Avoid Global Changes
Unplanned Network provider failure e.g. dropping BGP routes, etc Switch/Router bug/failure Software bug Machine failure, TORS failure, etc Chiller failure, BMS failure, etc Grid failure, UPS failure, etc Planned You deployed You messed with it Internet Network Network Software Software Compute Compute Cooling Cooling Power Power
19
Rule 2. Avoid Global Changes
Having multiple protects you against fate, or other people’s mistakes. network providers availability zones datacenters racks switches routers service instances If you deploy all at once nothing can protect you from your own mistakes.
20
...so we find and fix the firewall issue.
21
It’s been a long day, but everything is back to normal.
22
You are on the 6th floor patio.
In front of you is <a PR approved beverage>. Exits are West. >
23
You are on the 6th floor patio.
In front of you is <a PR approved beverage>. Exits are West. > drink beverage
24
You are on the 6th floor patio.
You receive a page: “Blackbox is alerting in 8 cities” Exits are West. >
25
You are on the 6th floor patio.
You receive a page: “Blackbox is alerting in 8 cities” Exits are West. > exit east
26
So the team failed over to
A few rebooted machines just got the bad firewall config back in a single datacenter. So the team failed over to another datacenter.
27
Rule 3. Moving traffic is faster than fixing
28
Rule 3. Moving traffic is faster than fixing
Consistent Available Partition Tolerant
29
Rule 3. Moving traffic is faster than fixing
“UNAVAILABLE” Consistent Available Partition Tolerant
30
Rule 3. Moving traffic is faster than fixing
client client Can A client Proceed?
31
Rule 3. Moving traffic is faster than fixing
client client Can A client Proceed?
32
...the traffic moved to the new datacenter starts failing...
33
...the existing traffic in the new datacenter starts failing.
34
Frontend User Cache Micro Services User Authentication Flow
worker nginx HAProxy worker Health Checker User Authentication Flow if token in nginx cache do service req else do fast-auth req worker Varnish worker HAProxy Micro Services health check worker service fast auth service request
35
Load balance if request hostname is localhost
backend userCache0 { … } backend userCache1 { … } … director userCache round-robin { { .backend = userCache0; } { .backend = userCache1; } { .backend = userCache11; } } sub vcl_recv { if (req.http.host ~ "^localhost") { set req.backend = userCache;
36
Rule 4. Make your mitigations normal
37
Rule 4. Make your mitigations normal
Do test drills of your mitigations regularly Test at peak traffic Test without telling anyone else first Make a plan of tests you need Keep a log of what you’ve tested and when Understand how mitigations affect your system How much extra pressure does adding 25%, 50%, 100% extra traffic put on a datacenter Load test, and capacity plan based on failover needs at peak not just peak Load test even more often than you drill
39
Rule 1. Always Know When It’s Broken Rule 2. Avoid Global Changes
Rule 3. Moving traffic is faster than fixing Rule 4. Make your mitigations normal Tom Croucher Uber Site Reliability Engineering @sh1mmer
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.