Presentation is loading. Please wait.

Presentation is loading. Please wait.

Don't gamble when it comes to reliability

Similar presentations


Presentation on theme: "Don't gamble when it comes to reliability"— Presentation transcript:

1 Don't gamble when it comes to reliability
Tom Croucher Uber Site Reliability Engineering @sh1mmer

2 You are in a 5th floor office.
You overhear some office workers: “...service can’t talk to Redis”. Exits are North, South, East, and West. >

3 You are in a 5th floor office.
You overhear some office workers: “...service can’t talk to Redis”. Exits are North, South, East, and West. > south

4 You approach the elevator. You can smell lunch.
You receive a page: “Blackbox is alerting in 6 cities” >

5 You approach the elevator. You can smell lunch.
You receive a page: “Blackbox is alerting in 6 cities” > commandeer conference room

6 You are in an incident war room.
Everything is f̨̧͔̳̝̫̳̥̭͙̰̜͕̝̣̗͍̤̺̦̼̘̘̲̮͔̠̩̞ͅ ̲̻͕͖̬̞͔̣̹̮͈̝̗͓̗̲̫̘̜̱̫̦̭̼̭̙͇͖̭̥̼ͅ ̧̨̨̢̧̗̞̝̤̙̗̗̣͎̗̭̯̘̤̩̬̦͍̖͈͚̭͎̱̠͜ ̧͖̳̗͙̖̣̙̣̦̣̦̬̲͇̰̩̠̠̳̼͕̗̲̞̼̭ͅͅ ̨̡̡̞̣̗̳̱̻̮͔̩̘̳̲͕̰͎͓͔̜̳͕͇͙̠̯͓͓͜ ̢̢̳̘̗̤͕̦͕̞̮̟͎̞̭̠̳̙̻̤̰̺̻̼ͅͅͅ ̢̨̨̣͓̠̦͈͙̝̯̮̮̘̙̥͍͖̯̟̪͖̣̜̬̪̩̭ͅ ̧̢̲̺͎̠̙̰͖͈̻̹̘̦̳͇͚̘̩̼͚̰̹̦͓̰̟̞̥͈͚͍̹͎͙̙͕̥ ̧̡̡̧͔̝͎̭̩̞̞̭̼̺̳͚͎̙͉͕̥̗̲͕͓̻͎͖͇̟̣̭͔̙̞̖̙̬̼͜ͅ ̨̡̭̭͕̫̭̫̘̠͙̖̪̺̻̥̪̼̘̭̖̩͉̦͜͜͜ͅͅ >

7 Internal dashboards aren’t working

8 Rule 1. Always Know When It’s Broken

9 Rule 1. Always Know When It’s Broken
Uber Datacenter Test Trips Alerts Tom’s Phone Blackbox monitoring system

10 Many services are suddenly losing a high % of requests...

11 ...so all these teams jump on a conference call.

12 It began with a simple mistake.

13 insert firewall all_services

14 insert [into] firewall [group] all_services [group]

15 insert [into] all_services [group] firewall [group]

16 While fixing the Redis issue the firewall change was pushed to the shared service cluster globally.

17 Rule 2. Avoid Global Changes

18 Rule 2. Avoid Global Changes
Unplanned Network provider failure e.g. dropping BGP routes, etc Switch/Router bug/failure Software bug Machine failure, TORS failure, etc Chiller failure, BMS failure, etc Grid failure, UPS failure, etc Planned You deployed You messed with it Internet Network Network Software Software Compute Compute Cooling Cooling Power Power

19 Rule 2. Avoid Global Changes
Having multiple protects you against fate, or other people’s mistakes. network providers availability zones datacenters racks switches routers service instances If you deploy all at once nothing can protect you from your own mistakes.

20 ...so we find and fix the firewall issue.

21 It’s been a long day, but everything is back to normal.

22 You are on the 6th floor patio.
In front of you is <a PR approved beverage>. Exits are West. >

23 You are on the 6th floor patio.
In front of you is <a PR approved beverage>. Exits are West. > drink beverage

24 You are on the 6th floor patio.
You receive a page: “Blackbox is alerting in 8 cities” Exits are West. >

25 You are on the 6th floor patio.
You receive a page: “Blackbox is alerting in 8 cities” Exits are West. > exit east

26 So the team failed over to
A few rebooted machines just got the bad firewall config back in a single datacenter. So the team failed over to another datacenter.

27 Rule 3. Moving traffic is faster than fixing

28 Rule 3. Moving traffic is faster than fixing
Consistent Available Partition Tolerant

29 Rule 3. Moving traffic is faster than fixing
“UNAVAILABLE” Consistent Available Partition Tolerant

30 Rule 3. Moving traffic is faster than fixing
client client Can A client Proceed?

31 Rule 3. Moving traffic is faster than fixing
client client Can A client Proceed?

32 ...the traffic moved to the new datacenter starts failing...

33 ...the existing traffic in the new datacenter starts failing.

34 Frontend User Cache Micro Services User Authentication Flow
worker nginx HAProxy worker Health Checker User Authentication Flow if token in nginx cache do service req else do fast-auth req worker Varnish worker HAProxy Micro Services health check worker service fast auth service request

35 Load balance if request hostname is localhost
backend userCache0 { … } backend userCache1 { … } director userCache round-robin { { .backend = userCache0; } { .backend = userCache1; } { .backend = userCache11; } } sub vcl_recv { if (req.http.host ~ "^localhost") { set req.backend = userCache;

36 Rule 4. Make your mitigations normal

37 Rule 4. Make your mitigations normal
Do test drills of your mitigations regularly Test at peak traffic Test without telling anyone else first Make a plan of tests you need Keep a log of what you’ve tested and when Understand how mitigations affect your system How much extra pressure does adding 25%, 50%, 100% extra traffic put on a datacenter Load test, and capacity plan based on failover needs at peak not just peak Load test even more often than you drill

38

39 Rule 1. Always Know When It’s Broken Rule 2. Avoid Global Changes
Rule 3. Moving traffic is faster than fixing Rule 4. Make your mitigations normal Tom Croucher Uber Site Reliability Engineering @sh1mmer


Download ppt "Don't gamble when it comes to reliability"

Similar presentations


Ads by Google