Don't gamble when it comes to reliability

Slides:



Advertisements
Similar presentations
Fast Data at Massive Scale Lessons Learned at Facebook Bobby Johnson.
Advertisements

Tableau Software Australia
Barracuda Link Balancer Link Reliability and Bandwidth Optimization.
Large-Scale Distributed Systems Andrew Whitaker CSE451.
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 44 How Firewalls Work How Firewalls Work.
Why wait... Solve your customers’ problems today. Simon Cuthbert – Sales Director, EMEA Nick Bond – Pre-Sales Engineer Zeus Technology.
CIT 470: Advanced Network and System Administration
Cosc 4765 Network Security: Routers, Firewall, filtering, NAT, and VPN.
Highly Available Central Services An Intelligent Router Approach Thomas Finnern Thorsten Witt DESY/IT.
Cold Fusion High Availability “Taking It To The Next Level” Presenter: Jason Baker, Digital North Date:
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Dr. Zahid Anwar. Simplified Architecture of Linux Cluster Simplified Architecture of a Single Computer Simplified architecture of an enterprise cluster.
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
Controls for Information Security
Load Sharing and Balancing - Saravanan Mathialagan Masters in Computer Science Georgia State University.
Introduction to Computers and the Internet. What is a computer? An "intelligent" machine  You tell a person to do a job and the person follows your “instruction”
Tier 3g Infrastructure Doug Benjamin Duke University.
Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
1 Distributed Systems : Server Load Balancing Dr. Sunny Jeong. Mr. Colin Zhang With Thanks to Prof. G. Coulouris,
Windows Azure Tour Benjamin Day Benjamin Day Consulting, Inc.
Clustering SQL Server Tom Pullen Senior DBA, RM Education
How to create DNS rule that allow internal network clients DNS access Right click on Firewall Policy ->New- >Access Rule Right click on Firewall.
Monitoring Data Access A practical guide to on the wire data access monitoring Kevin Else, Senior Consultant NoFools Ltd.
Year 7 ICT. Classroom rules and behaviour Be on time for the lesson especially after lunch.
Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU.
Enabling cache for monitoring application Alexandre Beche.
© 2002 Global Knowledge Network, Inc. All rights reserved. Windows Server 2003 MCSA and MCSE Upgrade Clustering Servers.
D-Link TSD 2009 workshop D-Link Net-Defends Firewall Training ©Copyright By D-Link HQ TSD Benson Wu.
CHAPTER 7 CLUSTERING SERVERS. CLUSTERING TYPES There are 2 types of clustering ; Server clusters Network Load Balancing (NLB) The difference between the.
CHAPTER 2 Laws of Security. Introduction Laws of security enable user make the judgment about the security of a system. Some of the “laws” are not really.
Kona Security Solutions - Overview
NETWORKING COMPONENTS lLTEC 4550 JGuadalupe. HUB -THIS IS A HARDWARE DEVICE THAT IS USED TO NETWORK MULTIPLE COMPUTERS TOGETHER. IT IS A CENTRAL CONNECTION.
Jennifer Rexford Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks Energy.
Ananta: Cloud Scale Load Balancing Presenter: Donghwi Kim 1.
Andrew Lahiff HEP SYSMAN June 2016 Hiding infrastructure problems from users: load balancers at the RAL Tier-1 1.
High Availability Clusters in Linux Sulamita Garcia EDS Unix Specialist
Introduction of load balancers at the RAL Tier-1
Please take out the homework - viewing sheet fro the movie
Lab A: Planning an Installation
Understanding Solutions
Running micro service environments is no free lunch
Local & Global Load Balancing
High Availability Linux (HA Linux)
N-Tier Architecture.
Module Overview Installing and Configuring a Network Policy Server
Authentication & .htaccess
ETHANE: TAKING CONTROL OF THE ENTERPRISE
Large Distributed Systems
Network Load Balancing Functionality
Maximum Availability Architecture Enterprise Technology Centre.
Network Management Checking Performance + Traffic & Configuration
Cloud Computing Ed Lazowska August 2011 Bill & Melinda Gates Chair in
Lyons School Safety Terms
Microsoft Ignite NZ October 2016 SKYCITY, Auckland.
Distributed Content in the Network: A Backbone View
X in [Integration, Delivery, Deployment]
What I Learned Making a Global Web App
Information Technology Cornell notes
Design Unit 26 Design a small or home office network
INFO 344 Web Tools And Development
Rules for Being Safe on the Internet for Kids
Content Distribution Networks
Module P3 Practical: Building a webapp in nodejs and
Security through Group Policy
Building global and highly-available services using Windows Azure
LOAD BALANCING INSTANCE GROUP APPLICATION #1 INSTANCE GROUP Overview
Office 365 – How NOT to do it UKNOF43.
SITUATIONAL AWARENESS TRAINING
Presentation transcript:

Don't gamble when it comes to reliability Tom Croucher Uber Site Reliability Engineering tomc@uber.com @sh1mmer

You are in a 5th floor office. You overhear some office workers: “...service can’t talk to Redis”. Exits are North, South, East, and West. >

You are in a 5th floor office. You overhear some office workers: “...service can’t talk to Redis”. Exits are North, South, East, and West. > south

You approach the elevator. You can smell lunch. You receive a page: “Blackbox is alerting in 6 cities” >

You approach the elevator. You can smell lunch. You receive a page: “Blackbox is alerting in 6 cities” > commandeer conference room

You are in an incident war room. Everything is f̨̧͔̳̝̫̳̥̭͙̰̜͕̝̣̗͍̤̺̦̼̘̘̲̮͔̠̩̞ͅ ̲̻͕͖̬̞͔̣̹̮͈̝̗͓̗̲̫̘̜̱̫̦̭̼̭̙͇͖̭̥̼ͅ ̧̨̨̢̧̗̞̝̤̙̗̗̣͎̗̭̯̘̤̩̬̦͍̖͈͚̭͎̱̠͜ ̧͖̳̗͙̖̣̙̣̦̣̦̬̲͇̰̩̠̠̳̼͕̗̲̞̼̭ͅͅ ̨̡̡̞̣̗̳̱̻̮͔̩̘̳̲͕̰͎͓͔̜̳͕͇͙̠̯͓͓͜ ̢̢̳̘̗̤͕̦͕̞̮̟͎̞̭̠̳̙̻̤̰̺̻̼ͅͅͅ ̢̨̨̣͓̠̦͈͙̝̯̮̮̘̙̥͍͖̯̟̪͖̣̜̬̪̩̭ͅ ̧̢̲̺͎̠̙̰͖͈̻̹̘̦̳͇͚̘̩̼͚̰̹̦͓̰̟̞̥͈͚͍̹͎͙̙͕̥ ̧̡̡̧͔̝͎̭̩̞̞̭̼̺̳͚͎̙͉͕̥̗̲͕͓̻͎͖͇̟̣̭͔̙̞̖̙̬̼͜ͅ ̨̡̭̭͕̫̭̫̘̠͙̖̪̺̻̥̪̼̘̭̖̩͉̦͜͜͜ͅͅ >

Internal dashboards aren’t working

Rule 1. Always Know When It’s Broken

Rule 1. Always Know When It’s Broken Uber Datacenter Test Trips Alerts Tom’s Phone Blackbox monitoring system

Many services are suddenly losing a high % of requests...

...so all these teams jump on a conference call.

It began with a simple mistake.

insert firewall all_services

insert [into] firewall [group] all_services [group]

insert [into] all_services [group] firewall [group]

While fixing the Redis issue the firewall change was pushed to the shared service cluster globally.

Rule 2. Avoid Global Changes

Rule 2. Avoid Global Changes Unplanned Network provider failure e.g. dropping BGP routes, etc Switch/Router bug/failure Software bug Machine failure, TORS failure, etc Chiller failure, BMS failure, etc Grid failure, UPS failure, etc Planned You deployed You messed with it Internet Network Network Software Software Compute Compute Cooling Cooling Power Power

Rule 2. Avoid Global Changes Having multiple protects you against fate, or other people’s mistakes. network providers availability zones datacenters racks switches routers service instances If you deploy all at once nothing can protect you from your own mistakes.

...so we find and fix the firewall issue.

It’s been a long day, but everything is back to normal.

You are on the 6th floor patio. In front of you is <a PR approved beverage>. Exits are West. >

You are on the 6th floor patio. In front of you is <a PR approved beverage>. Exits are West. > drink beverage

You are on the 6th floor patio. You receive a page: “Blackbox is alerting in 8 cities” Exits are West. >

You are on the 6th floor patio. You receive a page: “Blackbox is alerting in 8 cities” Exits are West. > exit east

So the team failed over to A few rebooted machines just got the bad firewall config back in a single datacenter. So the team failed over to another datacenter.

Rule 3. Moving traffic is faster than fixing

Rule 3. Moving traffic is faster than fixing Consistent Available Partition Tolerant

Rule 3. Moving traffic is faster than fixing “UNAVAILABLE” Consistent Available Partition Tolerant

Rule 3. Moving traffic is faster than fixing client client Can A client Proceed?

Rule 3. Moving traffic is faster than fixing client client Can A client Proceed?

...the traffic moved to the new datacenter starts failing...

...the existing traffic in the new datacenter starts failing.

Frontend User Cache Micro Services User Authentication Flow worker nginx HAProxy worker Health Checker User Authentication Flow if token in nginx cache do service req else do fast-auth req worker Varnish worker HAProxy Micro Services health check worker service fast auth service request

Load balance if request hostname is localhost backend userCache0 { … } backend userCache1 { … } … director userCache round-robin { { .backend = userCache0; } { .backend = userCache1; } { .backend = userCache11; } } sub vcl_recv { if (req.http.host ~ "^localhost") { set req.backend = userCache;

Rule 4. Make your mitigations normal

Rule 4. Make your mitigations normal Do test drills of your mitigations regularly Test at peak traffic Test without telling anyone else first Make a plan of tests you need Keep a log of what you’ve tested and when Understand how mitigations affect your system How much extra pressure does adding 25%, 50%, 100% extra traffic put on a datacenter Load test, and capacity plan based on failover needs at peak not just peak Load test even more often than you drill

Rule 1. Always Know When It’s Broken Rule 2. Avoid Global Changes Rule 3. Moving traffic is faster than fixing Rule 4. Make your mitigations normal Tom Croucher Uber Site Reliability Engineering tomc@uber.com @sh1mmer