Download presentation
Presentation is loading. Please wait.
Published byJessie Banks Modified over 9 years ago
1
Experience with some Principles for Building an Internet-Scale Reliable System Mike Afergan (Akamai and MIT) Joel Wein (Akamai and Polytechnic University, Brooklyn NY) Amy LaMeyer (Akamai)
2
Overview Background Our Development Philosophy Guiding Principles Metrics and Benefits of the Approach
3
Downloading www.xyz.com with Akamai’s EdgeSuite User enters www.xyz.comUser enters www.xyz.com Browser requests IP address for www.xyz.com which is CNAMEd to AkamaiBrowser requests IP address for www.xyz.com which is CNAMEd to Akamaiwww.xyz.com WWW.XYZ.COM 1 DNS Optimal Akamai server returns HTMLOptimal Akamai server returns HTML5 Browser requests HTMLBrowser requests HTML 3 Akamai server assembles page, contacting customer Web server if necessaryAkamai server assembles page, contacting customer Web server if necessary 4 Customer Web server DNS returns IP address of optimal Akamai serverDNS returns IP address of optimal Akamai server 2 Browser obtains objects from optimal Akamai servers, contacting the customer Web server if necessaryBrowser obtains objects from optimal Akamai servers, contacting the customer Web server if necessary 6 7 15,000+ Servers 1,100+ Networks 2,500+ Locations
4
What is this Paper About? Internal effort to assess and further formalize internal processes for reliability. Produced a long list of principles, some quite basic e.g. Input checking A smaller set of principles capturing our basic approach to building distributed systems emerged. Some we realized only in retrospect Many are not unique or new to us
5
Sharing our Principles “Not always easy in practice” Similarities with academic literature Enables useful operational approach This talk is not: Detailed exposition or justification of entire system or architecture Scientific reliability study Adequate comparison with previous literature
6
Overview Background Our Development Philosophy Guiding Principles Metrics and Benefits of the Approach
7
Challenges Failures all the time at different levels: Machines, racks, datacenters, networks, multiple networks Connectivity Statistics: Time “Health”
8
Our Philosophy Our software is designed to seamlessly work despite numerous failures as part of the operational network. Assumption: We assume that a significant and constantly changing number of component or other failures occur at all times in the network.
9
Consequence of Philosophy DoDon’t Commodity hardwareMore reliable servers Third-party Datacenters Own our own network Smaller regionsLarger more reliable clusters Spread regions within ISPs Find most reliable datacenters Use the public Internet Have dedicated links
10
Overview Background Our Development Philosophy Guiding Principles Metrics and Benefits of the Approach
11
Our Principles Assumption: Significant and constantly changing failures Philosophy: Work with numerous failures Principle #1: Ensure Significant Redundancy Principle #2: Logic and Software for Message Reliability Principle #3: Distributed Control Principle #4: Fail-Stop & Restart Principle #5: Zoning for Releases Principle #6: Notice and Quarantine Faults
12
Principle #1: Ensure Significant Redundancy Base Approach: Redundancy at every layer Example Problem: gTLDs return only 13 entries The set is relatively static Solution: IP Anycast to overload the IP addresses Other Practical Constraints DNS TTLs constrain flexibility Third-party protocols Cost Simple in theory, often challenging in practice. Redundancy
13
Principle #2: Use Logic and Software to Provide Message Reliability Many message types: Monitoring information Customer content We use an overlay transport (UDP and HTTP) We do not: Have dedicated pipes Own datacenters Redundancy Logic and Software
14
Principle #3: Distributed Control Different Layers: Leader-Election Failover Suspending region ensures reliability Region contains the only reliable content! Redundancy Logic and Software Distributed Control
15
Our ability to tolerate failures facilitates our approach to software development and operations.
16
Principle #4: Fail Stop and Restart Why? 1. Significant downside to a mistake 2. Strong mechanism for recovery Akamai could be viewed as a seven-year experiment in running Recovery Oriented Computing. Redundancy Logic and Software Distributed Control Fail-Stop & Restart
17
A Cautious Approach 1.2.3.4 X XX XX X Problems: 1)Continual Rolls 2)System-wide Issues Redundancy Logic and Software Distributed Control Fail-Stop & Restart
18
Principle #6: Notice and Quarantine Faults Challenging Problem Many classes of solution Open problem with vital importance Redundancy Logic and Software Distributed Control Fail-Stop & Restart Zoning Fault-Isolation
19
Principle #5: Zoning Release Type 1->22->3 Customer Config 15 mins20 mins System Config 30 min2 hours Standard Software Release 24 hours The prior principles – unexpectedly – have enabled a much more reliable and aggressive release process. Phase 1: One Machine Phase 2: Subset (< 1/8 th ) the network Phase 3: Entire Network Redundancy Logic and Software Distributed Control Fail-Stop & Restart Zoning Fault-Isolation
20
Overview Background Our Development Philosophy Guiding Principles Metrics and Benefits of the Approach
21
Benefits to Software Development High Rate of Change per Month: ~22 network/software releases ~1000 customer configuration releases PhaseAborts% of Total Phase One366.49 Phase Two173.06 Add’l Phase30.54 World234.14 Our confidence in our network’s ability to handle faults enables an aggressive rate of change. Data from 25 months = 556 software releases
22
Benefit to Operations Normal NOCC Staffing: 7-8 people during the day 3 people at night Per person: 1800 servers 300 datacenters Our ability to treat faults as normal occurrences – not as crises – helps us scale
23
Principles 1. Ensure Significant Redundancy 2. Use Logic and Software for Messaging 3. Employ Distributed Control 4. Fail Stop and Restart 5. Employ Zoning 6. Notice and Quarantine Faults Key Points These principles: 1)Build upon each other 2)Enable Akamai’s highly reliable service and ability to scale
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.