Experience with some Principles for Building an Internet-Scale Reliable System Mike Afergan (Akamai and MIT) Joel Wein (Akamai and Polytechnic University,

Experience with some Principles for Building an Internet-Scale Reliable System Mike Afergan (Akamai and MIT) Joel Wein (Akamai and Polytechnic University, Brooklyn NY) Amy LaMeyer (Akamai)

Overview Background Our Development Philosophy Guiding Principles Metrics and Benefits of the Approach

Downloading www.xyz.com with Akamai’s EdgeSuite User enters www.xyz.comUser enters www.xyz.com Browser requests IP address for www.xyz.com which is CNAMEd to AkamaiBrowser requests IP address for www.xyz.com which is CNAMEd to Akamaiwww.xyz.com WWW.XYZ.COM 1 DNS Optimal Akamai server returns HTMLOptimal Akamai server returns HTML5 Browser requests HTMLBrowser requests HTML 3 Akamai server assembles page, contacting customer Web server if necessaryAkamai server assembles page, contacting customer Web server if necessary 4 Customer Web server DNS returns IP address of optimal Akamai serverDNS returns IP address of optimal Akamai server 2 Browser obtains objects from optimal Akamai servers, contacting the customer Web server if necessaryBrowser obtains objects from optimal Akamai servers, contacting the customer Web server if necessary 6 7 15,000+ Servers 1,100+ Networks 2,500+ Locations

What is this Paper About? Internal effort to assess and further formalize internal processes for reliability. Produced a long list of principles, some quite basic e.g. Input checking A smaller set of principles capturing our basic approach to building distributed systems emerged. Some we realized only in retrospect Many are not unique or new to us

Sharing our Principles “Not always easy in practice” Similarities with academic literature Enables useful operational approach This talk is not: Detailed exposition or justification of entire system or architecture Scientific reliability study Adequate comparison with previous literature

Challenges Failures all the time at different levels: Machines, racks, datacenters, networks, multiple networks Connectivity Statistics: Time “Health”

Our Philosophy Our software is designed to seamlessly work despite numerous failures as part of the operational network. Assumption: We assume that a significant and constantly changing number of component or other failures occur at all times in the network.

Consequence of Philosophy DoDon’t Commodity hardwareMore reliable servers Third-party Datacenters Own our own network Smaller regionsLarger more reliable clusters Spread regions within ISPs Find most reliable datacenters Use the public Internet Have dedicated links

Our Principles Assumption: Significant and constantly changing failures Philosophy: Work with numerous failures Principle #1: Ensure Significant Redundancy Principle #2: Logic and Software for Message Reliability Principle #3: Distributed Control Principle #4: Fail-Stop & Restart Principle #5: Zoning for Releases Principle #6: Notice and Quarantine Faults

Principle #1: Ensure Significant Redundancy Base Approach: Redundancy at every layer Example Problem: gTLDs return only 13 entries The set is relatively static Solution: IP Anycast to overload the IP addresses Other Practical Constraints DNS TTLs constrain flexibility Third-party protocols Cost Simple in theory, often challenging in practice. Redundancy

Principle #2: Use Logic and Software to Provide Message Reliability Many message types: Monitoring information Customer content We use an overlay transport (UDP and HTTP) We do not: Have dedicated pipes Own datacenters Redundancy Logic and Software

Principle #3: Distributed Control Different Layers: Leader-Election Failover Suspending region ensures reliability Region contains the only reliable content! Redundancy Logic and Software Distributed Control

Our ability to tolerate failures facilitates our approach to software development and operations.

Principle #4: Fail Stop and Restart Why? 1. Significant downside to a mistake 2. Strong mechanism for recovery Akamai could be viewed as a seven-year experiment in running Recovery Oriented Computing. Redundancy Logic and Software Distributed Control Fail-Stop & Restart

A Cautious Approach 1.2.3.4 X XX XX X Problems: 1)Continual Rolls 2)System-wide Issues Redundancy Logic and Software Distributed Control Fail-Stop & Restart

Principle #6: Notice and Quarantine Faults Challenging Problem Many classes of solution Open problem with vital importance Redundancy Logic and Software Distributed Control Fail-Stop & Restart Zoning Fault-Isolation

Principle #5: Zoning Release Type 1->22->3 Customer Config 15 mins20 mins System Config 30 min2 hours Standard Software Release 24 hours The prior principles – unexpectedly – have enabled a much more reliable and aggressive release process. Phase 1: One Machine Phase 2: Subset (< 1/8 th ) the network Phase 3: Entire Network Redundancy Logic and Software Distributed Control Fail-Stop & Restart Zoning Fault-Isolation

Benefits to Software Development High Rate of Change per Month: ~22 network/software releases ~1000 customer configuration releases PhaseAborts% of Total Phase One366.49 Phase Two173.06 Add’l Phase30.54 World234.14 Our confidence in our network’s ability to handle faults enables an aggressive rate of change. Data from 25 months = 556 software releases

Benefit to Operations Normal NOCC Staffing: 7-8 people during the day 3 people at night Per person: 1800 servers 300 datacenters Our ability to treat faults as normal occurrences – not as crises – helps us scale

Principles 1. Ensure Significant Redundancy 2. Use Logic and Software for Messaging 3. Employ Distributed Control 4. Fail Stop and Restart 5. Employ Zoning 6. Notice and Quarantine Faults Key Points These principles: 1)Build upon each other 2)Enable Akamai’s highly reliable service and ability to scale

Experience with some Principles for Building an Internet-Scale Reliable System Mike Afergan (Akamai and MIT) Joel Wein (Akamai and Polytechnic University,

Similar presentations

Presentation on theme: "Experience with some Principles for Building an Internet-Scale Reliable System Mike Afergan (Akamai and MIT) Joel Wein (Akamai and Polytechnic University,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Experience with some Principles for Building an Internet-Scale Reliable System Mike Afergan (Akamai and MIT) Joel Wein (Akamai and Polytechnic University,

Similar presentations

Presentation on theme: "Experience with some Principles for Building an Internet-Scale Reliable System Mike Afergan (Akamai and MIT) Joel Wein (Akamai and Polytechnic University,"— Presentation transcript:

Similar presentations

About project

Feedback