Why Cloud Architecture is Different! Michael Stiefel Architecting For Failure.

Slides:



Advertisements
Similar presentations
Tableau Software Australia
Advertisements

Performance Testing - Kanwalpreet Singh.
Database Architectures and the Web
Case Study - Amazon. Amazon r Amazon has many Data Centers r Hundreds of services r Thousands of commodity machines r Millions of customers at peak times.
Local Touch – Global Reach Avoiding the Chaos Monkey Brent Stineman – National Cloud Solution Specialist.
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
NETWORK LOAD BALANCING NLB.  Network Load Balancing (NLB) is a Clustering Technology.  Windows Based. (windows server).  To scale performance, Network.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Features Scalability Availability Latency Lifecycle Data Integrity Portability Manage Services Deliver Features Faster Create Business Value.
Module 8: Concepts of a Network Load Balancing Cluster
Overview Of Microsoft New Technology ENTER. Processing....
Wade Wegner Windows Azure Technical Evangelist Microsoft Corporation Windows Azure AppFabric Caching.
Distributed Information Systems - The Client server model
ArcGIS for Server Reference Implementations An ArcGIS Server’s architecture tour.
Capacity planning for web sites. Promoting a web site Thoughts on increasing web site traffic but… Two possible scenarios…
National Manager Database Services
11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Design For Failure Is The Path To Success In Cloud Ashay Chaudhary.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Computer Measurement Group, India Reliable and Scalable Data Streaming in Multi-Hop Architecture Sudhir Sangra, BMC Software Lalit.
Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Module 13: Network Load Balancing Fundamentals. Server Availability and Scalability Overview Windows Network Load Balancing Configuring Windows Network.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
Module 12: Designing High Availability in Windows Server ® 2008.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Introduction to Cloud Computing
Cloud Computing & Amazon Web Services – EC2 Arpita Patel Software Engineer.
Windows Azure Conference 2014 Deploy your Java workloads on Windows Azure.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Module 13 Implementing Business Continuity. Module Overview Protecting and Recovering Content Working with Backup and Restore for Disaster Recovery Implementing.
3-Tier Architecture Chandrasekaran Rajagopalan Cs /01/99.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 12: Planning and Implementing Server Availability and Scalability.
VMware vSphere Configuration and Management v6
High Availability in DB2 Nishant Sinha
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Enhancing Scalability and Availability of the Microsoft Application Platform Damir Bersinic Ruth Morton IT Pro Advisor Microsoft Canada
Features Scalability Manage Services Deliver Features Faster Create Business Value Availability Latency Lifecycle Data Integrity Portability.
Building Cloud Solutions Presenter Name Position or role Microsoft Azure.
(re)-Architecting cloud applications on the windows Azure platform CLAEYS Kurt Technology Solution Professional Microsoft EMEA.
Deploying Highly Available SQL Server in Windows Azure A Presentation and Demonstration by Microsoft Cluster MVP David Bermingham.
 Cloud Computing technology basics Platform Evolution Advantages  Microsoft Windows Azure technology basics Windows Azure – A Lap around the platform.
Migration of Real Product into Windows Azure Lessons Learned.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 12: Planning and Implementing Server Availability and Scalability.
Building Scalable Resilient Websites in Azure
Business Continuity & Disaster Recovery
Recipes for Use With Thin Clients
Building a Virtual Infrastructure
Network Load Balancing Functionality
Network Load Balancing
Maximum Availability Architecture Enterprise Technology Centre.
GlassFish in the Real World
Software Architecture in Practice
Business Continuity & Disaster Recovery
Windows Azure 講師: 李智樺, Ruddy Lee
Outline Virtualization Cloud Computing Microsoft Azure Platform
Zhen Xiao, Qi Chen, and Haipeng Luo May 2013
INFO 344 Web Tools And Development
AWS Cloud Computing Masaki.
Specialized Cloud Architectures
Developing Microsoft Azure Solutions Jump Start
Building global and highly-available services using Windows Azure
Lecture 21: Replication Control
Presentation transcript:

Why Cloud Architecture is Different! Michael Stiefel Architecting For Failure

Outsource Infrastructure?

Traditional Web Application Web Site Virtual Machine / Directly on Hardware 100 MB Relational Database Inbound Transactions Output Transactions File System

Hosting Provider Costs Provider$ / Monthly Cost Host Gator9.95 Go Daddy10 ORCS Web69 Amazon83+ BYOS Windows Azure97 Note: traditional hosting, no custom colocation, virtualized data centers.

Cloud is Not Cheaper for Hosting

Perhaps, Higher Availability?

SLA is Not Radically Different ProviderCompute SLA (%) Go Daddy99.9 ORCS Web99.9 Host Gator99.9 Amazon99.95 Azure99.95 Difference is seven minutes a day; 1.75 days a year.

Higher Rate Since You Pay for Flexibility

Hosting is Not Cloud Computing

Why Utility Computing? Scalability: do not have to pay for peak scenarios. Availability: can approach 100% if you want to pay.

Architecturally, they are the same problem

You must design to accommodate missing computing resources.

Designing for Failure is Cloud Computing

What’s wrong with this Code Fragment? ClientProxy client = new ClientProxy(); Response response = client.Do (request);

Never assume that any interface between two components always succeeds.

So You Put in a Catch Handler try { ClientProxy client = new ClientProxy(); int result = client.Do (a, b, c); } catch (Exception ex) { }

What if… a timeout, how many retries? the result is a complete failure? the underlying hardware crashed? you need to save the user’s data? you are in the middle of a transaction?

What Do You Put in the Catch Handler? try { ClientProxy client = new ClientProxy(); int result = client.Do (a, b, c); } catch (Exception ex) { ???? }

You can’t program yourself out of a failure.

Failure is a first-class design citizen.

The critical issue is how to respond to failure. The underlying infrastructure cannot guarantee availability. Principle #1

Consequences of Failure Multiple tiers and dependencies. If your order queue fails, you cannot do orders. If your customer service fails, you cannot get membership information. The more dependencies, the more consequences of a poorly handle failure. Dependencies include your code, third parties, the Internet/Web, anything you do not control.

Unhandled failures propagate (like cracks) through your application.

Failures Cascade – an unhandled failure in one part of the system becomes a failure of your application. Principle #2

Two Types of Failure Transient Failure Resource Failure

Typical Response to a Transient Failure Retry How Often? How Long Before You Give Up?

Delays Cascade Just Like Failures Delays occur while you are waiting or retrying Delays hog resources like threads, TCP/IP ports, database connections, memory. Since delays are usually the result of resource bottlenecks, waiting or retrying for long periods adds to the bottleneck.

Transient failures become resource failures.

Transient Failures Retry for a short time, then give up (like a circuit breaker) if unsuccessful. Never block on I/O, timeout and assume failure.

There is no such thing as a transient failure. Fail fast and treat it as a resource failure. Principle #3

Make Components Failure Resistant Design For Beyond Largest Expected Load Understand latency of adding a new resource User load, virtual memory, CPU size, bandwidth, database Handle all Errors Failure affects more people than on the desktop.

Define your own SLA.

Stress test components and system.

A chain is a strong as its weakest link.

Use a Margin of Safety when designing the resources used. Principle #4

What is the cost of availability?

Any component or instance can fail – eliminate single points of failure.

Search for Dependencies Hardware / Virtual Machines Third Party Libraries Internet/Web Interfaces to your own components TCP/IP ports DNS Servers Message Queues Database Drivers Credit Card Processors, Geocoding services, etc.

Examine Queries Only three types of result sets: Zero, One, Many (can become large overnight) Search Providers limit results returned Remember those 5 way joins your ORM uses Objects on a DCOM or RMI call

Eliminate single points of failure. Accept the fact that you must build a distributed application. Principle #5

You need redundancy...

but you have to manage state.

Solutions such as database mirroring may have unacceptable latencies, such as over geography.

Reduce the parts of your application that handle state to a minimum.

Loss of a stateful component usually means loss of user data.

State Handling Components Does the UI layer need session state? Business Logic, Domain Layer should be stateless Use queues where they make sense to hold data Design services for minimal dependencies Pay with a customer number Keep state with the message Don’t forget infrastructure logs, configuration files State is in specialized stores

Build atomic services. Atomic means unified, not small. Decouple the services.

Stateless components allow for scalability and redundancy.

What about the data tier?

Can you relax consistency constraints? What is acceptable data loss?

What is the cost of an apology?

How important is the relational model?

Design for Eventual Consistency

Consider CQRS

Monitor your components. Understand why they fail.

Reroute traffic to existing instances or another data center or geographic area?

Add more instances?

Caching or throttling can help your application run under failure.

Poorer performance may be acceptable.

Automate…Automate….Automate

Degrade gracefully and predictably. Know what you can live without. Principle #6

Cloud Outages Happen

Some Are Normal Some Are Black Swans

Humans Reason About Probabilities Poorly

Assume the Rare Will Occur - It Will Occur Principle #7

Case Study: Amazon Four Day Outage

Facts April 21, 2011 One Day of Stabilization, Three Days of Recovery Problems: EC2, EBS, Relational Database Service Affected: Quora, Hootsite, Foursquare, Reddit Unaffected: Netflix, Twillo

Why were Netflix and Twillo Unaffected? They Designed For Failure

Netflix Explicitly Architected For Failure

Although more errors, higher latency, no increase in customer service calls or inability to find or start movies.

Key Architectural Decisions Stateless Services Data stored across isolation zones Could switch to hot standby Had Excess Capacity (N + 1) Handle large spikes or transient failures Used relational databases only where needed. Could partition data Degraded Gracefully

Fail Fast, Aggressive Timeouts Can degrade to lower quality service no personalized movie list, still can get list of available movies Non Critical Features can be removed.

Chaos Monkey

Some Problems Had to manually reroute traffic; use more automation in the future for failover and recovery Round robin load balancer can overload decreased number of instances. May have to change auto scaling algorithm and internal load balancing. Expand to Geographic Regions

Summary Hosting in a cloud computing environment is valid. Cloud Computing means designing for failure.