The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.

Slides:



Advertisements
Similar presentations
Ed Duguid with subject: MACE Cloud
Advertisements

Making Services Fault Tolerant
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
An Empirical Examination of Current High-Availability Clustering Solutions’ Performance Jeffrey Absher DePaul University Research Symposium Presentation.
CS526 Dr.Chow1 HIGH AVAILABILITY LINUX VIRTUAL SERVER By P. Jaya Sunderam and Ankur Deshmukh.
What Bugs Live in the Cloud? A Study of Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,
DatacenterMicrosoft Azure Consistency Connectivity Code.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…
VIRTUALIZATION AND YOUR BUSINESS November 18, 2010 | Worksighted.
Supervisor: Hadi Salimi Abdollah Ebrahimi Mazandaran University Of Science & Technology January,
National Manager Database Services
“Better together” PowerVault virtualization solutions
11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.
Applied Architectures Eunyoung Hwang. Objectives How principles have been used to solve challenging problems How architecture can be used to explain and.
What Bugs Live in the Cloud? A Study of Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Team Members Lora zalmover Roni Brodsky Academic Advisor Professional Advisors Dr. Natalya Vanetik Prof. Shlomi Dolev Dr. Guy Tel-Zur.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Software Engineer, #MongoDBDays.
Adam Leidigh Brandon Pyle Bernardo Ruiz Daniel Nakamura Arianna Campos.
Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.
Cloud Computing Saneel Bidaye uni-slb2181. What is Cloud Computing? Cloud Computing refers to both the applications delivered as services over the Internet.
Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.
Module 12: Designing High Availability in Windows Server ® 2008.
Reliability Andy Jensen Sandy Cabadas.  Understanding Reliability and its issues can help one solve them in relatable areas of computing Thesis.
VeriFlow: Verifying Network-Wide Invariants in Real Time
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Molecular Transactions G. Ramalingam Kapil Vaswani Rigorous Software Engineering, MSRI.
A Self-Manageable Infrastructure for Supporting Web-based Simulations Yingping Huang Xiaorong Xiang Gregory Madey Computer Science & Engineering University.
Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
Survival by Defense- Enabling Partha Pal, Franklin Webber, Richard Schantz BBN Technologies LLC Proceedings of the Foundations of Intrusion Tolerant Systems(2003)
Cloud Testing Haryadi Gunawi Towards thousands of failures and hundreds of specifications.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
OSIsoft High Availability PI Replication
 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Virtualizing the Banner Infrastructure Morey Roof Information Services Department New Mexico Tech
Distributed File Systems 11.2Process SaiRaj Bharath Yalamanchili.
A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Wei Dong.
70-412: Configuring Advanced Windows Server 2012 services
WINDOWS SERVER 2003 Genetic Computer School Lesson 12 Fault Tolerance.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Component 8/Unit 9aHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9a Creating Fault Tolerant.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Virtual Machine Movement and Hyper-V Replica
Database Laboratory Regular Seminar TaeHoon Kim Article.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.
High Availability Clusters in Linux Sulamita Garcia EDS Unix Specialist
Chapter 6: Securing the Cloud
File Share Parameters File share resources can be normal shares, DFS roots, or Dynamic Shares. You configure file share permissions at the same time and.
Large Distributed Systems
Towards Reliable Application Deployment in the Cloud
Cloud Testing Shilpi Chugh.
Vembu SaaSBackup for G Suite
Fault Tolerance Distributed Web-based Systems
3 Cloud Computing.
Cloud Computing: Concepts
Hardware-less Testing for RAS Software
Building global and highly-available services using Windows Azure
Presentation transcript:

The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1

Cloud Services Cheap Convenient Reliable 2

Yahoo Mail Disruption Hardware failures Wrong failover Disruptions – Some users could not access – Some users saw wrong notifications – Several days to recover 3

Outlook Disruption Hardware failures – Caching server Failover to backend servers correctly Requests flooded the servers Service went down Microsoft needed to change its software infrastructure 4

Cloud Outages 5 Outage Amazon EBS Gmail App Engine Skype Google Drive Outlook Yahoo Mail Root Event Network misconfig Upgrade event Power failure Overload Network bug Caching failure Hardware failures Supposedly tolerable failure Network partition Servers offline 25 % machines offline 30 % nodes failed Network offline Failover to backend Servers offline Incorrect Recovery Re-mirroring storm Bad request routing Bad failover Positive feedback loop Timeout during failover Request flooding Buggy failover Major Outage Clusters collapsed All routing servers down All user app were degraded Almost all nodes failed 33 % requests affected 7-hour outage 1 % of users affected

Journey of Cloud Dependability Research 6

Fault-Tolerant Systems 7 Complex failures Hard to handle and implement correctly Recovery protocols are very complex Recovery code is one of the most buggy parts Complex failures Hard to handle and implement correctly Recovery protocols are very complex Recovery code is one of the most buggy parts

Offline Testing Thoroughly verify recovery mechanism 8

Offline Testing Thoroughly verify recovery mechanism Fault injection, model checking, stress testing, etc. “Mini cluster” that represents production runs Testing and production environment is different – Cluster, workload, failure 9 Mini cluster Production run Real workload Test workload

Offline Testing Thoroughly verify recovery mechanism Fault injection, model checking, stress testing, etc. “Mini cluster” that represents production runs Testing and production environment is different – Cluster, workload, failure Orders of magnitude different in scale – Facebook used 100 machines to mimic 3000-machine production run[2011] Small start-ups forego the luxury – Many tests are much smaller than this 10

Diagnosis Help administrators to point out and reproduce causes of outages BUT – Post-mortem, not prevent disruptions – Passive approach, wait outages happen before diagnosis 11

Online Testing and Failure Drills 12 Requests Customers Test Administrators “Inject failures online” Users outnumber testers Real deep scenarios

A Missing Piece 13 Boss, let do inject failures online using Chaos Monkey Hmm … EmployeeBoss Dear beloved customers, Thank you for trusting our services, but we accidentally lose your data because the failure drills that we run...

Future of Failure Drill 14 Drill-ready cloudsCurrent Drill A team of engineers standing by

Drill-Ready Cloud Computing Automatic failure drill and automatic cancellation Safe, efficient, easy manner Ideally, no engineering effort required 15

Drill-Ready Cloud Computing 16 Administrator Drill-Ready System Drill Mode Drill Spec Kill 25 % If it disrupts revert back Drill-ready cloud computing Systems take care failure injection and cancellation Drill-ready cloud computing Systems take care failure injection and cancellation

Outline Safety Efficiency Usability Generality 17

Safety Learn about failure implications without suffering through them Learn whether data can be lost – But not lose the data Learn whether SLA can be violated – But not violate it for long time 18

Safety Solutions Normal and drill states 19 Not drill aware

Safety Solutions Normal and drill states 20 Normal TopologyDrill Topology “Maintaining 2 states” Revert back to normal state easily Normal and drill states The first most important thing for drill-ready clouds Normal and drill states The first most important thing for drill-ready clouds

Safety Solutions Drill state isolation Self cancellation – Real failures during the drill – Drill master and drill agent – Drill master command agents – What if network partition? Agents are in limbo state – Self cancellation when agents cannot contact master 21

Safety Solutions Drill state isolation Self cancellation Safe drill specification – Drill specification 22 Drill Spec - What failures? - How long? - Cancellation conditions - Etc. Example Kill 25 % If SLA is violated revert back Safe drill specification Check whether the specification can run safely Safe drill specification Check whether the specification can run safely

Efficiency Failures trigger data migration Monetary cost – Bandwidth – Storage space System performance – Affect users 23

Efficiency Solutions Low-overhead drill setup and cleanup – Do we need to do real key re-balance? – Depends on the objective of the test 24 [11-20] [21-30] [1-10][31-40] [41-50][51-60] [41-45] [46-50] [11-15] [16-20] Yes, if we want to see background re-balance impact Read / Write data SLA okay?

Efficiency Solutions Low-overhead drill setup and cleanup – Do we need to do real key re-balance? – Depends on the objective of the test 25 [16-30] [31-45] [1-15][46-60] No, if we want to measure performance, when we lose 2 nodes Read / Write [46-60] SLA okay? No key [11] Low-overhead setup and cleanup The cost depends on the drill objectives and Drill objectives must be parts on drill specifications Low-overhead setup and cleanup The cost depends on the drill objectives and Drill objectives must be parts on drill specifications

Efficiency Solutions Low-overhead drill setup and cleanup Cheap drill specification – Smarter and cheaper drill specification 26 If replication is 50 % correct  assume that the rest are correct Stop half way and report success Replicating progress status

Usability Solutions Declarative drill specification language 27 – Need declarative language Describe results Easy to read and write Drill Specification During peak load Kill 5% machines If SLA violated > 1 mins Cancel the drill If recovery is 50% good Stop the drill Report success

Generality Solutions Elasticity drill Configuration change drill Software upgrade drill Security attack drill 28

Conclusion Drill-ready cloud computing – New reliability paradigm Sketching a first draft We want your FEEDBACK 29

Thank You 30