Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern.

Slides:



Advertisements
Similar presentations
Utility Optimization for Event-Driven Distributed Infrastructures Cristian Lumezanu University of Maryland, College Park Sumeer BholaMark Astley IBM T.J.
Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Cloud Computing to Satisfy Peak Capacity Needs Case Study.
Reliability on Web Services Presented by Pat Chan 17/10/2005.
CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
City University London
Filterfresh Fault-tolerant Java Servers Through Active Replication Arash Baratloo
Reliability on Web Services Pat Chan 31 Oct 2006.
Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,
A Local Facility Location Algorithm Supervisor: Assaf Schuster Denis Krivitski Technion – Israel Institute of Technology.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
DISTRIBUTED COMPUTING
New Challenges in Cloud Datacenter Monitoring and Management
Algorithms for Self-Organization and Adaptive Service Placement in Dynamic Distributed Systems Artur Andrzejak, Sven Graupner,Vadim Kotov, Holger Trinks.
 CoDesign A Highly Extensible Collaborative Software Modeling Framework Jae young Bang University of Southern California.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Robust Network Supercomputing with Malicious Processes (Reliably Executing Tasks Upon Estimating the Number of Malicious Processes) Kishori M. Konwar*
Service Architecture of Grid Faults Diagnosis Expert System Based on Web Service Wang Mingzan, Zhang ziye Northeastern University, Shenyang, China.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
De-Nian Young Ming-Syan Chen IEEE Transactions on Mobile Computing Slide content thanks in part to Yu-Hsun Chen, University of Taiwan.
Challenges towards Elastic Power Management in Internet Data Center.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Practical Byzantine Fault Tolerance
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Scalable Computing on Open Distributed Systems Jon Weissman University of Minnesota National E-Science Center CLADE 2008.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.
Mobile Agent Migration Problem Yingyue Xu. Energy efficiency requirement of sensor networks Mobile agent computing paradigm Data fusion, distributed processing.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Trust-Sensitive Scheduling on the Open Grid Jon B. Weissman with help from Jason Sonnek and Abhishek Chandra Department of Computer Science University.
O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
WS-DREAM: A Distributed Reliability Assessment Mechanism for Web Services Zibin Zheng, Michael R. Lyu Department of Computer Science & Engineering The.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
Dzmitry Kliazovich University of Luxembourg, Luxembourg
Static Process Scheduling
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
 (Worse) The machines can process the family of parts, ensuring that it reaches full capacity.  (Better 1) Capable of processing the family of parts,
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
George Edwards Computer Science Department Center for Systems and Software Engineering University of Southern California
RAID Technology By: Adarsha A,S 1BY08A03. Overview What is RAID Technology? What is RAID Technology? History of RAID History of RAID Techniques/Methods.
Fault Tolerance in Distributed Systems Gökay Burak AKKUŞ Cmpe516 – Fault Tolerant Computing.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.
rain technology (redundant array of independent nodes)
A Scalable Approach to Architectural-Level Reliability Prediction
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Definition of Distributed System
Job Scheduling in a Grid Computing Environment
Lessons from The File Copy Assignment
Load Balancing and It’s Related Works in Cloud Computing
Controlling the Cost of Reliability in Peer-to-Peer Overlays
PA an Coordinated Memory Caching for Parallel Jobs
Cloud Computing By P.Mahesh
CSE8380 Parallel and Distributed Processing Presentation
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Automated Analysis and Code Generation for Domain-Specific Models
Introduction To Distributed Systems
EEC 688/788 Secure and Dependable Computing
Restricted © Siemens Switzerland Ltd 2019
Presentation transcript:

Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern California Nenad Medvidovic University of Southern California

Distributed Computation Architectures Solve large computational problems and/or process large data sets Provide a platform and API for applications Transparently parallelize computation across a pool of computers Examples: – Clouds – Grids – Volunteer computing

DCA Applications Highly parallelizable problems – Find the th digit of π – Factor – 1 Driven by: – Basic research – Pharmaceutical applications – Web analytics – …

Volunteer Computing Attempts to leverage the more than 1 billion (mostly idle) machines on the Internet – Volunteers install a client – When idle, the client requests work from a server and send back results Aids projects that have limited funding but large public appeal

Dealing with Faults Context: – Volunteers fail and maliciously return false results – Volunteers are not accountable – Malicious volunteers may collude – Well-formed but incorrect results are hard to detect – The reliability of volunteers is difficult to estimate Solution: – Redundancy and voting

System Model A task server subdivides computations into tasks The task server replicates each task into multiple identical jobs The task server assigns each job to a node in the node pool Nodes perform work, send results, and rejoin the pool New volunteer nodes may join the pool while other nodes may leave

k-vote Traditional Redundancy (TR) Performs k independent executions of each task Takes a vote on the correctness of the result Requires expending a factor of k resources or suffering a factor of k slowdown in performance Example k = 19 r = 0.7 Example k = 19 r = 0.7

Insights Redundant computations need not be simultaneous DCAs can dynamically adjust the level of redundancy based on run-time information k-vote traditional redundancy wastes computations Example 19 independent computations (k = 19) 70% node reliability (r = 0.7) (0.7) 10 ≈ 2.8% of the time, the first 10 of them will return the correct result The last 9 results are irrelevant Example 19 independent computations (k = 19) 70% node reliability (r = 0.7) (0.7) 10 ≈ 2.8% of the time, the first 10 of them will return the correct result The last 9 results are irrelevant

k-vote Progressive Redundancy (PR) Distributes jobs in waves In each wave, distributes the minimum jobs needed to produce a consensus (assuming all agree) Repeats until a consensus is reached Example k = 19 r = 0.7 Example k = 19 r = 0.7

Insights The confidence level associated with a result can be computed k-vote progressive redundancy produces results with varying confidence Example k = 19, r = 0.7 If the vote is 10-0, confidence level ≈ 99.98% If the vote is 10-9, confidence level = 70% Example k = 19, r = 0.7 If the vote is 10-0, confidence level ≈ 99.98% If the vote is 10-9, confidence level = 70%

Iterative Redundancy (IR) Distributes jobs in waves In each wave, distributes the minimum jobs required to achieve a desired confidence level Repeats until desired confidence level is reached Example d = 4 r = 0.7 Example d = 4 r = 0.7

Algorithm Comparison System reliability approaches 1 exponentially for TR, PR, and IR IR produces the same reliability at a lower cost – Or, equivalently, higher reliability at the same cost IR is optimal with respect to cost – Guaranteed to use the minimum computation needed to achieve desired system reliability Cost Factor System Reliability

Algorithm Comparison PR and IR perform best when the reliability of the node pool is high Node Reliability Ratio Improvement Over Traditional Recovery

Adaptive Behavior IR maintains a constant system reliability as node reliability fluctuates – Injects redundancy where it is needed “Unlucky” situations – Removes redundancy where it is unnecessary Time Node Reliability Cost Factor System Reliability

Node Reliability Estimation Incorrectly estimating node reliability does not affect the performance of IR Cost Factor System Reliability

Conclusions Iterative redundancy automatically replicates computation with optimal efficiency Iterative redundancy can be used when: – A computation can be broken down into independent tasks – Computation is performed by a pool of independent processing resources – Task deployment decisions can be made at runtime – The reliability of resources in the pool is unknown

For More Information To appear in ICDCS 2011: Smart Redundancy for Distributed Computation by Yuriy Brun, George Edwards, Jae young Bang and Nenad Medvidovic