Scalable Computing on Open Distributed Systems Jon Weissman University of Minnesota National E-Science Center CLADE 2008.

Slides:



Advertisements
Similar presentations
Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov
Advertisements

Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms Chenyang Lu, John A. Stankovic, Gang Tao, Sang H. Son Presented by Josh Carl.
Trustworthy Service Selection and Composition CHUNG-WEI HANG MUNINDAR P. Singh A. Moini.
Lecture 18: Temporal-Difference Learning
Hadi Goudarzi and Massoud Pedram
ALEAE : Handling Uncertainties in Large-Scale Distributed Systems Emmanuel Jeannot LORIA - INRIA - CNRS ALEAE Kick-off April 1st 2009.
G. Alonso, D. Kossmann Systems Group
Authors Haifeng Yu, Michael Kaminsky, Phillip B. Gibbons, Abraham Flaxman Presented by: Jonathan di Costanzo & Muhammad Atif Qureshi 1.
Distributed Process Scheduling Summery Distributed Process Scheduling Summery BY:-Yonatan Negash.
Planning under Uncertainty
Making Services Fault Tolerant
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
CS 3013 & CS 502 Summer 2006 Scheduling1 The art and science of allocating the CPU and other resources to processes.
Mitigating routing misbehavior in ad hoc networks Mary Baker Departments of Computer Science and.
Improving Robustness in Distributed Systems Jeremy Russell Software Engineering Honours Project.
Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern.
Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.
Ant Colonies As Logistic Processes Optimizers
Reliability on Web Services Pat Chan 31 Oct 2006.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Cumulative Violation For any window size  t  Communication-Efficient Tracking for Distributed Cumulative Triggers Ling Huang* Minos Garofalakis.
1 Validation and Verification of Simulation Models.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
CS 188: Artificial Intelligence Fall 2009 Lecture 19: Hidden Markov Models 11/3/2009 Dan Klein – UC Berkeley.
SybilGuard: Defending Against Sybil Attacks via Social Networks Haifeng Yu, Michael Kaminsky, Phillip B. Gibbons, and Abraham Flaxman Presented by Ryan.
SAMPLING Chapter 7. DESIGNING A SAMPLING STRATEGY The major interest in sampling has to do with the generalizability of a research study’s findings Sampling.
For Better Accuracy Eick: Ensemble Learning
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Analysis of Simulation Results Andy Wang CIS Computer Systems Performance Analysis.
Slicing the Onion: Anonymity Using Unreliable Overlays Sachin Katti Jeffrey Cohen & Dina Katabi.
Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.
Resource Management in Volunteer Computing Grids An analysis of the different approaches to maximizing throughput on a BOINC grid Presented by Geoffrey.
ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,
Collusion-Resistance Misbehaving User Detection Schemes Speaker: Jing-Kai Lou 2015/10/131.
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.
Introduction to Software Testing. Types of Software Testing Unit Testing Strategies – Equivalence Class Testing – Boundary Value Testing – Output Testing.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Software Reliability Research Pankaj Jalote Professor, CSE, IIT Kanpur, India.
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Trust-Sensitive Scheduling on the Open Grid Jon B. Weissman with help from Jason Sonnek and Abhishek Chandra Department of Computer Science University.
“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Introduction to Embedded Systems Rabie A. Ramadan 5.
Classification Ensemble Methods 1
Service Reliability Engineering The Chinese University of Hong Kong
Hierarchical Trust Management for Wireless Sensor Networks and Its Applications to Trust-Based Routing and Intrusion Detection Wenhai Sun & Ruide Zhang.
For a good summary, visit:
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Dynamic Load Balancing Tree and Structured Computations.
Decentralized Trust Management for Ad-Hoc Peer-to-Peer Networks Thomas Repantis Vana Kalogeraki Department of Computer Science & Engineering University.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Estimating standard error using bootstrap
Machine Learning Supervised Learning Classification and Regression
Talal H. Noor, Quan Z. Sheng, Lina Yao,
OPERATING SYSTEMS CS 3502 Fall 2017
Deep Feedforward Networks
Job Scheduling in a Grid Computing Environment
Data Collection and Dissemination
Gabor Madl Ph.D. Candidate, UC Irvine Advisor: Nikil Dutt
Professor S K Dubey,VSM Amity School of Business
Providing Secure Storage on the Internet
by Xiang Mao and Qin Chen
October 6, 2011 Dr. Itamar Arel College of Engineering
Types of Designs: R: Random Assignment of subjects to groups
Coevolutionary Automated Software Correction
Presentation transcript:

Scalable Computing on Open Distributed Systems Jon Weissman University of Minnesota National E-Science Center CLADE 2008

What is the Problem? Open distributed systems – Tasks submitted to the “system” for execution – Workers do the computing, execute a task, return an answer The Challenge – Computations that are erroneous or late are less useful – Failure, errors, hacked, misconfigured – Unpredictable time to return answers Both local- and wide-area systems – Focus on volunteer wide-area systems

Shape of the Solution Replication Works for all sources of unreliability – computation and data How to do this intelligently - scalably?

Replication Challenges How many replicas? – too many – waste of resources – too few – application suffers Most approaches assume ad-hoc replication – under-replicate: task re-execution (^ latency) – over-replicate: wasted resources (v throughput) Using information about the past behavior of a node, we can intelligently size the amount of redundancy

Problems with ad-hoc replication Unreliable node Reliable node Task x sent to group A Task y sent to group B

System Model Reputation rating r i – degree of node reliability Dynamically size the redundancy based on r i Note: variable sized groups Assume no correlated errors, relax later

Smart Replication Rating based on past interaction with clients – prob. (r i ) over window  correct/total or timely/total – extend to worker group (assuming no collusion) => likelihood of correctness (LOC) Smarter Redundancy – variable-sized worker groups – intuition: higher reliability clients => smaller groups

Terms LOC (Likelihood of Correctness), g – computes the ‘actual’ probability of getting a correct or timely answer from a group g of clients Target LOC ( target ) – the success-rate that the system tries to ensure while forming client groups

Scheduling Metrics Guiding metrics – throughput  : is the set of successfully completed tasks in an interval – success rate s: ratio of throughput to number of tasks attempted

Algorithm Space How many replicas? – algorithms compute how many replicas to meet a success threshold How to reach consensus? – Majority (better for byzantine threats) – M-1 (better for timeliness) – M-2 (2 matching)

One Scheduling Algorithm

Evaluation Baselines – Fixed algorithm: statically sized equal groups uses no reliability information – Random algorithm: forms groups by randomly assigning nodes until target is reached Simulated a wide-variety of node reliability distributions

Experimental Results: correctness Simulation: byzantine behavior only … majority voting

Role of target Key parameter – hard to specify Too large – groups will be too large (low throughput) Too small – groups will be too small (low success rate) Instead, adaptively learn it – bias toward  or s or both

Adaptive Algorithm

What about time? Timeliness Result > time T is less (or not) useful – (1) soft deadlines user interacting, visualization output from computation – (2) hard deadlines need to get X results done before HPDC/NSDI/… deadline Live experimentation on PlanetLab Real application: BLAST

Some PL data Computation - both across and within nodes Communication - both across and within nodes Temporal variability

PL Environment Ridge is our live system that implements reputation 120 wide-area nodes, fully correct, M-1 consensus 3 Timeliness environments based on deadlines D=120sD=180sD=240s

Experimental Results: timeliness Best BOINC (BOINC*), conservative (BOINC-) vs. RIDGE

Makespan Comparison

Collusion Suppose errors are correlated? How? – Widespread bug (hardware or software) – Misconfiguration – Virus – Sybil attack – Malicious group With Emmanuel Jeannot (Inria)

Key Ideas Execute a task => answer groups – A 1, A 2, … A k – For each A i there are associated workers W i 1, W i 2 … W i n – P collusion (workers in A i ) Learn probability of correlated errors – P collusion (W 1, W 2 ) Estimate probability of group correlated errors – P collusion (G), G=[W 1, W 2, W 3, …] via f {P collusion (W i, W j ), for all i,j} Rank and select answer – P collusion (G) and |G| – Update matrix: P collusion (W1, W2)

Bootstrap Problem Building collusion matrix Must first “bait” colluders – Over-replicate such that majority group is still correct to expose colluders –  : probability of worker collusion –  : probability colluders fool the system Given  group size k

4: 1 group 30% colluders, always collude 5. Same group – colludes 30% of the time 7. 2 groups (40%, 30% colluders) correctness

throughput

Summary Reliable Scalable computing – correctness and timeliness Future work – combined models and metrics – workflows: coupling data and computation reliability Visit ridge.cs.umn.edu to learn more