Download presentation
Presentation is loading. Please wait.
Published byMeredith McDaniel Modified over 9 years ago
1
Scalable Computing on Open Distributed Systems Jon Weissman University of Minnesota National E-Science Center CLADE 2008
2
What is the Problem? Open distributed systems – Tasks submitted to the “system” for execution – Workers do the computing, execute a task, return an answer The Challenge – Computations that are erroneous or late are less useful – Failure, errors, hacked, misconfigured – Unpredictable time to return answers Both local- and wide-area systems – Focus on volunteer wide-area systems
3
Shape of the Solution Replication Works for all sources of unreliability – computation and data How to do this intelligently - scalably?
4
Replication Challenges How many replicas? – too many – waste of resources – too few – application suffers Most approaches assume ad-hoc replication – under-replicate: task re-execution (^ latency) – over-replicate: wasted resources (v throughput) Using information about the past behavior of a node, we can intelligently size the amount of redundancy
5
Problems with ad-hoc replication Unreliable node Reliable node Task x sent to group A Task y sent to group B
6
System Model 0.9 0.4 0.8 0.7 0.8 0.7 0.4 0.3 Reputation rating r i – degree of node reliability Dynamically size the redundancy based on r i Note: variable sized groups Assume no correlated errors, relax later
7
Smart Replication Rating based on past interaction with clients – prob. (r i ) over window correct/total or timely/total – extend to worker group (assuming no collusion) => likelihood of correctness (LOC) Smarter Redundancy – variable-sized worker groups – intuition: higher reliability clients => smaller groups
8
Terms LOC (Likelihood of Correctness), g – computes the ‘actual’ probability of getting a correct or timely answer from a group g of clients Target LOC ( target ) – the success-rate that the system tries to ensure while forming client groups
9
Scheduling Metrics Guiding metrics – throughput : is the set of successfully completed tasks in an interval – success rate s: ratio of throughput to number of tasks attempted
10
Algorithm Space How many replicas? – algorithms compute how many replicas to meet a success threshold How to reach consensus? – Majority (better for byzantine threats) – M-1 (better for timeliness) – M-2 (2 matching)
11
One Scheduling Algorithm
12
Evaluation Baselines – Fixed algorithm: statically sized equal groups uses no reliability information – Random algorithm: forms groups by randomly assigning nodes until target is reached Simulated a wide-variety of node reliability distributions
13
Experimental Results: correctness Simulation: byzantine behavior only … majority voting
14
Role of target Key parameter – hard to specify Too large – groups will be too large (low throughput) Too small – groups will be too small (low success rate) Instead, adaptively learn it – bias toward or s or both
15
Adaptive Algorithm
16
What about time? Timeliness Result > time T is less (or not) useful – (1) soft deadlines user interacting, visualization output from computation – (2) hard deadlines need to get X results done before HPDC/NSDI/… deadline Live experimentation on PlanetLab Real application: BLAST
17
Some PL data Computation - both across and within nodes Communication - both across and within nodes Temporal variability
18
PL Environment Ridge is our live system that implements reputation 120 wide-area nodes, fully correct, M-1 consensus 3 Timeliness environments based on deadlines D=120sD=180sD=240s
19
Experimental Results: timeliness Best BOINC (BOINC*), conservative (BOINC-) vs. RIDGE
20
Makespan Comparison
21
Collusion Suppose errors are correlated? How? – Widespread bug (hardware or software) – Misconfiguration – Virus – Sybil attack – Malicious group With Emmanuel Jeannot (Inria)
22
Key Ideas Execute a task => answer groups – A 1, A 2, … A k – For each A i there are associated workers W i 1, W i 2 … W i n – P collusion (workers in A i ) Learn probability of correlated errors – P collusion (W 1, W 2 ) Estimate probability of group correlated errors – P collusion (G), G=[W 1, W 2, W 3, …] via f {P collusion (W i, W j ), for all i,j} Rank and select answer – P collusion (G) and |G| – Update matrix: P collusion (W1, W2)
23
Bootstrap Problem Building collusion matrix Must first “bait” colluders – Over-replicate such that majority group is still correct to expose colluders – : probability of worker collusion – : probability colluders fool the system Given group size k
24
4: 1 group 30% colluders, always collude 5. Same group – colludes 30% of the time 7. 2 groups (40%, 30% colluders) correctness
25
throughput
26
Summary Reliable Scalable computing – correctness and timeliness Future work – combined models and metrics – workflows: coupling data and computation reliability Visit ridge.cs.umn.edu to learn more
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.