Download presentation
Presentation is loading. Please wait.
Published byJordan Summers Modified over 9 years ago
1
Trust-Sensitive Scheduling on the Open Grid Jon B. Weissman with help from Jason Sonnek and Abhishek Chandra Department of Computer Science University of Minnesota Trends in HPDC Workshop Amsterdam 2006
2
Background Public donation-based infrastructures are attractive –positives: cheap, scalable, fault tolerant (UW- Condor, *@home,...) –negatives: “hostile” - uncertain resource availability/connectivity, node behavior, end- user demand => best effort service
3
Background Such infrastructures have been used for throughput-based applications –just make progress, all tasks equal Service applications are more challenging –all tasks not equal –explicit boundaries between user requests –may even have SLAs, QoS, etc.
4
Service Model Distributed Service –request -> set of independent tasks –each task mapped to a donated node –makespan –E.g. BLAST service user request (input sequence) + chunk of DB form a task
5
BOINC + BLAST workunit = input_sequence + chunk of DB generated when a request arrives
6
The Challenge Nodes are unreliable –timeliness: heterogeneity, bottlenecks, … –cheating: hacked, malicious (> 1% of SETi nodes), misconfigured –failure –churn For a service, this matters
7
Some data- timeliness Computation Heterogeneity - both across and within nodes Communication Heterogeneity - both across and within nodes PlanetLab – lower bound
8
The Problem for Today Deal with node misbehavior Result verification –application-specific verifiers – not general –redundancy + voting Most approaches assume ad-hoc replication –under-replicate: task re-execution (^ latency) –over-replicate: wasted resources (v throughput) Using information about the past behavior of a node, we can intelligently size the amount of redundancy
9
System Model
10
Problems with ad-hoc replication Unreliable node Reliable node Task x sent to group A Task y sent to group B
11
Smart Replication Reputation –ratings based on past interactions with clients –simple sample-based prob. (r i ) over window –extend to worker group (assuming no collusion) => likelihood of correctness (LOC) Smarter Redundancy –variable-sized worker groups –intuition: higher reliability clients => smaller groups
12
Terms LOC (Likelihood of Correctness), g –computes the ‘actual’ probability of getting a correct answer from a group of clients (group g) Target LOC ( target ) –the task success-rate that the system tries to ensure while forming client groups –related to the statistics of the underlying distribution
13
Trust Sensitive Scheduling Guiding metrics –throughput : is the number of successfully completed tasks in an interval –success rate s: ratio of throughput to number of tasks attempted
14
Scheduling Algorithms First-Fit –attempt to form the first group that satisfies target Best-Fit –attempt to form a group that best satisfies target Random-Fit –attempt to form a random group that satisfies target Fixed-size –randomly form fixed sized groups. Ignore client ratings. Random and Fixed are our baselines Min group size = 3
15
Scheduling Algorithms
16
Scheduling Algorithms (cont’d)
17
Different Groupings target =.5
18
Evaluation Simulated a wide-variety of node reliability distributions Set target to be the success rate of Fixed –goal: match success rate of fixed (which over- replicates) yet achieve higher throughput –if desired, can drive tput even higher (but success rate would suffer)
19
Comparison gain: 25-250% open question: how much better could we have done?
20
Non-stationarity Nodes may suddenly shift gears –deliberately malicious, virus, detach/rejoin –underlying reliability distribution changes Solution –window-based rating (reduce from infinite) Experiment: “blackout” at round 300 (30% effected)
21
Role of target Key parameter Too large –groups will be too large (low throughput) Too small –groups will be too small (low success rate) Adaptively learn it (parameterless) –maximizing * s : “goodput” –or could bias toward or s
22
Adaptive algorithm Multi-objective optimization –choose target LOC to simultaneously maximize throughput and success rate s 1 2 s –use weighted combination to reduce multiple objectives to a single objective –employ hill-climbing and feedback techniques to control dynamic parameter adjustment
23
Adapting target Blackout example
24
Throughput ( 1 =1, 2 =0)
25
Current/Future Work Implementation of reputation-based scheduling framework (BOINC and PL) Mechanisms to retain node identities (hence r i ) under node churn –“node signatures” that capture the characteristics of the node
26
Current/Future Work (cont’d) Timeliness –extending reliability to encompass time –a node whose performance is highly variable is less reliable Client collusion –detection: group signatures –prevention: combine quiz-based tasks with reputation systems form random-groupings
27
Thank you.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.