Trust-Sensitive Scheduling on the Open Grid Jon B. Weissman with help from Jason Sonnek and Abhishek Chandra Department of Computer Science University of Minnesota Trends in HPDC Workshop Amsterdam 2006
Background Public donation-based infrastructures are attractive –positives: cheap, scalable, fault tolerant (UW- Condor, –negatives: “hostile” - uncertain resource availability/connectivity, node behavior, end- user demand => best effort service
Background Such infrastructures have been used for throughput-based applications –just make progress, all tasks equal Service applications are more challenging –all tasks not equal –explicit boundaries between user requests –may even have SLAs, QoS, etc.
Service Model Distributed Service –request -> set of independent tasks –each task mapped to a donated node –makespan –E.g. BLAST service user request (input sequence) + chunk of DB form a task
BOINC + BLAST workunit = input_sequence + chunk of DB generated when a request arrives
The Challenge Nodes are unreliable –timeliness: heterogeneity, bottlenecks, … –cheating: hacked, malicious (> 1% of SETi nodes), misconfigured –failure –churn For a service, this matters
Some data- timeliness Computation Heterogeneity - both across and within nodes Communication Heterogeneity - both across and within nodes PlanetLab – lower bound
The Problem for Today Deal with node misbehavior Result verification –application-specific verifiers – not general –redundancy + voting Most approaches assume ad-hoc replication –under-replicate: task re-execution (^ latency) –over-replicate: wasted resources (v throughput) Using information about the past behavior of a node, we can intelligently size the amount of redundancy
System Model
Problems with ad-hoc replication Unreliable node Reliable node Task x sent to group A Task y sent to group B
Smart Replication Reputation –ratings based on past interactions with clients –simple sample-based prob. (r i ) over window –extend to worker group (assuming no collusion) => likelihood of correctness (LOC) Smarter Redundancy –variable-sized worker groups –intuition: higher reliability clients => smaller groups
Terms LOC (Likelihood of Correctness), g –computes the ‘actual’ probability of getting a correct answer from a group of clients (group g) Target LOC ( target ) –the task success-rate that the system tries to ensure while forming client groups –related to the statistics of the underlying distribution
Trust Sensitive Scheduling Guiding metrics –throughput : is the number of successfully completed tasks in an interval –success rate s: ratio of throughput to number of tasks attempted
Scheduling Algorithms First-Fit –attempt to form the first group that satisfies target Best-Fit –attempt to form a group that best satisfies target Random-Fit –attempt to form a random group that satisfies target Fixed-size –randomly form fixed sized groups. Ignore client ratings. Random and Fixed are our baselines Min group size = 3
Scheduling Algorithms
Scheduling Algorithms (cont’d)
Different Groupings target =.5
Evaluation Simulated a wide-variety of node reliability distributions Set target to be the success rate of Fixed –goal: match success rate of fixed (which over- replicates) yet achieve higher throughput –if desired, can drive tput even higher (but success rate would suffer)
Comparison gain: % open question: how much better could we have done?
Non-stationarity Nodes may suddenly shift gears –deliberately malicious, virus, detach/rejoin –underlying reliability distribution changes Solution –window-based rating (reduce from infinite) Experiment: “blackout” at round 300 (30% effected)
Role of target Key parameter Too large –groups will be too large (low throughput) Too small –groups will be too small (low success rate) Adaptively learn it (parameterless) –maximizing * s : “goodput” –or could bias toward or s
Adaptive algorithm Multi-objective optimization –choose target LOC to simultaneously maximize throughput and success rate s 1 2 s –use weighted combination to reduce multiple objectives to a single objective –employ hill-climbing and feedback techniques to control dynamic parameter adjustment
Adapting target Blackout example
Throughput ( 1 =1, 2 =0)
Current/Future Work Implementation of reputation-based scheduling framework (BOINC and PL) Mechanisms to retain node identities (hence r i ) under node churn –“node signatures” that capture the characteristics of the node
Current/Future Work (cont’d) Timeliness –extending reliability to encompass time –a node whose performance is highly variable is less reliable Client collusion –detection: group signatures –prevention: combine quiz-based tasks with reputation systems form random-groupings
Thank you.