Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007.

Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007

Roadmap Background The problem space Some early solutions Research frontier/opportunities Wrapup

Background Grids are distributed … but also centralized – Condor, Globus, BOINC, Grid Services, VOs – Why? client-server based Centralization pros – Security, policy, global resource management Decentralization pros – Reliability, dynamic, flexible, scalable – **Fertile CS research frontier**

Challenges May have to live within the Grid ecosystem – Condor, Globus, Grid services, VOs, etc. – First principle approaches are risky (Legion) 50K foot view – How to decentralize Grids yet retain their existing features? – High performance, workflows, performance prediction, etc.

Decentralized Grid platform Minimal assumptions about each node Nodes have associated assets (A) – basic: CPU, memory, disk, etc. – complex: application services – exposed interface to assets: OS, Condor, BOINC, Web service Nodes may up or down Node trust is not a given (do X, does Y instead) Nodes may connect to other nodes or not Nodes may be aggregates Grid may be large > 100K nodes, scalability is key

Grid Overlay Grid service Raw – OS services Condor network BOINC network

Grid Overlay - Join Grid service Raw – OS services Condor network BOINC network

Grid Overlay - Departure Grid service Raw – OS services Condor network BOINC network

Routing = Discovery Query contains sufficient information to locate a node: RSL, ClassAd, etc Exact match or semantic match discover A

Routing = Discovery bingo!

Routing = Discovery Discovered node returns a handle sufficient for the client to interact with it - perform service invocation, job/data transmission, etc

Routing = Discovery Three parties – initiator of discovery events for A – client: invocation, health of A – node offering A Often initiator and client will be the same Other times client will be determined dynamically – if W is a web service and results are returned to a calling client, want to locate C W near W => – discover W, then C W !

Routing = Discovery discover A X

Routing = Discovery

bingo!

Routing = Discovery

outside client

Routing = Discovery discover As

Routing = Discovery

Grid Overlay This generalizes … – Resource query (query contains job requirements) – Looks like decentralized matchmaking These are the easy cases … – independent simple queries find a CPU with characteristics x, y, z find 100 CPUs each with x, y, z – suppose queries are complex or related? find N CPUs with aggregate power = G Gflops locate an asset near a prior discovered asset

Grid Scenarios Grid applications are more challenging – Application has a more complex structure – multi-task, parallel/distributed, control/data dependencies individual job/task needs a resource near a data source workflow queries are not independent – Metrics are collective not simply raw throughput makespan response QoS

Related Work Maryland/Purdue – matchmaking Oregon-CCOF – time-zone CAN

Related Work (contd) None of these approaches address the Grid scenarios (in a decentralized manner) – Complex multi-task data/control dependencies – Collective metrics

50K Ft Research Issues Overlay Architecture – structured, unstructured, hybrid – what is the right architecture? Decentralized control/data dependencies – how to do it? Reliability – how to achieve it? Collective metrics – how to achieve them?

Context: Application Model answer = component service request job task … = data source

Context: Application Models Reliability Collective metrics Data dependence Control dependence

Context: Environment RIDGE project - ridge.cs.umn.edu – reliable infrastructure for donation grid envs Live deployment on PlanetLab – planet-lab.org – 700 nodes spanning 335 sites and 35 countries – emulators and simulators Applications – BLAST – Traffic planning – Image comparison

Application Models Reliability Collective metrics Data dependence Control dependence

Reliability Example C D E G B B G

C D E G B B CGCG G C G responsible for Gs health

Reliability Example C D E G B B G, loc(C G ) CGCG

Reliability Example C D E G B BG CGCG could also discover G then C G

Reliability Example C D E G B B CGCG X

C D E G B CGCG G. …

Reliability Example C D E G B CGCG G

Client Replication C D E G B B G

C D E G B BG C G1 C G2 loc (G), loc (CG 1 ), loc (CG 2 ) propagated

Client Replication C D E G B BG C G1 C G2 X client hand-off depends on nature of G and interaction

Component Replication C D E G B B G

C D E G B CGCG G1 G2

Replication Research Nodes are unreliable – crash, hacked, churn, malicious, slow, etc. How many replicas? – too many – waste of resources – too few – application suffers

System Model 0.9 0.4 0.8 0.7 0.8 0.7 0.4 0.3 Reputation rating r i – degree of node reliability Dynamically size the redundancy based on r i Nodes are not connected and check-in to a central server Note: variable sized groups

Reputation-based Scheduling Reputation rating – Techniques for estimating reliability based on past interactions Reputation-based scheduling algorithms – Using reliabilities for allocating work – Relies on a success threshold parameter

Algorithm Space How many replicas? – first-, best-fit, random, fixed, … – algorithms compute how many replicas to meet a success threshold How to reach consensus? – M-first (better for timeliness) – Majority (better for byzantine threats)

Experimental Results: correctness This was a simulation based on byzantine behavior … majority voting

Experimental Results: timeliness M-first (M=1), best BOINC (BOINC*), conservative (BOINC-) vs. RIDGE

Next steps Nodes are decentralized, but not trust management! Need a peer-based trust exchange framework – Stanford: Eigentrust project – local exchange until network converges to a global state

Collective Metrics BLAST Throughput not always the best metric Response, completion time, application-centric – makespan- response

Communication Makespan Nodes download data from replicated data nodes – Nodes choose data servers independently (decentralized) – Minimize the maximum download time for all worker nodes (communication makespan) data download dominates

Data node selection Several possible factors – Proximity (RTT) – Network bandwidth – Server capacity [Download Time vs. RTT - linear] [Download Time vs. Bandwidth - exp]

Heuristic Ranking Function Query to get candidates, RTT/bw probes Node i, data server node j – Cost function = rtt i,j * exp(k j /bw i,j ), k j load/capacity Least cost data node selected independently Three server selection heuristics that use k j – BW-ONLY: k j = 1 – BW-LOAD: k j = n-minute average load (past) – BW-CAND: k j = # of candidate responses in last m seconds (~ future load)

Performance Comparison

Computational Makespan compute dominates: BLAST

Computational Makespan variable-sizedequal-sized *

Next Steps Other makespan scenarios Eliminate probes for bw and RTT -> estimation Richer collective metrics – deadlines: user-in-the-loop

Data Dependence Data-dependent component needs access to one or more data sources – data may be large discover A,

Data Dependence (contd) discover A Where to run it?,

The Problem Where to run a data-dependent component? – determine candidate set – select a candidate Unlikely a candidate knows downstream bw from particular data nodes Idea: infer bw from neighbor observations w/r to data nodes!

Estimation Technique C 1 may have had little past interaction with – … but its neighbors may have For each neighbor generate a download estimate: – DT: prior download time to from neighbor – RTT: from candidate and neighbor to respectively – DP: average weighted measure of prior download times for any node to any data source C1C1 C2C2

Estimation Technique (contd) Download Power (DP) characterizes download capability of a node – DP = average (DT * RTT) – DT not enough (far-away vs. nearby data source) Estimation associated with each neighbor n i – ElapsedEst [n i ] = α β DT α : my_RTT/neighbor_RTT (to ) β : neighbor_DP /my_DP no active probes: historical data, RTT inference Combining neighbor estimates – mean, median, min, …. – median worked the best Take a min over all candidate estimates

Comparison of Candidate Selection Heuristics SELF uses direct observations

Take Away Next steps – routing to the best candidates Locality between a data source and component – scalable, no probing needed – many uses

The Problem How to enable decentralized control? – propagate downstream graph stages – perform distributed synchronization Idea: – distributed dataflow – token matching – graph forwarding, futures (Mentat project)

Control Example C D E G B Bcontrol node token matching

Simple Example C D E G B

Control Example C D E G B {E, B*C*D} {C, G} {D, G}

Control Example C D E G C D B {E, B*C*D} B

Control Example C D E G C D B {E, B*C*D, loc(S C ) } {E, B*C*D, loc(S D ) } {E, B*C*D, loc(S B ) } B output stored at loc(…) – where component is run, or client, or a storage node

Control Example C D E G C D B B B

C D E G C D B B B

C D E G C D B B E B

C D E G C D B B E B How to color and route tokens so that they arrive to the same control node?

Open Problems Support for Global Operations – troubleshooting – what happened? – monitoring – application progress? – cleanup – application died, cleanup state Load balance across different applications – routing to guarantee dispersion

Summary Decentralizing Grids is a challenging problem Re-think systems, algorithms, protocols, and middleware => fertile research Keep our eye on the ball – reliability, scalability, and maintaining performance Some preliminary progress on point solutions

My visit Looking to apply some of these ideas to existing UK projects via collaboration Current and potential projects – Decentralized dataflow: (Adam Barker) – Decentralized applications: Haplotype analysis (Andrea Christoforou, Mike Baker) – Decentralized control: openKnowledge (Dave Robertson) Goal – improve reliability and scalability of applications and/or infrastructures

Questions

EXTRAS

Non-stationarity Nodes may suddenly shift gears – deliberately malicious, virus, detach/rejoin – underlying reliability distribution changes Solution – window-based rating – adapt/learn target Experiment: blackout at round 300 (30% effected)

Adapting …

Adaptive Algorithm success ratethroughput

success ratethroughput

Scheduling Algorithms

Estimation Accuracy Download Elapsed Time Ratio (x-axis) is a ratio of estimation to real measured time – 1 means perfect estimation Accept if the estimation is within a range measured ± (measured * error) – Accept with error=0.33: 67% of the total are accepted – Accept with error=0.50: 83% of the total are accepted Objects: 27 (.5 MB – 2MB) Nodes: 130 on PlanetLab Download: 15,000 times from a randomly chosen node

Impact of Churn Jinoh – mean over what? Global(Prox) mean Random mean

Estimating RTT We use distance = (RTT+1) Simple RTT inference technique based on triangle inequality Triangle Inequality: Latency(a,c) Latency(a,b) + Latency(b,c) – |Latency(a,b)-Latency(b,c)| Latency(a,c) Latency(a,b)+Latency(b,c) Pick the intersected area as the range, and take the mean Via Neighbor A Via Neighbor B Via Neighbor C Inference RTT Lower bound Higher bound Final inference Intersected range

RTT Inference Result More neighbors, greater accuracy With 5 neighbors, 85% of the total < 16% error

Other Constraints C D E A B {E, B*C*D} {C, A, dep-CD} {D, A, dep-CD} C & D interact and they should be co-allocated, nearby … Tokens in bold should route to same control point so a collective query for C & D can be issued

Support for Global Operations Troubleshooting – what happened? Monitoring – application progress? Cleanup – application died, cleanup state Solution mechanism: propagate control node IPs back to origin (=> origin IP piggybacked) Control nodes and matcher nodes report progress (or lack thereof via timeouts) to origin Load balance across different applications

Other Constraints C D E A B {E, B*C*D} {C, A} {D, A} C & D interact and they should be co-allocated, nearby …

Combining Neighbors Estimation MEDIAN shows best results – using 3 neighbors 88% of the time error is within 50% (variation in download times is a factor of 10-20) 3 neighbors gives greatest bang

Effect of Candidate Size

Performance Comparison Parameters: Data size: 2MB Replication: 10 Candidates: 5

Computation Makespan (contd) Now bring in reliability … makespan improvement scales well # components

Token loss Between B and matcher; matcher and next stage – matcher must notify C B when token arrives (pass loc(C B ) with Bs token – destination (E) must notify C B when token arrives (pass loc(C B ) with Bs token C D B E B

RTT Inference >= 90-95% of Internet paths obey triangle inequality – RTT (a, c) <= RTT (a, b) + RTT (b, c) – RTT (server, c) <= RTT (server, n i ) + RTT (n i, c) – upper- bound – lower-bound: | RTT (server, n i ) - RTT (n i, c) | iterate over all neighbors to get max L, min U return mid-point

Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007.

Similar presentations

Presentation on theme: "Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007.

Similar presentations

Presentation on theme: "Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007."— Presentation transcript:

Similar presentations

About project

Feedback