Download presentation
Presentation is loading. Please wait.
Published byIsaac Schmitt Modified over 10 years ago
1
Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007
2
Roadmap Background The problem space Some early solutions Research frontier/opportunities Wrapup
3
Background Grids are distributed … but also centralized – Condor, Globus, BOINC, Grid Services, VOs – Why? client-server based Centralization pros – Security, policy, global resource management Decentralization pros – Reliability, dynamic, flexible, scalable – **Fertile CS research frontier**
4
Challenges May have to live within the Grid ecosystem – Condor, Globus, Grid services, VOs, etc. – First principle approaches are risky (Legion) 50K foot view – How to decentralize Grids yet retain their existing features? – High performance, workflows, performance prediction, etc.
5
Decentralized Grid platform Minimal assumptions about each node Nodes have associated assets (A) – basic: CPU, memory, disk, etc. – complex: application services – exposed interface to assets: OS, Condor, BOINC, Web service Nodes may up or down Node trust is not a given (do X, does Y instead) Nodes may connect to other nodes or not Nodes may be aggregates Grid may be large > 100K nodes, scalability is key
6
Grid Overlay Grid service Raw – OS services Condor network BOINC network
7
Grid Overlay - Join Grid service Raw – OS services Condor network BOINC network
8
Grid Overlay - Departure Grid service Raw – OS services Condor network BOINC network
9
Routing = Discovery Query contains sufficient information to locate a node: RSL, ClassAd, etc Exact match or semantic match discover A
10
Routing = Discovery bingo!
11
Routing = Discovery Discovered node returns a handle sufficient for the client to interact with it - perform service invocation, job/data transmission, etc
12
Routing = Discovery Three parties – initiator of discovery events for A – client: invocation, health of A – node offering A Often initiator and client will be the same Other times client will be determined dynamically – if W is a web service and results are returned to a calling client, want to locate C W near W => – discover W, then C W !
13
Routing = Discovery discover A X
14
Routing = Discovery
15
bingo!
16
Routing = Discovery
17
outside client
18
Routing = Discovery discover As
19
Routing = Discovery
20
Grid Overlay This generalizes … – Resource query (query contains job requirements) – Looks like decentralized matchmaking These are the easy cases … – independent simple queries find a CPU with characteristics x, y, z find 100 CPUs each with x, y, z – suppose queries are complex or related? find N CPUs with aggregate power = G Gflops locate an asset near a prior discovered asset
21
Grid Scenarios Grid applications are more challenging – Application has a more complex structure – multi-task, parallel/distributed, control/data dependencies individual job/task needs a resource near a data source workflow queries are not independent – Metrics are collective not simply raw throughput makespan response QoS
22
Related Work Maryland/Purdue – matchmaking Oregon-CCOF – time-zone CAN
23
Related Work (contd) None of these approaches address the Grid scenarios (in a decentralized manner) – Complex multi-task data/control dependencies – Collective metrics
24
50K Ft Research Issues Overlay Architecture – structured, unstructured, hybrid – what is the right architecture? Decentralized control/data dependencies – how to do it? Reliability – how to achieve it? Collective metrics – how to achieve them?
25
Context: Application Model answer = component service request job task … = data source
26
Context: Application Models Reliability Collective metrics Data dependence Control dependence
27
Context: Environment RIDGE project - ridge.cs.umn.edu – reliable infrastructure for donation grid envs Live deployment on PlanetLab – planet-lab.org – 700 nodes spanning 335 sites and 35 countries – emulators and simulators Applications – BLAST – Traffic planning – Image comparison
28
Application Models Reliability Collective metrics Data dependence Control dependence
29
Reliability Example C D E G B B G
30
C D E G B B CGCG G C G responsible for Gs health
31
Reliability Example C D E G B B G, loc(C G ) CGCG
32
Reliability Example C D E G B BG CGCG could also discover G then C G
33
Reliability Example C D E G B B CGCG X
34
C D E G B CGCG G. …
35
Reliability Example C D E G B CGCG G
36
Client Replication C D E G B B G
37
C D E G B BG C G1 C G2 loc (G), loc (CG 1 ), loc (CG 2 ) propagated
38
Client Replication C D E G B BG C G1 C G2 X client hand-off depends on nature of G and interaction
39
Component Replication C D E G B B G
40
C D E G B CGCG G1 G2
41
Replication Research Nodes are unreliable – crash, hacked, churn, malicious, slow, etc. How many replicas? – too many – waste of resources – too few – application suffers
42
System Model 0.9 0.4 0.8 0.7 0.8 0.7 0.4 0.3 Reputation rating r i – degree of node reliability Dynamically size the redundancy based on r i Nodes are not connected and check-in to a central server Note: variable sized groups
43
Reputation-based Scheduling Reputation rating – Techniques for estimating reliability based on past interactions Reputation-based scheduling algorithms – Using reliabilities for allocating work – Relies on a success threshold parameter
44
Algorithm Space How many replicas? – first-, best-fit, random, fixed, … – algorithms compute how many replicas to meet a success threshold How to reach consensus? – M-first (better for timeliness) – Majority (better for byzantine threats)
45
Experimental Results: correctness This was a simulation based on byzantine behavior … majority voting
46
Experimental Results: timeliness M-first (M=1), best BOINC (BOINC*), conservative (BOINC-) vs. RIDGE
47
Next steps Nodes are decentralized, but not trust management! Need a peer-based trust exchange framework – Stanford: Eigentrust project – local exchange until network converges to a global state
48
Application Models Reliability Collective metrics Data dependence Control dependence
49
Collective Metrics BLAST Throughput not always the best metric Response, completion time, application-centric – makespan- response
50
Communication Makespan Nodes download data from replicated data nodes – Nodes choose data servers independently (decentralized) – Minimize the maximum download time for all worker nodes (communication makespan) data download dominates
51
Data node selection Several possible factors – Proximity (RTT) – Network bandwidth – Server capacity [Download Time vs. RTT - linear] [Download Time vs. Bandwidth - exp]
52
Heuristic Ranking Function Query to get candidates, RTT/bw probes Node i, data server node j – Cost function = rtt i,j * exp(k j /bw i,j ), k j load/capacity Least cost data node selected independently Three server selection heuristics that use k j – BW-ONLY: k j = 1 – BW-LOAD: k j = n-minute average load (past) – BW-CAND: k j = # of candidate responses in last m seconds (~ future load)
53
Performance Comparison
54
Computational Makespan compute dominates: BLAST
55
Computational Makespan variable-sizedequal-sized *
56
Next Steps Other makespan scenarios Eliminate probes for bw and RTT -> estimation Richer collective metrics – deadlines: user-in-the-loop
57
Application Models Reliability Collective metrics Data dependence Control dependence
58
Application Models Reliability Collective metrics Data dependence Control dependence
59
Data Dependence Data-dependent component needs access to one or more data sources – data may be large discover A,
60
Data Dependence (contd) discover A Where to run it?,
61
The Problem Where to run a data-dependent component? – determine candidate set – select a candidate Unlikely a candidate knows downstream bw from particular data nodes Idea: infer bw from neighbor observations w/r to data nodes!
62
Estimation Technique C 1 may have had little past interaction with – … but its neighbors may have For each neighbor generate a download estimate: – DT: prior download time to from neighbor – RTT: from candidate and neighbor to respectively – DP: average weighted measure of prior download times for any node to any data source C1C1 C2C2
63
Estimation Technique (contd) Download Power (DP) characterizes download capability of a node – DP = average (DT * RTT) – DT not enough (far-away vs. nearby data source) Estimation associated with each neighbor n i – ElapsedEst [n i ] = α β DT α : my_RTT/neighbor_RTT (to ) β : neighbor_DP /my_DP no active probes: historical data, RTT inference Combining neighbor estimates – mean, median, min, …. – median worked the best Take a min over all candidate estimates
64
Comparison of Candidate Selection Heuristics SELF uses direct observations
65
Take Away Next steps – routing to the best candidates Locality between a data source and component – scalable, no probing needed – many uses
66
Application Models Reliability Collective metrics Data dependence Control dependence
67
The Problem How to enable decentralized control? – propagate downstream graph stages – perform distributed synchronization Idea: – distributed dataflow – token matching – graph forwarding, futures (Mentat project)
68
Control Example C D E G B Bcontrol node token matching
69
Simple Example C D E G B
70
Control Example C D E G B {E, B*C*D} {C, G} {D, G}
71
Control Example C D E G C D B {E, B*C*D} B
72
Control Example C D E G C D B {E, B*C*D, loc(S C ) } {E, B*C*D, loc(S D ) } {E, B*C*D, loc(S B ) } B output stored at loc(…) – where component is run, or client, or a storage node
73
Control Example C D E G C D B B B
74
C D E G C D B B B
75
C D E G C D B B B
76
C D E G C D B B E B
77
C D E G C D B B E B
78
C D E G C D B B E B How to color and route tokens so that they arrive to the same control node?
79
Open Problems Support for Global Operations – troubleshooting – what happened? – monitoring – application progress? – cleanup – application died, cleanup state Load balance across different applications – routing to guarantee dispersion
80
Summary Decentralizing Grids is a challenging problem Re-think systems, algorithms, protocols, and middleware => fertile research Keep our eye on the ball – reliability, scalability, and maintaining performance Some preliminary progress on point solutions
81
My visit Looking to apply some of these ideas to existing UK projects via collaboration Current and potential projects – Decentralized dataflow: (Adam Barker) – Decentralized applications: Haplotype analysis (Andrea Christoforou, Mike Baker) – Decentralized control: openKnowledge (Dave Robertson) Goal – improve reliability and scalability of applications and/or infrastructures
82
Questions
84
EXTRAS
85
Non-stationarity Nodes may suddenly shift gears – deliberately malicious, virus, detach/rejoin – underlying reliability distribution changes Solution – window-based rating – adapt/learn target Experiment: blackout at round 300 (30% effected)
86
Adapting …
87
Adaptive Algorithm success ratethroughput
88
success ratethroughput
89
Scheduling Algorithms
90
Estimation Accuracy Download Elapsed Time Ratio (x-axis) is a ratio of estimation to real measured time – 1 means perfect estimation Accept if the estimation is within a range measured ± (measured * error) – Accept with error=0.33: 67% of the total are accepted – Accept with error=0.50: 83% of the total are accepted Objects: 27 (.5 MB – 2MB) Nodes: 130 on PlanetLab Download: 15,000 times from a randomly chosen node
91
Impact of Churn Jinoh – mean over what? Global(Prox) mean Random mean
92
Estimating RTT We use distance = (RTT+1) Simple RTT inference technique based on triangle inequality Triangle Inequality: Latency(a,c) Latency(a,b) + Latency(b,c) – |Latency(a,b)-Latency(b,c)| Latency(a,c) Latency(a,b)+Latency(b,c) Pick the intersected area as the range, and take the mean Via Neighbor A Via Neighbor B Via Neighbor C Inference RTT Lower bound Higher bound Final inference Intersected range
93
RTT Inference Result More neighbors, greater accuracy With 5 neighbors, 85% of the total < 16% error
94
Other Constraints C D E A B {E, B*C*D} {C, A, dep-CD} {D, A, dep-CD} C & D interact and they should be co-allocated, nearby … Tokens in bold should route to same control point so a collective query for C & D can be issued
95
Support for Global Operations Troubleshooting – what happened? Monitoring – application progress? Cleanup – application died, cleanup state Solution mechanism: propagate control node IPs back to origin (=> origin IP piggybacked) Control nodes and matcher nodes report progress (or lack thereof via timeouts) to origin Load balance across different applications
96
Other Constraints C D E A B {E, B*C*D} {C, A} {D, A} C & D interact and they should be co-allocated, nearby …
97
Combining Neighbors Estimation MEDIAN shows best results – using 3 neighbors 88% of the time error is within 50% (variation in download times is a factor of 10-20) 3 neighbors gives greatest bang
98
Effect of Candidate Size
99
Performance Comparison Parameters: Data size: 2MB Replication: 10 Candidates: 5
100
Computation Makespan (contd) Now bring in reliability … makespan improvement scales well # components
101
Token loss Between B and matcher; matcher and next stage – matcher must notify C B when token arrives (pass loc(C B ) with Bs token – destination (E) must notify C B when token arrives (pass loc(C B ) with Bs token C D B E B
102
RTT Inference >= 90-95% of Internet paths obey triangle inequality – RTT (a, c) <= RTT (a, b) + RTT (b, c) – RTT (server, c) <= RTT (server, n i ) + RTT (n i, c) – upper- bound – lower-bound: | RTT (server, n i ) - RTT (n i, c) | iterate over all neighbors to get max L, min U return mid-point
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.