Download presentation
Presentation is loading. Please wait.
1
June 1, 2015 1 Inter-Operating Grids through Delegated MatchMaking Alexandru Iosup, Dick Epema PDS Group, TU Delft, NL Todd Tannenbaum, Matt Farrellee, Miron Livny CS Dept., U. Wisconsin-Madison, US
2
June 1, 2015 2 Outline 1.Grid Inter-Operation: Motivation and Goals Evaluation of the e-Science Computational Demand Why Grid Inter-Operation? The Grid Inter-Operation Research Question 2.Alternatives to/for Grid Inter-Operation 3.Inter-Operating Grids Through Delegated MatchMaking 4.Experimental Results 5.Conclusion and Future Work
3
June 1, 2015 3 Current e-Science Computational Demand For Every Grid (Cluster), Over 500k Jobs/Year A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, D.H.J. Epema, The Grid Workloads Archive, 2007 (submitted to FGCS). The Grid Workloads Archive http://gwa.ewi.tudelft.nl >500 kjobs/year/trace 1.5 yrs>525K
4
June 1, 2015 4 Current e-Science Demand Grids vs. Parallel Production Systems (Tens of) thousands of processors, grids have distributed ownership Similar CPUTime/Year, grids have 10x larger arrival bursts Grids (cluster-based, source: GWA) Parallel Production Environments (Large clusters, source: PWA) LCG cluster daily peak: 22.5k jobs A. Iosup, D.H.J. Epema, C. Franke, A. Papaspyrou, L. Schley, B. Song, R. Yahyapour, On Grid Performance Evaluation using Synthetic Workloads, JSSPP’06.
5
June 1, 2015 5 Current e-Science Demand Bursty Demand Leads to High Wait Time A. Iosup, C. Dumitrescu, D.H.J. Epema, H. Li, L. Wolters, How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications, Grid 2006.
6
June 1, 2015 6 The 1M-CPU Machine with Shared Resource Ownership The 1M-CPU machine E-Science (high-energy physics, earth sciences, financial services, bioinformatics, etc.) Over-provisioning for any individual e-Science field Provisioning for all e-Science fields at the same time Shared resource ownership Shared resource acquisition Shared maintenance and operation Summed capacity higher (more efficiently used) than sum of individual capacities
7
June 1, 2015 7 How to Build the 1M-CPU Machine with Shared Resource Ownership? The number of clusters increases at high pace Top500 SuperComputers: cluster systems from 0% to 75% share in 10 years (also from 0% to 50% performance) CERN WLCG: from 100 to 300 clusters in 2½ years Source: http://goc.grid.sinica.edu.tw/gstat/table.html http://goc.grid.sinica.edu.tw/gstat/table.html clusters MPPs
8
June 1, 2015 8 How to Build the 1M-CPU Machine with Shared Resource Ownership? Median: 10x Average: 20x Max:100x Last 10 years Data source: http://www.top500.orghttp://www.top500.org Last 4 years Now:<1.2x/yr To build the 1M-CPU cluster: - At last 10 years rate, another 10 years - At current rate, another 40 years
9
June 1, 2015 9 How to Build the 1M-CPU Machine with Shared Resource Ownership? CERN’s WLCG cluster size over time Median: +5 procs/yr Avg: +15 procs/yr Max: 2x/yr Shared clusters grow on average slower than Top500 cluster systems! Data source: http://goc.grid.sinica.edu.tw/gstat/http://goc.grid.sinica.edu.tw/gstat/ Year 1Year 2
10
June 1, 2015 10 How to Build the 1M-CPU Machine with Shared Resource Ownership? Physics Dissipate heat from large clusters Market Pay industrial power consumer rate, pay special system building rate Collaboration Who pays for the largest cluster? We don’t know how to exploit multi-cores yet Executing large batches of independent jobs Why doesn’t CERN WLCG use larger clusters? Why doesn’t CERN WLCG opt for multi-cores?
11
June 1, 2015 11 How to Build the 1M-CPU Machine with Shared Resource Ownership? Number of clusters is growing, cluster size is not Many small clusters in one large distributed computing system 6,000 clusters = 1M CPUs / 150 CPUs/cluster [CERN] 30,000 clusters = 1M CPUs / 32 CPUs/cluster [Kee et al.,SC04] How to Inter-Operate 10,000s of Clusters? Inter-operate 10,000s of clusters with grids But the largest grid has 300 clusters, most grids have 2-3… How to Build the 1M-CPU Machine with Shared Resource Ownership? Research Question: How to inter-operate cluster-based grids in a scalable and efficient way?
12
June 1, 2015 12 Outline 1.Grid Inter-Operation: Introduction, Motivation, and Goals 2.Alternatives to/for Grid Inter-Operation 3.Inter-Operating Grids Through Delegated MatchMaking 4.Experimental Results 5.Conclusion and Future Work
13
June 1, 2015 13 Alternatives to/for Grid Inter-Operation IndependentCentralized Hierarchical Decentralized Condor Globus GRAM Alien Koala OAR CCS Moab/Torque OAR2 NWIRE OurGrid Condor Flocking Load imbalance? Resource selection? Scale? Root ownership? Node failures? Accounting? Trust? Scale?
14
June 1, 2015 14 Outline 1.Grid Inter-Operation: Introduction, Motivation, and Goals 2.Alternatives to/for Grid Inter-Operation 3.Inter-Operating Grids Through Delegated MatchMaking Architecture Mechanism 4.Experimental Results 5.Conclusion and Future Work
15
June 1, 2015 15 3 3 3 333 2 3. Inter-Operating Grids Through Delegated MatchMaking The Delegated MatchMaking Architecture 1.Start from a hierarchical architecture 2.Let roots exchange load 3.Let siblings exchange load Hybrid hierarchical/decentralized architecture for grid inter-operation
16
June 1, 2015 16 3. Inter-Operating Grids Through Delegated MatchMaking The Delegated MatchMaking Mechanism 1.Deal with local load locally (if possible) Local load low Resource request Resource usage rights Cluster
17
June 1, 2015 17 3. Inter-Operating Grids Through Delegated MatchMaking The Delegated MatchMaking Mechanism 1.Deal with local load locally (if possible) 2.When local load is too high, temporarily bind resources from remote sites to the local environment. May build delegation chains. Delegate resource usage rights, do not migrate jobs. 3.Deal with delegations each delegation cycle (delegated matchmaking) Delegate Local load too high Resource request Resource usage rights Bind remote resource The Delegated MatchMaking Mechanism= Delegate Resource Usage Rights, Do Not Migrate Jobs
18
June 1, 2015 18 3. Inter-Operating Grids Through Delegated MatchMaking The Delegated MatchMaking Protocol 1.Delegate requests when not able to serve locally; delegation chains may be built this way 2.Delegation chains have maximum length defined at job submission by user/site manager (DTTL) JM-1 obtains RM-1 from SM-1
19
June 1, 2015 19 3. Inter-Operating Grids Through Delegated MatchMaking The Delegated MatchMaking Policies Is current load too high? Delegation threshold Which part is extra load? Delegation policy Max delegation chain length? DTTL How to dispatch local load? Local requests dispatching policy Whom to delegate to first? Target site ordering policy When to delegate? Delegation cycle Framework for Policies in Grid Inter-Operation
20
June 1, 2015 20 Outline 1.Grid Inter-Operation: Introduction, Motivation, and Goals 2.Alternatives to/for Grid Inter-Operation 3.Inter-Operating Grids Through Delegated MatchMaking 4.Experimental Results Experimental Setup Performance Evaluation Overhead Evaluation 5.Conclusion and Future Work
21
June 1, 2015 21 4. Experimental Results Experimental Setup Inter-operating DAS and Grid’5000: 20 clusters, >3000 processors Discrete event simulation Independent (separated clusters+FCFS, Condor+MM) Centralized (CERN+poll, centralized grid scheduler+WF+FCFS) Decentralized (Condor with flocking+MM+FS) Workloads Real traces Realistic DAS+Grid’5000 with DMM Metrics WT, RT, SD, Goodput FinishedJobs[%], O’head
22
June 1, 2015 22 4. Experimental Results Realistic Grid Workloads Model: Lublin-Feitelson [JPDC’03] … but 95% 1-CPU Jobs come in batches Runtime variability inside batches is high A. Iosup, M. Jan, O. Sonmez, and D.H.J. Epema, The Characteristics and Performance of Groups of Jobs in Grids, Euro-Par’ 07.
23
June 1, 2015 23 DMM High goodput Low wait time Finishes all jobs Even better for load imbalance between grids [see paper] 4. Experimental Results Performance Evaluation Independent Centralized Decentralized DMM The DMM delivers good performance
24
June 1, 2015 24 4. Experimental Results Overhead Evaluation DMM Overhead ~16% 93% more control messages Constant number of delegations per job until 80% load DMM Threshold to control o’head. [see paper] The DMM incurs reasonable overhead
25
June 1, 2015 25 Outline 1.Grid Inter-Operation: Introduction, Motivation, and Goals 2.Alternatives to/for Grid Inter-Operation 3.Inter-Operating Grids Through Delegated MatchMaking 4.Experimental Results 5.Conclusion and Future Work
26
June 1, 2015 26 Conclusion and Future Work The Delegated MatchMaking architecture, mechanism, and policies How to inter-operate cluster-based grids in a scalable and efficient way? Evaluation of DMM High goodput Low wait time Reasonable overhead Future Work Fault-tolerant policies (built resource availability model: Grid’2007) Larger systems (we promised the 1M-CPU machine) Malicious participants, trust Real environment evaluation (built testing tool: GrenchMark, CCGrid’06) Contributions Hybrid architecture Delegate resource usage rights Framework for policy investigation
27
June 1, 2015 27 Thank you! Questions? Remarks? Observations? The Grid Workloads Archive http://gwa.ewi.tudelft.nl/http://gwa.ewi.tudelft.nl/ (or Google “The Grid Workloads Archive”) Contact A.Iosup@tudelft.nl http://www.pds.ewi.tudelft.nl/~iosup/ (or Google “iosup”) http://www.pds.ewi.tudelft.nl/~iosup/ Share your Job and Resource Availability Traces!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.