Euro-Par 2007, Rennes, 29th August 1 The Characteristics and Performance of Groups of Jobs in Grids Alexandru Iosup, Mathieu Jan *, Ozan Sonmez and Dick Epema PDS Group Delft University of Technology The Netherlands * : now postdoc LRI/INRIA Futurs, Orsay (Paris South), France
Euro-Par 2007, Rennes, 29th August 2 Outline Why looking at groups of jobs? Grid traces and environment summary Definitions of groups of jobs The characteristics of jobs grouping Workload-level analysis Group-level analysis Job-level analysis Conclusion and future work
Euro-Par 2007, Rennes, 29th August 3 Why looking at groups of jobs? Current grids run almost exclusive single-node jobs [Grid2006] Traces analysis: LCG, Grid3, TeraGrid, DAS-2 How jobs are related then? What is their structure? Batches of identical jobs? Something else? No such analysis using long-term data from production and research grid environment No analysis of the impact of groups of jobs on the performance of grids
Euro-Par 2007, Rennes, 29th August 4 Our research questions What are the dependencies among the jobs submitted by a single user? What is the physical structure of such groupings? What is the impact of the job groupings on the performance of grids?
Euro-Par 2007, Rennes, 29th August 5 Grid traces: Grid’5000 (1/3) Experimental platform Grid’5000: 9 sites, 15 clusters All clusters managed by OAR Trace period: 05/ /2006 CPUs: ~ 2500 Jobs: 951 K Users: 473 Groups: 10 Consumed CPU time: 651 years
Euro-Par 2007, Rennes, 29th August 6 Grid traces: NorduGrid (2/3) Large scale production grid NorduGrid: ~75 sites Handled via ARC middleware Advanced Resource Connector Trace period: 05/ /2006 CPUs: ~ 2000 Jobs: 781 K Users: 387 Groups: 106 Consumed CPU time: 2443 years
Euro-Par 2007, Rennes, 29th August 7 Grid traces: GLOW (3/3) Grid Laboratory Of Wisconsin Campus wide distributed computing environment Condor based Trace period: 09/ /2007 CPUs: ~ 1400 Jobs: 216 K Users: 18 Groups: 1 Consumed CPU time: 55 years
Euro-Par 2007, Rennes, 29th August 8 Grid traces summary Period05/ /200605/ /200609/ /2007 Sites15~751 CPUs~2500~2000~1400 Jobs951 K781 K216 K Groups Users Consumed CPU time 651 years2443 years55 years
Euro-Par 2007, Rennes, 29th August 9 Groups of jobs: definitions (1/2) Batch submission Maximal contiguous subsequence G of such that for any two successive jobs J, J’ in G Parameter Sweep Application (PSA) Batch submission + jobs execute the same application
Euro-Par 2007, Rennes, 29th August 10 Groups of jobs: definitions (2/2) In this talk, we focus on batch submissions
Euro-Par 2007, Rennes, 29th August 11 Characteristics of jobs groupings In our analysis, = 120 seconds
Euro-Par 2007, Rennes, 29th August 12 Workload-level analysis Grid’5000NorduGridGLOW Submissions26k50k13k Jobs808k (951k)738k (781k)205k (216k) CPU time193y (651y)2192y (2443y)53y (55y) Batches Continued NorduGrid & GLOW: identical to batches Grid’5000: 14k sub, 910k jobs, 462y Bursty: less submissions, more jobs
Euro-Par 2007, Rennes, 29th August 13 Group-level analysis: size of batches 75% of batches are size (Grid ’ 5000 and NorduGrid) or <10 (GLOW) Average: 31+/-110 (Grid ’ 5000), 15+/-33 (NorduGrid) and 15+/-38 (GLOW) Heavy-tail distribution
Euro-Par 2007, Rennes, 29th August 14 Group-level analysis: inter-arrival time (seconds) Expected high inter-arrival time for batches 50% of the values are between 400 and 700 seconds Reminder: = 120 seconds
Euro-Par 2007, Rennes, 29th August 15 Group-level analysis: duration (seconds) Duration of batches are higher than for single jobs For NorduGrid, average duration of batches is 1.5 day vs. 1 day for single jobs
Euro-Par 2007, Rennes, 29th August 16 Group-level analysis: consumed CPU time (KCPUs) Consumed CPU time is much higher for batches than for single jobs!
Euro-Par 2007, Rennes, 29th August 17 Job-level analysis: run time (seconds) Average run time for batches Grid’5000: 0.66+/-6.65 days GLOW: 1.04+/-3.18 days NorduGrid: 2.27+/-5.59 days
Euro-Par 2007, Rennes, 29th August 18 Job-level analysis: wait time (seconds) NorduGrid: no wait time information in the trace Average wait times of batches are higher than The runtime of batches The wait time of single jobs
Euro-Par 2007, Rennes, 29th August 19 Job-level analysis: consumed CPU time (KCPUs) No clear distinction between batches and single jobs
Euro-Par 2007, Rennes, 29th August 20 Other analyses Do parallel jobs inside batches exists? Average parallelism: 1+/-1 (Grid’5000), 2+/-7 (NorduGrid) and 1 (GLOW) Grid’5000: 37% of batches are of size 2, 9% of size >2, max. = 325 To what extend batches are PSAs? In Grid’5000, 75% of batches are PSAs PSAs compared to batches: Increased grouped size by 9 in average Average duration time divided by 5.7
Euro-Par 2007, Rennes, 29th August 21 Performance impact of grouped submissions Batches display an high AIT value Over 4000% of the ART! Research direction for designing scheduling policies for batches: minimization of the AIT of batches Performances metrics Group runtime (RT) Group duration (DT) Group idle time: IT = DT - RT BatchesSingle jobs ART (s)AIT (s)ART (s)AIT (s) Grid’
Euro-Par 2007, Rennes, 29th August 22 Conclusion & future work Formally defined 3 types of groups of jobs Batch (and PSAs), continued and bursty Analysis of 3 long-term traces from large and different platforms Up to 96% of CPU time consumed by batch submissions Performance analysis of batches compared to single jobs Future work Deeper analysis (Grid Workloads Archives) Research direction: minimization of idle time in groups Trace driven simulations Dynamic resource availability [Grid2007]
Euro-Par 2007, Rennes, 29th August 23 Thank you! Questions? Remarks? Observations? Help building our community’s Grid Workloads Archive: