Ziliang Zong, Adam Manzanares, and Xiao Qin Department of Computer Science and Software Engineering Auburn University Energy Efficient Scheduling for High-Performance Clusters
Where is Auburn University? Ph.D.’04, U. of Nebraska-Lincoln 04-07, New Mexico Tech 07-09, Auburn University
Storage Systems Research Group at New Mexico Tech ( )
Storage Systems Research Group at Auburn (2008)
Storage Systems Research Group at Auburn (2009)
Investigators Ziliang Zong, Ph.D. Assistant Professor, South Dakota Schools of Mines and Technology Adam Manzanares, Ph.D. Candidate Auburn University Xiao Qin, Ph.D. Assistant Professor at Auburn University
Introduction - Applications
Introduction – Data Centers
Motivation – Electricity Usage EPA Report to Congress on Server and Data Center Energy Efficiency, 2007
Motivation – Energy Projections EPA Report to Congress on Server and Data Center Energy Efficiency, 2007
Motivation – Design Issues Energy Efficiency Performance Reliability&Security
Outline Introduction & Motivation General Architecture for High- Performance Computing Platforms Energy- Efficient Scheduling for Clusters Energy- Efficient Scheduling for Grids Energy- Efficient Storage Systems Conclusions
Architecture – Multiple Layers
Energy Efficient Devices
Multiple Design Goals PerformanceEnergy Efficiency ReliabilitySecurity High- Performance Computing Platforms
Outline Introduction & Motivation General Architecture for High- Performance Computing Platforms Energy- Efficient Scheduling for Clusters Energy- Efficient Scheduling for Grids Energy- Efficient Storage Systems Conclusions
Energy-Aware Scheduling for Clusters
Parallel Applications
Motivational Example An Example of duplication Linear ScheduleTime: 39s No Duplication Schedule (NDS) T1 08 T3 23 T2 33 T4 39 Time: 32s Task Duplication Schedule (TDS)Time: 29s T1 08 T T1 08 T3 23 T T1 08 T3 23 T T4 32
Motivational Example (cont.) T1 08 T3 23 T T4 32 T1 08 T T1 08 T3 23 T An Example of duplication Linear ScheduleTime:39s Energy: 234J No Duplication Schedule (MCP) Task Duplication Schedule (TDS) T1 08 T3 23 T2 33 T4 39 Time: 32s Energy: 242J Time: 29s Energy: 284J CPU_Energy=6W Network_Energy=1W (10,60) (8,48) (6,6)(5,5) (15,90) (2,2) (4,4) (6,36)
Motivational Example (cont.) (10,60) (8,48) (6,6)(5,5) (15,90) (2,2) (4,4) (6,36) The energy cost of duplicating T1: CPU side: 48J Network side: -6J Total: 42J The performance benefit of duplicating T1: 6s Energy-performance tradeoff: 42/6 = 7 T1 08 T3 23 T T4 32 T1 08 T T1 08 T3 23 T EAD PEBD Time: 32s Energy: 242J Time: 29s Energy: 284J If Threshold = 10 Duplicate T1? EAD: NO PEBD: Yes
Basic Steps of Energy-Aware Scheduling Task Description: Task Set {T1, T2, …, T9, T10 } T1 is the entry task; T10 is the exit task; T2, T3 and T4 can not start until T1 finished; T5 and T6 can not start until T2 finished; T7 can not start until both T3 and T4 finished; T8 can not start until both T5 and T6 finished; T9 can not start until both T6 and T7 finished; T10 can not start until both T8 and T9 finished; Task Description: Task Set {T1, T2, …, T9, T10 } T1 is the entry task; T10 is the exit task; T2, T3 and T4 can not start until T1 finished; T5 and T6 can not start until T2 finished; T7 can not start until both T3 and T4 finished; T8 can not start until both T5 and T6 finished; T9 can not start until both T6 and T7 finished; T10 can not start until both T8 and T9 finished; Step 1: DAG Generation Algorithm Implementation:
Basic Steps of Energy-Aware Scheduling Step 2: Parameters Calculation Algorithm Implementation: TaskLevelESTECTLASTLACTFP Total Execution time from current task to the exit task Earliest Start Time Earliest Completion Time Latest Allowable Start Time Latest Allowable Completion Time Favorite Predecessor
Basic Steps of Energy-Aware Scheduling Step 3: Scheduling Algorithm Implementation: TaskLevelESTECTLASTLACTFP Original Task List: {10, 9, 8, 5, 6, 2, 7, 4, 3, 1}
Basic Steps of Energy-Aware Scheduling Step 4: Duplication Decision Algorithm Implementation: Original Task List: {10, 9, 8, 5, 6, 2, 7, 4, 3, 1} Decision 1: Duplicate T1? Decision 2: Duplicate T2? Duplicate T1? Decision 3: Duplicate T1?
The EAD and PEBD Algorithms Generate the DAG of given task sets Find all the critical paths in DAG Generate scheduling queue based on the level (ascending) select the task (has not been scheduled yet) with the lowest level as starting task For each task which is in the same critical path with starting task, check if it is already scheduled allocate it to the same processor with the tasks in the same critical path Yes No meet entry task Save time if duplicate this task? Yes Calculate energy increase and time decrease Ratio= energy increase/ time decrease Ratio<=Threshold? No Yes Duplicate this task and select the next task in the same critical path Calculate energy increase more_energy<=Threshold? Duplicate this task and select the next task in the same critical path Yes No PEBD EAD
Energy Dissipation in Processors
Parallel Scientific Applications Fast Fourier TransformGaussian Elimination
Large-Scale Parallel Applications Robot ControlSparse Matrix Solver
Impact of CPU Power Dissipation Energy consumption for different processors (Gaussian, CCR=0.4) Energy consumption for different processors (FFT, CCR=0.4) 19.4%3.7% CPU TypePower (busy)Power (idle)Gap 104w15w89w 75w14w61w 47w11w36w 44w26w18w Observation: CPUs with large gap between CPU_busy and CPU_idle can obtain greater energy savings
Impact of Interconnect Power Dissipation Energy consumption (Robot Control, Myrinet)Energy consumption (Robot Control, Infiniband) 16.7% 5% InterconnectionPower Myrinet33.6w Infiniband65w Observation: The energy saving of EAD and PEBD is degraded if the interconnection has high power consumption rate. 13.3%3.1%
Parallelism Degrees Energy consumption of Robert Control(Myrinet) Energy consumption of Sparse Matrix (Myrinet) ApplicationParallelism Robot Control Sparse Matrix Solver Observation: Robert Control has more task dependencies thus there exists more possibility for EAD and PEBD to consume energy by judiciously duplicating tasks. 17% 15.8% 6.9%5.4%
Communication-Computation Ratio Energy consumption under different CCRs Processor type:Athlon W Interconnection:Myrinet Simualated Application:Robot Control CCR:(0.1, 0.5, 1, 5, 10) Observation: The overall energy consumption of EAD and PEBD are less than MCP and TDS. EAD and PEBD are very sensitive to CCR MCP provides the greatest energy savings if CCR is less than 1 MCP consumes much more energy when CCR is large CCR: Communication-Computation Rate
Performance Schedule length of Gaussian EliminationSchedule length of Sparse Matrix Solver ApplicationEAD Performance Degradation (: TDS) PEBD Performance Degradation (: TDS) Gaussian Elimination5.7%2.2% Sparse Matrix Solver2.92%2.02% Observation: it is worth trading a marginal degradation in schedule length for a significant energy savings for cluster systems.
Heterogeneous Clusters - Motivational Example
Motivational Example (cont.) Energy calculation for tentative schedule C1 C2 C3 C4
Experimental Settings Parameters Value (Fixed) - (Varied) Different trees to be examined Gaussian elimination, Fast Fourier Transform Execution time of Gaussian Elimination {5, 4, 1, 1, 1, 1, 10, 2, 3, 3, 3, 7, 8, 6, 6, 20, 30, 30 }-(random) Execution time of Fast Fourier Transform {15, 10, 10, 8, 8, 1, 1, 20, 20, 40, 40, 5, 5, 3, 3 }-(random) Computing node type AMD Athlon 64 X with 85W TDP (Type 1) AMD Athlon 64 X with 65W TDP (Type 2) AMD Athlon 64 X with 35W TDP (Type 3) Intel Core 2 Duo E6300 processor (Type 4) CCR setBetween 0.1 and 10 Computing node heterogeneity Environment1: # of Type 1: 4 # of Type 2: 4 # of Type 3: 4 # of Type 4: 4 Environment2: # of Type 1: 6 # of Type 2: 2 # of Type 3: 2 # of Type 4: 6 Environment3: # of Type 1: 5 # of Type 2: 3 # of Type 3: 3 # of Type 4: 5 Environment4: # of Type 1: 7 # of Type 2: 1 # of Type 3: 1 # of Type 4: 7 Network energy consumption rate 20W, 33.6W, 60W Simulation Environments
Communication-Computation Ratio CCR sensitivity for Gaussian Elimination
Heterogeneity Computational nodes heterogeneity experiments CPU Type E1E2E3E Observation: CPUs with large gap between CPU_busy and CPU_idle can obtain greater energy savings
Architecture for high-performance computing platforms Energy-Efficient Scheduling for Clusters Energy-Efficient Scheduling for Heterogeneous Systems How to measure energy consumption? Kill-A- Watt Conclusions
Questions Questions