Presentation is loading. Please wait.

Presentation is loading. Please wait.

st International Conference on Parallel Processing (ICPP)

Similar presentations


Presentation on theme: "st International Conference on Parallel Processing (ICPP)"— Presentation transcript:

1 GreenGPU: A Holistic Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures
st International Conference on Parallel Processing (ICPP) Kai Ma, Xue Li, Wei Chen, Chi Zhang, and Xiaorui Wang Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH 43210 Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996 Presented by Po-Ting Liu 2013/07/25

2 Outline Introduction Motivation System Design and Algorithms
Experiment Conclusion

3 Outline Introduction Motivation System Design and Algorithms
Experiment Conclusion

4 Introduction Population of GPU-CPU heterogeneous architecture
High computational throughput More efficient on SIMD operations Better energy efficiency For instance Performance Energy usage Tianhe-1A 2.5 PetaFlops 4 MegaWatts CPU base 12 MegaWatts NVIDIA. NVIDIA Tesla GPUs Power World's Fastest Supercomputer.

5 $2.7 million/year $2.7 million/year Introduction(cont.)
However, it about $2.7 million/year for Tianhe-1A’s electricity bill $2.7 million/year 81 million/year in NTD

6 Introduction(cont.) GreenGPU Two-tier design
A holistic way to improve the energy efficiency and negligible performance loss Two-tier design First tier Dynamically divide workload between CPU and GPU Second tier Dynamically scale the frequencies of CPU and GPU

7 Outline Introduction Motivation System Design and Algorithms
Experiment Conclusion

8 Motivation Case study on workload division between CPU and GPU
Properly divide the workload can reduce the idle time, and then save the energy *Benchmark: k-means

9 Motivation(cont.) Case study on frequency scaling for GPU memory
Properly scale down the under-utilized component can save energy with negligible performance impact properly scaling down the under-utilized component can save energy with negligible performance impact nbody: core-bounded computation intensive streamcluster(SC): memory-bounded memory intensive Figure a Figure b

10 Motivation(cont.) Case study on frequency scaling for GPU core
There may be a frequency level of the component that is most suitable there can be a frequency level of that component that is most suitable nbody: core-bounded computation intensive streamcluster(SC): memory-bounded memory intensive Figure a Figure b

11 Outline Introduction Motivation System Design and Algorithms
Experiment Conclusion

12 System Design and Algorithms
Second Tier First Tier Second Tier

13 System Design and Algorithms (cont.)
First tier - Workload division - Overview Dynamically divides the workloads between CPU and GPU Based on execution time (CPU and GPU) Conduct every iterations with fixed amount of work Iteration defined as reduction point or common barrier point

14 System Design and Algorithms (cont.)
First tier - Workload division - Example assume each step is 5% 𝑡𝑐>𝑡𝑔: 𝑟−5% of next iteration 𝑡𝑐<𝑡𝑔: 𝑟+5% of next iteration Workload(%) Execution time CPU 𝑟 𝑡𝑐 GPU 1−𝑟 𝑡𝑔

15 System Design and Algorithms (cont.)
First tier - Workload division - Avoid oscillating Oscillation example Optimal division point: 12/88 (CPU/GPU) Oscillating between 10/90 (CPU/GPU) and 15/85 (CPU/GPU) Solution Linearly scale the execution time in the previous iteration based on the possible workload to predict the execution time in next iteration Example 10/90 (CPU/GPU) , 𝑡𝑐<𝑡𝑔 , must take 5% workload form GPU to CPU 15/85 (CPU/GPU) for the next iteration 𝑡 𝑐 ′ =(15/10)×𝑡𝑐 𝑡𝑔′=(85/90)×𝑡𝑔 If 𝑡 𝑐 ′ >𝑡𝑔′, keep using the current division 10/90 (CPU/GPU) for next iteration

16 System Design and Algorithms (cont.)
Second tier - CPU Frequency scaling - Strategy On-demand Linux default power saving strategy First Running at lowest frequency (25MHz) Utilization rises above threshold (≥60%) Setting to the peak frequency (100MHz) Utilization falls below threshold (<60%) Scaling down the frequency step by step 75Mhz → 50MHz → 25MHz

17 System Design and Algorithms (cont.)
Second tier - GPU Frequency scaling - Pseudo code

18 System Design and Algorithms (cont.)
Second tier - GPU Frequency scaling - Loss factor 𝐿𝑜𝑠𝑠↑ , 𝑤𝑒𝑖𝑔ℎ𝑡↓ 0≤ 𝑙 𝑖 𝑡 ≤1, 𝑡 is the interval index, 𝑖 is the level of frequency 1≤𝑖≤𝑁, 𝑁 is the number of available frequency level 𝑢: current utilization(%) 𝑢𝑚𝑒𝑎𝑛[𝑖]: most suitable utilization for frequency level 𝑖 𝛼: weight between Energy and Performance

19 System Design and Algorithms (cont.)
Second tier - GPU Frequency scaling - Equations Loss factor of Core 𝑙_ 𝑐 𝑖 𝑡 = 𝛼 𝑐 ×𝑙_ 𝑐 𝑖𝑒 𝑡 + 1− 𝛼 𝑐 ×𝑙_ 𝑐 𝑖𝑝 𝑡 Loss factor of Memory 𝑙_ 𝑚 𝑗 𝑡 = 𝛼 𝑚 ×𝑙_ 𝑚 𝑗𝑒 𝑡 + 1− 𝛼 𝑚 ×𝑙_ 𝑚 𝑗𝑝 𝑡 Total Loss 𝑇𝑜𝑡𝑎𝑙𝐿𝑜𝑠𝑠 𝑖𝑗 𝑡 =∅×𝑙_ 𝑐 𝑖 𝑡 + 1−∅ ×𝑙_ 𝑚 𝑗 𝑡 Weight 𝑤𝑒𝑖𝑔ℎ𝑡 𝑖𝑗 (𝑡+1) = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑖𝑗 (𝑡) ×(1−(1−𝛽)× 𝑇𝑜𝑡𝑎𝑙𝐿𝑜𝑠𝑠 𝑖𝑗 𝑡 ) ∅: weight between Core and Memory 𝛽: weight between Total loss and History weight

20 System Design and Algorithms (cont.)
Problem for tiers affect each other Solution Decouple the First tier and second tier Configure the period of first tier to be much longer than second tier Overhead of first tier is much higher

21 Outline Introduction Motivation System Design and Algorithms
Experiment Conclusion

22 Experiment Experimental environment CPU:AMD Phenom II X2
GPU:NVIDIA 8800GTX 2 power supply 2 power meters one for CPU, disk, main memory... one for GPU OS:Ubuntu 10.04

23 Experiment (cont.) Benchmark From Rodinia and NVIDIA SDK

24 Experiment (cont.) Frequency Scaling for GPU Cores and Memory
Benchmark: streamcluster (memory-bounded) Peak frequency of core: 576 MHz Peak frequency of memory: 900MHz Scaling interval:3 seconds

25 Experiment (cont.) Frequency Scaling for GPU Cores and Memory
avg. energy saving: 5.97% without idle time avg. energy saving: 29.2% CPU+GPU avg. energy saving: 12.48%

26 Experiment (cont.) Workload Division between CPU and GPU
randomly set the initial division point

27 Experiment (cont.) Using both workload division and frequency scaling
avg. energy saving: 21% avg. performance loss: 1.7% (longer execution time)

28 Outline Introduction Motivation System Design and Algorithms
Experiment Conclusion

29 Conclusion A holistic energy management framework for CPU-GPU heterogeneous architectures Dynamically divide the workload and scale the frequency Improve energy efficiency and only a few performance loss Achieve about 21% of average energy saving

30 Thanks


Download ppt "st International Conference on Parallel Processing (ICPP)"

Similar presentations


Ads by Google