st International Conference on Parallel Processing (ICPP)

Name: st International Conference on Parallel Processing (ICPP)
Uploaded: 2017-12-02T17:59:33+00:00
Duration: PTM11S10
Channel: Dwain Bishop
Description: st International Conference on Parallel Processing (ICPP)

GreenGPU: A Holistic Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures
st International Conference on Parallel Processing (ICPP) Kai Ma, Xue Li, Wei Chen, Chi Zhang, and Xiaorui Wang Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH 43210 Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996 Presented by Po-Ting Liu 2013/07/25

Outline Introduction Motivation System Design and Algorithms
Experiment Conclusion

Introduction Population of GPU-CPU heterogeneous architecture
High computational throughput More efficient on SIMD operations Better energy efficiency For instance Performance Energy usage Tianhe-1A 2.5 PetaFlops 4 MegaWatts CPU base 12 MegaWatts NVIDIA. NVIDIA Tesla GPUs Power World's Fastest Supercomputer.

$2.7 million/year $2.7 million/year Introduction(cont.)
However, it about $2.7 million/year for Tianhe-1A’s electricity bill $2.7 million/year 81 million/year in NTD

Introduction(cont.) GreenGPU Two-tier design
A holistic way to improve the energy efficiency and negligible performance loss Two-tier design First tier Dynamically divide workload between CPU and GPU Second tier Dynamically scale the frequencies of CPU and GPU

Motivation Case study on workload division between CPU and GPU
Properly divide the workload can reduce the idle time, and then save the energy *Benchmark: k-means

Motivation(cont.) Case study on frequency scaling for GPU memory
Properly scale down the under-utilized component can save energy with negligible performance impact properly scaling down the under-utilized component can save energy with negligible performance impact nbody: core-bounded computation intensive streamcluster(SC): memory-bounded memory intensive Figure a Figure b

Motivation(cont.) Case study on frequency scaling for GPU core
There may be a frequency level of the component that is most suitable there can be a frequency level of that component that is most suitable nbody: core-bounded computation intensive streamcluster(SC): memory-bounded memory intensive Figure a Figure b

System Design and Algorithms
Second Tier First Tier Second Tier

System Design and Algorithms (cont.)
First tier - Workload division - Overview Dynamically divides the workloads between CPU and GPU Based on execution time (CPU and GPU) Conduct every iterations with fixed amount of work Iteration defined as reduction point or common barrier point

First tier - Workload division - Example assume each step is 5% 𝑡𝑐>𝑡𝑔: 𝑟−5% of next iteration 𝑡𝑐<𝑡𝑔: 𝑟+5% of next iteration Workload(%) Execution time CPU 𝑟 𝑡𝑐 GPU 1−𝑟 𝑡𝑔

First tier - Workload division - Avoid oscillating Oscillation example Optimal division point: 12/88 (CPU/GPU) Oscillating between 10/90 (CPU/GPU) and 15/85 (CPU/GPU) Solution Linearly scale the execution time in the previous iteration based on the possible workload to predict the execution time in next iteration Example 10/90 (CPU/GPU) , 𝑡𝑐<𝑡𝑔 , must take 5% workload form GPU to CPU 15/85 (CPU/GPU) for the next iteration 𝑡 𝑐 ′ =(15/10)×𝑡𝑐 𝑡𝑔′=(85/90)×𝑡𝑔 If 𝑡 𝑐 ′ >𝑡𝑔′, keep using the current division 10/90 (CPU/GPU) for next iteration

Second tier - CPU Frequency scaling - Strategy On-demand Linux default power saving strategy First Running at lowest frequency (25MHz) Utilization rises above threshold (≥60%) Setting to the peak frequency (100MHz) Utilization falls below threshold (<60%) Scaling down the frequency step by step 75Mhz → 50MHz → 25MHz

Second tier - GPU Frequency scaling - Pseudo code

Second tier - GPU Frequency scaling - Loss factor 𝐿𝑜𝑠𝑠↑ , 𝑤𝑒𝑖𝑔ℎ𝑡↓ 0≤ 𝑙 𝑖 𝑡 ≤1, 𝑡 is the interval index, 𝑖 is the level of frequency 1≤𝑖≤𝑁, 𝑁 is the number of available frequency level 𝑢: current utilization(%) 𝑢𝑚𝑒𝑎𝑛[𝑖]: most suitable utilization for frequency level 𝑖 𝛼: weight between Energy and Performance

Second tier - GPU Frequency scaling - Equations Loss factor of Core 𝑙_ 𝑐 𝑖 𝑡 = 𝛼 𝑐 ×𝑙_ 𝑐 𝑖𝑒 𝑡 + 1− 𝛼 𝑐 ×𝑙_ 𝑐 𝑖𝑝 𝑡 Loss factor of Memory 𝑙_ 𝑚 𝑗 𝑡 = 𝛼 𝑚 ×𝑙_ 𝑚 𝑗𝑒 𝑡 + 1− 𝛼 𝑚 ×𝑙_ 𝑚 𝑗𝑝 𝑡 Total Loss 𝑇𝑜𝑡𝑎𝑙𝐿𝑜𝑠𝑠 𝑖𝑗 𝑡 =∅×𝑙_ 𝑐 𝑖 𝑡 + 1−∅ ×𝑙_ 𝑚 𝑗 𝑡 Weight 𝑤𝑒𝑖𝑔ℎ𝑡 𝑖𝑗 (𝑡+1) = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑖𝑗 (𝑡) ×(1−(1−𝛽)× 𝑇𝑜𝑡𝑎𝑙𝐿𝑜𝑠𝑠 𝑖𝑗 𝑡 ) ∅: weight between Core and Memory 𝛽: weight between Total loss and History weight

Problem for tiers affect each other Solution Decouple the First tier and second tier Configure the period of first tier to be much longer than second tier Overhead of first tier is much higher

Experiment Experimental environment CPU:AMD Phenom II X2
GPU:NVIDIA 8800GTX 2 power supply 2 power meters one for CPU, disk, main memory... one for GPU OS:Ubuntu 10.04

Experiment (cont.) Benchmark From Rodinia and NVIDIA SDK

Experiment (cont.) Frequency Scaling for GPU Cores and Memory
Benchmark: streamcluster (memory-bounded) Peak frequency of core: 576 MHz Peak frequency of memory: 900MHz Scaling interval:3 seconds

Experiment (cont.) Frequency Scaling for GPU Cores and Memory
avg. energy saving: 5.97% without idle time avg. energy saving: 29.2% CPU+GPU avg. energy saving: 12.48%

Experiment (cont.) Workload Division between CPU and GPU
randomly set the initial division point

Experiment (cont.) Using both workload division and frequency scaling
avg. energy saving: 21% avg. performance loss: 1.7% (longer execution time)

Conclusion A holistic energy management framework for CPU-GPU heterogeneous architectures Dynamically divide the workload and scale the frequency Improve energy efficiency and only a few performance loss Achieve about 21% of average energy saving

Thanks

st International Conference on Parallel Processing (ICPP)

Similar presentations

Presentation on theme: "st International Conference on Parallel Processing (ICPP)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

st International Conference on Parallel Processing (ICPP)

Similar presentations

Presentation on theme: "st International Conference on Parallel Processing (ICPP)"— Presentation transcript:

Similar presentations

About project

Feedback