Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim 2,
Single-chip heterogeneous processors 2 Compared to systems based on discrete components -Lower communication overhead -Lower power consumption -Lower cost (less silicon) -Emerging application friendly (sequential + parallel processing) Sources: AMD, Intel, and Samsung AMD’s Llano Intel’s Sandy Bridge Samsung’s Exynos
Challenges 3 SCHP’s performance: limited by power budget -Total chip power budget -CPU/GPU power budget Multiprogrammed workload -Workload-aware power allocation -Considering characteristics and metrics How can optimize overall performance within limited power budget?
Outline 4 Motivation Target platform: SCHP + MW Workload-aware power allocation -Characteristics of programs -Evaluation Metrics Methodology -Power configuration -Benchmark programs Evaluation Algorithm Conclusion
Target platform: SCHP + MW 5 4-core CPU + 16-SM GPU Multiple V/F domains DVFS 2 programs running Hardware resources evenly divided GPU0 GPU0 V/F domain Memory Controllers MCs V/F domain CPU Core0 CPU Core1 CPU Core2 CPU Core3 CPU V/F domain (per-core) GPU1 GPU1 V/F domain Multiprogrammed Workload Program 1 Program 2
Workload-aware power allocation 6 Characteristics of programs -Non-uniform performance sensitivities Evaluation metrics -Throughput vs. Energy efficiency Normalized throughput Allocating more power to mri-q Power allocation (using the same HW)
Outline 7 Motivation Target platform: SCHP + MW Workload-aware power allocation -Characteristics of programs -Evaluation Metrics Methodology -Power configuration -Benchmark programs Evaluation Algorithm Conclusion
Methodology: shared power budget 8 Can change the power budget for CPU 2GPU 1GPU 2 Power Configuration Output CPU 1 Total chip power budget = 100 W CPU power budget = 80 W GPU power budget = 64 W Baseline configuration -Evenly divided (25 W for each CPU/GPU group)
Methodology: benchmark programs 9 Used 6 benchmark programs. Divided into 3 groups depending on characteristics BenchmarkAcronymSourceCharacteristics Magnetic Resonance Imaging Q MRQParboilCompute-bound Stream ClusterSCLRodiniaCompute-bound HotspotHOTRodiniaNeutral Sum of Absolute Difference SADParboilNeutral StencilSTNParboilMemory-bound Stream CopySCPCS VirginiaMemory-bound
Outline 10 Motivation Target platform: SCHP + MW Workload-aware power allocation -Characteristics of programs -Evaluation Metrics Methodology -Power configuration -Benchmark programs Evaluation Algorithm Conclusion
Evaluation: case study 1 (compute- vs. memory-bound) 11 19% throughput improvement32% energy efficiency improvement Allocating more power to compute-bound Optimal points vary depending on metrics.
Evaluation: case study 2 (memory- vs. memory-bound) 12 10% throughput improvement32% energy efficiency improvement Equally allocated power Again, optimal point depends on -Evaluation metric -Workload characteristics (compute- or memory-bound)
Evaluation: variation of optimal configuration 13 Depending on programs’ characteristics and evaluation metrics P1P2 Metric 1: throughputMetric 2: energy efficiency P1 (Watt)P2 (Watt)P1 (Watt)P2 (Watt) CPUGPUCPUGPUCPUGPUCPUGPU MRQ (C)SCL(C) SCP (M)STN (M) SAD (N)HOT (N) MRQ (C)SCP (M) SCL (C)SCP (M) HOT (N)MRQ(N) MRQ (C)SAD (N) SCL (C)SAD (N) HOT (N)STN (M) HOT (N)SCP (M) SAD (N)SCP (M)
Evaluation: performance improvement from optimal power allocation 14 Achieved significant improvement -12% for throughput -18% for energy efficiency
Algorithm for throughput maximization 15 calculate (slope) abs(sp1-sp2) < threshold sp1 > sp2 alloc(p2_more) alloc(p1_more) alloc(equally) wait(regular_time) YES NO Normalized throughput Power allocation
Algorithm for energy efficiency maximization 16 final = min_power EE(final) == MAX EE(final, p1++) > EE(final, p2++) final = (final, p1++) final = (final, p2++) exit MAX = max( EE(final), EE(final, p1++), EE(final, p2++) ) Gradient search from the minimum power allocation
Conclusion 17 We propose a solution for optimal power allocation -Workload-aware power allocation -By using program characteristics and evaluation metrics Significant performance improvement achieved -12% for throughput -18% for energy efficiency Run-time algorithms effectively find (near-)optimal power allocation
Backup slides 18
Simulator 19 Integrated CPU + GPU simulator -H. Wang, V. Sathish, R. Singh, M. Schulte and N. Kim, "Workload and Power Budget Partitioning for Single-Chip Heterogeneous Processors," in PACT, gem5 + GPGPU-Sim Adaptive power allocation for multiprogrammed workload -Per-core V/F domains for CPU -2 V/F domains for GPU