Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim.

Slides:

Advertisements

Similar presentations

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Advertisements

1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

Scalability-Based Manycore Partitioning Hiroshi Sasaki Kyushu University Koji Inoue Kyushu University Teruo Tanimoto The University of Tokyo Hiroshi Nakamura.

Project Proposal Presented by Michael Kazecki. Outline Background –Algorithms Goals Ideas Proposal –Introduction –Motivation –Implementation.

st International Conference on Parallel Processing (ICPP)

The Energy Case for Graph Processing on Hybrid Platforms Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University.

Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Sunpyo Hong, Hyesoon Kim

Power-Aware SoC Test Optimization through Dynamic Voltage and Frequency Scaling Vijay Sheshadri, Vishwani D. Agrawal, Prathima Agrawal Dept. of Electrical.

Supporting GPU Sharing in Cloud Environments with a Transparent

Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

“Low-Power, Real-Time Object- Recognition Processors for Mobile Vision Systems”, IEEE Micro Jinwook Oh ; Gyeonghoon Kim ; Injoon Hong ; Junyoung.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Towards reducing total energy consumption while constraining core temperatures Osman Sarood and Laxmikant Kale Parallel Programming Lab (PPL) University.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Energy Savings with DVFS Reduction in CPU power Extra system power.

Heterogeneity-Aware Peak Power Management for Accelerator-based Systems Heterogeneity-Aware Peak Power Management for Accelerator-Based Systems Gui-Bin.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

An Energy-efficient Task Scheduler for Multi-core Platforms with per-core DVFS Based on Task Characteristics Ching-Chi Lin Institute of Information Science,

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

By Islam Atta Supervised by Dr. Ihab Talkhan

Sunpyo Hong, Hyesoon Kim

Department of Electrical and Computer Engineering University of Wisconsin - Madison Optimizing Total Power of Many-core Processors Considering Voltage.

Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.

A 1.2V 26mW Configurable Multiuser Mobile MIMO-OFDM/-OFDMA Baseband Processor Motivations –Most are single user, SISO, downlink OFDM solutions –Training.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming Qiumin Xu*, Hyeran Jeon ✝, Keunsoo Kim ❖, Won.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Accelerated Processing Units

Seth Pugsley, Jeffrey Jestes,

Computing Resource Allocation and Scheduling in A Data Center

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Linchuan Chen, Xin Huo and Gagan Agrawal

Shanjiang Tang1, Bingsheng He2, Shuhao Zhang2,4, Zhaojie Niu3

Final Project presentation

Presentation transcript:

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim 2,

Single-chip heterogeneous processors 2 Compared to systems based on discrete components -Lower communication overhead -Lower power consumption -Lower cost (less silicon) -Emerging application friendly (sequential + parallel processing) Sources: AMD, Intel, and Samsung AMD’s Llano Intel’s Sandy Bridge Samsung’s Exynos

Challenges 3 SCHP’s performance: limited by power budget -Total chip power budget -CPU/GPU power budget Multiprogrammed workload -Workload-aware power allocation -Considering characteristics and metrics How can optimize overall performance within limited power budget?

Outline 4 Motivation Target platform: SCHP + MW Workload-aware power allocation -Characteristics of programs -Evaluation Metrics Methodology -Power configuration -Benchmark programs Evaluation Algorithm Conclusion

Target platform: SCHP + MW 5 4-core CPU + 16-SM GPU Multiple V/F domains  DVFS 2 programs running Hardware resources evenly divided GPU0 GPU0 V/F domain Memory Controllers MCs V/F domain CPU Core0 CPU Core1 CPU Core2 CPU Core3 CPU V/F domain (per-core) GPU1 GPU1 V/F domain Multiprogrammed Workload Program 1 Program 2

Workload-aware power allocation 6 Characteristics of programs -Non-uniform performance sensitivities Evaluation metrics -Throughput vs. Energy efficiency Normalized throughput Allocating more power to mri-q Power allocation (using the same HW)

Outline 7 Motivation Target platform: SCHP + MW Workload-aware power allocation -Characteristics of programs -Evaluation Metrics Methodology -Power configuration -Benchmark programs Evaluation Algorithm Conclusion

Methodology: shared power budget 8 Can change the power budget for CPU 2GPU 1GPU 2 Power Configuration Output CPU 1 Total chip power budget = 100 W CPU power budget = 80 W GPU power budget = 64 W Baseline configuration -Evenly divided (25 W for each CPU/GPU group)

Methodology: benchmark programs 9 Used 6 benchmark programs. Divided into 3 groups depending on characteristics BenchmarkAcronymSourceCharacteristics Magnetic Resonance Imaging Q MRQParboilCompute-bound Stream ClusterSCLRodiniaCompute-bound HotspotHOTRodiniaNeutral Sum of Absolute Difference SADParboilNeutral StencilSTNParboilMemory-bound Stream CopySCPCS VirginiaMemory-bound

Outline 10 Motivation Target platform: SCHP + MW Workload-aware power allocation -Characteristics of programs -Evaluation Metrics Methodology -Power configuration -Benchmark programs Evaluation Algorithm Conclusion

Evaluation: case study 1 (compute- vs. memory-bound) 11 19% throughput improvement32% energy efficiency improvement Allocating more power to compute-bound Optimal points vary depending on metrics.

Evaluation: case study 2 (memory- vs. memory-bound) 12 10% throughput improvement32% energy efficiency improvement Equally allocated power Again, optimal point depends on -Evaluation metric -Workload characteristics (compute- or memory-bound)

Evaluation: variation of optimal configuration 13 Depending on programs’ characteristics and evaluation metrics P1P2 Metric 1: throughputMetric 2: energy efficiency P1 (Watt)P2 (Watt)P1 (Watt)P2 (Watt) CPUGPUCPUGPUCPUGPUCPUGPU MRQ (C)SCL(C) SCP (M)STN (M) SAD (N)HOT (N) MRQ (C)SCP (M) SCL (C)SCP (M) HOT (N)MRQ(N) MRQ (C)SAD (N) SCL (C)SAD (N) HOT (N)STN (M) HOT (N)SCP (M) SAD (N)SCP (M)

Evaluation: performance improvement from optimal power allocation 14 Achieved significant improvement -12% for throughput -18% for energy efficiency

Algorithm for throughput maximization 15 calculate (slope) abs(sp1-sp2) < threshold sp1 > sp2 alloc(p2_more) alloc(p1_more) alloc(equally) wait(regular_time) YES NO Normalized throughput Power allocation

Algorithm for energy efficiency maximization 16 final = min_power EE(final) == MAX EE(final, p1++) > EE(final, p2++) final = (final, p1++) final = (final, p2++) exit MAX = max( EE(final), EE(final, p1++), EE(final, p2++) ) Gradient search from the minimum power allocation

Conclusion 17 We propose a solution for optimal power allocation -Workload-aware power allocation -By using program characteristics and evaluation metrics Significant performance improvement achieved -12% for throughput -18% for energy efficiency Run-time algorithms effectively find (near-)optimal power allocation

Backup slides 18

Simulator 19 Integrated CPU + GPU simulator -H. Wang, V. Sathish, R. Singh, M. Schulte and N. Kim, "Workload and Power Budget Partitioning for Single-Chip Heterogeneous Processors," in PACT, gem5 + GPGPU-Sim Adaptive power allocation for multiprogrammed workload -Per-core V/F domains for CPU -2 V/F domains for GPU