Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
Computer Structure Power Management Lihu Rappoport and Adi Yoaz Thanks to Efi Rotem for many of the foils.
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.
EVOLUTION OF MULTIMEDIA & DISPLAY MAZEN SALLOUM 26 FEB 2015.
Application Models for utility computing Ulrich (Uli) Homann Chief Architect Microsoft Enterprise Services.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
GPGPU platforms GP - General Purpose computation using GPU
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
AMD platform security processor
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Computing Hardware Starter.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.
Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.
Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.
Energy Savings with DVFS Reduction in CPU power Extra system power.
Recognizing Potential Parallelism Introduction to Parallel Programming Part 1.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL †, JOSEPH GREATHOUSE †, SRILATHA MANNE †, AND SUDHKAHAR.
Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.
A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
Lev Finkelstein ISCA/Thermal Workshop 6/ Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David)
C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.
© GCSE Computing Computing Hardware Starter. Creating a spreadsheet to demonstrate the size of memory. 1 byte = 1 character or about 1 pixel of information.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
Sunpyo Hong, Hyesoon Kim
SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California,
Wi-Fi BT/BLE Combo Module WINC3400 hands-on
PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.
Overview Motivation (Kevin) Thermal issues (Kevin)
µC-States: Fine-grained GPU Datapath Power Management
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration
Measuring and Modeling On-Chip Interconnect Power on Real Hardware
BLIS optimized for EPYCTM Processors
Power Management.
The Small batch (and Other) solutions in Mantle API
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Heterogeneous System coherence for Integrated CPU-GPU Systems
Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,
SOC Runtime Gregory Stoner.
libflame optimizations with BLIS
Interference from GPU System Service Requests
Simulation of exascale nodes through runtime hardware monitoring
Interference from GPU System Service Requests
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
RegMutex: Inter-Warp GPU Register Time-Sharing
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
Die Stacking (3D) Microarchitecture -- from Intel Corporation
2.C Memory GCSE Computing Langley Park School for Boys.
Advanced Micro Devices, Inc.
Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018
Presentation transcript:

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI 2 JUNE Advanced Micro Devices, Inc. 2 Georgia Institute of Technology 3 University of California, San Diego

2COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 GOAL & OUTLINE  Goal: –Optimize performance under power and thermal constraints in heterogeneous architecture  Outline: –State-of-the-Art Power and Thermal Management –Thermal Coupling –Performance Coupling –Cooperative Boosting –Results

STATE-OF-THE-ART POWER AND THERMAL MANAGEMENT

4COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 STATE-OF-THE-ART PROCESSOR Graphics processing unit (GPU): 384 AMD Radeon™ cores Multi-threaded CPU cores Shared Northbridge  access to overlapping CPU-GPU physical address spaces  Many resources shared among CPU and GPU –For example, memory hierarchy, power, and thermal capacity Accelerated processing unit (APU)

5COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 PROGRAMMING MODEL  Coupled programming model  Offload compute intensive tasks to the GPU APU Hardware CPU Operating System User Application OpenCL™ Software Stack Host Tasks GPU Tasks GPU Each OpenCL kernel Grid of threads, each operating over a data partition N-Dimensional Range

6COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 WHAT IS THERMAL DESIGN POWER?  Thermal design power: TDP –Upper bound for the sustainable power draw –Determines the cooling solution and package limits –Usually set by determining worst-case execution profile  Performance depends on effective utilization of thermal headroom  Instructions/cycle Time

7COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 STATE-OF-THE-ART: BI-DIRECTIONAL APPLICATION POWER MANAGEMENT (BAPM)  Power management algorithm 1.Calculate digital estimate of power consumption 2.Convert power to temperature - RC network model for heat transfer 3.Assign new power budgets to TEs based on temperature headroom 4.TEs locally control (boost) their own DVFS states Chip is divided into BAPM-controlled thermal entities (TEs) CU0 TE CU1 TE GPU TE

8COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 CURRENT BOOST ALGORITHMS: POWER VS. THERMAL MANAGEMENT 3.0 Time APU Die Temperature Thermal Headroom Convert thermal headroom to higher performance through boost HW Boost states Max Die Temp SW visible states APU Performance CPU DVFS- state HW Only (Boost) Pb0 Pb1 SW- Visible P0 P1 P Pmin GPU DVFS- state HW Only High Medium Low

9COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 KEY TAKEAWAYS  Power and thermals are shared resources in a heterogeneous processor  thermal coupling  Overall application performance is a function of both the CPU and the GPU  performance coupling  State of the practice: Managing to thermal limits by locally boosting when thermal headroom is available  utilize all of the headroom!

THERMAL COUPLING

11COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 THERMAL SIGNATURES: CPU & GPU  High-power GPU benchmark  Sustained power: 19.7 W  High-power CPU benchmark, idle GPU  Sustained power: 18.8 W  Higher thermal density of CPUs  steeper thermal gradients  Faster consumption of thermal headroom on the CPU Steady-state thermal fields produced by BAPM on a 19W AMD Trinity APU

12COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 THERMAL TIME CONSTANT  Significant rise in temperature of the idle component due to thermal coupling and pollution from the active components within a die  CPU consumes thermal headroom more rapidly (4X faster)  GPU can sustain higher power boosts longer Idle GPU temperature rose by ~20 o C

13COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 THERMAL COUPLING: THERMAL HEADROOM AVAILABILITY

14COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 THERMAL COUPLING: BOOST FOR CONSUMPTION OF THERMAL HEADROOM 6 o C rise in GPU temperature once CPU power limit was removed and both CUs were allowed to boost

15COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 THERMAL COUPLING: THERMAL THROTTLING  Minimize detrimental effects of thermal coupling by capping maximum CPU P-state  P-state limiting

16COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 BAPM P2  Capping the max CPU DVFS state at P2  Capping the max CPU DVFS state at P4 RESIDENCY IN DIFFERENT POWER STATES

17COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 KEY TAKEAWAYS  Thermal signatures different between CPU and GPU  Heterogeneity in physical properties  High thermal density leads to faster consumption of thermal headroom in the CPU cores  Significant thermal coupling from active to idle components  Near the thermal limit, boosting based on available thermal headroom introduces inefficiencies –Reduce the CPU P-state limit

PERFORMANCE COUPLING

19COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 CPU-GPU PERFORMANCE COUPLING  CPU should be just fast enough to keep the GPU fully utilized  P-state should be high enough APU Hardware CPU Operating System User Application OpenCL™ Software Stack Host Tasks GPU Tasks GPU Each OpenCL kernel Grid of threads, each operating over a data partition N-Dimensional Range

20COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 MANAGING THERMALS FOR PERFORMANCE-COUPLED APPLICATIONS

21COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 MANAGING THERMALS FOR PERFORMANCE-COUPLED APPLICATIONS

22COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 MANAGING THERMALS FOR PERFORMANCE-COUPLED APPLICATIONS

23COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 P-STATE SENSITIVITY

24COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 DETERMINING CRITICAL CPU P-STATE  Find the inflection point in performance as a function of CPU P-state  critical P-state  Critical P-state is determined by interference (CPU vs. GPU) in the memory system Critical CPU P-state Limit

25COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 KEY TAKEAWAYS  Performance coupling – CPU-GPU performance dependency  Balance between detrimental effects of thermal coupling and needs of performance coupling  CPU critical P-state limit is determined by performance coupling and thermal coupling  GPU memory bandwidth gradients as a function of CPU frequency along with CPU IPC serve as a measure of performance coupling

COOPERATIVE BOOSTING ALGORITHM

27COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 COOPERATIVE BOOSTING (CB)  Overlaid on top of BAPM – invoked periodically when thermal coupling is detrimental i.e. when thermal limit is approached

28COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 EXPERIMENTAL SET-UP  Trinity A8-4555M APU: 19W TDP  CPU: Managed by HW or SW P- state Voltage (V) Freq (MHz) HW Only (Boost) Pb Pb SW- Visible P P P P P  GPU: Managed by HW only  GPU-high: 423 MHz  GPU-med: 320 MHz  Cooperative Boosting implemented as a system software policy overlaid on top of BAPM in real hardware

29COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 BENCHMARKS BM (Description)Problem SizeType NDL (Needleman- Wusch) 4096x4096 data points, 1K iterations Performance- coupled HS (HotSpot)1024x1024 data points, 100K iterations Performance- coupled BF (BoxFilter SAT)1Kx1K input image, 6x6 filter,10K iterations Performance- coupled FAH (Folding at Home) Synthesis of large protein: spectrin$ Performance- coupled BS (Binary Search)4096 inputs, 256 segments, 1M iterations Performance- coupled Viewdle (Haar facial recognition) Image 1920x1080, 2K iterationsPerformance- coupled Lbm (CPU2006)4 threads, Ref inputCPU-centric Gcc (CPU2006)4 threads, Ref inputCPU-centric

RESULTS

31COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 PERFORMANCE IMPROVEMENT WITH COOPERATIVE BOOSTING  Static P-state limiting requires profiling and a priori information of workload  An average of 15% performance gain for performance-coupled applications with CB

32COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 POWER SAVINGS  Average 10% power savings across performance-coupled applications  5 o C reduction in peak temperature for BS -> large percentage of leakage power savings

33COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 ENERGY*DELAY^2  Average 33% energy-delay^2 savings across performance-coupled applications

34COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 CONCLUSIONS  Demonstrated effects of thermal and performance coupling on performance –Applications with high GPU compute-to-load ratio are more susceptible to detrimental effects of thermal coupling –Emergent balanced workloads with split CPU-GPU computation are tightly performance-coupled  Proposed Cooperative Boosting (CB) technique to determine critical CPU P-state at which effects of thermal coupling are balanced with needs of performance coupling –Shifts power to CPU only when needed  Demonstrated effectiveness of CB on real hardware as a well- rounded power and thermal management scheme

35COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

BACKUP

37COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 VIEWDLE PERFORMANCE ANALYSIS

38COOPERATIVE BOOSTING: NEEDY VERSUS GREEDY POWER MANAGEMENT | JUNE, 2013 BINARY SEARCH TEMPERATURE