Measuring and Modeling On-Chip Interconnect Power on Real Hardware

Slides:

Advertisements

Similar presentations

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Advertisements

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.

Filtering Approaches for Real-Time Anti-Aliasing /

AMD platform security processor

OpenCL Introduction A TECHNICAL REVIEW LU OCT

Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?

Computing Hardware Starter.

Computing hardware CPU.

Conditions and Terms of Use

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.

Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

PHY 201 (Blum)1 Microcode Source: Digital Computer Electronics (Malvino and Brown)

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

© GCSE Computing Computing Hardware Starter. Creating a spreadsheet to demonstrate the size of memory. 1 byte = 1 character or about 1 pixel of information.

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

Sunpyo Hong, Hyesoon Kim

CSIT 301 (Blum)1 Instructions at the Lowest Level Some of this material can be found in Chapter 3 of Computer Architecture (Carter)

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.

Computer Hardware What is a CPU.

µC-States: Fine-grained GPU Datapath Power Management

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

BLIS optimized for EPYCTM Processors

The Problem Finding a needle in haystack An expert (CPU)

5.2 Eleven Advanced Optimizations of Cache Performance

Memory chips Memory chips have two main properties that determine their application, storage capacity (size) and access time(speed). A memory chip contains.

Improving Program Efficiency by Packing Instructions Into Registers

Instructions at the Lowest Level

The Small batch (and Other) solutions in Mantle API

Heterogeneous System coherence for Integrated CPU-GPU Systems

Many-core Software Development Platforms

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

SOC Runtime Gregory Stoner.

Automation in an XML Authoring Environment

libflame optimizations with BLIS

Self-Registration walk-through

Interference from GPU System Service Requests

Simulation of exascale nodes through runtime hardware monitoring

Today’s agenda Hardware architecture and runtime system

Interference from GPU System Service Requests

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

RegMutex: Inter-Warp GPU Register Time-Sharing

Chapter 1 Introduction.

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

2.C Memory GCSE Computing Langley Park School for Boys.

Advanced Micro Devices, Inc.

ARM ORGANISATION.

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Presentation transcript:

Measuring and Modeling On-Chip Interconnect Power on Real Hardware Vignesh Adhinarayanan, Indrani Paul, Joseph L. Greathouse, Wei Huang, Ashutosh Pattnaik, Wu-chun Feng

Power and Energy are First-Class Design Constraints Power and energy have become a first-class design constraint in all areas of computing Phones and laptops must optimize for energy as they are battery operated Desktops require loud cooling solutions Supercomputers and other large data centers are constrained by power Image by Glenn Batuyong CC BY 2.0 Modified from Image by Vernon Chan CC BY 2.0 Image by Patrick Strandberg CC BY-SA 2.0

Data Movement: A Major Source of Power Consumption Data movement through the memory hierarchy thought to be a major source of power consumption Measurements on real system lacking Current estimates based on simulation studies Real-world measurements are coarse-grained and do not give break down for data movement Based on: Shalf et al., “Exascale Computing Technology Challenges,” VECPAR 2010

This Talk in One Sentence A new methodology to measure data-movement power on real hardware

State-of-the-art Measurement Approach REG L1 L2 L2 microbenchmark EL2 L1 microbenchmark EL1 G. Kestor et al., "Quantifying the energy cost of data movement in scientific applications,” IISWC 2013 EL2 = Energy consumed by L2 microbenchmark EL1 = Energy consumed by L1 microbenchmark Energy cost of moving data from L2 to L1 = EL2 - EL1 Issue: Over-estimation of data-movement energy

State-of-the-art Measurement Approach: Limitation Compute Compute L1 L1 Interconnect Interconnect L2 L2 The same benchmarks from the previous slide, but redrawn to understand the shortcomings of the related work. L2 microbenchmark L1 microbenchmark Issue: Data-movement power also includes L2- access power

Our Proposed Approach

Distance between shader engines and L2 banks differ A Bit of Background … Representative block diagram of AMD FirePro™ W9100 GPU (Previously Code-named “Hawaii”) Distance between shader engines and L2 banks differ

A Bit of Background … Representative block diagram of AMD FirePro™ W9100 GPU (Previously Code-named “Hawaii”) Physical distance traversed by data thought to affect data movement power

Our Proposed Approach L2 cache L2 cache CU CU Shader Engine I Shader Engine I CU CU CU CU CU CU CU CU Shader Engine II Shader Engine II CU CU CU CU CU CU The difference in distance causes the interconnect activity to differ by a factor of 2, which can help isolate the interconnect power better. Long-path microbenchmark Short-path microbenchmark Design microbenchmarks based on data-movement distance to properly isolate the interconnect

Challenges

Challenges OpenCL™ lacks native support to pin threads to programmer-specified compute units Temperature of device during tests will affect the power consumption Power difference between the two sets of microbenchmarks can be hard to observe

Challenges Only a small difference between used-L2 and L1 cache size which can make it challenging to write L2-only microbenchmarks Ensuring that the same amount of work is done by the two microbenchmarks can be challenging due to NUCA effects

Addressing the Challenges OpenCL™ lacks native support to pin threads to programmer-specified compute units Temperature of device during tests will affect the power consumption Power difference between the two sets of microbenchmarks can be hard to observe

Addressing the challenges Pinning Threads to Cores with OpenCL™ Can be accomplished with some binary hacking

Addressing the challenges Pinning Threads to Cores with OpenCL™ First, we have a snippet of OpenCL code written. Here, we do useful work only when the workgroup ID is within a certain range. The useful work we are doing here is reading from L2. Workgroup ID can be obtained with OpenCL. But workgroup is really only a place holder to hold compute unit ID as they are both scalar values. 1. Use workgroup ID as a placeholder for CU ID

Addressing the challenges Pinning Threads to Cores with OpenCL™ Second, we disassemble to code and in the assembly language, we manually identify the instruction which is responsible for writing the workgroup ID to the register that gets checked in the conditional. Its equivalent binary instruction is noted. 2. Identify instruction that writes workgroup ID to the register that gets checked in the conditional

Addressing the challenges Pinning Threads to Cores with OpenCL™ Then, we read the binary in a hex editor and identify the instruction we just noted earlier. 3. Identify instruction that writes workgroup ID to the register that gets checked in the conditional in binary

Addressing the challenges Pinning Threads to Cores with OpenCL™ The binary is then modified such that the register that gets checked in the conditional holds CU ID instead of the workgroup ID. While CU ID cannot be read with OpenCL, AMD hardware itself contains this information in a read-only hardware register. It is possible to read this data by forming the right instruction. The equivalent binary instruction shown in green. This instruction can be formed by going through AMD’s ISA manual for the Hawaii architecture. 4. Replace workgroup ID with CU ID (read from read- only HW register) in the correct register in the binary

Addressing the challenges Pinning Threads to Cores with OpenCL™ One you modify the binary, you have an executable that is functionally equivalent to the OpenCL snippet at the bottom right. In effect, you can now perform L2 reads only in shader engine 1. With this method, we can overcome the first challenge. 5. Get functionally equivalent OpenCL snippet

Addressing the Challenges OpenCL™ lacks native support to pin threads to programmer-specified compute units Temperature of device during tests will affect the power consumption Power difference between the two sets of microbenchmarks can be hard to observe

Addressing the challenges Eliminating Temperature Effects Solution 1: Run GPU fans at very high speed to limit temperature difference between runs

Addressing the challenges Eliminating Temperature Effects Solution 2: Model idle power separately and subtract from measured power

Addressing the Challenges OpenCL™ lacks native support to pin threads to programmer-specified compute units Temperature of device during tests will affect the power consumption Power difference between the two sets of microbenchmarks can be hard to observe

Addressing the challenges Saturating Interconnect Bandwidth Power difference between microbenchmarks can be too low to be reliably observed Unless the amount of data going through the interconnect increases Difficult to saturate interconnect bandwidth without increasing the number of wavefronts Fewer wavefronts can result in stalling exposing latency difference issues More wavefronts can lead to one of the following issues More L1 hits when accesses per thread is low Register pressure and memory spills if access per thread is high Solution: Modify firmware to artificially shrink L1 cache size and then increase number of wavefronts

Additional details in the paper Challenges Only a small difference between used-L2 and L1 cache size which can make it challenging to write L2-only microbenchmarks Ensuring that the same amount of work is done by the two microbenchmarks can be challenging due to NUCA effects Additional details in the paper

Characterization Studies

Parameters Affecting Interconnect Power Average interconnect distance Data toggle rate Bandwidth observed on the interconnect Interconnect’s voltage and frequency

Interconnect Power vs Distance Our experiments confirm that the distance traversed by data affects power consumption

Interconnect Power vs Distance Within 15% of industrial estimates for energy/bit/mm

Interconnect Power vs Distance Linear relationship observed between data- movement distance and interconnect power

Interconnect Power vs Toggle Rate 11111111 00000000 10101010 0x0x0x0x xxxxxxxx Linear relationship observed between toggle rate and interconnect power

Interconnect Power vs Toggle Rate 11111111 00000000 10101010 0x0x0x0x xxxxxxxx Approximately 10% of interconnect power goes towards arbitration, etc.

Interconnect Power vs Toggle Rate 11111111 00000000 10101010 0x0x0x0x xxxxxxxx Transmitting 0s slightly more expensive than 1s

Interconnect Power vs Voltage and Frequency Impact of voltage and frequency as expected

Putting it all together Interconnect Power = Energy/bit/mm * avg. distance * avg. bits/sec * scaled voltage2 x scaled frequency * avg. toggle rate For real applications, Avg. bits per second is calculated from performance counters (e.g. L1 and L2 accesses) The other parameters are pre-computed or pre-measured (e.g. Distance is measured from layout; energy/bit/mm is computed to be 110 fJ/bit/mm) 110 fJ/bit/mm is for random data. That is 50% toggle rate. Per transition cost would be 220 fJ/bit/mm

Evaluation of Applications

Energy Cost of Data Movement on Real Applications Up to 14% dynamic power spent on interconnects in today’s chips

Energy Cost of Data Movement on Real Applications Even non-memory bound applications can show high interconnect power (but at different hierarchy)

Energy Cost of Data Movement on Real Applications Data-movement power problem exacerbated in future technology nodes

Optimizations

Interconnect Power Optimizations Layout-Based Optimizations We explore how much impact layout-based optimization can have on real world applications We examine two layouts – one reduces distance between L1 and L2 and the other reduces distance between L2 and memory controller Cache Resizing (details in the paper) Increasing cache size decreases average data movement distance (most data is fetched from nearer memories) We quantify the magnitude of difference when we increase the cache size 4 times

Interconnect Power Optimizations – Layout Optimization Mem Controller L1 to L2 = 17.0 units L2 to MC = 7.6 units L1 to L2 = 3.5 units L2 to MC = 12.0 units Optimized to reduce L2-MC distance. Not too different from current GPUs.

Interconnect Power Optimizations – Layout Optimization Mem Controller L1 to L2 = 17.0 units L2 to MC = 7.6 units L1 to L2 = 3.5 units L2 to MC = 12.0 units Optimized to reduce L1-L2 distance.

Interconnect Power Optimizations – Layout Optimization Interconnect power reduces by 48% on average when we use an L1-L2 distance optimized layout

Conclusion Distance-based microbenchmarking is a promising approach to measure data movement power Over 14% of dynamic power can go towards on-chip data movement in today’s chips Lesser than past estimates as we separated out data access power from data movement power Can increase to 22% by 7nm technology Optimizing on-chip interconnects to reduce the distance of frequently accessed portions can reduce on-chip data movement power by 48%

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD FirePro, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL is a trademark of Apple, Inc. used by permission by Khronos. Other names are for informational purposes only and may be trademarks of their respective owners.