1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

ECE 260C – VLSI Advanced Topics Term paper presentation May 27, 2014 Keyuan Huang Ngoc Luong Low Power Processor Architectures and Software Optimization.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

Xiaomi An, Jiqiang Song, Wendong Wang SimpLight Nanoelectronics Ltd 2008/03/24 Temporal Distribution Based Software Cache Partition To Reduce I-Cache Misses.

S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.

University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.

University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.

Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Outline Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems Amit Pabalkar Compiler and Micro-architecture Lab School of Computing and.

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.

Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

University of Michigan Electrical Engineering and Computer Science 1 Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu,

1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

EFetch: Optimizing Instruction Fetch for Event-Driven Web Applications Gaurav Chadha, Scott Mahlke, Satish Narayanasamy University of Michigan August,

1 of 14 Lab 2: Design-Space Exploration with MPARM.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.

University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.

Reducing Code Management Overhead in Software-Managed Multicores

5.2 Eleven Advanced Optimizations of Cache Performance

Department of Electrical & Computer Engineering

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

RegLess: Just-in-Time Operand Staging for GPUs

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Ann Gordon-Ross and Frank Vahid*

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

A Self-Tuning Configurable Cache

Spring 2008 CSE 591 Compilers for Embedded Systems

Automatic Tuning of Two-Level Caches to Embedded Applications

Spring 2019 Prof. Eric Rotenberg

Efficient Placement of Compressed Code for Parallel Decompression

Presentation transcript:

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman, Scott Mahlke, Richard Brown Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor University of Michigan Electrical Engineering and Computer Science

2 Introduction Instruction fetch power dominant in low-power embedded processors ~ 27% for the StrongARM ~ 50% for the Motorola MCORE Two alternatives + No hardware overhead + Part of the physical address space - Managed in software Scratch-pad + Hardware managed + Transparent to the user - Power hungry tag-checking and comparison logic Instruction-cache

3 Focus Of This Work Explore the use of scratch-pad for reducing instruction fetch power Two possible software managed schemes Static –Map ‘hot’ regions prior to execution –Contents do not change during execution Dynamic –Allow contents to change during execution –Explicit copying of ‘hot’ regions

4 Scratch-pad Management: Static Approach BB1 BB2 BB3 BB4 BB6 BB7` BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1

5 Scratch-pad Management: Static Approach T1 BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 sizefreqprofit T T T profit = size * freq

6 Scratch-pad Management: Static Approach T1 BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 96 bytes T2 32 bytes T1 64 bytes Equivalent to bin-packing sizefreqprofit T T T profit = size * freq

7 Scratch-pad Management: Dynamic Approach copy T1 Scratch-pad space Scratch-pad size (96 bytes) T1 time 64b 32b BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1

8 Scratch-pad Management: Dynamic Approach copy T1 copy T2 Scratch-pad space Scratch-pad size (96 bytes) T1 time 64b 32b BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1

9 Scratch-pad Management: Dynamic Approach copy T1 copy T2 copy T3 over T2 Scratch-pad space Scratch-pad size (96 bytes) T1 time T3 64b 32b BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1

10 Scratch-pad Management: Dynamic Approach copy T1 copy T2 copy T3 over T2 Scratch-pad space Scratch-pad size (96 bytes) T1 time T3 64b 32b copy T2 over T3 BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1

11 Scratch-pad Management: Dynamic Approach Copy2 for T2 Copy1 for T1 Copy3 for T2 Copy4 for T3 copy T1 copy T2 copy T3 over T2 Scratch-pad space Scratch-pad size (96 bytes) T1 time copy1copy4copy3copy4 T3 copy2 T3 64b 32b copy T2 over T3 copy T3 over T2 BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1

12 Objectives Of This Work Develop a dynamic compiler managed scheme to exploit scratch-pad Prior work [Verma et al,’04] –ILP based solution –Not scalable –Limits scope of analysis to single procedure, loop-nests Practical solution –Scalable –Handle arbitrary control flow graphs –Inter-procedural analysis

13 Our Approach Two phases –Trace selection & scratch-pad (SP) allocation Identify frequently executed traces Select the most energy beneficial traces Place them with possible overlap to reduce copy overhead –Copy placement Insert copies to realize the placement Hoist within the control flow graph to minimize overhead Fix branch targets into selected traces

14 SP Allocation: Computing Energy Gain Benefit = ProfileWeight * Size *  FetchEnergy CopyCost = Size * ( FetchEnergy + WriteEnergy) Energy Gain = Benefit - CopyCost BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1 Benefit: Energy savings when the trace is executed from scratch-pad instead of memory CopyCost: Overhead associated with copying the trace once

15 SP Allocation: Placing Traces T1 T1 T2 T2 T2 T1 T1 T2 T2 T2 T3 T1 T1 T2 T2 T2 T3 initial copy of T1 initial copy of T2 recopy of T1 recopy of T2 Dynamic Copy Cost: # copies of T1 * CopyCost (T1) + # copies of T2 * CopyCost(T2) BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1

16 Temporal Relationship Graph [Gloy et al,’97] T3 T1T2 Edge Weights between two nodes denote the Dynamic Copy Cost T1 T1 T2 T2 T2 T1 T1 T2 T2 T2 T3 T1 T1 T2 T2 T2 T3 copy of T2 copy of T1 copy of T2 2 * CopyCost (T1) + 2 * CopyCost(T2)

17 SP Allocation: Placing Traces T2 96-bytes Energy Gain: T1  3104nJ Energy Gain: T2  15952nJ Energy Gain: T3  752nJ T2

18 SP Allocation: Placing Traces T2 96-bytes Energy Gain: T1  3104nJ Energy Gain: T2  15952nJ Energy Gain: T3  752nJ T2 T1

19 SP Allocation: Placing Traces T2 96-bytes T2 T1 T3 T1T2 432nJ 96nJ144nJ

20 SP Allocation: Placing Traces T2, T3 96-bytes T2, T3 T1 T3 T1T2 432nJ 96nJ144nJ

21 Copy Placement Initially, naively place copies at trace entry points –Guarantees correct but inefficient execution Reduce the copy overhead –Identify frequently executed copies –Iteratively hoist copies to less frequently executed blocks –Remove redundant copies –Ensure that the hoists and removal are legal –Traces are present prior to execution

22 Copy Placement: Initial Placement BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1 C1-T1 C3-T2 C1-T2 C3-T1 C2-T1 C2-T3 C1-T3

23 Copy Placement: Redundant Copies BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1 C1-T1 C3-T2 C1-T2 C3-T1 C2-T1 C2-T3 C1-T3 T2, T3 T1

24 Copy Placement: Hoisting Live-Range T1  BB4, BB6, BB7 T2  BB9, BB10 T3  BB12 BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1 C1-T1 C1-T2 C1-T3

25 Copy Placement: Hoisting T2, T3 T1 BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1 C1-T2 C1-T3 Live-Range of T2 before hoist

26 Copy Placement: Hoisting T2, T3 T1 BB1 BB2 BB3 BB4 BB6 BB7 BB5 BB14 BB8 BB9 BB10 BB11 BB12 BB13 T2 T3 T1 C1-T2 C1-T3 Live-Range of T2 after hoist  legal

27 Experimental Setup Trimaran compiler framework Measured instruction fetch power Varied scratch-pad size from 32-bytes to 4-Kbytes Two configurations WIMS microcontroller at the Univ. of Michigan –On-chip memory and scratch-pad –Static vs dynamic schemes –PowerMill Conventional processor –Off-chip memory, on-chip scratch-pad vs on-chip I-cache –CACTI model –Scratch-pad vs I-cache DMA copying –2 bytes per cycle, stalling

28 Energy Savings: Static vs Dynamic Average savings for Dynamic: 28% Average savings for Static: 17% WIMS Energy Savings, 64-Byte scratch-pad fir rawcaudio rawdaudio g721encodeg721decode mpeg2encmpeg2decpegwitencpegwitdec pgpencodepgpdecode gsmencodegsmdecode epic unepic cjpeg djpeg sha blowfish average % Energy Improvement dynamic static

SP Size (Bytes ) % Hit Rate Static Hit Rate Dynamic Hit Rate SP Size (bytes ) % Energy Savings Static Energy Dynamic Energy Effect of Varying Scratch-pad Size pegwitenc

30 Scratch-pad Size For 95% Hit Rate cjpeg djpeg epic unepic g721encode g721decode gsmencode gsmdecode mpeg2enc mpeg2dec pegwitenc pegwitdec pgpencode pgpdecode rawcaudio rawdaudio blowfish fir sha average Size (bytes) static dynamic Dynamic is 2.5x better than static

31 Energy Savings: SP vs I-Cache Cacti energy savings, 64b scratch-pad/I-cache fir rawcaudio rawdaudio g721encodeg721decode mpeg2encmpeg2decpegwitencpegwitdec pgpencodepgpdecode gsmencodegsmdecode epic unepic cjpeg djpeg sha blowfish average % Energy Improvement dynamic static I-cache Average savings for Dynamic: 48% Average savings for Static: 25% Average savings for I-cache: 30%

32 Conclusions Compiler directed dynamic placement in scratch-pad –Arbitrary control flow graph –Inter-procedural –Two phases  SP allocation & copy placement 28% savings for dynamic as compared to 16% for static for a 64-byte scratch-pad 41% savings for dynamic as compared to 31% for static for 256-byte scratch-pad 2 to 10% stall cycles Within 0 to 11 % of optimal, but scalable

33 For more information Thank You!