Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Computer Organization and Architecture

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Computer Organization and Architecture

© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.

Computer Organization and Architecture The CPU Structure.

Goal: Reduce the Penalty of Control Hazards

1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Branch Hazards and Static Branch Prediction Techniques

CSC 4250 Computer Architectures October 31, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Computer Architecture Chapter (14): Processor Structure and Function

5.2 Eleven Advanced Optimizations of Cache Performance

RANDOM FILL CACHE ARCHITECTURE

Pipelining: Advanced ILP

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Ann Gordon-Ross and Frank Vahid*

Ka-Ming Keung Swamy D Ponpandi

Translation Buffers (TLB’s)

Automatic Tuning of Two-Level Caches to Embedded Applications

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha Rawlins and Ann Gordon-Ross + University of Florida Department of Electrical and Computer Engineering

Instruction Cache Optimization The instruction cache is a good candidate for optimization 2 ARM920T(Segars ‘01) Gordon-Ross ‘04 Several optimizations exploit the rule –90% of execution is spent in 10% of code known as critical regions –These critical regions typically consist of several loops 29% of power consumption

Caching Instructions for Energy Savings The traditional L1 cache has been a good source of energy savings –Smaller device than the next level cache or main memory –Low dynamic and static energy consumption –Close proximity to the processor –Faster access to instructions An even smaller L0 cache known as the filter cache was introduced to increase energy savings at the expense of system performance 3

Filter Cache Small (L0) cache placed between the instruction cache and processor (Kin/Gupta/Mangione-Smith 97) Energy savings due to fetching from a small cache –Energy consumed during tag comparisons –Performance penalty of 21% due to high miss rate (Kin 97) 4 Microprocessor Filter Cache (L0 cache) L1 Instruction Cache or Main Memory

Loop Caching The loop cache achieves energy savings by storing loops in a smaller device than the L1 cache without sacrificing performance –Tagless device: no energy is used for tag comparisons –Miss-less device: there is no performance penalty –Only stores loops Critical regions typically consists of several loops Caching the most frequently executed (critical) regions has a large impact on energy savings Three previous loop cache designs were introduced for caching different types of loops: –Dynamically Loaded Loop Cache –Preloaded Loop Cache –Hybrid Loop Cache 5

Dynamically Loaded Loop Cache (DLC) Small instruction cache placed in parallel with the L1 cache (Lee/Moyer/Arends 99) Operation –Filled when a short backward branch (sbb) is detected in the instruction stream –Provides the processor with instructions on the next loop iteration Benefits –Smaller device  energy savings –Tagless device  energy savings –Miss-less device  no performance penalty –Loop cache operation invisible to user 6 Microprocessor Dynamic Loop Cache L1 Instruction Cache or Main Memory cofsbb [0] lw r1, 100(r2) [1] addi r3, r1, 1 [2] sw r3, 500(r2) [3] addi r2, r2, 1 [4] sbb -4

Dynamically Loaded Loop Cache Limitations –Cannot cache loops containing control of flow changes (cofs) Control of flow changes include common if/else constructs Causes thrashing – alternating between idle and filling but never fetching from the loop cache –All loop instructions must be cached in one iteration to provide 100% hit rate Gaps in the loop cache cannot be identified during runtime and thus only sequential instruction fetching is allowed 7 [0] lw r1, 100(r2) [1] addi r3, r1, 1 [2] sw r3, 500(r2) [3] bne r4, r3, 2 [4] addi r2, r2, 1 [5] sbb -5 Microprocessor Dynamic Loop Cache L1 Instruction Cache or Main Memory cofsbb

Preloaded Loop Cache (PLC) The preloaded cache statically stores the most frequently executed loops (Gordon- Ross/Cotterell/Vahid 02) Operation –Application is profiled and critical regions are stored in the loop cache –PLC provides the instructions when a stored critical region is executed –Exit bits are used indicate the location of the next instruction fetch and are critical for maintaining a 100% hit rate Benefits –Can cache loops with cofs –Tagless / Miss-less 8 Microprocessor Preloaded Loop Cache L1 Instruction Cache or Main Memory cofsbb [0] lw r1, 100(r2) [1] addi r3, r1, 1 [2] sw r3, 500(r2) [3] bne r4, r3, 2 [4] addi r2, r2, 1 [5] sbb -5 [0] lw r1, 100(r2) [1] addi r3, r1, 1 [2] sw r3, 500(r2) [3] bne r4, r3, 2 [4] addi r2, r2, 1 [5] sbb -5

Preloaded Loop Cache Limitations –Profiling and designer effort is required to identify the critical regions stored in the loop cache –Only a limited number of loops can be stored in the loop cache –Loop cache contents are fixed i.e. not suitable for changing application phases or changing behaviour 9 Microprocessor Preloaded Loop Cache L1 Instruction Cache or Main Memory cofsbb

Hybrid Loop Cache (HLC) Consists of a PLC and a DLC partition (Gordon- Ross/Vahid 02) –PLC checked on cof –If a loop is not preloaded, the loop will be cached by the DLC partition during runtime (if possible) Benefits –Can cache loops with cofs (PLC partition) –Tagless / Miss-less Limitations –If profiling not performed, functionality is reduced to that of a DLC –Can cache loops with cofs only if the loop is preloaded 10 Microprocessor Hybrid Loop Cache L1 Instruction Cache or Main Memory cofsbb PLC DLC

Choosing a Suitable Loop Cache Application Behaviour –The DLC is most suitable for small loops with straight-line code –However, the PLC/HLC is needed for applications with loops containing cofs Desired designer effort –Designer effort is the amount of time/effort spent on pre-analysis –The DLC is highly flexible and requires no designer effort –The PLC/HLC can store loops with cofs but requires considerable pre-analysis to identify and cache complex critical regions and to set up the exit conditions needed to ensure a 100% loop cache hit rate 11

Contribution The best loop cache design therefore depends on the application behaviour and desired designer effort We introduce an Adaptive Loop Cache (ALC) which –Integrates the PLC’s complex functionality of caching loops with cofs with the DLC’s runtime flexibility –Adapts to nearly any application behaviour via lightweight runtime critical region control flow analysis and dynamic loop cache exit condition evaluation –Eliminates costly designer pre-analysis efforts –Eliminates the uncertainly in choosing the most appropriate loop cache design for the application since our adaptive loop cache performs very well for all application types (even if the ALC is not the best loop cache for every application) 12

Architecture Overview 13 PC Microprocessor Loop Cache Controller (LCC) i_buffer last_PC tr_sbb Adaptive Loop Cache VALIDVALID BITSBITS Level One Instruction Cache Address Translation Unit (ATU) cr_start loop cache address sbbcof MUXMUX control Instruction from Loop Cache Instruction PC MUXMUX last_PC PC Instruction control signals

Architecture Address Translation Unit (ATU) The ATU translates the incoming address to a loop cache address –Necessary for caching loops with cofs –Allows gaps within the loop ALC 14 Address InstructionMem. Addr. lw r1, 200(r2)500 addi r3, r1, 1501 sw r3, 500(r2)502 bne r4, r3, 3503 srl r4, r5, or r6, r4, r1505 addi r2, r2, 1506 sbb Loop Cache [0]lw r1, 200(r2) [1]addi r3, r1, 1 [2]sw r3, 500(r2) [3]bne r4, r3, 3 [4]xxxxxxxxxx [5]xxxxxxxxxx [6]addi r2, r2, 1 [7]sbb -7 ad_trans ATU cr_start = lc_address = Loop Cache Address Instruction lc_address++ mem_addr – cr_start

Architecture Encountering a New Loop The instruction cache services instruction fetches when the loop cache is in the idle state A taken sbb indicates a new loop The loop cache spends 1 cycle in the buffer state preparing for the new loop 15 Microprocessor sbbcof Instructions from Level One Cache Loop Cache Controller (LCC) i_buffer last_PC tr_sbb same_loop i_buffer_ld tr_sbb_ld Loop Cache lw r1, 200(r2)10 addi r3, r1, 110 sw r3, 500(r2)10 sbb -301 inv PC Loop Cache xxxxxxxxxx

Architecture Valid Bits 16 Valid bits indicate the location of the next instruction fetch i.e. loop cache or instruction cache –Critical for maintaining a 100% loop cache hit rate Instructionsnvtnv lw r1, 200(r2)10 addi r3, r1, 110 bne r4, r3, 301 xxxxxxxxxx00 00 addi r2, r2, 110 sbb -601 Microprocessor LCC sbbcof Instructions from Level One Cache i_buffer i_cof instruction Instructions from i_buffer 0 last_PC

Architecture Backfilling the ALC 17 Instructionsnvtnv [0]lw r1, 200(r2)10 [1]addi r3, r1, 110 [2]sw r3, 500(r2)10 [3]bne r4, r3, 301 [4]xxxxxxxxxx00 [5]xxxxxxxxxx00 [6]addi r2, r2, 110 [7]sbb -701 InstructionAddr. lw r1, 200(r2)500 addi r3, r1, 1501 sw r3, 500(r2)502 bne r4, r3, 3503 srl r4, r5, or r6, r4, r1505 addi r2, r2, 1506 sbb LCC last_PC tr_sbb Microprocessor sbbcof i_buffer i_cof instruction loop_fill same_loop Instructions from i_buffer Instructions from Level One Cache Operating in the fill state ATU cr_start = 500 last_PC ad_trans set_nv set_tnv i_cof Decoder PC_out_of_range

Architecture Supplying the Processor with Instructions 18 Instructionsnvtnv [0]lw r1, 200(r2)10 [1]addi r3, r1, 110 [2]sw r3, 500(r2)10 [3]bne r4, r3, 301 [4]xxxxxxxxxx00 [5]xxxxxxxxxx00 [6]addi r2, r2, 110 [7]sbb -701 ATU cr_start = 500 LCC last_PC tr_sbb Microprocessor sbbcof i_buffer i_cof instruction loop_fill same_loop PC Instructions from the ALC Operating in the active state ad_trans fetch these instructions from L1 instruction cache xxxxxxxxxx 0 0 cof no

Architecture Filling the Gaps Return to fill state –Instruction cache provides processor with instructions –Always buffer instruction in active state –Both valid bits set for cof instruction –Return to active state Any path through instruction stream will hit 19 Instructionsnvtnv [0]lw r1, 200(r2)10 [1]addi r3, r1, 110 [2]sw r3, 500(r2)10 [3]bne r4, r3, 301 [4]xxxxxxxxxx00 [5]xxxxxxxxxx00 [6]addi r2, r2, 110 [7]sbb -701 Instructionsnvtnv [0]lw r1, 200(r2)10 [1]addi r3, r1, 110 [2]sw r3, 500(r2)10 [3]bne r4, r3, 311 [4]srl r4, r5, 1010 [5]or r6, r4, r110 [6]addi r2, r2, 110 [7]sbb fill execute

Architecture For loops larger than the loop cache –For a loop cache of size M entries, the ALC stores the first M instructions of the loop –The sbb instruction is not stored in the loop cache, however, the sbb address is still stored in tr_sbb and is used to identify the loop –The LCC overflow signal indicates that only part of the loop is stored –If overflow is asserted and the sbb is taken again, the LCC transitions from idle to active and supplies the processor with M instructions If a new loop is encountered in the fill state (e.g. a nested loop), the LCC transitions from fill back to the buffer state to prepare for the new loop 20

Experimental Setup Modified SimpleScalar 1 to implement the ALC 31 benchmarks from the EEMBC 2, Powerstone 3, and MiBench 4 suites Energy model based on access and miss statistics (SimpleScalar) and energy values (CACTI 5 ) What are we measuring? –Loop cache access rate - % of instructions fetched from the loop cache instead of the L1 i-cache –Energy savings 21 1 (Burger/Austin/Bennet 96), 2 ( 3 (Lee/Arends/Moyer 98), 4 (Guthaus/Ringenberg/Ernst/Austin/Mudge/Brown 01), 5 (Shivakumar/Jouppi 01)

Results Loop Cache Access Rate The ALC has a higher loop cache access rate than the DLC for all cache sizes –ALC can cache loops with cofs –Average improvements as high as 19% (E), 41% (M), and 75% (P) –Individual benchmarks show improvements as high as 93% (E), 98% (M), and 98% (P) 22 19% 75% 41%

Results Loop Cache Access Rate The ALC has a higher loop cache access rate than the PLC/HLC for loop cache sizes up to 128 entries –Caches loops with cofs, loop contents not static –On average, the 256 entry PLC/HLC performs better since it can cache most (or all) of the critical regions and requires no runtime filling –ALC good for size constrained systems 23 < 5%

Results Loop Cache Access Rate The ALC can have a higher loop cache access rate than the PLC/HLC for all loop caches sizes –Critical regions too large to fit in PLC/HLC (11) (32).7.2 (11) (78).7.3 (11) (32).7.4 (11) (78).7.5 (11) (32).7.6 (11) (63) (10).7 (391instructions) 60% 71%

Results Energy Savings 25 The ALC has higher energy savings than the DLC for all cache sizes –Higher loop cache access rate –Average improvements as high as 12% (E), 26% (M), and 49% (P) 12% 26% 49%

Results Energy Savings 26 The ALC has a higher energy savings than PLC/HLC for loop cache sizes up to 64 entries –64 entry PLC/HLC not large enough to compete with ALC –At 128 entries: ALC typically has higher access rate but energy is spent filling the ALC which could result in lower energy savings compared with the PLC/HLC –At 256 entries: PLC/HLC typically has higher access rate and thus higher energy savings

Results Energy Savings Increase in HLC energy savings is due to increase in PLC partition i.e. designer effort needed Decrease in energy savings for larger than 256 entry loop cache PLC/HLC can outperform ALC for all sizes critical loops are executed frequently but only iterate a few times successively / = 50% 50% / = 0.1%

Conclusions We introduced the Adaptive Loop Cache –Combines the flexibility of the dynamic loop cache with the preloaded loop cache’s ability to cache complex critical regions –Eliminates the need for designer pre-analysis effort needed for the preloaded loop cache and hybrid loop cache –Improves the average loop cache access rate by as much as 75% –Improves the average energy savings by as much as 49% 28

29