University of Michigan Electrical Engineering and Computer Science 1 Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu,

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign.

ECE 260C – VLSI Advanced Topics Term paper presentation May 27, 2014 Keyuan Huang Ngoc Luong Low Power Processor Architectures and Software Optimization.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University Cooperative Center for Net-Centric Software and Systems.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and

Virtual Memory Hardware Support

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.

S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.

University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.

CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

2013/10/21 Yun-Chung Yang An Energy-Efficient Adaptive Hybrid Cache Jason Cong, Karthik Gururaj, Hui Huang, Chunyue Liu, Glenn Reinman, Yi Zou Computer.

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

Energy Efficient D-TLB and Data Cache Using Semantic-Aware Multilateral Partitioning School of Electrical and Computer Engineering Georgia Institute of.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

Adaptive Cache Partitioning on a Composite Core

Ph.D. in Computer Science

Lecture 12 Virtual Memory.

Direct Cache Structure

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Energy-Efficient Address Translation

RegLess: Just-in-Time Operand Staging for GPUs

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

CSCI1600: Embedded and Real Time Software

Ann Gordon-Ross and Frank Vahid*

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Spring 2008 CSE 591 Compilers for Embedded Systems

Automatic Tuning of Two-Level Caches to Embedded Applications

Introduction to Computer Systems Engineering

CSCI1600: Embedded and Real Time Software

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu, and Scott Mahlke Advanced Computer Architecture Lab Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor * Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California

University of Michigan Electrical Engineering and Computer Science 2 Introduction: Memory Power On-chip memories are a major contributor to system energy Data caches  ~16% in StrongARM [ Unsal et. al, ‘01 ] HardwareSoftware Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses – Limited program information – Reactive Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses – Limited program information – Reactive Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive – No dynamic adaptability – Conservative Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive – No dynamic adaptability – Conservative

University of Michigan Electrical Engineering and Computer Science 3 Reducing Data Memory Power: Compiler Managed, Hardware Assisted HardwareSoftware Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ｰ Limited program information ｰ Reactive Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ｰ Limited program information ｰ Reactive Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ｰ No dynamic adaptability ｰ Conservative Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ｰ No dynamic adaptability ｰ Conservative Global program knowledge Proactive optimizations Dynamic adaptability Efficient execution Aggressive software optimizations

University of Michigan Electrical Engineering and Computer Science 4 Data Caches: Tradeoffs AdvantagesDisadvantages + Capture spatial/temporal locality + Transparent to the programmer + General than software scratch-pads + Efficient lookups + Capture spatial/temporal locality + Transparent to the programmer + General than software scratch-pads + Efficient lookups – Fixed replacement policy – Set index  no program locality – Set-associativity has high overhead – Activate multiple data/tag-array per access – Fixed replacement policy – Set index  no program locality – Set-associativity has high overhead – Activate multiple data/tag-array per access

University of Michigan Electrical Engineering and Computer Science 5 tagsetoffset =? tagdatalrutagdatalru tagdatalru tagdatalru 4:1 mux Replace Lookup  Activate all ways on every access Replacement  Choose among all the ways Traditional Cache Architecture

University of Michigan Electrical Engineering and Computer Science 6 Partitioned Cache Architecture tagsetoffset =? tagdatalrutagdatalru tagdatalru tagdatalru Ld/St Reg [Addr] [k-bitvector] [R/U] 4:1 mux Replace Lookup  Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions Replacement  Restricted to partitions specified in bit-vector P0P3P2P1 Advantages  Improve performance by controlling replacement  Reduce cache access power by restricting number of accesses

University of Michigan Electrical Engineering and Computer Science 7 Partitioned Caches: Example tagdatatagdatatagdata ld1, st1, ld2, st2 ld5, ld6ld3, ld4 way-0way-2way-1 ld1 [100], R ld5 [010], R ld3 [001], R for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld2/st2 ld3 ld4 ld5 ld6 yw1/w2x Reduce number of tag checks per iteration from 12 to 4 !

University of Michigan Electrical Engineering and Computer Science 8 Compiler Controlled Data Partitioning Goal: Place loads/stores into cache partitions Analyze application’s memory characteristics –Cache requirements  Number of partitions per ld/st –Predict conflicts Place loads/stores to different partitions –Satisfies its caching needs –Avoid conflicts, overlap if possible

University of Michigan Electrical Engineering and Computer Science 9 Cache Analysis: Estimating Number of Partitions X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y j-loop k-loop M M MM B1 M has working-set size = 1 Minimal partitions to avoid conflict/capacity misses Probabilistic hit-rate estimate Use the working-set to compute number of partitions

University of Michigan Electrical Engineering and Computer Science 10 Cache Analysis: Estimating Number Of Partitions D = D = D = 1  Avoid conflict/capacity misses for an instruction  Estimates hit-rate based on Reuse-distance (D), total number of cache blocks (B), associativity (A)  Compute energy matrices in reality  Pick most energy efficient configuration per instruction (Brehob et. al., ’99)

University of Michigan Electrical Engineering and Computer Science 11 Cache Analysis: Computing Interferences Avoid conflicts among temporally co-located references Model conflicts using interference graph M4 D = 1 M3 D = 1 M2 D = 1 M1 D = 1 X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y M4 M2 M1 M1 M4 M3 M1 M1

University of Michigan Electrical Engineering and Computer Science 12 Partition Assignment  Placement phase can overlap references  Compute combined working-set  Use graph-theoretic notion of a clique  For each clique, new D  Σ D of each node  Combined D for all overlaps  Max (All cliques) M4 D = 1 M3 D = 1 M2 D = 1 M1 D = 1 Clique 2 Clique 1 Clique 1 : M1, M2, M4  New reuse distance (D) = 3 Clique 2 : M1, M3, M4  New reuse distance (D) = 3 Combined reuse distance  Max(3, 3) = 3

University of Michigan Electrical Engineering and Computer Science 13 Experimental Setup Trimaran compiler and simulator infrastructure ARM9 processor model Cache configurations: –1-Kb to 32-Kb –32-byte block size –2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache Mediabench suite CACTI for cache energy modeling

University of Michigan Electrical Engineering and Computer Science 14 Reduction in Tag & Data-Array Checks K2-K4-K8-K16-K32-KAverage Average way accesses 8-part4-part2-part 36% reduction on a 8-partition cache Cache size

University of Michigan Electrical Engineering and Computer Science 15 Improvement in Fetch Energy rawcaudio rawdaudio g721encodeg721decode mpeg2decmpeg2enc pegwitencpegwitdec pgpencodepgpdecode gsmencodegsmdecode epic unepic cjpeg djpeg Average Percentage energy improvement 2-part vs 2-way4-part vs 4-way8-part vs 8-way 16-Kb cache

University of Michigan Electrical Engineering and Computer Science 16 Summary Maintain the advantages of a hardware-cache Expose placement and lookup decisions to the compiler –Avoid conflicts, eliminate redundancies 24% energy savings for 4-Kb with 4-partitions Extensions –Hybrid scratch-pad and caches –Disable selected tags  convert them to scratch-pads –35% additional savings in 4-Kb cache with 1 partition as SP

University of Michigan Electrical Engineering and Computer Science 17 Thank You & Questions

University of Michigan Electrical Engineering and Computer Science 18 Cache Analysis Step 1: Instruction Fusioning Combine ld/st that accesses the same set of objects Avoids coherence and duplication Points-to analysis M1M2 for (i = 0; i < N1; i++) { … for (j = 0; j < readInput1(); j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < readInput2(); k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld2/st2 ld3 ld4 ld5 ld6

University of Michigan Electrical Engineering and Computer Science 19 Partition Assignment Greedily place instructions based on its cache estimates Overlap instructions if required Compute number of partitions for overlapped instructions –Enumerate cliques within interference graph –Compute combined working-set of all cliques Assign the R/U bit to control lookup M4 D = 1 M3 D = 1 M2 D = 1 M1 D = 1 Clique 2 Clique 1

University of Michigan Electrical Engineering and Computer Science 20 Related Work Direct addressed, cool caches [Unsal ’01, Asanovic ’01] –Tags maintained in registers that are addressed within loads/stores Split temporal/spatial cache [Rivers ’96] –Hardware managed, two partitions Column partitioning [Devdas ’00] –Individual ways can be configured as a scratch-pad –No load/store based partitioning Region based caching [Tyson ’02] –Heap, stack, globals –More finer grained control and management Pseudo set-associative caches [Calder ’96,Inou ’99,Albonesi ‘99] –Reduce tag check power –Compromises on cycle time –Orthogonal to our technique

University of Michigan Electrical Engineering and Computer Science 21 Code Size Overhead rawcaudio rawdaudio g721encodeg721decode mpeg2decmpeg2enc pegwitencpegwitdec pgpencodepgpdecode gsmencodegsmdecode epic unepic cjpeg djpeg Average Percentage instructions Annotated LD/STsExtra MOV instructions 15%16%