Download presentation
Presentation is loading. Please wait.
Published byFranklin Rose Modified over 9 years ago
1
University of Michigan Electrical Engineering and Computer Science 1 Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu, and Scott Mahlke Advanced Computer Architecture Lab Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor * Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California
2
University of Michigan Electrical Engineering and Computer Science 2 Introduction: Memory Power On-chip memories are a major contributor to system energy Data caches ~16% in StrongARM [ Unsal et. al, ‘01 ] HardwareSoftware Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses – Limited program information – Reactive Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses – Limited program information – Reactive Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive – No dynamic adaptability – Conservative Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive – No dynamic adaptability – Conservative
3
University of Michigan Electrical Engineering and Computer Science 3 Reducing Data Memory Power: Compiler Managed, Hardware Assisted HardwareSoftware Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information ー Reactive Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information ー Reactive Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ー No dynamic adaptability ー Conservative Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ー No dynamic adaptability ー Conservative Global program knowledge Proactive optimizations Dynamic adaptability Efficient execution Aggressive software optimizations
4
University of Michigan Electrical Engineering and Computer Science 4 Data Caches: Tradeoffs AdvantagesDisadvantages + Capture spatial/temporal locality + Transparent to the programmer + General than software scratch-pads + Efficient lookups + Capture spatial/temporal locality + Transparent to the programmer + General than software scratch-pads + Efficient lookups – Fixed replacement policy – Set index no program locality – Set-associativity has high overhead – Activate multiple data/tag-array per access – Fixed replacement policy – Set index no program locality – Set-associativity has high overhead – Activate multiple data/tag-array per access
5
University of Michigan Electrical Engineering and Computer Science 5 tagsetoffset =? tagdatalrutagdatalru tagdatalru tagdatalru 4:1 mux Replace Lookup Activate all ways on every access Replacement Choose among all the ways Traditional Cache Architecture
6
University of Michigan Electrical Engineering and Computer Science 6 Partitioned Cache Architecture tagsetoffset =? tagdatalrutagdatalru tagdatalru tagdatalru Ld/St Reg [Addr] [k-bitvector] [R/U] 4:1 mux Replace Lookup Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions Replacement Restricted to partitions specified in bit-vector P0P3P2P1 Advantages Improve performance by controlling replacement Reduce cache access power by restricting number of accesses
7
University of Michigan Electrical Engineering and Computer Science 7 Partitioned Caches: Example tagdatatagdatatagdata ld1, st1, ld2, st2 ld5, ld6ld3, ld4 way-0way-2way-1 ld1 [100], R ld5 [010], R ld3 [001], R for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld2/st2 ld3 ld4 ld5 ld6 yw1/w2x Reduce number of tag checks per iteration from 12 to 4 !
8
University of Michigan Electrical Engineering and Computer Science 8 Compiler Controlled Data Partitioning Goal: Place loads/stores into cache partitions Analyze application’s memory characteristics –Cache requirements Number of partitions per ld/st –Predict conflicts Place loads/stores to different partitions –Satisfies its caching needs –Avoid conflicts, overlap if possible
9
University of Michigan Electrical Engineering and Computer Science 9 Cache Analysis: Estimating Number of Partitions X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y j-loop k-loop M M MM B1 M has working-set size = 1 Minimal partitions to avoid conflict/capacity misses Probabilistic hit-rate estimate Use the working-set to compute number of partitions
10
University of Michigan Electrical Engineering and Computer Science 10 Cache Analysis: Estimating Number Of Partitions 1 1 1 1 8 16 24 32 1234 D = 0.76.98 1 1 8 16 24 32 1234 D = 2.87 1 1 1 8 16 24 32 1234 D = 1 Avoid conflict/capacity misses for an instruction Estimates hit-rate based on Reuse-distance (D), total number of cache blocks (B), associativity (A) Compute energy matrices in reality Pick most energy efficient configuration per instruction (Brehob et. al., ’99)
11
University of Michigan Electrical Engineering and Computer Science 11 Cache Analysis: Computing Interferences Avoid conflicts among temporally co-located references Model conflicts using interference graph M4 D = 1 M3 D = 1 M2 D = 1 M1 D = 1 X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y M4 M2 M1 M1 M4 M3 M1 M1
12
University of Michigan Electrical Engineering and Computer Science 12 Partition Assignment Placement phase can overlap references Compute combined working-set Use graph-theoretic notion of a clique For each clique, new D Σ D of each node Combined D for all overlaps Max (All cliques) M4 D = 1 M3 D = 1 M2 D = 1 M1 D = 1 Clique 2 Clique 1 Clique 1 : M1, M2, M4 New reuse distance (D) = 3 Clique 2 : M1, M3, M4 New reuse distance (D) = 3 Combined reuse distance Max(3, 3) = 3
13
University of Michigan Electrical Engineering and Computer Science 13 Experimental Setup Trimaran compiler and simulator infrastructure ARM9 processor model Cache configurations: –1-Kb to 32-Kb –32-byte block size –2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache Mediabench suite CACTI for cache energy modeling
14
University of Michigan Electrical Engineering and Computer Science 14 Reduction in Tag & Data-Array Checks 0 1 2 3 4 5 6 7 8 1-K2-K4-K8-K16-K32-KAverage Average way accesses 8-part4-part2-part 36% reduction on a 8-partition cache Cache size
15
University of Michigan Electrical Engineering and Computer Science 15 Improvement in Fetch Energy 0 10 20 30 40 50 60 rawcaudio rawdaudio g721encodeg721decode mpeg2decmpeg2enc pegwitencpegwitdec pgpencodepgpdecode gsmencodegsmdecode epic unepic cjpeg djpeg Average Percentage energy improvement 2-part vs 2-way4-part vs 4-way8-part vs 8-way 16-Kb cache
16
University of Michigan Electrical Engineering and Computer Science 16 Summary Maintain the advantages of a hardware-cache Expose placement and lookup decisions to the compiler –Avoid conflicts, eliminate redundancies 24% energy savings for 4-Kb with 4-partitions Extensions –Hybrid scratch-pad and caches –Disable selected tags convert them to scratch-pads –35% additional savings in 4-Kb cache with 1 partition as SP
17
University of Michigan Electrical Engineering and Computer Science 17 Thank You & Questions
18
University of Michigan Electrical Engineering and Computer Science 18 Cache Analysis Step 1: Instruction Fusioning Combine ld/st that accesses the same set of objects Avoids coherence and duplication Points-to analysis M1M2 for (i = 0; i < N1; i++) { … for (j = 0; j < readInput1(); j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < readInput2(); k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld2/st2 ld3 ld4 ld5 ld6
19
University of Michigan Electrical Engineering and Computer Science 19 Partition Assignment Greedily place instructions based on its cache estimates Overlap instructions if required Compute number of partitions for overlapped instructions –Enumerate cliques within interference graph –Compute combined working-set of all cliques Assign the R/U bit to control lookup M4 D = 1 M3 D = 1 M2 D = 1 M1 D = 1 Clique 2 Clique 1
20
University of Michigan Electrical Engineering and Computer Science 20 Related Work Direct addressed, cool caches [Unsal ’01, Asanovic ’01] –Tags maintained in registers that are addressed within loads/stores Split temporal/spatial cache [Rivers ’96] –Hardware managed, two partitions Column partitioning [Devdas ’00] –Individual ways can be configured as a scratch-pad –No load/store based partitioning Region based caching [Tyson ’02] –Heap, stack, globals –More finer grained control and management Pseudo set-associative caches [Calder ’96,Inou ’99,Albonesi ‘99] –Reduce tag check power –Compromises on cycle time –Orthogonal to our technique
21
University of Michigan Electrical Engineering and Computer Science 21 Code Size Overhead 0 2 4 6 8 10 12 rawcaudio rawdaudio g721encodeg721decode mpeg2decmpeg2enc pegwitencpegwitdec pgpencodepgpdecode gsmencodegsmdecode epic unepic cjpeg djpeg Average Percentage instructions Annotated LD/STsExtra MOV instructions 15%16%
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.