1 Clustered Data Cache Designs for VLIW Processors PhD Candidate: Enric Gibert Advisors: Antonio González, Jesús Sánchez
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert2 Motivation Two major problems in processor design –Wire delays –Energy consumption D. Matzke, "Will Physical Scalability Sabotage Performance Gains?¨¨ in IEEE Computer 30(9), pp , 1997 Data from
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert3 Clustering CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache L2 cache Memory buses
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert4 Data Cache Latency Energy –Leakage will soon dominate energy consumption –Cache memories will probably be the main source of leakage In this Thesis : –Latency Reduction Techniques –Energy Reduction Techniques (S. Hill, Hot Chips 13) SIA projections (64KB cache) 100 nm70nm50nm35nm 4 cycles5 cycles7 cycles
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert5 Contributions of this Thesis Memory hierarchy for clustered VLIW processors –Latency Reduction Techniques Distribution of the Data Cache among clusters Cost-effective cache coherence solutions Word-Interleaved distributed data cache Flexible Compiler-Managed L0 Buffers –Energy Reduction Techniques Heterogeneous Multi-module Data Cache Unified processors Clustered processors
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert6 Evaluation Framework IMPACT C compiler –Compile + optimize + memory disambiguation Mediabench benchmark suite ProfileExecutionProfileExecution adpcmdecclintonS_16_44jpegdectestimgmonalisa adpcmencclintonS_16_44jpegenctestimgmonalisa epicdectest_imagetitanicmpeg2decmei16v2tek6 epicenctest_imagetitanicpegwitdecpegwittechrep g721decclintonS_16_44pegwitencpgptesttechrep g721encclintonS_16_44pgpdecpgptexttechrep gsmdecclintonS_16_44pgpencpgptesttechrep gsmencclintonS_16_44rastaex5_c1 Microarchitectural VLIW simulator
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert7 Presentation Outline Latency reduction techniques –Software memory coherence in distributed caches –Word-interleaved distributed cache –Flexible Compiler-Managed L0 Buffers Energy reduction techniques –Multi-Module cache for clustered VLIW processor Conclusions
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert8 Distributing the Data Cache CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L2 cache L1 cache Memory buses L1 cache module L1 cache module L1 cache module L1 cache module
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert9 MultiVLIW CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache module L1 cache module L1 cache module L1 cache module L2 cache MSI cache coherence protocol cache block (Sánchez and González, MICRO33)
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert10 Presentation Outline Latency reduction techniques –Software memory coherence in distributed caches –Word-interleaved distributed cache –Flexible Compiler-Managed L0 Buffers Energy reduction techniques –Multi-Module cache for clustered VLIW processor Conclusions
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert11 Memory Coherence CLUSTER 1 X Cache module CLUSTER 3CLUSTER 2 CLUSTER 4 Cache module NEXT MEMORY LEVEL memory buses cycle i---store to X cycle i cycle i cycle i cycle i+4load from X--- new value of X Update XRead X new value of X Remote accesses Misses Replacements Others NON-DETERMINISTIC BUS LATENCY!!!
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert12 Coherence Solutions: Overview Local scheduling solutions applied to loops –Memory Dependent Chains (MDC) –Data Dependence Graph Transformations (DDGT) Store replication Load-store synchronization Software-based solutions with little hardware support Applicable to different configurations –Word-interleaved cache –Replicated distributed cache –Flexible Compiler-Managed L0 Buffers
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert13 Scheme 1: Mem. Dependent Chains Sets of memory dependent instructions –Memory disambiguation by the compiler Conservative assumptions –Assign instructions in same set to same cluster LD ADD ST Register deps Memory deps CLUSTER 1 X cache module CLUSTER 3CLUSTER 2 CLUSTER 4 cache module store to X load from X store to X
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert14 Scheme 2: DDG transformations (I) 2 transformations applied together Store replication overcome MF and MO –Little support from the hardware CLUSTER 1 cache module CLUSTER 4 cache module store to X CLUSTER 2 X cache module CLUSTER 3 cache module store to X local instance remote instances load from X
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert15 load from X Scheme 2: DDG transformations (II) Load-store synchronization overcome MA dependences LD ST add RF MA SYNC CLUSTER 1 cache module CLUSTER 2 CLUSTER 3 X cache module add CLUSTER 4 store to X
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert16 Results: Memory Coherence Memory Dependent Chains (MDC) –Bad since restrictions on the assignment of instructions to clusters –Good when memory disambiguation is accurate DDG Transformations (DDGT) –Good when there is pressure in the memory buses Increases number of local accesses –Bad when there is pressure in the register buses Big increase in inter-cluster communications Solutions useful for different cache schemes
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert17 Presentation Outline Latency reduction techniques –Software memory coherence in distributed caches –Word-interleaved distributed cache –Flexible Compiler-Managed L0 Buffers Energy reduction techniques –Multi-Module cache for clustered VLIW processor Conclusions
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert18 Word-Interleaved Cache Simplify hardware –As compared to MultiVLIW Avoid replication Strides +1/-1 element are predominant –Page interleaved –Block interleaved –Word interleaved best suited
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert19 Architecture CLUSTER 1 Register File Func. Units Register-to-register communication buses cache module CLUSTER 2 Register File Func. Units cache module CLUSTER 3 Register File Func. Units cache module CLUSTER 4 Register File Func. Units cache module L2 cache TAG W0W4W1W5W2W6W3W7 subblock 1 local hit remote hitlocal miss TAG W0W1W2W4W5W6W7W3 cache block remote miss
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert20 Instruction Scheduling (I): Unrolling CLUSTER 1 cache module a[0] a[4] for (i=0; i<MAX; i++) { ld … } CLUSTER 2 cache module a[1] a[5] CLUSTER 3 cache module a[2] a[6] CLUSTER 4 cache module a[3] a[7] 25% of local accesses for (i=0; i<MAX; i=i+4) { ld ld ld ld … } ld ld ld ld 100% of local accesses
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert21 Instruction Scheduling (II) Assign appropriate latency to memory instruction –Small latencies ILP ↑, stall time ↑ –Large latencies ILP ↓, stall time ↓ –Start with large latency (remote miss) + iteratively reassign appropriate latencies (local miss, remote hit, local hit) LD add RF LD add Cluster 1C2C3C4 cycle 1 cycle 2 cycle 3 small latencies LD Cluster 1C2C3C4 cycle 1 cycle 2 cycle 3 add cycle 4 cycle 5 large latencies
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert22 Instruction Scheduling (III) Assign instructions to clusters –Non-memory instructions Minimize inter-cluster communications Maximize workload balance among clusters –Memory instructions 2 heuristics Preferred cluster (PrefClus) –Average preferred cluster of memory dependent set Minimize inter-cluster communications (MinComs) –Min. Comms. for 1st instruction of the memory dependent set
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert23 Memory Accesses Sources of remote accesses: –Indirect, chains restrictions, double precision, …
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert24 Attraction Buffers Cost-effective mechanism ↑ local accesses CLUSTER 4 a[3] a[7] cache module Attraction Buffer CLUSTER 2 a[1] a[5] cache module AB CLUSTER 3 a[2] a[6] cache module AB CLUSTER 1 a[0] a[4] cache module AB load a[i] i=i+4 loop a[0] a[4] local accesses 0% 50% Results –~ 15% INCREASE in local accesses –~30-35% REDUCTION in stall time –5-7% REDUCTION in overall execution time i=0
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert25 Performance
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert26 Presentation Outline Latency reduction techniques –Software memory coherence in distributed caches –Word-interleaved distributed cache –Flexible Compiler-Managed L0 Buffers Energy reduction techniques –Multi-Module cache for clustered VLIW processor Conclusions
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert27 Why L0 Buffers Still keep hardware simple, but…... Allow dynamic binding between addresses and clusters
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert28 L0 Buffers Small number of entries flexibility –Adaptative to application + dynamic address-cluster binding Controlled by software load/store hints –Mark instructions to access the buffers: which and how Flexible Compiler-Managed L0 Buffers CLUSTER 1 Register File L1 cache INT FP MEM CLUSTER 3 CLUSTER 2 Register File INT FP MEM CLUSTER 4 L0 buffer unpack logic L0 buffer
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert29 Mapping Flexibility a[0] a[1] a[2] a[3] a[4]a[5]a[6]a[7] CLUSTER 1 L0 Buffer L1 block (16 bytes) L1 cache CLUSTER 2 L0 Buffer CLUSTER 3 L0 Buffer CLUSTER 4 L0 Buffer 1234 a[0]a[1]a[0]a[1]a[0]a[1]a[0]a[1] interleaved mapping (1 cycle penalty) a[0]a[4]a[1]a[5]a[2]a[6]a[3]a[7] load a[0]load a[1]load a[2]load a[3] All loads with a 4-element stride unpack logic 4 bytes 1234 load a[0] with stride 1 element a[0]a[1] linear mapping
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert30 Hints and L0-L1 Interface Memory hints –Access or bypass the L0 Buffers –Data mapping: linear/interleaved –Prefetch hints next/previous blocks L0 are write-through with respect to L1 –Simplifies replacements –Makes hardware simple No arbitration No logic to pack data back correctly –Simplifies coherence among L0 Buffers
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert31 Instruction Scheduling Selective loop unrolling –No unroll vs. unroll by N Assign latencies to memory instructions –Critical instructions (slack) use L0 Buffers –Do not overflow L0 Buffers Use counter of L0 Buffer free entries / cluster Do not schedule critical instruction into cluster with counter == 0 –Memory coherence Cluster assignment + schedule instructions –Minimize global communications –Maximize workload balance –Critical Priority to clusters where L0 Buffer can be used Explicit prefetching
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert32 Number of Entries
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert33 Performance
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert34 Global Comparative MultiVLIWWord Interleaved L0 Buffers Hardware Complexity Lower is betterHighLow Software Complexity Lower is betterLowMediumHigh Performance Higher is betterHighMediumHigh
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert35 Presentation Outline Latency reduction techniques –Software memory coherence in distributed caches –Word-interleaved distributed cache –Flexible Compiler-Managed L0 Buffers Energy reduction techniques –Multi-Module cache for clustered VLIW processor Conclusions
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert36 Motivation Energy consumption 1st class design goal Heterogeneity –↓ supply voltage and/or ↑ threshold voltage Cache memory ARM10 –D-cache 24% dynamic energy –I-cache 22% dynamic energy Exploit heterogeneity in the L1 D-cache? processor front-end processor back-end processor front-end processor back-end structure tuned for performance structure tuned for energy
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert37 Multi-Module Data Cache FAST CACHE MODULE SLOW CACHE MODULE inst PC L2 D-CACHE PROCESSOR CRITICALITY TABLE ROB Instruction-Based Multi-Module (Abella and González, ICCD 2003) STACK HEAP DATA GLOBAL DATA STACK HEAP DATA GLOBAL DATA FAST SPACE SLOW SPACE SP1 SP2 distributed stack frames FAST MODULE SLOW load/store queues L2 D-CACHE L1 D-CACHE Variable-Based Multi-Module It is possible to exploit heterogeneity!
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert38 Cache Configurations 8KB FASTSLOW L=2 1 R/W L=4 1 R/W latency x2 energy by 1/3 FAST FU+RF CLUSTER 1 FU+RF CLUSTER 2 FAST+NONE FAST FU+RF CLUSTER 1 FAST FU+RF CLUSTER 2 FAST+FAST SLOW FU+RF CLUSTER 1 FU+RF CLUSTER 2 SLOW+NONE SLOW FU+RF CLUSTER 1 SLOW FU+RF CLUSTER 2 SLOW+SLOW FAST FU+RF CLUSTER 1 SLOW FU+RF CLUSTER 2 FAST+SLOW FIRST MODULE FU RF CLUSTER 1 SECOND MODULE FU RF CLUSTER 2 Register buses L2 D-CACHE
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert39 Instr.-to-Variable Graph (IVG) Built with profiling information Variables = global, local, heap LD1 LD2 ST1 LD3 ST2 LD4 LD5 VAR V1VAR V2VAR V3VAR V4 FIRSTSECOND CACHE FU+RF CLUSTER 1 CACHE FU+RF CLUSTER 2 LD2 LD1 LD4 LD5 ST1 LD3 ST2
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert40 Greedy Mapping Algorithm Initial mapping all to space Assign affinities to instructions –Express a preferred cluster for memory instructions: [0,1] –Propagate affinities to other instructions Schedule code + refine mapping Compute IVG Compute mapping Compute affinities + propagate affinities Schedule code
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert41 Computing and Propagating Affinity add1 add2 LD1 LD2 mul1 add6 add7 ST1 add3 add4 LD3 LD4 add5 L=1 L=3 LD1 LD2 LD3 LD4 ST1 V1 V2 V4 V3 FIRSTSECOND AFFINITY=0AFFINITY=1 FIRST MODULE FU RF CLUSTER 1 Register buses SECOND MODULE FU RF CLUSTER 2 AFF.=0.4 slack 0 slack 2 slack 0 slack 2 slack 0 slack 5
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert42 Cluster Assignment Cluster affinity + affinity range used to: –Define a preferred cluster –Guide the instruction-to-cluster assignment process Strongly preferred cluster –Schedule instruction in that cluster Weakly preferred cluster –Schedule instruction where global comms. are minimized IBIB ICIC Affinity range (0.3, 0.7) ≤ 0.3≥ 0.7 CACHE FU+RF CLUSTER 1 CACHE FU+RF CLUSTER 2 V1 IAIA 100 Affinity=0 Affinity=0.9 V2V Affinity=0.4 ICIC ICIC ? IAIA IBIB
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert43 EDD Results Memory Ports SensitiveInsensitive Memory LatencySensitiveFAST+FASTFAST+NONE InsensitiveSLOW+SLOWSLOW+NONE
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert44 Other Results BESTUNIFIED FAST UNIFIED SLOW EDD ED ED –The SLOW schemes are better In all cases, these schemes are better than unified cache –29-31% better in EDD, 19-29% better in ED No configuration is best for all cases
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert45 Reconfigurable Cache Results The OS can set each module in one state: –FAST mode / SLOW mode / Turned-off The OS reconfigures the cache on a context switch –Depending on the applications scheduled in and scheduled out Two different V DD and V TH for the cache –Reconfiguration overhead: 1-2 cycles [Flautner et al. 2002] Simple heuristic to show potential –For each application, choose the estimated best cache configuration BEST DISTRIBUTED RECONFIGURABLE SCHEME EDD0.89 (FAST+SLOW) 0.86 ED0.89 (SLOW+SLOW) 0.86
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert46 Presentation Outline Latency reduction techniques –Software memory coherence in distributed caches –Word-interleaved distributed cache –Flexible Compiler-Managed L0 Buffers Energy reduction techniques –Multi-Module cache for clustered VLIW processor Conclusions
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert47 Conclusions Cache partitioning is a good latency reduction technique Cache heterogeneity can be used to exploit energy efficiency The best energy and performance efficient scheme is a distributed data cache –Dynamic vs. Static mapping between addresses and clusters Dynamic for performance (L0 Buffers) Static for energy consumption (Variable-Based mapping) –Hardware vs. Software-based memory coherence solutions Software solutions are viable
Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert48 List of Publications Distributed Data Cache Memories –ICS, 2002 –MICRO-35, 2002 –CGO-1, 2003 –MICRO-36, 2003 –IEEE Transactions on Computers, October 2005 –Concurrency & Computation: practice and experience (to appear late ’05 / ’06) Heterogeneous Data Cache Memories –Technical report UPC-DAC-RR-ARCO , 2004 –PACT, 2005
49 Questions…