1 Clustered Data Cache Designs for VLIW Processors PhD Candidate: Enric Gibert Advisors: Antonio González, Jesús Sánchez.

Slides:

Advertisements

Similar presentations

UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

Advertisements

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.

U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

Lecture 12 Reduce Miss Penalty and Hit Time

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

UPC Power and Complexity Aware Microarchitectures Jaume Abella 1 Ramon Canal 1

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.

1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,

Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.

Lecture Topics: 11/24 Sharing Pages Demand Paging (and alternative) Page Replacement –optimal algorithm –implementable algorithms.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Computer Organization CS224 Fall 2012 Lessons 41 & 42.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Lluc Álvarez, Lluís Vilanova, Miquel Moretó, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, Mateo Valero Coherence.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

‘99 ACM/IEEE International Symposium on Computer Architecture

Cache Memory Presentation I

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Out-of-Order Commit Processors

Milad Hashemi, Onur Mutlu, Yale N. Patt

Advanced Computer Architecture

Out-of-Order Commit Processors

Sampoorani, Sivakumar and Joshua

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Virtual Memory Overcoming main memory size limitation

Code Transformation for TLB Power Reduction

The University of Adelaide, School of Computer Science

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Overview Problem Solution CPU vs Memory performance imbalance

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

1 Clustered Data Cache Designs for VLIW Processors PhD Candidate: Enric Gibert Advisors: Antonio González, Jesús Sánchez

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert2 Motivation Two major problems in processor design –Wire delays –Energy consumption D. Matzke, "Will Physical Scalability Sabotage Performance Gains?¨¨ in IEEE Computer 30(9), pp , 1997 Data from

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert3 Clustering CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache L2 cache Memory buses

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert4 Data Cache Latency Energy –Leakage will soon dominate energy consumption –Cache memories will probably be the main source of leakage In this Thesis : –Latency Reduction Techniques –Energy Reduction Techniques (S. Hill, Hot Chips 13) SIA projections (64KB cache) 100 nm70nm50nm35nm 4 cycles5 cycles7 cycles

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert5 Contributions of this Thesis Memory hierarchy for clustered VLIW processors –Latency Reduction Techniques Distribution of the Data Cache among clusters Cost-effective cache coherence solutions Word-Interleaved distributed data cache Flexible Compiler-Managed L0 Buffers –Energy Reduction Techniques Heterogeneous Multi-module Data Cache Unified processors Clustered processors

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert6 Evaluation Framework IMPACT C compiler –Compile + optimize + memory disambiguation Mediabench benchmark suite ProfileExecutionProfileExecution adpcmdecclintonS_16_44jpegdectestimgmonalisa adpcmencclintonS_16_44jpegenctestimgmonalisa epicdectest_imagetitanicmpeg2decmei16v2tek6 epicenctest_imagetitanicpegwitdecpegwittechrep g721decclintonS_16_44pegwitencpgptesttechrep g721encclintonS_16_44pgpdecpgptexttechrep gsmdecclintonS_16_44pgpencpgptesttechrep gsmencclintonS_16_44rastaex5_c1 Microarchitectural VLIW simulator

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert7 Presentation Outline Latency reduction techniques –Software memory coherence in distributed caches –Word-interleaved distributed cache –Flexible Compiler-Managed L0 Buffers Energy reduction techniques –Multi-Module cache for clustered VLIW processor Conclusions

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert8 Distributing the Data Cache CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L2 cache L1 cache Memory buses L1 cache module L1 cache module L1 cache module L1 cache module

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert9 MultiVLIW CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache module L1 cache module L1 cache module L1 cache module L2 cache MSI cache coherence protocol cache block (Sánchez and González, MICRO33)

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert10 Presentation Outline Latency reduction techniques –Software memory coherence in distributed caches –Word-interleaved distributed cache –Flexible Compiler-Managed L0 Buffers Energy reduction techniques –Multi-Module cache for clustered VLIW processor Conclusions

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert11 Memory Coherence CLUSTER 1 X Cache module CLUSTER 3CLUSTER 2 CLUSTER 4 Cache module NEXT MEMORY LEVEL memory buses cycle i---store to X cycle i cycle i cycle i cycle i+4load from X--- new value of X Update XRead X new value of X Remote accesses Misses Replacements Others NON-DETERMINISTIC BUS LATENCY!!!

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert12 Coherence Solutions: Overview Local scheduling solutions  applied to loops –Memory Dependent Chains (MDC) –Data Dependence Graph Transformations (DDGT) Store replication Load-store synchronization Software-based solutions with little hardware support Applicable to different configurations –Word-interleaved cache –Replicated distributed cache –Flexible Compiler-Managed L0 Buffers

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert13 Scheme 1: Mem. Dependent Chains Sets of memory dependent instructions –Memory disambiguation by the compiler Conservative assumptions –Assign instructions in same set to same cluster LD ADD ST Register deps Memory deps CLUSTER 1 X cache module CLUSTER 3CLUSTER 2 CLUSTER 4 cache module store to X load from X store to X

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert14 Scheme 2: DDG transformations (I) 2 transformations applied together Store replication  overcome MF and MO –Little support from the hardware CLUSTER 1 cache module CLUSTER 4 cache module store to X CLUSTER 2 X cache module CLUSTER 3 cache module store to X local instance remote instances load from X

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert15 load from X Scheme 2: DDG transformations (II) Load-store synchronization  overcome MA dependences LD ST add RF MA SYNC CLUSTER 1 cache module CLUSTER 2 CLUSTER 3 X cache module add CLUSTER 4 store to X

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert16 Results: Memory Coherence Memory Dependent Chains (MDC) –Bad since restrictions on the assignment of instructions to clusters –Good when memory disambiguation is accurate DDG Transformations (DDGT) –Good when there is pressure in the memory buses Increases number of local accesses –Bad when there is pressure in the register buses Big increase in inter-cluster communications Solutions useful for different cache schemes

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert17 Presentation Outline Latency reduction techniques –Software memory coherence in distributed caches –Word-interleaved distributed cache –Flexible Compiler-Managed L0 Buffers Energy reduction techniques –Multi-Module cache for clustered VLIW processor Conclusions

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert18 Word-Interleaved Cache Simplify hardware –As compared to MultiVLIW Avoid replication Strides +1/-1 element are predominant –Page interleaved –Block interleaved –Word interleaved  best suited

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert19 Architecture CLUSTER 1 Register File Func. Units Register-to-register communication buses cache module CLUSTER 2 Register File Func. Units cache module CLUSTER 3 Register File Func. Units cache module CLUSTER 4 Register File Func. Units cache module L2 cache TAG W0W4W1W5W2W6W3W7 subblock 1 local hit remote hitlocal miss TAG W0W1W2W4W5W6W7W3 cache block remote miss

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert20 Instruction Scheduling (I): Unrolling CLUSTER 1 cache module a[0] a[4] for (i=0; i<MAX; i++) { ld … } CLUSTER 2 cache module a[1] a[5] CLUSTER 3 cache module a[2] a[6] CLUSTER 4 cache module a[3] a[7] 25% of local accesses for (i=0; i<MAX; i=i+4) { ld ld ld ld … } ld ld ld ld 100% of local accesses

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert21 Instruction Scheduling (II) Assign appropriate latency to memory instruction –Small latencies  ILP ↑, stall time ↑ –Large latencies  ILP ↓, stall time ↓ –Start with large latency (remote miss) + iteratively reassign appropriate latencies (local miss, remote hit, local hit) LD add RF LD add Cluster 1C2C3C4 cycle 1 cycle 2 cycle 3 small latencies LD Cluster 1C2C3C4 cycle 1 cycle 2 cycle 3 add cycle 4 cycle 5 large latencies

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert22 Instruction Scheduling (III) Assign instructions to clusters –Non-memory instructions Minimize inter-cluster communications Maximize workload balance among clusters –Memory instructions  2 heuristics Preferred cluster (PrefClus) –Average preferred cluster of memory dependent set Minimize inter-cluster communications (MinComs) –Min. Comms. for 1st instruction of the memory dependent set

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert23 Memory Accesses Sources of remote accesses: –Indirect, chains restrictions, double precision, …

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert24 Attraction Buffers Cost-effective mechanism  ↑ local accesses CLUSTER 4 a[3] a[7] cache module Attraction Buffer CLUSTER 2 a[1] a[5] cache module AB CLUSTER 3 a[2] a[6] cache module AB CLUSTER 1 a[0] a[4] cache module AB load a[i] i=i+4 loop a[0] a[4] local accesses 0%  50% Results –~ 15% INCREASE in local accesses –~30-35% REDUCTION in stall time –5-7% REDUCTION in overall execution time i=0

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert25 Performance

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert26 Presentation Outline Latency reduction techniques –Software memory coherence in distributed caches –Word-interleaved distributed cache –Flexible Compiler-Managed L0 Buffers Energy reduction techniques –Multi-Module cache for clustered VLIW processor Conclusions

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert27 Why L0 Buffers Still keep hardware simple, but…... Allow dynamic binding between addresses and clusters

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert28 L0 Buffers Small number of entries  flexibility –Adaptative to application + dynamic address-cluster binding Controlled by software  load/store hints –Mark instructions to access the buffers: which and how Flexible Compiler-Managed L0 Buffers CLUSTER 1 Register File L1 cache INT FP MEM CLUSTER 3 CLUSTER 2 Register File INT FP MEM CLUSTER 4 L0 buffer unpack logic L0 buffer

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert29 Mapping Flexibility a[0] a[1] a[2] a[3] a[4]a[5]a[6]a[7] CLUSTER 1 L0 Buffer L1 block (16 bytes) L1 cache CLUSTER 2 L0 Buffer CLUSTER 3 L0 Buffer CLUSTER 4 L0 Buffer 1234 a[0]a[1]a[0]a[1]a[0]a[1]a[0]a[1] interleaved mapping (1 cycle penalty) a[0]a[4]a[1]a[5]a[2]a[6]a[3]a[7] load a[0]load a[1]load a[2]load a[3] All loads with a 4-element stride unpack logic 4 bytes 1234 load a[0] with stride 1 element a[0]a[1] linear mapping

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert30 Hints and L0-L1 Interface Memory hints –Access or bypass the L0 Buffers –Data mapping: linear/interleaved –Prefetch hints  next/previous blocks L0 are write-through with respect to L1 –Simplifies replacements –Makes hardware simple No arbitration No logic to pack data back correctly –Simplifies coherence among L0 Buffers

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert31 Instruction Scheduling Selective loop unrolling –No unroll vs. unroll by N Assign latencies to memory instructions –Critical instructions (slack) use L0 Buffers –Do not overflow L0 Buffers Use counter of L0 Buffer free entries / cluster Do not schedule critical instruction into cluster with counter == 0 –Memory coherence Cluster assignment + schedule instructions –Minimize global communications –Maximize workload balance –Critical  Priority to clusters where L0 Buffer can be used Explicit prefetching

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert32 Number of Entries

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert33 Performance

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert34 Global Comparative MultiVLIWWord Interleaved L0 Buffers Hardware Complexity Lower is betterHighLow Software Complexity Lower is betterLowMediumHigh Performance Higher is betterHighMediumHigh

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert35 Presentation Outline Latency reduction techniques –Software memory coherence in distributed caches –Word-interleaved distributed cache –Flexible Compiler-Managed L0 Buffers Energy reduction techniques –Multi-Module cache for clustered VLIW processor Conclusions

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert36 Motivation Energy consumption  1st class design goal Heterogeneity –↓ supply voltage and/or ↑ threshold voltage Cache memory  ARM10 –D-cache  24% dynamic energy –I-cache  22% dynamic energy Exploit heterogeneity in the L1 D-cache? processor front-end processor back-end processor front-end processor back-end structure tuned for performance structure tuned for energy

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert37 Multi-Module Data Cache FAST CACHE MODULE SLOW CACHE MODULE inst PC L2 D-CACHE PROCESSOR CRITICALITY TABLE ROB Instruction-Based Multi-Module (Abella and González, ICCD 2003) STACK HEAP DATA GLOBAL DATA STACK HEAP DATA GLOBAL DATA FAST SPACE SLOW SPACE SP1 SP2 distributed stack frames FAST MODULE SLOW load/store queues L2 D-CACHE L1 D-CACHE Variable-Based Multi-Module It is possible to exploit heterogeneity!

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert38 Cache Configurations 8KB FASTSLOW L=2 1 R/W L=4 1 R/W latency x2 energy by 1/3 FAST FU+RF CLUSTER 1 FU+RF CLUSTER 2 FAST+NONE FAST FU+RF CLUSTER 1 FAST FU+RF CLUSTER 2 FAST+FAST SLOW FU+RF CLUSTER 1 FU+RF CLUSTER 2 SLOW+NONE SLOW FU+RF CLUSTER 1 SLOW FU+RF CLUSTER 2 SLOW+SLOW FAST FU+RF CLUSTER 1 SLOW FU+RF CLUSTER 2 FAST+SLOW FIRST MODULE FU RF CLUSTER 1 SECOND MODULE FU RF CLUSTER 2 Register buses L2 D-CACHE

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert39 Instr.-to-Variable Graph (IVG) Built with profiling information Variables = global, local, heap LD1 LD2 ST1 LD3 ST2 LD4 LD5 VAR V1VAR V2VAR V3VAR V4 FIRSTSECOND CACHE FU+RF CLUSTER 1 CACHE FU+RF CLUSTER 2 LD2 LD1 LD4 LD5 ST1 LD3 ST2

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert40 Greedy Mapping Algorithm Initial mapping  all to space Assign affinities to instructions –Express a preferred cluster for memory instructions: [0,1] –Propagate affinities to other instructions Schedule code + refine mapping Compute IVG Compute mapping Compute affinities + propagate affinities Schedule code

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert41 Computing and Propagating Affinity add1 add2 LD1 LD2 mul1 add6 add7 ST1 add3 add4 LD3 LD4 add5 L=1 L=3 LD1 LD2 LD3 LD4 ST1 V1 V2 V4 V3 FIRSTSECOND AFFINITY=0AFFINITY=1 FIRST MODULE FU RF CLUSTER 1 Register buses SECOND MODULE FU RF CLUSTER 2 AFF.=0.4 slack 0 slack 2 slack 0 slack 2 slack 0 slack 5

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert42 Cluster Assignment Cluster affinity + affinity range  used to: –Define a preferred cluster –Guide the instruction-to-cluster assignment process Strongly preferred cluster –Schedule instruction in that cluster Weakly preferred cluster –Schedule instruction where global comms. are minimized IBIB ICIC Affinity range (0.3, 0.7) ≤ 0.3≥ 0.7 CACHE FU+RF CLUSTER 1 CACHE FU+RF CLUSTER 2 V1 IAIA 100 Affinity=0 Affinity=0.9 V2V Affinity=0.4 ICIC ICIC ? IAIA IBIB

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert43 EDD Results Memory Ports SensitiveInsensitive Memory LatencySensitiveFAST+FASTFAST+NONE InsensitiveSLOW+SLOWSLOW+NONE

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert44 Other Results BESTUNIFIED FAST UNIFIED SLOW EDD ED ED –The SLOW schemes are better In all cases, these schemes are better than unified cache –29-31% better in EDD, 19-29% better in ED No configuration is best for all cases

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert45 Reconfigurable Cache Results The OS can set each module in one state: –FAST mode / SLOW mode / Turned-off The OS reconfigures the cache on a context switch –Depending on the applications scheduled in and scheduled out Two different V DD and V TH for the cache –Reconfiguration overhead: 1-2 cycles [Flautner et al. 2002] Simple heuristic to show potential –For each application, choose the estimated best cache configuration BEST DISTRIBUTED RECONFIGURABLE SCHEME EDD0.89 (FAST+SLOW) 0.86 ED0.89 (SLOW+SLOW) 0.86

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert46 Presentation Outline Latency reduction techniques –Software memory coherence in distributed caches –Word-interleaved distributed cache –Flexible Compiler-Managed L0 Buffers Energy reduction techniques –Multi-Module cache for clustered VLIW processor Conclusions

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert47 Conclusions Cache partitioning is a good latency reduction technique Cache heterogeneity can be used to exploit energy efficiency The best energy and performance efficient scheme is a distributed data cache –Dynamic vs. Static mapping between addresses and clusters Dynamic for performance (L0 Buffers) Static for energy consumption (Variable-Based mapping) –Hardware vs. Software-based memory coherence solutions Software solutions are viable

Clustered Data Cache Designs for VLIW ProcessorsEnric Gibert48 List of Publications Distributed Data Cache Memories –ICS, 2002 –MICRO-35, 2002 –CGO-1, 2003 –MICRO-36, 2003 –IEEE Transactions on Computers, October 2005 –Concurrency & Computation: practice and experience (to appear late ’05 / ’06) Heterogeneous Data Cache Memories –Technical report UPC-DAC-RR-ARCO , 2004 –PACT, 2005

49 Questions…