Presentation is loading. Please wait.

Presentation is loading. Please wait.

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

Similar presentations


Presentation on theme: "Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González."— Presentation transcript:

1 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González 1,2 1 Intel Barcelona Research Center Intel Labs, Barcelona 2 Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya, Barcelona

2 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)2 Issue #1: Energy Consumption First class design goal Heterogeneity –↓ supply voltage and/or ↑ threshold voltage Cache memory  ARM10 –D-cache  24% dynamic energy –I-cache  22% dynamic energy Heterogeneity can be exploited in the D-cache for VLIW processors processor front-end processor back-end processor front-end processor back-end Higher performance Higher energy Lower performance Lower energy

3 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)3 Issue #2: Wire Delays From capacity-bound to communication-bound One possible solution: clustering Unified cache clustered VLIW processor –Used as a baseline throughout this work CLUSTER 1 Reg. File FUs Global communication buses Cache Memory buses … CLUSTER 2 Reg. File FUs CLUSTER n Reg. File FUs

4 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)4 Contributions GOAL : exploit heterogeneity in the L1 D-cache for clustered VLIW processors Power-efficient distributed L1 data cache –Divide data cache into two modules and assign each to a cluster Modules may be heterogeneous –Map variables statically between cache modules –Develop instruction scheduling techniques Results summary –Heterogeneous distributed data cache  good design point –Distributed data cache vs. unified data cache Distributed caches outperform unified schemes in EDD and ED –No single distributed cache configuration is the best Reconfigurable distributed cache  allows additional improvements

5 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)5 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions

6 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)6 L2 D-CACHE Register buses load *p RF CLUSTER 1 var X RF CLUSTER 2 var Y Variable-Based Multi-Module Cache FU RF CLUSTER 1 FIRST MODULE SECOND MODULE FU RF CLUSTER 2 Register buses L2 D-CACHE Memory instructions have a preferred cluster  cluster affinity “Wrong” cluster assignment  performance, not correctness  Resume execution  Stall clusters  Empty communication buses  Send request  Access memory  Send reply back load X STACK HEAP DATA GLOBAL DATA STACK HEAP DATA GLOBAL DATA FIRST SPACE SECOND SPACE SP1 SP2 distributed stack frames Logical Address Space

7 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)7 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions

8 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)8 Distributed Cache Configurations 8KB FASTSLOW 1 R/W latency ↑ energy ↓ FAST FU+RF CLUSTER 1 FU+RF CLUSTER 2 FAST+NONE FAST FU+RF CLUSTER 1 FAST FU+RF CLUSTER 2 FAST+FAST SLOW FU+RF CLUSTER 1 FU+RF CLUSTER 2 SLOW+NONE SLOW FU+RF CLUSTER 1 SLOW FU+RF CLUSTER 2 SLOW+SLOW FAST FU+RF CLUSTER 1 SLOW FU+RF CLUSTER 2 FAST+SLOW FIRST MODULE FU RF CLUSTER 1 SECOND MODULE FU RF CLUSTER 2 Register buses L2 D-CACHE

9 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)9 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions

10 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)10 Instructions-to-Variables Graph Built with profiling information Variables = global, local, heap LD1 LD2 ST1 LD3 ST2 LD4 LD5 VAR V1VAR V2VAR V3VAR V4 FIRSTSECOND CACHE FU+RF CLUSTER 1 CACHE FU+RF CLUSTER 2 LD2 LD1 LD4 LD5 ST1 LD3 ST2

11 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)11 Greedy Mapping / Scheduling Algorithm Initial mapping  all to first @ space Assign affinities to instructions –Express a preferred cluster for memory instructions: [0,1] –Propagate affinities from memory insts. to other insts. Schedule code + refine mapping Compute IVG Compute mapping Compute affinities using IVG + propagate affinities Compute affinities using IVG + propagate affinities Schedule code

12 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)12 Computing and Propagating Affinity add1 add2 LD1 LD2 mul1 add6 add7 ST1 add3 add4 LD3 LD4 add5 L=1 L=3 LD1 LD2 LD3 LD4 ST1 V1 V2 V4 V3 FIRSTSECOND AFFINITY=0AFFINITY=1 FIRST MODULE FU RF CLUSTER 1 Register buses SECOND MODULE FU RF CLUSTER 2 AFF.=0.4 slack 0 slack 2 slack 0 slack 2 slack 0 slack 5

13 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)13 Cluster affinity + affinity range  used to: –Define a preferred cluster –Guide the instruction-to-cluster assignment process Strongly preferred cluster –Schedule instruction in that cluster Weakly preferred cluster –Schedule instruction where global comms. are minimized Cluster Assignment IBIB ICIC Affinity range (0.3, 0.7) ≤ 0.3≥ 0.7 CACHE FU+RF CLUSTER 1 CACHE FU+RF CLUSTER 2 V1 IAIA 100 Affinity=0 Affinity=0.9 V2V3 6040 Affinity=0.4 ICIC ICIC ? IAIA IBIB

14 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)14 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions

15 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)15 Evaluation Framework IMPACT compiler infrastructure +16 Mediabench Cache parameters –CACTI 3.0 + SIA projections + ARM10 datasheets Data cache consumes 1/3 of the processor energy Leakage accounts for 50% of the total energy Results outline –Distributed cache schemes: F+Ø, F+F, F+S, S+S, S+Ø Affinity range EDD and ED comparison  the lower, the better F+Ø used as baseline throughout presentation –Comparison with a unified cache scheme FAST and SLOW unified schemes State-of-the-art scheduling techniques for these schemes –Reconfigurable distributed cache 8KB FASTSLOW 1 R/W L = 2 1 R/W L = 4 latency x2 energy by 1/3

16 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)16 Affinity Range Affinity plays a key role in cluster assignment –36% - 44% better in EDD than no-affinity –32% better in ED than no-affinity (0,1) affinity range is the best –~92% of memory instructions access a single variable –Binary affinity for memory instructions 0-10.1-0.90.2-0.80.3-0.70.4-0.60.5-0.5NO AFFINITY FAST+FAST EDD0.961.011.021.031.021.051.63 FAST+SLOW EDD0.890.930.94 0.930.941.58 SLOW+SLOW EDD0.950.99 0.980.99 1.69

17 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)17 EDD Results Memory Ports SensitiveInsensitive Memory LatencySensitiveFAST+FASTFAST+NONE InsensitiveSLOW+SLOWSLOW+NONE

18 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)18 ED Results

19 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)19 Comparison With Unified Cache BEST DISTRIBUTEDUNIFIED FASTUNIFIED SLOW EDD0.89 (FAST+SLOW) 1.291.25 ED0.89 (SLOW+SLOW) 1.251.07 Distributed schemes are better than unified schemes –29-31% better in EDD and 19-29% better in ED FUs RF CLUSTER 1 FAST CACHE FUs RF CLUSTER 2 FUs RF CLUSTER 1 SLOW CACHE FUs RF CLUSTER 2 Instruction Scheduling Aletà et al. (PACT’02)

20 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)20 Reconfigurable Distributed Cache The OS can set each module in one state: –FAST mode / SLOW mode / Turned-off The OS reconfigures the cache on a context switch –Depending on the applications scheduled in and scheduled out Two different V DD and V TH for the cache –Reconfiguration overhead: 1-2 cycles [Flautner et al. 2002] Simple heuristic to show potential –For each application, choose the estimated best cache configuration BEST DISTRIBUTED RECONFIGURABLE SCHEME EDD0.89 (FAST+SLOW) 0.86 ED0.89 (SLOW+SLOW) 0.86

21 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)21 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions

22 Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)22 Conclusions Distributed Variable-Based Multi-Module Cache –Affinity is crucial for achieving good performance 36-44% better in EDD and 32% in ED than no-affinity –Heterogeneity ( FAST+SLOW ) is a good design point 4-11% better in EDD and from 6% worse to 10% better in ED –No single cache configuration is the best Reconfigurable cache modules  exploit additional 3-4% Distributed schemes vs. unified schemes –All distributed schemes outperform unified ones 29-31% better in EDD, 19-29% better in ED

23 Q&A


Download ppt "Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González."

Similar presentations


Ads by Google