ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) * Also at Intel Barcelona Research Center June 2002
ICS’02 UPC Motivation Capacity-bound vs. Communication-bound Solution: clustered microarchitectures Partition some hardware resources Simpler + faster Power consumption Communications not homogeneous Goal: clustering the memory hierarchy in statically scheduled processors Motivation
ICS’02 UPC Talk Outline State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
ICS’02 UPC State-of-the-art: MultiVLIW Sánchez and González [MICRO’00] Reg. File F.U. L1 data cache Cluster 1 Reg. File F.U. L1 data cache Cluster 2 Reg. File F.U. L1 data cache Cluster n Coherency network... Register-to-register buses Next memory level
ICS’02 UPC Talk Outline State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
ICS’02 UPC Basic Interleaved Cache Clustered VLIW Processor Reg. File FUs TAGW0W4 cache module Reg. File FUs TAGW1W5 cache module Reg. File FUs TAGW2W6 cache module Reg. File FUs TAGW3W7 cache module TAGW0W1W2W4W5W6W7W3 Subblock 1 memory buses NEXT MEMORY LEVEL cache block Register-to-register buses CLUSTER 1 CLUSTER 2CLUSTER 3CLUSTER 4
ICS’02 UPC Talk Outline State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
ICS’02 UPC Modulo Scheduling Extract ILP from loops overlap execution of iterations A A B B C C A A B B C C A’ B’ C’ A’’ B’’ C’’ II SC Kernel LOOP L
ICS’02 UPC Base Scheduling Algorithm Used for Unified Cache II=II+1 Best profit in output edges START Sort nodes Next node Select possible clusters How Many? Least loaded Schedule it How Many? >0 >1 1 0
ICS’02 UPC Interleaved Cache Scheduling Algorithm Unroll loop to maximize instructions with a stride multiple of NxI access ONE cache module Assign latencies to memory instructions Assign memory instructions to clusters: –IPBC (Interleaved Pre-Build Chains) minimize stall time –IBC (Interleaved Build Chains) minimize compute time
ICS’02 UPC Memory Dependent Instructions store load add load add store load store memory dependant chain 1 memory dependant chain 2 IPBC preferred info is used vs. IBC minimize register comms. Preferred=1 Preferred=2
ICS’02 UPC Talk Outline State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
ICS’02 UPC Local Data Local Data ABuffer local logic datahit data hit ADDRESS TAGW2W6 = TAGW ADDRESS datahit ATTRACTION BUFFER word select CACHE MODULE Enhacement: Attraction Buffers
ICS’02 UPC for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16) ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3] r41 = OP(r31) r42 = OP(r32) r43 = OP(r33) r44 = OP(r34) st r41, b[i] st r42, b[i+1] st r43, b[i+2] st r44, b[i+3] } 16 byte strides (NxI multiple) N = 4 clusters, I= 4 bytes Unroll x4 An Example a[3]a[7]a[0]a[4] CLUSTER 4 ABuffer Local module ld r31, a[0] CLUSTER 3CLUSTER 2CLUSTER 1 a[0] a[1] a[2] a[3]...
ICS’02 UPC Enhacement: Attraction Buffers Why remote accesses? Why Attraction Buffers? –Double precision accesses low benefit –Indirect accesses: a[b[i]] low benefit –“Unclear” preferred cluster big benefit for (i=0; i<MAX; i++) for (k=i; k<i+MAX; k+=4) ld a[k], ld a[k+1], ld a[k+2], ld a[k+3] –Memory dependent chains big benefit –IBC: preferred cluster info is not used big benefit
ICS’02 UPC Talk Outline State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
ICS’02 UPC Experimental Framework IMPACT C compiler Modulo scheduling on hyperblock loops –BASE for a Unified Cache –IPBC and IBC for an Interleaved Cache –IPBC and IBC for the MultiVLIW –The same unrolling factor has been used for all architecture configurations! Mediabench benchmark suite
ICS’02 UPC Experimental Framework Number of clusters4 Functional units1 FP / cluster + 1 int / cluster + 1 mem / cluster Cache configuration8KB, 32-byte lines, 2-way set associative, 1 cycle latency Reg-to-reg communication buses 4 buses that run at ½ the core frequency Memory buses4 buses that run at ½ (or ¼) the core frequency Next memory level4 ports, 5 cycle latency, always hit Interleaving factor (Interleaved Cache) 4 bytes Latencies1-10 (Unified Cache + MultiVLIW) 1-(5/6) (Interleaved Cache)
ICS’02 UPC Results (I) IPBC vs IBC similar cycle count results MultiVLIW vs Interleaved similar results BUT… … lower complexity!
ICS’02 UPC Results (II) Memory dependent chains –Interleaved cache workload unbalance + remote accesses –MultiVLIW workload unbalance –Working on techniques to overcome scheduling restrictions
ICS’02 UPC Results (III) Local hits are increased by 15% Stall time reduced by 30%
ICS’02 UPC Conclusions Scheduling Algorithms –Good latency assignment process (stall time accounts for 9% of execution time) –Coherence kept through memory dependent chains (5% cycle count degradation) Attraction Buffers –Effective to increase local hits (15% average) + reduce stall time (30% average) –Reduce remote hits to previously accessed subblocks (70% average) Cycle count results –similar to Unified Cache and MultiVLIW
ICS’02 UPC Questions