UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez 1,2 Antonio González 1,2 1 Dept. dArquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2 Intel Barcelona Research Center Intel Labs Barcelona
UPC MICRO35 Istanbul Nov Motivation Capacity vs. Communication-bound Clustered microarchitectures –Simpler + faster –Power consumption –Communications not homogeneous Clustering embedded/DSP domain
UPC MICRO35 Istanbul Nov Clustered Microarchitectures CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache L2 cache Memory buses GOAL: distribute the memory hierarchy!!!
UPC MICRO35 Istanbul Nov Contributions Distribution of data cache: –Interleaved cache clustered VLIW processor Hardware enhancement: –Attraction Buffers Effective instruction scheduling techniques –Modulo scheduling –Loop unrolling + smart assignment of latencies + padding
UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC MICRO35 Istanbul Nov MultiVLIW CLUSTER 1 Register File Func. Units Register-to-register communication buses cache module CLUSTER 2 Register File Func. Units cache module CLUSTER 3 Register File Func. Units cache module CLUSTER 4 Register File Func. Units cache module L2 cache cache block TAG+STATE+DATA Cache-Coherence Protocol!!!
UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC MICRO35 Istanbul Nov Interleaved Cache CLUSTER 1 Register File Func. Units Register-to-register communication buses cache module CLUSTER 2 Register File Func. Units cache module CLUSTER 3 Register File Func. Units cache module CLUSTER 4 Register File Func. Units cache module L2 cache TAGW0W1W2W4W5W6W7W3 TAGW0W4TAGW1W5TAGW2W6TAGW3W7 subblock 1 local hit remote hitlocal missremote miss cache block
UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC MICRO35 Istanbul Nov successful not successful BASE Scheduling Algorithm II=II+1 Best profit in output edges START Sort nodes Next node Select possible clusters How Many? Least loaded Schedule it How Many? >0 >1 1 0 successful not successful
UPC MICRO35 Istanbul Nov Scheduling Algorithm For word-interleaved cache clustered processors Scheduling steps: 1.Loop unrolling 2.Assignment of latencies to memory instructions – latencies stall time + compute time 3.Order instructions (DDG nodes) 4.Cluster assignment and scheduling
UPC MICRO35 Istanbul Nov STEP 1: Loop Unrolling CLUSTER 1 cache module a[0]a[4] CLUSTER 2 cache module a[1]a[5] CLUSTER 3 cache module a[2]a[6] CLUSTER 4 cache module a[3]a[7] for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } ld r31, a[i]ld r32, a[i+1]ld r33, a[i+2]ld r34, a[i+3] 25% local accesses 100% local accesses for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes)... } ld r3, a[i] 25% local accesses Selective unrolling : No unrolling UnrollxN OUF unrolling Strides multiple of NxI Optimum Unrolling Factor (OUF)
UPC MICRO35 Istanbul Nov STEP 2: Latency Assignment n1 load n2 load n3 add n4 store n5 sub REC1 distance=1 n6 load n7 div n8 add REC2 memory dependences register-flow deps. distance=1 STEP 2 II stall B STEP 1 LoadLatency change II stall B n1 To LM To RH To LH n2 To LM To RH To LH LH=1 cycle RH=5 cycles LM=10 cycles RM=15 cycles L=1 L=8 L=1 L=15 MII=33 MII=22 L=15 L=10 L=15 MII=28 MII=22 L=15 L=5 L=15 MII=23 MII=22 L=5 L=1 MII=9 MII=10
UPC MICRO35 Istanbul Nov Step 3: Order instructions Step 4: Cluster assignment and scheduling STEPS 3 and 4
UPC MICRO35 Istanbul Nov Scheduling Restrictions CLUSTER 1 a[0]a[4] Cache module CLUSTER 3CLUSTER 2 CLUSTER 4 a[3]a[7] Cache module NEXT MEMORY LEVEL memory buses cycle i---store to a[0] cycle i cycle i cycle i+3load from a[0]--- NON-DETERMINISTIC BUS LATENCY!!!
UPC MICRO35 Istanbul Nov Step 3: Order instructions Step 4: Cluster assignment and scheduling –Non-memory instructions same as BASE Minimize register communications + maximize workload –Memory instructions: Memory instructions in same chain same cluster IPBC (Interleaved Preferred Build Chains) –Average preferred cluster of the chain –Padding meaningful preferred cluster information »Stack frames »Dynamically allocated data IBC (Interleaved Build Chains) –Minimize register communications of 1 st instr. of chain STEPS 3 and 4 NxI boundary
UPC MICRO35 Istanbul Nov Memory Dependent Chains n1 load n2 load n3 add n4 store n5 sub distance=1 n6 load n7 div n8 add memory dependences register-flow deps. distance=1 Preferred = 1 Preferred = 2 LH=1 cycle RH=5 cycles LM=10 cycles RM=15 cycles L=1 L=8 L=1 L=5 L=1 n1n2n4n6 IPBCcluster 1cluster 2 IBCsame as n4minimize register communications order={n5, n4, n3, n2, n1, n8, n7, n6}
UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC MICRO35 Istanbul Nov Attraction Buffers Cost-effective mechanism local accesses CLUSTER 1 cache module a[0]a[4] CLUSTER 2 cache module a[1]a[5] CLUSTER 3 cache module a[2]a[6] CLUSTER 4 cache module a[3]a[7] ABuffer ld r3, a[3] ld r3, a[7]... stride 16 bytes a[3]a[7] Local accesses = 0% Local accesses = 50%
UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC MICRO35 Istanbul Nov Evaluation Framework IMPACT C compiler Mediabench benchmark suite ProfileExecution epicdec test_imagetitanic epicenc test_imagetitanic g721dec clintonS_16_44 g721enc clintonS_16_44 gsmdec clintonS_16_44 gsmenc clintonS_16_44 jpegdec testimgmonalisa ProfileExecution jpegenc testimgmonalisa mpeg2dec mei16v2tek6 pegwitdec pegwittechrep pegwitenc pgptesttechrep pgpdec pgptexttechrep pgpenc pgptesttechrep rasta ex5_c1
UPC MICRO35 Istanbul Nov Evaluation Framework Unified cacheMultiVLIWInterleaved cache # clusters 4 Functional units 1 FP / cluster + 1 integer / cluster + 1 memory / cluster Register buses 4 buses running at ½ the core freq. Cache configuration 8KB, 2-way set-associative, 32 byte blocks L2 always hits Cache latencies Hit=5 Miss=14 Hit=1 Miss=10 Local Hit=1 Remote Hit=5 Local Miss=10 Remote Miss=15 Algorithm BASEIBCIPBC + IBC Interleaving factor --4 bytes
UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC MICRO35 Istanbul Nov Local Accesses OUF=Optimum UF P=Padding NC=No Chains
UPC MICRO35 Istanbul Nov Why Remote Accesses? Double precision accesses (mpeg2dec) Unclear preferred cluster information Indirect accesses (e.g. a[b[i]] ) (jpegdec, jpegenc, pegwitdec, pegwitenc) Different alignment (epicenc, jpegdec, jpegenc) Strides not multiple of NxI (selective unrolling, …) Memory dependent chains (epicdec, pgpdec, pgpenc, rasta) for (k=0; k<MAX; k++){ for (i=k; i<MAX; i++) load a[i] }
UPC MICRO35 Istanbul Nov Stall Time
UPC MICRO35 Istanbul Nov Cycle Count Results
UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC MICRO35 Istanbul Nov Conclusions Interleaved cache clustered VLIW processor Effective instruction scheduling techniques –Smart assignment of latencies –Loop unrolling + padding (27% local hits) Source of remote accesses and stall time Attraction Buffers ( stall time up to 34%) Cycle count results: –MultiVLIW (7% slowdown but simpler hardware) –Unified cache (11% speedup)
UPC MICRO35 Istanbul Nov Questions?
UPC MICRO35 Istanbul Nov Question: Latency Assignment MII(REC1)=20MII(DDG)=10 Node II stall B(ratio)B(substract) n n n35154 n45154 n5100MAX10
UPC MICRO35 Istanbul Nov Question: Padding void foo(int *array, int *accum) { *accum = 0; for (i=0; i<MAX; i++) *accum += array[i]; } void main() { int *a, value; a = malloc(MAX*sizeof(int)); foo(a, &value); } CLUSTER 1 a[0] a[4]... CLUSTER 2 accum a[1] a[5]... CLUSTER 3 a[2] a[6]... CLUSTER 4 a[3] a[7]...
UPC MICRO35 Istanbul Nov Question: Coherence Memory Dependent Chains –Modified data Present in only one Attraction Buffer –Data present in multiple Attraction Buffers Replicated in read-only manner –Local scheduling technique At end of loop flush Attraction Buffers contents CLUSTER 1 a[2] ABuffer CLUSTER 2 a[2] ABuffer CLUSTER 3 ABuffer CLUSTER 4 a[2] ABuffer