U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González 1,2 1 Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2 Intel Barcelona Research Center Intel Labs - UPC Barcelona

U P C MICRO36 San Diego December 2003 CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache L2 cache Memory buses Motivation CLUSTER 1 Reg. File FUs CLUSTER n Reg. File FUs L1 cache module L1 cache module L2 cache L1 cache module L1 cache module Memory buses... OPTION 1: Distribute L1 CLUSTER 1 Reg. File FUs CLUSTER n Reg. File FUs memory buffer memory buffer L1 cache memory buffer memory buffer Memory buses... L2 cache OPTION 2: Memory Buffers

U P C MICRO36 San Diego December 2003 Contributions  Small L0 Buffer in each cluster –Flexible mechanisms to map data to the buffers –Compiler-controlled  memory inst. hints  Instruction scheduling techniques (VLIW) –Mark “critical” instructions to use the buffers –Use appropriate memory instruction hints  Data coherence among buffers [CGO’03] –3 mechanisms: same cluster, partial store replication and not use buffers

U P C MICRO36 San Diego December 2003 Talk Outline  Flexible Compiler-Managed L0 Buffers  Instruction Scheduling Techniques  Evaluation  Conclusions

U P C MICRO36 San Diego December 2003 L0 Buffers CLUSTER 1 Reg. File Register-to-register communication buses L1 cache INT FP MEM CLUSTER 2 Reg. File INT FP MEM CLUSTER 3CLUSTER 4 L0 buffer unpack logic

U P C MICRO36 San Diego December 2003 Mapping Flexibility 12345678910111213141516 a[0] a[1]a[2] a[3]a[4] a[5]a[6]a[7] CLUSTER 1 L0 Buffer L1 block (16 bytes) L1 cache CLUSTER 2 L0 Buffer CLUSTER 3 L0 Buffer CLUSTER 4 L0 Buffer 1234 load a[0] with stride 1 element a[0]a[1] linear mapping 4 bytes unpack logic 1234 a[0]a[1]a[0]a[1]a[0]a[1]a[0]a[1] interleaved mapping (1 cycle penalty) a[0]a[4]a[1]a[5]a[2]a[6]a[3]a[7] 12345678910111213141516 load a[0]load a[1]load a[2]load a[3] All loads with a 4-element stride

U P C MICRO36 San Diego December 2003 L0 buffer L1 cache Memory Hints  Access Directives CLUSTER 1 INT FP MEM CLUSTER 2 INT FP MEM CLUSTER 3CLUSTER 4 load (sequential access) load (parallel access) : no access, sequential, parallel  : linear, interleaved : none, positive, negative  Mapping Hints  Prefetching Hints cycle i+1load no access *p cycle iload sequential a[0] load a[0] load *p no access a[0] a[1] a[2] a[3] load *a (positive pref.) cycle i a++ cycle i+1

U P C MICRO36 San Diego December 2003 L0 - L1 Interaction  L0 Buffers are write-through CLUSTER 1 Reg. File L1 cache INT FP MEM CLUSTER 2 L0 buffer CLUSTER 3CLUSTER 4 unpack logic 1) Simplifies replacements no bus arbitration flush instruction store 2) No pack logic 3) Data consistency pack logic replacement load

U P C MICRO36 San Diego December 2003 CLUSTER 1 L0 buffer CLUSTER 2 L0 buffer L1 SCHEDULE store E cycle i+3 load D cycle i+2 store C cycle i+1 load Bload A cycle i Not use buffers (NB) Memory Coherence CLUSTER 1 L0 buffer CLUSTER 2 L0 buffer L1 SCHEDULE store E cycle i+3 load D cycle i+2 store C cycle i+1 load A cycle i 1 cluster (1C) store C store C load A load A load D load D load B load B store E store E store C store C load A load A load D load D load B load B store E store E load B

U P C MICRO36 San Diego December 2003 Scheduling Algorithm (I)  Overview –Candidate instructions  strided mem. insts. SS epicdec99%mpeg2dec96% g721dec100%pegwitdec50% g721enc100%pegwitenc56% gsmdec97%pgpdec99% gsmenc99%pgpenc86% jpegdec60%rasta95% jpegenc49% –Assign “critical” candidate instructions to buffers  Loop unrolling –Factors: 1 or N –Unroll N: may benefit from interleaved mapping –Global comms. + workload –Do not overflow buffers

U P C MICRO36 San Diego December 2003 NFreeEntries = {2, 2} CLUSTER 1 L0 buffer CLUSTER 2 L0 buffer load D load D load A load A load B load B store C store C load E load E NFreeEntries = {1, 0} NFreeEntries Latencies (slack) load A load A load B load B load E load E load F load F load G load G load C load C load D load D load H load H add RF Scheduling Algorithm (II) Sort Nodes Initialize Data Initialize Data Next Node Next Node Sort P and Compute Latencies Sort P and Compute Latencies Schedule in a Cluster of P Schedule in a Cluster of P Swing MS Sort P L0 availability Min. global comms. Max. workload Compute latencies NFreeEntries Coherence P = Possible Clusters P = Possible Clusters II=II+1 load D load D load A load A load B load B store C store C load E load E NFreeEntries = {1, 1} mem deps NFreeEntries = {2, 2} load D load D load a[i] load a[i] load B load B load E load E load a[i+1] load a[i+1] load C load C 1 load D load D load B load B load E load E load a[i+1] load a[i+1] load C load C NFreeEntries + Recompute Criticality + Reassign Latencies NFreeEntries + Recompute Criticality + Reassign Latencies empty ! empty impossiblepossible

U P C MICRO36 San Diego December 2003 Evaluation Framework (I)  IMPACT C compiler Compile + optimize + memory disambiguation Extended with proposed instruction scheduler  Mediabench benchmark suite Input epicdec titanic g721dec S_16_44 g721enc S_16_44 gsmdec S_16_44 gsmenc S_16_44 jpegdec monalisa jpegenc monalisa Input mpeg2dec tek6 pegwitdec techrep pegwitenc techrep pgpdec techrep pgpenc techrep rasta ex5_c1

U P C MICRO36 San Diego December 2003 Evaluation Framework (II) Architecture configuration # Clusters 4 Functional Units 1 FP / cluster + 1 integer / cluster + 1 memory / cluster L0 Buffers 8-byte subblocks, fully-associative 1-cycle latency L1 Cache 8KB total size, 32 byte blocks 2-way set associative 6-cycle latency 1 extra cycle for interleaved mapping (unpack logic) L2 Cache 10-cycle latency, always hits Register Communications 4 buses with a 2-cycle latency

U P C MICRO36 San Diego December 2003 Number of L0 Entries

U P C MICRO36 San Diego December 2003 L0 Hit Rate

U P C MICRO36 San Diego December 2003 Improving L0 Hit Rate  Solution: prefetch two blocks in advance –Use more L0 buffer entries –Speedups: 1.12 in epicdec (+7% HR) and 1.04 in rasta (+12% HR) CLUSTER 1 L0 buffer a[0] a[1] load a[0] load a[1] load a[2] load a[3] II=2 prefetch a[2]a[2] is needed time Iteration 1 Iteration 2Iteration 3Iteration 4 a[2] a[3] a[2] reaches L0

U P C MICRO36 San Diego December 2003 Distributed Cache CLUSTER 1 Reg. File Func. Units L1 module L2 cache W0W1W2W3W4W5W6W7 W0W2W4W6 CLUSTER 2 Reg. File Func. Units L1 module W1W3W5W7 Word-interleaved [MICRO35] CLUSTER 1 Reg. File Func. Units L1 module L2 cache CLUSTER 2 Reg. File Func. Units L1 module MultiVLIW L1 cache block [MICRO33] Cache-coherent protocol

U P C MICRO36 San Diego December 2003 Performance Results

U P C MICRO36 San Diego December 2003 Conclusions  Flexible Compiler-Managed L0 Buffers –Mapping flexibility –Memory instruction hints  Instruction Scheduling Techniques –Mark “critical” insts. + do not overflow buffers –Memory coherence solutions [CGO’03]  Performance Results –16% better than unified L1 cache without buffers –Outperforms word-interleaved cache [MICRO35] –Competitive compared to MultiVLIW [MICRO33]

U P C MICRO36 San Diego December 2003 Questions?

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

Similar presentations

Presentation on theme: "U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

Similar presentations

Presentation on theme: "U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González."— Presentation transcript:

Similar presentations

About project

Feedback