Download presentation
Presentation is loading. Please wait.
Published byEsteban Solly Modified over 10 years ago
1
U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González 1,2 1 Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2 Intel Barcelona Research Center Intel Labs - UPC Barcelona
2
U P C MICRO36 San Diego December 2003 CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache L2 cache Memory buses Motivation CLUSTER 1 Reg. File FUs CLUSTER n Reg. File FUs L1 cache module L1 cache module L2 cache L1 cache module L1 cache module Memory buses... OPTION 1: Distribute L1 CLUSTER 1 Reg. File FUs CLUSTER n Reg. File FUs memory buffer memory buffer L1 cache memory buffer memory buffer Memory buses... L2 cache OPTION 2: Memory Buffers
3
U P C MICRO36 San Diego December 2003 Contributions Small L0 Buffer in each cluster –Flexible mechanisms to map data to the buffers –Compiler-controlled memory inst. hints Instruction scheduling techniques (VLIW) –Mark “critical” instructions to use the buffers –Use appropriate memory instruction hints Data coherence among buffers [CGO’03] –3 mechanisms: same cluster, partial store replication and not use buffers
4
U P C MICRO36 San Diego December 2003 Talk Outline Flexible Compiler-Managed L0 Buffers Instruction Scheduling Techniques Evaluation Conclusions
5
U P C MICRO36 San Diego December 2003 L0 Buffers CLUSTER 1 Reg. File Register-to-register communication buses L1 cache INT FP MEM CLUSTER 2 Reg. File INT FP MEM CLUSTER 3CLUSTER 4 L0 buffer unpack logic
6
U P C MICRO36 San Diego December 2003 Mapping Flexibility 12345678910111213141516 a[0] a[1]a[2] a[3]a[4] a[5]a[6]a[7] CLUSTER 1 L0 Buffer L1 block (16 bytes) L1 cache CLUSTER 2 L0 Buffer CLUSTER 3 L0 Buffer CLUSTER 4 L0 Buffer 1234 load a[0] with stride 1 element a[0]a[1] linear mapping 4 bytes unpack logic 1234 a[0]a[1]a[0]a[1]a[0]a[1]a[0]a[1] interleaved mapping (1 cycle penalty) a[0]a[4]a[1]a[5]a[2]a[6]a[3]a[7] 12345678910111213141516 load a[0]load a[1]load a[2]load a[3] All loads with a 4-element stride
7
U P C MICRO36 San Diego December 2003 L0 buffer L1 cache Memory Hints Access Directives CLUSTER 1 INT FP MEM CLUSTER 2 INT FP MEM CLUSTER 3CLUSTER 4 load (sequential access) load (parallel access) : no access, sequential, parallel : linear, interleaved : none, positive, negative Mapping Hints Prefetching Hints cycle i+1load no access *p cycle iload sequential a[0] load a[0] load *p no access a[0] a[1] a[2] a[3] load *a (positive pref.) cycle i a++ cycle i+1
8
U P C MICRO36 San Diego December 2003 L0 - L1 Interaction L0 Buffers are write-through CLUSTER 1 Reg. File L1 cache INT FP MEM CLUSTER 2 L0 buffer CLUSTER 3CLUSTER 4 unpack logic 1) Simplifies replacements no bus arbitration flush instruction store 2) No pack logic 3) Data consistency pack logic replacement load
9
U P C MICRO36 San Diego December 2003 Talk Outline Flexible Compiler-Managed L0 Buffers Instruction Scheduling Techniques Evaluation Conclusions
10
U P C MICRO36 San Diego December 2003 CLUSTER 1 L0 buffer CLUSTER 2 L0 buffer L1 SCHEDULE store E cycle i+3 load D cycle i+2 store C cycle i+1 load Bload A cycle i Not use buffers (NB) Memory Coherence CLUSTER 1 L0 buffer CLUSTER 2 L0 buffer L1 SCHEDULE store E cycle i+3 load D cycle i+2 store C cycle i+1 load A cycle i 1 cluster (1C) store C store C load A load A load D load D load B load B store E store E store C store C load A load A load D load D load B load B store E store E load B
11
U P C MICRO36 San Diego December 2003 Scheduling Algorithm (I) Overview –Candidate instructions strided mem. insts. SS epicdec99%mpeg2dec96% g721dec100%pegwitdec50% g721enc100%pegwitenc56% gsmdec97%pgpdec99% gsmenc99%pgpenc86% jpegdec60%rasta95% jpegenc49% –Assign “critical” candidate instructions to buffers Loop unrolling –Factors: 1 or N –Unroll N: may benefit from interleaved mapping –Global comms. + workload –Do not overflow buffers
12
U P C MICRO36 San Diego December 2003 NFreeEntries = {2, 2} CLUSTER 1 L0 buffer CLUSTER 2 L0 buffer load D load D load A load A load B load B store C store C load E load E NFreeEntries = {1, 0} NFreeEntries Latencies (slack) load A load A load B load B load E load E load F load F load G load G load C load C load D load D load H load H add RF Scheduling Algorithm (II) Sort Nodes Initialize Data Initialize Data Next Node Next Node Sort P and Compute Latencies Sort P and Compute Latencies Schedule in a Cluster of P Schedule in a Cluster of P Swing MS Sort P L0 availability Min. global comms. Max. workload Compute latencies NFreeEntries Coherence P = Possible Clusters P = Possible Clusters II=II+1 load D load D load A load A load B load B store C store C load E load E NFreeEntries = {1, 1} mem deps NFreeEntries = {2, 2} load D load D load a[i] load a[i] load B load B load E load E load a[i+1] load a[i+1] load C load C 1 load D load D load B load B load E load E load a[i+1] load a[i+1] load C load C NFreeEntries + Recompute Criticality + Reassign Latencies NFreeEntries + Recompute Criticality + Reassign Latencies empty ! empty impossiblepossible
13
U P C MICRO36 San Diego December 2003 Talk Outline Flexible Compiler-Managed L0 Buffers Instruction Scheduling Techniques Evaluation Conclusions
14
U P C MICRO36 San Diego December 2003 Evaluation Framework (I) IMPACT C compiler Compile + optimize + memory disambiguation Extended with proposed instruction scheduler Mediabench benchmark suite Input epicdec titanic g721dec S_16_44 g721enc S_16_44 gsmdec S_16_44 gsmenc S_16_44 jpegdec monalisa jpegenc monalisa Input mpeg2dec tek6 pegwitdec techrep pegwitenc techrep pgpdec techrep pgpenc techrep rasta ex5_c1
15
U P C MICRO36 San Diego December 2003 Evaluation Framework (II) Architecture configuration # Clusters 4 Functional Units 1 FP / cluster + 1 integer / cluster + 1 memory / cluster L0 Buffers 8-byte subblocks, fully-associative 1-cycle latency L1 Cache 8KB total size, 32 byte blocks 2-way set associative 6-cycle latency 1 extra cycle for interleaved mapping (unpack logic) L2 Cache 10-cycle latency, always hits Register Communications 4 buses with a 2-cycle latency
16
U P C MICRO36 San Diego December 2003 Number of L0 Entries
17
U P C MICRO36 San Diego December 2003 L0 Hit Rate
18
U P C MICRO36 San Diego December 2003 Improving L0 Hit Rate Solution: prefetch two blocks in advance –Use more L0 buffer entries –Speedups: 1.12 in epicdec (+7% HR) and 1.04 in rasta (+12% HR) CLUSTER 1 L0 buffer a[0] a[1] load a[0] load a[1] load a[2] load a[3] II=2 prefetch a[2]a[2] is needed time Iteration 1 Iteration 2Iteration 3Iteration 4 a[2] a[3] a[2] reaches L0
19
U P C MICRO36 San Diego December 2003 Distributed Cache CLUSTER 1 Reg. File Func. Units L1 module L2 cache W0W1W2W3W4W5W6W7 W0W2W4W6 CLUSTER 2 Reg. File Func. Units L1 module W1W3W5W7 Word-interleaved [MICRO35] CLUSTER 1 Reg. File Func. Units L1 module L2 cache CLUSTER 2 Reg. File Func. Units L1 module MultiVLIW L1 cache block [MICRO33] Cache-coherent protocol
20
U P C MICRO36 San Diego December 2003 Performance Results
21
U P C MICRO36 San Diego December 2003 Talk Outline Flexible Compiler-Managed L0 Buffers Instruction Scheduling Techniques Evaluation Conclusions
22
U P C MICRO36 San Diego December 2003 Conclusions Flexible Compiler-Managed L0 Buffers –Mapping flexibility –Memory instruction hints Instruction Scheduling Techniques –Mark “critical” insts. + do not overflow buffers –Memory coherence solutions [CGO’03] Performance Results –16% better than unified L1 cache without buffers –Outperforms word-interleaved cache [MICRO35] –Competitive compared to MultiVLIW [MICRO33]
23
U P C MICRO36 San Diego December 2003 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.