Download presentation
Presentation is loading. Please wait.
Published byLily Parmley Modified over 9 years ago
1
Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone, scott.wills } @ ece.gatech.edu
2
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands2Motivation Outside of pipeline, global communication dominates Memory wall is well studied Inside, traditionally computation or logic dominated fetch decode rename issue exec commit I cache D cache L2 cache memory
3
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands3Motivation issue queue Now dominated by local communication paths: – –issue window – –reorder buffer – –register file – –bypass network Bottlenecks both IPC and frequency issue logic issue logic alu reg file reg file
4
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands4Motivation issue queue RISC instruction sets create superfluous trafficRISC instruction sets create superfluous traffic All instructions and operands are treated as equalAll instructions and operands are treated as equal Little focus on exposing sequentialityLittle focus on exposing sequentiality issue logic issue logic alu reg file reg file
5
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands5Contributions Dynamic Strands: – –collapse dependence-chains without fan-out – –exploit properties for simple value precomputation – –increase efficiency of critical resources – –preserve binary compatibility IPC improvements: – –17-20% speedup on Spec2000int and MediaBench Frequency improvements: – –37% fewer in-flight instructions – –reduced dependence on dependencies
6
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands6Outline MotivationMotivation Transient Operands and StrandsTransient Operands and Strands Instruction Replacement HardwareInstruction Replacement Hardware ResultsResults ConclusionConclusion
7
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands7 Dyadic Dilemma Performing any operation on more than two sources requires temporary values R1’ R1’’ R1 R2 R3 R4 R9 + + + + + +... add R1 R1, R2 add R1 R1, R3 add R9 R1, R4... add R1 R1, R2 add R1 R1, R3 add R9 R1, R4... int sum( int a, int b, int c, int d ) { return a + b + c + d; } int sum( int a, int b, int c, int d ) { return a + b + c + d; }
8
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands8 Transient Operands We term these temporary values transient operands: – –values produced by an ALU inst – –values consumed only once, and only by an ALU inst Common in modern integer workloads… On average, about 40% of all dynamic operands are transient
9
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands9Strands Strands: – –linear chains of instructions joined by transient operands – –non-consecutive – –span basic blocks – –three instructions – –only the final output needs to be committed Strands are common – –dyadic temporaries – –compiler strategies – –language semantics + + c c d d b b a a + + + +
10
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands10 Outline Motivation Transient Operands and Strands Instruction Replacement Hardware Results Conclusion
11
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands11 closed-loop ALUs Hardware Overview fetch decode rename reg file commit ALU strand cache fill unit strand cache fill unit instructions strand cache strand cache transients dispatch engine dispatch engine strands instructions strands issue queue off the critical path
12
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands12 Algorithm Example closed-loopALUs fetch decode rename reg file commit ALU strand cache fill unit strand cache fill unit instructions strandcachestrandcache transients dispatchenginedispatchengine strands instructions strands issue queue 1 1 2 2 3 3 1 1 3 3 2 2 1 1 2 2 3 3 0 0
13
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands13 Strand Cache Fill Unit Based around the operand table Detects conditions of transients When found… – –append to existing strand – –begin new strand last producer instruction last consumer instructionconsumercount R5 R6 R4 archreg 1404: R5 R0 + 0 PC 1416 1412: R1 R5 + 0 1416: R5 R0 + 0 1408:... PC 1404PC 14121 operand table
14
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands14 Strand Cache 101110101 status bitsprevious reader info strand 2 i1i2i3pcreadyvalue instructions seen pcinstseenpcinstseenpcinst this instructionsource 1source 2 + + + + + + About 175 bytes per line, though very few lines are needed for effect strand 1 strand 3
15
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands15 Dispatch Engine Watches for strand cache matches Inserts ready strands into the stream eagerly Removes component instructions when seen Correctness checking with dirty table dispatchenginedispatchengine decodedecode renamerename pre-renamed instructions strands, recovery strands, kill signals, dirty table strandcache
16
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands16 Closed-Loop ALUs Full bypass is half of the execute stage delay Regular ALUs with double-speed closed-loop mode – –two dependent ALU operations in a single cycle – –intermediate values (the transients) are discarded! – –final result still takes ½ cycle for full bypass full bypass network “free” local bypass mode switch ALUALU ½ cycle
17
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands17 Oops… Dirty Read R1’ R1’’ R1 R2 R3 R4 R9 + + + + + + load 16 [ R1 ] R1’ R1’’ R1 R2 R3 + + + + insert recovery sub-strand to recover R1 R1 is dirty!
18
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands18 Oops… Anti-Dependence Violation R1’ R1’’ R1 R2 R3 R4 R9 + + + + + + load 32 [ R9 ] insert load immediate of previous value R9 has already been replaced R9 previous value renaming not sufficent – outside reorder buffer safety net
19
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands19Outline MotivationMotivation Transient Operands and StrandsTransient Operands and Strands Instruction Replacement HardwareInstruction Replacement Hardware ResultsResults ConclusionConclusion
20
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands20 coverage with various strand cache sizes Instruction Coverage High coverage rates, but only with a big strand cache. Less than a 15% replacement rate, regardless of cache size Average ALU inst coverage: 16: 12% 1024: 27% Average ALU inst coverage: 16: 12% 1024: 27%
21
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands21 4-wide IPC speedup with 16-entry strand cache IPC Improvements Average IPC Speedup: 4-wide: 17% 8-wide: 20% Average IPC Speedup: 4-wide: 17% 8-wide: 20% Some benchmarks almost double in IPC Some see almost no speedup at all
22
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands22 strand Resource Occupancy CISCification of instructions reduces traffic – –reorder buffer occupancy is reduced up to 37%. – –issue queue occupancy is reduced up to 34%. – –traffic reduction coverage Reduced dependence on dependencies – –opportunity for pipelined bypass – –opportunity for pipelined issue. + + + + + + + + + + + + + + + + + + + +
23
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands23 strand Resource Occupancy Caveat emptor – –more worst case issue CAMs – –more worst case register ports Prior work applicable – –only 1.2 live inputs / strand + + + + + + + + + + + + + + + + + + + +
24
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands24Outline MotivationMotivation Transient Operands and StrandsTransient Operands and Strands Instruction Replacement HardwareInstruction Replacement Hardware ResultsResults ConclusionConclusion
25
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands25Conclusion Key points: – –eagerly executing macro-instructions value precomputation – –limiting focus to transient operands – –all new hardware off critical path Results: – –IPC speedup of 18-20% with 3KB strand cache – –potential for frequency gains – –full binary compatibility Lots of current and future research: – –relaxed constraint of ALU instructions – –quantified frequency improvements – –static detection of strands Questions?
26
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands26 Backup Slides
27
MICRO-37Sassone & Wills / Georgia Tech / Dynamic Strands27 Sensitivity to Dispatch Delay 4-wide IPC speedup with 16-entry strand cache On average, speedup only drops 1% with three cycles of delay Some actually get faster due to less errant strands Most benchmarks lose a small amount of speedup
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.