Download presentation
Presentation is loading. Please wait.
Published byAlena Tinkham Modified over 9 years ago
1
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department of Computer Science * Currently on internship/sabbatical at NVIDIA Research
2
L2 Motivation L2 HomogeneousHeterogeneous Adaptive (Federation) Multithreaded scalar IO core 2-way OO core L2
3
Basic Insights A multithreaded in-order core has many registers which can be reused for a reorder buffer or active list A multithreaded in-order core has many registers which can be reused for a reorder buffer or active list If cores are small, single cycle communication between neighbors is feasible If cores are small, single cycle communication between neighbors is feasible Prior work on making large OOO cores feasible can be applied at the low end to make low-cost OOO possible Prior work on making large OOO cores feasible can be applied at the low end to make low-cost OOO possible
4
Bpred Allocate Rename Issue Commit In-order & Out-of-order Pipelines Fetch Decode Execute Mem Writeback Fetch Decode Execute Mem Writeback In-orderOut-of-order
5
Ready BitsSubscriber Slot 1Subscriber Slot 2 1 2 3 4 5 Issue Queue Example 11IQ2 1 IQ3 0 00 1 1 + + + 1 Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002 Sassone et al., Sassone et al., Matrix Scheduler Reloaded, ISCA 2007 1 2 3
6
Simplified Load-Store Queue Memory Alias Table (MAT) Memory Alias Table (MAT) No store forwarding No store forwarding No conservative waiting on stores No conservative waiting on stores Only detect memory order violations after they have occurred and flush the pipeline when the offending instruction commits Only detect memory order violations after they have occurred and flush the pipeline when the offending instruction commits Amir Roth, Amir Roth, Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization, ISCA 2005
7
MAT Example st 0x13, r5 ld r1, 0x13 0 0 0 0 0 0 0 0 MAT 0 1 2 3 4 5 6 7
8
MAT Example st 0x13, r5 ld r1, 0x13EXE 0 0 0 1 0 0 0 0 MAT 0 1 2 3 4 5 6 7 ld executes and increments counter
9
MAT Example st 0x13, r5COM 0 0 0 1 ! 0 0 0 0 MAT 0 1 2 3 4 5 6 7 ld r1, 0x13 st commits and sets flag
10
MAT Example ld r1, 0x13COM 0 0 0 1 ! 0 0 0 0 MAT 0 1 2 3 4 5 6 7 Flush ld commits, sees flag, and flushes pipeline
11
MAT Example ld r1, 0x13 0 0 0 0 0 0 0 0 MAT 0 1 2 3 4 5 6 7 MAT is reset and execution resumes
12
Performance Impact
13
Performance
14
Energy Efficiency
15
Area Efficiency
16
Conclusions Two in-order cores can be federated at run-time to form a 2-way OO core Two in-order cores can be federated at run-time to form a 2-way OO core Almost doubling IPC of throughput core is possible with very little extra hardware Almost doubling IPC of throughput core is possible with very little extra hardware Don’t want traditional OO structures because their performance comes at too high a price Don’t want traditional OO structures because their performance comes at too high a price Best combined area- and energy-efficiency Best combined area- and energy-efficiency
17
Q & A
18
Backup
19
Core Fusion Data Figure from Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors”, ISCA 2007
20
Overall Results Scalar in-order core is 8KB I/D, 256KB L2 Scalar in-order core is 8KB I/D, 256KB L2 Base 2-way core has 16KB I and D-Caches, 256KB L2, 32 entry ROB, 16 entry issue queue, 16 entry LSQ, bimodal bpred Base 2-way core has 16KB I and D-Caches, 256KB L2, 32 entry ROB, 16 entry issue queue, 16 entry LSQ, bimodal bpred 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpred 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpred
21
Branch Prediction Use only a Next Line and Set (NLS) predictor, Bimodal predictor and a Return Address Stack (RAS) Use only a Next Line and Set (NLS) predictor, Bimodal predictor and a Return Address Stack (RAS) NLS ok if your instruction working set not > I$ size NLS ok if your instruction working set not > I$ size Small bimodal predictor ik ok for small window processor Small bimodal predictor ik ok for small window processor
22
Fetch Two I$’s act as a I$ of twice the size and associativity (and random replacement) Two I$’s act as a I$ of twice the size and associativity (and random replacement) More logic and buffers to capture two instructions More logic and buffers to capture two instructions Extra cycle to route instructions from two I$’s to two decoders Extra cycle to route instructions from two I$’s to two decoders
23
Decode Cancel second instruction if first turns out to be branch Cancel second instruction if first turns out to be branch Extra cycle to route decoded instructions to new allocate stage Extra cycle to route decoded instructions to new allocate stage
24
Allocate New logic and free lists to allocate ROB, IQ entries New logic and free lists to allocate ROB, IQ entries
25
Rename New table since it has too many ports New table since it has too many ports One, centralized rename table, not distributed One, centralized rename table, not distributed Has separate table (or field in each RAT entry) for each registers producer instructions IQ-slot number (see our new issue queue) Has separate table (or field in each RAT entry) for each registers producer instructions IQ-slot number (see our new issue queue)
26
Issue Uses a simple lookup table as wakeup structure, where instructions subscribe to their input instructions (explained in detail later) Uses a simple lookup table as wakeup structure, where instructions subscribe to their input instructions (explained in detail later) Centralized, one IQ for the two cores Centralized, one IQ for the two cores
27
Register File Register file is mirrored in the two cores Register file is mirrored in the two cores No extra copy instructions or load-balancing questions No extra copy instructions or load-balancing questions
28
Execute Add extra cycle for copying result to other core’s register file (like EV6) Add extra cycle for copying result to other core’s register file (like EV6)
29
Memory Access The two D$s are checked in parallel, each responsible for half of the merged D$’s ways The two D$s are checked in parallel, each responsible for half of the merged D$’s ways No standard LSQ, only a Memory Alias Table (details later) No standard LSQ, only a Memory Alias Table (details later) Only detects ordering violations and send signal to pipeline Only detects ordering violations and send signal to pipeline
30
Commit Centralized commit, no slippage Centralized commit, no slippage Recover from branch mispredictions since no checkpoints of RAT on branches Recover from branch mispredictions since no checkpoints of RAT on branches Recover from memory order violations (or false positives) from MAT Recover from memory order violations (or false positives) from MAT
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.