Download presentation
Presentation is loading. Please wait.
Published byHollie French Modified over 9 years ago
1
A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004
2
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 2 Simultaneous Multithreading SMT [Tullsen95] / Multistreaming [Yamamoto95] Instructions from different threads coexist in each processor stage Resources are shared among different threads But… Sharing implies competition In caches, queues, FUs, … Fetch policy decides! time
3
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 3 Motivation SMT performance is limited by fetch performance A superscalar fetch is not enough to feed an aggressive SMT core SMT fetch is a bottleneck [Tullsen96] [Burns99] Straightforward solution: Fetch from several threads each cycle a) Multiple fetch units (1 per thread) EXPENSIVE! b) Shared fetch + fetch policy [Tullsen96] Multiple PCs Multiple branch predictions per cycle Multiple I-cache accesses per cycle Does the performance of this fetch organization compensate its complexity?
4
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 4 Talk Outline Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions
5
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 5 Branch Predictor Instruction Cache Fetching from a Single Thread (1.X) Fine-grained, non-simultaneous sharing Simple similar to a superscalar fetch unit No additional HW needed A fetch policy is needed Decides fetch priority among threads Several proposals in the literature SHIFT&MASK
6
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 6 Fetching from a Single Thread (1.X) But…a single thread is not enough to fill fetch BW Gshare / hybrid branch predictor + BTB limits fetch width to one basic block per cycle (6-8 instructions) Fetch BW is heavily underused Avg 40% wasted with 1.8 Avg 60% wasted with 1.16 Fully use the fetch BW 31% fetch cycles with 1.8 6% fetch cycles with 1.16
7
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 7 Fetching from Multiple Threads (2.X) Increases fetch throughput More threads more possibilities to fill fetch BW More fetch BW use than 1.X Fully use the fetch BW 54% of cycles with 2.8 16% of cycles with 2.16 28% 33%
8
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 8 Fetching from Multiple Threads (2.X) Branch Predictor BANK 1 Instruction Cache BANK 2 2 2 SHIFT&MASK MERGE 2 predictions per cycle + 2 ports Multibanked + multiported instruction cache Replication of SHIFT & MASK logic New HW to realign and merge cache lines But…what is the additional HW cost of a 2.X fetch?
9
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 9 Our Goal Can we take the best of both worlds? Low complexity of a 1.X fetch architecture + High performance of a 2.X fetch architecture That is…can a single thread provide sufficient instructions to fill the available fetch bandwidth?
10
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 10 Talk Outline Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions
11
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 11 High Performance Fetch Engines (I) We look for high performance Gshare / hybrid branch predictor + BTB Low performance Limit fetch BW to one basic block per cycle 6-8 instructions We look for low complexity Trace cache, Branch Target Address Cache, Collapsing Buffer, etc… Fetch multiple basic blocks per cycle 12-16 instructions High complexity
12
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 12 High Performance Fetch Engines (II) Our alternatives Gskew [Michaud97] + FTB [Reinman99] FTB fetch blocks are larger than basic blocks 5% speedup over gshare+BTB in superscalars Stream Predictor [Ramirez02] Streams are larger than FTB fetch blocks 11% speedup over gskew+FTB in superscalars
13
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 13 Talk Outline Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions
14
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 14 Simulation Setup SMTSIM Modified version of SMTSIM [Tullsen96] Trace-driven, allowing wrong-path execution Decoupled fetch (1 additional pipeline stage) Branch predictor sizes of approx. 45KB Decode & rename width limited to 8 instructions Fetch width 8/16 inst. Fetch buffer 32 inst. Fetch policyICOUNT RAS /thread64-entry FTQ size /thread4-entry Functional units6 int, 4 ld/st, 3 fp Inst. queues32 int, 32 ld/st, 32 fp ROB /thread256-entry Physical registers 384 int, 384 fp L1 I-cache & D- cache 32KB, 2W, 8 banks L2 cache1MB, 2W, 8banks, 10 cyc. Line size64B (16 instructions) TLB48 I + 48 D Mem. lat.100 cyc.
15
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 15 Workloads SPECint2000 Code layout optimized Spike [Cohn97] + profile data using train input Most representative 300M instruction trace Using ref input Workloads including 2, 4, 6, and 8 threads Classified according to threads characteristics: ILP ILP only ILP benchmarks MEM MEM memory-bounded benchmarks MIX MIX mix of ILP and MEM benchmarks
16
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 16 Talk Outline Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results ILP workloads MEM & MIX workloads Summary & Conclusions Only for 2 & 4 threads (see paper for the rest)
17
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 17 ILP Workloads - Fetch Throughput With a given fetch bandwidth, fetching from two threads always benefits fetch performance Critical point is 1.16 Stream predictor Better fetch performance than 2.8 Gshare+BTB / gskew+FTB Worse fetch perform. than 2.8 Fetch Throughput
18
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 18 ILP Workloads – 1.X (1.8) vs 2.X (2.8) ILP benchmarks have few memory problems and high parallelism Fetch unit is the real limiting factor The higher the fetch throughput, the higher the IPC Commit Throughput
19
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 19 ILP Workloads So…2.X better than 1.X in ILP workloads… But, what about 1. 2X instead of 2.X? That is, 1.16 instead of 2.8 Maintain single thread fetch Cache lines and buses already 16-instruction wide We have to modify the HW to select 16 instead of 8 instructions
20
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 20 ILP Workloads – 2.X (2.8) vs 1. 2X (1.16) With 1.16, stream predictor increases throughput (9% avg) Streams are long enough for a 16-wide fetch Fetching a single block per cycle is not enough Gshare+BTB 10% slowdown Gskew+FTB 4% slowdown Similar/Better performance than 2.16! Commit Throughput
21
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 21 MEM & MIX Workloads - Fetch Throughput Same trend compared to ILP fetch throughput For a given fetch BW, fetching from two threads is better Stream > gskew + FTB > gshare + BTB Fetch Throughput
22
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 22 MEM & MIX Workloads – 1.X (1.8) vs 2.X (2.8) With memory-bounded benchmarks…overall performance actually decreases!! Memory-bounded threads monopolize resources for many cycles Previously identified New fetch policies Flush [Tullsen01] or stall [Luo01, El-Mousry03] problematic threads Commit Throughput
23
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 23 MEM & MIX workloads Fetching from only one thread allows to fetch only from the first, most priority thread Allows the highest priority thread to proceed with more resources Avoids low-quality (less priority) threads to monopolize more and more resources on cache misses Registers, IQ slots, etc. Only the highest priority thread is fetched When cache miss is resolved, instructions from the second thread will be consumed ICOUNT will give it more priority after the cache miss resolution A powerful fetch unit can be harmful if not well used
24
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 24 MEM & MIX workloads – 1.X (1.8) vs 1. 2X (1.16) Commit Throughput Even 2.16 has worse commit performance than 1.8 More interference introduced by low-quality threads Overall, 1.16 is the best combination Low complexity fetching from one thread High performance wide fetch
25
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 25 Talk Outline Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions
26
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 26 Summary Fetch unit is the most significant obstacle to obtain high SMT performance However, researchers usually don’t care about SMT fetch performance They care on how to combine threads to maintain available fetch throughput A simple gshare/hybrid + BTB is commonly used Everybody assumes that 2.8 (2.X) is the correct answer Fetching from many threads can be counterproductive Sharing implies competing Low-quality threads monopolize more and more resources
27
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 27 Conclusions 1.16 (1. 2X ) is the best fetch option Using a high-width fetch architecture It’s not the prediction accuracy, it’s the fetch width Beneficial for both ILP and MEM workloads 1.X is bad for ILP 2.X is bad for MEM Fetches only from the most promising thread (according to fetch policy), and as much as possible Offers the best performance/complexity tradeoff Fetching from a single thread may require revisiting current SMT fetch policies
28
Thanks Questions & Answers
29
Backup Slides
30
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 30 SMT Workloads WorkloadThreads 2_ILPeon, gcc 2_MEMmcf, twolf 2_MIXgzip, twolf 4_ILPeon, gcc, gzip, bzip2 4_MEMmcf, twolf, vpr, perlbmk 4_MIXgzip, twolf, bzip2, mcf 6_ILPeon, gcc, gzip, bzip2, crafty, vortex 6_MIXgzip, twolf, bzip2, mcf, vpr, eon 8_ILPeon, gcc, gzip, bzip2, crafty, vortex, gap, parser 8_MIXgzip, twolf, bzip2, mcf, vpr, eon, gap, parser
31
HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 31 Simulation Setup Fetch policyICOUNT Gshare predictor64K-entry, 16 bits history Gskew predictor3x32K-entry, 15 bits history BTB/FTB2K-entry, 4W asc. Stream predictor1K-entry, 4W + 4K-entry, 4W RAS /thread64-entry FTQ size /thread4-entry Functional units6 int, 4 ld/st, 3 fp Inst. queues32 int, 32 ld/st, 32 fp ROB /thread256-entry Physical registers384 int, 384 fp L1 I-cache & D-cache32KB, 2W, 8 banks L2 cache1MB, 2W, 8banks, 10 cyc. Line size64B (16 instructions) TLB48 I + 48 D Mem. lat.100 cyc.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.