Download presentation
Presentation is loading. Please wait.
1
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998
2
Leveraging SMT Recall branch fan-out from “Limits of ILP” Future processors will likely have no shortage of idle thread contexts Spawned threads are parallel, but have dependences with earlier instructions: registers, uncommitted stores, data cache values SMT may be an ideal candidate as threads share the same set of resources
3
SMT Vs. CMP A multi-threaded workload (on an SMT) is more tolerant of branch mpreds – TME makes most sense if there is a shortage of threads Power overheads are enormous – on an SMT, we may not have the option to execute speculative threads on low-power pipelines What about energy? Is CMP a better candidate?
4
Renaming Overview r1 maps to p1 r1 … r1 br …. r1 p5 … p5 br …. p3 Every branch causes a checkpoint of mappings, so we can recover quickly on a mis-predict Each thread in the SMT can have 8 checkpoints
5
Threaded Multi-Path Execution Key elements in TME: Identifying low-confidence branches Efficient thread spawning Efficient recovery on branch resolution Fetch priorities for each thread on SMT
6
Path Selection Only the primary path can spawn threads (prevents an exponential increase in threads) For each bpred entry, keep track of successive correct predictions (reset on mispredict) – if the counter is less than a threshold, the branch is low-confidence – note that a small counter size is more selective in picking low-confidence branches
7
Register Mappings In SMT, each thread can read any physical register Thread spawning requires a copy of the register mappings at that branch A copy involves transfer of (32 x 9 bits) – the new thread cannot begin renaming until this copy is complete – the copy may also hold up the primary thread if map table read ports are scarce Every new mapping can be placed on a bus and idle threads can snoop and keep pace
8
Spawning Algorithm
9
When threads are idle, they keep pace and spawn a thread as soon as a low-confidence branch is encountered When a thread context becomes free and a low-confidence checkpoint already exists, the new context synchronizes mappings with the primary context and executes the primary path, while the old primary context executes the alternate path after reinstating the checkpoint If a newly idle thread has a low-confidence checkpoint, it starts executing the alternate path
10
Introduced Complexity Book-keeping to manage checkpoint locations – every branch has to track the location of its checkpoint Who frees a register value? What about memory dependences? Loads can ignore stores that are not predecessors Maintain an array of bits to represent the path taken (each basic block corresponds to a bit in the array) Check for memory dependences only if the store’s path is a subset of the load’s path (p5) r1 (p7) r1 (p8) r1
11
Processor Parameters Eight-wide processor with up to eight contexts; each context has eight checkpoints 32-entry issue queues, 4Kb gshare branch predictor, 7 cycle mpred penalty, memory latency of 62 cycles ICOUNT 2.8: first thread can bring in up to 8 instrs and the second thread fills in unused slots; occupancy in the front-end determines priority Focus on branch-limited programs: compress (20%), gcc (18%), go (30%), li (6%)
12
Results: Spare Contexts
13
Results: Bus Latency
14
Results: Branch Confidence
15
Results: Path Selection
16
Results: Fetch Policy
17
Results: Mpred Penalty
18
Conclusions Too much complexity/power overhead, too little benefit? Benefits may be higher for deeper pipelines; larger windows (this paper evaluates 8 windows of 48 instrs; does 2 x 192 yield better results?); longer memory latencies There is room for improvement with better branch confidence metrics CMPs will incur greater cost during thread spawning, but may be more power-efficient
19
Title Bullet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.