Download presentation
Presentation is loading. Please wait.
1
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September 2003
2
Pentium 4 Architecture Fetch/commit width = 3 ops, execution width = 6 128 registers, 126 (48 lds, 24 strs) in-flight instrs Trace cache has 12K entries, each line has 6 ops Latencies: L1 – 2 cycles, L2 – 18 cycles, memory – 361 cycles
3
Hyper-Threading Two threads – the Linux operating system operates as if it is executing on a two-processor system When there is only one available thread, it behaves like a regular single-threaded superscalar processor Statically divided resources: ROB, LSQ, issueq -- a slow thread will not cripple thruput (might not scale) Dynamically shared: trace cache and decode (fine-grained multi-threaded, round-robin), FUs, data cache, bpred
4
Results Thruput has gone from 2.2 (single-thread) to 3.9 (base SMT) to 5.4 (ICOUNT.2.8)
5
Methodology Three workloads: single-threaded base, parallel workload (two parallel threads of the same SPLASH application), heterogeneous workload (single- threaded app running with each of the other apps) For heterogeneous workloads – execute two threads together and restart the program when it finishes, do this 12 times, discard the last execution and compute average IPC for each thread If thread-A executes at 85% efficiency and thread-B at 75%, speedup equals 1.6
6
Static Partitioning A single thread is statically assigned half the queues – this impacts IPC A dummy thread ensures that there is no contention for dynamically assigned resources (caches, bpred) – helps isolate the effect of static partitioning SPEC-int achieves 83% efficiency and SPEC-fp achieves 85%, range: 71-98%
7
Multi-Programmed Speedup
8
sixtrack and eon do not degrade their partners (small working sets?) swim and art degrade their partners (cache contention?) Best combination: swim & sixtrack worst combination: swim & art Static partitioning ensures low interference – worst slowdown is 0.9
9
Static vs. Dynamic Statically partitioned resources: queues, ROB: threads run at 83-85% efficiency Dynamically partitioned resources: fetch bandwidth, caches, bpred: threads run at ~60% efficiency Both contribute equally – however, without static partitioning, the effect of dynamic partitioning could go out of control
10
Parallel Thread Results Parallel threads have similar characteristics and put more pressure on shared resources
11
Communication Speed Locking and reading a value takes 68 cycles Locking and updating a value takes 171 cycles (lower than memory access time) To parallelize efficiently, there has to be X amount of parallel work in each loop to offset synch costs -- X is 20,000 computations for SMT; 200,000 for an SMP – the synch mechanism assumed in past research was more optimistic than the real design
12
Microbenchmark Parallel region Loop-carried dependence
13
Computation vs. Communication
14
Thread Co-Scheduling Diverse programs interfere less with each other Avg. speedup is 1.20, but while running two copies of the same thread, avg. speedup is only 1.11, int-int is 1.17, fp-fp is 1.20, and int-fp is 1.21 Symbiotic jobscheduling: each thread has two favorable partners – construct a schedule such that every thread is co-scheduled only with its partners – avg. speedup of 1.27 Linux can’t exploit -- has 2 independent schedulers
15
Compiler Optimizations Multithreading is tolerant of low-ILP codes Higher optimization levels improve overall performance, but reduce speedup from SMT
16
Unanswered Questions Area overhead of SMT? (multiple renames, RAS, PC regs) Register utilization Effect of fetch policies – is it a bottleneck? Influence on power, energy, temperature
17
Conclusions The real design matches simulation-based expectations Static partitioning is important to minimize conflicts and control thruput losses Dynamic partitioning might be required for 8 threads Order of magnitude faster synch than an SMP, but more room for improvement
18
Title Bullet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.