Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire, 3.2 GHZ clock rate (deep pipeline allows higher clock rate) Front end decoder translates each IA-32 instruction into a series of RISC like micro-operations called uops Uops executed by dynamically scheduled speculative pipeline Section 2.10
Pentium 4, continued Uops are stored in an execution trace cache Stores sequences of instructions to be executed, including nonadjacent instructions Accessed using branch prediction bits and address of first instruction in trace Has its own branch target buffer for predicting the outcome of uop branches Very high hit rate – IA-32 instruction fetch rarely needed Section 2.10
Pentium 4, continued Uops executed by an out-of-order speculative pipeline that uses register renaming rather than a reorder buffer Up to three uops per clock cycle can be renamed and dispatched to the functional unit queue Up to three uops can commit each clock cycle Up to six uops can be dispatched to functional units each clock cycle Section 2.10
Figure 2.26 Section 2.10
About figure 2.26 Front-end BTB – predicts next IA-32 instruction to fetch; only accessed if miss in execution trace cache Execution trace cache – holds uops Trace cache BTB – predicts the next uop Registers for renaming – 128; supports 128 uops executing simultaneously Functional units – 7 (simple ones run at twice the clock rate and accept up to two every clock cycle) Section 2.10
About figure 2.26 L1 data cache – supports up to 8 outstanding misses; integer load latency is 4 cycles; FP load latency is 12 cycles L2 cache – 18 cycle access time Section 2.10
Pentium 4 Deep pipeline makes speculation and branch prediction very important for high performance Cost of cache miss is also very high as queues will fill waiting for the miss to be handled Section 2.10
Pentium 4: Branch misprediction Figure 2.28 (next slide) show branch-misprediction rate per 1000 instructions Top five are integer benchmarks (average 186 branches per 1000 instructions) Bottom five are fp benchmarks (48 branches per 1000 instructions) Misprediction rate for integer benchmarks is 8 times higher than for fp benchmarks Section 2.10
Figure 2.28 Section 2.10
Pentium 4: Misspeculation Misprediction causes wrong instructions to be executed (misspeculated instructions), requires recovery time and wastes energy Figure 2.29 (next slide) shows the percentage of uop instructions issued that are misspeculated Note Figure 2.29 closely matches Figure 2.28 Section 2.10
Figure 2.29 Section 2.10
Pentium 4: cache misses Trace cache miss rates are almost negligible for SPEC benchmarks L1 and L2 miss rates are more significant Figure 2.30 (next slide) shows misses per 1000 instructions for the L1 and L2 caches Misses for L1 is higher, however miss penalty for L2 is higher so both will impact performance Section 2.10
Figure 2.30 Section 2.10
Pentium 4: CPI Figure 2.31 (next slide) shows cycles per instruction for these same 10 SPEC benchmarks Note mcf has worst misspeculation rate and worst L1 and L2 miss rate and also has highest CPI Note swim has high L1 and L2 miss rate and is lowest performing FP benchmark Section 2.10
Figure 2.31 Section 2.10
Comparing Pentium 4 to AMD Opteron Both use dynamically scheduled, speculative pipeline capable of issuing three IA-32 instructions per clock cycle Both have two levels of on-chip cache, but Opteron L1 instruction cache is not a trace cache Biggest difference is that the Pentium 4 is more deeply pipelined Pentium 4 has higher CPI (figure 2.32) but this makes sense given deeper pipeline Section 2.10
Figure 2.32 Section 2.10
Comparing Pentium 4 to AMD Opteron Deeper pipelining allows increase in clock rate – Will this increase make up for increase in CPI? Figure 2.33 (next slide) compares 2.8 GHz AMD Opteron versus 3.8 GHz Intel Pentium 4 Note the AMD has higher performance, thus the higher clock rate is insufficient to overcome the higher CPI Section 2.10
Figure 2.33 Section 2.10
Comparing Pentium 4 to IBM Power5 Sophisticated multiple-issue pipelines usually have slower clock rates than simple pipelines Faster clock rate will win in the presence of limited ILP IBM Power5 designed for high-performance integer and FP (two processor cores each capable of sustaining four instructions per clock cycle); 1.9GHz clock rate Section 2.10
Comparing Pentium 4 to IBM Power5 Pentium 4 – single processor with multithreading; very deep pipeline; can sustain three instructions per clock cycle; higher clock rate (3.8GHz) Figure 2.34 (next slide) compares the performance of these machines Note that the Power5 often does better on the FP benchmarks (less branches, more parallelism) Pentium 4 does better on Integer (higher clock rate) Section 2.10
Figure 2.34 Section 2.10