Download presentation
Presentation is loading. Please wait.
1
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12
2
Optimizing CPU Performance Golden Rule: t CPU = N inst *CPI*t CLK Given this, what are our options –Reduce the number of instructions executed –Reduce the cycles to execute an instruction –Reduce the clock period Our next focus: Further reducing CPI –Approach: Superscalar execution –Capable of initiating multiple instructions per cycle –Possible to implement for in-order or out-of-order pipelines
3
Why Superscalar? PipeliningSuperscalar + Pipelining Optimization results in more complexity –Longer wires, more logic higher t CLK and t CPU –Architects must strike a balance with reductions in CPI
4
Implications of Superscalar Execution Instruction fetch? –Taken branches, multiple branches, partial cache lines Instruction decode? –Simple for fixed length ISA, much harder for variable length Renaming? –Multi-port RT, inter-inst dependencies must be recognized Dynamic Scheduling? –Requires multiple results buses, smarter selection logic Execution? –Multiple functional units, multiple result buses Commit? –Multiple ROB/ARF ports, dependencies must be recognized
5
P4 Overview Latest iA32 processor from Intel –Equipped with the full set of iA32 SIMD operations –First flagship architecture since the P6 microarchitecture –Pentium 4 ISA = Pentium III ISA + SSE2 –SSE2 (Streaming SIMD Extensions 2) provides 128-bit SIMD integer and floating point operations + prefetch
6
Comparison Between Pentium III and Pentium 4
7
Execution Pipeline
8
Front End Predicts branches Fetches/decodes code into trace cache Generates ops for complex instructions Prefetches instructions that are likely to be executed
9
Branch Prediction Dynamically predict the direction and target of branches based on PC using BTB If no dynamic prediction available, statically predict –Taken for backwards looping branches –Not taken for forward branches –Implemented at decode Traces built across (predicted) taken branches to avoid taken branch penalties Also includes a 16-entry return address stack predictor
10
Decoder Single decoder available –Operates at a maximum of 1 instruction per cycle Receives instructions from L2 cache 64 bits at a time Some complex instructions must enlist the micro-ROM –Used for very complex iA32 instructions (> 4 ops) –After the microcode ROM finishes, the front- end resumes fetching ops from the Trace Cache
11
Execution Pipeline
12
Trace Cache Primary instruction cache in P4 architecture –Stores 12k decoded ops On a miss, instructions are fetched from L2 Trace predictor connects traces Trace cache removes –Decode latency after mispredictions –Decode power for all pre-decoded instructions
13
Branch Hints P4 software can provide hints to branch prediction and trace cache –Specify the likely direction of a branch –Implemented with conditional branch prefixes –Used for decode-stage predictions and trace building
14
Execution Pipeline
16
Execution 126 ops can in flight at once –Up to 48 loads / 24 stores Can dispatch up to 6 ops per cycle 2x trace cache and retirement op bandwidth –Provides additional B/W for scheduling mispeculation
17
Execution Units
18
Register Renaming
19
8-entry architectural register file 128-entry physical register file 2 RAT (Front-end RAT and Retirement RAT) Retirement RAT eliminates register writes into ARF
20
Store and Load Scheduling Out of order store and load operations Stores are always in program order 48 loads and 24 stores could be in flight Store/load buffers are allocated at the allocation stage –Total 24 store buffers and 48 load buffers
21
Execution Pipeline
22
Retirement Can retire 3 ops per cycle Implements precise exceptions Reorder buffer used to organize completed ops Also keeps track of branches and sends updated branch information to the BTB
23
Data Stream of Pentium 4 Processor
24
On-chip Caches L1 instruction cache (Trace Cache) L1 data cache L2 unified cache –All caches use a pseudo-LRU replacement algorithm Parameters:
25
L1 Data Cache Non-blocking –Support up to 4 outstanding load misses Load latency –2-clock for integer –6-clock for floating-point 1 Load and 1 Store per clock Load speculation –Assume the access will hit the cache –“Replay” the dependent instructions when miss detected
26
L2 Cache Non-blocking Load latency –Net load access latency of 7 cycles Bandwidth –1 load and 1 store in one cycle –New cache operations may begin every 2 cycles –256-bit wide bus between L1 and L2 –48Gbytes per second @ 1.5GHz
27
L2 Cache Data Prefetcher Hardware prefetcher monitors the reference patterns Bring cache lines automatically Attempts to fetch 256 bytes ahead of current access Prefetch for up to 8 simultaneous independent streams
28
System Bus Deliver data with 3.2Gbytes/S 64-bit wide bus Four data phase per clock cycle (quad pumped) 100MHz clocked system bus
29
Execution on MPEG4 Benchmarks @ 1 GHz
30
Performance Trends Moore's Law Speedup Performance Gap Real-time speech 10k SPECInt2000
31
Power Trends Real-time Speech 500 mW Power Power Gap Hot Plate Nuclear Reactor Rocket Nozzle
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.