Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga
Pipelines Data Hazards Code: lw $4, 0($1) add $15, $1, $1 sub$2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15,100($2) The last four instructions all depend on a result produced by the first! MIPS instructions have the format op dest, src a, src b
Pipelines - Data hazards Examine the pipeline (ignore first 2!) r2 only updated in time for add!
Pipelines - Data Hazards Compiler solution Insert NOOPs Inefficient!
Pipelines - Data Hazards Second compiler solution Reorder lw $4, 0($1) add $15, $1, $1 sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15,100($2) sub $2, $1, $3 lw $4, 0($1) add $15, $1, $1 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15,100($2) These two must not define $1 or $3! Read Written
Pipelines - Data Hazards Second compiler solution Reorder sub $2, $1, $3 lw $4, 0($1) add $15, $1, $1 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15,100($2) Read Written First use of $2
Pipelines - Data Hazards Compiler analyses dependencies Register definitions Register use Read After Write (RAW) dependency No dependencies Instruction can be moved! sub $2, $1, $3 lw $4, 0($1) add $15, $1, $1 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15,100($2) Written Uses of $2
Pipelines - Data Hazards Hardware solution Value forwarding Hardware detects dependency scoreboard Forwards result from WB to EX for subsequent use Hardware Transparent to software!
Data Hazards - classification Read after Write (RAW) Instruction 1 must write before instruction 2 reads Write after Write (WAW) Instructions 1 and 2 both write Instruction 2 must write after 1 Write after Read (WAR) Instruction 1 reads Instruction 2 writes (overwrites) Instruction 2 must not write before 1 reads Reordering algorithms must consider all three!
Lecture 5 - Key Points Data Hazards RAW - most common WAW WAR Compiler looks for dependencies then re-orders Hardware Scoreboard Monitors dependencies ensures correct operation Value forwarding hardware Forwards results from EX stage
Pipelines - Exceptions Caused by overflow, underflow Example add $1, $2, $1 Overflow detected in EX stage Causes jump to exception handler as branch - remainder of pipeline flushed but Compiler needs original $1 causing overflow Register must not be overwritten EX stage needs to squash WB operation Precise Exception problem - more later!
Time to complete each instruction = t Total: Fetch + decode + fetch operands + operation + write-back Clock frequency: f = 1/t An n -stage pipeline allows n instructions ‘in flight’ simultaneously Each pipeline stage does 1/n of the work Each stage requires time t/n Assumes a perfectly balanced pipeline! Balanced = each stage requires the same time Clock frequency: f pipe = 1/(t/n) = n/t Increasing n increases processor power?
Pipelines - Depth Pipeline can’t be too deep Hazards are frequent èmany stalls in deep pipelines Relative Performance Pipeline Depth Too Deep!
Pipelines - Depth Pipeline can’t be too deep Hazards are frequent èmany stalls in deep pipelines Relative Performance Pipeline Depth Too Deep! Superpipelined
Pipeline depth Increasing number of stages Each stage adds overheads Problems balancing pipeline Require t pd 1 ≈ t pd 2 ≈ t pd 3 Stage time is t pd j + t pd reg n stages means n t pd reg overhead Register Operation (work) Register Operation (work) Operation (work) t pd reg t pd 1 t pd 2 t pd 3 t pd reg
CISC and pipelines High Speed CISC processors are pipelined Overlap IF, EX Variable instruction length running time (number of microcode cycles) èpipeline imbalance è“backup” in pipe stages ècomplicate hazard detection Complex addressing modes èauto-increment updates address register èmultiple memory accesses required èsmooth pipeline flow more difficult!
Instruction Queues Vital performance determinant Rate of instruction fetch High Performance processors Fetch multiple instructions in each cycle common Use wide datapath to memory PowerPC bits = 4 instructions Despatch unit Examine dependencies Determine which instructions can be despatched
Instruction Queues Q “matches” fetch/despatch rates General Strategy for matching Producers - Consumers Use of FIFO-style Queues Absorb Asynchronous Delivery / Consumption Rates Provides Elasticity in pipelines Producer FIFO Consumer Differing Instantaneous Rates
Superscalar Processors
PowerPC organisation PowerPC 601 ~1993 Boundary of the Si die New - Look in the “Example Processors” section of the Web notes 3-way SuperScalar Integer Branch Floating Point A newer machine will have more functional units here!
Superscalar Processors Multiple Functional Units PowerPC 604 ð6-way superscalar Despatch Unit Sends “ready” instructions to all free units PowerPC 604: potential 4 instructions/cycle (pipeline lengths are different!) reality: 2-3 instructions/cycle? (program dependent!) Branch Unit LoadStore Unit 3 Integer Units Floating Point Unit
Superscalar Processors Mix of functional units Up to 8-way superscalar common now 2 Floating point units Usually have ~3 cycle latency 3 Integer Arithmetic Branch unit Load / store unit + ….? Marketing departments can play some games with the ‘ n ’ of a n -way superscalar!
Pentium Quad Core Distinguish between Multiple ‘cores’ (separate processors) – later – and Superscalars – multiple functional units per processor ☺“Wide dynamic execution” in Intel-speak Quad core 4 cores Complete up to 4 instructions / cycle each IIU can issue four instructions / cycle 3 Mb L2 cache / processor (total 12Mb) Master clock 3.2 GHz, front side bus 1.6GHz 771 pins
Superscalar Limitations To achieve maximum performance Instruction mix must match Functional Unit mix eg if we have 2 Integer ALUs, 2 FPUs, 1 branch unit, 1 load/store unit Instruction issue unit (IIU) can issue 4 instructions Each four instructions should be able to use 4 of the functional units If instruction stream doesn’t have right mix Some functional units will remain idle FPUs require multiple cycles Additional stalls Pipeline hazards stall pipeline 4-way superscalar gets instructions completed per cycle Program dependent!