Linear Pipeline Processors

Linear Pipeline Processors
Cascade of processing stages that are linearly connected Perform a fixed function k processing stages External input fed in at stage S1 Final result emerges from stage Sk EENG-630

Asynchronous Model Data flow between adjacent stages controlled by handshaking Si sends ready signal to Si+1 when ready to transmit Si+1 sends ack signal to Si after receiving data Allows variable throughput rate at stages EENG-630

EENG-630

Synchronous Model Clocked latches used to interface between stages
Upon arrival of clock pulse, all latches transfer data to next stage Approximately equal delay in all stages EENG-630

Reservation Table Specifies utilization pattern of successive stages
Follows a diagonal streamline Need k clock cycles to flow through One result emerges at each cycle if tasks are independent of each other EENG-630

Clock Cycle i = time delay of circuitry in Si
d = time delay of a latch m = max stage delay  = m + d (clock cycle of pipeline) Data latched to master f/f of each latch register at rising edge of clock pulse d = width of clock pulse (m >> d) EENG-630

Pipeline Throughput f = 1/  = pipeline frequency
At best, can expect one result per cycle, therefore, f represents the maximum throughput Actual throughput < f due to initiation and dependencies EENG-630

Clock Skewing Same clock pulse may arrive at different stages with time offset of s tmax (tmin )= time delay of longest (shortest) logic path within a stage Choose m > tmax + s and d  tmin - s d + tmax+ s    m + tmin - s Ideally s = 0, tmax = m and tmin = d EENG-630

Speedup Factor Ideally k stage pipeline can process n tasks in k+(n-1) cycles Tk = [k + (n –1)]  Flow thru delay = k  for nonpipelined proc. For n tasks: T1 = nk Sk = T1/Tk = nk / [k+(n-1)] EENG-630

Number of Stages Micropipelining: divide at logic gate level
Macropipelining: divide at processor level Optimal # of stages should maximize performance/cost ratio p = t/k + d f = 1/p Total cost = c + kh PCR = f/(c+kh) = 1/(t/k + d)(c + kh) EENG-630

Efficiency and Throughput
Ek = Sk /k = n/[k +(n-1)] Hk = n/[k + (n-1)] = nf / [k + (n-1)] Max f when Ek  1 as n   Hk = Ek f = Ek /  = Sk / k EENG-630

Dynamic Pipeline Can be reconfigured to perform variable functions at different times Allows feedforward/feedback connections Making nonlinear pipelines Linear pipelines are static for fixed functions Following different dataflow patterns, can use same pipeline to evaluate different functions EENG-630

Reservation Tables Multiple reservation tables can be generated for evaluation of different functions Different fxns may follow dif. paths One to many mapping b/t pipeline configuration and reservation tables # of columns is evaluation time of a given fxn EENG-630

EENG-630

Latency # of time units b/t two initiations
Any attempt by two or more initiations to use the same pipeline stage at the same time causes a collision – resource conflict Forbidden latencies: cause collisions To detect forbidden latencies, check distance b/t any two marks in the same row of the reservation table EENG-630

Latency Analysis Latency sequence: sequence of permissible latencies b/t successive task initiations Latency cycle: latency seq. that repeats the same cycle indefinitely Average latency: divide sum of all latencies by # of latencies in cycle Constant cycle: cycle which contains only one latency value EENG-630

EENG-630

Collision Vectors Max forbidden latency m  n-1
Permissible latency: 1  p  m-1 (p=1 ideal) Collision vector: displays set of permissible & forbidden latencies (m bit binary vector) Ci = 1 if latency i causes collision (0 else) Cm = 1 always (max forbidden latency) EENG-630

State Diagrams Specifies permissible state transitions among successive iterations Initial collision vector: corresponds to initial state at time 1 Next state at time t+p obtained w/m-bit right shift register Next state after p shifts obtained by Oring initial collision vector w/shifted register EENG-630

EENG-630

Greedy Cycles Simple cycles: each state appears only once
Some simple cycles are greedy cycles One whose edges are all made w/min latencies from their respective starting states Their average latencies must be lower than those of other simple cycles One w/minimal avg. latency (MAL) chosen EENG-630

Bounds on MAL Lower bounded by max # of checkmarks in any row of reservation table Lower than or equal to avg. latency of any greedy cycle in the state diagram Avg. latency of any greedy cycle is upper-bounded by # of 1’s in the initial collision vector + 1. (upper bound on MAL also) EENG-630

Optimizing Schedule Greedy cycle not sufficient for optimality of MAL, lower bound is Find lower bound by modifying the reservation table Try to reduce max # of marks in any row Modified table must preserve the original function being evaluated EENG-630

Delay Insertion Use noncompute delay stages to increase pipeline performance with a shorter MAL Purpose is to modify reservation table Yields a new collision vector Results in a modified state diagram EENG-630

EENG-630

Pipeline Throughput Initiation rate or avg. # of task initiations per cycle If N tasks initiated in n cycles, then initiation rate or pipeline throughput is N/n Scheduling strategy affects performance Shorter MAL, then higher throughput Unless MAL reduced to 1, then throughput is a fraction EENG-630

Pipeline Efficiency Stage utilization: % of time each stage is used over a long series of task initiations Accumulated rate determines efficiency Higher efficiency implies less idle time and higher throughput EENG-630

Instruction Execution Phases
Instruction execution consists of: Fetch, decode, operand fetch, execute, and write back phases Ideal for overlapped execution on a linear pipeline Each phase may require one or more clock cycles to execute EENG-630

Instruction Pipeline Stages
Fetch: fetches instructions from cache Decode: reveals the function to perform and identifies needed resources Issue: reserves resources, maintain control interlocks, and read register operands Execute: one or several stages Writeback: write results into registers EENG-630

EENG-630

Prefetch Buffers Three types of buffers can be used to match the instruction fetch rate to pipeline consumption rate Sequential: for in sequence pipelining Target: instructions from a branch target Loop: seq. instructions within a loop Fetch block of instructions in one memory access time to a prefetch buffer EENG-630

EENG-630

Multiple Functional Units
Bottleneck stage is one w/max # of marks in its row in the reservation table Solve by using multiple copies of same stage simultaneously Reservation stations for each unit used to resolve data or resource dependencies EENG-630

Reservation Stations Operands wait in the RS until its data dependencies have been resolved Each RS has an ID tag, monitored by a tag unit Allows h/w to resolve conflicts b/t source and destination registers Also serve as buffers EENG-630

EENG-630

Internal Data Forwarding
Improves throughput further Replace some memory access ops by register transfer ops Store-load forwarding: load replaced by move operation Load-load forwarding: replace second load with move operation Store-store: remove first store operation EENG-630

EENG-630

Hazard Avoidance Read and write of shared variables by dif. instructions may lead to dif. results if executed out of order Three types of hazards RAW, WAW, and WAR Domain = input set Range = output set EENG-630

EENG-630

Hazard Conditions RAW: R(I)  D(J)   (flow)
WAW: R(I)  R(J)   (antidependence) WAR: D(I)  R(J)   (output) Necessary, but not sufficient conditions Occurrence depends on order two instructions are executed Special tag bit used with each operand register to indicate safe or hazard-prone EENG-630

Static Scheduling Data dependencies create interlocked relationship b/t sequence of instructions Resolve by compiler-based static scheduling approach Increases separation b/t interlocked instr. Cheaper to implement and flexible to apply EENG-630

Static Scheduling 2 Add R0, R1 1 Move R1, R5 2 Load R2, M(a)
2 Load R3, M(b) 3 Mult R2, R3 (multiply held up by previous load) Load R2, M(a) 2-3 Load R3, M(b) 2 Add R0, R1 2 Move R1, R5 1 Mult R2, R3 3 (no delay for multiply) EENG-630

Tomasulo’s Algorithm Hardware dependence-resolution scheme
Resolves resource conflicts as well as data dependencies using register tagging An issued instruction whose operands are not available is forwarded to an RS associated w/the unit it will use EENG-630

EENG-630

CDC Scoreboarding Dynamic instruction scheduling hardware
Scoreboard unit keeps track of registers needed by instructions waiting for units When all registers have valid data, the scoreboard enables execution When finished, resources are released EENG-630

EENG-630

Branching Terms Fetching a nonsequential instruction after a branch instr. is called branch taken Instr. to be executed after a branch taken is called branch target # of cycles b/t branch taken and its target is called delay slot (denoted by b) EENG-630

Effect of Branching When branch taken occurs, all instr. following branch in pipeline drained Loss of useful cycles p = prob. of conditional branch instruction q = prob. branch taken Penalty = pqnb (b extra cycles) Teff = k + (n-1)  + pqnb EENG-630

Branch Prediction Branch instruction type or history used for prediction Must collect frequency and probabilities of branch taken and types for large # of traces Static prediction (taken or not) wired in Best performance given by predicting taken Once wired, cannot be changed EENG-630

Dynamic Branch Strategy
Uses recent branch history for prediction Three classes of strategies: Based on info found at decode stage Use cache to store target address at stage the effective address of branch target computed Use cache to store target instructions at fetch stage Additional hardware required to track EENG-630

Branch Target Buffer Holds recent branch info including address of branch target used Address of branch instruction locates its entry in the BTB BTB entry contains backtracking info to guide prediction Can also store target instruction(s) EENG-630

Delayed Branches Reduces delay penalty by shortening delay slot
Delayed branch of d cycles allows at most d-1 useful instructions to be executed following branch taken These instructions are independent of branch outcome Can use NOPs as fillers EENG-630

Delayed branch example
I1. Load R1, A I2. Dec R3, 1 I3. BRZero R3, I5 I4. Add R2, R4 I5. Sub R5, R6 I6. Store R5, B (Original code) I2. Dec R3, 1 I3. BRZero R3, I5 I1. Load R1, A I4. Add R2, R4 I5. Sub R5, R6 I6. Store R5, B (Modified code) EENG-630

Pipeline Design Parameters
k stage pipeline Pipeline cycle = 1 time unit for scalar base machine = base cycle Issue rate = 1 for base m/c Issue latency = 1 for base m/c Simple operation latency = 1 for base m/c Instruction level parallelism: max # of instructions that can execute simultaneously EENG-630

Superscalar Pipeline Structure
Degree m: issue m instructions concurrently Instruction decoding and execution resources increase to form m pipelines Functional units may be shared by multiple pipelines at some stages EENG-630

EENG-630

Scheduling Difficulties
More difficult when instructions retrieved from same source Goals in scheduling: – Avoid pipeline stalling – Minimize pipeline idle time EENG-630

Pipeline Stalling Lowers pipeline utilization
More so for superscalar pipelines than scalar Caused by data or resource conflicts, in or about to enter pipeline Also caused by branching EENG-630

EENG-630

Multipipeline Scheduling
In-order / out-of-order issues (depends on original program order) In-order / out-of-order completion In-order is easier to implement, but may not be optimal Performance measured by total execution time and utilization rate of pipeline stages EENG-630

Superscalar Performance
N independent instructions: T(1,1) = k + N – 1 T(m,1) = k + (N-m)/m S(m,1) = [m(N+k-1)] / [N+m(k-1)] EENG-630

Superpipelined Design
Degree n: pipeline cycle time = 1/n base cycles A fixed point addition takes one cycle in a base scalar processor, takes n short cycles in a superpipelined processor Issue rate = 1, issue latency = 1/n, ILP = n Requires high speed clocking EENG-630

EENG-630

Superpipelined Performance
N instructions, degree n, k stages T(1,n) = k + 1/n (N-1) S(1,n) = [n (k + N –1)] / [nk + N –1] EENG-630

Superpipelined Superscalar
Degree (m,n): executes m instructions every cycle with pipeline cycle = 1/n of base cycle Instruction issue latency = 1/n ILP = mn instructions EENG-630

Superscalar Superpipelined Performance
N independent instructions, degree (m,n) T(m,n) = k + (N-m) / mn S(m,n) = [mn (k + N – 1)] / [mnk + N – m] EENG-630

Design Approaches Superpipelined: Emphasizes temporal parallelism
Faster transistors Design must minimize effects of clock skewing Superscalar: Depends on spatial parallelism More transistors Better match for CMOS technology EENG-630

Linear Pipeline Processors

Similar presentations

Presentation on theme: "Linear Pipeline Processors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linear Pipeline Processors

Similar presentations

Presentation on theme: "Linear Pipeline Processors"— Presentation transcript:

Similar presentations

About project

Feedback