Download presentation
Presentation is loading. Please wait.
1
Linear Pipeline Processors
Cascade of processing stages that are linearly connected Perform a fixed function k processing stages External input fed in at stage S1 Final result emerges from stage Sk EENG-630
2
Asynchronous Model Data flow between adjacent stages controlled by handshaking Si sends ready signal to Si+1 when ready to transmit Si+1 sends ack signal to Si after receiving data Allows variable throughput rate at stages EENG-630
3
EENG-630
4
Synchronous Model Clocked latches used to interface between stages
Upon arrival of clock pulse, all latches transfer data to next stage Approximately equal delay in all stages EENG-630
5
Reservation Table Specifies utilization pattern of successive stages
Follows a diagonal streamline Need k clock cycles to flow through One result emerges at each cycle if tasks are independent of each other EENG-630
6
Clock Cycle i = time delay of circuitry in Si
d = time delay of a latch m = max stage delay = m + d (clock cycle of pipeline) Data latched to master f/f of each latch register at rising edge of clock pulse d = width of clock pulse (m >> d) EENG-630
7
Pipeline Throughput f = 1/ = pipeline frequency
At best, can expect one result per cycle, therefore, f represents the maximum throughput Actual throughput < f due to initiation and dependencies EENG-630
8
Clock Skewing Same clock pulse may arrive at different stages with time offset of s tmax (tmin )= time delay of longest (shortest) logic path within a stage Choose m > tmax + s and d tmin - s d + tmax+ s m + tmin - s Ideally s = 0, tmax = m and tmin = d EENG-630
9
Speedup Factor Ideally k stage pipeline can process n tasks in k+(n-1) cycles Tk = [k + (n –1)] Flow thru delay = k for nonpipelined proc. For n tasks: T1 = nk Sk = T1/Tk = nk / [k+(n-1)] EENG-630
10
Number of Stages Micropipelining: divide at logic gate level
Macropipelining: divide at processor level Optimal # of stages should maximize performance/cost ratio p = t/k + d f = 1/p Total cost = c + kh PCR = f/(c+kh) = 1/(t/k + d)(c + kh) EENG-630
11
Efficiency and Throughput
Ek = Sk /k = n/[k +(n-1)] Hk = n/[k + (n-1)] = nf / [k + (n-1)] Max f when Ek 1 as n Hk = Ek f = Ek / = Sk / k EENG-630
12
Dynamic Pipeline Can be reconfigured to perform variable functions at different times Allows feedforward/feedback connections Making nonlinear pipelines Linear pipelines are static for fixed functions Following different dataflow patterns, can use same pipeline to evaluate different functions EENG-630
13
Reservation Tables Multiple reservation tables can be generated for evaluation of different functions Different fxns may follow dif. paths One to many mapping b/t pipeline configuration and reservation tables # of columns is evaluation time of a given fxn EENG-630
14
EENG-630
15
Latency # of time units b/t two initiations
Any attempt by two or more initiations to use the same pipeline stage at the same time causes a collision – resource conflict Forbidden latencies: cause collisions To detect forbidden latencies, check distance b/t any two marks in the same row of the reservation table EENG-630
16
Latency Analysis Latency sequence: sequence of permissible latencies b/t successive task initiations Latency cycle: latency seq. that repeats the same cycle indefinitely Average latency: divide sum of all latencies by # of latencies in cycle Constant cycle: cycle which contains only one latency value EENG-630
17
EENG-630
18
EENG-630
19
Collision Vectors Max forbidden latency m n-1
Permissible latency: 1 p m-1 (p=1 ideal) Collision vector: displays set of permissible & forbidden latencies (m bit binary vector) Ci = 1 if latency i causes collision (0 else) Cm = 1 always (max forbidden latency) EENG-630
20
State Diagrams Specifies permissible state transitions among successive iterations Initial collision vector: corresponds to initial state at time 1 Next state at time t+p obtained w/m-bit right shift register Next state after p shifts obtained by Oring initial collision vector w/shifted register EENG-630
21
EENG-630
22
Greedy Cycles Simple cycles: each state appears only once
Some simple cycles are greedy cycles One whose edges are all made w/min latencies from their respective starting states Their average latencies must be lower than those of other simple cycles One w/minimal avg. latency (MAL) chosen EENG-630
23
Bounds on MAL Lower bounded by max # of checkmarks in any row of reservation table Lower than or equal to avg. latency of any greedy cycle in the state diagram Avg. latency of any greedy cycle is upper-bounded by # of 1’s in the initial collision vector + 1. (upper bound on MAL also) EENG-630
24
Optimizing Schedule Greedy cycle not sufficient for optimality of MAL, lower bound is Find lower bound by modifying the reservation table Try to reduce max # of marks in any row Modified table must preserve the original function being evaluated EENG-630
25
Delay Insertion Use noncompute delay stages to increase pipeline performance with a shorter MAL Purpose is to modify reservation table Yields a new collision vector Results in a modified state diagram EENG-630
26
EENG-630
27
EENG-630
28
Pipeline Throughput Initiation rate or avg. # of task initiations per cycle If N tasks initiated in n cycles, then initiation rate or pipeline throughput is N/n Scheduling strategy affects performance Shorter MAL, then higher throughput Unless MAL reduced to 1, then throughput is a fraction EENG-630
29
Pipeline Efficiency Stage utilization: % of time each stage is used over a long series of task initiations Accumulated rate determines efficiency Higher efficiency implies less idle time and higher throughput EENG-630
30
Instruction Execution Phases
Instruction execution consists of: Fetch, decode, operand fetch, execute, and write back phases Ideal for overlapped execution on a linear pipeline Each phase may require one or more clock cycles to execute EENG-630
31
Instruction Pipeline Stages
Fetch: fetches instructions from cache Decode: reveals the function to perform and identifies needed resources Issue: reserves resources, maintain control interlocks, and read register operands Execute: one or several stages Writeback: write results into registers EENG-630
32
EENG-630
33
EENG-630
34
Prefetch Buffers Three types of buffers can be used to match the instruction fetch rate to pipeline consumption rate Sequential: for in sequence pipelining Target: instructions from a branch target Loop: seq. instructions within a loop Fetch block of instructions in one memory access time to a prefetch buffer EENG-630
35
EENG-630
36
Multiple Functional Units
Bottleneck stage is one w/max # of marks in its row in the reservation table Solve by using multiple copies of same stage simultaneously Reservation stations for each unit used to resolve data or resource dependencies EENG-630
37
Reservation Stations Operands wait in the RS until its data dependencies have been resolved Each RS has an ID tag, monitored by a tag unit Allows h/w to resolve conflicts b/t source and destination registers Also serve as buffers EENG-630
38
EENG-630
39
Internal Data Forwarding
Improves throughput further Replace some memory access ops by register transfer ops Store-load forwarding: load replaced by move operation Load-load forwarding: replace second load with move operation Store-store: remove first store operation EENG-630
40
EENG-630
41
Hazard Avoidance Read and write of shared variables by dif. instructions may lead to dif. results if executed out of order Three types of hazards RAW, WAW, and WAR Domain = input set Range = output set EENG-630
42
EENG-630
43
Hazard Conditions RAW: R(I) D(J) (flow)
WAW: R(I) R(J) (antidependence) WAR: D(I) R(J) (output) Necessary, but not sufficient conditions Occurrence depends on order two instructions are executed Special tag bit used with each operand register to indicate safe or hazard-prone EENG-630
44
Static Scheduling Data dependencies create interlocked relationship b/t sequence of instructions Resolve by compiler-based static scheduling approach Increases separation b/t interlocked instr. Cheaper to implement and flexible to apply EENG-630
45
Static Scheduling 2 Add R0, R1 1 Move R1, R5 2 Load R2, M(a)
2 Load R3, M(b) 3 Mult R2, R3 (multiply held up by previous load) Load R2, M(a) 2-3 Load R3, M(b) 2 Add R0, R1 2 Move R1, R5 1 Mult R2, R3 3 (no delay for multiply) EENG-630
46
Tomasulo’s Algorithm Hardware dependence-resolution scheme
Resolves resource conflicts as well as data dependencies using register tagging An issued instruction whose operands are not available is forwarded to an RS associated w/the unit it will use EENG-630
47
EENG-630
48
CDC Scoreboarding Dynamic instruction scheduling hardware
Scoreboard unit keeps track of registers needed by instructions waiting for units When all registers have valid data, the scoreboard enables execution When finished, resources are released EENG-630
49
EENG-630
50
Branching Terms Fetching a nonsequential instruction after a branch instr. is called branch taken Instr. to be executed after a branch taken is called branch target # of cycles b/t branch taken and its target is called delay slot (denoted by b) EENG-630
51
Effect of Branching When branch taken occurs, all instr. following branch in pipeline drained Loss of useful cycles p = prob. of conditional branch instruction q = prob. branch taken Penalty = pqnb (b extra cycles) Teff = k + (n-1) + pqnb EENG-630
52
Branch Prediction Branch instruction type or history used for prediction Must collect frequency and probabilities of branch taken and types for large # of traces Static prediction (taken or not) wired in Best performance given by predicting taken Once wired, cannot be changed EENG-630
53
Dynamic Branch Strategy
Uses recent branch history for prediction Three classes of strategies: Based on info found at decode stage Use cache to store target address at stage the effective address of branch target computed Use cache to store target instructions at fetch stage Additional hardware required to track EENG-630
54
Branch Target Buffer Holds recent branch info including address of branch target used Address of branch instruction locates its entry in the BTB BTB entry contains backtracking info to guide prediction Can also store target instruction(s) EENG-630
55
Delayed Branches Reduces delay penalty by shortening delay slot
Delayed branch of d cycles allows at most d-1 useful instructions to be executed following branch taken These instructions are independent of branch outcome Can use NOPs as fillers EENG-630
56
Delayed branch example
I1. Load R1, A I2. Dec R3, 1 I3. BRZero R3, I5 I4. Add R2, R4 I5. Sub R5, R6 I6. Store R5, B (Original code) I2. Dec R3, 1 I3. BRZero R3, I5 I1. Load R1, A I4. Add R2, R4 I5. Sub R5, R6 I6. Store R5, B (Modified code) EENG-630
57
Pipeline Design Parameters
k stage pipeline Pipeline cycle = 1 time unit for scalar base machine = base cycle Issue rate = 1 for base m/c Issue latency = 1 for base m/c Simple operation latency = 1 for base m/c Instruction level parallelism: max # of instructions that can execute simultaneously EENG-630
58
Superscalar Pipeline Structure
Degree m: issue m instructions concurrently Instruction decoding and execution resources increase to form m pipelines Functional units may be shared by multiple pipelines at some stages EENG-630
59
EENG-630
60
Scheduling Difficulties
More difficult when instructions retrieved from same source Goals in scheduling: – Avoid pipeline stalling – Minimize pipeline idle time EENG-630
61
Pipeline Stalling Lowers pipeline utilization
More so for superscalar pipelines than scalar Caused by data or resource conflicts, in or about to enter pipeline Also caused by branching EENG-630
62
EENG-630
63
Multipipeline Scheduling
In-order / out-of-order issues (depends on original program order) In-order / out-of-order completion In-order is easier to implement, but may not be optimal Performance measured by total execution time and utilization rate of pipeline stages EENG-630
64
Superscalar Performance
N independent instructions: T(1,1) = k + N – 1 T(m,1) = k + (N-m)/m S(m,1) = [m(N+k-1)] / [N+m(k-1)] EENG-630
65
Superpipelined Design
Degree n: pipeline cycle time = 1/n base cycles A fixed point addition takes one cycle in a base scalar processor, takes n short cycles in a superpipelined processor Issue rate = 1, issue latency = 1/n, ILP = n Requires high speed clocking EENG-630
66
EENG-630
67
Superpipelined Performance
N instructions, degree n, k stages T(1,n) = k + 1/n (N-1) S(1,n) = [n (k + N –1)] / [nk + N –1] EENG-630
68
Superpipelined Superscalar
Degree (m,n): executes m instructions every cycle with pipeline cycle = 1/n of base cycle Instruction issue latency = 1/n ILP = mn instructions EENG-630
69
Superscalar Superpipelined Performance
N independent instructions, degree (m,n) T(m,n) = k + (N-m) / mn S(m,n) = [mn (k + N – 1)] / [mnk + N – m] EENG-630
70
Design Approaches Superpipelined: Emphasizes temporal parallelism
Faster transistors Design must minimize effects of clock skewing Superscalar: Depends on spatial parallelism More transistors Better match for CMOS technology EENG-630
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.