EE3A1 Computer Hardware and Digital Design Lecture 9 Pipelining
Introduction Pipelining Technique to speed up hardware Used in u ASICs u Microprocessors
Flip flops On rising_edge(clk): u Value of D at that precise instant is read into memory u New value appears at Q delta later. At all other times u Q holds old value Memory bit
Registers On rising_edge(clk): u Value of D (7 downto 0) at that precise instant is read into memory u New value appears at Q (7 downto 0) delta later. At all other times u Q holds old value Memory bits
A Chain of Registers Feed in a stream of numbers 8, 3, 7, 4, … Before the first rising clock edge 8
A Chain of Registers Feed in a stream of numbers 8, 3, 7, 4, … After the first rising clock edge 3 8
A Chain of Registers Feed in a stream of numbers 8, 3, 7, 4, … After the second rising clock edge 7 3 8
A Chain of Registers Feed in a stream of numbers 8, 3, 7, 4, … Function is shift register Shift one stage on each clock cycle After the third rising clock edge
This is NOT what happens Feed in a stream of numbers 8, 3, 7, 4, … Before the first rising clock edge 8
This is NOT what happens Feed in a stream of numbers 8, 3, 7, 4, … 8 After the first rising clock edge
Why not? 8 After the first rising clock edge Stage 1 has finite (delta) delay Stage 2 is no longer be reading by the time stage 1 outputs the 8 Blocked Stage 1 Stage 2
Pipelines We add some useful logic between the shift register stages
Pipelines Before the first rising clock edge 1 st inputs
Pipelines We add some useful logic between the shift register stages After the first rising clock edge 2 nd inputs 1 st inputs
Pipelines We add some useful logic between the shift register stages After the second rising clock edge 3 rd inputs 2 nd inputs 1 st inputs
Pipelines We add some useful logic between the shift register stages After the third rising clock edge 4 th inputs 3 rd inputs 2 nd inputs 1 st inputs
Abbreviated notation Registers shown by dashed line
Abbreviated notation Registers shown by dashed line Before the first rising clock edge 1 st inputs
Abbreviated notation Registers shown by dashed line After the first rising clock edge 2 nd inputs1 st inputs
Abbreviated notation Registers shown by dashed line After the second rising clock edge 3 rd inputs2 nd inputs1 st inputs
Abbreviated notation Registers shown by dashed line After the third rising clock edge 4 th inputs3 rd inputs2 nd inputs1 st inputs Inputs proceed one stage per clock cycle
Registered Logic Register gives clean output Output is valid one cycle after corresponding inputs Worst case output settling time for this adder is 1.9 ns We could use a faster clock c is glitchy c_reg is clean 4 ns 1.9 ns
Measures of Speed Latency u Time between the inputs appearing and corresponding outputs appearing u Latency is 1 clock cycle: 4 ns Throughput u Rate at which we put new inputs into our circuit u 1 / 4 ns = 250 MHz. 4 ns
Measures of Speed Latency measures delay (in seconds): u high latency means slow Throughput measures rate (in Hz): u high throughput means fast. For this circuit:
Turning up the speed Now use 2 ns clock No problem, but Worst case input has only just settled before clock edge 1.9 ns worst case delay 2 ns
Turning up the speed too high Now use 1.8 ns clock Answer is sometimes wrong Our adder does not add: unacceptable 1.9 ns worst case delay 1.8 ns
Timing Diagrams 1.9 ns worst case delay Notation used on data sheets and in text books c is untrustworthy until 1.9 ns after transition c is shown as X 2 ns
Timing Diagrams 1.9 ns worst case delay For this clock speed c never becomes trustworthy Need to interpret this diagram with care If you inspect output of a real device it looks mostly normal, with just a few wrong results 1.8 ns
Timing Diagrams 1.9 ns worst case delay For this clock speed c never becomes trustworthy Need to interpret this diagram with care If you inspect output of a real device it looks mostly normal, with just a few wrong results 1.8 ns
Datapaths Datapath: u Data flows in one end u Flows out the other end u Is modified on the way A simple datapath: an adder tree u g <= a + b + d + e
Speed of a combinational datapath Combinational: u No memory; no registers Settling time from a to g is u (Time a c) plus (time c g) u 1.9 ns ns = 3.8 ns Overall settling time is slowest of (a g,b g,d g,e g) u = 3.8 ns Worst case settling time 1.9 ns
Speed of a Registered datapath Outputs appear 1 cycle after corresponding inputs Clock edge must come after circuit has settled Clock period must be > 3.8 ns; let’s use 4 ns Latency = 4 ns Throughput = 1 / 4 ns = 250 MHz This settles 1.9 ns after a or b change This settles 1.9 ns after c changes Apply clock edge after at least 3.8 ns
Speed of a Pipelined datapath We can do better c settles after only 1.9 ns Catch this value in a register that holds it stable for a cycle Can use a 2 ns clock How does this help? This settles 1.9 ns after a or b change This settles 1.9 ns after c changes
Comparison: Sequence of inputs 4 ns clock 2 ns clock 1 Time = 0 1
Comparison: Sequence of inputs 4 ns clock 2 ns clock 1 Time = 2 ns 2 1
Comparison: Sequence of inputs 4 ns clock 2 ns clock 2 Time = 4 ns
Comparison: Sequence of inputs 4 ns clock 2 ns clock 2 Time = 6 ns
Comparison: Sequence of inputs 4 ns clock 2 ns clock 3 Time = 8 ns
Comparison: Sequence of inputs 4 ns clock 2 ns clock 3 Time = 10 ns
Comparison: Sequence of inputs 4 ns clock 2 ns clock Output valid 2 cycles after inputs Output valid 1 cycle after inputs
Comparison: Sequence of inputs 4 ns clock 2 ns clock Each item takes 4 ns to traverse datapath Output valid 2 cycles after inputs Output valid 1 cycle after inputs
Comparison: Sequence of inputs 4 ns clock 2 ns clock Each item takes 4 ns to traverse datapath Latency = 4 ns Output valid 2 cycles after inputs Output valid 1 cycle after inputs
Comparison: Sequence of inputs 4 ns clock 2 ns clock Each item takes 4 ns to traverse datapath Latency = 4 ns Insert new item every 2 ns Insert new item every 4 ns Output valid 2 cycles after inputs Output valid 1 cycle after inputs
Comparison: Sequence of inputs 4 ns clock 2 ns clock Each item takes 4 ns to traverse datapath Latency = 4 ns Insert new item every 2 ns Insert new item every 4 ns Throughput = 1 / 2 ns = 500 MHz Throughput = 1 / 4 ns = 250 MHz Output valid 2 cycles after inputs Output valid 1 cycle after inputs
Comparison: Sequence of inputs 4 ns clock 2 ns clock Each item takes 4 ns to traverse datapath Latency = 4 ns Insert new item every 2 ns Insert new item every 4 ns Throughput = 1 / 2 ns = 500 MHz Throughput = 1 / 4 ns = 250 MHz Output valid 2 cycles after inputs Output valid 1 cycle after inputs 2-stage pipeline 1-stage pipeline
Waveforms for 1-stage Pipeline Outputs valid 1 cycle after inputs. 4 ns clock 1-stage
2-stage Waveforms for 2-stage Pipeline 1-stage
Waveforms for 2-stage Pipeline Outputs valid 2 cycles after inputs. Latency is same Throughput is double 1-stage 2-stage
Speed of n-Stage Datapath n-stage datapath with no pipelinining:
Speed of 1-Stage Pipeline Register input and output 1-stage pipeline 1-stage is not normally regarded as “true” pipeline
Speed of n-Stage Pipeline n-stage pipelined datapath Clock rate is n times higher Throughput is higher by factor of n Latency is unchanged
Data Rate on an n-Stage Pipeline Suppose we have m data items to process. Time taken to process m items is
Numerical example 10,000 items to process 10 stage pipeline Clock rate of 500 MHz (i.e. a clock period of 2 ns). Pipeline latency is 10 stages x 2 ns clock period = 20 ns. It takes 20 ns to fill the pipeline. Then the answers emerge at a rate of one per clock cycle. Total time is
Summary Pipelining places registers at intermediate points in datapath This means that new inputs can be inserted before previous inputs have emerged n-stage pipeline is n times faster than non-pipelined datapath