ECE 352 Digital System Fundamentals

ECE 352 Digital System Fundamentals
Pipelining In this presentation, we will introduce the concept of pipelining. Pipelining is a technique we often use to improve the performance of data processing machines. But before we look at what pipelining means in digital circuits, let’s first look at a more familiar example where you are probably already using this idea.

Laundry Pipelining Example
You could wait for a load to completely finish before you start the next… Four loads in 8 time units Time Units: 6 8 7 5 4 1 2 3 Suppose you needed to do four loads of laundry. Let’s assume the washer and dryer each take the same amount of time. You could completely finish each load before starting the next one. Of course, this isn’t very efficient. The washer and dryer are both idle half of the time. As a result, it will take eight time units to finish the four loads. Green: done after 2 Purple: done after 4 Orange: done after 6 Blue: done after 8

2-stage pipeline: start new washer load at same time you put first load in dryer Work on two loads at once, at different stages of completion Four loads in 5 time units Time Units: 3 5 4 2 1 What you probably do, of course, is start a new load in the washer as soon as you empty it. This means that you are simultaneously washing one load of laundry while drying a different one. After the first load moves to the dryer, both the washer and dryer are always being used, until the last load is in the dryer. The result is that we can get the laundry done a lot quicker. Green: done after 2 Purple: done after 3 Orange: done after 4 Blue: done after 5

What does this mean? If you wash your favorite shirt first, it is not done sooner But you get approximately twice the amount of laundry done in the same time (if doing many loads) Four loads in 5 time units Time Units: 3 5 4 2 1 Let’s think about what is happening in these examples. The time to wash and dry any single piece of clothing has not changed, but if we are doing multiple loads of laundry, we’ll get more done in a given amount of time. These two ideas are fundamental concepts in pipelining, so we have special names for them. Green: done after 2 Purple: done after 3 Orange: done after 4 Blue: done after 5

Terminology Latency Throughput Laundry Example:
The length of time it takes for a value ready at the input to propagate to a “result” at the output Throughput The rate at which “results” are produced Laundry Example: Unpipelined Latency = 2 time units Throughput = 1 result per 2 time units Pipelined Throughput = 1 result per time unit Latency is the length of time it takes for an input to propagate to the output. In the laundry example, an input is a dirty sock going in the washer, and the output is a clean and dry sock coming out of the dryer. In a digital circuit, latency is the time from when a signal arrives at the input of the circuit until the effects of that signal have finished propagating to the circuit’s output. Throughput is the rate at which results are produced, or the number of operations per unit time. For the laundry example, throughput might be measured in terms of the number of loads of laundry completed per hour. Note that throughput is based on the steady-state output rate after the first output has been produced (in other words, after the pipeline is full). So for our laundry example, we would say the unpipelined version has a latency of 2 time units and a throughput of one result per 2 time units. The pipelined version still has a latency of 2 time units, but its throughput is double that of the unpipelined version. Note: throughput ignores startup latency (2 time units until first result produced)

Pipelining Digital Circuits
Original circuit has lower throughput than desired What can we do to increase it? Increase throughput by still producing one result per clock cycle, but with a shorter tmin How can we decrease tmin and still accomplish the same amount of work? Registers Logic Now let’s see how these ideas apply to digital machines. Let’s assume that between two registers in a machine, we have combinational logic that performs some operation. We want to have higher throughput, but throughput is limited by the maximum clock frequency. To clock the circuit faster we need to somehow decrease the minimum clock period.

Pipelining Digital Circuits
Insert registers to subdivide long-latency combinational blocks into two (or more) stages New circuit has a shorter critical path Clock rate of modified circuit is limited by longest combinational path between registers, so we want to subdivide as evenly as possible (balance the stages) Logic Stage 2 Registers Logic Stage 1 Registers Logic Well, we can insert registers to break up the critical path, creating two or more pipeline stages. In this example, we’ve created two of them. The longest path between any two flip-flops is no longer all the way through the big block of combinational logic. Instead it will be through one of the two stages on either side of the pipeline register in the middle. Compared to the original circuit, the new tMIN is smaller, so the new fMAX is higher. If the stages have very different delays, we’re still limited by the longest one. So we try to make the stage delays as similar as possible to minimize tMIN.

Pipelined circuit can be clocked faster, but not 2× faster!
Pipelining Effects Original circuit produces 1 result per cycle Pipelined with 2 evenly- balanced stages Still produces 1 result / cycle Reg Logic ORIGINAL Logic Reg PIPELINED Let’s compare the pipelined circuit to the original one. The original circuit produced one result per clock cycle. The pipelined circuit also produces one result per cycle, after the first result. But now we can use a faster clock. Let’s assume that we split the delay of the original circuit evenly among the two pipeline stages. This means that the pipelined circuit’s critical path delay is one half of the original circuit’s critical path delay. We can calculate the minimum clock period for the pipelined machine. But note that only the combinational delay was reduced. So although the pipelined version’s tMIN is smaller, it is still more than half of the original tMIN because the flip-flop propagation delay and setup time were not changed by pipelining. tcomb,pipe = (tcomb,orig / 2) tmin,pipe = tpd + tcomb,pipe + ts Pipelined circuit can be clocked faster, but not 2× faster! = tpd + (tcomb,orig / 2) + ts tmin,pipe > (tmin,orig / 2)

Pipelining Effects Throughput is increased! Latency is increased!
Produce a result once per cycle fmax,pipe is higher than fmax,orig Latency is increased! N stages, so latency is N cycles tmin,pipe is more than tmin,orig / N Pipelining is only useful if we can take advantage of throughput increase and can tolerate latency increase Need to be processing a sequence of data… Diminishing returns as pipeline depth (N) increases Reg Logic ORIGINAL Logic Reg PIPELINED So what does this mean for the throughput and latency of the circuit? Pipelining breaks up the critical path, increasing fMAX and therefore the circuit throughput. But it also increases latency. It increases the number of cycles required to produce any one output, but cannot reduce tMIN enough to compensate. This means that pipelining is only useful if we can make use of the increased throughput and can tolerate the increased latency. Usually it works best for logic that processes long continuous streams of data. Can we always improve throughput by adding more pipeline stages? No. Remember, pipelining only improves the tCOMB portion of the tMIN calculation. As we add more and more pipeline stages, each stage of logic becomes very small, and tPD and tS become responsible for the majority of the minimum clock period.

Add Four Values: Non-Pipelined
Calculate tmin All paths in this circuit have the same delay Calculate latency The time it takes for input values that are ready in their registers to propagate to output Y = tpd + tADD1 + tADD2 + ts = 4 + 10 + 12 + 1 = 27 ns min latency = 1 cycle × tp = 1 × tmin = 27 ns For these delay values… ts = 1ns tpd = 4ns tADD1 = 10ns tADD2 = 12ns + A B C D Y tADD1 tADD2 Let’s apply pipelining to a more concrete example: a tree of adders that computes the sum of four values. We’ll use the delay values shown here for our calculations. Note that the delay of the rightmost adder is slightly higher—this adder is one bit wider than the others so that we don’t have to worry about overflow. First we can calculate tMIN. All combinational paths through the circuit have the same delay: tADD1 plus tADD2. The minimum clock period is thus equal to 27ns. Next we can calculate latency, which is the time it takes for the sum to appear at Y after the inputs have been loaded into registers A through D. That requires a single clock cycle. The minimum possible latency is the minimum possible clock period, which is 27ns. The throughput of the circuit is one result per cycle. Assuming we run the circuit as fast as possible, this throughput is equal to 1/tMIN, which in this case is 37 million results per second. Throughput = 1 result per cycle Max = 1 result / 27 ns = 37 M results / s

Add Four Values: Pipelined
Calculate tmin Based on the longest path Calculate latency Same idea, but remember that the length of each pipeline stage is dictated by the same clock! = + tS + max(tADD1, tADD2) tpd = 4 + max(10, 12) + 1 = 17 ns min latency = 2 cycles × tp = 2 × tmin = 34 ns tCOMB is the longest of these paths For these delay values… ts = 1ns tpd = 4ns tADD1 = 10ns tADD2 = 12ns + A B C D Y tADD1 tADD2 To pipeline our adder circuit, we insert registers between the adder stages. This cuts the critical path delay nearly in half. We now have two shorter paths instead of one long one. If the pipeline stages are not equally balanced, we need to use the longer of the two delays to compute tMIN, which, for this circuit, is now 17ns. The latency will now be two clock cycles due to the added registers, and the minimum latency is now 2 times the new tMIN, which is 34ns. Throughput is still one result per cycle. If we run the circuit as fast as possible, the throughput will be 59 million results per second. Throughput = 1 result per cycle Max = 1 result / 17 ns = 59 M results / s

Comparison + + tmin: Max Throughput: Minimum Latency: 27 ns 17 ns
ts = 1ns tpd = 4ns tADD1 = 10ns tADD2 = 12ns Non-Pipelined Pipelined tmin: Max Throughput: Minimum Latency: 27 ns 17 ns tmin: Max Throughput: Minimum Latency: 1 result / cycle = 59 M results / s = 37 M results / s 1 cycle 2 cycles = 2 × tmin = 1 × tmin = 34 ns = 27 ns Comparing the two circuits, we see that pipelining reduced tMIN. The throughput is one result per cycle in both circuits, which means the pipelined circuit, with its smaller tMIN, has a greater maximum throughput. However, the pipelined circuit’s latency increased from one clock period to two clock periods, so even though the pipelined circuit has a smaller tMIN, it has a longer minimum latency than the original circuit. Note that the pipelined circuit is also larger, since it has two additional registers. + A B C D Y tADD1 tADD2 + A B C D Y tADD1 tADD2

Pipelining Summary Technique that can increase frequency and throughput at the expense of latency and area If adding pipeline stages, we need to evaluate: Is latency, throughput, or area most important for how that particular circuit will be used? Where should pipeline registers be added? Clock speed depends on the longest path… Limited by flip-flop ts and tpd (diminishing returns) There are tricks we can use to mitigate this, but they are beyond the scope of the class…. Pipelining illustrates common tradeoffs in digital circuit design. Often, to improve one characteristic, such as throughput, we need to pay a price in one or more other characteristics, such as circuit area and latency. In fact, not all circuits should be pipelined. For example, if a circuit must react quickly to something that happens infrequently, then we may care more about latency than throughput, and adding a lot of pipeline stages is probably a bad idea. When we do pipeline a circuit, it is important to balance the stage delays as much as possible. This usually involves some logic reorganization or duplication so we can split it evenly, which often increases circuit size even more than the added pipeline registers already do. Finally, we also need to remember that there is a limit to how much pipelining can reduce tMIN, since in general, tMIN cannot be shorter than the flip-flop delays. However, used wisely, pipelining is an important technique for improving circuit performance.

ECE 352 Digital System Fundamentals
Pipelining This concludes our video on pipelining.

ECE 352 Digital System Fundamentals

Similar presentations

Presentation on theme: "ECE 352 Digital System Fundamentals"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE 352 Digital System Fundamentals

Similar presentations

Presentation on theme: "ECE 352 Digital System Fundamentals"— Presentation transcript:

Similar presentations

About project

Feedback