ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.

ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Technique for improving performance Exploiting parallelism in improving performance Two ways –Pipelining Using pipeline latches to reduce the critical path delay Can exploit to increase the clock speed of sample speed or reduce power consumption at the same speed. –Parallel processing Multiple output are computed in parallel in a clock period with parallel hardware Effective sampling speed is increased with the level of parallelism Can be used for the reduction of power consumption

Example of a 3-tap FIR filter DD X(n) y(n) X(n-1) X(n-2) A B C Let TM be the delay for multiplier and TA be the delay of adder, the sampling period How can we improve the performance??

Pipelining of FIR digital filter By adding latches DD X(n) y(n) X(n-1) X(n-2) A B C D D X(n) y(n) A B C D D 1 2 3 Critical path = T M +2T A Critical path = T M +T A Schedule of events in the pipeline ClockInputNode 1Node 2Node 3Output 0X(0)Ax(0)+bx(-1)--- 1X(1)Ax(1)+bx(0)Ax(0)+bx(-1)Cx(-2)Y(0) 2X(2)Ax(2)+Bx(1)Ax(1)+bx(0)Cx(-1)Y(1) 3X(3)Ax(3)+Bx(2)Ax(2)+Bx(1)Cx(0)Y(2)

Pipelining properties M-level pipelining needs M-1 more delay elements in any path from input to output Increase in speed with the following penalty –Increase in system latency –Increase in the number of latches Pipelining latches can only be placed across any feed-forward cutset of the graph (signal flow graph/DFG)

Cutset pipelining Cutset – a set of edges of a graph such that if these edges are removed from the graph, the graph becomes disjoint. Feed-forward cutset – a cutset that the data move in the forward direction on all the edges of the cutset, e.g. dotted line in the previous slide We can place latches on a feed-forward cutset of any FIR filter structure without affecting the functionality of the algorithm. The data movement between the two disjoint sub-graphs only occurs on the feed-forward cutset, delaying or advancing the data movement along all edges on the cutset by the same amount of time do not change the behavior. SG1 SG2 cutset D1

Feed-forward cutset A2A2 A1A1 A3A3 A4A4 A5A5 A6A6 D A2A2 A1A1 A3A3 A4A4 A5A5 A6A6 D D D D Not a valid pipelining A2A2 A1A1 A3A3 A4A4 A5A5 A6A6 D D D D D Must place delays on all edges In the cutset Critical path reduced to 2

Data-broadcast structures The critical path of the original 3-tap FIR filter can be reduced without introducing pipelining latches by transposing the structure Transposition – reversing the direction of all the edges in a given SFG and interchanging the input and output ports preserves the functionality of the system. X(n) y(n) Z -1 a bc y(n) x(n) Z -1 a bc SFG of the FIRTransposed SFG of the FIR

Data-broadcast structures Data-broadcasting structure based on transposed form where data are not stored but are broadcast to all the multipliers simultaneously. DD X(n) y(n) C B A Critical path delay = T M +T A

Fine-grain pipelining Further breakdown the functional units by pipelining to increase performance. E,g. breakdown each multiplier into 2 small units DD X(n) y(n) m1 DDD m2 (6) (4) (2) Critical path delay = T M2 +T A = 4+2 = 6

Parallel Processing Parallel processing and pipelining techniques are duals of each other –Both exploit concurrency available in the computation Parallel processing – computed using duplicate hardware

A Parallel FIR System E.g. 3-tap FIR filter, Single-input-single-output (SISO) system Y(n) = Ax(n)+bx(n-1)+cx(n-2) A parallel system with 3 inputs per clock cycle, level of parallel processing L=3. –Y(3k) = Ax(3k)+bx(3k-1)+cx(3k-2) –Y(3k+1) = Ax(3k+1)+bx(3k)+cx(3k-1) –Y(3k+2) = Ax(3k+2)+bx(3k+1)+cx(3k) SISO X(n) y(n) Sequential System MIMO X(3k) X(3k+1) X(3k+2) y(3k) y(3k+1) y(3k+2) 3-Parallel System

A Parallel FIR System a b c Y(3k+2) c a b Y(3k+1) b c a Y(3k) D D x(3k+2) x(3k+1) x(3k) Parallel system Pipelined system

Complete parallel processing system Serial-to-Parallel COnverter MIMO System Parallel-to-Serial COnverter X(n) Sampling Period=T/4 Clock Period =T Clock Period=T/4 X(4k+3) X(4k+2) X(4k+1) X(4k) y(4k+3) y(4k+2) y(4k+1) y(4k) Y(n)

When should we use parallel over pipeline processing There is fundamental limit to pipelining imposed by the input/output (I/O) bottlenecks. Chip1 Chip2 o/p pad i/p pad T comm. T computation Communication bounded –Communication time (input/output pad delay + wire-delay) is larger than that of computation delay. –Pipelining can only be used to reduce the critical path computation delay. –For communication-bound system, this cannot help. –So only parallel processing can be used to improve the performance. –Further improvement can be achieved by combining pipelining and parallel processing

Low Power Signal Processing Higher speed Low Power Dynamic Power consumption Propagation delay –C charge : the capacitance to be charged/discharged –V o : supply voltage; V t : threshold voltage –K: technology parameter

Pipelining for Low Power P seq =C total V 0 2 f After pipelining, the critical path is reduced, hence we can use a lower voltage V’=  V 0, the new power is P pip =C total  2 V 0 2 f=  2 P seq The power consumption reduction factor, , can be found the following: T seq Sequential (critical path) (V o ) (Vo)(Vo) T pipe Pipelined: (critical path when M=3)

Example Assume –Cap. Of multiplier C M is 5 times of that of an adder C A –Fine grain pipelining is used, and C m1 =3 C A and C m2 = 2 C A –Vdd = 5V and Vt = 0.6V DD X(n) y(n) m1 DDD m2 (6) (4) (2) DD X(n) y(n) C B A

Solution For original filter, C charge =C M +C A =6C A For pipelined filter, C charge =C M1 =C M2 +C A = 3C A Now M = 2,we have 2( .5-0.6) 2 = .(5-0.6) 2, solving this equation, we have  =0.6033 The voltage of the pipelined filter V pipe = .V o =~3V Power consumption ratio is  2 = 36.4%

Parallel Processing for low power In an L-parallel architecture, we can assume the charge capacitance remain the same, but the total capacitance (i.e. C total ) is increased L times. The clock speed of the L-parallel architecture is reduced to 1/L (i.e. f = 1/L. T pd ) to maintain the same sampling rate Supply voltage can be reduced to .V o since more time is allowed to charge or discsharge the same capacitance.

Parallel Processing for low power Sequential (critical path) (V o ) T seq (Vo)(Vo) 3T seq Parallel: critical path when L=3

Example: Reduce Power by parallel Consider the following FIR filters DDD X(n) y(n) DDD X(2k) y(2k+1) DDD X2k+1) y(2k) Assumption: - C M = 8C A - T M = 8T A - both architectures operate at the sampling period of 9 T A - Supply voltage = 3.3V and Vt = 0.45V

Solution C charge : Sequential: C charge = C M + C A = 9 C A Parallel: C charge = C M + 2C A = 10 C A Power ratio  2 = 0.434

ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.

Similar presentations

Presentation on theme: "ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.

Similar presentations

Presentation on theme: "ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing."— Presentation transcript:

Similar presentations

About project

Feedback