Presentation is loading. Please wait.

Presentation is loading. Please wait.

ELEC692 VLSI Signal Processing Architecture Lecture 4

Similar presentations


Presentation on theme: "ELEC692 VLSI Signal Processing Architecture Lecture 4"— Presentation transcript:

1 ELEC692 VLSI Signal Processing Architecture Lecture 4
Unfolding

2 Unfolding Transformation
Create a new program describing more than one iteration of the original program Unfolding factor J: J consecutive iterations of the original program Also named as loop unrolling, used with software pipelining in compiler design Unfolding can reveal hidden concurrencies so that the program can be scheduled to a smaller iteration period and thus increase throughput, e.g. with pipelining unfolding 1 2 3 1 2 3 1 2 3 1 2 3 ….. looping pipelining 1 2 3 …… 1 2 3 …… Increase throughput 1 2 3

3 Unfolding Also unfolding can work with parallel processing to increase throughput E.g. 3-tap FIR filters Y(n) = ax(n-1)+bx(n-2)+cx(n-3) Unroll three times we get

4 A Parallel FIR System x(3k+2) x(3k) x(3k+1) b a c Y(3k+2) D Y(3k+1)
Parallel system Pipelined system

5 Unfolding Example Example The above can be rewritten as
Y(n)=ay(n-9)+x(n), after 2-unfolding we have Y(2k)=ay(2k-9)+x(2k), Y(2k+1)=ay(2k-8)+x(2k+1), The above can be rewritten as Y(2k)=ay(2(k-5)+1)+x(2k), Y(2k+1)=ay(2(k-4)+x(2k+1), 9D X(n) Y(n) a X(2k) X(2k+1) y(2k) y(2k+1) 5D 4D

6 Timing after unfolding
After J-unfolding, the clock period T = J * data sampling period. Ex. Timing diagram of the previous example: y(0) y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) y(9) y(10) y(11) x(n) Tclock=Tsample unfolding y(0) y(2) y(4) y(6) y(8) y(10) X(2k) Tclock=2*Tsample y(1) y(3) y(5) y(7) y(9) y(11) x(2k+1) Tclock=2*Tsample

7 Unfolding Algorithm How to construct J-unfolded DFG Algorithm
The resulting DFG contains J times as many edges and edges as the original DFG. Algorithm For each node U in the original DFG, draw the J nodes Uo, U1,…, UJ-1. For each edge U->V with w delays in the original DFG, draw the J edges with delays for I = 0,1,…,J-1. *Here is the floor of x and a%b is the remainder after dividing a by b.

8 Example B0 B A0 C0 D0 5D 9D A C D 4D A1 C1 D1 B1

9 More example 4-unfolded U1 V1 37D V U 9D V2 U2 9D Unfolding of an edge with w delays in the original DFG produces J-w edges with no delays and w edges with 1 delay in J-unfolded DFG when w<J V3 U3 9D 10D V4 U4 Unfolding preserves precedence constraints of a DSP program

10 Another Example 2D U V 2D U0 V0 T0 D T 2D 2D U1 V1 T1 2D U2 V2 T2 D D
3-unfolded U0 V0 T0 D 5D 6D T 2D 2D U1 V1 T1 2D U2 V2 T2 D

11 Unfolding preserves precedence constraints
The e edges in the original DFG explicitly show the precedence constraints for 1 iterations of the original program The J*e edges in the J-unfolded DFG explicitly show the precedence constraints for J iterations of the program The edge with delay in the unfolded DFG corresponds to the edge U->V with w delay in the original DFG.

12 Precedence preservation (I)
For unfolding, the k-th iteration of the node Ui in the J-unfolded DFG executes the (Jk+i)-th iteration of the node U in the original DFG. Due to the delays in the edge , the output of the k-th iteration of the node Ui is consumed by the (k )-th iteration of the node in the unfolded DFG. The k-th iteration of Ui corresponds to the (Jk+i)-th iteration of the node U, and the (k )-th iteration of V(i+w)%J corresponds to the ( )-th of the node V

13 Precedence preservation (II)
Therefore in the original DFG, the output of the (Jk+i)-th iteration of the node U is consumed by the ( )-th of the node V. So the number of delays on the edge UV is Hence the number of delays on the edge UV is (i+w)-i=w This term is simply i+w

14 Properties of Unfolding
Unfolding preserves the number of delays in a DFG Sum of the delays on the J unfolded edges UiV(i+w)%J, I = 0,1,…,J-1, is the same as the number of delays on the edge UV in the original DFG, i.e. When a loop is unfolded, let l be a loop with wl delays in the original DFG. Let A be a node in l. (we denote this as with wl delays). If the loop l is traversed p times (p>=1) this results in the path with p*wl delays. The correspodning unfolded path starting at the node Ai, 0<=i <= J-1, in the J-unfolded DFG is Ai A(i+pwl)%J, and this path forms a loop in the unfolded DFG if i = (i+pwl )%J. Find the minimum p such that Ai A(i+pwl)%J, is a loop in the unfolded DFG.

15 Example D B D 3-unfolded A0 B0 C0 A 2D D D 3D C A1 B1 C1 4-unfolded D
A single loop with wl=6 D B i=(i+pwl)%J =>i=(i+6p)%3 =>p=1(for i=0,1,2 are true), so we have 3 loops in the unfolded DFG D 3-unfolded A0 B0 C0 A 2D D D 3D C A1 B1 C1 4-unfolded D D A2 B2 C2 D A0 B0 C0 D D D D i=(i+pwl)%J =>i=(i+6p)%4 => Hods for i=0,1,2,3 when p=2, so unfolding p=2 instances of the loop results in a loop in the 4-unfolded DFG and we have 2 loops in the unfolded DFG A1 B1 C1 D A2 B2 C2 D A3 B3 C3

16 Properties of Unfolding
When a loop is unfolded, the number of resulting unfolded loops can be determined by finding the minimum p value such that the path AiA(i+pwl)%J forms a path,i.e. i=(i+pwl)%J. Two lemma Lemma 1: =(i+pwl)%J pwl=qJ for an integer q. Lemma 2: The smallest positive integer p that satisfies pwl=qJ is J/gcd(wl,J) where gcd is the greatest common divisor Based on Lemma 1, AiA(i+pwl)%J is a loop iff pwl=qJ holds Minimum p is equal to the number of times the loop l is traversed to form an unfolded loop, and from Lemma 2 it is equal to J/gcd(wl,J) Thus an unfolded loop contains J/gcd(wl,J) copies oe each node in l and the unfolded DFG contains a total of J copies of each node, so there must be J/(J/gcd(wl,J))= gcd(wl,J) unfolded loops. The unfolded DFG contains wl,delays and each of the gcd(wl,J) unfolded loops contains wl,/gcd(wl,J) delays.

17 Properties of Unfolding
Unfolding a DFG with iteration bound T¥ results in a J-unfolded DFG with iteration bound JT. Original iteration bound is given by T¥ =maxl{tl/wl} From the previous property, we can show that the iteration bound of the unfolded DFG is

18 Example Iteration bound = 9/9 = 1 Iteration bound = 18/9 = 2 B0 B A0
C0 D0 5D 9D A C D 4D A1 C1 D1 B1 Iteration bound = 9/9 = 1 Iteration bound = 18/9 = 2

19 Properties regarding critical path, unfolding and retiming
Consider a path with w delays in the original DFG. J-unfolding of this path leads to (J-w) paths with no delays and w paths with 1 delay each, when w<J. Any path in the original DFG containing J or more delays leads to J paths with 1 or more delays in each path. Therefore a path in the original DFG with J or more delays cannot create a critical path in the J-unfolded DFG. From these, we can retime the original DFG such that the J-unfolded version of the retimed DFG will meet a specified critical path delay c. This is true if there exists a path in the original DFG with computation time c and less than J delays. Assume that the critical path of the J-folded DFG is c, if D(UV) ³ c, then wr(U,V)=W(U,V)+r(V)-r(U) ³ J.

20 Properties regarding critical path, unfolding and retiming
Any feasible clock cycle period that can be obtained by retiming he J-unfolded DFG, GJ, an be achieved by retiming the original DFG, G, directly and then unfolding it by unfolding factor J. Let r’ be a legal retiming for GJ, which lead to critical path c. Let r be a retiming for G defined as Consider an edge UV with w delays in G, since r’ is a legal retiming on the unfolded DFG, we have Summing these inequalities for i=0 to J-1, we have r(U-r(V) ³ w which shows this is a legal retiming For the J-unfolded DFG, since r’ satisfies the critical path (c) constraints, i.e. Summing these inequalities for i=0 to J-1, we have r(U-r(V) £ W(U,V)-J which is the desired inequality

21 Applications of Unfolding
Sample Period Reduction Parallel Processing Word-Level Parallel Processing Bit-Level Parallel Processing

22 Sample Period Reduction
How unfolding can allow DSP program to be implemented with an iteration period equal to the iteration bound T¥. 2 cases Case 1: A node in the DFG has computation time greater than T¥.. Case 2: Iteration bound T¥. is not an integer

23 Case 1 Even retiming cannot be used to reduce the computation time of the critical path of the DFG to T¥. Example (4) S Loop bound for loop 1 = (4+1+1)/2=3 (4) D (1) Loop bound for loop 2 = (4+1+1)/3=2 Q T 2D Iteration bound = max(3,2)=3 (0) (0) P R U Maximum computation delay time of a component = 4 > iteration bound (1) loop1 loop2

24 Case 1 Example P0 R0 Q0 T0 S0 U0 P1 R1 Q1 T1 S1 U1
(0) (4) (1) 2D P1 R1 Q1 T1 S1 U1 D loop1 loop2 loop3 The previous DFG can be unfolded with a factor 2 and from this the loop bound can be reduced. Loop bound for loop 1 = (4+1+1)/1=6 Loop bound for loop 2 = (4+1+1)/1=6 Loop bound for loop 3 = 12/3 = 4 (0) So for this unfolded DFG, T¥ = 6 However it performs 2 iterations of the original DSP in 6 unit time and hence the sampling period is 6/2 = 3 which is the iteration bound T¥ of the original DFG. ** In general, unfolding is used where tu is the max. computation time of a node. In this example, it is and so = 2.

25 Case 2: T¥ is not an integer
U1 V2 D (1) S1 T2 U3 V0 S2 T0 U0 V1 Iteration period cannot achieve the iteration bound (1) (1) (1) (1) D D S T U V 3-Unfolded D Even retiming cannot not achieve a critical path of less than 2. for each loop ** In general if a critical loop bound is tl/wl where tl and wl are mutually prime, then wl-unfolding should be used. # Of sampling = 3 and hence the minimum sampling period of the unfolded DFG is 4/3 =T¥ of the original DFG

26 Parallel Processing Unfolding transformation is used to derive parallel processing architectures from serial processing architectures. Word-level parallel processing Processing multiple (J) samples at the same times by J-unfolding X0 C0 B0 A0 D0 E0 2D X1 C1 B1 A1 D1 E1 D X2 C2 B2 A2 D2 E2 x(n) x(n) a b c a b c y(n) 2D 4D y(n) D x(n) 2D D X 3-unfolded a b c y(n) C B A D 2D D x(n) 4D D E a b c y(n)

27 Bit-level Parallel Processing
Let W be the wordlength of the data Bit-serial processing – one bit processed per clock cycle and a complete word is processed in W clock cycles Bit-parallel processing – one word of W bits is processed every clock cycle. Digital-serial processing: N bits are processed per clock cycle and a word is processed in W/N clock cycles. N is the digit size.

28 Bit-level Parallel Processing
an Bit-parallel bn a2 …. …. b2 a1 b1 a0 b0 Bit-serial bn an b2 b1 b0 a2 a1 a0 Digit-serial (digit size=2) a2 a0 b2 b0 a3 a1 b3 b1

29 Bit serial addition To obtain a digit-serial or bit-parallel adder architecture, the bit-serial adder must be unfolded. The issue is how to unfold the edge with a switch. For J-unfolding of the switch, we assume The wordlength W is a multiple of the unfolding factor J, i.e. W=W’J. All edges into and out of the switch have no delays.

30 Unfolding of Bit serial addition
The edge is unfolded as follows: Write the switching instance as Draw an edge with no delays in the unfolded graph from the node Uu%J ot the node Vu%J, which is switched at time instance ( ). E.g. The above switch, Let J=3 assuming W=12 and u=7 We have W = W’J=>12=(4)(3) and the edge has no delays, and In the unfolded DFG, the unfolded edge is from the node U1 to the node V1 and is switched at 4l+2

31 Example of unfolding of switch
V0 U V 4l +0,2 To unfold the DFG by 3 (J=3), the switching instances are as follows: U1 V1 4l +3 U2 V2

32 Example of unfolding of switch
IF an edge contains a switch and a positive number of delays, a dummy node can be used to reduce the problem to the case where the edge contains zero delays.

33 Example of the bit serial adder

34 Example of the bit serial adder

35 How about if wordlength W is not a multiple of the unfolding factor J?
Determine L = LCM(W,J) and replacing the switching instance Wl+u with the L/W switching instances Ll+u+wW for w=0 to L/W -1. Switch periodicity changed from W to L and each switching instance has been expanded by a factor L/W. E.g. Let W = 4, J=3. The LCM(4,3)=12. Then 4l is equivalent to 12l, 12l+4, 12l+8; 4l+1 is equivalent to 12l+1, 12l+5, 12l+9; 4l+2 is equivalent to 12l+2, 12l+6, 12l+10; 4l+3 is equivalent to 12l+3, 12l+7, 12l+11. All new switching instances are now multiples of J=3 and can now be unfolded using the previous approach.

36 Example


Download ppt "ELEC692 VLSI Signal Processing Architecture Lecture 4"

Similar presentations


Ads by Google