Chapter 5 Unfolding.

Slides:



Advertisements
Similar presentations
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Advertisements

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
ECE 734 VLSI Array Structures for Digital Signal Processing Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng.
ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.
ECE Longest Path dual 1 ECE 665 Spring 2005 ECE 665 Spring 2005 Computer Algorithms with Applications to VLSI CAD Linear Programming Duality – Longest.
1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Chapter 4 Retiming.
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 CS 151: Digital Design Chapter 5: Sequential Circuits 5-3: Flip-Flops I.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Shift Register Application Chapter 22 Subject: Digital System Year: 2009.
VLSI Communication SystemsRecap VLSI Communication Systems RECAP.
ELEC692 VLSI Signal Processing Architecture Lecture 4
ECE734 VLSI Arrays for Digital Signal Processing Algorithm Representations and Iteration Bound.
1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.
Courseware Path-Based Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads,
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Spring 08, Feb 28 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2008 Retiming Vishwani D. Agrawal James J. Danaher.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Chapter 7 - Part 2 1 CPEN Digital System Design Chapter 7 – Registers and Register Transfers Part 2 – Counters, Register Cells, Buses, & Serial Operations.
All-Pairs Shortest Paths
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
ECE Synthesis & Verification 1 ECE 667 ECE 667 Synthesis and Verification of Digital Systems Retiming.
Algorithmic Transformations
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC 2300 Data Structures & Algorithms February 6, 2007 Chapter 4. Trees.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
EDA (CS286.5b) Day 18 Retiming. Today Retiming –cycle time (clock period) –C-slow –initial states –register minimization.
EKT 124 / 3 DIGITAL ELEKTRONIC 1
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz
L7: Pipelining and Parallel Processing VADA Lab..
ECE 8053 Introduction to Computer Arithmetic (Website: Course & Text Content: Part 1: Number Representation.
Chap 5. Registers and Counters. Chap Definition of Register and Counter l a clocked sequential circuit o consist of a group of flip-flops & combinational.
Topic: Sequential Circuit Course: Logic Design Slide no. 1 Chapter #6: Sequential Logic Design.
Dr. Elwin Chandra Monie Department of ECE, RMK Engineering College
ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.
Folding Technique: Compromising in Special Purpose Hardware Design
ELEC692 VLSI Signal Processing Architecture Lecture 3
EKT 221 : Chapter 4 Computer Design Basics
Digital System Design using VHDL
Chapter 1_0 Registers & Register Transfer. Chapter 1- Registers & Register Transfer  Chapter 7 in textbook.
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.
EE345S Real-Time Digital Signal Processing Lab Fall 2006 Lecture 17 Fast Fourier Transform Prof. Brian L. Evans Dept. of Electrical and Computer Engineering.
1 VLSI Algorithm & Computing Structures Chapter 1. Introduction to DSP Systems Younglok Kim Dept. of Electrical Engineering Sogang University Spring 2007.
Exercise 4.1 Different instructions utilize different hardware blocks in the basic single-cycle implementation. The next three problems in this exercise.
DSP Design – Lecture 7 Unfolding cont. & Folding Fredrik Edman fredrik
102-1 Under-Graduate Project Techniques in VLSI design
Digital Filter Design Tools
Rattapoom vudhichamnong University of Wisconsin
Integer Square Root Lecture L8.0.
101-1 Under-Graduate Project Techniques in VLSI design
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Real time signal processing
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Murugappan Senthilvelan May 4th 2004
Presentation transcript:

Chapter 5 Unfolding

Definitions Unfolding is the process of unfolding a loop so that several iterations are unrolled into the same iteration. Also known as (a.k.a.) Loop unrolling (in compilers for parallel programs) Block processing Applications Reducing sampling period to achieve iteration bound (desired throughput rate) T. Parallel (block processing) to execute several iterations concurrently. Digit-serial or bit-serial processing (C) 1997-2006 by Yu Hen Hu

An example Block processing formulation J = 3, 9/J = 3 (an integer) X(k) = [x(3k) x(3k+1) x(3k+2)]T Y(k) = [y(3k) y(3k+1) y(3k+2)]T Y(k) = a*Y(k- 3 ) + X(k) J = 2, 9/J = 5 (not an integer) X(k) = [x(2k) x(2k+1)]T Y(k) = [y(2k) y(2k+1)]T Y(k) = a*Y(k- 5 ) + X(k) Before unfolding: For n = 0 to N-1, y(n)=a*y(n-9)+x(n) end Unfolding once (J = 2) For k = 0 to N/2-1, y(2k)=a*y(2k-9)+x(2k) y(2k+1)=a*y(2k-8)+x(2k+1) Unfolding twice (J = 3) For k = 0 to N/3-1, y(3k)=a*y(3k-9)+x(3k) y(3k+1)=a*y(3k-8)+x(3k+1) y(3k+2)=a*y(3k-7)+x(3k+2) (C) 1997-2006 by Yu Hen Hu

Implementation with J=3 3Ts Serial-to-parallel conversion parallel-to-Serial conversion Ts y(0) y(1) y(2) y(3) y(4) y(5) . Ts + X D + X D x(0) x(1) x(2) x(3) x(4) x(5) . + X D (C) 1997-2006 by Yu Hen Hu

Unfolding the DFG Rewrite the algorithm formulation: y(2k)=a*y(2k-9)+x(2k) y(2k+1)=a*y(2k-8)+x(2k+1) y(2k)=a*y(2(k-5)+1)+x(2k) y(2k+1)=a*y(2(k-4))+x(2k+1) After J-folded unfolding, the clock period T = J Ts, where Ts is the data sampling period. T=Ts T=J Ts (C) 1997-2006 by Yu Hen Hu

Timing Diagram y(0) y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) y(9) y(10) y(11) y(12) y(13) 9 T T=Ts 9 T T=2Ts y(0) y(2) y(4) y(6) y(8) y(10) y(12) 4T 5T y(1) y(3) y(5) y(7) y(9) y(11) y(13) Above timing diagram is obtained assuming that the sampling period Ts remains unchanged. Thus, the clock period T is increased J-fold. Since 9/2 is not an integer, output (y(0), y(1)) will be needed by two different future iterations, 4T and 5T later. (C) 1997-2006 by Yu Hen Hu

General DFG Unfolding Method Define Step 1. For each node U in original DFG, draw J nodes {Ui; 0 iJ-1} in the unfolded DFG Step 2. For each edge from U to V with w delays, draw J edges from Ui to V(i+w)%J with (i+w)/J delays (C) 1997-2006 by Yu Hen Hu

Another DFG Unfolding Example J=2 S0 i w (i+w)%J 2 1 3 Q0 T0 S R0 Q T 3D 2D S1 R Q1 T1 T=3 R1 Step 1. Duplicate J copies of each node (C) 1997-2006 by Yu Hen Hu

Another DFG Unfolding Example J=2 S0 i w (i+w)%J 2 1 3 Q0 T0 S R0 Q T 3D 2D S1 R Q1 T1 T=3 R1 Step 2. Add all edges with 0 delay on them. (C) 1997-2006 by Yu Hen Hu

Another DFG Unfolding Example J=2 S0 i w (i+w)%J 2 1 3 Q0 T0 S D R0 Q T 2D D 3D 2D S1 R Q1 T1 T=3 D R1 Step 3. Use table on the left to figure out edges with delays. T=6 (C) 1997-2006 by Yu Hen Hu

Properties of Unfolding Unfolding preserves the number of registers (delays) in a DFG For a loop with w delays in a DFG that has been unfolded J times, it leads to g.c.d.(w, J) loops in the unfolded DFG, with each of these loops containing w/(g.c.d.(w,J)) delays and J/(g.c.d.(w,J)) copies of each node that appear in the original loop. Unfolding a DFG with iteration bound T results in a J-folded DFG with iteration bound JT. A path with w (< J) delays in a DFG will lead to J-w paths with no delays, and w paths with 1 delay each in the J-unfolded DFG. Any path in the original DFG containing J or more delays leads to J paths with 1 or more delay in each path. Therefore, it can not create a critical path in the J-unfolded DFG Any clock period that can be achieved by retiming a J-unfolded DFG can be achieved by retiming the original DFG and followed by J-unfolding. (C) 1997-2006 by Yu Hen Hu