Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Similar presentations


Presentation on theme: "Fang Fang James C. Hoe Markus Püschel Smarahara Misra"— Presentation transcript:

1 Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform
Fang Fang James C. Hoe Markus Püschel Smarahara Misra Carnegie Mellon University

2 Conventional Approach: Static IP Cores
IP cores improve productivity and reduce time-to-market. e.g. Xilinx LogiCore library: FFT for N=16, 64, 256 and 1024 on 16-bit complex numbers chip library application ? May not match the application’s needs: parameters, speed, power, area and their trade-off. .

3 Alternative Approach: IP Core Generation
Generate IP cores to match specific application requirements (speed, area, power, numerical accuracy, and I/O bandwidth…) Application parameters Generator + Evaluator Speed / area / power requirements Optimized IP cores

4 Design space Design Space Designer’s Focus
DSP transform design can be studied at several levels. More math knowledge involved  Bigger design space to explore. Design Space Math Designer’s Focus Algorithm Architecture Gate Circuit

5 Problem Problem: gap between transform mathematics and hardware design
A hardware engineer A math guy What I know: Linear algebra Digital signal processing Adaptive filter theory … What I know: Finite state machine Pipelining Systolic array …

6 Bridge: Formula Solution: - Formula representation of DSP transforms Automated formula manipulation and mapping Formula example A hardware engineer A math guy Formula Representation Manipulation Mapping What I know: Linear algebra Digital signal processing Adaptive filter theory … What I know: Finite state machine Pipelining Systolic array …

7 Outline Introduction Technical Details (illustrated by WHT transform)
What are the degrees of design freedom? How do we explore this design space? Experimental Results Summary and Future work

8 Walsh-Hadamard Transform
Why WHT? Typical access pattern for a DSP transform Close to 2-power FFT Study important construct  Definition n fold Tensor product A  B = [ ak,l • B ], where A = [ak,l]

9 From Formula to Architecture
WHT23 Addition Subtraction an F2 block

10 Pease Algorithm Stride permutation L2N

11 Pease Algorithm Regular routing an F2 block
Possibility for vertical folding an F2 block 7 6 5 4 3 2 1 1 2 3 4 5 6 7 Possibility for horizontal folding 12 F2 blocks total

12 ? Folding 8 Repeat 3 times Horizontal Folding (HF) 8 4 F2 blocks
Folded L28 2 HF + Vertical Folding (VF) Repeat 3 times I4 F2 L28 ?

13 Challenge in Vertical Folding
How to fold these wires? L2N Folded L28 N ports Q ports Straightforward approach: Memory-based reordering Extra control logic to reorder address Computation speed is limited by memory speed Ad-hoc approach: Register routing Hard to automate the process Our approach: formula-based matrix factorization

14 Factorization of Stride Permutation
7 1 2 3 5 4 6 J8 JN can be easily folded [1] L2Q has Q input ports Q=2q, N=2n [1]. J.H.Takala etc., “Multi-Port Interconnection Networks for Radix-R Algorithms”, ICASSP01 Example of (L264)4 (N=64, Q = 4) (J64)4 (J32)4 (J16)4 (J8)4 (J4)4 L24

15 Freedom in Horizontal Folding
WHT2n has n horizontal stages in the flattened design The divisors of n are all the possible folding degrees Example: HF degrees of WHT26 can be 1, 2, 3, 6 Effects of more horizontal folding degree Latency (cycle) Same Throughput (op / cycle) Lower Area less adders, more muxs & wires Speed Not clear Less pipeline depth  lower throughput

16 Freedom in Vertical Folding
WHT2n has 2n vertical ports in the flattened design 1, 2, 4… 2n-1 are all possible folding degrees Example: VF degrees of WHT26 could be 1, 2, 4, … 32 Effects of more vertical folding degree Latency (cycle) Longer Throughput (op / cycle) Lower Area less adders, more regs & muxs Speed Not clear Less I/O bandwidth  longer computation

17 Outline Introduction Technical Details Experimental Results
Summary and Future work

18 Design Space Exploration
HF factor (1,2,3,6) VF factor (1,2,4, ) X = 24 different designs Bit-width (8) WHT Generator Transform size(64) Technology Libary Xilinx FPGA Synthesis Performance requirement Evaluator Xilinx FPGA Place&Route

19 Area vs. Folding Degrees
To achieve the same area, multiple folding options are available.

20 Latency vs. Folding Degrees (WHT64)

21 Latency vs. Folding Degrees (WHT64)

22 Latency vs. Folding Degrees (WHT64)
Latency is almost unaffected by HF, except comparing flattened design with folded design

23 Throughput vs. Folding Degrees
Folding always lowers throughput

24 Comparison with an Existing Design
WHT8 8 bit fixed-point FPGA: Xilinx Virtex xcv1000e-fg680 Speed grade: -8 Compare our fastest generated designs against results reported by Amira, et al. [2] Design in [2] Our fastest design 60% more area 80% reduction in latency 13 times higher throughput Area (#of slices) Latency(ns) Throughput(MOP/s) [2] A.Amira et al., “Novel FPGA Implementations of Walsh-Hardamard Transforms for Signal Processing”, Vision, Image and Signal Processing, IEE Proceedings- , Volume: 148 Issue: 6 , Dec. 2001

25 Comparison with an Existing Design
WHT8 8 bit fixed-point FPGA: Xilinx Virtex xcv1000e-fg680 Speed grade: -8 Compare our smallest generated designs against results reported by Amira, et al. [2] Design in [2] Our smallest design Less area Shorter latency Higher throughput Area (#of slices) Latency(ns) Throughput(MOP/s) [2] A.Amira et al., “Novel FPGA Implementations of Walsh-Hardamard Transforms for Signal Processing”, Vision, Image and Signal Processing, IEE Proceedings- , Volume: 148 Issue: 6 , Dec. 2001

26 Summary Large performance variations over the design space of horizontal and vertical folding Automatic design space exploration through formula manipulation and mapping can find the best trade-off Horizontal folding Performance throughput / latency / area … Vertical folding

27 Future work More DSP transforms Formula
Representation Manipulation Mapping More design decisions Pipelining Systolic array Distributed Arithmetic Fix-point vs. Floating-point … DFT DCT DST DWT …

28 Thank you ! Contact: Fang Fang
URL:


Download ppt "Fang Fang James C. Hoe Markus Püschel Smarahara Misra"

Similar presentations


Ads by Google