Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Slides:

Advertisements

Similar presentations

Common Subexpression Elimination Involving Multiple Variables for Linear DSP Synthesis 15 th IEEE International Conference on Application Specific Architectures.

Advertisements

Chapter 15 Digital Signal Processing

Digital Kommunikationselektronik TNE027 Lecture 4 1 Finite Impulse Response (FIR) Digital Filters Digital filters are rapidly replacing classic analog.

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

1 Chapter 13 Cores and Intellectual Property. 2 Overview FPGA intellectual property (IP) can be defined as a reusable design block (Hard, Firm or soft)

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.

© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.

Real time DSP Professors: Eng. Julian Bruno Eng. Mariano Llamedo Soria.

03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

Highest Performance Programmable DSP Solution September 17, 2015.

Computational Technologies for Digital Pulse Compression

Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.

Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Efficient FPGA Implementation of QR

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

FPGA Implementations for Volterra DFEs

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Programmable Logic Devices

Presenter: Darshika G. Perera Assistant Professor

Backprojection Project Update January 2002

Floating-Point FPGA (FPFPGA)

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

ECE354 Embedded Systems Introduction C Andras Moritz.

Complex Programmable Logic Device (CPLD) Architecture and Its Applications

EEE2135 Digital Logic Design Chapter 1. Introduction

Introduction to Programmable Logic

Session 4: Reconfigurable Computing

CS184a: Computer Architecture (Structure and Organization)

FPGA Acceleration of Convolutional Neural Networks

Instructor: Dr. Phillip Jones

Electronics for Physicists

DESIGN AND IMPLEMENTATION OF DIGITAL FILTER

Centar ( Global Signal Processing Expo

CMPT 733, SPRING 2016 Jiannan Wang

COE 561 Digital System Design & Synthesis Introduction

CSE 370 – Winter 2002 – Comb. Logic building blocks - 1

Jian Huang, Matthew Parris, Jooheung Lee, and Ronald F. DeMara

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Architecture Synthesis

HIGH LEVEL SYNTHESIS.

ICS 252 Introduction to Computer Design

UNIVERSITY OF MASSACHUSETTS Dept

Electronics for Physicists

ESE534: Computer Organization

Real time signal processing

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

UNIVERSITY OF MASSACHUSETTS Dept

(Lecture by Hasan Hassan)

Presentation transcript:

Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform Fang Fang James C. Hoe Markus Püschel Smarahara Misra Carnegie Mellon University

Conventional Approach: Static IP Cores IP cores improve productivity and reduce time-to-market. e.g. Xilinx LogiCore library: FFT for N=16, 64, 256 and 1024 on 16-bit complex numbers chip library application ? May not match the application’s needs: parameters, speed, power, area and their trade-off. .

Alternative Approach: IP Core Generation Generate IP cores to match specific application requirements (speed, area, power, numerical accuracy, and I/O bandwidth…) Application parameters Generator + Evaluator Speed / area / power requirements Optimized IP cores

Design space Design Space Designer’s Focus DSP transform design can be studied at several levels. More math knowledge involved  Bigger design space to explore. Design Space Math Designer’s Focus Algorithm Architecture Gate Circuit

Problem Problem: gap between transform mathematics and hardware design A hardware engineer A math guy What I know: Linear algebra Digital signal processing Adaptive filter theory … What I know: Finite state machine Pipelining Systolic array …

Bridge: Formula Solution: - Formula representation of DSP transforms - Automated formula manipulation and mapping Formula example A hardware engineer A math guy Formula Representation Manipulation Mapping What I know: Linear algebra Digital signal processing Adaptive filter theory … What I know: Finite state machine Pipelining Systolic array …

Outline Introduction Technical Details (illustrated by WHT transform) What are the degrees of design freedom? How do we explore this design space? Experimental Results Summary and Future work

Walsh-Hadamard Transform Why WHT? Typical access pattern for a DSP transform Close to 2-power FFT Study important construct  Definition n fold Tensor product A  B = [ ak,l • B ], where A = [ak,l]

From Formula to Architecture WHT23 Addition Subtraction an F2 block

Pease Algorithm Stride permutation L2N

Pease Algorithm Regular routing an F2 block Possibility for vertical folding an F2 block 7 6 5 4 3 2 1 1 2 3 4 5 6 7 Possibility for horizontal folding 12 F2 blocks total

? Folding 8 Repeat 3 times Horizontal Folding (HF) 8 4 F2 blocks Folded L28 2 HF + Vertical Folding (VF) Repeat 3 times I4 F2 L28 ?

Challenge in Vertical Folding How to fold these wires? L2N Folded L28 N ports Q ports Straightforward approach: Memory-based reordering Extra control logic to reorder address Computation speed is limited by memory speed Ad-hoc approach: Register routing Hard to automate the process Our approach: formula-based matrix factorization

Factorization of Stride Permutation 7 1 2 3 5 4 6 J8 JN can be easily folded [1] L2Q has Q input ports Q=2q, N=2n [1]. J.H.Takala etc., “Multi-Port Interconnection Networks for Radix-R Algorithms”, ICASSP01 Example of (L264)4 (N=64, Q = 4) (J64)4 (J32)4 (J16)4 (J8)4 (J4)4 L24

Freedom in Horizontal Folding WHT2n has n horizontal stages in the flattened design The divisors of n are all the possible folding degrees Example: HF degrees of WHT26 can be 1, 2, 3, 6 Effects of more horizontal folding degree Latency (cycle) Same Throughput (op / cycle) Lower Area less adders, more muxs & wires Speed Not clear Less pipeline depth  lower throughput

Freedom in Vertical Folding WHT2n has 2n vertical ports in the flattened design 1, 2, 4… 2n-1 are all possible folding degrees Example: VF degrees of WHT26 could be 1, 2, 4, … 32 Effects of more vertical folding degree Latency (cycle) Longer Throughput (op / cycle) Lower Area less adders, more regs & muxs Speed Not clear Less I/O bandwidth  longer computation

Outline Introduction Technical Details Experimental Results Summary and Future work

Design Space Exploration HF factor (1,2,3,6) VF factor (1,2,4, ... 32) X = 24 different designs Bit-width (8) WHT Generator Transform size(64) Technology Libary Xilinx FPGA Synthesis Performance requirement Evaluator Xilinx FPGA Place&Route

Area vs. Folding Degrees To achieve the same area, multiple folding options are available.

Latency vs. Folding Degrees (WHT64)

Latency vs. Folding Degrees (WHT64)

Latency vs. Folding Degrees (WHT64) Latency is almost unaffected by HF, except comparing flattened design with folded design

Throughput vs. Folding Degrees Folding always lowers throughput

Comparison with an Existing Design WHT8 8 bit fixed-point FPGA: Xilinx Virtex xcv1000e-fg680 Speed grade: -8 Compare our fastest generated designs against results reported by Amira, et al. [2] Design in [2] Our fastest design 60% more area 80% reduction in latency 13 times higher throughput Area (#of slices) Latency(ns) Throughput(MOP/s) [2] A.Amira et al., “Novel FPGA Implementations of Walsh-Hardamard Transforms for Signal Processing”, Vision, Image and Signal Processing, IEE Proceedings- , Volume: 148 Issue: 6 , Dec. 2001

Comparison with an Existing Design WHT8 8 bit fixed-point FPGA: Xilinx Virtex xcv1000e-fg680 Speed grade: -8 Compare our smallest generated designs against results reported by Amira, et al. [2] Design in [2] Our smallest design Less area Shorter latency Higher throughput Area (#of slices) Latency(ns) Throughput(MOP/s) [2] A.Amira et al., “Novel FPGA Implementations of Walsh-Hardamard Transforms for Signal Processing”, Vision, Image and Signal Processing, IEE Proceedings- , Volume: 148 Issue: 6 , Dec. 2001

Summary Large performance variations over the design space of horizontal and vertical folding Automatic design space exploration through formula manipulation and mapping can find the best trade-off Horizontal folding Performance throughput / latency / area … Vertical folding

Future work More DSP transforms Formula Representation Manipulation Mapping More design decisions Pipelining Systolic array Distributed Arithmetic Fix-point vs. Floating-point … DFT DCT DST DWT …

Thank you ! Contact: Fang Fang Email: ffang@cmu.edu URL: www.ece.cmu.edu/~ffang