Parallel Implementation of Fast Fourier Transform on a Multi-core System Tao Liu Chi-Li Yu Nov. 29, 2007.

Slides:



Advertisements
Similar presentations
Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond
Advertisements

Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
Run-Time FPGA Partial Reconfiguration for Image Processing Applications Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida Dr. Ann Gordon-Ross.
Masters Presentation at Griffith University Master of Computer and Information Engineering Magnus Nilsson
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Image Compression System Megan Fuller and Ezzeldin Hamed 1.
Pipelined Parallel AC-based Approach for Multi-String Matching Department of Computer Science and Information Engineering National Cheng Kung University,
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.
Hardware accelerator for PPC microprocessor Final presentation By: Instructor: Kopitman Reem Fiksman Evgeny Stolberg Dmitri.
1 Multi-Core Architecture on FPGA for Large Dictionary String Matching Department of Computer Science and Information Engineering National Cheng Kung University,
V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.
Configurable System-on-Chip: Xilinx EDK
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.
Implementation of DSP Algorithm on SoC. Characterization presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompany engineer : Emilia Burlak.
HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.
HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
1 DSP Implementation on FPGA Ahmed Elhossini ENGG*6090 : Reconfigurable Computing Systems Winter 2006.
GPGPU platforms GP - General Purpose computation using GPU
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Final A Presentation By: Vova Menis-Lurie Sonia Gershkovich.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.
High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.
Performance and Energy Efficiency of GPUs and FPGAs
Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.
Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.
Efficient FPGA Implementation of QR
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 25: May 27, 2005 Transactional Computing.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
Fast Memory Addressing Scheme for Radix-4 FFT Implementation Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Xin Xiao, Erdal Oruklu and.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Paper Review Presentation Paper Title: Hardware Assisted Two Dimensional Ultra Fast Placement Presented by: Mahdi Elghazali Course: Reconfigurable Computing.
A Monte Carlo Simulation Accelerator using FPGA Devices Final Year project : LHW0304 Ng Kin Fung && Ng Kwok Tung Supervisor : Professor LEONG, Heng Wai.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Reconfigurable FFT architecture
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
An FFT/IFFT Accelerator for OCT Application
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Implementing JPEG Encoder for FPGA ECE 734 PROJECT Deepak Agarwal.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
CORDIC Based 64-Point Radix-2 FFT Processor
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Presenter: Darshika G. Perera Assistant Professor
Backprojection Project Update January 2002
School of Engineering University of Guelph
Evaluating Partial Reconfiguration for Embedded FPGA Applications
A Streaming FFT on 3GSPS ADC Data using Core Libraries and DIME-C
Divide-and-Conquer Design
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Course Agenda DSP Design Flow.
1CECA, Peking University, China
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
sRODdemo implementation
Presentation transcript:

Parallel Implementation of Fast Fourier Transform on a Multi-core System Tao Liu Chi-Li Yu Nov. 29, 2007

Goal Implement and optimize 2D FFT on FPGA platform. Evaluate multi-core architectures with various number of cores. Design memory structures suitable for the various multi-core architectures

Basic method and the problem N-point 1D FFT Generated by Xilinx LigiCORE. Throughput rate: 1 sample per clock. Up to 150MHz. N*N Matrix Stored in a dual-port SRAM constructed by Xilinx BRAM. Total Latency: Row-wise + Colum-wise = N 2 +N 2 =2N 2 Our target is to reduce the latency.

Quad-Core Architecture 4 (N/2)-point 1D-FFTs: Lower latency: Only ¼ latency (N 2 /2 clocks) for local 2D-FFT. Overhead: 2 Radix-2 butterflies are required for preprocessing. Extra latency: 2*(N/2)*(N/2) = N 2 /2 clocks Total latency: N 2 clocks (Single-core: 2N 2 )

8-Core Architecture 8 (N/2)-point 1D-FFTs: Latency : N 2 /4 16 banks of memory 8 Radix-2 butterflies Extra latency is reduced: N 2 /8 clocks Total latency: 3N 2 /8

16-Core Architecture 16 (N/4)-point FFT 16 banks of memory 4 Radix-4 butterflies Latency: N 2 /4 Hardware resource of the FPGA is not enough! Radix-4 BTY

Implementation We implemented the architectures with Verilog Hardware Description Language. Used Xilinx ISE Foundation to synthesize the designs. The target FPGA platform is Digilent XUP V2 Pro.

8 Comparisons Single coreQuad-core8-core16-core (Strip down ver.) Butterfly0Radix-2 Bty *2Radix-2 Bty *8Radix-4 Bty * 4 1D FFTN-point * 1(N/2)-point *4(N/2)-point * 8(N/4)–point * 16(N/4)-point * 8 Banks of Mem.1416 FPGA occupation* (Slices) 4299 (10%) (28%) (59%) (109%) (54%) Latency (Butterfly) 02*(N/2)*(N/2)2*(N/4)*(N/4) Latency (Local 2D-FFT) 2*N*N2*(N/2)*(N/2)(N/2)*(N/2)2*(N/4)*(N/4)4*(N/4)*(N/4) Total latency2*N 2 1*N 2 (3/8)*N 2 (1/4)*N 2 (3/8)*N 2 Total latency* (Measured) 32,99816,6146, ,374 *: 128x128 2D FFT. Target FPGA : Xilinx XC2VP100, which contains slices.

Conclusion Implemented 2D FFT on an FPGA Evaluated various multi-core architecture Designed and optimized memory structures for every multi-core architecture Experimental results meet with theoretical predication