Very Large Fast DFT (VL FFT) Implementation on KeyStone Multicore Applications.

Slides:



Advertisements
Similar presentations
Chapter 19 Fast Fourier Transform (FFT) (Theory and Implementation)
Advertisements

Chapter 19 Fast Fourier Transform
DFT & FFT Computation.
KeyStone Connectivity and Priorities
Computer Architecture
Categories of I/O Devices
David Hansen and James Michelussi
Technische universiteit eindhoven November 2000Ad Verschueren and Bart Theelen1 The Multi Micro Processor Eindhoven.
Fourier Transform Fourier transform decomposes a signal into its frequency components Used in telecommunications, data compression, digital signal processing,
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Implementing Large Scale FFTs on Heterogeneous Multicore Systems Yan Li 1, Jeff Diamond 2, Haibo Lin 1, Yudong Yang 3, Zhenxing Han 3 June 4 th,
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Parallel Fast Fourier Transform Ryan Liu. Introduction The Discrete Fourier Transform could be applied in science and engineering. Examples: ◦ Voice recognition.
Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.
DFT and FFT FFT is an algorithm to convert a time domain signal to DFT efficiently. FFT is not unique. Many algorithms are available. Each algorithm has.
The Discrete Fourier Transform. The spectrum of a sampled function is given by where –  or 0 .
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
KeyStone Training Multicore Navigator Overview. Overview Agenda What is Navigator? – Definition – Architecture – Queue Manager Sub-System (QMSS) – Packet.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Introduction to Fast Fourier Transform (FFT) Algorithms R.C. Maher ECEN4002/5002 DSP Laboratory Spring 2003.
DSP online algorithms for the ATLAS TileCal Read Out Drivers Cristobal Cuenca Almenar IFIC (University of Valencia-CSIC)
Chapter 15 Digital Signal Processing
May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
ARM-DSP Communication Architecture
Introduction to Algorithms
Multicore Design Considerations KeyStone Training Multicore Applications Literature Number: SPRP811.
Multicore Design Considerations. Multicore: The Forefront of Computing Technology “We’re not going to have faster processors. Instead, making software.
Parallelizing the Fast Fourier Transform David Monismith cs599.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
CHAPTER 8 DSP Algorithm Implementation Wang Weilian School of Information Science and Technology Yunnan University.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Challenges in KeyStone Workshop Getting Ready for Hawking, Moonshot and Edison.
Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture.
Multicore Design Considerations. Multicore: The Forefront of Computing Technology “We’re not going to have faster processors. Instead, making software.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Case Study: Single-path Delay Feedback FFT Speaker: Yu-Min.
Computer Architecture Lecture 32 Fasih ur Rehman.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Professor A G Constantinides 1 Discrete Fourier Transforms Consider finite duration signal Its z-tranform is Evaluate at points on z-plane as We can evaluate.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Performed by: Lior Raviv & Zohar koritzki Instructor: Reuven Nisar הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion - Israel institute of technology.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Parallel Computing Presented by Justin Reschke
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
EE345S Real-Time Digital Signal Processing Lab Fall 2006 Lecture 17 Fast Fourier Transform Prof. Brian L. Evans Dept. of Electrical and Computer Engineering.
Husheng Li, UTK-EECS, Fall The specification of filter is usually given by the tolerance scheme.  Discrete Fourier Transform (DFT) has both discrete.
TI Information – Selective Disclosure
Embedded Systems Design
Fast Fourier Transform
Implementation of IDEA on a Reconfigurable Computer
Centar ( Global Signal Processing Expo
High Level Synthesis Overview
Real-time 1-input 1-output DSP systems
Fast Fourier Transformation (FFT)
Chapter 19 Fast Fourier Transform
Mapping the FFT Algorithm to the IBM Cell Processor
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Speaker: Chris Chen Advisor: Prof. An-Yeu Wu Date: 2014/10/28
Fast Fourier Transform
Presentation transcript:

Very Large Fast DFT (VL FFT) Implementation on KeyStone Multicore Applications

Outlines Basic Algorithm for parallelizing DFT Multi-core implementation of DFT Review Benchmark Performance

Goals and Requirements Goal: – To implement very large floating point fast DFT on TI multicore devices: Shannon and Nyquist Requirements: – FFT sizes: 4K – 1M samples – Configurable to run on different number of cores: 1, 2, 4, 8 – High performance

Algorithm for Very Large DFT A generic discrete Fourier transform (DFT) is shown below, Here N is the total size of DFT,

Algorithm for Very Large DFT For very large N, it can be factored into N = N1*N2 and with decimation-in-time, the DFT can be formulated as,

Algorithm for Very Large DFT The above DFT formula can be :

Algorithm for Very Large DFT A vary large DFT of size N=N1*N2 can be computed in the following steps: 1)Formulate input into N1xN2 matrix 2)Matrix transpose: N1xN2 -> N2xN1 3)Compute N2 FFTs and multiply twiddle factors. Each FFT is N1 size. 4)Matrix transpose: N2xN1 -> N1xN2 5)Compute N1 FFTs. Each is N2 size. 6)Matrix transpose: N1xN2 -> N2xN1

Implementing VLFFT on Multiple Cores Two iterations of computations 1 st iteration – N2 FFTs are distributed across all the cores. – Each core implements matrix transpose and computes N2/numCores FFTs and multiplying twiddle factor. 2 nd iteration – N1 FFTs of N2 size are distributed across all the cores – Each core computes N1/numCores FFTs and implements matrix transpose before and after FFT computation.

Data Buffers DDR3: Three float complex arrays of size N – Input buffer, output buffer, working buffer L2 SRAM: – Two ping-pong buffers, each buffer is the size of 16 FFT input/output – Some working buffer – Buffers for twiddle factors Twiddle factors for N1 and N2 FFT N2 global twiddle factors

Global Twiddle Factors Global Twiddle Factors: Total of N1*N2 global twiddle factors are required. N1 are actually pre-computed and saved. The rest are computed during run time.

DMA Scheme Each core has dedicated in/out DMA channels Each core configures and triggers its own DMA channels for input/output On each core, the processing is divided into blocks of 8 FFT each. For each block on every core – DMA transfer 8 lines of FFT input – DSP computes FFT/transpose – DMA transfers 8 lines of FFT output

VLFFT Pseudo Code VLFFT_start: 1) Core0 sends message to each core to start 1st iteration processing. 2) Each core does the following, Wait message from core 0 to start, numBlk = 0; While( numBlk < totalBlk ) { 1) Trigger DMA to transfer (n+1)th blk from Input Buffer to L2 and to transfer (n-1)th blk output from L2 to Temp Buffer 2) Implement transpose, compute FFT, and multiply twiddle factors for nth blk 3) wait for DMA completion 4) numBlk++ } Send a message to core 0 3) Core0 waits for message from each core for completion of its own processing 4) After receiving all the messages from all the other cores, core0 sends message to each core to start 2nd iteration processing 5) Each core does the following, Wait message from core 0 to start, numBlk = 0; While( numBlk < totalBlk ) { 1) Trigger DMA to transfer (n+1)th blk from Temp Buffer to L2 and to transfer (n-1)th blk output from L2 to Output Buffer 2) Compute FFT and transpose for nth blk 3) wait for DMA completion 4) numBlk++ } Send a message to core 0 6) Core0 waits for message back from each core for completion its own processing VLFFT_end:

Matrix Transpose The transpose is required for the following matrixes from each core: – N1x8 -> 8xN1 – N2x8 -> 8xN2 – 8xN2 -> N2x8 DSP computes matrix transpose from L2 SRAM – DMA bring samples from DDR to L2 SRAM – DSP implements transpose for matrixes in L2 SRAM – 32K L1 Cache

Major Kernels FFT: single precision floating point FFT from c66x DSPLIB Global twiddle factor compute and multiplication: 1 cycle per complex sample Transpose: 1 cycle per complex sample

Major Software Tools SYS BIOS 6 CSL for EDMA configuration IPC for inter-processor communication

Conclusion After the demo