Mapping the FFT Algorithm to the IBM Cell Processor

Slides:



Advertisements
Similar presentations
Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.
Advertisements

?  Able to explain the 6 key functions of system software  Able to explain each using a suitable example  Identify three different system software.
David Hansen and James Michelussi
1a. Outline how the main memory of a computer can be partitioned b. What are the benefits of partitioning the main memory? It allows more than 1 program.
Very Large Fast DFT (VL FFT) Implementation on KeyStone Multicore Applications.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Processor Architecture Needed to handle FFT algoarithm M. Smith.
1 Presenter: Chien-Chih Chen. 2 Dynamic Scheduler for Multi-core Systems Analysis of The Linux 2.6 Kernel Scheduler Optimal Task Scheduler for Multi-core.
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
A Matlab Playground for JPEG Andy Pekarske Nikolay Kolev.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Introduction to Fast Fourier Transform (FFT) Algorithms R.C. Maher ECEN4002/5002 DSP Laboratory Spring 2003.
GPU-Based Frequency Domain Volume Rendering Ivan Viola, Armin Kanitsar, and Meister Eduard Gröller Institute of Computer Graphics and Algorithms Vienna.
The FFT on a GPU Graphics Hardware 2003 July 27, 2003 Kenneth MorelandEdward Angel Sandia National LabsU. of New Mexico Sandia is a multiprogram laboratory.
Image Compression and Signal Processing Dan Hewett CS 525.
Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.
Input image Output image Transform equation All pixels Transform equation.
Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia.
“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1LYU0703 Electronic Advertisement Guide on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Jeff Wang Kay-Won Chang March 18, DEMO Harmonic Product Spectrum (HPS) pitch detection: obtain fundamental frequency from FFT Fast Fourier Transform.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
EE/CS 481 Spring Founder’s Day, 2008 University of Portland School of Engineering Project Golden Eagle CMOS Fast Fourier Transform Processor Team.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Seismic Reflection Data Processing and Interpretation A Workshop in Cairo 28 Oct. – 9 Nov Cairo University, Egypt Dr. Sherif Mohamed Hanafy Lecturer.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Fast Fourier Transform & Assignment 2
Cell Processor Programming: An introduction Pascal Comte Brock University, Fall 2007.
QCAdesigner – CUDA HPPS project
Inverse DFT. Frequency to time domain Sometimes calculations are easier in the frequency domain then later convert the results back to the time domain.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Fourier and Wavelet Transformations Michael J. Watts
Professor A G Constantinides 1 Discrete Fourier Transforms Consider finite duration signal Its z-tranform is Evaluate at points on z-plane as We can evaluate.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Vincent DeVito Computer Systems Lab The goal of my project is to take an image input, artificially blur it using a known blur kernel, then.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
VLSI Design of 2-D Discrete Wavelet Transform for Area-Efficient and High- Speed Image Computing - End Presentation Presentor: Eyal Vakrat Instructor:
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
Husheng Li, UTK-EECS, Fall The specification of filter is usually given by the tolerance scheme.  Discrete Fourier Transform (DFT) has both discrete.
Section II Digital Signal Processing ES & BM.
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Fourier and Wavelet Transformations
Operating Systems What are they and why do we need them?
Implementation of DWT using SSE Instruction Set
Fast Fourier Transform
Implementation of IDEA on a Reconfigurable Computer
Centar ( Global Signal Processing Expo
Computer Organization & Compilation Process
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
4.1 DFT In practice the Fourier components of data are obtained by digital computation rather than by analog processing. The analog values have to be.
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Implementation of neural gas on Cell Broadband Engine
Final Project presentation
Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico
A Parallel Fast Fourier Transform for Millimeter-wave Applications
VSIPL++: Parallel Performance HPEC 2004
Computer Organization & Compilation Process
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM
ADSP 21065L.
Introduction to Cell Programming
Forest Packing: Fast Parallel, Decision Forests
Presentation transcript:

Mapping the FFT Algorithm to the IBM Cell Processor Andy Polidore Advisors: Brendan Burns, Joseph Czechowski

Motivation MRI Imaging Fast Fourier Transformations Efficient algorithm for computing a Discrete Fourier Transform DFT converts time-domain to frequency-domain 2D FFT: Perform a 1D FFT on each row of an image and then perform a 1D FFT on each resulting column The Cell Nine cores 1 Power Processing Unit (PPU) 8 Synergistic Processing Units (SPU) My project is to map the 2d FFT algorithm to the Cell. You may be asking…why? Faster processing = faster better images.

Strategy Cell comes with 2d routine Limited SPU memory Needs to be called twice First call organizes the data in contiguous column form Striping Limited SPU memory Quad Buffering

PPU SPU 0 PPU SPU 0 DMA In FFT DMA Out DMA In FFT DMA Out Input Buffer Output Buffer FFT out DMA Out PPU SPU 0 Input Buffer Input DMA In f FFT Output Buffer FFT out DMA Out

PPU SPU 0 SPU 1 SPU 7 Input Buffer DMA In Input FFT Output Buffer

PPU PPU DMA In FFT DMA Out Sync Point DMA In FFT DMA Out Input Buffer SPU 0 SPU 1 Input Buffer SPU 2 Input DMA In FFT Output Buffer FFT out DMA Out PPU Sync Point SPU 0 SPU 1 Input Buffer SPU 2 Input DMA In FFT Output Buffer FFT out DMA Out

Quad buffering Why it is required? Buffers Space problems Maximizing processing power Buffers IN to handle incoming data FFTin and FFTout to process the data OUT stores the data ready to be DMA’ed back to main memory

Buffering A B C D FILL ------- ------- ------- 1 2 3 4 5 6

Buffering A B C D FILL ------- ------- ------- 1 FFTOUT FILL ------- 2 FILL ------- ------- ------- 1 FFTOUT FILL ------- 2 FFTIN 3 4 5 6

Buffering A B C D FILL ------- ------- ------- 1 FFTOUT FILL ------- 2 FILL ------- ------- ------- 1 FFTOUT FILL ------- 2 FFTIN FFTOUT OUT FFTIN FILL 3 4 5 6

Buffering A B C D FILL ------- ------- ------- 1 FFTOUT FILL ------- 2 FILL ------- ------- ------- 1 FFTOUT FILL ------- 2 FFTIN FFTOUT OUT FFTIN FILL 3 OUT FILL FFTOUT FFTIN 4 5 6

Buffering A B C D FILL ------- ------- ------- 1 FFTIN FFTOUT FILL FILL ------- ------- ------- 1 FFTIN FFTOUT FILL ------- 2 FFTOUT OUT FFTIN FILL 3 OUT FILL FFTOUT FFTIN 4 FILL FFTIN OUT FFTOUT 5 FFTIN FFTOUT FILL OUT 6 FFTOUT OUT FFTIN FILL

Striping Main Memory SPU 0 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5 SPU 6 SPU 7

Challenges Simulator C coding Parallel processing Testing is slow Alignment Compiler C coding Working with bytes Parallel processing Data movement Debugging

Knowledge Gained Mastering Linux C make files, linking, etc Data movement strategies Multi-core processing Debugging!

Results and Conclusions Success? Future Work Arbitrary size input

Questions?