FFT Accelerator Project Rohit Prakash(2003CS10186) Anand Silodia(2003CS50210) Date : February 23,2007.

Slides:

Advertisements

Similar presentations

Acceleration of Cooley-Tukey algorithm using Maxeler machine

Advertisements

David Hansen and James Michelussi

Fourier Transform Fourier transform decomposes a signal into its frequency components Used in telecommunications, data compression, digital signal processing,

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Factoring of Large Numbers using Number Field Sieve Matrix Step Chandana Anand, Arman Gungor, and Kimberly A. Thomas ECE 646 Fall 2006.

+ Accelerating Fully Homomorphic Encryption on GPUs Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, Berk Sunar ECE Dept., Worcester Polytechnic Institute.

Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.

Microprocessors. Microprocessor Buses Address Bus Address Bus One way street over which microprocessor sends an address code to memory or other external.

Algorithms Today we will look at: what we mean by efficiency in programs why efficiency matters what causes programs to be inefficient? will one algorithm.

ECE 109 / CSCI 255 What’s next.

1 Lecture 6 Performance Measurement and Improvement.

Fast Fourier Transform. Agenda Historical Introduction CFT and DFT Derivation of FFT Implementation.

/ 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware.

COMPARISONS 64-bit Intel Xeon X Ghz processors –12 processors sharing 48 GB RAM –Each BARON run restricted to single processor All experiments.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Computer Organization & Assembly Language

Different CPUs CLICK THE SPINNING COMPUTER TO MOVE ON.

Inside the Computer Ms. Rocío Acevedo September 2006.

1 4.2 MARIE This is the MARIE architecture shown graphically.

FFT: Accelerator Project Rohit Prakash Anand Silodia.

200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Scalable Multi-core Sonar Beamforming with Computational Process Networks Motivation Sonar beamforming requires significant computation and input/output.

Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.

SAXS Scatter Performance Analysis CHRIS WILCOX 2/6/2008.

FFT Accelerator Project Rohit Prakash Anand Silodia Date: June 7 th, 2007.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

How to Enforce Reproducibility with your Existing Intel ® Math Kernel Library Code Noah Clemons Technical Consulting Engineer Intel ® Developer Products.

Baum, Boyett, & Garrison Comparing Intel C++ and Microsoft Visual C++ Compilers Michael Baum David Boyett Holly Garrison.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Single Node Optimization Computational Astrophysics.

Divide & Conquer Themes –Reasoning about code (correctness and cost) –iterative code, loop invariants, and sums –recursion, induction, and recurrence relations.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

Simulated Annealing To minimize the wire length. Combinatorial Optimization The Process of searching the solution space for optimum possible solutions.

By Anand George SourceLens.org Copyright. All rights reserved. Content Owner - Meera R (meera at sourcelens.org)

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,

Lecture 4 Jianjun Hu Department of Computer Science and Engineerintg University of South Carolina CSCE350 Algorithms and Data Structure.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Personal Computer (PC)  Computer advertisement specification Intel® Pentium 4 Processor at 3.06GHz with 512K cache 512MB DDR SDRAM 200GB ATA-100 Hard.

Algorithmic complexity: Speed of algorithms

NFV Compute Acceleration APIs and Evaluation

Introduction to Analysis of Algorithms

Algorithmic complexity: Speed of algorithms

An Iterative FFT We rewrite the loop to calculate nkyk[1] once

Introduction to Analysis of Algorithms

Introduction to Analysis of Algorithms

High Performance Computing on an IBM Cell Processor --- Bioinformatics

FFTs, Portability, & Performance

Presented by: Tim Olson, Architect

Implementation of IDEA on a Reconfigurable Computer

Introduction to Computer Systems

High Performance Computing (CS 540)

Intel Microprocessor.

Real-time 1-input 1-output DSP systems

STUDY AND IMPLEMENTATION

Alan Jovic1, Kresimir Jozic2, Davor Kukolja1,

Algorithmic complexity: Speed of algorithms

Virtual Memory Overcoming main memory size limitation

GRAPHIC ALARM MANAGEMENT SYSTEM

CSE 373 Data Structures and Algorithms

Analysis of Algorithms

Algorithmic complexity: Speed of algorithms

CSE 373: Data Structures and Algorithms

CS Introduction to Operating Systems

Presentation transcript:

FFT Accelerator Project Rohit Prakash(2003CS10186) Anand Silodia(2003CS50210) Date : February 23,2007

Current Objectives Validate the number of complex multiplications Run the code with intel compiler and compare the results – –For single run –For multiple runs Tabulate all the results Analyse these using vTune

Number of Complex multiplications Our results –(11/4)*nlog4(n) =8960 Result on net –(3/4)*nlog4(n) = 3840 The inner loop is trivial and does not require any “complex multiplications”

Inner loop of our Algorithm T  A[k+j] U  w*A[k+j+m/4] V  w*w*A[k+j+m/2] X  w*w*w*A[k+j+3*m/4] A[k+j]  T+U+V+X A[k+j+m/4]  T+(i)U-V-(i)X A[k+j+2m/4]  T-U+V-X A[k+j+3m/4]  T-(i)U-V+(i)X W  w*w_m Total number of multiplications n this loop : 11

New Inner loop of our Algorithm T  A[k+j] U  twiddle[k]*A[k+j+m/4] V  twiddle[2*k]*A[k+j+m/2] X  twiddle[3*k]*A[k+j+3*m/ 4] A[k+j]  T+U+V+X A[k+j+m/4]  T+i*U-V-i*X A[k+j+2m/4]  T-U+V-X A[k+j+3m/4]  T-i*U-V+i*X Total number of multiplications n this loop : 3 (3/4)*nlog4(n) =3840

Stuff we tried Improved the “bit reversal” –Better than the last time Though inefficient (O(nlogn)), still works faster than the previous implementation Still there exists many fast algorithms

System Specifications Processor: Intel Pentium 4 CPU 3.00Ghz Cache Size: 1MB RAM: 1GB Flags supported : sse, sse2

Results User time(ms) for 1024 points (single iteration)

Results User time(ms) for 1024 points (10 iterations)

Results User time for 4096 points (single iteration)

Results User time(ms) for 4096 points (10 iterations)

Results User time(ms) for points (single iteration)

Results User time(ms) for points (10 iterations)

Analysis Results are comparable due to the following reasons –Change in bit reversal –Number of computations FFTW : compiling option gcc Got to re-write the code for arbitrary number of points

Tabular Representation (1024 points) Time (ms) Recursiv e (single run on icpc) Recursive (single run on g++) Final (single run on icpc) Final (single run on g++) FFTW (single run on icpc) FFTW (single run on g++) Recursive (10 runs on icpc) Recursive (10 runs on g++) Fina l (10 runs on icpc) Fina l (10 runs on g++) FFTW (10 runs on icpc) FFTW (10 runs on g++) Real User System

Tabular Representation (4096 point) Time (ms) Recursiv e (single run on icpc) Recursiv e (single run on g++) Final (singl e run on icpc) Final (singl e run on g++) FFT W (singl e run on icpc) FFT W (singl e run on g++) Recursiv e (10 runs on icpc) Recursiv e (10 runs on g++) Fina l (10 runs on icpc ) Fina l (10 runs on g++ ) FFT W (10 runs on icpc) FFT W (10 runs on g++) Real User System

Tabular Representation ( point) Time (ms) Recursive (single run on icpc) Recursive (single run on g++) Final (single run on icpc) Final (single run on g++) FFTW (single run on icpc) FFTW (single run on g++) Recursive (10 runs on icpc) Recursive (10 runs on g++) Final (10 runs on icpc) Final (10 runs on g++) FFTW (10 runs on icpc) FFTW (10 runs on g++) Real User System

Vtune Analysis TODO Vtune (not available)

Further Improvements Fast digit reversal Fast “twiddle compute” TODO: –Comparison with Intel Math Kernel library –Study FFTW implementation –Vtune Analysis Try winograd algorithm Code more efficiently

References Alan H. Karp “Bit Reversal on Uniprocessors” Angelo A. Yong “A better FFT Bit-reversal Algorithm”

Thank You