FFT: Accelerator Project Rohit Prakash Anand Silodia.

Slides:

Advertisements

Similar presentations

Acceleration of Cooley-Tukey algorithm using Maxeler machine

Advertisements

David Hansen and James Michelussi

Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs*

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Computer Abstractions and Technology

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) 4 th October, 2007.

Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.

High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

Amir Torjeman Nitay Shiran

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.

The FFT on a GPU Graphics Hardware 2003 July 27, 2003 Kenneth MorelandEdward Angel Sandia National LabsU. of New Mexico Sandia is a multiprogram laboratory.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Company LOGO Hashing System based on MD5 Algorithm Characterization Students: Eyal Mendel & Aleks Dyskin Instructor: Evgeny Fiksman High Speed Digital.

From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.

Fast (finite) Fourier Transforms (FFTs) Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com December 5,

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Low-Power Wireless Sensor Networks

High Performance Linear Transform Program Generation for the Cell BE

1 Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.

CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.

Digital Image Processing Homework II Fast Fourier Transform 2012/03/28 Chih-Hung Lu ( 呂志宏 ) Visual Communications Laboratory Department of Communication.

Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.

Implementation of Fast Fourier Transform on General Purpose Computers Tianxiang Yang.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.

Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

10/18/2013PHY 711 Fall Lecture 221 PHY 711 Classical Mechanics and Mathematical Methods 10-10:50 AM MWF Olin 103 Plan for Lecture 22: Summary of.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

SAXS Scatter Performance Analysis CHRIS WILCOX 2/6/2008.

FFT Accelerator Project Rohit Prakash Anand Silodia Date: June 7 th, 2007.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez

 Programming - the process of creating computer programs.

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.

Parallel Computing Presented by Justin Reschke

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

FFT Accelerator Project Rohit Prakash(2003CS10186) Anand Silodia(2003CS50210) Date : February 23,2007.

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Computer Architecture: Parallel Task Assignment

Why to use the assembly and why we need this course at all?

FFTs, Portability, & Performance

Real-Time Ray Tracing Stefan Popov.

Vector Processing => Multimedia

Automatic Performance Tuning

High Performance Computing (CS 540)

Compiler Back End Panel

STUDY AND IMPLEMENTATION

Compiler Back End Panel

Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico

Presentation transcript:

FFT: Accelerator Project Rohit Prakash Anand Silodia

Work done till now Studied various FFT algorithms Implemented radix-4, recursive and iterative algorithms Optimized these Compared the results with FFTW RESULT- FFTW fares better than our implementation

Current Objectives Validate the number of complex calculations in our implementation with theoretical number of computations Document the work done till now Make a website of the project Study FFTW code (also figure out the reasons for its efficiency) Run the code on intel compiler (icc)/ visual c++

Validating the computations Incorrect theoretical formula (cnx.org) Theoretical formula (for no. of complex computations) = (11/4)*nlog4(n) =8960 (Correct) (3/4)*nlog4(n) = 3840 (Incorrect) Actual 8960

Documentation and website Website of the project – – Includes the details and results of our experimentations (till last week)

Running on intel compiler icc No improvement Possible reasons – –Tested on Intel Pentium Mobile –This does not support optimizations like exploiting SSE3 instructions (-fast flag)

FFTW code 56,489+ LOC (contains code written in Ocaml and C) We decided to study why FFTW is so fast (before going into the code itself) Text we came across in this context – –Design and implementation of FFTW3 (Matteo Frigo and Steven G. Johnson) –Documentation of FFTW

Why is FFTW fast? The transform is computed by an executor, composed of highly optimized, composable blocks of C code called codelets –At runtime, a ‘planner’ finds an efficient way to compose codelets: it measures the speed of different plans and chooses the best using a dynamic programming algorithm –The executor interprets the plan with negligible overhead –Codelets are generated automatically and are fast

Contd… The executor implements the recursive divide and conquer Cooley Tukey FFT algorithm Basically, it adapts to hardware in order to maximize performance ‘Performance has little to do with the number of operations.Fast code must exploit instruction level parallelism of the processor. It is important to write the code in such a way that C compiler can schedule it efficiently’

Contd… It uses some tricky optimizations like – It also exploits SIMD instructions

Further plan ? Since FFTW supports MPI and adapts itself to the given hardware architecture, we may use it as it is.

References The Design and Implementation of FFTW3 (Matteo Frigo and Steven G. Johnson) The Fastest Fourier Transform in the West (Matteo Frigo and Steven G. Johnson)

Thank You