FFT Accelerator Project Rohit Prakash Anand Silodia Date: June 7 th, 2007.

Slides:

Advertisements

Similar presentations

Modeling Ion Channel Kinetics with High- Performance Computation Allison Gehrke Dept. of Computer Science and Engineering University of Colorado Denver.

Advertisements

Introduction to Matlab

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Cache Heng Sovannarith

Discussion topics SLAM overview Range and Odometry data Landmarks

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

Instructor: Sazid Zaman Khan Lecturer, Department of Computer Science and Engineering, IIUC.

FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) 4 th October, 2007.

Query Reordering for Photon Mapping Rohit Saboo. Photon Mapping A two step solution for global illumination: Step 2: Shoot eye rays and perform a “gather”

Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.17.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Buying a Laptop. 3 Main Components The 3 main components to consider when buying a laptop or computer are Processor – The Bigger the Ghz the faster the.

Overview of Intel® Core 2 Architecture and Software Development Tools June 2009.

Adam Meyer, Michael Beck, Christopher Koch, and Patrick Gerber.

Intro to Java Programming  A computer follows the instruction precisely and exactly.  Anything has to be declared and defined before it can be used.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.

Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

Different CPUs CLICK THE SPINNING COMPUTER TO MOVE ON.

THE MEMORY SYSTEM & INTERCONNECTION STRUCTURE OBJECTIVES Define Memory hierarchy and its characteristics Define various types of memories Define the.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

3. April 2006Bernd Panzer-Steindel, CERN/IT1 HEPIX 2006 CPU technology session some ‘random walk’

Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.

FFT: Accelerator Project Rohit Prakash Anand Silodia.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

History of Microprocessor MPIntroductionData BusAddress Bus

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

CPEN Digital System Design

Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

1 st Semester Introduction to Computer and Programming Computer Engineering Department Kasetsart University, Bangkok, THAILAND.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1 IAF0620, 5.0 AP, Exam Jaan Raik ICT-524, , Digital systems verification.

Baum, Boyett, & Garrison Comparing Intel C++ and Microsoft Visual C++ Compilers Michael Baum David Boyett Holly Garrison.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Co-processors for speeding up drug design algorithms Advait Jain Priyanka Jindal Pulkit Gambhir Under the guidance of: Prof. M Balakrishnan Prof. Kolin.

A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.

Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.

Sunpyo Hong, Hyesoon Kim

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

FFT Accelerator Project Rohit Prakash(2003CS10186) Anand Silodia(2003CS50210) Date : February 23,2007.

Types of RAM (Random Access Memory) Information Technology.

Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.

Multiplication Find the missing value x __ = 32.

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

Two notions of performance

ECE232: Hardware Organization and Design

Simple Illustration of L1 Bandwidth Limitations on Vector Performance

Effective Data-Race Detection for the Kernel

Cache memory Direct Cache Memory Associate Cache Memory

Introduction to Computer Systems

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

Using Vector Capabilities of GPUs to Accelerate FFT

CMSC 611: Advanced Computer Architecture

Multi-Core Programming Assignment

Presentation transcript:

FFT Accelerator Project Rohit Prakash Anand Silodia Date: June 7 th, 2007

Objectives Analysis using random input points %age improvement (from the previous implementations) Cache profiling

Improvements Calls to sine/cosine decreased Separate arrays for power, some other terms –Division decreased –Multiplications decreased Error in last time corrected (FFTW floating point)

System Configuration Intel Pentium 4 (HT) 3.0Ghz RAM : 1GB Cache : 1MB L2 O.S. : Fedora Core 3 Compiler icc Flags used : -xW, -O3, -ipo-prec-div, - static

User time : vs. FFTW (single precision) Radix-4 works 1.5 times slower than fftw Radix-8 works 1.6 times slower than fftw

User time : previous (double) vs. new (float) Approximately 20% improvement

User time : previous (double) vs new (float) Approximately 19% improvement

Cache Organization Cache Level SizeAssociativityLine size L21 MB8-way64 I116 KB4-way64 D116KB4-way64

Radix-4 L2 misses Approximately 30% less L2 misses

Radix-4 D1 misses Approximately 1.6% less D1 misses

Radix-8 L2 misses Approximately 13.6% less L2 misses

Radix-8 D1 misses Approximately.96% less D1 misses

Profiling results: using vtune

Profiling results: using gprof

Profiling results : using vtune

Profiling results: using gprof

Profiling results: using vtune

Profiling results: using gprof

Profiling results: using vtune

Profiling results: using gprof

Profiling results: using vtune

Profiling results: using gprof

Further Improvements : use sse instructions Vectorize the loop T  A[r] U  w*A[r+p] V  w*w*A[r+2*p] W  w*w*w*A[r+3*p] Complex temp[4]; For(i = 1; i<4;i++) { temp[i] = twiddle[i*p]*A[r+ i*l] }

Thank You