18.337 Parallel FFT in Julia. 18.337 Review of FFT.

Slides:



Advertisements
Similar presentations
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
Advertisements

Lecture 3: Parallel Algorithm Design
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
1 Parallel Parentheses Matching Plus Some Applications.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
CSE 373: Data Structures and Algorithms Lecture 5: Math Review/Asymptotic Analysis III 1.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Parallel Programming in C with MPI and OpenMP
RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.
Divide-and-Conquer The most-well known algorithm design strategy:
1 Divide-and-Conquer The most-well known algorithm design strategy: 1. Divide instance of problem into two or more smaller instances 2. Solve smaller instances.
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Advanced Topics in Algorithms and Data Structures Lecture 6.1 – pg 1 An overview of lecture 6 A parallel search algorithm A parallel merging algorithm.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
1 Friday, November 17, 2006 “In the confrontation between the stream and the rock, the stream always wins, not through strength but by perseverance.” -H.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Lecture 11 Sorting Parallel Computing Fall 2008.
Accelerated Cascading Advanced Algorithms & Data Structures Lecture Theme 16 Prof. Dr. Th. Ottmann Summer Semester 2006.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
QAP/MPI Adam Gaweda Anthony Habash Ray Brown. What is QAP Stands for Quadratic Assignment Problem The QAP is the problem of assigning a set of facilities.
Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
“Evaluating MapReduce for Multi-core and Multiprocessor Systems” Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis Computer.
Chapter 8 ARRAYS Continued
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Combinational and Sequential Logic Circuits.
Identifying Reversible Functions From an ROBDD Adam MacDonald.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Copyright © 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin “ Introduction to the Design & Analysis of Algorithms, ” 2 nd ed., Ch. 1 Chapter.
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 5 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
SEARCHING UNIT II. Divide and Conquer The most well known algorithm design strategy: 1. Divide instance of problem into two or more smaller instances.
Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.
Parallel #2 Paper – Phylogeny and Branch and Bound Algorithms George McGinn
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
CS261 – Recitation 5 Fall Outline Assignment 3: Memory and Timing Tests Binary Search Algorithm Binary Search Tree Add/Remove examples 1.
Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
P-Tree Implementation Anne Denton. So far: Logical Definition C.f. Dr. Perrizo’s slides Logical definition Defines node information Representation of.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
A Logarithmic Randomly Accessible Data Structure Tom Morgan TJHSST Computer Systems Lab
1 Recursive algorithms Recursive solution: solve a smaller version of the problem and combine the smaller solutions. Example: to find the largest element.
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
A Logarithmic Randomly Accessible Data Structure Tom Morgan TJHSST Computer Systems Lab
Multi-Grid Esteban Pauli 4/25/06. Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other.
An Iterative FFT We rewrite the loop to calculate nkyk[1] once
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Parallel Odd-Even Sort Algorithm Dr. Xiao.
Chapter 4 Divide-and-Conquer
Real-time double buffer For hard real-time
DFT and FFT By using the complex roots of unity, we can evaluate and interpolate a polynomial in O(n lg n) An example, here are the solutions to 8 =
Real-time 1-input 1-output DSP systems
4.1 DFT In practice the Fourier components of data are obtained by digital computation rather than by analog processing. The analog values have to be.
Chapter 9 Computation of the Discrete Fourier Transform
CSE 373 Data Structures and Algorithms
Merge Sort 2/23/ :15 PM Merge Sort 7 2   7  2   4  4 9
Merge Sort 4/10/ :25 AM Merge Sort 7 2   7  2   4  4 9
Lecture #18 FAST FOURIER TRANSFORM ALTERNATE IMPLEMENTATIONS
Merge Sort 5/30/2019 7:52 AM Merge Sort 7 2   7  2  2 7
Presentation transcript:

Parallel FFT in Julia

Review of FFT

Review of FFT (cont.)

Review of FFT (cont.)

Sequential FFT Pseudocode Recursive-FFT ( array ) -arrayEven = even indexed elements of array -arrayOdd = odd indexed elements of array -Recursive-FFT ( arrayEven ) -Recursive-FFT ( arrayOdd ) -Combine results using Cooley-Tukey butterfly -Bit reversal, could be done either before, after or in between

Parallel FFT Algorithms -Binary Exchange Algorithm -Transpose Algorithm

Binary Exchange Algorithm

Binary Exchange Algorithm

Binary Exchange Algorithm

Parallel Transpose Algorithm

Parallel Transpose Algorithm

Julia implementations -Input is represented as distributed arrays. -Assumptions: N, P are powers of 2 for sake of simplicity -More focus on minimizing communication overhead versus computational optimizations

Easy Recursive-FFT ( array ) -…………. Recursive-FFT ( arrayEven ) Recursive-FFT ( arrayOdd ) -…………. Same as FFTW parallel implementation for 1D input using Cilk.

Too much unnecessary overhead because of random spawning. Better: Recursive-FFT ( array ) -…………. owner Recursive-FFT ( arrayEven ) owner Recursive-FFT ( arrayOdd ) -…………. Not so fast

FFT_Parallel ( array ) -Bit reverse input array and distribute owner Recursive-FFT ( first half ) owner Recursive-FFT ( last half ) -Combine results Binary Exchange Implementation

Binary Exchange Implementation

Binary Exchange Implementation

Binary Exchange Implementation

Binary Exchange Implementation

Binary Exchange Implementation

Alternate approach – Black box -Initially similar to parallel transpose method: data is distributed so that each sub-problem is locally contained within one node FFT_Parallel ( array ) -Bit reverse input array and distribute equally - for each processor proc FFT-Sequential ( local array ) -Redistribute data and combine locally

Alternate approach

Alternate approach – Black box Pros: -Eliminates needs for redundant spawning which is expensive -Can leverage black box packages such as FFTW for local sequential run, or black box combiners -Warning: Order of input to sub-problems is important Note: -Have not tried FFTW due to configuration issues

Benchmark Caveats array = redist(array, 1, 5)

Benchmark Results FFTWBinary Exchange Black BoxCommunicationBlack Box 8pCommunication 8p A year? Apocalypse

Benchmark Results

Communication issues

New Results FFTW Black Box Improved Redistribution Communication for Improved Redistribution Black Box Improved Redistribution 8p Communication for Improved Redistribution 8p

New Results

More issues and considerations -Communication cost: where and how. Better redistribution method. -Leverage of sequential FFTW on black box problems -A separate algorithm, better data distribution?

Questions