Parallel FFT Sathish Vadhiyar.

Slides:



Advertisements
Similar presentations
General algorithmic techniques: Balanced binary tree technique Doubling technique: List Ranking Problem Divide and concur Lecture 6.
Advertisements

PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs*
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Parallel Fast Fourier Transform Ryan Liu. Introduction The Discrete Fourier Transform could be applied in science and engineering. Examples: ◦ Voice recognition.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Lecture 3: Parallel Algorithm Design
1 Parallel Parentheses Matching Plus Some Applications.
FFT1 The Fast Fourier Transform. FFT2 Outline and Reading Polynomial Multiplication Problem Primitive Roots of Unity (§10.4.1) The Discrete Fourier Transform.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.
1 Lecture 11 Sorting Parallel Computing Fall 2008.
FFT1 The Fast Fourier Transform by Jorge M. Trabal.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 16: Application-Driven Hardware Acceleration (1/4)
CS 684.
CS 584. Algorithm Analysis Assumptions n Consider ring, mesh, and hypercube. n Each process can either send or receive a single message at a time. n No.
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
Introduction to Algorithms
Fast Fourier Transform Irina Bobkova. Overview I. Polynomials II. The DFT and FFT III. Efficient implementations IV. Some problems.
Parallelizing the Fast Fourier Transform David Monismith cs599.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
FFT1 The Fast Fourier Transform. FFT2 Outline and Reading Polynomial Multiplication Problem Primitive Roots of Unity (§10.4.1) The Discrete Fourier Transform.
The Fast Fourier Transform
Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.
Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture.
Chapter 12 Binary Search and QuickSort Fundamentals of Java.
1 Fast Polynomial and Integer Multiplication Jeremy R. Johnson.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Data Structures and Algorithms in Parallel Computing Lecture 8.
Non-Linear Algebra Problems Sathish Vadhiyar SERC IISc.
Lecture 2 What is a computational problem? What is an instance of a problem? What is an algorithm? How to guarantee that an algorithm is correct? What.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Sorting: Parallel Compare Exchange Operation A parallel compare-exchange operation. Processes P i and P j send their elements to each other. Process P.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
CSG523/ Desain dan Analisis Algoritma
CSG523/ Desain dan Analisis Algoritma
Chapter 2 Divide-and-Conquer algorithms
Lecture 3: Parallel Algorithm Design
An Iterative FFT We rewrite the loop to calculate nkyk[1] once
Chapter 2 Divide-and-Conquer algorithms
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Divide-and-Conquer Design
Introduction to parallel algorithms
September 4, 1997 Applied Symbolic Computation (CS 300) Fast Polynomial and Integer Multiplication Jeremy R. Johnson.
Lecture 16: Parallel Algorithms I
Quick-Sort 11/14/2018 2:17 PM Chapter 4: Sorting    7 9
Quick-Sort 11/19/ :46 AM Chapter 4: Sorting    7 9
Communication costs of Schönhage-Strassen fast integer multiplication
Applied Symbolic Computation
FAST FOURIER TRANSFORM ALGORITHMS
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
DFT and FFT By using the complex roots of unity, we can evaluate and interpolate a polynomial in O(n lg n) An example, here are the solutions to 8 =
4.1 DFT In practice the Fourier components of data are obtained by digital computation rather than by analog processing. The analog values have to be.
September 4, 1997 Applied Symbolic Computation (CS 300) Fast Polynomial and Integer Multiplication Jeremy R. Johnson.
The Fast Fourier Transform
Introduction to parallel algorithms
September 4, 1997 Applied Symbolic Computation (CS 567) Fast Polynomial and Integer Multiplication Jeremy R. Johnson.
Applied Symbolic Computation
Richard Anderson Lecture 14 Inversions, Multiplication, FFT
Quick-Sort 2/23/2019 1:48 AM Chapter 4: Sorting    7 9
(2,4) Trees 2/28/2019 3:21 AM Sorting Lower Bound Sorting Lower Bound.
Zhongguo Liu Biomedical Engineering
The Fast Fourier Transform
Fast Fourier Transform (FFT) Algorithms
Fast Polynomial and Integer Multiplication
Bucket-Sort and Radix-Sort
Introduction to parallel algorithms
Presentation transcript:

Parallel FFT Sathish Vadhiyar

Sequential FFT – Quick Review Twiddle factor – primitive nth root of unity in complex plane -

Sequential FFT – Quick Review (n/2)th root of unity 2 (n/2)-point DFTs

Sequential FFT – quick review X(0) Y(0) wn0 wn0 wn0 X(4) Y(1) wn4 wn2 wn1 X(2) Y(2) wn0 wn4 wn2 X(6) Y(3) wn4 wn6 wn3 X(1) Y(4) wn0 wn0 wn4 X(5) Y(5) wn4 wn2 wn5 X(3) Y(6) wn0 wn4 wn6 X(7) Y(7) wn4 wn6 wn7

Sequential FFT – recursive solution 1. procedure R_FFT(X, Y, n, w) 2. if (n=1) then Y[0] := X[0] else 3. begin 4. R_FFT(<X(0), X(2), …, X[n-2]>, <Q[0], Q[1], …, Q[n/2]>, n/2, w2) 5. R_FFT(<X(1), X(3), …, X[n-1]>, <T[0], T[1], …, T[n/2]>, n/2, w2) 6. for i := 0 to n-1 do 7. Y[i] := Q[i mod (n/2)] + wiT(i mod (n/2)]; 8. end R_FFT

Sequential FFT – iterative solution 1. procedure ITERATIVE_FFT(X, Y, n) 2. begin 3. r := log n; 4. for i:= 0 to n-1 do R[i] := X[i]; 5. for m:= 0 to r-1 do 6. begin 7. for i:= 0 to n-1 do S[i] := R[i]; 8. for i:= 0 to n-1 do 9. begin /* Let (b0, b1, b2, … br-1) be the binary representation of i */ 10. j := (b0 … bm-10bm+1 .. br-1); 11. k := (b0 … bm-11bm+1 .. br-1); 12. R[i] := S[j] + S[k] x w(bmbm-1..b00..0) ; 13. endfor; 14. endfor; 15. for i:= 0 to n-1 do Y[i] := R[i]; 16. end ITERATIVE_FFT

Example of w calculation m/i 1 2 3 4 5 6 7 000 100 010 110 001 101 011 111 For a given m and i, the power of w is computed by reversing the order of the m+1 most significant bits of i and padding them by 0’s to the right.

Parallel FFT – Binary exchange 000 P0 001 X(1) Y(4) 010 X(2) Y(2) P1 011 X(3) Y(6) 100 X(4) Y(1) P2 101 X(5) Y(5) 110 X(3) Y(3) P3 111 X(7) Y(7) d r

Binary Exchange d – number of bits for representing processes; r – number of bits representing the elements The d most significant bits of element i indicate the process that the element belongs to. Only the first d of the r iterations require communication In a given iteration, m, a process i communicates with only one other process obtained by flipping the (m+1)th MSB of i Total execution time - ? (n/P)logN + logP(l) + (n/P)logP (b)

Parallel FFT – 2D Transpose 1 2 3 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 m = 0 m = 1 Phase 1 – FFTs along columns

Parallel FFT – 2D Transpose 1 2 3 4 8 12 4 5 6 7 1 5 9 13 8 9 10 11 2 6 10 14 12 13 14 15 3 7 11 15 Phase 2 – Transpose

Parallel FFT – 2D Transpose 4 8 12 4 8 12 1 5 9 13 1 5 9 13 2 6 10 14 2 6 10 14 3 7 11 15 3 7 11 15 m = 2 m = 3 Phase 3 – FFTs along columns

2D Transpose In general, n elements arranged as √n x √n p processes arranged along columns. Each process owns √n/p columns Each process does √n/p FFTs of size √n each Parallel runtime – 2(√n/p)√nlog√n + (p-1)(l)+ n/p(b)

3D Transpose n1/3 x n1/3 x n1/3 elements √p x √p processes Steps ? Parallel runtime – (n/p)logn(c) + 2(√p-1)(l) + 2(n/p)(b)

In general For q dimensions: Parallel runtime – (n/p)logn + (q-1)(p1/(q-1) -1) [l]+ (q-1)(n/p) [b] Time due to latency decreases; due to bandwidth increases For implementation – only 2D and 3D transposes are feasible. Moreover, there are restrictions on n and p in terms of q.

Choice of algorithm Binary exchange – small latency, large bandwidth 2D transpose – large latency, small bandwidth Other transposes lie between binary exchange and 2D transpose For a given parallel computer, based on l and b, different algorithms can give different performances for different problem sizes