Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture.

Slides:



Advertisements
Similar presentations
Very Large Fast DFT (VL FFT) Implementation on KeyStone Multicore Applications.
Advertisements

Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs*
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Parallel Fast Fourier Transform Ryan Liu. Introduction The Discrete Fourier Transform could be applied in science and engineering. Examples: ◦ Voice recognition.
Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.
CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)
Efficient Representation of Data Structures on Associative Processors Jalpesh K. Chitalia (Advisor Dr. Robert A. Walker) Computer Science Department Kent.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh,
Richard Fateman CS 282 Lecture 101 The Finite-Field FFT Lecture 10.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
Efficient Associative SIMD Processing for Non-Tabular Data Jalpesh K. Chitalia and Robert A. Walker Computer Science Department Kent State University.
Multiprocessors Andreas Klappenecker CPSC321 Computer Architecture.
CSE 830: Design and Theory of Algorithms
May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.
Fast Fourier Transform (FFT) (Section 4.11) CS474/674 – Prof. Bebis.
Introduction to Algorithms
Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Parallelizing the Fast Fourier Transform David Monismith cs599.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Fast Memory Addressing Scheme for Radix-4 FFT Implementation Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Xin Xiao, Erdal Oruklu and.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
1 Gaspard Methodology The Y Model approach Models and UML profiles Arnaud CUCCURU, phd student ModEasy meeting, Lille, February 2005.
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
Important Components, Blocks and Methodologies. To remember 1.EXORS 2.Counters and Generalized Counters 3.State Machines (Moore, Mealy, Rabin-Scott) 4.Controllers.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
Numerical Methods Fast Fourier Transform Part: Theoretical Development of Fast Fourier Transform
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
The Fast Fourier Transform and Applications to Multiplication
P-Tree Implementation Anne Denton. So far: Logical Definition C.f. Dr. Perrizo’s slides Logical definition Defines node information Representation of.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
1 Radix Sort. 2 Classification of Sorting algorithms Sorting algorithms are often classified using different metrics:  Computational complexity: classification.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
A Flexible Interleaved Memory Design for Generalized Low Conflict Memory Access Laurence S.Kaplan BBN Advanced Computers Inc. Cambridge,MA Distributed.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Data Structures and Algorithms in Parallel Computing Lecture 8.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
PMLAB, IECS, FCU Designing Efficient Matrix Transposition on Various Interconnection Networks Using Tensor Product Formulation Presented by Chin-Yi Tsai.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
Applied Symbolic Computation1 Applied Symbolic Computation (CS 567) The Fast Fourier Transform (FFT) and Convolution Jeremy R. Johnson TexPoint fonts used.
Parallel FFT in Julia Review of FFT.
EE345S Real-Time Digital Signal Processing Lab Fall 2006 Lecture 17 Fast Fourier Transform Prof. Brian L. Evans Dept. of Electrical and Computer Engineering.
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
An Iterative FFT We rewrite the loop to calculate nkyk[1] once
Introduction to parallel algorithms
Lecture 13 review Explain how distance vector algorithm works.
Applied Symbolic Computation
VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA.
Parallel Computation Patterns (Scan)
Introduction to parallel algorithms
Applied Symbolic Computation
OSU Quantum Information Seminar
ECE 498AL Lecture 15: Reductions and Their Implementation
Chapter 19 Fast Fourier Transform
Introduction to parallel algorithms
Presentation transcript:

Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture was derived from material from Johnson, Johnson, Pryor.

Mar. 1, 2001Parallel Processing2 Introduction Objective: To derive and implement a distributed-memory parallel program for computing the fast Fourier transform (FFT). Topics –Derivation of the FFT Iterative version Pease Algorithm & Generalizations Tensor permutations –Distributed implementation of tensor permutations stride permutation bit reversal –Distributed FFT

Mar. 1, 2001Parallel Processing3 FFT as a Matrix Factorization Compute y = F n x, where F n is n-point Fourier matrix.

Mar. 1, 2001Parallel Processing4 Matrix Factorizations and Algorithms function y = fft(x) n = length(x) if n == 1 y = x else % [x0 x1] = L^n_2 x x0 = x(1:2:n-1); x1 = x(2:2:n); % [t0 t1] = (I_2 tensor F_m)[x0 x1] t0 = fft(x0); t1 = fft(x1); % w = W_m(omega_n) w = exp((2*pi*i/n)*(0:n/2-1)); % y = [y0 y1] = (F_2 tensor I_m) T^n_m [t0 t1] y0 = t0 + w.*t1; y1 = t0 - w.*t1; y = [y0 y1] end

Mar. 1, 2001Parallel Processing5 Rewrite Rules

Mar. 1, 2001Parallel Processing6 FFT Variants Cooley-Tukey Recursive FFT Iterative FFT Vector FFT (Stockham) Vector FFT (Korn-Lambiotte) Parallel FFT (Pease)

Mar. 1, 2001Parallel Processing7 Example TPL Programs ; Recursive 8-point FFT (compose (tensor (F 2) (I 4)) (T 8 4) (tensor (I 2) (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) (L 8 2)) ; Iterative 8-point FFT (compose (tensor (F 2) (I 4)) (T 8 4) (tensor (I 2) (F 2) (I 2)) (tensor (I 2) (T 4 2)) (tensor (F 2) (I 4)) (tensor (I 2) (L 4 2) (L 8 2))

Mar. 1, 2001Parallel Processing8 FFT Dataflow Different formulas for the FFT have different dataflow (memory access patterns). The dataflow in a class of FFT algorithms can be described by a sequence of permutations. An “FFT dataflow” is a sequence of permutations that can be modified with the insertion of butterfly computations (with appropriate twiddle factors) to form a factorization of the Fourier matrix. FFT dataflows can be classified wrt to cost, and used to find “good” FFT implementations.

Mar. 1, 2001Parallel Processing9 Distributed FFT Algorithm Experiment with different dataflow and locality properties by changing radix and permutations

Mar. 1, 2001Parallel Processing10 Cooley-Tukey Dataflow

Mar. 1, 2001Parallel Processing11 Pease Dataflow

Mar. 1, 2001Parallel Processing12 Tensor Permutations A natural class of permutations compatible with the FFT. Let  be a permutation of {1,…,t} Mixed-radix counting permutation of vector indices Well-known examples are stride permutations and bit-reversal. 

Mar. 1, 2001Parallel Processing13 Example (Stride Permutation)

Mar. 1, 2001Parallel Processing14 Example (Bit Reversal)

Mar. 1, 2001Parallel Processing15 Twiddle Factor Matrix Diagonal matrix containing roots of unity Generalized Twiddle (compatible with tensor permutations)

Mar. 1, 2001Parallel Processing16 Distributed Computation Allocate equal-sized segments of vector to each processor, and index distributed vector with pid and local offset. Interpret tensor product operations with this addressing scheme b k+l-1 ……b l b l-1 …...……... b 1 b 0 pidoffset

Mar. 1, 2001Parallel Processing17 Distributed Tensor Product and Twiddle Factors Assume P processors I n  A, becomes parallel do over all processors when n  P. Twiddle factors determined independently from pid and offset. Necessary bits determined from I, J, and (n 1,…,n t ) in generalized twiddle notation.

Mar. 1, 2001Parallel Processing18 Distributed Tensor Permutations b  (k+l-1) … b  (l) b  (l-1) ………... b  (1) b  (0) b k+l-1 ……b l b l-1 …...……... b 1 b 0 pidoffset

Mar. 1, 2001Parallel Processing19 Classes of Distributed Tensor Permutations 1Local (pid is fixed by  ) Only permute elements locally within each processor 2Global (offset is fixed by  ) Permute the entire local arrays amongst the processors 3Global*Local (bits in pid and bits in offset moved by , but no bits cross the pid/offset boundary) Permute elements locally followed by a Global permutation 4Mixed (at least one offset and pid bit are exchanged) Elements from a processor are sent/received to/from more than one processor

Mar. 1, 2001Parallel Processing20 Distributed Stride Permutation 000|  0 000|0  000|  1 100|0  001|  0 000|1  001|  1 100|1  010|  0 001|0  010|  1 101|0  011|  0 001|1   011|  1 101|1  100|  0 010|0  100|  1 110|0  101|  0 010|1  101|  1 110|1  110|  0 011|0  110|  1 111|0  111|  0 011|1   111|  1 111|1 

Mar. 1, 2001Parallel Processing21 Communication Pattern X(0:2:6) Y(4:1:3) X(1:2:7) Y(0:1:7)

Mar. 1, 2001Parallel Processing22 Communication Pattern Each PE sends 1/2 data to 2 different PEs

Mar. 1, 2001Parallel Processing23 Communication Pattern Each PE sends 1/4 data to 4 different PEs

Mar. 1, 2001Parallel Processing24 Communication Pattern Each PE sends1/8 data to 8 different PEs

Mar. 1, 2001Parallel Processing25 Implementation of Distributed Stride Permutation D_Stride(Y,N,t,P,k,M,l,S,j,X) // Compute Y = L^N_S X // Inputs // Y,X distributed vectors of size N = 2^t, // with M = 2^l elements per processor // P = 2^k = number of processors // S = 2^j, 0 <= j <= k, is the stride. // Output // Y = L^N_S X p = pid for i=0,...,2 j -1 do put x(i:S:i+S*(n/S-1)) in y((n/S)*(p mod S):(n/S)*(p mod S)+N/S-1) on PE p/2^j + i*2^{k-j}

Mar. 1, 2001Parallel Processing26 Cyclic Scheduling Each PE sends 1/4 data to 4 different PEs

Mar. 1, 2001Parallel Processing27 Distributed Bit Reversal Permutation Mixed tensor permutation Implement using factorization b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 b 5 b 6 b 7 b 0 b 1 b 2 b 3 b 4 b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7

Mar. 1, 2001Parallel Processing28 Experiments on the CRAY T3E All experiments were performed on a 240 node (8x4x8 with partial plane) T3E using 128 processors (300 MHz) with 128MB memory –Task 1(pairwise communication) Implemented with shmem_get, shmem_put, and mpi_sendrecv –Task 2 (all 7! = 5040 global tensor permutations) Implemented with shmem_get, shmem_put, and mpi_sendrecv –Task 3 (local tensor permutations of the form I  L  I on vectors of size 2^22 words - only run on a single node) Implemented using streams on/off, cache bypass –Task 4 (distributed stride permutations) Implemented using shmem_iput, shmem_iget, and mpi_sendrecv

Mar. 1, 2001Parallel Processing29 Task 1 Performance Data

Mar. 1, 2001Parallel Processing30 Task 2 Performance Data

Mar. 1, 2001Parallel Processing31 Task 3 Performance Data

Mar. 1, 2001Parallel Processing32 Task 4 Performance Data

Mar. 1, 2001Parallel Processing33 Network Simulator An idealized simulator for the T3E was developed (with C. Grassl from Cray research) in order to study contention –Specify processor layout and route table and number of virual processors with a given start node –Each processor can simultaneously issue a single send Contention is measured as the maximum number of messages across any edge/node Simulator used to study global and mixed tensor permutations.

Mar. 1, 2001Parallel Processing34 Task 2 Grid Simulation Analysis

Mar. 1, 2001Parallel Processing35 Task 2 Grid Simulation Analysis

Mar. 1, 2001Parallel Processing36 Task 2 Torus Simulation Analysis

Mar. 1, 2001Parallel Processing37 Task 2 Torus Simulation Analysis