Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

Slides:



Advertisements
Similar presentations
Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs*
Advertisements

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Distributed Systems CS
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Cache Oblivious Search Trees via Binary Trees of Small Height
Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.
Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Parallel Programming in C with MPI and OpenMP
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Synchronous Algorithms I Barrier Synchronizations and Computing LBTS.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
Short Vector SIMD Code Generation for DSP Algorithms
What is the WHT anyway, and why are there so many ways to compute it? Jeremy Johnson 1, 2, 6, 24, 112, 568, 3032, 16768,…
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU)
High Performance Linear Transform Program Generation for the Cell BE
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
R. Arce-Nazario, M. Jimenez, and D. Rodriguez Electrical and Computer Engineering University of Puerto Rico – Mayagüez.
Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.
Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing TX, US, Experimental.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Auburn University
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Parallel Algorithm Design
Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems Javier Cuenca Departamento de Ingeniería.
What is the WHT anyway, and why are there so many ways to compute it?
Department of Computer Science University of California, Santa Barbara
WHT Package Jeremy Johnson.
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering Carnegie Mellon University

Sponsors Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT administered by the Army Directorate of Contracting.

 Generate high-performance implementations of linear computations (signal transforms) from mathematical descriptions  Explore alternative implementations and optimize using formula generation, manipulation and search  Prototype implementation using WHT Prototype transform (WHT) Build on existing sequential package SMP implementation using OpenMP Distributed memory implementation using MPI Sequential package presented at ICASSP’00 & ’01 and OpenMP extension presented at IPDPS’02  Incorporate into SPIRAL Automatic performance tuning for DSP transforms Objective CMU: J. Hoe, J. Moura, M. Püschel. M. Veloso Drexel: J. Johnson UIUC: D. Padua R. W. Johnson

Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

Fast WHT algorithms obtained by factoring the WHT matrix Walsh-Hadamard Transform

 All WHT algorithms have the same arithmetic cost O(N log N) but different data access patterns and varying amounts of recursion and iteration  Small transforms (size 2 1 to 2 8 ) are implemented with straight-line code to reduce overhead  The WHT package allows exploration of the O(7 n ) different algorithms and implementations using a simple grammar  Optimization/adaptation to architectures is performed by searching for the fastest algorithm  Dynamic Programming (DP)  Evolutionary Algorithm (STEER) Johnson and Püschel: ICASSP 2000 SPIRAL WHT Package

 Automatically generate random algorithms for WHT 2 16 using SPIRAL  Only difference: order of arithmetic instructions Performance of WHT Algorithms (II) Factor 5

The best WHT algorithms also depend on architecture  Memory hierarchy  Cache structure  Cache miss penalty …… ,(1) ,(1) UltraSPARC II v A DDL split node 2 5,(1) An IL=1 straight-line WHT 32 node POWER3 II ,(1) 2525 PowerPC RS64 III ,(4) ,(2) Architecture Dependency

Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

Definition   A permutation of {0,1,…, n -1} ( b n -1 … b 1 b 0 ) Binary representation of 0  i < 2 n.  P  Permutation of {0,1,…,2 n -1 } defined by ( b n -1 … b 1 b 0 )  ( b  ( n -1) … b  (1) b  (0) ) Distributed interpretation  P = 2 p processors  Block cyclic data distribution.  Leading p bits are the pid  Trailing ( n-p ) bits are the local offset. pid offset pid offset ( b n -1 … b n - p | b n-p -1 … b 1 b 0 )  ( b  ( n -1) … b  ( n - p ) | b  ( n - p -1) … b  (1) b  (0) ) Bit Permutations

Stride Permutation Write at stride 4 (=8/2) ( b 2 b 1 b 0 )  ( b 0 b 2 b 1 )

000|  0 000|0  000|  1 100|0  001|  0 000|1  001|  1 100|1  010|  0 001|0  010|  1 101|0  011|  0 001|1  011|  1 101|1  100|  0 010|0  100|  1 110|0  101|  0 010|1  101|  1 110|1  110|  0 011|0  110|  1 111|0  111|  0 011|1  111|  1 111|1  Distributed Stride Permutation Processor address mappingLocal address mapping #0 #1 #2 #3 #4 #5 #6 #7 Communication rules per processor #

Communication Pattern Each PE sends 1/2 data to 2 different PEs Looks nicely regular…

X(0:2:6) Y(0:1:3) X(1:2:7) Y(4:1:7) Communication Pattern …but is highly irregular…

Communication Pattern Each PE sends 1/4 data to 4 different PEs …and gets worse for larger parameters of L.

Multi-Swap Permutation ( b 2 b 1 b 0 )  ( b 0 b 1 b 2 ) Writes at stride 4 Pairwise exchange of data

Communication Pattern Each PE exchanges 1/2 data with another PE (4 size 2 All-to-All)

X(0:2:6) X(1:2:7) Communication Pattern

Each PE sends 1/4 data to 4 different PEs (2 size 4 All-to-All)

Communication Scheduling  Order two Latin square  Used to schedule All- to-All permutation  Uses Point-to-Point communication  Simple recursive construction

Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

 WHT partition tree is parallelized at the root node  SMP implementation obtained using OpenMP  Distributed memory implementation using MPI  Dynamic programming decides when to use parallelism  DP decides the best parallel root node  DP builds the partition with best sequential subtrees Parallel WHT Package Sequential WHT package: Johnson and Püschel: ICASSP 2000, ICASSP 2001 Dynamic Data Layout: N. Park and V. K. Prasanna: ICASSP 2001 OpenMP SMP version: K. Chen and J. Johnson: IPDPS 2002

 Distributed split, d_split, as root node  Data equally distributed among threads  Distributed stride permutation to exchange data  Different sequences of permutations are possible  Parallel form WHT transform on local data Distributed Memory WHT Algorithms Pease dataflow Stride permutations General dataflow Bit permutations Parallel local WHT Sequential algorithm

Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

Problem statement Find sequence of permutations that minimize communication and congestion Pease dataflow Total bandwidth = N log ( N )(1-1/ P ) Conjectured Optimal Total bandwidth = N /2 log ( P ) + N (1-1/ P ) Optimal uses independent pairwise exchanges (except last permutation) Theoretical Results

Pease Dataflow

Theoretically Optimal Dataflow

Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

Platform  32 Pentium III processors, 450 MHz  512 MB 8ns PCI-100 memory and  2 SMC 100 mbps fast Ethernet cards Distributed WHT package implemented using MPI Experiments  All-to-All  Distributed stride vs. multi-swap permutations  Distributed WHT Experimental Results

All-to-All Three different implementations of All-to-All permutation Point-to-point fastest

Stride vs. Multi-Swap

Distributed WHT 2 30 vs.

Summary  Self-adapting WHT package  Optimize distributed WHT over different communication patterns and combinations of sequential code  Use of point-to-point primitives for all-to-all Ongoing work:  Lower bounds  Use high-speed interconnect  Generalize to other transforms  Incorporate into SPIRAL