Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering Carnegie Mellon University
Sponsors Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT administered by the Army Directorate of Contracting.
Generate high-performance implementations of linear computations (signal transforms) from mathematical descriptions Explore alternative implementations and optimize using formula generation, manipulation and search Prototype implementation using WHT Prototype transform (WHT) Build on existing sequential package SMP implementation using OpenMP Distributed memory implementation using MPI Sequential package presented at ICASSP’00 & ’01 and OpenMP extension presented at IPDPS’02 Incorporate into SPIRAL Automatic performance tuning for DSP transforms Objective CMU: J. Hoe, J. Moura, M. Püschel. M. Veloso Drexel: J. Johnson UIUC: D. Padua R. W. Johnson
Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results
Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results
Fast WHT algorithms obtained by factoring the WHT matrix Walsh-Hadamard Transform
All WHT algorithms have the same arithmetic cost O(N log N) but different data access patterns and varying amounts of recursion and iteration Small transforms (size 2 1 to 2 8 ) are implemented with straight-line code to reduce overhead The WHT package allows exploration of the O(7 n ) different algorithms and implementations using a simple grammar Optimization/adaptation to architectures is performed by searching for the fastest algorithm Dynamic Programming (DP) Evolutionary Algorithm (STEER) Johnson and Püschel: ICASSP 2000 SPIRAL WHT Package
Automatically generate random algorithms for WHT 2 16 using SPIRAL Only difference: order of arithmetic instructions Performance of WHT Algorithms (II) Factor 5
The best WHT algorithms also depend on architecture Memory hierarchy Cache structure Cache miss penalty …… ,(1) ,(1) UltraSPARC II v A DDL split node 2 5,(1) An IL=1 straight-line WHT 32 node POWER3 II ,(1) 2525 PowerPC RS64 III ,(4) ,(2) Architecture Dependency
Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results
Definition A permutation of {0,1,…, n -1} ( b n -1 … b 1 b 0 ) Binary representation of 0 i < 2 n. P Permutation of {0,1,…,2 n -1 } defined by ( b n -1 … b 1 b 0 ) ( b ( n -1) … b (1) b (0) ) Distributed interpretation P = 2 p processors Block cyclic data distribution. Leading p bits are the pid Trailing ( n-p ) bits are the local offset. pid offset pid offset ( b n -1 … b n - p | b n-p -1 … b 1 b 0 ) ( b ( n -1) … b ( n - p ) | b ( n - p -1) … b (1) b (0) ) Bit Permutations
Stride Permutation Write at stride 4 (=8/2) ( b 2 b 1 b 0 ) ( b 0 b 2 b 1 )
000| 0 000|0 000| 1 100|0 001| 0 000|1 001| 1 100|1 010| 0 001|0 010| 1 101|0 011| 0 001|1 011| 1 101|1 100| 0 010|0 100| 1 110|0 101| 0 010|1 101| 1 110|1 110| 0 011|0 110| 1 111|0 111| 0 011|1 111| 1 111|1 Distributed Stride Permutation Processor address mappingLocal address mapping #0 #1 #2 #3 #4 #5 #6 #7 Communication rules per processor #
Communication Pattern Each PE sends 1/2 data to 2 different PEs Looks nicely regular…
X(0:2:6) Y(0:1:3) X(1:2:7) Y(4:1:7) Communication Pattern …but is highly irregular…
Communication Pattern Each PE sends 1/4 data to 4 different PEs …and gets worse for larger parameters of L.
Multi-Swap Permutation ( b 2 b 1 b 0 ) ( b 0 b 1 b 2 ) Writes at stride 4 Pairwise exchange of data
Communication Pattern Each PE exchanges 1/2 data with another PE (4 size 2 All-to-All)
X(0:2:6) X(1:2:7) Communication Pattern
Each PE sends 1/4 data to 4 different PEs (2 size 4 All-to-All)
Communication Scheduling Order two Latin square Used to schedule All- to-All permutation Uses Point-to-Point communication Simple recursive construction
Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results
WHT partition tree is parallelized at the root node SMP implementation obtained using OpenMP Distributed memory implementation using MPI Dynamic programming decides when to use parallelism DP decides the best parallel root node DP builds the partition with best sequential subtrees Parallel WHT Package Sequential WHT package: Johnson and Püschel: ICASSP 2000, ICASSP 2001 Dynamic Data Layout: N. Park and V. K. Prasanna: ICASSP 2001 OpenMP SMP version: K. Chen and J. Johnson: IPDPS 2002
Distributed split, d_split, as root node Data equally distributed among threads Distributed stride permutation to exchange data Different sequences of permutations are possible Parallel form WHT transform on local data Distributed Memory WHT Algorithms Pease dataflow Stride permutations General dataflow Bit permutations Parallel local WHT Sequential algorithm
Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results
Problem statement Find sequence of permutations that minimize communication and congestion Pease dataflow Total bandwidth = N log ( N )(1-1/ P ) Conjectured Optimal Total bandwidth = N /2 log ( P ) + N (1-1/ P ) Optimal uses independent pairwise exchanges (except last permutation) Theoretical Results
Pease Dataflow
Theoretically Optimal Dataflow
Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results
Platform 32 Pentium III processors, 450 MHz 512 MB 8ns PCI-100 memory and 2 SMC 100 mbps fast Ethernet cards Distributed WHT package implemented using MPI Experiments All-to-All Distributed stride vs. multi-swap permutations Distributed WHT Experimental Results
All-to-All Three different implementations of All-to-All permutation Point-to-point fastest
Stride vs. Multi-Swap
Distributed WHT 2 30 vs.
Summary Self-adapting WHT package Optimize distributed WHT over different communication patterns and combinations of sequential code Use of point-to-point primitives for all-to-all Ongoing work: Lower bounds Use high-speed interconnect Generalize to other transforms Incorporate into SPIRAL