Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University http://www.spiral.net Franz Franchetti Electrical and Computer Engineering Carnegie Mellon University

Sponsors Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting.

 Generate high-performance implementations of linear computations (signal transforms) from mathematical descriptions  Explore alternative implementations and optimize using formula generation, manipulation and search  Prototype implementation using WHT Prototype transform (WHT) Build on existing sequential package SMP implementation using OpenMP Distributed memory implementation using MPI Sequential package presented at ICASSP’00 & ’01 and OpenMP extension presented at IPDPS’02  Incorporate into SPIRAL Automatic performance tuning for DSP transforms Objective CMU: J. Hoe, J. Moura, M. Püschel. M. Veloso Drexel: J. Johnson UIUC: D. Padua R. W. Johnson www.spiral.net

Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

Fast WHT algorithms obtained by factoring the WHT matrix Walsh-Hadamard Transform

 All WHT algorithms have the same arithmetic cost O(N log N) but different data access patterns and varying amounts of recursion and iteration  Small transforms (size 2 1 to 2 8 ) are implemented with straight-line code to reduce overhead  The WHT package allows exploration of the O(7 n ) different algorithms and implementations using a simple grammar  Optimization/adaptation to architectures is performed by searching for the fastest algorithm  Dynamic Programming (DP)  Evolutionary Algorithm (STEER) Johnson and Püschel: ICASSP 2000 SPIRAL WHT Package

 Automatically generate random algorithms for WHT 2 16 using SPIRAL  Only difference: order of arithmetic instructions Performance of WHT Algorithms (II) Factor 5

The best WHT algorithms also depend on architecture  Memory hierarchy  Cache structure  Cache miss penalty …… 2 22 2 5,(1) 2 17 2 13 2 4,(1) 2929 2525 UltraSPARC II v9 2 22 A DDL split node 2 5,(1) An IL=1 straight-line WHT 32 node POWER3 II 2 22 2 10 2 12 2626 2626 2 5,(1) 2525 PowerPC RS64 III 2 22 2 4,(4) 2 18 2 6,(2) 2 12 2525 2727 Architecture Dependency

Definition   A permutation of {0,1,…, n -1} ( b n -1 … b 1 b 0 ) Binary representation of 0  i < 2 n.  P  Permutation of {0,1,…,2 n -1 } defined by ( b n -1 … b 1 b 0 )  ( b  ( n -1) … b  (1) b  (0) ) Distributed interpretation  P = 2 p processors  Block cyclic data distribution.  Leading p bits are the pid  Trailing ( n-p ) bits are the local offset. pid offset pid offset ( b n -1 … b n - p | b n-p -1 … b 1 b 0 )  ( b  ( n -1) … b  ( n - p ) | b  ( n - p -1) … b  (1) b  (0) ) Bit Permutations

000 001 100 010 001 011 101 100 010 101 110 110 011 111 Stride Permutation Write at stride 4 (=8/2) ( b 2 b 1 b 0 )  ( b 0 b 2 b 1 )

000|  0 000|0  000|  1 100|0  001|  0 000|1  001|  1 100|1  010|  0 001|0  010|  1 101|0  011|  0 001|1  011|  1 101|1  100|  0 010|0  100|  1 110|0  101|  0 010|1  101|  1 110|1  110|  0 011|0  110|  1 111|0  111|  0 011|1  111|  1 111|1  Distributed Stride Permutation Processor address mappingLocal address mapping #0 #1 #2 #3 #4 #5 #6 #7 Communication rules per processor #

Communication Pattern 0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends 1/2 data to 2 different PEs Looks nicely regular…

X(0:2:6) 0 1 2 3 4 5 6 7 Y(0:1:3) X(1:2:7) Y(4:1:7) Communication Pattern …but is highly irregular…

Communication Pattern 0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends 1/4 data to 4 different PEs …and gets worse for larger parameters of L.

000 001 100 010 011 110 100 001 101 110 011 111 Multi-Swap Permutation ( b 2 b 1 b 0 )  ( b 0 b 1 b 2 ) Writes at stride 4 Pairwise exchange of data

Communication Pattern 0 123 4 5 6 7 0 123 4 5 6 7 Each PE exchanges 1/2 data with another PE (4 size 2 All-to-All)

X(0:2:6) 0 1 2 3 4 5 6 7 X(1:2:7) Communication Pattern

0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends 1/4 data to 4 different PEs (2 size 4 All-to-All)

Communication Scheduling  Order two Latin square  Used to schedule All- to-All permutation  Uses Point-to-Point communication  Simple recursive construction

 WHT partition tree is parallelized at the root node  SMP implementation obtained using OpenMP  Distributed memory implementation using MPI  Dynamic programming decides when to use parallelism  DP decides the best parallel root node  DP builds the partition with best sequential subtrees Parallel WHT Package Sequential WHT package: Johnson and Püschel: ICASSP 2000, ICASSP 2001 Dynamic Data Layout: N. Park and V. K. Prasanna: ICASSP 2001 OpenMP SMP version: K. Chen and J. Johnson: IPDPS 2002

 Distributed split, d_split, as root node  Data equally distributed among threads  Distributed stride permutation to exchange data  Different sequences of permutations are possible  Parallel form WHT transform on local data Distributed Memory WHT Algorithms Pease dataflow Stride permutations General dataflow Bit permutations Parallel local WHT Sequential algorithm

Problem statement Find sequence of permutations that minimize communication and congestion Pease dataflow Total bandwidth = N log ( N )(1-1/ P ) Conjectured Optimal Total bandwidth = N /2 log ( P ) + N (1-1/ P ) Optimal uses independent pairwise exchanges (except last permutation) Theoretical Results

Pease Dataflow

Theoretically Optimal Dataflow

Platform  32 Pentium III processors, 450 MHz  512 MB 8ns PCI-100 memory and  2 SMC 100 mbps fast Ethernet cards Distributed WHT package implemented using MPI Experiments  All-to-All  Distributed stride vs. multi-swap permutations  Distributed WHT Experimental Results

All-to-All Three different implementations of All-to-All permutation Point-to-point fastest

Stride vs. Multi-Swap

Distributed WHT 2 30 vs.

Summary  Self-adapting WHT package  Optimize distributed WHT over different communication patterns and combinations of sequential code  Use of point-to-point primitives for all-to-all http://www.spiral.net Ongoing work:  Lower bounds  Use high-speed interconnect  Generalize to other transforms  Incorporate into SPIRAL

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

Similar presentations

Presentation on theme: "Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

Similar presentations

Presentation on theme: "Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering."— Presentation transcript:

Similar presentations

About project

Feedback