Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

Similar presentations


Presentation on theme: "Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering."— Presentation transcript:

1 Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University http://www.spiral.net Franz Franchetti Electrical and Computer Engineering Carnegie Mellon University

2 Sponsors Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting.

3  Generate high-performance implementations of linear computations (signal transforms) from mathematical descriptions  Explore alternative implementations and optimize using formula generation, manipulation and search  Prototype implementation using WHT Prototype transform (WHT) Build on existing sequential package SMP implementation using OpenMP Distributed memory implementation using MPI Sequential package presented at ICASSP’00 & ’01 and OpenMP extension presented at IPDPS’02  Incorporate into SPIRAL Automatic performance tuning for DSP transforms Objective CMU: J. Hoe, J. Moura, M. Püschel. M. Veloso Drexel: J. Johnson UIUC: D. Padua R. W. Johnson www.spiral.net

4 Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

5 Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

6 Fast WHT algorithms obtained by factoring the WHT matrix Walsh-Hadamard Transform

7  All WHT algorithms have the same arithmetic cost O(N log N) but different data access patterns and varying amounts of recursion and iteration  Small transforms (size 2 1 to 2 8 ) are implemented with straight-line code to reduce overhead  The WHT package allows exploration of the O(7 n ) different algorithms and implementations using a simple grammar  Optimization/adaptation to architectures is performed by searching for the fastest algorithm  Dynamic Programming (DP)  Evolutionary Algorithm (STEER) Johnson and Püschel: ICASSP 2000 SPIRAL WHT Package

8  Automatically generate random algorithms for WHT 2 16 using SPIRAL  Only difference: order of arithmetic instructions Performance of WHT Algorithms (II) Factor 5

9 The best WHT algorithms also depend on architecture  Memory hierarchy  Cache structure  Cache miss penalty …… 2 22 2 5,(1) 2 17 2 13 2 4,(1) 2929 2525 UltraSPARC II v9 2 22 A DDL split node 2 5,(1) An IL=1 straight-line WHT 32 node POWER3 II 2 22 2 10 2 12 2626 2626 2 5,(1) 2525 PowerPC RS64 III 2 22 2 4,(4) 2 18 2 6,(2) 2 12 2525 2727 Architecture Dependency

10 Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

11 Definition   A permutation of {0,1,…, n -1} ( b n -1 … b 1 b 0 ) Binary representation of 0  i < 2 n.  P  Permutation of {0,1,…,2 n -1 } defined by ( b n -1 … b 1 b 0 )  ( b  ( n -1) … b  (1) b  (0) ) Distributed interpretation  P = 2 p processors  Block cyclic data distribution.  Leading p bits are the pid  Trailing ( n-p ) bits are the local offset. pid offset pid offset ( b n -1 … b n - p | b n-p -1 … b 1 b 0 )  ( b  ( n -1) … b  ( n - p ) | b  ( n - p -1) … b  (1) b  (0) ) Bit Permutations

12 000 001 100 010 001 011 101 100 010 101 110 110 011 111 Stride Permutation Write at stride 4 (=8/2) ( b 2 b 1 b 0 )  ( b 0 b 2 b 1 )

13 000|  0 000|0  000|  1 100|0  001|  0 000|1  001|  1 100|1  010|  0 001|0  010|  1 101|0  011|  0 001|1  011|  1 101|1  100|  0 010|0  100|  1 110|0  101|  0 010|1  101|  1 110|1  110|  0 011|0  110|  1 111|0  111|  0 011|1  111|  1 111|1  Distributed Stride Permutation Processor address mappingLocal address mapping #0 #1 #2 #3 #4 #5 #6 #7 Communication rules per processor #

14 Communication Pattern 0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends 1/2 data to 2 different PEs Looks nicely regular…

15 X(0:2:6) 0 1 2 3 4 5 6 7 Y(0:1:3) X(1:2:7) Y(4:1:7) Communication Pattern …but is highly irregular…

16 Communication Pattern 0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends 1/4 data to 4 different PEs …and gets worse for larger parameters of L.

17 000 001 100 010 011 110 100 001 101 110 011 111 Multi-Swap Permutation ( b 2 b 1 b 0 )  ( b 0 b 1 b 2 ) Writes at stride 4 Pairwise exchange of data

18 Communication Pattern 0 123 4 5 6 7 0 123 4 5 6 7 Each PE exchanges 1/2 data with another PE (4 size 2 All-to-All)

19 X(0:2:6) 0 1 2 3 4 5 6 7 X(1:2:7) Communication Pattern

20 0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends 1/4 data to 4 different PEs (2 size 4 All-to-All)

21 Communication Scheduling  Order two Latin square  Used to schedule All- to-All permutation  Uses Point-to-Point communication  Simple recursive construction

22 Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

23  WHT partition tree is parallelized at the root node  SMP implementation obtained using OpenMP  Distributed memory implementation using MPI  Dynamic programming decides when to use parallelism  DP decides the best parallel root node  DP builds the partition with best sequential subtrees Parallel WHT Package Sequential WHT package: Johnson and Püschel: ICASSP 2000, ICASSP 2001 Dynamic Data Layout: N. Park and V. K. Prasanna: ICASSP 2001 OpenMP SMP version: K. Chen and J. Johnson: IPDPS 2002

24  Distributed split, d_split, as root node  Data equally distributed among threads  Distributed stride permutation to exchange data  Different sequences of permutations are possible  Parallel form WHT transform on local data Distributed Memory WHT Algorithms Pease dataflow Stride permutations General dataflow Bit permutations Parallel local WHT Sequential algorithm

25 Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

26 Problem statement Find sequence of permutations that minimize communication and congestion Pease dataflow Total bandwidth = N log ( N )(1-1/ P ) Conjectured Optimal Total bandwidth = N /2 log ( P ) + N (1-1/ P ) Optimal uses independent pairwise exchanges (except last permutation) Theoretical Results

27 Pease Dataflow

28 Theoretically Optimal Dataflow

29 Outline  Introduction  Bit permutations  Distributed WHT algorithms  Theoretical results  Experimental results

30 Platform  32 Pentium III processors, 450 MHz  512 MB 8ns PCI-100 memory and  2 SMC 100 mbps fast Ethernet cards Distributed WHT package implemented using MPI Experiments  All-to-All  Distributed stride vs. multi-swap permutations  Distributed WHT Experimental Results

31 All-to-All Three different implementations of All-to-All permutation Point-to-point fastest

32 Stride vs. Multi-Swap

33 Distributed WHT 2 30 vs.

34 Summary  Self-adapting WHT package  Optimize distributed WHT over different communication patterns and combinations of sequential code  Use of point-to-point primitives for all-to-all http://www.spiral.net Ongoing work:  Lower bounds  Use high-speed interconnect  Generalize to other transforms  Incorporate into SPIRAL


Download ppt "Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering."

Similar presentations


Ads by Google