Download presentation
Presentation is loading. Please wait.
Published byMarian Ross Modified over 8 years ago
1
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University http://www.spiral.net Franz Franchetti Electrical and Computer Engineering Carnegie Mellon University
2
Sponsors Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting.
3
Generate high-performance implementations of linear computations (signal transforms) from mathematical descriptions Explore alternative implementations and optimize using formula generation, manipulation and search Prototype implementation using WHT Prototype transform (WHT) Build on existing sequential package SMP implementation using OpenMP Distributed memory implementation using MPI Sequential package presented at ICASSP’00 & ’01 and OpenMP extension presented at IPDPS’02 Incorporate into SPIRAL Automatic performance tuning for DSP transforms Objective CMU: J. Hoe, J. Moura, M. Püschel. M. Veloso Drexel: J. Johnson UIUC: D. Padua R. W. Johnson www.spiral.net
4
Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results
5
Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results
6
Fast WHT algorithms obtained by factoring the WHT matrix Walsh-Hadamard Transform
7
All WHT algorithms have the same arithmetic cost O(N log N) but different data access patterns and varying amounts of recursion and iteration Small transforms (size 2 1 to 2 8 ) are implemented with straight-line code to reduce overhead The WHT package allows exploration of the O(7 n ) different algorithms and implementations using a simple grammar Optimization/adaptation to architectures is performed by searching for the fastest algorithm Dynamic Programming (DP) Evolutionary Algorithm (STEER) Johnson and Püschel: ICASSP 2000 SPIRAL WHT Package
8
Automatically generate random algorithms for WHT 2 16 using SPIRAL Only difference: order of arithmetic instructions Performance of WHT Algorithms (II) Factor 5
9
The best WHT algorithms also depend on architecture Memory hierarchy Cache structure Cache miss penalty …… 2 22 2 5,(1) 2 17 2 13 2 4,(1) 2929 2525 UltraSPARC II v9 2 22 A DDL split node 2 5,(1) An IL=1 straight-line WHT 32 node POWER3 II 2 22 2 10 2 12 2626 2626 2 5,(1) 2525 PowerPC RS64 III 2 22 2 4,(4) 2 18 2 6,(2) 2 12 2525 2727 Architecture Dependency
10
Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results
11
Definition A permutation of {0,1,…, n -1} ( b n -1 … b 1 b 0 ) Binary representation of 0 i < 2 n. P Permutation of {0,1,…,2 n -1 } defined by ( b n -1 … b 1 b 0 ) ( b ( n -1) … b (1) b (0) ) Distributed interpretation P = 2 p processors Block cyclic data distribution. Leading p bits are the pid Trailing ( n-p ) bits are the local offset. pid offset pid offset ( b n -1 … b n - p | b n-p -1 … b 1 b 0 ) ( b ( n -1) … b ( n - p ) | b ( n - p -1) … b (1) b (0) ) Bit Permutations
12
000 001 100 010 001 011 101 100 010 101 110 110 011 111 Stride Permutation Write at stride 4 (=8/2) ( b 2 b 1 b 0 ) ( b 0 b 2 b 1 )
13
000| 0 000|0 000| 1 100|0 001| 0 000|1 001| 1 100|1 010| 0 001|0 010| 1 101|0 011| 0 001|1 011| 1 101|1 100| 0 010|0 100| 1 110|0 101| 0 010|1 101| 1 110|1 110| 0 011|0 110| 1 111|0 111| 0 011|1 111| 1 111|1 Distributed Stride Permutation Processor address mappingLocal address mapping #0 #1 #2 #3 #4 #5 #6 #7 Communication rules per processor #
14
Communication Pattern 0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends 1/2 data to 2 different PEs Looks nicely regular…
15
X(0:2:6) 0 1 2 3 4 5 6 7 Y(0:1:3) X(1:2:7) Y(4:1:7) Communication Pattern …but is highly irregular…
16
Communication Pattern 0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends 1/4 data to 4 different PEs …and gets worse for larger parameters of L.
17
000 001 100 010 011 110 100 001 101 110 011 111 Multi-Swap Permutation ( b 2 b 1 b 0 ) ( b 0 b 1 b 2 ) Writes at stride 4 Pairwise exchange of data
18
Communication Pattern 0 123 4 5 6 7 0 123 4 5 6 7 Each PE exchanges 1/2 data with another PE (4 size 2 All-to-All)
19
X(0:2:6) 0 1 2 3 4 5 6 7 X(1:2:7) Communication Pattern
20
0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends 1/4 data to 4 different PEs (2 size 4 All-to-All)
21
Communication Scheduling Order two Latin square Used to schedule All- to-All permutation Uses Point-to-Point communication Simple recursive construction
22
Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results
23
WHT partition tree is parallelized at the root node SMP implementation obtained using OpenMP Distributed memory implementation using MPI Dynamic programming decides when to use parallelism DP decides the best parallel root node DP builds the partition with best sequential subtrees Parallel WHT Package Sequential WHT package: Johnson and Püschel: ICASSP 2000, ICASSP 2001 Dynamic Data Layout: N. Park and V. K. Prasanna: ICASSP 2001 OpenMP SMP version: K. Chen and J. Johnson: IPDPS 2002
24
Distributed split, d_split, as root node Data equally distributed among threads Distributed stride permutation to exchange data Different sequences of permutations are possible Parallel form WHT transform on local data Distributed Memory WHT Algorithms Pease dataflow Stride permutations General dataflow Bit permutations Parallel local WHT Sequential algorithm
25
Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results
26
Problem statement Find sequence of permutations that minimize communication and congestion Pease dataflow Total bandwidth = N log ( N )(1-1/ P ) Conjectured Optimal Total bandwidth = N /2 log ( P ) + N (1-1/ P ) Optimal uses independent pairwise exchanges (except last permutation) Theoretical Results
27
Pease Dataflow
28
Theoretically Optimal Dataflow
29
Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results
30
Platform 32 Pentium III processors, 450 MHz 512 MB 8ns PCI-100 memory and 2 SMC 100 mbps fast Ethernet cards Distributed WHT package implemented using MPI Experiments All-to-All Distributed stride vs. multi-swap permutations Distributed WHT Experimental Results
31
All-to-All Three different implementations of All-to-All permutation Point-to-point fastest
32
Stride vs. Multi-Swap
33
Distributed WHT 2 30 vs.
34
Summary Self-adapting WHT package Optimize distributed WHT over different communication patterns and combinations of sequential code Use of point-to-point primitives for all-to-all http://www.spiral.net Ongoing work: Lower bounds Use high-speed interconnect Generalize to other transforms Incorporate into SPIRAL
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.