A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and Computer Science Drexel University
Motivation and Overview High performance implementation of critical signal processing kernels A self-optimizing parallel package for computing fast signal transforms –Prototype transform (WHT) –Build on existing sequential package –SMP implementation using OpenMP Part of SPIRAL project –
Outline Walsh-Hadamard Transform (WHT) Sequential performance and optimization using dynamic programming A parallel implementation of the WHT Parallel performance and optimization including parallelism in the search
Walsh-Hadamard Transform Fast WHT algorithms are obtained by factoring the WHT matrix
SPIRAL WHT Package All WHT algorithms have the same arithmetic cost O(NlgN) but different data access patterns Different factorizations lead to varying amounts of recursion and iteration Transforms in small sizes (2 1 to 2 8 ) are implemented in straight-line code to reduce overheads The WHT package allows exploration of different algorithms and implementations Optimization/adaptation to architectures is performed by searching for the fastest algorithm Johnson and Püschel: ICASSP 2000
Dynamic Programming Exhaustive Search: Searching all possible algorithms –Cost is (4 n /n 3/2 ) for binary factorizations Dynamic Programming: Searching among algorithms generated from previously determined best algorithms –Cost is (n 2 ) for binary factorizations Best algorithm at size Best algorithm at size 2 4 Possibly best algorithm at size
Performance of WHT Algorithms Iterative algorithms have less overhead Recursive algorithms have better data locality Best WHT algorithms are compromise between less overhead and good data flow pattern.
The best WHT algorithms also depend on architecture characteristics such as memory hierarchy, cache structure and cache miss penalty, etc. Architecture Dependency ,(1) ,(1) UltraSPARC v9POWER3 IIPowerPC RS64 III 2 22 A DDL split node 2 5,(1) An IL=1 straight-line WHT 32 node ,(1) ,(4) ,(2)
Improved Data Access Patterns Stride tensor causes WHT accessing data out of block and loss of locality Large stride introduces more conflict cache misses time x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x Stride tensorUnion tensor
Dynamic Data Layout DDL uses in-place pseudo transpose to swap data in a special way so that Stride tensor is changed to Union tensor. x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x0x0 x4x4 x2x2 x6x6 x1x1 x5x5 x3x3 x7x7 x 0 x 4 x 2 x 6 x 1 x 5 x 3 x 7 pseudo transpose N. Park and V. K. Prasanna: ICASSP 2001
Loop Interleaving IL maximizes the use of cache pre-fetching by interleaving multiple WHT transforms into one transform. Access order1st2nd3rd4th x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 WHT 2 IL=1 I 4/2 WHT 2 IL=2 I 4/4 WHT 2 I 4 Gatlin and Carter: PACT 2000, Implemented by Bo Hong
Environment: PowerPC RS64 III/ MHz, 128/128KB L1 cache, 8 MB L2 cache, 8 GB RAM, AIX 4.3.3, cc Best WHT Partition Trees Standard best tree Best tree with DDL ,(3) Best tree with IL 2 16 A DDL split node 2 5,(3) An IL=3 straight-line WHT 32 node
Effect of IL and DDL on Performance DDL and IL improve performance when data size is larger than the L1 cache, 128 KB = 2 14 8 bytes. IL level 4 reaches the maximal use of cache line, 128 bytes = 2 4 8 bytes.
Parallel WHT Package SMP implementation obtained using OpenMP WHT partition tree is parallelized at the root node – Simple to insert OpenMP directives – Better performance obtained with manual scheduling DP decides when to use parallelism DP builds the partition with best sequential subtrees DP decides the best parallel root node – Parallel split – Parallel split with DDL – Parallel pseudo-transpose – Parallel split with IL
OpenMP Implementation # pragma omp parallel { R = N; S = 1; for (i = 0; i < t; i ++) { R = R / N(i); # pragma omp parallel for for (j = 0; j < R - 1) { for (k = 0; k < S - 1) { WHT(N(i)) * x(j, k, S, N(i)); } S = S * N(i); } # pragma omp parallel { total = get_total_threads( ); id = get_thread_id( ); R = N; S = 1; for (i = 0; i < t; i ++) { R = R / N(i); for (; id < R*S - 1; id += total) { j = id / S; k = id % S; WHT(N(i)) * x(j, k, S, N(i)); } S = S * N(i); # pragma omp barrier }
In WHT R S = L (I S WHT R ) L (I R WHT S ), the pseudo transpose, L, can be parallelized in different granularity Parallel DDL thread 1thread 2thread 3thread 4 Coarse-grained pseudo transpose Fine-grained pseudo transpose Fine-grained pseudo transpose with ID shift S R SS
Comparison of Parallel Schemes
Best Tree of Parallel DDL Schemes A parallel DDL split node A DDL split node Coarse-grained DDL Fine-grained DDL Fine-grained with ID Shift DDL
Normalized Runtime of PowerPC RS64 The three plateaus in the figure are due to the L1 and L2 caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau. PowerPC RS64 III
Overall Parallel Speedup
Parallel Performance A. PowerPC RS64 IIIB. POWER3 II C. UltraSPARC v8plus Data size is 2 25 for Table A, 2 23 for Table B and C.
Conclusion and Future Work Parallel WHT package provides efficient parallel performance across multiple SMP platforms using OpenMP –Self-adapts to different architectures using search –Must take into account data access pattern –Parallel implementation should not constrain search –Package is available for download at SPIRAL website Working on a distributed memory version using MPI
Effect of Scheduling Strategy
Parallel Split Node with IL and DDL Parallel IL utilizes pre-fetched data on the same cache line and eliminates data contention among threads. So it has better parallel efficiency on some architectures.
Modified Scheduling Choice in scheduling WHT tasks for (WHT R I S ) and (I R WHT S ). small granularity, size R or S large granularity, size R S / thread number thread 1thread 2thread 3thread 4