Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Similar presentations


Presentation on theme: "A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and."— Presentation transcript:

1 A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and Computer Science Drexel University

2 Motivation and Overview High performance implementation of critical signal processing kernels A self-optimizing parallel package for computing fast signal transforms –Prototype transform (WHT) –Build on existing sequential package –SMP implementation using OpenMP Part of SPIRAL project –http://www.ece.cmu.edu/~spiral

3 Outline Walsh-Hadamard Transform (WHT) Sequential performance and optimization using dynamic programming A parallel implementation of the WHT Parallel performance and optimization including parallelism in the search

4 Walsh-Hadamard Transform Fast WHT algorithms are obtained by factoring the WHT matrix

5 SPIRAL WHT Package All WHT algorithms have the same arithmetic cost O(NlgN) but different data access patterns Different factorizations lead to varying amounts of recursion and iteration Transforms in small sizes (2 1 to 2 8 ) are implemented in straight-line code to reduce overheads The WHT package allows exploration of different algorithms and implementations Optimization/adaptation to architectures is performed by searching for the fastest algorithm Johnson and Püschel: ICASSP 2000

6 Dynamic Programming Exhaustive Search: Searching all possible algorithms –Cost is  (4 n /n 3/2 ) for binary factorizations Dynamic Programming: Searching among algorithms generated from previously determined best algorithms –Cost is  (n 2 ) for binary factorizations 2525 2929 2424 Best algorithm at size 2 9 2424 Best algorithm at size 2 4 Possibly best algorithm at size 2 13 2929 2 13 2424 2525 2424

7 Performance of WHT Algorithms Iterative algorithms have less overhead Recursive algorithms have better data locality Best WHT algorithms are compromise between less overhead and good data flow pattern.

8 The best WHT algorithms also depend on architecture characteristics such as memory hierarchy, cache structure and cache miss penalty, etc. Architecture Dependency 2 22 2 5,(1) 2 17 2 13 2 4,(1) 2929 2525 UltraSPARC v9POWER3 IIPowerPC RS64 III 2 22 A DDL split node 2 5,(1) An IL=1 straight-line WHT 32 node 2 22 2 10 2 12 2626 2626 2 5,(1) 2626 2 22 2 4,(4) 2 18 2 6,(2) 2 12 2525 2727

9 Improved Data Access Patterns Stride tensor causes WHT accessing data out of block and loss of locality Large stride introduces more conflict cache misses time x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 2 2323 2121 Stride tensorUnion tensor

10 Dynamic Data Layout DDL uses in-place pseudo transpose to swap data in a special way so that Stride tensor is changed to Union tensor. x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x0x0 x4x4 x2x2 x6x6 x1x1 x5x5 x3x3 x7x7 x 0 x 4 x 2 x 6 x 1 x 5 x 3 x 7 pseudo transpose N. Park and V. K. Prasanna: ICASSP 2001

11 Loop Interleaving IL maximizes the use of cache pre-fetching by interleaving multiple WHT transforms into one transform. Access order1st2nd3rd4th x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 WHT 2 IL=1  I 4/2 WHT 2 IL=2  I 4/4 WHT 2  I 4 Gatlin and Carter: PACT 2000, Implemented by Bo Hong

12 Environment: PowerPC RS64 III/12 450 MHz, 128/128KB L1 cache, 8 MB L2 cache, 8 GB RAM, AIX 4.3.3, cc 5.0.5 Best WHT Partition Trees 2 16 2727 2929 2424 2525 2121 2 15 2 14 2323 2121 2 11 2525 2626 Standard best tree Best tree with DDL 2 16 2 5,(3) 2 11 2525 2626 Best tree with IL 2 16 A DDL split node 2 5,(3) An IL=3 straight-line WHT 32 node

13 Effect of IL and DDL on Performance DDL and IL improve performance when data size is larger than the L1 cache, 128 KB = 2 14  8 bytes. IL level 4 reaches the maximal use of cache line, 128 bytes = 2 4  8 bytes.

14 Parallel WHT Package SMP implementation obtained using OpenMP WHT partition tree is parallelized at the root node – Simple to insert OpenMP directives – Better performance obtained with manual scheduling DP decides when to use parallelism DP builds the partition with best sequential subtrees DP decides the best parallel root node – Parallel split – Parallel split with DDL – Parallel pseudo-transpose – Parallel split with IL

15 OpenMP Implementation # pragma omp parallel { R = N; S = 1; for (i = 0; i < t; i ++) { R = R / N(i); # pragma omp parallel for for (j = 0; j < R - 1) { for (k = 0; k < S - 1) { WHT(N(i)) * x(j, k, S, N(i)); } S = S * N(i); } # pragma omp parallel { total = get_total_threads( ); id = get_thread_id( ); R = N; S = 1; for (i = 0; i < t; i ++) { R = R / N(i); for (; id < R*S - 1; id += total) { j = id / S; k = id % S; WHT(N(i)) * x(j, k, S, N(i)); } S = S * N(i); # pragma omp barrier }

16 In WHT R  S = L (I S  WHT R ) L (I R  WHT S ), the pseudo transpose, L, can be parallelized in different granularity Parallel DDL thread 1thread 2thread 3thread 4 Coarse-grained pseudo transpose Fine-grained pseudo transpose Fine-grained pseudo transpose with ID shift S R SS

17 Comparison of Parallel Schemes

18 Best Tree of Parallel DDL Schemes 2 26 2929 2 17 2929 2525 2424 2828 2424 2525 2 26 2 17 A parallel DDL split node A DDL split node Coarse-grained DDL 2 26 2 12 2 14 2 10 2626 2626 2424 2525 2525 Fine-grained DDL 2 26 2 13 2727 2727 2626 2626 Fine-grained with ID Shift DDL

19 Normalized Runtime of PowerPC RS64 The three plateaus in the figure are due to the L1 and L2 caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau. PowerPC RS64 III

20 Overall Parallel Speedup

21 Parallel Performance A. PowerPC RS64 IIIB. POWER3 II C. UltraSPARC v8plus Data size is 2 25 for Table A, 2 23 for Table B and C.

22 Conclusion and Future Work Parallel WHT package provides efficient parallel performance across multiple SMP platforms using OpenMP –Self-adapts to different architectures using search –Must take into account data access pattern –Parallel implementation should not constrain search –Package is available for download at SPIRAL website http://www.ece.cmu.edu/~spiral Working on a distributed memory version using MPI

23 Effect of Scheduling Strategy

24 Parallel Split Node with IL and DDL Parallel IL utilizes pre-fetched data on the same cache line and eliminates data contention among threads. So it has better parallel efficiency on some architectures.

25 Modified Scheduling Choice in scheduling WHT tasks for (WHT R  I S ) and (I R  WHT S ). small granularity, size R or S large granularity, size R  S / thread number thread 1thread 2thread 3thread 4


Download ppt "A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and."

Similar presentations


Ads by Google