A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
OpenFOAM on a GPU-based Heterogeneous Cluster
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Cache Oblivious Search Trees via Binary Trees of Small Height
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.
Optimizing the trace transform Using OpenMP and CUDA Tim Besard
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
Short Vector SIMD Code Generation for DSP Algorithms
What is the WHT anyway, and why are there so many ways to compute it? Jeremy Johnson 1, 2, 6, 24, 112, 568, 3032, 16768,…
SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU)
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
Drew Freer, Beayna Grigorian, Collin Lambert, Alfonso Roman, Brian Soumakian.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
Sunpyo Hong, Hyesoon Kim
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
Accelerating MapReduce on a Coupled CPU-GPU Architecture
What is the WHT anyway, and why are there so many ways to compute it?
High Performance Computing (CS 540)
Department of Computer Science University of California, Santa Barbara
WHT Package Jeremy Johnson.
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
ICIEV 2014 Dhaka, Bangladesh
Data-Intensive Computing: From Clouds to GPU Clusters
Course Outline Introduction in algorithms and applications
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and Computer Science Drexel University

Motivation and Overview High performance implementation of critical signal processing kernels A self-optimizing parallel package for computing fast signal transforms –Prototype transform (WHT) –Build on existing sequential package –SMP implementation using OpenMP Part of SPIRAL project –

Outline Walsh-Hadamard Transform (WHT) Sequential performance and optimization using dynamic programming A parallel implementation of the WHT Parallel performance and optimization including parallelism in the search

Walsh-Hadamard Transform Fast WHT algorithms are obtained by factoring the WHT matrix

SPIRAL WHT Package All WHT algorithms have the same arithmetic cost O(NlgN) but different data access patterns Different factorizations lead to varying amounts of recursion and iteration Transforms in small sizes (2 1 to 2 8 ) are implemented in straight-line code to reduce overheads The WHT package allows exploration of different algorithms and implementations Optimization/adaptation to architectures is performed by searching for the fastest algorithm Johnson and Püschel: ICASSP 2000

Dynamic Programming Exhaustive Search: Searching all possible algorithms –Cost is  (4 n /n 3/2 ) for binary factorizations Dynamic Programming: Searching among algorithms generated from previously determined best algorithms –Cost is  (n 2 ) for binary factorizations Best algorithm at size Best algorithm at size 2 4 Possibly best algorithm at size

Performance of WHT Algorithms Iterative algorithms have less overhead Recursive algorithms have better data locality Best WHT algorithms are compromise between less overhead and good data flow pattern.

The best WHT algorithms also depend on architecture characteristics such as memory hierarchy, cache structure and cache miss penalty, etc. Architecture Dependency ,(1) ,(1) UltraSPARC v9POWER3 IIPowerPC RS64 III 2 22 A DDL split node 2 5,(1) An IL=1 straight-line WHT 32 node ,(1) ,(4) ,(2)

Improved Data Access Patterns Stride tensor causes WHT accessing data out of block and loss of locality Large stride introduces more conflict cache misses time x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x Stride tensorUnion tensor

Dynamic Data Layout DDL uses in-place pseudo transpose to swap data in a special way so that Stride tensor is changed to Union tensor. x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x0x0 x4x4 x2x2 x6x6 x1x1 x5x5 x3x3 x7x7 x 0 x 4 x 2 x 6 x 1 x 5 x 3 x 7 pseudo transpose N. Park and V. K. Prasanna: ICASSP 2001

Loop Interleaving IL maximizes the use of cache pre-fetching by interleaving multiple WHT transforms into one transform. Access order1st2nd3rd4th x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 WHT 2 IL=1  I 4/2 WHT 2 IL=2  I 4/4 WHT 2  I 4 Gatlin and Carter: PACT 2000, Implemented by Bo Hong

Environment: PowerPC RS64 III/ MHz, 128/128KB L1 cache, 8 MB L2 cache, 8 GB RAM, AIX 4.3.3, cc Best WHT Partition Trees Standard best tree Best tree with DDL ,(3) Best tree with IL 2 16 A DDL split node 2 5,(3) An IL=3 straight-line WHT 32 node

Effect of IL and DDL on Performance DDL and IL improve performance when data size is larger than the L1 cache, 128 KB = 2 14  8 bytes. IL level 4 reaches the maximal use of cache line, 128 bytes = 2 4  8 bytes.

Parallel WHT Package SMP implementation obtained using OpenMP WHT partition tree is parallelized at the root node – Simple to insert OpenMP directives – Better performance obtained with manual scheduling DP decides when to use parallelism DP builds the partition with best sequential subtrees DP decides the best parallel root node – Parallel split – Parallel split with DDL – Parallel pseudo-transpose – Parallel split with IL

OpenMP Implementation # pragma omp parallel { R = N; S = 1; for (i = 0; i < t; i ++) { R = R / N(i); # pragma omp parallel for for (j = 0; j < R - 1) { for (k = 0; k < S - 1) { WHT(N(i)) * x(j, k, S, N(i)); } S = S * N(i); } # pragma omp parallel { total = get_total_threads( ); id = get_thread_id( ); R = N; S = 1; for (i = 0; i < t; i ++) { R = R / N(i); for (; id < R*S - 1; id += total) { j = id / S; k = id % S; WHT(N(i)) * x(j, k, S, N(i)); } S = S * N(i); # pragma omp barrier }

In WHT R  S = L (I S  WHT R ) L (I R  WHT S ), the pseudo transpose, L, can be parallelized in different granularity Parallel DDL thread 1thread 2thread 3thread 4 Coarse-grained pseudo transpose Fine-grained pseudo transpose Fine-grained pseudo transpose with ID shift S R SS

Comparison of Parallel Schemes

Best Tree of Parallel DDL Schemes A parallel DDL split node A DDL split node Coarse-grained DDL Fine-grained DDL Fine-grained with ID Shift DDL

Normalized Runtime of PowerPC RS64 The three plateaus in the figure are due to the L1 and L2 caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau. PowerPC RS64 III

Overall Parallel Speedup

Parallel Performance A. PowerPC RS64 IIIB. POWER3 II C. UltraSPARC v8plus Data size is 2 25 for Table A, 2 23 for Table B and C.

Conclusion and Future Work Parallel WHT package provides efficient parallel performance across multiple SMP platforms using OpenMP –Self-adapts to different architectures using search –Must take into account data access pattern –Parallel implementation should not constrain search –Package is available for download at SPIRAL website Working on a distributed memory version using MPI

Effect of Scheduling Strategy

Parallel Split Node with IL and DDL Parallel IL utilizes pre-fetched data on the same cache line and eliminates data contention among threads. So it has better parallel efficiency on some architectures.

Modified Scheduling Choice in scheduling WHT tasks for (WHT R  I S ) and (I R  WHT S ). small granularity, size R or S large granularity, size R  S / thread number thread 1thread 2thread 3thread 4