Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

Slides:

Advertisements

Similar presentations

Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University.

Advertisements

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

1 Cache-Efficient Matrix Transposition Written by : Siddhartha Chatterjee and Sandeep Sen Presented By: Iddit Shalem.

V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Physical Design CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 Physical Design Steps 1. Develop standards 2.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Fundamentals of Multimedia Chapter 8 Lossy Compression Algorithms (Wavelet) Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

1 Chapter 2 Matrices Matrices provide an orderly way of arranging values or functions to enhance the analysis of systems in a systematic manner. Their.

Adam Day.  Applications  Classification  Common watermarking methods  Types of verification/detection  Implementing watermarking using wavelets.

NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Short Vector SIMD Code Generation for DSP Algorithms

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.

Wavelet-based Coding And its application in JPEG2000 Monia Ghobadi CSC561 final project

Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.

Parallelization of the Classic Gram-Schmidt QR-Factorization

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

An Efficient Implementation of Scalable Architecture for Discrete Wavelet Transform On FPGA Michael GUARISCO, Xun ZHANG, Hassan RABAH and Serge WEBER Nancy.

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Assembly - Arrays תרגול 7 מערכים.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional.

Section 7: Memory and Caches

Morgan Kaufmann Publishers

Accelerating PFA FFT: Performance Comparison

Implementation of DWT using SSE Instruction Set

Linchuan Chen, Peng Jiang and Gagan Agrawal

Memory Hierarchies.

STUDY AND IMPLEMENTATION

Wavelet “Block-Processing” for Reduced Memory Transfers

Parallel Programming in C with MPI and OpenMP

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Multi-Core Programming Assignment

Automatic Generation of Warp-Level Primitives and Atomic Instruction for Fast and Portable Parallel Reduction on GPU Simon Garcia De Gonzalo and Sitao.

Implementation of a De-blocking Filter and Optimization in PLX

Presentation transcript:

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M

UCMUCM 2 Index 1.Motivation 2.Experimental environment 3.Lifting Transform 4.Memory hierarchy exploitation 5.SIMD optimization 6.Conclusions 7.Future work

UCMUCM 3 Motivation

UCMUCM 4  Applications based on the Wavelet Transform: JPEG-2000 MPEG-4  Usage of the lifting scheme  Study based on a modern general purpose microprocessor oPentium 4  Objectives: oEfficient exploitation of Memory Hierarchy oUse of the SIMD ISA extensions

UCMUCM 5 Experimental Environment

UCMUCM 6 RedHat Distribution 7.2 (Enigma) Operating System 1 GB RDRAM (PC800)Memory 512 KB, 128 Byte/LineL2 8 KB, 64 Byte/Line, Write-Through DL1 NAIL1 Cache DFI WT70-EC Motherboard Intel Pentium4 (2,4 GHz) Platform Intel ICC compiler GCC compiler Compiler

UCMUCM 7 Lifting Transform

UCMUCM 8 D 1 st Lifting Transform Original element 1 st step 2 nd step +  x + + β x + + x  + + δ x+ x  x   A D D D A A A 1 st

UCMUCM 9 N Levels Lifting Transform 1 Level Horizontal Filtering (1D Lifting Transform) Vertical Filtering (1D Lifting Transform) Original element Approximation

UCMUCM 10 Lifting Transform Horizontal Filtering 1 2 Vertical Filtering 2 1

UCMUCM 11 Memory Hierarchy Exploitation

UCMUCM 12  Poor data locality of one component (canonical layouts) E.g. : column-major layout  processing image rows (Horizontal Filtering) o Aggregation (loop tiling) Memory Hierarchy Exploitation  Poor data locality of the whole transform o Other layouts

UCMUCM 13 Memory Hierarchy Exploitation Horizontal Filtering 1 2 Vertical Filtering 2 1

UCMUCM 14 Aggregation Horizontal Filtering IMAGE 2 1 Memory Hierarchy Exploitation

UCMUCM 15 Memory Hierarchy Exploitation INPLACE  Common implementation of the transform  Memory: Only requires the original matrix  For most applications needs post-processing MALLAT  Memory: requires 2 matrices  Stores the image in the expected order INPLACE-MALLAT  Memory: requires 2 matrices  Stores the image in the expected order Different studied schemes

UCMUCM 16 Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O MATRIX 1 L L L L L L L L H H H H H H H H Horizontal Filtering LL 1 HH 1 HL 1 LH 1 LL 3 HH 3 HL 3 LH 3 LL 4 HH 4 HL 4 LH 4 LL 2 HH 2 HL 2 LH 2 Vertical Filtering Transformed image... LL 1 LH 1 LL 2 LH 2 HH 1 HL 1 HH 2 HL 2 LL 3 logical view physical view INPLACE LL 1 LL 2 LL 3 LL 4 LH 2 LH 1 LH 4 LH 3... HL 1

UCMUCM 17 Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O L L L L L L L L H H H H H H H H Horizontal Filtering MATRIX 1MATRIX 2 LL 1 LL 2 LL 4 LL 3 HH 3 HH 4 HH 2 HH 1 HL 1 HL 2 HL 4 HL 3 LH 1 LH 2 LH 4 LH 3 Vertical Filtering Transformed image LL 1 LL 2 LL 3 LL 4 LH 2 LH 1 LH 4 LH 3... HL 1 logical view physical view MALLAT

UCMUCM 18 Memory Hierarchy Exploitation MATRIX 1 MATRIX 2 O O O O O O O O O O O O O O O O logical view L L L L L L L L H H H H H H H H Horizontal Filtering LL 1 LL 2 LL 4 LL 3 HH 3 HH 4 HH 2 HH 1 HL 1 HL 2 HL 4 HL 3 LH 1 LH 2 LH 4 LH 3 Vertical Filtering Transformed image (Matrix 1) LL 1 LL 2 LL 3 LL 4... Transformed image (Matrix 2) LH 2 LH 1 LH 4 LH 3... HL 1 physical view INPLACE- MALLAT

UCMUCM 19 Memory Hierarchy Exploitation  Execution time breakdown for several sizes comparing both compilers.  I, IM and M denote inplace, inplace-mallat, and mallat strategies respectively.  Each bar shows the execution time of each level and the post-processing step.

UCMUCM 20  The Mallat and Inplace-Mallat approaches outperform the Inplace approach for levels 2 and above  These 2 approaches have a noticeable slowdown for the 1 st level: Larger working set More complex access pattern  The Inplace-Mallat version achieves the best execution time  ICC compiler outperforms GCC for Mallat and Inplace-Mallat, but not for the Inplace approach Memory Hierarchy Exploitation CONCLUSIONS

UCMUCM 21 SIMD Optimization

UCMUCM 22  Objective: Extract the parallelism available on the Lifting Transform  Different strategies: Semi-automatic vectorization Hand-coded vectorization  Only the horizontal filtering of the transform can be semi- automatically vectorized (when using a column-major layout) SIMD Optimization

UCMUCM 23 SIMD Optimization Automatic Vectorization (Intel C/C++ Compiler) Inner loops Simple array index manipulation Iterate over contiguous memory locations Global variables avoided Pointer disambiguation if pointers are employed

UCMUCM 24 Original element 1 st step 2 nd step +  x + + β x + + x  + + δ x+ x  x   A D SIMD Optimization 1 st

UCMUCM 25 SIMD Optimization Column-major layout Vectorial Horizontal filtering +  x + Horizontal filtering +  x +   

UCMUCM 26 SIMD Optimization Column-major layout Vectorial Vertical filtering +  x + Vertical filtering +  x +   

UCMUCM 27 for(j=2,k=1;j<(#columns-4);j+=2,k++) { #pragma vector aligned for(i=0;i<#rows;i++) { /* 1st operation */ col3=col3 + alfa*( col4+ col2); /* 2nd operation */ col2=col2 + beta*( col3+ col1); /* 3rd operation */ col1=col1 + gama*( col2+ col0); /* 4th operation */ col0 =col0 + delt*( col1+ col-1); /* Last step */ detail = col1 *phi_inv; aprox = col0 *phi; } Horizontal Vectorial Filtering (semi-automatic) SIMD Optimization

UCMUCM 28 SIMD Optimization Hand-coded Vectorization SIMD parallelism has to be explicitly expressed Intrinsics allow more flexibility Possibility to also vectorize the vertical filtering

UCMUCM 29 Horizontal Vectorial Filtering (hand) SIMD Optimization /* 1st operation */ t2 = _mm_load_ps(col2); t4 = _mm_load_ps(col4); t3 = _mm_load_ps(col3); coeff = _mm_set_ps1(alfa); t4 = _mm_add_ps(t2,t4); t4 = _mm_mul_ps(t4,coeff); t3 = _mm_add_ps(t4,t3); _mm_store_ps(col3,t3); /* 2nd operation */ /* 3rd operation */ /* 4th operation */ /* Last step */ _mm_store_ps(detail,t1); _mm_store_ps(aprox,t0); t2t3t4 +  x +   

UCMUCM 30 SIMD Optimization  Execution time breakdown of the horizontal filtering ( pixels image).  I, IM and M denote inplace, inplace- mallat and mallat approaches.  S, A and H denote scalar, automatic- vectorized and hand-coded- vectorized.

UCMUCM 31 SIMD Optimization  Speedup between 4 and 6 depending on the strategy. The reason for such a high improvement is due not only to the vectorial computations, but also to a considerable reduction in the memory accesses.  The speedups achieved by the strategies with recursive layouts (i.e. inplace- mallat and mallat) are higher than the inplace version counterparts, since the computation on the latter can only be vectorized in the first level.  For ICC, both vectorization approaches (i.e. automatic and hand-tuned) produce similar speedups, which highlights the quality of the ICC vectorizer. CONCLUSIONS

UCMUCM 32 SIMD Optimization  Execution time breakdown of the whole transform ( pixels image).  I, IM and M denote inplace, inplace- mallat and mallat approaches.  S, A and H denote scalar, automatic- vectorized and hand-coded- vectorized.

UCMUCM 33 SIMD Optimization  Speedup between 1,5 and 2 depending on the strategy.  For ICC the shortest execution time is reached by the mallat version.  When using GCC both recursive-layout strategies obtain similar results. CONCLUSIONS

UCMUCM 34 SIMD Optimization  Speedup achieved by the different vectorial codes over the inplace- mallat and inplace.  We show the hand- coded ICC, the automatic ICC, and the hand-coded GCC.

UCMUCM 35 SIMD Optimization  The speedup grows with the image size since.  On average, the speedup is about 1.8 over the inplace-mallat scheme, growing to about 2 when considering it over the inplace strategy.  Focusing on the compilers, ICC clearly outperforms GCC by a significant % for all the image sizes CONCLUSIONS

UCMUCM 36 Conclusions

UCMUCM 37  Scalar version: We have introduced a new scheme called Inplace-Mallat, that outperforms both the Inplace implementation and the Mallat scheme.  SIMD exploitation: Code modifications for the vectorial processing of the lifting algorithm. Two different methodologies with ICC compiler: semi- automatic and intrinsic-based vectorizations. Both provide similar results.  Speedup: Horizontal filtering about 4-6 (vectorization also reduces the pressure on the memory system). Whole transform around 2.  The vectorial Mallat approach outperforms the other schemes and exhibits a better scalability.  Most of our insights are compiler independent. Conclusions

UCMUCM 38 Future work

UCMUCM 39 4D layout for a lifting-based scheme Measurements using other platforms Intel Itanium Intel Pentium-4 with hiperthreading Parallelization using OpenMP (SMT) Future work For additional information: