Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M

UCMUCM 2 Index 1.Motivation 2.Experimental environment 3.Lifting Transform 4.Memory hierarchy exploitation 5.SIMD optimization 6.Conclusions 7.Future work

UCMUCM 3 Motivation

UCMUCM 4  Applications based on the Wavelet Transform: JPEG-2000 MPEG-4  Usage of the lifting scheme  Study based on a modern general purpose microprocessor oPentium 4  Objectives: oEfficient exploitation of Memory Hierarchy oUse of the SIMD ISA extensions

UCMUCM 5 Experimental Environment

UCMUCM 6 RedHat Distribution 7.2 (Enigma) Operating System 1 GB RDRAM (PC800)Memory 512 KB, 128 Byte/LineL2 8 KB, 64 Byte/Line, Write-Through DL1 NAIL1 Cache DFI WT70-EC Motherboard Intel Pentium4 (2,4 GHz) Platform Intel ICC compiler GCC compiler Compiler

UCMUCM 7 Lifting Transform

UCMUCM 8 D 1 st Lifting Transform Original element 1 st step 2 nd step +  x + + β x + + x  + + δ x+ x  x   A D D D A A A 1 st

UCMUCM 9 N Levels Lifting Transform 1 Level Horizontal Filtering (1D Lifting Transform) Vertical Filtering (1D Lifting Transform) Original element Approximation

UCMUCM 10 Lifting Transform Horizontal Filtering 1 2 Vertical Filtering 2 1

UCMUCM 11 Memory Hierarchy Exploitation

UCMUCM 12  Poor data locality of one component (canonical layouts) E.g. : column-major layout  processing image rows (Horizontal Filtering) o Aggregation (loop tiling) Memory Hierarchy Exploitation  Poor data locality of the whole transform o Other layouts

UCMUCM 13 Memory Hierarchy Exploitation Horizontal Filtering 1 2 Vertical Filtering 2 1

UCMUCM 14 Aggregation Horizontal Filtering IMAGE 2 1 Memory Hierarchy Exploitation

UCMUCM 15 Memory Hierarchy Exploitation INPLACE  Common implementation of the transform  Memory: Only requires the original matrix  For most applications needs post-processing MALLAT  Memory: requires 2 matrices  Stores the image in the expected order INPLACE-MALLAT  Memory: requires 2 matrices  Stores the image in the expected order Different studied schemes

UCMUCM 16 Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O MATRIX 1 L L L L L L L L H H H H H H H H Horizontal Filtering LL 1 HH 1 HL 1 LH 1 LL 3 HH 3 HL 3 LH 3 LL 4 HH 4 HL 4 LH 4 LL 2 HH 2 HL 2 LH 2 Vertical Filtering Transformed image... LL 1 LH 1 LL 2 LH 2 HH 1 HL 1 HH 2 HL 2 LL 3 logical view physical view INPLACE LL 1 LL 2 LL 3 LL 4 LH 2 LH 1 LH 4 LH 3... HL 1

UCMUCM 17 Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O L L L L L L L L H H H H H H H H Horizontal Filtering MATRIX 1MATRIX 2 LL 1 LL 2 LL 4 LL 3 HH 3 HH 4 HH 2 HH 1 HL 1 HL 2 HL 4 HL 3 LH 1 LH 2 LH 4 LH 3 Vertical Filtering Transformed image LL 1 LL 2 LL 3 LL 4 LH 2 LH 1 LH 4 LH 3... HL 1 logical view physical view MALLAT

UCMUCM 18 Memory Hierarchy Exploitation MATRIX 1 MATRIX 2 O O O O O O O O O O O O O O O O logical view L L L L L L L L H H H H H H H H Horizontal Filtering LL 1 LL 2 LL 4 LL 3 HH 3 HH 4 HH 2 HH 1 HL 1 HL 2 HL 4 HL 3 LH 1 LH 2 LH 4 LH 3 Vertical Filtering Transformed image (Matrix 1) LL 1 LL 2 LL 3 LL 4... Transformed image (Matrix 2) LH 2 LH 1 LH 4 LH 3... HL 1 physical view INPLACE- MALLAT

UCMUCM 19 Memory Hierarchy Exploitation  Execution time breakdown for several sizes comparing both compilers.  I, IM and M denote inplace, inplace-mallat, and mallat strategies respectively.  Each bar shows the execution time of each level and the post-processing step.

UCMUCM 20  The Mallat and Inplace-Mallat approaches outperform the Inplace approach for levels 2 and above  These 2 approaches have a noticeable slowdown for the 1 st level: Larger working set More complex access pattern  The Inplace-Mallat version achieves the best execution time  ICC compiler outperforms GCC for Mallat and Inplace-Mallat, but not for the Inplace approach Memory Hierarchy Exploitation CONCLUSIONS

UCMUCM 21 SIMD Optimization

UCMUCM 22  Objective: Extract the parallelism available on the Lifting Transform  Different strategies: Semi-automatic vectorization Hand-coded vectorization  Only the horizontal filtering of the transform can be semi- automatically vectorized (when using a column-major layout) SIMD Optimization

UCMUCM 23 SIMD Optimization Automatic Vectorization (Intel C/C++ Compiler) Inner loops Simple array index manipulation Iterate over contiguous memory locations Global variables avoided Pointer disambiguation if pointers are employed

UCMUCM 24 Original element 1 st step 2 nd step +  x + + β x + + x  + + δ x+ x  x   A D SIMD Optimization 1 st

UCMUCM 25 SIMD Optimization Column-major layout Vectorial Horizontal filtering +  x + Horizontal filtering +  x +   

UCMUCM 26 SIMD Optimization Column-major layout Vectorial Vertical filtering +  x + Vertical filtering +  x +   

UCMUCM 27 for(j=2,k=1;j<(#columns-4);j+=2,k++) { #pragma vector aligned for(i=0;i<#rows;i++) { /* 1st operation */ col3=col3 + alfa*( col4+ col2); /* 2nd operation */ col2=col2 + beta*( col3+ col1); /* 3rd operation */ col1=col1 + gama*( col2+ col0); /* 4th operation */ col0 =col0 + delt*( col1+ col-1); /* Last step */ detail = col1 *phi_inv; aprox = col0 *phi; } Horizontal Vectorial Filtering (semi-automatic) SIMD Optimization

UCMUCM 28 SIMD Optimization Hand-coded Vectorization SIMD parallelism has to be explicitly expressed Intrinsics allow more flexibility Possibility to also vectorize the vertical filtering

UCMUCM 29 Horizontal Vectorial Filtering (hand) SIMD Optimization /* 1st operation */ t2 = _mm_load_ps(col2); t4 = _mm_load_ps(col4); t3 = _mm_load_ps(col3); coeff = _mm_set_ps1(alfa); t4 = _mm_add_ps(t2,t4); t4 = _mm_mul_ps(t4,coeff); t3 = _mm_add_ps(t4,t3); _mm_store_ps(col3,t3); /* 2nd operation */ /* 3rd operation */ /* 4th operation */ /* Last step */ _mm_store_ps(detail,t1); _mm_store_ps(aprox,t0); t2t3t4 +  x +   

UCMUCM 30 SIMD Optimization  Execution time breakdown of the horizontal filtering (1024 2 pixels image).  I, IM and M denote inplace, inplace- mallat and mallat approaches.  S, A and H denote scalar, automatic- vectorized and hand-coded- vectorized.

UCMUCM 31 SIMD Optimization  Speedup between 4 and 6 depending on the strategy. The reason for such a high improvement is due not only to the vectorial computations, but also to a considerable reduction in the memory accesses.  The speedups achieved by the strategies with recursive layouts (i.e. inplace- mallat and mallat) are higher than the inplace version counterparts, since the computation on the latter can only be vectorized in the first level.  For ICC, both vectorization approaches (i.e. automatic and hand-tuned) produce similar speedups, which highlights the quality of the ICC vectorizer. CONCLUSIONS

UCMUCM 32 SIMD Optimization  Execution time breakdown of the whole transform (1024 2 pixels image).  I, IM and M denote inplace, inplace- mallat and mallat approaches.  S, A and H denote scalar, automatic- vectorized and hand-coded- vectorized.

UCMUCM 33 SIMD Optimization  Speedup between 1,5 and 2 depending on the strategy.  For ICC the shortest execution time is reached by the mallat version.  When using GCC both recursive-layout strategies obtain similar results. CONCLUSIONS

UCMUCM 34 SIMD Optimization  Speedup achieved by the different vectorial codes over the inplace- mallat and inplace.  We show the hand- coded ICC, the automatic ICC, and the hand-coded GCC.

UCMUCM 35 SIMD Optimization  The speedup grows with the image size since.  On average, the speedup is about 1.8 over the inplace-mallat scheme, growing to about 2 when considering it over the inplace strategy.  Focusing on the compilers, ICC clearly outperforms GCC by a significant 20- 25% for all the image sizes CONCLUSIONS

UCMUCM 36 Conclusions

UCMUCM 37  Scalar version: We have introduced a new scheme called Inplace-Mallat, that outperforms both the Inplace implementation and the Mallat scheme.  SIMD exploitation: Code modifications for the vectorial processing of the lifting algorithm. Two different methodologies with ICC compiler: semi- automatic and intrinsic-based vectorizations. Both provide similar results.  Speedup: Horizontal filtering about 4-6 (vectorization also reduces the pressure on the memory system). Whole transform around 2.  The vectorial Mallat approach outperforms the other schemes and exhibits a better scalability.  Most of our insights are compiler independent. Conclusions

UCMUCM 38 Future work

UCMUCM 39 4D layout for a lifting-based scheme Measurements using other platforms Intel Itanium Intel Pentium-4 with hiperthreading Parallelization using OpenMP (SMT) Future work For additional information: http://www.dacya.ucm.es/dchaver

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

Similar presentations

Presentation on theme: "Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

Similar presentations

Presentation on theme: "Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M."— Presentation transcript:

Similar presentations

About project

Feedback