Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M
UCMUCM 2 Index 1.Motivation 2.Experimental environment 3.Lifting Transform 4.Memory hierarchy exploitation 5.SIMD optimization 6.Conclusions 7.Future work
UCMUCM 3 Motivation
UCMUCM 4 Applications based on the Wavelet Transform: JPEG-2000 MPEG-4 Usage of the lifting scheme Study based on a modern general purpose microprocessor oPentium 4 Objectives: oEfficient exploitation of Memory Hierarchy oUse of the SIMD ISA extensions
UCMUCM 5 Experimental Environment
UCMUCM 6 RedHat Distribution 7.2 (Enigma) Operating System 1 GB RDRAM (PC800)Memory 512 KB, 128 Byte/LineL2 8 KB, 64 Byte/Line, Write-Through DL1 NAIL1 Cache DFI WT70-EC Motherboard Intel Pentium4 (2,4 GHz) Platform Intel ICC compiler GCC compiler Compiler
UCMUCM 7 Lifting Transform
UCMUCM 8 D 1 st Lifting Transform Original element 1 st step 2 nd step + x + + β x + + x + + δ x+ x x A D D D A A A 1 st
UCMUCM 9 N Levels Lifting Transform 1 Level Horizontal Filtering (1D Lifting Transform) Vertical Filtering (1D Lifting Transform) Original element Approximation
UCMUCM 10 Lifting Transform Horizontal Filtering 1 2 Vertical Filtering 2 1
UCMUCM 11 Memory Hierarchy Exploitation
UCMUCM 12 Poor data locality of one component (canonical layouts) E.g. : column-major layout processing image rows (Horizontal Filtering) o Aggregation (loop tiling) Memory Hierarchy Exploitation Poor data locality of the whole transform o Other layouts
UCMUCM 13 Memory Hierarchy Exploitation Horizontal Filtering 1 2 Vertical Filtering 2 1
UCMUCM 14 Aggregation Horizontal Filtering IMAGE 2 1 Memory Hierarchy Exploitation
UCMUCM 15 Memory Hierarchy Exploitation INPLACE Common implementation of the transform Memory: Only requires the original matrix For most applications needs post-processing MALLAT Memory: requires 2 matrices Stores the image in the expected order INPLACE-MALLAT Memory: requires 2 matrices Stores the image in the expected order Different studied schemes
UCMUCM 16 Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O MATRIX 1 L L L L L L L L H H H H H H H H Horizontal Filtering LL 1 HH 1 HL 1 LH 1 LL 3 HH 3 HL 3 LH 3 LL 4 HH 4 HL 4 LH 4 LL 2 HH 2 HL 2 LH 2 Vertical Filtering Transformed image... LL 1 LH 1 LL 2 LH 2 HH 1 HL 1 HH 2 HL 2 LL 3 logical view physical view INPLACE LL 1 LL 2 LL 3 LL 4 LH 2 LH 1 LH 4 LH 3... HL 1
UCMUCM 17 Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O L L L L L L L L H H H H H H H H Horizontal Filtering MATRIX 1MATRIX 2 LL 1 LL 2 LL 4 LL 3 HH 3 HH 4 HH 2 HH 1 HL 1 HL 2 HL 4 HL 3 LH 1 LH 2 LH 4 LH 3 Vertical Filtering Transformed image LL 1 LL 2 LL 3 LL 4 LH 2 LH 1 LH 4 LH 3... HL 1 logical view physical view MALLAT
UCMUCM 18 Memory Hierarchy Exploitation MATRIX 1 MATRIX 2 O O O O O O O O O O O O O O O O logical view L L L L L L L L H H H H H H H H Horizontal Filtering LL 1 LL 2 LL 4 LL 3 HH 3 HH 4 HH 2 HH 1 HL 1 HL 2 HL 4 HL 3 LH 1 LH 2 LH 4 LH 3 Vertical Filtering Transformed image (Matrix 1) LL 1 LL 2 LL 3 LL 4... Transformed image (Matrix 2) LH 2 LH 1 LH 4 LH 3... HL 1 physical view INPLACE- MALLAT
UCMUCM 19 Memory Hierarchy Exploitation Execution time breakdown for several sizes comparing both compilers. I, IM and M denote inplace, inplace-mallat, and mallat strategies respectively. Each bar shows the execution time of each level and the post-processing step.
UCMUCM 20 The Mallat and Inplace-Mallat approaches outperform the Inplace approach for levels 2 and above These 2 approaches have a noticeable slowdown for the 1 st level: Larger working set More complex access pattern The Inplace-Mallat version achieves the best execution time ICC compiler outperforms GCC for Mallat and Inplace-Mallat, but not for the Inplace approach Memory Hierarchy Exploitation CONCLUSIONS
UCMUCM 21 SIMD Optimization
UCMUCM 22 Objective: Extract the parallelism available on the Lifting Transform Different strategies: Semi-automatic vectorization Hand-coded vectorization Only the horizontal filtering of the transform can be semi- automatically vectorized (when using a column-major layout) SIMD Optimization
UCMUCM 23 SIMD Optimization Automatic Vectorization (Intel C/C++ Compiler) Inner loops Simple array index manipulation Iterate over contiguous memory locations Global variables avoided Pointer disambiguation if pointers are employed
UCMUCM 24 Original element 1 st step 2 nd step + x + + β x + + x + + δ x+ x x A D SIMD Optimization 1 st
UCMUCM 25 SIMD Optimization Column-major layout Vectorial Horizontal filtering + x + Horizontal filtering + x +
UCMUCM 26 SIMD Optimization Column-major layout Vectorial Vertical filtering + x + Vertical filtering + x +
UCMUCM 27 for(j=2,k=1;j<(#columns-4);j+=2,k++) { #pragma vector aligned for(i=0;i<#rows;i++) { /* 1st operation */ col3=col3 + alfa*( col4+ col2); /* 2nd operation */ col2=col2 + beta*( col3+ col1); /* 3rd operation */ col1=col1 + gama*( col2+ col0); /* 4th operation */ col0 =col0 + delt*( col1+ col-1); /* Last step */ detail = col1 *phi_inv; aprox = col0 *phi; } Horizontal Vectorial Filtering (semi-automatic) SIMD Optimization
UCMUCM 28 SIMD Optimization Hand-coded Vectorization SIMD parallelism has to be explicitly expressed Intrinsics allow more flexibility Possibility to also vectorize the vertical filtering
UCMUCM 29 Horizontal Vectorial Filtering (hand) SIMD Optimization /* 1st operation */ t2 = _mm_load_ps(col2); t4 = _mm_load_ps(col4); t3 = _mm_load_ps(col3); coeff = _mm_set_ps1(alfa); t4 = _mm_add_ps(t2,t4); t4 = _mm_mul_ps(t4,coeff); t3 = _mm_add_ps(t4,t3); _mm_store_ps(col3,t3); /* 2nd operation */ /* 3rd operation */ /* 4th operation */ /* Last step */ _mm_store_ps(detail,t1); _mm_store_ps(aprox,t0); t2t3t4 + x +
UCMUCM 30 SIMD Optimization Execution time breakdown of the horizontal filtering ( pixels image). I, IM and M denote inplace, inplace- mallat and mallat approaches. S, A and H denote scalar, automatic- vectorized and hand-coded- vectorized.
UCMUCM 31 SIMD Optimization Speedup between 4 and 6 depending on the strategy. The reason for such a high improvement is due not only to the vectorial computations, but also to a considerable reduction in the memory accesses. The speedups achieved by the strategies with recursive layouts (i.e. inplace- mallat and mallat) are higher than the inplace version counterparts, since the computation on the latter can only be vectorized in the first level. For ICC, both vectorization approaches (i.e. automatic and hand-tuned) produce similar speedups, which highlights the quality of the ICC vectorizer. CONCLUSIONS
UCMUCM 32 SIMD Optimization Execution time breakdown of the whole transform ( pixels image). I, IM and M denote inplace, inplace- mallat and mallat approaches. S, A and H denote scalar, automatic- vectorized and hand-coded- vectorized.
UCMUCM 33 SIMD Optimization Speedup between 1,5 and 2 depending on the strategy. For ICC the shortest execution time is reached by the mallat version. When using GCC both recursive-layout strategies obtain similar results. CONCLUSIONS
UCMUCM 34 SIMD Optimization Speedup achieved by the different vectorial codes over the inplace- mallat and inplace. We show the hand- coded ICC, the automatic ICC, and the hand-coded GCC.
UCMUCM 35 SIMD Optimization The speedup grows with the image size since. On average, the speedup is about 1.8 over the inplace-mallat scheme, growing to about 2 when considering it over the inplace strategy. Focusing on the compilers, ICC clearly outperforms GCC by a significant % for all the image sizes CONCLUSIONS
UCMUCM 36 Conclusions
UCMUCM 37 Scalar version: We have introduced a new scheme called Inplace-Mallat, that outperforms both the Inplace implementation and the Mallat scheme. SIMD exploitation: Code modifications for the vectorial processing of the lifting algorithm. Two different methodologies with ICC compiler: semi- automatic and intrinsic-based vectorizations. Both provide similar results. Speedup: Horizontal filtering about 4-6 (vectorization also reduces the pressure on the memory system). Whole transform around 2. The vectorial Mallat approach outperforms the other schemes and exhibits a better scalability. Most of our insights are compiler independent. Conclusions
UCMUCM 38 Future work
UCMUCM 39 4D layout for a lifting-based scheme Measurements using other platforms Intel Itanium Intel Pentium-4 with hiperthreading Parallelization using OpenMP (SMT) Future work For additional information: