Copyright © 2007 Intel Corporation. ® SP 3D Running Average Implementation SSE + OpenMP Benchmarking on different platforms Dr. Zvi Danovich, Senior Application.

Copyright © 2008 Intel Corporation. 2 Agenda What is 3D Running Average (RA) ? What is 3D Running Average (RA) ? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS  SoA transforms Basic SSE technique: AoS  SoA transforms 1D RA 4-lines SSE implementation 1D RA 4-lines SSE implementation 2 nd dimension completion 2 nd dimension completion 3 rd dimension completion 3 rd dimension completion Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 3 3D RA is computed for each voxel V as normalized sum inside k x k x k cube (k is odd) located “around” given voxel: 3D RA is computed for each voxel V as normalized sum inside k x k x k cube (k is odd) located “around” given voxel: where ‘v’ is source voxels. In another words, 3D RA can be considered as 3D convolution with kernel having all components equal to 1/(k x k x k). In another words, 3D RA can be considered as 3D convolution with kernel having all components equal to 1/(k x k x k). 3D Running Average (RA) – what is it ? vvv vvv vvv V = (1/k 3 )sum k k k

Copyright © 2008 Intel Corporation. 4 Agenda What is 3D Running Average (RA) ? What is 3D Running Average (RA) ? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS  SoA transforms Basic SSE technique: AoS  SoA transforms 1D RA 4-lines SSE implementation 1D RA 4-lines SSE implementation 2 nd dimension completion 2 nd dimension completion 3 rd dimension completion 3 rd dimension completion Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 5 1D Running Average (RA) Unlike 1D convolution, 1D RA can be computed with complexity (O 1 ) using following aproach: Unlike 1D convolution, 1D RA can be computed with complexity (O 1 ) using following aproach: –Prolog: compute sum S of first k voxels –Main step: to compute next sum S +1, first member of previous sum () should be subtracted, and next component () should be added –Main step: to compute next sum S +1, first member of previous sum (v 0 ) should be subtracted, and next component (v k ) should be added 012 3456 k-3k-2k-1k S = ∑(v) 0,k-1 S +1 = ∑(v) 1,k = S – v 0 + v k

Copyright © 2008 Intel Corporation. 6 Extending 1D Running Average toward 2D Giving slice (plane) with all lines (L i ) 1D-averaged, we can extend averaging for 2D by the same approach: Giving slice (plane) with all lines (L i ) 1D-averaged, we can extend averaging for 2D by the same approach: –Prolog: compute sum S of first k lines S = ∑(L) 0,k-1 S +1 = ∑(L) 1,k = S– L 0 + L k L0L0 LkLk –Main step: to compute next sum S +1, first line of previous sum () should be subtracted, and next line () should be added –Main step: to compute next sum S +1, first line of previous sum (L 0 ) should be subtracted, and next line (L k ) should be added

Copyright © 2008 Intel Corporation. 7 Extending 2D Running Average toward 3D Giving stack of planes with all planes (P i ) 2D-averaged, we can extend averaing for 3D by the same approach: Giving stack of planes with all planes (P i ) 2D-averaged, we can extend averaing for 3D by the same approach: –Prolog: compute sum S of first k planes – Main step: to compute next sum S +1, first plane of previous sum () should be subtracted, and next plane () should be added – Main step: to compute next sum S +1, first plane of previous sum (P 0 ) should be subtracted, and next plane (P k ) should be added S = ∑(P) 0,k-1 S +1 = ∑(P) 1,k = S– P 0 + P k k k-1 k-2 … 2 1 0

Copyright © 2008 Intel Corporation. 8 Agenda What is 3D Running Average (RA)? What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS  SoA transforms Basic SSE technique: AoS  SoA transforms 1D RA 4-lines SSE implementation 1D RA 4-lines SSE implementation 2 nd dimension completion 2 nd dimension completion 3 rd dimension completion 3 rd dimension completion Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 9 How it can be transformed ? Array of Structures (AoS ) => Structure of Arrays (SoA) Why should we transform it to vectorize 1D Running Average ? Origin “natural” serial data structure: AoS Origin “natural” serial data structure: AoS NOT enabled for SSE NOT enabled for SSE 0123456k-3k-2k-1k M s = ∑(m) 0,k-1 M s + 1 = ∑(m) 1,k = M s – m 0 + m k 0123456k-3k-2k-1k 0123456k-3k-2k-1k 0123456k-3k-2k-1k L0L0 L1L1 L2L2 L3L3 v3v300v3v3000 v2v200v2v2000 v1v100v1v1000 v0v000v0v0000 v3v311v3v3111 v2v211v2v2111 v1v111v1v1111 v0v011v0v0111 v3v322v3v3222 v2v222v2v2222 v1v122v1v1222 v0v022v0v0222 v3v333v3v3333 v2v233v2v2333 v1v133v1v1333 v0v033v0v0333 v3v344v3v3444 v2v244v2v2444 v1v144v1v1444 v0v044v0v0444 v3v355v3v3555 v2v255v2v2555 v1v155v1v1555 v0v055v0v0555 v3v366v3v3666 v2v266v2v2666 v1v166v1v1666 v0v066v0v0666 v 3 k-3 v 2 k-3 v 1 k-3 v 0 k-3 v 3 k-2 v 2 k-2 v 1 k-2 v 0 k-2 v 3 k-1 v 2 k-1 v 1 k-1 v 0 k-1 v3v3kkv3v3kkk v2v2kkv2v2kkk v1v1kkv1v1kkk v0v0kkv0v0kkk S = ∑(v) 0,k- 1 S +1 = ∑(v) 1,k = S – v 0 + v k “Transposed” “Transposed” data structure: SoA data structure: SoA ENABLED for SSE ! ENABLED for SSE !

Copyright © 2008 Intel Corporation. 10 Array of Structures (AoS ) => Structure of Arrays (SoA) Presented below: transposition of 4 quads from 4 org lines – into 4 SSE regs of x, y, w, z. Presented below: transposition of 4 quads from 4 org lines – into 4 SSE regs of x, y, w, z. Takes 12 SSE operations per 16 components. Takes 12 SSE operations per 16 components. x0x0x0x0 y0y0y0y0 z0z0z0z0 w0w0w0w0 x1x1x1x1 y1y1y1y1 z1z1z1z1 w1w1w1w1 x2x2x2x2 y2y2y2y2 z2z2z2z2 w2w2w2w2 x3x3x3x3 y3y3y3y3 z3z3z3z3 w3w3w3w3 x0x0x0x0 y0y0y0y0 x1x1x1x1 y1y1y1y1 z0z0z0z0 w0w0w0w0 z1z1z1z1 w1w1w1w1 x2x2x2x2 y2y2y2y2 x3x3x3x3 y3y3y3y3 z2z2z2z2 w2w2w2w2 z3z3z3z3 w3w3w3w3 L 1 org L 2 org L 3 org w0w0w0w0 w1w1w1w1 w2w2w2w2 w3w3w3w3 z0z0z0z0 z1z1z1z1 z2z2z2z2 z3z3z3z3 x0x0x0x0 x1x1x1x1 x2x2x2x2 x3x3x3x3 y0y0y0y0 y1y1y1y1 y2y2y2y2 y3y3y3y3 intermediate loadhi, loadlo loadhi, loadlo loadhi, loadlo loadhi, loadlo L 0 org xy 10 xy 32 zw 10 zw 32 shuffle(xy 10, xy 32, (3,1,3,1)) FINAL SSE regs shuffle(zw 10, zw 32, (2,0,2,0)) shuffle(xy 10, xy 32, (2,0,2,0)) shuffle(zw 10, zw 32, (3,1,3,1))

Copyright © 2008 Intel Corporation. 11 Array of Structures (AoS ) <= Structure of Arrays (SoA) Presented below: (inverse) transposition of 4 x, y, w, z SSE regs into 4 memory places. Presented below: (inverse) transposition of 4 x, y, w, z SSE regs into 4 memory places. Takes 12 SSE operations per 16 components. Takes 12 SSE operations per 16 components. x0x0x0x0 y0y0y0y0 z0z0z0z0 w0w0w0w0 x1x1x1x1 y1y1y1y1 z1z1z1z1 w1w1w1w1 x2x2x2x2 y2y2y2y2 z2z2z2z2 w2w2w2w2 x3x3x3x3 y3y3y3y3 z3z3z3z3 w3w3w3w3 x0x0x0x0 y0y0y0y0 x1x1x1x1 y1y1y1y1 z0z0z0z0 w0w0w0w0 z1z1z1z1 w1w1w1w1 x2x2x2x2 y2y2y2y2 x3x3x3x3 y3y3y3y3 z2z2z2z2 w2w2w2w2 z3z3z3z3 w3w3w3w3 L 1 ptr L 2 ptr L 3 ptr w0w0w0w0 w1w1w1w1 w2w2w2w2 w3w3w3w3 z0z0z0z0 z1z1z1z1 z2z2z2z2 z3z3z3z3 x0x0x0x0 x1x1x1x1 x2x2x2x2 x3x3x3x3 y0y0y0y0 y1y1y1y1 y2y2y2y2 y3y3y3y3 shuffle(xy 10, zw 10, …) store + store L 0 ptr xy 10 xy 32 zw 10 zw 32 Org SSE regs unpack_lo unpack_hi shuffle(xy 23, zw 23, …) + store

Copyright © 2008 Intel Corporation. 13 1D Running Average 4-lines SSE implementation (width – 11) Cyclic SSE array buffer AoS=>SoA transform loads 4 SSE regs. AoS=>SoA transform loads 4 SSE regs. RA with width 11 needs to maintain together 12 regs, they can fit in 3 QUADs of regs, but can crawl to 4 QUADs as well v3v300v3v3000 v2v200v2v2000 v1v100v1v1000 v0v000v0v0000 v3v311v3v3111 v2v211v2v2111 v1v111v1v1111 v0v011v0v0111 v3v322v3v3222 v2v222v2v2222 v1v122v1v1222 v0v022v0v0222 v3v333v3v3333 v2v233v2v2333 v1v133v1v1333 v0v033v0v0333 v3v344v3v3444 v2v244v2v2444 v1v144v1v1444 v0v044v0v0444 v3v355v3v3555 v2v255v2v2555 v1v155v1v1555 v0v055v0v0555 v3v366v3v3666 v2v266v2v2666 v1v166v1v1666 v0v066v0v0666 v3v377v3v3777 v2v277v2v2777 v1v177v1v1777 v0v077v0v0777 v3v388v3v3888 v2v288v2v2888 v1v188v1v1888 v0v088v0v0888 v3v399v3v3999 v2v299v2v2999 v1v199v1v1999 v0v099v0v0999 v 3 10 v 2 10 v 1 10 v 0 10 v 3 11 v 2 11 v 1 11 v 0 11 v 3 12 v 2 12 v 1 12 v 0 12 v 3 13 v 2 13 v 1 13 v 0 13 v 3 14 v 2 14 v 1 14 v 0 14 v 3 15 v 2 15 v 1 15 v 0 15 12: fits in 3 QUADS 12: crawls to 4 QUADS v3v300v3v3000 v2v200v2v2000 v1v100v1v1000 v0v000v0v0000 v3v311v3v3111 v2v211v2v2111 v1v111v1v1111 v0v011v0v0111 v3v322v3v3222 v2v222v2v2222 v1v122v1v1222 v0v022v0v0222 v3v333v3v3333 v2v233v2v2333 v1v133v1v1333 v0v033v0v0333 Can be filled by oS=>SoA as A oS=>SoA as “next” QUAD So, 16 regs (4 QUADs) must be allocated and used in cyclic way – when last QUAD is freed, it is loaded by AoS=>SoA with next QUAD values. So, 16 regs (4 QUADs) must be allocated and used in cyclic way – when last QUAD is freed, it is loaded by AoS=>SoA with next QUAD values. Fill by AoS=>SoA

Copyright © 2008 Intel Corporation. 14 1.Loading 12 SSE regs by AoS=>SoA 2.Summing up (accumulate) 5 first 3.4 times: (sum-up next, save result in SSE regs – SoA form) –Save QUAD of results in memory by AoS<=SoA 4.2 times: (sum-up next, save result in SSE regs – SoA form) 5.1 time: Sum-up next, subrtact first, save result in SSE reg Here all 12 loaded QUADs are used: 5+4+2+1, and 3 resulted regs are NOT saved 1D Running Average 4-lines SSE implementation (width – 11) Prolog v3v300v3v3000 v2v200v2v2000 v1v100v1v1000 v0v000v0v0000 v3v311v3v3111 v2v211v2v2111 v1v111v1v1111 v0v011v0v0111 v3v322v3v3222 v2v222v2v2222 v1v122v1v1222 v0v022v0v0222 v3v333v3v3333 v2v233v2v2333 v1v133v1v1333 v0v033v0v0333 v3v344v3v3444 v2v244v2v2444 v1v144v1v1444 v0v044v0v0444 v3v355v3v3555 v2v255v2v2555 v1v155v1v1555 v0v055v0v0555 v3v366v3v3666 v2v266v2v2666 v1v166v1v1666 v0v066v0v0666 v3v377v3v3777 v2v277v2v2777 v1v177v1v1777 v0v077v0v0777 v3v388v3v3888 v2v288v2v2888 v1v188v1v1888 v0v088v0v0888 v3v399v3v3999 v2v299v2v2999 v1v199v1v1999 v0v099v0v0999 v 3 10 v 2 10 v 1 10 v 0 10 v 3 11 v 2 11 v 1 11 v 0 11 Accumulate r3r300r3r3000 r2r200r2r2000 r1r100r1r1000 r0r000r0r0000 r3r311r3r3111 r2r211r2r2111 r1r111r1r1111 r0r011r0r0111 r3r322r3r3222 r2r222r2r2222 r1r122r1r1222 r0r022r0r0222 r3r333r3r3333 r2r233r2r2333 r1r133r1r1333 r0r033r0r0333 Accumulate & save += += += += += += + – = Save in memory by AoS<=SoA r3r300r3r3000 r2r200r2r2000 r1r100r1r1000 r0r000r0r0000 r3r311r3r3111 r2r211r2r2111 r1r111r1r1111 r0r011r0r0111 r3r322r3r3222 r2r222r2r2222 r1r122r1r1222 r0r022r0r0222 Add last & subtract the very first Will be subtracted at the end of prolog 3 NOT saved in prolog

Copyright © 2008 Intel Corporation. 15 Main step Main step 1.Loading 4 SSE regs by AoS=>SoA, using 4 “last” regs from cyclic buffer 2.Sum-up next, subrtact (next-11), save result in SSE reg – it will be the 4 th –Save QUAD of results in memory by AoS<=SoA 3.3 times: (sum-up next, subrtact first, save result in SSE reg) During the step: 4 new SSE regs are loaded, 4 (3 old and 1 new) are saved in memory, and 3 resulted regs are NOT saved 1D Running Average 4-lines SSE implementation (width – 11) Main step & epilog v3v300v3v3000 v2v200v2v2000 v1v100v1v1000 v0v000v0v0000 v3v311v3v3111 v2v211v2v2111 v1v111v1v1111 v0v011v0v0111 v3v322v3v3222 v2v222v2v2222 v1v122v1v1222 v0v022v0v0222 v3v333v3v3333 v2v233v2v2333 v1v133v1v1333 v0v033v0v0333 v3v344v3v3444 v2v244v2v2444 v1v144v1v1444 v0v044v0v0444 v3v355v3v3555 v2v255v2v2555 v1v155v1v1555 v0v055v0v0555 v3v366v3v3666 v2v266v2v2666 v1v166v1v1666 v0v066v0v0666 v3v377v3v3777 v2v277v2v2777 v1v177v1v1777 v0v077v0v0777 v3v388v3v3888 v2v288v2v2888 v1v188v1v1888 v0v088v0v0888 v3v399v3v3999 v2v299v2v2999 v1v199v1v1999 v0v099v0v0999 v 3 10 v 2 10 v 1 10 v 0 10 v 3 11 v 2 11 v 1 11 v 0 11 r3r300r3r3000 r2r200r2r2000 r1r100r1r1000 r0r000r0r0000 r3r311r3r3111 r2r211r2r2111 r1r111r1r1111 r0r011r0r0111 r3r322r3r3222 r2r222r2r2222 r1r122r1r1222 r0r022r0r0222 Added in current step + – = + – = Save in memory by AoS<=SoA r3r300r3r3000 r2r200r2r2000 r1r100r1r1000 r0r000r0r0000 r3r311r3r3111 r2r211r2r2111 r1r111r1r1111 r0r011r0r0111 r3r322r3r3222 r2r222r2r2222 r1r122r1r1222 r0r022r0r0222 3 NOT saved in current step v 3 12 v 2 12 v 1 12 v 0 12 v 3 13 v 2 13 v 1 13 v 0 13 v 3 14 v 2 14 v 1 14 v 0 14 v 3 15 v 2 15 v 1 15 v 0 15 r3r333r3r3333 r2r233r2r2333 r1r133r1r1333 r0r033r0r0333 3 from prev | new Subtracted in current step Are freed after current step Epilog Epilog For 5 last results, subtraction ONLY is done

Copyright © 2008 Intel Corporation. 17 Logical flow of 2D RA (in-place routine) is very similar to 1D RA 4-lines implementation. Logical flow of 2D RA (in-place routine) is very similar to 1D RA 4-lines implementation. To save intermediate 1D RA lines we use 16 working lines – analog of 16 SSE regs. Prolog Prolog 1.Computation 12 1D RA lines by 3 calls to 1D RA 4-lines routine 2.Summing up (accumulate) 5 first in working memory 3.6 times: (sum-up next line, save result in final place) 4.1 time: sum-up next line, subrtact first line, save result in final place Here all 12 1D RA lines are used: 5+6+1 2 nd dimension completion 2D RA: based on 4-lines 1D SSE implementation - prolog 1D RA L01D RA L11D RA L111D RA L2 1D RA L3 1D RA L41D RA L51D RA L61D RA L71D RA L81D RA L9 1D RA L10 AccumulateAccumulate & save + + + + + + + – 2D RA L0 <=2D RA L1 <=2D RA L2 <=2D RA L3 <=2D RA L4 <=2D RA L5 <=2D RA L6 <= Resulting 2D Running Average lines Add last & subtract the very first Will be subtracted at the end of prolog

Copyright © 2008 Intel Corporation. 18 2 nd dimension completion 2D RA: based on 4-lines 1D SSE implementation – main step & epilog 1D RA L01D RA L11D RA L111D RA L2 1D RA L3 1D RA L41D RA L51D RA L61D RA L71D RA L81D RA L9 1D RA L10 + – + – 2D RA +0 <=2D RA +1 <=2D RA +2 <=2D RA +3 <= Resulting 2D Running Average lines Main step Main step –Computation 4 1D RA lines by calling 1D RA 4-lines routine, outputting into 4 “last” lines from working lines cyclic buffer –4 times - sum-up next, subrtact (next-11), save result in final place 1D RA L151D RA L121D RA L13 1D RA L14 Added in current step Subtracted in current step Are freed after current step Epilog Epilog – For 5 last results, subtraction ONLY is done Important cash-related note: typical line length is ~400 floats => 1.6K, therefore the cyclic buffer of 16 lines is ~26K => less than 32K, L1 cash. Important cash-related note: typical line length is ~400 floats => 1.6K, therefore the cyclic buffer of 16 lines is ~26K => less than 32K, L1 cash. Most of data manipulation is done in L1 cash ! Most of data manipulation is done in L1 cash !

Copyright © 2008 Intel Corporation. 20 3 rd dimension (in-place) computation is done after completion of 2D computations for all the stack of images (planes). 3 rd dimension (in-place) computation is done after completion of 2D computations for all the stack of images (planes). It is straight-forward, as it is fully independent from previously computed 2D results – in opposite to 2D computation that includes 1D computation as internal part. It is straight-forward, as it is fully independent from previously computed 2D results – in opposite to 2D computation that includes 1D computation as internal part. In general, its logical flow is very similar to 2D one. The important difference is, that (because of “in placing”) the results are firstly saved in cyclic buffer, and are copied to final place only after using appropriate line for subtracting. In general, its logical flow is very similar to 2D one. The important difference is, that (because of “in placing”) the results are firstly saved in cyclic buffer, and are copied to final place only after using appropriate line for subtracting. 3 rd dimension completion 3D RA L03D RA L13D RA L113D RA L2 3D RA L3 3D RA L43D RA L53D RA L63D RA L73D RA L83D RA L9 3D RA L10 2D RA L02D RA L12D RA L112D RA L2 2D RA L3 2D RA L42D RA L52D RA L62D RA L72D RA L82D RA L9 2D RA L10 Subtract Add First Second:copy Source: 2d RA Pool of 12 working lines- cyclic buffer

Copyright © 2008 Intel Corporation. 22 Parallelizing by OpenMP and benchmarking To parallelize the above algorithm by using OpenMP, 16 working lines for each thread are allocated. To parallelize the above algorithm by using OpenMP, 16 working lines for each thread are allocated. Using OpenMP is straight forward for 2 loops: (1) calling 2D RA routine for each plane in stack and (2) calling routine for computing “stack” of 3D RA lines – the loop in “y” direction (explained on appropriate foil). Using OpenMP is straight forward for 2 loops: (1) calling 2D RA routine for each plane in stack and (2) calling routine for computing “stack” of 3D RA lines – the loop in “y” direction (explained on appropriate foil). Results for several platforms benchmarking: Results for several platforms benchmarking: Pentium-M T43 laptop 1.86 GHz Merom T61 laptop 2.0 GHz Conroe WS 2.4 GHz WoodCrestWS 2.66 GHz HPTNBensley 2.8 GHz SSE run time msec 32151412.59.4 Speed-upSerial/SSE2.5x4x3.2x3.6x4.2x SSE+OpenMP run time msec NA13??5.7 Speed-upSSE/SSE+OpenMPNA1.15x??1.6x Conclusions: Conclusions: –SSE/serial speed-up for Penryn/Merom is ~ 4x, 30% better than for “old” Pentium-M (2.5x) –Absolute SSE run time for Merom (12-15 msec) is 2-2.5x better than for Pentium-M (32 msec) and >3x better for Penrin (9.4 msec). –OpenMP scalability is very low, it seems that performance is restricted by FSB speed.

Copyright © 2007 Intel Corporation. ® SP 3D Running Average Implementation SSE + OpenMP Benchmarking on different platforms Dr. Zvi Danovich, Senior Application.

Similar presentations

Presentation on theme: "Copyright © 2007 Intel Corporation. ® SP 3D Running Average Implementation SSE + OpenMP Benchmarking on different platforms Dr. Zvi Danovich, Senior Application."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Copyright © 2007 Intel Corporation. ® SP 3D Running Average Implementation SSE + OpenMP Benchmarking on different platforms Dr. Zvi Danovich, Senior Application.

Similar presentations

Presentation on theme: "Copyright © 2007 Intel Corporation. ® SP 3D Running Average Implementation SSE + OpenMP Benchmarking on different platforms Dr. Zvi Danovich, Senior Application."— Presentation transcript:

Similar presentations

About project

Feedback