Copyright © 2007 Intel Corporation. ® SP 3D Running Average Implementation SSE + OpenMP Benchmarking on different platforms Dr. Zvi Danovich, Senior Application.

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

CS 450: COMPUTER GRAPHICS LINEAR ALGEBRA REVIEW SPRING 2015 DR. MICHAEL J. REALE.

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Programming and Data Structure

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

1 RTL Example: Video Compression – Sum of Absolute Differences Video is a series of frames (e.g., 30 per second) Most frames similar to previous frame.

 Suppose for a moment that you were asked to perform a task and were given the following list of instructions to perform:

ARM-DSP Multicore Considerations CT Scan Example.

Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.

Fast Volume Rendering Using a Shear-Warp Factorization of the Viewing Transformation Philippe Larcoute & Marc Levoy Stanford University Published in SIGGRAPH.

Implementing the Probability Matrix Technique for Positron Emission Tomography By: Chris Markson Student Adviser: Dr. Kaufman.

Copyright © 2007 Intel Corporation. ® 16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn Dr. Zvi Danovich, Senior Application Engineer.

UNIVERSITY OF MASSACHUSETTS Dept

Memory - Registers Instruction Sets

Review of Matrix Algebra

Chapter 3 Vectors in Physics.

Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.

Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.

The hybird approach to programming clusters of multi-core architetures.

Week 1 - Friday.  What did we talk about last time?  C#  SharpDX.

Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.

Chapter 10 Review: Matrix Algebra

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

ECE 4110– Sequential Logic Design

Chapter 6-2 Multiplier Multiplier Next Lecture Divider

Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

CS3012: Formal Languages and Compilers The Runtime Environment After the analysis phases are complete, the compiler must generate executable code. The.

Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.

Copyright © 2012 Pearson Education, Inc. Chapter 8 Two Dimensional Arrays.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Previous Page Next Page EXIT Created by Professor James A. Sinclair, Ph.D. MMXI Order of operations: In mathematics order of operations is a very important.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Basic Image Manipulation Raed S. Rasheed Agenda Region of Interest (ROI) Basic geometric manipulation. – Enlarge – shrink – Reflection Arithmetic.

Pointers OVERVIEW.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Important Components, Blocks and Methodologies. To remember 1.EXORS 2.Counters and Generalized Counters 3.State Machines (Moore, Mealy, Rabin-Scott) 4.Controllers.

Runtime Stack Computer Organization I 1 November 2009 © McQuain, Feng & Ribbens MIPS Memory Organization In addition to memory for static.

Scalable Multi-core Sonar Beamforming with Computational Process Networks Motivation Sonar beamforming requires significant computation and input/output.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Computer Architecture Lecture 32 Fasih ur Rehman.

Introduction to MMX, XMM, SSE and SSE2 Technology

Computer Studies/ICT SS2

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction The STL is a complex piece of software engineering that uses some of C++'s most sophisticated features STL provides an incredible amount.

SCALARS Scalars only have magnitude Scalars only have magnitude Magnitude means length Magnitude means length Example: 50 m Example: 50 m.

Lecture Outline Chapter 3 Physics, 4 th Edition James S. Walker Copyright © 2010 Pearson Education, Inc.

GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

REGISTER TRANSFER LANGUAGE (RTL) INTRODUCTION TO REGISTER Registers1.

Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,

F453 Module 8: Low Level Languages 8.1: Use of Computer Architecture.

Buffering Techniques Greg Stitt ECE Department University of Florida.

PERFORMANCE EVALUATIONS

REGISTER TRANSFER LANGUAGE (RTL)

Lecture 16 Arithmetic Circuits

Faster File matching using GPGPU’s Deephan Mohan Professor: Dr

Compiler Back End Panel

CS/EE 217 – GPU Architecture and Parallel Programming

Static Image Filtering on Commodity Graphics Processors

Compiler Back End Panel

Parallel Computation Patterns (Scan)

Vectors Vectors are a way to describe motion that is not in a straight line. All measurements can be put into two categories: Scalars = magnitude Vectors.

Computer Architecture

Memory System Performance Chapter 3

Presentation transcript:

Copyright © 2007 Intel Corporation. ® SP 3D Running Average Implementation SSE + OpenMP Benchmarking on different platforms Dr. Zvi Danovich, Senior Application Engineer January 2008

Copyright © 2008 Intel Corporation. 2 Agenda What is 3D Running Average (RA) ? What is 3D Running Average (RA) ? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS  SoA transforms Basic SSE technique: AoS  SoA transforms 1D RA 4-lines SSE implementation 1D RA 4-lines SSE implementation 2 nd dimension completion 2 nd dimension completion 3 rd dimension completion 3 rd dimension completion Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 3 3D RA is computed for each voxel V as normalized sum inside k x k x k cube (k is odd) located “around” given voxel: 3D RA is computed for each voxel V as normalized sum inside k x k x k cube (k is odd) located “around” given voxel: where ‘v’ is source voxels. In another words, 3D RA can be considered as 3D convolution with kernel having all components equal to 1/(k x k x k). In another words, 3D RA can be considered as 3D convolution with kernel having all components equal to 1/(k x k x k). 3D Running Average (RA) – what is it ? vvv vvv vvv V = (1/k 3 )sum k k k

Copyright © 2008 Intel Corporation. 4 Agenda What is 3D Running Average (RA) ? What is 3D Running Average (RA) ? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS  SoA transforms Basic SSE technique: AoS  SoA transforms 1D RA 4-lines SSE implementation 1D RA 4-lines SSE implementation 2 nd dimension completion 2 nd dimension completion 3 rd dimension completion 3 rd dimension completion Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 5 1D Running Average (RA) Unlike 1D convolution, 1D RA can be computed with complexity (O 1 ) using following aproach: Unlike 1D convolution, 1D RA can be computed with complexity (O 1 ) using following aproach: –Prolog: compute sum S of first k voxels –Main step: to compute next sum S +1, first member of previous sum () should be subtracted, and next component () should be added –Main step: to compute next sum S +1, first member of previous sum (v 0 ) should be subtracted, and next component (v k ) should be added k-3k-2k-1k S = ∑(v) 0,k-1 S +1 = ∑(v) 1,k = S – v 0 + v k

Copyright © 2008 Intel Corporation. 6 Extending 1D Running Average toward 2D Giving slice (plane) with all lines (L i ) 1D-averaged, we can extend averaging for 2D by the same approach: Giving slice (plane) with all lines (L i ) 1D-averaged, we can extend averaging for 2D by the same approach: –Prolog: compute sum S of first k lines S = ∑(L) 0,k-1 S +1 = ∑(L) 1,k = S– L 0 + L k L0L0 LkLk –Main step: to compute next sum S +1, first line of previous sum () should be subtracted, and next line () should be added –Main step: to compute next sum S +1, first line of previous sum (L 0 ) should be subtracted, and next line (L k ) should be added

Copyright © 2008 Intel Corporation. 7 Extending 2D Running Average toward 3D Giving stack of planes with all planes (P i ) 2D-averaged, we can extend averaing for 3D by the same approach: Giving stack of planes with all planes (P i ) 2D-averaged, we can extend averaing for 3D by the same approach: –Prolog: compute sum S of first k planes – Main step: to compute next sum S +1, first plane of previous sum () should be subtracted, and next plane () should be added – Main step: to compute next sum S +1, first plane of previous sum (P 0 ) should be subtracted, and next plane (P k ) should be added S = ∑(P) 0,k-1 S +1 = ∑(P) 1,k = S– P 0 + P k k k-1 k-2 … 2 1 0

Copyright © 2008 Intel Corporation. 8 Agenda What is 3D Running Average (RA)? What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS  SoA transforms Basic SSE technique: AoS  SoA transforms 1D RA 4-lines SSE implementation 1D RA 4-lines SSE implementation 2 nd dimension completion 2 nd dimension completion 3 rd dimension completion 3 rd dimension completion Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 9 How it can be transformed ? Array of Structures (AoS ) => Structure of Arrays (SoA) Why should we transform it to vectorize 1D Running Average ? Origin “natural” serial data structure: AoS Origin “natural” serial data structure: AoS NOT enabled for SSE NOT enabled for SSE k-3k-2k-1k M s = ∑(m) 0,k-1 M s + 1 = ∑(m) 1,k = M s – m 0 + m k k-3k-2k-1k k-3k-2k-1k k-3k-2k-1k L0L0 L1L1 L2L2 L3L3 v3v300v3v3000 v2v200v2v2000 v1v100v1v1000 v0v000v0v0000 v3v311v3v3111 v2v211v2v2111 v1v111v1v1111 v0v011v0v0111 v3v322v3v3222 v2v222v2v2222 v1v122v1v1222 v0v022v0v0222 v3v333v3v3333 v2v233v2v2333 v1v133v1v1333 v0v033v0v0333 v3v344v3v3444 v2v244v2v2444 v1v144v1v1444 v0v044v0v0444 v3v355v3v3555 v2v255v2v2555 v1v155v1v1555 v0v055v0v0555 v3v366v3v3666 v2v266v2v2666 v1v166v1v1666 v0v066v0v0666 v 3 k-3 v 2 k-3 v 1 k-3 v 0 k-3 v 3 k-2 v 2 k-2 v 1 k-2 v 0 k-2 v 3 k-1 v 2 k-1 v 1 k-1 v 0 k-1 v3v3kkv3v3kkk v2v2kkv2v2kkk v1v1kkv1v1kkk v0v0kkv0v0kkk S = ∑(v) 0,k- 1 S +1 = ∑(v) 1,k = S – v 0 + v k “Transposed” “Transposed” data structure: SoA data structure: SoA ENABLED for SSE ! ENABLED for SSE !

Copyright © 2008 Intel Corporation. 10 Array of Structures (AoS ) => Structure of Arrays (SoA) Presented below: transposition of 4 quads from 4 org lines – into 4 SSE regs of x, y, w, z. Presented below: transposition of 4 quads from 4 org lines – into 4 SSE regs of x, y, w, z. Takes 12 SSE operations per 16 components. Takes 12 SSE operations per 16 components. x0x0x0x0 y0y0y0y0 z0z0z0z0 w0w0w0w0 x1x1x1x1 y1y1y1y1 z1z1z1z1 w1w1w1w1 x2x2x2x2 y2y2y2y2 z2z2z2z2 w2w2w2w2 x3x3x3x3 y3y3y3y3 z3z3z3z3 w3w3w3w3 x0x0x0x0 y0y0y0y0 x1x1x1x1 y1y1y1y1 z0z0z0z0 w0w0w0w0 z1z1z1z1 w1w1w1w1 x2x2x2x2 y2y2y2y2 x3x3x3x3 y3y3y3y3 z2z2z2z2 w2w2w2w2 z3z3z3z3 w3w3w3w3 L 1 org L 2 org L 3 org w0w0w0w0 w1w1w1w1 w2w2w2w2 w3w3w3w3 z0z0z0z0 z1z1z1z1 z2z2z2z2 z3z3z3z3 x0x0x0x0 x1x1x1x1 x2x2x2x2 x3x3x3x3 y0y0y0y0 y1y1y1y1 y2y2y2y2 y3y3y3y3 intermediate loadhi, loadlo loadhi, loadlo loadhi, loadlo loadhi, loadlo L 0 org xy 10 xy 32 zw 10 zw 32 shuffle(xy 10, xy 32, (3,1,3,1)) FINAL SSE regs shuffle(zw 10, zw 32, (2,0,2,0)) shuffle(xy 10, xy 32, (2,0,2,0)) shuffle(zw 10, zw 32, (3,1,3,1))

Copyright © 2008 Intel Corporation. 11 Array of Structures (AoS ) <= Structure of Arrays (SoA) Presented below: (inverse) transposition of 4 x, y, w, z SSE regs into 4 memory places. Presented below: (inverse) transposition of 4 x, y, w, z SSE regs into 4 memory places. Takes 12 SSE operations per 16 components. Takes 12 SSE operations per 16 components. x0x0x0x0 y0y0y0y0 z0z0z0z0 w0w0w0w0 x1x1x1x1 y1y1y1y1 z1z1z1z1 w1w1w1w1 x2x2x2x2 y2y2y2y2 z2z2z2z2 w2w2w2w2 x3x3x3x3 y3y3y3y3 z3z3z3z3 w3w3w3w3 x0x0x0x0 y0y0y0y0 x1x1x1x1 y1y1y1y1 z0z0z0z0 w0w0w0w0 z1z1z1z1 w1w1w1w1 x2x2x2x2 y2y2y2y2 x3x3x3x3 y3y3y3y3 z2z2z2z2 w2w2w2w2 z3z3z3z3 w3w3w3w3 L 1 ptr L 2 ptr L 3 ptr w0w0w0w0 w1w1w1w1 w2w2w2w2 w3w3w3w3 z0z0z0z0 z1z1z1z1 z2z2z2z2 z3z3z3z3 x0x0x0x0 x1x1x1x1 x2x2x2x2 x3x3x3x3 y0y0y0y0 y1y1y1y1 y2y2y2y2 y3y3y3y3 shuffle(xy 10, zw 10, …) store + store L 0 ptr xy 10 xy 32 zw 10 zw 32 Org SSE regs unpack_lo unpack_hi shuffle(xy 23, zw 23, …) + store

Copyright © 2008 Intel Corporation. 12 Agenda What is 3D Running Average (RA)? What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS  SoA transforms Basic SSE technique: AoS  SoA transforms 1D RA 4-lines SSE implementation 1D RA 4-lines SSE implementation 2 nd dimension completion 2 nd dimension completion 3 rd dimension completion 3 rd dimension completion Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 13 1D Running Average 4-lines SSE implementation (width – 11) Cyclic SSE array buffer AoS=>SoA transform loads 4 SSE regs. AoS=>SoA transform loads 4 SSE regs. RA with width 11 needs to maintain together 12 regs, they can fit in 3 QUADs of regs, but can crawl to 4 QUADs as well v3v300v3v3000 v2v200v2v2000 v1v100v1v1000 v0v000v0v0000 v3v311v3v3111 v2v211v2v2111 v1v111v1v1111 v0v011v0v0111 v3v322v3v3222 v2v222v2v2222 v1v122v1v1222 v0v022v0v0222 v3v333v3v3333 v2v233v2v2333 v1v133v1v1333 v0v033v0v0333 v3v344v3v3444 v2v244v2v2444 v1v144v1v1444 v0v044v0v0444 v3v355v3v3555 v2v255v2v2555 v1v155v1v1555 v0v055v0v0555 v3v366v3v3666 v2v266v2v2666 v1v166v1v1666 v0v066v0v0666 v3v377v3v3777 v2v277v2v2777 v1v177v1v1777 v0v077v0v0777 v3v388v3v3888 v2v288v2v2888 v1v188v1v1888 v0v088v0v0888 v3v399v3v3999 v2v299v2v2999 v1v199v1v1999 v0v099v0v0999 v 3 10 v 2 10 v 1 10 v 0 10 v 3 11 v 2 11 v 1 11 v 0 11 v 3 12 v 2 12 v 1 12 v 0 12 v 3 13 v 2 13 v 1 13 v 0 13 v 3 14 v 2 14 v 1 14 v 0 14 v 3 15 v 2 15 v 1 15 v : fits in 3 QUADS 12: crawls to 4 QUADS v3v300v3v3000 v2v200v2v2000 v1v100v1v1000 v0v000v0v0000 v3v311v3v3111 v2v211v2v2111 v1v111v1v1111 v0v011v0v0111 v3v322v3v3222 v2v222v2v2222 v1v122v1v1222 v0v022v0v0222 v3v333v3v3333 v2v233v2v2333 v1v133v1v1333 v0v033v0v0333 Can be filled by oS=>SoA as A oS=>SoA as “next” QUAD So, 16 regs (4 QUADs) must be allocated and used in cyclic way – when last QUAD is freed, it is loaded by AoS=>SoA with next QUAD values. So, 16 regs (4 QUADs) must be allocated and used in cyclic way – when last QUAD is freed, it is loaded by AoS=>SoA with next QUAD values. Fill by AoS=>SoA

Copyright © 2008 Intel Corporation Loading 12 SSE regs by AoS=>SoA 2.Summing up (accumulate) 5 first 3.4 times: (sum-up next, save result in SSE regs – SoA form) –Save QUAD of results in memory by AoS<=SoA 4.2 times: (sum-up next, save result in SSE regs – SoA form) 5.1 time: Sum-up next, subrtact first, save result in SSE reg Here all 12 loaded QUADs are used: , and 3 resulted regs are NOT saved 1D Running Average 4-lines SSE implementation (width – 11) Prolog v3v300v3v3000 v2v200v2v2000 v1v100v1v1000 v0v000v0v0000 v3v311v3v3111 v2v211v2v2111 v1v111v1v1111 v0v011v0v0111 v3v322v3v3222 v2v222v2v2222 v1v122v1v1222 v0v022v0v0222 v3v333v3v3333 v2v233v2v2333 v1v133v1v1333 v0v033v0v0333 v3v344v3v3444 v2v244v2v2444 v1v144v1v1444 v0v044v0v0444 v3v355v3v3555 v2v255v2v2555 v1v155v1v1555 v0v055v0v0555 v3v366v3v3666 v2v266v2v2666 v1v166v1v1666 v0v066v0v0666 v3v377v3v3777 v2v277v2v2777 v1v177v1v1777 v0v077v0v0777 v3v388v3v3888 v2v288v2v2888 v1v188v1v1888 v0v088v0v0888 v3v399v3v3999 v2v299v2v2999 v1v199v1v1999 v0v099v0v0999 v 3 10 v 2 10 v 1 10 v 0 10 v 3 11 v 2 11 v 1 11 v 0 11 Accumulate r3r300r3r3000 r2r200r2r2000 r1r100r1r1000 r0r000r0r0000 r3r311r3r3111 r2r211r2r2111 r1r111r1r1111 r0r011r0r0111 r3r322r3r3222 r2r222r2r2222 r1r122r1r1222 r0r022r0r0222 r3r333r3r3333 r2r233r2r2333 r1r133r1r1333 r0r033r0r0333 Accumulate & save += += += += += += + – = Save in memory by AoS<=SoA r3r300r3r3000 r2r200r2r2000 r1r100r1r1000 r0r000r0r0000 r3r311r3r3111 r2r211r2r2111 r1r111r1r1111 r0r011r0r0111 r3r322r3r3222 r2r222r2r2222 r1r122r1r1222 r0r022r0r0222 Add last & subtract the very first Will be subtracted at the end of prolog 3 NOT saved in prolog

Copyright © 2008 Intel Corporation. 15 Main step Main step 1.Loading 4 SSE regs by AoS=>SoA, using 4 “last” regs from cyclic buffer 2.Sum-up next, subrtact (next-11), save result in SSE reg – it will be the 4 th –Save QUAD of results in memory by AoS<=SoA 3.3 times: (sum-up next, subrtact first, save result in SSE reg) During the step: 4 new SSE regs are loaded, 4 (3 old and 1 new) are saved in memory, and 3 resulted regs are NOT saved 1D Running Average 4-lines SSE implementation (width – 11) Main step & epilog v3v300v3v3000 v2v200v2v2000 v1v100v1v1000 v0v000v0v0000 v3v311v3v3111 v2v211v2v2111 v1v111v1v1111 v0v011v0v0111 v3v322v3v3222 v2v222v2v2222 v1v122v1v1222 v0v022v0v0222 v3v333v3v3333 v2v233v2v2333 v1v133v1v1333 v0v033v0v0333 v3v344v3v3444 v2v244v2v2444 v1v144v1v1444 v0v044v0v0444 v3v355v3v3555 v2v255v2v2555 v1v155v1v1555 v0v055v0v0555 v3v366v3v3666 v2v266v2v2666 v1v166v1v1666 v0v066v0v0666 v3v377v3v3777 v2v277v2v2777 v1v177v1v1777 v0v077v0v0777 v3v388v3v3888 v2v288v2v2888 v1v188v1v1888 v0v088v0v0888 v3v399v3v3999 v2v299v2v2999 v1v199v1v1999 v0v099v0v0999 v 3 10 v 2 10 v 1 10 v 0 10 v 3 11 v 2 11 v 1 11 v 0 11 r3r300r3r3000 r2r200r2r2000 r1r100r1r1000 r0r000r0r0000 r3r311r3r3111 r2r211r2r2111 r1r111r1r1111 r0r011r0r0111 r3r322r3r3222 r2r222r2r2222 r1r122r1r1222 r0r022r0r0222 Added in current step + – = + – = Save in memory by AoS<=SoA r3r300r3r3000 r2r200r2r2000 r1r100r1r1000 r0r000r0r0000 r3r311r3r3111 r2r211r2r2111 r1r111r1r1111 r0r011r0r0111 r3r322r3r3222 r2r222r2r2222 r1r122r1r1222 r0r022r0r NOT saved in current step v 3 12 v 2 12 v 1 12 v 0 12 v 3 13 v 2 13 v 1 13 v 0 13 v 3 14 v 2 14 v 1 14 v 0 14 v 3 15 v 2 15 v 1 15 v 0 15 r3r333r3r3333 r2r233r2r2333 r1r133r1r1333 r0r033r0r from prev | new Subtracted in current step Are freed after current step Epilog Epilog For 5 last results, subtraction ONLY is done

Copyright © 2008 Intel Corporation. 16 Agenda What is 3D Running Average (RA)? What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS  SoA transforms Basic SSE technique: AoS  SoA transforms 1D RA 4-lines SSE implementation 1D RA 4-lines SSE implementation 2 nd dimension completion 2 nd dimension completion 3 rd dimension completion 3 rd dimension completion Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 17 Logical flow of 2D RA (in-place routine) is very similar to 1D RA 4-lines implementation. Logical flow of 2D RA (in-place routine) is very similar to 1D RA 4-lines implementation. To save intermediate 1D RA lines we use 16 working lines – analog of 16 SSE regs. Prolog Prolog 1.Computation 12 1D RA lines by 3 calls to 1D RA 4-lines routine 2.Summing up (accumulate) 5 first in working memory 3.6 times: (sum-up next line, save result in final place) 4.1 time: sum-up next line, subrtact first line, save result in final place Here all 12 1D RA lines are used: nd dimension completion 2D RA: based on 4-lines 1D SSE implementation - prolog 1D RA L01D RA L11D RA L111D RA L2 1D RA L3 1D RA L41D RA L51D RA L61D RA L71D RA L81D RA L9 1D RA L10 AccumulateAccumulate & save – 2D RA L0 <=2D RA L1 <=2D RA L2 <=2D RA L3 <=2D RA L4 <=2D RA L5 <=2D RA L6 <= Resulting 2D Running Average lines Add last & subtract the very first Will be subtracted at the end of prolog

Copyright © 2008 Intel Corporation nd dimension completion 2D RA: based on 4-lines 1D SSE implementation – main step & epilog 1D RA L01D RA L11D RA L111D RA L2 1D RA L3 1D RA L41D RA L51D RA L61D RA L71D RA L81D RA L9 1D RA L10 + – + – 2D RA +0 <=2D RA +1 <=2D RA +2 <=2D RA +3 <= Resulting 2D Running Average lines Main step Main step –Computation 4 1D RA lines by calling 1D RA 4-lines routine, outputting into 4 “last” lines from working lines cyclic buffer –4 times - sum-up next, subrtact (next-11), save result in final place 1D RA L151D RA L121D RA L13 1D RA L14 Added in current step Subtracted in current step Are freed after current step Epilog Epilog – For 5 last results, subtraction ONLY is done Important cash-related note: typical line length is ~400 floats => 1.6K, therefore the cyclic buffer of 16 lines is ~26K => less than 32K, L1 cash. Important cash-related note: typical line length is ~400 floats => 1.6K, therefore the cyclic buffer of 16 lines is ~26K => less than 32K, L1 cash. Most of data manipulation is done in L1 cash ! Most of data manipulation is done in L1 cash !

Copyright © 2008 Intel Corporation. 19 Agenda What is 3D Running Average (RA)? What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS  SoA transforms Basic SSE technique: AoS  SoA transforms 1D RA 4-lines SSE implementation 1D RA 4-lines SSE implementation 2 nd dimension completion 2 nd dimension completion 3 rd dimension completion 3 rd dimension completion Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation rd dimension (in-place) computation is done after completion of 2D computations for all the stack of images (planes). 3 rd dimension (in-place) computation is done after completion of 2D computations for all the stack of images (planes). It is straight-forward, as it is fully independent from previously computed 2D results – in opposite to 2D computation that includes 1D computation as internal part. It is straight-forward, as it is fully independent from previously computed 2D results – in opposite to 2D computation that includes 1D computation as internal part. In general, its logical flow is very similar to 2D one. The important difference is, that (because of “in placing”) the results are firstly saved in cyclic buffer, and are copied to final place only after using appropriate line for subtracting. In general, its logical flow is very similar to 2D one. The important difference is, that (because of “in placing”) the results are firstly saved in cyclic buffer, and are copied to final place only after using appropriate line for subtracting. 3 rd dimension completion 3D RA L03D RA L13D RA L113D RA L2 3D RA L3 3D RA L43D RA L53D RA L63D RA L73D RA L83D RA L9 3D RA L10 2D RA L02D RA L12D RA L112D RA L2 2D RA L3 2D RA L42D RA L52D RA L62D RA L72D RA L82D RA L9 2D RA L10 Subtract Add First Second:copy Source: 2d RA Pool of 12 working lines- cyclic buffer

Copyright © 2008 Intel Corporation. 21 Agenda What is 3D Running Average (RA)? What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS  SoA transforms Basic SSE technique: AoS  SoA transforms 1D RA 4-lines SSE implementation 1D RA 4-lines SSE implementation 2 nd dimension completion 2 nd dimension completion 3 rd dimension completion 3 rd dimension completion Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 22 Parallelizing by OpenMP and benchmarking To parallelize the above algorithm by using OpenMP, 16 working lines for each thread are allocated. To parallelize the above algorithm by using OpenMP, 16 working lines for each thread are allocated. Using OpenMP is straight forward for 2 loops: (1) calling 2D RA routine for each plane in stack and (2) calling routine for computing “stack” of 3D RA lines – the loop in “y” direction (explained on appropriate foil). Using OpenMP is straight forward for 2 loops: (1) calling 2D RA routine for each plane in stack and (2) calling routine for computing “stack” of 3D RA lines – the loop in “y” direction (explained on appropriate foil). Results for several platforms benchmarking: Results for several platforms benchmarking: Pentium-M T43 laptop 1.86 GHz Merom T61 laptop 2.0 GHz Conroe WS 2.4 GHz WoodCrestWS 2.66 GHz HPTNBensley 2.8 GHz SSE run time msec Speed-upSerial/SSE2.5x4x3.2x3.6x4.2x SSE+OpenMP run time msec NA13??5.7 Speed-upSSE/SSE+OpenMPNA1.15x??1.6x Conclusions: Conclusions: –SSE/serial speed-up for Penryn/Merom is ~ 4x, 30% better than for “old” Pentium-M (2.5x) –Absolute SSE run time for Merom (12-15 msec) is 2-2.5x better than for Pentium-M (32 msec) and >3x better for Penrin (9.4 msec). –OpenMP scalability is very low, it seems that performance is restricted by FSB speed.