® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar

® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar ronen.zohar@intel.com

® 2 March 1999 Agenda  Streaming SIMD extensions overview.  Streaming SIMD extensions in C++. Streaming SIMD extensions & memory. Some 3D code samples.

® 3 March 1999 Streaming SIMD Extensions Overview. Streaming SIMD extensions introduce three types of new instructions:  SIMD floating point single precision instructions  Memory streaming instructions  SIMD integer instructions

® 4 March 1999 Streaming SIMD Extensions Overview (Cont.) Streaming SIMD extensions introduce a new set of eight registers (xmm0-xmm7)  Each of these registers is 128 bits long  Each register holds 4 floating point single precision numbers Legacy x86 Registers (eax,…) x87 stack/MMX ® Registers Streaming SIMD Extension registers

® 5 March 1999 SIMD Floating point Single precision instructions. Operations supported:  Data transfer (move, load, store)  Numerical (add, subtract, square root,...)  Bitwise operations (and, or, exclusive or,...)  Compares (==, !=, <=,…) These instructions can operate between two xmm registers, or between a register and a 16 byte aligned memory.

® 6 March 1999 How to Write SIMD Code? The old fashioned way (assembly):... movecx, ptr movaps xmm0, [ecx] mulpsxmm0, [ecx] movaps[ecx], xmm0 … calculating x 2 for four FP. numbers located at ptr. Disadvantages:  Hard to code, and even harder to debug.  No compiler optimizations.  Code maintenance is difficult.

® 7 March 1999 How to Write SIMD Code? (cont.) Using C intrinsic instructions:... __m128*ptr, val; val = *ptr; *ptr = _mm_mul_ps(val,val); … Advantages:  No need to allocate registers manually.  Compiler based optimizations. Disadvantages:  Hard to read/maintain the code. The type __m128 describes a 128 bit basic data element.

® 8 March 1999 How to Write SIMD Code? (cont.) Using C++ SIMD classes:... F32vec4*ptr, val; val = *ptr; *ptr = val * val; … Advantages:  Natural to code and read.  No need to allocate registers manually.  Compiler based optimizations. The class F32vec4 describes a 128 bit basic data element with all the SIMD FP. operations as C++ overloaded operators.

® 9 March 1999 How to Write SIMD Code? (cont.) For assembly language: MASM + Streaming SIMD extension macro package Intel® C/C++ compiler (for inline assembly) Tools for coding Streaming SIMD extension code For intrinsic C functions: Intel C/C++ compiler, include the file xmmintrin.h For SIMD classes: Intel C/C++ compiler, include the file fvec.h

® 10 March 1999 Agenda Streaming SIMD extension overview. Streaming SIMD extension in C++.  Streaming SIMD extension & memory. Some 3D code samples.

® 11 March 1999 Memory Alignment. Memory accessed via intrinsics or SIMD data types pointers MUST be 16 byte aligned.  No need to align local variables (done by the compiler).  To align a global variable use the _MM_ALIGN16 macro: _MM_ALIGN16 DWORD mask[4];  To align a dynamically allocated buffer: orig_buff = malloc(size + 15); buff = (void *)(((DWORD) orig_buff + 15)& 0xfffffff0); ………. free(orig_buff);

® 12 March 1999 Memory Alignment (cont.) If alignment is not guaranteed:  Load from memory using the loadu function  Store to memory using the storeu function F32vec4 in,out; float *in_ptr,*out_ptr; loadu (in, in_ptr);// loading four unaligned floats. ….. storeu(out_ptr,out);// storing four unaligned floats.  These functions are slower than aligned memory access

® 13 March 1999 Memory Arrangement Issues. Traditional procedures require horizontal operations that do not utilize the SIMD structure  For example: 3D Vector normalization. X0Y0Z0Nx0Ny0Nz0Tu0Tv0X1Y1Z1Nx1Ny1Nz1Tu1Tv1X2Y2Z2Nx2Ny2Nz2Tu2Tv2 Base

® 14 March 1999 3D Vector Normalization XYZNxXYZ X*X + Y*Y + Z*Z 1.0 + 1/sqrt 1/sqrt (X*X + Y*Y + Z*Z) 1.0 X/sqrt (X*X + Y*Y + Z*Z) Y/sqrt (X*X + Y*Y + Z*Z) Z/sqrt (X*X + Y*Y + Z*Z) Nx * X*XY*YZ*ZNx*Nx *

® 15 March 1999 A Different Memory Approach (SOA) Flush of each of the data components (Structure of Arrays). For example: Base X0X1X2 Y0Y1Y2 Z0Z1Z2 Nx0Nx1Nx2 Ny0Ny1Ny2 Nz0Nz1Nz2

® 16 March 1999 Vector Normalization using SOA X0X1X2X3 Y0Y1Y2Y3 Z0Z1Z2Z3 X0X1X2X3 Y0Y1Y2Y3 Z0Z1Z2Z3 1/sqrt( X0*X0+ Y0*Y0+ Z0*Z0) 1/sqrt( X1*X1+ Y1*Y1+ Z1*Z1) 1/sqrt( X2*X2+ Y2*Y2+ Z2*Z2) 1/sqrt( X3*X3+ Y3*Y3+ Z3*Z3) +1/sqrt X0*X0X1*X1X2*X2X3*X3 Y0*Y0Y1*Y1Y2*Y2Y3*Y3 Z0*Z0Z1*Z1Z2*Z2Z3*Z3 * * * * * * Normalized vectors

® 17 March 1999 SOA Advantages Calculate 4 items in a single iteration. No need to move items around the SIMD register. Better memory utilization, in the vector normalization example: AOS used 3/8 FP. numbers in each cache line. SOA used 8/8 FP. numbers in each cache line.

® 18 March 1999 SOA Disadvantages If your algorithm uses constants, you must pre-create a SIMD version of these constants. The same rule applies to data generated outside the main loop. (e.g. transformation matrix, lights data, etc.)  For a small number of iterations the overhead is bigger than the savings.

® 19 March 1999 Streaming Instructions By using the _mm_prefetch intrinsic, you can hint the processor to load data that is not required now but will be soon (in the next iteration/pass). By using the store_nta function, you can write data that is no longer needed directly to memory without polluting the caches.

® 20 March 1999 Agenda Streaming SIMD extension overview. Streaming SIMD extension in C++. Streaming SIMD extension & memory.  Some 3D code techniques & samples.

® 21 March 1999 Branch Elimination Classic code:  if (x > y) x = y;  if (x < y) x = x + y else x = x - y; SIMD code:  x = simd_min(x,y);  x = select_lt(x,y,x+y,x-y); lt  lower than

® 22 March 1999 Approximation Functions Classic code:  y = 1.0/x;  y = 1.0/sqrt(x); SIMD code:  y = rcp(x);  y = rsqrt(x); These functions are approximations (but fast ones). To improve the approximation, use the _nr suffix (rcp_nr/rsqrt_nr).

® 23 March 1999 Vector Normalization Classic code: float *x,*y,*z,len; for (i=0; i<n; i++,x++,y++,z++) { len = *x * *x + *y * *y + *z * *z; len = 1.0f / sqrt(len); *x *= len; *y *= len; *z *= len; } SIMD code: F32vec4 *x,*y,*z,len; for (i=0; i<n/4; i++,x++,y++,z++) { len = *x * *x + *y * *y + *z * *z; len = rsqrt(len); *x *= len; *y *= len; *z *= len; }

® 24 March 1999 Code Samples 3D transform (SOA): F32vec4 x,y,z,tx,ty,tz,w,m[4][4]; w = x*m[3][0] + y*m[3][1] + z*m[3][2] + m[3][3]; w = rcp(w);// ~ 1.0/w tx = w*(x*m[0][0] + y*m[0][1] + z*m[0][2] + m[0][3]); ty = w*(x*m[1][0] + y*m[1][1] + z*m[1][2] + m[1][3]); tz = w*(x*m[2][0] + y*m[2][1] + z*m[2][2] + m[2][3]);

® 25 March 1999 Code Samples (cont.) A simple directional light: static const F32vec4 ZERO = 0.0f;// expanding a constant F32vec4 dot; dot = light_dir->x * norm->x + light_dir->y * norm->y + light_dir->z * norm->z; dot = simd_max(dot,ZERO);// clear all items less than 0.0 color->r += dot; color->g += dot; color->b += dot;

® 26 March 1999 Packing color values to RGB format Step 1: scaling to [0..255] & saturating colors. static const F32vec4 _255_ = 255.0f; r = simd_min(color->r * _255_, _255_); g = simd_min(color->g * _255_, _255_); b = simd_min(color->b * _255_, _255_);

® 27 March 1999 Packing color values to RGB format (cont.) Step 2: Convert & pack: Only the lower 2 SIMD items can be converted to MMX double DWORD vector in each pass. R0R1R2R3 G0G1G2G3 B0B1B2B3 Integers in MMX register R0R1 G0G1 B0B1 SIMD integer conversion (only 2 lower floats) R2R3R2R3 G2G3G2G3 B2B3B2B3 High SIMD half to low SIMD half. R2R3 G2G3 B2B3 SIMD integer conversion (only 2 lower floats)

® 28 March 1999 Packing color values to RGB format (cont.) // Converting The lower 2 SIMD items. Is32vec2 color[2]; color[0] = (F32vec4ToIs32vec2(r) << 16) | (F32vec4ToIs32vec2(g) << 8) | (F32vec4ToIs32vec2 (b)); // Converting the upper 2 SIMD items. color[1] = F32vec4ToIs32vec2(_mm_movehl_ps(r,r)) << 16 | F32vec4ToIs32vec2(_mm_movehl_ps(g,g)) << 8 | F32vec4ToIs32vec2(_mm_movehl_ps(b,b));

® 29 March 1999 Backup

® 30 March 1999 Code Samples V0V1V2V3m00m01m02m03 m10m11m12m13 m20m21m22m23 m30m31m32m33 Multiplying a vector by matrix V0 V1 V2 V3 Cast each one of the vector components to all 4 SIMD components. * * * * Multiply each component with the matching matrix line. + Result Sum the results.

® 31 March 1999 Code Samples Expand the previous example and multiply a matrix (m1) by matrix (m2) to a result matrix (m3). F32vec4 m1[4],m2[4],m3[4]; for (i=0;i<4;i++) { // each iteration multiplies a line vector from m1 by m2. a = F32vec4((m1[i])[0]); // cast to all items from row i column 0. b = F32vec4((m1[i])[1]); // cast to all items from row i column 1. c = F32vec4((m1[i])[2]); // cast to all items from row i column 2. d = F32vec4((m1[i])[3]); // cast to all items from row i column 3. m3[i] = a * m2[0] + b * m2[1] + c*m2[2] + d*m2[3]; }

® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar

Similar presentations

Presentation on theme: "® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar

Similar presentations

Presentation on theme: "® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar"— Presentation transcript:

Similar presentations

About project

Feedback