Download presentation
Presentation is loading. Please wait.
Published byMyra Warren Modified over 9 years ago
1
® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar ronen.zohar@intel.com
2
® 2 March 1999 Agenda Streaming SIMD extensions overview. Streaming SIMD extensions in C++. Streaming SIMD extensions & memory. Some 3D code samples.
3
® 3 March 1999 Streaming SIMD Extensions Overview. Streaming SIMD extensions introduce three types of new instructions: SIMD floating point single precision instructions Memory streaming instructions SIMD integer instructions
4
® 4 March 1999 Streaming SIMD Extensions Overview (Cont.) Streaming SIMD extensions introduce a new set of eight registers (xmm0-xmm7) Each of these registers is 128 bits long Each register holds 4 floating point single precision numbers Legacy x86 Registers (eax,…) x87 stack/MMX ® Registers Streaming SIMD Extension registers
5
® 5 March 1999 SIMD Floating point Single precision instructions. Operations supported: Data transfer (move, load, store) Numerical (add, subtract, square root,...) Bitwise operations (and, or, exclusive or,...) Compares (==, !=, <=,…) These instructions can operate between two xmm registers, or between a register and a 16 byte aligned memory.
6
® 6 March 1999 How to Write SIMD Code? The old fashioned way (assembly):... movecx, ptr movaps xmm0, [ecx] mulpsxmm0, [ecx] movaps[ecx], xmm0 … calculating x 2 for four FP. numbers located at ptr. Disadvantages: Hard to code, and even harder to debug. No compiler optimizations. Code maintenance is difficult.
7
® 7 March 1999 How to Write SIMD Code? (cont.) Using C intrinsic instructions:... __m128*ptr, val; val = *ptr; *ptr = _mm_mul_ps(val,val); … Advantages: No need to allocate registers manually. Compiler based optimizations. Disadvantages: Hard to read/maintain the code. The type __m128 describes a 128 bit basic data element.
8
® 8 March 1999 How to Write SIMD Code? (cont.) Using C++ SIMD classes:... F32vec4*ptr, val; val = *ptr; *ptr = val * val; … Advantages: Natural to code and read. No need to allocate registers manually. Compiler based optimizations. The class F32vec4 describes a 128 bit basic data element with all the SIMD FP. operations as C++ overloaded operators.
9
® 9 March 1999 How to Write SIMD Code? (cont.) For assembly language: MASM + Streaming SIMD extension macro package Intel® C/C++ compiler (for inline assembly) Tools for coding Streaming SIMD extension code For intrinsic C functions: Intel C/C++ compiler, include the file xmmintrin.h For SIMD classes: Intel C/C++ compiler, include the file fvec.h
10
® 10 March 1999 Agenda Streaming SIMD extension overview. Streaming SIMD extension in C++. Streaming SIMD extension & memory. Some 3D code samples.
11
® 11 March 1999 Memory Alignment. Memory accessed via intrinsics or SIMD data types pointers MUST be 16 byte aligned. No need to align local variables (done by the compiler). To align a global variable use the _MM_ALIGN16 macro: _MM_ALIGN16 DWORD mask[4]; To align a dynamically allocated buffer: orig_buff = malloc(size + 15); buff = (void *)(((DWORD) orig_buff + 15)& 0xfffffff0); ………. free(orig_buff);
12
® 12 March 1999 Memory Alignment (cont.) If alignment is not guaranteed: Load from memory using the loadu function Store to memory using the storeu function F32vec4 in,out; float *in_ptr,*out_ptr; loadu (in, in_ptr);// loading four unaligned floats. ….. storeu(out_ptr,out);// storing four unaligned floats. These functions are slower than aligned memory access
13
® 13 March 1999 Memory Arrangement Issues. Traditional procedures require horizontal operations that do not utilize the SIMD structure For example: 3D Vector normalization. X0Y0Z0Nx0Ny0Nz0Tu0Tv0X1Y1Z1Nx1Ny1Nz1Tu1Tv1X2Y2Z2Nx2Ny2Nz2Tu2Tv2 Base
14
® 14 March 1999 3D Vector Normalization XYZNxXYZ X*X + Y*Y + Z*Z 1.0 + 1/sqrt 1/sqrt (X*X + Y*Y + Z*Z) 1.0 X/sqrt (X*X + Y*Y + Z*Z) Y/sqrt (X*X + Y*Y + Z*Z) Z/sqrt (X*X + Y*Y + Z*Z) Nx * X*XY*YZ*ZNx*Nx *
15
® 15 March 1999 A Different Memory Approach (SOA) Flush of each of the data components (Structure of Arrays). For example: Base X0X1X2 Y0Y1Y2 Z0Z1Z2 Nx0Nx1Nx2 Ny0Ny1Ny2 Nz0Nz1Nz2
16
® 16 March 1999 Vector Normalization using SOA X0X1X2X3 Y0Y1Y2Y3 Z0Z1Z2Z3 X0X1X2X3 Y0Y1Y2Y3 Z0Z1Z2Z3 1/sqrt( X0*X0+ Y0*Y0+ Z0*Z0) 1/sqrt( X1*X1+ Y1*Y1+ Z1*Z1) 1/sqrt( X2*X2+ Y2*Y2+ Z2*Z2) 1/sqrt( X3*X3+ Y3*Y3+ Z3*Z3) +1/sqrt X0*X0X1*X1X2*X2X3*X3 Y0*Y0Y1*Y1Y2*Y2Y3*Y3 Z0*Z0Z1*Z1Z2*Z2Z3*Z3 * * * * * * Normalized vectors
17
® 17 March 1999 SOA Advantages Calculate 4 items in a single iteration. No need to move items around the SIMD register. Better memory utilization, in the vector normalization example: AOS used 3/8 FP. numbers in each cache line. SOA used 8/8 FP. numbers in each cache line.
18
® 18 March 1999 SOA Disadvantages If your algorithm uses constants, you must pre-create a SIMD version of these constants. The same rule applies to data generated outside the main loop. (e.g. transformation matrix, lights data, etc.) For a small number of iterations the overhead is bigger than the savings.
19
® 19 March 1999 Streaming Instructions By using the _mm_prefetch intrinsic, you can hint the processor to load data that is not required now but will be soon (in the next iteration/pass). By using the store_nta function, you can write data that is no longer needed directly to memory without polluting the caches.
20
® 20 March 1999 Agenda Streaming SIMD extension overview. Streaming SIMD extension in C++. Streaming SIMD extension & memory. Some 3D code techniques & samples.
21
® 21 March 1999 Branch Elimination Classic code: if (x > y) x = y; if (x < y) x = x + y else x = x - y; SIMD code: x = simd_min(x,y); x = select_lt(x,y,x+y,x-y); lt lower than
22
® 22 March 1999 Approximation Functions Classic code: y = 1.0/x; y = 1.0/sqrt(x); SIMD code: y = rcp(x); y = rsqrt(x); These functions are approximations (but fast ones). To improve the approximation, use the _nr suffix (rcp_nr/rsqrt_nr).
23
® 23 March 1999 Vector Normalization Classic code: float *x,*y,*z,len; for (i=0; i<n; i++,x++,y++,z++) { len = *x * *x + *y * *y + *z * *z; len = 1.0f / sqrt(len); *x *= len; *y *= len; *z *= len; } SIMD code: F32vec4 *x,*y,*z,len; for (i=0; i<n/4; i++,x++,y++,z++) { len = *x * *x + *y * *y + *z * *z; len = rsqrt(len); *x *= len; *y *= len; *z *= len; }
24
® 24 March 1999 Code Samples 3D transform (SOA): F32vec4 x,y,z,tx,ty,tz,w,m[4][4]; w = x*m[3][0] + y*m[3][1] + z*m[3][2] + m[3][3]; w = rcp(w);// ~ 1.0/w tx = w*(x*m[0][0] + y*m[0][1] + z*m[0][2] + m[0][3]); ty = w*(x*m[1][0] + y*m[1][1] + z*m[1][2] + m[1][3]); tz = w*(x*m[2][0] + y*m[2][1] + z*m[2][2] + m[2][3]);
25
® 25 March 1999 Code Samples (cont.) A simple directional light: static const F32vec4 ZERO = 0.0f;// expanding a constant F32vec4 dot; dot = light_dir->x * norm->x + light_dir->y * norm->y + light_dir->z * norm->z; dot = simd_max(dot,ZERO);// clear all items less than 0.0 color->r += dot; color->g += dot; color->b += dot;
26
® 26 March 1999 Packing color values to RGB format Step 1: scaling to [0..255] & saturating colors. static const F32vec4 _255_ = 255.0f; r = simd_min(color->r * _255_, _255_); g = simd_min(color->g * _255_, _255_); b = simd_min(color->b * _255_, _255_);
27
® 27 March 1999 Packing color values to RGB format (cont.) Step 2: Convert & pack: Only the lower 2 SIMD items can be converted to MMX double DWORD vector in each pass. R0R1R2R3 G0G1G2G3 B0B1B2B3 Integers in MMX register R0R1 G0G1 B0B1 SIMD integer conversion (only 2 lower floats) R2R3R2R3 G2G3G2G3 B2B3B2B3 High SIMD half to low SIMD half. R2R3 G2G3 B2B3 SIMD integer conversion (only 2 lower floats)
28
® 28 March 1999 Packing color values to RGB format (cont.) // Converting The lower 2 SIMD items. Is32vec2 color[2]; color[0] = (F32vec4ToIs32vec2(r) << 16) | (F32vec4ToIs32vec2(g) << 8) | (F32vec4ToIs32vec2 (b)); // Converting the upper 2 SIMD items. color[1] = F32vec4ToIs32vec2(_mm_movehl_ps(r,r)) << 16 | F32vec4ToIs32vec2(_mm_movehl_ps(g,g)) << 8 | F32vec4ToIs32vec2(_mm_movehl_ps(b,b));
29
® 29 March 1999 Backup
30
® 30 March 1999 Code Samples V0V1V2V3m00m01m02m03 m10m11m12m13 m20m21m22m23 m30m31m32m33 Multiplying a vector by matrix V0 V1 V2 V3 Cast each one of the vector components to all 4 SIMD components. * * * * Multiply each component with the matching matrix line. + Result Sum the results.
31
® 31 March 1999 Code Samples Expand the previous example and multiply a matrix (m1) by matrix (m2) to a result matrix (m3). F32vec4 m1[4],m2[4],m3[4]; for (i=0;i<4;i++) { // each iteration multiplies a line vector from m1 by m2. a = F32vec4((m1[i])[0]); // cast to all items from row i column 0. b = F32vec4((m1[i])[1]); // cast to all items from row i column 1. c = F32vec4((m1[i])[2]); // cast to all items from row i column 2. d = F32vec4((m1[i])[3]); // cast to all items from row i column 3. m3[i] = a * m2[0] + b * m2[1] + c*m2[2] + d*m2[3]; }
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.