® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar

Slides:

Advertisements

Similar presentations

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Advertisements

Henk Corporaal TUEindhoven 2011

Lecture 6 Programming the TMS320C6x Family of DSPs.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.

INSTRUCTION SET ARCHITECTURES

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.

ECE291 Computer Engineering II Lecture 24 Josh Potts University of Illinois at Urbana- Champaign.

SPARC Architecture & Assembly Language

ITCS 3181 Logic and Computer Systems 2015 B. Wilkinson slides3.ppt Modification date: March 16, Addressing Modes The methods used in machine instructions.

Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.

Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland.

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.

Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006.

S. Barua – CPSC 440 CHAPTER 2 INSTRUCTIONS: LANGUAGE OF THE COMPUTER Goals – To get familiar with.

PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.

CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.

Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.

CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.

Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

8-1 Embedded Systems Fixed-Point Math and Other Optimizations.

HW-Accelerated HD video playback under Linux Zou Nan hai Open Source Technology Center.

Performance Optimization Getting your programs to run faster CS 691.

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.

Performance Optimization Getting your programs to run faster.

December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Introduction to MMX, XMM, SSE and SSE2 Technology

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.

Lecture 04: Instruction Set Principles Kai Bu

With a focus on floating point.  For floating point (i.e., real numbers), MASM supports:  real4  single precision; IEEE standard; analogous to float.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.

The Alpha Thomas Daniels Other Dude Matt Ziegler.

Optimization of C Code The C for Speed

Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

Lecture 01a: C++ review Topics: Setting up projects, main program Memory Diagrams Variables / Types (some of) the many-types-of-const's Input / Output.

® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March.

MMX-accelerated Matrix Multiplication

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Addressing Modes Dr. Hadi Hassan.  Two Basic Questions  Where are the operands?  How memory addresses are computed?  Intel IA-32 supports 3 fundamental.

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

1 Lecture 5a: CPU architecture 101 boris.

Lecture 9 Scheduling. Scheduling Policies Preemptive Priority Scheduling.

Instruction Set Architectures

Overview of Instruction Set Architectures

Instruction Level Parallelism

Optimizing the code using SSE intrinsics

Exploiting Parallelism

Compilers for Embedded Systems

Vector Processing => Multimedia

Advanced Computer Architecture 5MD00 / 5Z032 Instruction Set Design

SIMD Programming CS 240A, 2017.

MMX Multi Media eXtensions

Pipelining: Advanced ILP

STUDY AND IMPLEMENTATION

EE 193: Parallel Computing

Chapter 12 Pipelining and RISC

Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.

Lecture 4: Instruction Set Design/Pipelining

Presentation transcript:

® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar

® 2 March 1999 Agenda  Streaming SIMD extensions overview.  Streaming SIMD extensions in C++. Streaming SIMD extensions & memory. Some 3D code samples.

® 3 March 1999 Streaming SIMD Extensions Overview. Streaming SIMD extensions introduce three types of new instructions:  SIMD floating point single precision instructions  Memory streaming instructions  SIMD integer instructions

® 4 March 1999 Streaming SIMD Extensions Overview (Cont.) Streaming SIMD extensions introduce a new set of eight registers (xmm0-xmm7)  Each of these registers is 128 bits long  Each register holds 4 floating point single precision numbers Legacy x86 Registers (eax,…) x87 stack/MMX ® Registers Streaming SIMD Extension registers

® 5 March 1999 SIMD Floating point Single precision instructions. Operations supported:  Data transfer (move, load, store)  Numerical (add, subtract, square root,...)  Bitwise operations (and, or, exclusive or,...)  Compares (==, !=, <=,…) These instructions can operate between two xmm registers, or between a register and a 16 byte aligned memory.

® 6 March 1999 How to Write SIMD Code? The old fashioned way (assembly):... movecx, ptr movaps xmm0, [ecx] mulpsxmm0, [ecx] movaps[ecx], xmm0 … calculating x 2 for four FP. numbers located at ptr. Disadvantages:  Hard to code, and even harder to debug.  No compiler optimizations.  Code maintenance is difficult.

® 7 March 1999 How to Write SIMD Code? (cont.) Using C intrinsic instructions:... __m128*ptr, val; val = *ptr; *ptr = _mm_mul_ps(val,val); … Advantages:  No need to allocate registers manually.  Compiler based optimizations. Disadvantages:  Hard to read/maintain the code. The type __m128 describes a 128 bit basic data element.

® 8 March 1999 How to Write SIMD Code? (cont.) Using C++ SIMD classes:... F32vec4*ptr, val; val = *ptr; *ptr = val * val; … Advantages:  Natural to code and read.  No need to allocate registers manually.  Compiler based optimizations. The class F32vec4 describes a 128 bit basic data element with all the SIMD FP. operations as C++ overloaded operators.

® 9 March 1999 How to Write SIMD Code? (cont.) For assembly language: MASM + Streaming SIMD extension macro package Intel® C/C++ compiler (for inline assembly) Tools for coding Streaming SIMD extension code For intrinsic C functions: Intel C/C++ compiler, include the file xmmintrin.h For SIMD classes: Intel C/C++ compiler, include the file fvec.h

® 10 March 1999 Agenda Streaming SIMD extension overview. Streaming SIMD extension in C++.  Streaming SIMD extension & memory. Some 3D code samples.

® 11 March 1999 Memory Alignment. Memory accessed via intrinsics or SIMD data types pointers MUST be 16 byte aligned.  No need to align local variables (done by the compiler).  To align a global variable use the _MM_ALIGN16 macro: _MM_ALIGN16 DWORD mask[4];  To align a dynamically allocated buffer: orig_buff = malloc(size + 15); buff = (void *)(((DWORD) orig_buff + 15)& 0xfffffff0); ………. free(orig_buff);

® 12 March 1999 Memory Alignment (cont.) If alignment is not guaranteed:  Load from memory using the loadu function  Store to memory using the storeu function F32vec4 in,out; float *in_ptr,*out_ptr; loadu (in, in_ptr);// loading four unaligned floats. ….. storeu(out_ptr,out);// storing four unaligned floats.  These functions are slower than aligned memory access

® 13 March 1999 Memory Arrangement Issues. Traditional procedures require horizontal operations that do not utilize the SIMD structure  For example: 3D Vector normalization. X0Y0Z0Nx0Ny0Nz0Tu0Tv0X1Y1Z1Nx1Ny1Nz1Tu1Tv1X2Y2Z2Nx2Ny2Nz2Tu2Tv2 Base

® 14 March D Vector Normalization XYZNxXYZ X*X + Y*Y + Z*Z /sqrt 1/sqrt (X*X + Y*Y + Z*Z) 1.0 X/sqrt (X*X + Y*Y + Z*Z) Y/sqrt (X*X + Y*Y + Z*Z) Z/sqrt (X*X + Y*Y + Z*Z) Nx * X*XY*YZ*ZNx*Nx *

® 15 March 1999 A Different Memory Approach (SOA) Flush of each of the data components (Structure of Arrays). For example: Base X0X1X2 Y0Y1Y2 Z0Z1Z2 Nx0Nx1Nx2 Ny0Ny1Ny2 Nz0Nz1Nz2

® 16 March 1999 Vector Normalization using SOA X0X1X2X3 Y0Y1Y2Y3 Z0Z1Z2Z3 X0X1X2X3 Y0Y1Y2Y3 Z0Z1Z2Z3 1/sqrt( X0*X0+ Y0*Y0+ Z0*Z0) 1/sqrt( X1*X1+ Y1*Y1+ Z1*Z1) 1/sqrt( X2*X2+ Y2*Y2+ Z2*Z2) 1/sqrt( X3*X3+ Y3*Y3+ Z3*Z3) +1/sqrt X0*X0X1*X1X2*X2X3*X3 Y0*Y0Y1*Y1Y2*Y2Y3*Y3 Z0*Z0Z1*Z1Z2*Z2Z3*Z3 * * * * * * Normalized vectors

® 17 March 1999 SOA Advantages Calculate 4 items in a single iteration. No need to move items around the SIMD register. Better memory utilization, in the vector normalization example: AOS used 3/8 FP. numbers in each cache line. SOA used 8/8 FP. numbers in each cache line.

® 18 March 1999 SOA Disadvantages If your algorithm uses constants, you must pre-create a SIMD version of these constants. The same rule applies to data generated outside the main loop. (e.g. transformation matrix, lights data, etc.)  For a small number of iterations the overhead is bigger than the savings.

® 19 March 1999 Streaming Instructions By using the _mm_prefetch intrinsic, you can hint the processor to load data that is not required now but will be soon (in the next iteration/pass). By using the store_nta function, you can write data that is no longer needed directly to memory without polluting the caches.

® 20 March 1999 Agenda Streaming SIMD extension overview. Streaming SIMD extension in C++. Streaming SIMD extension & memory.  Some 3D code techniques & samples.

® 21 March 1999 Branch Elimination Classic code:  if (x > y) x = y;  if (x < y) x = x + y else x = x - y; SIMD code:  x = simd_min(x,y);  x = select_lt(x,y,x+y,x-y); lt  lower than

® 22 March 1999 Approximation Functions Classic code:  y = 1.0/x;  y = 1.0/sqrt(x); SIMD code:  y = rcp(x);  y = rsqrt(x); These functions are approximations (but fast ones). To improve the approximation, use the _nr suffix (rcp_nr/rsqrt_nr).

® 23 March 1999 Vector Normalization Classic code: float *x,*y,*z,len; for (i=0; i<n; i++,x++,y++,z++) { len = *x * *x + *y * *y + *z * *z; len = 1.0f / sqrt(len); *x *= len; *y *= len; *z *= len; } SIMD code: F32vec4 *x,*y,*z,len; for (i=0; i<n/4; i++,x++,y++,z++) { len = *x * *x + *y * *y + *z * *z; len = rsqrt(len); *x *= len; *y *= len; *z *= len; }

® 24 March 1999 Code Samples 3D transform (SOA): F32vec4 x,y,z,tx,ty,tz,w,m[4][4]; w = x*m[3][0] + y*m[3][1] + z*m[3][2] + m[3][3]; w = rcp(w);// ~ 1.0/w tx = w*(x*m[0][0] + y*m[0][1] + z*m[0][2] + m[0][3]); ty = w*(x*m[1][0] + y*m[1][1] + z*m[1][2] + m[1][3]); tz = w*(x*m[2][0] + y*m[2][1] + z*m[2][2] + m[2][3]);

® 25 March 1999 Code Samples (cont.) A simple directional light: static const F32vec4 ZERO = 0.0f;// expanding a constant F32vec4 dot; dot = light_dir->x * norm->x + light_dir->y * norm->y + light_dir->z * norm->z; dot = simd_max(dot,ZERO);// clear all items less than 0.0 color->r += dot; color->g += dot; color->b += dot;

® 26 March 1999 Packing color values to RGB format Step 1: scaling to [0..255] & saturating colors. static const F32vec4 _255_ = 255.0f; r = simd_min(color->r * _255_, _255_); g = simd_min(color->g * _255_, _255_); b = simd_min(color->b * _255_, _255_);

® 27 March 1999 Packing color values to RGB format (cont.) Step 2: Convert & pack: Only the lower 2 SIMD items can be converted to MMX double DWORD vector in each pass. R0R1R2R3 G0G1G2G3 B0B1B2B3 Integers in MMX register R0R1 G0G1 B0B1 SIMD integer conversion (only 2 lower floats) R2R3R2R3 G2G3G2G3 B2B3B2B3 High SIMD half to low SIMD half. R2R3 G2G3 B2B3 SIMD integer conversion (only 2 lower floats)

® 28 March 1999 Packing color values to RGB format (cont.) // Converting The lower 2 SIMD items. Is32vec2 color[2]; color[0] = (F32vec4ToIs32vec2(r) << 16) | (F32vec4ToIs32vec2(g) << 8) | (F32vec4ToIs32vec2 (b)); // Converting the upper 2 SIMD items. color[1] = F32vec4ToIs32vec2(_mm_movehl_ps(r,r)) << 16 | F32vec4ToIs32vec2(_mm_movehl_ps(g,g)) << 8 | F32vec4ToIs32vec2(_mm_movehl_ps(b,b));

® 29 March 1999 Backup

® 30 March 1999 Code Samples V0V1V2V3m00m01m02m03 m10m11m12m13 m20m21m22m23 m30m31m32m33 Multiplying a vector by matrix V0 V1 V2 V3 Cast each one of the vector components to all 4 SIMD components. * * * * Multiply each component with the matching matrix line. + Result Sum the results.

® 31 March 1999 Code Samples Expand the previous example and multiply a matrix (m1) by matrix (m2) to a result matrix (m3). F32vec4 m1[4],m2[4],m3[4]; for (i=0;i<4;i++) { // each iteration multiplies a line vector from m1 by m2. a = F32vec4((m1[i])[0]); // cast to all items from row i column 0. b = F32vec4((m1[i])[1]); // cast to all items from row i column 1. c = F32vec4((m1[i])[2]); // cast to all items from row i column 2. d = F32vec4((m1[i])[3]); // cast to all items from row i column 3. m3[i] = a * m2[0] + b * m2[1] + c*m2[2] + d*m2[3]; }