Lecture 17. Vector Machine, and Intel MMX/SSEx Extensions

Lecture 17. Vector Machine, and Intel MMX/SSEx Extensions
COSC3330 Computer Architecture Lecture 17. Vector Machine, and Intel MMX/SSEx Extensions Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Topics Vector Machine Intel MMX/SSEx Extensions

Supercomputer A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved. CDC6600 (Cray, 1964) regarded as first supercomputer First RISC processor ! Seymour Cray, Father of Supercomputing, main designer

History of Supercomputers
In 70s-80s, Supercomputer  Vector Machine

Vector Supercomputers
Epitomized by Cray-1, 1976: One of best known & successful supercomputer Installed at LANL for $8.8 million Vector Extension Vector Registers Vector Instructions Implementation Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory Programming Cray-1 FORTRAN Auto vectorizing compiler!

Vector Processing Vector processors have high-level operations that work on linear arrays of numbers: "vectors" SCALAR (1 operation) VECTOR (N operations) v1 v2 v3 + vector length + r1 r2 r3 add r3, r1, r2 add.vv v3, v1, v2

SIMD SIMD (single instruction multiple data) architecture performs the same operation on multiple data elements in parallel PADDW MM0, MM1

Vector Length Register
Vector Register Scalar Registers Vector Registers v15 r15 v0 r0 [0] [1] [2] [VLRMAX-1] Vector Length Register VLR

Vector Arithmetic Instructions
ADDV v3, v1, v2 v1 + + + + + + v2 v3 [0] [1] [VLR-1]

Vector Load and Store Instructions
Vector Load/Store Vector Load and Store Instructions LV v1, r1, r2 Vector Register v1 Memory Base, r1 Stride, r2

Vector Code Example # C code for (i=0; i<64; i++)
C[i] = A[i] + B[i]; # Scalar Code LI R4, 64 loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop # Vector Code LI VLR, 64 LV V1, R1 LV V2, R2 ADDV.D V3, V1, V2 SV V3, R3

VLIW: Very Long Instruction Word
Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Int Op 1 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of: Parallelism within an instruction => no x-operation RAW check No data use before data ready => no data interlocks

Vector Instruction Set Advantages
Compact one short instruction encodes N operations Expressive, tells hardware that these N operations: are independent use the same functional unit access disjoint registers access registers in same pattern as previous instructions access a contiguous block of memory (unit-stride load/store) Scalable can run same code on more parallel pipelines (lanes)

Vector Arithmetic Execution
Use deep pipeline (=> fast clock) to execute element operations Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) V1 V2 V3 Six stage multiply pipeline

Vector Instruction Execution
ADDV C,A,B C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] Execution using one pipelined functional unit C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] Execution using four pipelined functional units

Vector Memory System 1 2 3 4 5 6 7 8 9 A B C D E F +
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Cycles between accesses to same bank 1 2 3 4 5 6 7 8 9 A B C D E F + Base Stride Vector Registers Memory Banks Address Generator

Vector Unit Structure Functional Unit Vector Registers Lane
Elements 0, 4, 8, … Elements 1, 5, 9, … Elements 2, 6, 10, … Elements 3, 7, 11, … Memory Subsystem

Vector Instruction Parallelism
Can overlap execution of multiple vector instructions example machine has 32 elements per vector register and 8 lanes Load Unit Multiply Unit Add Unit load mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle

Vector Chaining Vector version of register bypassing
introduced with Cray-1 Memory V1 Load Unit Mult. V2 V3 Chain Add V4 V5 Chain LV v1 MULV v3,v1,v2 ADDV v5, v3, v4

Vector Chaining Advantage
Load Mul Add Time Without chaining, must wait for last element of result to be written before starting dependent instruction With chaining, can start dependent instruction as soon as first result appears Load Mul Add

Automatic Code Vectorization
for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store Iter. 1 Iter. 2 Scalar Sequential Code Vector Instruction load add store Iter. 1 Iter. 2 Vectorized Code Time Vectorization is a massive compile-time reordering of operation sequencing  requires extensive loop dependence analysis

Vector Scatter/Gather
Want to vectorize loops with indirect accesses: for (i=0; i<N; i++) A[i] = B[i] + C[D[i]] Indexed load instruction (Gather) LV vD, rD # Load indices in D vector LVI vC, rC, vD # Load indirect from rC base LV vB, rB # Load B vector ADDV.D vA,vB,vC # Do add SV vA, rA # Store result rD rC vD vC vB vA 1 3 5 2 11 30 +

Vector Scatter/Gather
Scatter example: for (i=0; i<N; i++) A[B[i]]++; Is following a correct translation? LV vB, rB # Load indices in B vector LVI vA, rA, vB # Gather initial A values ADDV vA, vA, 1 # Increment SVI vA, rA, vB # Scatter incremented values rB rA vB vA 1 3 5 2 11 30 +1

Vector Conditional Execution
Problem: Want to vectorize loops with conditional code: for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i]; Solution: Add vector mask (or flag) registers 1 bit per element …and maskable vector instructions vector operation becomes NOP at elements where mask bit is clear Code example: CVM # Turn on all elements LV vA, rA # Load entire A vector SGTVS.D vA, F0 # Set bits in mask register where A>0 LV vA, rB # Load B vector into A under mask SV vA, rA # Store A back to memory under mask

Masked Vector Instructions
B[3] A[4] B[4] A[5] B[5] A[6] B[6] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 Write data port Write Enable A[7] B[7] M[7]=1 Simple Implementation execute all N operations, turn off result writeback according to mask C[4] C[5] C[1] Write data port A[7] B[7] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 M[7]=1 Density-Time Implementation scan mask vector and only execute elements with non-zero masks

Intel MMX/SSE/SSE2 Extension Why MMX?
Accelerate multimedia and communications applications. Maintain full compatibility with existing operating systems and applications. Exploit inherent parallelism in multimedia and communication algorithms Includes new instructions and data types to improve performance.

First Step: Examine Code
Examined a wide range of applications: graphics, MPEG video, music synthesis, speech compression, speech recognition, image processing, games, video conferencing. Identified and analyzed the most compute-intensive routines

Common Characteristics
Small integer data types: e.g. 8-bit pixels, 16-bit audio samples Small, highly repetitive loops Frequent multiply-and-accumulate Compute-intensive algorithms Highly parallel operations

IA-32 SIMD Development MMX (Multimedia Extension) was introduced in 1996 (Pentium with MMX and Pentium II). SSE (Streaming SIMD Extension) was introduced with Pentium III. SSE2 was introduced with Pentium 4. SSE3 was introduced with Pentium 4 supporting hyper-threading technology. SSE3 adds 13 more instructions. SSE4 was introduced in SSE4 added 54 instructions.

MMX Data Types

MMX Packed integer types allow operations to be applied on multiple integers 31

Aliases to Existing FP registers:
79 NaN or infinity as real because bits are ones. 11…11 MMX is hidden behind FPU MMX and FPU can not be used at the same time. Big overhead to switch MM0~MM7

MMX Instructions 57 MMX instructions are defined to perform the parallel operations on multiple data elements packed into 64-bit data types. Basic arithmetic: add, subtract, multiply, arithmetic shift and multiply-add Comparison Conversion: pack & unpack Logical Shift Move: register-to-register Load/Store: 64-bit and 32-bit All instructions except for data move use MMX registers as operands.

Packed Add Word with wrap around
Each Addition is independent Rightmost overflows and wraps around

Saturation Saturation: if addition results in overflow or underflow, the result is clamped to the largest or smallest value representable. This is important for pixel calculations where this would prevent a wrap-around add from causing a black pixel to suddenly turn white

No Mode There is no "saturation mode bit”:
a new mode bit would require a change to the operating system. Separate instructions are used to generate wrap-around and saturating results.

Packed Add Word with unsigned saturation
Each Addition is independent Rightmost saturates

Multiply-Accumulate multiply-accumulate operations are fundamental to many signal processing algorithms like vector-dot-products, matrix multiplies, FIR and IIR Filters, FFTs, DCTs etc

Packed Multiply-Add Multiply bytes generating four 32-bit results. Add the 2 products on the left for one result and the 2 products on the right for the other result.

Packed Parallel Compare
Result can be used as a mask to select elements from different inputs using a logical operation, eliminating branchs.

Conditional Select The Chroma Keying example demonstrates how conditional selection using the MMX instruction set removes branch mis-predictions, in addition to performing multiple selection operations in parallel. Text overlay on a pix/video background, and sprite overlays in games are some of the other operations that would benefit from this technique.

Chroma Keying + =

Chroma Keying (con’t) Take pixels from the picture with the airplane on a green background. A compare instruction builds a mask for that data. That mask is a sequence of bytes that are all ones or all zeros. We now know what is the unwanted background and what we want to keep.

Create Mask Assume pixels alternate green/not_green

Combine: !AND, AND, OR

SSE SSE introduced eight 128-bit data registers (called XMM registers)
46

SSE Programming Environment
XMM0 | XMM7 MM0 | MM7 EAX, EBX, ECX, EDX EBP, ESI, EDI, ESP

SSE Data Type SSE extensions introduced one new data type
128-Bit Packed Single-Precision Floating-Point Data Type SSE 2 introduced five data types 48

Inline Assembly Code Assembly language source code that is inserted directly into a HLL program. Compilers such as Microsoft Visual C++ and GCC have compiler-specific directives that identify inline ASM code. Simple to code because there are no external names, memory models, or naming conventions involved. Decidedly not portable because it is written for a single platform.

__asm directive in Microsoft Visual C++
void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size_t) { __asm mov esi, src; //src pointer mov edi, dest; //dest pointer mov ebx, size_t; //ebx is our counter shr ebx, 7; //divide by 128 (8 * 128bit registers) loop_copy: … movdqa xmm0, 0[ESI]; //move data from src to registers movdqa xmm1, 16[ESI]; movdqa xmm2, 32[ESI]; movntdq 0[EDI], xmm0; //move data from registers to dest movntdq 16[EDI], xmm1; movntdq 32[EDI], xmm2; jnz loop_copy; //loop please loop_copy_end: } Mark the beginning of a block of assembly language statements

Intel MMX/SSE Intrinsics
Intrinsics are C/C++ functions and procedures for MMX/SSE instructions With instrinsics, one can program using these instructions indirectly using the provided intrinsics In general, there is a one-to-one correspondence between MMX/SSE instructions and intrinsics _mm_<opcode>_<suffix> ps: packed single-precision ss: scalar single-precision 51

Intrinsics #include <xmmintrin.h> __m128 a , b , c;
c = _mm_add_ps( a , b ); float a[4] , b[4] , c[4]; for( int i = 0 ; i < 4 ; ++ i ) c[i] = a[i] + b[i]; // a = b * c + d / e; __m128 a = _mm_add_ps( _mm_mul_ps( b , c ) , _mm_div_ps( d , e ) );

Lecture 17. Vector Machine, and Intel MMX/SSEx Extensions

Similar presentations

Presentation on theme: "Lecture 17. Vector Machine, and Intel MMX/SSEx Extensions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 17. Vector Machine, and Intel MMX/SSEx Extensions

Similar presentations

Presentation on theme: "Lecture 17. Vector Machine, and Intel MMX/SSEx Extensions"— Presentation transcript:

Similar presentations

About project

Feedback