Lecture 17. Vector Machine, and Intel MMX/SSEx Extensions

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Superscalar and VLIW Architectures Miodrag Bolic CEG3151.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
ECE291 Computer Engineering II Lecture 24 Josh Potts University of Illinois at Urbana- Champaign.
Krste CS 252 Feb. 27, 2006 Lecture 12, Slide 1 EECS 252 Graduate Computer Architecture Lec. 12: Vector Computers Krste Asanovic ( )
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
Computer Organization and Architecture
Computer Organization and Architecture
Assembly Language for Intel-Based Computers Chapter 2: IA-32 Processor Architecture Kip Irvine.
April 1, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 17: Vectors Part II Krste Asanovic Electrical Engineering and Computer.
CS 152 Computer Architecture and Engineering Lecture 17: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California,
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Krste Asanovic Electrical Engineering and Computer Sciences
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.
Pipeline Extensions prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University MIPS Extensions1May 2015.
Computer Architecture Lec. 12: Vector Computers. Supercomputers Definition of a supercomputer: Fastest machine in world at given task A device to turn.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
Introduction to MMX, XMM, SSE and SSE2 Technology
CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas EE (CE) 6304 Computer Architecture Lecture #19 (11/19/15) Course.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
Vector computers.
Page 1 Vector Processors Slides issues de diverses sources (quasi aucune de moi)
Chapter Overview General Concepts IA-32 Processor Architecture
Computer Architecture: SIMD and GPUs (Part I)
COSC3330 Computer Architecture Lecture 18. Vector Machine
Assembly language.
ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 17 Vectors
14: Vector Computers: an old-fashioned approach
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
Massachusetts Institute of Technology
Other Processors.
Embedded Systems Design
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Morgan Kaufmann Publishers
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
MMX Multi Media eXtensions
The University of Adelaide, School of Computer Science
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures
STUDY AND IMPLEMENTATION
Multivector and SIMD Computers
Static Compiler Optimization Techniques
Intel SIMD architecture
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Architecture
Static Compiler Optimization Techniques
Superscalar and VLIW Architectures
Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.
Static Compiler Optimization Techniques
Chapter 11 Processor Structure and function
CSE 502: Computer Architecture
Static Compiler Optimization Techniques
Computer Organization
Presentation transcript:

Lecture 17. Vector Machine, and Intel MMX/SSEx Extensions COSC3330 Computer Architecture Lecture 17. Vector Machine, and Intel MMX/SSEx Extensions Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Topics Vector Machine Intel MMX/SSEx Extensions

Supercomputer A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved. CDC6600 (Cray, 1964) regarded as first supercomputer First RISC processor ! Seymour Cray, Father of Supercomputing, main designer

History of Supercomputers In 70s-80s, Supercomputer  Vector Machine

Vector Supercomputers Epitomized by Cray-1, 1976: One of best known & successful supercomputer Installed at LANL for $8.8 million Vector Extension Vector Registers Vector Instructions Implementation Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory Programming Cray-1 FORTRAN Auto vectorizing compiler!

Vector Processing Vector processors have high-level operations that work on linear arrays of numbers: "vectors" SCALAR (1 operation) VECTOR (N operations) v1 v2 v3 + vector length + r1 r2 r3 add r3, r1, r2 add.vv v3, v1, v2

SIMD SIMD (single instruction multiple data) architecture performs the same operation on multiple data elements in parallel PADDW MM0, MM1

Vector Length Register Vector Register Scalar Registers Vector Registers v15 r15 v0 r0 [0] [1] [2] [VLRMAX-1] Vector Length Register VLR

Vector Arithmetic Instructions ADDV v3, v1, v2 v1 + + + + + + v2 v3 [0] [1] [VLR-1]

Vector Load and Store Instructions Vector Load/Store Vector Load and Store Instructions LV v1, r1, r2 Vector Register v1 Memory Base, r1 Stride, r2

Vector Code Example # C code for (i=0; i<64; i++) C[i] = A[i] + B[i]; # Scalar Code LI R4, 64 loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop # Vector Code LI VLR, 64 LV V1, R1 LV V2, R2 ADDV.D V3, V1, V2 SV V3, R3

VLIW: Very Long Instruction Word Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Int Op 1 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of: Parallelism within an instruction => no x-operation RAW check No data use before data ready => no data interlocks

Vector Instruction Set Advantages Compact one short instruction encodes N operations Expressive, tells hardware that these N operations: are independent use the same functional unit access disjoint registers access registers in same pattern as previous instructions access a contiguous block of memory (unit-stride load/store) Scalable can run same code on more parallel pipelines (lanes)

Vector Arithmetic Execution Use deep pipeline (=> fast clock) to execute element operations Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) V1 V2 V3 Six stage multiply pipeline

Vector Instruction Execution ADDV C,A,B C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] Execution using one pipelined functional unit C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] Execution using four pipelined functional units

Vector Memory System 1 2 3 4 5 6 7 8 9 A B C D E F + Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Cycles between accesses to same bank 1 2 3 4 5 6 7 8 9 A B C D E F + Base Stride Vector Registers Memory Banks Address Generator

Vector Unit Structure Functional Unit Vector Registers Lane Elements 0, 4, 8, … Elements 1, 5, 9, … Elements 2, 6, 10, … Elements 3, 7, 11, … Memory Subsystem

Vector Instruction Parallelism Can overlap execution of multiple vector instructions example machine has 32 elements per vector register and 8 lanes Load Unit Multiply Unit Add Unit load mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle

Vector Chaining Vector version of register bypassing introduced with Cray-1 Memory V1 Load Unit Mult. V2 V3 Chain Add V4 V5 Chain LV v1 MULV v3,v1,v2 ADDV v5, v3, v4

Vector Chaining Advantage Load Mul Add Time Without chaining, must wait for last element of result to be written before starting dependent instruction With chaining, can start dependent instruction as soon as first result appears Load Mul Add

Automatic Code Vectorization for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store Iter. 1 Iter. 2 Scalar Sequential Code Vector Instruction load add store Iter. 1 Iter. 2 Vectorized Code Time Vectorization is a massive compile-time reordering of operation sequencing  requires extensive loop dependence analysis

Vector Scatter/Gather Want to vectorize loops with indirect accesses: for (i=0; i<N; i++) A[i] = B[i] + C[D[i]] Indexed load instruction (Gather) LV vD, rD # Load indices in D vector LVI vC, rC, vD # Load indirect from rC base LV vB, rB # Load B vector ADDV.D vA,vB,vC # Do add SV vA, rA # Store result rD rC vD vC vB vA 1 3 5 2 11 30 +

Vector Scatter/Gather Scatter example: for (i=0; i<N; i++) A[B[i]]++; Is following a correct translation? LV vB, rB # Load indices in B vector LVI vA, rA, vB # Gather initial A values ADDV vA, vA, 1 # Increment SVI vA, rA, vB # Scatter incremented values rB rA vB vA 1 3 5 2 11 30 +1

Vector Conditional Execution Problem: Want to vectorize loops with conditional code: for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i]; Solution: Add vector mask (or flag) registers 1 bit per element …and maskable vector instructions vector operation becomes NOP at elements where mask bit is clear Code example: CVM # Turn on all elements LV vA, rA # Load entire A vector SGTVS.D vA, F0 # Set bits in mask register where A>0 LV vA, rB # Load B vector into A under mask SV vA, rA # Store A back to memory under mask

Masked Vector Instructions B[3] A[4] B[4] A[5] B[5] A[6] B[6] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 Write data port Write Enable A[7] B[7] M[7]=1 Simple Implementation execute all N operations, turn off result writeback according to mask C[4] C[5] C[1] Write data port A[7] B[7] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 M[7]=1 Density-Time Implementation scan mask vector and only execute elements with non-zero masks

Intel MMX/SSE/SSE2 Extension Why MMX? Accelerate multimedia and communications applications. Maintain full compatibility with existing operating systems and applications. Exploit inherent parallelism in multimedia and communication algorithms Includes new instructions and data types to improve performance.

First Step: Examine Code Examined a wide range of applications: graphics, MPEG video, music synthesis, speech compression, speech recognition, image processing, games, video conferencing. Identified and analyzed the most compute-intensive routines

Common Characteristics Small integer data types: e.g. 8-bit pixels, 16-bit audio samples Small, highly repetitive loops Frequent multiply-and-accumulate Compute-intensive algorithms Highly parallel operations

IA-32 SIMD Development MMX (Multimedia Extension) was introduced in 1996 (Pentium with MMX and Pentium II). SSE (Streaming SIMD Extension) was introduced with Pentium III. SSE2 was introduced with Pentium 4. SSE3 was introduced with Pentium 4 supporting hyper-threading technology. SSE3 adds 13 more instructions. SSE4 was introduced in 2006. SSE4 added 54 instructions.

MMX Data Types

MMX Packed integer types allow operations to be applied on multiple integers 31

Aliases to Existing FP registers: 79 NaN or infinity as real because bits 79-64 are ones. 11…11 MMX is hidden behind FPU MMX and FPU can not be used at the same time. Big overhead to switch MM0~MM7

MMX Instructions 57 MMX instructions are defined to perform the parallel operations on multiple data elements packed into 64-bit data types. Basic arithmetic: add, subtract, multiply, arithmetic shift and multiply-add Comparison Conversion: pack & unpack Logical Shift Move: register-to-register Load/Store: 64-bit and 32-bit All instructions except for data move use MMX registers as operands.

Packed Add Word with wrap around Each Addition is independent Rightmost overflows and wraps around

Saturation Saturation: if addition results in overflow or underflow, the result is clamped to the largest or smallest value representable. This is important for pixel calculations where this would prevent a wrap-around add from causing a black pixel to suddenly turn white

No Mode There is no "saturation mode bit”: a new mode bit would require a change to the operating system. Separate instructions are used to generate wrap-around and saturating results.

Packed Add Word with unsigned saturation Each Addition is independent Rightmost saturates

Multiply-Accumulate multiply-accumulate operations are fundamental to many signal processing algorithms like vector-dot-products, matrix multiplies, FIR and IIR Filters, FFTs, DCTs etc

Packed Multiply-Add Multiply bytes generating four 32-bit results. Add the 2 products on the left for one result and the 2 products on the right for the other result.

Packed Parallel Compare Result can be used as a mask to select elements from different inputs using a logical operation, eliminating branchs.

Conditional Select The Chroma Keying example demonstrates how conditional selection using the MMX instruction set removes branch mis-predictions, in addition to performing multiple selection operations in parallel. Text overlay on a pix/video background, and sprite overlays in games are some of the other operations that would benefit from this technique.

Chroma Keying + =

Chroma Keying (con’t) Take pixels from the picture with the airplane on a green background. A compare instruction builds a mask for that data. That mask is a sequence of bytes that are all ones or all zeros. We now know what is the unwanted background and what we want to keep.

Create Mask Assume pixels alternate green/not_green

Combine: !AND, AND, OR

SSE SSE introduced eight 128-bit data registers (called XMM registers) 46

SSE Programming Environment XMM0 | XMM7 MM0 | MM7 EAX, EBX, ECX, EDX EBP, ESI, EDI, ESP

SSE Data Type SSE extensions introduced one new data type 128-Bit Packed Single-Precision Floating-Point Data Type SSE 2 introduced five data types 48

Inline Assembly Code Assembly language source code that is inserted directly into a HLL program. Compilers such as Microsoft Visual C++ and GCC have compiler-specific directives that identify inline ASM code. Simple to code because there are no external names, memory models, or naming conventions involved. Decidedly not portable because it is written for a single platform.

__asm directive in Microsoft Visual C++ void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size_t) { __asm mov esi, src; //src pointer mov edi, dest; //dest pointer mov ebx, size_t; //ebx is our counter shr ebx, 7; //divide by 128 (8 * 128bit registers) loop_copy: … movdqa xmm0, 0[ESI]; //move data from src to registers movdqa xmm1, 16[ESI]; movdqa xmm2, 32[ESI]; movntdq 0[EDI], xmm0; //move data from registers to dest movntdq 16[EDI], xmm1; movntdq 32[EDI], xmm2; jnz loop_copy; //loop please loop_copy_end: } Mark the beginning of a block of assembly language statements

Intel MMX/SSE Intrinsics Intrinsics are C/C++ functions and procedures for MMX/SSE instructions With instrinsics, one can program using these instructions indirectly using the provided intrinsics In general, there is a one-to-one correspondence between MMX/SSE instructions and intrinsics _mm_<opcode>_<suffix> ps: packed single-precision ss: scalar single-precision 51

Intrinsics #include <xmmintrin.h> __m128 a , b , c; c = _mm_add_ps( a , b ); float a[4] , b[4] , c[4]; for( int i = 0 ; i < 4 ; ++ i )     c[i] = a[i] + b[i]; // a = b * c + d / e; __m128 a = _mm_add_ps( _mm_mul_ps( b , c ) , _mm_div_ps( d , e ) );