CS 201 Advanced Topics SIMD, x86-64, ARM

Name: CS 201 Advanced Topics SIMD, x86-64, ARM
Uploaded: 2017-07-19T09:44:21+00:00
Duration: PTM49S11
Channel: Andra McDaniel
Description: CS 201 Advanced Topics SIMD, x86-64, ARM

CS 201 Advanced Topics SIMD, x86-64, ARM

Vector instructions (MMX/SSE/AVX)

Background: IA32 Floating Point
What does this have to do with SIMD? Floating Point Unit (X87 FPU) Hardware to add, multiply, and divide IEEE floating point numbers 8 80-bit registers organized as a stack (st0-st7) Operands pushed onto stack and operators can pop results off into memory History 8086: first computer to implement IEEE FP separate 8087 FPU (floating point unit) 486: merged FPU and Integer Unit onto one chip Instruction decoder and sequencer Integer Unit FPU Memory

FPU Data Register Stack
FPU register format (extended precision) s exp frac 63 64 78 79 FPU registers 8 registers Logically forms shallow stack Top called %st(0) When push too many, bottom values disappear “Top” %st(0) %st(1) %st(2) %st(3) stack grows down

Simplified FPU operation
“load” instruction Pushes number onto stack “storep” instruction Pops top element from stack and stores it in memory unary operation “neg” = pop top element, negate it, push result onto stack binary operations “addp”, “multp” = pop top two elements, perform operation, push result onto stack Stack operation similar to Reverse Polish Notation a b + = push a, push b, add (pop a & b, add, push result)

Example calculation x = (a-b)/(-b+c) load c load b neg addp load a
subp divp storep x

After pop, %st(0) has result
FPU instructions Large number of floating point instructions and formats ~50 basic instruction types load (fld*), store (fst*), add (fadd), multiply (fmul) sin (fsin), cos (fcos), tan (ftan) etc… Sample instructions: Instruction Effect Description fldz push Load zero flds Addr push M[Addr] Load single precision real fmuls Addr %st(0) <- %st(0)*M[Addr] Multiply faddp %st(1) <- %st(0)+%st(1); pop Add and pop After pop, %st(0) has result

FPU instruction mnemonics
Precision “s” single precision “l” double precision Operand order Default Op1 <op> Op2 “r” reverse operand order (i.e. Op2 <op> Op1) Stack operation “p” pop a single value from stack upon completion

Floating Point Code Example
Compute Inner Product of Two Vectors Single precision arithmetic Common computation pushl %ebp # setup movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx # %ebx=&x movl 12(%ebp),%ecx # %ecx=&y movl 16(%ebp),%edx # %edx=n fldz # push +0.0 xorl %eax,%eax # i=0 cmpl %edx,%eax # if i>=n done jge .L3 .L5: flds (%ebx,%eax,4) # push x[i] fmuls (%ecx,%eax,4) # st(0)*=y[i] faddp # st(1)+=st(0); pop incl %eax # i++ cmpl %edx,%eax # if i<n repeat jl .L5 .L3: movl -4(%ebp),%ebx # finish movl %ebp, %esp popl %ebp ret # st(0) = result float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) { result += x[i] * y[i]; } return result;

Inner Product Stack Trace
Initialization 1. fldz 0.0 %st(0) Iteration 0 Iteration 1 2. flds (%ebx,%eax,4) 5. flds (%ebx,%eax,4) 0.0 %st(1) x[0]*y[0] %st(1) x[0] %st(0) x[1] %st(0) 3. fmuls (%ecx,%eax,4) 6. fmuls (%ecx,%eax,4) 0.0 %st(1) x[0]*y[0] %st(1) x[0]*y[0] %st(0) x[1]*y[1] %st(0) 4. faddp 7. faddp 0.0+x[0]*y[0] %st(0) x[0]*y[0]+x[1]*y[1] %st(0) Serial, sequential operation

Motivation for SIMD Multimedia, graphics, scientific, and security applications Require a single operation across large amounts of data Frame differencing for video encoding Image Fade-in/Fade-out Sprite overlay in game Matrix computations Encryption/decryption Algorithm characteristics Access data in a regular pattern Operate on short data types (8-bit, 16-bit, 32-bit) Have an operating paradigm that has data streaming through fixed processing stages Data-flow operation

Natural fit for SIMD instructions
Single Instruction, Multiple Data Also known as vector instructions Before SIMD One instruction per data location With SIMD One instruction over multiple sequential data locations Execution units must support “wide” parallel execution Examples in many processors Intel x86 MMX, SSE, AVX AMD 3DNow!

Example R = R + XR * 1.08327 G = G + XG * 1.89234 B = B + XB * 1.29835
R = R + X[i+0] G = G + X[i+1] B = B + X[i+2] R R XR G = G + XG * B B XB R R G = G + X[i:i+2] B B

Example for (i=0; i<64; i+=1)‏ A[i+0] = A[i+0] + B[i+0]
} for (i=0; i<100; i+=4)‏ A[i:i+3] = A[i:i+3] + B[i:i+3]

SIMD in x86 MMX (MultiMedia eXtensions)
Pentium, Pentium II SSE (Streaming SIMD Extensions) (1999) Pentium 3 SSE2 (2000), SSE3 (2004) Pentium 4 SSSE3 (2004), SSE4 (2007) Intel Core AVX (2011) Intel Sandy Bridge, Ivy Bridge

General idea SIMD (single-instruction, multiple data) vector instructions New data types, registers, operations Parallel operation on small (length 2-8) vectors of integers or floats Example: “4-way” + x 16

MMX (MultiMedia eXtensions)
MMX re-uses FPU registers for SIMD execution of integer ops Alias the FPU registers st0-st7 as MM0-MM7 Treat as 8 64-bit data registers randomly accessible Partition registers based on data type of vector How many different partitions are there for a vectored add? Single operation applied in parallel on individual parts Why not new registers? Wanted to avoid adding CPU state Change does not impact context switching OS does not need to know about MMX Drawback: can't use FPU and MMX at the same time 8 byte additions (PADDB) 4 short or word additions (PADDW) 2 int or dword additions (PADDD) +

SSE (Streaming SIMD Extensions)
Larger, independent registers MMX doesn't allow use of FPU and SIMD simultaneously 8 128-bit data registers separate from FPU New hardware registers (XMM0-XMM7) New status register for flags (MXCSR) Vectored floating point supported MMX only for vectored integer operations SSE adds support for vectored floating point operations 4 single precision floats Streaming support Prefetching and cacheability control in loading/storing operands Additional integer operations for permutations Shuffling, interleaving

SSE2 Adds more data types and instructions
Vectored double-precision floating point operations 2 double precision floats Full support for vectored integer types over 128-bit XMM registers 16 single byte vectors 8 word vectors 4 double word vectors 2 quad word vector

SSE3 SSE4 Horizontal vector operations All x86-64 chips have SSE3
Operations within vector (e.g. min, max) Speed up DSP and 3D ops Complex arithmetic (SSE3) All x86-64 chips have SSE3 SSE4 Video encoding accelerators Sum of absolute differences (frame differencing) Horizontal Minimum Search (motion estimation) Conditional copying Graphics building blocks Dot product 32-bit vector integer operations on 128-bit registers Dword multiplies Vector rounding

Feature summary Integer vectors (64-bit registers) (MMX)
Single-precision vectors (SSE) Double-precision vectors (SSE2) Integer vectors (128-bit registers) (SSE2) Horizontal arithmetic within register (SSE3/SSSE3) Video encoding accelerators (H.264) (SSE4) Graphics building blocks (SSE4)

Intel Architectures (Focus Floating Point)
Processors Architectures Features 8086 286 x86-16 time 386 486 Pentium Pentium MMX Pentium III Pentium 4 Pentium 4E x86-32 MMX SSE SSE2 SSE3 4-way single precision fp 2-way double precision fp Pentium 4F Core 2 Duo x86-64 / em64t SSE4 22

SSE3 Registers All caller saved %xmm0 for floating point return value
128 bit %xmm0 Argument #1 %xmm8 %xmm1 Argument #2 %xmm9 %xmm2 Argument #3 %xmm10 %xmm3 Argument #4 %xmm11 %xmm4 Argument #5 %xmm12 %xmm5 Argument #6 %xmm13 %xmm6 Argument #7 %xmm14 %xmm7 Argument #8 %xmm15 23

SSE3 Registers Different data types and associated instructions
Integer vectors: 16-way byte 8-way short 4-way int Floating point vectors: 4-way single (float) 2-way double Floating point scalars: single double 128 bit LSB 24

SSE3 Instruction Names addps addss addpd addsd packed (vector)
single slot (scalar) addps addss single precision addpd addsd double precision 25

SSE3 Instructions: Examples
Single precision 4-way vector add: addps %xmm0 %xmm1 Single precision scalar add: addss %xmm0 %xmm1 %xmm0 + %xmm1 %xmm0 + %xmm1 26

SSE3 Basic Instructions
Single Double Effect movss movsd D ← S Moves Usual operand form: reg → reg, reg → mem, mem → reg Packed versions to load vector from memory Arithmetic Single Double Effect addss addsd D ← D + S subss subsd D ← D – S mulss mulsd D ← D x S divss divsd D ← D / S maxss maxsd D ← max(D,S) minss minsd D ← min(D,S) sqrtss sqrtsd D ← sqrt(S) 27

x86-64 FP Code Example Compute inner product of two vectors
float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) result += x[i]*y[i]; return result; } Compute inner product of two vectors Single precision arithmetic Uses SSE3 instructions ipf: xorps %xmm1, %xmm1 # result = 0.0 xorl %ecx, %ecx # i = 0 jmp .L8 # goto middle .L10: # loop: movslq %ecx,%rax # icpy = i incl %ecx # i++ movss (%rsi,%rax,4), %xmm0 # t = y[icpy] mulss (%rdi,%rax,4), %xmm0 # t *= x[icpy] addss %xmm0, %xmm1 # result += t .L8: # middle: cmpl %edx, %ecx # i:n jl L10 # if < goto loop movaps %xmm1, %xmm0 # return result ret

SSE3 Conversion Instructions
Conversions Same operand forms as moves Instruction Description cvtss2sd single → double cvtsd2ss double → single cvtsi2ss int → single cvtsi2sd int → double cvtsi2ssq quad int → single cvtsi2sdq quad int → double cvttss2si single → int (truncation) cvttsd2si double → int (truncation) cvttss2siq single → quad int (truncation) double → quad int (truncation) 29

Detecting if it is supported
mov eax, 1 cpuid ; supported since Pentium test edx, h ; h (bit 23) MMX ; h (bit 25) SSE ; h (bit 26) SSE2 jnz HasMMX

#include <stdio.h> #include <string.h> #define cpuid(func,ax,bx,cx,dx)\ __asm__ __volatile__ ("cpuid":\ "=a" (ax), "=b" (bx), "=c" (cx), "=d" (dx) : "a" (func)); int main(int argc, char* argv[]) { int a, b, c, d, i; char x[13]; int* q; for (i=0; i < 13; i++) x[i]=0; q=(int ) x; /* 12 char string returned in 3 registers */ cpuid(0,a,q[0],q[2],q[1]); printf("str: %s\n", x); /* Bits returned in all 4 registers */ cpuid(1,a,b,c,d); printf("a: %08x, b: %08x, c: %08x, d: %08x\n",a,b,c,d); printf(" bh * 8 = cache line size\n"); printf(" bit 0 of c = SSE3 supported\n"); printf(" bit 25 of c = AES supported\n"); printf(" bit 0 of d = On-board FPU\n"); printf(" bit 4 of d = Time-stamp counter\n"); printf(" bit 26 of d = SSE2 supported\n"); printf(" bit 25 of d = SSE supported\n"); printf(" bit 23 of d = MMX supported\n"); }

mashimaro <~> 12:43PM % cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz stepping : 11 cpu MHz : cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

AVX2 Intel codename Haswell (2013), Broadwell (2014)
Expansion of most integer AVX to 256 bits “Gather support” to load data from non-contiguous memory 3-operand FMA operations (fused multiply-add operations) at full precision (a+b*c) Dot products, matrix multiplications, polynomial evaluations via Horner's rule (see DEC VAX POLY instruction 1977) Speeds up software-based division and square root operations (so dedicated hardware for these operations can be removed)

Programming SIMD Store data contiguously (i.e. in an array)
Define total size of vector in bytes 8 bytes (64 bits) for MMX 16 bytes (128 bits) for SSE2 and beyond Define type of vector elements For 128 bit registers 2 double 4 float 4 int 8 short 16 char SIMD instructions based on each vector type

Example: SIMD via macros/libraries
Rely on compiler macros or library calls for SSE acceleration Macros embed in-line assembly into program Call into library functions compiled with SSE Adding two 128-bit vectors containing 4 float // Microsoft-specific compiler intrinsic function __m128 _mm_add_ps(__m128 a , __m128 b ); __m128 a, b, c; // intrinsic function c = __mm_add_ps(a, b); a 1 2 3 4 b 2 4 6 8 + + + + 3 6 9 12

Example: SIMD in C Adding two vectors (SSE)
Must pass the compiler hints about your vector Size of each vector in bytes (i.e. vector_size(16)) Type of vector element (i.e. float) gcc –msse2 // vector of four single floats typedef float v4sf __attribute__ ((vector_size(16))); union f4vector { v4sf v; float f[4]; }; void add_vector() { union f4vector a, b, c; a.f[0] = 1; a.f[1] = 2; a.f[2] = 3; a.f[3] = 4; b.f[0] = 5; b.f[1] = 6; b.f[2] = 7; b.f[3] = 8; c.v = a.v + b.v; }

Examples: SSE in C Measuring performance improvement using rdtsc

Vector Instructions Starting with version 4.1.1, gcc can autovectorize to some extent -O3 or –ftree-vectorize No speed-up guaranteed Very limited icc as of now much better For highest performance vectorize yourself using intrinsics Intrinsics = C interface to vector instructions 38

AES AES-NI announced 2008 http://software.intel.com/file/24917
Added to Intel Westmere processors and beyond (2010) Separate from MMX/SSE/AVX AESENC/AESDEC performs one round of an AES encryption/decryption flow One single byte substitution step, one row-wise permutation step, one column-wise mixing step, addition of the round key (order depends on whether one is encrypting or decrypting) Speed up from 28 cycles per byte to 3.5 cycles per byte 10 rounds per block for 128-bit keys, 12 rounds per block for 192-bit keys, 14 rounds per block for 256-bit keys Software support from security vendors widespread

x86-64

x86-64 History Features 64-bit version of x86 architecture
Developed by AMD in 2000 First processor released in 2003 Adopted by Intel in 2004 Features 64-bit registers and instructions Additional integer registers Adoption and extension of Intel’s SSE No-execute bit Conditional move instruction (avoiding branches)

64-bit registers From IA-32 Now
%ah/al : 8 bits %ax: 16 bits %eax: 32 bits Now %rax - 64 bits 63 31 15 7 %ax %rax %eax %ah %al

More integer registers
Denoted %rXb - 8 bits %rXw - 16 bits %rXd - 32 bits %rX - 64 bits where X is from 8 to 15 Within gdb ‘info registers’

x86-64 Integer Registers %rax %r8 %rbx %r9 %rcx %r10 %rdx %r11 %rsi
%eax %r8d %rbx %r9 %ebx %r9d %rcx %r10 %ecx %r10d %rdx %r11 %edx %r11d %rsi %r12 %esi %r12d %rdi %r13 %edi %r13d %rsp %r14 %esp %r14d %rbp %r15 %ebp %r15d Twice the number of registers Accessible as 8, 16, 32, 64 bits

More vector registers XMM0-XMM7 XMM8-XMM15
128-bit SSE registers prior to x86-64 XMM8-XMM15 Additional bit registers

64-bit instructions All 32-bit instructions have quad-word equivalents
Use suffix 'q‘ to denote movq $0x4,%rax addq %rcx,%rax Exception for stack operations pop, push, call, ret, enter, leave Implicitly 64 bit 32 bit versions not valid Values translated to 64 bit versions with zeros.

Modified calling convention
Previously Function parameters pushed onto the stack Frame pointer management and update A lot of memory operations and overhead! x86-64 Use registers to pass function parameters %rdi, %rsi, %rdx, %rcx, %r8, %r9 used for argument build %xmm0 - %xmm7 for floating point arguments Avoid frame management when possible Simple functions do not incur frame management overhead Use stack if more than 6 parameters Kernel interface also uses registers for parameters %rdi, %rsi, %rdx, %r10, %r8, %r9 Callee saved registers %rbp, %rbx, from %r12 to %r15 All references to stack frame via stack pointer Eliminates need to update %ebp/%rbp

x86-64 Integer Registers %rax %r8 %rbx %r9 %rcx %r10 %rdx %r11 %rsi
Return value Argument #5 %rbx %r9 Callee saved Argument #6 %rcx %r10 Argument #4 Callee saved %rdx %r11 Argument #3 Used for linking %rsi %r12 Argument #2 C: Callee saved %rdi %r13 Argument #1 Callee saved %rsp %r14 Stack pointer Callee saved %rbp %r15 Callee saved Callee saved

x86-64 Long Swap Operands passed in registers
movq (%rdi), %rdx movq (%rsi), %rax movq %rax, (%rdi) movq %rdx, (%rsi) ret void swap(long *xp, long *yp) { long t0 = *xp; long t1 = *yp; *xp = t1; *yp = t0; } Operands passed in registers First (xp) in %rdi, second (yp) in %rsi 64-bit pointers No stack operations required (except ret) Avoiding stack Can hold all local information in registers

x86-64 Locals in the Red Zone
swap_a: movq (%rdi), %rax movq %rax, -24(%rsp) movq (%rsi), %rax movq %rax, -16(%rsp) movq -16(%rsp), %rax movq %rax, (%rdi) movq -24(%rsp), %rax movq %rax, (%rsi) ret /* Swap, using local array */ void swap_a(long *xp, long *yp) { volatile long loc[2]; loc[0] = *xp; loc[1] = *yp; *xp = loc[1]; *yp = loc[0]; } Avoiding Stack Pointer Change Have compiler manage the stack frame in a function without changing %rsp Allocate window beyond stack pointer rtn Ptr %rsp −8 unused −16 loc[1] −24 loc[0]

Interesting Features of Stack Frame
Allocate entire frame at once All stack accesses can be relative to %rsp Do by decrementing stack pointer Can delay allocation, since safe to temporarily use red zone Simple deallocation Increment stack pointer No base/frame pointer needed

x86-64 function calls via Jump
When swap executes ret, it will return from swap_ele Possible since swap is a “tail call” (no instructions afterwards) long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele(long a[], int i) { swap(&a[i], &a[i+1]); } swap_ele: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # &a[i] leaq 8(%rdi), %rsi # &a[i+1] jmp swap # swap()

x86-64 Procedure Summary Heavy use of registers Minimal use of stack
Parameter passing More temporaries since more registers Minimal use of stack Sometimes none Allocate/deallocate entire block Many tricky optimizations What kind of stack frame to use Calling with jump Various allocation techniques Turning on/off 64-bit $ gcc -m64 code.c -o code $ gcc -m32 code.c -o code

ARM history Acorn RISC Machine (Acorn Computers, UK)
Design initiated 1983, first silicon 1985 32-bit reduced instruction set machine inspired by Berkeley RISC (Patterson, ) Licensing model allows for custom designs (contrast to x86) Does not produce their own chips Companies customize base CPU for their products PA Semiconductor (fabless, SoC startup acquired by Apple for its A4 design that powers iPhone/iPad) ARM estimated to make $0.11 on each chip (royalties + license) Runs 98% of all mobile phones (2005) Per-watt performance currently better than x86 Less “legacy” instructions to implement

ARM architecture RISC features Fewer instructions
Complex instructions handled via multiple simpler ones Results in a smaller execution unit Only loads/stores to and from memory Uniform-size instructions Less decoding logic 16-bit in Thumb mode to increase code density

ARM architecture ALU features
Conditional execution built into many instructions Less branches Less power lost to stalled pipelines No need for branch prediction logic Operand bit-shifts supported in certain instructions Built-in barrel shifter in ALU Bit shifting plus ALU operation in one Support for 3 operand instructions <R> = <Op1> OP <Op2>

ARM architecture Control state features Shadow registers (pre v7)
Allows efficient interrupt processing (no need to save registers onto stack) Link register Stores return address for leaf functions (no stack operation needed)

ARM architecture Advanced features
SIMD (NEON) to compete with x86 at high end mp3, AES, SHA support Hardware virtualization Hypervisor mode Jazelle DBX (Direct Bytecode eXecution) Native execution of Java Security No-execute page protection Return2libc attacks still possible TrustZone Support for trusted execution via hardware-based access control and context management e.g. isolate DRM processing

ARM vs. x86 Key architectural differences CISC vs. RISC
Legacy instructions impact per-watt performance Atom (stripped-down core) Once a candidate for the iPad until Apple VP threatened to quit over the choice State pushed onto stack vs. swapped from shadow registers Conditional execution via branches Later use of conditional moves Bit shifting separate, explicit instructions Memory locations usable as ALU operands Mostly 2 operand instructions ( <D> = <D> OP <S> )

ARM vs. x86 Key differences
Intel is the only producer of x86 chips and designs No SoC customization (everyone gets same hardware) Must wait for Intel to give you what you want ARM allows Apple to differentiate itself Intel and ARM XScale: Intel's version of ARM sold to Marvell in 2006 Speculation Leakage current will eventually dominate power consumption (versus switching current) Intel advantage on process to make RISC/CISC moot Make process advantage bigger than custom design + RISC advantage (avoid wasting money on license) Latest attempt: Medfield (2012)

Example: SIMD in assembly
Add a constant to a vector (MMX) // Microsoft Macro Assembler format (MASM) char d[]={5, 5, 5, 5, 5, 5, 5, 5}; // 8 bytes char clr[]={65,66,68,...,87,88}; // 24 bytes __asm{ movq mm1, d // load constant into mm1 reg mov cx, 3 // initialize loop counter mov esi, 0 // set index to 0 L1: movq mm0, clr[esi] // load 8 bytes into mm0 reg paddb mm0, mm1 // perform vector addition movq clr[esi], mm0 // store 8 bytes of result add esi, 8 // update index loop L1 // loop macro (on cx) emms // clear MMX register state }

CS 201 Advanced Topics SIMD, x86-64, ARM

Similar presentations

Presentation on theme: "CS 201 Advanced Topics SIMD, x86-64, ARM"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 201 Advanced Topics SIMD, x86-64, ARM

Similar presentations

Presentation on theme: "CS 201 Advanced Topics SIMD, x86-64, ARM"— Presentation transcript:

Similar presentations

About project

Feedback