SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

Slides:



Advertisements
Similar presentations
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment.
ECE291 Computer Engineering II Lecture 24 Josh Potts University of Illinois at Urbana- Champaign.
Comp Sci Floating Point Arithmetic 1 Ch. 10 Floating Point Unit.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Computer Organization and Assembly Languages Yung-Yu Chuang
Computer Organization and Architecture
Computer Organization and Architecture
Pentium 4 and IA-32 ISA ELEC 5200/6200 Computer Architecture and Design, Fall 2006 Lectured by Dr. V. Agrawal Lectured by Dr. V. Agrawal Kyungseok Kim.
IA-32 Processor Architecture
S. Barua – CPSC 440 CHAPTER 2 INSTRUCTIONS: LANGUAGE OF THE COMPUTER Goals – To get familiar with.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.
CMPE 511 Computer Architecture Caner AKSOY CmpE Boğaziçi University December 2006 Intel ® Core 2 Duo Desktop Processor Architecture.
AMD Opteron - AMD64 Architecture Sean Downes. Description Released April 22, 2003 The AMD Opteron is a 64 bit microprocessor designed for use in server.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
History of Microprocessor MPIntroductionData BusAddress Bus
® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar
Instructor: Justin Hsia
Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
Introduction to MMX, XMM, SSE and SSE2 Technology
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
With a focus on floating point.  For floating point (i.e., real numbers), MASM supports:  real4  single precision; IEEE standard; analogous to float.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
The Alpha Thomas Daniels Other Dude Matt Ziegler.
Guest Lecturer: Alan Christopher
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Computer Science 516 Intel x86 Overview. Intel x86 Family Eight-bit 8080, 8085 – 1970s 16-bit 8086 – was internally 16 bits, externally 8 bits.
SIMD Multimedia Extensions
Exploiting Parallelism
Vladimir Stojanovic & Nicholas Weaver
Basics Of X86 Architecture
Vector Processing => Multimedia
SIMD Programming CS 240A, 2017.
MMX technology for Pentium
MMX Multi Media eXtensions
Special Instructions for Graphics and Multi-Media
CS170 Computer Organization and Architecture I
ECE/CS 552: Instruction Sets – x86
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
EE 193: Parallel Computing
Optimizing program performance
Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.
Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.
CPU Structure CPU must:
Chapter 10 Instruction Sets: Characteristics and Functions
Presentation transcript:

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals

Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extentions Successor to 64-bit MMX integer and AMD's 3DNow! extentions SSE Introduced in 1999 with the Pentium III 128-bit packed single precision floating point bit registers xmm0-xmm7, hold 4 floats each Additional 8 on x86-64 (xmm8-xmm15)‏ SSE2 Introduced with Pentium 4 Added double precision FP and several integer types SSE 3 & 4 More ops: DSP, permutations, fixed point, dot product...

Data Types Instructions end in D, S, I, Q etc. to designate type Conversion instructions CVTxx2xx SSE SSE2

Instruction Basics Packed vs. Scalar Ops Most operate on 2 args  OP xmm-dest, xmm/m128-src  In-place operation on dest. Instruction Varieties  MOV loads, stores  Arithmetic & Logical  Shuffle & Unpack  CVT conversion  Cache & Ordering Control

Arithmetic & Logical Instructions Can take a memory reference as 2 nd argument All put results into 1 st argument  CMP can be made to use EFLAGS Full instruction names are OP+TYPE...  ADDPD add packed double  ANDSS and scalar single  etc. MOVAPDxmm0, [eax] MOVAPD xmm1, [eax+16] MOVAPD xmm2, xmm0 ADDPD xmm2, xmm1 SUBPD xmm0, xmm1 MOVAPD [eax], xmm2 MOVAPD [eax+16], xmm0 ADD SUB MUL DIV SQRT MAX MIN CMP

Moving Data to/from Memory MOVAPD move aligned MOVUPD move unaligned  reg-reg, reg-mem, mem-reg Memory should be 16-byte aligned if possible  Starts at a 128-bit boundary in Virtual?/Physical? addressing  Special types and attributes are used in C/C++ MOVUPD takes much longer than MOVAPD  Intel's optimization manual says it's actually faster to do MOVSD xmm0, [mem] MOVSD xmm1, [mem+8] UNPCKLPD xmm0, xmm1

Shuffle & Unpack SHUFPD xmm0, xmm1, pattern UNPCKHPD xmm0, xmm1 UNPCKLPD xmm0, xmm1 Combine parts of 2 vectors In-Place Value(s) from 1 st argument alway go into low ½ of dest. Note that UNPCKs are just a shortcut of SHUF  But is it faster or slower?

C/C++ Intrinsics Both Intel's ICC and GCC have “intrinsic” functions to access the SSE instructions  More or less a 1:1 mapping of instructions Instead of in-place, they look like 2-in, 1-out functions  Does this indicate that penalties are incurred for unnecessary register copies/loads/stores?  Or are these optimized away?... It appears that way. MOV is split up into load, store, and set __m128d _mm_add_pd(__m128d a, __m128d b)‏ Adds the two DP FP values of a and b. r0 := a0 + b0 r1 := a1 + b1 __m128d _mm_load_pd(double const*dp)‏ (uses MOVAPD) Loads two DP FP values. The address p must be 16-byte aligned. r0 := p[0] r1 := p[1]

Example #include int main(void){ int i; __m128d a, b, c; double x0[2] __attribute__((aligned(16))) = {1.2, 3.5}; double x1[2] __attribute__((aligned(16))) = {-0.7, 2.6}; a = _mm_load_pd(x0); b = _mm_load_pd(x1); c = _mm_add_pd(a, b); _mm_store_pd(x0, c); for(i = 0; i < 2; i++)‏ printf("%.2f\n", x0[i]); return 0; }

Efficient Use Use 16-byte aligned memory Compiler driven automated vectorization is limited Use struct-of-arrays (SoA) instead of AoS (3d vectors)‏  i.e. block across multiple entities and do normal operations

To Investigate Denormals and underflow can cause penalties (1500? cycles)‏ Are intrinsics efficient? Do they use direct memory refs? SHUF vs UNPCK ? Cache utilization, ordering instructions... x87 FPU and SSE are disjoint and can be mixed... MMX registers can be used for shuffle or copy Intel Core architecture has more efficient SIMD than NetBurst Direct memory references Trace cache and reorder buffer effects on loop unrolling Non-temporal stores Prefetch

SHUF vs. UNPCK Intel Optimization Manual shows, sometimes UNPCK is faster  Probably because there's no decoding of immediate field gcc-4.x will convert a call to the shuffle intrinsic to an unpck instruction if it can Table data represents usage of register arguments; delays from direct memory references depend heavily on cache state. NetBurst Arch. (Pentium 4, Xeon)‏ Core Arch. (Core Duo, Pentium M)‏ DisplayFamily_DisplayModel represent variation across different processors. Family 0F is NetBurst Architecture, Family 06 is Intel Core Architecture Model for NetBurst in range 00H to 06H, data for 03H applies to 04H and 06H Model for Pentium M are 09H and 0DH, Core Solo/Duo is 0EH, Core is 0FH

References Wikipedia: Streaming SIMD Extensions  Intel® 64 and IA-32 Architectures Software Developer's Manuals  Intel® C++ Compiler Documentation  Apple Developer Connection – SSE Performance Programming 