Streaming SIMD Extension (SSE)

Slides:



Advertisements
Similar presentations
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.
Advertisements

提升循环级并行 陈健2002/11 Copyright © 2002 Intel Corporation.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.
Software and Services Group Optimization Notice Advancing HPC == advancing the business of software Rich Altmaier Director of Engineering Sept 1, 2011.
Pentium 4 and IA-32 ISA ELEC 5200/6200 Computer Architecture and Design, Fall 2006 Lectured by Dr. V. Agrawal Lectured by Dr. V. Agrawal Kyungseok Kim.
No Pipeline IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB instr time IFIDEXMEMWB Assuming 1 cycle for IF/ID/EX/MEM/WB, Total # cycles for n instructions:
Programming SA --- Systolic Array SIMD --- Single Instruction Multiple Data ISA --- Instruction Systolic Array MIMD --- Multiple Instruction Multiple Data.
Parallel and Distributed IR
Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.
1 IRAM and ISTORE David Patterson, Katherine Yelick, John Kubiatowicz U.C. Berkeley, EECS
Introduction to Intel Core Duo Processor Architecture Al-Asli, Mohammed.
CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.
Hiep Hong CS 147 Spring Intel Core 2 Duo. CPU Chronology 2.
AMD Opteron - AMD64 Architecture Sean Downes. Description Released April 22, 2003 The AMD Opteron is a 64 bit microprocessor designed for use in server.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design of a Multimedia Extension for RISC Processor Ing. Eduardo Jonathan Martínez.
HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Multicore Systems CET306 Harry R. Erwin University of Sunderland.
History of Microprocessor MPIntroductionData BusAddress Bus
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
Performance of mathematical software Agner Fog Technical University of Denmark
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
10.Introduction to Data-Parallel architectures TECH Computer Science SIMD {Single Instruction Multiple Data} 10.1 Introduction 10.2 Connectivity 10.3 Alternative.
Flynn’s Architecture. SISD (single instruction and single data stream) SIMD (single instruction and multiple data streams) MISD (Multiple instructions.
Lecture 6. VFP & NEON in ARM
By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.
Introduction to MMX, XMM, SSE and SSE2 Technology
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Agenda for PresentationPresenter: Sergey Kostrov Implementation of a Geo-Coding subsystem Requirements, Objectives and Goals: 1.Cost effective because.
Computer Architecture And Organization UNIT-II Flynn’s Classification Of Computer Architectures.
Chapter 5 Computer Systems Organization. Levels of Abstraction – Figure 5.1e The Concept of Abstraction.
Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Xinsong1 Multimedia Extension Technology survey Xinsong Yang Electrical and Computer Engineering 734 Final Project 5/10/2002.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Section 4.2 – The Dot Product. The Dot Product (inner product) where is the angle between the two vectors we refer to the vectors as ORTHOGONAL.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
1 Lecture 5a: CPU architecture 101 boris.
Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.
Vector projections (resolutes)
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
MPC860.
A Memory Aliased Instruction Set Architecture
Multi-core processors
Parallel Computing Lecture
Hot Processors Of Today
Flynn’s Classification Of Computer Architectures
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
MMX Multi Media eXtensions
Special Instructions for Graphics and Multi-Media
Array Processor.
بسم الله الرحمن الرحيم هل اختلف دور المعلم بعد تطبيق المنهج الحديث الذي ينادي بتوفير خبرات تعليمية مناسبة للطلبة ؟ هل اختلف دور المعلم ؟ ن.ن. ع.
Compiler Back End Panel
Symmetric Multiprocessing (SMP)
المدخل إلى تكنولوجيا التعليم في ضوء الاتجاهات الحديثة
Compiler Back End Panel
ديبــــاجــــــة: صادق الكنيست الإسرائيلي في تاريخ على اقتراح قانون دائرة أراضي إسرائيل (تعديل رقم7) – 2009 الذي يشكّل، عمليًا، خطة إصلاح شاملة.
Coe818 Advanced Computer Architecture
INF5063: Programming heterogeneous multi-core processors
CS 286 Computer Organization and Architecture
Determining if Vectors are Parallel
Data Parallel Pattern 6c.1
Data Parallel Computations and Pattern
Data Parallel Computations and Pattern
Presentation transcript:

Streaming SIMD Extension (SSE)

SIMD architectures A data parallel architecture Applying the same instruction to many data Save control logic A related architecture is the vector architecture SIMD and vector architectures offer high performance for vector operations.

Vector operations Vector addition Z = X + Y Vector scaling Y = a * X for (i=0; i<n; i++) z[i] = x[i] + y[i]; Vector scaling Y = a * X for(i=0; i<n; i++) y[i] = a*x[i]; Dot product for(i=0; i<n; i++) r += x[i]*y[i];

SISD and SIMD vector operations C = A + B For (i=0;i<n; i++) c[i] = a[i] + b[i] C A 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 SISD + 10 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 B 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 7.0 5.0 3.0 1.0 8.0 6.0 4.0 2.0 A SIMD + 8.0 6.0 4.0 2.0 9.0 7.0 5.0 3.0 C 1.0 1.0 1.0 1.0 B

x86 architecture SIMD support Both current AMD and Intel’s x86 processors have ISA and microarchitecture support SIMD operations. ISA SIMD support MMX, 3DNow!, SSE, SSE2, SSE3, SSE4, AVX See the flag field in /proc/cpuinfo SSE (Streaming SIMD extensions): a SIMD instruction set extension to the x86 architecture Instructions for operating on multiple data simultaneously (vector operations). Micro architecture support Many functional units 8 128-bit vector registers, XMM0, XMM1, …, XMM7

SSE programming Vector registers support three data types: Integer (16 bytes, 8 shorts, 4 int, 2 long long int, 1 dqword) single precision floating point (4 floats) double precision float point (2 doubles). [Klimovitski 2001]

SSE instructions Assembly instructions Data movement instructions moving data in and out of vector registers Arithmetic instructions Arithmetic operation on multiple data (2 doubles, 4 floats, 16 bytes, etc) Logical instructions Logical operation on multiple data Comparison instructions Comparing multiple data Shuffle instructions move data around SIMD registers Miscellaneous Data conversion: between x86 and SIMD registers Cache control: vector may pollute the caches etc

SSE instructions Data Movement Instructions: MOVUPS - Move 128bits of data to an SIMD register from memory or SIMD register. Unaligned. MOVAPS - Move 128bits of data to an SIMD register from memory or SIMD register. Aligned. MOVHPS - Move 64bits to upper bits of an SIMD register (high). MOVLPS - Move 64bits to lower bits of an SIMD register (low). MOVHLPS - Move upper 64bits of source  register to the lower 64bits of destination register. MOVLHPS - Move lower 64bits of source register  to the upper 64bits of destination register. MOVMSKPS = Move sign bits of each of the 4 packed scalars to an x86 integer register. MOVSS - Move 32bits to an SIMD register from memory or SIMD register.

SSE instructions Arithmetic instructions Logical instructions pd: two doubles, ps: 4 floats, ss: scalar ADD, SUB, MUL, DIV, SQRT, MAX, MIN, RCP, etc ADDPS – add four floats, ADDSS: scalar add Logical instructions AND, OR, XOR, ANDN, etc ANDPS – bitwise AND of operands ANDNPS – bitwise AND NOT of operands Comparison instruction: CMPPS, CMPSS – compare operands and return all 1’s or 0’s

SSE instructions Shuffle instructions Other SHUFPS: shuffle number from one operand to another UNPCKHPS - Unpack high order numbers to an SIMD register. Unpckhps [x4,x3,x2,x1][y4,y3,y2,y1] = [y4, x4, y3, x3] UNPCKLPS Other Data conversion: CVTPS2PI mm,xmm/mem64 Cache control MOVNTPS stores data from a SIMD floating-point register to memory, bypass cache. State management: LDMXCSR load MXCSR status register.

SEE programming in C/C++ Map to intrinsics An intrinsic is a function known by the compiler that directly maps to a sequence of one or more assembly language instructions. Intrinsic functions are inherently more efficient than called functions because no calling linkage is required. Intrinsics provides a C/C++ interface to use processor-specific enhancements Supported by major compilers such as gcc

SSE intrinsics Header files to access SEE intrinsics #include <mmintrin.h> // MMX #include <xmmintrin.h> // SSE #include <emmintrin.h> //SSE2 #include <pmmintrin.h> //SSE3 #include <tmmintrin.h> //SSSE3 #include <smmintrin.h> // SSE4 MMX/SSE/SSE2 are mostly supported SSE4 are not well supported. When compile, use –msse, -mmmx, -msse2 (machine dependent code) Some are default for gcc. A side note: Gcc default include path can be seen by ‘cpp –v’ On linprog, the SSE header files are in /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.3.2/include/

SSE intrinsics Data types (mapped to an xmm register) __m128: float __m128d: double __m128i: integer Data movement and initialization _mm_load_ps, _mm_loadu_ps, _mm_load_pd, _mm_loadu_pd, etc _mm_store_ps, … _mm_setzero_ps

SSE intrinsics Data types (mapped to an xmm register) __m128: float __m128d: double __m128i: integer Data movement and initialization _mm_load_ps, _mm_loadu_ps, _mm_load_pd, _mm_loadu_pd, etc _mm_store_ps, … _mm_setzero_ps _mm_loadl_pd, _mm_loadh_pd _mm_storel_pd, _mm_storeh_pd

SSE intrinsics Arithemetic intrinsics: _mm_add_ss, _mm_add_ps, … _mm_add_pd, _mm_mul_pd More details in the MSDN library at http://msdn.microsoft.com/en-us/library/y0dh78ez(v=VS.80).aspx See ex1.c, and sapxy.c

SSE intrinsics Data alignment issue Writing more generic SSE routine Some intrinsics may require memory to be aligned to 16 bytes. May not work when memory is not aligned. See sapxy1.c Writing more generic SSE routine Check memory alignment Slow path may not have any performance benefit with SSE See sapxy2.c

Summary Contemporary CPUs have SIMD support for vector operations SSE is its programming interface SSE can be accessed at high level languages through intrinsic functions. SSE Programming needs to be very careful about memory alignments Both for correctness and for performance.

References Intel® 64 and IA-32 Architectures Software Developer's Manuals (volumes 2A and 2B). http://www.intel.com/products/processor/manuals/ SSE Performance Programming, http://developer.apple.com/hardwaredrivers/ve/sse.html Alex Klimovitski, “Using SSE and SSE2: Misconcepts and Reality.” Intel Developer update magazine, March 2001. Intel SSE Tutorial : An Introduction to the SSE Instruction Set, http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#D SSE intrinsics tutorial, http://www.formboss.net/blog/2010/10/sse-intrinsics-tutorial/ MSDN library, MMX, SSE, and SSE2 intrinsics: http://msdn.microsoft.com/en-us/library/y0dh78ez(v=VS.80).aspx