CS 295: Modern Systems Modern Processors – SIMD Extensions

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Instruction Level Parallelism (ILP) Colin Stevens.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Introduction to MMX, XMM, SSE and SSE2 Technology
Pipelining and Parallelism Mark Staveley
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
EKT303/4 Superscalar vs Super-pipelined.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
CS61C L20 Thread Level Parallelism I (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
My Coordinates Office EM G.27 contact time:
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Processor Level Parallelism 1
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Computer Science 516 Intel x86 Overview. Intel x86 Family Eight-bit 8080, 8085 – 1970s 16-bit 8086 – was internally 16 bits, externally 8 bits.
1 Lecture 5a: CPU architecture 101 boris.
Chapter Overview General Concepts IA-32 Processor Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
William Stallings Computer Organization and Architecture 8th Edition
buses, crossing switch, multistage network.
SIMD Multimedia Extensions
Parallel Processing - introduction
Simple Illustration of L1 Bandwidth Limitations on Vector Performance
Basics Of X86 Architecture
Morgan Kaufmann Publishers
Compilers for Embedded Systems
Lecture 5: GPU Compute Architecture
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
SIMD Programming CS 240A, 2017.
Mattan Erez The University of Texas at Austin
Pipelining and Vector Processing
Superscalar Processors & VLIW Processors
Lecture 5: GPU Compute Architecture for the last time
buses, crossing switch, multistage network.
Control unit extension for data hazards
* From AMD 1996 Publication #18522 Revision E
EE 193: Parallel Computing
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
Control unit extension for data hazards
Morgan Kaufmann Publishers Arithmetic for Computers
Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.
Control unit extension for data hazards
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 16 – RISC-V Vectors Krste Asanovic Electrical Engineering and.
Lecture 11: Machine-Dependent Optimization
COMPUTER ORGANIZATION AND ARCHITECTURE
CS 295: Modern Systems Cache And Memory System
ESE532: System-on-a-Chip Architecture
Presentation transcript:

CS 295: Modern Systems Modern Processors – SIMD Extensions Sang-Woo Jun Spring, 2019

Modern Processor Topics Transparent Performance Improvements Pipelining, Caches Superscalar, Out-of-Order, Branch Prediction, Speculation, … Covered in CS250A and others Explicit Performance Improvements SIMD extensions, AES extensions, … … Non-Performance Topics Virtualization extensions, secure enclaves, transactional memory, …

Flynn Taxonomy (1966) Recap Data Stream Single Multi Instruction Stream SISD (Single-Core Processors) SIMD (GPUs, Intel SSE/AVX extensions, …) MISD (Systolic Arrays, …) MIMD (VLIW, Parallel Computers)

Flynn Taxonomy Recap Single-Instruction Single-Data (Single-Core Processors) Single-Instruction Multi-Data (GPUs, Intel SIMD Extensions) Multi-Instruction Single-Data (Systolic Arrays,…) Multi-Instruction Multi-Data (Parallel Computers)

Intel SIMD Extensions New instructions, new registers Introduced in phases/groups of functionality SSE – SSE4 (1999 –2006) 128 bit width operations AVX, FMA, AVX2, AVX-512 (2008 – 2015) 256 – 512 bit width operations AVX-512 chips not available yet (as of Spring, 2019), but soon! F16C, and more to come?

Intel SIMD Registers (AVX-512) XMM0 – XMM15 128-bit registers SSE YMM0 – YMM15 256-bit registers AVX, AVX2 ZMM0 – ZMM31 512-bit registers AVX-512 XMM0 YMM0 ZMM0 XMM1 YMM1 ZMM1 … XMM31 YMM31 ZMM31

SSE/AVX Data Types Operation on 32 8-bit values in one instruction! 255 YMM0 float float float float float float float float double double double double int32 int32 int32 int32 int32 int32 int32 int32 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 Operation on 32 8-bit values in one instruction! 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

Sandy Bridge Microarchitecture e.g., “Port 5 pressure” when code uses too much shuffle operations

Skylake Die Layout

Aside: Do I Have SIMD Capabilities? less /proc/cpuinfo

Processor Microarchitectural Effects on Power Efficiency The majority of power consumption of a CPU is not from the ALU Cache management, data movement, decoding, and other infrastructure Adding a few more ALUs should not impact power consumption Indeed, 4X performance via AVX does not add 4X power consumption FromiI7 4770K measurements: Idle: 40 W Under load : 117 W Under AVX load : 128 W Aside: Not all AVX operations are active at the same time – “Specialization horseman”?

Compiler Automatic Vectorization In gcc, flags “-O3 –mavx –mavx2” attempts automatic vectorization Works pretty well for simple loops But not for anything complex E.g., naïve bubblesort code not parallelized at all Generated using GCC explorer: https://gcc.godbolt.org/

Intel SIMD Intrinsics Use C functions instead of inline assembly to call AVX instructions Compiler manages registers, etc Intel Intrinsics Guide https://software.intel.com/sites/landingpage/IntrinsicsGuide One of my most-visited pages… e.g., __m256 a, b, c; __m256 d = _mm256_fmadd_ps(a, b, c); // d[i] = a[i]*b[i]+c[i] for i = 0 …7

Data Types in AVX/AVX2 __m512 variants in the future for AVX-512 Type Description __m128 128-bit vector containing 4 floats __m128d 128-bit vector containing 2 doubles __m128i 128-bit vector containing integers __m256 256-bit vector containing 8 floats __m256d 256-bit vector containing 4 doubles​ __m256i 256-bit vector containing integers 1~16 signed/unsigned integers 1~32 signed/unsigned integers __m512 variants in the future for AVX-512

Intrinsic Naming Convention _mm<width>_[function]_[type] E.g., _mm256_fmadd_ps : perform fmadd (floating point multiply-add) on 256 bits of packed single-precision floating point values (8 of them) Width Prefix 128 _mm_ 256 _mm256_ 512 _mm512_ Type Postfix Single precision _ps Double precision _pd Packed signed integer _epiNNN (e.g., epi256) Packed unsigned integer _epuNNN (e.g., epu256) Scalar integer _siNNN (e.g., si256) Not all permutations exist! Check guide

Load/Store/Initialization Operations _mm256_setzero_ps/pd/epi32/… _mm256_set_... … Load/Store : Address MUST be aligned by 256-bit _mm256_load_... _mm256_store_... And many more! (Masked read/write, strided reads, etc…) e.g., __mm256d t = _mm256_load_pd(double const * mem); // loads 4 double values from mem to t __mm256i v = _mm256_set_epi32(h,g,f,e,d,c,b,a); // loads 8 integer values to v

Vertical Vector Instructions Add/Subtract/Multiply _mm256_add/sub/mul/div_ps/pd/epi Mul only supported for epi32/epu32/ps/pd Div only supported for ps/pd Consult the guide! Max/Min/GreaterThan/Equals Sqrt, Reciprocal, Shift, etc… FMA (Fused Multiply-Add) (a*b)+c, -(a*b)-c, -(a*b)+c, and other permutations! … × × × × b + + + + c = = = = d __m256 a, b, c; __m256 d = _mm256_fmadd_pd(a, b, c);

Integer Multiplication Caveat Integer multiplication of two N bit values require 2N bits E.g., __mm256_mul_epi32 and __mm256_mul_epu32 Only use the lower 4 32 bit values Result has 4 64 bit values E.g., __mm256_mullo_epi32 and __mm256_mullo_epu32 Uses all 8 32 bit values Result has 8 truncated 32 bit values And more options! Consult the guide…

Horizontal Vector Instructions Horizontal add/subtraction Adds adjacent pairs of values E.g., __m256d _mm256_hadd_pd (__m256d a, __m256d b) a b + + + + c

Shuffling/Permutation Within 128-bit lanes _mm256_shuffle_ps/pd/… (a,b, imm8) _mm256_permute_ps/pd _mm256_permutevar_ps/… Across 128-bit lanes _mm256_permute2x128/4x64 : Uses 8 bit control _mm256_permutevar8x32/… : Uses 256 bit control Not all type permutations exist, but variables can be cast back and forth Matt Scarpino, “Crunching Numbers with AVX and AVX2,” 2016

Blend Merges two vectors using a control _mm256_blend_... : Uses 8 bit control e.g., _mm256_blend_epi32 _mm256_blendv_... : Uses 256 bit control e.g., _mm256_blendv_epi8 a b ? ? ? ? c

Alignr Right-shifts concatenated value of two registers, by byte Often used to implement circular shift by using two same register inputs _mm256_alignr_epi8 (a, b, count) Example of 64-bit values being shifted by 8 a b c

Helper Instructions Cast Convert Movemask And many more… __mm256i <-> __mm256, etc… Syntactic sugar -- does not spend cycles Convert 4 floats <-> 4 doubles, etc… Movemask __mm256 mask to -> int imm8 And many more…

Case Study: Sorting Important, fundamental application! Can be parallelized via divide-and-conquer How can SIMD help?

Reminder: Sorting Network Network structure for sorting fixed number of values Type of a “comparator network” comparators perform compare-and-swap Easily pipelined and parallelized Example 4-element sorting network 5 comparators, 3 cycle pipelined

Remind: Sorting Network Simple to generate correct sorting networks, but optimal structures are not well-known Some known optimal sorting networks Source: Wikipedia (Sorting Network)

SIMD And Sorting Networks AVX compare/max/min instructions do not work internally to a register Typically same values copied to two different registers, permuted, compared Or, “two register trick”, and many more optimizations In this case, optimal sorting networks are typically not very helpful Irregular computation means bubbles in the compute structure Simple, odd-even sorting network works well What if I want to sort more than 8 values? 8-element odd-even sorting network

SIMD And Sorting Networks Typically, we are sorting more than one set of tuples If we have multiple tasks, we can have task-level parallelism – Optimized networks! Sort multiple tuples at the same time Then we need to transpose the 8 8-element variables Each variable has a value for each sorting network instance Non-SIMD works, or a string of unpackhi/unpacklo/blend … Instruction 1, 2 (min, max) Instruction 3, 4 (min, max)

SIMD And Sorting Networks Some SIMD instructions have high throughput, but high latency Data dependency between two consecutive max instructions can take 8 cycles on Skylake If each parallel stage has less than 4 operations, pipeline may stall Solution: Interleave two sets of parallel 8-tuple sorting In reality, min/max means even for 4-tuples, pipeline is still filled Source: Instrinsics guide

The Two Register Merge Sort units of two pre-sorted registers, K elements minv = A, maxv = B // Repeat K times minv = min(minv,maxv) maxv = max(minv,maxv) // circular shift one value down minv = alignr(minv,minv,imm) Inoue et.al., “SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures,” VLDB 2015

SIMD And Merge Sort Hierarchically merged sorted subsections Using the SIMD merger for sorting vector_merge is the two-register sorter from before What if we want to sort structs? Instead of min/max, use cmpgt, followed by permute for both key and values For large values, use <key, index> Inoue et.al., “SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures,” VLDB 2015 Typically reports 2x + performance!

Topic Under Active Research! Papers being written about… Architecture-optimized matrix transposition Register-level sorting algorithm Merge-sort … and more! Good find can accelerate your application kernel 2x

Case Study: Matrix Multiply Branch : Boser & Katz, “CS61C: Great Ideas In Computer Architecture” Lecture 18 – Parallel Processing – SIMD

https://www.intel.com/content/dam/www/public/us/en/documents/ma nuals/64-ia-32-architectures-optimization-manual.pdf