SIMD Multimedia Extensions

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

PIPELINE AND VECTOR PROCESSING
Streaming SIMD Extension (SSE)
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!
02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!
09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
Instructor: Justin Hsia
Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Introduction to MMX, XMM, SSE and SSE2 Technology
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
With a focus on floating point.  For floating point (i.e., real numbers), MASM supports:  real4  single precision; IEEE standard; analogous to float.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Playstation2 Architecture Architecture Hardware Design.
Guest Lecturer: Alan Christopher
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
CS61C L20 Thread Level Parallelism I (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
My Coordinates Office EM G.27 contact time:
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.
Vector computers.
09/10/2010CS4961 CS4961 Parallel Programming Lecture 6: SIMD Parallelism in SSE-3 Mary Hall September 10,
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.
Prof. Zhang Gang School of Computer Sci. & Tech.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Single Instruction Multiple Data
Chapter 3 Arithmetic for Computers
Parallel Processing - introduction
CS4961 Parallel Programming Lecture 8: SIMD, cont
3- Parallel Programming Models
Flynn’s Classification Of Computer Architectures
Exploiting Parallelism
Morgan Kaufmann Publishers
Vladimir Stojanovic & Nicholas Weaver
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Getting Started with Automatic Compiler Vectorization
Basics Of X86 Architecture
Morgan Kaufmann Publishers
Compilers for Embedded Systems
Vector Processing => Multimedia
SIMD Programming CS 240A, 2017.
Instructor: Michael Greenbaum
Notes on Homework 1 What is Peak ? CS267 Lecture 2 CS267 Lecture 2 1.
Notes on Homework 1 DBS=328 lines BLK+UNJ=68 lines BLK=48 lines
The University of Adelaide, School of Computer Science
Pipelining and Vector Processing
Array Processor.
CS4961 Parallel Programming Lecture 11: SIMD, cont
Multivector and SIMD Computers
Overview Parallel Processing Pipelining
Chap. 9 Pipeline and Vector Processing
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
EE 193: Parallel Computing
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

SIMD Multimedia Extensions Lecture 16 SSE vector processing SIMD Multimedia Extensions

Improving performance with SSE We’ve seen how we can apply multithreading to speed up the cardiac simulator But there is another kind of parallelism available to us: SSE Scott B. Baden / CSE 160 / Wi '16

Hardware Control Mechanisms Flynn’s classification (1966) How do the processors issue instructions? PE + CU SIMD: Single Instruction, Multiple Data Execute a global instruction stream in lock-step PE + CU Interconnect PE + CU PE + CU PE PE + CU PE Interconnect MIMD: Multiple Instruction, Multiple Data Control Unit PE Clusters and servers processors execute instruction streams independently PE PE 26 Scott B. Baden / CSE 160 / Wi '16

SIMD (Single Instruction Multiple Data) Operate on regular arrays of data Two landmark SIMD designs  ILIAC IV (1960s)  Connection Machine 1 and 2 (1980s) Vector computer: Cray-1 (1976) Intel and others support SIMD for multimedia and graphics  SSE Streaming SIMD extensions, Altivec  Operations defined on vectors GPUs, Cell Broadband Engine (Sony Playstation) Reduced performance on data dependent or irregular computations 1 1 1 4 2 2 = * 2 2 1 6 3 forall i = 0:N-1 p[i] = a[i] * b[i] 2 forall i = 0 : n-1 x[i] = y[i] + z [ K[i] ] end forall if ( x[i] < 0) then y[i] = x[i] else y[i] = x[i] end if end forall 27 Scott B. Baden / CSE 160 / Wi '16

Are SIMD processors general purpose? A. Yes B. No Scott B. Baden / CSE 160 / Wi '16

Are SIMD processors general purpose? A. Yes B. No Scott B. Baden / CSE 160 / Wi '16

What kind of parallelism does multithreading provide? A. MIMD B. SIMD Scott B. Baden / CSE 160 / Wi '16

What kind of parallelism does multithreading provide? A. MIMD B. SIMD Scott B. Baden / CSE 160 / Wi '16

Streaming SIMD Extensions SIMD instruction set on short vectors SSE: SSE3 on Bang, but most will need only SSE2 See https://goo.gl/DIokKj and https://software.intel.com/sites/landingpage/IntrinsicsGuide Bang : 8x128 bit vector registers (newer cpus have 16) for i = 0:N-1 { p[i] = a[i] * b[i];} 1 1 1 a 4 2 2 = * b 2 2 1 X X X X 6 3 2 p 4 doubles 8 floats , ints etc Scott B. Baden / CSE 160 / Wi '16

SSE Architectural support SSE2,SSE3, SSE4, AVX Vector operations on short vectors: add, subtract, 128 bit load store SSE2+: 16 XMM registers (128 bits) These are in addition to the conventional registers and are treated specially Vector operations on short vectors: add, subtract, Shuffling (handles conditionals) Data transfer: load/store See the Intel intrisics guide: software.intel.com/sites/landingpage/IntrinsicsGuide May need to invoke compiler options depending on level of optimization Scott B. Baden / CSE 160 / Wi '16

Supported by all major compilers C++ intrinsics C++ functions and datatypes that map directly onto 1 or more machine instructions Supported by all major compilers The interface provides 128 bit data types and operations on those datatypes  _m128 (float)  _m128d (double) Data movement and initialization  mm_load_pd (aligned load)  mm_store_pd  mm_loadu_pd (unaligned load) Data may need to be aligned m128d vec1, vec2, vec3; for (i=0; i<N; i+=2) { vec1 = _mm_load_pd(&b[i]); vec2 = _mm_load_pd(&c[i]); vec3 = _mm_div_pd(vec1, vec2); vec3 = _mm_sqrt_pd(vec3); _mm_store_pd(&a[i], vec3); } Scott B. Baden / CSE 160 / Wi '16

Identify vector operations, reduce loop bound How do we vectorize? Original code double a[N], b[N], c[N]; for (i=0; i<N; i++) { a[i] = sqrt(b[i] / c[i]); Identify vector operations, reduce loop bound for (i = 0; i < N; i+=2) a[i:i+1] = vec_sqrt(b[i:i+1] / c[i:i+1]); The vector instructions __m128d vec1, vec2, vec3; for (i=0; i<N; i+=2) { vec1 = _mm_load_pd(&b[i]); vec2 = _mm_load_pd(&c[i]); vec3 = _mm_div_pd(vec1, vec2); vec3 = _mm_sqrt_pd(vec3); _mm_store_pd(&a[i], vec3); } Scott B. Baden / CSE 160 / Wi '16

Performance Without SSE vectorization : 0.777 sec. With SSE vectorization : 0.454 sec. Speedup due to vectorization: x1.7 $PUB/Examples/SSE/Vec double *a, *b, *c m128d vec1, vec2, vec3; for (i=0; i<N; i+=2) { vec1 = _mm_load_pd(&b[i]); vec2 = _mm_load_pd(&c[i]); vec3 = _mm_div_pd(vec1, vec2); vec3 = _mm_sqrt_pd(vec3); _mm_store_pd(&a[i], vec3); } double *a, *b, *c for (i=0; i<N; i++) { a[i] = sqrt(b[i] / c[i]); } Scott B. Baden / CSE 160 / Wi '16

The assembler code double *a, *b, *c for (i=0; i<N; i++) { __m128d vec1, vec2, vec3; for (i=0; i<N; i+=2) { vec1 = _mm_load_pd(&b[i]); vec2 = _mm_load_pd(&c[i]); vec3 = _mm_div_pd(vec1, vec2); vec3 = _mm_sqrt_pd(vec3); _mm_store_pd(&a[i], vec3); } double *a, *b, *c for (i=0; i<N; i++) { a[i] = sqrt(b[i] / c[i]); } .L12: movsd divsd sqrtsd xmm0, QWORD PTR [r12+rbx] xmm0, QWORD PTR [r13+0+rbx] xmm1, xmm0 ucomisd xmm1, xmm1 // checks for illegal sqrt jp .L30 movsd QWORD PTR [rbp+0+rbx], xmm1 add cmp jne rbx, 8 # ivtmp.135 rbx, 16384 .L12 Scott B. Baden / CSE 160 / Wi '16

What prevents vectorization Interrupted flow out of the loop for (i=0; i<n; i++) { a[i] = b[i] + c[i]; maxval = (a[i] > maxval ? a[i] : maxval); if (maxval > 1000.0) break; } Loop not vectorized/parallelized: multiple exits This loop will vectorize maxval = (a[i] > maxval ? a[i] : maxval); Scott B. Baden / CSE 160 / Wi '16

SSE2 Cheat sheet (load and store) Krste Asanovic & Randy H. Katz xmm: one operand is a 128-bit SSE2 register mem/xmm: other operand is in memory or an SSE2 register {SS} Scalar Single precision FP: one 32-bit operand in a 128-bit register {PS} Packed Single precision FP: four 32-bit operands in a 128-bit register {SD} Scalar Double precision FP: one 64-bit operand in a 128-bit register {PD} Packed Double precision FP, or two 64-bit operands in a 128-bit register {A} 128-bit operand is aligned in memory {U} the 128-bit operand is unaligned in memory {H} move the high half of the 128-bit operand Krste Asanovic & Randy H. Katz {L} move the low half of the 128-bit operand Scott B. Baden / CSE 160 / Wi '16