December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

Slides:

Advertisements

Similar presentations

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Advertisements

INSTRUCTION SET ARCHITECTURES

Machine/Assembler Language Putting It All Together Noah Mendelsohn Tufts University Web:

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

The University of Adelaide, School of Computer Science

ECE291 Computer Engineering II Lecture 24 Josh Potts University of Illinois at Urbana- Champaign.

ITCS 3181 Logic and Computer Systems 2015 B. Wilkinson slides3.ppt Modification date: March 16, Addressing Modes The methods used in machine instructions.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Programming with Collections Collections in Java Using Arrays Week 9.

What is an instruction set?

Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.

Assembly Questions תרגול 12.

Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CS-334: Computer.

University of Washington What is parallel processing? Winter 2015 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.

Machine-Level Programming 6 Structured Data Topics Structs Unions.

Computer Organization CS224 Fall 2012 Lesson 21. FP Example: °F to °C  C code: float f2c (float fahr) { return ((5.0/9.0)*(fahr )); } fahr in $f12,

Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.

® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar

1 Computer Architecture COSC 3430 Lecture 3: Instructions.

November 18, 2015Memory and Pointers in Assembly1 What does this C code do? int foo(char *s) { int L = 0; while (*s++) { ++L; } return L; }

Computer Architecture and Organization

University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.

CWRU EECS 3221 Language of the Machine EECS 322 Computer Architecture Instructor: Francis G. Wolff Case Western Reserve University.

OpenCL Programming James Perry EPCC The University of Edinburgh.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Introduction to MMX, XMM, SSE and SSE2 Technology

November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

Chapter 10 Instruction Sets: Characteristics and Functions Felipe Navarro Luis Gomez Collin Brown.

Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.

תרגול 5 תכנות באסמבלי, המשך

® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Copyright © Curt Hill SIMD Single Instruction Multiple Data.

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.

for (int i = 0; i < numBodies; i++) { float_3 netAccel; netAccel.x = netAccel.y = netAccel.z = 0; for (int j = 0; j < numBodies; j++) { float_3 r;

Arrays. Outline 1.(Introduction) Arrays An array is a contiguous block of list of data in memory. Each element of the list must be the same type and use.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

University of Washington Today Quick review? Parallelism Wrap-up 

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

University of Washington 1 What is parallel processing? When can we execute things in parallel? Parallelism: Use extra resources to solve a problem faster.

Subword Parallellism Graphics and audio applications can take advantage of performing simultaneous operations on short vectors – Example: 128-bit adder:

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

Single Instruction Multiple Data

Lecture 17. Vector Machine, and Intel MMX/SSEx Extensions

Morgan Kaufmann Publishers Arithmetic for Computers

SIMD Multimedia Extensions

Microprocessor Systems Design I

Exploiting Parallelism

Basics Of X86 Architecture

Compilers for Embedded Systems

Vector Processing => Multimedia

SIMD Programming CS 240A, 2017.

Special Instructions for Graphics and Multi-Media

Machine-Level Programming 6 Structured Data

Instructors: Majd Sakr and Khaled Harras

ECEG-3202 Computer Architecture and Organization

Machine Level Representation of Programs (III)

Chapter 9 Instruction Sets: Characteristics and Functions

Computer Instructions

ECEG-3202 Computer Architecture and Organization

Digital System Design II 数字系统设计2

Morgan Kaufmann Publishers Arithmetic for Computers

02/02/10 20:53 Assembly Questions תרגול 12 1.

9/27: Lecture Topics Memory Data transfer instructions

Chapter 10 Instruction Sets: Characteristics and Functions

Presentation transcript:

December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

December 2, 2015ISA's, Compilers, and Assembly2 How do we improve performance?  Imagine you want to build a house. How long would it take you?  What could you do to build that house faster?

December 2, 2015Single-Instruction Multiple Data (SIMD)3 Exploiting Parallelism  Of the computing problems for which performance is important, many have inherent parallelism.  E.g., computer games: —graphics, physics, sound, A.I. etc. can be done separately —Furthermore, there is often parallelism within each of these: Each pixel on the screen’s color can be computed independently Non-contacting objects can be updated/simulated independently Artificial intelligence of non-human entities done independently  E.g., Google queries: —Every query is independent Google searches are read-only!!

December 2, 2015Single-Instruction Multiple Data (SIMD)4 Exploiting Parallelism at the Instruction level (SIMD)  Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }  You could write assembly for this, something like: lw$t0, 0($a0) lw$t1, 0($a1) add $t0, $t1, $t2 sw$t2, 0($a2) (plus all of the address arithmetic, plus the loop control)

December 2, 2015Single-Instruction Multiple Data (SIMD)5 Exploiting Parallelism at the Instruction level (SIMD)  Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } + Operating on one element at a time

December 2, 2015Single-Instruction Multiple Data (SIMD)6 Exploiting Parallelism at the Instruction level (SIMD)  Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } + Operating on one element at a time

December 2, 2015Single-Instruction Multiple Data (SIMD)7  Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } + Exploiting Parallelism at the Instruction level (SIMD) + Operate on MULTIPLE elements ++ Single Instruction, Multiple Data (SIMD)

December 2, 2015Single-Instruction Multiple Data (SIMD)8 Intel SSE/SSE2 as an example of SIMD Added new 128 bit registers (XMM0 – XMM7), each can store —4 single precision FP values (SSE)4 * 32b —2 double precision FP values (SSE2)2 * 64b —16 byte values (SSE2)16 * 8b —8 word values (SSE2)8 * 16b —4 double word values (SSE2)4 * 32b —1 128-bit integer value (SSE2)1 * 128b 4.0 (32 bits) (32 bits) -2.0 (32 bits) 2.3 (32 bits) 1.7 (32 bits) 2.0 (32 bits)-1.5 (32 bits) 0.3 (32 bits) 5.2 (32 bits) 6.0 (32 bits)2.5 (32 bits)

December 2, 2015Single-Instruction Multiple Data (SIMD)9 SIMD Extensions More than 70 instructions. Arithmetic Operations supported: Addition, Subtraction, Mult, Division, Square Root, Maximum, Minimum. Can operate on Floating point or Integer data.

December 2, 2015Single-Instruction Multiple Data (SIMD)10 Annotated SSE code for summing an array movdqa (%eax,%edx,4), %xmm0# load A[i] to A[i+3] movdqa (%ebx,%edx,4), %xmm1# load B[i] to B[i+3] paddd%xmm0, %xmm1# CCCC = AAAA + BBBB movdqa %xmm1, (%ecx,%edx,4)# store C[i] to C[i+3] addl$4, %edx# i += 4 (loop control code) mov = data movement dq = double-quad (128b) a = aligned %eax = A %ebx = B %ecx = C %edx = i A + 4*i p = packed add = add d = double (i.e., 32-bit integer) why?

December 2, 2015Single-Instruction Multiple Data (SIMD)11

December 2, 2015Single-Instruction Multiple Data (SIMD)12 Is it always that easy?  No. Not always. Let’s look at a little more challenging one. unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total; }  Is there parallelism here?

December 2, 2015Single-Instruction Multiple Data (SIMD)13 Exposing the parallelism unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total; }

December 2, 2015Single-Instruction Multiple Data (SIMD)14 We first need to restructure the code unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length ; ++ i) { total += array[i]; } return total; }

December 2, 2015Single-Instruction Multiple Data (SIMD)15 Then we can write SIMD code for the hot part unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length ; ++ i) { total += array[i]; } return total; }

December 2, 2015Single-Instruction Multiple Data (SIMD)16 Summary  Performance is of primary concern in some applications —Games, servers, mobile devices, super computers  Many important applications have parallelism —Exploiting it is a good way to speed up programs.  Single Instruction Multiple Data (SIMD) does this at ISA level —Registers hold multiple data items, instruction operate on them —Can achieve factor or 2, 4, 8 speedups on kernels —May require some restructuring of code to expose parallelism