EE 193: Parallel Computing

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

Lecture 13: 10/8/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

INSTRUCTION SET ARCHITECTURES

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

The University of Adelaide, School of Computer Science

Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.

Introduction to Parallel Processing Ch. 12, Pg

CS 1308 Computer Literacy and the Internet Computer Systems Organization.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Computer Systems Organization CS 1428 Foundations of Computer Science.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

2/8/02CSE MultiCycle From One Cycle to Many Note: Some of the material in this lecture are COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Introduction to MMX, XMM, SSE and SSE2 Technology

IT253: Computer Organization Lecture 9: Making a Processor: Single-Cycle Processor Design Tonga Institute of Higher Education.

CS 1308 Computer Literacy and the Internet. Objectives In this chapter, you will learn about:  The components of a computer system  Putting all the.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

ALU (Continued) Computer Architecture (Fall 2006).

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

Processor Level Parallelism 1

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

1 Lecture 5a: CPU architecture 101 boris.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Single Instruction Multiple Data

Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.

A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.

Advanced Architectures

Chapter 3 Arithmetic for Computers

Distributed Processors

SIMD Multimedia Extensions

Morgan Kaufmann Publishers

Exploiting Parallelism

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.

A Closer Look at Instruction Set Architectures: Expanding Opcodes

Morgan Kaufmann Publishers

Compilers for Embedded Systems

CS/COE0447 Computer Organization & Assembly Language

Vector Processing => Multimedia

SIMD Programming CS 240A, 2017.

EE 193: Parallel Computing

The University of Adelaide, School of Computer Science

Pipelining and Vector Processing

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

Arithmetic Logical Unit

Multivector and SIMD Computers

Chapter 5: Computer Systems Organization

AN INTRODUCTION ON PARALLEL PROCESSING

Guest Lecturer TA: Shreyas Chand

Recall: ROM example Here are three functions, V2V1V0, implemented with an 8 x 3 ROM. Blue crosses (X) indicate connections between decoder outputs and.

Other ISAs Next, we’ll first we look at a longer example program, starting with some C code and translating it into our assembly language. Then we discuss.

Branch instructions We’ll implement branch instructions for the eight different conditions shown here. Bits 11-9 of the opcode field will indicate the.

Instructor: Joel Grodstein

Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.

ECE 352 Digital System Fundamentals

Instructions in Machine Language

EE 193: Parallel Computing

Addressing mode summary

Memory System Performance Chapter 3

Lecture 4: Instruction Set Design/Pipelining

EE 193: Parallel Computing

EE 155 / Comp 122 Parallel Computing

EE 155 / COMP 122: Parallel Computing

Multicore and GPU Programming

Conditional Control Structure

Presentation transcript:

EE 193: Parallel Computing Fall 2017 Tufts University Instructor: Joel Grodstein joel.grodstein@tufts.edu Lecture 7: SIMD

Goals Where are we? Primary goals: We've learned some basic (and some not so basic) architecture. Today is a different topic (SIMD)… …just so you don't get too burnt out on architecture  Then back to our final architecture topic – ring caches Primary goals: Learn what SIMD is, and (roughly) how to use it No programming assignments on this  But, yes, it is covered in the short quizzes And if you want, you can do some SIMD programming for a final project

Flynn’s Taxonomy* *Mike Flynn, Stanford, 1966 SISD Single instruction stream Single data stream (SIMD) Multiple data stream MISD Multiple instruction stream (MIMD) classic von Neumann today's class multicore not used *Mike Flynn, Stanford, 1966 Copyright © 2010, Elsevier Inc. All rights Reserved

Problems with multithreading We had lots of flexibility (many threads all doing different things) Our simple use model actually had all the threads executing the same code Keeping them all in sync was hard EE 193 Joel Grodstein

SIMD Parallelism achieved by dividing data among multiple execution units (which may be just one datapath) in the same thread. Applies the same instruction to multiple data items. Called data parallelism. Copyright © 2010, Elsevier Inc. All rights Reserved

SIMD example … for (i = 0; i < n; i++) x[i] += y[i]; x[1] x[2] x[n] n data items n ALUs control unit … x[1] x[2] x[n] ALU1 ALU2 ALUn for (i = 0; i < n; i++) x[i] += y[i]; Copyright © 2010, Elsevier Inc. All rights Reserved

SIMD What if we don’t have as many ALUs as data items? Divide the work and process iteratively. Example: 4 ALUs and 15 data items. Round ALU1 ALU2 ALU3 ALU4 1 X[0] X[1] X[2] X[3] 2 X[4] X[5] X[6] X[7] 3 X[8] X[9] X[10] X[11] 4 X[12] X[13] X[14] Copyright © 2010, Elsevier Inc. All rights Reserved

Problems with SIMD We’ve shown the use of many ALUs. But we skipped the hard part. What have we skipped? How do you get the data from memory to the ALUs? SIMD does have parallel loads & stores, but it gets harder when you load many things, and some hit and some miss Not as flexible as MIMD Even when we had different threads all running the same code, cache misses can get them out of sync, and the ones that had cache hits will happily move forwards SIMD cannot do that EE 193 Joel Grodstein

SIMD history 1996 MMX: reused the FP regs (!) for 2x32b, 4x16b and 8x8b integer ops. MMX was aimed at graphics shading operations – but graphics cards soon took over that. 1996 SSE: new 16B regfile XMM0-15. 4x float, 2x double, numerous int. 2011 AVX: new 32B regfile YMM0-15, 8x float. 2015 AVX512: new 64B regfile ZMM0-15. Only available on Xeon Phi so far. EE 193 Joel Grodstein

Example 4x4 vector dot product using SSE instructions DPPS xmm2, xmm0, xmm1, imm8 DPPS = Dot Product Packed Single XMM0-15 are new 16B registers that can each hold, e.g., 4 floats. Instruction does xmm2 = xmm0∙xmm1. Actually… 𝑡𝑚𝑝= 𝑖=0 3 𝑥𝑚𝑚0 𝑖 ∗ 𝑥𝑚𝑚1 𝑖 ∗ 𝑖𝑚𝑚 𝑖 Then ∀ 𝑖=0 3 𝑥𝑚𝑚2 𝑖 = 𝑖𝑚𝑚 𝑖+4 ?𝑡𝑚𝑝:0 imm3:0 are mask bits imm7:4 choose where to place the result EE 193 Joel Grodstein

SIMD 𝑡𝑚𝑝= 𝑖=0 3 𝑥𝑚𝑚0 𝑖 ∗ 𝑥𝑚𝑚1 𝑖 ∗ 𝑖𝑚𝑚 𝑖 . ∀ 𝑖=0 3 𝑥𝑚𝑚2 𝑖 = 𝑖𝑚𝑚 𝑖+4 ?𝑡𝑚𝑝:0 ; Why might you want to mask out inputs using imm[3:0]? You might only want vectors of size 2 or 3 and not 4 You may have vectors of size 6; you do 4 and then 2 Nobody wants to program in assembly language Intrinsics call is res=_mm128_dp_ps(opA, opB,imm); EE 193 Joel Grodstein

What's good about SIMD A cheap, simple, power-efficient way to get parallelism. Cheap: just add a few new inst. to an existing core It's easy to turn a 64b adder into 4 16b adders. It's not hard to widen the FPU datapath. Simple: it’s still one thread, so no critical-section issues. SIMD is easier to program than multithreading Many fewer weird corner-case bugs. Power-effective: one instruction launches many computations saves energy of decoding lots of instructions. EE 193 Joel Grodstein

Matrix multiply with DPPS The usual question: the computes sound good, but how do you get data to them? Consider a matrix multiply P00 P01 P02 P03 P10 P11 P12 P13 P20 P21 P22 P23 P30 P31 P32 P33 A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 = * Matrix multiply is just a lot of vector dot products. We should be able to use DPPS. Time for some details. EE 193 Joel Grodstein

Data storage How should we store our matrices to use DPPS? P00 P01 P02 P03 P10 P11 P12 P13 P20 P21 P22 P23 P30 P31 P32 P33 A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 = * How should we store our matrices to use DPPS? What about the normal way (row major)? We will store each row of a matrix in a single XMM register A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 Each rectangle is one XMM register = EE 193 Joel Grodstein

Data storage Does this work for matrix multiply? P00 P01 P02 P03 P10 P11 P12 P13 P20 P21 P22 P23 P30 P31 P32 P33 A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 = * Does this work for matrix multiply? No. DPPS can grab a row of A, but it cannot grab a column of B. Would it help to store each matrix column in a register? No. Then DPPS could access B but not A Any clever ideas? EE 193 Joel Grodstein

Data storage How about this way? Now can we use DPPS? P00 P01 P02 P03 A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 = * How about this way? Store rows of A in 4 XMM registers, and store columns of B in 4 more XMM registers Now can we use DPPS? Well, let's find out EE 193 Joel Grodstein

In-class exercise Can you fill in the rest of this matrix-multiply code? Assume A, B and P are 4x4 matrices, implemented as a vector of 4 XMM registers (4 packed floats per register) vector<XMM> A, B, P; // Assume A and P are stored with one XMM per row // Assume B is stored with one XMM per column for r=0..3 { P[r] = 0; for c=0..3 { unsigned imm = 1<<(c+4) | 0xF; P[r] |= _mm256_dp_ps(A[r], B[c], imm); } EE 193 Joel Grodstein

Setup How do we get A to be stored in our registers in rows, but B in columns? Write some code to do a matrix transpose B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 MMX, SSE and AVX have instructions (pack/unpack, shuffle) that help with matrix transpose EE 193 Joel Grodstein

Gather/scatter We could avoid transposing the matrix if we had instructions to gather data from columns. This is available in AVX2, but arguably doesn't work that well (must be iterated to read >1 cache line). The matching scatter only appeared as of AVX512, and has the same issues as the gather instruction. EE 193 Joel Grodstein

SIMD summary The good: The bad: The state of SIMD: SIMD is cheaper to implement, easier to program (since there's only one thread), & more power efficient than other alternatives. The bad: There's no special instruction to build 4 histograms . They've parallelized many common cases, but not everything. The # of elements in a vector is encoded in the instruction, which makes it hard to have an orthogonal instruction set (they’re starting to fix this with mask bits, but those are usually an immediate field, and so must be constant). The state of SIMD: Compilers now use AVX reasonably well. It's also been inserted by hand into various libraries. You can put it into your C++ code using intrinsics https://software.intel.com/sites/landingpage/IntrinsicsGuide/#=undefined EE 193 Joel Grodstein