09/10/2010CS4961 CS4961 Parallel Programming Lecture 6: SIMD Parallelism in SSE-3 Mary Hall September 10, 2009 1.

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

Dr. Ken Hoganson, © August 2014 Programming in R COURSE NOTES 2 Hoganson Language Translation.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

The University of Adelaide, School of Computer Science

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.

Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

CS4961 Parallel Programming Chapter 5 Introduction to SIMD Johnnie Baker January 14,

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

New Algorithms for SIMD Alignment Liza Fireman - Technion Ayal Zaks – IBM Haifa Research Lab Erez Petrank – Microsoft Research & Technion.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia

NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.

Basics and Architectures

09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.

Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.

Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.

L21: Final Preparation and Course Retrospective (also very brief introduction to Map Reduce) December 1, 2011.

09/14/2010CS4961 CS4961 Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,

December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Introduction to MMX, XMM, SSE and SSE2 Technology

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

09/25/2012CS4230 CS4230 Parallel Programming Lecture 10: Red-Black, cont., and Data Parallel Algorithms Mary Hall September 25,

The Alpha Thomas Daniels Other Dude Matt Ziegler.

CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont

Exploiting Parallelism

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

09/20/2012CS4230 CS4230 Parallel Programming Lecture 9: Locality and Data Parallel Algorithms Mary Hall September 20,

Vector computers.

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

Single Instruction Multiple Data

SIMD Multimedia Extensions

CS4961 Parallel Programming Lecture 8: SIMD, cont

Exploiting Parallelism

Getting Started with Automatic Compiler Vectorization

Morgan Kaufmann Publishers

Vector Processing => Multimedia

SIMD Programming CS 240A, 2017.

Pipelining and Vector Processing

CS170 Computer Organization and Architecture I

CS4961 Parallel Programming Lecture 11: SIMD, cont

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

COMS 361 Computer Organization

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Multi-Core Programming Assignment

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Presentation transcript:

09/10/2010CS4961 CS4961 Parallel Programming Lecture 6: SIMD Parallelism in SSE-3 Mary Hall September 10,

Administrative Programming assignment 1 to be posted Friday, Sept. 11 Due, Tuesday, September 22 before class -Use the “handin” program on the CADE machines -Use the following command: “handin cs4961 prog1 ” Mailing list set up: Sriram office hours: -MEB 3115, Mondays and Wednesdays, 2-3 PM 09/10/2010CS4961 2

Homework 2 Problem 1 (#10 in text on p. 111): The Red/Blue computation simulates two interactive flows. An n x n board is initialized so cells have one of three colors: red, white, and blue, where white is empty, red moves right, and blue moves down. Colors wrap around on the opposite side when reaching the edge. In the first half step of an iteration, any red color can move right one cell if the cell to the right is unoccupied (white). On the second half step, any blue color can move down one cell if the cell below it is unoccupied. The case where red vacates a cell (first half) and blue moves into it (second half) is okay. Viewing the board as overlaid with t x t tiles (where t divides n evenly), the computation terminates if any tile’s colored squares are more than c% one color. Use Peril-L to write a solution to the Red/Blue computation. 09/10/2010CS4961 3

Homework 2, cont. Problem 2: For the following task graphs, determine the following: (1) Maximum degree of concurrency. (2) Critical path length. (3) Maximum achievable speedup over one process assuming an arbitrarily large number of processes is available. (4) The minimum number of processes needed to obtain the maximum possible speedup. (5) The maximum achievable speedup if the number of processes is limited to (a) 2 and (b) 8. 09/10/2010CS4961 4

Today’s Lecture SIMD for multimedia extensions (SSE-3 and Altivec) Sources for this lecture: -Jaewook Shin unix.mcs.anl.gov/~jaewook/slides/vectorization- uchicago.ppthttp://www- unix.mcs.anl.gov/~jaewook/slides/vectorization- uchicago.ppt -Some of the above from “Exploiting Superword Level Parallelism with Multimedia Instruction Sets”, Larsen and Amarasinghe (PLDI 2000). References: -“Intel Compilers: Vectorization for Multimedia Extensions,” -Programmer’s guide 66.pdf -List of SSE intrinsics 1.htm 09/10/2010CS49615

Scalar vs. SIMD in Multimedia Extensions 09/10/2010CS49616

09/10/2010CS49617 Multimedia Extensions At the core of multimedia extensions -SIMD parallelism -Variable-sized data fields: Vector length = register width / type size V V0 V1 V2 V3 V4 V5 Sixteen 8-bit Operands Eight 16-bit Operands Four 32-bit Operands Example: AltiVec WIDE UNIT

09/10/2010CS49618 Multimedia / Scientific Applications Image -Graphics : 3D games, movies -Image recognition -Video encoding/decoding : JPEG, MPEG4 Sound -Encoding/decoding: IP phone, MP3 -Speech recognition -Digital signal processing: Cell phones Scientific applications -Double precision Matrix-Matrix multiplication (DGEMM) -Y[] = a*X[] + Y[] (SAXPY)

09/10/2010CS49619 Characteristics of Multimedia Applications Regular data access pattern -Data items are contiguous in memory Short data types -8, 16, 32 bits Data streaming through a series of processing stages -Some temporal reuse for such data streams Sometimes … -Many constants -Short iteration counts -Requires saturation arithmetic

Programming Multimedia Extensions Language extension - Programming interface similar to function call -C: built-in functions, Fortran: intrinsics -Most native compilers support their own multimedia extensions -GCC: -faltivec, -msse2 -AltiVec: dst= vec_add(src1, src2); -SSE2: dst= _mm_add_ps(src1, src2); -BG/L: dst= __fpadd(src1, src2); -No Standard ! Need automatic compilation 09/10/2010CS496110

Programming Complexity Issues High level: Use compiler -may not always be successful Low level: Use intrinsics or inline assembly tedious and error prone Data must be aligned,and adjacent in memory -Unaligned data may produce incorrect results -May need to copy to get adjacency (overhead) Control flow introduces complexity and inefficiency Exceptions may be masked 09/10/2010CS496111

09/10/2010CS Independent ALU Ops R = R + XR * G = G + XG * B = B + XB * R R XR G = G + XG * B B XB

09/10/2010CS Adjacent Memory References R = R + X[i+0] G = G + X[i+1] B = B + X[i+2] R G = G + X[i:i+2] B

09/10/2010CS for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0] 3. Vectorizable Loops

09/10/2010CS Vectorizable Loops for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3]

09/10/ Partially Vectorizable Loops for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L) CS4961

09/10/ Partially Vectorizable Loops for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1) for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L) CS4961

09/10/2010CS Exploiting SLP with SIMD Execution Benefit: -Multiple ALU ops  One SIMD op -Multiple ld/st ops  One wide mem op Cost: -Packing and unpacking -Reshuffling within a register -Alignment overhead

09/10/2010CS Packing/Unpacking Costs C = A + 2 D = B + 3 C A 2 D B 3 = +

09/10/2010CS Packing/Unpacking Costs Packing source operands -Copying into contiguous memory A B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = +

09/10/2010CS Packing/Unpacking Costs Packing source operands -Copying into contiguous memory Unpacking destination operands -Copying back to location C D A = f() B = g() C = A + 2 D = B + 3 E = C / 5 F = D * 7 A B C A 2 D B 3 = +

09/10/2010CS Alignment Most multimedia extensions require aligned memory accesses. Aligned memory access ? -A memory access is aligned to a 16 byte boundary if the address is a multiple of 16. -Ex) For 16 byte memory accesses in AltiVec, the last 4 bits of the address are ignored.

09/10/2010CS Alignment Code Generation Aligned memory access -The address is always a multiple of 16 bytes -Just one superword load or store instruction float a[64]; for (i=0; i<64; i+=4) Va = a[i:i+3]; …

09/10/2010CS Alignment Code Generation (cont.) Misaligned memory access -The address is always a non-zero constant offset away from the 16 byte boundaries. -Static alignment: For a misaligned load, issue two adjacent aligned loads followed by a merge. float a[64]; for (i=0; i<60; i+=4) Va = a[i+2:i+5]; … float a[64]; for (i=0; i<60; i+=4) V1 = a[i:i+3]; V2 = a[i+4:i+7]; Va = merge(V1, V2, 8);

Statically align loop iterations float a[64]; for (i=0; i<60; i+=4) Va = a[i+2:i+5]; float a[64]; Sa2 = a[2]; Sa3 = a[3]; for (i=2; i<62; i+=4) Va = a[i+2:i+5]; 09/10/2010CS496125

09/10/2010CS Alignment Code Generation (cont.) Unaligned memory access -The offset from 16 byte boundaries is varying or not enough information is available. -Dynamic alignment: The merging point is computed during run time. float a[64]; for (i=0; i<60; i++) Va = a[i:i+3]; … float a[64]; for (i=0; i<60; i++) V1 = a[i:i+3]; V2 = a[i+4:i+7]; align = (&a[i:i+3])%16; Va = merge(V1, V2, align);

09/10/2010CS SIMD in the Presence of Control Flow for (i=0; i<16; i++) if (a[i] != 0) b[i]++; for (i=0; i<16; i+=4){ pred = a[i:i+3] != (0, 0, 0, 0); old = b[i:i+3]; new = old + (1, 1, 1, 1); b[i:i+3] = SELECT(old, new, pred); } Overhead: Both control flow paths are always executed !

09/10/2010CS An Optimization: Branch-On-Superword-Condition-Code for (i=0; i<16; i+=4){ pred = a[i:i+3] != (0, 0, 0, 0); branch-on-none(pred) L1; old = b[i:i+3]; new = old + (1, 1, 1, 1); b[i:i+3] = SELECT(old, new, pred); L1: }

09/10/2010CS Control Flow Not likely to be supported in today’s commercial compilers -Increases complexity of compiler -Potential for slowdown -Performance is dependent on input data Many are of the opinion that SIMD is not a good programming model when there is control flow. But speedups are possible!

09/10/2010CS Nuts and Bolts What does a piece of code really look like? for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] for (i=0; i<100; i+=4) { __m128 btmp = _mm_load_ps(float B[I]); __m128 ctmp = _mm_load_ps(float C[I]); __m128 atmp = _mm_add_ps(__m128 btmp, __m128 ctmp); void_mm_store_ps(float A[I], __m128 atmp); }

09/10/2010CS Wouldn’t you rather use a compiler? Intel compiler is pretty good -icc –msse3 –vecreport3 Get feedback on why loops were not “vectorized” First programming assignment -Use compiler and rewrite code examples to improve vectorization -One example: write in low-level intrinsics

09/10/2010CS4961 Summary of Lecture SIMD parallelism for multimedia extensions (SSE-3) -Widely available -Portable to AMD platforms, similar capability on other platforms Parallel execution of -“Vector” or superword operations -Memory accesses -Partially parallel computations -Mixes well with scalar instructions Performance issues to watch out for -Alignment of memory accesses -Overhead of packing operands -Control flow Next Time: -More SSE-3 32