09/14/2010CS4961 CS4961 Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010.

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Lecture 6: Multicore Systems

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

The University of Adelaide, School of Computer Science

Microprocessors General Features To be Examined For Each Chip Jan 24 th, 2002.

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

CS4961 Parallel Programming Chapter 5 Introduction to SIMD Johnnie Baker January 14,

Pipelined Processor II CPSC 321 Andreas Klappenecker.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia

Basics and Architectures

09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

08/26/2010CS4961 CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26,

Multi-core architectures. Single-core computer Single-core CPU chip.

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.

L21: Final Preparation and Course Retrospective (also very brief introduction to Map Reduce) December 1, 2011.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Introduction to MMX, XMM, SSE and SSE2 Technology

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.

09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7,

CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

Copyright © Curt Hill SIMD Single Instruction Multiple Data.

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

Memory-Aware Compilation Philip Sweany 10/20/2011.

My Coordinates Office EM G.27 contact time:

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Vector computers.

09/10/2010CS4961 CS4961 Parallel Programming Lecture 6: SIMD Parallelism in SSE-3 Mary Hall September 10,

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

1 Lecture 5a: CPU architecture 101 boris.

SIMD Multimedia Extensions

CS4961 Parallel Programming Lecture 11: Data Locality, cont

CS4961 Parallel Programming Lecture 8: SIMD, cont

Exploiting Parallelism

Morgan Kaufmann Publishers

CDA 3101 Spring 2016 Introduction to Computer Organization

Vector Processing => Multimedia

CS4961 Parallel Programming Lecture 11: SIMD, cont

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Multi-Core Programming Assignment

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSE 502: Computer Architecture

Presentation transcript:

09/14/2010CS4961 CS4961 Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010

Homework 2, Due Friday, Sept. 10, 11:59 PM To submit your homework: -Submit a PDF file -Use the “handin” program on the CADE machines -Use the following command: “handin cs4961 hw2 ” Problem 1 (based on #1 in text on p. 59): Consider the Try2 algorithm for “count3s” from Figure 1.9 of p.19 of the text. Assume you have an input array of 1024 elements, 4 threads, and that the input data is evenly split among the four processors so that accesses to the input array are local and have unit cost. Assume there is an even distribution of appearances of 3 in the elements assigned to each thread which is a constant we call NTPT. What is a bound for the memory cost for a particular thread predicted by the CTA expressed in terms of λ and NTPT. 09/14/2010CS4961

Homework 2, cont Problem 2 (based on #2 in text on p. 59), cont.: Now provide a bound for the memory cost for a particular thread predicted by CTA for the Try4 algorithm of Fig. 114 on p. 23 (or Try3 assuming each element is placed on a separate cache line). Problem 3: For these examples, how is algorithm selection impacted by the value of NTPT? Problem 4 (in general, not specific to this problem): How is algorithm selection impacted by the value of λ? 09/14/2010CS4961

Programming Assignment 1 Due Wednesday, Sept. 21 at 11:59PM Logistics: -You’ll use water.eng.utah.edu (a Sun Ultrasparc T2), for which all of you have accounts that match the userid and password of your CADE Linux account. -Compile using “cc –O3 –xopenmp p01.c” Write the prefix sum computation from HW1 in OpenMP using the test harness found on the website. -What is the parallel speedup of your code as reported by the test harness? -If your code does not speed up, you will need to adjust the parallelism granularity, the amount of work each processor does between synchronization points. You can adjust this by changing numbers of threads, and frequency of synchronization. You may also want to think about reducing the parallelism overhead, as the solutions we have discussed introduce a lot of overhead. -What happens when you try different numbers of threads or different schedules? 09/14/2010CS4961

Programming Assignment 1, cont. What to turn in: -Your source code so we can see your solution -A README file that describes at least three variations on the implementation or parameters and the performance impact of those variations. -handin “cs4961 p1 ” Lab hours: -Thursday afternoon and Tuesday afternoon 09/14/2010CS4961

Review: Predominant Parallel Control Mechanisms 09/01/2009CS49616

SIMD and MIMD Architectures: What’s the Difference? 09/01/2009CS49617 Slide source: Grama et al., Introduction to Parallel Computing,

Overview of SIMD Programming Vector architectures Early examples of SIMD supercomputers TODAY Mostly -Multimedia extensions such as SSE and AltiVec -Graphics and games processors -Accelerators (e.g., ClearSpeed) Is there a dominant SIMD programming model -Unfortunately, NO!!! Why not? -Vector architectures were programmed by scientists -Multimedia extension architectures are programmed by systems programmers (almost assembly language!) -GPUs are programmed by games developers (domain- specific libraries) 09/08/2009CS49618

Scalar vs. SIMD in Multimedia Extensions 09/08/2009CS49619

Multimedia Extension Architectures At the core of multimedia extensions -SIMD parallelism -Variable-sized data fields: -Vector length = register width / type size 09/08/2009CS496110

09/10/2010CS Multimedia / Scientific Applications Image -Graphics : 3D games, movies -Image recognition -Video encoding/decoding : JPEG, MPEG4 Sound -Encoding/decoding: IP phone, MP3 -Speech recognition -Digital signal processing: Cell phones Scientific applications -Double precision Matrix-Matrix multiplication (DGEMM) -Y[] = a*X[] + Y[] (SAXPY)

09/10/2010CS Characteristics of Multimedia Applications Regular data access pattern -Data items are contiguous in memory Short data types -8, 16, 32 bits Data streaming through a series of processing stages -Some temporal reuse for such data streams Sometimes … -Many constants -Short iteration counts -Requires saturation arithmetic

Why SIMD +More parallelism +When parallelism is abundant +SIMD in addition to ILP +Simple design +Replicated functional units +Small die area +No heavily ported register files +Die area: +MAX-2(HP): 0.1% +VIS(Sun): 3.0% -Must be explicitly exposed to the hardware -By the compiler or by the programmer 09/08/2009CS496113

Programming Multimedia Extensions Language extension - Programming interface similar to function call -C: built-in functions, Fortran: intrinsics -Most native compilers support their own multimedia extensions -GCC: -faltivec, -msse2 -AltiVec: dst= vec_add(src1, src2); -SSE2: dst= _mm_add_ps(src1, src2); -BG/L: dst= __fpadd(src1, src2); -No Standard ! Need automatic compilation 09/08/2009CS496114

Programming Complexity Issues High level: Use compiler -may not always be successful Low level: Use intrinsics or inline assembly tedious and error prone Data must be aligned,and adjacent in memory -Unaligned data may produce incorrect results -May need to copy to get adjacency (overhead) Control flow introduces complexity and inefficiency Exceptions may be masked 09/08/2009CS496115

09/10/2010CS Independent ALU Ops R = R + XR * G = G + XG * B = B + XB * R R XR G = G + XG * B B XB

09/10/2010CS Adjacent Memory References R = R + X[i+0] G = G + X[i+1] B = B + X[i+2] R G = G + X[i:i+2] B

09/10/2010CS for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0] 3. Vectorizable Loops

09/10/2010CS Vectorizable Loops for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3]

09/10/ Partially Vectorizable Loops for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L) CS4961

09/10/ Partially Vectorizable Loops for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1) for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L) CS4961

09/10/2010CS Exploiting SLP with SIMD Execution Benefit: -Multiple ALU ops  One SIMD op -Multiple ld/st ops  One wide mem op Cost: -Packing and unpacking -Reshuffling within a register -Alignment overhead

09/10/2010CS Packing/Unpacking Costs C = A + 2 D = B + 3 C A 2 D B 3 = +

09/10/2010CS Packing/Unpacking Costs Packing source operands -Copying into contiguous memory A B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = +

09/10/2010CS Packing/Unpacking Costs Packing source operands -Copying into contiguous memory Unpacking destination operands -Copying back to location C D A = f() B = g() C = A + 2 D = B + 3 E = C / 5 F = D * 7 A B C A 2 D B 3 = +

09/10/2010CS Alignment Code Generation Aligned memory access -The address is always a multiple of 16 bytes -Just one superword load or store instruction float a[64]; for (i=0; i<64; i+=4) Va = a[i:i+3]; …

09/10/2010CS Alignment Code Generation (cont.) Misaligned memory access -The address is always a non-zero constant offset away from the 16 byte boundaries. -Static alignment: For a misaligned load, issue two adjacent aligned loads followed by a merge. float a[64]; for (i=0; i<60; i+=4) Va = a[i+2:i+5]; … float a[64]; for (i=0; i<60; i+=4) V1 = a[i:i+3]; V2 = a[i+4:i+7]; Va = merge(V1, V2, 8);

Statically align loop iterations float a[64]; for (i=0; i<60; i+=4) Va = a[i+2:i+5]; float a[64]; Sa2 = a[2]; Sa3 = a[3]; for (i=2; i<62; i+=4) Va = a[i+2:i+5]; 09/10/2010CS496128

09/10/2010CS Alignment Code Generation (cont.) Unaligned memory access -The offset from 16 byte boundaries is varying or not enough information is available. -Dynamic alignment: The merging point is computed during run time. float a[64]; for (i=0; i<60; i++) Va = a[i:i+3]; … float a[64]; for (i=0; i<60; i++) V1 = a[i:i+3]; V2 = a[i+4:i+7]; align = (&a[i:i+3])%16; Va = merge(V1, V2, align);

09/10/2010CS SIMD in the Presence of Control Flow for (i=0; i<16; i++) if (a[i] != 0) b[i]++; for (i=0; i<16; i+=4){ pred = a[i:i+3] != (0, 0, 0, 0); old = b[i:i+3]; new = old + (1, 1, 1, 1); b[i:i+3] = SELECT(old, new, pred); } Overhead: Both control flow paths are always executed !

09/10/2010CS An Optimization: Branch-On-Superword-Condition-Code for (i=0; i<16; i+=4){ pred = a[i:i+3] != (0, 0, 0, 0); branch-on-none(pred) L1; old = b[i:i+3]; new = old + (1, 1, 1, 1); b[i:i+3] = SELECT(old, new, pred); L1: }

09/10/2010CS Control Flow Not likely to be supported in today’s commercial compilers -Increases complexity of compiler -Potential for slowdown -Performance is dependent on input data Many are of the opinion that SIMD is not a good programming model when there is control flow. But speedups are possible!

09/10/2010CS Nuts and Bolts What does a piece of code really look like? for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] for (i=0; i<100; i+=4) { __m128 btmp = _mm_load_ps(float B[I]); __m128 ctmp = _mm_load_ps(float C[I]); __m128 atmp = _mm_add_ps(__m128 btmp, __m128 ctmp); void_mm_store_ps(float A[I], __m128 atmp); }

09/10/2010CS Wouldn’t you rather use a compiler? Intel compiler is pretty good -icc –msse3 –vecreport3 Get feedback on why loops were not “vectorized” First programming assignment -Use compiler and rewrite code examples to improve vectorization -One example: write in low-level intrinsics

Next Time Discuss Red-Blue computation, problem 10 on page 111 (not assigned, just to discuss) More on Data Parallel Algorithms 09/14/2010CS4961