09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.

Dr. Ken Hoganson, © August 2014 Programming in R COURSE NOTES 2 Hoganson Language Translation.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

The University of Adelaide, School of Computer Science

Systems Architecture Lecture 5: MIPS Instruction Set

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Computer Architecture and Data Manipulation Chapter 3.

Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

CS4961 Parallel Programming Chapter 5 Introduction to SIMD Johnnie Baker January 14,

CS6963 L4: Hardware Execution Model and Overview January 26, 2009.

Pipelined Processor II CPSC 321 Andreas Klappenecker.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia

NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.

Basics and Architectures

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.

L21: Final Preparation and Course Retrospective (also very brief introduction to Map Reduce) December 1, 2011.

09/14/2010CS4961 CS4961 Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010.

GPU Architecture and Programming

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Introduction to MMX, XMM, SSE and SSE2 Technology

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.

CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

Computer Organization Instructions Language of The Computer (MIPS) 2.

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,

My Coordinates Office EM G.27 contact time:

A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.

Vector computers.

09/10/2010CS4961 CS4961 Parallel Programming Lecture 6: SIMD Parallelism in SSE-3 Mary Hall September 10,

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

SIMD Multimedia Extensions

CS4961 Parallel Programming Lecture 11: Data Locality, cont

Morgan Kaufmann Publishers

CS4961 Parallel Programming Lecture 8: SIMD, cont

Chapter 9 – Real Memory Organization and Management

Morgan Kaufmann Publishers

Vector Processing => Multimedia

SIMD Programming CS 240A, 2017.

CS4961 Parallel Programming Lecture 12: Data Locality, cont

Systems Architecture Lecture 5: MIPS Instruction Set

Compiler Back End Panel

The University of Adelaide, School of Computer Science

CS4961 Parallel Programming Lecture 11: SIMD, cont

Compiler Back End Panel

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

COMS 361 Computer Organization

The Vector-Thread Architecture

COMS 361 Computer Organization

Introduction to Microprocessor Programming

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Presentation transcript:

09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011

Homework 3, Due Thursday, Sept. 29 before class To submit your homework: -Submit a PDF file -Use the “handin” program on the CADE machines -Use the following command: “handin cs4961 hw3 ” Problem 1 (problem 5.8 in Wolfe, 1996): (a) Find all the data dependence relations in the following loop, including dependence distances: for (i=0; i<N; i+=2) for (j=i; j<N; j+=2) A[i][j] = (A[i-1][j-1] + A[i+2][j-1])/2; (b) Draw the iteration space of the loop, labeling the iteration vectors, and include an edge for any loop-carried dependence relation. 09/27/2011CS4961

Homework 3, cont. Problem 2 (problem 9.19 in Wolfe, 1996): Loop coalescing is a reordering transformation where a nested loop is combined into a single loop with a longer number of iterations. When coalescing two loops with a dependence relation with distance vector (d1,d2), whas is the distance vector for the dependence relation in the resulting loop? Problem 3: Provide pseudo-code for a locality-optimized version of the following code: for (j=0; j<N; j++) for (i=0; i<N; i++) a[i][j] = b[j][i] + c[i][j]*5.0; 09/27/2011CS4961

Homework 3, cont. Problem 4: In words, describe how you would optimize the following code for locality using loop transformations. for (l=0; l<N; l++) for (k=0; k<N; k++) { C[k][l] = 0.0; for (j=0; j<W; j++) for (i=0; i<W; i++) C[k][l] += A[k+i][l+j]*B[i][j]; } 09/27/2011CS4961

Today’s Lecture Conceptual understanding of SIMD Practical understanding of SIMD in the context of multimedia extensions Slide source: -Sam Larsen, PLDI 2000, -Jaewook Shin, my former PhD student 09/27/2011CS4961

Review: Predominant Parallel Control Mechanisms 09/27/2011CS4961

SIMD and MIMD Architectures: What’s the Difference? 09/27/2011CS4961 Slide source: Grama et al., Introduction to Parallel Computing,

Overview of SIMD Programming Vector architectures Early examples of SIMD supercomputers TODAY Mostly -Multimedia extensions such as SSE and AltiVec -Graphics and games processors (CUDA, stay tuned) -Accelerators (e.g., ClearSpeed) Is there a dominant SIMD programming model -Unfortunately, NO!!! Why not? -Vector architectures were programmed by scientists -Multimedia extension architectures are programmed by systems programmers (almost assembly language!) -GPUs are programmed by games developers (domain- specific libraries) 09/27/2011CS4961

Scalar vs. SIMD in Multimedia Extensions 09/27/2011CS4961 Slide source: Sam Larsen

Multimedia Extension Architectures At the core of multimedia extensions -SIMD parallelism -Variable-sized data fields: -Vector length = register width / type size 09/27/2011CS4961 Slide source: Jaewook Shin

09/27/2011CS4961 Multimedia / Scientific Applications Image -Graphics : 3D games, movies -Image recognition -Video encoding/decoding : JPEG, MPEG4 Sound -Encoding/decoding: IP phone, MP3 -Speech recognition -Digital signal processing: Cell phones Scientific applications -Array-based data-parallel computation, another level of parallelism

09/27/2011CS4961 Characteristics of Multimedia Applications Regular data access pattern -Data items are contiguous in memory Short data types -8, 16, 32 bits Data streaming through a series of processing stages -Some temporal reuse for such data streams Sometimes … -Many constants -Short iteration counts -Requires saturation arithmetic

Why SIMD +More parallelism +When parallelism is abundant +SIMD in addition to ILP +Simple design +Replicated functional units +Small die area +No heavily ported register files +Die area: +MAX-2(HP): 0.1% +VIS(Sun): 3.0% -Must be explicitly exposed to the hardware -By the compiler or by the programmer 09/27/2011CS4961

Programming Multimedia Extensions Language extension - Programming interface similar to function call -C: built-in functions, Fortran: intrinsics -Most native compilers support their own multimedia extensions -GCC: -faltivec, -msse2 -AltiVec: dst= vec_add(src1, src2); -SSE2: dst= _mm_add_ps(src1, src2); -BG/L: dst= __fpadd(src1, src2); -No Standard ! Need automatic compilation -PhD student Jaewook Shin wrote a thesis on locality optimizations and control flow optimizations for SIMD multimedia extensions 09/27/2011CS4961

Rest of Lecture 1.Understanding multimedia execution 2.Understanding the overheads 3.Understanding how to write code to deal with the overheads Note: A few people write very low-level code to access this capability. Today the compilers do a pretty good job, at least on relatively simple codes. When we study CUDA, we will see a higher-level notation for SIMD execution on a different architecture. 09/27/2011CS4961

09/27/2011CS4961 Exploiting SLP with SIMD Execution Benefit: -Multiple ALU ops  One SIMD op -Multiple ld/st ops  One wide mem op What are the overheads: -Packing and unpacking: -rearrange data so that it is contiguous -Alignment overhead -Accessing data from the memory system so that it is aligned to a “superword” boundary -Control flow -Control flow may require executing all paths

09/27/2011CS Independent ALU Ops R = R + XR * G = G + XG * B = B + XB * R R XR G = G + XG * B B XB Slide source: Sam Larsen

09/27/2011CS Adjacent Memory References R = R + X[i+0] G = G + X[i+1] B = B + X[i+2] R G = G + X[i:i+2] B Slide source: Sam Larsen

09/27/2011CS4961 for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0] 3. Vectorizable Loops Slide source: Sam Larsen

09/27/2011CS Vectorizable Loops for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3] Slide source: Sam Larsen

09/27/ Partially Vectorizable Loops for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L) CS4961 Slide source: Sam Larsen

09/27/ Partially Vectorizable Loops for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1) for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L) CS4961 Slide source: Sam Larsen

Programming Complexity Issues High level: Use compiler -may not always be successful Low level: Use intrinsics or inline assembly tedious and error prone Data must be aligned,and adjacent in memory -Unaligned data may produce incorrect results -May need to copy to get adjacency (overhead) Control flow introduces complexity and inefficiency Exceptions may be masked 09/27/2011CS4961

09/27/2011CS4961 Packing/Unpacking Costs C = A + 2 D = B + 3 C A 2 D B 3 = + Slide source: Sam Larsen

09/27/2011CS4961 Packing/Unpacking Costs Packing source operands -Copying into contiguous memory A B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = +

09/27/2011CS4961 Packing/Unpacking Costs Packing source operands -Copying into contiguous memory Unpacking destination operands -Copying back to location C D A = f() B = g() C = A + 2 D = B + 3 E = C / 5 F = D * 7 A B C A 2 D B 3 = + Slide source: Sam Larsen

09/27/2011CS4961 Alignment Code Generation Aligned memory access -The address is always a multiple of 16 bytes -Just one superword load or store instruction float a[64]; for (i=0; i<64; i+=4) Va = a[i:i+3]; …

09/27/2011CS4961 Alignment Code Generation (cont.) Misaligned memory access -The address is always a non-zero constant offset away from the 16 byte boundaries. -Static alignment: For a misaligned load, issue two adjacent aligned loads followed by a merge. float a[64]; for (i=0; i<60; i+=4) Va = a[i+2:i+5]; … float a[64]; for (i=0; i<60; i+=4) V1 = a[i:i+3]; V2 = a[i+4:i+7]; Va = merge(V1, V2, 8);

Statically align loop iterations float a[64]; for (i=0; i<60; i+=4) Va = a[i+2:i+5]; float a[64]; Sa2 = a[2]; Sa3 = a[3]; for (i=2; i<62; i+=4) Va = a[i:i+3]; 09/27/2011CS4961

09/27/2011CS4961 Alignment Code Generation (cont.) Unaligned memory access -The offset from 16 byte boundaries is varying or not enough information is available. -Dynamic alignment: The merging point is computed during run time. float a[64]; start = read(); for (i=start; i<60; i++) Va = a[i:i+3]; float a[64]; start = read(); for (i=start; i<60; i++) V1 = a[i:i+3]; V2 = a[i+4:i+7]; align = (&a[i:i+3])%16; Va = merge(V1, V2, align);