Data Parallel Computations and Pattern

Slides:



Advertisements
Similar presentations
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Advertisements

Streaming SIMD Extension (SSE)
Introduction to Parallel Computing
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Numerical Algorithms • Matrix multiplication
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
Chapter 10 in textbook. Sorting Algorithms
CUDA Grids, Blocks, and Threads
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
1 Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. ITCS4145/5145, Parallel Programming B. Wilkinson.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.
Single Instruction Multiple Threads
Numerical Algorithms Chapter 11.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Conception of parallel algorithms
Synchronous Computations
Flynn’s Classification Of Computer Architectures
Morgan Kaufmann Publishers
CS 213: Data Structures and Algorithms
Parallel Sorting Algorithms
PRAM Algorithms.
SIMD Programming CS 240A, 2017.
Pipelining and Vector Processing
Array Processor.
Programming with Shared Memory
Numerical Algorithms • Parallelizing matrix multiplication
Pipelined Computations
Introduction to High Performance Computing Lecture 20
Parallel Computation Patterns (Scan)
Constructing a system with multiple computers or processors
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
Pipelined Pattern This pattern is implemented in Seeds, see
Constructing a system with multiple computers or processors
Parallel Sorting Algorithms
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
Data Parallel Algorithms
Parallel Sorting Algorithms
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt March 20, 2014.
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson slides5.ppt August 17, 2014.
ECE 498AL Lecture 15: Reductions and Their Implementation
Programming with Shared Memory Specifying parallelism
Questions Parallel Programming Shared memory performance issues
CUDA Grids, Blocks, and Threads
CUDA Programming Model
Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. Sorting number is important in applications as it can.
CS 286 Computer Organization and Architecture
Matrix Addition and Multiplication
Programming with Shared Memory - 3 Recognizing parallelism
Programming with Shared Memory Specifying parallelism
Quiz Questions Parallel Programming Parallel Computing Potential
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Data Parallel Pattern 6c.1
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Data Parallel Computations and Pattern
Presentation transcript:

Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 11, 2014 6c.1

Data Parallel Computations Same operation performed on different data elements simultaneously; i.e., in parallel, fully synchronous. Particularly convenient because: • Can scale easily to larger problem sizes. • Many numeric and some non-numeric problems can be cast in a data parallel form. Ease of programming (only one program!). 6c.2

Single Instruction Multiple Data (SIMD) model Data parallel model used in vector super-computers designs in1970s: Synchronism at the instruction level. Each instruction specifies a “vector” operation and elements of array to perform operation on. Multiple execution units, each executes operation on a different element or pairs of elements in synchronism Only one instruction fetch/decode unit Subsequently seen in Intel processors -- Vector SSE (Streaming SIMD Extensions) instructions. 6c.3

(SIMD) Data Parallel Pattern Could be described a computational “pattern”: (SIMD) Data Parallel Pattern Program Same program instruction sent to all execution units at the same time Execution units Data Each execution unit performs same operation but on different data in parallel. Usually data are elements of an array

SIMD Example To add same constant, k, to each element of an array: for (i = 0; i < N; i++) a[i] = a[i] + k; Statement a[i] = a[i] + k; could be executed simultaneously by multiple processors, each using a different index i (0<i<=n). Vector instruction Meaning add k to all elements of A[i] , 0 <i<N 6c.5

Using forall construct for data parallel pattern Could use forall to specify data parallel operations forall (i = 0; i < n; i++) a[i] = a[i] + k However, forall is more general – it states that the n instances of the body can be executed simultaneously or in any order (not necessarily executed at the same time). We shall see this in GPU implementation of data parallel pattern. Note forall does imply synchronism at its end – all instances must complete before continuing, which will be true in GPUs 6.6

Data Parallel Example Prefix Sum Problem Given a list of numbers, x0, …, xn-1, compute all the partial summations, i.e.: x0 + x1; x0 + x1 + x2; x0 + x1 + x2 + x3; x0 + x1 + x2 + x3 + x4; … Can also be defined with associative operations other than addition. Widely studied. Practical applications in areas such as processor allocation, data compaction, sorting, and polynomial evaluation. 6.7

Data parallel method for prefix sum operation 6.8

Sequential pseudo code Parallel code using forall notation for (j = 0; j < log(n); j++) // at each step for (i = 2 j; i < n; i++) // accumulate sum x[i] = x[i] + x[i + 2 j]; ** Parallel code using forall notation forall (i = 0; i < n; i++) // accumulate sum if (i >= 2 j) x[i] = x[i] + x[i + 2 j]; ** ** Note this will not work because of data dependences - the x array will get overwritten, so in real code use temp array or sequentially count downwards. 6c.9

Low level image processing Involves manipulating image pixels (picture elements) and often the same operation on each pixel using neighboring pixel values SIMD (single instruction multiple data) model very applicable. Historically, GPUs designed for creating image data for displays using this model.

Single Instruction Multiple Thread Programming Model (SIMT) A version of SIMD used in recent GPUs. Multiple threads, each execute the same instruction sequence. Groups of threads scheduled to execute at the same time on execution cores. Very low thread overhead.

SIMT Example -- Matrix Multiplication Matrix multiplication easy to make a data parallel version. Change two for’s to forall’s: forall (i = 0; i < n; i++) // for each row of A forall (j = 0; j < n; j++) { // for each column of B c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] += a[i][k] * b[k][j]; } Each instance of body is a separate thread, doing same calculation but on different elements of array 6c.1212

forall (i = 0; i < n; i++) // for each row of A forall (j = 0; j < n; j++) { // for each column of B } Threads c[0][0] = 0; for (k = 0; k < n; k++) c[0][0]+=a[0][k]*b[k][0]; c[n-1][n-1] = 0; for (k = 0; k < n; k++) c[n-1][n-1]+=a[n-1][k]*b[k][n-1]; One thread for each c element, doing the same calculation but using different a and b elements 6c.13

We will explore programming GPUs for high performance computing next. Questions so far 6.14