3- Parallel Programming Models

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
The University of Adelaide, School of Computer Science
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
1 Concurrency: Deadlock and Starvation Chapter 6.
Process Concept An operating system executes a variety of programs
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
CSCI-455/552 Introduction to High Performance Computing Lecture 19.
Processor Architecture
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
C H A P T E R E L E V E N Concurrent Programming Programming Languages – Principles and Paradigms by Allen Tucker, Robert Noonan.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Parallel Computing Presented by Justin Reschke
Process Synchronization. Concurrency Definition: Two or more processes execute concurrently when they execute different activities on different devices.
Single Instruction Multiple Threads
Chapter 4 – Thread Concepts
Processes and threads.
Chapter 3: Process Concept
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
Background on the need for Synchronization
Operating System Concepts
Operating Systems (CS 340 D)
Conception of parallel algorithms
Chapter 4 – Thread Concepts
The University of Adelaide, School of Computer Science
Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads
Process Management Presented By Aditya Gupta Assistant Professor
Chapter 3: Process Concept
Computer Engg, IIT(BHU)
Intro to Processes CSSE 332 Operating Systems
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
Morgan Kaufmann Publishers
Operating Systems (CS 340 D)
Lecture 5: GPU Compute Architecture
Lecture 5: GPU Compute Architecture for the last time
Programming with Shared Memory
COT 5611 Operating Systems Design Principles Spring 2014
Multivector and SIMD Computers
Midterm review: closed book multiple choice chapters 1 to 9
Shared Memory Programming
Threads Chapter 4.
Background and Motivation
Multithreaded Programming
Concurrency: Mutual Exclusion and Process Synchronization
Concurrency, Processes and Threads
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
The University of Adelaide, School of Computer Science
Chapter 01: Introduction
Foundations and Definitions
Lecture 17 Multiprocessors and Thread-Level Parallelism
Programming with Shared Memory Specifying parallelism
The University of Adelaide, School of Computer Science
6- General Purpose GPU Programming
Chapter 3: Process Management
Presentation transcript:

3- Parallel Programming Models

Parallelization of programs Decomposition of the computations: The computations of the sequential algorithm are decomposed into tasks. the decomposition step must find a good compromise between the number of tasks and their granularity. Assignment of tasks to processes or threads: A process or a thread represents a flow of control executed by a physical processor or core. The main goal of the assignment step is to assign the tasks such that a good load balancing results. Mapping of processes or threads to physical processes or cores: The main goal of the mapping step is to get an equal utilization of the processors or cores while keeping communication between the processors small.

Levels of parallelism Instruction level Data level Loop level Granularity Instruction level Data level Loop level Function level Fine-grained Medium-grained Coarse-grained

Parallelism at instruction level Multiple instructions of a program can be executed in parallel at the same time, if they are independent of each other. Types of instruction dependencies: Flow dependency (true dependency): There is a flow dependency from instruction I1 to I2, if I1 computes a result value which is then used by I2 as operand. Anti-dependency: There is an anti-dependency from I1 to I2, if I1 uses a register or variable as operand which is later used by I2 to store the result of another computation. Output dependency: There is an output dependence from I1 to I2, if I1 and I2 use the same register or variable to store the result of their computations. Control dependency: There is a control dependence from I1 to I2 if the execution of instruction I2 depends on the outcome of instruction I1.

Data parallelism Is used when the same operation must be applied to different elements of a larger data structure (e.g. an array). The elements of the data structure are distributed among the processors and each processor performs the operation on its assigned elements. Data parallelism is used in: SIMD model: extended instructions work on multiple data (extended registers). MIMD model: mostly implemented as SPMD (Single Program Multiple Data) where a complete program is performed on separate chunks of data. me = processor rank Reduce = result gathering instruction on processor 0

Loop parallelism A loop is usually executed sequentially which means that the computations of the ith iteration are started not before all computations of the (i − 1)th iteration are completed. If there are no dependencies between the iterations of a loop, they can be executed in arbitrary order, and in parallel by different processors.

Functional parallelism Many programs contain parts that are independent of each other and can be executed in parallel. This form of parallelism is called task parallelism or functional parallelism. The tasks & their dependencies is represented as a task graph. A popular technique for dynamic task scheduling is the use of a task pool in which tasks that are ready for execution are stored and from which processors can retrieve tasks if they have finished the execution of their current task. The task pool concept is particularly useful for shared address space machines.

Explicit and implicit parallelism Implicit parallelism: compiler based parallelization Explicit parallelism with implicit distribution: do not demand an explicit distribution and assignment to processes or threads. (OpenMP) Explicit distribution: The mapping to processors or cores as well as communication between processors is implicit e.g. BSP (bulk synchronous parallel). Explicit assignment to processors: the communication between the processors does not need to be specified (Linda). Explicit communication and synchronization: the programmer must specify all details of a parallel execution. message-passing models (MPI), thread-based models (Pthreads).

SIMD Computations Vector processors: e.g. NEC SX-9 series specialized vector load instructions, can be used to load a set of memory elements with consecutive addresses, or with a fixed distance between their memory addresses (called stride). Sparse matrix operations are also possible with index registers. Vector instructions can be applied to a part of the vector registers by specifying the length of the vector in a vector length register (VLR). Vector instructions can also use scalar registers as operands.

The University of Adelaide, School of Computer Science 28 July 2018 SIMD computations SIMD instructions for General purpose processors: Intel MMX (1996) Use 64-bit floating point registers for: Eight 8-bit integer ops or four 16-bit integer ops Streaming SIMD Extensions (SSE 1,2,3,4) (1999-2007) Added separate 128-bit registers. Eight 16-bit integer ops Double precision (64-bit) floating point data type was added Four 32-bit integer/fp ops or two 64-bit integer/fp ops Advanced Vector Extensions (2010) Introduced 16, 256-bit floating point registers to double the BW. So many DSP and multimedia instructions were added, such as addition, subtraction, multiplication, division, square root, minimum, maximum, rounding. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 28 July 2018 SIMD computations Graphical Processing Unit (GPU): CPU is the host, GPU is the parallelizing device CUDA: Compute Unified Device Architecture. Developed by NVIDIA OpenCL : similar open language Unify all forms of GPU parallelism as CUDA thread Programming model is “Single Instruction Multiple Thread” (SIMT) 32 threads form a thread block to be executed. Chapter 2 — Instructions: Language of the Computer

Data distributions for arrays The decomposition of data and the mapping to different processors is called data distribution, data decomposition, or data partitioning. Colunmwise (Rowwise) distribution Checkerboard distribution

Information exchange: Shared variables shared data can be stored in shared variables. Synchronization operations are provided to ensure that concurrent accesses to the same variable are synchronized. Race condition is when parallel execution of a program part by multiple execution units depends on the order in which the statements are executed by them. Program parts in which race condition may occur, thus holding the danger of inconsistent values, are called critical sections. letting only one thread at a time execute a critical section, is called mutual exclusion. race conditions can also be avoided by a lock mechanism.

Information exchange: Communication operations exchange of data and information between the processors is performed by communication operations: Single transfer: processor Pi (sender) sends a message to processor Pj (receiver) with j ≠ i . Single-broadcast: Single-accumulation:

Information exchange: Communication operations Gather: Scatter: Multi-broadcast:

Information exchange: Communication operations Multi-accumulation: Total exchange:

Duality of communication operations single-broadcast/accumulation operation can be implemented by using a spanning tree with the sending/receiving processor as root. Scatter operation can also be implemented by a top-down traversal of a spanning tree where each node receives a set of data blocks from its parent node and forwards those that are meant for its subtree to its child nodes. Gather operation can be implemented by a bottom-up traversal of the spanning tree where each node receives a set of data blocks from its child nodes and forwards all of them, including its own data block, to its parent.

Hierarchy of communication operations

Parallel matrix-vector product (Ab = c) The row-oriented representation of matrix A in the computation of n scalar products (ai , b), i = 1, . . . , n, leads to a parallel implementation in which each of the p processors computes approximately n/p scalar products. Distributed Memory Program Shared Memory

Parallel matrix-vector product (Ab = c) (b) The column-oriented representation of matrix A in the computation of a linear combination, leads to a parallel implementation in which each processor computes approximately m/p column vectors. Distributed Memory Program

Parallel matrix-vector product (Ab = c)

Processes and Threads Process: The process comprises the executable program along with all the necessary information for its execution, including the program data on the runtime stack or the heap, the current values of the registers, as well as the content of the program counter. Each process has its own address space. Threads: each process may consist of multiple independent control flows which are called threads. threads of one process share the address space of the process. user-level threads: are managed by a thread library without specific support of the operating system. This has the advantage that a switch from one thread to another can be done quite fast. But, the operating system cannot map different threads of the same process to different execution resources. Moreover, the operating system cannot switch to another thread if one thread executes a blocking I/O operation. kernel threads: threads that are generated and managed by the operating system.

Execution models for threads- N:1 mapping

Execution models for threads- 1:1 mapping

Execution models for threads- N:M mapping

Thread states

Synchronization mechanisms Lock synchronization : lock variables: are also called mutex variables as they help to ensure mutual exclusion. Mutually related threads should use same lock variables, to ensure synchronization. Semaphores: a data structure which contains an integer counter s and to which two atomic operations P(s) and V(s) can be applied. Wait function decrements s when none zero and signal function increments it.

Synchronization mechanisms Thread execution control : barrier synchronization : a synchronization point where each thread must wait until all other threads have also reached this point. The barrier state is initialized to be "stop" by the first threads coming into the barrier. Whenever a thread enters, based on the number of threads already in the barrier, only if it is the last one, the thread set the barrier state to "pass" so that all the threads can get out of the barrier.  condition synchronization : a thread T1 is blocked until a given condition has been established. The condition should be checked upon entering executable and running states.

Deadlock when program execution comes into a state where each thread waits for an event that can only be caused by another thread.

parallel programming approaches Popular and standard: MPI Pthreads Java threads OpenMP Unpopular but more user friendly: Unified Parallel C(UPC) DARPA HPCS (High Productivity Computing Systems) programming languages: Fortress by Sun Microsystems, X10 by IBM, Chapel by Cray Inc. global array (GA)