3- Parallel Programming Models

3- Parallel Programming Models

Parallelization of programs
Decomposition of the computations: The computations of the sequential algorithm are decomposed into tasks. the decomposition step must find a good compromise between the number of tasks and their granularity. Assignment of tasks to processes or threads: A process or a thread represents a flow of control executed by a physical processor or core. The main goal of the assignment step is to assign the tasks such that a good load balancing results. Mapping of processes or threads to physical processes or cores: The main goal of the mapping step is to get an equal utilization of the processors or cores while keeping communication between the processors small.

Levels of parallelism Instruction level Data level Loop level
Granularity Instruction level Data level Loop level Function level Fine-grained Medium-grained Coarse-grained

Parallelism at instruction level
Multiple instructions of a program can be executed in parallel at the same time, if they are independent of each other. Types of instruction dependencies: Flow dependency (true dependency): There is a flow dependency from instruction I1 to I2, if I1 computes a result value which is then used by I2 as operand. Anti-dependency: There is an anti-dependency from I1 to I2, if I1 uses a register or variable as operand which is later used by I2 to store the result of another computation. Output dependency: There is an output dependence from I1 to I2, if I1 and I2 use the same register or variable to store the result of their computations. Control dependency: There is a control dependence from I1 to I2 if the execution of instruction I2 depends on the outcome of instruction I1.

Data parallelism Is used when the same operation must be applied to different elements of a larger data structure (e.g. an array). The elements of the data structure are distributed among the processors and each processor performs the operation on its assigned elements. Data parallelism is used in: SIMD model: extended instructions work on multiple data (extended registers). MIMD model: mostly implemented as SPMD (Single Program Multiple Data) where a complete program is performed on separate chunks of data. me = processor rank Reduce = result gathering instruction on processor 0

Loop parallelism A loop is usually executed sequentially which means that the computations of the ith iteration are started not before all computations of the (i − 1)th iteration are completed. If there are no dependencies between the iterations of a loop, they can be executed in arbitrary order, and in parallel by different processors.

Functional parallelism
Many programs contain parts that are independent of each other and can be executed in parallel. This form of parallelism is called task parallelism or functional parallelism. The tasks & their dependencies is represented as a task graph. A popular technique for dynamic task scheduling is the use of a task pool in which tasks that are ready for execution are stored and from which processors can retrieve tasks if they have finished the execution of their current task. The task pool concept is particularly useful for shared address space machines.

Explicit and implicit parallelism
Implicit parallelism: compiler based parallelization Explicit parallelism with implicit distribution: do not demand an explicit distribution and assignment to processes or threads. (OpenMP) Explicit distribution: The mapping to processors or cores as well as communication between processors is implicit e.g. BSP (bulk synchronous parallel). Explicit assignment to processors: the communication between the processors does not need to be specified (Linda). Explicit communication and synchronization: the programmer must specify all details of a parallel execution. message-passing models (MPI), thread-based models (Pthreads).

SIMD Computations Vector processors: e.g. NEC SX-9 series
specialized vector load instructions, can be used to load a set of memory elements with consecutive addresses, or with a fixed distance between their memory addresses (called stride). Sparse matrix operations are also possible with index registers. Vector instructions can be applied to a part of the vector registers by specifying the length of the vector in a vector length register (VLR). Vector instructions can also use scalar registers as operands.

The University of Adelaide, School of Computer Science
28 July 2018 SIMD computations SIMD instructions for General purpose processors: Intel MMX (1996) Use 64-bit floating point registers for: Eight 8-bit integer ops or four 16-bit integer ops Streaming SIMD Extensions (SSE 1,2,3,4) ( ) Added separate 128-bit registers. Eight 16-bit integer ops Double precision (64-bit) floating point data type was added Four 32-bit integer/fp ops or two 64-bit integer/fp ops Advanced Vector Extensions (2010) Introduced 16, 256-bit floating point registers to double the BW. So many DSP and multimedia instructions were added, such as addition, subtraction, multiplication, division, square root, minimum, maximum, rounding. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science
28 July 2018 SIMD computations Graphical Processing Unit (GPU): CPU is the host, GPU is the parallelizing device CUDA: Compute Unified Device Architecture. Developed by NVIDIA OpenCL : similar open language Unify all forms of GPU parallelism as CUDA thread Programming model is “Single Instruction Multiple Thread” (SIMT) 32 threads form a thread block to be executed. Chapter 2 — Instructions: Language of the Computer

Data distributions for arrays
The decomposition of data and the mapping to different processors is called data distribution, data decomposition, or data partitioning. Colunmwise (Rowwise) distribution Checkerboard distribution

Information exchange: Shared variables
shared data can be stored in shared variables. Synchronization operations are provided to ensure that concurrent accesses to the same variable are synchronized. Race condition is when parallel execution of a program part by multiple execution units depends on the order in which the statements are executed by them. Program parts in which race condition may occur, thus holding the danger of inconsistent values, are called critical sections. letting only one thread at a time execute a critical section, is called mutual exclusion. race conditions can also be avoided by a lock mechanism.

Information exchange: Communication operations
exchange of data and information between the processors is performed by communication operations: Single transfer: processor Pi (sender) sends a message to processor Pj (receiver) with j ≠ i . Single-broadcast: Single-accumulation:

Gather: Scatter: Multi-broadcast:

Multi-accumulation: Total exchange:

Duality of communication operations
single-broadcast/accumulation operation can be implemented by using a spanning tree with the sending/receiving processor as root. Scatter operation can also be implemented by a top-down traversal of a spanning tree where each node receives a set of data blocks from its parent node and forwards those that are meant for its subtree to its child nodes. Gather operation can be implemented by a bottom-up traversal of the spanning tree where each node receives a set of data blocks from its child nodes and forwards all of them, including its own data block, to its parent.

Hierarchy of communication operations

Parallel matrix-vector product (Ab = c)
The row-oriented representation of matrix A in the computation of n scalar products (ai , b), i = 1, , n, leads to a parallel implementation in which each of the p processors computes approximately n/p scalar products. Distributed Memory Program Shared Memory

(b) The column-oriented representation of matrix A in the computation of a linear combination, leads to a parallel implementation in which each processor computes approximately m/p column vectors. Distributed Memory Program

Processes and Threads Process: The process comprises the executable program along with all the necessary information for its execution, including the program data on the runtime stack or the heap, the current values of the registers, as well as the content of the program counter. Each process has its own address space. Threads: each process may consist of multiple independent control flows which are called threads. threads of one process share the address space of the process. user-level threads: are managed by a thread library without specific support of the operating system. This has the advantage that a switch from one thread to another can be done quite fast. But, the operating system cannot map different threads of the same process to different execution resources. Moreover, the operating system cannot switch to another thread if one thread executes a blocking I/O operation. kernel threads: threads that are generated and managed by the operating system.

Execution models for threads- N:1 mapping

Execution models for threads- 1:1 mapping

Execution models for threads- N:M mapping

Thread states

Synchronization mechanisms
Lock synchronization : lock variables: are also called mutex variables as they help to ensure mutual exclusion. Mutually related threads should use same lock variables, to ensure synchronization. Semaphores: a data structure which contains an integer counter s and to which two atomic operations P(s) and V(s) can be applied. Wait function decrements s when none zero and signal function increments it.

Synchronization mechanisms
Thread execution control : barrier synchronization : a synchronization point where each thread must wait until all other threads have also reached this point. The barrier state is initialized to be "stop" by the first threads coming into the barrier. Whenever a thread enters, based on the number of threads already in the barrier, only if it is the last one, the thread set the barrier state to "pass" so that all the threads can get out of the barrier. condition synchronization : a thread T1 is blocked until a given condition has been established. The condition should be checked upon entering executable and running states.

Deadlock when program execution comes into a state where each thread waits for an event that can only be caused by another thread.

parallel programming approaches
Popular and standard: MPI Pthreads Java threads OpenMP Unpopular but more user friendly: Unified Parallel C(UPC) DARPA HPCS (High Productivity Computing Systems) programming languages: Fortress by Sun Microsystems, X10 by IBM, Chapel by Cray Inc. global array (GA)

3- Parallel Programming Models

Similar presentations

Presentation on theme: "3- Parallel Programming Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

3- Parallel Programming Models

Similar presentations

Presentation on theme: "3- Parallel Programming Models"— Presentation transcript:

Similar presentations

About project

Feedback