Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Similar presentations


Presentation on theme: "Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,"— Presentation transcript:

1 Chapter 3 Parallel Programming Models

2 Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network, memory organization, synchronous & asynchronous Computational Model – Cost models, algorithm complexity, RAM vs. PRAM Programming Model – Uses programming language description of process

3 Control Flows Process – Address spaces differ - Distributed Thread – Shares address spaces – Shared Memory Created statically (like MPI-1) or dynamically during run time (MPI-2 allows this as well as pthreads).

4 Parallelization of a Program Decomposition of the computations – Can be done at many levels (ex. Pipelining). – Divide into tasks and identify dependencies between tasks. – Can be done statically (at compile time) or dynamically (at run time) – Number of tasks places an upper bound on the parallelism that can be used – Granularity: the computation time of a task

5 Assignment of Tasks The number of processes or threads does not need to be the same as the number of processors Load Balancing: each process/thread having the same amount of work (computation, memory access, communication) Have a tasks that use the same memory execute on the same thread (good cache use) Scheduling: assignment of tasks to threads/processes

6 Assignment to Processors 1-1: map a process/thread to a unique processor many to 1: map several processes to a single processor. (Load balancing issues) OS or programmer done

7 Scheduling Precedence constraints – Dependencies between tasks Capacity constraints – A fixed number of processors Want to meet constraints and finish in minimum time

8 Levels of Parallelism Instruction Level Data Level Loop Level Functional level

9 Instruction Level Executing multiple instructions in parallel. May have problems with dependencies – Flow dependency – if next instruction needs a value computed by previous instruction – Anti-Dependency – if an instruction uses a value from register or memory when the next instruction stores a value into that place (cannot reverse the order of instructions – Output dependency – 2 instructions store into same location

10 Data Level Same process applied to different elements of a large data structure If these are independent, the can distribute the data among the processors One single control flow SIMD

11 Loop Level If there are no dependencies between the iterations of a loop, then each iteration can be done independently, in parallel Similar to data parallelism

12 Functional Level Look at the parts of a program and determine which parts can be done independently. Use a dependency graph to find the dependencies/independencies Static or Dynamic assignment of tasks to processors – Dynamic would use a task pool

13 Explicit/Implicit Parallelism Expression Language dependent Some languages hide the parallelism in the language For some languages, you must explicitly state the parallelism

14 Parallelizing Compilers Takes a program in a sequential language and generates parallel code – Must analyze the dependencies and not violate them – Should provide good load balancing (difficult) – Minimize communication Functional Programming Languages – Express computations as evaluation of a function with no side effects – Allows for parallel evaluation

15 More explicit/implicit Explicit parallelism/implicit distribution – The language explicitly states the parallelism in the algorithm, but allows the system to assign the tasks to processors. Explicit assignment to processors – do not have to worry about communications Explicit Communication and Synchronization – MPI – additionally must explicitly state communications and synchronization points

16 Parallel Programming Patterns Process/Thread Creation Fork-Join Parbegin-Parend SPMD, SIMD Master-Slave (worker) Client-Server Pipelining Task Pools Producer-Consumer

17 Process/Thread Creation Static or Dynamic Threads, traditionally dynamic Processes, traditionally static, but dynamic has become recently available

18 Fork-Join An existing thread can create a number of child processes with a fork. The child threads work in parallel. Join waits for all the forked processes to terminate. Spawn/exit is similar

19 Parbegin-Parend Also called cobegin-coend Each statement (blocks/function calls) in the cobegin-coend block are to be executed in parallel. Statements after coend are not executed until all the parallel statement are complete.

20 SPMD – SIMD Single Program, Multiple Data vs. Single Instruction, Multiple Data Both use a number of threads/processes which apply the same program to different data SIMD executes the statements synchronously on different data SPMD executes the statements asynchronously

21 Master-Slave One thread/process that controls all the others If dynamic thread/process creation, the master is the one that usually does it. Master would “assign” the work to the workers and the workers would send the results to the master

22 Client-Server Multiple clients connected to a server that responds to requests Server could be satisfying requests in parallel (multiple requests being done in parallel or if the request is involved, a parallel solution to the request) The client would also do some work with response from server. Very good model for heterogeneous systems

23 Pipelining Output of one thread is the input to another thread A special type of functional decomposition Another case where heterogeneous systems would be useful

24 Task Pools Keep a collection of tasks to be done and the data to do it upon Thread/process can generate tasks to be added to the pool as well as obtaining a task when it is done with the current task

25 Producer Consumer Producer threads create data used as input by the consumer threads Data is stored in a common buffer that is accessed by producers and consumers Producer cannot add if buffer is full Consumer cannot remove if buffer is empty

26 Array Data Distributions 1-D – Blockwise Each process gets ceil(n/p) elements of A, except for the last process which gets n-(p-1)*ceil(n/p) elements Alternatively, the first n%p processes get ceil(n/p) elements while the rest get floor(n/p) elements. – Cyclic Process p gets data k*p+i (k=0..ceil(n/)) – Block cyclic Distribute blocks of size b to processes in a cyclic manner

27 2-D Array distribution Blockwise distribution rows or columns Cyclic distribution of rows or columns Blockwise-cyclic distribution of rows or columns

28 Checkerboard Take an array of size n x m Overlay a grid of size g x f – g<=n – f<=m – More easily seen if n is a multiple of g and m is a multiple of f Blockwise Checkerboard – Assign each n/g x m/f submatrix to a processor

29 Cyclic Checkerboard Take each item in a n/g x m/f submatrix and assign it in a cyclic manner. Block-Cyclic checkerboard – Take each n/g x m/f submatrix and assign all the data in the submatrix to a processor in a cyclic fashion

30 Information Exchange Shared Variables – Used in shared memory – When thread T 1 wants to share information with thread T 2, then T 1 writes the information into a variable that is shared with T 2 – Must avoid 2 or more processes reading or writing to the same variable at the same time (race condition) – Leads to non-Deterministic behavior.

31 Critical Sections Sections of code where there may be concurrent accesses to shared variables Must make these sections mutually exclusive – Only one process can be executing this section at any one time Lock mechanism is used to keep sections mutually exclusive – Process checks to see if section is “open” – If it is, then “lock” it and execute (unlock when done) – If not, wait until unlocked

32 Communication Operations Single Transfer – P i sends a message to P j Single Broadcast – one process sends the same data to all other processes Single accumulation – Many values operated on to make a single value that is placed in root Gather – Each process provides a block of data to a common single process Scatter – root process sends a separate block of a large data structure to every other process

33 More Communications Multi-Broadcast – Every process sends data to every other process so every process has all the data that was spread out among the processes Multi-Accumulate – accumulate, but every process gets the result Total Exchange-Each process provides p-data blocks. The i th data block is sent to p i. Each processor receives the blocks and builds the structure with the data in i order.

34 Applications Parallel Matrix-Vector Product – Ab=c where A is n x m and b, c are m – Want A to be in contiguous memory A single array, not an array of arrays – Have blocks of rows with allof b calculate a block of c Used if A is stored row-wise – Have blocks of columns with a block of b compute columns that need to be summed. Used if A is stored column-wise

35 Processes and Threads Process – a program in execution – Includes code, program data on stack or heap, values of registers, PC. – Assigned to processor or core for execution – If there are more processes than resources (processors or memory) for all, execute in a round-robin time-shared method – Context switch – changing from one process to another executing on processor.

36 Fork The Unix fork command – Creates a new process – Makes a copy of the program – Copy starts at statement after the fork – NOT shared memory model – Distributed memory model – Can take a while to execute

37 Threads Share a single address space Best with physically shared memory Easier to get started than a process – no copy of code space Two types – Kernel threads – managed by the OS – User threads – managed by a thread library

38 Thread Execution If user threads are executed by a thread library/scheduler, (no OS support for threads) then all the threads are part of one process that is scheduled by the OS – Only one thread executed at a time even if there are multiple processors If OS has thread management, then threads can be scheduled by OS and multiple threads can execute concurrently Or, Thread scheduler can map user threads to kernel threads (several user threads may map to one kernel thread)

39 Thread States Newly generated Executable Running Waiting Finished Threads transition from state to state based on events (start, interrupt, end, block, unblock, assign-to-processor)

40 Synchronization Locks – A process “locks” a shared variable at the beginning of a critical section Lock allows process to proceed if shared variable is unlocked Process blocked if variable is locked until variable is unlocked Locking is an “atomic” process.

41 Semaphore Usually a binary type but can be integer wait(s) – Waits until the value of s is 1 (or greater) – When it is, decreases s by 1 and continues signal(s) – Increments s by 1

42 Barrier Synchronization A way to have every process wait until every process is at a certain point Assures the state of every process before certain code is executed

43 Condition Synchronization A thread is blocked until a given condition is established – If condition is not true, then put into blocked state – When condition true, moved from blocked to ready (not necessarily directly to a processor) – Since other processes may be executing, by the time this process gets to a processor, the condition may no longer be true So, must check condition after condition satisfied

44 Efficient Thread Programs Proper number of threads – Consider degree of parallelism in application – Number of processors – Size of shared cache Avoid synchronization as much as possible – Make critical section as small as possible Watch for deadlock conditions

45 Memory Access Must consider writing values to shared memory that is held in local caches False sharing – Consider 2 processes writing to different memory locations – SHOULD not be an issue since not shared by two cache memories – HOWEVER, if the memory locations are close to each other, they may be in the same cache line and actually have the different locations both be in the different caches


Download ppt "Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,"

Similar presentations


Ads by Google