Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

Threads, SMP, and Microkernels
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Mutual Exclusion.
CH7 discussion-review Mahmoud Alhabbash. Q1 What is a Race Condition? How could we prevent that? – Race condition is the situation where several processes.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Computer Systems/Operating Systems - Class 8
Reference: Message Passing Fundamentals.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Chapter 2: Processes Topics –Processes –Threads –Process Scheduling –Inter Process Communication (IPC) Reference: Operating Systems Design and Implementation.
Concurrency CS 510: Programming Languages David Walker.
3.5 Interprocess Communication
Chapter 11: Distributed Processing Parallel programming Principles of parallel programming languages Concurrent execution –Programming constructs –Guarded.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.
1 Organization of Programming Languages-Cheng (Fall 2004) Concurrency u A PROCESS or THREAD:is a potentially-active execution context. Classic von Neumann.
Race Conditions CS550 Operating Systems. Review So far, we have discussed Processes and Threads and talked about multithreading and MPI processes by example.
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
CE Operating Systems Lecture 5 Processes. Overview of lecture In this lecture we will be looking at What is a process? Structure of a process Process.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Threads, Thread management & Resource Management.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Chapter 3: Processes. 3.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts - 7 th Edition, Feb 7, 2006 Process Concept Process – a program.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
- 1 - Embedded Systems - SDL Some general properties of languages 1. Synchronous vs. asynchronous languages Description of several processes in many languages.
Chapter 101 Multiprocessor and Real- Time Scheduling Chapter 10.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Chapter 1 Computer System Overview Sections 1.1 to 1.6 Instruction exe cution Interrupt Memory hierarchy Cache memory Locality: spatial and temporal Problem.
Threads G.Anuradha (Reference : William Stallings)
Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.
Processes CSCI 4534 Chapter 4. Introduction Early computer systems allowed one program to be executed at a time –The program had complete control of the.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Threads, Thread management & Resource Management.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
Threads. Readings r Silberschatz et al : Chapter 4.
Processes. Process Concept Process Scheduling Operations on Processes Interprocess Communication Communication in Client-Server Systems.
Operating Systems CMPSC 473 Signals, Introduction to mutual exclusion September 28, Lecture 9 Instructor: Bhuvan Urgaonkar.
Threads. Thread A basic unit of CPU utilization. An Abstract data type representing an independent flow of control within a process A traditional (or.
4.1 Introduction to Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.
Agenda  Quick Review  Finish Introduction  Java Threads.
Semaphores Chapter 6. Semaphores are a simple, but successful and widely used, construct.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
1 Module 3: Processes Reading: Chapter Next Module: –Inter-process Communication –Process Scheduling –Reading: Chapter 4.5, 6.1 – 6.3.
Chapter 3: Process Concept
PROCESS MANAGEMENT IN MACH
Conception of parallel algorithms
Processes and Threads Processes and their scheduling
3- Parallel Programming Models
Computer Engg, IIT(BHU)
Threads, SMP, and Microkernels
Programming with Shared Memory
Midterm review: closed book multiple choice chapters 1 to 9
Shared Memory Programming
Threads Chapter 4.
Background and Motivation
CS510 Operating System Foundations
CSE 153 Design of Operating Systems Winter 19
Chapter 3: Process Management
Presentation transcript:

Chapter 3 Parallel Programming Models

Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network, memory organization, synchronous & asynchronous Computational Model – Cost models, algorithm complexity, RAM vs. PRAM Programming Model – Uses programming language description of process

Control Flows Process – Address spaces differ - Distributed Thread – Shares address spaces – Shared Memory Created statically (like MPI-1) or dynamically during run time (MPI-2 allows this as well as pthreads).

Parallelization of a Program Decomposition of the computations – Can be done at many levels (ex. Pipelining). – Divide into tasks and identify dependencies between tasks. – Can be done statically (at compile time) or dynamically (at run time) – Number of tasks places an upper bound on the parallelism that can be used – Granularity: the computation time of a task

Assignment of Tasks The number of processes or threads does not need to be the same as the number of processors Load Balancing: each process/thread having the same amount of work (computation, memory access, communication) Have a tasks that use the same memory execute on the same thread (good cache use) Scheduling: assignment of tasks to threads/processes

Assignment to Processors 1-1: map a process/thread to a unique processor many to 1: map several processes to a single processor. (Load balancing issues) OS or programmer done

Scheduling Precedence constraints – Dependencies between tasks Capacity constraints – A fixed number of processors Want to meet constraints and finish in minimum time

Levels of Parallelism Instruction Level Data Level Loop Level Functional level

Instruction Level Executing multiple instructions in parallel. May have problems with dependencies – Flow dependency – if next instruction needs a value computed by previous instruction – Anti-Dependency – if an instruction uses a value from register or memory when the next instruction stores a value into that place (cannot reverse the order of instructions – Output dependency – 2 instructions store into same location

Data Level Same process applied to different elements of a large data structure If these are independent, the can distribute the data among the processors One single control flow SIMD

Loop Level If there are no dependencies between the iterations of a loop, then each iteration can be done independently, in parallel Similar to data parallelism

Functional Level Look at the parts of a program and determine which parts can be done independently. Use a dependency graph to find the dependencies/independencies Static or Dynamic assignment of tasks to processors – Dynamic would use a task pool

Explicit/Implicit Parallelism Expression Language dependent Some languages hide the parallelism in the language For some languages, you must explicitly state the parallelism

Parallelizing Compilers Takes a program in a sequential language and generates parallel code – Must analyze the dependencies and not violate them – Should provide good load balancing (difficult) – Minimize communication Functional Programming Languages – Express computations as evaluation of a function with no side effects – Allows for parallel evaluation

More explicit/implicit Explicit parallelism/implicit distribution – The language explicitly states the parallelism in the algorithm, but allows the system to assign the tasks to processors. Explicit assignment to processors – do not have to worry about communications Explicit Communication and Synchronization – MPI – additionally must explicitly state communications and synchronization points

Parallel Programming Patterns Process/Thread Creation Fork-Join Parbegin-Parend SPMD, SIMD Master-Slave (worker) Client-Server Pipelining Task Pools Producer-Consumer

Process/Thread Creation Static or Dynamic Threads, traditionally dynamic Processes, traditionally static, but dynamic has become recently available

Fork-Join An existing thread can create a number of child processes with a fork. The child threads work in parallel. Join waits for all the forked processes to terminate. Spawn/exit is similar

Parbegin-Parend Also called cobegin-coend Each statement (blocks/function calls) in the cobegin-coend block are to be executed in parallel. Statements after coend are not executed until all the parallel statement are complete.

SPMD – SIMD Single Program, Multiple Data vs. Single Instruction, Multiple Data Both use a number of threads/processes which apply the same program to different data SIMD executes the statements synchronously on different data SPMD executes the statements asynchronously

Master-Slave One thread/process that controls all the others If dynamic thread/process creation, the master is the one that usually does it. Master would “assign” the work to the workers and the workers would send the results to the master

Client-Server Multiple clients connected to a server that responds to requests Server could be satisfying requests in parallel (multiple requests being done in parallel or if the request is involved, a parallel solution to the request) The client would also do some work with response from server. Very good model for heterogeneous systems

Pipelining Output of one thread is the input to another thread A special type of functional decomposition Another case where heterogeneous systems would be useful

Task Pools Keep a collection of tasks to be done and the data to do it upon Thread/process can generate tasks to be added to the pool as well as obtaining a task when it is done with the current task

Producer Consumer Producer threads create data used as input by the consumer threads Data is stored in a common buffer that is accessed by producers and consumers Producer cannot add if buffer is full Consumer cannot remove if buffer is empty

Array Data Distributions 1-D – Blockwise Each process gets ceil(n/p) elements of A, except for the last process which gets n-(p-1)*ceil(n/p) elements Alternatively, the first n%p processes get ceil(n/p) elements while the rest get floor(n/p) elements. – Cyclic Process p gets data k*p+i (k=0..ceil(n/)) – Block cyclic Distribute blocks of size b to processes in a cyclic manner

2-D Array distribution Blockwise distribution rows or columns Cyclic distribution of rows or columns Blockwise-cyclic distribution of rows or columns

Checkerboard Take an array of size n x m Overlay a grid of size g x f – g<=n – f<=m – More easily seen if n is a multiple of g and m is a multiple of f Blockwise Checkerboard – Assign each n/g x m/f submatrix to a processor

Cyclic Checkerboard Take each item in a n/g x m/f submatrix and assign it in a cyclic manner. Block-Cyclic checkerboard – Take each n/g x m/f submatrix and assign all the data in the submatrix to a processor in a cyclic fashion

Information Exchange Shared Variables – Used in shared memory – When thread T 1 wants to share information with thread T 2, then T 1 writes the information into a variable that is shared with T 2 – Must avoid 2 or more processes reading or writing to the same variable at the same time (race condition) – Leads to non-Deterministic behavior.

Critical Sections Sections of code where there may be concurrent accesses to shared variables Must make these sections mutually exclusive – Only one process can be executing this section at any one time Lock mechanism is used to keep sections mutually exclusive – Process checks to see if section is “open” – If it is, then “lock” it and execute (unlock when done) – If not, wait until unlocked

Communication Operations Single Transfer – P i sends a message to P j Single Broadcast – one process sends the same data to all other processes Single accumulation – Many values operated on to make a single value that is placed in root Gather – Each process provides a block of data to a common single process Scatter – root process sends a separate block of a large data structure to every other process

More Communications Multi-Broadcast – Every process sends data to every other process so every process has all the data that was spread out among the processes Multi-Accumulate – accumulate, but every process gets the result Total Exchange-Each process provides p-data blocks. The i th data block is sent to p i. Each processor receives the blocks and builds the structure with the data in i order.

Applications Parallel Matrix-Vector Product – Ab=c where A is n x m and b, c are m – Want A to be in contiguous memory A single array, not an array of arrays – Have blocks of rows with allof b calculate a block of c Used if A is stored row-wise – Have blocks of columns with a block of b compute columns that need to be summed. Used if A is stored column-wise

Processes and Threads Process – a program in execution – Includes code, program data on stack or heap, values of registers, PC. – Assigned to processor or core for execution – If there are more processes than resources (processors or memory) for all, execute in a round-robin time-shared method – Context switch – changing from one process to another executing on processor.

Fork The Unix fork command – Creates a new process – Makes a copy of the program – Copy starts at statement after the fork – NOT shared memory model – Distributed memory model – Can take a while to execute

Threads Share a single address space Best with physically shared memory Easier to get started than a process – no copy of code space Two types – Kernel threads – managed by the OS – User threads – managed by a thread library

Thread Execution If user threads are executed by a thread library/scheduler, (no OS support for threads) then all the threads are part of one process that is scheduled by the OS – Only one thread executed at a time even if there are multiple processors If OS has thread management, then threads can be scheduled by OS and multiple threads can execute concurrently Or, Thread scheduler can map user threads to kernel threads (several user threads may map to one kernel thread)

Thread States Newly generated Executable Running Waiting Finished Threads transition from state to state based on events (start, interrupt, end, block, unblock, assign-to-processor)

Synchronization Locks – A process “locks” a shared variable at the beginning of a critical section Lock allows process to proceed if shared variable is unlocked Process blocked if variable is locked until variable is unlocked Locking is an “atomic” process.

Semaphore Usually a binary type but can be integer wait(s) – Waits until the value of s is 1 (or greater) – When it is, decreases s by 1 and continues signal(s) – Increments s by 1

Barrier Synchronization A way to have every process wait until every process is at a certain point Assures the state of every process before certain code is executed

Condition Synchronization A thread is blocked until a given condition is established – If condition is not true, then put into blocked state – When condition true, moved from blocked to ready (not necessarily directly to a processor) – Since other processes may be executing, by the time this process gets to a processor, the condition may no longer be true So, must check condition after condition satisfied

Efficient Thread Programs Proper number of threads – Consider degree of parallelism in application – Number of processors – Size of shared cache Avoid synchronization as much as possible – Make critical section as small as possible Watch for deadlock conditions

Memory Access Must consider writing values to shared memory that is held in local caches False sharing – Consider 2 processes writing to different memory locations – SHOULD not be an issue since not shared by two cache memories – HOWEVER, if the memory locations are close to each other, they may be in the same cache line and actually have the different locations both be in the different caches