SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture Sathish Vadhiyar L1
Concurrent Programming Until now: execution involved one flow of control through program Concurrent programming is about programs with multiple flows of control For example: a program that runs as multiple processes cooperating to achieve a common goal To cooperate, processes must somehow communicate L21
Inter Process Communication (IPC) Using files Parent process creates 2 files before forking child process Child inherits file descriptors from parent, and they share the file pointers Can use one for parent to write and child to read, other for child to write and parent to read OS supports something called a pipe Producer writes at one end (write-end) and consumer reads from the other end (read-end) corresponds to 2 file descriptors (int fd[2]) Read from fd[0] accesses data written to fd[1] in FIFO order and vice versa Used with fork - parent process creates a pipe and uses it to communicate with a child process L21
Other IPC Mechanisms Processes could communicate through variables that are shared between them Shared variables, shared memory; other variables are private to a process Special OS support for program to specify objects that are to be in shared regions of address space Posix shared memory – shmget, shmat Processes could communicate by sending and receiving messages to each other Special OS support for these messages L21
More Ideas on IPC Mechanisms Sometimes processes don’t need to communicate explicit values to cooperate They might just have to synchronize their activities Example: Process 1 reads 2 matrices, Process 2 multiplies them, Process 3 writes the result matrix Process 2 should not start work until Process 1 finishes reading, etc. Called process synchronization Synchronization primitives Examples: mutex lock, semaphore, barrier L21
Programming With Shared Variables Consider a 2 process program in which both processes increment a shared variable shared int X = 0; P1: P2: X++; X++; Q: What is the value of X after this? Complication: Remember that X++ compiles into something like LOAD R1, 0(R2) ADD R1, R1, 1 STORE 0(R2), R1 L21
Problem with using shared variables Final value of X could be 1! P1 loads X into R1, increments R1 P2 loads X into register before P1 stores new value into X Net result: P1 stores 1, P2 stores 1 Moral of example: Necessary to synchronize processes that are interacting using shared variables Problem arises when 2 or more processes try to update shared variable Critical Section: part of program where shared variable is accessed like this L21
Critical Section Problem: Mutual Exclusion Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion Mutex Lock: a synchronization primitive AcquireLock(L) Done before critical section of code Returns when safe for process to enter critical section ReleaseLock(L) Done after critical section Allows another process to acquire lock L21
Implementing a Lock int L=0; /* 0: lock available */ AcquireLock(L): while (L==1); L = 1; ReleaseLock(L): L = 0; /* `BUSY WAITING’ */ L21
Why this implementation fails while ( L == 1) ; L = 1; wait: LW R1, Addr(L) BNEZ R1, wait ADDI R1, R0, 1 SW R1, Addr(L) Process 1 Process 2 LW R1 with 0 Context Switch LW R1 with 0 BNEZ ADDI SW Enter CS Assume that lock L is currently available (L = 0) and that 2 processes, P1 and P2 try to acquire the lock L Context Switch IMPLEMENTATION ALLOWS PROCESSES P1 and P2 TO BE IN CRITICAL SECTION TOGETHER! L21 BNEZ ADDI SW Enter CS time
Busy Wait Lock Implementation Hardware support will be useful to implement a lock Example: Test&Set instruction Test&Set Lock: tmp = Lock Lock = 1 Return tmp Where these 3 steps happen atomically or indivisibly. i.e., all 3 happen as one operation (with nothing happening in between) Atomic Read-Modify-Write (RMW) instruction L21
Busy Wait Lock with Test&Set AcquireLock(L) while (Test&Set(L)) ; ReleaseLock(L) L = 0; Consider the case where P1 is currently in a critical section, P2-P10 are executing AcquireLock: all are executing the while loop When P1 releases the lock, by the definition of Test&Set exactly one of P2-P10 will read the new lock value of 0 and set L back to 1 L21
More on Locks Other names for this kind of lock Mutex Spin wait lock Busy wait lock Can have locks where instead of busy waiting, an unsuccessful process gets blocked by the operating system Lec22
Semaphore A more general synchronization mechanism Operations: P (wait) and V (signal) P(S) if S is nonzero, decrements S and returns Else, suspends the process until S becomes nonzero, when the process is restarted After restarting, decrements S and returns V(S) Increments S by 1 If there are processes blocked for S, restarts exactly one of them Lec22
Critical Section Problem & Semaphore Semaphore S = 1; Before critical section: P(S) After critical section: V(S) Semaphores can do more than mutex locks Initialize S to 10 and 10 processes will be allowed to proceed P1:read matrices; P2: multiply; P3: write product Semaphores S1=S2=0; End of P1: V(S1), beginning of P2: P(S1) etc Lec22
Deadlock Consider the following process: P1: lock (L); wait(L); P1 is waiting for something (release of lock that it is holding) that will never happen Simple case of a general problem called deadlock Cycle of processes waiting for resources held by others while holding resources needed by others Lec22
Classical Problems Producers-Consumers Problem Bounded buffer problem Producer process makes things and puts them into a fixed size shared buffer Consumer process takes things out of shared buffer and uses them Must ensure that producer doesn’t put into full buffer or consumer take out of empty buffer While treating buffer accesses as critical section Lec22
Producers-Consumers Problem shared Buffer[0 .. N-1] Producer: repeatedly Produce x Buffer[i++] = x Consumer: repeatedly y = Buffer[- - i] Consume y ; if (buffer is full) wait for consumption ; signal consumer If (buffer is empty) wait for production Lec22 ; signal producer
Dining Philosophers Problem N philosophers sitting around a circular table with a plate of food in front of each and a fork between each 2 philosophers Philosopher does: repeatedly Eat (using 2 forks) Think Problem: Avoid deadock; be fair Lec22
THREADS Thread Weight related to Recall context of process a basic unit of CPU utilization Thread of control in a process `Light weight process’ Weight related to Time for creation Time for context switch Size of context Recall context of process Lec22
Threads and Processes Thread context Thread id Stack Stack pointer, PC, GPR values So, thread context switching can be fast Many threads in same process that share parts of process context Virtual address space (other than stack) So, threads in the same process share variables that are not stack allocated Lec22
Threads and Sharing Shares with other threads of a process – code section, data section, open files and signals
Threads Benefits – responsiveness, communication, parallelism and scalability Types – user threads and kernel threads Multithreading models Many-one: efficient; but entire process will block if a thread makes a blocking system call One-to-one: e.g. linux. Parallelism; but heavy weight Many-to-many: balance between the above two schemes
Thread Implementation Could either be supported in the operating system or by a library Pthreads: POSIX thread library – a standard for defining thread creation and synchronization int pthread_create pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine), void *arg pthread_attr pthread_join pthread_exit pthread_detach Do “man –k pthreads” Lec22
Synchronization Primitives Mutex locks int pthread_mutex_lock(pthread_mutex_t *mutex) If the mutex is already locked, the calling thread blocks until the mutex becomes available. Returns with the mutex object referenced by mutex in the locked state with the calling thread as its owner. pthread_mutex_unlock Semaphores sem_init sem_wait sem_post Lec22
Pthread scheduling Process contention scope – scheduling user-level threads among a set of kernel threads. System contention scope – scheduling kernel threads for CPU. Functions for setting the scope - pthread_attr_setscope, pthread_attr_getscope Can use PTHREAD_SCOPE_PROCESS for PCS and PTHREAD_SCOPE_SYSTEM for SCS
Thread Safety A function is thread safe if it always produces correct results when called repeatedly from concurrent multiple threads Thread Unsafe functions That don’t protect shared variables That keep state across multiple invocations That return a pointer to a static variable That call thread unsafe functions Races When correctness of a program depends on one thread reaching a point x before another thread reaching a point y
Parallel Architecture
PARALLEL ARCHITECTURE Parallel Machine: a computer system with more than one processor Motivations Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases. Certain class of algorithms lend themselves Aggregate bandwidth to memory/disk. Increase in data throughput. Clock rate improvement in the past decade – 40% Memory access time improvement in the past decade – 10% Lec22
Classification of Architectures – Flynn’s classification In terms of parallelism in instruction and data stream Single Instruction Single Data (SISD): Serial Computers Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Classification of Architectures – Flynn’s classification Multiple Instruction Single Data (MISD): Not popular Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Classification of Architectures – Based on Memory Shared memory 2 types – UMA and NUMA NUMA Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q UMA Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Classification 2: Shared Memory vs Message Passing Shared memory machine: The n processors share physical address space Communication can be done through this shared memory The alternative is sometimes referred to as a message passing machine or a distributed memory machine P M P M P M P M P M P M M P P P P Interconnect P P P P Main Memory Interconnect Lec22
Shared Memory Machines The shared memory could itself be distributed among the processor nodes Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time Terms: NUMA vs UMA architecture Non-Uniform Memory Access Uniform Memory Access Lec22
Classification of Architectures – Based on Memory Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ Recently multi-cores Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids
Parallel Architecture: Interconnection Networks An interconnection network defined by switches, links and interfaces Switches – provide mapping between input and output ports, buffering, routing etc. Interfaces – connects nodes with network
Parallel Architecture: Interconnections Indirect interconnects: nodes are connected to interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message from source to destination is decided Network topologies Static – point-to-point communication links among processing nodes Dynamic – Communication links are formed dynamically by switches Lec22
Interconnection Networks Static Bus Completely connected Star Linear array, Ring (1-D torus) Mesh k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh Trees – our campus network Dynamic – Communication links are formed dynamically by switches Crossbar Multistage For more details, and evaluation of topologies, refer to book by Grama et al.
Indirect Interconnects Shared bus Multiple bus 2x2 crossbar Lec22 Crossbar switch Multistage Interconnection Network
Direct Interconnect Topologies Star Ring Linear 2D Mesh Hypercube (binary n-cube) n=2 n=3 Lec22 Torus
Evaluating Interconnection topologies Diameter – maximum distance between any two processing nodes Full-connected – Star – Ring – Hypercube - Connectivity – multiplicity of paths between 2 nodes. Miniimum number of arcs to be removed from network to break it into two disconnected networks Linear-array – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes – 1 2 p/2 logP 1 2 2 4 d
Evaluating Interconnection topologies bisection width – minimum number of links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes - 2 Root(P) 1 1 P2/4 P/2
Evaluating Interconnection topologies channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes channel rate – performance of a single physical wire channel bandwidth – channel rate times channel width bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth
Shared Memory Architecture: Caches P1 P2 Read X Read X Write X=1 Read X Cache hit: Wrong data!! X: 1 X: 0 X: 0 Lec24 X: 0 X: 1
Cache Coherence Problem If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache coherence problem Shared variable modification, private cache Objective: processes shouldn’t read `stale’ data Solutions Hardware: cache coherence mechanisms Lec24
Cache Coherence Protocols Write update – propagate cache line to other processors on every write to a processor Write invalidate – each processor gets the updated cache line whenever it reads stale data Which is better? Lec24
Invalidation Based Cache Coherence P1 P2 Read X Read X Write X=1 Read X X: 1 X: 1 X: 0 X: 0 Invalidate Lec24 X: 0 X: 1
Cache Coherence using invalidate protocols 3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data item Dirty – state of the data item in P0 Implementations Snoopy for bus based architectures shared bus interconnect where all cache controllers monitor all bus activity There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Memory operations are propagated over the bus and snooped Directory-based Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors A central directory maintains states of cache blocks, associated processors Implemented with presence bits