SE-292 High Performance Computing

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) SE 292: High Performance.
SE-292 High Performance Computing
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Threads© Dr. Ayman Abdel-Hamid, CS4254 Spring CS4254 Computer Network Architecture and Programming Dr. Ayman A. Abdel-Hamid Computer Science Department.
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Parallel Programming Sathish S. Vadhiyar Course Web Page:
The University of Adelaide, School of Computer Science
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Introduction to Concurrency.
1 Announcements The fixing the bug part of Lab 4’s assignment 2 is now considered extra credit. Comments for the code should be on the parts you wrote.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Super computers Parallel Processing
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Multiprocessors – Locks
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Interprocess Communication Race Conditions
Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:
Parallel Architecture
Distributed Processors
Background on the need for Synchronization
Lecture 21 Synchronization
Outline Other synchronization primitives
Threads Threads.
Atomic Operations in Hardware
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Multithreading Tutorial
Atomic Operations in Hardware
Other Important Synchronization Primitives
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Threads and Data Sharing
Multiprocessor Introduction and Characteristics of Multiprocessor
Outline Interconnection networks Processor arrays Multiprocessors
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Multiprocessors - Flynn’s taxonomy (1966)
Multiple Processor Systems
Multithreading Tutorial
Background and Motivation
Concurrency: Mutual Exclusion and Process Synchronization
Multithreading Tutorial
Multithreading Tutorial
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
CSE 153 Design of Operating Systems Winter 19
CS333 Intro to Operating Systems
Chapter 6: Synchronization Tools
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
CSL718 : Multiprocessors 13th April, 2006 Introduction
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSE 542: Operating Systems
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture Sathish Vadhiyar L1

Concurrent Programming Until now: execution involved one flow of control through program Concurrent programming is about programs with multiple flows of control For example: a program that runs as multiple processes cooperating to achieve a common goal To cooperate, processes must somehow communicate L21

Inter Process Communication (IPC) Using files Parent process creates 2 files before forking child process Child inherits file descriptors from parent, and they share the file pointers Can use one for parent to write and child to read, other for child to write and parent to read OS supports something called a pipe Producer writes at one end (write-end) and consumer reads from the other end (read-end) corresponds to 2 file descriptors (int fd[2]) Read from fd[0] accesses data written to fd[1] in FIFO order and vice versa Used with fork - parent process creates a pipe and uses it to communicate with a child process L21

Other IPC Mechanisms Processes could communicate through variables that are shared between them Shared variables, shared memory; other variables are private to a process Special OS support for program to specify objects that are to be in shared regions of address space Posix shared memory – shmget, shmat Processes could communicate by sending and receiving messages to each other Special OS support for these messages L21

More Ideas on IPC Mechanisms Sometimes processes don’t need to communicate explicit values to cooperate They might just have to synchronize their activities Example: Process 1 reads 2 matrices, Process 2 multiplies them, Process 3 writes the result matrix Process 2 should not start work until Process 1 finishes reading, etc. Called process synchronization Synchronization primitives Examples: mutex lock, semaphore, barrier L21

Programming With Shared Variables Consider a 2 process program in which both processes increment a shared variable shared int X = 0; P1: P2: X++; X++; Q: What is the value of X after this? Complication: Remember that X++ compiles into something like LOAD R1, 0(R2) ADD R1, R1, 1 STORE 0(R2), R1 L21

Problem with using shared variables Final value of X could be 1! P1 loads X into R1, increments R1 P2 loads X into register before P1 stores new value into X Net result: P1 stores 1, P2 stores 1 Moral of example: Necessary to synchronize processes that are interacting using shared variables Problem arises when 2 or more processes try to update shared variable Critical Section: part of program where shared variable is accessed like this L21

Critical Section Problem: Mutual Exclusion Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion Mutex Lock: a synchronization primitive AcquireLock(L) Done before critical section of code Returns when safe for process to enter critical section ReleaseLock(L) Done after critical section Allows another process to acquire lock L21

Implementing a Lock int L=0; /* 0: lock available */ AcquireLock(L): while (L==1); L = 1; ReleaseLock(L): L = 0; /* `BUSY WAITING’ */ L21

Why this implementation fails while ( L == 1) ; L = 1; wait: LW R1, Addr(L) BNEZ R1, wait ADDI R1, R0, 1 SW R1, Addr(L) Process 1 Process 2 LW R1 with 0 Context Switch LW R1 with 0 BNEZ ADDI SW Enter CS Assume that lock L is currently available (L = 0) and that 2 processes, P1 and P2 try to acquire the lock L Context Switch IMPLEMENTATION ALLOWS PROCESSES P1 and P2 TO BE IN CRITICAL SECTION TOGETHER! L21 BNEZ ADDI SW Enter CS time

Busy Wait Lock Implementation Hardware support will be useful to implement a lock Example: Test&Set instruction Test&Set Lock: tmp = Lock Lock = 1 Return tmp Where these 3 steps happen atomically or indivisibly. i.e., all 3 happen as one operation (with nothing happening in between) Atomic Read-Modify-Write (RMW) instruction L21

Busy Wait Lock with Test&Set AcquireLock(L) while (Test&Set(L)) ; ReleaseLock(L) L = 0; Consider the case where P1 is currently in a critical section, P2-P10 are executing AcquireLock: all are executing the while loop When P1 releases the lock, by the definition of Test&Set exactly one of P2-P10 will read the new lock value of 0 and set L back to 1 L21

More on Locks Other names for this kind of lock Mutex Spin wait lock Busy wait lock Can have locks where instead of busy waiting, an unsuccessful process gets blocked by the operating system Lec22

Semaphore A more general synchronization mechanism Operations: P (wait) and V (signal) P(S) if S is nonzero, decrements S and returns Else, suspends the process until S becomes nonzero, when the process is restarted After restarting, decrements S and returns V(S) Increments S by 1 If there are processes blocked for S, restarts exactly one of them Lec22

Critical Section Problem & Semaphore Semaphore S = 1; Before critical section: P(S) After critical section: V(S) Semaphores can do more than mutex locks Initialize S to 10 and 10 processes will be allowed to proceed P1:read matrices; P2: multiply; P3: write product Semaphores S1=S2=0; End of P1: V(S1), beginning of P2: P(S1) etc Lec22

Deadlock Consider the following process: P1: lock (L); wait(L); P1 is waiting for something (release of lock that it is holding) that will never happen Simple case of a general problem called deadlock Cycle of processes waiting for resources held by others while holding resources needed by others Lec22

Classical Problems Producers-Consumers Problem Bounded buffer problem Producer process makes things and puts them into a fixed size shared buffer Consumer process takes things out of shared buffer and uses them Must ensure that producer doesn’t put into full buffer or consumer take out of empty buffer While treating buffer accesses as critical section Lec22

Producers-Consumers Problem shared Buffer[0 .. N-1] Producer: repeatedly Produce x Buffer[i++] = x Consumer: repeatedly y = Buffer[- - i] Consume y ; if (buffer is full) wait for consumption ; signal consumer If (buffer is empty) wait for production Lec22 ; signal producer

Dining Philosophers Problem N philosophers sitting around a circular table with a plate of food in front of each and a fork between each 2 philosophers Philosopher does: repeatedly Eat (using 2 forks) Think Problem: Avoid deadock; be fair Lec22

THREADS Thread Weight related to Recall context of process a basic unit of CPU utilization Thread of control in a process `Light weight process’ Weight related to Time for creation Time for context switch Size of context Recall context of process Lec22

Threads and Processes Thread context Thread id Stack Stack pointer, PC, GPR values So, thread context switching can be fast Many threads in same process that share parts of process context Virtual address space (other than stack) So, threads in the same process share variables that are not stack allocated Lec22

Threads and Sharing Shares with other threads of a process – code section, data section, open files and signals

Threads Benefits – responsiveness, communication, parallelism and scalability Types – user threads and kernel threads Multithreading models Many-one: efficient; but entire process will block if a thread makes a blocking system call One-to-one: e.g. linux. Parallelism; but heavy weight Many-to-many: balance between the above two schemes

Thread Implementation Could either be supported in the operating system or by a library Pthreads: POSIX thread library – a standard for defining thread creation and synchronization int pthread_create pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine), void *arg pthread_attr pthread_join pthread_exit pthread_detach Do “man –k pthreads” Lec22

Synchronization Primitives Mutex locks int pthread_mutex_lock(pthread_mutex_t *mutex) If the mutex is already locked, the calling thread blocks until the mutex becomes available. Returns with the mutex object referenced by mutex in the locked state with the calling thread as its owner. pthread_mutex_unlock Semaphores sem_init sem_wait sem_post Lec22

Pthread scheduling Process contention scope – scheduling user-level threads among a set of kernel threads. System contention scope – scheduling kernel threads for CPU. Functions for setting the scope - pthread_attr_setscope, pthread_attr_getscope Can use PTHREAD_SCOPE_PROCESS for PCS and PTHREAD_SCOPE_SYSTEM for SCS

Thread Safety A function is thread safe if it always produces correct results when called repeatedly from concurrent multiple threads Thread Unsafe functions That don’t protect shared variables That keep state across multiple invocations That return a pointer to a static variable That call thread unsafe functions Races When correctness of a program depends on one thread reaching a point x before another thread reaching a point y

Parallel Architecture

PARALLEL ARCHITECTURE Parallel Machine: a computer system with more than one processor Motivations Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases. Certain class of algorithms lend themselves Aggregate bandwidth to memory/disk. Increase in data throughput. Clock rate improvement in the past decade – 40% Memory access time improvement in the past decade – 10% Lec22

Classification of Architectures – Flynn’s classification In terms of parallelism in instruction and data stream Single Instruction Single Data (SISD): Serial Computers Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification of Architectures – Flynn’s classification Multiple Instruction Single Data (MISD): Not popular Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification of Architectures – Based on Memory Shared memory 2 types – UMA and NUMA NUMA Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q UMA Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification 2: Shared Memory vs Message Passing Shared memory machine: The n processors share physical address space Communication can be done through this shared memory The alternative is sometimes referred to as a message passing machine or a distributed memory machine P M P M P M P M P M P M M P P P P Interconnect P P P P Main Memory Interconnect Lec22

Shared Memory Machines The shared memory could itself be distributed among the processor nodes Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time Terms: NUMA vs UMA architecture Non-Uniform Memory Access Uniform Memory Access Lec22

Classification of Architectures – Based on Memory Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ Recently multi-cores Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

Parallel Architecture: Interconnection Networks An interconnection network defined by switches, links and interfaces Switches – provide mapping between input and output ports, buffering, routing etc. Interfaces – connects nodes with network

Parallel Architecture: Interconnections Indirect interconnects: nodes are connected to interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message from source to destination is decided Network topologies Static – point-to-point communication links among processing nodes Dynamic – Communication links are formed dynamically by switches Lec22

Interconnection Networks Static Bus Completely connected Star Linear array, Ring (1-D torus) Mesh k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh Trees – our campus network Dynamic – Communication links are formed dynamically by switches Crossbar Multistage For more details, and evaluation of topologies, refer to book by Grama et al.

Indirect Interconnects Shared bus Multiple bus 2x2 crossbar Lec22 Crossbar switch Multistage Interconnection Network

Direct Interconnect Topologies Star Ring Linear 2D Mesh Hypercube (binary n-cube) n=2 n=3 Lec22 Torus

Evaluating Interconnection topologies Diameter – maximum distance between any two processing nodes Full-connected – Star – Ring – Hypercube - Connectivity – multiplicity of paths between 2 nodes. Miniimum number of arcs to be removed from network to break it into two disconnected networks Linear-array – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes – 1 2 p/2 logP 1 2 2 4 d

Evaluating Interconnection topologies bisection width – minimum number of links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes - 2 Root(P) 1 1 P2/4 P/2

Evaluating Interconnection topologies channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes channel rate – performance of a single physical wire channel bandwidth – channel rate times channel width bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

Shared Memory Architecture: Caches P1 P2 Read X Read X Write X=1 Read X Cache hit: Wrong data!! X: 1 X: 0 X: 0 Lec24 X: 0 X: 1

Cache Coherence Problem If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache coherence problem Shared variable modification, private cache Objective: processes shouldn’t read `stale’ data Solutions Hardware: cache coherence mechanisms Lec24

Cache Coherence Protocols Write update – propagate cache line to other processors on every write to a processor Write invalidate – each processor gets the updated cache line whenever it reads stale data Which is better? Lec24

Invalidation Based Cache Coherence P1 P2 Read X Read X Write X=1 Read X X: 1 X: 1 X: 0 X: 0 Invalidate Lec24 X: 0 X: 1

Cache Coherence using invalidate protocols 3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data item Dirty – state of the data item in P0 Implementations Snoopy for bus based architectures shared bus interconnect where all cache controllers monitor all bus activity There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Memory operations are propagated over the bus and snooped Directory-based Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors A central directory maintains states of cache blocks, associated processors Implemented with presence bits