SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc L1
PARALLEL ARCHITECTURE Parallel Machine: a computer system with more than one processor Special parallel machines designed to make this interaction overhead less Questions: What about Multicores? What about a network of machines? Yes, but time involved in interaction (communication) might be high, as the system is designed assuming that the machines are more or less independent Lec22
Classification of Parallel Machines Flynn’s Classification In terms of number of Instruction streams and Data streams Instruction stream: path to instruction memory (PC) Data stream: path to data memory SISD: single instruction stream single data stream SIMD MIMD Lec22
PARALLEL PROGRAMMING Recall: Flynn SIMD, MIMD Speedup = Programming side: SPMD, MPMD Shared memory: threads Message passing: using messages Speedup = Lec24
How Much Speedup is Possible? Let s be the fraction of sequential execution time of a given program that can not be parallelized Assume that program is parallelized so that the remaining part (1 - s) is perfectly divided to run in parallel across n processors Speedup = Maximum speedup achievable is limited by the sequential fraction of the sequential program Lec24 Amdahl’s Law
Understanding Amdahl’s Law Compute : min ( |A[i,j]- B[i,i]| ) 1 n2 (a) Serial Time concurrency Time p 1 n2/p concurrency (c) Parallel n2/p n2 p 1 concurrency (b) Naïve Parallel Time
Classification 2: Shared Memory vs Message Passing Shared memory machine: The n processors share physical address space Communication can be done through this shared memory The alternative is sometimes referred to as a message passing machine or a distributed memory machine P M P M P M P M P M P M M P P P P Interconnect P P P P Main Memory Interconnect Lec22
Shared Memory Machines The shared memory could itself be distributed among the processor nodes Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time Terms: Shared vs Private Terms: Local vs Remote Terms: Centralized vs Distributed Shared Terms: NUMA vs UMA architecture Non-Uniform Memory Access Lec22
Shared Memory Architecture Network Distributed Shared Memory Non-Uniform Memory Access (NUMA) M $ P ° ° ° M M M ° ° ° Network $ P $ P $ P ° ° ° Centralized Shared Memory Uniform Memory Access (UMA) © RG@SERC,IISc
MultiCore Structure Memory C0 C2 C4 C6 C1 C3 C5 C7 L1$ L2-Cache © RG@SERC,IISc
NUMA Architecture Memory Memory C0 C2 C4 C6 IMC QPI C1 C3 C5 C7 QPI L2$ C0 C2 L1$ C4 C6 L3-Cache IMC QPI C1 C3 C5 C7 L1$ L1$ L1$ L1$ L2$ L2$ L2$ L2$ L3-Cache QPI IMC Memory Memory © RG@SERC,IISc
Distributed Memory Architecture Cluster Network M $ P M $ P M $ P ° ° ° Message Passing Architecture Memory is private to each node Processes communicate by messages © RG@SERC,IISc
Parallel Architecture: Interconnections Indirect interconnects: nodes are connected to interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message from source to destination is decided Lec22
Indirect Interconnects Shared bus Multiple bus 2x2 crossbar Lec22 Crossbar switch Multistage Interconnection Network
Direct Interconnect Topologies Star Ring Linear 2D Mesh Hypercube (binary n-cube) n=2 n=3 Lec22 Torus
Shared Memory Architecture: Caches P1 P2 Read X Read X Write X=1 Read X Cache hit: Wrong data!! X: 1 X: 0 X: 0 Lec24 X: 0 X: 1
Cache Coherence Problem If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache coherence problem Shared variable modification, private cache Objective: processes shouldn’t read `stale’ data Solutions Hardware: cache coherence mechanisms Software: compiler assisted cache coherence Lec24
Example: Write Once Protocol Assumption: shared bus interconnect where all cache controllers monitor all bus activity Called snooping There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Corrective action could involve updating or invalidating a cache block Lec24
Invalidation Based Cache Coherence P1 P2 Read X Read X Write X=1 Read X X: 1 X: 1 X: 0 X: 0 Invalidate Lec24 X: 0 X: 1
Snoopy Cache Coherence and Locks Holds lock; in critical section Release L; Test&Set(L) Test&Set(L) Test&Set(L) Test&Set(L) Test&Set(L) L: 1 L: 0 L: 1 L: 1 L: 1 Lec24 L: 0 L: 1 L: 0
Space of Parallel Computing Programming Models What programmer uses in coding applns. Specifies synch. And communication. Programming Models: Shared address space Message passing Data parallel Parallel Architecture Shared Memory Centralized shared memory (UMA) Distributed Shared Memory (NUMA) Distributed Memory A.k.a. Message passing E.g., Clusters
Memory Consistency Model Order in which memory operations will appear to execute What value can a read return? Contract between appln. software and system. Affects ease-of-programming and performance
Implicit Memory Model Sequential consistency (SC) [Lamport] Result of an execution appears as if Operations from different processors executed in some sequential (interleaved) order Memory operations of each process in program order MEMORY P1 P3 P2 Pn
Distributed Memory Architecture Network Proc. Node M $ P Proc. Node M $ P Proc. Node M $ P ° ° ° Message Passing Architecture Memory is private to each node Processes communicate by messages © RG@SERC,IISc
NUMA Architecture PCI-E Memory IO-Hub Memory C0 C2 C4 C6 IMC QPI C1 C3 L2$ C0 C2 L1$ C4 C6 L3-Cache IMC QPI C1 C3 C5 C7 L1$ L1$ L1$ L1$ L2$ L2$ L2$ L2$ L3-Cache QPI IMC Memory IO-Hub Memory NIC PCI-E © RG@SERC,IISc
Using MultiCores to Build Cluster Node 0 Node 1 Memory Memory NIC NIC N/W Switch Node 3 Node 2 Memory Memory NIC NIC © RG@SERC,IISc
Using MultiCores to Build Cluster Node 0 Node 1 Memory Memory Send A(1:N) NIC NIC N/W Switch Node 3 Node 2 Memory Memory NIC NIC © RG@SERC,IISc Multi-core Workshop