Presentation is loading. Please wait.

Presentation is loading. Please wait.

SE-292 High Performance Computing

Similar presentations


Presentation on theme: "SE-292 High Performance Computing"— Presentation transcript:

1 SE-292 High Performance Computing
Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan L1

2 PARALLEL ARCHITECTURE
Parallel Machine: a computer system with more than one processor Special parallel machines designed to make this interaction overhead less Questions: What about Multicores? What about a network of machines? Yes, but time involved in interaction (communication) might be high, as the system is designed assuming that the machines are more or less independent Lec22

3 Classification of Parallel Machines
Flynn’s Classification In terms of number of Instruction streams and Data streams Instruction stream: path to instruction memory (PC) Data stream: path to data memory SISD: single instruction stream single data stream SIMD MIMD Lec22

4 PARALLEL PROGRAMMING Recall: Flynn SIMD, MIMD Speedup =
Programming side: SPMD, MPMD Shared memory: threads Message passing: using messages Speedup = Lec24

5 How Much Speedup is Possible?
Let s be the fraction of sequential execution time of a given program that can not be parallelized Assume that program is parallelized so that the remaining part (1 - s) is perfectly divided to run in parallel across n processors Speedup = Maximum speedup achievable is limited by the sequential fraction of the sequential program Lec24 Amdahl’s Law

6 Understanding Amdahl’s Law
Compute : min ( |A[i,j]- B[i,i]| ) 1 n2 (a) Serial Time concurrency Time p 1 n2/p concurrency (c) Parallel n2/p n2 p 1 concurrency (b) Naïve Parallel Time

7 Classification 2: Shared Memory vs Message Passing
Shared memory machine: The n processors share physical address space Communication can be done through this shared memory The alternative is sometimes referred to as a message passing machine or a distributed memory machine P M P M P M P M P M P M M P P P P Interconnect P P P P Main Memory Interconnect Lec22

8 Shared Memory Machines
The shared memory could itself be distributed among the processor nodes Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time Terms: Shared vs Private Terms: Local vs Remote Terms: Centralized vs Distributed Shared Terms: NUMA vs UMA architecture Non-Uniform Memory Access Lec22

9 Shared Memory Architecture
Network Distributed Shared Memory Non-Uniform Memory Access (NUMA) M $ P ° ° ° M M M ° ° ° Network $ P $ P $ P ° ° ° Centralized Shared Memory Uniform Memory Access (UMA) ©

10 MultiCore Structure Memory C0 C2 C4 C6 C1 C3 C5 C7 L1$ L2-Cache
©

11 NUMA Architecture Memory Memory C0 C2 C4 C6 IMC QPI C1 C3 C5 C7 QPI
L2$ C0 C2 L1$ C4 C6 L3-Cache IMC QPI C1 C3 C5 C7 L1$ L1$ L1$ L1$ L2$ L2$ L2$ L2$ L3-Cache QPI IMC Memory Memory ©

12 Distributed Memory Architecture
Cluster Network M $ P M $ P M $ P ° ° ° Message Passing Architecture Memory is private to each node Processes communicate by messages ©

13 Parallel Architecture: Interconnections
Indirect interconnects: nodes are connected to interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message from source to destination is decided Lec22

14 Indirect Interconnects
Shared bus Multiple bus 2x2 crossbar Lec22 Crossbar switch Multistage Interconnection Network

15 Direct Interconnect Topologies
Star Ring Linear 2D Mesh Hypercube (binary n-cube) n=2 n=3 Lec22 Torus

16 Shared Memory Architecture: Caches
P1 P2 Read X Read X Write X=1 Read X Cache hit: Wrong data!! X: 1 X: 0 X: 0 Lec24 X: 0 X: 1

17 Cache Coherence Problem
If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache coherence problem Shared variable modification, private cache Objective: processes shouldn’t read `stale’ data Solutions Hardware: cache coherence mechanisms Software: compiler assisted cache coherence Lec24

18 Example: Write Once Protocol
Assumption: shared bus interconnect where all cache controllers monitor all bus activity Called snooping There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Corrective action could involve updating or invalidating a cache block Lec24

19 Invalidation Based Cache Coherence
P1 P2 Read X Read X Write X=1 Read X X: 1 X: 1 X: 0 X: 0 Invalidate Lec24 X: 0 X: 1

20 Snoopy Cache Coherence and Locks
Holds lock; in critical section Release L; Test&Set(L) Test&Set(L) Test&Set(L) Test&Set(L) Test&Set(L) L: 1 L: 0 L: 1 L: 1 L: 1 Lec24 L: 0 L: 1 L: 0

21 Space of Parallel Computing
Programming Models What programmer uses in coding applns. Specifies synch. And communication. Programming Models: Shared address space Message passing Data parallel Parallel Architecture Shared Memory Centralized shared memory (UMA) Distributed Shared Memory (NUMA) Distributed Memory A.k.a. Message passing E.g., Clusters

22 Memory Consistency Model
Order in which memory operations will appear to execute What value can a read return? Contract between appln. software and system. Affects ease-of-programming and performance

23 Implicit Memory Model Sequential consistency (SC) [Lamport]
Result of an execution appears as if Operations from different processors executed in some sequential (interleaved) order Memory operations of each process in program order MEMORY P1 P3 P2 Pn

24 Distributed Memory Architecture
Network Proc. Node M $ P Proc. Node M $ P Proc. Node M $ P ° ° ° Message Passing Architecture Memory is private to each node Processes communicate by messages ©

25 NUMA Architecture PCI-E Memory IO-Hub Memory C0 C2 C4 C6 IMC QPI C1 C3
L2$ C0 C2 L1$ C4 C6 L3-Cache IMC QPI C1 C3 C5 C7 L1$ L1$ L1$ L1$ L2$ L2$ L2$ L2$ L3-Cache QPI IMC Memory IO-Hub Memory NIC PCI-E ©

26 Using MultiCores to Build Cluster
Node 0 Node 1 Memory Memory NIC NIC N/W Switch Node 3 Node 2 Memory Memory NIC NIC ©

27 Using MultiCores to Build Cluster
Node 0 Node 1 Memory Memory Send A(1:N) NIC NIC N/W Switch Node 3 Node 2 Memory Memory NIC NIC © Multi-core Workshop


Download ppt "SE-292 High Performance Computing"

Similar presentations


Ads by Google