Graduate Computer Architecture I Lecture 11: Distribute Memory Multiprocessors.

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

CMSC 611: Advanced Computer Architecture
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
Graduate Computer Architecture, Fall 2005 Lecture 10 Distributed Memory Multiprocessors Shih-Hao Hung Computer Science & Information Engineering National.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Realizing Programming Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.
Distributed Memory Multiprocessors CS 252, Spring 2005 David E. Culler Computer Science Division U.C. Berkeley.
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
EECC756 - Shaaban #1 lec # 12 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
EECC756 - Shaaban #1 lec # 11 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Graduate Computer Architecture I Lecture 10: Shared Memory Multiprocessors Young Cho.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
CS252 Graduate Computer Architecture Lecture 24 Network Interface Design Memory Consistency Models Prof John D. Kubiatowicz
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Scalable Cache Coherent Systems
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Scalable Cache Coherent Systems
CMSC 611: Advanced Computer Architecture
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
The University of Adelaide, School of Computer Science
Scalable Cache Coherent Systems
Computer Science Division
Lecture 7: Directory-Based Cache Coherence
11 – Snooping Cache and Directory Based Multiprocessors
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 25: Multiprocessors
Scalable Cache Coherent Systems
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Scalable Cache Coherent Systems
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Scalable Cache Coherent Systems
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Scalable Cache Coherent Systems
Scalable Cache Coherent Systems
Lecture 17 Multiprocessors and Thread-Level Parallelism
Scalable Cache Coherent Systems
Presentation transcript:

Graduate Computer Architecture I Lecture 11: Distribute Memory Multiprocessors

2 - CSE/ESE 560M – Graduate Computer Architecture I Natural Extensions of Memory System P 1 Switch Main memory P n (Interleaved) First-level $ P 1 $ Interconnection network $ P n Mem P 1 $ Interconnection network $ P n Mem Shared Cache Centralized Memory Dance Hall, UMA Distributed Memory (NUMA) Scale

3 - CSE/ESE 560M – Graduate Computer Architecture I Fundamental Issues 1.Naming 2.Synchronization 3.Performance: Latency and Bandwidth

4 - CSE/ESE 560M – Graduate Computer Architecture I Fundamental Issue #1: Naming Naming –what data is shared –how it is addressed –what operations can access data –how processes refer to each other Choice of naming affects –code produced by a compiler via load where just remember address or keep track of processor number and local virtual address for msg. passing –replication of data via load in cache memory hierarchy or via SW replication and consistency

5 - CSE/ESE 560M – Graduate Computer Architecture I Fundamental Issue #1: Naming Global physical address space –any processor can generate, address, and access it in a single operation –memory can be anywhere: virtual addr. translation handles it Global virtual address space –if the address space of each process can be configured to contain all shared data of the parallel program Segmented shared address space –locations are named uniformly for all processes of the parallel program

6 - CSE/ESE 560M – Graduate Computer Architecture I Fundamental Issue #2: Synchronization Message passing –implicit coordination –transmission of data –arrival of data Shared address –explicitly coordinate –write a flag –awaken a thread –interrupt a processor

7 - CSE/ESE 560M – Graduate Computer Architecture I Parallel Architecture Framework Programming Model –Multiprogramming lots of independent jobs no communication –Shared address space communicate via memory –Message passing send and receive messages Communication Abstraction –Shared address space load, store, atomic swap –Message passing send, recieve library calls –Debate over this topic ease of programming vs. scalability

8 - CSE/ESE 560M – Graduate Computer Architecture I Scalable Machines Design trade-offs for the machines –specialize vs commodity nodes –capability of node-to-network interface –supporting programming models Scalability –avoids inherent design limits on resources –bandwidth increases with increase in resource –latency does not increase –cost increases slowly with increase in resource

9 - CSE/ESE 560M – Graduate Computer Architecture I Bandwidth Scalability Fundamentally limits bandwidth –Amount of wires –Bus vs. Network Switch

10 - CSE/ESE 560M – Graduate Computer Architecture I Dancehall Multiprocessor Organization

11 - CSE/ESE 560M – Graduate Computer Architecture I Generic Distributed System Organization

12 - CSE/ESE 560M – Graduate Computer Architecture I Key Property of Distributed System Large number of independent communication paths between nodes –allow a large number of concurrent transactions using different wires Independent Initialization No global arbitration Effect of a transaction only visible to the nodes involved –effects propagated through additional transactions

13 - CSE/ESE 560M – Graduate Computer Architecture I CAD MultiprogrammingShared address Message passing Data parallel DatabaseScientific modeling Parallel applications Programming models Communication abstraction User/system boundary Compilation or library Operating systems support Communication hardware Physical communication medium Hardware/Software boundary Network Transactions Programming Models Realized by Protocols

14 - CSE/ESE 560M – Graduate Computer Architecture I Network Transaction Interpretation of the message –Complexity of the message Processing in the Comm. Assist –Processing power PM CA PM ° ° ° Scalable Network Node Architecture Communication Assist Message Output Processing – checks – translation – formatting – scheduling Input Processing – checks – translation – buffering – action

15 - CSE/ESE 560M – Graduate Computer Architecture I Shared Address Space Abstraction Fundamentally a two-way request/response protocol –writes have an acknowledgement Issues –fixed or variable length (bulk) transfers –remote virtual or physical address –deadlock avoidance and input buffer full –Memory coherency and consistency Source Destination T ime Load [ Global address] Read request Read request Memory access Read response (1) Initiate memory access (2) Address translation (3) Local/remote check (4) Request transaction (5) Remote memory access (6) Reply transaction (7) Complete memory access Wait Read response

16 - CSE/ESE 560M – Graduate Computer Architecture I Shared Physical Address Space

17 - CSE/ESE 560M – Graduate Computer Architecture I Shared Address Abstraction Source and destination data addresses are specified by the source of the request –a degree of logical coupling and trust No storage logically “outside the address space” –may employ temporary buffers for transport Operations are fundamentally request response Remote operation can be performed on remote memory –logically does not require intervention of the remote processor

18 - CSE/ESE 560M – Graduate Computer Architecture I Message passing Bulk transfers Synchronous –Send completes after matching recv and source data sent –Receive completes after data transfer complete from matching send Asynchronous –Send completes after send buffer may be reused

19 - CSE/ESE 560M – Graduate Computer Architecture I Synchronous Message Passing Constrained programming model Destination contention very limited User/System boundary Source Destination Time Send P dest, local VA, len Send-rdy req Tag check (1) Initiate send (2) Address translation on P src (4) Send-ready request (6) Reply transaction Wait Recv P src, local VA, len Recv-rdy reply Data-xfer req (5) Remote check for posted receive (assume success) (7) Bulk data transfer Source VA Dest VA or ID (3) Local/remote check

20 - CSE/ESE 560M – Graduate Computer Architecture I Asynch Message Passing: Optimistic More powerful programming model Wildcard receive  non-deterministic Storage required within msg layer?

21 - CSE/ESE 560M – Graduate Computer Architecture I Active Messages User-level analog of network transaction –transfer data packet and invoke handler to extract it from the network and integrate with on-going computation Request/Reply Event notification: interrupts, polling, events? May also perform memory-to-memory transfer Request handler Reply

22 - CSE/ESE 560M – Graduate Computer Architecture I Message Passing Abstraction Source knows send data address, dest. knows receive data address –after handshake they both know both Arbitrary storage “outside the local address spaces” –may post many sends before any receives –non-blocking asynchronous sends reduces the requirement to an arbitrary number of descriptors Fundamentally a 3-phase transaction –includes a request / response –can use optimistic 1-phase in limited “Safe” cases

23 - CSE/ESE 560M – Graduate Computer Architecture I Data Parallel Operations can be performed in parallel –each element of a large regular data structure, such as an array –Data parallel programming languages lay out data to processor Processing Element –1 Control Processor broadcast to many PEs –When computers were large, could amortize the control portion of many replicated PEs –Condition flag per PE so that can skip –Data distributed in each memory Early 1980s VLSI  SIMD rebirth –32 1-bit PEs + memory on a chip was the PE

24 - CSE/ESE 560M – Graduate Computer Architecture I Data Parallel Architecture Development –Vector processors have similar ISAs, but no data placement restriction –SIMD led to Data Parallel Programming languages –Single Program Multiple Data (SPMD) model –All processors execute identical program Advanced VLSI Technology –Single chip FPUs –Fast µProcs (SIMD less attractive)

25 - CSE/ESE 560M – Graduate Computer Architecture I Cache Coherent System Invoking coherence protocol –state of the line is maintained in the cache –protocol is invoked if an “access fault” occurs on the line Actions to Maintain Coherence –Look at states of block in other caches –Locate the other copies –Communicate with those copies

26 - CSE/ESE 560M – Graduate Computer Architecture I Scalable Cache Coherence Realizing Program Models through net transaction protocols - efficient node-to-net interface - interprets transactions Caches naturally replicate data - coherence through bus snooping protocols - consistency Scalable Networks - many simultaneous transactions Scalable distributed memory Need cache coherence protocols that scale! - no broadcast or single point of order

27 - CSE/ESE 560M – Graduate Computer Architecture I Bus-based Coherence All actions done as broadcast on bus –faulting processor sends out a “search” –others respond to the search probe and take necessary action Could do it in scalable network too –broadcast to all processors, and let them respond Conceptually simple, but doesn’t scale with p –on bus, bus bandwidth doesn’t scale –on scalable network, every fault leads to at least p network transactions

28 - CSE/ESE 560M – Graduate Computer Architecture I One Approach: Hierarchical Snooping Extend snooping approach –hierarchy of broadcast media –processors are in the bus- or ring-based multiprocessors at the leaves –parents and children connected by two-way snoopy interfaces –main memory may be centralized at root or distributed among leaves Actions handled similarly to bus, but not full broadcast –faulting processor sends out “search” bus transaction on its bus –propagates up and down hierarchy based on snoop results Problems –high latency: multiple levels, and snoop/lookup at every level –bandwidth bottleneck at root

29 - CSE/ESE 560M – Graduate Computer Architecture I Scalable Approach: Directories Directory –Maintain cached block copies –Maintain memory block states –On a miss in own memory Look up directory entry Communicate only with the nodes with copies –Scalable networks Communication through network transactions Different ways to organize directory

30 - CSE/ESE 560M – Graduate Computer Architecture I Basic Directory Transactions val. ack 3a. 3b. 4a. 4b. Requestor Node with dirty copy Directory node for block Requestor Directory node Sharer Sharer (a) Read miss to a block in dirty state (b)Write miss to a block with two sharers

31 - CSE/ESE 560M – Graduate Computer Architecture I ESI P1$ ESI P2$ DSU MDir ctrl ld vA -> rd pA Read pA R/replyR/req P1: pA SS Example Directory Protocol (1 st Read)

32 - CSE/ESE 560M – Graduate Computer Architecture I ESI P1$ ESI P2$ DSU MDir ctrl R/replyR/req P1: pA ld vA -> rd pA P2: pA R/req R/_ SSS Example Directory Protocol (Read Share)

33 - CSE/ESE 560M – Graduate Computer Architecture I ESI P1$ ESI P2$ DSU MDir ctrl st vA -> wr pA R/replyR/req P1: pA P2: pA R/reqW/req E R/_ Invalidate pA Read_to_update pA Inv ACK RX/invalidate&replySSSDE reply xD(pA) Inv/_ Excl Example Directory Protocol (Wr to shared) I

34 - CSE/ESE 560M – Graduate Computer Architecture I A Popular Middle Ground Two-level “hierarchy” –Coherence across nodes is directory-based directory keeps track of nodes, not individual processors –Coherence within nodes is snooping or directory orthogonal, but needs a good interface of functionality Examples –Convex Exemplar: directory-directory –Sequent, Data General, HAL: directory-snoopy

35 - CSE/ESE 560M – Graduate Computer Architecture I P C Snooping B1 B2 P C P C B1 P C Main Mem Main Mem Adapter Snooping Adapter P C B1 Bus (or Ring) P C P C B1 P C Main Mem Main Mem Network Assist Network2 P C A M/D Network1 P C A M/D Directory adapter P C A M/D Network1 P C A M/D Directory adapter P C A M/D Network1 P C A M/D Dir/Snoopy adapter P C A M/D Network1 P C A M/D Dir/Snoopy adapter (a) Snooping-snooping (b) Snooping-directory Dir.. (c) Directory-directory (d) Directory-snooping Two-level Hierarchies

36 - CSE/ESE 560M – Graduate Computer Architecture I Memory Consistency Memory Coherence –Consistent view of the memory –Not ensure how consistent –In what order of execution Memory P 1 P 2 P 3 MemoryMemory A=1; flag=1; while (flag==0); print A; A:0 flag:0->1 Interconnection network 1: A=1 2: flag=1 3: load A Delay P 1 P 3 P 2 (b) (a) Congested path

37 - CSE/ESE 560M – Graduate Computer Architecture I Memory Consistency Relaxed Consistency –Allows Out-of-order Completion –Different Read and Write ordering models –Increase in Performance but possible errors Current Systems –Relaxed Models –Expectation for synchronous programs –Use of standard synchronization libraries