Presentation is loading. Please wait.

Presentation is loading. Please wait.

7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)

Similar presentations


Presentation on theme: "7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)"— Presentation transcript:

1 7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication architectures (SCAs)  Message-based SCAs (7.3-7.5)  Shared-memory based SCAs (7.6) Read Dubois/Annavaram/Stenström Chapter 5.5-5.6 (COMA architectures could be paper topic) Read Dubois/Annavaram/Stenström Chapter 6

2 7/2/2015 slide 2 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalability Goals (P is number of processors) Bandwidth: scale linearly with P Latency: short and independent of P Cost: low fixed cost and scale linearly with P Example: A bus-based multiprocessor Bandwidth: constant Latency: short and constant Cost: high for infrastructure and then linear

3 7/2/2015 slide 3 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Organizational Issues Network composed of switches for performance and cost Many concurrent transactions allowed Distributed memory can bring down bandwidth demands Bandwidth scaling:  no global arbitration and ordering  broadcast bandwidth fixed and expensive Distributed memory organization Dance-hall memory organization

4 7/2/2015 slide 4 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scaling Issues Latency scaling:  T(n) = Overhead + Channel Time + Routing Delay  Channel Time is a function of bandwidth  Routing Delay is a function of number of hops in network Cost scaling:  Cost(p,m) = Fixed cost + Incremental Cost (p,m)  Design is cost-effective if speedup(p,m) > costup(p,m)

5 7/2/2015 slide 5 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Physical Scaling Chip, board, system-level partitioning has a big impact on scaling However, little consensus

6 7/2/2015 slide 6 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Network Transaction Primitives Primitives to implement the programming model on a scalable machine One-way transfer between source and destination Resembles a bus transaction but much richer in variety Examples: A message send transaction A write transaction in a SAS machine

7 7/2/2015 slide 7 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Bus vs. Network Transactions Design Issues: Protection Format Output buffering Media arbitration Destination name & routing Input buffering Action Completion detection Transaction ordering Bus Transactions: V->P address translation Fixed Simple Global Direct One source Response Simple Global order Network Transactions: Done at multiple points Flexible Support flexible in format Distributed Via several switches Several sources Rich diversity Response transaction No global order

8 7/2/2015 slide 8 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 SAS Transactions Issues: Fixed or variable size transfers Deadlock avoidance and input buffer full

9 7/2/2015 slide 9 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Sequential Consistency Issues: Writes need acks to signal completion SC may cause extreme waiting times

10 7/2/2015 slide 10 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Message Passing Multiple flavors of synchronization semantics Blocking versus non-blocking  Blocking send/recv returns when operation completes  Non-blocking returns immediately (probe function tests completion) Synchronous  Send completes after matching receive has executed  Receive completes after data transfer from matching send completes Asynchronous (buffered, in MPI terminology)  Send completes as soon as send buffer may be reused

11 7/2/2015 slide 11 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Synchronous MP Protocol Alternative: Keep match table at the sender, enabling a two-phase receive-initiated protocol

12 7/2/2015 slide 12 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Asynchronous Optimistic MP Protocol Issues: Copying overhead at receiver from temp buffer to user space Huge buffer space at receiver to cope with worst case

13 7/2/2015 slide 13 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Asynchronous Robust MP Protocol Note: after handshake, send and recv buffer addresses are known, so data transfer can be performed with little overhead

14 7/2/2015 slide 14 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Active Messages User-level analog of network transactions  transfer data packet and invoke handler to extract it from network and integrate with on- going computation Request handler Reply

15 7/2/2015 slide 15 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Challenges Common to SAS and MP Input buffer overflow: how to signal buffer space is exhausted Solutions:  ACK at protocol level  back pressure flow control  special ACK path or drop packets (requires time-out) Fetch deadlock (revisited): a request often generates a response that can form dependence cycles in the network Solutions:  two logically independent request/response networks  NACK requests at receiver to free space

16 7/2/2015 slide 16 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Spectrum of Designs None, physical bit stream  blind, physical DMAnCUBE, iPSC,... User/System  User-level portCM-5, *T  User-level handlerJ-Machine, Monsoon,... Remote virtual address  Processing, translationParagon, Meiko CS-2 Global physical address  Proc + Memory controllerRP3, BBN, T3D Cache-to-cache  Cache controllerDash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???)

17 7/2/2015 slide 17 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 MP Architectures Design tradeoff: how much processing in CA vs P, and how much interpretation of network transaction Physical DMA (7.3) User-level access (7.4) Dedicated message processing (7.5) PM CA PM ° ° ° Scalable Network Node Architecture Communication Assist Message Output Processing – checks – translation – formatting – scheduling Input Processing – checks – translation – buffering – action

18 7/2/2015 slide 18 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Physical DMA Node processor packages messages in user/system mode DMA used to copy between network and system buffers Problem: no way to distinguish between user/system messages, which results in much overhead because node processor must be involved Example: nCUBE/2, IBM SP1

19 7/2/2015 slide 19 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 User-Level Access Network interface mapped into user address space Communication assist does protection checks, translation, etc. No intervention by kernel except for interrupts Example: CM-5

20 7/2/2015 slide 20 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Dedicated Message Processing MP does Interprets message Supports message operations Off-loads P with a clean message abstraction Network ° ° ° dest Mem PM P NI UserSystem Mem PM P NI UserSystem Issues: P/MP communicate via shared memory: coherence traffic MP can be a bottleneck due to all concurrent actions

21 7/2/2015 slide 21 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Shared Physical Address Space Remote read/write performed by pseudo processors Cache coherence issues treated in Ch. 8 M Pseudo memory Pseudo processor P M Pseudo memory Pseudo processor P Scalable Network


Download ppt "7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)"

Similar presentations


Ads by Google