Download presentation
Presentation is loading. Please wait.
Published byRoderick McDowell Modified over 9 years ago
1
7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication architectures (SCAs) Message-based SCAs (7.3-7.5) Shared-memory based SCAs (7.6) Read Dubois/Annavaram/Stenström Chapter 5.5-5.6 (COMA architectures could be paper topic) Read Dubois/Annavaram/Stenström Chapter 6
2
7/2/2015 slide 2 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalability Goals (P is number of processors) Bandwidth: scale linearly with P Latency: short and independent of P Cost: low fixed cost and scale linearly with P Example: A bus-based multiprocessor Bandwidth: constant Latency: short and constant Cost: high for infrastructure and then linear
3
7/2/2015 slide 3 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Organizational Issues Network composed of switches for performance and cost Many concurrent transactions allowed Distributed memory can bring down bandwidth demands Bandwidth scaling: no global arbitration and ordering broadcast bandwidth fixed and expensive Distributed memory organization Dance-hall memory organization
4
7/2/2015 slide 4 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scaling Issues Latency scaling: T(n) = Overhead + Channel Time + Routing Delay Channel Time is a function of bandwidth Routing Delay is a function of number of hops in network Cost scaling: Cost(p,m) = Fixed cost + Incremental Cost (p,m) Design is cost-effective if speedup(p,m) > costup(p,m)
5
7/2/2015 slide 5 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Physical Scaling Chip, board, system-level partitioning has a big impact on scaling However, little consensus
6
7/2/2015 slide 6 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Network Transaction Primitives Primitives to implement the programming model on a scalable machine One-way transfer between source and destination Resembles a bus transaction but much richer in variety Examples: A message send transaction A write transaction in a SAS machine
7
7/2/2015 slide 7 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Bus vs. Network Transactions Design Issues: Protection Format Output buffering Media arbitration Destination name & routing Input buffering Action Completion detection Transaction ordering Bus Transactions: V->P address translation Fixed Simple Global Direct One source Response Simple Global order Network Transactions: Done at multiple points Flexible Support flexible in format Distributed Via several switches Several sources Rich diversity Response transaction No global order
8
7/2/2015 slide 8 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 SAS Transactions Issues: Fixed or variable size transfers Deadlock avoidance and input buffer full
9
7/2/2015 slide 9 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Sequential Consistency Issues: Writes need acks to signal completion SC may cause extreme waiting times
10
7/2/2015 slide 10 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Message Passing Multiple flavors of synchronization semantics Blocking versus non-blocking Blocking send/recv returns when operation completes Non-blocking returns immediately (probe function tests completion) Synchronous Send completes after matching receive has executed Receive completes after data transfer from matching send completes Asynchronous (buffered, in MPI terminology) Send completes as soon as send buffer may be reused
11
7/2/2015 slide 11 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Synchronous MP Protocol Alternative: Keep match table at the sender, enabling a two-phase receive-initiated protocol
12
7/2/2015 slide 12 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Asynchronous Optimistic MP Protocol Issues: Copying overhead at receiver from temp buffer to user space Huge buffer space at receiver to cope with worst case
13
7/2/2015 slide 13 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Asynchronous Robust MP Protocol Note: after handshake, send and recv buffer addresses are known, so data transfer can be performed with little overhead
14
7/2/2015 slide 14 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Active Messages User-level analog of network transactions transfer data packet and invoke handler to extract it from network and integrate with on- going computation Request handler Reply
15
7/2/2015 slide 15 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Challenges Common to SAS and MP Input buffer overflow: how to signal buffer space is exhausted Solutions: ACK at protocol level back pressure flow control special ACK path or drop packets (requires time-out) Fetch deadlock (revisited): a request often generates a response that can form dependence cycles in the network Solutions: two logically independent request/response networks NACK requests at receiver to free space
16
7/2/2015 slide 16 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Spectrum of Designs None, physical bit stream blind, physical DMAnCUBE, iPSC,... User/System User-level portCM-5, *T User-level handlerJ-Machine, Monsoon,... Remote virtual address Processing, translationParagon, Meiko CS-2 Global physical address Proc + Memory controllerRP3, BBN, T3D Cache-to-cache Cache controllerDash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???)
17
7/2/2015 slide 17 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 MP Architectures Design tradeoff: how much processing in CA vs P, and how much interpretation of network transaction Physical DMA (7.3) User-level access (7.4) Dedicated message processing (7.5) PM CA PM ° ° ° Scalable Network Node Architecture Communication Assist Message Output Processing – checks – translation – formatting – scheduling Input Processing – checks – translation – buffering – action
18
7/2/2015 slide 18 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Physical DMA Node processor packages messages in user/system mode DMA used to copy between network and system buffers Problem: no way to distinguish between user/system messages, which results in much overhead because node processor must be involved Example: nCUBE/2, IBM SP1
19
7/2/2015 slide 19 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 User-Level Access Network interface mapped into user address space Communication assist does protection checks, translation, etc. No intervention by kernel except for interrupts Example: CM-5
20
7/2/2015 slide 20 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Dedicated Message Processing MP does Interprets message Supports message operations Off-loads P with a clean message abstraction Network ° ° ° dest Mem PM P NI UserSystem Mem PM P NI UserSystem Issues: P/MP communicate via shared memory: coherence traffic MP can be a bottleneck due to all concurrent actions
21
7/2/2015 slide 21 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Shared Physical Address Space Remote read/write performed by pseudo processors Cache coherence issues treated in Ch. 8 M Pseudo memory Pseudo processor P M Pseudo memory Pseudo processor P Scalable Network
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.