7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

L.N. Bhuyan Adapted from Patterson’s slides

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

Graduate Computer Architecture, Fall 2005 Lecture 10 Distributed Memory Multiprocessors Shih-Hao Hung Computer Science & Information Engineering National.

Multiple Processor Systems

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Realizing Programming Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.

Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker.

ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.

EECC756 - Shaaban #1 lec # 13 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.

Scalability CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Scalable Distributed Memory Multiprocessors Todd C. Mowry CS 495 October 24 & 29, 2002.

Multiprocessors Andreas Klappenecker CPSC321 Computer Architecture.

EECC756 - Shaaban #1 lec # 12 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

CS252 Graduate Computer Architecture Lecture 17 Multiprocessor Networks (con’t) March 31 th, 2010 John Kubiatowicz Electrical Engineering and Computer.

Distributed Memory Multiprocessors CS 252, Spring 2005 David E. Culler Computer Science Division U.C. Berkeley.

Multiple Processor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

1 Distributed Memory Computers and Programming. 2 Outline Distributed Memory Architectures Topologies Cost models Distributed Memory Programming Send.

General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

CS 258 Parallel Computer Architecture Lecture 8 Network Interface Design February 20, 2008 Prof John D. Kubiatowicz

CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

INPUT/OUTPUT ARCHITECTURE By Truc Truong. Input Devices Keyboard Keyboard Mouse Mouse Scanner Scanner CD-Rom CD-Rom Game Controller Game Controller.

Synchronization and Communication in the T3E Multiprocessor.

Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Top Level View of Computer Function and Interconnection.

Parallel Computer Architecture and Interconnect 1b.1.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Machine Organizations Copyright 2004 Daniel J. Sorin Duke University.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

EEE440 Computer Architecture

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Graduate Computer Architecture I Lecture 11: Distribute Memory Multiprocessors.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

Chapter 6 Storage and Other I/O Topics. Chapter 6 — Storage and Other I/O Topics — 2 Introduction I/O devices can be characterized by Behaviour: input,

Embedded Computer Architecture 5SAI0 Multi-Processor Systems

© 2004, D. J. Foreman 1 Device Mgmt. © 2004, D. J. Foreman 2 Device Management Organization  Multiple layers ■ Application ■ Operating System ■ Driver.

CS252 Graduate Computer Architecture Lecture 24 Network Interface Design Memory Consistency Models Prof John D. Kubiatowicz

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

CMSC 611: Advanced Computer Architecture

Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

The Stanford FLASH Multiprocessor

Multiple Processor Systems

Computer Science Division

CSE 471 Autumn 1998 Virtual memory

Multiprocessors and Multi-computers

Presentation transcript:

7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication architectures (SCAs)  Message-based SCAs ( )  Shared-memory based SCAs (7.6) Read Dubois/Annavaram/Stenström Chapter (COMA architectures could be paper topic) Read Dubois/Annavaram/Stenström Chapter 6

7/2/2015 slide 2 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalability Goals (P is number of processors) Bandwidth: scale linearly with P Latency: short and independent of P Cost: low fixed cost and scale linearly with P Example: A bus-based multiprocessor Bandwidth: constant Latency: short and constant Cost: high for infrastructure and then linear

7/2/2015 slide 3 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Organizational Issues Network composed of switches for performance and cost Many concurrent transactions allowed Distributed memory can bring down bandwidth demands Bandwidth scaling:  no global arbitration and ordering  broadcast bandwidth fixed and expensive Distributed memory organization Dance-hall memory organization

7/2/2015 slide 4 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scaling Issues Latency scaling:  T(n) = Overhead + Channel Time + Routing Delay  Channel Time is a function of bandwidth  Routing Delay is a function of number of hops in network Cost scaling:  Cost(p,m) = Fixed cost + Incremental Cost (p,m)  Design is cost-effective if speedup(p,m) > costup(p,m)

7/2/2015 slide 5 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Physical Scaling Chip, board, system-level partitioning has a big impact on scaling However, little consensus

7/2/2015 slide 6 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Network Transaction Primitives Primitives to implement the programming model on a scalable machine One-way transfer between source and destination Resembles a bus transaction but much richer in variety Examples: A message send transaction A write transaction in a SAS machine

7/2/2015 slide 7 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Bus vs. Network Transactions Design Issues: Protection Format Output buffering Media arbitration Destination name & routing Input buffering Action Completion detection Transaction ordering Bus Transactions: V->P address translation Fixed Simple Global Direct One source Response Simple Global order Network Transactions: Done at multiple points Flexible Support flexible in format Distributed Via several switches Several sources Rich diversity Response transaction No global order

7/2/2015 slide 8 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 SAS Transactions Issues: Fixed or variable size transfers Deadlock avoidance and input buffer full

7/2/2015 slide 9 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Sequential Consistency Issues: Writes need acks to signal completion SC may cause extreme waiting times

7/2/2015 slide 10 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Message Passing Multiple flavors of synchronization semantics Blocking versus non-blocking  Blocking send/recv returns when operation completes  Non-blocking returns immediately (probe function tests completion) Synchronous  Send completes after matching receive has executed  Receive completes after data transfer from matching send completes Asynchronous (buffered, in MPI terminology)  Send completes as soon as send buffer may be reused

7/2/2015 slide 11 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Synchronous MP Protocol Alternative: Keep match table at the sender, enabling a two-phase receive-initiated protocol

7/2/2015 slide 12 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Asynchronous Optimistic MP Protocol Issues: Copying overhead at receiver from temp buffer to user space Huge buffer space at receiver to cope with worst case

7/2/2015 slide 13 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Asynchronous Robust MP Protocol Note: after handshake, send and recv buffer addresses are known, so data transfer can be performed with little overhead

7/2/2015 slide 14 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Active Messages User-level analog of network transactions  transfer data packet and invoke handler to extract it from network and integrate with on- going computation Request handler Reply

7/2/2015 slide 15 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Challenges Common to SAS and MP Input buffer overflow: how to signal buffer space is exhausted Solutions:  ACK at protocol level  back pressure flow control  special ACK path or drop packets (requires time-out) Fetch deadlock (revisited): a request often generates a response that can form dependence cycles in the network Solutions:  two logically independent request/response networks  NACK requests at receiver to free space

7/2/2015 slide 16 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Spectrum of Designs None, physical bit stream  blind, physical DMAnCUBE, iPSC,... User/System  User-level portCM-5, *T  User-level handlerJ-Machine, Monsoon,... Remote virtual address  Processing, translationParagon, Meiko CS-2 Global physical address  Proc + Memory controllerRP3, BBN, T3D Cache-to-cache  Cache controllerDash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???)

7/2/2015 slide 17 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 MP Architectures Design tradeoff: how much processing in CA vs P, and how much interpretation of network transaction Physical DMA (7.3) User-level access (7.4) Dedicated message processing (7.5) PM CA PM ° ° ° Scalable Network Node Architecture Communication Assist Message Output Processing – checks – translation – formatting – scheduling Input Processing – checks – translation – buffering – action

7/2/2015 slide 18 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Physical DMA Node processor packages messages in user/system mode DMA used to copy between network and system buffers Problem: no way to distinguish between user/system messages, which results in much overhead because node processor must be involved Example: nCUBE/2, IBM SP1

7/2/2015 slide 19 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 User-Level Access Network interface mapped into user address space Communication assist does protection checks, translation, etc. No intervention by kernel except for interrupts Example: CM-5

7/2/2015 slide 20 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Dedicated Message Processing MP does Interprets message Supports message operations Off-loads P with a clean message abstraction Network ° ° ° dest Mem PM P NI UserSystem Mem PM P NI UserSystem Issues: P/MP communicate via shared memory: coherence traffic MP can be a bottleneck due to all concurrent actions

7/2/2015 slide 21 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Shared Physical Address Space Remote read/write performed by pseudo processors Cache coherence issues treated in Ch. 8 M Pseudo memory Pseudo processor P M Pseudo memory Pseudo processor P Scalable Network