Computer Science Division

Slides:

Advertisements

Similar presentations

Graduate Computer Architecture, Fall 2005 Lecture 10 Distributed Memory Multiprocessors Shih-Hao Hung Computer Science & Information Engineering National.

Advertisements

Realizing Programming Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.

Scalability CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Scalable Distributed Memory Multiprocessors Todd C. Mowry CS 495 October 24 & 29, 2002.

EECC756 - Shaaban #1 lec # 12 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.

Distributed Memory Multiprocessors CS 252, Spring 2005 David E. Culler Computer Science Division U.C. Berkeley.

General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)

Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Switching, routing, and flow control in interconnection networks.

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

TELE202 Lecture 5 Packet switching in WAN 1 Lecturer Dr Z. Huang Overview ¥Last Lectures »C programming »Source: ¥This Lecture »Packet switching in Wide.

Data and Computer Communications Circuit Switching and Packet Switching.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Graduate Computer Architecture I Lecture 11: Distribute Memory Multiprocessors.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

CS252 Graduate Computer Architecture Lecture 24 Network Interface Design Memory Consistency Models Prof John D. Kubiatowicz

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai.

Data and Computer Communications Eighth Edition by William Stallings Lecture slides by Lawrie Brown Chapter 10 – Circuit Switching and Packet Switching.

McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 Muhammad Waseem Iqbal Lecture # 20 Data Communication.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Fundamental Design Issues

Deterministic Communication with SpaceWire

Packet Switching Networks & Frame Relay

Overview Parallel Processing Pipelining

Berkeley Cluster Projects

The network-on-chip protocol

Advanced Computer Networks

Packet Switching Outline Store-and-Forward Switches

Operating Systems (CS 340 D)

1 Input-Output Organization Computer Organization Computer Architectures Lab Peripheral Devices Input-Output Interface Asynchronous Data Transfer Modes.

Azeddien M. Sllame, Amani Hasan Abdelkader

Chapter 3 Top Level View of Computer Function and Interconnection

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.

BIC 10503: COMPUTER ARCHITECTURE

CMSC 611: Advanced Computer Architecture

Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.

Introduction to Scalable Interconnection Network Design

ECE 4450:427/527 - Computer Networks Spring 2017

Switching, routing, and flow control in interconnection networks

Lecture 1: Parallel Architecture Intro

Architectural Interactions in High Performance Clusters

Convergence of Parallel Architectures

ECEG-3202 Computer Architecture and Organization

Computer Science Division

Characteristics of Reconfigurable Hardware

Overview of Computer Architecture and Organization

Peng Liu Lecture 14 I/O Peng Liu

Data Communication Networks

Networks Networking has become ubiquitous (cf. WWW)

EE 122: Lecture 7 Ion Stoica September 18, 2001.

Interconnection Networks Contd.

Overview of Computer Architecture and Organization

Latency Tolerance: what to do when it just won’t go away

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

CS 6290 Many-core & Interconnect

Computer Science Division

Packet Switching Outline Store-and-Forward Switches

Switching, routing, and flow control in interconnection networks

William Stallings Computer Organization and Architecture

Multiprocessors and Multi-computers

Presentation transcript:

Computer Science Division Scalability CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley CS258 S99

Recap: Gigaplane Bus Timing 3/3/99 CS258 S99

Enterprise Processor and Memory System 2 procs per board, external L2 caches, 2 mem banks with x-bar Data lines buffered through UDB to drive internal 1.3 GB/s UPA bus Wide path to memory so full 64-byte line in 1 mem cycle (2 bus cyc) Addr controller adapts proc and bus protocols, does cache coherence its tags keep a subset of states needed by bus (e.g. no M/E distinction) 3/3/99 CS258 S99

Enterprise I/O System I/O board has same bus interface ASICs as processor boards But internal bus half as wide, and no memory path Only cache block sized transactions, like processing boards Uniformity simplifies design ASICs implement single-block cache, follows coherence protocol Two independent 64-bit, 25 MHz Sbuses One for two dedicated FiberChannel modules connected to disk One for Ethernet and fast wide SCSI Can also support three SBUS interface cards for arbitrary peripherals Performance and cost of I/O scale with no. of I/O boards 3/3/99 CS258 S99

Limited Scaling of a Bus Characteristic Bus Physical Length ~ 1 ft Number of Connections fixed Maximum Bandwidth fixed Interface to Comm. medium memory inf Global Order arbitration Protection Virt -> physical Trust total OS single comm. abstraction HW Bus: each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components 3/3/99 CS258 S99

Workstations in a LAN? Characteristic Bus LAN Physical Length ~ 1 ft KM Number of Connections fixed many Maximum Bandwidth fixed ??? Interface to Comm. medium memory inf peripheral Global Order arbitration ??? Protection Virt -> physical OS Trust total none OS single independent comm. abstraction HW SW No clear limit to physical scaling, little trust, no global order, consensus difficult to achieve. Independent failure and restart 3/3/99 CS258 S99

Scalable Machines What are the design trade-offs for the spectrum of machines between? specialize or commodity nodes? capability of node-to-network interface supporting programming models? What does scalability mean? avoids inherent design limits on resources bandwidth increases with P latency does not cost increases slowly with P 3/3/99 CS258 S99

Bandwidth Scalability What fundamentally limits bandwidth? single set of wires Must have many independent wires Connect modules through switches Bus vs Network Switch? 3/3/99 CS258 S99

Dancehall MP Organization Network bandwidth? Bandwidth demand? independent processes? communicating processes? Latency? 3/3/99 CS258 S99

Generic Distributed Memory Org. Network bandwidth? Bandwidth demand? independent processes? communicating processes? Latency? 3/3/99 CS258 S99

Key Property Large number of independent communication paths between nodes => allow a large number of concurrent transactions using different wires initiated independently no global arbitration effect of a transaction only visible to the nodes involved effects propagated through additional transactions 3/3/99 CS258 S99

Latency Scaling T(n) = Overhead + Channel Time + Routing Delay Channel Time(n) = n/B --- BW at bottleneck RoutingDelay(h,n) 3/3/99 CS258 S99

Typical example max distance: log n number of switches: a n log n overhead = 1 us, BW = 64 MB/s, 200 ns per hop Pipelined T64(128) = 1.0 us + 2.0 us + 6 hops * 0.2 us/hop = 4.2 us T1024(128) = 1.0 us + 2.0 us + 10 hops * 0.2 us/hop = 5.0 us Store and Forward T64sf(128) = 1.0 us + 6 hops * (2.0 + 0.2) us/hop = 14.2 us T64sf(1024) = 1.0 us + 10 hops * (2.0 + 0.2) us/hop = 23 us 3/3/99 CS258 S99

Cost Scaling cost(p,m) = fixed cost + incremental cost (p,m) Bus Based SMP? Ratio of processors : memory : network : I/O ? Parallel efficiency(p) = Speedup(P) / P Costup(p) = Cost(p) / Cost(1) Cost-effective: speedup(p) > costup(p) Is super-linear speedup 3/3/99 CS258 S99

Cost Effective? 2048 processors: 475 fold speedup at 206x cost 3/3/99 CS258 S99

Physical Scaling Chip-level integration Board-level System level 3/3/99 CS258 S99

nCUBE/2 Machine Organization S i n g l e - c h p o d B a s m u H y r b t w k f D R A M U I F & 6 4 E O $ x 1024 Nodes Entire machine synchronous at 40 MHz 3/3/99 CS258 S99

CM-5 Machine Organization 3/3/99 CS258 S99

System Level Integration 3/3/99 CS258 S99

Realizing Programming Models CAD Multipr ogramming Shar ed addr ess Message passing Data parallel Database Scientific modeling Parallel applications Pr ogramming models Communication abstraction User/system boundary Compilation or library Operating systems support Communication har dwar e Physical communication medium Har e/softwar e boundary 3/3/99 CS258 S99

Network Transaction Primitive one-way transfer of information from a source output buffer to a dest. input buffer causes some action at the destination occurrence is not directly visible at source deposit data, state change, reply 3/3/99 CS258 S99

Bus Transactions vs Net Transactions Issues: protection check V->P ?? format wires flexible output buffering reg, FIFO ?? media arbitration global local destination naming and routing input buffering limited many source action completion detection 3/3/99 CS258 S99

Shared Address Space Abstraction fixed format, request/response, simple action 3/3/99 CS258 S99

Consistency is challenging 3/3/99 CS258 S99

Synchronous Message Passing 3/3/99 CS258 S99

Asynch. Message Passing: Optimistic Storage??? 3/3/99 CS258 S99

Asynch. Msg Passing: Conservative 3/3/99 CS258 S99

Active Messages User-level analog of network transaction Request handler Reply handler User-level analog of network transaction Action is small user function Request/Reply May also perform memory-to-memory transfer 3/3/99 CS258 S99

Common Challenges Input buffer overflow N-1 queue over-commitment => must slow sources reserve space per source (credit) when available for reuse? Ack or Higher level Refuse input when full backpressure in reliable network tree saturation deadlock free what happens to traffic not bound for congested dest? Reserve ack back channel drop packets 3/3/99 CS258 S99

Challenges (cont) Fetch Deadlock For network to remain deadlock free, nodes must continue accepting messages, even when cannot source them what if incoming transaction is a request? Each may generate a response, which cannot be sent! What happens when internal buffering is full? logically independent request/reply networks physical networks virtual channels with separate input/output queues bound requests and reserve input buffer space K(P-1) requests + K responses per node service discipline to avoid fetch deadlock? NACK on input buffer full NACK delivery? 3/3/99 CS258 S99

Summary Scalability Realizing Programming Models physical, bandwidth, latency and cost level of integration Realizing Programming Models network transactions protocols safety N-1 fetch deadlock Next: Communication Architecture Design Space how much hardware interpretation of the network transaction? 3/3/99 CS258 S99