EECC756 - Shaaban #1 lec # 12 Spring2000 4-25-2000 Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
Graduate Computer Architecture, Fall 2005 Lecture 10 Distributed Memory Multiprocessors Shih-Hao Hung Computer Science & Information Engineering National.
Multiple Processor Systems
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.
Realizing Programming Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.
Scalability CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Scalable Distributed Memory Multiprocessors Todd C. Mowry CS 495 October 24 & 29, 2002.
EECC756 - Shaaban #1 lec # 10 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:
ECE669 L2: Architectural Perspective February 3, 2004 ECE 669 Parallel Computer Architecture Lecture 2 Architectural Perspective.
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.
EECC756 - Shaaban #1 lec # 2 Spring Parallel Architectures History Application Software System Software SIMD Message Passing Shared Memory.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
CS252 Graduate Computer Architecture Lecture 17 Multiprocessor Networks (con’t) March 31 th, 2010 John Kubiatowicz Electrical Engineering and Computer.
Distributed Memory Multiprocessors CS 252, Spring 2005 David E. Culler Computer Science Division U.C. Berkeley.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Parallel Processing Architectures Laxmi Narayan Bhuyan
EECC756 - Shaaban #1 lec # 2 Spring Parallel Architectures History Application Software System Software SIMD Message Passing Shared Memory.
General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)
Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
CS 258 Parallel Computer Architecture Lecture 8 Network Interface Design February 20, 2008 Prof John D. Kubiatowicz
Storage area network and System area network (SAN)
1 25\10\2010 Unit-V Connecting LANs Unit – 5 Connecting DevicesConnecting Devices Backbone NetworksBackbone Networks Virtual LANsVirtual LANs.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Synchronization and Communication in the T3E Multiprocessor.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Parallel Computer Architecture and Interconnect 1b.1.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Machine Organizations Copyright 2004 Daniel J. Sorin Duke University.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
CS252 Graduate Computer Architecture Lecture 17 Multiprocessor Networks (con’t) March 18 th, 2012 John Kubiatowicz Electrical Engineering and Computer.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Graduate Computer Architecture I Lecture 11: Distribute Memory Multiprocessors.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Interconnection network network interface and a case study.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
CS252 Graduate Computer Architecture Lecture 24 Network Interface Design Memory Consistency Models Prof John D. Kubiatowicz
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Evolution and Convergence of Parallel Architectures Todd C. Mowry CS 495 January 17, 2002.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
COMP7330/7336 Advanced Parallel and Distributed Computing Data Parallelism Dr. Xiao Qin Auburn University
Overview Parallel Processing Pipelining
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
CMSC 611: Advanced Computer Architecture
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
Convergence of Parallel Architectures
Computer Science Division
Parallel Processing Architectures
Interconnection Networks Contd.
Presentation transcript:

EECC756 - Shaaban #1 lec # 12 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands of processors. Design Choices: –Custom-designed or commodity nodes? –Network scalability. –Capability of node-to-network interface (critical). –Supporting programming models? What does hardware scalability mean? –Avoids inherent design limits on resources. –Bandwidth increases with machine size P. –Latency should not increase with machine size P. –Cost should increase slowly with P.

EECC756 - Shaaban #2 lec # 12 Spring MPPs Scalability Issues Problems: –Memory-access latency. –Interprocess communication complexity or synchronization overhead. –Multi-cache inconsistency. –Message-passing and message processing overheads. Possible Solutions: –Fast dedicated, proprietary and scalable, networks and protocols. –Low-latency fast synchronization techniques possibly hardware-assisted. –Hardware-assisted message processing in communication assists (node-to- network interfaces). –Weaker memory consistency models. –Scalable directory-based cache coherence protocols. –Shared virtual memory. –Improved software portability; standard parallel and distributed operating system support. –Software latency-hiding techniques.

EECC756 - Shaaban #3 lec # 12 Spring One Extreme: Limited Scaling of a Bus Bus: Each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components. CharacteristicBus Physical Length~ 1 ft Number of Connectionsfixed Maximum Bandwidthfixed Interface to Comm. mediummemory inf Global Orderarbitration ProtectionVirt -> physical Trusttotal OSsingle comm. abstractionHW Poor Scalability

EECC756 - Shaaban #4 lec # 12 Spring Another Extreme: Scaling of Workstations in a LAN? No clear limit to physical scaling, no global order, consensus difficult to achieve. CharacteristicBusLAN Physical Length~ 1 ftKM Number of Connectionsfixedmany Maximum Bandwidthfixed??? Interface to Comm. mediummemory infperipheral Global Orderarbitration??? ProtectionVirt -> physicalOS Trusttotalnone OSsingleindependent comm. abstractionHWSW

EECC756 - Shaaban #5 lec # 12 Spring Depends largely on network characteristics: –Channel bandwidth. –Static: Topology: Node degree, Bisection width etc. –Multistage: Switch size and connection pattern properties. –Node-to-network interface capabilities. Bandwidth Scalability

EECC756 - Shaaban #6 lec # 12 Spring Dancehall MP Organization Network bandwidth? Bandwidth demand? –Independent processes? –Communicating processes? Latency? Extremely high demands on network in terms of bandwidth, latency even for independent processes.

EECC756 - Shaaban #7 lec # 12 Spring Generic Distributed Memory Organization Network bandwidth? Bandwidth demand? –Independent processes? –Communicating processes? Latency? O(log 2 P) increase? Cost scalability of system? Multi-stage interconnection network (MIN)? Custom-designed? Node: O(10) Bus-based SMP Custom-designed CPU? Node/System integration level? How far? Cray-on-a-Chip? SMP-on-a-Chip? OS Supported? Network protocols? Communication Assist Extend of functionality? Message transaction DMA? Global virtual Shared address space?

EECC756 - Shaaban #8 lec # 12 Spring Key System Scaling Property Large number of independent communication paths between nodes. => Allow a large number of concurrent transactions using different channels. Transactions are initiated independently. No global arbitration. Effect of a transaction only visible to the nodes involved –Effects propagated through additional transactions.

EECC756 - Shaaban #9 lec # 12 Spring Latency Scaling T(n) = Overhead + Channel Time + Routing Delay Scaling of overhead? Channel Time(n) = n/B --- BW at bottleneck RoutingDelay(h,n)

EECC756 - Shaaban #10 lec # 12 Spring Network Latency Scaling Example Max distance: log 2 n Number of switches:  n log n overhead = 1 us, BW = 64 MB/s, 200 ns per hop Using pipelined or cut-through routing: T 64 (128) = 1.0 us us + 6 hops * 0.2 us/hop = 4.2 us T 1024 (128) = 1.0 us us + 10 hops * 0.2 us/hop = 5.0 us Store and Forward T 64 sf (128) = 1.0 us + 6 hops * ( ) us/hop = 14.2 us T 1024 sf (128) = 1.0 us + 10 hops * ( ) us/hop = 23 us O(log 2 n) Stage MIN using switches: Only 20% increase in latency for 16x size increase

EECC756 - Shaaban #11 lec # 12 Spring Cost Scaling cost(p,m) = fixed cost + incremental cost (p,m) Bus Based SMP? Ratio of processors : memory : network : I/O ? Parallel efficiency(p) = Speedup(P) / P Similar to speedup, one can define: Costup(p) = Cost(p) / Cost(1) Cost-effective: speedup(p) > costup(p)

EECC756 - Shaaban #12 lec # 12 Spring Cost Effective? 2048 processors: 475 fold speedup at 206x cost

EECC756 - Shaaban #13 lec # 12 Spring Physical Scaling Chip-level integration: –Integrate network interface, message router I/O links. –Memory/Bus controller/chip set. –IRAM-style Cray-on-a-Chip. –Future: SMP on a chip? Board-level: –Replicating standard microprocessor cores. CM-5 replicated the core of a Sun SparkStation 1 workstation. Cray T3D and T3E replicated the core of a DEC Alpha workstation. System level: IBM SP-2 uses 8-16 almost complete RS6000 workstations placed in racks.

EECC756 - Shaaban #14 lec # 12 Spring Chip-level integration Example: nCUBE/2 Machine Organization Entire machine synchronous at 40 MHz Single-chip node Basic module Hypercube network configuration DRAM interface D M A c h a n n e l s R o u t e r MMU I-Fetch & decode 64-bit integer IEEE floating point Operand $ Execution unit 500, 000 transistors large at the time 64 nodes socketed on a board 13 links up to 8096 nodes possible

EECC756 - Shaaban #15 lec # 12 Spring Board-level integration Example: CM-5 Machine Organization Design replicated the core of a Sun SparkStation 1 workstation Fat Tree

EECC756 - Shaaban #16 lec # 12 Spring System Level Integration Example: IBM SP almost complete RS6000 workstations placed in racks.

EECC756 - Shaaban #17 lec # 12 Spring Realizing Programming Models: Realized by Protocols CAD MultiprogrammingShared address Message passing Data parallel DatabaseScientific modeling Parallel applications Programming models Communication abstraction User/system boundary Compilation or library Operating systems support Communication hardware Physical communication medium Hardware/software boundary Network Transactions

EECC756 - Shaaban #18 lec # 12 Spring Challenges in Realizing Prog. Models in Large-Scale Machines No global knowledge, nor global control. –Barriers, scans, reduce, global-OR give fuzzy global state. Very large number of concurrent transactions. Management of input buffer resources: –Many sources can issue a request and over-commit destination before any see the effect. Latency is large enough that one is tempted to “take risks”: –Optimistic protocols. –Large transfers. –Dynamic allocation. Many more degrees of freedom in design and engineering of these system.

EECC756 - Shaaban #19 lec # 12 Spring Network Transaction Processing Key Design Issue: How much interpretation of the message by CA without involving the CPU? How much dedicated processing in the CA? PM CA PM ° ° ° Scalable Network Node Architecture Communication Assist Message Output Processing – checks – translation – formating – scheduling Input Processing – checks – translation – buffering – action CA = Communication Assist

EECC756 - Shaaban #20 lec # 12 Spring Spectrum of Designs None: Physical bit stream –blind, physical DMAnCUBE, iPSC,... User/System –User-level portCM-5, *T –User-level handlerJ-Machine, Monsoon,... Remote virtual address –Processing, translationParagon, Meiko CS-2 Global physical address –Proc + Memory controllerRP3, BBN, T3D Cache-to-cache –Cache controllerDash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???)

EECC756 - Shaaban #21 lec # 12 Spring No CA Net Transactions Interpretation: Physical DMA DMA controlled by regs, generates interrupts. Physical => OS initiates transfers. Send-side: –Construct system “envelope” around user data in kernel area. Receive: –Must receive into system buffer, since no message interpretation in CA. senderauth dest addr

EECC756 - Shaaban #22 lec # 12 Spring nCUBE/2 Network Interface Independent DMA channel per link direction –Leave input buffers always open. –Segmented messages. Routing determines if message is intended for local or remote node –Dimension-order routing on hypercube. –Bit-serial with 36 bit cut-through. Os16 ins 260 cy13 us Or18200 cy15 us - includes interrupt

EECC756 - Shaaban #23 lec # 12 Spring DMA In Conventional LAN Network Interfaces NIC Controller DMA addr len trncv TX RX Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Data Host Memory NIC IO Bus mem bus Proc

EECC756 - Shaaban #24 lec # 12 Spring User-Level Ports Initiate transaction at user level. CA interprets delivers message to user without OS intervention. Network port in user space. User/system flag in envelope. –Protection check, translation, routing, media access in source CA –User/sys check in destination CA, interrupt on system.

EECC756 - Shaaban #25 lec # 12 Spring User-Level Network Example: CM-5 Two data networks and one control network. Input and output FIFO for each network. Tag per message: –Index Network Inteface (NI) mapping table. *T integrated NI on chip. Also used in iWARP. Os50 cy1.5 us Or53 cy1.6 us interrupt10us

EECC756 - Shaaban #26 lec # 12 Spring User-Level Handlers Tighter integration of user-level network port with the processor at the register level. Hardware support to vector to address specified in message –message ports in registers. User/system P Mem DestDataAddress P Mem 

EECC756 - Shaaban #27 lec # 12 Spring iWARP Nodes integrate communication with computation on systolic basis. Message data direct to register. Stream into memory. Interface unit Host

EECC756 - Shaaban #28 lec # 12 Spring Dedicated Message Processing Without Specialized Hardware Design General Purpose processor performs arbitrary output processing (at system level) General Purpose processor interprets incoming network transactions (at system level) User Processor Message Processor share memory Message Processor Message Processor via system network transaction Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI UserSystem Node: Bus-based SMP MP Message Processor

EECC756 - Shaaban #29 lec # 12 Spring User Processor stores cmd / msg / data into shared output queue. –Must still check for output queue full (or make elastic). Communication assists make transaction happen. –Checking, translation, scheduling, transport, interpretation. Effect observed on destination address space and/or events. Protocol divided between two layers. Network ° ° ° dest Mem PM P NI UserSystem Mem PM P NI Levels of Network Transaction

EECC756 - Shaaban #30 lec # 12 Spring Example: Intel Paragon Network ° ° ° Mem P M P NI i860xp 50 MHz 16 KB $ 4-way 32B Block MESI sDMA rDMA MB/s $$ MB/s Duplex I/O Nodes rte MP handler Var data EOP I/O Nodes Service Devices 2048 B

EECC756 - Shaaban #31 lec # 12 Spring Dispatcher User Output Queues Send FIFO ~Empty Rcv FIFO ~Full Send DMA Rcv DMA DMA done Compute Processor Kernel System Event Message Processor Events

EECC756 - Shaaban #32 lec # 12 Spring Concurrency Intensive –Need to keep inbound flows moving while outbound flows stalled. –Large transfers segmented. Reduces overhead but adds latency. User Output Queues Send FIFO ~Empty Rcv FIFO ~Full Send DMA Rcv DMA DMA done Compute Processor Kernel System Event User Input Queues VAS Dispatcher Message Processor Assessment