Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Graduate Computer Architecture, Fall 2005 Lecture 10 Distributed Memory Multiprocessors Shih-Hao Hung Computer Science & Information Engineering National.
CS-334: Computer Architecture
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
Realizing Programming Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.
Scalability CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Scalable Distributed Memory Multiprocessors Todd C. Mowry CS 495 October 24 & 29, 2002.
Supporting Systolic and Memory Communication in iWarp (Borkar et al. 1990) presented by Vasily Volkov CS258, Spring 2008, UC Berkeley.
EECC756 - Shaaban #1 lec # 12 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
EECS 570: Fall rev3 1 Chapter 7 (excl. 7.9): Scalable Multiprocessors.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
CS252 Graduate Computer Architecture Lecture 17 Multiprocessor Networks (con’t) March 31 th, 2010 John Kubiatowicz Electrical Engineering and Computer.
Distributed Memory Multiprocessors CS 252, Spring 2005 David E. Culler Computer Science Division U.C. Berkeley.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
12/13/99 Page 1 IRAM Network Interface Ioannis Mavroidis IRAM retreat January 12-14, 2000.
General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)
CS 258 Parallel Computer Architecture Lecture 8 Network Interface Design February 20, 2008 Prof John D. Kubiatowicz
The OSI Model A layered framework for the design of network systems that allows communication across all types of computer systems regardless of their.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Switching, routing, and flow control in interconnection networks.
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
3/11/2002CSE Input/Output Input/Output Control Datapath Memory Processor Input Output Memory Input Output Network Control Datapath Processor.
Synchronization and Communication in the T3E Multiprocessor.
Data and Computer Communications Circuit Switching and Packet Switching.
MODULE I NETWORKING CONCEPTS.
CCNA 3 Week 4 Switching Concepts. Copyright © 2005 University of Bolton Introduction Lan design has moved away from using shared media, hubs and repeaters.
CSE 661 PAPER PRESENTATION
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Machine Organizations Copyright 2004 Daniel J. Sorin Duke University.
Cisco 3 - Switching Perrine. J Page 16/4/2016 Chapter 4 Switches The performance of shared-medium Ethernet is affected by several factors: data frame broadcast.
CS252 Graduate Computer Architecture Lecture 17 Multiprocessor Networks (con’t) March 18 th, 2012 John Kubiatowicz Electrical Engineering and Computer.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Supporting Systolic and Memory Communication in iWarp CS258 Paper Summary Computer Science Jaein Jeong.
Interrupt driven I/O. MIPS RISC Exception Mechanism The processor operates in The processor operates in user mode user mode kernel mode kernel mode Access.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower than CPU.
Interrupt driven I/O Computer Organization and Assembly Language: Module 12.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 14: May 7, 2003 Fast Messaging.
CS252 Graduate Computer Architecture Lecture 24 Network Interface Design Memory Consistency Models Prof John D. Kubiatowicz
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
CMSC 611: Advanced Computer Architecture
Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.
John Kubiatowicz Electrical Engineering and Computer Sciences
CS 286 Computer Organization and Architecture
CMSC 611: Advanced Computer Architecture
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
Switching, routing, and flow control in interconnection networks
Computer Science Division
Basic Mechanisms How Bits Move.
Latency Tolerance: what to do when it just won’t go away
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
John Kubiatowicz Electrical Engineering and Computer Sciences
Chapter 13: I/O Systems.
Presentation transcript:

Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley

3/10/99CS258 S992 Racap: Common Challenges Input buffer overflow –N-1 queue over-commitment => must slow sources –reserve space per source(credit) »when available for reuse? Ack or Higher level –Refuse input when full »backpressure in reliable network »tree saturation »deadlock free »what happens to traffic not bound for congested dest? –Reserve ack back channel –drop packets –Utilize higher-level semantics of programming model

3/10/99CS258 S993 Racap: Challenges (cont) Fetch Deadlock –For network to remain deadlock free, nodes must continue accepting messages, even when cannot source msgs –what if incoming transaction is a request? »Each may generate a response, which cannot be sent! »What happens when internal buffering is full? logically independent request/reply networks –physical networks –virtual channels with separate input/output queues bound requests and reserve input buffer space –K(P-1) requests + K responses per node –service discipline to avoid fetch deadlock? NACK on input buffer full –NACK delivery?

3/10/99CS258 S994 Network Transaction Processing Key Design Issue: How much interpretation of the message? How much dedicated processing in the Comm. Assist? PM CA PM ° ° ° Scalable Network Node Architecture Communication Assist Message Output Processing – checks – translation – formating – scheduling Input Processing – checks – translation – buffering – action

3/10/99CS258 S995 Spectrum of Designs None: Physical bit stream –blind, physical DMAnCUBE, iPSC,... User/System –User-level portCM-5, *T –User-level handlerJ-Machine, Monsoon,... Remote virtual address –Processing, translationParagon, Meiko CS-2 Global physical address –Proc + Memory controllerRP3, BBN, T3D Cache-to-cache –Cache controllerDash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???)

3/10/99CS258 S996 Net Transactions: Physical DMA DMA controlled by regs, generates interrupts Physical => OS initiates transfers Send-side –construct system “envelope” around user data in kernel area Receive –must receive into system buffer, since no interpretation inCA senderauth dest addr

3/10/99CS258 S997 nCUBE Network Interface independent DMA channel per link direction –leave input buffers always open –segmented messages routing interprets envelope –dimension-order routing on hypercube –bit-serial with 36 bit cut-through Os16 ins 260 cy13 us Or18200 cy15 us - includes interrupt

3/10/99CS258 S998 Conventional LAN NI NIC Controller DMA addr len trncv TX RX Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Data Host Memory NIC IO Bus mem bus Proc

3/10/99CS258 S999 User Level Ports initiate transaction at user level deliver to user without OS intervention network port in user space User/system flag in envelope –protection check, translation, routing, media access in src CA –user/sys check in dest CA, interrupt on system

3/10/99CS258 S9910 User Level Network ports Appears to user as logical message queues plus status What happens if no user pop?

3/10/99CS258 S9911 Example: CM-5 Input and output FIFO for each network 2 data networks tag per message –index NI mapping table context switching? *T integrated NI on chip iWARP also Os50 cy1.5 us Or53 cy1.6 us interrupt10us

3/10/99CS258 S9912 User Level Handlers Hardware support to vector to address specified in message –message ports in registers User/system P Mem DestDataAddress P Mem 

3/10/99CS258 S9913 J-Machine: Msg-Driven Processor Each node a small msg driven processor HW support to queue msgs and dispatch to msg handler task

3/10/99CS258 S9914 Monsoon Explicit Token-Store

3/10/99CS258 S9915 *T: Network Co-Processor

3/10/99CS258 S9916 iWARP: Systolic Computation Nodes integrate communication with computation on systolic basis Msg data direct to register Stream into memory Interface unit Host

3/10/99CS258 S9917 Dedicated processing without dedicated hardware design

3/10/99CS258 S9918 Dedicated Message Processor General Purpose processor performs arbitrary output processing (at system level) General Purpose processor interprets incoming network transactions (at system level) User Processor Msg Processor share memory Msg Processor Msg Processor via system network transaction Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI UserSystem

3/10/99CS258 S9919 Levels of Network Transaction User Processor stores cmd / msg / data into shared output queue –must still check for output queue full (or make elastic) Communication assists make transaction happen –checking, translation, scheduling, transport, interpretation Effect observed on destination address space and/or events Protocol divided between two layers Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI

3/10/99CS258 S9920 Example: Intel Paragon

3/10/99CS258 S9921 User Level Abstraction (Lok Liu) Any user process can post a transaction for any other in protection domain –communication layer moves OQ src –> IQ dest –may involve indirection: VAS src –> VAS dest Proc OQ IQ VAS Proc OQ IQ VAS Proc OQ IQ VAS Proc OQ IQ VAS

3/10/99CS258 S9922 Msg Processor Events Dispatcher User Output Queues Send FIFO ~Empty Rcv FIFO ~Full Send DMA Rcv DMA DMA done Compute Processor Kernel System Event

3/10/99CS258 S9923 Basic Implementation Costs: Scalar Cache-to-cache transfer (two 32B lines, quad word ops) –producer: read(miss,S), chk, write(S,WT), write(I,WT),write(S,WT) –consumer: read(miss,S), chk, read(H), read(miss,S), read(H),write(S,WT) to NI FIFO: read status, chk, write,... from NI FIFO: read status, chk, dispatch, read, read,... CP User OQ MP Registers Cache Net FIFO User IQ MPCP Net µs5.4 µs 10.5 µs 7 wds ns + H*40ns

3/10/99CS258 S9924 Virtual DMA -> Virtual DMA Send MP segments into 8K pages and does VA –> PA Recv MP reassembles, does dispatch and VA –> PA per page CP User OQ MP Registers Cache Net FIFO User IQ MP CP Net wds 222 Memory sDMA hdr rDMA MP MB/s 175 MB/s 400 MB/s

3/10/99CS258 S9925 Single Page Transfer Rate Actual Buffer Size: 2048 Effective Buffer Size: 3232

3/10/99CS258 S9926 Msg Processor Assessment Concurrency Intensive –Need to keep inbound flows moving while outbound flows stalled –Large transfers segmented Reduces overhead but adds latency User Output Queues Send FIFO ~Empty Rcv FIFO ~Full Send DMA Rcv DMA DMA done Compute Processor Kernel System Event User Input Queues VAS Dispatcher

3/10/99CS258 S9927 Case Study: Meiko CS2 Concept Circuit-switched Network Transaction –source-dest circuit held open for request response –limited cmd set executed directly on NI Dedicated communication processor for each step in flow

3/10/99CS258 S9928 Case Study: Meiko CS2 Organization

3/10/99CS258 S9929 Shared Physical Address Space NI emulates memory controller at source NI emulates processor at dest –must be deadlock free

3/10/99CS258 S9930 Case Study: Cray T3D Build up info in ‘shell’ Remote memory operations encoded in address

3/10/99CS258 S9931 Case Study: NOW General purpose processor embedded in NIC

3/10/99CS258 S9932 Message Time Breakdown Communication pipeline

3/10/99CS258 S9933 Message Time Comparison

3/10/99CS258 S9934 SAS Time Comparison

3/10/99CS258 S9935 Message-Passing Time vs Size

3/10/99CS258 S9936 Message-Passing Bandwidth vs Size

3/10/99CS258 S9937 Application Performance on LU

3/10/99CS258 S9938 Application Performance on BT

3/10/99CS258 S9939 Message Profile on BT

3/10/99CS258 S9940 Reflective Memory Writes to local region reflected to remote

3/10/99CS258 S9941 Case Study: DEC Memory Channel See also Shrimp