General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Slides:



Advertisements
Similar presentations
1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 U-Net: A User-Level Network Interface for Parallel and Distributed Computing T. von Eicken, A. Basu,
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Realizing Programming Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.
Scalability CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
Scalable Distributed Memory Multiprocessors Todd C. Mowry CS 495 October 24 & 29, 2002.
Supporting Systolic and Memory Communication in iWarp (Borkar et al. 1990) presented by Vasily Volkov CS258, Spring 2008, UC Berkeley.
EECC756 - Shaaban #1 lec # 12 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.
Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.
12/13/99 Page 1 IRAM Network Interface Ioannis Mavroidis IRAM retreat January 12-14, 2000.
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.
7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)
1 I/O Management in Representative Operating Systems.
Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
PRASHANTHI NARAYAN NETTEM.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
CS 258 Parallel Computer Architecture Lecture 8 Network Interface Design February 20, 2008 Prof John D. Kubiatowicz
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:
CS-334: Computer Architecture
Synchronization and Communication in the T3E Multiprocessor.
Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization.
Switched network.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Machine Organizations Copyright 2004 Daniel J. Sorin Duke University.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.
EEE440 Computer Architecture
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Supporting Systolic and Memory Communication in iWarp CS258 Paper Summary Computer Science Jaein Jeong.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.
CCNA3 Module 4 Brierley Module 4. CCNA3 Module 4 Brierley Topics LAN congestion and its effect on network performance Advantages of LAN segmentation in.
CS252 Graduate Computer Architecture Lecture 24 Network Interface Design Memory Consistency Models Prof John D. Kubiatowicz
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Interactions with Microarchitectures and I/O Copyright 2004 Daniel.
Last Class: Introduction
Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.
B. N. Bershad, T. E. Anderson, E. D. Lazowska and H. M
CMSC 611: Advanced Computer Architecture
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
Introduction to Scalable Interconnection Network Design
Page Replacement.
The Stanford FLASH Multiprocessor
Architectural Interactions in High Performance Clusters
Computer Science Division
Yiannis Nikolakopoulos
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Presented by Neha Agrawal
Latency Tolerance: what to do when it just won’t go away
Lecture 18: Coherence and Synchronization
Chapter 13: I/O Systems.
Multiprocessors and Multi-computers
Presentation transcript:

General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley

3/12/99CS258 S992 *T: Network Co-Processor

3/12/99CS258 S993 iWARP: Systolic Computation Nodes integrate communication with computation on systolic basis Msg data direct to register Stream into memory Interface unit Host

3/12/99CS258 S994 Dedicated Message Processor General Purpose processor performs arbitrary output processing (at system level) General Purpose processor interprets incoming network transactions (at system level) User Processor Msg Processor share memory Msg Processor Msg Processor via system network transaction Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI UserSystem

3/12/99CS258 S995 Levels of Network Transaction User Processor stores cmd / msg / data into shared output queue –must still check for output queue full (or make elastic) Communication assists make transaction happen –checking, translation, scheduling, transport, interpretation Effect observed on destination address space and/or events Protocol divided between two layers Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI

3/12/99CS258 S996 Example: Intel Paragon

3/12/99CS258 S997 User Level Abstraction (Lok Liu) Any user process can post a transaction for any other in protection domain –communication layer moves OQ src –> IQ dest –may involve indirection: VAS src –> VAS dest Proc OQ IQ VAS Proc OQ IQ VAS Proc OQ IQ VAS Proc OQ IQ VAS

3/12/99CS258 S998 Msg Processor Events Dispatcher User Output Queues Send FIFO ~Empty Rcv FIFO ~Full Send DMA Rcv DMA DMA done Compute Processor Kernel System Event

3/12/99CS258 S999 Basic Implementation Costs: Scalar Cache-to-cache transfer (two 32B lines, quad word ops) –producer: read(miss,S), chk, write(S,WT), write(I,WT),write(S,WT) –consumer: read(miss,S), chk, read(H), read(miss,S), read(H),write(S,WT) to NI FIFO: read status, chk, write,... from NI FIFO: read status, chk, dispatch, read, read,... CP User OQ MP Registers Cache Net FIFO User IQ MPCP Net µs5.4 µs 10.5 µs 7 wds ns + H*40ns

3/12/99CS258 S9910 Virtual DMA -> Virtual DMA Send MP segments into 8K pages and does VA –> PA Recv MP reassembles, does dispatch and VA –> PA per page CP User OQ MP Registers Cache Net FIFO User IQ MP CP Net wds 222 Memory sDMA hdr rDMA MP MB/s 175 MB/s 400 MB/s

3/12/99CS258 S9911 Single Page Transfer Rate Actual Buffer Size: 2048 Effective Buffer Size: 3232

3/12/99CS258 S9912 Msg Processor Assessment Concurrency Intensive –Need to keep inbound flows moving while outbound flows stalled –Large transfers segmented Reduces overhead but adds latency User Output Queues Send FIFO ~Empty Rcv FIFO ~Full Send DMA Rcv DMA DMA done Compute Processor Kernel System Event User Input Queues VAS Dispatcher

3/12/99CS258 S9913 Case Study: Meiko CS2 Concept Circuit-switched Network Transaction –source-dest circuit held open for request response –limited cmd set executed directly on NI Dedicated communication processor for each step in flow

3/12/99CS258 S9914 Case Study: Meiko CS2 Organization

3/12/99CS258 S9915 Shared Physical Address Space NI emulates memory controller at source NI emulates processor at dest –must be deadlock free

3/12/99CS258 S9916 Case Study: Cray T3D Build up info in ‘shell’ Remote memory operations encoded in address

3/12/99CS258 S9917 Case Study: NOW General purpose processor embedded in NIC

3/12/99CS258 S9918 Message Time Breakdown Communication pipeline

3/12/99CS258 S9919 Message Time Comparison

3/12/99CS258 S9920 SAS Time Comparison

3/12/99CS258 S9921 Message-Passing Time vs Size

3/12/99CS258 S9922 Message-Passing Bandwidth vs Size

3/12/99CS258 S9923 Application Performance on LU

3/12/99CS258 S9924 Working Sets Change with P 8-fold reduction in miss rate from 4 to 8 proc

3/12/99CS258 S9925 Application Performance on BT

3/12/99CS258 S9926 NAS Communication Scaling Normalized Msgs per ProcAverage Message Size

3/12/99CS258 S9927 NAS Communication Scaling: Volume

3/12/99CS258 S9928 Communication Characteristics: BT

3/12/99CS258 S9929 Beware Average BW analysis

3/12/99CS258 S9930 Reflective Memory Writes to local region reflected to remote

3/12/99CS258 S9931 Case Study: DEC Memory Channel See also Shrimp

3/12/99CS258 S9932 Scalable Synchronization Operations Messages: point-to-point synchronization Build all-to-all as trees Recall: sophisticated locks reduced contention by spinning on separate locations –caching brought them local –test&test&set, ticket-lock, array lock »O(p) space Problem: with array lock location determined by arrival order => not likely to be local Solution: queue-lock –build distributed linked-list, each spins on local node

3/12/99CS258 S9933 Queue Locks Head holds lock;Each points to next waiter Shared pointer to tail Acquire –swap (fetch&store) tail with node address, chain in prev Release –signal next –compare&swap plus check to reset tail

3/12/99CS258 S9934 Parallel Prefix: Upward Sweep generalization of barrier (reduce-broadcast) compute S i = X i  X i X 0, for i = 0, 1,... combine children, store least significant

3/12/99CS258 S9935 Downward Sweep of parallel Prefix Least branch send to most sig child when receive from above –send to least significant –combine with stored and send result to most sign