Haoyuan Li CS 6410 Fall 2009 10/15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 U-Net: A User-Level Network Interface for Parallel and Distributed Computing T. von Eicken, A. Basu,

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach.

1 Architectural Complexity: Opening the Black Box Methods for Exposing Internal Functionality of Complex Single and Multiple Processor Systems EECC-756.

Chapter 4 Conventional Computer Hardware Architecture

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

Reference: Message Passing Fundamentals.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Dawson R. Engler, M. Frans Kaashoek, and James O’Toole Jr.

Dawson R. Engler, M. Frans Kaashoek, and James O'Tool Jr.

CS 550 Amoeba-A Distributed Operation System by Saie M Mulay.

Faster! Vidhyashankar Venkataraman CS614 Presentation.

Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

3.5 Interprocess Communication

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Mid-Semester Presentation Spring 2005 Network Sniffer.

CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Networks Thu D. Nguyen Spring 2005 Computer Science Rutgers University.

Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

1 I/O Management in Representative Operating Systems.

COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.

Ethan Kao CS 6410 Oct. 18 th  Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth.

CS533 Concepts of OS Class 16 ExoKernel by Constantia Tryman.

Optimizing RPC “Lightweight Remote Procedure Call” (1990) Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, Henry M. Levy (University of Washington)

ATM and Fast Ethernet Network Interfaces for User-level Communication Presented by Sagwon Seo 2000/4/13 Matt Welsh, Anindya Basu, and Thorsten von Eicken.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

LWIP TCP/IP Stack 김백규.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

1 Micro-kernel. 2 Key points Microkernel provides minimal abstractions –Address space, threads, IPC Abstractions –… are machine independent –But implementation.

High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.

1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.

Unconventional Networking Makoto Bentz October 13, 2010 CS 6410.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.

A summary by Nick Rayner for PSU CS533, Spring 2006

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Slide 1 von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation" CS258 Lecture by: Dan Bonachea.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

Department of Computer Science and Software Engineering

EECB 473 Data Network Architecture and Electronics Lecture 1 Conventional Computer Hardware Architecture

Software Overhead in Messaging Layers Pitch Patarasuk.

6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris

Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS Moontae Lee (Nov 20, 2014)Part 1 CS6410.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

LWIP TCP/IP Stack 김백규.

The Mach System Sri Ramkrishna.

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

Presented by Neha Agrawal

Presented by Roberto Starc

Presentation transcript:

Haoyuan Li CS 6410 Fall /15/2009

 U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya Basu, Vineet Buch, and Werner Vogels  Active Messages: a Mechanism for Integrated Communication and Computation ◦ Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser

 Thorsten von Eicken  Werner Vogels  David E. Culler  Seth Copen Goldstein  Klaus Erik Schauser

 Motivation  Design  Implementation ◦ SBA-100 ◦ SBA-200  Evaluation ◦ Active Messages ◦ Split-C ◦ IP Suite  Conclusion

 Motivation  Design  Implementation ◦ SBA-100 ◦ SBA-200  Evaluation ◦ Active Messages ◦ Split-C ◦ IP Suite  Conclusion

 Processing Overhead ◦ Fabrics bandwidth vs. Software overhead  Flexibility ◦ Design new protocol  Small Message ◦ Remote object executions ◦ Cache maintaining messages ◦ RPC style client/server architecture  Economic driven ◦ Expensive multiprocessors super computers with custom network design ◦ vs. ◦ Cluster of standard workstations connected by off-the- shelf communication hardware

 Provide low-latency communication in local area setting  Exploit the full network bandwidth even with small message  Facilitate the use of novel communication protocols  All built on CHEAP hardware!

 Motivation  Design  Implementation ◦ SBA-100 ◦ SBA-200  Evaluation ◦ Active Messages ◦ Split-C ◦ IP Suite  Conclusion

 Communication Segments  Send queue  Receive Queue  Free Queue

 Send and Receive packet  Multiplexing and demultiplexing messages  Zero-copy vs. true Zero-copy  Base-Level U-Net  Kernel emulation of U-Net  Direct-Access U-Net

networ k  Prepare packet and place it in the Communication segment  Place descriptor on the Send queue  U-Net takes descriptor from queue  transfers packet from memory to network packet U-Net NI From Itamar Sagi

networ k  U-Net receives message and decide which Endpoint to place it  Takes free space from Free Queue  Place message in Communication Segment  Place descriptor in receive queue  Process takes descriptor from receive queue (polling or signal) and reads message packet U-Net NI From Itamar Sagi

 Channel setup and memory allocation  Communication Channel ID  Isolation Protection

 True Zero-copy: No intermediate buffering ◦ Direct-Access U-Net  Communication segment spans the entire process address space  Specify offset where data has to be deposited  Zero-copy: One intermediate copy into a networking buffer ◦ Base-Level U-Net  Communication segment are allocated and pinned to physical memory  Optimization for small messages ◦ Kernel emulation of U-Net  Scarce resources for communication segment and message queues

 Motivation  Design  Implementation ◦ SBA-100 ◦ SBA-200  Evaluation ◦ Active Messages ◦ Split-C ◦ IP Suite

 SPARCstations  SunOS  Fore SBA-100 and Fore SBA-200 ATM interfaces by FORE Systems, now part of Ericsson  AAL5

 Onboard processor  DMA capable  AAL5 CRC generator  Firmware changed to implement U-Net NI on the onboard processor

 Motivation  Design  Implementation ◦ SBA-100 ◦ SBA-200  Evaluation ◦ Active Messages ◦ Split-C ◦ IP Suite  Conclusion

 Active Messages ◦ A mechanism that allows efficient overlapping of communication with computation in multiprocessors  Implementation of GAM specification over U- Net

 Split C based on UAM  Vs.  CM-5  Meiko CS-2

 Block matrix multiply  Sample sort (2 versions)  Radix sort (2 versions)  Connected component algorithm  Conjugate gradient solver

TCP max bandwidthUDP max bandwidth

 Motivation  Design  Implementation ◦ SBA-100 ◦ SBA-200  Evaluation ◦ Active Messages ◦ Split-C ◦ IP Suite  Conclusion

 U-Net main objectives achieved: ◦ Provide efficient low latency communication ◦ Offer a high degree of flexibility  U-Net based round-trip latency for messages smaller than 40 bytes: Win!  U-Net flexibility shows good performance on TCP and UDP protocol

 Large-scale multiprocessors design’s key challenges  Active messages  Message passing architectures  Message driven architectures  Potential hardware support  Conclusions

 Minimize communication overhead  Allow communication to overlap computation  Coordinate the two above without sacrificing processor cost/performance

 Large-scale multiprocessors design’s key challenges  Active messages  Message passing architectures  Message driven architectures  Potential hardware support  Conclusions

 Mechanism for sending messages ◦ Message header contains instruction address ◦ Handler retrieves message, cannot block, and no computing ◦ No buffering available  Making a simple interface to match hardware  Allow computation and communication overlap

 Sender asynchronous sends a message to a receiver without blocking computing  Receiver pulls message, integrates into computation through handler ◦ Handler executes without blocking ◦ Handler provides data to ongoing computation, but not does any computation

 Large-scale multiprocessors design’s key challenges  Active messages  Message passing architectures  Message driven architectures  Potential hardware support  Conclusions

 3-Phase Protocol  Simple  Inefficient  No buffering needed

 Communication can have overlap with computation  Buffer space allocated throughout computation

 Extension of C for SPMD Programs ◦ Global address space is partitioned into local and remote ◦ Maps shared memory benefits to distributed memory ◦ Split-phase access  Active Messages serve as interface for Split-C

 Large-scale multiprocessors design’s key challenges  Active messages  Message passing architectures  Message driven architectures  Potential hardware support  Conclusions

 To support languages with dynamic parallelism  Integrate communication into the processor  Computation is driven by messages, which contain the name of a handler and some data  Computation is within message handlers  May buffer messages upon receipt ◦ Buffers can grow to any size depending on amount of excess parallelism  Less locality

 Large-scale multiprocessors design’s key challenges  Active messages  Message passing architectures  Message driven architectures  Potential hardware support  Conclusions

 Network Interface Support ◦ Large messages ◦ Message registers ◦ Reuse of message data ◦ Single network port ◦ Protection ◦ Frequent message accelerators  Processor Support ◦ Fast polling ◦ User-level interrupts ◦ PC injection ◦ Dual processors

 Large-scale multiprocessors design’s key challenges  Active messages  Message passing architectures  Message driven architectures  Potential hardware support  Conclusions

 Asynchronous communication  No buffering  Improved Performance  Handlers are kept simple