Split-C for the New Millennium Andrew Begel, Phil Buonadonna, David Gay

Slides:



Advertisements
Similar presentations
Phil Buonadonna, Jason Hill CS-268, Spring 2000 MOTE Active Messages Communication Architectures for Networked Mini-Devices Networked sub-devicesActive.
Advertisements

Threads, SMP, and Microkernels
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Database Architectures and the Web
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 3 – Process Concepts Outline 3.1 Introduction 3.1.1Definition of Process 3.2Process States:
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Processes CSCI 444/544 Operating Systems Fall 2008.
Figure 2.8 Compiler phases Compiling. Figure 2.9 Object module Linking.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
UCB Millennium and the Vineyard Cluster Architecture Phil Buonadonna University of California, Berkeley
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
3.5 Interprocess Communication
Computer Science Lecture 2, page 1 CS677: Distributed OS Last Class: Introduction Distributed Systems – A collection of independent computers that appears.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
Chapter 3: Processes. Process Concept Process Scheduling Operations on Processes Cooperating Processes Interprocess Communication Communication in Client-Server.
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
PRASHANTHI NARAYAN NETTEM.
Using Two Queues. Using Multiple Queues Suspended Processes Processor is faster than I/O so all processes could be waiting for I/O Processor is faster.
COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.
Chapter 4.1 Interprocess Communication And Coordination By Shruti Poundarik.
Presentation on Osi & TCP/IP MODEL
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Process Concept Process Scheduling Operations on Processes Cooperating Processes Interprocess Communication Communication in Client-Server Systems.
Extensibility, Safety and Performance in the SPIN Operating System Ashwini Kulkarni Operating Systems Winter 2006.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Xen I/O Overview. Xen is a popular open-source x86 virtual machine monitor – full-virtualization – para-virtualization para-virtualization as a more efficient.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
The Structure of Processes (Chap 6 in the book “The Design of the UNIX Operating System”)
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
CE Operating Systems Lecture 3 Overview of OS functions and structure.
3.1 Silberschatz, Galvin and Gagne ©2009Operating System Concepts with Java – 8 th Edition Chapter 3: Processes.
Processes. Chapter 3: Processes Process Concept Process Scheduling Operations on Processes Cooperating Processes Interprocess Communication Communication.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Processes CS 6560: Operating Systems Design. 2 Von Neuman Model Both text (program) and data reside in memory Execution cycle Fetch instruction Decode.
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
The Mach System Silberschatz et al Presented By Anjana Venkat.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
Computer Science Lecture 3, page 1 CS677: Distributed OS Last Class: Communication in Distributed Systems Structured or unstructured? Addressing? Blocking/non-blocking?
Processes. Process Concept Process Scheduling Operations on Processes Interprocess Communication Communication in Client-Server Systems.
Page Replacement Implementation Issues Text: –Tanenbaum ch. 4.7.
 Process Concept  Process Scheduling  Operations on Processes  Cooperating Processes  Interprocess Communication  Communication in Client-Server.
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
Distributed Computing & Embedded Systems Chapter 4: Remote Method Invocation Dr. Umair Ali Khan.
Communication in Distributed Systems. . The single most important difference between a distributed system and a uniprocessor system is the interprocess.
Split-C for the New Millennium
Last Class: Introduction
PROTECTION.
Chapter 11: File System Implementation
Chapter 3 – Process Concepts
File System Implementation
Processes Overview: Process Concept Process Scheduling
#01 Client/Server Computing
IPC and RPC.
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
Chapter 4: Processes Process Concept Process Scheduling
Chapter 15: File System Internals
Operating Systems: A Modern Perspective, Chapter 6
Last Class: Communication in Distributed Systems
#01 Client/Server Computing
Presentation transcript:

Split-C for the New Millennium Andrew Begel, Phil Buonadonna, David Gay

Introduction Berkeley’s new Millennium cluster –16 2-way Intel 400 Mhz PII SMPs –Myrinet NICs Virtual Interface Architecture (VIA) user-level network Active Messages Split-C Project Goals Implement Active Messages over VIA Implement and measure Split-C over VIA

VI Architecture VI Recv QSend Q Descriptor Network Interface Controller Status Receive Doorbell Send Doorbell Virtual Address Space RM VI Consumer

Active Messages Paradigm for message-based communication –Concept: Overlap communication/computation Implementation –Two-phase request/reply pairs –Endpoints: Processes Connection to a Virtual Network –Bundles: Collection of process endpoints Operations –AM_Map(), AM_Request(), AM_Reply(), AM_Poll() –Credit based flow-control scheme

AM-VIA Components VI Queue (VIQ) –Logical channel for AM message type –VI & independent Send/Receive Queues –Independent request credit scheme (counter n ) VI Dxs (2*k) Dxs (2*k +1) Data (2*k) Data (2*k +1) Send Recv n < k

AM-VIA Components VI Queue (VIQ) –Logical channel for AM message type –VI & independent Send/Receive Queues –Independent request credit scheme (counter n ) MAP Object –Container for 3 VIQ’s Short,Medium,Long MAP Object

AM-VIA Components VI Queue (VIQ) –Logical channel for AM message type –VI & independent Send/Receive Queues –Independent request credit scheme (counter n ) MAP Object –Container for 3 VIQ’s Short,Medium,Long –Single Registered Memory Region MAP Object

Bundle: Pair of VI Completion Queues –Send/Receive AM-VIA Integration Proc A Proc B Proc C Endpoints: Collection of MAP objects –Virtual network emulated by point-to-point connections

AM-VIA Operations Map –Allocates VI and registered memory resources and establishes connections. Send operations –Copies data into a free send buffer posts descriptor. Receive operations –Short/Long messages: copies data and invokes handler –Medium: invokes handler w/ pointer to data buffer Polling –Request/Reply marshalling Empties completion queue into Request/Reply FIFO queues Process single Request and/or Reply on each iteration –Recycles send descriptors

Design Tradeoffs Logical Channels for Short/Medium/Long messages –Balances resources (VI’s, buffering) and reliability –Fine grained credit scheme –Requires advanced knowledge of reply size. –Requires request-reply marshalling upon receipt Data Copying –Simplest/Robust means to buffer management –Zero copy on medium receives requires k+1 buffering. Completion Queue/Bundle –Straightforward implementation of bundle –May overflow on high communication volume –Prevents endpoint migration

Reflections AMVIA Implementation –Robust. Works for wide variety of AM applications –Performance suffers due to subtle architectural differences VI Architecture shortcomings –Lack of support for mapping a VI to a user context –VI Naming complicates IPC on the same host Active Message shortcomings –Memory Ownership semantics prevent true zero-copy for medium messages Both benefit from some direct hardware support –VIA: Hardware doorbell management –AM: Distinction of request/reply messages

Split-C C-based shared address space, parallel language Distributed memory, explicit global pointers Split-phase global read/writes: l := rr :- l r := l sync()store_sync() processaddress Process 0 Process 1 1 0xdeadbeef (__) (oo) / \/ / | || * ||----|| ~~ ~~

Implementing Split-C Split-C implemented as a modified gcc compiler Split-phase reads, writes translated to library calls ï Just need to implement a library Essential library calls: get charsync put int + bulk store_sync store... Four implementations: –Split-C over AMVIA –Split-C over reliable VIA –Split-C over unreliable VIA –Split-C over shared memory + AMVIA x

Split-C over AMVIA Establish connection between every pair of processes Simple requests/replies to implement get, put, store, e.g.: p0: get(loc, ) request "get"(1, loc, 0xbeef) p1 p0 continues program execution AM connection Process 0 Process 2 Process 1 (__) (oo) / \/ / | || * ||----|| ~~ ~~

Split-C over AMVIA Establish connection between every pair of processes Simple requests/replies to implement get, put, store, e.g.: p0: get(loc, ) request "get"(1, loc, 0xbeef) p1 p0 continues program execution p1: receive request "get"(…) reply "getr"(loc, a-cow) p0 AM connection Process 0 Process 2 Process 1 (__) (oo) / \/ / | || * ||----|| ~~ ~~ (__) (oo) / \/ / | || * ||----|| ~~ ~~

Split-C over AMVIA Establish connection between every pair of processes Simple requests/replies to implement get, put, store, e.g.: p0: get(loc, ) request "get"(1, loc, 0xbeef) p1 p0 continues program execution p1: receive request "get"(…) reply "getr"(loc, a-cow) p0 p0: receive reply "getr"(…) store cow at loc AM connection Process 0 Process 2 Process 1 (__) (oo) / \/ / | || * ||----|| ~~ ~~ (__) (oo) / \/ / | || * ||----|| ~~ ~~

Split-C over Reliable VIA Goal: Reduce send and receive overhead for Split-C operations Method 1: Specialise AMVIA for Split-C library –support only short, medium messages –remove all dynamic dispatch (AM calls, handler dispatch) –reduce message size Method 2: Allow reply-free requests (for stores) –reply to every nth store request, rather than every one –n = 1/4 of maximum credits

Split-C over Unreliable VIA Replace request/reply mechanism of Split-C over reliable VIA Sliding-window + credit-based protocol Acknowledge processed requests/replies  reply-free requests handled automatically Timeouts detected in polling routine (unimplemented) Ack Process Request Process Ack Stores

Split-C over Shared Memory How can two processes on the same host communicate? –Loopback through network –Multi-Protocol VIA –Multi-Protocol AM –Shared Memory Split-C Each process maps the address space of every other process on the same host into its own. Heap is allocated with Sys V IPC Shared Memory. Data segment is mmapped via /proc file system. Stack is too dynamic to map. Process 1 Local Memory Process 2 Local Memory P1’s view of Process 2 P2’s view of Process 1 Address Spaces on Host mm4.millennium.berkeley.edu P1’s address spaceP2’s address space

Split-C Microbenchmarks Split-C Store Performance (Short and Bulk Messages) (smaller numbers are better)

Split-C Application Benchmarks Figure : Split-C application performance (bigger is better)

Reflections The specialization of the communications layer for Split-C reduced send and receive overhead. This overhead reduction appears to correlate with increased application performance and scaling. Sharing a process’s address space should be much easier than it is in Linux.

AM(v2) Architecture Components –Endpoints request_hndlr_a() request_hndlr_b() reply_hndlr_a() reply_hndlr_b()... Network

AM(v2) Architecture Components –Endpoints –Virtual Networks Proc A Proc B Proc C

AM(v2) Architecture Components –Endpoints –Virtual Networks –Bundles Proc A Proc B Proc C

AM(v2) Architecture Components –Endpoints –Virtual Networks –Bundles Operations –Request / Reply Short, Med, Long –Create, Map, Free –Poll, Wait Credit based flow control Proc A Proc B Proc C

Active Messages Split-phase remote procedure calls –Concept: Overlap communication/computation Request Handler Reply Handler Proc AProc B Request Reply