REPLICATING FILES AND OTHER BIG OBJECTS “OUT OF BAND” WITH ISIS2 Ken Birman 1 Cornell University.

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

Part IV: Memory Management
Isis 2 Design Choices A few puzzles to think about when considering use of Isis 2 in your work.
Memory Protection: Kernel and User Address Spaces  Background  Address binding  How memory protection is achieved.
Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.
04/14/2008CSCI 315 Operating Systems Design1 I/O Systems Notice: The slides for this lecture have been largely based on those accompanying the textbook.
1 Lecture 2: Review of Computer Organization Operating System Spring 2007.
A CHAT CLIENT-SERVER MODULE IN JAVA BY MAHTAB M HUSSAIN MAYANK MOHAN ISE 582 FALL 2003 PROJECT.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
Chapter 1 and 2 Computer System and Operating System Overview
The Google File System.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Layer 2 Switch  Layer 2 Switching is hardware based.  Uses the host's Media Access Control (MAC) address.  Uses Application Specific Integrated Circuits.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Mapping Internet Addresses to Physical Addresses (ARP)
1 The Google File System Reporter: You-Wei Zhang.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
CMPE 421 Parallel Computer Architecture
I/O Systems I/O Hardware Application I/O Interface
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
Chapter 2: Operating-System Structures. 2.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 2: Operating-System Structures Operating.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
Department of Computer Science and Software Engineering
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
Lecture 1: Review of Computer Organization
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
Memory Hierarchies Sonish Shrestha October 3, 2013.
CSCI 156: Lab 11 Paging. Our Simple Architecture Logical memory space for a process consists of 16 pages of 4k bytes each. Your program thinks it has.
Silberschatz, Galvin, and Gagne  Applied Operating System Concepts Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O.
I/O Software CS 537 – Introduction to Operating Systems.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Part IVI/O Systems Chapter 13: I/O Systems. I/O Hardware a typical PCI bus structure 2.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Introduction to Operating Systems Concepts
OPERATING SYSTEM CONCEPT AND PRACTISE
Module 12: I/O Systems I/O hardware Application I/O Interface
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Mechanism: Limited Direct Execution
CSC 4250 Computer Architectures
Chapter 2: System Structures
CSE451 I/O Systems and the Full I/O Path Autumn 2002
Main Memory Management
CSCI 315 Operating Systems Design
I/O Systems I/O Hardware Application I/O Interface
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Lecture Topics: 11/1 General Operating System Concepts Processes
CSE 451: Operating Systems Autumn 2009 Module 21 Remote Procedure Call (RPC) Ed Lazowska Allen Center
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
System calls….. C-program->POSIX call
CSE 451: Operating Systems Autumn 2010 Module 21 Remote Procedure Call (RPC) Ed Lazowska Allen Center
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

REPLICATING FILES AND OTHER BIG OBJECTS “OUT OF BAND” WITH ISIS2 Ken Birman 1 Cornell University

Core Challenge 2  Many cloud computing systems work with very large files or other big objects  Frequently they take the form of massive byte arrays and it isn’t at all uncommon to “map” them into memory  On Linux and Windows the memory-mapped file API makes this easy to do. Takes a file name and returns a pointer to a memory region where you can directly access bytes of that file

Not long ago, Isis 2 wasn’t good choice for applications with big objects… 3  We created the OOB layer because moving big objects inside Isis 2 was simply too costly  You can put big things into messages, and Isis 2 carves them into smaller chunks  But they can seriously disrupt steady flow in the system  The issue is that  Isis 2 needs to maintain FIFO ordering for lower level communication between group members  Hence a big object needs to be fully transferred before small things sent after it can be delivered, even if they were sent by some other thread for some other reason

Out of Band (OOB) Concept 4  We added a way to move very big byte[] objects “outside” of the normal Isis 2 communication path  We start by assuming the objects are memory-mapped files (they don’t have to exist at all on disk, but they do have file names that look like the names of disk files)  You can Create these from file Create a big mapped memory region and put data in it  These mapped files can be shared easily within a single computer and are ultra efficient because no copying occurs. Much faster than ANY form of copying!

Out of Band (OOB) Concept 5  Before  … After Keep in mind that “  ” is really big. And there may be many such transfers to do, all at the same time Machine A has a big memory object  We want copies on B and C…. But not on D   

Out of Band (OOB) Concept 6  So…  You’ve created a memory mapped region  … and put data into it, somehow  … it might be huge (hundreds of megabytes? Or even gigabytes? No problem! But > 6Gb needs 64-bit O/S)  Our goal: Use Isis 2 to efficiently move these from computer to computer in a cloud computing data center or a cluster  Ideally: a single DMA transfer, or a super-efficient series of ethernet multicasts

Out of Band (OOB) Concept 7 Machine A has a big memory object   We want copies on machines B and C…. But not on D (2) Tell your application on B and C to fetch X (1) Tell Isis 2 about X using OOBRegister (3) Applications on B and C call OOBFetch.   (3) OOBReReplicate tells Isis 2 to modify replication pattern

Steps 8  First you need to tell the Isis 2 subsystem that the file exists. There are three cases. 1. Isis 2 could be linked directly to your appliction code, 2. Isis 2 could run in a server that you talk to via RPC, perhaps from native C We also have a command-line program that can talk to our server for you, so you can access OOB by issuing commands if the server is running.  Isis 2 wants to know the file name. In RPC mode the data lives in the mapped memory and isn’t copied to Isis 2

Steps 9  So.. You 1. Register the memory-mapped file  Now you can 1. Form a process group 2. Replicate data within/among the group members. We call this “rereplicate” because you can do it again and again, changing the replication pattern 3. On the receiving “side”, fetch a pointer into the memory-mapped file region (this will wait until the data arrives)

Why do we call it “out of band”? 10  Often you’ll mix Isis 2 RPC and multicast with out of band data transfer  Register a file, and start transferring it  In parallel, tell some group member(s) about it, by name  In such cases  Isis 2 carries out the OOB transfer as efficiently as it can  The OOBFetch operation in the receiver blocks until the bytes have been correctly received and are available

Other options 11  You can also register an upcall handler  The OOB layer will tell it each time an incoming OOB file has been fully transferred  And you can access for the replication map  It tells you which group members have which files  Idea is to be able to rereplicate in a flash, in parallel for multiple files if desired, and as close as possible to the raw hardware speed of the network

OOB interface 12  Example:  Creating a new mapped file  You can also open an existing mapped file, if some other program on the same computer created it  Then call g.OOBRegister(string fname, MemoryMappedFile mmf) MemoryMappedFile mmf = MemoryMappedFile.CreateNew(fname, CAPACITY); MemoryMappedViewAccessor mva = mmf.CreateViewAccessor(); for (int n = 0; n < CAPACITY; n++) { byte b = (byte)(n & 0xFF); mva.Write (n, ref b); } (1) Creates a completely new memory-mapped object (2) An “Accessor” allows you to access the bytes in the object (3) An example of byte-by- byte access.

Now Isis 2 knows about the file 13  Next we can call ReReplicate:  Fname is the file name. But what goes in “where”? g.OOBReReplicate(fname, where);

The “where” argument to ReReplicate 14  This should be an object of type List. For example, given a view v for a group, List everywhere = v.members.ToList(); creates a list with every group member in it.  It must list ALL the places where you want replicas. Isis 2 will create new replicas and also delete unwanted ones  Create new replicas before deleting old ones: two steps  OOBDelete(fname) is short for OOBReReplicate with an empty replica location list.

Now Isis 2 knows about the file 15  ReReplicate also has a second overload:  The delegate method will be called by Isis 2 when the transfer finishes. The transfer itself runs asynchronously – out of band! g.OOBReReplicate(fname, where, (Action ) delegate(string oobfname, MemoryMappedFile m) { IsisSystem.WriteLine("ReReplicate finished for " + oobfname); });

How to access your replica 16  You call MemoryMappedFile xmmf = g.OOBFetch(fname);  This call will wait until the ReReplication action finishes (so it is a mistake to do it if you haven’t started one!). That could take a while if the file is big: a 5GB file on a 10Gb network will need 5 seconds to transfer even at 100% rate

How our server works 17  We built a very simple server that accepts RPC requests in Web Services style  Then we created a simple “thin” library to talk to it  You can pass a file name to it, and it will do an OOB operation using that file name as the argument  Remember: memory mapped files are accessible from any program on the same machine!  So Isis 2 can access your memory mapped files even from this server, even if you aren’t “linked” to it!  The command-line API works the same way

Recap: A very fast way to move objects around 18 Machine A has a big memory object   We want copies on machines B and C (2) Tell your application on B and C to fetch X (1) Tell Isis 2 about X using OOBRegister (3) Applications on B and C call OOBFetch.   (3) OOBReReplicate tells Isis 2 to modify replication pattern

How we use OOB inside Isis 2 19  One situation where Isis 2 has to copy identical data to lots of group members involves a master/worker startup with many new members joining  All the new members need the new group view!  … and because they don’t have the prior group view, we can’t just send the delta, which is how new view events normally work  So, if the group is large, Isis 2 creates a memory- mapped object containing the view, then uses OOB to transfer it to the joining processes!

You might use it for state transfer too! 20  The initialization case is a form of state transfer  Suppose you are building a group but the state is very large, like a file service  If you try and transfer the state “in band” it could take ages and disrupt the group for a long time!

OOB to the rescue! 21  Better: pre-transfer as much state as you can using the OOB tool  You’ll need a way to contact the group before even trying to join. A good option: the Client API Allows you to bind to a randomly chosen “representative” Load balances these roles… Representative must “allow client requests” to handlers you can call as a client.  So, you create a state pre-fetch API for clients Joining member shows up, perhaps authenticates itself, and you use OOB to pre-send all that state

But if updates are active… 22  … a race condition forms!  Suppose the state is A…. W but during the time between when you finish being a client and join, updates X and Y occur in the group  Your state is “stale” – should it be discarded?  We recommend:  Associate a counter or timestamp with the state. The version you pre-transferred had, perhaps, T=23  Now we can use this to “finalize” the state

Implementation 23  g.Join() has a overload where you can pass in an long integer. Pass this timestamp  The process that initiates your state transfer will find the timestamp value in the view, in a field called v.offset  It can compute a state for you that includes updates done subsequent to when you pre-transferred state!

OOB pre-transfer idea 24 P Q R Pre-transfer please? “look in /tmp/xxx, T=12345” OOBFetch() … as Client of G T=123 OOBReReplicate OOBDelete Memory Mapped Byte[] Representation P Q R Updates since T=12345 g.Join(12345) Create mapped file

Group obligation? 25  If state of the group is an append-style log, this concept is easily implemented  Otherwise, group needs to keep a log of “recent” updates and implement some form of periodic snapshot in which the stored state has an associated time (how many updates it reflects), and the log has the remaining updates

Serialization 26  We have several ways to create the byte[] representation of these view objects  Msg.ToBArray(objs…)  C# serialization  Your favorite way of generating a byte[] object  But keep in mind that because an mva isn’t a byte[] object, copying does occur at the last step of transforming data into a C# managed object

Performance considerations 27  In theory, the very best way to move the bytes is with Ethernet multicast or Infiniband  Isis 2 supports both… but they behave differently  Ethernet multicast is highly efficient from 1:n, but the data still is copied from kernel to user address space  Infiniband multicast doesn’t work well, hence we use Infiniband “verbs” to send the data via multiple 1:1 streams. But these avoid any kernel/user copying  Worst performance: ISIS_UNICAST_ONLY case