The Kangaroo Approach to Data Movement on the Grid Douglas Thain, Jim Basney, Se-Chang Son, and Miron Livny Condor Project University of Wisconsin.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Remus: High Availability via Asynchronous Virtual Machine Replication
Categories of I/O Devices
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Operating System.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Operating-System Structures
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
Tam Vu Remote Procedure Call CISC 879 – Spring 03 Tam Vu March 06, 03.
GridFTP: File Transfer Protocol in Grid Computing Networks
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
W4118 Operating Systems OS Overview Junfeng Yang.
The Kangaroo Approach to Data Movement on the Grid Douglas Thain, Jim Basney, Se-Chang Son, and Miron Livny
Reliable I/O on the Grid Douglas Thain and Miron Livny Condor Project University of Wisconsin.
CS533 - Concepts of Operating Systems
Device Management.
The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
PRASHANTHI NARAYAN NETTEM.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Transaction. A transaction is an event which occurs on the database. Generally a transaction reads a value from the database or writes a value to the.
File Systems (2). Readings r Silbershatz et al: 11.8.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
Windows Server 2008 Chapter 11 Last Update
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Networked File System CS Introduction to Operating Systems.
CHAPTER 2 OPERATING SYSTEM OVERVIEW 1. Operating System Operating System Definition A program that controls the execution of application programs and.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.
Distributed File Systems
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
1 The Kangaroo approach to Data movement on the Grid Rajesh Rajamani June 03, 2002.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
Manish Kumar,MSRITSoftware Architecture1 Remote procedure call Client/server architecture.
David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.
The Kangaroo Approach to Data Movement on the Grid Author: D. Thain, J. Basney, S.-C. Son, and M. Livny From: HPDC 2001 Presenter: NClab, KAIST, Hyonik.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
I/O Software CS 537 – Introduction to Operating Systems.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
 PROCESS MANAGEMENT  A process is a program in execution: (A program is passive, a process active.)  A process has resources (CPU time, files) and.
Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project, University of Wisconsin.
Introduction to Operating Systems
Hands-On Microsoft Windows Server 2008
Definition of Distributed System
Chapter 2: System Structures
US CMS Testbed.
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
湖南大学-信息科学与工程学院-计算机与科学系
Switching Techniques.
Chapter 2: Operating-System Structures
Chapter 2: Operating-System Structures
Presentation transcript:

The Kangaroo Approach to Data Movement on the Grid Douglas Thain, Jim Basney, Se-Chang Son, and Miron Livny Condor Project University of Wisconsin

The Grid is BYOFS.

“Bring Your Own File System” You can’t depend on the host.  Problems of configuration: Execution sites do not necessarily have a distributed file system, or even a userid for you.  Problems of correctness: Networks go down, servers crash, disks fill, users forget to start servers…  Problems of performance: Bandwidth and latency may fluctuate due to competition with other users, both local and remote.

Applications are Not Prepared to Handle These Errors  Examples: open(“input”) -> “connection refused” write(file,buffer,length) -> “wait ten minutes” close(file) -> “couldn’t write data”  Applications respond by dumping core, exiting, or producing incorrect results… or just by running slowly.  Users respond with…

Focus: “Half-Interactive” Jobs  Users want to submit batch jobs to the Grid, but still be able to monitor the output interactively.  But, network failures are expected as a matter of course, so keeping the job running takes priority over getting output.  Examples: Simulation of high-energy collider events. (CMS) Simulation of molecular structures. (Gaussian) Rendering of animated images. (Maya) App Unreliable Network

 Make a third party responsible for executing each application’s I/O operations.  Never return an error to the application. (Maybe tell the user or scheduler.)  Use all available resources to hide latencies.  Benefit: Higher throughput, fault tolerance.  Cost: Weaker consistency. The Kangaroo Approach To Data Movement

Philosophical Musings  Two problems, one solution: Hiding errors: Retry, report the error to a third party, and use another resource to satisfy the request. Hiding latencies: Use another resource to satisfy the request in the background, but if an error occurs, there is no channel to report it.

This is an Old Problem RAM Buffer Disk App Disk Data Mover Process Relieves the application of the responsibility of collecting, scheduling, and retrying operations. Application can request consistency operations: Is it done? Wait until done. Interface is a file system. Weak consistency guarantees. Scheduler chooses when and where.

Apply it to a New World RAM Buffer Disk App Disk Data Mover Process RAM Buffer Disk Accepts the responsibility of moving data. App should never receive errors. Provides weak consistency guarantees. Moves data according to network, buffer and target availability. File System Interface Occasional explicit consistency requests.

Our Vision: A Grid File System File System File System File System K K K K K K K Data Movement System App Disk

Reality Check: Are we on the right track?  David Turek on Reliability: Be less like a lizard, and more like a human. (Be self repairing.)  Peter Nugent on Weak Consistency: Datasets are written once. Recollection or recomputation results in a new file. (No read/write or write/write issues.)  Miron Livny on the Grid Environment: The grid is constantly changing. Networks go up, and down, and machines come and go. Software must be agile.

Introducing Kangaroo - A user-level data movement system that ‘hops’ partial files from node to node on the Grid. - A background process that will ‘fight’ for your jobs’ I/O needs. - A ‘data valet’ that worries about errors, but never admits them to the job.

Kangaroo Prototype  We have built a first-try Kangaroo that validates the central ideas of hiding errors and latencies.  Emphasis on high-level reliability and throughput, not on low-level optimizations.  First, work to improve writes, but leave room in the design to improve reads.

Kangaroo Prototype KK App K An application may contact any node in the system and perform partial-file reads and writes. Disk The node may then execute or buffer operations as conditions warrant. Where are my data? Have they arrived yet? Disk full. Credentials expired. Permission denied.

The Kangaroo Protocol  Simple, easy to implement.  Same protocol is used between all participants. Client -> Server Server -> Server  Can be thought of as an “indirect NFS.” Idempotent operations on a (host,file) name. Servers need not remember state of clients.

Get( host, file, offset, length, data ) -> returns success/failure + data Put( host, file, offset, length, data ) -> no response Commit() -> returns success/failure Push( host, file ) -> returns success/failure The Kangaroo Protocol

 Writes do not return a result! Why? A grid application has no reasonable response to possible errors: –Connection lost –Out of space –Permission denied The Kangaroo server becomes responsible for trying and retrying the write, whether it is an intermediate or ultimate destination. If there is a brief resource shortage, the server may simply pause the incoming stream. If there is a catastrophic error, the server may drop the connection -- the caller must roll back.

The Kangaroo Protocol  Two consistency operations: Commit: –Block until all writes have been safely recorded in some stable storage. –App must do this before it exits. Push: –Block until all writes are delivered to their ultimate destinations. –App may do this to externally synchronize. –User may do this to discover if data movement is done.  Consistency guarantees: The end result is the same as an interactive system.

User Interface  Although applications could write to the Kangaroo interface, we don’t expect or require this.  An interposition agent is responsible for converting POSIX operations into the Kangaroo protocol. K AppAgent POSIX Kangaroo

User Interface  Interposition agent built with Bypass. A tool for trapping UNIX I/O operations and routing them through new code. Works on any dynamically-linked, unmodified, unprivileged program.  Examples: vi /kangaroo/coral.cs.wisc.edu/etc/hosts gcc /gsiftp/server/input.c -o /kangaroo/server/output.exe

Performance Evaluation  Not a full-fledged file-system evaluation.  A proof-of-concept that shows latencies and errors can be hidden correctly.  Preview of results: As a data-copier, Kangaroo is reasonable. Real benefit comes from the ability to overlap I/O and CPU.

Microbenchmark: File Transfer  Create a large output file at the execution site, and send it to a storage site.  Ideal conditions: No competition for cpu, network, or disk bandwidth.  Three methods: Stream output directly to target. (Online) Stage output to disk. (Offline) Kangaroo

Macrobenchmark: Image Processing  Post-processing of satellite image data: Need to compute various enhancements and produce output for each. Read input image For I=1 to N – –Compute transformation of image – –Write output image  Example: Image size about 5 MB Compute time about 6 sec IO-cpu ratio about 0.9 MB/s

I/O Models Compared OUTPUT CPU OUTPUT Online I/O: Offline I/O: Kangaroo: INPUT OUTPUT CPU OUTPUT CPUOUTPUTINPUTOUTPUTCPU OUTPUT CPUOUTPUTINPUTOUTPUTCPU PUSH CPU ReleasedTask Done CPU Released Task Done CPU Released

Summary of Results  At the micro level, our prototype provides reliability with reasonable performance.  At the macro level, I/O overlap gives reliability and speedups (for some applications.)  Kangaroo allows the application to survive on its real I/O needs:.91 MB/s. Without it, there is ‘false pressure’ to provide fast networks.

Research Problems  Commit means “make data safe somewhere.” Greedy approach: Commit all dirty data here. Lazy approach: Commit nothing until final delivery. Solution must be somewhere in between.  Disk as Buffer, not as File System Existing buffer impl is clumsy and inefficient. Need to optimize for 1-write, 1-read, 1-delete.  Fine-Grained Scheduling Reads should have priority over writes. This is easy at one node, but multiple nodes?

Related Work  Some items neglected from the paper: HPSS data movement: - Move data from RAM -> disk -> tape Internet Backplane Protocol (IBP) –Passive storage building block. –Kangaroo could use IBP as underlying storage. PUNCH virtual file system –Uses NFS as data protocol. –Uses indirection for implicit naming.

Conclusion  The Grid is BYOFS.  Error hiding and latency hiding are tightly- knit problems.  The solution to both is to make a third party responsible for I/O execution.  The benefits of high-level overlap can outweigh any low-level inefficienies.

Contact Us  Douglas Thain  Miron Livny  Kangaroo software, papers, and more:  Condor in general:  Questions now?