Reliable and Efficient Grid Data Placement using Stork and DiskRouter Tevfik Kosar University of Wisconsin-Madison April 15 th, 2004.

Slides:

Advertisements

Similar presentations

Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.

Advertisements

Network Resource Broker for IPTV in Cloud Computing Lei Liang, Dan He University of Surrey, UK OGF 27, G2C Workshop 15 Oct 2009 Banff,

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Cross-site data transfer on TeraGrid using GridFTP TeraGrid06 Institute User Introduction to TeraGrid June 12 th by Krishna Muriki

AT LOUISIANA STATE UNIVERSITY CCT: Center for Computation & LSU Stork Data Scheduler: Current Status and Future Directions Sivakumar Kulasekaran.

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.

1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison

Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Stork and DAGMan.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.

Figure 1.1 Interaction between applications and the operating system.

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Jim Basney Computer Sciences Department University of Wisconsin-Madison Managing Network Resources in.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

GridFTP Guy Warner, NeSC Training.

Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor and Grid Challenges.

Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.

CS An Overlay Routing Scheme For Moving Large Files Su Zhang Kai Xu.

1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.

A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.

June 21-25, 2004Lecture4: Grid Data Management1 Lecture 4 Grid Data Management Jaime Frey UW-Madison Condor Group Slides prepared in.

July Lecture 4: Grid Data Management1 Grid Data Management.

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.

Enabling Data Intensive Science with PetaShare Tevfik Kosar Center for Computation & Technology Louisiana State University April 6, 2007.

Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.

File and Object Replication in Data Grids Chin-Yi Tsai.

Reliable Data Movement using Globus GridFTP and RFT: New Developments in 2008 John Bresnahan Michael Link Raj Kettimuthu Argonne National Laboratory and.

Globus GridFTP and RFT: An Overview and New Features Raj Kettimuthu Argonne National Laboratory and The University of Chicago.

INFSO-RI Enabling Grids for E-sciencE DAGs with data placement nodes: the “shish-kebab” jobs Francesco Prelz Enzo Martelli INFN.

Peter F. Couvares (based on material from Tevfik Kosar, Nick LeRoy, and Jeff Weber) Associate Researcher, Condor Team Computer Sciences Department University.

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar and Miron Livny University of Wisconsin-Madison March 25 th, 2004 Tokyo, Japan.

Alain Roy Computer Sciences Department University of Wisconsin-Madison I/O Access in Condor and Grid.

George Kola Computer Sciences Department University of Wisconsin-Madison DiskRouter: A Mechanism for High.

Cluster 2004 San Diego, CA A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison September 23 rd,

1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University.

Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.

Flexibility, Manageability and Performance in a Grid Storage Appliance John Bent, Venkateshwaran Venkataramani, Nick Leroy, Alain Roy, Joseph Stanley,

Storage Research Meets The Grid Remzi Arpaci-Dusseau.

A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar University of Wisconsin-Madison May 25 th, 2004 CERN.

1 Stork: State of the Art Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison

1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,

Copyright © Clifford Neuman - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture notes Dr.

Dzmitry Kliazovich University of Luxembourg, Luxembourg

A Fully Automated Fault- tolerant System for Distributed Video Processing and Offsite Replication George Kola, Tevfik Kosar and Miron Livny University.

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

GridFTP Guy Warner, NeSC Training Team.

George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully.

Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project, University of Wisconsin.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Job Delegation and Planning.

Condor DAGMan: Managing Job Dependencies with Condor

Introduction to Distributed Platforms

Applying Control Theory to Stream Processing Systems

US CMS Testbed.

Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.

CLUSTER COMPUTING.

Outline Problem DiskRouter Overview Details Real life DiskRouters

Outline What users want ? Data pipeline overview

STORK: A Scheduler for Data Placement Activities in Grid

Wide Area Workload Management Work Package DATAGRID project

Mats Rynge USC Information Sciences Institute

Experiences in Running Workloads over OSG/Grid3

Presentation transcript:

Reliable and Efficient Grid Data Placement using Stork and DiskRouter Tevfik Kosar University of Wisconsin-Madison April 15 th, 2004

A Single Project.. LHC (Large Hadron Collider)  Comes online in 2006  Will produce 1 Exabyte data by 2012  Accessed by ~2000 physicists, 150 institutions, 30 countries

And Many Others.. Genomic information processing applications Biomedical Informatics Research Network (BIRN) applications Cosmology applications (MADCAP) Methods for modeling large molecular systems Coupled climate modeling applications Real-time observatories, applications, and data-management (ROADNet)

The Same Big Problem.. Need for data placement:  Locate the data  Send data to processing sites  Share the results with other sites  Allocate and de-allocate storage  Clean-up everything Do these reliably and efficiently

Outline Introduction Stork DiskRouter Case Studies Conclusions

Stork A scheduler for data placement activities in the Grid What Condor is for computational jobs, Stork is for data placement Stork comes with a new concept: “Make data placement a first class citizen in the Grid.”

The Concept Stage-in Execute the Job Stage-out Stage-in Execute the jobStage-outRelease input spaceRelease output space Allocate space for input & output data Individual Jobs

The Concept Stage-in Execute the Job Stage-out Stage-in Execute the jobStage-outRelease input spaceRelease output space Allocate space for input & output data Data Placement Jobs Computational Jobs

DAGMan The Concept Condor Job Queue DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. C Stork Job Queue E DAG specification ACB D E F

Why Stork? Stork understands the characteristics and semantics of data placement jobs. Can make smart scheduling decisions, for reliable and efficient data placement.

Failure Recovery and Efficient Resource Utilization Fault tolerance  Just submit a bunch of data placement jobs, and then go away.. Control number of concurrent transfers from/to any storage system  Prevents overloading Space allocation and De-allocations  Make sure space is available

Support for Heterogeneity Protocol translation using Stork memory buffer.

Support for Heterogeneity Protocol translation using Stork Disk Cache.

Flexible Job Representation and Multilevel Policy Support [ Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”; …… Max_Retry = 10; Restart_in = “2 hours”; ]

Run-time Adaptation Dynamic protocol selection [ dap_type = “transfer”; src_url = “drouter://slic04.sdsc.edu/tmp/test.dat”; dest_url = “drouter://quest2.ncsa.uiuc.edu/tmp/test.dat”; alt_protocols = “nest-nest, gsiftp-gsiftp”; ] [ dap_type = “transfer”; src_url = “any://slic04.sdsc.edu/tmp/test.dat”; dest_url = “any://quest2.ncsa.uiuc.edu/tmp/test.dat”; ]

Run-time Adaptation Run-time Protocol Auto-tuning [ link = “slic04.sdsc.edu – quest2.ncsa.uiuc.edu”; protocol = “gsiftp”; bs = 1024KB;//block size tcp_bs= 1024KB;//TCP buffer size p= 4; ]

Outline Introduction Stork DiskRouter Case Studies Conclusions

DiskRouter A mechanism for high performance, large scale data transfers Uses hierarchical buffering to aid in large scale data transfers Enables application-level overlay network for maximizing bandwidth Supports application-level multicast

Store and Forward Improves performance when bandwidth fluctuation between A and B is independent of the bandwidth fluctuation between B and C DiskRouter With DiskRouter Without DiskRouter A B C

DiskRouter Overlay Network A B 90 Mb/s

DiskRouter Overlay Network A B DiskRouter 90 Mb/s 400 Mb/s C Add a DiskRouter Node C which is not necessarily on the path from A to B, to enforce use of an alternative path.

Data Mover/Distributed Cache Source writes to the closest DiskRouter and Destination receives it up from its closest DiskRouter Source Destination DiskRouter Cloud

Outline Introduction Stork DiskRouter Case Studies Conclusions

Case Study I: SRB-UniTree Data Pipeline Transfer ~3 TB of DPOSS data from to A data pipeline created with Stork and DiskRouter SRB Server UniTree Server SDSC Cache NCSA Cache Submit Site

UniTree not responding Diskrouter reconfigured and restarted SDSC cache reboot & UW CS Network outage Software problem Failure Recovery

Case Study -II

Dynamic Protocol Selection

Runtime Adaptation Before Tuning: parallelism = 1 block_size = 1 MB tcp_bs = 64 KB After Tuning: parallelism = 4 block_size = 1 MB tcp_bs = 256 KB

Conclusions Regard data placement as first class citizen. Introduce a specialized scheduler for data placement. Introduce a high performance data transfer tool. End-to-end automation, fault tolerance, run-time adaptation, multilevel policy support, reliable and efficient transfers.

Future work Enhanced interaction between Stork, DiskRouter and higher level planners  co-scheduling of CPU and I/O Enhanced authentication mechanisms More run-time adaptation

You don’t have to FedEx your data anymore.. We deliver it for you! For more information  Stork: Tevfik Kosar  DiskRouter: George Kola