STORK: A Scheduler for Data Placement Activities in Grid Tevfik Kosar University of Wisconsin-Madison kosart@cs.wisc.edu
Some Remarkable Numbers Characteristics of four physics experiments targeted by GriPhyN: Application First Data Data Volume (TB/yr) User Community SDSS 1999 10 100s LIGO 2002 250 ATLAS/ CMS 2005 5,000 1000s Source: GriPhyN Proposal, 2000
Even More Remarkable… “ ..the data volume of CMS is expected to subsequently increase rapidly, so that the accumulated data volume will reach 1 Exabyte (1 million Terabytes) by around 2015.” Source: PPDG Deliverables to CMS
Other Data Intensive Applications Genomic information processing applications Biomedical Informatics Research Network (BIRN) applications Cosmology applications (MADCAP) Methods for modeling large molecular systems Coupled climate modeling applications Real-time observatories, applications, and data-management (ROADNet)
Need to Deal with Data Placement Data need to be moved, staged, replicated, cached, removed; storage space for data should be allocated, de-allocated. We call all of these data related activities in the Grid as Data Placement (DaP) activities.
State of the Art Data placement activities in the Grid are performed either manually or by simple scripts. Data placement activities are simply regarded as “second class citizens” of the computation dominated Grid world.
Our Goal Our goal is to make data placement activities “first class citizens” in the Grid just like the computational jobs! They need to be queued, scheduled, monitored and managed, and even checkpointed.
Outline Introduction Grid Challenges Stork Solutions Case Study: SRB-UniTree Data Pipeline Conclusions & Future Work
Grid Challenges Heterogeneous Resources Limited Resources Network/Server/Software Failures Different Job Requirements Scheduling of Data & CPU together
Stork Intelligently & reliably schedules, runs, monitors, and manages Data Placement (DaP) jobs in a heterogeneous Grid environment & ensures that they complete. What Condor means for computational jobs, Stork means the same for DaP jobs. Just submit a bunch of DaP jobs and then relax..
Stork Solutions to Grid Challenges Specialized in Data Management Modularity & Extendibility Failure Recovery Global & Job Level Policies Interaction with Higher Level Planners/Schedulers
Already Supported URLs file:/ -> Local File ftp:// -> FTP gsiftp:// -> GridFTP nest:// -> NeST (chirp) protocol srb:// -> SRB (Storage Resource Broker) srm:// -> SRM (Storage Resource Manager) unitree:// -> UniTree server diskrouter:// -> UW DiskRouter
Higher Level Planners DAGMan Condor-G Stork Gate Keeper SRB SRM NeST (compute) Stork (DaP) Gate Keeper StartD SRB SRM NeST GridFTP RFT
Interaction with DAGMan Condor Job Queue A Job A A.submit DaP X X.submit Job C C.submit Parent A child C, X Parent X child B ….. DAGMan A Stork Job Queue X X C B Y D
Sample Stork submit file [ Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”; …… Max_Retry = 10; Restart_in = “2 hours”; ]
Case Study: SRB-UniTree Data Pipeline We have transferred ~3 TB of DPOSS data (2611 x 1.1 GB files) from SRB to UniTree using 3 different pipeline configurations. The pipelines are built using Condor and Stork scheduling technologies. The whole process is managed by DAGMan.
1 Submit Site SRB Server UniTree Server SRB get UniTree put NCSA Cache
2 Submit Site SRB Server UniTree Server SRB get UniTree put SDSC Cache NCSA Cache GridFTP
3 Submit Site SRB Server UniTree Server SRB get UniTree put SDSC Cache NCSA Cache DiskRouter
Outcomes of the Study 1. Stork interacted easily and successfully with different underlying systems: SRB, UniTree, GridFTP and Diskrouter.
Outcomes of the Study (2) 2. We had the chance to compare different pipeline topologies and configurations: Configuration End-to-end rate (MB/sec) 1 5.0 2 3.2 3 5.95
Outcomes of the Study (3) 3. Almost all possible network, server, and software failures were recovered automatically.
Failure Recovery Diskrouter reconfigured and restarted UniTree not responding SDSC cache reboot & UW CS Network outage SRB server maintenance
For more information on the results of this study, please check: http://www.cs.wisc.edu/condor/stork/
Conclusions Stork makes data placement a “first class citizen”. Stork is the Condor of data placement world. Stork is fault tolerant, easy to use, modular, extendible, and very flexible.
Future Work More intelligent scheduling Data level management instead of file level management Checkpointing for transfers Security
You don’t have to FedEx your data anymore.. Stork delivers it for you! For more information Drop by my office anytime Room: 3361, Computer Science & Stats. Bldg. Email to: kosart@cs.wisc.edu