The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab
Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management The fabric-level services Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging
Gabriele Garzoglio, ACAT 2003 Introduction SAM is a Data Handling System for HEP: the project was started in 1997 by DZero SAM-Grid project started in to handle DZero’s expanded needs for globally distributed computing CDF joined SAM-Grid at the end of 2002 JIM complements the data handling system (SAM) with Job and Info Management: SAM-Grid = JIM + SAM JIM is funded by PPDG and GridPP Participated at SC02 and SC03
Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management The fabric-level services Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging
Gabriele Garzoglio, ACAT 2003 JOB Computing Element Submission Client User Interface Queuing System Job Management User Interface Broker Match Making Service Information Collector Execution Site #1 Submission Client Match Making Service Computing Element Grid Sensors Execution Site #n Queuing System Grid Sensors Storage Element Computing Element Storage Element Data Handling System Storage Element Informatio n Collector Grid Sensor s Computin g Element Data Handling System
Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management The fabric-level services Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging
Gabriele Garzoglio, ACAT 2003 Running jobs on Grid resources: the trend Grid resources are not dedicated to a single experiment Translation: no daemons running on the worker nodes of a Batch System no experiment specific software installed
Gabriele Garzoglio, ACAT 2003 Running jobs on Grid resources: today The situation is transitioning: Generally, experiments can install specific services on a node close to the cluster. Worker nodes typically access the software via shared FS: not scalable! Local resource configuration still too diverse to easily plug into the Grid Today, most of our efforts are directed to coping with (the lack of) standard local fabric services
Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management The fabric-level services Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging
Gabriele Garzoglio, ACAT 2003 Motivation Problem: “standard” grid batch system adapters (globus job-managers) are too restrictive to fit all the local configurations Examples: the terms of the agreement for using the batch system can be expressed with special directives to the batch system system administrators end up writing wrappers around the standard batch system commands
Gabriele Garzoglio, ACAT 2003 SAM Batch System Adapter We factor out the local batch system configuration using an intermediate layer that abstracts the basic interactions with the batch system submit command lookup command remove command For each of the commands above, the administrator can specify how to parse the output to fish out the relevant information e.g. local job id when submitting We have written JIM globus job managers that use this layer
Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management The fabric-level services Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging
Gabriele Garzoglio, ACAT 2003 Motivation Portability of the software for DZero and CDF is still a problem not completely solved. Most of the CDF and DZero applications still rely on the offline software to be preinstalled at the site. Administrators need to install and maintain the software at each site A job submitted to the grid must be able to execute at a site where its dependencies are installed
Gabriele Garzoglio, ACAT 2003 Old solution: software advertisement Administrators install the software at each site The JIM advertisement framework senses the new product and advertises it to the broker as one of the characteristics of the site Drawbacks: the administrators still need to install the software increased complexity of the advertisement framework: it needs to know how to detect the list of installed products increased complexity of the broker: it needs to enforce the matching to the eligible sites jobs running on old software versions may not find an eligible site
Gabriele Garzoglio, ACAT 2003 New solution: dynamic software retrieval Product developers store the software into SAM with appropriate metadata Before running a job at a site, the infrastructure asks SAM for the delivery of the dependent products The products live in the SAM cache and are automatically managed Drawbacks: increased complexity of local job submission
Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management The fabric-level services Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging
Gabriele Garzoglio, ACAT 2003 Nomenclature Input sandbox: from the client (user sandbox): the executable configuration files special dependencies (libraries, products,…) from the local site the product dependencies Output sandbox: stdout, stderr log files small custom output (e.g. histograms)
Gabriele Garzoglio, ACAT 2003 Requirements We want an infrastructure that: Locally stores the user sandbox (from the Grid) at the site transports and installs the input sandbox to the worker node packages the output and hands it over to the Grid
Gabriele Garzoglio, ACAT 2003 Limitations to overcome the file transport mechanism of a batch system is site specific and needs to be factored out shared file systems have scalability limits: we want to rely on them as little as possible the worker nodes may have connectivity restrictions (firewalls)
Gabriele Garzoglio, ACAT 2003 The sandbox management 1 It creates a sandbox area (reorganizing the native globus gass cache) It starts up a gridftp server for the communications between worker nodes and head node (no shared FS) It requests the delivery of the product dependencies It creates a self extracting archive that contains the gridftp client and a bootstrapping script; when running, this transfers and installs the product dependencies, then passes control to the application
Gabriele Garzoglio, ACAT 2003 The sandbox management 2 It submits to the batch system parallel instances of the self extracting archive The job relies on SAM for large input/output files transfers When the job finishes, stdout/stderr + custom output is packaged at the head node to be transported back to the submission site via grid mechanisms
Gabriele Garzoglio, ACAT 2003 Open problems Not all the batch system allow the selection of a node with sufficient scratch space to install the needed software We would greatly simplify this infrastructure if there were a “standard” local storage service at all the sites (e.g. DiskFarm)
Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management The fabric-level services Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging
Gabriele Garzoglio, ACAT 2003 Motivation Distributed logging of job status/history Web monitoring Statistics on historical data Grid scheduling based upon job status/history at a certain site
Gabriele Garzoglio, ACAT 2003 The XML DB Status Logger The status of the job is reported to an XML database deployed at each execution site The information comes from the local batch system (simple job status e.g. “idle”, “running”, …) AND from the application (complex status e.g. “Processing executable X in the chain”) The XML database gives flexible remote access via standard mechanisms, such as XPath
Gabriele Garzoglio, ACAT 2003 Conclusions The SAM-Grid offers an extensible working framework for Grid-level Job/Data/Info Management The SAM-Grid adopts Fabric-level configurable solutions for batch system adaptation, product delivery, sandboxing and job complex-status logging The community needs to come up with standard fabric-level services to make any Grid usable
Gabriele Garzoglio, ACAT 2003 More info at… Morag Burgon-Lyon’s Talk on SAM-Grid for CDF!