The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

Slides:



Advertisements
Similar presentations
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Advertisements

Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
1 Generic logging layer for the distributed computing by Gene Van Buren Valeri Fine Jerome Lauret.
Rod Walker IC 13th March 2002 SAM-Grid Middleware  SAM.  JIM.  RunJob.  Conclusions. - Rod Walker,ICL.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter : S.Y.Chen.
The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Chapter 9: Moving to Design
JIM Deployment for the CDF Experiment M. Burgon-Lyon 1, A. Baranowski 2, V. Bartsch 3,S. Belforte 4, G. Garzoglio 2, R. Herber 2, R. Illingworth 2, R.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al (see next slide) FNAL/CD/CCF, D0, CDF, Condor team, UTA,
SAMGrid – A fully functional computing grid based on standard technologies Igor Terekhov for the JIM team FNAL/CD/CCF.
Chapter 9 Elements of Systems Design
GRID job tracking and monitoring Dmitry Rogozin Laboratory of Particle Physics, JINR 07/08/ /09/2006.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
K. Harrison CERN, 20th April 2004 AJDL interface and LCG submission - Overview of AJDL - Using AJDL from Python - LCG submission.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
CDF Grid Status Stefan Stonjek 05-Jul th GridPP meeting / Durham.
Grids and Portals for VLAB Marlon Pierce Community Grids Lab Indiana University.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Deploying and Operating the SAM-Grid: lesson learned Gabriele Garzoglio for the SAM-Grid Team Sep 28, 2004.
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
SAMGrid for CDF MC (and beyond) Igor Terekhov, FNAL/CD/CCF/SAM for JIM team.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
SAM and D0 Grid Computing Igor Terekhov, FNAL/CD.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
Andrew McNabETF Firewall Meeting, NeSC, 5 Nov 2002Slide 1 Firewall issues for Globus 2 and EDG Andrew McNab High Energy Physics University of Manchester.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
22 nd September 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
Claudio Grandi INFN Bologna CHEP'03 Conference, San Diego March 27th 2003 BOSS: a tool for batch job monitoring and book-keeping Claudio Grandi (INFN Bologna)
Dzero MC production on LCG How to live in two worlds (SAM and LCG)
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
Metadata Mòrag Burgon-Lyon University of Glasgow.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
ATLAS-specific functionality in Ganga - Requirements for distributed analysis - ATLAS considerations - DIAL submission from Ganga - Graphical interfaces.
Adapting SAM for CDF Gabriele Garzoglio Fermilab/CD/CCF/MAP CHEP 2003.
Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al FNAL/CD/CCF, D0, CDF, Condor team.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
10 March Andrey Grid Tools Working Prototype of Distributed Computing Infrastructure for Physics Analysis SUNY.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
The SAM-Grid / LCG interoperability system: a bridge between two Grids Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar Anoop Rajendra*, Sudhamsh.
Simulation Production System
BOSS: the CMS interface for job summission, monitoring and bookkeeping
BOSS: the CMS interface for job summission, monitoring and bookkeeping
BOSS: the CMS interface for job summission, monitoring and bookkeeping
Presentation transcript:

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management The fabric-level services Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging

Gabriele Garzoglio, ACAT 2003 Introduction SAM is a Data Handling System for HEP: the project was started in 1997 by DZero SAM-Grid project started in to handle DZero’s expanded needs for globally distributed computing CDF joined SAM-Grid at the end of 2002 JIM complements the data handling system (SAM) with Job and Info Management: SAM-Grid = JIM + SAM JIM is funded by PPDG and GridPP Participated at SC02 and SC03

Gabriele Garzoglio, ACAT 2003 Overview Introduction  The grid-level services: an overview Job Management The fabric-level services Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging

Gabriele Garzoglio, ACAT 2003 JOB Computing Element Submission Client User Interface Queuing System Job Management User Interface Broker Match Making Service Information Collector Execution Site #1 Submission Client Match Making Service Computing Element Grid Sensors Execution Site #n Queuing System Grid Sensors Storage Element Computing Element Storage Element Data Handling System Storage Element Informatio n Collector Grid Sensor s Computin g Element Data Handling System

Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management  The fabric-level services Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging

Gabriele Garzoglio, ACAT 2003 Running jobs on Grid resources: the trend Grid resources are not dedicated to a single experiment Translation: no daemons running on the worker nodes of a Batch System no experiment specific software installed

Gabriele Garzoglio, ACAT 2003 Running jobs on Grid resources: today The situation is transitioning: Generally, experiments can install specific services on a node close to the cluster. Worker nodes typically access the software via shared FS: not scalable! Local resource configuration still too diverse to easily plug into the Grid Today, most of our efforts are directed to coping with (the lack of) standard local fabric services

Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management The fabric-level services  Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging

Gabriele Garzoglio, ACAT 2003 Motivation Problem: “standard” grid batch system adapters (globus job-managers) are too restrictive to fit all the local configurations Examples: the terms of the agreement for using the batch system can be expressed with special directives to the batch system system administrators end up writing wrappers around the standard batch system commands

Gabriele Garzoglio, ACAT 2003 SAM Batch System Adapter We factor out the local batch system configuration using an intermediate layer that abstracts the basic interactions with the batch system submit command lookup command remove command For each of the commands above, the administrator can specify how to parse the output to fish out the relevant information e.g. local job id when submitting We have written JIM globus job managers that use this layer

Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management The fabric-level services Local batch system adaptation  Dynamic product retrieval Local sandbox management Job complex-status logging

Gabriele Garzoglio, ACAT 2003 Motivation Portability of the software for DZero and CDF is still a problem not completely solved. Most of the CDF and DZero applications still rely on the offline software to be preinstalled at the site. Administrators need to install and maintain the software at each site A job submitted to the grid must be able to execute at a site where its dependencies are installed

Gabriele Garzoglio, ACAT 2003 Old solution: software advertisement Administrators install the software at each site The JIM advertisement framework senses the new product and advertises it to the broker as one of the characteristics of the site Drawbacks: the administrators still need to install the software increased complexity of the advertisement framework: it needs to know how to detect the list of installed products increased complexity of the broker: it needs to enforce the matching to the eligible sites jobs running on old software versions may not find an eligible site

Gabriele Garzoglio, ACAT 2003 New solution: dynamic software retrieval Product developers store the software into SAM with appropriate metadata Before running a job at a site, the infrastructure asks SAM for the delivery of the dependent products The products live in the SAM cache and are automatically managed Drawbacks: increased complexity of local job submission

Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management The fabric-level services Local batch system adaptation Dynamic product retrieval  Local sandbox management Job complex-status logging

Gabriele Garzoglio, ACAT 2003 Nomenclature Input sandbox: from the client (user sandbox): the executable configuration files special dependencies (libraries, products,…) from the local site the product dependencies Output sandbox: stdout, stderr log files small custom output (e.g. histograms)

Gabriele Garzoglio, ACAT 2003 Requirements We want an infrastructure that: Locally stores the user sandbox (from the Grid) at the site transports and installs the input sandbox to the worker node packages the output and hands it over to the Grid

Gabriele Garzoglio, ACAT 2003 Limitations to overcome the file transport mechanism of a batch system is site specific and needs to be factored out shared file systems have scalability limits: we want to rely on them as little as possible the worker nodes may have connectivity restrictions (firewalls)

Gabriele Garzoglio, ACAT 2003 The sandbox management 1 It creates a sandbox area (reorganizing the native globus gass cache) It starts up a gridftp server for the communications between worker nodes and head node (no shared FS) It requests the delivery of the product dependencies It creates a self extracting archive that contains the gridftp client and a bootstrapping script; when running, this transfers and installs the product dependencies, then passes control to the application

Gabriele Garzoglio, ACAT 2003 The sandbox management 2 It submits to the batch system parallel instances of the self extracting archive The job relies on SAM for large input/output files transfers When the job finishes, stdout/stderr + custom output is packaged at the head node to be transported back to the submission site via grid mechanisms

Gabriele Garzoglio, ACAT 2003 Open problems Not all the batch system allow the selection of a node with sufficient scratch space to install the needed software We would greatly simplify this infrastructure if there were a “standard” local storage service at all the sites (e.g. DiskFarm)

Gabriele Garzoglio, ACAT 2003 Overview Introduction The grid-level services: an overview Job Management The fabric-level services Local batch system adaptation Dynamic product retrieval Local sandbox management  Job complex-status logging

Gabriele Garzoglio, ACAT 2003 Motivation Distributed logging of job status/history Web monitoring Statistics on historical data Grid scheduling based upon job status/history at a certain site

Gabriele Garzoglio, ACAT 2003 The XML DB Status Logger The status of the job is reported to an XML database deployed at each execution site The information comes from the local batch system (simple job status e.g. “idle”, “running”, …) AND from the application (complex status e.g. “Processing executable X in the chain”) The XML database gives flexible remote access via standard mechanisms, such as XPath

Gabriele Garzoglio, ACAT 2003 Conclusions The SAM-Grid offers an extensible working framework for Grid-level Job/Data/Info Management The SAM-Grid adopts Fabric-level configurable solutions for batch system adaptation, product delivery, sandboxing and job complex-status logging The community needs to come up with standard fabric-level services to make any Grid usable

Gabriele Garzoglio, ACAT 2003 More info at… Morag Burgon-Lyon’s Talk on SAM-Grid for CDF!