The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.

Slides:



Advertisements
Similar presentations
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Advertisements

Physics with SAM-Grid Stefan Stonjek University of Oxford 6 th GridPP Meeting 30 th January 2003 Coseners House.
SAM-Grid Status Core SAM development SAM-Grid architecture Progress Future work.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Rod Walker IC 13th March 2002 SAM-Grid Middleware  SAM.  JIM.  RunJob.  Conclusions. - Rod Walker,ICL.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Meta-Computing at DØ Igor Terekhov, for the DØ Experiment Fermilab, Computing Division, PPDG ACAT 2002 Moscow, Russia June 28, 2002.
DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
JIM Deployment for the CDF Experiment M. Burgon-Lyon 1, A. Baranowski 2, V. Bartsch 3,S. Belforte 4, G. Garzoglio 2, R. Herber 2, R. Illingworth 2, R.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al (see next slide) FNAL/CD/CCF, D0, CDF, Condor team, UTA,
SAMGrid – A fully functional computing grid based on standard technologies Igor Terekhov for the JIM team FNAL/CD/CCF.
Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.
High Energy Physics At OSCER A User Perspective OU Supercomputing Symposium 2003 Joel Snow, Langston U.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
Deploying and Operating the SAM-Grid: lesson learned Gabriele Garzoglio for the SAM-Grid Team Sep 28, 2004.
28 April 2003Lee Lueking, PPDG Review1 BaBar and DØ Experiment Reports DOE Review of PPDG January 28-29, 2003 Lee Lueking Fermilab Computing Division D0.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
SAMGrid for CDF MC (and beyond) Igor Terekhov, FNAL/CD/CCF/SAM for JIM team.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
PPDG and ATLAS Particle Physics Data Grid Ed May - ANL ATLAS Software Week LBNL May 12, 2000.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
SAM and D0 Grid Computing Igor Terekhov, FNAL/CD.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
Data Grid projects in HENP R. Pordes, Fermilab Many HENP projects are working on the infrastructure for global distributed simulated data production, data.
Ashok Agarwal University of Victoria 1 GridX1 : A Canadian Particle Physics Grid A. Agarwal, M. Ahmed, B.L. Caron, A. Dimopoulos, L.S. Groer, R. Haria,
Mar 28, 20071/18 The OSG Resource Selection Service (ReSS) Gabriele Garzoglio OSG Resource Selection Service (ReSS) Don Petravick for Gabriele Garzoglio.
22 nd September 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Dzero MC production on LCG How to live in two worlds (SAM and LCG)
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
4 March 2004GridPP 9th Collaboration Meeting SAMGrid:JIM and CDF Development CDF Accepts the Need for the Grid –Requirements How to Meet the Need –Status.
The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (
1 DØ Grid PP Plans – SAM, Grid, Ceiling Wax and Things Iain Bertram Lancaster University Monday 5 November 2001.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
Review of Condor,SGE,LSF,PBS
High Energy Physics and Grids at UF (Dec. 13, 2002)Paul Avery1 University of Florida High Energy Physics.
19 February 2004SAMGrid Project Review SAMGrid: Future Plans CDF Accepts the Need for the Grid –Requirements D0 Relies on the Grid –Requirements How to.
VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al FNAL/CD/CCF, D0, CDF, Condor team.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
The Resource Selection Service (ReSS) Activity Gabriele Garzoglio Fermilab, Computing Division March 14, 2006.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
The SAM-Grid / LCG interoperability system: a bridge between two Grids Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar Anoop Rajendra*, Sudhamsh.
Workload Management Workpackage
Distributed Data Access and Resource Management in the D0 SAM System
The DZero/PPDG D0/PPDG mission is to enable fully distributed computing for the experiment, by enhancing SAM as the distributed data handling system of.
Presentation transcript:

The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Gabriele Garzoglio Apr 16, 2004 Overview Computation in High Energy Physics The SAM-Grid computing infrastructure The Job Management and Condor-G Real life experience Future work

Gabriele Garzoglio Apr 16, 2004 High Energy Physics Challenges High Energy Physics studies the fundamental interactions of Nature. Few laboratories around the world provide each unique facilities (accelerators) to study particular aspects of the field: the collaborations are geographically distributed. Experiments become every decade more challenging/expensive: the collaborations are large groups of people. The phenomena studied are statistical in nature and very rare events: a lot of data/statistics is needed

Gabriele Garzoglio Apr 16, 2004 A HEP laboratory: Fermilab

Gabriele Garzoglio Apr 16, 2004 FNAL Run II detectors

Gabriele Garzoglio Apr 16, 2004 DZero FNAL Run II detectors

Gabriele Garzoglio Apr 16, 2004 The Size of the D0 Collaboration ~500 Physicists 72 institutions 18 Countries DZero and CDF Institutions

Gabriele Garzoglio Apr 16, 2004 Data size for the D0 Experiment Detector Data 1,000,000 Channels Event size 250KB Event rate ~50 Hz On-line Data Rate 12 MBps 100 TB/year Total data detector, reconstructred, simulated 400 TB/year

Gabriele Garzoglio Apr 16, 2004 Typical DZero activities

Gabriele Garzoglio Apr 16, 2004 Overview Computation in High Energy Physics  The SAM-Grid computing infrastructure The Job Management and Condor-G Real life experience Future work

Gabriele Garzoglio Apr 16, 2004 The SAM-Grid Project Mission: enable fully distributed computing for DZero and CDF Strategy: enhance the distributed data handling system of the experiments (SAM), incorporating standard Grid tools and protocols, and developing new solutions for Grid computing (JIM) History: SAM from 1997, JIM from end of 2001 Funds: the Particle Physics Data Grid (US) and GridPP (UK) People: Computer scientists and Physicists from Fermilab and the collaborating Institutions

Gabriele Garzoglio Apr 16, 2004

Gabriele Garzoglio Apr 16, 2004 Overview Computation in High Energy Physics The SAM-Grid computing infrastructure  The Job Management and Condor-G Real life experience Future work

Gabriele Garzoglio Apr 16, 2004 Job Management: Requirements Foster site autonomy Operate in batch mode: submit and disconnect Reliability: handle the job request persistently; execute it and retrieve output and/or errors. Flexible automatic resource selection: optimization of various metrics/policies Fault tolerance: transient service disruption; automatic rematching and resubmitting capabilities Automatic execution of complex interdependent job structures.

Gabriele Garzoglio Apr 16, 2004 Service Architecture Site Resource Selector Info Collector Info Gatherer Match Making User Interface Submission Global Job Queue Grid Client Submission User Interface Global DH Services SAM Naming Server SAM Log Server Resource Optimizer SAM DB Server RCMetaData Catalog Bookkeeping Service SAM Stager(s) SAM Station (+other servs) Data Handling Worker Nodes Grid Gateway Local Job Handler (CAF, D0MC, BS,...) JIM Advertise Local Job Handling Cluster AAA Dist.FS Info Manager XML DB server Site Conf. Glob/Loc JID map... Info Providers MDS MSS Cache Site Web Serv Grid Monitoring User Tools Flow of: jobdata meta-data

Gabriele Garzoglio Apr 16, 2004 Technological choices (2001) Low level resource management: Globus GRAM. Clearly not enough... Condor-G: right components and functionalities, but not enough in DZero and the Condor Team have been collaborating since, under the auspices of PPDG to address the requirements of a large distributed system, with distributively owned and shared resources.

Gabriele Garzoglio Apr 16, 2004 Condor-G: added functionalities I Use of the condor Match Making Service as Grid Resource Selector Advertisement of grid site capabilities to the MMS Dynamic $$(gatekeeper) selection for jobs specifying requirements on grid sites Concurrent submission of multiple jobs to the same grid resource at any given moment, a grid site is capable of accepting up to N jobs the MMS was modified to push up to N jobs to the same site in the same negotiation cycle

Gabriele Garzoglio Apr 16, 2004 Condor-G: added functionalities II Flexible Match Making logic the job/resource match criteria should be arbitrarily complex (based on more info than what fits in the classad), statefull (remembers match history), “pluggable” (by administrators and users) Example: send the job where most of the data are. The MMS contacts the site data handling service to rank a job/site match This leads to a very thin and flexible “grid broker”

Gabriele Garzoglio Apr 16, 2004 Condor-G: added functionalities III Light clients A user should be able to submit a job from a laptop and turn it off Client software (condor_submit, etc.) and queuing service (condor_schedd) should be on different machines This leads to a 3 tiers architecture for Condor-G: client, queuing, execution sites. Security was implemented via X509.

Gabriele Garzoglio Apr 16, 2004 Condor-G: added functionalities IV Resubmission/Rematching logic If the MMS matched a job to a site, which cannot accept it after trying the submission N times, the job should be rematched to a different site Flexible penalization of already failed matches

Gabriele Garzoglio Apr 16, 2004 Overview Computation in High Energy Physics The SAM-Grid computing infrastructure The Job Management and Condor-G  Real life experience Future work

Gabriele Garzoglio Apr 16, 2004 JOB Computing Element Submission Client User Interface Queuing System Job Management User Interface Broker Match Making Service Information Collector Execution Site #1 Submission Client Match Making Service Computing Element Grid Sensors Execution Site #n Queuing System Grid Sensors Storage Element Computing Element Storage Element Data Handling System Storage Element Informatio n Collector Grid Sensor s Computin g Element Data Handling System ext. logic ext. logic MyType "Machine" TargetType "Job" Name "ccin2p3-analysis.d0.prd.jobmanager-runjob" gatekeeper_url_ "ccd0.in2p3.fr:2119/jobmanager-runjob" DbURL " sam_nameservice_ "IOR: a49444c " station_name_ "ccin2p3-analysis" station_experiment_ "d0" station_universe_ "prd" cluster_architecture_ "Linux+2.4" cluster_name_ "LyonsGrid" local_storage_path_ "/samgrid/disk" local_storage_node_ "ccd0.in2p3.fr" schema_version_ "1_1" site_name_ "ccin2p3"... MyType "Machine" TargetType "Job" Name "ccin2p3-analysis.d0.prd.jobmanager-runjob" gatekeeper_url_ "ccd0.in2p3.fr:2119/jobmanager-runjob" DbURL " sam_nameservice_ "IOR: a49444c " station_name_ "ccin2p3-analysis" station_experiment_ "d0" station_universe_ "prd" cluster_architecture_ "Linux+2.4" cluster_name_ "LyonsGrid" local_storage_path_ "/samgrid/disk" local_storage_node_ "ccd0.in2p3.fr" schema_version_ "1_1" site_name_ "ccin2p3"... MyType "Job" TargetType "Machine" ClusterId 304 JobType “montecarlo" GlobusResource "$$(gatekeeper_url_)" Requirements (TARGET.station_name_ == "ccin2p3-analysis" &&...) Rank station_univ "prd" station_ex "d0" RequestId "11866" ProjectId "sam_ccd0_012457_25321_0" DbURL "$$(DbURL)" cert_subject "/DC=org/DC=doegrids/OU=People/CN=Aditya Nishandar..." Env "MATCH_RESOURCE_NAME=$$(name);\ SAM_STATION=$$(station_name_);\ SAM_USER_NAME=aditya;..." Args "--requestId=11866" "--gridId=sam_ccd0_012457" MyType "Job" TargetType "Machine" ClusterId 304 JobType “montecarlo" GlobusResource "$$(gatekeeper_url_)" Requirements (TARGET.station_name_ == "ccin2p3-analysis" &&...) Rank station_univ "prd" station_ex "d0" RequestId "11866" ProjectId "sam_ccd0_012457_25321_0" DbURL "$$(DbURL)" cert_subject "/DC=org/DC=doegrids/OU=People/CN=Aditya Nishandar..." Env "MATCH_RESOURCE_NAME=$$(name);\ SAM_STATION=$$(station_name_);\ SAM_USER_NAME=aditya;..." Args "--requestId=11866" "--gridId=sam_ccd0_012457" job_type = montecarlo station_name = ccin2p3-analysis runjob_requestid = runjob_numevts = d0_release_version = p jobfiles_dataset = san_jobset2 minbias_dataset = ccin2p3_minbias_dataset sam_experiment = d0 sam_universe = prd group = test instances = 1 job_type = montecarlo station_name = ccin2p3-analysis runjob_requestid = runjob_numevts = d0_release_version = p jobfiles_dataset = san_jobset2 minbias_dataset = ccin2p3_minbias_dataset sam_experiment = d0 sam_universe = prd group = test instances = 1

Gabriele Garzoglio Apr 16, 2004 Montecarlo Production Statistics Started beginning of Ramped up in March. 3 Sites: Wisconsin (...via Miron), Manchester, Lyon. New sites are joining (UTA, LU, OU, LTU,...) Inefficiency due to the Grid infrastructure « 5% 30 GB/week = 80,000 events/week (about 1/4 of total production)

Gabriele Garzoglio Apr 16, 2004 Overview Computation in High Energy Physics The SAM-Grid computing infrastructure The Job Management and Condor-G Real life experience  Future work

Gabriele Garzoglio Apr 16, 2004 Future work of DZero with Condor Use of DAGMan to automate the management of interdependent grid job structures. Address potential scalability limits. Investigate non-central brokering service via grid flocking. Integrate/Implement a proxy management infrastructure (e.g. MyProxy). All the rest (...fix bugs, improve error reporting, hand holding, sailing...)

Gabriele Garzoglio Apr 16, 2004 Conclusions The collaboration between DZero and the Condor team has been very fruitful since DZero has worked together with Condor to enhance the Condor-G framework, in order to address the requirements on distributed computing of a large HEP experiment. DZero is running “production” jobs on the Grid.

Gabriele Garzoglio Apr 16, 2004 Acknowledgments Condor Team PPDG DZero CDF

Gabriele Garzoglio Apr 16, 2004 More info at… d0.fnal.gov/computing/grid/