The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (

Slides:



Advertisements
Similar presentations
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
GUMS status Gabriele Carcassi PPDG Common Project 12/9/2004.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Grids and Grid Technologies for Wide-Area Distributed Computing Mark Baker, Rajkumar Buyya and Domenico Laforenza.
Workload Management Massimo Sgaravatto INFN Padova.
Asynchronous Web Services Approach Enrique de Andrés Saiz.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
Cluster Reliability Project ISIS Vanderbilt University.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Deploying and Operating the SAM-Grid: lesson learned Gabriele Garzoglio for the SAM-Grid Team Sep 28, 2004.
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
Clever Framework Name That Doesn’t Violate Copyright Laws MARCH 27, 2015.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Mar 28, 20071/18 The OSG Resource Selection Service (ReSS) Gabriele Garzoglio OSG Resource Selection Service (ReSS) Don Petravick for Gabriele Garzoglio.
The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
22 nd September 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
Dzero MC production on LCG How to live in two worlds (SAM and LCG)
16 September GridPP 5 th Collaboration Meeting D0&CDF SAM and The Grid Act I: Grid, Sam and Run II Rick St. Denis – Glasgow University Act II: Sam4CDF.
Virtual Batch Queues A Service Oriented View of “The Fabric” Rich Baker Brookhaven National Laboratory April 4, 2002.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
GridPP11 Liverpool Sept04 SAMGrid GridPP11 Liverpool Sept 2004 Gavin Davies Imperial College London.
Jan 10, 2007, Clermont-Ferrand Tibor Kurca, Tutorial Grille1 DØ Computing Introduction - Fermilab & Tevatron & DØ Experiment DØ Computing Model 1. data.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
International Symposium on Grid Computing (ISGC-07), Taipei - March 26-29, 2007 Of 16 1 A Novel Grid Resource Broker Cum Meta Scheduler - Asvija B System.
Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
December 07, 2006Parag Mhashilkar, Fermilab1 Samgrid – OSG Interoperability Parag Mhashilkar, Fermi National Accelerator Laboratory.
Daniele Spiga PerugiaCMS Italia 14 Feb ’07 Napoli1 CRAB status and next evolution Daniele Spiga University & INFN Perugia On behalf of CRAB Team.
STAR Scheduling status Gabriele Carcassi 9 September 2002.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Proxy management mechanism and gLExec integration with the PanDA pilot Status and perspectives.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
The Resource Selection Service (ReSS) Activity Gabriele Garzoglio Fermilab, Computing Division March 14, 2006.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
DataTAG is a project funded by the European Union International School on Grid Computing, 23 Jul 2003 – n o 1 GridICE The eyes of the grid PART I. Introduction.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
July 26, 2007Parag Mhashilkar, Fermilab1 DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory.
A Case for Application-Aware Grid Services Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar, Anoop Rajendra*, Ljubomir Perković** Computing Division,
The SAM-Grid / LCG interoperability system: a bridge between two Grids Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar Anoop Rajendra*, Sudhamsh.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
5/12/06T.Kurca - D0 Meeting FNAL1 p20 Reprocessing Introduction Computing Resources Architecture Operational Model Technical Issues Operational Issues.
BOSS: the CMS interface for job summission, monitoring and bookkeeping
BOSS: the CMS interface for job summission, monitoring and bookkeeping
BOSS: the CMS interface for job summission, monitoring and bookkeeping
WP1 activity, achievements and plans
Scalability Tests With CMS, Boss and R-GMA
Tibor Kurca, Tutorial CPPM
Presentation transcript:

The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (

Gabriele Garzoglio Sep 28, 2005 Overview The Interoperability Test Bed Motivations Architecture Status Report Lesson learned / Problems encountered Still discussing… Conclusions

Gabriele Garzoglio Sep 28, 2005 Motivations for the interoperability project The SAM-Grid is a convenient meta- computing system for the RunII experiments because it offers… …transparent access to the experiment data through SAM …integrated application management (job environment preparation, application-sensitive policies, job aggregation) But deployment is expensive… The idea: DZero will increase its resource pool within the framework of LCG (EGEE), while relying on the SAM-Grid data and application management

Gabriele Garzoglio Sep 28, 2005 Basic Architecture SAM-Grid LCG SAM-Grid / LCG Forwarding Node SAM-Grid VO-Specific Services Flow of Job Submission Offers services to … Main issues to track down: Accessibility of the services Usability of the resources Scalability

Gabriele Garzoglio Sep 28, 2005 Service/Resource Multiplicity FW SAM- Grid CCCCCCCCCSSS FW C S Network Boundaries Forwarding Node LCG Cluster VO-Service (SAM) Job Flow Offers Service

Gabriele Garzoglio Sep 28, 2005 Current Test Bed Configuration FW SAM- Grid C S FW C S Network Boundaries Forwarding Node LCG Cluster Integration in Progress VO-Service (SAM) Job Flow Offers Service Wuppertal CCIN2P3 C Clermont- Ferrand CCC Imperial College RAL Lancaster C

Gabriele Garzoglio Sep 28, 2005 Job Scheduling System Adaptation I The SAM-Grid sees the FW node as another gateway The SAM-Grid has developed a grid-to-fabric interface (job-manager) that interacts with multiple fabric services (SAM, Monitoring, Environment Preparation): the Batch System is one of them. Batch system adaptation is done through a layer of abstraction and implemented via robust local scheduler handlers.

Gabriele Garzoglio Sep 28, 2005 Job Scheduling System Adaptation II This mechanism is so flexible that allowed the adaptation of SAM-Grid to LCG Job Management (submit, status poll, kill, output gathering, …) is implemented via an LCG “scheduler” handler The handler uses the LCG UI to submit jobs to an LCG broker (logically part of the FW node, in practice can be anywhere)

Gabriele Garzoglio Sep 28, 2005 Overview The Interoperability Test Bed Motivations Architecture  Status Report Lesson learned / Problems encountered Still discussing… Conclusions

Gabriele Garzoglio Sep 28, 2005 Status Report We can submit real DZero data reprocessing and montecarlo jobs to LCG via SAM-Grid Jobs land on the available LCG clusters Jobs rely on the SAM station at CCIN2P3 to handle input (binaries and data) and output …see the SAM-Grid monitoringmonitoring

Gabriele Garzoglio Sep 28, 2005 Problems/Lesson Learned I Scratch management is responsibility of the site OR the application. DZero requirements on local scratch space Cannot run on NFS because of intensive I/O Need 4 GB of local space SAM-Grid uses job wrappers to do “smart” scratch management (find best scratch area to use) These wrappers rely on the job managers to set up scratch variables ($TMP_DIR, …) Under discussion: one aspect of considering a cluster DZero-certified should be having the scratch variables defined

Gabriele Garzoglio Sep 28, 2005 Problems/Lesson Learned II Use of the LCG brokers Experienced problems with disk space for the input sandbox (input sandbox 4 MB, all the rest via SAM) Needed administrative action to resolve the problem Possibly mitigated since we can use multiple brokers (tested with Wupperal and CCIN2P3 brokers)

Gabriele Garzoglio Sep 28, 2005 Problems/Lesson Learned III Job Failure Analysis In general, for a single SAM-Grid job, the forwarding node submits multiple LCG jobs (aggregation management). The output of all the jobs is bundled together in an output sandbox. We observed problems retrieving the output of “aborted” LCG jobs “Maradona” fails in handling the output In this case, it is tough to understand what went wrong with the job

Gabriele Garzoglio Sep 28, 2005 Problems/Lesson Learned IV Resubmission of non-reentrant jobs Some jobs should not be resubmitted in case of failure. They will be recovered as a separate activity Problems overriding retrials of job submission from the JDL and the UI configuration Is this a known bug? A configuration problem on our part?

Gabriele Garzoglio Sep 28, 2005 Problems/Lesson Learned V Network configuration Sites hosting SAM must allow incoming network traffic from the FW node and from all LCG clusters (worker nodes) to allow data handling control and transport SAM should be modified to provide port range control

Gabriele Garzoglio Sep 28, 2005 Problems/Lesson Learned VI SAM configuration SAM can only use TCP-based communication (as expected, UDP does not work in practice on the WAN) SAM had to be modified to allow service accessibility for jobs within private networks (pull-based vs call-back interfaces)

Gabriele Garzoglio Sep 28, 2005 Still discussing... I What does it mean certifying LCG for a certain DZero activity? For reprocessing, all the SAM-Grid clusters have undergone an initial certification phase The cluster processes a well known dataset, then results are compared with a reference result What do we do for LCG? Should every individual cluster be certified? Should the LCG as a whole be certified? The answer probably depends on the type of activity (Reprocessing, Montecarlo, Analysis, …)

Gabriele Garzoglio Sep 28, 2005 Still discussing... II Who operates the SAM-Grid / LCG interoperability system? For the SAM-Grid DZero reprocessing, people at the facilities had interest in having their resources utilized: people at each facility have run operations submitting jobs to their own facilities Running “operations” means being responsible for the production of the data (routine job submission/monitoring, troubleshooting, facility maintenance/upgrade, …) How do we organize the people that operate the LCG interoperability system? Is one responsible person enough?

Gabriele Garzoglio Sep 28, 2005 Still discussing... III Support on LCG In case something goes wrong on the LCG, DZero has to learn the best channels to request support What response can DZero expect now and in 2 years? As the system becomes more complex, it becomes difficult for the operators to pin point the reasons for job failures. LCG will get reports for failures of the SAM-Grid side… and vice-versa.

Gabriele Garzoglio Sep 28, 2005 Overview The Interoperability Test Bed Motivations Architecture Status Report Lesson learned / Problems encountered Still discussing…  Conclusions

Gabriele Garzoglio Sep 28, 2005 Conclusions / SAM We are moving the test bed to “production” by expanding the system ramping up usage We are discussing open issues in operating the interoperability system LCG certification Organizing the operations Obtaining support for LCG problems Our principal target production application is montecarlo for DZero

Gabriele Garzoglio Sep 28, 2005 Conclusions / LCG Grid batch job environment variables Proposal for standardization made at last HEPIX and last Operations Workshop (Bologna) What is the next step ? How to proceed with implementation ? Make easier the MW errors handling By using a well defined set of MW error codes ? Suitable for automatic handling

Gabriele Garzoglio Sep 28, 2005 More info at… integration.pdf integration-Lyon-report.pdf