The SAM-Grid / LCG interoperability system: a bridge between two Grids Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar Anoop Rajendra*, Sudhamsh.

Slides:



Advertisements
Similar presentations
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
1 Multi-Grid and Multi-VO Job Submission based on a Unified Computational Model Krakow Grid Workshop, November 22nd 2005 Trinity College Dublin Mr. Oliver.
CREAM-CE status and evolution plans Paolo Andreetto, Sara Bertocco, Alvise Dorigo, Eric Frizziero, Alessio Gianelle, Massimo Sgaravatto, Lisa Zangrando.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.
High Performance Louisiana State University - LONI HPC Enablement Workshop – LaTech University,
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Deploying and Operating the SAM-Grid: lesson learned Gabriele Garzoglio for the SAM-Grid Team Sep 28, 2004.
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
22 nd September 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
D0RACE: Testbed Session Lee Lueking D0 Remote Analysis Workshop February 12, 2002.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
Institute For Digital Research and Education Implementation of the UCLA Grid Using the Globus Toolkit Grid Center’s 2005 Community Workshop University.
Virtual Batch Queues A Service Oriented View of “The Fabric” Rich Baker Brookhaven National Laboratory April 4, 2002.
The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
1 P-GRADE Portal: a workflow-oriented generic application development portal Peter Kacsuk MTA SZTAKI, Hungary Univ. of Westminster, UK.
Review of Condor,SGE,LSF,PBS
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
GridPP11 Liverpool Sept04 SAMGrid GridPP11 Liverpool Sept 2004 Gavin Davies Imperial College London.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Conference name Company name INFSOM-RI Speaker name The ETICS Job management architecture EGEE ‘08 Istanbul, September 25 th 2008 Valerio Venturi.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space Operational Architecture of PL-Grid project M.Radecki,
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
STAR Scheduling status Gabriele Carcassi 9 September 2002.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
STAR Scheduler Gabriele Carcassi STAR Collaboration.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
July 26, 2007Parag Mhashilkar, Fermilab1 DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory.
A Case for Application-Aware Grid Services Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar, Anoop Rajendra*, Ljubomir Perković** Computing Division,
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
5/12/06T.Kurca - D0 Meeting FNAL1 p20 Reprocessing Introduction Computing Resources Architecture Operational Model Technical Issues Operational Issues.
Regional Operations Centres Core infrastructure Centres
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Leigh Grundhoefer Indiana University
Presentation transcript:

The SAM-Grid / LCG interoperability system: a bridge between two Grids Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar Anoop Rajendra*, Sudhamsh Reddy *, Tibor Kurca **, Frédéric Villeneuve-Séguier ***, Torsten Harenberg **** Computing Division, Fermilab, Batavia, IL; * University of Texas at Arlington, TX; ** IPN / CCIN2P3, Lyon, France; *** Imperial College, London, UK; **** University of Wuppertal, Wuppertal, Germany Handling of user credentials for job forwarding The forwarding node accepts jobs from the SAM-Grid via the GRAM protocol (Globus gatekeeper). A problem to overcome is that user credentials, delegated to the gatekeeper at the forwarding node, are limited and cannot be used directly to submit grid jobs to LCG. We use an online credential repository (MyProxy) to address the problem. Users upload their credentials to MyProxy before submitting the job. After the job has entered the forwarding node, the delegated limited credentials of the user are used to retrieve unlimited credentials from MyProxy. These fresh credentials are then used to submit the job to LCG. Job Failure Analysis We experienced difficulties in analyzing the output of failed jobs. In particular, we could not retrieve the output of “aborted” jobs (“Maradona” server fails in handling the output). Scheduling policies for “clusters” of jobs are difficult to express on LCG Jobs submitted to the SAM-Grid tend to be “large”. The SAM-Grid needs to split these jobs into parallel instances of the same process in order to execute them in a reasonable time. These job “clusters” tend to have the same characteristics and, in our experience, are most efficiently executed on the same computing cluster. Since the LCG Job Description Language does not provide ways of referencing previously scheduled jobs, scheduling such job clusters all on the same site is challenging. SAM data handling configuration Service accessibility: SAM had to be modified to allow service accessibility for jobs within private networks (pull- based vs call-back interfaces). Communication reliability: In order to serve jobs running on the grid, SAM is configured to accept TCP-based communications only, as UDP does not work in practice on the WAN. System usability: Sites hosting the SAM data handling system must allow incoming network traffic from the forwarding node and from all LCG clusters (worker nodes) to allow data handling control and transport. The SAM system should be modified to provide port range control. Certification of LCG for DZero computing activities For some computing activities, the experiments run cluster certification procedures. For example, for DZero data reprocessing, the SAM-Grid clusters process a well known dataset, then physicists compare the output with reference results. What does it mean to certify LCG? Should every individual cluster be certified? Should the LCG as a whole be certified? We currently certify every LCG cluster to which SAM-Grid submits jobs. Operation and support of the SAM-Grid / LCG interoperability system In DZero, institutions get credit for the computing cycles used by the collaboration. Collaborators at an institution tend to run their share of operations submitting jobs to their facility. Running “operations” means being responsible for the production of the data (routine job submission/monitoring, troubleshooting, facility maintenance and upgrade, …), being the contact point for the support at that facility. It is not clear that this operational and accounting model can be reused on the grid, where jobs can run on institutions that are not part of the collaboration. CONCLUSIONS Users of the SAM-Grid have access to the pool of LCG resources via the “interoperability” system described hereby. This mechanism increases the resources available to the DZero collaboration without increasing the cost of system deployment. The SAM-Grid is responsible for job preparation, for data handling, and for interfacing the users to the grid. LCG is responsible for job handling (resource selection and scheduling). DZero is using the system for production activities. We have described the problems and lessons learned operating the infrastructure. INTRODUCTION The SAM-Grid system is an integrated data, job, and information management infrastructure. The SAM-Grid addresses the distributed computing needs of the experiments of RunII at Fermilab. The system typically relies on SAM-Grid services deployed at the remote facilities in order to manage computing resources. Such deployment requires special agreements with each resource provider and it is a labor intensive process. On the other hand, the DZero VO has also access to computing resources through the LCG infrastructure. In this context, resource sharing agreements and the deployment of standard middleware are negotiated within the framework of the EGEE project. The SAM-Grid / LCG interoperability project was started to let DZero users retain the user-friendliness of the SAM-Grid interface, allowing, at the same time, access to the LCG pool of resources. This "bridging" between grids is beneficial for both the SAM-Grid and LCG, since it minimizes the deployment efforts of the SAM-Grid team and exercises the LCG computing infrastructure with data intensive production applications of a running experiment. The interoperability system is centered around job "forwarding" nodes, which receive jobs prepared by the SAM-Grid and submit them to LCG. We discuss the architecture of the system and how it addresses inherent issues of service accessibility and scalability. We also present the operational and support challenges that arise to operate the system in production. WHY NOT SUBMITTING JOBS TO LCG DIRECTLY? SAM-Grid implements features that make the system user friendly for the RunII experiments. Transparent access to the data: the SAM-Grid system is fully integrated with SAM, the data handling system of the experiments; Integrated Application Management: the SAM-Grid system has knowledge of the typical applications running on the system. The SAM-Grid provides: – Job Environment Preparation: dynamic software deployment, configuration management, and workflow management; – Application-sensitive Policies: the SAM-Grid allows the implementation of different policies on data access and local job management. More in detail, different types of applications can access data through different data access queues, each configured with its own policy settings. In addition, different types of applications can be submitted to a local scheduler using different local policies (generally enforced using different job queues) – Job Aggregation: the job request to the system is automatically split at the level of the local scheduler into multiple parallel instances of the same process. The multiple jobs are aggregated and presented to the user as the single initial request. This allows resource optimizations and user friendliness in the management of the job. SAM-GRID TO LCG JOB FORWARDING In order to maintain the advantages of the SAM-Grid system, using at the same time the resources provided by LCG, we have implemented the following architecture. SAM-Grid LCG SAM-Grid / LCG Forwarding Node SAM-Grid VO-Specific Services Flow of Job Submission Offers services to … The multiplicity of resources and services is represented in the diagram below. FW SAM- Grid C C C C C C C C C S S S FW C S Network Boundaries Forwarding Node LCG Cluster VO-Service (SAM) Job Flow Offers Service Main issues to consider when implementing this architecture are service accessibility, usability of the resources, and scalability. SAM-GRID TO LCG FORWARDING SYSTEM: THE PRODUCTION CONFIGURATION The system is used in production to run DZero montecarlo and data reprocessing jobs. The configuration of the system is the following: FW SAM- Grid C S Wuppertal CCIN2P3 C Clermont- Ferrand C C C Imperial College RAL Lancaster FW C S Network Boundaries Forwarding Node LCG Cluster Integration in Progress VO-Service (SAM) Job Flow Offers Service C C C NIKHEF Prague The system runs hundreds of jobs per day processing hundreds of Gigabytes of data. PROBLEMS FACED AND LESSONS LEARNED Deploying and operating the SAM-Grid to LCG forwarding infrastructure exposed a series of problems. This is the list of the most relevant issues. Local cluster configuration Misconfiguration of even a single worker node on the grid can significantly lowers the job success rate. Misconfigured worker nodes tend to fail jobs very quickly, thus appearing to the batch system often in “idle” state. All jobs queued, therefore, tend to be submitted to the misconfigured nodes, with catastrofic consequences for the job success rate. Typical worker nodes misconfigurations include time asynchrony (causes security problems) and scratch disk management problems (disk full). Scratch management is responsibility of the site OR the application DZero has special requirements on local scratch space management. The jobs cannot run on NFS because of intensive I/O and they need more than 4 GB of local space. SAM-Grid uses job wrappers to do “smart” scratch management, in order to find a scratch area that satisfies the requirements. Possible choices for scratch management areas are made available to the job through the LCG job managers (environment variables $TMPDIR, etc.). Sites that accept jobs from DZero must support this configuration of the job managers. Grid services configuration Resubmission of non-reentrant jobs: Some jobs should not be resubmitted in case of failure and must be recovered as a separate activity. We experienced problems overriding retrials of job submission from the JDL and the UI configuration. Broker input sandbox space management: on some brokers, disk space was not properly cleaned up, requiring administrative intervention to resume the job submission activity. Forwarding nodes act as an interface between the SAM- Grid and LCG. To the SAM-Grid, a forwarding node is an execution site, or, in other words, a gateway to computing resources. Jobs submitted to the forwarding node are submitted in turn to LCG, using the LCG user interface. LCG jobs are in turn dispatched to LCG resources through the LCG Resource Broker. A SAM installation offers data handling services to jobs running on LCG.