Download presentation
Presentation is loading. Please wait.
Published byJerome Gibson Modified over 9 years ago
1
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia Méndez Lorenzo, Enzo Miccio, Roberto Santinelli, Andrea Sciabà CERN — Switzerland, INFN-CNAF — Italy The EIS team The Experiment Integration Support team in the Worldwide LHC Computing Grid project is active since 2002 in helping the LHC experiments and other user communities to use the Grid as effectively as possible. The EIS activities include: —Contributing to integrate the experiment computing framework with the Grid middleware —Interfacing user communities with the middleware developers and the WLCG infrastructure operations —Developing new user tools to implement functionalities missing from the Grid middleware —Testing middleware components as they become available —Directly participating to the experiment computing activities (data challenges, Monte Carlo production, etc.) —Providing end-user documentation The gLite WMS and the experiments Middleware testing is one of the most important activities of the EIS team. The purpose is to verify the readiness of the middleware with respect to the needs of the LHC experiments. The LHC experiments need to generate and process huge amounts of simulated data to validate the reconstruction software, test their computing model and develop physics data analysis algorithms. For example, the current and foreseen production rates are of the order of 50 million events/month in 2007 and 100 million events/month in 2008, for ATLAS and CMS. Each experiment requires to submit and manage about 10 5 jobs/day at several tens of participating sites. The gLite Workload Management System is an evolution of the LCG Resource Broker which provides better performance in terms of scalability and new functionalities ("bulk" submission being the most important). Experiment monitoring with SAM The Service Availability Monitoring system (SAM) is a framework developed to provide a global and uniform monitoring tool for Grid services. It works executing periodic tests, organized in "sensors" (one for each type of Grid service), on all Grid services. The test results are published in an Oracle database with a Tomcat based web service interface. SAM is the main source of information for Grid operations and is used to measure the availability of Grid services. The flexibility of the SAM framework makes it an excellent choice also for any Virtual Organisation to implement custom tests on existing service types, or even on experiment-specific services. The EIS team is strongly involved in the integration of the experiment monitoring with SAM. All the LHC experiments are currently using SAM: ALICE uses SAM to monitor the services running on their "VO boxes" (nodes which host all the ALICE-specific software at a site) ATLAS uses the same "standard" tests as the Grid operations team, but run using ATLAS Grid credentials; in the near future also more specific tests will be run LHCb uses the SAM database to publish the results of software installation and validation jobs The gLite WMS architecture ClientWMProxy Task queue Workload Manager Matchmaker Job Submission and Monitoring Logging & Bookkeeping Information Supermarket Information System LB Proxy Computing Element Testing the gLite WMS During its final development phase, the WMS was mainly tested by the EIS team. The tests involved the submission large numbers of jobs to the WLCG production infrastructure, both using simple "hello world" scripts and real experiment applications. Problems encountered were reported to the developers, who provided bug fixes, in an iterative process. Acceptance criteria were defined to assess the compliance of the WMS with the requirements from the experiments and the WLCG operations: Uninterrupted submission of at least 10 4 jobs/day for period of at least five days No service restart required during this period No degradation in performance at the end of this period Number of "stale" jobs less than 1% of the total at the end of the test gLite WMS test results The gLite WMS was tested both by submitting single jobs and job collections of a few hundred jobs each. The status of the jobs was monitored and all failures were identified and investigated. The WMS internal status was also monitored (system load, memory usage, etc.). A test to verify the acceptance criteria was performed and these results were obtained: 115,000 jobs submitted in 7 days (16,000 jobs/day) 320 (0.3%) jobs aborted due to the WMS Negligible delay between job submission and arrival on the CE The acceptance criteria were fully met. An example: CMS monitoring CMS has adopted SAM as the system to implement Grid-wide monitoring of computing and storage elements. CMS contacts at sites must ensure that the CMS tests run successfully. Computing Element tests NameChecks basicCMS software area and CMS site local configuration swinstPresence of the required versions of the CMSSW Monte CarloStage out of a file from the WN to the local SE Squid Basic functionality of the closest Squid server FroNtier SRM tests NameChecks get-pfn-from-tfc Gets LFN PFN rule from a central CMS DB putCopies a test file into the SRM via srmcp get-metadataGets metadata of the remote file getCopies back the remote file advisory-deleteRemoves the remote file Snapshot of the SAM test results on OSG Site availability The outcome of the CMS SAM tests is used to give a measurement of the CMS availability. It is expressed as the fraction of successful tests as a function of time. EGEE/OSG interoperability An interesting side effect of the choice of SAM for the CMS monitoring was a strong push for interoperability between EGEE and OSG: jobs are submitted by SAM to the LCG Resource Broker also for OSG sites. 50% of sites over the 80% availability mark
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.