Presentation is loading. Please wait.

Presentation is loading. Please wait.

U.S. ATLAS Grid Production Experience

Similar presentations


Presentation on theme: "U.S. ATLAS Grid Production Experience"— Presentation transcript:

1 U.S. ATLAS Grid Production Experience
Kaushik De University of Texas at Arlington Troubleshooting and Fault Tolerance in Grid Environments, Chicago December 11, 2002

2 US -ATLAS testbed launched February 2001
Kaushik De Troubleshooting & Fault Tolerance December 11, 2002

3 Fabric Testing Kaushik De Troubleshooting & Fault Tolerance
December 11, 2002

4 Testbed Production Goals: Production (and testing) experience so far:
Demonstrate distributed ATLAS data production, access and analysis using grid middleware and using tools developed by the testbed group Production (and testing) experience so far: Fast simulation (Atlfast) Short jobs, 5 sites used (all 8 sites certified) Generated ~10 million events during two weeks in July 2002, 6000 files fully catalogued and accessible through the grid Data Challenge production (Atlsim) Phase 1 CPU intensive - ~14 hours per job/output file 3 heterogeneous sites participated: 15, 30, 300 nodes; Condor (2) and LSF; MHz Generated 200k events, 5000 files in August 2002 DC Phase 2 ~25 hours per job, 50-60k events in January 2003 Pre-production testing started Kaushik De Troubleshooting & Fault Tolerance December 11, 2002

5 General Remarks Tackled a large number of complex issues
repackaging of applications (by hand) software deployment (PACMAN) site verification (GridView) production tools (GRAT, Grappa) data management (magda) VO management (BNL tools) ... Troubleshooting ignore & resubmit, check log files, databases most of the troubleshooting done by tool developers - not a robust operations model! Fault tolerance redundancy, independent verification process, concatenated logs, error handling Not a production environment yet - still a development testbed doing production! Kaushik De Troubleshooting & Fault Tolerance December 11, 2002

6 Databases Used in U.S. DC1 MySQL databases play a central role in U.S. DC1 production scripts Production database used to track job status (filename, submitting site, processing site, job id, time started, time finished, temporary and final file locations…) information is updated periodically during job Data management to transfer input and output files using GridFTP to register file locations in Magda catalogue Virtual Data Catalogue used to define job (transformation) store job parameters, random numbers Metadata catalogue store post-production summary information data provenance, physics summary... Kaushik De Troubleshooting & Fault Tolerance December 11, 2002

7 GRAT Software ~50 independently executable modular scripts based on Globus and magda Minimal requirement on grid production site Globus & Magda installed on gatekeeper shared $ATLAS_SCRATCH disk for all nodes Automatic job submission under full user control One, many or infinite sequence of jobs at one or many sites, using grid even for local submits Any user from any site can submit production jobs Independent data management scripts to check consistency of production semi-automatically query production database check Globus for job completion status check data catalog (magda) for output files recover from many possible production failures Data management using magda: moving and registering output files to BNL HPSS and at replica locations on the grid Kaushik De Troubleshooting & Fault Tolerance December 11, 2002

8 GRAT Execution Model 1. Resource Discovery 2. Partition Selection
DC1 Prod. (UTA) Remote Gatekeeper Replica (local) MAGDA (BNL) Param (CERN) Batch Execution scratch 1,4,5,10 2 3 4 5 6 7 8 9 1. Resource Discovery 2. Partition Selection 3. Job Creation 4. Pre-stage 5. Batch Submission 6. Job Parameterization 7. Simulation 8. Post-stage 9. Cataloging 10. Monitoring Kaushik De Troubleshooting & Fault Tolerance December 11, 2002

9 GRAT Job Scheduling Site select module Create job script module
Replica storage select module Partition select module Register Production Scheduler Scheduler Magda Database Query environment Virtual Data Catalogue Stage software on atlas_scratch Execute Atlsim Job Move files/cleanup module Queue Node Gatekeeper ATLAS_SCRATCH Kaushik De Troubleshooting & Fault Tolerance December 11, 2002

10 DC1 Jobs on U.S. Grid Kaushik De Troubleshooting & Fault Tolerance
December 11, 2002

11 DC1 Production Experience
Grid production requires robust software During 18 days of grid production (in August), every system died at least once Local experts were not always accessible (many of them on vacation) Examples: scheduling machines died 5 times (thrice power failure, twice system hung) Long network outages - multiple times Gatekeeper - died at every site at least 2-3 times Three databases used - production, magda and virtual data. Each inaccessible at least once! Scheduled maintenance - HPSS, Magda server, LBNL hardware, LBNL Raid array… These outages should be expected on the grid, as we include many more sites We managed > 100 files/day (~75% efficiency) in spite of these stoppages! Kaushik De Troubleshooting & Fault Tolerance December 11, 2002

12 Future Plans Continue production/development GRAT improvements
Pileup data production (data - not cpu intensive) other production/analysis use cases GRAT improvements Use Condor-G for job submission detailed plan developed working with Condor team need database publication of Condor log files 1 month time-scale Use DAGMan for pileup production nice use case - hundreds of nodes to be managed over many days or many weeks 3 month time-scale Migrate to Chimera 6 month time-scale MDS integration (using GLUE & Pippy schema) Implement resource broker Kaushik De Troubleshooting & Fault Tolerance December 11, 2002


Download ppt "U.S. ATLAS Grid Production Experience"

Similar presentations


Ads by Google