Presentation is loading. Please wait.

Presentation is loading. Please wait.

The ATLAS Grid Progress Roger Jones Lancaster University GridPP CM QMUL, 28 June 2006.

Similar presentations


Presentation on theme: "The ATLAS Grid Progress Roger Jones Lancaster University GridPP CM QMUL, 28 June 2006."— Presentation transcript:

1 The ATLAS Grid Progress Roger Jones Lancaster University GridPP CM QMUL, 28 June 2006

2 RWL Jones 28 June 2006 QMUL 2 ATLAS partial &“average” T1 Data Flow (2008) Tier-0 CPU farm T1 Other Tier-1s disk buffer RAW 1.6 GB/file 0.02 Hz 1.7K f/day 32 MB/s 2.7 TB/day ESD2 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day AOD2 10 MB/file 0.2 Hz 17K f/day 2 MB/s 0.16 TB/day AODm2 500 MB/file 0.004 Hz 0.34K f/day 2 MB/s 0.16 TB/day RAW ESD2 AODm2 0.044 Hz 3.74K f/day 44 MB/s 3.66 TB/day RAW ESD (2x) AODm (10x) 1 Hz 85K f/day 720 MB/s T1 Other Tier-1s T1 Each Tier-2 Tape RAW 1.6 GB/file 0.02 Hz 1.7K f/day 32 MB/s 2.7 TB/day disk storage AODm2 500 MB/file 0.004 Hz 0.34K f/day 2 MB/s 0.16 TB/day ESD2 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day AOD2 10 MB/file 0.2 Hz 17K f/day 2 MB/s 0.16 TB/day ESD2 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day AODm2 500 MB/file 0.036 Hz 3.1K f/day 18 MB/s 1.44 TB/day ESD2 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day AODm2 500 MB/file 0.036 Hz 3.1K f/day 18 MB/s 1.44 TB/day ESD1 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day AODm1 500 MB/file 0.04 Hz 3.4K f/day 20 MB/s 1.6 TB/day AODm1 500 MB/file 0.04 Hz 3.4K f/day 20 MB/s 1.6 TB/day AODm2 500 MB/file 0.04 Hz 3.4K f/day 20 MB/s 1.6 TB/day Plus simulation and analysis data flow

3 RWL Jones 28 June 2006 QMUL 3 Computing System Commissioning ATLAS developments are all driven by the Computing System Commissioning (CSC)ATLAS developments are all driven by the Computing System Commissioning (CSC) Runs from June 06 to ~March 07 Not monolithic, many components Careful scheduling needed of interrelated components – workshop next week for package leaders Begins with Tier-0/Tier-1/(some) Tier-2sBegins with Tier-0/Tier-1/(some) Tier-2s Exercising the data handling and transfer systems Lesson from the previous round of experiments at CERN (LEP, 1989-2000)Lesson from the previous round of experiments at CERN (LEP, 1989-2000) Reviews in 1988 underestimated the computing requirements by an order of magnitude!

4 RWL Jones 28 June 2006 QMUL 4 CSC items Full Software ChainFull Software ChainFull Software ChainFull Software Chain Tier-0 ScalingTier-0 ScalingTier-0 ScalingTier-0 Scaling Streaming testsStreaming tests Calibration & AlignmentCalibration & AlignmentCalibration & AlignmentCalibration & Alignment High-Level TriggerHigh-Level TriggerHigh-Level TriggerHigh-Level Trigger Distributed Data ManagementDistributed Data ManagementDistributed Data ManagementDistributed Data Management Distributed ProductionDistributed ProductionDistributed ProductionDistributed Production Physics AnalysisPhysics AnalysisPhysics AnalysisPhysics Analysis

5 RWL Jones 28 June 2006 QMUL 5 ATLAS Distributed Data Management ATLAS reviewed all its own Grid distributed systems (data management, production, analysis) during the first half of 2005ATLAS reviewed all its own Grid distributed systems (data management, production, analysis) during the first half of 2005 Data Management is key A new Distributed Data Management System (DDM) was designed, based on:A new Distributed Data Management System (DDM) was designed, based on: A hierarchical definition of datasets Central dataset catalogues, Distributed file catalogues Data blocks as units of file storage and replication Automatic data transfer mechanisms using distributed services (dataset subscription system) The DDM system supports the basic data tasks:The DDM system supports the basic data tasks: Distribution of raw and reconstructed data from CERN to the Tier-1s Distribution of AODs (Analysis Object Data) to Tier-2 centres for analysis Storage of simulated data (produced by Tier-2s) at Tier-1 centres for further distribution and/or processing

6 RWL Jones 28 June 2006 QMUL 6 ATLAS DDM Organization

7 RWL Jones 28 June 2006 QMUL 7 Central vs Local Services The DDM system has a central role with respect to ATLAS Grid toolsThe DDM system has a central role with respect to ATLAS Grid tools Its slow roll-out on LCG is causing problems to other components Predicated on distributed file catalogues and auxiliary servicesPredicated on distributed file catalogues and auxiliary services Do not ask every single Grid centre to install ATLAS services We decided to install “local” catalogues and services at Tier-1 centres Then we defined “regions” which consist of a Tier-1 and all other Grid computing centres that: Are well (network) connected to this Tier-1 Depend on this Tier-1 for ATLAS services (including the file catalogue) CSC will establish if this scales to the LHC data-taking era needs:CSC will establish if this scales to the LHC data-taking era needs: Moving several 10000s files/day Supporting up to 100000 organized production jobs/day Supporting the analysis work of >1000 active ATLAS physicists

8 RWL Jones 28 June 2006 QMUL 8 ATLAS Data Management Model In practice, it turns out to be convenient ( & more robust) to partition the Grid so that there are default (not compulsory) Tier-1↔Tier-2 pathsIn practice, it turns out to be convenient ( & more robust) to partition the Grid so that there are default (not compulsory) Tier-1↔Tier-2 paths FTS channels are installed for these data paths for production use All other data transfers go through normal network routes In this model, a number of data management services are installed only at Tier-1s and act also on their “associated” Tier-2s:In this model, a number of data management services are installed only at Tier-1s and act also on their “associated” Tier-2s: VO Box FTS channel server (both directions) Local file catalogue (part of DDM/DQ2)

9 RWL Jones 28 June 2006 QMUL 9 Tiers of ATLAS T1 T0 T2 LFC FTS Server T1 FTS Server T0 T1 …. VO box LFC: local within ‘cloud’ All SEs SRM

10 RWL Jones 28 June 2006 QMUL 10 Job Management: Productions Next step: rework the distributed production system to optimise job distribution, by sending jobs to the data (or as close as possible to them)Next step: rework the distributed production system to optimise job distribution, by sending jobs to the data (or as close as possible to them) This was not the case previously, as jobs were sent to free CPUs and had to copy the input file(s) to the local WN, from wherever in the world the data happened to be Make better use of the task and dataset conceptsMake better use of the task and dataset concepts A “task” acts on a dataset and produces more datasets Use bulk submission functionality to send all jobs of a given task to the location of their input datasets Minimise file transfers and waiting time before execution Collect output files from the same dataset to the same SE and transfer them asynchronously to their final locations

11 RWL Jones 28 June 2006 QMUL 11 Job Management: Analysis A central job queue is good for scheduled productions (priority settings), but too heavy for user analysisA central job queue is good for scheduled productions (priority settings), but too heavy for user analysis Interim tools developed to submit Grid jobs on specific deployments and with limited data management:Interim tools developed to submit Grid jobs on specific deployments and with limited data management: LJSF for the LCG/EGEE Grid Pathena can generate ATLAS jobs that act on a dataset and submits them to PanDA on the OSG Grid Baseline tool to help users to submit Grid jobs is GangaBaseline tool to help users to submit Grid jobs is Ganga  Job splitting and bookkeeping  Several submission possibilities  Collection of output files Now becoming useful as DDM is populated Rapid progress after user feedback, rich features

12 RWL Jones 28 June 2006 QMUL 12 ATLAS Analysis Work Model Job preparation: Medium-scale (on demand) running &testing: Large-scale (scheduled) running: Local system (shell) Prepare JobOptions  Run Athena (interactive or batch)  Get Output Local system (Ganga) Job book-keeping Get Output Local system (Ganga) Prepare JobOptions Find dataset from DDM Generate & submit jobs Grid Run Athena Local system (Ganga) Job book-keeping Access output from Grid Merge results Local system (Ganga) Prepare JobOptions Find dataset from DDM Generate & submit jobs ProdSys Run Athena on Grid Store o/p on Grid

13 RWL Jones 28 June 2006 QMUL 13 Analysis Jobs at Tier-2s Analysis jobs must run where the input data files areAnalysis jobs must run where the input data files are Most analysis jobs will take AODs as input for complex calculations and event selectionsMost analysis jobs will take AODs as input for complex calculations and event selections And most likely will output Athena-Aware Ntuples (AAN, to be stored on some close SE) and histograms (to be sent back to the user) People will develop their analyses on reduced samples many many times before launching runs on a complete datasetPeople will develop their analyses on reduced samples many many times before launching runs on a complete dataset There will be a large number of failures due to people’s code! Exploring a priority system that separates centrally organised productions from analysis tasksExploring a priority system that separates centrally organised productions from analysis tasks

14 RWL Jones 28 June 2006 QMUL 14 ATLAS requirements General productionGeneral production Organized production Share defined by the management Group ProductionGroup Production Organized production About 24 groups identified Share defined by the management General UsersGeneral Users Chaotic use pattern Fair share between users Analysis service to be deployed over summerAnalysis service to be deployed over summer Various approached to prioritisation (VOViews, gpbox, queues) to be explored

15 RWL Jones 28 June 2006 QMUL 15 Conditions data model All non-event data for simulation, reconstruction and analysisAll non-event data for simulation, reconstruction and analysis Calibration/alignment data, also DCS (slow controls) data, subdetector and trigger configuration, monitoring, … Several technologies employed:Several technologies employed: Relational databases: COOL for Intervals Of Validity and some payload data, other relational database tables referenced by COOL COOL databases in Oracle, MySQL DBs, or SQLite file-based DBs Accessed by ‘CORAL’ software (common database backend- independent software layer) - independent of underlying database Mixing technologies part of database distribution strategy File based data (persistified calibration objects) - stored in files, indexed / referenced by COOL File based data will be organised into datasets and handled using DDM (same system as used for event data)

16 RWL Jones 28 June 2006 QMUL 16 Calibration data challenge ATLAS, Tier-2s have only done simulation/reconstructionATLAS, Tier-2s have only done simulation/reconstruction Static replicas of conditions data in SQLite files, or preloaded MySQL replicas - conditions data already known in advance ATLAS calibration data challenge (late 2006) will change thisATLAS calibration data challenge (late 2006) will change this Reconstruct misaligned/miscalibrated data, derive calibrations, re-reconstruct and iterate - as close as possible to real data Will require ‘live’ replication of new data out to Tier-1/2 centres Technologies to be used @ Tier-2Technologies to be used @ Tier-2 Will need COOL replication either by local MySQL replicas, or via Frontier Currently just starting on ATLAS tests of Frontier - need experience Decision in a few months - what to use for calibration data challenge Will definitely need DDM replication of new conditions datasets (sites subscribe to evolving datasets) External sites will submit updates as COOL SQLite files to be merged into central CERN Oracle databases

17 RWL Jones 28 June 2006 QMUL 17 Conclusions We are trying not to impose any particular load on Tier-2 managers by running distributed services at Tier-1sWe are trying not to impose any particular load on Tier-2 managers by running distributed services at Tier-1s Although this concept breaks the symmetry and forces us to set up default Tier-1–Tier-2 associations All that is required of Tier-2s is to set up the Grid environmentAll that is required of Tier-2s is to set up the Grid environment Including whichever job queue priority scheme will be found most useful And SRM Storage Elements with (when available) a correct implementation of the space reservation and accounting system


Download ppt "The ATLAS Grid Progress Roger Jones Lancaster University GridPP CM QMUL, 28 June 2006."

Similar presentations


Ads by Google