Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 1 Alexei Klimentov Brookhaven National Laboratory XXII-th International.

Slides:

Advertisements

Similar presentations

GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.

Advertisements

1 Japanese activities in LHC Takahiko Kondo, KEK November 28, 2005 KEK-DESY 1st Collaboration Meeting in Tokyo.

T1 at LBL/NERSC/OAK RIDGE General principles. RAW data flow T0 disk buffer DAQ & HLT CERN Tape AliEn FC Raw data Condition & Calibration & data DB disk.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.

CERN – June 2007 View of the ATLAS detector (under construction) 150 million sensors deliver data … … 40 million times per second.

A tool to enable CMS Distributed Analysis

Roger Jones: The ATLAS Experiment Ankara, Turkey - 2 May The ATLAS Experiment Roger Jones Lancaster University.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

1 ATLAS installation: Status about 3 years ago…. 2 ATLAS installation: Status half a years ago…

Andrea Ventura University of Salento & INFN Lecce on behalf of the ATLAS Collaboration New Trends in High-Energy Physics Alushta, Crimea, Ukraine, September.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)

Taming the beast: Using Python To Control ATLAS Software David Quarrie NERSC Division Lawrence Berkeley National Laboratory 3 July 2006.

The Large Hadron Collider project is a global scientific adventure, combining the accelerator, a worldwide computing grid and the experiments, initiated.

3 Sept 2001F HARRIS CHEP, Beijing 1 Moving the LHCb Monte Carlo production system to the GRID D.Galli,U.Marconi,V.Vagnoni INFN Bologna N Brook Bristol.

LHC Tier 2 Networking BOF Joe Metzger Joint Techs Vancouver 2005.

Alexei Klimentov : ATLAS Computing CHEP March Prague Reprocessing LHC beam and cosmic ray data with the ATLAS distributed Production System.

Tier 3 Data Management, Tier 3 Rucio Caches Doug Benjamin Duke University.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

8th November 2002Tim Adye1 BaBar Grid Tim Adye Particle Physics Department Rutherford Appleton Laboratory PP Grid Team Coseners House 8 th November 2002.

F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,

USATLAS SC4. 2 ?! …… The same host name for dual NIC dCache door is resolved to different IP addresses depending.

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

А.Минаенко Совещание по физике и компьютингу, 03 февраля 2010 г. НИИЯФ МГУ, Москва Текущее состояние и ближайшие перспективы компьютинга для АТЛАСа в России.

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

January 21, 2003 Transition from the Construction Project to Pre-Ops and Operations Howard Gordon.

ATLAS, ATLAS through first data Fabiola Gianotti (on behalf of the ATLAS Collaboration)

F.Gianotti, ATLAS RRB, Collaboration and Management matters First results from 900 GeV and 7 TeV collision data (in particular: first observation.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,

Dario Barberis: ATLAS Activities at Tier-2s Tier-2 Workshop June ATLAS Activities at Tier-2s Dario Barberis CERN & Genoa University.

The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

14/03/2007A.Minaenko1 ATLAS computing in Russia A.Minaenko Institute for High Energy Physics, Protvino JWGC meeting 14/03/07.

Victoria, Sept WLCG Collaboration Workshop1 ATLAS Dress Rehersals Kors Bos NIKHEF, Amsterdam.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

ATLAS Distributed Analysis Dietrich Liko IT/GD. Overview  Some problems trying to analyze Rome data on the grid Basics Metadata Data  Activities AMI.

The ATLAS Computing & Analysis Model Roger Jones Lancaster University ATLAS UK 06 IPPP, 20/9/2006.

Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.

Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.

ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko

ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

Production System 2 manpower and funding issues Alexei Klimentov Brookhaven National Laboratory Aug 19, 2013 Production System Technical Meeting CERN.

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

Establishing large international research facilities: the HEP case 1st ASEPS Summit Tsukuba March 25, 2010 Sergio Bertolucci CERN LHC.

CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.

Searching for Compositeness with ATLAS Kaushik De University of Texas at Arlington For the ATLAS Collaboration DPF/JPS 2006, Hawaii November 1, 2006.

ATLAS Distributed Analysis S. González de la Hoz 1, D. Liko 2, L. March 1 1 IFIC – Valencia 2 CERN.

DOE/NSF Project Experience Key Ingredients to Success

ATLAS as an example of a large scientific collaboration

Depleted CMOS Pixel Detectors

Computing Operations Roadmap

ATLAS through first data

POW MND section.

Data Challenge with the Grid in ATLAS

Status and Prospects of The LHC Experiments Computing

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

Readiness of ATLAS Computing - A personal view

An introduction to the ATLAS Computing Model Alessandro De Salvo

Monitoring of the infrastructure from the VO perspective

LHC Data Analysis using a worldwide computing grid

LHC Tier 2 Networking BOF

The ATLAS Computing Model

Presentation transcript:

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September Alexei Klimentov Brookhaven National Laboratory XXII-th International Symposium onNuclear Electronics and Computing. Varna Sep 6-13, 2009 ATLAS Distributed Computing Computing Model, Data Management, Production System, Distributed Analysis, Information System, Monitoring

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September Introduction l The title that Vladimir gave me cannot be done in 20 mins. l I’ll talk about Distributed Computing Components, but I am certainly biased as any Operations person.

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September ATLAS Collaboration 6 Continents 37 Countries 169 Institutions 2800 Physicists 700 Students >1000 Technical and support staff Albany, Alberta, NIKHEF Amsterdam, Ankara, LAPP Annecy, Argonne NL, Arizona, UT Arlington, Athens, NTU Athens, Baku, IFAE Barcelona, Belgrade, Bergen, Berkeley LBL and UC, HU Berlin, Bern, Birmingham, Bogotá, Bologna, Bonn, Boston, Brandeis, Bratislava/SAS Kosice, Brookhaven NL, Buenos Aires, Bucharest, Cambridge, Carleton, Casablanca/Rabat, CERN, Chinese Cluster, Chicago, Chilean Cluster (Santiago+Valparaiso), Clermont-Ferrand, Columbia, NBI Copenhagen, Cosenza, AGH UST Cracow, IFJ PAN Cracow, DESY, Dortmund, TU Dresden, JINR Dubna, Duke, Frascati, Freiburg, Geneva, Genoa, Giessen, Glasgow, Göttingen, LPSC Grenoble, Technion Haifa, Hampton, Harvard, Heidelberg, Hiroshima, Hiroshima IT, Indiana, Innsbruck, Iowa SU, Irvine UC, Istanbul Bogazici, KEK, Kobe, Kyoto, Kyoto UE, Lancaster, UN La Plata, Lecce, Lisbon LIP, Liverpool, Ljubljana, QMW London, RHBNC London, UC London, Lund, UA Madrid, Mainz, Manchester, Mannheim, CPPM Marseille, Massachusetts, MIT, Melbourne, Michigan, Michigan SU, Milano, Minsk NAS, Minsk NCPHEP, Montreal, McGill Montreal, FIAN Moscow, ITEP Moscow, MEPhI Moscow, MSU Moscow, Munich LMU, MPI Munich, Nagasaki IAS, Nagoya, Naples, New Mexico, New York, Nijmegen, BINP Novosibirsk, Ohio SU, Okayama, Oklahoma, Oklahoma SU, Oregon, LAL Orsay, Osaka, Oslo, Oxford, Paris VI and VII, Pavia, Pennsylvania, Pisa, Pittsburgh, CAS Prague, CU Prague, TU Prague, IHEP Protvino, Regina, Ritsumeikan, UFRJ Rio de Janeiro, Rome I, Rome II, Rome III, Rutherford Appleton Laboratory, DAPNIA Saclay, Santa Cruz UC, Sheffield, Shinshu, Siegen, Simon Fraser Burnaby, SLAC, Southern Methodist Dallas, PNPI St.Petersburg, Stockholm, KTH Stockholm, Stony Brook, Sydney, AS Taipei, Tbilisi, Tel Aviv, Thessaloniki, Tokyo ICEPP, Tokyo MU, Toronto, TRIUMF, Tsukuba, Tufts, Udine/ICTP, Uppsala, Urbana UI, Valencia, UBC Vancouver, Victoria, Washington, Weizmann Rehovot, FH Wiener Neustadt, Wisconsin, Wuppertal, Yale, Yerevan

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September Necessity of Distributed Computing? l ATLAS will collect RAW data at 320 MB/s for 50k seconds/day and ~100 days/year nRAW data: 1.6 PB/year l Processing (and re-processing) these events will require ~10k CPUs full time the first year of data-taking, and a lot more in the future as data accumulate l Reconstructed events will also be large, as people want to study detector performance as well as do physics analysis using the output data nESD data: 1.0 PB/year, AOD data: 0.1 PB/year l At least 10k CPUs are also needed for continuous simulation production of at least 30% of the real data rate and for analysis l There is no way to concentrate all needed computing power and storage capacity at CERN nThe LEP model will not scale to this level l The idea of distributed computing, and later of the computing grid, became fashionable at the turn of the century and looked promising when applied to HEP experiments’ computing needs

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September Copy RAW data to CERN Castor Mass Storage System tape for archival Copy RAW data to Tier-1s for storage and reprocessing Run first-pass calibration/alignment (within 24 hrs) Run first-pass reconstruction (within 48 hrs) Distribute reconstruction output (ESDs, AODs & TAGs) to Tier-1s 5 Calibration Tier 2 5 sites in Europe and US RAW Archive a fraction of RAW data (Re)run calibration and alignment Re-process data with better calib/align or/and algo Distribute derived data to Tier-2s Run HITS reconstruction and large-scale event selection and analysis jobs Computing Model : Main Operations Tier 3 Contribute to MC simulation Users Analysis O(100) sites Worldwide AOD TAG Incomplete list of Data Formats: ESD : Event Summary Data AOD : Analysis Object Data DPD : Derived Physics Data TAG : event meta-information TAG Run MC simulation Keep AOD and TAG for the analysis Run analysis jobs (36 Tier-2s, ~80 sites)

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 Tier-0 BNL IN2P3 FZK IN2P3 MoU & CM RAW, ESD 15%, AOD,DPD,TAG 100% FZK MoU and CM RAW, ESD 10%, AOD,DPD,TAG 100% ASGC Tier-1 MWT2 AGLT2 NET2 SWT2 SLAC ATLAS Grid Sites and Data Distribution 3 Grids, 10 Tier-1s, ~80 Tier-2(3)s Tier-1 and associated Tier-ns form cloud. ATLAS clouds have from 2 to 15 sites. We also have T1-T1 associations. Input Rates Estimation (Tier-1s) BNL MoU & CM RAW 24%, ESD, AOD,DPD,TAG 100% Data export from CERN reProcessed and MC data distribution Tier-1BNLCNAFFZKIN2P3NDGFPICRALSARATAIWANTRIUMFSummary Tape (MB/s) Disk (MB/s) Total (MB/s) ATLAS Tier-1s Data Shares 6

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 Ubiquitous Wide Area Network Bandwidth l First Computing TDR’s assumed not enough network bandwidth l The Monarch project proposed multi Tier model with this in mind l Today network bandwidth is our least problem l But we still have the Tier model in the LHC experiments l Not in all parts of the world ideal network yet (last mile) l LHCOPN provides excellent backbone for Tier-0 and Tier-1’s l Each LHC experiment has adopted differently K.Bos. “Status and Prospects of The LHC Experiments Computing”. CHEP’09

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September Distributed Computing Components The ATLAS Grid architecture is based on : nDistributed Data Management (DDM) nDistributed Production System (ProdSys, PanDA) nDistributed Analysis (DA), GANGA, PanDA nMonitoring nGrid Information System nAccounting nNetworking nDatabases

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 ATLAS Distributed Data Management. 1/2 The second generation of ATLAS DDM system (DQ2) DQ2 is built on top of Grid data transfer tools nMoved to dataset based approach  Datasets : an aggregation of files plus associated DDM metadata  Datasets is a unit of storage and replication  Automatic data transfer mechanisms using distributed site services lSubscription system lNotification system Technicalities : nGlobal services  dataset repository  dataset location catalog  logical file names only, no global physical file catalog nLocal Site services (LocalFileCatalog)  It provides logical to physical file name mapping.

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 ATLAS Distributed Data Management. 2/2 10 STEP09 Data export from CERN to Tiers day/average MB/s Reprocessed datasets replication between Tier- 1s (ΔΤ [hours] = T_last_file_transfer – T_subscription) 99% of data were transferred within 4 hours Latency in reprocessing or site issue One dataset wasn’t replicated after 3 days Days of running

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 ATLAS Production System 1/2 Manages ATLAS simulation (full chain) and reprocessing jobs on the wLCG Task request interface to define a related group of jobs Input : DQ2 dataset(s) (with the exception of some event generation) Output : DQ2 dataset(s) (the jobs are done only when the output is at the Tier-1) Due to temporary site problems, jobs are allowed several attempts Job definition and attempt state are stored in Production Database (Oracle DB) Jobs are supervised by ATLAS Production System Consists of many components DDM/DQ2 for data management PanDA task request interface and job definitions PanDA for job supervision ATLAS Dashboard and PanDA monitor for monitoring Grid Middlewares ATLAS software 11

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 ATLAS Production System 2/2 12 Task request interface Production Database: job definition, job states, metadata Tasks Input: DQ2 datasets Task states Tasks Output: DQ2 datasets 3 Grids/10 Clouds/90+Production Sites Monitor sites, tasks, jobs Job brokering is done by the PanDA Service (bamboo) according to input data and site availability A.Read, Mar09

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 Data Processing Cycle Data processing at CERN (Tier-0 processing) nFirst-pass processing of the primary event stream nThe derived datasets (ESD, AOD, DPD, TAG) are distributed from the Tier-0 to the Tier-1s nRAW data (received from Event Filter Farm) are exported within 24h. This is why first-pass processing can be done by Tier-1s (though this facility was not used during LHC beam and cosmic ray runs) Data reprocessing at Tier-1s n10 Tier-1 centers world wide. Each takes a subset of RAW data (Tier-1 shares from 5% to 25%), ATLAS production facilities at CERN can be used in case of emergency. nEach Tier-1 reprocessed its share of RAW data. The derived datasets are distributed ATLAS-wide. 13 See P.Nevski’ talk NEC2009, LHC Computing

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 ATLAS Data Simulation and Reprocessing 14 Running Jobs Reprocessing Sep08-Sep09 Production System in continuous operations 10 clouds use LFC as file catalog and Panda as jobs executor CPUs are under utilized in average, peak rate 33kjobs/day ProdSys can produce 100 TB/week of MC Average walltime efficiency is over 90% System does : Data simulation and data reprocessing

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 ATLAS Distributed Analysis 15 Probably the most important area at this point It depends on a functional data management and job management system Two widely used distributed analysis tools (Ganga and pathena) They capture the great majority of users We expect the usage to grow substantially in the preparation and especially in the 2009/10 run Present/traditional use cases: AOD/DPD analysis clearly very important But also run over selected RAW (for detector debugging, studying etc…) J.Elmsheuser Sep09 ATLAS jobs go to the data

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 ATLAS Grid Information System (AGIS) The overall purpose of ATLAS Grid Information System is to store and to expose static, dynamic and configuration parameters needed by ATLAS Distributed Computing (ADC) applications. AGIS is a database oriented system. The first AGIS proposal from G.Poulard. The pioneering work of R.Pezoa and R.Rocha in summer 2008, and definition of basic design principles implemented in ‘dashboards’. Now development is leaded by ATLAS BINP group. Today’s situation : various configuration parameters and information about available resources, services and its status and properties are extracted from different sources or they are defined in different configuration files (sometimes Grid information is hard coded in application programs). 16

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 AGIS Architecture Overview l System architecture should allow to add new classes of information or sites configuration parameters, to reconfigure ATLAS clouds topology and production queues, to add and to modify users information. l AGIS is ORACLE based information system. l AGIS stores as read-only data extracted from the external databases (f.e, OIM, GOCDB, BDII) and ADC configuration information which can be modified. l The synchronization of AGIS content with the external sources will be done by agents (data providers), agents will access databases via standard interfaces. 17

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 AGIS Components A.Anisenkov, D.Krivashin. Sep09 ATP Logging Service

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 AGIS Information ATLAS clouds, tiers and sites nTopology : clouds, tiers, sites specifics (f.e. geography, names, etc) Site Resources and Services information n list of resources and services (FTS servers, SRM, LFC) n site service properties (name, status, type, endpoints) Site information and configurations  available CE and SE information (CPU,disk information, status, available resources)  availability and various status information, like site status in ATLAS Data distribution, Monte-Carlo Production, Functional Tests. Site downtime periods  relations to currently running/planned tests, tasks or runs Data replication. Sites shares and pairing. nList of activities (f.e. reprocessing), activity start and end time Global configuration parameters needed by ADC applications Users related information (privileges, roles, account info) 19

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 ATLAS Distributed Computing Monitoring (Today) 20 R.Rocha Sep09

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 ATLAS Distributed Computing Monitoring (Next) 21 R.Rocha Sep09 Simplify into one monitoring application (where it is possible) Standardize monitoring messages o shboard/wiki/WorkInProgress shboard/wiki/WorkInProgress o HTTP for transport o JSON for data serialization Attempt to have a common (single) dashboard client application o Built using the Google Web Toolkit (GWT) Source data exposed directly from its source (like the Panda database) o Avoid aggregation databases like we have today o Server side technology left open

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 Summary & Conclusions l The ATLAS Collaboration has developed a set of software and middleware tools that enable access to data for physics analysis purposes to all members of the collaboration, independently of their geographical location. Main building blocks of this infrastructure are nthe Distributed Data Management system; nthe Ganga and pathena for distributed analysis on the Grid. nProduction System to (re)process and to simulate ATLAS data l Almost all required functionalities are already provided; and extensively used for simulated, as well as real data from beam and cosmic ray events. l Grid Information System technical proposal is finalized and the system must be in production by the end of the year l Monitoring system standardization is in progress 22

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 МНОГО БЛАГОДАРЮ 23

Alexei Klimentov : ATLAS Distributed Computing NEC2009 Varna- 10 September 2009 Acknoledgements Thanks to A.Anisenkov, D.Barberis, K.Bos, M.Branco, S.Campana, A.Farbin, J.Elmsheuser, D.Krivashin, A.Read, R.Rocha, A.Vaniachine, T.Wenaus,… For pictures and slides used in this presentation 24