Introduction to Distributed Analysis Dietrich Liko
Overview Introduction to Grid Computing Three grid flavors in ATLAS EGEE OSG Nordugrid Distributed Analysis Activities GANGA/LCG PANDA/OSG Other tools How to find your data ? Where is the data stored Which data is really available ?
Evolution of CERN computing 2 years to build 3 months to install 320 kBytes storage Less computing power than today’s calculators 1958: Ferranti Mercury 1967: CDC 6400 1976: IBM 370/168 The scope and complexity of particle-physics experiments has increased in parallel with increases in computing power Massive upsurge in computing requirements in going from LEP to LHC 1988: IBM MM 3090, DEC VAX, Cray X-MP 2001: PC Farm
Strategy for processing LHC data Majority of data processing (reconstruction/simulation/analysis) for LEP experiments performed at CERN About 50% of physics analyses run at collaborating institutes Similar approach might have been possible for LHC Increase data-processing capacity at CERN Take advantage of Moore’s Law increase in CPU power and storage LHC Computing Review (CERN/LHCC/2001-004) discouraged LEP-type approach Rules out access to funding not available to CERN Makes poor use of expertise and resources at collaborating institutes Require solution for managing distributed data and CPUs: Grid computing Project for LHC Computing Grid (LCG) started 2002
Grid Computing Ideas behind Grid computing have been around since the 1970s, but became very fashionable around the turn of the century A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities. Ian Foster and Carl Kesselman, The Grid: Blueprint for a New Computing Infrastructure (1998) First release of Globus Toolkit for Grid infrastructures made in 1998 World Wide Web commercially attractive by late 1990s e-Everything suddenly in vogue: e-mail, e-Commerce, e-Science Dot-com bubble 1998-2002 Grid proposed as evolution of World Wide Web: access to resources as well as to information Many projects: EGEE, OSG, Nordugrid GridPP, INFN Grid, D-Grid
Distributed Analysis Data Analysis User Production AOD & ESD analysis TAG based analysis pathena/PANDA GANGA/LCG User Production Prodsys LJSF GANGA (DQ2 Integration)
EGEE Job submission via LCG Resource Broker LFC File catalog The new gLite RB is on its way … LFC File catalog Also CondorG submission is possible Requires some expertise and has no support from the service provider New approach using Condor glideins is under investigation (Cronus)
Resource Broker Model CE RB CE RB CE
PANDA is an integrated production and distributed analysis system OSG/Panda PANDA is an integrated production and distributed analysis system Pilot job based Similar to DIRAC & Alien Simple File Catalogs at sites Will be supported by GANGA in release 4.3
Three grids …. ATLAS is using three large infrastructures EGEE OSG Nordugroid Grids have different middleware Different software to submit jobs Different catalogs to store the data We have to aim to hide this differences from the ATLAS user
Panda Model CE Task queue CE CE
Nordugrid ARC middleware for job submission RLS Filecatalog Powerful and simple RLS Filecatalog Will be supported by GANGA in release 4.3
ARC Model CE CE CE
How can we live with that ? Data management layer to hide this differences – Don Quixote 2 Tools that aim to hide the difficulties to submit jobs pathena/PANDA on OSG GANGA on LCG In the future better interoperability On level of the ATLAS tools On the level of the middleware
pathena/PANDA Lightweight client Integrated to Athena release Very nice work A lot of work has been done to support better user jobs Short queues, multitasking pilots etc. A large set of data is available Available since some time
GANGA/LCG Text UI & GUI Multiple backends A pathena-like interface is available Multiple backends LCG/EGEE LSF – works also with CAT queues PBS PANDA & Nordugrid for 4.3 And others
Dashboard Monitoring We are setting up a framework to monitor distributed analysis jobs MonaLisa based (OSG, LCG) RGMA Imperial collage DB Production system http://dashboard.cern.ch/atlas GANGA has been instrumented to understand its usage
Since September 1st …
Dataset distribution In principle data should be everywhere AOD & ESD during this year ~ 30 TB max Three steps Not all data can be consolidated Other grids, Tier-2 Distribution between Tier-1 not yet perfect Distribution to Tier-2’s can only be the next step
Latest number by Alexei – Feb 27 Files req/copied Transfers Waiting(*) Transfered in 7 days ASGC 5604 1883 33.6 53 1883 BNL 1891 1532 81.0 5 24 CERN 5587 5489 98.2 1 2581 CNAF 5610 2801 49.9 12 1111 FZK 5645 5541 98.2 0 2668 LYON 5529 5464 98.8 0 2643 NDGF 4822 3116 64.6 10 893 NIKHEF 5700 5471 96.0 1 2563 PIC 5787 2362 40.8 32 2617 RAL 5763 3903 67.7 12 30 TRIUMF 5744 3740 65.1 13 843 The milage is varying between 33.6 % to 98.2
Monitoring of transfers
Why can I not send the jobs to the data automatically ? I will advise you to send jobs to selected sites This is not the final word, this is just a way to address the current situation ATLAS is using a dataset concept Datasets have a content Datasets have one or more locations Datasets can be complete or incomplete at a location Only complete datasets can be used in a dataset based brokering process We are currently trying to understand How much data is available as complete datasets Can we do a file based brokering for incomplete datasets ? We have big progress in the last months, but not yet all is working as we would like
How to find out which data exists AMI Metadata http://lpsc1168x.in2p3.fr:8080/opencms/opencms/AMI/www/index.html Prodsys database http://cern.ch/atlas-php/DbAdmin/Ora/php-4.3.4/proddb/monitor/Datasets.php Dataset browser http://panda.atlascomp.org/?overview=dslist
How to access data ? Download with dq2_get, analyze locally Works (sometimes), is not scalable Data is distributed on sites, jobs are send to sites to analyze the data DA is promoting this way of working The process of finding the data will be fully automated in some time
Posix like IO DA wants to read data directly from the SE Prodsys is downloading the data using gridftp Use rfio, dcap, GFAL, xrootd We want to use posix like IO Size of the local disk for the job We do not need the full event We do not need all events As of today ATLAS AOD jobs read data with ~ 2 MB/sec
Analysis jobs Today on job 1 year of running of LHC Backnavigation 10 to 100 AOD files, 130 MB each 1 year of running of LHC 150 TB AOD according to ATLAS computing model Filesize 10 GB Still order of 10000 files Backnavigation Reduces IO Increase load on SE do to more “open”
Some measurements 10 files a 130 MB Local: 14:02 min Standard Analysis Example Local: 14:02 min DPM using rfio: 16:30 min Castor-2: 20:29 min 150 TB: about 1000 days
DPM in Glasglow
Athena jobs Athena uses POOL/ROOT Many issues concerning plugins and current configuration See Wiki page https://twiki.cern.ch/twiki/bin/view/Atlas/IssuesWithPosixIO
Highlights dCache DPM Castor Some issues will go away with v13 Wrong dCache library (except BNL) DPM Need to provide a symbolic link (libdpm.so -> libshift.so) Broken RFIO plugin DPM URLs not support Castor New castor sytntax not supported No files larger the 2BG Some issues will go away with v13 RFIO plugin will still be outdated New rfio library not yet released We need to do systematic test Proposed by Stephane
Backporting ROOT RFIO plugin Advantages New syntax a la Castor-2 Large Files > 2GB Problems with DPM A different URL format Some problems querying the file attributes Several patches required to make in work Security context required, but Grid UI clashes since last week with Athena due to python version New RFIO plugin is under development inside ROOT In generally new ROOT IO plugins should be backported to agreed ROOT versions
Short queues Distributed Analysis competes with Production Short queues can be used to speed up the analysis There is a lot of discussion going on how useful short queues are Empirically I prefer to send jobs to short queues https://twiki.cern.ch/twiki/bin/view/Atlas/DAGangaFAQ#How_to_find_out_suited_Computing Selection of queues is the easy part, selecting the dataset location is the complicated aspect Fully automatic for complete datasets
Summary We are learning how to access data everywhere Several tools are available to perform Distributed Analysis Integrated with DQ2 Data is being collected and also distributed Still a lot of work in front of us We are learning how to access data everywhere How to find data How to read data Not fully automatic yet But we aim for that We learn how to handle user jobs Job Priorities on LCG Short Queues
Next steps Increase the number of sites Interoperability We have to push getting the data at all Tier-1. They are the backbone of the ATLAS data distribution Interoperability Is for sure be an issue for this year GANGA will send jobs to other sites PANDA will run on LCG Cronus wants to bridge all resources
GANGA Introduction
Who is ATLAS GANGA ? GANGA Core Ulrik Egede, Karl Harrison, Jakub.Moscicki, A.Soroko, V.Romanovsky, Adrina Murao GANGA GUI Chun Lik Tan Athena AOD analysis Johannes Elmsheuser Tag Navigator Mike Kenyon, Caitherina Nicholson User production Fredric Brochu EGEE/LCG Hurng-Chun Lee, Dietrich Liko Nordugrid Pajchel Katarina, Bjoern Hallvard PANDA Dietrich Liko + support from PANDA Cronus Rod Walker AMI Integration Farida Fassi, Chun Lik Tan + support from AMI Mona Lisa Montoring Benjamin Gaidioz, Jae Yu, Tummalapalli Reddy
What is GANGA ? Ganga is an easy-to-use frontend for job definition and management Allows simple switching between testing on a local batch system and large-scale data processing on distributed resources (Grid) Developed in the context of ATLAS and LHC. For ATLAS Athena framework JobTransformations DQ2 data-management system EGEE/LCG For release 4.3 AMI PANDA/OSG Nordugrid Cronus Component architecture readily allows extension Implemented in Python
Users
Domains
GANGA Job Abstraction What to run Application Where to run Backend Data read by application Input Dataset Job Data written by application Output Dataset Rule for dividing into subjobs Splitter Rule for combining outputs Merger
Framework for plugins GangaObject IApplication ISplitter IDataset Interfaces Plugin IApplication ISplitter IDataset IMerger IBackend Athena atlas_release max_events options option_file user_setupfile user_area CE requirements jobtype middleware id status reason actualCE exitcode LCG User Example plugins and schemas System
Backends and Applications Gauss/Boole/Brunel/DaVinci (Simulation/Digitisation/ Reconstruction/Analysis) AthenaMC (Production) Athena (Simulation/Digitisation/ Reconstruction/Analysis) Executable PBS LSF OSG PANDA LHCb WMS US-ATLAS WMS Implemented Coming soon
Status Actual version: 4.2.11 Upcoming version 4.3 AOD analysis TAG based analysis Mona Lisa based Monitoring LCG/EGEE Batch handlers Upcoming version 4.3 Tag Navigator AMI Integration PANDA Nordugrid Cronus
How elements work together ? Ganga has built-in support for ATLAS and LHCb Component architecture allows customisation for other user groups LHCb applications ATLAS Other Applications Metadata catalogues Data storage and retrieval File Tools for data management GANGA User interface for job definition and management Local repository Remote repository Ganga job archives Ganga monitoring loop Experiment-specific workload-management systems Local batch systems Distributed (Grid) systems Processing systems (backends)
Different working styles Command Line Interface in Python (CLIP) provides interactive job definition and submission from an enhanced Python shell (IPython) Especially good for trying things out, and seeing how the system works Scripts, which may contain any Python/IPython or CLIP commands, allow automation of repetitive tasks Scripts included in distribution enable kind of approach traditionally used when submitting jobs to a local batch system Graphical User Interface (GUI) allows job management based on mouse selections and field completion Lots of configuration possibilities
Scripts provide pathena like interface ganga athena --inDS trig1_misal1_csc11.005033.Jimmy_jetsJ4.recon.AOD.v12000601 --outputdata AnalysisSkeleton.aan.root --split 3 --maxevt 100 --lcg --ce ce102.cern.ch:2119/jobmanager-lcglsf-grid_2nh_atlas AnalysisSkeleton_topOptions.py Monitoring the job status for example using GUI or CLI
IPython IPython CLIP How to define a job j=Job() Comfortable python shell Many useful extensions http://ipython.scipy.org/ CLIP GANGA Command line interface How to define a job j=Job() j.application=Executable() j.application.exe=‘/bin/echo’ j.applications.args=[‘Hello World’] j.backend=LCG() j.submit() Other commands jobs Jobs[20].kill() jobs[20].copy()
GUI
Exercises Subset adapted for today https://cern.ch/twiki/bin/view/Atlas/GangaTutorialAtCCIN2P3 Current Tutorial that explains more features https://cern.ch/twiki/bin/view/Atlas/GangaGUITutorial427 FAQ https://cern.ch/twiki/bin/view/Atlas/DAGangaFAQ User Support using hypernews https://hypernews.cern.ch/HyperNews/Atlas/get/GANGAUserDeveloper.html