Analysis vs Storage 4 the LHC Experiments (… ATLAS biased view ) Alessandro Di Girolamo CERN / INFN.

Analysis vs Storage 4 the LHC Experiments (… ATLAS biased view ) Alessandro Di Girolamo CERN / INFN

Outline Overview of LHCb limitations for data access by analysis jobs Overview of the data access for CMS analysis jobs (dcache vs lustre) ATLAS users analysis stress test 14 May 20092 Alessandro Di Girolamo (CERN/INFN)

LHCb: What and How Understanding current limitations for Data access by User Analysis Jobs Select a user with a working analysis code and need to run over a large data sample Job submission and analysis of output. Repeat the submission with different samples periodically (2 – 3 times a week). Each submission consist of: – ~ 600 jobs, reading 100 different files with 500 events per file. –Average file size 200 MB –1 Job = 100 Files = 50.000 events ~ 20 GB ~ 1-2 hours~ 2-3 MB/s

LHCb: some results Total Number of Successful Jobs CPU/WallClock for Successful Jobs Daily Amount of Data read by SiteDaily Number of Jobs by Site

LHCb Tier1s performances CNAF: 0.11 s/evt + 0.76 s/file IN2P3: 0.18 s/evt + 1.7 s/file Distribution of Wall Clock time to process 100 consecutive events (black line). Distribution of Wall Clock time to process 100 consecutive events including a new file opening (red line). The difference of the means represents the file opening time.

CMS analysis job In order to find the limits, coming from the access to data, on the number of concurrent analysis jobs real CMS jobs were used in the test which just read the input file but without doing any computation In this condition the I/O required by a single job is about 10MB/s This looks like the upper bandwidth limit = unzipping root objects (using a Xeon E5405@ 2.00GHz) In the plots the CPUTime and the overall time spent in I/O operation for a CMS analysis job are shown. As a function of the number of concurrent jobs. both for lustre and dCache

CMS analysis job execution time (min) concurrent jobs execution time (min)

What the ATLAS user want (would!) My Laptop ~= GRID (but GRID must be faster) “Classic” analyses with Athena & AthenaROOTAccess: –MonteCarlo processing, cosmics, reprocessed data –Various sizes of input data: AOD, DPD, ESD –TAG analyses for direct data access Calibrations & Alignment: RAW data and remote database access Small MC Sample Production: transformations ROOT: Generic ROOT application also with DQ2 access Generic Executables: for everything else 14 May 2009 Alessandro Di Girolamo (CERN/INFN) 8

9 MC DATA SCRATCH GROUP LOCAL @Tier3 @Tier2 CPUs Analysis tools Detector data 70 TB RAW, ESD, AOD, DPD Centrally managed Simulated data 80 TB RAW, ESD, AOD, DPD Centrally managed Physics Group data 20 TB DnPD, ntup, hist,.. Group managed User Scratch data 20 TB User data Transient Local Storage Non pledged User data Locally managed ATLAS Jobs go to the Data Example for a 200 TB T2 Managed with space tokens Buffers, spare 10 TB 14 May 2009 Alessandro Di Girolamo (CERN/INFN)

ATLAS Distributed Analysis 14 May 200910 Alessandro Di Girolamo (CERN/INFN) pAthena WN Site B WN site A job submit WMS job pull push PANDA

11 PANDA site A Panda server site B pilot Worker Nodes condor-g Scheduler glite https submit pull run End-user run job pilot ProdSys job Scheduler sends pilots to the batch system and Grid – CondorG scheduler For most US ATLAS OSG sites – Local scheduler BNL(condor) and UTA(PBS) Very efficient and robust – Generic scheduler Supports also non-ATLAS OSG VOs and LCG Move pilot submission from a global submission point to a site-local pilot factory Scheduler sends pilots to the batch system and Grid – CondorG scheduler For most US ATLAS OSG sites – Local scheduler BNL(condor) and UTA(PBS) Very efficient and robust – Generic scheduler Supports also non-ATLAS OSG VOs and LCG Move pilot submission from a global submission point to a site-local pilot factory 14 May 2009 Alessandro Di Girolamo (CERN/INFN)

How pilots work Sends the several parameters to Panda server for job matching –CPU speed –Available memory size on the WN –List of available ATLAS releases at the site Runs the job immediately (all input files should be already available at the site) –Sends heartbeat every 30min Copy output files to local SE and register them to Catalog Analysis jobs run under production proxy unless gLExec is implemented in identity switching mode –gLExec based identity change on WN to submitter identity for user jobs under testing (proxy management done by MyProxy) –Security issues have been investigated and clarified for ATLAS gLExec is considered mature 14 May 200912 Alessandro Di Girolamo (CERN/INFN)

GANGA Can submit jobs both to WMS and PANDA 14 May 2009 Alessandro Di Girolamo (CERN/INFN) 13 Ganga is a Grid user interface for HEP experiments Key piece of the distributed-analysis systems for ATLAS and LHCb Manages large scale scientific applications on the Grid: configuring the applications switching between testing on a local batch system and large-scale processing on the Grid keeping track of results discovery of dataset locations by direct interfacing to metadata and file catalogues Portable Python code

Testing the DA infrastructure Functional test –GangaRobot Vital to verify sites configurations Stress test –HammerCloud “The” way to simulate chaotic users analysis –In fall 2008: interest in testing Tier 2s under load… »The first tests were in Italy, and were manual: 2-5 users submitting ~200 jobs each at the same time Results merged and analyzed 24-48 hours later The IT tests saturated 1Gbps networks at T2s: <3Hz per job 14 May 2009 Alessandro Di Girolamo (CERN/INFN) 14 Creators and main developers: J. Elmsheuser & D. van der Ster

How HammerCloud works? An operator defines the tests: –What: a ganga job template, specifying input datasets and including an input sandbox tar.gz (athena analysis code) –Where: list of sites to test, number of jobs –When: start and end times –How: input data I/O (posix I/O==DQ2Local, FileStager) Each job runs athena over an entire input dataset. The test is defined with a dataset pattern (e.g. mc08.*.AOD.*), and HC generates one job per dataset. –Try to run with the same datasets at all sites: not always enough replicas ! HammerCloud runs the tests: 1.Generate appropriate jobs for each site 2.Submit the jobs (LCG and NG and Panda now) 3.Poll their statuses, writing incremental results in HC DB 4.Read HC DB to plot results on web. 5.Cleanup leftovers; kill jobs still incomplete When running many tests, each stage handles each test sequentially –This limits the number of tests that can run at once. Work in Progress 1514 May 2009 Alessandro Di Girolamo (CERN/INFN)

HammerCloud: the tests HammerCloud tests real analyses: –AOD analysis, based on Athena UserAnalysis pkg, analyzing mainly muons: Input data: muon AOD datasets, or other AODs if muons are not available –Reprocessed DPD analysis: Intended to test the remote conditions database (at local Tier 1) HammerCloud Metrics: –Exit status and log files –CPU/Wallclock ratio, events per second –Job timing: Queue, Input sandbox stage-in, Athena/CMT setup, LFC lookup, Athena exec, Output storage –Number of events and files processed (versus what was expected) –Some local statistics (e.g. network and storage rates) only available at site level monitoring Site contacts very important! 14 May 200916 Alessandro Di Girolamo (CERN/INFN)

17 HammerCloud: the tests (2) HammerCloud key variable (up until now): the data access –Posix I/O with local protocol: To tune rfio, dcap, gsidcap, storm, lustre, etc… Testing with read-ahead buffers on or off; large, small or tweaked. –Copy/stream the files locally But disk space is limited, and restarting athena causes overhead Athena FileStager plugin: –Uses a background thread to copy the input files from storage –Startup – Copy f1 – Process f1 & copy f2 – Process f2 & copy f3 – etc… 14 May 2009 Alessandro Di Girolamo (CERN/INFN)

14 May 2009 HammerCloud: some results… Example I/O rates from a classic Athena AOD analysis: –A fully loaded CPU can read events at ~20Hz (i.e. at this rate, the CPU, not the file I/O, is the bottleneck) –20Hz * 0.2MB per event = 4 MB/s per CPU –A site with 200 CPUs could consume data at 800 MB/s –This requires a 10Gbps network, and a storage system that can handle such a load. Alternatively, this means that 200 CPU cluster with a 1Gbps network will result in ~3Hz per analysis job  18 Alessandro Di Girolamo (CERN/INFN)

Example of HC results 14 May 2009 Alessandro Di Girolamo (CERN/INFN) 19

14 May 2009 Alessandro Di Girolamo (CERN/INFN) 20 Example of HC results

23 Overall HammerCloud Statistics Throughout the history of HammerCloud: –74 sites tested; nearly 200 tests; top sites tested >25 times –~50000 jobs total with average runtime of 2.2 hours. –Processed 2.7 billion events in 10.5 million files Success rate: –29 sites have >80% success rate; 9 sites >90% Across all tests: –CPU Utilisation: 27 sites >50% CPU; 8 sites >70% –Event rate: 19 sites > 10Hz; 7 sites >15Hz With FileStager data access mode: –CPU Utilisation: 36 sites >50%; 24 sites >70% –Event rate: 33 sites > 10Hz; 20 sites > 15Hz; 4 sites >20Hz Full statistics available at: http://gangarobot.cern.ch/st/summary.html http://gangarobot.cern.ch/st/summary.html NOTE: These are overall summaries without a quality cut; i.e. the numbers include old tests without tuned data access. 14 May 2009 Alessandro Di Girolamo (CERN/INFN)

24 Lesson learned so far… The expected benefits: –Most sites are not optimized to start out: HC can find the weaknesses. Sites rely on large quantities of jobs to tune their networks and storage –HammerCloud is a benchmark for the sites: Site admins can change their configuration, and then request a test to see how it affects performance –We are building a knowledge base of optimal data access modes at the sites: There is no magic solution w.r.t. Posix I/O vs. FileStager. Essential for the DA tools to employ this information about sites 14 May 2009 Alessandro Di Girolamo (CERN/INFN)

14 May 2009 … and also… Unexpected benefits: Unexpected storage bottlenecks (hot dataset problem): –Data not well distributed across all storage pools? one pool overloaded while the others sat idle. –Need to understand how to balance the pools Misunderstood behavior of distributed data management tools: –DB access jobs require a large sqlite database (dq2-get before start) dq2-get did not retrieve it from different storage areas of the site! –A large test could have brought systems down (but this was caught before the test thanks to a friendly user). Ganga’s download of the sqlite DB changed (as dq2-get’s behavior) Found athena I/O bug/misunderstanding: –HC found discrepancies in the number of files intended to be and actually processed. –Athena exitcode 0 if file open() times out: “success”  Behaviour was changed for Athena 15. 25 Alessandro Di Girolamo (CERN/INFN)

The Challenge o Support of Users activities 14 May 200926 Alessandro Di Girolamo (CERN/INFN) Difficult to simulate: Real life will provide new challenges and opportunities

Questions? 14 May 200927 Alessandro Di Girolamo (CERN/INFN) Thanks to all (…those who I took these slides from) !

Analysis vs Storage 4 the LHC Experiments (… ATLAS biased view ) Alessandro Di Girolamo CERN / INFN.

Similar presentations

Presentation on theme: "Analysis vs Storage 4 the LHC Experiments (… ATLAS biased view ) Alessandro Di Girolamo CERN / INFN."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Analysis vs Storage 4 the LHC Experiments (… ATLAS biased view ) Alessandro Di Girolamo CERN / INFN.

Similar presentations

Presentation on theme: "Analysis vs Storage 4 the LHC Experiments (… ATLAS biased view ) Alessandro Di Girolamo CERN / INFN."— Presentation transcript:

Similar presentations

About project

Feedback