Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – 21-27 March 2009 1 Functional and Large-Scale Testing of the ATLAS.

Slides:

Advertisements

Similar presentations

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

Advertisements

Israel Cluster Structure. Outline The local cluster Local analysis on the cluster –Program location –Storage –Interactive analysis & batch analysis –PBS.

Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Wahid Bhimji University of Edinburgh J. Cranshaw, P. van Gemmeren, D. Malon, R. D. Schaffer, and I. Vukotic On behalf of the ATLAS collaboration CHEP 2012.

December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.

F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,

ATLAS Distributed Analysis Experiences in STEP'09 Dan van der Ster for the DA stress testing team and ATLAS Distributed Computing WLCG STEP'09 Post-Mortem.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.

Experiment Support ANALYSIS FUNCTIONAL AND STRESS TESTING Dan van der Ster, CERN IT-ES-DAS for the HC team: Johannes Elmsheuser, Federica Legger, Mario.

Update on replica management

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

HammerCloud Functional tests Valentina Mancinelli IT/SDC 28/2/2014.

PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.

TAGS in the Analysis Model Jack Cranshaw, Argonne National Lab September 10, 2009.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, R. Brock,T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina,

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

Testing Infrastructure Wahid Bhimji Sam Skipsey Intro: what to test Existing testing frameworks A proposal.

Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.

CERN IT Department CH-1211 Genève 23 Switzerland t Future Needs of User Support (in ATLAS) Dan van der Ster, CERN IT-GS & ATLAS WLCG Workshop.

CERN IT Department CH-1211 Genève 23 Switzerland t Distributed Analysis User Support in ATLAS (and LHCb) Dan van der Ster, CERN IT-GS & ATLAS.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko

Analysis vs Storage 4 the LHC Experiments (… ATLAS biased view ) Alessandro Di Girolamo CERN / INFN.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.

ATLAS Physics Analysis Framework James R. Catmore Lancaster University.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

EXPERIENCE WITH ATLAS DISTRIBUTED ANALYSIS TOOLS S. González de la Hoz L. March IFIC, Instituto.

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.

ATLAS Distributed Analysis S. González de la Hoz 1, D. Liko 2, L. March 1 1 IFIC – Valencia 2 CERN.

Status: ATLAS Grid Computing

Dan van der Ster for the Ganga Team

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

Readiness of ATLAS Computing - A personal view

The ADC Operations Story

X in [Integration, Delivery, Deployment]

D. van der Ster, CERN IT-ES J. Elmsheuser, LMU Munich

R. Graciani for LHCb Mumbay, Feb 2006

ATLAS DC2 & Continuous production

Presentation transcript:

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March Functional and Large-Scale Testing of the ATLAS Distributed Analysis Facilities with Ganga D.C. van der Ster (CERN), J. Elmsheuser (LMU), M. Slater (Birmingham), C. Serfon (LMU), M. Biglietti (INFN-Napoli), F. Galeazzi (INFN-Roma III)

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 Overview of the talk lOverview of Distributed Analysis in ATLAS: nWhat needs to be tested? Workflows and Resources lFunctional Testing with GangaRobot: nDaily short tests to verify the analysis workflows lStress Testing with HammerCloud: nInfrequent (~weekly) large scale tests to stress specific components 2

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 DA in ATLAS: What can the users do? lThe ATLAS Distributed Computing (ADC) operational situation in a nutshell: nThe grids and resources are established. nDistributed production and data distribution is well understood and tested. nNow, the priority is on validating distributed analysis for users lWhat do the users want to do? n“What runs on my laptop should run on the grid!” lClassic analyses with Athena and AthenaROOTAccess: nA lot of MC processing, cosmics, reprocessed data nVarious sizes of input data: AOD, DPD, ESD nTAG analyses for direct data access lCalibrations & Alignment: RAW data and remote database access lSmall MC Sample Production: transformations lROOT: Generic ROOT application also with DQ2 access lGeneric Executables: for everything else 3

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 DA in ATLAS: What are the resources? 4 The frontends, Pathena and Ganga, share a common “ATLAS Grid” library. The sites are highly heterogeneous in technology and configuration. How do we validate ATLAS DA? Use case functionalities?? Behaviour under load??

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 ATLAS DA Operations Activities lThis talk presents two activities to work on these problems: nFunctional Testing with GangaRobot nSimulated Stress Testing with HammerCloud lThe third piece of the puzzle, not covered in this talk, is what we call the “Distributed Analysis Jamboree”: nCoordinated stress test with real users, real analyses, and generating real chaos. nUS has some experience of this type of test, and worldwide distributed analysis jamborees are being organized right now. 5

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 Functional Testing with GangaRobot lDefinitions: nGanga is a distributed analysis user interface with a scriptable python API. nGangaRobot is both a)a component of Ganga which allows for rapid definition and execution of test jobs, with hooks for pre- and post-processing b)an ATLAS service which uses (a) to run DA functional tests  In this talk, GangaRobot is (b). lSo what does GangaRobot test and how does it work? 6

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 Functional Testing with GangaRobot 1.Tests are defined by the GR operator: nAthena version, analysis code, input datasets, which sites to test nShort jobs, mainly to test the software and data access 2.Ganga submits the jobs nTo OSG/Panda, EGEE/LCG, NG/ARC 3.Ganga periodically monitors the jobs until they have completed or failed nResults are recorded locally 4.GangaRobot then publishes the results to three systems: nGanga Runtime Info System, to avoid failing sites nSAM, so that sites can see the failures (EGEE only, OSG in deployment) nGangaRobot website, monitored by ATLAS DA shifters  GGUS and RT tickets sent for failures 7

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 What happens with the results? lThe analysis tools need to avoid sites with failed tests: nFor Ganga/EGEE users, feeding the results to the Ganga InfoSys accomplishes this. nFor OSG and NG the sites are set offline manually by a shifter nIn future, the results need to go to the planned central ATLAS InfoSys (AGIS) lResults need to be relevant and clear so the problems can be fixed rapidly: nGangaRobot website has all the results… but causes information overload for non experts nSAM is more friendly and better integrated, but doesn’t present the whole picture (and A.T.M. includes only EGEE). 8

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 Overall Statistics with GangaRobot Plots from SAM dashboard of d aily and percentage availability of EGEE sites over the past 3.5 months. WARNING: Don’t automatically blame the sites! The fault could lie anywhere in the systems. The good: Many sites with >90% efficiency The bad: Less than 50% of sites have >80% uptime The expected: Many transient errors, 1-2 day downtimes. A few sites are permanently failing. 9

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 Distributed Analysis Stress Testing lThe need for DA stress testing: nExample I/O rates from a classic Athena AOD analysis:  A fully loaded CPU can read events at ~20Hz (i.e. at this rate, the CPU, not the file I/O, is the bottleneck)  20Hz * 0.2MB per event = 4 MB/s per CPU  A site with 200 CPUs could consume data at 800MBytes per second  This requires a 10Gbps network, and a storage system that can handle such a load. nAlternatively, this means that 200 CPU cluster with a 1Gbps network will result in ~3Hz per analysis job lIn fall 2008, clouds started getting interested in testing the Tier 2s under load nThe first tests were in Italy, and were manual:  2-5 users submitting ~200 jobs each at the same time  Results merged and analyzed hours later  The IT tests saturated 1Gbps networks at the T2 sites, resulting in <3Hz per job. lFrom these early, we saw then need for an automated stress testing system to be able to simultaneously test all clouds: hence, we developed HammerCloud 10

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 HammerCloud: How it works? lWhat does HammerCloud (HC) do? nAn operator defines the tests:  What: a ganga job template, specifying input datasets and including an input sandbox tar.gz (athena analysis code)  Where: list of sites to test, number of jobs  When: start and end times  How: input data I/O (posix I/O, copy locally, or FileStager) nEach job runs athena over an entire input dataset. The test is defined with a dataset pattern (e.g. mc08.*.AOD.*), and HC generates one job per dataset.  Try to run with the same datasets at all sites, but there are not always enough replicas. nHammerCloud runs the tests: 1.Generate appropriate jobs for each site 2.Submit the jobs (LCG and NG now; Panda and Batch coming) 3.Poll their statuses, writing incremental results in HC DB 4.Read HC DB to plot results on web. 5.Cleanup leftovers; kill jobs still incomplete nWhen running many tests, each stage handles each test sequentially (e.g. gen A, gen B, sub A, sub B,…)  This limits the number of tests that can run at once. 11

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 HammerCloud: What are the tests? lHammerCloud tests real analyses: nAOD analysis, based on Athena UserAnalysis pkg, analyzing mainly muons:  Input data: muon AOD datasets, or other AODs if muons are not available  In principal, the results would be similar to any analysis where the file I/O is the bottleneck. nReprocessed DPD analysis:  Intended to test the remote conditions database (at local Tier 1) lWhat metrics does HammerCloud measure? nExit status and log files nCPU/Wallclock ratio, events per second nJob timing:  Queue, Input sandbox stage-in, Athena/CMT setup, LFC lookup, Athena exec, Output storage nNumber of events and files processed (versus what was expected) nSome local statistics (e.g. network and storage rates) are only available at site level monitoring  Site contacts very important! 12

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 HammerCloud: What are the tests? (2) lUp until now, the key variable that HammerCloud is evaluating is the data access method: nPosix I/O with local protocol:  To tune rfio, dcap, gsidcap, storm, lustre, etc…  Testing with read-ahead buffers on or off; large, small or tweaked. nCopy the files locally before running  But disk space is limited, and restarting athena causes overhead nAthena FileStager plugin:  Uses a background thread to JIT copy the input files from storage  Startup – Copy f1 – Process f1 & copy f2 – Process f2 & copy f3 – etc… 13

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 HammerCloud Website Results from all tests are kept indefinitely. 14

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 Example HammerCloud Test Results 15 % CPU UsedEvents/second600 jobs across 12 sites ~50 million events, ~20000 files 7 sites had no errors But, beware hidden failures! Did the job actually process the files it was supposed to? No, only 92% of the files that should have processed were… the other 8%? See later.

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 Overall HammerCloud Statistics lThroughout the history of HammerCloud: n74 sites tested; nearly 200 tests; top sites tested >25 times n~50000 jobs total with average runtime of 2.2 hours. nProcessed 2.7 billion events in 10.5 million files lSuccess rate: n29 sites have >80% success rate; 9 sites >90% lAcross all tests: nCPU Utilisation: 27 sites >50% CPU; 8 sites >70% nEvent rate: 19 sites > 10Hz; 7 sites >15Hz lWith FileStager data access mode: nCPU Utilisation: 36 sites >50%; 24 sites >70% nEvent rate: 33 sites > 10Hz; 20 sites > 15Hz; 4 sites >20Hz lFull statistics available at: 16 NOTE: These are overall summaries without a quality cut; i.e. the numbers include old tests without tuned data access.

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 What have we learned so far? lThe expected benefits: nWe have found that most sites are not optimized to start out, and HC can find the weaknesses.  The sites rely on large quantities of jobs to tune their networks and storage nHammerCloud is a benchmark for the sites:  Site admins can change their configuration, and then request a test to see how it affects performance nWe are building a knowledge base of optimal data access modes at the sites:  There is no magic solution w.r.t. Posix I/O vs. FileStager.  It is essential for the DA tools to employ this information about the sites. 17

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 What have we learned so far? (2) lThe unexpected benefits: nUnexpected storage bottlenecks (hot dataset problem):  In many tests, we found that the data was not well distributed across all storage pools, resulting in one pool being overloaded while the others sat idle.  Need to understand how to balance the pools nMisunderstood behaviour of distributed data management tools:  The DB access jobs require a large sqlite database to be dq2-get’d before starting. It was not known that the design of dq2-get did not retrieve from a close site.  A large test could have brought systems down (but this was caught before the test thanks to a friendly user).  Ganga’s download of the sq2lite DB was changed (as was dq2-get’s behaviour). nFound athena I/O bug/misunderstanding:  HC found discrepancies in the number of files intended to be and actually processed.  We found that athena, in the case that file open() times out, would exit with error status 0 and “success”.  Behaviour was changed for Athena

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 Next Steps lGangaRobot Functional Testing TODO list: nTechnical improvements:  More tests: enumerate the workflows and test them all  Better integration with SAM/dashboards/AGIS: add non-EGEE sites !! nProcedural improvements:  Need more effort to report to and fix the broken sites lHammerCloud Stress Testing TODO list: nV0.2 is ready, pending verification:  New testing model (continuous parallel tests) that will allow upward scaling  Advanced booking of repeated (e.g. daily/weekly) tests nImplement testing on Panda & Batch backends:  Testing on Panda is the top priority. nMore metrics, improved presentation, correlation of results  We have more than 60GB of logfiles… any data miners interested? nMake it more generic with support for other VOs:  LHCb testing would be rather simple 19

Dan van der Ster: Functional and Large-Scale Testing for ATLAS DA CHEP 2009 – Prague – March 2009 Conclusions lValidating the grid for user analysis is a top priority for ATLAS Distributed Computing nThe functionalities available to users are rather complete, now we are testing to see what breaks under full load. lGangaRobot is an effective tool for functional testing: nDaily tests of the common use cases are essential if we want to keep sites working. lHammerCloud is a relatively new tool; there is a lot of work to do. nMany sites have improved their networks and storage configurations nATLAS-wide application of these tests are the top development priority. 20