Nurcan Ozturk University of Texas at Arlington US ATLAS Distributed Facility Workshop at SLAC 12-13 October 2010 Assessment of Analysis Failures, DAST.

Slides:

Advertisements

Similar presentations

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

Advertisements

Nurcan Ozturk University of Texas at Arlington for the Distributed Analysis Support Team (DAST) ATLAS Distributed Computing Technical Interchange Meeting.

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

Nurcan Ozturk University of Texas at Arlington Grid User Training for Local Community TUBITAK ULAKBIM, Ankara, Turkey April 5 - 9, 2010 Overview of ATLAS.

Tier 3 Data Management, Tier 3 Rucio Caches Doug Benjamin Duke University.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.

DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI AMOD report – Fernando H. Barreiro Megino CERN-IT-ES-VOS.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

ATLAS Distributed Analysis Experiences in STEP'09 Dan van der Ster for the DA stress testing team and ATLAS Distributed Computing WLCG STEP'09 Post-Mortem.

BNL DDM Status Report Hironori Ito Brookhaven National Laboratory.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

US ATLAS Computing Operations Kaushik De University of Texas At Arlington US ATLAS Distributed Facility Workshop at SLAC October 13, 2010.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Direct gLExec integration with PanDA Fernando H. Barreiro Megino CERN IT-ES-VOS.

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

June 22, 2007USATLAS T2-T3 DQ2 0.3 SiteServices Patrick McGuigan

Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008.

US ATLAS Computing Operations Kaushik De University of Texas At Arlington U.S. ATLAS Tier 2/Tier 3 Workshop, FNAL March 8, 2010.

EGI-InSPIRE EGI-InSPIRE RI DDM Site Services winter release Fernando H. Barreiro Megino (IT-ES-VOS) ATLAS SW&C Week November

David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

Nurcan Ozturk University of Texas at Arlington US ATLAS Transparent Distributed Facility Workshop University of North Carolina - March 4, 2008 A Distributed.

Karsten Köneke October 22 nd 2007 Ganga User Experience 1/9 Outline: Introduction What are we trying to do? Problems What are the problems? Conclusions.

D. Adams, D. Liko, K...Harrison, C. L. Tan ATLAS ATLAS Distributed Analysis: Current roadmap David Adams – DIAL/PPDG/BNL Dietrich Liko – ARDA/EGEE/CERN.

MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

PD2P The DA Perspective Kaushik De Univ. of Texas at Arlington S&C Week, CERN Nov 30, 2010.

Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Data Management: US Focus Kaushik De, Armen Vartapetian Univ. of Texas at Arlington US ATLAS Facility, SLAC Apr 7, 2014.

PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.

Nurcan Ozturk University of Texas at Arlington US ATLAS Tier2/3 Workshop at UTA November 10-12, 2009 Report on the User Analysis Stress Test (UAT)

Distributed Data Management Miguel Branco 1 DQ2 discussion on future features BNL workshop October 4, 2007.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

U.S. ATLAS Facility Planning U.S. ATLAS Tier-2 & Tier-3 Meeting at SLAC 30 November 2007.

22/10/2007Software Week1 Distributed analysis user feedback (I) Carminati Leonardo Universita’ degli Studi e sezione INFN di Milano.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.

David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.

CERN IT Department CH-1211 Genève 23 Switzerland t Future Needs of User Support (in ATLAS) Dan van der Ster, CERN IT-GS & ATLAS WLCG Workshop.

CERN IT Department CH-1211 Genève 23 Switzerland t Distributed Analysis User Support in ATLAS (and LHCb) Dan van der Ster, CERN IT-GS & ATLAS.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.

Dynamic Data Placement: the ATLAS model Simone Campana (IT-SDC)

Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.

Consistency Checking And RUCIO Progress Update Sarah Williams Indiana University ADC Weekly Meeting,

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.

VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -

PD2P Planning Kaushik De Univ. of Texas at Arlington S&C Week, CERN Dec 2, 2010.

Data Distribution Performance Hironori Ito Brookhaven National Laboratory.

J. Shank DOSAR Workshop LSU 2 April 2009 DOSAR Workshop VII 2 April ATLAS Grid Activities Preparing for Data Analysis Jim Shank.

Kevin Thaddeus Flood University of Wisconsin

Supporting Analysis Users in U.S. ATLAS

Computing Operations Roadmap

David Adams Brookhaven National Laboratory September 28, 2006

Elizabeth Gallas - Oxford ADC Weekly September 13, 2011

1 VO User Team Alarm Total ALICE ATLAS CMS

Data Federation with Xrootd Wei Yang US ATLAS Computing Facility meeting Southern Methodist University, Oct 11-12, 2011.

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

Presentation transcript:

Nurcan Ozturk University of Texas at Arlington US ATLAS Distributed Facility Workshop at SLAC October 2010 Assessment of Analysis Failures, DAST Perspectives

Distributed Analysis Panda User Jobs in All Clouds Nurcan Ozturk 2 US cloud has been very active in distributed user analysis since the beginning. Ganga user jobs counts for 20% of Panda user jobs.

Distributed Analysis Panda User Jobs in US Cloud Nurcan Ozturk 3 All US sites have been contributing to distributed user analysis successfully. Job counts went from 2K to 7K over the year.

Analysis job summary in all clouds, last 12 hours CERN time) Nurcan Ozturk 4 US cloud has been running the most jobs among all clouds. A stable cloud with low job failure rate. Note: BNL queues are being drained for the dCache intervention scheduled for today.

Analysis job summary in US cloud, last 12 hours CERN time) Nurcan Ozturk 5 BNL queues are being drained for the dCache intervention scheduled for today.

Quick Look at BNL and HU Site Failures Nurcan Ozturk 6 <-- User code problem

Closer Look to Analysis Job Error Report (1)  trans:  Unspecified error, consult log file  TRF_SEGVIO - Segmentation violation  panda:  Put error: Error in copying the file from job workdir to localSE  Lost heartbeat  No space left on local disk  Get error: Staging input file failed  Put error: Failed to add file size and checksum to LFC  Payload stdout file too big  Exception caught by pilot  Put error: Failed to import LFC python module  Bad replica entry returned by lfc_getreplicas(): SFN not set in LFC for this guid  wget command failed to download trf  Missing installation  etc Nurcan Ozturk 7 “panda’ errors (pilotErrorCode) are relevant for the attention of the US Facility, however difficult to observe a pattern. “panda’ errors (pilotErrorCode) are relevant for the attention of the US Facility, however difficult to observe a pattern.

Closer Look to Analysis Job Error Report (2)  ddm:  Could not add output files to dataset  athena:  Athena ran out of memory, Athena core dump  ATH_FAILURE - Athena non-zero exit  Athena core dump or timeout, or conddb DB connect exception  Athena crash - consult log file  user:  User work directory too large  other:  Unknown error code Note: above is my classification of errors, have not double checked with Torre as to how they are classified in the analysis job error report on Panda monitor. Nurcan Ozturk 8

User problems reported to DAST for US sites Nurcan Ozturk 9

10 DAST – Distributed Analysis Support Team  DAST started in September 2008 for a combined support of pathena and Ganga users.  First point of contact for distributed analysis questions.  All kinds of problems are discussed in the DA help forum (hn-atlas-dist- not just pathena and Ganga related ones:  athena  physics analysis tools  conditions database access  site/service problems  dq2-* tools  data access at sites  data replication  etc  DAST helps directly by solving the problem or escalating to relevant experts

Nurcan Ozturk 11 Team Members EU time zone NA time zone Daniel van der SterNurcan Ozturk (now in EU time zone) Mark Slater Alden Stradling Hurng-Chun Lee Sergey Panitkin Bjorn Samset Bill Edson Christian Kummer Wensheng Deng Maria Shiyakova Shuwei Ye Jaroslava Schovancova Nils Krumnack Manoj Jha Woo Chun Park Elena Oliver Garcia Karl Harrison Frederic Brochu Daniel Geerts Carl Gwilliam Mohamed Gouighri Borge Gjelsten Katarina Pajchel red: trainee Eric Lancon A small team in NA time zone, most people are the ones who are fully/partly supported by the USATLAS Physics Support and Computing Program, thus close ties to USATLAS Facilities and Support.

User problems reported to DAST for US sites (1) By looking at the recent DAST shift reports, here are some of the issues reported by users on US sites:  A wrong update concerning the analysis cache Affected several sites including US ones.  Error at BNL: pilot: Get error: lsm-get failed. The pool hosts the input files was not available due to machine reboot. The pool is back shortly after.  MWT2 site admin reporting user analysis failures (HammerCloud and user code problems). Very useful for heads-up and understanding the site performances better.  Production cache missing at MWT2. Installed by Xin, available at several analysis sites but not all, no indication of usage in physics analyses at: Nurcan Ozturk 12

User problems reported to DAST for US sites (2)  dq2-get error, "lcg_cp : Communication error on send’ at MWT2 – Transient error.  Why do Australian grid certs not work on BNL? dq2-get difficulties by some Australian users. Shifter has not escalated this issue to BNL yet  Tier3 issue – User can't use WISC_LOCALGROUP disk as a DaTRI source. DDM people reported that it is in the works. Nurcan Ozturk 13

User File Access Pattern (1)  Pedro examined the user access pattern at BNL. His observations as follows:  Took a 1.5h sample, 29 files per second were accessed just for reading. 56% of the files are user library files. This suggested that users are splitting jobs into many subjobs (each subjob needs to access the library file).  Outcome: Lots of reads and writes to the userdisk namespace database, the writes clog up and timeouts on the whole pnfs namespace. Heavy usage of the same file, possibility to break the storage server, the files will not be accessible since there is no redundancy (this is not a hotdisk area). In this case he estimated a failure of more than 2/sec and depending on the problem, the recovery can take 2min-15min ( job failures).  Solution: If this continues to be the practice among users, he needs to add more changes on the local site mover and add some extra 'spices' to be able to scale the whole system.(except the data availability since for that lib files would need to go to HOTDISK by default or some new HOTUSERDISK token). Nurcan Ozturk 14

User File Access Pattern (2)  I looked at the user jobs Pedro gave as examples. Nurcan Ozturk 15 usergregormdaviecsandovazmeng jobsetID # subjobs #input file/job4 ESD1 D3PD1 ESD average run time/job 3h3' to 37'13’ to 3h14’ to 50’ average input file size 800 MB200 MB (some 6 MB) 3 GB average output file size 60 KB KB200 MB130 MB

User File Access Pattern (3)  Conclusion: Users are splitting into many subjobs and their output file size is small, then they have difficulty to download them by dq2-get. As discussed in DAST help list last week, the file-look up can be as long as 30s in DDM system, so users have been advised to merge their output files at the source and then download.  The default limit on the input file size is 14GB in the pilot. So these users could use to run on more input files in one job.They are probably splitting more to be on the safe side.  How could we advise users? Can pathena print a message when the jobis splitted into too many subjobs?  This needs to be further discussed. Nurcan Ozturk 16

Storage space alerts for SCRATCHDISKs Date: Tue, 12 Oct :06: From: To: Subject: [DQ2 notification] Storage space alerts for SCRATCHDISKs Site Free(TB) Total(TB) IN2P3-LPSC_SCRATCHDISK (16%) NET2_SCRATCHDISK (8%) For questions related to this alarm service: Date: Tue, 12 Oct :06: From: To: Subject: [DQ2 notification] Storage space alerts for SCRATCHDISKs Site Free(TB) Total(TB) IN2P3-LPSC_SCRATCHDISK (16%) NET2_SCRATCHDISK (8%) For questions related to this alarm service: Nurcan Ozturk 17 DAST receives notifications from DQ2 system. An example follows from today. NET2 is often in the list. Is there a problem with cleaning the SCRATCHDISK?

How US Cloud Provide Support to DAST  Site issues are followed by GGUS tickets:  Good response from site admins  Some site admins also watch for issues reported in the DA help forum for their sites, thus prompt support.  General questions:   Good response, Yuri often keeps an eye on this list.  Release installation issues at US sites:  Xin Xhao  Also private communications with Paul Nilsson, Torre Weanus on the pilot and Panda monitor issues (not only for US sites of course). Nurcan Ozturk 18

Summary  US cloud continues to be a successful cloud in ATLAS distributed user analysis.  Good US site performances, one of the least problematic analysis sites in ATLAS.  User analysis jobs fail with diffferent errors. The dominant ones are classifed as “panda” and “athena” on Panda monitor. “athena’ errors often refer to user code problems. “panda” errors vary, difficult to observe a pattern or classify as common problem.  Good US cloud support to DAST. No operational issues to report in the US Facility, in particular from the DAST point of view.  User access pattern needs to be further discussed for better site/storage performances.  I have not mentioned here the new Panda Dynamic Data Placement (PD2P) mechanism which helped greatly users (long waiting period in the queue) and sites (storage). I expect Kaushik will cover this topic. Nurcan Ozturk 19