TANYA LEVSHINA Monitoring, Diagnostics and Accounting.

Slides:

Advertisements

Similar presentations

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Advertisements

Role Based VO Authorization Services Ian Fisk Gabriele Carcassi July 20, 2005.

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

A Computation Management Agent for Multi-Institutional Grids

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Basic Grid Job Submission Alessandra Forti 28 March 2006.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Minerva Infrastructure Meeting – October 04, 2011.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

OSG Public Storage and iRODS

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.

G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.

Apr 30, 20081/11 VO Services Project – Stakeholders’ Meeting Gabriele Garzoglio VO Services Project Stakeholders’ Meeting Apr 30, 2008 Gabriele Garzoglio.

11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.

Grid job submission using HTCondor Andrew Lahiff.

Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Open Science Grid OSG CE Quick Install Guide Siddhartha E.S University of Florida.

June 10, D0 Use of OSG D0 relies on OSG for a significant throughput of Monte Carlo simulation jobs, will use it if there is another reprocessing.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.

Overview of Privilege Project at Fermilab (compilation of multiple talks and documents written by various authors) Tanya Levshina.

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

OSG AuthZ components Dane Skow Gabriele Carcassi.

Auditing Project Architecture VERY HIGH LEVEL Tanya Levshina.

Data Management The European DataGrid Project Team

December 07, 2006Parag Mhashilkar, Fermilab1 Samgrid – OSG Interoperability Parag Mhashilkar, Fermi National Accelerator Laboratory.

JSS Job Submission Service Massimo Sgaravatto INFN Padova.

Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

July 26, 2007Parag Mhashilkar, Fermilab1 DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory.

A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.

Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.

FIFE Architecture Figures for V1.2 of document. Servers Desktops and Laptops Desktops and Laptops Off-Site Computing Off-Site Computing Interactive ComputingSoftware.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

ATLAS Computing Wenjing Wu outline Local accounts Tier3 resources Tier2 resources.

Patrick Gartung 1 CMS 101 Mar 2007 Introduction to the User Analysis Facility (UAF) Patrick Gartung - Fermilab.

Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.

Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.

Parag Mhashilkar (Fermi National Accelerator Laboratory)

Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.

Advanced Computing Facility Introduction

Jean-Philippe Baud, IT-GD, CERN November 2007

Quick Look on dCache Monitoring at FNAL

Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)

Primer for Site Debugging

Workload Management System

CREAM-CE/HTCondor site

Artem Trunov and EKP team EPK – Uni Karlsruhe

Troubleshooting Your Jobs

Haiyan Meng and Douglas Thain

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

Michael P. McCumber Task Force Meeting April 3, 2006

Troubleshooting Your Jobs

The LHCb Computing Data Challenge DC06

Presentation transcript:

TANYA LEVSHINA Monitoring, Diagnostics and Accounting

Talk Overview New Infrastructure Offsite Submission Monitoring and Debugging:  Grid Jobs  Data Management Accounting Help! 6/17/14 FIFE Workshop 2

What’s New and Why Should I Care? (I) New jobsub: client/server architecture  More scalable than old architecture  Changes that affect you:  You cannot login on submission node  jobsub commands have changed – check help & documentation  Log files need to be fetched from the server site  jobsub will show the status of your jobs but if you want to see job details for now you should use the command condor_q (and specify pool and schedd) High Availability Service  Multiple jobsub servers  Changes that affect you:  new job 6/17/14 FIFE Workshop 3

What’s New and Why Should I Care? (II) New software distribution paradigm: OASIS/CVMFS  Robust, scalable way to distribute software on remote sites  Changes that affect you:  You need to build relocatable software  Higher probability of encountering access problems on remote worker nodes New resources: FermiCloud, the OSG( more than 80 institutions), and AWS clusters:  More opportunistic resources  Changes that affect you:  Complicated troubleshooting  Cannot expect to have access to BlueArc 6/17/14 FIFE Workshop 4

Gird Jobs: Monitoring and Debugging 6/17/14 FIFE Workshop 5

Job Submission You submitted a job: <> jobsub_submit.py --resource-provides=usage_model=DEDICATED --group uboone …. Submitting job(s) job(s) submitted to cluster 386. JobsubJobId of first job: Use job id to retrieve output Let’s look at the status job <> jobsub_q.py -G uboone 00.0test_local_env.sh 00.0test_local_env.sh 00.0test_local_env.sh Why my job is idle? 6/17/14 FIFE Workshop 6

Idle jobs (I) There are multiple reasons why your job can be idle: It requires some time to request and start a glidein on a remote worker node and match your job. Be patient especially if your are submitting jobs to opportunistic, offsite, fermicloud or AWS resources. Your job doesn’t match any requirements  Unsupported resources (SL7, huge memory requirement) Your job is starting, failing right away and it is put back into a queue The system is too busy running other jobs. 6/17/14 FIFE Workshop 7

Idle jobs (II) If your job is idle for more than half an hour, check: requirements using condor_q –better-analyze. for number of restarts: condor_q -g -pool fifebatch1.fnal.gov –l -format "%s\n" NumRestarts NumRestarts = 5 the log file and look for memory usage: …. 006 ( ) 06/12 13:33:42 Image size of job updated: MemoryUsage of job (MB) ResidentSetSize of job (KB)... GlideinWMS Frontend monitoring page. 6/17/14 FIFE Workshop 8

Idle Jobs – Very Busy System 6/17/14 FIFE Workshop 9 VO Frontend Monitor (

Held Jobs If you see: condor_q -g -pool fifebatchgpvmhead1.fnal.gov -name fifebatch2.fnal.gov ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD tlevshin 6/5 10: :00:28 H test.sh Check the held reason: condor_q -g -pool fifebatchgpvmhead1.fnal.gov –name fifebatch2.fnal.gov –l -format "%s " MATCH_EXP_JOB_GLIDEIN_Site -format "%s\n" HoldReason MATCH_EXP_JOB_GLIDEIN_Site = "BNL" HoldReason = "Error from error changing sandbox ownership to the user: condor_glexec_setup exited with status 768 Glexec exited with status 203;” Other reasons: HoldReason = "Job restarted too many times (by user condor)” HoldReason = "Error from STARTER at failed to send file(s) to : error reading from …SHADOW failed to receive file(s) from ” Open SNOW ticket! 6/17/14 FIFE Workshop 10

Job Monitoring Use jobsub_q.py to check if your jobs is still running Use jobsub_fetchlog to download your log files Check FIFEMON (we plan to support FIFEMON for the new infrastructure) - user’s jobs info - batch details, enstore, blue arc and dCache activities per experiment 6/17/14 FIFE Workshop 11

FIFEMON 6/17/14 FIFE Workshop 12 Batch details about your experiment Storage details Batch details about you job

Job Debugging (I) 6/17/14 FIFE Workshop 13 Job failed: Do you have enough information to debug a problem  Add set –x for simplify debugging  Exit with non-zero exit code in case of failure  Add timestamps to your log (echo “ Started ifdh `date`”) especially when you are doing file transfers.  You should always log the following debugging information: site, host, and OS your job has been executed: echo Site:${GLIDEIN_ResourceName} echo "the worker node is " `hostname` "OS: " `uname –a` Your script has a bug. Have you tested it on your local machine? Authentication/authorization problems with remote site

Job Debugging (II) 6/17/14 FIFE Workshop 14 Worker node problems:  cvmfs problem  missing libraries  not enough space (worker node scratch area per slot is limited to 20GB – 80GB)  not enough memory (OSG limit is usually 2GB per slot) Create a SNOW ticket, provide: Your experiment, role (DN and FQAN of a proxy certificate) On what Site, Node your job was executed Job stdout/stderr and log files You can always request a FermiCloud VM node that looks like a “generic” grid worker node and do testing and debugging there.

Data Management: Monitoring and Debugging 6/17/14 FIFE Workshop 15

Data Management 6/17/14 FIFE Workshop 16 We're not in Kansas anymore! The jobs could be started at Nebraska, Oklahoma or other OSG sites. So you should not expect to get direct access to BlueArc.  How do you get files to/from a worker node? Do you use SAM/SAM WEB? Check your SAM station for input file download: Do you use IFDH? Do you know where your output is going: Fermi Public dCache? Fermilab BlueArc? Remote Storage Element?

SAM Monitoring 6/17/14 FIFE Workshop 17

IFDH Monitoring 6/17/14 FIFE Workshop 18 It is strongly recommended to add to your script or pass it with –e to jobsub (especially for jobs running offsite): export IFDH_DEBUG=1 ifdh does a lot of guessing how to reach a proper storage when plain path is specified er_data_routing er_data_routing If you need to use the remote storage element, you will need to know the address and path to the storage element and use it as the source or destination arguments in the ifdh command.

IFDH Debugging (I) 6/17/14 FIFE Workshop 19 What can go wrong (especially when running offsite and/or using remote SEs)?:  Service (SRM Endpoint or gridftp) is down  Network problems  Authentication (proxy has expired)  Authorization (you don’t have access to a SE or permissions to access a specific directory)  Worker node problem:  missing executables, libraries (globus, srm client commands) or cvmfs problems  problem with a host certificate or crl files  File not found, wrong path, etc – trivial bugs

IFDH Debugging (II) 6/17/14 FIFE Workshop 20 To debug:  Check logs  Execute ifdh command manually Create a SNOW ticket, provide:  Your experiment, role (DN and FQAN of a proxy certificate)  On what Site, Node your job was executed  What command you have used to transfer files  Timestamp  Command stdout/stderr output  Results of executing the same command manually in local environments

dCache Monitoring 6/17/14 FIFE Workshop 21 You can check dCache transfers if your files are going to Fermilab Public dCache (path /pnfs/fnal.gov/usr/ /scratch/…) Home page: Per experiment space usage: bin/sg_usage_cgi.pyhttp://fndca.fnal.gov/cgi- bin/sg_usage_cgi.py Per experiment dCache transfers: bin/sg_transfers_cgi.pyhttp://fndca.fnal.gov/cgi- bin/sg_transfers_cgi.py Active transfers: Recent FTP transfers: File lifetime plots:

dCache Monitoring Examples 6/17/14 FIFE Workshop 22

Enstore Monitoring 6/17/14 FIFE Workshop 23 You can check Enstore (path /pnfs/fnal.gov/usr/ /rawdata) Home page: Small Files Aggregation: bin/enstore_sfa_hud_cgi.py Per experiment tape burn-rate: stken.fnal.gov/enstore/all_sg_burn_rates.html Quotas (more usage): stken.fnal.gov/enstore/tape_inventory/VOLUME_QUOTAS Per experiment drive usage: stken.fnal.gov/enstore/drive-hours-sep/plot_enstore_system.htmlhttp://www- stken.fnal.gov/enstore/drive-hours-sep/plot_enstore_system.html Complete file listings: bin/enstore_file_listing_cgi.pyhttp://www-stken.fnal.gov/cgi- bin/enstore_file_listing_cgi.py

Enstore Monitoring Examples 6/17/14 FIFE Workshop 24 Tape burn rate per experiment Drive usage per experiment

FTS Monitoring 6/17/14 FIFE Workshop 25 The File Transfer Service (FTS) is used for adding new files into the SAM catalogue, copying them to new locations, and archiving them to Enstore. Example: /fts/status /fts/status

Accounting 6/17/14 FIFE Workshop 26

Grid Accounting (Gratia) 6/17/14 FIFE Workshop 27 Fermi Accounting scoreboard:  Weekly generated customizable plots (wall duration, job count, VMs usage, transfers to dCache and BlueArc, usage of the OSG).  Weekly and monthly generated customizable plots per experiments  We can add more plots on request.

Accounting Examples(I) 6/17/14 FIFE Workshop 28 Experiment Accounting: d/ _week.html

Accounting Examples (II) 6/17/14 FIFE Workshop 29

Accounting Example (III) 6/17/14 FIFE Workshop 30 It looks like MWT2 sites were not very reliable sites for NOvA during the last week. Let’s check for a monthly trend. Efficient usage of the OSG sites (CPU wall hours/Wall hours):

Getting Help 6/17/14 FIFE Workshop 31 Open a SNOW ticket: