Analysis Operations Monitoring Requirements Stefano Belforte

Slides:



Advertisements
Similar presentations
1 CRAB Tutorial 19/02/2009 CERN F.Fanzago CRAB tutorial 19/02/2009 Marco Calloni CERN – Milano Bicocca Federica Fanzago INFN Padova.
Advertisements

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Intro Informatica Productivity Pack Save Time and Money while Increasing the Quality of Your PowerCenter Deployment Louis Hausle.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
2 Sep Experience and tools for Site Commissioning.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
Mtivity Client Support System Quick start guide. Mtivity Client Support System We are very pleased to announce the launch of a new Client Support System.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
Karsten Köneke October 22 nd 2007 Ganga User Experience 1/9 Outline: Introduction What are we trying to do? Problems What are the problems? Conclusions.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Daniele Spiga PerugiaCMS Italia 14 Feb ’07 Napoli1 CRAB status and next evolution Daniele Spiga University & INFN Perugia On behalf of CRAB Team.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
1 of 5 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
Google Analytics Graham Triggs Head of Repository Systems, Symplectic.
II EGEE conference Den Haag November, ROC-CIC status in Italy
Online Job Applications Workshop Coordinators Sharon Feeney – Andrea Reynolds –
17 Copyright © 2006, Oracle. All rights reserved. Information Publisher.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.
Standards Certification Education & Training Publishing Conferences & Exhibits Easy Web Design ISA Toronto – Jan 2013.
Claudio Grandi INFN Bologna Workshop congiunto CCR e INFNGrid 13 maggio 2009 Le strategie per l’analisi nell’esperimento CMS Claudio Grandi (INFN Bologna)
Human Computer Interaction Lecture 21 User Support
Daniele Bonacorsi Andrea Sciabà
WLCG IPv6 deployment strategy
Core LIMS Training: Project Management
Kevin Thaddeus Flood University of Wisconsin
Xiaomei Zhang CMS IHEP Group Meeting December
Shared Services with Spotfire
A more efficient you. Introducing EmployerAccess
Akiya Miyamoto KEK 1 June 2016
David Adams Brookhaven National Laboratory September 28, 2006
The Status of Beijing site, and CMS local DBS
Operational procedures and tools for scheduled shutdowns at CC-IN2P3
Evolution of SAM in an enhanced model for monitoring the WLCG grid
AMI – Status November Solveig Albrand Jerome Fulachier
William Stallings Computer Organization and Architecture
Survey on User’s Computing Experience
Take the summary from the table on
Publishing ALICE data & CVMFS infrastructure monitoring
Analysis Operations Requirements
Microsoft FrontPage 2003 Illustrated Complete
DrayWatch Training November 2009.
SharePoint Online: Migration Planning to avoid Mistakes
Monitoring of the infrastructure from the VO perspective
X in [Integration, Delivery, Deployment]
TimeClock Plus v7 Manager Training.
Towson University Store
Presentation transcript:

Analysis Operations Monitoring Requirements Stefano Belforte

Analysis Ops Monitoring Requirements 5 areas High Level Metrics James Letts plots and weekly reports Job Monitoring For users, for Ops Services Monitoring Crab Server, grid schedulers Disk Space management Filled, available, used by jobs Alarms and Alerts May 11, 20111 Analysis Ops Monitoring Requirements

Analysis Ops Monitoring Requirements High Level Metrics Under control See November review e.g. Requirement: Current plots and tables automatically on web, not by hand on twiki James will be at CERN for June's CMS week, good time to workout details May 11, 20111 Analysis Ops Monitoring Requirements

Analysis Ops Monitoring Requirements Disk Space Management Disk usage by site/group: Overview fits the bill No major request, add /store/group /store/users Dataset (un)usage : coming Deployment, commisioning, validation etc. Requirement 1: combine the views Sort of CPU-weighted space, need some good idea for presentation, also Claudio model for data allocation may be an interesting way to represent Requirement 2: maintenance and support how much support will be there for how long ? May 11, 20111 Analysis Ops Monitoring Requirements

Analysis Ops Monitoring Requirements Job Monitoring Can't really find better way to summarize need then November 2010 review: will not repeat Let's split timelines What to do in Dashboard now What to address in Crab3/WMA To be concrete I do not expect significant changes until Crab3 Need to look at what WMA already has before making shopping list May 11, 20111 Analysis Ops Monitoring Requirements

Analysis Ops Monitoring Requirements Dashboard until Crab3 Weekly summary of CMSSW version usage High Level metrics Faster interactive UI for short term (up to 15 days) Could possibly get a lot by working at task, not job level May be a good thing to have even long term Current selections/correlations ~OK May 11, 20111 Analysis Ops Monitoring Requirements

Job Monitorin in Crab2 : wishes Exit codes as link to FAQ on what it means and what to do Daily digest of site-related failures prepared for site admins Data reading failures summarised in such a way that we can highlite ”blocks needing PhEDEx verify or read test via jobs” The latter requires submission with file (block) list, which is not allowed by current tools, plus some 'file check” simple executable/script, is CMSSW version dependent.. i.e. trickier then it seems May 11, 20111 Analysis Ops Monitoring Requirements

Job Monitoring in Crab3 - 1 While need is clear (why those jobs failed ?) solution is not simply monitoring How can we avoid needing so (too) much monitoring ? Contain job failures: prevent, preempt, report Give users no need to guess, dig, fish, ask ... Examples: out of memory, CPU, disk, sites ... Define ”the box” we can handle and make jobs fit Monitoring-wise WMA job summary looks good Need to look at content details May 11, 20111 Analysis Ops Monitoring Requirements

Job Monitoring in Crab3 - 2 Error parsing with ”code du jour” Example: Job Robot summaries Ideally input new classification on the web Easy navigation to stdout of failed jobs Running jobs ? Keep existing Task Monitor as user portal Fill when submittin to Crab3, not when Crab3 submits to Grid. Solve Crab vs. Dashboard status In the end, if jobs succeed, users do not care to monitor, needs will depend on how well we do May 11, 20111 Analysis Ops Monitoring Requirements

Monitoring of services we operate Have not looked at WMA yet Hope this is all there, or can be added easily Requirement: car's rpm dial How fast it runs now What's the possible range Where's the safe limit and how close are we Requirement: flow by components Is someone bottlenecking the flow ? Are some jobs/tasks stuck ? May 11, 20111 Analysis Ops Monitoring Requirements

Crab Server monitoring now Currently have : One page with top view http://cmsdoc.cern.ch/cms/LCG/crab/overview/overview.html One page for drilling down (MonAlisa repository) http://glidein-mon.t2.ucsd.edu:8080/ Basically a publish-subscribe model using MA turned out to be fast to setup and maint free afterwords Next WMA based system could be like that What we miss now is the metrics, not the views May 11, 20111 Analysis Ops Monitoring Requirements

Monitoring of services we use Need a good downtime calendar Sites: discussed since years, still work to do Would like to have also for services CMSWEB, VOMS, Oracle, ... Avoid subscribing to N lists and browse N announcement pages Can only work if automated May 11, 20111 Analysis Ops Monitoring Requirements

Analysis Ops Monitoring Requirements Alarms and alerts Monitoring pages are to be used after an alarm is raised, not in search for abnormal situation Requirement: a common framework where each monitoring component reports problems Eventually we will have, like now, many pages and views and things.. how do we tie them toghether ? How do we know who's obsolete ? Each can set a LookAtMe, obnoxious ones can be shut off, good ones will make themselves known when needed A good alarm bell has a Silence button May 11, 20111 Analysis Ops Monitoring Requirements