Download presentation
Presentation is loading. Please wait.
Published byLucy York Modified over 6 years ago
1
Analysis Operations Monitoring Requirements Stefano Belforte
2
Analysis Ops Monitoring Requirements
5 areas High Level Metrics James Letts plots and weekly reports Job Monitoring For users, for Ops Services Monitoring Crab Server, grid schedulers Disk Space management Filled, available, used by jobs Alarms and Alerts May 11, 20111 Analysis Ops Monitoring Requirements
3
Analysis Ops Monitoring Requirements
High Level Metrics Under control See November review e.g. Requirement: Current plots and tables automatically on web, not by hand on twiki James will be at CERN for June's CMS week, good time to workout details May 11, 20111 Analysis Ops Monitoring Requirements
4
Analysis Ops Monitoring Requirements
Disk Space Management Disk usage by site/group: Overview fits the bill No major request, add /store/group /store/users Dataset (un)usage : coming Deployment, commisioning, validation etc. Requirement 1: combine the views Sort of CPU-weighted space, need some good idea for presentation, also Claudio model for data allocation may be an interesting way to represent Requirement 2: maintenance and support how much support will be there for how long ? May 11, 20111 Analysis Ops Monitoring Requirements
5
Analysis Ops Monitoring Requirements
Job Monitoring Can't really find better way to summarize need then November 2010 review: will not repeat Let's split timelines What to do in Dashboard now What to address in Crab3/WMA To be concrete I do not expect significant changes until Crab3 Need to look at what WMA already has before making shopping list May 11, 20111 Analysis Ops Monitoring Requirements
6
Analysis Ops Monitoring Requirements
Dashboard until Crab3 Weekly summary of CMSSW version usage High Level metrics Faster interactive UI for short term (up to 15 days) Could possibly get a lot by working at task, not job level May be a good thing to have even long term Current selections/correlations ~OK May 11, 20111 Analysis Ops Monitoring Requirements
7
Job Monitorin in Crab2 : wishes
Exit codes as link to FAQ on what it means and what to do Daily digest of site-related failures prepared for site admins Data reading failures summarised in such a way that we can highlite ”blocks needing PhEDEx verify or read test via jobs” The latter requires submission with file (block) list, which is not allowed by current tools, plus some 'file check” simple executable/script, is CMSSW version dependent.. i.e. trickier then it seems May 11, 20111 Analysis Ops Monitoring Requirements
8
Job Monitoring in Crab3 - 1
While need is clear (why those jobs failed ?) solution is not simply monitoring How can we avoid needing so (too) much monitoring ? Contain job failures: prevent, preempt, report Give users no need to guess, dig, fish, ask ... Examples: out of memory, CPU, disk, sites ... Define ”the box” we can handle and make jobs fit Monitoring-wise WMA job summary looks good Need to look at content details May 11, 20111 Analysis Ops Monitoring Requirements
9
Job Monitoring in Crab3 - 2
Error parsing with ”code du jour” Example: Job Robot summaries Ideally input new classification on the web Easy navigation to stdout of failed jobs Running jobs ? Keep existing Task Monitor as user portal Fill when submittin to Crab3, not when Crab3 submits to Grid. Solve Crab vs. Dashboard status In the end, if jobs succeed, users do not care to monitor, needs will depend on how well we do May 11, 20111 Analysis Ops Monitoring Requirements
10
Monitoring of services we operate
Have not looked at WMA yet Hope this is all there, or can be added easily Requirement: car's rpm dial How fast it runs now What's the possible range Where's the safe limit and how close are we Requirement: flow by components Is someone bottlenecking the flow ? Are some jobs/tasks stuck ? May 11, 20111 Analysis Ops Monitoring Requirements
11
Crab Server monitoring now
Currently have : One page with top view One page for drilling down (MonAlisa repository) Basically a publish-subscribe model using MA turned out to be fast to setup and maint free afterwords Next WMA based system could be like that What we miss now is the metrics, not the views May 11, 20111 Analysis Ops Monitoring Requirements
12
Monitoring of services we use
Need a good downtime calendar Sites: discussed since years, still work to do Would like to have also for services CMSWEB, VOMS, Oracle, ... Avoid subscribing to N lists and browse N announcement pages Can only work if automated May 11, 20111 Analysis Ops Monitoring Requirements
13
Analysis Ops Monitoring Requirements
Alarms and alerts Monitoring pages are to be used after an alarm is raised, not in search for abnormal situation Requirement: a common framework where each monitoring component reports problems Eventually we will have, like now, many pages and views and things.. how do we tie them toghether ? How do we know who's obsolete ? Each can set a LookAtMe, obnoxious ones can be shut off, good ones will make themselves known when needed A good alarm bell has a Silence button May 11, 20111 Analysis Ops Monitoring Requirements
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.