Download presentation
Presentation is loading. Please wait.
Published byAnna Jackson Modified over 9 years ago
1
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring
2
7/2/2003Supervision & Monitoring section2 Mandate Develop and deploy a monitoring solution that addresses LHC-era needs in areas such as data rates, data volumes and scalability and that provides appropriate information for users, administrators, operators and management both for individual component services and in logical service groupings. Develop and deploy an automated fault tolerance solution that is compatible with the deployed monitoring solution. Develop and maintain infrastructure for remote console access and system reset. Fulfil CERN’s commitments to the monitoring and fault tolerance tasks within EDG/WP4 + WP4 management & integration
3
7/2/2003Supervision & Monitoring section3 LCG-1 monitoring: criteria All measurement data in Oracle –for service and computer center managers powerful reporting tools complex correlation queries Physics users must be given access to measurement data –API for query/subscription –web based query interface? Alarm display –for operators and service managers
4
7/2/2003Supervision & Monitoring section4 LCG-1 monitoring: client WP4 Monitoring Sensor Agent (MSA) deployed on all CPU, disk and tape servers. Sensors: –FioSensor.pl: exception metrics –LinuxSensorProc: performance metrics –Castor performance/exception metrics would be desirable? e.g. Tape queues length per device group Tape pools (%free) Drive status (physical and VDQM) –Network switches performance metrics
5
7/2/2003Supervision & Monitoring section5 LCG-1 monitoring: Server(1) Measurement Repository, deploy: –WP4 MR server, TCP or UDP transport –PVSS, UDP transport Both needs to be evaluated w.r.t. –Performance in a large deployment –Operational & maintenance burden –Physics user interface requirements Evaluation period: 1 – 2 months
6
7/2/2003Supervision & Monitoring section6 LCG-1 monitoring: Server(2) Oracle DB –Use PVSS info server to regularly export to Oracle –Use WP4 MR server with Oracle backend from David Front (LCG/Israel) User interfaces –Service mgrs: Oracle tools –Users: WP4 repository API + web based query interface –Operators: alarm display
7
7/2/2003Supervision & Monitoring section7 MSA oracleMonServer PVSS PVSS Info Server Export W2K Oracle DB APIAPI APIAPI Can this be given to users? Evaluation phase architecture APIAPI
8
7/2/2003Supervision & Monitoring section8 Monitoring deployment: Issues WP4 alarm display: needs to be finalized and deployed Externalized repository API for PVSS: Andreu’s library requires PVSS client to be installed Continue to duplicate efforts for another 2 months knowing that ~half of the work will be thrown away afterwards
9
7/2/2003Supervision & Monitoring section9 LCG-1 monitoring: Scenarios Test both solutions in parallel ~2 months Document the evaluation and decide: –WP4 solution selected –PVSS solution selected –Would need both because requirement scope too wide, e.g. PVSS alarm display is best for the operators WP4 implementation of the repository API is best for the users
10
7/2/2003Supervision & Monitoring section10 Fault tolerance (FT): plans Model the escalation procedures: ~May –Tracing of recovery actions –Exception escalation hierarchies Evaluate WP4 FT framework: ~September –Adaptable to the modeled escalation procedure? –If not: survey other frameworks (e.g. Pete’s Oracle based correlation engine) Adapt the LCG-1 monitoring to the FT recovery action tracing if necessary. ~October Deploy. ~November
11
7/2/2003Supervision & Monitoring section11 FT: modeling (~May) Model the escalation procedure Exception raised Global recoveries Local recoveries Alarm raised Exception reset Escalation? Trace of actions? Try local repairs Try global repairs Try manual repairs Problem fixed!
12
7/2/2003Supervision & Monitoring section12 FT: evaluation (~September) Evaluate WP4 FT framework –Does it scale to global correlations? –Is the rule syntax rich enough? Check other frameworks –Pete’s Oracle based solution
13
7/2/2003Supervision & Monitoring section13 FT: deployment (~November) Make sure the framework works together with the LCG-1 monitoring –FT related metrics –Correlation engines need: API for data consumption (subscription/queries) API for action tracing (feedback to monitoring) Deploy the system and... –Develop correlation engine and exception escalation hierarchies –Check that it works in production
14
7/2/2003Supervision & Monitoring section14 Timelines Deploy WP4 server Deploy PVSS Feb Mar AprMayJunJulAugSepOctNovDec Run both systems in parallel Evaluation report Selection Gather input from selected set of LHC users Maintenance Fault tolerance: model escalation and tracing Evaluate WP4 FT framework Adapt and deploy
15
7/2/2003Supervision & Monitoring section15 Other tasks Develop and maintain infrastructure for remote console access and system reset –Strategy, man-power ?? WP4 management –WP4 manager –WP4 monitoring task leader
16
7/2/2003Supervision & Monitoring section16 Who does what? PVSS dev/depl WP4 mon depl WP4 mon dev/plan WP4 mgr WP4 integr FT esc Ft eval Remote consol Bill Hugo Jan Juan Karim Maite Olof Sylvain
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.