Download presentation
Presentation is loading. Please wait.
Published byLindsey Logan Modified over 8 years ago
1
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t IT Monitoring WG IT/CS Monitoring System Virginie Longo September 14th 2011
2
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Summary CS Monitoring Systems Spectrum CA Performance Analysis Others Tools Data storage Requirements NMS Status Requirements Researches
3
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CS Monitoring systems
4
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Spectrum CA Description: Commercial Tool Fault management oriented system Root Cause Analysis/ alarm Correlation Topology View Service Manager => Relation With SLS View Basic Performance manager Volumes: ~3000 devices monitored Support 3K Laser devices for simple alarm (UP/DOWN) Thousands of attributes polled and analyzed 6GB of data events over 30 days Monitoring Protocols: SNMP and ICMP Information only feed by SNMP (No remote agent) Few other support : DNS / DHCP / TRACEROUTE /NTP /HTTP Few home maid scripts for DHCP, web monitoring.
5
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Alarm Monitoring Spectrum Architecture (Storage system) Spectrum DB Models, topology, current polling value,alarms Oracle Stats (CSR) Oracle Stats (CSR) Oracle Alarm History (LANDB) Oracle Alarm History (LANDB) Spectrum System Non Spectrum system Mysql Events Mysql Events Remote Mysql Service Manager Remote Mysql Service Manager SLS Devices Info
6
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Performance Analysis Statistics Architecture - Mix home maid system and Spectrum tool - Extraction data from Spectrum to Oracle DB - Data consolidation into RRD. - Displayed on Netstat website (PHP). Volumes: - ~9000 models (port + devices) for 24K of RRDs - 36 Metrics - 157 Attributes - ~160K entries load into Oracle DB for 5MN of poll - Data kept 1 months for oracle - 2 years of consolidated data in RRDs. Note : Metric is a group of attributes such as Bandwidth = in/out bits and in/out packets.
7
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Performance Analysis
8
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Other Tools Syslog event recording - Gathering all log from network devices - Stored into Oracle DB - Accessible from CSDB - Filtering and propagation by notification LHCOPN : Perfsonar Tool - Decentralized networks tool - OWD, latency and throughput regular test - Other tools like traceroute - LHCOPN network analysis Implementation ongoing, testing phase with 1BG link, security tests not complete yet. (www.perfosnar.net)www.perfosnar.net
9
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Data storage
10
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Data Storage Summary: Spectrum proprietary DBs for core and alarms Mysql database for events and service manager Oracle database for stats (CSR) and alarm history (LANDB) Oracle database for Syslog info Standalone Mysql database for Perfsonar tools. Too many different type of storage. Missing correlation between Syslog and SNMP
11
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Requirements
12
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t NMS Status Advantages : - Root cause analysis efficient - Correct Event- Alarm management - High availability - Really good topology views (useful for intervention group) - Support NICE users - Very good level of filtering (topology, alarms) - Notification support Negative points / Weakness - Expensive - Polling limitation is almost reached (new version with complete redraw of polling system will arrive in 2 years) - Not a performance system: can’t handle 50K of statistics - Integration of non certificated manufacturer is complex - Data collection mostly limited to SNMP (changes ongoing)
13
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Requirements Mandatory: Root Cause Analysis High polling system :1-2mn for critical nodes 3-5mn for others Network topology representation Notifications (SMS/ MAIL/XMPP) and general console Distributed environment High Availability System Complete performance management IPv6 Support Nice to have : Autodiscovery system Mobile version Oracle centralized database Numbers and storage time : Polling capacity for at least 5K nodes Performance statistics for 56K of ports Data lifetime: 1 month without aggregation, max with aggregation Devices Alarm: around 2 years
14
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Researches List of tools which fit better : Icinga: Nagios like (forked) (Not Yet Tested) Zabbix: Large polling scale, open source, notification, Oracle database, distributed (NYT) (http://www.zabbix.com/features.php(http://www.zabbix.com/features.php) Solarwind: commercial but include performance and less expensive (NYT) Opennms : Open source - Completely customizable High polling system with distributed environment Events correlation, Alarm management, notification Many data collection support (SNMP, HTML, JMX, JDBC, NAGIOS-NSCLIENT) (http://www.opennms.org/about/(http://www.opennms.org/about/) Links : http://en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems http://www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
15
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Thanks Questions ?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.