Download presentation
Presentation is loading. Please wait.
Published byMartin Warren Modified over 9 years ago
1
Log analysis in the accelerator sector Steen Jensen, BE-CO-DO
2
Outline Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 2
3
Outline Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 3
4
CERN’s accelerator complex Openlab workshop on data analytics, CERN, November 16th 2012 4
5
Cern’s Control Centre, CCC Openlab workshop on data analytics, CERN, November 16th 2012 5
6
Control system infrastructure Openlab workshop on data analytics, CERN, November 16th 2012 GPN T T T T T Operator Consoles Fixed Displays File Servers Appl Servers PVSS Servers Real-time Linux LynxOS Computers WorldFIP Systems PLCs Cryo, Vacuum, … Magnet Control, RF Quench Protection, … Beam instrum. Kickers Machine Protection, … Other operational Consoles Oracle Technet Timing 6
7
Hardware 425 consoles 300 servers 1300 front ends 600 module types ~85.000 device instances Openlab workshop on data analytics, CERN, November 16th 2012 7
8
Software Java – operational code 400 GUIs, 200 servers ~8MLOC, > 1000 jars Up to 1000 processes 80 people, 10 groups C/C++ > 1300 Front End servers Real-time processes Drivers (Linux/LynxOS) 80 people, 8 groups Openlab workshop on data analytics, CERN, November 16th 2012 8
9
Software environment Operating systems Linux, LynxOS, Windows Languages Java, C++, C, Scripts Developers Staff, Fellows, Students – as primary or secondary activity CO, OP and Equipment Groups Legacy 10+ years Openlab workshop on data analytics, CERN, November 16th 2012 9
10
Connections in middleware layer Openlab workshop on data analytics, CERN, November 16th 2012 10
11
Outline Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 11
12
Motivation Further improve quality and availability of the controls system New responsibility model => non-experts must be able to quickly identify who to call Increase diagnostic efficiency Widen and deepen diagnostic capabilities Proactive & preventive maintenance & evolution We already have many log files that can be exploited better Openlab workshop on data analytics, CERN, November 16th 2012 12
13
Log data characteristics Continuous stream of text messages ~2Gb per day, bursts > 10Gb per day Multiple transport mechanisms syslog, log4J, JMS Scattered, system-specific log files Diversity in format and length Ambiguity in log level interpretation Variations in frequency Differences in relevance Openlab workshop on data analytics, CERN, November 16th 2012 13
14
Outline Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 14
15
Use cases Allow people to diagnose systems other than their own Facilitate correlation of messages across systems Detect what has changed if something does not work Observe trends over time Easily obtain customized views of one or more systems Receive notifications based on custom criteria Openlab workshop on data analytics, CERN, November 16th 2012 15
16
Non-functional requirements Easy to use and access by (non-)specialists Overviews and graphical visualisations Efficient data mining Allow for knowledge capture, sharing and re-use grep/awk/sed does not provide this Looking at Splunk, it is unrealistic to develop own system Openlab workshop on data analytics, CERN, November 16th 2012 16
17
Outline Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 17
18
Log data pre-processing Centralization Filtering of irrelevant messages Throttling and burst protection Down from 2+ Gb / day to ~100 Mb / day, not counting Java, i.e. likely to increase considerably Openlab workshop on data analytics, CERN, November 16th 2012 18
19
Splunk setup Central instance receiving pre-filtered messages via syslog and JMS Automated alerts Dashboards and saved searches Manual monitoring and follow-up Openlab workshop on data analytics, CERN, November 16th 2012 19
20
Splunk dashboard Openlab workshop on data analytics, CERN, November 16th 2012 20
21
RBAC denials on SET Openlab workshop on data analytics, CERN, November 16th 2012 21
22
Outline Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 22
23
Detailed case: Leap second On July 1 st, 01:59:59 a 60 th leap second was added globally Using Splunk, a system administrator discovered that Some computers failed to add the leap second on time Certain types of computers added the leap second at a later moment There seems to be no pattern over time No computer added a leap second more than once Openlab workshop on data analytics, CERN, November 16th 2012 23
24
Use cases RBAC security Tokens missing, malformed or expired CMW Client issues Slow consumer, zombies, incorrect accesses, error handling Timing system telegram layout issues Performance testing of new system CO Frameworks Improvement: Token check prior to access Bug: applying wrong token in certain cases Bug: Loop on timing error Openlab workshop on data analytics, CERN, November 16th 2012 24
25
Use cases, continued Design considerations Architectural decisions based on usage info Driver error reporting, missing module System issues yum updates misbehaving due to race condition Separation between operational and test environments Openlab workshop on data analytics, CERN, November 16th 2012 25
26
Outline Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 26
27
Conclusions Log analysis is very valuable – we learned a lot using Splunk Knowledge about control system usage Pro-active detection, trimming, improving Re-active diagnostics It is a continuous discover-learn-improve process It involves a mix of automated alerts manual monitoring ad hoc data mining Experience allows improving log data quality Culture: Developer usage Structure: Key-value pairs Splunk is a very promising tool in CO’s context, and projects show strong interest Openlab workshop on data analytics, CERN, November 16th 2012 27
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.