ACET Accelerator Controls Exploitation Tools Progress and plans, December 2012
Outline Controls system overview Motivation and purpose Focus points 2013 Conclusions ACET - TC on 06 December
3 Controls system overview Knobs Services “Core”Diagnostics Applications Middletier Front Ends Sequencer Orbit InCA/LSA Proxies JMS SIS CMW/FESA Timing Drivers DB Boot NFS cmwDir RBAC DiaMon cmwAdmin FESA Navigator Video Syslog Hardware Tune RT 425 Consoles 400 GUIs 300 Servers 200 Java servers 1300 FECs 600 module types devices
Outline Controls system overview Motivation and purpose Focus points 2013 Conclusions ACET - TC on 06 December
ACET Motivation Distributed and complex controls system Knowledge distributed over many experts Move towards uniform (LHC) exploitation model across machines Purpose: Allow (non-)experts to carry out more efficient diagnostics ACET collaborates with CO projects to improve diagnostic facilities of the control system ACET - TC on 06 December
Outline Controls system overview Motivation and purpose Focus points 2013 Conclusions ACET - TC on 06 December
Focus points Diagnostic Tools – aggregation and training Process metrics – JMX & CMX DiaMon – GUI and CLIC agent Documentation Wiki/site structure, Portal and Useful links Dynamic/runtime dependencies Feedback – Tracing & Config message format, transport, analysis Trace analysis using Splunk Config analysis in CCDB ACET - TC on 06 December
Diagnostic tools Tools evaluated for criticality Aggregation into CCM diagnostic menu Training given during shutdown lectures ACET - TC on 06 December
Focus points Diagnostic Tools – aggregation and training Process metrics – JMX & CMX DiaMon – GUI and clic agent Documentation Wiki/site structure, Portal and Useful links Dynamic/runtime dependencies Feedback – Tracing & Config message format, transport, analysis Trace analysis using Splunk Config analysis in CCDB ACET - TC on 06 December
Process Metrics – JMX architecture C2Mon SRV JMX-DAQ DiaMon GUI Metrics RMI JMX mBeans JMX viewer JmxDirectory jConsole jar1 jar2 mgt JVM jmx-dir-client jVisualVM SRV ACET - TC on 06 December
Process metrics – CMX architecture C2Mon CLIC-DAQ DiaMon GUI lib1 lib2 p1 lib1lib2 cmx-lib-c shared memory segments C process p1 cmx-lib registry lib3lib4 cmx-lib-c++ C++ process p2 lib3 lib4 p2 cmx-lib-c++ CLIC agent CMX viewer ACET - TC on 06 December FEC Command line tool DB Metrics
Process metrics – DiaMon JMX integration ACET - TC on 06 December
Process metrics - jConsole ACET - TC on 06 December
Process metrics - Viewers ACET - TC on 06 December
Process metrics – JMX lookup ACET - TC on 06 December
Focus points Diagnostic Tools – aggregation and training Process metrics – JMX & CMX DiaMon – GUI and clic agent Documentation Wiki/site structure, Portal and Useful links Dynamic/runtime dependencies Feedback – Tracing & Config message format, transport, analysis Trace analysis using Splunk Config analysis in CCDB ACET - TC on 06 December
Documentation - Structure ACET - TC on 06 December
Documentation – Portal ACET - TC on 06 December
Documentation – Useful links ACET - TC on 06 December
Focus points Diagnostic Tools – aggregation and training Process metrics – JMX & CMX DiaMon – GUI and clic agent Documentation Wiki/site structure, Portal and Useful links Dynamic/runtime dependencies Feedback – Tracing & Config message format, transport, analysis Trace analysis using Splunk Config analysis in CCDB ACET - TC on 06 December
Dependencies - architecture FEC cmwadmin-scanner Visualization client connections cmwAdmin CMW/FESA Dependency analysis FEC cmwDirectory “dot” files log files ACET - TC on 06 December Data collection before LS1
Dependencies – a view ACET - TC on 06 December
Dependencies – a view ACET - TC on 06 December Face FecBook
Focus points Diagnostic Tools – aggregation and training Process metrics – JMX & CMX DiaMon – GUI and clic agent Documentation Wiki/site structure, Portal and Useful links Dynamic/runtime dependencies Feedback – Tracing & Config message format, transport, analysis Trace analysis using Splunk Config analysis in CCDB ACET - TC on 06 December
Feedback – architecture cmw-fb-c C process cmw FESA3 cmw-log CCDB cmw-log4j Java process jar1jar2 ACET - TC on 06 December Listeners GUIs C process /var/log/messages FEC/SRV Syslog tracing APEX GUIs Splunk syslog converters Java tracing Tracing & Config libs logfiles Impl make Scripts cmmnbld deploy wreboot
Feedback – CCDB tracing GUI ACET - TC on 06 December
Feedback – Hardware config CCDB GUI ACET - TC on 06 December
Splunk - architecture Central instance running on dedicated machine Project accounts set up Training given to projects Project-specific searches created FEC FEC /var/log/messages FEC FEC SRV logfiles ACET - TC on 06 December Contact Steen for Splunk access FEC filter&throttle logfiles cmw-log SRV cmw-log4j filters
Splunk – Message filter GUI ACET - TC on 06 December
Splunk – saved searches ACET - TC on 06 December
Splunk - visualization ACET - TC on 06 December
Splunk – dashboard ACET - TC on 06 December
Splunk – Use case: japc-ext-dir Queue overflow messages from CMW proxy Hosts and PIDs reported Client application identified japc-ext-dir suspected – and verified Subscriptions made to “constant” properties Data never consumed => Queue overflow in proxy Problem fixed by Eric ACET - TC on 06 December
Splunk – Use cases Leap second RBAC tokens missing/malformed/expired CMW slow clients Telegram layout and configuration JAPC applying wrong token in certain cases FESA handling of Timlib error Separating test environment from operational ACET - TC on 06 December
Splunk – Comments (1) “Proper usage requires very good configuration” “We need to rework our way to log information…” “Log files are a bit of a mess now, and only contain a sub-set of necessary data…it is necessary to clean up and extend logging…” “…it must be possible for others to access the data…” ACET - TC on 06 December
Splunk – Comments (2) ACET - TC on 06 December Positive comments “Powerful tool for detecting and reporting anomalies” “Very useful for proactive actions” “Powerful tool to make statistics” “It avoids spending time creating tools for decoding traces” “It is an agile way to gather analytics, to inform design decisions” “It is a very powerful auditing tool” “Trends over time allow spotting new types of problems” “It was useful for me several times for seeing if a problem is on one or multiple machines” “It gives an easy, reusable way of looking at logfiles” “It could become a valuable tool to spot errors, where currently we feel blind whenever there is a problem”
Splunk – vision Active, daily use by component providers - Dashboards Exploit tracing for Pro-active operation Informed evolution Preventive maintenance 10 user-friendly message types per project ERROR or WARNING Contact information Link to documentation Message body meaningful to non-expert No java stack trace Continuous improvement of messages ACET - TC on 06 December
Outline Controls system overview Motivation and purpose Focus points 2013 Conclusions ACET - TC on 06 December
Plans for 2013 (a) DiaMon Interactive service-oriented dependency view Declare and monitor process metrics Integrate metrics viewers Launching of external tools Make contact information accessible Splunk Improve current setup and configurations Increase support and project uptake Investigate integration of ITAT ACET - TC on 06 December
Plans for 2013 (b) Documentation Agree/implement CO-wide website/wiki structure Agree on maintenance responsibilities Portal – review, add and extend pages Content – all projects provide ½-page description Databases Finalize Hardware Configuration Feedback mechanisms Capturing version information, detecting time bombs Update contact information ACET - TC on 06 December
Plans for 2013 (c) Feedback (Tracing and Configuration) Improve message quality (structure, content, level) Increase project usage of feedback API All projects review configuration/version feedback Process metrics Work with projects to expose metrics Extend CMX (commands,…) ? MW team take over jmxDirectory ACET - TC on 06 December
Plans for 2013 (d) Runtime dependency data Analysis and visualization of CMW data Collecting network connection information Drivers Finalize hardware configuration feedback Version feedback implementation ACET - TC on 06 December
Outline Controls system overview Motivation and purpose Focus points 2013 Conclusions ACET - TC on 06 December
Conclusions Done Means for provision/transport of tracing, configuration and metrics Centralized Tracing and analysis Todo Data generation by projects Documentation Analysis and presentation Good support from projects in 2012, but… Too many other priorities for developers – and for me… 2013 is for bringing the pieces together ACET - TC on 06 December ACET needs time from all projects in 2013