MONITORING WITH MONALISA Costin Grigoras
M ON ALISA COMMUNICATION ARCHITECTURE MonALISA software components and the connections between them Data consumers Multiplexing layer Helps firewalled endpoints connect Registration and discovery JINI-Lookup Services Secure & Public MonALISA services Proxies Clients HL services Agents Network of Data gathering services Fully Distributed System with no Single Point of Failure T1/T2 ALICE Karlsruhe
ALICE G RID MONITORING ARCHITECTURE Monitoring follows the general AliEn layout: one service per site collects and aggregates site-local monitoring information 3 Long History DB LCG Tools MonALISA AliEn Site ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent MonALISA LCG Site ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn TQ ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn IS ApMon AliEn Optimizers ApMon AliEn Brokers ApMon MySQL Servers ApMon CastorGrid Scripts ApMon API Services MonALISARepository Aggregated Data rss vsz cpu time run time job slots free space nr. of files open files Queued JobAgents cpu ksi2k job status disk used processes load net In/out jobs status sockets migrated mbytes active sessions MyProxy status Alerts Actions T1/T2 ALICE Karlsruhe
A P M ON ApMon: embeddable APlication MONitoring library Lightweight library of APIs (C, C++, Java, Perl, Python) that can be used to send any information to MonALISA Service(s) over UDP (open XDR binary format) Flexible configuration (hardcoded / configuration file / URL) Dynamic change of options and target while the app is running Background system monitoring (optional) Load, CPU, memory & swap usage Network interfaces (in/out/ip/errs) Sockets in each state, processes in each state Disk IO, swap IO Background application monitoring (optional) Used CPU & wall time, % of the machine CPU Partition stats, size of workdir, open files Memory usage (rss, virtual and %), page faults Very high throughput (O(10K msg/s) on a regular machine) T1/T2 ALICE Karlsruhe
S ERVICE D EPLOYMENT One service per site deployed on the VoBox AliEn services, proxies’ status, critical local directories on the VoBox Xrootd storage nodes (ALICE:: :: _xrootd_*) Machine parameters (CPU, load, memory, sockets, processes, network traffic, disk and swap IO) Xrootd internal parameters Storage space as seen by xrootd Job agents ( _JobAgent) CPU and memory usage, payload job ID, status Jobs (ALICE:: :: _Jobs) Full machine parameters (same as above) CPU and memory usage of the process tree itself (+ Si2K normalized values) Owner and Masterjob / Subjob IDs Status (STARTING, RUNNING, SAVING) Xrdcp transfers, Available bandwidth, Traceroute, PROOF monitoring, POD, … T1/T2 ALICE Karlsruhe
A GGREGATED VALUES PER SITE Values are aggregated in various categories (*_Summary) Job summaries Number of jobs in each state (absolute and rates) min/max/avg/total resource usage (absolute and rates) Per cluster and per user Top memory consumers Job agent overview Number of JAs in each state Average TTL Queued JAs (that don’t report yet active monitoring data) Histograms of number of executed jobs per JA Cluster nodes’ summary status Aggregated network traffic per source/target SE Xrootd and PROOF cluster summary T1/T2 ALICE Karlsruhe
O PEN DATA ACCESS Adapting ML to your needs You can recycle the VoBox ML instance for your own purposes Sending extra data to it from local services For example cluster monitoring with Or start a separate, independent service in the “alice” group In case you want to deploy custom filters / alarms Or start your own monitoring group For several sites / redundant monitoring Services and clients can belong/connect to several groups Any number of consumers can subscribe to any cut in the data stream But try to find the minimal expression that gives the data you want T1/T2 ALICE Karlsruhe
M ON ALISA S ERVICE MonALISA service includes many modules; easily extendable The service package includes: Local host monitoring (CPU, memory, network traffic, processes and sockets in each state, LM sensors, APC UPSs), log files tailing SNMP generic & specific modules; Condor, PBS, LSF and SGE (accounting & host monitoring), Ganglia Ping, tracepath, traceroute, pathload, xrootd Ciena, Optical switches (TL1); Netflow/Sflow (Force10) Calling external applications/scripts that output the values as text XDR-formatted UDP messages (ApMon) New modules can be added by implementing a simple Java interface. Filters can also be defined to aggregate data in new ways The Service can also react to the monitoring data it receives, more about the actions it can take later T1/T2 ALICE Karlsruhe
M ON ALISA CLIENTS Two important clients GUI client Interactive exploring of all the parameters Can plot history or real-time values Customizable history query interval Subscribes to those particular series and updates the plots in real time Storage client (aka Repository) Subscribes to a set of parameters and stores them in database structures suitable for long-term archival Is usually complemented by a web interface presenting these values Can also be embedded in another controlling application WebServices & REST clients Limited functionality: they lack the subscription mechanism T1/T2 ALICE Karlsruhe
I NTERACTIVE CLIENT Select “alice” only T1/T2 ALICE Karlsruhe
I NTERACTIVE CLIENT Browsing parameters 4 hierarchical levels of parameters (Farm, Cluster, Node, Function) T1/T2 ALICE Karlsruhe
I NTERACTIVE CLIENT Dynamic views T1/T2 ALICE Karlsruhe
W EB INTERFACE Single package including Headless client Apache Tomcat PostgreSQL database (64bit binaries) Web interface includes examples of dynamic views History charts Real-time bar plots Pie, spider, histograms Status tables The above only need a configuration file per plot describing what should be displayed (ask Stefano, Kilian, Martin for configuration tips) With this foundation you can build a custom repository Dynamic pages (JSP or servlets) Over plots with the included JFreeChart library Data aggregation filters Alarms T1/T2 ALICE Karlsruhe
P LOT EXAMPLES History charts with annotations; any past time interval can be selected T1/T2 ALICE Karlsruhe
P LOT EXAMPLES History in area mode T1/T2 ALICE Karlsruhe Statistics per host below the charts
P LOT EXAMPLES Bar chart with optional second axis for cumulative values T1/T2 ALICE Karlsruhe
P LOT EXAMPLES Dynamic data aggregation T1/T2 ALICE Karlsruhe Pie with several data selection options (last / integral / min / max / avg) Bar charts, same options
P LOT EXAMPLES Radar plots T1/T2 ALICE Karlsruhe
P LOT EXAMPLES Table view - mostly used to display current status, but aggregation functions are also available T1/T2 ALICE Karlsruhe
P LOT EXAMPLES Custom views T1/T2 ALICE Karlsruhe
P LOT EXAMPLES Custom views T1/T2 ALICE Karlsruhe
MonALISA keeps a memory buffer for a minimal monitoring history In addition, data can be kept in configurable database structures -The service keeps one week of raw data and one month of averaged values -The client creates three averaged structures Parallel database backends can be used to increase performance and reliability D ATA STORAGE MODEL Shared codebase between components Response Short term, high resolution Short term, high resolution Medium term, lower resolution Medium term, lower resolution Long term, low resolution Long term, low resolution Memory buffer Volatile storage Persistent storage (DB) time Request at highest resolution T1/T2 ALICE Karlsruhe
D ATA VOLUMES Disk space is not infinite … One can subscribe to detailed information, aggregate and discard the details The system is quick in forgetting The ALICE Central Repository is now 330GB database for 6 years of history data (~100B / data point) 100K distinct active parameters that are stored Usually receives updates for ~300K parameters (1KHz) Storing at a flat rate of 0.6KHz Generating dynamic pages at 2Hz in avg(0.2s) T1/T2 ALICE Karlsruhe
A UTOMATIC A CTIONS What to do with this monitoring data? Decisions taken upon: Absence/presence of some parameter(s) Values above/below predefined thresholds Arbitrary correlations between values User-defined code Action types: Notifications RSS/Atom feeds Annotation of charts with the events Calling external programs Logging Running custom code, for example in ALICE restart site services maintain DNS-based load balancing ML Service Actions Traffic Jobs Hosts Apps Temperature Humidity A/C Power … Global ML Services Global ML Services T1/T2 ALICE Karlsruhe
C USTOM FILTERS Memory watchdog T1/T2 ALICE Karlsruhe
2Q||!2Q ? Questions ? P.S. Don’t forget to subscribe T1/T2 ALICE Karlsruhe