MONITORING WITH MONALISA Costin Grigoras. M ON ALISA COMMUNICATION ARCHITECTURE MonALISA software components and the connections between them Data consumers.

Slides:



Advertisements
Similar presentations
This course is designed for system managers/administrators to better understand the SAAZ Desktop and Server Management components Students will learn.
Advertisements

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
SSRS 2008 Architecture Improvements Scale-out SSRS 2008 Report Engine Scalability Improvements.
ALICE G RID SERVICES IP V 6 READINESS
A Java Architecture for the Internet of Things Noel Poore, Architect Pete St. Pierre, Product Manager Java Platform Group, Internet of Things September.
MONITORING WITH MONALISA Costin Grigoras. M ONITORING WITH M ON ALISA What is MonALISA ? MonALISA communication architecture Monitoring modules ApMon.
Chapter 19: Network Management Business Data Communications, 4e.
June 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
What is SDM? SDM : Server and Database Monitoring  SDM is the web-based real-time server and database monitoring and reporting tool  Service Items Server.
October 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
F Fermilab Database Experience in Run II Fermilab Run II Database Requirements Online databases are maintained at each experiment and are critical for.
Windows Server 2008 Chapter 11 Last Update
Platform as a Service (PaaS)
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –
G RID SERVICES IP V 6 READINESS
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Online Monitoring with MonALISA Dan Protopopescu Glasgow, UK Dan Protopopescu Glasgow, UK.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
ACAT 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents.
1 Overview of the Application Hosting Environment Stefan Zasada University College London.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework.
February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,
1 Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Catalin Cirstoiu, Ciprian Dobre An Agent Based, Dynamic Service System to Monitor, Control.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Graphing and statistics with Cacti AfNOG 11, Kigali/Rwanda.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
N EWS OF M ON ALISA SITE MONITORING
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Site operations Outline Central services VoBox services Monitoring Storage and networking 4/8/20142ALICE-USA Review - Site Operations.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
Overview of ALICE monitoring Catalin Cirstoiu, Pablo Saiz, Latchezar Betev 23/03/2007 System Analysis Working Group.
Monitoring with MonALISA Costin Grigoras. What is MonALISA ?  Caltech project started in 2002
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
April 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
PPDG February 2002 Iosif Legrand Monitoring systems requirements, Prototype tools and integration with other services Iosif Legrand California Institute.
JAliEn Java AliEn middleware A. Grigoras, C. Grigoras, M. Pedreira P Saiz, S. Schreiner ALICE Offline Week – June 2013.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
AliEn central services Costin Grigoras. Hardware overview  27 machines  Mix of SLC4, SLC5, Ubuntu 8.04, 8.10, 9.04  100 cores  20 KVA UPSs  2 * 1Gbps.
ALICE DATA ACCESS MODEL Outline 05/13/2014 ALICE Data Access Model 2  ALICE data access model  Infrastructure and SE monitoring.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
October 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
TF meeting – July 13, 2006 Support for taking actions in MonALISA Costin Grigoras.
1 R. Voicu 1, I. Legrand 1, H. Newman 1 2 C.Grigoras 1 California Institute of Technology 2 CERN CHEP 2010 Taipei, October 21 st, 2010 End to End Storage.
Interstage BPM v11.2 1Copyright © 2010 FUJITSU LIMITED INTERSTAGE BPM ARCHITECTURE BPMS.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
MONALISA MONITORING AND CONTROL Costin Grigoras. O UTLINE MonALISA services and clients Usage in ALICE Online SE discovery mechanism Data management 3.
Federating Data in the ALICE Experiment
Platform as a Service (PaaS)
Platform as a Service (PaaS)
Database Replication and Monitoring
California Institute of Technology
ALICE Monitoring
Publishing ALICE data & CVMFS infrastructure monitoring
Moodle Scalability What is Scalability?
Presentation transcript:

MONITORING WITH MONALISA Costin Grigoras

M ON ALISA COMMUNICATION ARCHITECTURE MonALISA software components and the connections between them Data consumers Multiplexing layer Helps firewalled endpoints connect Registration and discovery JINI-Lookup Services Secure & Public MonALISA services Proxies Clients HL services Agents Network of Data gathering services Fully Distributed System with no Single Point of Failure T1/T2 ALICE Karlsruhe

ALICE G RID MONITORING ARCHITECTURE Monitoring follows the general AliEn layout: one service per site collects and aggregates site-local monitoring information 3 Long History DB LCG Tools MonALISA AliEn Site ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent MonALISA LCG Site ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn TQ ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn IS ApMon AliEn Optimizers ApMon AliEn Brokers ApMon MySQL Servers ApMon CastorGrid Scripts ApMon API Services MonALISARepository Aggregated Data rss vsz cpu time run time job slots free space nr. of files open files Queued JobAgents cpu ksi2k job status disk used processes load net In/out jobs status sockets migrated mbytes active sessions MyProxy status Alerts Actions T1/T2 ALICE Karlsruhe

A P M ON ApMon: embeddable APlication MONitoring library Lightweight library of APIs (C, C++, Java, Perl, Python) that can be used to send any information to MonALISA Service(s) over UDP (open XDR binary format) Flexible configuration (hardcoded / configuration file / URL) Dynamic change of options and target while the app is running Background system monitoring (optional) Load, CPU, memory & swap usage Network interfaces (in/out/ip/errs) Sockets in each state, processes in each state Disk IO, swap IO Background application monitoring (optional) Used CPU & wall time, % of the machine CPU Partition stats, size of workdir, open files Memory usage (rss, virtual and %), page faults Very high throughput (O(10K msg/s) on a regular machine) T1/T2 ALICE Karlsruhe

S ERVICE D EPLOYMENT One service per site deployed on the VoBox AliEn services, proxies’ status, critical local directories on the VoBox Xrootd storage nodes (ALICE:: :: _xrootd_*) Machine parameters (CPU, load, memory, sockets, processes, network traffic, disk and swap IO) Xrootd internal parameters Storage space as seen by xrootd Job agents ( _JobAgent) CPU and memory usage, payload job ID, status Jobs (ALICE:: :: _Jobs) Full machine parameters (same as above) CPU and memory usage of the process tree itself (+ Si2K normalized values) Owner and Masterjob / Subjob IDs Status (STARTING, RUNNING, SAVING) Xrdcp transfers, Available bandwidth, Traceroute, PROOF monitoring, POD, … T1/T2 ALICE Karlsruhe

A GGREGATED VALUES PER SITE Values are aggregated in various categories (*_Summary) Job summaries Number of jobs in each state (absolute and rates) min/max/avg/total resource usage (absolute and rates) Per cluster and per user Top memory consumers Job agent overview Number of JAs in each state Average TTL Queued JAs (that don’t report yet active monitoring data) Histograms of number of executed jobs per JA Cluster nodes’ summary status Aggregated network traffic per source/target SE Xrootd and PROOF cluster summary T1/T2 ALICE Karlsruhe

O PEN DATA ACCESS Adapting ML to your needs You can recycle the VoBox ML instance for your own purposes Sending extra data to it from local services For example cluster monitoring with Or start a separate, independent service in the “alice” group In case you want to deploy custom filters / alarms Or start your own monitoring group For several sites / redundant monitoring Services and clients can belong/connect to several groups Any number of consumers can subscribe to any cut in the data stream But try to find the minimal expression that gives the data you want T1/T2 ALICE Karlsruhe

M ON ALISA S ERVICE MonALISA service includes many modules; easily extendable The service package includes: Local host monitoring (CPU, memory, network traffic, processes and sockets in each state, LM sensors, APC UPSs), log files tailing SNMP generic & specific modules; Condor, PBS, LSF and SGE (accounting & host monitoring), Ganglia Ping, tracepath, traceroute, pathload, xrootd Ciena, Optical switches (TL1); Netflow/Sflow (Force10) Calling external applications/scripts that output the values as text XDR-formatted UDP messages (ApMon) New modules can be added by implementing a simple Java interface. Filters can also be defined to aggregate data in new ways The Service can also react to the monitoring data it receives, more about the actions it can take later T1/T2 ALICE Karlsruhe

M ON ALISA CLIENTS Two important clients GUI client Interactive exploring of all the parameters Can plot history or real-time values Customizable history query interval Subscribes to those particular series and updates the plots in real time Storage client (aka Repository) Subscribes to a set of parameters and stores them in database structures suitable for long-term archival Is usually complemented by a web interface presenting these values Can also be embedded in another controlling application WebServices & REST clients Limited functionality: they lack the subscription mechanism T1/T2 ALICE Karlsruhe

I NTERACTIVE CLIENT Select “alice” only T1/T2 ALICE Karlsruhe

I NTERACTIVE CLIENT Browsing parameters 4 hierarchical levels of parameters (Farm, Cluster, Node, Function) T1/T2 ALICE Karlsruhe

I NTERACTIVE CLIENT Dynamic views T1/T2 ALICE Karlsruhe

W EB INTERFACE Single package including Headless client Apache Tomcat PostgreSQL database (64bit binaries) Web interface includes examples of dynamic views History charts Real-time bar plots Pie, spider, histograms Status tables The above only need a configuration file per plot describing what should be displayed (ask Stefano, Kilian, Martin for configuration tips) With this foundation you can build a custom repository Dynamic pages (JSP or servlets) Over plots with the included JFreeChart library Data aggregation filters Alarms T1/T2 ALICE Karlsruhe

P LOT EXAMPLES History charts with annotations; any past time interval can be selected T1/T2 ALICE Karlsruhe

P LOT EXAMPLES History in area mode T1/T2 ALICE Karlsruhe Statistics per host below the charts

P LOT EXAMPLES Bar chart with optional second axis for cumulative values T1/T2 ALICE Karlsruhe

P LOT EXAMPLES Dynamic data aggregation T1/T2 ALICE Karlsruhe Pie with several data selection options (last / integral / min / max / avg) Bar charts, same options

P LOT EXAMPLES Radar plots T1/T2 ALICE Karlsruhe

P LOT EXAMPLES Table view - mostly used to display current status, but aggregation functions are also available T1/T2 ALICE Karlsruhe

P LOT EXAMPLES Custom views T1/T2 ALICE Karlsruhe

P LOT EXAMPLES Custom views T1/T2 ALICE Karlsruhe

MonALISA keeps a memory buffer for a minimal monitoring history In addition, data can be kept in configurable database structures -The service keeps one week of raw data and one month of averaged values -The client creates three averaged structures Parallel database backends can be used to increase performance and reliability D ATA STORAGE MODEL Shared codebase between components Response Short term, high resolution Short term, high resolution Medium term, lower resolution Medium term, lower resolution Long term, low resolution Long term, low resolution Memory buffer Volatile storage Persistent storage (DB) time Request at highest resolution T1/T2 ALICE Karlsruhe

D ATA VOLUMES Disk space is not infinite … One can subscribe to detailed information, aggregate and discard the details The system is quick in forgetting The ALICE Central Repository is now 330GB database for 6 years of history data (~100B / data point) 100K distinct active parameters that are stored Usually receives updates for ~300K parameters (1KHz) Storing at a flat rate of 0.6KHz Generating dynamic pages at 2Hz in avg(0.2s) T1/T2 ALICE Karlsruhe

A UTOMATIC A CTIONS What to do with this monitoring data? Decisions taken upon: Absence/presence of some parameter(s) Values above/below predefined thresholds Arbitrary correlations between values User-defined code Action types: Notifications RSS/Atom feeds Annotation of charts with the events Calling external programs Logging Running custom code, for example in ALICE restart site services maintain DNS-based load balancing ML Service Actions Traffic Jobs Hosts Apps Temperature Humidity A/C Power … Global ML Services Global ML Services T1/T2 ALICE Karlsruhe

C USTOM FILTERS Memory watchdog T1/T2 ALICE Karlsruhe

2Q||!2Q ? Questions ? P.S. Don’t forget to subscribe T1/T2 ALICE Karlsruhe