MONITORING WITH MONALISA Costin Grigoras. M ONITORING WITH M ON ALISA What is MonALISA ? MonALISA communication architecture Monitoring modules ApMon.

Slides:



Advertisements
Similar presentations
TeraGrid Deployment Test of Grid Software JP Navarro TeraGrid Software Integration University of Chicago OGF 21 October 19, 2007.
Advertisements

ALICE G RID SERVICES IP V 6 READINESS
May 2005 Iosif Legrand 1 Iosif Legrand California Institute of Technology May 2005 An Agent Based, Dynamic Service System to Monitor, Control and Optimize.
May 2005 Iosif Legrand 1 Iosif Legrand California Institute of Technology ICFA WORKSHOP Daegu, May 2005 Daegu, May 2005 An Agent Based, Dynamic Service.
1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.
June 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
October 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
Windows Server 2008 Chapter 11 Last Update
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –
G RID SERVICES IP V 6 READINESS
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
September 2005 Iosif Legrand 1 End User Agents: extending the "intelligence" to the edge in Distributed Service Systems Iosif Legrand California Institute.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Online Monitoring with MonALISA Dan Protopopescu Glasgow, UK Dan Protopopescu Glasgow, UK.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
ACAT 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and.
The Network Performance Advisor J. W. Ferguson NLANR/DAST & NCSA.
Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework.
February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,
1 Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Catalin Cirstoiu, Ciprian Dobre An Agent Based, Dynamic Service System to Monitor, Control.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
Graphing and statistics with Cacti AfNOG 11, Kigali/Rwanda.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
N EWS OF M ON ALISA SITE MONITORING
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Site operations Outline Central services VoBox services Monitoring Storage and networking 4/8/20142ALICE-USA Review - Site Operations.
Overview of ALICE monitoring Catalin Cirstoiu, Pablo Saiz, Latchezar Betev 23/03/2007 System Analysis Working Group.
Monitoring with MonALISA Costin Grigoras. What is MonALISA ?  Caltech project started in 2002
1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN.
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
April 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
PPDG February 2002 Iosif Legrand Monitoring systems requirements, Prototype tools and integration with other services Iosif Legrand California Institute.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
JAliEn Java AliEn middleware A. Grigoras, C. Grigoras, M. Pedreira P Saiz, S. Schreiner ALICE Offline Week – June 2013.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
AliEn central services Costin Grigoras. Hardware overview  27 machines  Mix of SLC4, SLC5, Ubuntu 8.04, 8.10, 9.04  100 cores  20 KVA UPSs  2 * 1Gbps.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
October 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
TF meeting – July 13, 2006 Support for taking actions in MonALISA Costin Grigoras.
MONITORING WITH MONALISA Costin Grigoras. M ON ALISA COMMUNICATION ARCHITECTURE MonALISA software components and the connections between them Data consumers.
The ALICE data quality monitoring Barthélémy von Haller CERN PH/AID For the ALICE Collaboration.
1 R. Voicu 1, I. Legrand 1, H. Newman 1 2 C.Grigoras 1 California Institute of Technology 2 CERN CHEP 2010 Taipei, October 21 st, 2010 End to End Storage.
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
MONALISA MONITORING AND CONTROL Costin Grigoras. O UTLINE MonALISA services and clients Usage in ALICE Online SE discovery mechanism Data management 3.
Federating Data in the ALICE Experiment
Database Replication and Monitoring
California Institute of Technology
Monitoring and Fault Tolerance
ALICE Monitoring
Open Source distributed document DB for an enterprise
ALICE FAIR Meeting KVI, 2010 Kilian Schwarz GSI.
AliEn central services (structure and operation)
Publishing ALICE data & CVMFS infrastructure monitoring
Presentation transcript:

MONITORING WITH MONALISA Costin Grigoras

M ONITORING WITH M ON ALISA What is MonALISA ? MonALISA communication architecture Monitoring modules ApMon Data storage model MonALISA clients AliEn monitoring architecture The Repository Automatic actions Putting it all together Useful links

M ON ALISA What is MonALISA ? Caltech project started in Java-based set of distributed, self-describing services Offers the infrastructure to collect any type of information Can process it in near real time The services can cooperate in performing the monitoring tasks Can act as a platform for running distributed user agents

M ON ALISA COMMUNICATION ARCHITECTURE MonALISA software components and the connections between them Data consumers Multiplexing layer Helps firewalled endpoints connect Registration and discovery JINI-Lookup Services Secure & Public MonALISA services Proxies Clients HL services Agents Network of Data gathering services Fully Distributed System with no Single Point of Failure

M ON ALISA COMMUNICATION ARCHITECTURE Subscriber/notification paradigm Configuration Control (SSL) Predicates & Agents Data (via ML Proxy) Applications ML Client LookupService Registration Discovery AGENTS FILTERS / TRIGGERS Monitoring Modules Dynamic Loading Data Store Push or Pull, depending on device ML Service

M ONITORING MODULES MonALISA service includes many modules; easily extendable The service package includes: -Local host monitoring (CPU, memory, network traffic, processes and sockets in each state, LM sensors, APC UPSs), log files tailing -SNMP generic & specific modules -Condor, PBS, LSF and SGE (accounting & host monitoring), Ganglia -Ping, tracepath, traceroute, pathload and other network-related measurements -Ciena, Optical switches -Calling external applications/scripts that return as output the values -XDR-formatted UDP messages (such as ApMon). New modules can be added by implementing a simple Java interface. Filters can also be defined to aggregate data in new ways. The Service can also react to the monitoring data it receives, more about the actions it can take later. MonALISA can run code as distributed agents -Used by VRVS/Evo to maintain the tree of connections between reflectors -Eof optical paths between two network endpoints for particular data transfers

A P M ON Embeddable APlication MONitoring library ApMon is a collection of libraries in various languages (C, C++, Java, Perl, Python), all offering a simple API to sending monitoring information Based on the XDR open format of data packing (efficient packing of the values) Implemented over UDP to minimize the impact on the monitored application Allows applications to send particular values and also provides local host monitoring The Perl version is used by AliEn to send monitoring information from each service and JobAgent to the site local MonALISA instance The C/C++ implementations are used in ROOT (TMonalisaWriter), Proof and Xrootd. CMS also uses ApMon to send monitoring information from all jobs to a single point Can also be used stand-alone, in a wrapper application that loops forever, sending the default host monitoring parameters + any other interesting values (like some services’ status) It can be configured by API, local configuration file or URL of one

MonALISA keeps a memory buffer for a minimal monitoring history In addition, data can be kept in configurable database structures -The service keeps one week of raw data and one month of averaged values -The client creates three averaged structures Parallel database backends can be used to increase performance and reliability D ATA STORAGE MODEL Shared codebase between components Response Short term, high resolution Short term, high resolution Medium term, lower resolution Medium term, lower resolution Long term, low resolution Long term, low resolution Memory buffer Volatile storage Persistent storage (DB) time Request at highest resolution

M ON ALISA CLIENTS Two important clients GUI client -Interactive exploring of all the parameters -Can plot history or real-time values -Customizable history query interval -Subscribes to those particular series and updates the plots in real time Storage client (aka Repository) -Subscribes to a set of parameters and stores them in database structures suitable for long-term archival -Is usually complemented by a web interface presenting these values -Can also be embedded in another controlling application WebServices clients -Limited functionality: they lack the subscription mechanism

A LI E N MONITORING ARCHITECTURE Monitoring follows the general AliEn layout: one service per site collects and aggregates site-local monitoring information 10 Long History DB LCG Tools MonALISA AliEn Site ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent MonALISA LCG Site ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn TQ ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn IS ApMon AliEn Optimizers ApMon AliEn Brokers ApMon MySQL Servers ApMon CastorGrid Scripts ApMon API Services MonALISARepository Aggregated Data rss vsz cpu time run time job slots free space nr. of files open files Queued JobAgents cpu ksi2k job status disk used processes load net In/out jobs status sockets migrated mbytes active sessions MyProxy status Alerts Actions

T HE R EPOSITORY MonALISA repository for ALICE

T HE R EPOSITORY Monitoring the ALICE Grid: some numbers 85 sites defined 9000 worker nodes parallel jobs 25 central machines More than 1.1M parameters Storing only aggregated data where possible we have reached >35000 active parameters in the database We store new values at ~100Hz dynamic pages / day 1 client/web server + 3 database instances 170GB of history

A UTOMATIC A CTIONS What to do with this monitoring data? Decisions taken upon: -Absence/presence of some parameter(s) -Values above/below predefined thresholds -Arbitrary correlations between values -User-defined code Action types: -Notifications -RSS/Atom feeds -Annotation of charts with the events -Calling external programs -Logging -Running custom code, for example in ALICE -restart site services -maintain DNS-based load balancing ML Service Actions Traffic Jobs Hosts Apps Temperature Humidity A/C Power … Global ML Services Global ML Services

P UTTING IT ALL TOGETHER How to monitor your system Step 0: basic monitoring support Install a MonALISA service instance or reuse the one installed by AliEn Step 1: instrument the applications with monitoring calls ApMon covers both the host monitoring and sending your own parameters. Use a pointer to the actual configuration Step 2: collect the data Install a Repository and collect only the parameters that you need Step 3: present the data Learning from the past is important. Add annotations of changes Step 4: react to events Start by sending s when things don’t work. Continue by automatizing the manual procedures Step 5: go to a workshop in Philippines.

U SEFUL LINKS Exploring ML world can start from here Main documentation page Experiment repositories ALICE: PANDA: Cluster monitoring repositories GSI: Muenster: CAF:

2Q||!2Q ? Thanks a lot for your attention! Any questions?