LEMON – Monitoring in the CERN Computer Centre

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Categories of I/O Devices
19/06/2002WP4 Workshop - CERN WP4 - Monitoring Progress report
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
12. March 2003Bernd Panzer-Steindel, CERN/IT1 LCG Fabric status
NGOP J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Institute of Computer Science AGH Performance Monitoring of Java Web Service-based Applications Włodzimierz Funika, Piotr Handzlik Lechosław Trębacz Institute.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
CERN - IT Department CH-1211 Genève 23 Switzerland t The High Performance Archiver for the LHC Experiments Manuel Gonzalez Berges CERN, Geneva.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Fabric Infrastructure LCG Review November 18 th 2003 CERN.ch.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.
Distributed File Systems 11.2Process SaiRaj Bharath Yalamanchili.
April 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Management of the LHCb DAQ Network Guoming Liu *†, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
DAQ & ConfDB Configuration DB workshop CERN September 21 st, 2005 Artur Barczyk & Niko Neufeld.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
SERVERS. General Design Issues  Server Definition  Type of server organizing  Contacting to a server Iterative Concurrent Globally assign end points.
Hepix EDG Fabric Monitoring tutorial – n° 1 Introduction to EDG Fabric Monitoring Sylvain Chapeland.
Business System Development
WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO
System Monitoring with Lemon
High Availability Linux (HA Linux)
Monitoring and Fault Tolerance
Status of Fabric Management at CERN
Hands-On Microsoft Windows Server 2008
Miroslav Siket, Dennis Waldron
Status and plans of central CERN Linux facilities
CERN-Russia Collaboration in CASTOR Development
Monitoring of the infrastructure from the VO perspective
Database Services for CERN Deployment and Monitoring
RM3G: Next Generation Recovery Manager
Chapter 2: Operating-System Structures
Wide Area Workload Management Work Package DATAGRID project
The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:
Chapter 2: Operating-System Structures
Harrison Howell CSCE 824 Dr. Farkas
Presentation transcript:

LEMON – Monitoring in the CERN Computer Centre Helge Meinhard / CERN-IT LCG Project Execution Board 23 September 2003

Outline LEMON – Monitoring LEAF – Advanced fabric management Overview MSA and sensors MSA  repository transport Repository Query API and displays Local recovery, derived metrics, QoS LEAF – Advanced fabric management Hardware management system State management system Fault tolerance, HSM operator interface

Lhc Era MONitoring

Monitoring Sensor Agent (MSA) LEMON Overview Measurement Repository (MR) Monitored nodes Sensor Monitoring Sensor Agent (MSA) Cache Consumer Local Consumer Global Consumer Monitoring Sensor Agent Calls plug-in sensors to sample configured metrics Stores all collected data in a local disk buffer Sends the collected data to the global repository Plug-in sensors Programs/scripts that implement a simple sensor-agent ASCII text protocol A C++ interface class is provided on top of the text protocol to facilitate implementation of new sensors The local cache Assures data is collected also when node cannot connect to network Allows for node autonomy for local repairs Transport Transport is pluggable. Two protocols over UDP and TCP are currently supported where only the latter can guarantee the delivery Measurement Repository The data is stored in a database A memory cache guarantees fast access to most recent data, which is normally what is used for fault tolerance correlations Database Repository API SOAP RPC Query history data Subscription to new data Proprietary flat-file database Oracle Generic interface (ODBC) being developed

MSA and Sensors MSA stable for almost 2 years and in production on GNU/Linux machines at CERN for ~ 18 months Increasing functionality over the period Sensors deployed on GNU/Linux to provide performance and exception metrics for HW, OS and application-specific items (eg. batch schedulers) 150 metrics defined Installed on almost 20 different clusters, 1500 nodes. 80-120 metrics collected per node depending on the cluster. Now developing sensors to collect other information, specifically from disk & tape servers Much code already exists; need to bring measurements under the Lemon framework and collect metrics centrally

MSA  Repository Transport EDG/WP4 specific UDP based protocol in production. Potential concern about routers dropping UDP packets, but no problem today No security: anybody can inject any value into the repository TCP version of EDG/WP4 specific protocol tested Some concern about load of multiple permanently open sockets on repository  proxies developed Required for security (but nothing tested here) and also to resend metrics after network failure Need work to interface to SNMP world E.g. for routers and switches SNMP has been tested successfully for input to PVSS repository Could be implemented as additional sensor, too

LEMON Repository (1) Oracle-based repository required for long term storage of metrics Needed to understand detailed (node to node) performance issues over required timescales (days…months) Don’t want compression of data Oracle-based EDG/WP4 repository (OraMon) in production Alternative approach: PVSS-based repository, in production

LEMON Repository (2) These two alternatives compared earlier this year Found that both systems can do the job, and are both necessary to fully address our requirements Native Oracle archive for PVSS, promised for end-2003, is potentially interesting given our requirements… Decision (June): have all clients feed data both to OraMon repository and to PVSS Scaling limitations seen with both repositories already; very likely that the final (2006) system will run on multiple repositories of the same kind Imminent deployment of smoothing at MSA level will alleviate this problem

Query API & Displays (1) EDG/WP4 defined an API to retrieve metrics Implementations exist for Oracle repository (in C) and for direct extraction from PVSS (in C and Perl) Perl (for OraMon) and command line implementations still to be done Operator and synoptic displays created for PVSS Scales well with number of client machines so far Used by bat. 513 operators in test mode for > 6 months (production system based on old SURE system)

Query API & Displays (2) Display situation for EDG/WP4: Java/swing alarm displays for small-scale clusters in development A Java-based time series display exists but does not work for all browsers Some simple web displays also exist (table based display of metric values extracted with the API) Clarification about future directions needed

Local Recovery Actions Two parts: a framework to interface to the repository This “subscribes” to metrics and is alerted when metric values meet a defined condition Actuators invoked as necessary by framework to take action (e.g. restart a daemon) A Framework has been developed by Heidelberg as part of EDG/WP4 Successfully invokes actuators in simple cases More complex cases still to be tested Work needed at CERN to implement actuators but lower priority than…

Derived Metrics Created upon values read from the repository and, perhaps, elsewhere, typically combining metrics from different nodes e.g. maximum [system load|/tmp usage|/pool usage] on batch nodes running jobs for [alice|atlas|cms|lhcb] Sampled regularly, and/or triggered by value changes Investigating whether EDG Framework is useful (and whether we wish to use it…) Alternative implementation: Just another sensor

Quality of Service Essentially, a derived metric per-application combination of system parameters that affect performance of the application on a node or on the overall service Initial BARC work was for a CPU bound application Wall clock time increases with system load. Little affected by anything else except swapping Now looking at other applications, in particular with I/O requirements

LEMON Status Sensors, MSA and OraMon and PVSS repositories are running in production, and running well Status of, e.g. lxbatch, nodes has much improved since initial sensor/MSA deployment Solaris port working Still much work to do, though, notably for displays, derived metrics and local recovery

LEAF

LEAF Components HMS Hardware Management System SMS State Management System Fault Tolerance

HMS LEAF HMS tracks systems through steps of HW life cycle; defined, implemented and in production: Install Move Retire Repair (vendor call) HMS was used (inter alia) to manage the migration of systems to the vault Some developments/improvements needed as other systems develop

SMS LEAF Use cases: “Give me 200 nodes, any 200. Make them like this. By then.” “Take this sick node out of lxbatch” Tightly coupled to Lemon to understand current state, and to CDB to record changes of desired state, which SMS must update Analysis & requirements documents discussed. Led to architecture prototype and definition of interfaces to CDB Aim for initial SMS driven state change by the end of the year

Fault Tolerance LEAF Requires actions such as reallocating a disk server to experiment A “because one of theirs has failed and they are the priority users at present” Could also reallocate a disk server to experiment B because this looks a good move given the LSF/Castor load reconfigure LSF queue parameters at night or weekend based on workloads and budgets Essentially just a combination of derived metrics and SMS Concentrate on building the basic elements for now and start with simple use cases when we have the tools

HSM Operator interface LEAF Ought to be a GUI on top of an intelligent combination of Quattor, Lemon and LEAF Want to be able to (re)allocate disk space (servers) to users/applications as necessary These are CDB and SMS issues But strong requirements on Castor and the stager Many of the necessary stager changes planned for next year In the meantime, concentrate on extending CDB to cover definition of, e.g., staging pools

Conclusion FIO have gone a long way in the direction of fabric automatisation Still a long way to go Some areas still sketchy We have started to see enormous benefits already