HEPSYSMAN Monitoring Workshop Introduction to the Day and Overview of Ganglia Pete Gronbech.

Slides:



Advertisements
Similar presentations
London Tier2 Status O.van der Aa. Slide 2 LT 2 21/03/2007 London Tier2 Status Current Resource Status 7 GOC Sites using sge, pbs, pbspro –UCL: Central,
Advertisements

QMUL e-Science Research Cluster Introduction (New) Hardware Performance Software Infrastucture What still needs to be done.
Metadata Progress GridPP18 20 March 2007 Mike Kenyon.
Using the Grid- a users perspective Ivan Hollins University of Birmingham.
GridPP: Executive Summary Tony Doyle. Tony Doyle - University of Glasgow Oversight Committee 11 October 2007 Exec 2 Summary Grid Status: Geographical.
NGS computation services: API's,
The GridSite Toolbar Shiv Kaushal The University of Manchester All Hands Meeting 2006.
The National Grid Service and OGSA-DAI Mike Mineter
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct GOSC Oct 28.
©2011 Quest Software, Inc. All rights reserved.. Andrei Polevoi, Tatiana Golubovich Program Management Group ActiveRoles Add-on Manager Overview.
Auto-scaling Axis2 Web Services on Amazon EC2 By Afkham Azeez.
Chapter 1: Introduction to Scaling Networks
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
powerful network monitoring & management solution
1 Online communication: remote login and file transfer.
Database System Concepts and Architecture
05/10/2011http:// 1/15 Connected! How we Integrated our Collections in WordPress using the EMu API Paul Trafford
1 XML Web Services Practical Implementations Bob Steemson Product Architect iSOFT plc.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
ActiveXperts Network Monitor Monitors servers, workstations and devices for availability Alerts and corrects.
ActiveXperts Network Monitor Monitors servers, workstations and devices for availability Alerts and corrects.
Pankaj Kumar Qinglan Zhang Sagar Davasam Sowjanya Puligadda Wei Liu
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
DONE-10: Adminserver Survival Tips Brian Bowman Product Manager, Data Management Group.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
Network Management Tool Amy Auburger. 2 Product Overview Made by Ipswitch Affordable alternative to expensive & complicated Network Management Systems.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
02/07/09 1 WLCG NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) University of Oxford.
INDIANAUNIVERSITYINDIANAUNIVERSITY Grid Monitoring from a GOC perspective John Hicks HPCC Engineer Indiana University October 27, 2002 Internet2 Fall Members.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
Distributed monitoring system. Why Monitor? Solve them! Identify Problems Ensure conduct Requirements Manage many computers Spot trends in the system.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Ch 10 Monitoring NCNU CSIE 林似真 Stella. NCNU CSIE Stella2010/6/82 ganglia.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Stuart Kenny and Stephen Childs Trinity.
Information Services Andrew Brown Jon Ludwig Elvis Montero grid:seminar1:lectures:seminar-grid-1-information-services.ppt.
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
CSI 3125, Preliminaries, page 1 Networking. CSI 3125, Preliminaries, page 2 Networking A network represents interconnection of computers that is capable.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
EGEE is a project funded by the European Union under contract INFSO-RI Grid accounting with GridICE Sergio Fantinel, INFN LNL/PD LCG Workshop November.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 1 / 10 Grid Monitoring A conceptual introduction to GridICE Sergio Andreozzi
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.
Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it.
Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie
Daniele Bonacorsi Andrea Sciabà
NGI and Site Nagios Monitoring
Database Replication and Monitoring
Use of Nagios in Central European ROC
Sergio Fantinel, INFN LNL/PD
Monitoring HTCondor with Ganglia
Discussions on group meeting
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
Site availability Dec. 19 th 2006
Presentation transcript:

HEPSYSMAN Monitoring Workshop Introduction to the Day and Overview of Ganglia Pete Gronbech

Slide 2 31 st October 2007 Introduction & Ganglia Agenda Wednesday 31st October :00Start / Coffee 10: :00Introduction & GangliaOverviewPete Gronbech 11: :30MonAMI Interactive WorkshopPaul Millar 12: :30Lunch 13: :00Intro To NagiosA. Elwell 14: :30GRID Service Monitoring GroupIan Neilson 14: :00Further Nagios Scripts.Chris Brew 15: :00Live Install at a site and workshop discussion. 16: :30Other Monitoring Tools Discussion (Pakiti, gridmap, accounting cpu and storage, SAM, SAM admins page etc.) 16:30AOB and wrap up.

Slide 3 31 st October 2007 Introduction & Ganglia Why Monitoring Untrustworthy machines, that are critical. Your systems will fail. When they do fail, two things save you from downtime: Redundancy and Monitoring systems Limited Man Power at sites Ever increasing sizes of clusters Complex software with many failure modes Need to meet SLAs – 95% uptime PR and reporting

Slide 4 31 st October 2007 Introduction & Ganglia Many external monitoring sites Gstat - Steve Lloyds Page SAM -

Slide 5 31 st October 2007 Introduction & Ganglia Many more external monitoring sites Gridmap - Gridview Accounting -

Slide 6 31 st October 2007 Introduction & Ganglia Local Site Monitoring Unix tools Batch System tools Really need something that can provide a quick visual overview of the health and load on your cluster … ganglia

Slide 7 31 st October 2007 Introduction & Ganglia Ganglia

Slide 8 31 st October 2007 Introduction & Ganglia How does Ganglia work? Ganglia works through a small agent, gmond, on each node or machine to be monitored. You can distribute a single gmond instance to lots of machines at once. Gmonds communicate the state of their local node to a machine running a Master gmetad instance. The server uses RRDtool to store the data over time The Ganglia framework can be extended to monitor many parameters.

Slide 9 31 st October 2007 Introduction & Ganglia Setup The software can be downloaded from Computer A Runs gmond Computer BComputer CComputer D Runs gmond Runs gmetad Clients just have to run gmond, which is configured by /etc/gmond.conf Server to collect the data runs gmetad. It could also run gmond to monitor itself. The web interface needs to run on a webserver. Computer D gmetad gmond httpd

Slide st October 2007 Introduction & Ganglia Client Setup Computer A Runs gmond Computer BComputer C Runs gmond yum install ganglia-gmond edit config file service gmond start chkconfig gmond on /etc/gmond.conf extracts cluster { name = "LCG Workers" } /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { mcast_join = port = 8649 } /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { mcast_join = port = 8649 bind = } /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 } udp_send_channel { port = 8649 host = pplxconfig }

Slide st October 2007 Introduction & Ganglia Server Setup yum install ganglia-gmond ganglia-gmetad ganglia-web edit /etc/gmond.conf edit /etc/gmetad.conf Computer D gmetad gmond httpd Extracts from /etc/gmetad.conf data_source "LCG Workers" computerA.physics.ox.ac.uk ComputerB.physics.ox.ac.uk computerC.physics.ox.ac.uk data_source "LCG Servers" t2se01.physics.ox.ac.uk:8656 t2ce02.physics.ox.ac.uk:8656 gridlogger.physics.ox.ac.uk:8656

Slide st October 2007 Introduction & Ganglia Aggregating sub clusters

Slide st October 2007 Introduction & Ganglia Host level detail

Slide st October 2007 Introduction & Ganglia Customizing Adding PBS Batch Queue data

Slide st October 2007 Introduction & Ganglia PBS Queue Monitoring Originally based on RAL Tier 1 work Actually fairly complicated. see Chris Brew or me later for details.

Slide st October 2007 Introduction & Ganglia How is Ganglia different from Nagios Ganglia is architecturally designed to perform efficiently in very large monitoring environments: each Ganglia gmond performs its service checks locally, reporting in at a regular interval to the gmetad. Nagios performs its service checks by polling each device across a network connection and waiting for a response (known as "active checks"), which can be more resource and bandwidth intensive. Nagios uses the results of its active checks to determine state by comparing the metrics it polls to thresholds. These state changes can in turn be used to generate notifications and customizable corrective actions. Ganglia, by contrast, has no built-in thresholds, and so does not generate events or notifications. The general rule of thumb has been: if you need to monitor a limited number of aspects of a large number of identical devices, use Ganglia; if you want to monitor lots of aspects of a smaller number of different devices, use Nagios. But those distinctions are blurring as Ganglia supports more and more devices, and as Nagios' scalability improves.

Slide st October 2007 Introduction & Ganglia How is Ganglia different from Nagios The problem with ganglia and all the other external web pages we have been looking at is that you have to look at them! If all is well with your system you dont want to have to look. This is where Nagios comes in. It can be setup to alert you when something goes wrong, or a value passes a threshold.