Monitoring HTCondor with Ganglia

Slides:



Advertisements
Similar presentations
Implementing Tableau Server in an Enterprise Environment
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
October 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
8/9/2015 1:47 AM SurveyCentralOverview.ppt CSC ©Copyright 2012 Online Survey Application: CSC Survey Central System Overview November 26, 2012 Supported.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
TAKING CARE GUIDELINES Sub-title Place, Month Year.
Monitoring Your Data Center Using Apache and Ganglia Brad Nicholes Sr. Software Engineer, Novell Member Apache Software Foundation
Monitoring Latency Sensitive Enterprise Applications on the Cloud Shankar Narayanan Ashiwan Sivakumar.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
London April 2005 London April 2005 Creating Eyeblaster Ads The Rich Media Platform The Rich Media Platform Eyeblaster.
1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of.
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Data Tagging Architecture for System Monitoring in Dynamic Environments Bharat Krishnamurthy, Anindya Neogi, Bikram Sengupta, Raghavendra Singh (IBM Research.
Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.
INDIANAUNIVERSITYINDIANAUNIVERSITY Grid Monitoring from a GOC perspective John Hicks HPCC Engineer Indiana University October 27, 2002 Internet2 Fall Members.
Module 10 Administering and Configuring SharePoint Search.
Graphing and statistics with Cacti AfNOG 11, Kigali/Rwanda.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Distributed monitoring system. Why Monitor? Solve them! Identify Problems Ensure conduct Requirements Manage many computers Spot trends in the system.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
Ch 10 Monitoring NCNU CSIE 林似真 Stella. NCNU CSIE Stella2010/6/82 ganglia.
Information Services Andrew Brown Jon Ludwig Elvis Montero grid:seminar1:lectures:seminar-grid-1-information-services.ppt.
Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Hardware, 010 – Revision notes Scales use Controller board O/Ps for Feed Control (FCE’s) Flow meters use their own on board O/Ps for Feed Control (FCE’s)
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Monitoring Primer HTCondor Workshop Barcelona Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.
Monitoring Primer HTCondor Week 2017 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.
HTCondor Networking Concepts
HTCondor Networking Concepts
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
The Life of a MongoDB GitHub Commit
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Monitoring with Clustered Graphite & Grafana
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Workload Management System
High Availability in HTCondor
Moving from CREAM CE to ARC CE
PREGEL Data Management in the Cloud
CREAM-CE/HTCondor site
Apache Hadoop YARN: Yet Another Resource Manager
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
© 2002, Cisco Systems, Inc. All rights reserved.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
HC Hyper-V Module GUI Portal VPS Templates Web Console
Basic Grid Projects – Condor (Part I)
POP: Building Automation Around Secure Server Deployment
Genre1: Condor Grid: CSECCR
Condor: Firewall Mirroring
Chapter 15: Network Monitoring and Tuning
Condor-G Making Condor Grid Enabled
Condor-G: An Update.
Presentation transcript:

Monitoring HTCondor with Ganglia

Ganglia Overview Scalable distributed monitoring for HPC clusters Two daemons gmond – every host; collects and send metrics gmetad – single host; persists metrics from local gmond in RRD Web Frontend Presents graphs from persistent data

3

Why Ganglia? Widely used monitoring system for cluster and grids Easy to add new metrics Can create custom graphs 4

Running condor_gangliad condor_gangliad runs on a single host Gathers daemon ClassAds from the Collector Publishes metrics to ganglia with host spoofing Can be on any host May be co-located with condor_collector gmetad Consider network traffic

Put Them Together 6

Possible Deployments Ganglia is already used for monitoring Start condor_gangliad on gmetad host Least configuration Start condor_gangliad on Central Manager Saves network traffic Ganglia is not in use for monitoring Setup dedicated host to run ganglia and condor_gangliad Generates graphs for web pages on demand

Ganglia Interface Uses gmetric method to add metrics to ganglia Uses shared library on system to send updates Fast and efficient Falls back to using gmetric command Much slower Uses gstat to determine which hosts are already monitored by ganglia

Configuration Macros GANGLIA_GSTAT_COMMAND Defaults to localhost (change master gmond running elsewhere) “gstat --all --mpifile --gmond_ip=localhost – gmond_port=8649” GANGLIA_SEND_DATA_FOR_ALL_HOSTS Set to true if want hosts not currently in ganglia GANGLIAD_VERBOSITY Defaults to 0, set higher for more monitoring

Running condor_gangliad Add to DAEMON_LIST DAEMON_LIST = …, GANGLIAD Check GangliadLog for gmetric integration Look for libganglia load message Library has been stable over many releases May have to specify path to library If fall back to gmetric command look closely at timing

Log Snippet 04/24/14 08:05:43 Testing gmetric 04/24/14 08:05:43 Loading libganglia /usr/lib64/libganglia-3.1.7.so.0.0.0 04/24/14 08:05:43 Will use libganglia to interact with ganglia. 04/24/14 08:06:03 Starting update... 04/24/14 08:06:03 Ganglia is monitoring 1 hosts 04/24/14 08:06:10 Got 7687 daemon ads 04/24/14 08:06:14 Ganglia metrics sent: 1858 04/24/14 08:06:14 Heartbeats sent: 80

Limit Data Use Requirement express to limit data fetched GANGLIAD_PER_EXECUTE_NODE_METRICS Set to false if large pool Use Requirement express to limit data fetched GANGLIAD_REQUIREMENTS = Machine == "cm.chtc.wisc.edu" || Machine == "submit- 1.chtc.wisc.edu" || Machine == "submit- 2.chtc.wisc.edu" || Machine == "submit- 3.chtc.wisc.edu"

Metrics to Track Described in /etc/condor/ganglia.d/ Default set provided Expressed as ClassAds Name: Unique metric name used by ganglia Value: ClassAd expression, defaults to “Name”

Minimal Example [ Name = "JobsSubmitted"; Desc = "Number of jobs submitted"; Units = "jobs"; TargetType = "Scheduler"; ]

Simple Example [ Name = strcat(MyType,"DaemonCoreDutyCycle"); Value = RecentDaemonCoreDutyCycle; Desc = "Recent fraction of busy time in the daemon event loop"; Scale = 100; Units = "%"; TargetType = "Scheduler,Negotiator,Machine_slot1"; ]

Aggregate Metrics Can aggregate metrics over entire pool Sums: running jobs over pool Min and Max: Space Available Average Aggregates appear in “HTCondor Pool” group on central manager

Aggregate Example [ Name = "TotalJobAds"; Desc = "Number of jobs currently in this schedd's queue"; Units = "jobs"; TargetType = "Scheduler"; ] Aggregate = "SUM"; Name = "Jobs in Pool"; Value = TotalJobAds; Desc = "Number of jobs currently in schedds reporting to this pool";

Scaling Example [ Name = strcat(MyType,"MonitorSelfResidentSetSize"); Value = MonitorSelfResidentSetSize; Verbosity = 1; Desc = "RAM allocated to this daemon"; Units = "bytes"; Scale = 1024; Type = "float"; TargetType = "Scheduler,Negotiator,Machine_slot1"; ]

Other Attributes Title = “Graph Title” (defaults to Name) Regex = for dynamic metric (users) Type = automatic based on type Coerce integers to floats if scaling or large Group = “Group on Web Page”

Future Work Composite graphs For example, I/O load and throughput Better able to draw conclusions Graph slot states Determine which metrics are most useful

Live Demo http://timt.chtc.wisc.edu/ganglia http://cm.batlab.org/ganglia