Www.cs.wisc.edu/condor 1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of.

Slides:



Advertisements
Similar presentations
The Grid Job Monitoring Service Luděk Matyska et al. CESNET, z.s.p.o. Prague Czech Republic.
Advertisements

TeraGrid Deployment Test of Grid Software JP Navarro TeraGrid Software Integration University of Chicago OGF 21 October 19, 2007.
Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Copyright 2004 Monash University IMS5401 Web-based Systems Development Topic 2: Elements of the Web (g) Interactivity.
Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.
CompuNet Grid Computing Milena Natanov Keren Kotlovsky Project Supervisor: Zvika Berkovich Lab Chief Engineer: Dr. Ilana David Spring, /
Computer Science 101 Web Access to Databases Overview of Web Access to Databases.
Maintaining and Updating Windows Server 2008
F Fermilab Database Experience in Run II Fermilab Run II Database Requirements Online databases are maintained at each experiment and are critical for.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
1 Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
1 Guide to Novell NetWare 6.0 Network Administration Chapter 11.
Network Management System The Concept –From a central computer, network administrator can manage entire network Collect data Give commands –Moving gradually.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
Managing and Monitoring Windows 7 Performance Lesson 8.
Module 7: Fundamentals of Administering Windows Server 2008.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
Grid Computing I CONDOR.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
National Center for Supercomputing Applications NCSA OPIE Presentation November 2000.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Learningcomputer.com SQL Server 2008 – Administration, Maintenance and Job Automation.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Grid Workload Management Massimo Sgaravatto INFN Padova.
FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper
1 Chapter Overview Performing Configuration Tasks Setting Up Additional Features Performing Maintenance Tasks.
Nick LeRoy & Jeff Weber Computer Sciences Department University of Wisconsin-Madison Managing.
Alain Roy Computer Sciences Department University of Wisconsin-Madison ClassAds: Present and Future.
Maintaining and Updating Windows Server Monitoring Windows Server It is important to monitor your Server system to make sure it is running smoothly.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
1 Condor BirdBath SOAP Interface to Condor Charaka Goonatilake Department of Computer Science University College London
Windows Role-Based Access Control Longhorn Update
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Nick LeRoy Computer Sciences Department University of Wisconsin-Madison Hawkeye.
Optimizing Windows Vista Performance Lesson 10. Skills Matrix Technology SkillObjective DomainObjective # Introducing ReadyBoostTroubleshoot performance.
Ceilometer + Gnocchi + Aodh Architecture
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
Matthew Farrellee Computer Sciences Department University of Wisconsin-Madison Condor and Web Services.
STAR Scheduling status Gabriele Carcassi 9 September 2002.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Interstage BPM v11.2 1Copyright © 2010 FUJITSU LIMITED INTERSTAGE BPM ARCHITECTURE BPMS.
ITMT 1371 – Window 7 Configuration 1 ITMT Windows 7 Configuration Chapter 8 – Managing and Monitoring Windows 7 Performance.
Maintaining and Updating Windows Server 2008 Lesson 8.
SQL Database Management
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Building Distributed Educational Applications using P2P
File System Implementation
Chapter 1: Introduction
Open Source distributed document DB for an enterprise
DNS.
Monitoring HTCondor with Ganglia
Basic Grid Projects – Condor (Part I)
LO3 – Understand Business IT Systems
Presentation transcript:

1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

2 What does Condor have? › …lots of core technology for building a distributed system

3 What does Condor have? › …lots of core technology for building a distributed system › …lots of core technology for monitoring the status of a machine

4 What does Condor have? › …lots of core technology for building a distributed system › …lots of core technology for monitoring the status of a machine › …lots of core technology for managing a work load of tasks

5 What does Condor have? › …lots of core technology for building a distributed system › …lots of core technology for monitoring the status of a machine › …lots of core technology for managing a work load of tasks › …lots of really, truly, skilled and experienced developers and researchers at building distributed systems. Some of the best. Standout state employees. Honest.  for Wisconsin Gov Scott McCallum:

6 One day an avid Condor user asked:

7 One day an avid Condor user asked: Say, could Condor Technology be used for distributed system administration??

8 Time to think… › Gathered up our experiences with our own management tasks, looked at the mature Condor technology available to us, and HawkEye effort was born. › Completely separate from Condor from end user prospective.  Can install HawkEye, or Condor, or both

9 First Component: MONITORING › Sysadmins first need information about what is happening on the machines they are responsible for.  Both Current and Past  Information must be consolidated and easily accessible  Information must be dynamic

10 Condor ClassAds › Technology for an entity to describe itself › Simple attribute value pairs [ load_average = 1.3 free_Swap_space_mb = 140 number_of_processes = 92 keyboard_idle_secs = 6 ram = 128 total_swap = 512 total_memory = ram + total_swap busy = load_average > 1.0 ]

11 Condor ClassAds, cont. › No fixed schema › Attributes can contain values or expressions › Serialize Ads in XML › Open source libraries on C++ and Java to:  Manipulate Ads and Ad attributes  Store Ads  Query collections of Ads › Bindings for Perl and others on the way…

12 HawkEye Monitoring Agent HawkEye Manager ClassAd Updates Via Secure UDP

13 HawkEye Monitoring Agent HawkEye Manager HawkEye Monitoring Agent

14 HawkEye Monitoring Agent /proc, kstat… Hawkeye_Startup_Agent Hawkeye_Monitor HawkEye Monitoring Agent HawkEye Manager ClassAd Updates Via Secure UDP

15 Monitor Agent, cont. › Updates are sent periodically  Information does not get stale › Updates also serve as a heartbeat monitor  Know when a machine is down › Out of the box, the update ClassAd has many attributes about the machine of interest for system administration  Current Prototype = 184 attributes

16 What if I want to monitor something you didn’t think about?

17 Custom Attributes /proc, kstat… Hawkeye_Startup_Agent Hawkeye_Monitor HawkEye Monitoring Agent HawkEye Manager Data from hawkeye_update_attribute command line tool Create your own HawkEye plugins, or share plugins with others

18 Role of HawkEye Manager › Store all incoming ClassAds in a indexed resident data structure  Fast response to client tool queries about current state  “Show me all machines with a load average > 10” › Periodically store ClassAd attributes into a Round Robin Database  Store information over time  “Show me a graph with the load average for this machine over the past week” › Speak to clients via CEDAR, HTTP HawkEye Manager

Several different clients › Command-line, GUI, Web-based

20 But sysadmins also sometimes have to do work… › Task: copy a new library onto the local disk of each machine.  Just a script to copy via rcp/scp to every machine… or is it?

21 Running tasks on behalf of the sysadmin › Submit your sysadmin tasks to HawkEye  Tasks are stored in a persistent queue by the Manager  Tasks can leave the queue upon completion, or repeat after specified intervals  Tasks can have complex interdependencies via DAGMan  Records are kept on which task ran where › Sounds like Condor, eh?  Yes, but simpler…

22 Run Tasks in response to monitoring information › ClassAd “Requirements” Attribute › Example: Send if a machine is low on disk space or low on swap space  Submit an task with an attribute: Requirements = free_disk < 5 || free_swap < 5 › Example w/ task interdependency: If load average is high and OS=Linux and console is Idle, submit a task which runs “top”, if top sees Netscape, submit a task to kill Netscape

23 HawkEye Design Goals › Monitoring  Reliable presence  Get Data off the node in an extensible, consistent manner › Run Tasks  In response to probe information  Repeat or once-only semantics  Audit Log › Independent and self-contained › Cross-Platform

24 Current Status › Just Beginning this project › Initial release early summer › Prototypes already running – Stop in and see initial HawkEye Work Rm 3385 on Weds 9am – 12pm

25 Thank you! I was an overworked sysadmin. Now I have more free time thanks to HawkEye!