Download presentation
Presentation is loading. Please wait.
Published byEarl Harris Modified over 9 years ago
1
www.cs.wisc.edu/condor 1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison http://www.cs.wisc.edu/condor condor-admin@cs.wisc.edu
2
www.cs.wisc.edu/condor 2 What does Condor have? › …lots of core technology for building a distributed system
3
www.cs.wisc.edu/condor 3 What does Condor have? › …lots of core technology for building a distributed system › …lots of core technology for monitoring the status of a machine
4
www.cs.wisc.edu/condor 4 What does Condor have? › …lots of core technology for building a distributed system › …lots of core technology for monitoring the status of a machine › …lots of core technology for managing a work load of tasks
5
www.cs.wisc.edu/condor 5 What does Condor have? › …lots of core technology for building a distributed system › …lots of core technology for monitoring the status of a machine › …lots of core technology for managing a work load of tasks › …lots of really, truly, skilled and experienced developers and researchers at building distributed systems. Some of the best. Standout state employees. Honest. Email for Wisconsin Gov Scott McCallum: wisgov@gov.state.wi.us
6
www.cs.wisc.edu/condor 6 One day an avid Condor user asked:
7
www.cs.wisc.edu/condor 7 One day an avid Condor user asked: Say, could Condor Technology be used for distributed system administration??
8
www.cs.wisc.edu/condor 8 Time to think… › Gathered up our experiences with our own management tasks, looked at the mature Condor technology available to us, and HawkEye effort was born. › Completely separate from Condor from end user prospective. Can install HawkEye, or Condor, or both
9
www.cs.wisc.edu/condor 9 First Component: MONITORING › Sysadmins first need information about what is happening on the machines they are responsible for. Both Current and Past Information must be consolidated and easily accessible Information must be dynamic
10
www.cs.wisc.edu/condor 10 Condor ClassAds › Technology for an entity to describe itself › Simple attribute value pairs [ load_average = 1.3 free_Swap_space_mb = 140 number_of_processes = 92 keyboard_idle_secs = 6 ram = 128 total_swap = 512 total_memory = ram + total_swap busy = load_average > 1.0 ]
11
www.cs.wisc.edu/condor 11 Condor ClassAds, cont. › No fixed schema › Attributes can contain values or expressions › Serialize Ads in XML › Open source libraries on C++ and Java to: Manipulate Ads and Ad attributes Store Ads Query collections of Ads › Bindings for Perl and others on the way…
12
www.cs.wisc.edu/condor 12 HawkEye Monitoring Agent HawkEye Manager ClassAd Updates Via Secure UDP
13
www.cs.wisc.edu/condor 13 HawkEye Monitoring Agent HawkEye Manager HawkEye Monitoring Agent
14
www.cs.wisc.edu/condor 14 HawkEye Monitoring Agent /proc, kstat… Hawkeye_Startup_Agent Hawkeye_Monitor HawkEye Monitoring Agent HawkEye Manager ClassAd Updates Via Secure UDP
15
www.cs.wisc.edu/condor 15 Monitor Agent, cont. › Updates are sent periodically Information does not get stale › Updates also serve as a heartbeat monitor Know when a machine is down › Out of the box, the update ClassAd has many attributes about the machine of interest for system administration Current Prototype = 184 attributes
16
www.cs.wisc.edu/condor 16 What if I want to monitor something you didn’t think about?
17
www.cs.wisc.edu/condor 17 Custom Attributes /proc, kstat… Hawkeye_Startup_Agent Hawkeye_Monitor HawkEye Monitoring Agent HawkEye Manager Data from hawkeye_update_attribute command line tool Create your own HawkEye plugins, or share plugins with others
18
www.cs.wisc.edu/condor 18 Role of HawkEye Manager › Store all incoming ClassAds in a indexed resident data structure Fast response to client tool queries about current state “Show me all machines with a load average > 10” › Periodically store ClassAd attributes into a Round Robin Database Store information over time “Show me a graph with the load average for this machine over the past week” › Speak to clients via CEDAR, HTTP HawkEye Manager
19
Several different clients › Command-line, GUI, Web-based
20
www.cs.wisc.edu/condor 20 But sysadmins also sometimes have to do work… › Task: copy a new library onto the local disk of each machine. Just a script to copy via rcp/scp to every machine… or is it?
21
www.cs.wisc.edu/condor 21 Running tasks on behalf of the sysadmin › Submit your sysadmin tasks to HawkEye Tasks are stored in a persistent queue by the Manager Tasks can leave the queue upon completion, or repeat after specified intervals Tasks can have complex interdependencies via DAGMan Records are kept on which task ran where › Sounds like Condor, eh? Yes, but simpler…
22
www.cs.wisc.edu/condor 22 Run Tasks in response to monitoring information › ClassAd “Requirements” Attribute › Example: Send email if a machine is low on disk space or low on swap space Submit an email task with an attribute: Requirements = free_disk < 5 || free_swap < 5 › Example w/ task interdependency: If load average is high and OS=Linux and console is Idle, submit a task which runs “top”, if top sees Netscape, submit a task to kill Netscape
23
www.cs.wisc.edu/condor 23 HawkEye Design Goals › Monitoring Reliable presence Get Data off the node in an extensible, consistent manner › Run Tasks In response to probe information Repeat or once-only semantics Audit Log › Independent and self-contained › Cross-Platform
24
www.cs.wisc.edu/condor 24 Current Status › Just Beginning this project › Initial release early summer › Prototypes already running – Stop in and see initial HawkEye Work Rm 3385 on Weds 9am – 12pm
25
www.cs.wisc.edu/condor 25 Thank you! I was an overworked sysadmin. Now I have more free time thanks to HawkEye!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.