1 Monitoring Grid Services Yin Chen June 2003
2 Contents zIssues of Monitoring zProject Proposal
3 Issues of Monitoring zWhat the goals of Grid monitoring zWhat's the characteristics of Grid system zWhat may need to be Monitored zWhat’s the characteristics of Monitoring Data zRelated Work
4 What the goals of Grid monitoring zThe question is zPropagate errors to users/management zPerformance monitoring to tune the application z use the Grid more efficiently Not how to measure resources z But how to deliver information to end-users and system/Grid
5 What's the characteristics of Grid system zComplex distributed system =>often observe unexpectedly low performance Where is the bottleneck? - application - operating system - disks - network adapters on either the sending or the receiving host - network switches, routers Experience of the Netlogger group - 40% network, 40% application, 20% host problems - application: 50% client, 50% server process problems
6 What's the characteristics of Grid system (cont..) zDynamic environment zWorld-wide distributed environment with - high latency - frequent faults - very heterogeneous resources
7 What may need to be Monitored zDisk space, speed of processor, network bandwidth, CPU load, memory load, network load, network communication time, number of parallel streams, stripes TCP/IP buffer size, disk access time that includes time to copy data to or from the local hard disk on the server.[2][3] zSome of this information are relative static information while others are run-time dynamic information.
8 What’s the characteristics of Monitoring Data zRun-time monitoring data goes "Old" quickly zProducer should near the entities. zRapidly and efficiently transport from producer to consumer. zInformation should be explicate, e.g. by timestamps zUpdates are frequent zPerformance information is often stochastic
9 Related Work zMonitoring and Discovery Service (MDS) zGrid Monitoring Architecture (GMA) zRelational Grid Monitoring Architecture (R-GMA) zHawkeye zGlobus Heartbeat Monitor (HBM) zNetwork Weather Service (NWS) zGridRM
10 MDS Architecture
11 GMA Architecture
12 R-GMA Architecture
13 Hawkeye Architecture
14 HBM Architecture
15 NWS Architecture
16 The Global Layer of GridRM
17 The Local GridRM Layer
18 Summary and Conclusion zVarieties of different systems exist for monitoring zEach system has its own strengths and weaknesses zTend to use standard and open components zGGF advocated architecture GMA
19 Summary and Conclusion (cont.) zThe similarities in architecture zAt the lowest level, have a sensor or other program that generates a piece of data. zSome systems allow data to be aggregated from a set of resources zAt the resource level, gather together the data from several information collectors into one component zDirectory component zDecentralised hierarchy structure, which have higher ability in fault tolerance zDifferences in using push or pull mechanism
20 Project Proposal zGoal zRequirement zArchitecture -- Pull Model zSpecification zImplementation zTesting zSchedule
21 Goal zRealisation zLightweight & Simple design zReliability & Robustness
22 Architecture zWhat is Pull model zThe monitor sends requests to the service for information. This implies repeated queries of resource attributes over some time period at a specific frequency zOn the other hand in a Push model the service sends out notifications to a subscribed sink.
23 Benefits of Pull zLess network traffic: collections initiated only from top zHas no time synchronisation problem: collect data from resources at the same time. zThe server can determine the size of the file, select the appropriate alternate server, and passively control the bandwidth and storage space. zAccording to Globus, "push" model "generates a large amount of data and results in constant updates to the MDS. zStandard LDAP databases are not designed to handle frequent updates.
24 Benefits of Pull (Cont.) zThe Pull model is based on distributed intelligence to the asset site - it becomes automated. zUsing machine-to-machine communications with connected sensors and autonomic computing the asset does self-diagnostics, self maintain and repair, re-routes energy flows, schedules non-routine maintenance and reports on any out of the ordinary activity that poses a security threat. zIBM calls it autonomic computing where machine to machine communications take place to optimise the performance of computing and network resources.
25 Problems of Pull zmust gathering current measurements from all resources. zif the data volume is large in real-time may cause bottleneck problem. zmay be not useful in fault detection -- heartbeat events are valid only for a short time interval and should be delivered in this time constraint. zmay be not useful in dynamic sensor management. zThe push model is the most efficient in terms of bandwidth as requests are not sent, just responses from the service.
26 Monitoring Grid Services z Thanks