Performance of the Relational Grid Monitoring Architecture (R-GMA)

Performance of the Relational Grid Monitoring Architecture (R-GMA)
CMS data challenges. The nature of the problem. What is GMA ? And what is R-GMA ? Performance test description Performance test results Conclusions

The Nature of the problem
As part of the preparations for data taking CMS is performing DATA CHALLENGES. Large number of simulated events to optimise detectors and prepare software Enormous processing requirements BUT each event is independent of all the others each event can be generated on a machine without any interaction with any other

The local solution Work split between farms. How to handle the book-keeping ? a data-base automatically updated Implemented via a job wrapper BOSS Output to <stdout> and <stderr> is intercepted and the information is recorded in a mySQL production database. Event generation and job accounting decoupled

The local solution (schematic)
Worker Node (WN) Database Machine WN WN WN WN WN Submission Machine UI WN WN WN

The grid solution (schematic)
Database Machine Submission Machine UI

Grid Monitoring Architecture (GMA) of the GGF
Producer register producer data data Ask for data data data Registry (Directory services) data data data data address of producer locate producer Consumer

R-GMA (Relational GMA)
Developed for E(uropean) D(ata) G(rid) Extends the GMA in two important ways Introduces a time stamp on the data. A relational implementation Hides the registry behind the API Can be used for information and monitoring Each Virtual Organisation appears to have one RDBMS

The syntax of R-GMA The user interface to R-GMA is via SQL statements (not all SQL statements and structures are supported) Information is advertised via a table create Information is published via insert Information is read via select … from table The first read request registers the consumer as interested in this data. Relational queries are supported NOTE : sql is the interface – it should not be supposed an actual database lies behind it.

Fit between R-GMA and BOSS
R-GMA can be dropped into the framework with very little disruption Set up calls for mySQL are replaced by those for R-GMA producers An archiver (joint consumer/producer) runs on a single machine which collects the data from all the running jobs and writes it to a local database (and possible republishes it). The data can then be queried either by direct mySQL calls or via R-GMA consumer (a distributed database has been created)

Fit between R-GMA and BOSS (i)
LAN Connection WAN Connection R-GMA Database R-GMA BOSS R-GMA R-GMA R-GMA R-GMA R-GMA R-GMA

R-GMA Measurements The architecture of GMA clearly provides a putative solution to the wide area monitoring problem. BUT Does a specific implementation provide a practical solution Before entrusting CMS production to R-GMA, we must be confident that it will perform. What load will it fail at and why ?

Message time distribution from 44 jobs
<Message length> 35 chars.

Simulation of a CMS job Multi-threaded job each thread produces messages. Length 35 chars, suitable distribution. Threads starting time distribution can be altered. One machine delivers the R-GMA load of a farm. R-GMA servlet consumer

Simulation of the CMS Grid
One machine per grid cluster providing loads of greater than the cluster R-GMA consumer servlet

We believe these instabilities are soluble.
Current status R-GMA can survive loads of around 20% of the current CMS requirements and does provides a grid method for monitoring. An overload of a factor 2 jobs causes problems after about five minutes running. We believe these instabilities are soluble. When production starts in earnest we will compare reality with our model.

GridICE Server Installation

Brief Introduction GridICE:
is a distributed monitoring tool for grid systems integrates with local monitoring systems offers a web interface for publishing monitoring data at the Grid level fully integrated in the LCG-2 Middleware gridice-clients data collector installation and configuration for each site ralized by the Yaim scripts.

System Requirements Suggested Operating system is Scientific Linux with a minimal installation The GridICE server should be installed on a performant machine PostgreSQL service - RAM intensive demand Apache web server - RAM-CPU intensive demand

Core Packages & Dependencies
The GridICE server software is composed by three core packages: gridice-core (setup and maintenance scripts / discovery components) gridice-www (web interface scripts and components) gridice-plugins (monitoring scripts) Plus several dependencies: Apache http web server PostgreSQL database server Nagios monitoring tool ...

The Four Main Phases of Monitoring
Processing and abstract the number of received events in order to enable the consumer to draw conclusions about the operation of the monitored system Presenting Transmission of the events from the source to any interested parties (data delivery model: push vs. pull; periodic vs. aperiodic) Distributing Processing Sensors inquiring entities and encoding the measurements according to a schema Generation (e.g., fairly static as software and hardware configuration or dynamics as current processor load) Dynamics: (e.g., fairly static as software and hardware configuration or dynamics as current processor load) Timing: (e.g., periodic or on demand) e.g., filtering according to some predefined criteria, or summarising a group of events

The GridICE Approach

Generating Events Generation of events:
Sensors: typically perl scripts or c programs. Schema: GLUE Schema v GridICE extension. System related (e.g., CPU load, CPU Type, Memory size). Grid service related (e.g., CE ID, queued jobs). Network related (e.g., Packet loss). Job usage (e.g., CPU Time, Wall Time). All sensors are executed in a periodic fashion.

Distributing Events Distribution of events: Hierarchical model.
Intra-site: by means of the local monitoring service default choice, LEMON ( Inter-site: by offering data through the Grid Information Service. Final Consumer: depending on the client application. Mixed data delivery model. Intra-site: depending on the local monitoring service (push for lemon). Inter-site: depending on the GIS (current choice, MDS 2.x, pull). Final consumer: pull (browser/application), push (publish/subscribe notification service coming on the next release).

Presenting Events Data stored in a RDBMS used to build aggregated statistics. Data retrieved from the RDBMS are encoded in XML files. XSL to XHTML transformations to publish aggregated data in a Web context.

Monitoring a Grid

Challenges for Data Collection
The distribution of monitoring data is strongly characterised by significant requirements (e.g., Scalability, Heterogeneity, Security, System Health) None of the existing tools satisfy all of these requirements Grid data collection should be customized depending on what are the needs of your Grid users selected

Challenges for Data Presentation
Different Grid users are interested in different subset of Grid data and different aggregation levels Usability principles should be taken into account to help users finding relevant Grid monitoring information A sintetic data aggregation is crucial to permit a drill-down navigation (from the general to te detailed) of the Grid data

Grid Monitoring Architecture (GMA) of the GGF
Producer register producer data data Ask for data data data Registry (Directory services) data data data data address of producer locate producer Consumer

R-GMA (Relational GMA)
Developed for E(uropean) D(ata) G(rid) Extends the GMA in two important ways Introduces a time stamp on the data. A relational implementation Hides the registry behind the API Can be used for information and monitoring Each Virtual Organisation appears to have one RDBMS

The syntax of R-GMA The user interface to R-GMA is via SQL statements (not all SQL statements and structures are supported) Information is advertised via a table create Information is published via insert Information is read via select … from table The first read request registers the consumer as interested in this data. Relational queries are supported NOTE : sql is the interface – it should not be supposed an actual database lies behind it.

Fit between R-GMA and BOSS
R-GMA can be dropped into the framework with very little disruption Set up calls for mySQL are replaced by those for R-GMA producers An archiver (joint consumer/producer) runs on a single machine which collects the data from all the running jobs and writes it to a local database (and possible republishes it). The data can then be queried either by direct mySQL calls or via R-GMA consumer (a distributed database has been created)

Fit between R-GMA and BOSS (i)
LAN Connection WAN Connection R-GMA Database R-GMA BOSS R-GMA R-GMA R-GMA R-GMA R-GMA R-GMA

How is Ganglia different from Nagios
Ganglia is architecturally designed to perform efficiently in very large monitoring environments: each Ganglia gmond performs its service checks locally, reporting in at a regular interval to the gmetad. Nagios performs its service checks by polling each device across a network connection and waiting for a response (known as "active checks"), which can be more resource and bandwidth intensive. Nagios uses the results of its active checks to determine state by comparing the metrics it polls to thresholds. These state changes can in turn be used to generate notifications and customizable corrective actions. Ganglia, by contrast, has no built-in thresholds, and so does not generate events or notifications. The general rule of thumb has been: if you need to monitor a limited number of aspects of a large number of identical devices, use Ganglia; if you want to monitor lots of aspects of a smaller number of different devices, use Nagios. But those distinctions are blurring as Ganglia supports more and more devices, and as Nagios' scalability improves. 6/1/2018 T.R.LEKHAA/AP/IT/SNSCE

How is Ganglia different from Nagios
The problem with ganglia and all the other external web pages we have been looking at is that you have to look at them! If all is well with your system you don’t want to have to look. This is where Nagios comes in. It can be setup to alert you when something goes wrong, or a value passes a threshold. 6/1/2018 T.R.LEKHAA/AP/IT/SNSCE

Monitoring: What? Packet loss Data is transmitted in packets, and unsurprisingly, packet loss is an of measure how many packets are lost during transport. This includes packets which are discarded because they have arrived with corrupted "transmission" data (separate to the payload/user data part of the packet). Packets that are discarded or fail to arrive must be re-transmitted, and this quickly causes a "traffic jam" if the packet loss is severe. RTT Round trip time is a measure of the time it takes to send a packet from node x to y, and receive a response back at x. TCP is a send-acknowledge protocol. A block of data cannot be sent until the receipt of the previously transmitted block has been acknowledged. If this acknowledgement takes a long time to arrive (due to a long RTT) then transmission delays are created. Connectivity Is simply an indication of whether you can connect to a remote site/machine. It can be used to identify sudden faults, such as link failures, or more sporadic problems such as loss of connectivity at certain times of the day, for example due to network congestion at those times causing packets to be discarded. TCP/UDP Throughput Throughput is essentially a measure of the rate at which data is or can be transferred. However you must be careful about what kind of throughput you are talking about. For example, are you talking about maximum throughput, network throughput, end-to-end throughput, or throughput on the wire? The GridMon toolkit monitors network throughput (what the network sees) for TCP and UDP traffic, and end-to-end throughput (what applications and end users see) for TCP traffic. You can use this data to see what transfer rates you can expect to achieve to other sites, and to identify inefficiency (e.g. if network throughput is significantly better than end-to-end). Inter-packet Jitter Is a measure of the variation in the delay of packets arriving. It is very important metric for real-time applications such as video conferencing, where we require the packets to arrive at a steady rate. This metric is currently only measured for UDP traffic (including multicast) but it is planned to extend this to TCP traffic in the future.

Monitoring Architecture
Monitoring: How(1)? IperfER PingER UDPmon MiperfER bbcp/ftp Monitor Node Publication service Grid middleware Monitoring Architecture 30 mins The monitoring is performed by a kit of tools installed on a suitable machine at each e-Science Centre. Performance data is stored locally on that machine, and is published to interested people via a web interface, and to the Grid middleware via a publication service (LDAP, Grid or web). LDAP seems to be popular in the states, with OGSA growing in popularity in Europe. An (OGSA) Grid service is essentially a web service with some Grid specific add-ons/pre-requisites. Every 30 minutes (90 minutes for bbcp/ftp) each machine performs monitoring between itself and all other e-Science Centres. In this way a mesh of monitoring is created, allowing each centre to build up a picture of the quality of its links to all other centres. The mesh approach is feasible given the relatively low number of sites involved (12-15 in this case). The times at which the individual machines run tests are staggered in an attempt to minimise the disruption one machine’s tests may have on another’s. The monitoring host must obviously meet some requirements, with the most important being that: It is a dedicated monitoring machine. There is little point installing the toolkit on your web server, for example, because it will not allow like for like performance data to be connected. In the same vein, the machine must have similar connectivity to the other networked e-Science resources at your centre. Performance data will not be representative if the tools are, for example, installed on a machine hanging from the primary link to outside world, while users in the e-Science centre are three or four levels further down the network hierarchy. Tools installed on dedicated & similar node at each centre MESH

Monitoring: How(2)? PingER is a collection of Perl scripts which use the ICMP ping utility to send 10 ping requests to remote hosts. The results (RTT etc.) are recorded for later analysis. The tool was developed at SLAC, with assistance from various other sources. IperfER is based on NCSA’s iperf utility, used for measuring the network’s view of TCP or UDP throughput between two different hosts. Iperf consists of client and server executables which sit at either end of a TCP/UDP connection, streaming data between each other. IperfER allows the findings of iperf to be stored for subsequent review. UDPmon is essentially a UDP equivalent of IperfER, developed in Manchester University’s HEP department, within the bounds of the EDG project. bbcp and bbftp are basic tools for copying files between sites, albeit using multiple TCP streams and large TCP window sizes. They are the important in relation to monitoring because they are end user software, not monitoring tools. This allows us to obtain a picture of end-to-end performance. This approach has been pioneered at SLAC. Building on their work, the monitoring toolkit will use SSH to login to a remote machine (via RSA public-private key authentication) and copy a series of files (ranging in size from 180 MB) across the network. Data about the transfer is recorded for future processing. MiperfER is a (new and experimental) multicast version of the IperfER tool. The most notable application for multicast is of course video conferencing. The Access Grid has been proving to be a pretty useful for e-Science, hence the interest in multicast. As the Grid promotes the use of geographically distributed teams (via VOs) you would expect the demand for multicast apps (like video conferencing) to also increase. Note: MiperfER is being used as more of a diagnostic tool. Any possible use of data published to the middleware is unclear.

Network Weather Service

Introduction “NWS provides accurate forecasts of dynamically changing performance characteristics from a distributed set of metacomputing resources” What will be the future load (not current load) when a program is executed? Producing short-term performance forecasts based on historical performance measurements The forecasts can be used by dynamic scheduling agents

Introduction Resource allocation and scheduling decisions must be based on predictions of resource performance during a timeframe NWS takes periodic measurements of performance and using numerical models, forecasts resource performance

NWS Goals Components Persistent state Name server Sensors Forecaster
Passive (CPU availability) Active (Network measurements) Forecaster

Architecture

Performance measurements
Using sensors CPU sensors Measures CPU availability Uses uptime vmstat Active probes Network sensors Measures latency and bandwidth Each host maintains Current data One-step ahead predictions Time series of data

Network Measurements

Issues with Network Sensors
Appropriate transfer size for measuring throughput Collision of network probes Solutions Tokens and hierarchical trees with cliques

Available CPU measurement

Available CPU measurement
The formulae shown does not take into account job priorities Hence periodically an active probe is run to adjust the estimates

Predictions To generate a forecast, forecaster requests persistent state data When a forecast is requested, forecaster makes predictions for existing measurements using different forecast models Dynamic choice of forecast models based on the best Mean Absolute Error, Mean Square Prediction Error, Mean Percentage Prediction Error Forecasts requested by: InitForecaster() RequestForecasts() Forecasting methods Mean-based Median based Autoregressive

Mean Absolute Error (MAE) is the average of the above
Forecasting Methods Notations: Prediction Accuracy: Mean Absolute Error (MAE) is the average of the above Prediction Method:

Forecasting Methods – Mean-based
1. 2. 3.

Forecasting Methods – Mean-based
4. 5.

Forecasting Methods – Median-based
1. 2. 3.

ai found such that it minimizes the overall error.
Autoregression 1. ai found such that it minimizes the overall error. ri ,j is the autocorellation function for the series of N measurements.

Forecasting Methodology

Forecast Results

Forecasting Complexity vs Accuracy
Semi Non-parametric Time Series Analysis (SNP) – an accurate but complicated model Model fit using iterative search Calculation of conditional expected value using conditional probability density

Sensor Control Each sensor connects to other sensors and perform measurements O(N2) To reduce the time complexity, sensors organized in hierarchy called cliques To avoid collisions, tokens are used Adaptive control using adaptive token timeouts Adaptive time-out discovery and distributed leader election protocol

Performance of the Relational Grid Monitoring Architecture (R-GMA)

Similar presentations

Presentation on theme: "Performance of the Relational Grid Monitoring Architecture (R-GMA)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance of the Relational Grid Monitoring Architecture (R-GMA)

Similar presentations

Presentation on theme: "Performance of the Relational Grid Monitoring Architecture (R-GMA)"— Presentation transcript:

Similar presentations

About project

Feedback