Download presentation
Presentation is loading. Please wait.
1
Experiment: Step by Step Author: Anna Bekkerman abekkerm@ecs.umass.edu
2
Setup Server LMM Control signals Data Client Data Node Target system
3
Configuration File Describes an experiment – Nodes IP addresses, types (SOCC node/radar node), etc. – Commands to start/stop involved processes – Collected metrics (CPU/memory utilization, etc.) – Monitored processes – Net control parameters Delays, drop rates – Refresh rates
4
Start LMMs When started, RAPIDS server: – Grabs two ports: 49162 - to communicate with LMMs 8888 - to communicate with RAPIDS clients – Reads a configuration file – Starts LMMs on all nodes through SSH connections – Waits for ack signals from all LMMs – Starts setting LMMs up according to the configuration file FIXME: Server will wait indefinitely for the ack s from all LMMs. A time-out mechanism should be introduced.
5
Set LMMs Up Home-made protocol is used to set up LMM parameters Examples of commands sent from the server to LMMs: – STM set metric – STP set monitored process – STE set start-up command – STT start – SPP stop When a parameter is set, LMM sends an ack signal back to the server At the end of each step, server waits for ack s from all LMMs
6
Start Monitoring When LMM receives the start command: – If needed, network control application is started Network control application runs only if iptables are turned on. iptables select IP packets (as specified in iptables rules) and queue them for processing by the application. The application introduces delays and/or drops packets according to the settings in the configuration file.
7
Start Monitoring When LMM receives the start command: – If needed, network control application is started – RAPIDS Message Queues (RMQ) are initialized A mechanism used for communication between RAPIDS and monitored applications. See more in the “RMQ” section.
8
Start Monitoring When LMM receives the start command: – If needed, network control application is started – RAPIDS Message Queues (RMQ) are initialized – Heartbeat applications are started Send “I’m alive” signals from radar nodes to SOCC nodes. If a signal has not been received, RAPIDS reports link failure. FIXME: Timeout mechanism should be added to minimize false alarms.
9
Start Monitoring When LMM receives the start command: – If needed, network control application is started – RAPIDS Message Queues (RMQ) are initialized – Heartbeat applications are started – Processes are started Commands are specified by user in the configuration file
10
Start Monitoring When LMM receives the start command: – If needed, network control application is started – RAPIDS Message Queues (RMQ) are initialized – Heartbeat applications are started – Processes are started Commands are specified by user in the configuration file – “Collection sessions” are started every t seconds According to the refresh rates provided by user in the configuration file
11
Collection Session During each collection session LMM: – Collects metrics – Reads events accumulated in RMQ – Sends the metrics and events to the RAPIDS server More details in the “LMM” section
12
Stop Monitoring When the server is stopped, it sends stop commands to all LMMs Upon receiving the stop signal, LMM: – Stops launching collection sessions – Stops processes Using the commands specified by user in the configuration file – Heartbeat applications are stopped – RMQ is deleted – Network control applications are stopped
13
What Might Go Wrong? When the server is stopped, it sends stop commands to all LMMs Upon receiving the stop signal, LMM: – Stops launching collection sessions – Stops processes Using the commands specified by user in the configuration file – Heartbeat applications are stopped – RMQ is deleted – Network control applications are stopped If “untrappable” signals ( SIGKILL and SIGSTOP ) are used to kill the server, the shut-down procedures will not be executed!
14
What Might Go Wrong? When the server is stopped, it sends stop commands to all LMMs Upon receiving the stop signal, LMM: – Stops launching collection sessions – Stops processes Using the commands specified by user in the configuration file – Heartbeat applications are stopped – RMQ is deleted – Network control applications are stopped If commands provided by user do not stop all processes, LMM will hang waiting for their termination. While an LMM is hanging the port used for communication with the server remains unreleased, which means that the new experiment cannot be started until LMMs are stopped and all necessary clean-up procedures have been completed.
15
What Might Go Wrong? When the server is stopped, it sends stop commands to all LMMs Upon receiving the stop signal, LMM: – Stops launching collection sessions – Stops processes Using the commands specified by user in the configuration file – Heartbeat applications are stopped – RMQ is deleted – Network control applications are stopped FIXME: These applications do not always react to the termination signal properly. Symptom: sometimes a number of zombie processes appear
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.