1 Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Catalin Cirstoiu, Ciprian Dobre An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed Systems Control and Optimize Distributed Systems ACAT - November 2008 ERICE
Iosif Legrand November The MonALISA Framework MonALISA is a Dynamic, Distributed Service System capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems. The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of monitoring tasks. These agents can analyze and process the information, in a distributed way, and to provide optimization decisions in large scale distributed applications.
Distributed Object Systems CORBA, DCOM Lookup Service Stub Lookup Service Skeleton CLIENT Server “Traditional” Distributed Object Models (CORBA, DCOM) “IDL” Compiler The Stub is linked to the Client. The Client must know about the service from the beginning and needs the right stub for it The Server and the client code must be created together !!
Distributed Object Systems Web Services WSDL/SOAP Lookup Service WSDL CLIENT Server Lookup Service Interface SOAP The client can dynamically generate the data structures and the interfaces for using remote objects based on WSDL Platform independent
Iosif Legrand November Mobile Code and Distributed Services Act as a true dynamic service and provide the necessary functionally to be used by any other services that require such information (Jini, interface to WSDL / SOAP) mechanism to dynamically discover all the “Service Units" remote event notification for changes in the any system lease mechanism for each registered unit Dynamic Code Loading Lookup Service Proxy CLIENT Lookup Service Proxy Service Services can be used dynamically Remote Services Proxy == RMI Stub Mobile Agents Proxy == Entire Service “Smart Proxies” Proxy adjusts to the client Any well suited protocol for the application
Iosif Legrand November MonALISA Service & Data Handling 6 Data Store Data Cache Service & DB Configuration Control (SSL) Predicates & Agents Data (via ML Proxy) Applications Clients or Higher Level Services WS Clients and service Web Service WSDL SOAP Lookup Service Lookup Service Registration Discovery Postgres AGENTS FILTERS / TRIGGERS Monitoring Modules Collects any type of information Dynamic Loading Push and Pull
Iosif Legrand November The MonALISA Architecture 7 Regional or Global High Level Services, Repositories & Clients Secure and reliable communication Dynamic load balancing Scalability & Replication AAA for Clients Distributed Dynamic Registration and Discovery- based on a lease mechanism and remote events JINI-Lookup Services Secure & Public MonALISA services Proxies HL services Agents Network of Distributed System for gathering and analyzing information based on mobile agents: Customized aggregation, Triggers, Actions Fully Distributed System with no Single Point of Failure
Iosif Legrand November Lookup Service Registration / Discovery Admin Access and AAA for Clients MonALISA Service Lookup Service Client (other service) Discovery Registration (signed certificate) MonALISA Service MonALISA Service Services Proxy Multiplexer Services Proxy Multiplexer Client (other service) Admin SSL connection Trust keystore AAA services Client authentication Data Data Filters & Agents Filters & Agents Trust keystore Application Applications
Iosif Legrand November Monitoring Grid sites, Running Jobs, Network Traffic, and Connectivity 9 TOPOLOGY JOBS ACCOUNTING Running Jobs
Iosif Legrand November Monitoring architecture in ALICE 10 Long History DB LCG Tools ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent MonALISA LCG Site ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn TQ ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn IS ApMon AliEn Optimizers ApMon AliEn Brokers ApMon MySQL Servers ApMon CastorGrid Scripts ApMon API Services MonaLisaRepository Aggregated Data rss vsz cpu time run time job slots free space nr. of files open files Queued JobAgents cpu ksi2k job status disk used processes load net In/out jobs status sockets migrated mbytes active sessions MyProxy status Alerts Actions
Iosif Legrand November ALICE : Global Views, Status & Jobs
Iosif Legrand November ALICE: Job status – history plots
Iosif Legrand November ALICE: Resource Usage monitoring Cumulative parameters CPU Time & CPU KSI2K Wall time & Wall KSI2K Read & written files Input & output traffic (xrootd) Running parameters Resident memory Virtual memory Open files Workdir size Disk usage CPU usage Aggregated per site
Iosif Legrand November ALICE: Job agents monitoring From Job Agent itself Requesting job Installing packages Running job Done Error statuses From Computing Element Available job slots Queued Job Agents Running Job Agents
Iosif Legrand November Monitoring the Execution of Jobs and the Time Evolution 15 SPLIT JOBS LIFELINES for JOBS Job Job1 Job2 Job3 Job 31 Job 32 Summit a Job DAG
Iosif Legrand November Two levels of decisions: local (autonomous), global (correlations). Actions triggered by: values above/below given thresholds, absence/presence of values, correlations between any values. Action types: alerts ( s/instant msg/atom feeds), running an external command, automatic charts annotations in the repository, running custom code, like securely ordering a ML service to (re)start a site service. ML Service Actions based on global information Actions based on local information Traffic Jobs Hosts Apps Temperature Humidity A/C Power … Sensors Local decisions Global decisions Local and Global Decision Framework Global ML Services
Iosif Legrand November ALICE: Automatic job submission Restarting Services 17 MySQL daemon is automatically restarted when it runs out of memory Trigger: threshold on VSZ memory usage ALICE Production jobs queue is kept full by the automatic submission Trigger: threshold on the number of aliprod waiting jobs Administrators are kept up-to-date on the services’ status Trigger: presence/absence of monitored information
Iosif Legrand November ALICE is using the monitoring information to automatically: resubmit error jobs until a target completion percentage is reached, submit new jobs when necessary (watching the task queue size for each service account) production jobs, RAW data reconstruction jobs, for each pass, restart site services, whenever tests of VoBox services fail but the central services are OK, send notifications / add chart annotations when a problem was not solved by a restart, dynamically modify the DNS aliases of central services for an efficient load-balancing. Most of the actions are defined by few lines configuration files. Automatic actions in ALICE
Iosif Legrand November Monitoring USLHCnet Operations & management assisted by agent-based software Used on the new CIENA equipment used for network managment
Iosif Legrand November USLHCnet: Precise measurements for the Operational Status on the WAN Link Operations & management assisted by agent-based software Used on the new CIENA equipment used for network managment
Iosif Legrand November USLHCnet: Traffic on different segments
Iosif Legrand November USLHCnet: Accounting for Integrated Traffic
Iosif Legrand November The UltraLight Network BNL ESnet IN /OUT
Iosif Legrand November Available Bandwidth Measurements Embedded Pathload module. 24
Iosif Legrand November Monitoring Network Topology, Latency, Routers NETWORKS AS ROUTERS Real Time Topology Discovery & Display
Iosif Legrand November EVO : Real-Time monitoring for Reflectors and the quality of all possible connections
Iosif Legrand November EVO: Creating a Dynamic, Global, Minimum Spanning Tree to optimize the connectivity A weighted connected graph G = (V,E) with n vertices and m edges. The quality of connectivity between any two reflectors is measured every second. Building in near real time a minimum- spanning tree with addition constrains
Iosif Legrand November Dynamic MST to optimize the Connectivity for Reflectors Frequent measurements of RTT, jitter, traffic and lost packages The MST is recreated in ~ 1 S case on communication problems.
Iosif Legrand November EVO: Optimize how clients connect to the system for best performance and load balancing
Iosif Legrand November FDT – Fast Data Transfer FDT is an application for efficient data transfers. Easy to use. Written in java and runs on all major platforms. It is based on an asynchronous, multithreaded system which is using the NIO library and is able to: stream continuously a list of files use independent threads to read and write on each physical device transfer data in parallel on multiple TCP streams, when necessary use appropriate size of buffers for disk IO and networking resume a file transfer session
Iosif Legrand November FDT – Fast Data Transfer Pool of buffers Kernel Space Pool of buffers Kernel Space Data Transfer Sockets / Channels Independent threads per device Restore the files from buffers Control connection / authorization
Iosif Legrand November FDT features April 2007 Iosif Legrand 32 The FDT architecture allows to "plug-in" external security APIs and to use them for client authentication and authorization. Supports several security schemes : IP filtering IP filtering SSH SSH GSI-SSH GSI-SSH Globus-GSI Globus-GSI SSL SSL User defined loadable modules for Pre and Post Processing to provide support for dedicated MS system, compression … FDT can be monitored and controlled dynamically by MonALISA services
Iosif Legrand November October 2006 Iosif Legrand 33 FDT – Memory to Memory Tests in WAN CPUs Dual Core Intel 3.00 GHz, 4 GB RAM, 4 x 320 GB SATA Disks Connected with 10Gb/s Myricom ~9.0 Gb/s ~9.4 Gb/s
Iosif Legrand November Disk -to- Disk transfers in WAN NEW YORK GENEVA Reads and writes on 4 SATA disks in parallel on each server Mean traffic ~ 210 MB/s ~ 0.75 TB per hour MB/s CERN CALTECH Reads and writes on two 12-port RAID Controllers in parallel on each server Mean traffic ~ 545 MB/s ~ 2 TB per hour 1U Nodes with 4 Disks 4U Disk Servers with 24 Disks October 2007 Iosif Legrand u Lustre read/ write ~ 320 MB/s between Florida and Caltech u Works with xrootd u Interface to dCache using the dcap protocol
Iosif Legrand November Dynamic restoration of lightpath if a segment has problems Monitoring Optical Switches
Iosif Legrand November Monitoring the Topology and Optical Power on Fibers for Optical Circuits Port power monitoring Controlling Glimmerglass Switch Example
Iosif Legrand November “On-Demand”, End to End Optical Path Allocation 37 Internet A >FDT A/fileX B/path/ OS path available Configuring interfaces Starting Data Transfer Monitor Control TL1 Optical Switch MonALISA Service MonALISA Distributed Service System B OS Agent Active light path Regular IP path Real time monitoring APPLICATION LISA AGENTLISA sets up - Network Interfaces - TCP stack - Kernel parameters - RoutesLISA APPLICATION “use eth1.2, …” LISA Agent Agent DATA CREATES AN END TO END PATH < 1s Detects errors and automatically recreate the path in less than the TCP timeout path in less than the TCP timeout
Iosif Legrand November CERN Geneva CALTECH Pasadena Starlight Manlan USLHCnet Internet2 Controlling Optical Planes Automatic Path Recovery “Fiber cut” simulations The traffic moves from one transatlantic line to the other one FDT transfer (CERN – CALTECH) continues uninterrupted TCP fully recovers in ~ 20s FDT Transfer 4 Fiber cuts simulations 200+ MBytes/sec From a 1U Node 4 fiber cut emulations
Iosif Legrand November End to End Path Provisioning on different layers Layer 3 Layer 2 Layer 1 Default IP route VCAT and VLAN channels Optical path Site A Site B Monitor layout / Setup circuit Monitor host & end-to-end paths / Setup end-host parameters Control transfers and bandwidth reservations Monitor interfaces traffic
Iosif Legrand November APPLICATION >FDT A/fileX B/path/ path or channel allocation Configuring interfaces Starting Data Transfer Regular IP path Local VLANs Recommended to use two NICs -one for management /one for data -- bonding two NICs to the same IP MAP Local VLANs to WAN channels or light paths “On-Demand”, L2 Dynamic Channel and Path Allocation
Iosif Legrand November The Need for Planning and Scheduling for Large Data Transfers In Parallel Sequential 2.5 X Faster to perform the two reading tasks sequentially
Iosif Legrand November User Scheduling Control Monitoring End Host Agents Realtime Feedback Request Channel allocation based on VO/Priority, [ + Wait time, etc.] Create on demand a End-to-end path or Channel & configure end-hosts Automatic recovery (rerouting) in case of errors Dynamic reallocation of throughputs per channel: to manage priorities, control time to completion, where needed Reallocate resources requested but not used Dynamic Path Provisioning Queueing and Scheduling
Iosif Legrand November Dynamic priority for FDT Transfers on common segments Priority 4 Priority 2 Priority 8
Iosif Legrand November Bandwidth Challenge at SC Gbs ~ 500 TB Total in 4h
Iosif Legrand November FDT & MonLISA Used at SC 2006 April 2007 Iosif Legrand 17.7 Gb/s Disk to Disk on 10 Gb/s link used in on 10 Gb/s link used in Both directions from Florida to Caltech
Iosif Legrand November Official BWC Hyper BWC SC2006 April 2007 Iosif Legrand
Iosif Legrand November SC Gb/s disk to disk
Iosif Legrand November Communities using MonALISA 48 Major Communities ALICE CMS ATLAS EVO LGC RUSSIA UNAM Grid (Mx) ITU USLHCNET ULTRALIGHT GLORIAD ABILENE RoEduNET Enlightened - - VRVS ALICE USLHCnet EVO OSG MonALISA Today Running 24 X 7 at ~340 Sites Collecting ~ parameters in near real-time Update rate of 20,000 parameter updates per second Monitoring 12,000 computers > 100 WAN Links Thousands of Grid jobs running concurrently