DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 1 / 29 GridICE The eyes of the grid A monitoring tool for a Grid Operation Center by DataTAG WP4 Sergio Fantinel, INFN LNL/PD
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 2 / 29 Sergio Fantinel’s Outline Discovery: resources, components Server side processes layout DB schema Graphs/data presentation Next steps Packaging/installation
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 3 / 29 Discovery entities RESOURCES: are the entities discovered from the GIS, ex: Cluster Head Nodes Storage Services Cluster Worker Nodes COMPONENTS: are the entities belonging to resources and discovered directly from resource itself, ex: Computing Elements Storage Space Network Adapters
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 4 / 29 Discovery purposes track the life of the entities: they are characterized by a status (new, available, disappeared, dead). Configure the monitoring system accordingly the status of these entities to collect metrics, status and other info.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 5 / 29 Discovery entities list This is the list of entities currently tracked by the monitoring system: Clusters Storage Services Worker Nodes (CL) Computing Elements (CL) Run Time Environments (CL) Virtual Organizations (CE) Storage Extents (WN) Network Adapters (WN) Storage Space (SE) Storage Protocols (SE) CL = Cluster WN = Worker Node/host SE= Storage Service
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 6 / 29 Discovery entities (2) Every entity (resource or component) is described by a number of characterizing information. Entities may be linked together: Ex. Network Adapter -> Worker Node -> Cluster To track the life of the entities it is used a SQL database where are stored also all the information related to every single entity.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 7 / 29 Discovery entities (3) Monitoring DB Monitorig Server GIIS GRIS GIIS Server Computing Element/ Storage Element SQL 1: LDAP Query 2: available CE/SE 3: LDAP Query 4: CEIDs, WNs, Steps 3,4 repeated for every CE/SE LDAP
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 8 / 29 Discovery config/check processes All the info collected are used to generate a number of Nagios configuration files (configuration process). Nagios schedule, according to some other DB stored parameters, at different interval times, the execution of a number of scripts (Nagios plug-ins wrote by the DataTAG WP4) that collect the info associated to every entity (check process) and put those in the DB.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 9 / 29 Server Side processes layout Discovery Config Nagios/scheduler Check Monitoring DB GIIS GRIS GRID 1A 1B 1C 2A 2B 34A 4B Gfx/Presentation WEB 5A 5B Developed by DataTAG WP4 1: entities discovery 2: generation of config files 3: check scheduling 4: entities info collection 5: DB info rendering Grid Information System LDAP Interface
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 10 / 29 Scheduling Discovery and config generation run as cron jobs; although the two processes can be scheduled independently at different time intervals, a discovery is just followed by a config generation. Check plug-ins are scheduled by Nagios; the interval for each one is set by a corresponding parameter in the DB.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 11 / 29 DataBase stored info Three types of information are stored in the database (~50 tables): Entities: actual status, historical status (fed by discovery process ) Info about entities (fed by check process) Monitoring configuration parameters (fed manually by monitoring administrator)
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 12 / 29 DataBase date/time attributes Many record types has associated two date/time attributes (Date, LastCheck) that determines the time interval where the record is valid. Checking the time distance between the Date attribute of a tuple and the LastCheck of the tuple that immediately precede, we can figure out problems of the resource or the monitoring system (VETOES generation, see later).
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 13 / 29 DataBase (2) For many entities the related info are grouped into two sets: slow, fast parameters; in the first group are those params that change very slowly in the time (ex. SO version, RAM size of a node, bogoMIPS,…); in the second are all the other parameters.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 14 / 29 DataBase check process For a set of parameters: the check process get the info from the GRIS, then compare the values against the last record in the DB and, if no one has changed, it update only the LastCheck attribute of the tuple; otherwise it insert a new tuple with the current date/time in Date/LastCheck attributes and the new info.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 15 / 29 DataBase thresholds We are introducing a threshold mechanism: For a selected set of attributes, nearly all the fast parameters, we define a threshold If the new values for a tuple doesn’t exceed the related threshold over the old values, we change only the LastCheck attribute
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 16 / 29 DataBase thresholds (2) This threshold mechanism become very useful to save DB space resource allocation. For example We doesn’t need to appreciate a variation of a CPU load less the 5- 10% (think to the fluctuations around the 100% idle when no job is running on a CPU) The same concept may be applied to a Storage Area with a 5-10MB threshold
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 17 / 29 DB Example: GE simple info RESOURCE Date LastCheck IP ResName ID_GeType Group Cluster Date Data string(16) string(64) Int string(64) int RL_GE_CPU_SLOWPARAM Date LastCheck Vendor Model Version ClockSpeed TimeInterval date string(32) string(64) string(32) int RL_GE_CPU_FASTPARAM Date LastCheck Load1Min Load5Min Load15Min date int User Nice System Idle int NB: this is a simplified structure
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 18 / 29 RL_GE_CPU_FASTPARAM Date LastCheck FreeCpus EstimatedResponseTime RunningJobs WaitingJobs WorstResponseTime date int DB Example: queues RESOURCE RL_CE_CEID Date LastCheck CeidName date string(256) RL_GE_CPU_SLOWPARAM Date LastCheck CeidStatus TotalCpus MaxCpuTime MaxRunningJobs date int Int int MaxTotalJobs MaxWallClockTime Priority LRMSType LRMSVersion int string(16) NB: this is a simplified structure
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 19 / 29 RL_SA_FASTPARAM DB Example: status history RESOURCE RL_SE_SA RL_SA_SLOWPARAM NB: this is a simplified structure RL_SA_CHANGE_STATUS ID_ChgStatus Date int date RL_RES_CHANGE_STATUS ID_ChgStatus Date int date
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 20 / 29 GETYPE DB Example: configuration RL_GETYPE_SERVICE NB: this is a simplified structure SERIVCE ServiceName CheckInterval Template Command string(64) Int string(32) Service List: CheckCeidFP CheckCeidSP CheckSeFP CheckSeSP CheckWnCpu CheckWnCtxc CheckWnInfo CheckWnIntc CheckWnMem CheckWnMpc CheckWnNet CheckWnPrcc CheckWnSckc CheckWnSpc CheckWnStg CheckWnVirtual
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 21 / 29 Data presentation We based our work on the JpGraph tool (PHPv4), a very powerful set of API to generate a number of graph types that use directly the GDLib API (v2 is needed for an accurate colour management) To overcame some limitation in the managing of date/time labels, merging and plotting of large set of date of JpGraph, and to have in general a fine control, we developed a decoupling layer between the DB and the render engine.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 22 / 29 Data presentation (2) Monitoring DB JpGraph GDLib Data Load Resample Data Merging Graph generation Developed by DataTAG WP4 Main Analysis Process
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 23 / 29 Data presentation (3) Monitoring DB Data Load Time interval Object descriptors (Table, fields,…) Veto thresholdArray of homogeneous tuples with their time interval validity (ALV) Array of VETOES (if option enabled) (AVT)
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 24 / 29 Data presentation (4) Resample Time interval (ALV) (AVT) X-axis plot (dot) size (xSize) Array with xSize number of elements; Every element is a record of metrics values; each value is the average of the metric function in the time interval slot represented by the xpoint considered Graph Labels Graph VETO bars (if enabled)
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 25 / 29 Data presentation (5) The presentation of the date was made addressing different user types (see later on the live demo session): Vo views, for a VO manager Site views, system manager Single entity views
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 26 / 29 Next steps, integration the next HIGH PRIORITY step is the integration with the EDG 2.0 information system. Due to the structure of our monitoring system, the migration should be very smooth and fast.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 27 / 29 Next steps, long term Glue Network Service (NE) Job monitoring Collective services (RB, LB, ROS, RLS,…) We are thinking to evaluate different technologies that can help us to improve our tool; we have in mind, for example, OGSA and JINI. Major issue: distribution of the monitoring DB to get more scalability Authentication/authorization Remote system management Better analysis of monitoring date and their presentation Personal monitoring
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 28 / 29 Packaging/installation Server side Linux Red Hat 7.3 PostgreSQL PHP (GDLibv2) JpGraph 1.2 Nagios
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 29 / 29 Packaging/installation (2) Server side: Guide to installation and configuration Client side: Packaged rpms to install on the Head Node and on the Worker Nodes (fmon server/agents) Replacing the Glue-CE.schema file on the Head Node (GLUE+)