Download presentation
Presentation is loading. Please wait.
1
GridMonitor: Integration of Large Scale Facility Monitoring With MDS Richard Baker, Antonio Chan Richard Baker, Antonio Chan Jason Smith, Dantong Yu USATLAS/RHIC Computing Facility Brookhaven National Lab
2
6/27/2015 CHEP 03, La Jolla 2 Outline Requirements System Framework, Structure and Characteristics I: Ganglia and Its Information Provider II: Archive and Its Information Provider Gridview, Front End System: http://heppc1.uta.edu/atlas/grid-status/mds.gremlin.usatlas.bnl.gov.html Current Status and Future Works
3
6/27/2015 CHEP 03, La Jolla 3 Requirements Requirements : Modularity and Extensibility: Make Use of Existing Monitoring Pieces Flexibility: Adjustable to the Dynamics of the Monitored Systems Overhead: Non-intrusive Scalability Security, Consistency, Inter-operability, Etc-bility
4
6/27/2015 CHEP 03, La Jolla 4 What Need to Be Monitored Linux Farm Monitoring Description About 1100 Dual CPU LINUX Nodes. Performance Data Must Be Summarized for Advertising to Grid. Performance Events Required: Configuration Information Status Information: CPU Load, (1, 5, 10, 15), Memory Load, Disk Load, and Network Load Example Usage: A Resource Broker Might Ask the Availability of Linux Farm System Resources in Order to Plan the Efficient Execution of Tasks
5
6/27/2015 CHEP 03, La Jolla 5 More… Network Monitoring: Description: 8 USATLAS Testbeds Publish the Connectivity of These Test-beds, Monitor the Healthiness of the USATLAS Network Archived Performance Data Can Be Used to Predict the Network Behavior a User Can Choose the Source and Destination for File Replication Performance Events Required: Bandwidth, Delay ( Round Trip Time), Trace Route
6
6/27/2015 CHEP 03, La Jolla 6 Monitoring Framework Monitoring Database ( ODBC+MYSQL) Or RRD DB Info. Providers Data Collectors Aggregate Service Index (GIIS) Grid-View (Web Server) Information Provider (GRIS) Information Provider (GRIS) Information Provider (GRIS) Information Provider (GRIS) Grid-info-search Server HPSS Network Computing Nodes Sensor
7
6/27/2015 CHEP 03, La Jolla 7 Monitoring System Components Four Tier Structure Sensors Host: Ganglia, Top, /Proc and lsf Host Load Archive System (Database System) Round Robin Database (RRD) Relational Database: UNIXodbc+myodbc+mysql Database Information Providers Monitoring and Discovery Service (Mds2.2), GLUE Schema, Customized Ganglia Client Tool Reporting the Lastest Monitoring Data and Database Client Tools Reporting the Summary Information Front-end Browsing System Gridview (Grid Visualization Tool Developed at Univ. of Texas at Arlington)
8
6/27/2015 CHEP 03, La Jolla 8 Advantages Information Provider Provides Cache for the Newest Value From the Mysql Database Non-intrusiveness: Information Provider Can Eliminate the User Random Accesses to the Database Server Scalability Can Be Significantly Increased 1000 Linux Nodes Are Being Monitored Network Connectivity of Eight Usatlas Testbeds: Each Site Monitoring the Paths From Itself to the Other Seven. Network Topology and Traffic Can Be Easily Constructed Flexibility: Independent on Sensors. Many Sensors Can Be Easily Plugged As Long It Has Well Defined Protocol and API: We Could Switch Among Ganglia, top, /proc Archive System Is Independent to Underlying Database Can Be rdbms, Oracle, Mysql, Sybase, Informix, Flat Files, Objectivity As Long the Odbc Drivers Is Available
9
6/27/2015 CHEP 03, La Jolla 9 I: Ganglia Monitoring with MDS Ganglia Information Provider Front-end: Glue-schema Http://www.cnaf.Infn.It/~sergio/datatag/glue/ Back-end: XML Cluster A Multicast Channel Gmond XML Gmetad (filtered) Gmetad (filtered) … ? MDS Ganglia IP XML GLUE Layered Gmetad
10
6/27/2015 CHEP 03, La Jolla 10 I: Ganglia Monitoring with MDS gremlin % grid-info-search -x -h spider.usatlas.bnl.gov -s one # ATLAS Linux Cluster, local, grid dn: cl=ATLAS Linux Cluster, mds-vo-name=local, o=grid objectClass: GlueClusterTop objectClass: GlueCluster GlueClusterName: ATLAS Linux Cluster GlueClusterUniqueID: ATLAS_Linux_Cluster-RCF_and_ACF_Linux_Farm_Group GlueClusterService: compute # PHOBOS CAS Linux Cluster, local, grid # PHOBOS CAS Linux Cluster, local, grid dn: cl=PHOBOS CAS Linux Cluster, mds-vo-name=local, o=grid objectClass: GlueClusterTop objectClass: GlueCluster GlueClusterName: PHOBOS CAS Linux Cluster GlueClusterUniqueID: PHOBOS_CAS_Linux_Cluster-RCF_and_ACF_Linux_Farm_Group GlueClusterService: compute # STAR CAS Linux Cluster, local, grid # STAR CAS Linux Cluster, local, grid dn: cl=STAR CAS Linux Cluster, mds-vo-name=local, o=grid objectClass: GlueClusterTop objectClass: GlueCluster GlueClusterName: STAR CAS Linux Cluster GlueClusterUniqueID: STAR_CAS_Linux_Cluster-RCF_and_ACF_Linux_Farm_Group GlueClusterService: compute
11
6/27/2015 CHEP 03, La Jolla 11 II: Farm Monitoring Linux Farm Is Divided Into Different Sub-clusters Based on Site Policy, Different Experiments, OS and Version, CPU Speed. A Sub-cluster Contains the Host With the Same Configuration Bnl Atlas Farm Is Partitioned Into Four Subclusters: Cpu400mhz, Cpu700hz, Cpu1ghz, Cpu1.4ghz and CPU 2.4GHZ The Status Information of a Sub-cluster Is Summarized From All Nodes in This Sub-cluster Grid Resource Broker Schedules in the Level of Farm Sub- clusters
12
6/27/2015 CHEP 03, La Jolla 12 Information Schema (Linux Farm Monitoring) Queue-Info: objectclass ( 1.3.6.1.4.1.3536.2.6.0.0.0.0 NAME 'Queue-Info' SUP 'Mds' STRUCTURAL MUST ( MdsQueueNumberOfCpu $ MdsQueueSpeed $ MdsQueueAverageLoad $ MdsQueueAverageUserPercent $ MdsQueueAverageSysPercent )) Need to be replaced by GLUB-schema
13
6/27/2015 CHEP 03, La Jolla 13 Backend Data Structure Node Status Information mysql> describe node_load; +-------------+-------------------------+------+----- +---------+---------------------+ | Field |Type | Null | Key |Default| Extra | +-------------+------------------------+------+--------+--------+----------------------+ | load_index | int(10) unsigned | | PRI | NULL| auto_increment | | sampletime| timestamp(14) | YES | MUL | NULL| | | machine_id| varchar(31) | | | | | | owner | varchar(8) | | | | | | load_5 | float(10,2) | | | 0.00 | | | user_cpu | float(10,2) | | | 0.00 | | | sys_cpu | float(10,2) | | | 0.00 | | +---------------+-----------------------+-------+--------+-------+---------------------+
14
6/27/2015 CHEP 03, La Jolla 14 Information Provider (Linux Farm Monitoring) # generate Farm information every 10 minutes dn: MdsFarmQueueName=1000, MdsHostNodeDomainName=usatlas.bnl.gov, Mds-Host-hn=gremlin.usatlas.bnl.gov, Mds-Vo-name=local, o=grid objectclass: GlobusTop objectclass: GlobusActiveObject objectclass: GlobusActiveSearch type: exec path: /usr/local/globus-new/customize base: mds-farm-batch-info.pl args: -dn MdsFarmQueueName=1000,MdsHostNodeDomainName=usatlas.bnl.gov,Mds- Host-hn=gremlin.usatlas.bnl.gov,Mds-Vo-name=local,o=grid -ttl 900 cachetime: 600 timelimit: 20 sizelimit: 400
15
6/27/2015 CHEP 03, La Jolla 15 Observation from Grid-View
16
6/27/2015 CHEP 03, La Jolla 16 Current Status and Future Work Current Status: Sensors & Local Monitoring Tools Put Less Than 1 Percent CPU Load: Non-intrusive Improved the Ganglia Information Provider, It Can Obtain Information From Both Gmond and Gmetad Multiple & Hierarchical Clusters Are Supported Future Works Merge the Ganglia RRD Information Provider and the Archive DB Information Provider Work With the Ganglia Team and Glue-schema, Help to Define Requirements for What Information Be Monitoring for Job Scheduling Automate the Mapping From Xml to Glue Schema, Provide Flexibility Continue to Optimize The Information Provider to Deliver Data Faster Scalability Test Extend This Prototype To Other Facility Monitoring
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.