October 27, 2015 Atlas Monitoring Infrastructure in Grid Environment Richard Baker Dantong Yu Brookhaven National Lab.

Slides:

Advertisements

Similar presentations

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO

Advertisements

Distributed Systems basics

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.

GUMS status Gabriele Carcassi PPDG Common Project 12/9/2004.

Grid Monitoring Discussion Dantong Yu BNL. Overview Goal Concept Types of sensors User Scenarios Architecture Near term project Discuss topics.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.

A Model for Grid User Management Rich Baker Dantong Yu Tomasz Wlodek Brookhaven National Lab.

GridScape Ding Choon Hoong Grid Computing and Distributed Systems (GRIDS) Lab. The University of Melbourne Melbourne, Australia WW Grid.

GridMonitor: Integration of Large Scale Facility Monitoring With MDS Richard Baker, Antonio Chan Richard Baker, Antonio Chan Jason Smith, Dantong Yu USATLAS/RHIC.

Grid Computing, B. Wilkinson, 20046c.1 Globus III - Information Services.

Status of Globus activities within INFN (update) Massimo Sgaravatto INFN Padova for the INFN Globus group

Magda – Manager for grid-based data Wensheng Deng Physics Applications Software group Brookhaven National Laboratory.

Institute of Computer Science AGH Performance Monitoring of Java Web Service-based Applications Włodzimierz Funika, Piotr Handzlik Lechosław Trębacz Institute.

Minerva Infrastructure Meeting – October 04, 2011.

Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting February 24-25, 2003.

Performance and Exception Monitoring Project Tim Smith CERN/IT.

TeraPaths: A QoS Collaborative Data Sharing Infrastructure for Petascale Computing Research Bruce Gibbard & Dantong Yu High-Performance Network Research.

The EU DataGrid – Information and Monitoring Services The European DataGrid Project Team

WP9 Resource Management Current status and plans for future Juliusz Pukacki Krzysztof Kurowski Poznan Supercomputing.

GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

An Integrated Instrumentation Architecture for NGI Applications Ian Foster, Darcy Quesnel, Steven Tuecke Argonne National Laboratory The University of.

A. Cavalli - F. Semeria INFN Experience With Globus GIS 1 A. Cavalli - F. Semeria INFN First INFN Grid Workshop Catania, 9-11 April 2001 INFN Experience.

1 BIG FARMS AND THE GRID Job Submission and Monitoring issues ATF Meeting, 20/06/03 Sergio Andreozzi.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.

Grid Status - PPDG / Magda / pacman Torre Wenaus BNL U.S. ATLAS Physics and Computing Advisory Panel Review Argonne National Laboratory Oct 30, 2001.

Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2

A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone.

Oracle 10g Database Administrator: Implementation and Administration Chapter 2 Tools and Architecture.

Grid Workload Management Massimo Sgaravatto INFN Padova.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Designing a Scalable Enterprise Project Management Architecture Ken Toole Platform Test Manager MS Project Microsoft Corporation.

MAGDA Roger Jones UCL 16 th December RWL Jones, Lancaster University MAGDA  Main authors: Wensheng Deng, Torre Wenaus Wensheng DengTorre WenausWensheng.

Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.

Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.

VO-Ganglia Grid Simulator Catalin Dumitrescu, Mike Wilde, Ian Foster Computer Science Department The University of Chicago.

Virtual Batch Queues A Service Oriented View of “The Fabric” Rich Baker Brookhaven National Laboratory April 4, 2002.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory Review of U.S. LHC Software and Computing Projects Fermi National Laboratory November.

December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.

Group Communication Theresa Nguyen ICS243f Spring 2001.

GraDS MacroGrid Carl Kesselman USC/Information Sciences Institute.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Participation of JINR in CERN- INTAS project ( ) Korenkov V., Mitcin V., Nikonov E., Oleynik D., Pose V., Tikhonenko E. 19 march 2004.

GIIS Implementation and Requirements F. Semeria INFN European Datagrid Conference Amsterdam, 7 March 2001.

Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group

April 4, 2002Atlas Testbed Workshop ATLAS Hierarchical MDS Server Patrick McGuigan.

Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.

EGEE is a project funded by the European Union under contract INFSO-RI Grid accounting with GridICE Sergio Fantinel, INFN LNL/PD LCG Workshop November.

Grid Status - PPDG / Magda / pacman Torre Wenaus BNL DOE/NSF Review of US LHC Software and Computing Fermilab Nov 29, 2001.

Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –

FESR Trinacria Grid Virtual Laboratory gLite Information System Muoio Annamaria INFN - Catania gLite 3.0 Tutorial Trigrid Catania,

DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 1 / 10 Grid Monitoring A conceptual introduction to GridICE Sergio Andreozzi

A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.

Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.

OpenMosix, Open SSI, and LinuxPMI

StratusLab Final Periodic Review

StratusLab Final Periodic Review

BOSS: the CMS interface for job summission, monitoring and bookkeeping

Sergio Fantinel, INFN LNL/PD

BOSS: the CMS interface for job summission, monitoring and bookkeeping

Oracle Solaris Zones Study Purpose Only

GGF15 – Grids and Network Virtualization

Wide Area Workload Management Work Package DATAGRID project

Presentation transcript:

October 27, 2015 Atlas Monitoring Infrastructure in Grid Environment Richard Baker Dantong Yu Brookhaven National Lab

October 27, 2015 What need to be monitored. Linux Farm Monitoring Description – 800 linux nodes. – Advertise farm information for grid-level scheduling – Performance Data must be summarized for advertising to Grid. Performance events required: – Configuration information – Status information: CPU load, (5 minutes, 10, 15), memory load, disk load, and network load Example usage: A resource broker might ask the availability of Linux farm system resources in order to plan the efficient execution of tasks.

October 27, 2015 More… Network Monitoring: Description: – 8 USATLAS testbeds. – Publish the connectivity of these test-beds, monitor the healthiness of the USATLAS network. – Archived performance data can be used to predict the network behavior a user can choose the source and destination for file replication. Performance events required: – Bandwidth, Delay ( round trip time), trace route. Requirements : Interface, Overhead, Scalability, Security, Archive, Consistency.

October 27, 2015 Monitoring System Components Four tier structure Sensors. Implementation: Unix command: top, /proc and LSF host load Archive System (Database system). Implementation: unixODBC+myODBC+mySQL database Information Providers. Globus 2.0 Beta, MDS2.1 Front-end browsing System. GridView (Grid Visualization tool developed at UTA.)

October 27, 2015 ATLAS Monitoring Framework LSF Grid Cluster LSF Server 1 LSF Server2 Gatekeeper Job manager Information Provider (GRIS) Top, LSF load and NSW Monitoring Database ( ODBC+MYSQL Or Oracle ) DB Info. Providers Data Collectors Row to object class Aggregate Service Index (GIIS) Register Grid-View (UTA Web Server) Grid-info-search Network Sensor (Iperf, GridFtp) ServerHPSS Sensor Information Provider (GRIS) Information Provider (GRIS) Register

October 27, 2015 Advantages Information Provider provides cache for the newest value from the mysql database. Non-intrusiveness: Information provider can eliminate the user random accesses to the database server. Scalability can be significantly increased. 800 linux nodes are being monitored Network connectivity of eight USATLAS testbeds. Flexibility: Independent on Sensors. Many sensors can be easily plugged as long it has well defined protocol and API. Archive system is independent to underlying database. – Can be RDBMS, Oracle, MySql, Sybase, Informix, flat files, objectivity as long the ODBC drivers is available.

October 27, 2015 Level of Farm monitoring Linux Farm is divided into different sub-clusters based on site policy, different experiments, OS and version, CPU speed. A sub-clustering contains the host with the same configuration. BNL atlas farm is partitioned into four subclusters: CPU200MHz, CPU400MHz, CPU700Hz and CPU1GHz The status information of a subcluster is summarized from all nodes in this subcluster. Grid resource broker schedules in the level of farm subclusters.

October 27, 2015 Information Schema (Linux Farm Monitoring) Queue-Info: objectclass ( NAME 'Queue-Info' SUP 'Mds' STRUCTURAL MUST ( MdsQueueNumberOfCpu $ MdsQueueSpeed $ MdsQueueAverageLoad $ MdsQueueAverageUserPercent $ MdsQueueAverageSysPercent ))

October 27, 2015 Information Schema (Linux Farm Monitoring) Host-Info: objectclass ( NAME 'Host-Info' SUP ‘Mds' STRUCTURAL MUST ( MdsNodeAddress $ MdsHostNodeName $ MdsHostNodeDomainName $ MdsNetMacAddr) MAY ( MdsHostVendor $ MdsCpuVendor $ MdsCpuSmpSize $ MdsOsName $ MdsOsKernelVersion $ MdsMemoryRamSizeMB $ MdsMemoryVmSizeMB $ MdsTimeFrom $ MdsCpuLoad5min $ MdsCpuUser15min $ MdsCpuSystem15min))

October 27, 2015 Backend Data Structure (Linux Farm Monitoring) Node Configuration Information mysql> describe machines; | Field | Type | Null | Key | Default | Extra | | nodename | varchar(15) | | PRI | | | | domain | varchar(30) | YES | | NULL | | | address | varchar(15) | YES | | NULL | | | macaddr | varchar(12) | YES | | NULL | | | brand | varchar(20) | YES | | NULL | | | hw_type | varchar(20) | YES | | NULL | | | cpu_id | tinyint(2) | YES | | NULL | | | no_cpu | tinyint(2) | YES | | NULL | | | os | varchar(20) | YES | | NULL | | | kernel | varchar(20) | YES | | NULL | | | memory | int(5) | YES | | NULL | | | swap | int(5) | YES | | NULL | | | home | int(5) | YES | | NULL | |

October 27, 2015 Backend Data Structure Node Status Information mysql> describe node_load; | Field |Type | Null | Key | Default | Extra | | load_index | int(10) unsigned | | PRI | NULL | auto_increment | | sampletime | timestamp(14) | YES | MUL| NULL | | | machine_id | varchar(31) | | | | | | owner | varchar(8) | | | | | | load_5 | float(10,2) | | | 0.00 | | | user_cpu | float(10,2) | | | 0.00 | | | sys_cpu | float(10,2) | | | 0.00 | |

October 27, 2015 Information Provider (Linux Farm Monitoring) # generate Farm information every 10 minutes dn: MdsFarmQueueName=1000, MdsHostNodeDomainName=usatlas.bnl.gov, Mds-Host- hn=gremlin.usatlas.bnl.gov, Mds-Vo-name=local, o=grid objectclass: GlobusTop objectclass: GlobusActiveObject objectclass: GlobusActiveSearch type: exec path: /usr/local/globus-new/customize base: mds-farm-batch-info.pl args: -dn MdsFarmQueueName=1000,MdsHostNodeDomainName=usatl as.bnl.gov,Mds-Host- hn=gremlin.usatlas.bnl.gov,Mds-Vo- name=local,o=grid -ttl 900 cachetime: 600 timelimit: 20 sizelimit: 400

October 27, 2015 DEMO status/mds.gremlin.usatlas.bnl.gov.html status/mds.gremlin.usatlas.bnl.gov.html

October 27, 2015 Observation from Grid-View

October 27, 2015 Future work Work with PPDG monitoring group, GGF, install “recommended” sensors on Atlas Monitoring Framework. Identity what should be monitored for Grid resource broker and deploy tools. Optimize Backend Data Structure Scalability test. The system should be able to handle 1600 linux nodes. Extend this prototype to other facility monitoring.