EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.
19/06/2002WP4 Workshop - CERN WP4 - Monitoring Progress report
DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – Towards automation of computing fabrics... – n° 1 Towards automation.
Planning Server Deployments
1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 4 Installing and Configuring the Dynamic Host Configuration Protocol.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
Distributed Object System. Project Goals Develop a distributed system for performing time-consuming calculations. Load Balancing support. Fault Tolerance.
Ivy Equipment Inventory System Jaein Jeong Barbara Hohlt Kris Pister.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
16.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 16: Examining Software Update.
Performance and Exception Monitoring Project Tim Smith CERN/IT.
An Introduction to IBM Systems Director
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
COMP 410 Update. The Problems Story Time! Describe the Hurricane Problem Do this with pictures, lots of people, a hurricane, trucks, medicine all disconnected.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
Deploying non-HA NSM Components in a Microsoft Cluster Environment -Unicenter NSM Release 11.1 SP1 -Last Revision October 30, 2007.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
© Copyright 2009 Sysgem AG, 8002 Zurich, Switzerland Sysgem File Synchronizer (SFiS) Manage configuration files on multiple target servers from definitions.
An application architecture specifies the technologies to be used to implement one or more (and possibly all) information systems in terms of DATA, PROCESS,
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
1 Client X Management over ISDN link Client X. 2 Key Requirements m Manageability of remote store devices m Scalability (Distributed Management) m Multi.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
Maite Barroso – WP4 Barcelona – 13/05/ n° 1 -WP4 Barcelona- Closure Maite Barroso 13/05/2003
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
OSIsoft High Availability PI Replication
May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
CCNA4 v3 Module 6 v3 CCNA 4 Module 6 JEOPARDY K. Martin.
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
© Copyright 2009 Sysgem AG, 8002 Zurich, Switzerland Sysgem File Synchronizer (SFiS) Manage configuration files on multiple target servers from definitions.
INFSO-RI Enabling Grids for E-sciencE GridICE: Grid and Fabric Monitoring Integrated for gLite-based Sites Sergio Fantinel INFN.
Module 4: Design IIS Maintenance and UDDI. Designing Internet Information Services Backup and Recovery Specifying Monitoring requirements Deploying UDDI.
Ceilometer + Gnocchi + Aodh Architecture
Image Distribution and VMIC (brainstorm) Belmiro Moreira CERN IT-PES-PS.
The EDG Testbed The European DataGrid Project Team
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
EMTTS UAT Day1 & Day2 Powered by:. Topics CoversTopics Remaining Comparison Network Infrastructure Separate EP Hosting Fault Tolerance.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)
CERN 21 January 2005Piotr Nyczyk, CERN1 R-GMA Basics and key concepts Monitoring framework for computing Grids – developed by EGEE-JRA1-UK, currently used.
FIO Fault Tolerance Workshop Hugo 08/01/2004 Analysis of WP4 Fault Tolerance Framework FIO Fault Tolerance Workshop Hugo 08/01/2004.
1 Chapter Overview Using Standby Servers Using Failover Clustering.
How to setup DSS V6 iSCSI Failover with XenServer using Multipath Software Version: DSS ver up55 Presentation updated: February 2011.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 1 / 10 Grid Monitoring A conceptual introduction to GridICE Sergio Andreozzi
Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.
Planning Server Deployments Chapter 1. Server Deployment When planning a server deployment for a large enterprise network, the operating system edition.
ITMT 1371 – Window 7 Configuration 1 ITMT Windows 7 Configuration Chapter 8 – Managing and Monitoring Windows 7 Performance.
17 Copyright © 2006, Oracle. All rights reserved. Information Publisher.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
Proctor Caching Overview. 2 Proctor Caching Diagram.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Hepix EDG Fabric Monitoring tutorial – n° 1 Introduction to EDG Fabric Monitoring Sylvain Chapeland.
WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO
Monitoring and Fault Tolerance
LEMON – Monitoring in the CERN Computer Centre
Miroslav Siket, Dennis Waldron
Towards automation of computing fabrics using tools from the fabric management workpackage of the EU DataGrid project Maite Barroso Lopez (WP4)
A Web-Based Data Grid Chip Watson, Ian Bird, Jie Chen,
Configuration Of A Pull Network.
The EU DataGrid Fabric Management Services
Presentation transcript:

EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – Feb – WP4 demo – n° 2 Workload Management (WP1) Data Management (WP2) Storage Element (WP5) Fabric Management (WP4) Networking (WP7) Information Service (WP3) Fabric Monitoring and Fault Tolerance in the global picture

EU 2nd Year Review – Feb – WP4 demo – n° 3 Outline u System architecture n Fabric Monitoring n Fault Tolerance u Demonstration n Hardware setup n Use case u Summary u Questions

EU 2nd Year Review – Feb – WP4 demo – n° 4 Sensor 3 Consumer Sensor 2 Consumer Sensor Fabric Monitoring architecture Measurement Repository (MR) Database Monitored nodes Monitoring Sensor Agent (MSA) 1 Cache Consumer Sensor Consumer

EU 2nd Year Review – Feb – WP4 demo – n° 5 Sensor MSA Sensor Fault Tolerance architecture Local Node Decision unit Actuator agent monitoring Rules Fault Tolerance daemon (FTd) Cache Actuator

EU 2nd Year Review – Feb – WP4 demo – n° 6 Demonstration setup Slides Monitoring data Shells Laptop Beamer 1 Beamer 2 MSA FTd MR Monitored node Server node FT Rule editor

EU 2nd Year Review – Feb – WP4 demo – n° 7 Demonstration u Use case based on daemon monitoring u Fabric Monitoring n Check a daemon status with the monitoring system while killing and restarting it u Fault Tolerance n Edit a rule to restart the daemon automatically n Kill the daemon while following its status in monitoring

EU 2nd Year Review – Feb – WP4 demo – n° 8 Monitored node Server node MSA MR MSA monitors a daemon status. Information is propagated to repository and consumers. daemon Daemon status Check ok Transport Store Notify Daemon ok Status display : consumer application connected to repository

EU 2nd Year Review – Feb – WP4 demo – n° 9 Monitored node Server node MSA MR When daemon killed, MSA updates the daemon status in the repository. Consumers are notified of the new metric value. daemon Daemon status Check not ok Transport Store Notify Daemon dead Shell Kill Status display : consumer application connected to repository

EU 2nd Year Review – Feb – WP4 demo – n° 10 Monitored node Server node MSA MR A manual operation is required to get back to normal status. Daemon status Check ok Transport Store Notify Daemon ok Shell Relaunch daemon Status display : consumer application connected to repository

EU 2nd Year Review – Feb – WP4 demo – n° 11 Monitored node Server node A rule is added to automatically restart the daemon when dead. Web browser Rule editor FTd Rule editor accessed by web browser Rule editor HTTP rule Transport

EU 2nd Year Review – Feb – WP4 demo – n° 12 Monitored node Server node MSA MR daemon Daemon ok Status display : consumer application connected to repository Check Shell Kill When daemon killed, FTd is notified and triggers recovery action as specified in rule. FTd rule daemon Transport Store Notify Daemon dead Notify rule Daemon ok Relaunch

EU 2nd Year Review – Feb – WP4 demo – n° 13 Monitored node Server node MSA MR daemon Daemon ok Recovery actions are also fed back to the monitoring. FTd Transport Store Notify Daemon dead Log Daemon restarted Log viewer: consumer application connected to repository

EU 2nd Year Review – Feb – WP4 demo – n° 14 Monitored node Server node Web browser MSA History on web browser. HTTP Metric history is available in the measurement repository. MR

EU 2nd Year Review – Feb – WP4 demo – n° 15 Summary u Monitoring system to get live status of a node n Centralization of data n Measures available remotely u Fault Tolerance as monitoring data consumer n Rule edition of recovery actions n Automatic actions taken according to monitoring status u Deployment status n Monitoring agent runs in production mode on ~1000 nodes in CERN computer center n Will be available in next EDG release