DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

Fabric Management at CERN BT July 16 th 2002 CERN.ch.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
The Premier Software Usage Analysis and Reporting Toolset CELUG Presentation – May 12, 2010 LT-Live : License Tracker’s License Server Monitor.
Gridification Task Development Plan for Release 1.1 – 2.0 For Gridification: David Groep
DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – Towards automation of computing fabrics... – n° 1 Towards automation.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.
Workload Management Massimo Sgaravatto INFN Padova.
The EU DataGrid Architecture The European DataGrid Project Team
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
Institute of Computer Science AGH Performance Monitoring of Java Web Service-based Applications Włodzimierz Funika, Piotr Handzlik Lechosław Trębacz Institute.
Performance and Exception Monitoring Project Tim Smith CERN/IT.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
Open Science Grid The OSG Accounting System: GRATIA by Philippe Canal (FNAL) & Matteo Melani (SLAC) Mumbai, India CHEP2006.
WP4 Security and AA(A) issues For WP4: David Groep
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone.
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
Author - Title- Date - n° 1 Partner Logo EU DataGrid, Work Package 5 The Storage Element.
TOSCA Monitoring Reference Architecture Straw-man Roger Dev CA Technologies March 18, 2015 PRELIMINARY.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
DataGrid Fabric Management (WP4) Gridification of Large Farms, a very brief overview David Groep, NIKHEF
Maite Barroso – WP4 Barcelona – 13/05/ n° 1 -WP4 Barcelona- Closure Maite Barroso 13/05/2003
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
German Cancio – WP4 developments Partner Logo WP4 / ATF ATF meeting, 9/4/2002
EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.
M.Biasotto, CERN, 5 november Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.
INFSO-RI Enabling Grids for E-sciencE GridICE: Grid and Fabric Monitoring Integrated for gLite-based Sites Sergio Fantinel INFN.
Olof Bärring – EDG WP4 status&plans- 22/10/ n° 1 Partner Logo EDG WP4 (fabric mgmt): status&plans Large Cluster.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
Site Authorization Service Local Resource Authorization Service (VOX Project) Vijay Sekhri Tanya Levshina Fermilab.
CERN 21 January 2005Piotr Nyczyk, CERN1 R-GMA Basics and key concepts Monitoring framework for computing Grids – developed by EGEE-JRA1-UK, currently used.
DataTAG is a project funded by the European Union International School on Grid Computing, 23 Jul 2003 – n o 1 GridICE The eyes of the grid PART I. Introduction.
Bob Jones – Project Architecture - 1 March n° 1 Project Architecture, Middleware and Delivery Schedule Bob Jones Technical Coordinator, WP12, CERN.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 1 / 10 Grid Monitoring A conceptual introduction to GridICE Sergio Andreozzi
Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Hepix EDG Fabric Monitoring tutorial – n° 1 Introduction to EDG Fabric Monitoring Sylvain Chapeland.
Workload Management Workpackage
WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO
Monitoring and Fault Tolerance
WP4 Fabric Management 3rd EU Review Maite Barroso - CERN
LEMON – Monitoring in the CERN Computer Centre
The European DataGrid Project Team
StratusLab Final Periodic Review
StratusLab Final Periodic Review
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
EDT-WP4 monitoring group status report
Towards automation of computing fabrics using tools from the fabric management workpackage of the EU DataGrid project Maite Barroso Lopez (WP4)
Wide Area Workload Management Work Package DATAGRID project
I Datagrid Workshop- Marseille C.Vistoli
Presentation transcript:

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

8 July 2003 – Towards automation of computing fabrics – n° 2 Talk Outline u WP4 objective and partners u Automated management of large clusters u Fabric Monitoring u Fabric Fault Tolerance

8 July 2003 – Towards automation of computing fabrics – n° 3 WP4 objective and partners “To deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes.” User job management (Grid and local) Automated management of large clusters u 6 partners: CERN, NIKHEF, ZIB, KIP, PPARC, INFN. u The development work divided into 6 subtasks: WP4 Configuration Mgt Installation Mgt Monitoring Fault Tolerance Resource MgtGridification

8 July 2003 – Towards automation of computing fabrics – n° 4 Fabric mgt subsystems Other services Architecture logical overview Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Resource Broker Data Mgmt Grid Data Storage

8 July 2003 – Towards automation of computing fabrics – n° 5 Fabric mgt subsystems Other services Architecture logical overview Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Resource Broker Data Mgmt Grid Data Storage - Interface between Grid-wide services and local fabric; - Provides local authentication, authorization and mapping of grid credentials. - Interface between Grid-wide services and local fabric; - Provides local authentication, authorization and mapping of grid credentials.

8 July 2003 – Towards automation of computing fabrics – n° 6 Fabric mgt subsystems Other services Architecture logical overview Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Resource Broker Data Mgmt (WP2) Grid Data Storage - provides transparent access (both job and admin) to different cluster batch systems; - enhanced capabilities (extended scheduling policies, advanced reservation, local accounting). - provides transparent access (both job and admin) to different cluster batch systems; - enhanced capabilities (extended scheduling policies, advanced reservation, local accounting).

8 July 2003 – Towards automation of computing fabrics – n° 7 Installation & Node Mgmt Fabric mgt subsystems Other services Architecture logical overview Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Resource Broker Data Mgmt Grid Data Storage (WP5) - provides the tools to install and manage all software running on the fabric nodes; -Agent to install, upgrade, remove and configure software packages on the nodes. -bootstrap services and software repositories. - provides the tools to install and manage all software running on the fabric nodes; -Agent to install, upgrade, remove and configure software packages on the nodes. -bootstrap services and software repositories.

8 July 2003 – Towards automation of computing fabrics – n° 8 Fabric mgt subsystems Architecture logical overview Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Other services Resource Broker Data Mgmt Grid Data Storage -provides a central storage and management of all fabric configuration information; - central DB and set of protocols and APIs to store and retrieve information. -provides a central storage and management of all fabric configuration information; - central DB and set of protocols and APIs to store and retrieve information.

8 July 2003 – Towards automation of computing fabrics – n° 9 Architecture logical overview Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services WP4 subsystems Other Wps Resource Broker Data Mgmt Grid Data Storage - provides the tools for gathering monitoring information on fabric nodes; -central measurement repository stores all monitoring information; - fault tolerance correlation engines detect failures and trigger recovery actions. - provides the tools for gathering monitoring information on fabric nodes; -central measurement repository stores all monitoring information; - fault tolerance correlation engines detect failures and trigger recovery actions.

8 July 2003 – Towards automation of computing fabrics – n° 10 Measurement Repository (MR) Monitored nodes Sensor Monitoring Sensor Agent (MSA) Cache Consumer Local Consumer Sensor Consumer Global Consumer Database Fabric Monitoring architecture

8 July 2003 – Towards automation of computing fabrics – n° 11 Monitored nodes u Sensors measure Metrics mainly locally: CPU utilization, network throughput, daemons status, etc. u MSA – Monitoring Sensor Agent: n collects sensors data, stores them in a local cache and sends them to the central repository n triggers measurements according to the configured schedule. u Local Cache: allows local consumers to have fast access to monitoring information. Data can be collected even if the node istemporally isolated from central repository. u Outgoing data consist of a value associated to a metric identifier (what), a timestamp (when), a target identifier (where), and an agent identifier (who).

8 July 2003 – Towards automation of computing fabrics – n° 12 Measurement Repository u Receives samples from all the nodes u Stores data: n Plain text DB n Oracle DB n MySQL DB being implemented u Follows data up to subscribers u Answers queries u Consumers can subscribe to metrics and receive notification when new samples are available

8 July 2003 – Towards automation of computing fabrics – n° 13 Fault Tolerance Architecture Sensor MSA Sensor Monitoring Fault Tolerance daemon (FTd) Cache Actuator Local Node Decision Unit (DU) Actuator agent Rules

8 July 2003 – Towards automation of computing fabrics – n° 14 Fault tolerance framework: status u Main features n Rule: defines the exception condition (using monitoring metrics) and its association with recovery actions (actuators) n Web-based rule editor n Central Rule repository (MySQL) n Local FTd (fault tolerance daemon) that s Automatically subscribes to monitoring metrics specified by the rules s Launches the associated actuators when the correlation evaluates to an exception s Reports back to the monitoring system the recovery actions taken and their status u Not yet deployed in production environment