Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and.

Slides:



Advertisements
Similar presentations
TeraGrid Deployment Test of Grid Software JP Navarro TeraGrid Software Integration University of Chicago OGF 21 October 19, 2007.
Advertisements

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Andrew McNab - Manchester HEP - 22 April 2002 EU DataGrid Testbed EU DataGrid Software releases Testbed 1 Job Lifecycle Authorisation at your site More.
Data Management Expert Panel - WP2. WP2 Overview.
Andrew McNab - Manchester HEP - 2 May 2002 Testbed and Authorisation EU DataGrid Testbed 1 Job Lifecycle Software releases Authorisation at your site Grid/Web.
Gridification Task Development Plan for Release 1.1 – 2.0 For Gridification: David Groep
German Cancio – WP4 developments Partner Logo WP4-install plans WP6 meeting, Paris project conference
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
The EDG Testbed Introduction and Setup The European DataGrid Project Team
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.
Workload Management Massimo Sgaravatto INFN Padova.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Performance and Exception Monitoring Project Tim Smith CERN/IT.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
Partner Logo German Cancio – WP4-install LCFG HOW-TO - n° 1 How to write LCFGng components for EDG 10/2002
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
EDG LCFGng: concepts Fabric Management Tutorial - n° 2 LCFG (Local ConFiGuration system)  LCFG is originally developed by the.
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
RLS Tier-1 Deployment James Casey, PPARC-LCG Fellow, CERN 10 th GridPP Meeting, CERN, 3 rd June 2004.
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
UCY HPCL Introduction to the CrossGrid Testbed George Tsouloupas UCY HPCL.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone.
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
Grid Workload Management Massimo Sgaravatto INFN Padova.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
Author - Title- Date - n° 1 Partner Logo EU DataGrid, Work Package 5 The Storage Element.
ALICE, ATLAS, CMS & LHCb joint workshop on
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
Budapest, September 5th, 2002 DataGrid Accounting System DGAS Current status & plans Stefano Barale INFN Budapest, September.
DataGRID WPMM, Geneve, 17th June 2002 Testbed Software Test Group work status for 1.2 release Andrea Formica on behalf of Test Group.
Maite Barroso – WP4 Barcelona – 13/05/ n° 1 -WP4 Barcelona- Closure Maite Barroso 13/05/2003
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
DataGRID PTB, Geneve, 10 April 2002 Testbed Software Test Plan Status Laurent Bobelin on behalf of Test Group.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
22nd April 2002 Steve Traylen, RAL, 1 LCFG Installation Steve Traylen. LCFG – A tool for installation and configuration. UK HEP SYSMAN,
German Cancio – WP4 developments Partner Logo WP4 / ATF ATF meeting, 9/4/2002
EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.
M.Biasotto, CERN, 5 november Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.
Olof Bärring – EDG WP4 status&plans- 22/10/ n° 1 Partner Logo EDG WP4 (fabric mgmt): status&plans Large Cluster.
E. Ferro, CNAF, april Enrico Ferro INFN-LNL Installation of a LCFG server.
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
The EDG Testbed The European DataGrid Project Team
The EDG Testbed The European DataGrid Project Team
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
VOX Project Status T. Levshina. 5/7/2003LCG SEC meetings2 Goals, team and collaborators Purpose: To facilitate the remote participation of US based physicists.
Site Authorization Service Local Resource Authorization Service (VOX Project) Vijay Sekhri Tanya Levshina Fermilab.
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
Bob Jones – Project Architecture - 1 March n° 1 Project Architecture, Middleware and Delivery Schedule Bob Jones Technical Coordinator, WP12, CERN.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.
Monitoring and Fault Tolerance
WP4 Fabric Management 3rd EU Review Maite Barroso - CERN
The European DataGrid Project Team
StratusLab Final Periodic Review
StratusLab Final Periodic Review
WP4-install status update
Sergio Fantinel, INFN LNL/PD
EDT-WP4 monitoring group status report
Presentation transcript:

Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and Architecture Status reports Installation Monitoring Others Short-term planning

J.Iven - 2 HEPiX 2002, Catania / IT WP4 - Background information  WP4 ’ s objective: deliver the necessary tools to manage a computing fabric providing grid services on clusters scaling up to thousands of nodes. + Main scope: = Fabric (system administration) management = User job management (Grid and local) + Official participants: CERN (leading partner), INFN, NIKHEF, University of Heidelberg, ZIB (Berlin) and University of Edinburgh/ PPARC

J.Iven - 3 HEPiX 2002, Catania / IT Functionality + Enterprise system administration - scalable to O(10K) nodes = Automated installation and maintenance of nodes = Resource management (batch, interactive) = Monitoring of events and performance = Fault tolerance & recovery actions = Fabric Configuration Management + Provision for running Grid jobs = Authorization according to local policies = Mapping Grid credential to local ones = Publication of fabric resources and job information + Provision for running local jobs = Sharing of resources according to local policies

J.Iven - 4 HEPiX 2002, Catania / IT DataGRID Architecture Grid Fabric Grid Collective Services Information & Monitoring Replica Manager Grid Scheduler Local ApplicationLocal Database Underlying Grid Services Computing Element Services Authorization Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configuration Management Node Installation & Management Monitoring and Fault Tolerance Resource Management Fabric Storage Management Local Computing Grid Application Layer Data Management Job Management Metadata Management Object to File Mapping Service Index WP4 tasks

J.Iven - 5 HEPiX 2002, Catania / IT Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services (WP3) WP4 subsystems Other Wps Resource Broker (WP1) Data Mgmt (WP2) Grid Data Storage (WP5) WP4 Architecture overview - Interface between Grid-wide services and local fabric; - Provides local authentication, authorization and mapping of grid credentials. - provides transparent access to different cluster batch systems; - enhanced capabilities (extended scheduling policies, advanced reservation, local accounting). - provides a central storage and management of all fabric configuration information; - central DB and set of protocols and APIs to store and retrieve information. - provides the tools to install and manage all software running on the fabric nodes; - Agent to install, upgrade, remove and configure software packages on the nodes. -bootstrap services and software repositories; - provides the tools for gathering monitoring information on fabric nodes; - central measurement repository stores all monitoring information; - - fault tolerance correlation engines detect failures and trigger recovery actions.

J.Iven - 6 HEPiX 2002, Catania / IT Status: Installation + Main focus: Linux on i386 + Prototype available, based on a tool originally developed by Edinburgh University: LCFG (Local ConFiGuration system). + Main Features: = automatic installation of O.S. = installation/upgrade/removal of software packages = configure and manage standard services and custom packages = Uses a central, hierarchical configuration database + Deployed on EDG testbed 1 in September 2001, interim releases every ~2 months. Currently ~70 nodes (CERN, INFN, NIKHEF, RAL + other UK, IFAE-Barcelona, LIP-Lisbon), +~30 being installed now (ESA- ESRIN, NIKJEF)

J.Iven - 7 HEPiX 2002, Catania / IT LCFG overview A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services Abstract configuration parameters for all nodes stored in a central repository ldxprof Load Profile Generic Component Profile Object rdxprof Read Profile LCFG Objects Local cache Client nodes Web Server HTTP XML Profile LCFG Config Files Make XML Profile Server LCFG diagram +inet.services telnet login ftp +inet.allow telnet login ftp sshd +inet.allow_telnet ALLOWED_NETWORKS +inet.allow_login ALLOWED_NETWORKS +inet.allow_ftp ALLOWED_NETWORKS +inet.allow_sshd ALL +inet.daemon_sshd yes auth.users mickey +auth.userhome_mickey /home/Mickey +auth.usershell_mickey /bin/tcsh Config files , /home/Mickey /bin/tcsh.... XML profiles Profile Object inet auth /etc/services /etc/inetd.conf /etc/hosts.allow in.telnetd : , in.rlogind : , in.ftpd : , sshd : ALL /etc/shadow /etc/group /etc/passwd.... mickey:x:999:20::/home/Mickey:/bin/tcsh.... Slide prepared by INFN

J.Iven - 8 HEPiX 2002, Catania / IT Monitoring + Fault Tolerance + Principle: = A Monitoring Agent running on each node samples the configured metrics via sensors = The samples are sent to a central Monitoring Repository and stored. The samples are also stored locally to allow for local fault tolerance if appropriate = Correlation engines act on local or central data Å Trigger actions, e.g. Alarms or Recovery Å Create higher-level data = User Interface allows to query Repository, displays alarms + Status: = Component APIs have been defined = First version of Agent available, deployed on EDG (~15) and CERN (~1000) nodes = Current prototype uses simple DB based on flat files

J.Iven - 9 HEPiX 2002, Catania / IT Other Statuses (Stati?) + Fault Tolerance = Prototype which periodically checks the CPU/chip set temperatures as well as the fan speeds. + Configuration Management = High Level Configuration Description Language: declarative way of describing configuration of computer systems. First draft available. = High Level configuration Language to Low Level Configuration language Compiler. Alpha prototype available. = Central Configuration Database (CDB) (central store for all fabric configuration information). Being designed. + Resource Management = Working on first prototype of the Resource Management Subsystem + Gridification = Enhancing the Globus gatekeeper with plug-in authorization and credential mapping components.

J.Iven - 10 HEPiX 2002, Catania / IT Planning up to Release 2 (09/02) + Installation: Split "Production" and "Research" = Deliver Production quality LCFG in R2 = Move to latest LCFG: support PXE installations, RedHat7.2 = New Configuration Schema = Deploy new configuration language compiler + Monitoring: complete Prototype in R2 = Prototype-quality agent, deploy everywhere = Simple alarm display / GUI  Rework transport layer (UDP  reliable transport) = Integrate SNMP = Select and move to "real" database

J.Iven - 11 HEPiX 2002, Catania / IT Summary and Links + WP4 is well-established: software deployed and used + Release 2 will be Evolution, not Revolution close collaboration between WP4, farm admins and LCG team + DataGRID project: + DataGRID WP4 : + LCFG: