Olof Bärring – EDG WP4 status&plans- 22/10/2002 - n° 1 Partner Logo EDG WP4 (fabric mgmt): status&plans Large Cluster.

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

Andrew McNab - Manchester HEP - 22 April 2002 EU DataGrid Testbed EU DataGrid Software releases Testbed 1 Job Lifecycle Authorisation at your site More.
Data Management Expert Panel - WP2. WP2 Overview.
Andrew McNab - Manchester HEP - 2 May 2002 Testbed and Authorisation EU DataGrid Testbed 1 Job Lifecycle Software releases Authorisation at your site Grid/Web.
Database Architectures and the Web
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Gridification Task Development Plan for Release 1.1 – 2.0 For Gridification: David Groep
DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – Towards automation of computing fabrics... – n° 1 Towards automation.
Andrew McNab - EDG Access Control - 14 Jan 2003 EU DataGrid security with GSI and Globus Andrew McNab University of Manchester
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
EU-GRID Work Program Massimo Sgaravatto – INFN Padova Cristina Vistoli – INFN Cnaf as INFN members of the EU-GRID technical team.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
1 Grid services based architectures Growing consensus that Grid services is the right concept for building the computing grids; Recent ARDA work has provoked.
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.
DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.
Workload Management Massimo Sgaravatto INFN Padova.
The EU DataGrid Architecture The European DataGrid Project Team
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO
WP4 Security and AA(A) issues For WP4: David Groep
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
Author - Title- Date - n° 1 Partner Logo EU DataGrid, Work Package 5 The Storage Element.
German Cancio – WP4 developments Partner Logo System Management: Node Configuration & Software Package Management
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
G. Cancio, L. Cons, Ph. Defert - n°1 October 2002 Software Packages Management System for the EU DataGrid G. Cancio Melia, L. Cons, Ph. Defert. CERN/IT.
Maite Barroso – WP4 Barcelona – 13/05/ n° 1 -WP4 Barcelona- Closure Maite Barroso 13/05/2003
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
Resource Management Task Report Thomas Röblitz 19th June 2002.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
German Cancio – WP4 developments Partner Logo WP4 / ATF ATF meeting, 9/4/2002
EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.
M.Biasotto, CERN, 5 november Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
The EDG Testbed The European DataGrid Project Team
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
Site Authorization Service Local Resource Authorization Service (VOX Project) Vijay Sekhri Tanya Levshina Fermilab.
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
Bob Jones – Project Architecture - 1 March n° 1 Project Architecture, Middleware and Delivery Schedule Bob Jones Technical Coordinator, WP12, CERN.
Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
Monitoring and Fault Tolerance
WP4 Fabric Management 3rd EU Review Maite Barroso - CERN
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
WP4-install status update
German Cancio CERN IT .quattro architecture German Cancio CERN IT.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Towards automation of computing fabrics using tools from the fabric management workpackage of the EU DataGrid project Maite Barroso Lopez (WP4)
Wide Area Workload Management Work Package DATAGRID project
Presentation transcript:

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 1 Partner Logo EDG WP4 (fabric mgmt): status&plans Large Cluster Computing Workshop FNAL, 22/10/2002 Olof Bärring

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 2 Outline u What’s “EDG” and “WP4” ?? u Recap from LCCWS 2001 u Architecture design and the ideas behind… u Subsystem status&plans&issues n Configuration mgmt n Installation mgmt n Monitoring n Fault tolerance n Resource mgmt n Gridification u Conclusions

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 3 “EDG” == EU DataGrid project u Project started 1/1/2001 and ends 31/12/2003 u 6 principal contractors: CERN, CNRS, ESA-ESRIN, INFN, NIKHEF/FOM, PPARC u 15 assistant contractors u ~150FTE u u 12 workpackages

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 4 “WP” == workpackage EDG WPs u WP1: Workload Management u WP2: Grid Data Management u WP3: Grid Monitoring Services u WP4: Fabric management u WP5: Mass Storage Management u WP6: Integration Testbed – Production quality International Infrastructure u WP7: Network Services u WP8: High-Energy Physics Applications u WP9: Earth Observation Science Applications u WP10: Biology Science Applications u WP11: Information Dissemination and Exploitation u WP12: Project Management This is what I’m gonna talk about today

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 5 WP4: main objective “To deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes.” User job management (Grid and local) Automated management of large clusters

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 6 WP4: structure u ~14 FTEs (6 funded by the EU). Presently split over ~ people u 6 partners: CERN, NIKHEF, ZIB, KIP, PPARC, INFN u The development work divided into 6 subtasks

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 7 Recap from LCCWS-1 EDG WP4 presentations in LCCWS-1 //-sessions SessionWhat we saidWhat happened InstallationPlans for using the LCFG tool from Edinburgh Univ. as an interim installation/maintenance system LCFG in production on EDG testbed since 12 months. Will be replaced by new system 2Q03. MonitoringPEM vs. WP4. Design for node autonomy where possible System deployed on EDG testbed since one month GridEarly architecture design ideas and development plans up to Sept Architecture design refined and adopted. Delivery OK. Not everything worked smoothly n Architecture design: had to reach consensus between partners with different agendas and motivations. n Delivered software: we learned some lessons and had taken some uncomfortable decisions

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 8 Architecture design and the ideas behind u Information model. Configuration is distinct from monitoring n Configuration == desired state (what we want) n Monitoring == actual state (what we have) u Aggregation of configuration information n Good experience with LCFG concepts with central configuration template hierarchies u Node autonomy. Resolve local problems locally if possible n Cache node configuration profile and local monitoring buffer u Scheduling of intrusive actions u Plug-in authorization and credential mapping

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 9 DataGrid Architecture Collective Services Information & Monitoring Replica Manager Grid Scheduler Local Application Local Database Underlying Grid Services Computing Element Services Authorization Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configuration Management Configuration Management Node Installation & Management Node Installation & Management Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Management Fabric Storage Management Fabric Storage Management Grid Fabric Local Computing Grid Grid Application Layer Data Management Job Management Metadata Management Object to File Mapping Service Index WP4 tasks

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 10 Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services (WP3) WP4 subsystems Other Wps Resource Broker (WP1) Data Mgmt (WP2) Grid Data Storage (WP5) WP4 Architecture logical overview - Interface Grid-wide services with local fabric - Provides local authorization and mapping of grid credentials. - Interface Grid-wide services with local fabric - Provides local authorization and mapping of grid credentials. - provides transparent access (both job and admin) to different cluster batch systems - enhanced capabilities (extended scheduling policies, advanced reservation, local accounting) - provides transparent access (both job and admin) to different cluster batch systems - enhanced capabilities (extended scheduling policies, advanced reservation, local accounting) -provides a central storage and management of all fabric configuration information -Compile HLD templates to LLD node profiles - central DB and set of protocols and APIs to store and retrieve information -provides a central storage and management of all fabric configuration information -Compile HLD templates to LLD node profiles - central DB and set of protocols and APIs to store and retrieve information - provides the tools to install and manage all software running on the fabric nodes -Agent to install, upgrade, remove and configure software packages on the nodes -bootstrap services and software repositories - provides the tools to install and manage all software running on the fabric nodes -Agent to install, upgrade, remove and configure software packages on the nodes -bootstrap services and software repositories - provides the tools for gathering monitoring information on fabric nodes -central measurement repository stores all monitoring information - fault tolerance correlation engines detect failures and trigger recovery actions - provides the tools for gathering monitoring information on fabric nodes -central measurement repository stores all monitoring information - fault tolerance correlation engines detect failures and trigger recovery actions

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 11 User job management (Grid and local) Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Monitoring Fabric Gridification Resource Management Grid Info Services (WP3) WP4 subsystems Other Wps Resource Broker (WP1) Data Mgmt (WP2) Grid Data Storage (WP5) - Submit job - Optimized selection of site -Authorization -Map grid  local credentials -Authorization -Map grid  local credentials -Select an optimal batch queue and submit -Return job status and output -Select an optimal batch queue and submit -Return job status and output - publish resource and accounting information

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 12 Automated management of large clusters WP4 subsystems Other Wps Farm A (LSF)Farm B (PBS ) Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Resource Management Information Invocation - Update configuration templates - Node malfunction detected -Remove node from queue -Wait for running jobs(?) -Remove node from queue -Wait for running jobs(?) - Trigger repair - Repair (e.g. restart, reboot, reconfigure, …) - Node OK detected -Put back node in queue Automation

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 13 Node autonomy Cfg cache Monitoring Buffer Correlation engines Node mgmt components Monitoring Measurement Repository Configuration Data Base Central (distributed) Buffer copy Cache Node profile Local recover if possible (e.g. restarting daemons) Automation

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 14 ClientServer XML HLDL PAN DBM Notification + Transfer Low Level API Access API Components Template Subtasks: configuration management

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 15 # TEST Linux system #################################### object template TEST_i386_rh72; "/system/platform" = "i386_rh72"; "/system/network/interfaces/0/ip" = “ "; "/system/network/hostname" = “myhost"; include node_profile; # Default node profile #################################### template node_profile; # Include validation functions ############################## include functions; # Include basic type definitions ################################ include hardware_types; include system_types; include software_types; # Include default configuration data #################################### include default_hardware; include default_system; include default_software; # SYSTEM: Default configuration ######################### template default_system; # Include default system configuration ###################################### include default_users; include default_network; include default_filesystems; # SYSTEM: Default network configuration ################################### template default_network; "/system/network" = value("//network_" + value("/system/platform") + "/network"); Configuration templates like this …

Olof Bärring – EDG WP4 status&plans- 22/10/ n° <nlist name="profile" derivation="TEST_i386_rh72,node_profile,functions,hardware_types,… ….. - i386_rh72 - <nlist name="network“ derivation="TEST_i386_rh72,default_network,network_i386_rh72,std_network“ type="record"> myhost - - eth true ….. … generate XML profile like this u Description of the High Level Definition Language (HLDL), the compiler and the Low Level Definition Language (LLDL) can be found at:

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 17 Global configuration schema tree hardware systemsoftware CPUharddiskmemory…. sys_nameinterface_typesize…. networkplatformpartitionsservices…. hda1 sizetypeid hda2…. packagesknown_repositoriesedg_lcas …. versionrepositories…. cluster …. Component specific configuration The population of the global schema is an ongoing activity i386_rh72

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 18 Subtask: installation management u Node Configuration Deployment u Base system installation u Software Package Management

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 19 ClientServer XML HLDL PAN DBM Notification + Transfer Low Level API Access API Component Template Node configuration deployment

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 20 Node configuration deployment infrastructure server client Node View Access (NVA) API DBM Cache Invocation registration & notification XML profiles Component libs SUE sysmgt Logging Template processor Monitoring interface Configure() “low level” API Configuration Dispatch daemon (cdispd) Component

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 21 Component example sub Configure { my ($self) # access configuration information my $config=NVA::Config->new(); my $arch=$config->getValue('/system/platform’); # low-level API $self->Fail (“not supported") unless ($arch eq ‘i386_rh72’); # (re)generate and/or update local config file(s) open (myconfig,’/etc/myconfig’); … # notify affected (SysV) services if required if ($changed) { system(‘/sbin/service myservice reload’); … }

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 22 Base Installation and Software Package management u Use of standard tools u Base installation n Generation of kickstart or jumpstart files from node profile u Software package management n Framework with pluggable packager s rpm s pkg s ?? n It can be configured to respect locally installed packages, ie. it can be used for managing only a subset of packages on the node (useful for desktops)

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 23 Software Package Management (SPM) rpmt Repository packages SPM Packages (RPM, pkg) RPM db Local Config file Transaction set filesystem Package files HTTP(S), NFS, FTP Installed pkgs “desired” configuration SPM Component

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 24 Installation (&configuration): status u LCFG (Local Configuration) tool from Univ. of Edinburgh has been in production at the EDG testbed since more than 12 months n Learned a lot from it to understand what we really want n Used at almost all EDG testbed sites  very valuable feedback from a large O(5-10) group of site admins u Disadvantages with LCFG n Enforces a private per component configuration schema n High level language lacks possibilities to attach compile time validation n Maintains propriety solutions where standards exist (e.g. base installation) u New developments progress well and complete running system is expected by April 2003

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 25 Subtask: fabric monitoring u Framework for n Collecting monitoring information from sensors running on the nodes n Store the information in a local buffer s Assures that data is collected and stored even if network is down s Allows for local fault tolerance n Transports the data to a central repository database s Allows for global correlations and fault tolerance s Facilitate generation of periodic resource utilisation reports u Status: framework deployed on EDG testbed. Enhancements will come n Oracle DB repository backend. MySQL and/or PostgreSQL also planned n GUIs: alarm display and data analysis

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 26 Fabric monitoring Sensor Agent Cache Repository server DB Application Sensor API Cache used by local fault tolerance Native DB API (e.g. SQL) Repository API (SOAP RPC) Transport (UDP or TCP) Nodes Server node Desktop Repository API (Local access)

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 27 Subtask: fault tolerance u Framework consists of n Rule editor s Enter metric correlation algorithms and bind them to actions (actuators) n Correlation engines implements the rules s Subscribe to the defined set of input metrics s Detect exception conditions determined by the correlation algorithms and report to the monitoring system (exception metric) s Try out the action(s) and report back the success/failure to the monitoring system (action metric) n Actuators s Plug-in modules (scripts/programs) implementing the actions n Status: first prototype expected by mid-November 2002

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 28 Subtask: resource management u Manage grid jobs and local jobs. Layer between grid scheduler and local batch system. Allows for enhancing scheduling capabilities if necessary n Advanced reservations n Priorities u Provides common API for administrating underlying batch system n Scheduling of maintenance jobs n Draining node/queues from batch jobs u Status: prototype exists since a couple of months. Not yet deployed on EDG testbed.

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 29 queues resources Batch system: PBS, LSF, etc. Scheduler Runtime Control System Grid Local fabric Gatekeeper (Globus or WP4) job 1 job 2 job n JM 1JM 2JM n schedule d jobs new jobs user queue 2 execution queue stopped, visible for usersstarted, invisible for users submit user queue 1 get job info mov e move job exec job RMS components PBS-, LSF-Cluster Globus components Resource management prototype (R1.3)

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 30 Subtask: gridification u Layer between local fabric and the grid n Local Centre Authorisation Service, LCAS s Framework for local authorisation based on grid certificate and resource specification (job description) s Allows for authorisation plug-ins to extend the basic set of authorisation policies (gridmap file, user ban lists, wall-clock time) n Local Credential Mapping Service, LCMAPS s Framework for mapping authorised user’s grid certificates onto local credentials s Allows for credential mapping plug-ins. Basic set should include uid mapping and AFS token mapping n Job repository n Status: LCAS deployed in May LCMAPS and job repository expected 1Q03.

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 31 Conclusions u Since last LCCWS we have learned a lot n We do have an architecture and a plan to implement it n Development work is progressing well n Adopting LCFG as interim solution was a good thing s Experience and feedback with a real tool helps in digging out what people really want s Forces middleware providers and users to respect some rules when delivering software s Automated configuration has become an important for implementing quality assurance in EDG n Internal and external coordination with other WPs and projects result in significant overhead n Sociology is an issue (see next 30 slides…)