Olof Bärring – WP4 summary- 4/9/2002 - n° 1 Partner Logo WP4 report Plans for testbed 2

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Andrew McNab - Manchester HEP - 2 May 2002 Testbed and Authorisation EU DataGrid Testbed 1 Job Lifecycle Software releases Authorisation at your site Grid/Web.
19/06/2002WP4 Workshop - CERN WP4 - Monitoring Progress report
Gridification Task Development Plan for Release 1.1 – 2.0 For Gridification: David Groep
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
German Cancio – WP4 developments Partner Logo WP4-install plans WP6 meeting, Paris project conference
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Cracow Grid Workshop, November 5-6, 2001 Towards the CrossGrid Architecture Marian Bubak, Marek Garbacz, Maciej Malawski, and Katarzyna Zając.
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
27-29 September 2002CrossGrid Workshop LINZ1 USE CASES (Task 3.5 Test and Integration) Santiago González de la Hoz CrossGrid Workshop at Linz,
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
C. Loomis – Testbed Status – 28/01/2002 – n° 1 Future WP6 Tasks Charles Loomis January 28, 2002
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
LIGO-G9900XX-00-M ITR 2003 DMT Sub-Project John G. Zweizig LIGO/Caltech.
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Bob Jones – June n° 1 EDG release Schedule Bob Jones.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.
German Cancio – WP4 developments Partner Logo System Management: Node Configuration & Software Package Management
Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
G. Cancio, L. Cons, Ph. Defert - n°1 October 2002 Software Packages Management System for the EU DataGrid G. Cancio Melia, L. Cons, Ph. Defert. CERN/IT.
DataGRID WPMM, Geneve, 17th June 2002 Testbed Software Test Group work status for 1.2 release Andrea Formica on behalf of Test Group.
Maite Barroso – WP4 Barcelona – 13/05/ n° 1 -WP4 Barcelona- Closure Maite Barroso 13/05/2003
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
DataGRID PTB, Geneve, 10 April 2002 Testbed Software Test Plan Status Laurent Bobelin on behalf of Test Group.
Resource Management Task Report Thomas Röblitz 19th June 2002.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
German Cancio – WP4 developments Partner Logo WP4 / ATF ATF meeting, 9/4/2002
EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.
M.Biasotto, CERN, 5 november Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
Olof Bärring – EDG WP4 status&plans- 22/10/ n° 1 Partner Logo EDG WP4 (fabric mgmt): status&plans Large Cluster.
German Cancio – WP4 developments Partner Logo WP4-install progress CERN, 19/6/2002 for WP4-install.
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
VOX Project Status T. Levshina. 5/7/2003LCG SEC meetings2 Goals, team and collaborators Purpose: To facilitate the remote participation of US based physicists.
Site Authorization Service Local Resource Authorization Service (VOX Project) Vijay Sekhri Tanya Levshina Fermilab.
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
Gridification progress report David Groep, Oscar Koeroo Wim Som de Cerff, Gerben Venekamp Martijn Steenbakkers.
Bob Jones – Project Architecture - 1 March n° 1 Project Architecture, Middleware and Delivery Schedule Bob Jones Technical Coordinator, WP12, CERN.
Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.
WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO
Blueprint of Persistent Infrastructure as a Service
Monitoring and Fault Tolerance
Status of Fabric Management at CERN
WP4 Fabric Management 3rd EU Review Maite Barroso - CERN
LEMON – Monitoring in the CERN Computer Centre
WP4-install status update
Presentation transcript:

Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2

Olof Bärring – WP4 summary- 4/9/ n° 2 Summary u Reminder on how it all fits together u What’s in R1.2 (deployed and not-deployed but integrated) u Piled up software from R1.3, R1.4 u Timeline for R2 developments and beyond u A WP4 problem u Conclusions

Olof Bärring – WP4 summary- 4/9/ n° 3 How it all fits together (job management) Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Monitoring Fabric Gridification Resource Management Grid Info Services (WP3) WP4 subsystems Other Wps Resource Broker (WP1) Data Mgmt (WP2) Grid Data Storage (WP5) - Submit job - Optimized selection of site -Authorize -Map grid  local credentials -Authorize -Map grid  local credentials -Select an optimal batch queue and submit -Return job status and output -Select an optimal batch queue and submit -Return job status and output - publish resource and accounting information

Olof Bärring – WP4 summary- 4/9/ n° 4 How it all fits together (system mgmt) WP4 subsystems Other Wps Farm A (LSF)Farm B (PBS ) Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Resource Management Information Invocation - Update configuration templates - Node malfunction detected -Remove node from queue -Wait for running jobs(?) -Remove node from queue -Wait for running jobs(?) - Trigger repair - Repair (e.g. restart, reboot, reconfigure, …) - Node OK detected -Put back node in queue Automation

Olof Bärring – WP4 summary- 4/9/ n° 5 How it all fits together (node autonomy) Cfg cache Monitoring Buffer Correlation engines Node mgmt components Monitoring Measurement Repository Configuration Data Base Central (distributed) Buffer copy Node profile Local recover if possible (e.g. restarting daemons) Automation

Olof Bärring – WP4 summary- 4/9/ n° 6 What’s in R1.2 (and deployed) u Gridification: n Library implementation of LCAS

Olof Bärring – WP4 summary- 4/9/ n° 7 What’s in R1.2 but not used/deployed u Resource management n Information provider for Condor (not fully tested because you need a complete testbed including a Condor cluster) u Monitoring n Agent + first prototype repository server + basic linuxproc sensors n No LCFG object  not deployed u Installation mgmt n LCFG light exists in R1.2. Please provide us feedback on any problems you have with it.

Olof Bärring – WP4 summary- 4/9/ n° 8 Piled up software from R1.3, R1.4 u Everything mentioned here is ready, unit tested and documented (and rpms are built by autobuild) n Gridification s LCAS with dynamic plug-ins. (already in R1.2.1???) n Resource mgmt s Complete prototype enterprise level batch system management with proxy for PBS (see next slide). Includes LCFG object. n Monitoring s New agent. Production quality. Already used on CERN production clusters sampling some 110 metrics/node. Has also been tested on Solaris. s LCFG object n Installation mgmt s Next generation LCFG: LCFGng for RH6.2 (RH7.2 almost ready)

Olof Bärring – WP4 summary- 4/9/ n° 9 queues resources Batch system: PBS, LSF, etc. Scheduler Runtime Control System Grid Local fabric Gatekeeper (Globus or WP4) job 1job 2job n JM 1JM 2JM n scheduled jobs new jobs user queue 2 execution queue stopped, visible for users started, invisible for users submit user queue 1 get job info move move job exec job RMS components PBS-, LSF-Cluster Globus components Enterprise level batch system mgmt prototype (R1.3)

Olof Bärring – WP4 summary- 4/9/ n° 10 Timeline for R2 developments u Configuration management: complete central part of framework n High Level Definition Language: 30/9/2002 n PAN compiler: 30/9/2002 n Configuration Database (CDB): 31/10/2002 u Installation mgmt n LCFGng for RH72: 30/9/2002 u Monitoring: Complete final framework n TCP transport: 30/9/2002 n Repository server: 30/9/2002 n Repository API WSDL: 30/9/2002 n Oracle DB support: 31/10/2002 n Alarm display: 30/11/2002 n Open Source DB (MySQL or PostgreSQL): mid-December 2002

Olof Bärring – WP4 summary- 4/9/ n° 11 Timeline for R2 developments u Resource mgmt n GLUE info providers: 15/9/2002 n Maintenance support API (e.g. enable/disable a node in the queue): 30/9/2002 n Provide accounting information to WP1 accounting group: 30/9/2002 n Support Maui as scheduler u Fault tolerance framework n Various components already delivered n Complete framework by end of November

Olof Bärring – WP4 summary- 4/9/ n° 12 Beyond release 2 u Conclusion from WP4 workshop, June 2002: LCFG is not the future for EDG (see WP4 quarterly report for 2Q02) because: n Inherent LCFG constraints on the configuration schema (per-component config) n LCFG is a project of its own and our objectives do not always coincide n We have learned a lot from LCFG architecture and we continue to collaborate with the LCFG team u EDG future: first release by end-March 2003 n Proposal for a common schema for all fabric configuration information to be stored in the configuration database, implemented using the HLDL. n New configuration client and node management replacing LCFG client (the server side is already delivered in October). n New software package management (replacing updaterpms) split into two modules: an OS independent part and an OS dependent part (packager).

Olof Bärring – WP4 summary- 4/9/ n° 13 Global schema tree hardware systemsw CPUharddiskmemory…. sys_nameinterface_typesize…. hostnamearchitecturepartitionsservices…. hda1 sizetypeid hda2…. packagesknown_repositoriesedg_lcas …. versionrepositories…. cluster …. Component specific configuration The population of the global schema is an ongoing activity

Olof Bärring – WP4 summary- 4/9/ n° 14 SW repository structure (maintained by repository managers): /sw/known_repositories/Arep/url = (host, protocol, prefix dir) /owner = /extras = /directories/dir_name_X/path = (asis) /platform = (i386_rh61) /packages/pck_a/name = (kernel) /version = (2.4.9) /release = 31.1.cern /architecture = (i686) /dir_name_Y /path = (sun_system) /platform = (sun4_58) /packages/pck_b/name = (SUNWcsd) /version = /release = /architecture = (?) Global schema example

Olof Bärring – WP4 summary- 4/9/ n° 15 Problem u Very little of delivered WP4 software is of any interest to EDG application WPs, possibly with the exception of producing nice colour plots of the CPU loads when a job was run… u This is normal, but… n Site administrators do not grow on trees. Because of the lack of good system admin tools, like the ones WP4 tries to develop, the configuration, installation and supervision of the testbed installations require a substantial amount of manual work. u However, thanks to Bob new priority list the need for automated configuration and installation has bubbled up on the required features stack to become absolutely vital for assuring good quality.

Olof Bärring – WP4 summary- 4/9/ n° 16 Summary u Substantial amount of s/w piled up from R1.3, R1.4 to be deployed now u R2 also includes two large components: n LCFGng – migration is non-trivial but we already perform as much as the non-trivial part ourselves so TB integration should be smooth n Complete monitoring framework u Beyond R2: LCFG is not future for EDG WP4. First version of new configuration and node management system in March 2003