CERN LCG-1 Status and Issues Ian Neilson for LCG Deployment Group CERN Hepix 2003, Vancouver.

Slides:



Advertisements
Similar presentations
Andrew McNab - Manchester HEP - 22 April 2002 EU DataGrid Testbed EU DataGrid Software releases Testbed 1 Job Lifecycle Authorisation at your site More.
Advertisements

Andrew McNab - Manchester HEP - 2 May 2002 Testbed and Authorisation EU DataGrid Testbed 1 Job Lifecycle Software releases Authorisation at your site Grid/Web.
Deployment Team. Deployment –Central Management Team Takes care of the deployment of the release, certificates the sites and manages the grid services.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
A conceptual model of grid resources and services Authors: Sergio Andreozzi Massimo Sgaravatto Cristina Vistoli Presenter: Sergio Andreozzi INFN-CNAF Bologna.
1 Deployment of an LCG Infrastructure in Australia How-To Setup the LCG Grid Middleware – A beginner's perspective Marco La Rosa
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
CERN LCG-1 Status Markus Schulz LCG EDG Project Conference 29 September 2003.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
Andrew McNab - Manchester HEP - 5 July 2001 WP6/Testbed Status Status by partner –CNRS, Czech R., INFN, NIKHEF, NorduGrid, LIP, Russia, UK Security Integration.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
RLS Tier-1 Deployment James Casey, PPARC-LCG Fellow, CERN 10 th GridPP Meeting, CERN, 3 rd June 2004.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
CERN Manual Installation of a UI – Oxford July - 1 LCG2 Administrator’s Course Oxford University, 19 th – 21 st July Developed.
DataGrid Applications Federico Carminati WP6 WorkShop December 11, 2000.
Tier 1 Facility Status and Current Activities Rich Baker Brookhaven National Laboratory NSF/DOE Review of ATLAS Computing June 20, 2002.
LCG Status GridPP 8 September 22 nd 2003 CERN.ch.
Steve Traylen Particle Physics Department EDG and LCG Status 9 th December 2003
C. Loomis – Integration Status- October 31, n° 1 Integration Status October 31, 2001
Δ Storage Middleware GridPP10 What’s new since GridPP9? CERN, June 2004.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.
CERN LCG Deployment Overview Ian Bird CERN IT/GD LHCC Comprehensive Review November 2003.
GLite – An Outsider’s View Stephen Burke RAL. January 31 st 2005gLite overview Introduction A personal view of the current situation –Asked to be provocative!
DataGRID WPMM, Geneve, 17th June 2002 Testbed Software Test Group work status for 1.2 release Andrea Formica on behalf of Test Group.
LCG EGEE is a project funded by the European Union under contract IST LCG PEB, 7 th June 2004 Prototype Middleware Status Update Frédéric Hemmer.
Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
Ian Bird LCG Deployment Area Manager & EGEE Operations Manager IT Department, CERN Presentation to HEPiX 22 nd October 2004 LCG Operations.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
Grid Deployment Board – 10 February GD LCG Workshop Goals Give overview where we are Stimulate cooperation between the centres Improve the communication.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Presenter Name Facility Name UK Testbed Status and EDG Testbed Two. Steve Traylen GridPP 7, Oxford.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
29/1/2002A.Ghiselli, INFN-CNAF1 DataTAG / WP4 meeting Cern, 29 January 2002 Agenda  start at  Project introduction, Olivier Martin  WP4 introduction,
Andrew McNab - Manchester HEP - 17 September 2002 UK Testbed Deployment Aim of this talk is to the answer the questions: –“How much of the Testbed has.
LCG LCG-1 Deployment and usage experience Lev Shamardin SINP MSU, Moscow
DataTAG is a project funded by the European Union DataTAG WP4 meeting, Bologna 29/07/2003 – n o 1 GLUE Schema - Status Report DataTAG WP4 meeting Bologna,
Last update 31/01/ :41 LCG 1 Maria Dimou Procedures for introducing new Virtual Organisations to EGEE NA4 Open Meeting Catania.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
CERN LCG Deployment Overview Ian Bird CERN IT/GD LCG Internal Review November 2003.
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
DataGrid is a project funded by the European Commission under contract IST rd EU Review – 19-20/02/2004 The EU DataGrid Project Three years.
INFSO-RI Enabling Grids for E-sciencE gLite Certification and Deployment Process Markus Schulz, SA1, CERN EGEE 1 st EU Review 9-11/02/2005.
CERN Deployment & Experiment Integration Flavia Donno & Markus Schulz LCG LCG Review 24 November 2003.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Site Services and Policies Summary Dirk Düllmann, CERN IT More details at
VOX Project Status T. Levshina. 5/7/2003LCG SEC meetings2 Goals, team and collaborators Purpose: To facilitate the remote participation of US based physicists.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
CERN Status of Deployment & tasks Ian Bird LCG & IT Division, CERN GDB – FNAL 9 October 2003.
CERN Certification & Testing LCG Certification & Testing Team (C&T Team) Marco Serra - CERN / INFN Zdenek Sekera - CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
II EGEE conference Den Haag November, ROC-CIC status in Italy
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
Jean-Philippe Baud, IT-GD, CERN November 2007
Bob Jones EGEE Technical Director
Status of Task Forces Ian Bird GDB 8 May 2003.
Regional Operations Centres Core infrastructure Centres
Ian Bird GDB Meeting CERN 9 September 2003
Grid Deployment Area Status Report
Sergio Fantinel, INFN LNL/PD
Bernd Panzer-Steindel CERN/IT
LCG experience in Integrating Grid Toolkits
Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002
Porting LCG to IA64 Andreas Unterkircher CERN openlab May 2004
Presentation transcript:

CERN LCG-1 Status and Issues Ian Neilson for LCG Deployment Group CERN Hepix 2003, Vancouver

CERN Outline LHC Computing Grid (LHC) –Project, Challenges, Milestones Deployment Group –Structure, Goals Deployed Software –History, Status, Configuration Deployment Process –Activities, Tools, Interactions Deployment Status –Milestones –Sites, Monitoring, Communications Lessons from LCG-1 –Diversity, Complexity What’s Next?

CERN Our Customers LHCb ~6-8 PetaBytes / year ~10 8 events/year ~10 3 batch and interactive users Federico.carminati, EU review presentation

CERN What is LCG? LHC Computing Grid – Project Goal –prototype and deploy the computing environment for the LHC experiments in 2 phases: Phase – 2005 –Build a service prototype, based on existing grid middleware –Get experience running a production grid service –Produce the Technical Design Report for the production system Phase – 2008 –Build and commission the initial LHC computing environment

CERN What LCG is NOT! LCG is NOT a development project

CERN Grid Deployment Group Certification and Testing –Certification Testbed administration –Packaging and Release management –Integration and patching of grid middleware packages Grid Infrastructure Services –Configuration and Deployment support for sites Release coordination across sites –Administration of grid CERN –User Registration Service + CERN Certification Authority +EDG Virtual Organization servers at Nikhef Work with LCG Security Group (Access policy, incident response, audit etc.) Experiment Integration Support –Support integration of experiment applications with grid middleware –End-User Documentation

CERN LCG-1 Software Current LCG-1 (LCG1-1_0_2) is: –VDT (Globus 2.2.4) –EDG WP1 (Resource Broker) –EDG WP2 (Replica Management tools) One central RMC and LRC for each VO, located at CERN, ORACLE backend –Several bits from other WPs (Config objects, InfoProviders, Packaging…) –GLUE 1.1 (Information schema) + few essential LCG extensions –MDS based Information System with LCG enhancements –EDG components approx. edg-2.0 version –LCG modifications: Job managers to avoid shared filesystem problems (GASS Cache, etc.) MDS – BDII LDAP (fault tolerant Information System) Globus gatekeeper enhancements (accounting/auditing records, log rotation) Many, many bug fixes to EDG and Globus/VDT 2 further releases planned to the end of year

CERN Deployment Process Certification and Testing –Software first assembled on the Certification & Test Testbeds 6 CERN + externals (Taipei + Budapest + Wisconsin) Installation and functionality tests (resolving problems found in the services) Certification test suite almost finished Software handed to the Deployment Team Adjustments in the configuration Release notes for the external sites Decision on time to release –How do we deploy? Service Nodes (RB, CE, SE …) –LCFGng, sample configurations in CVS –We provide configuration files for new sites based on a questionnaire Worker nodes – aim is to allow sites to use existing tools as required –LCFGng – provides automated installation YES –Instructions allowing system managers to use their existing tools SOON User interface –LCFGng YES –Installed on a cluster (e.g. Lxplus at CERN) LCFGng-lite YES –Instructions allowing system managers to use their existing tools SOON LCFGng - Local ConFiGuration system Univ. of Edinburgh + EDG WP4 Central server publishes to clients Client components handle local configuration LCFGng-Lite version

CERN 1)Site contacts us (LCG) 2)Leader of the GD decides if the site can join (hours) 3)Site gets mail with pointers to documentation of the process 4)Site fills questionnaire 5)We, or primary site write LCFGng config files and place them in CVS 6)Site checks out config. files, studies them, corrects them, asks questions… 7)Site starts installing 8)Site runs first tests locally (described in the material provided) 9)Site maintains config. in CVS (helps us finding problems) 10)Site contacts us or primary site to be certified –Currently we run a few more tests, certification suite in preparation –Site creates a CVS tag –Site is added to the Information System We currently lack proper tool to express this process in the IS Adding a Site LCG Instant Add Computers & Network CERN Primary Site A Primary Site B Tier 2 site a Tier 2 site b Tier 2 site c Tier 2 site d Tier 2 site e

CERN Support Services Operations Support: –RAL is leading sub-project on developing distributed operations services –Initial prototype Basic monitoring tools Mail lists and rapid communications/coordination for problem resolution Working on defining policies for operation, responsibilities (draft document) –Monitoring: GridICE (development of DataTag Nagios-based tools) GridPP job submission monitoring User support –FZK leading sub-project to develop distributed user support services –Draft on user support policy –Web portal for problem reporting –Triage done by experiments

CERN

CERN Sites in LCG-1 Snapshot several days ago

CERN Deployment Status – 1 What we wanted - planned milestones for 2003 –April Deploy candidate middleware on C&T testbeds –July Introduce the initial publicly available LCG-1 global grid service –10 Tier 1 centres on 3 continents –November Expanded resources and functionality for 2004 Computing Data Challenges –Additional Tier 1 centres, several Tier 2 centres – more countries –Expanded resources at Tier 1s »(e.g. at CERN make the LXBatch service grid-accessible) –Agreed performance and reliability targets

CERN Deployment Status - 2 What we got – history of 2003 so far –First set of reasonable middleware on C&T Testbed end of July (PLAN April) limited functionality and stability –Deployment started to 10 initial sites Focus not on functionality, but establishing procedures Getting sites used to LCFGng –End of August only 5 sites in Lack of effort of the participating sites Gross underestimation of the effort and dedication needed by the sites –Many complaints about complexity –Inexperience (and dislike) of install/config Tool –Lack of a one stop installation (tar, run a script and go) –Instructions with more than 100 words might be too complex/boring to follow –First certified version LCG1-1_0_0 release September 1st (PLAN in June) Limited functionality, improved reliability Training paid off -> 5 sites upgraded (reinstalled) in 1 day (Last after 1 week….) Security patch LCG1-1_0_1 first unscheduled upgrade took only 24h. –Sites need between 3 days and several weeks to come online None in not using the LCFGng setup middleware was late

CERN Deployment Status - 3 Up to date status can be seen here: – Has links to maps with sites that are in operation Links to GridICE based monitoring tool (history of VO’s jobs, etc) –Using information provided by the information system Tables with deployment status Sites that are currently in LCG-1 (here) expect by end of 2003here –PIC-Barcelona (RB) –Budapest (RB) –CERN (RB) –CNAF (RB) –FNAL –FZK –Krakow –Moscow (RB) –Prague –RAL (RB) –Taipei (RB) –Tokyo Total number of CPUs ~120 WNs Sites to enter soon BNL, (Lyon) Several tier2 centres in Italy and Spain Sites preparing to join Pakistan, Sofia, Switzerland Users (now): EDG-Loose Cannons Experiments starting (Alice, ATLAS,..)

CERN Getting the Experiments on Experiments are starting to use the service now –Agreement between LCG and the experiments System has limitations, testing what is there Focus on: –Testing with loads similar to production programs (long jobs, etc) –Testing the experiments software on LCG We don’t want: –Destructive testing to explore the limits of the system with artificial loads »This can be done in scheduled sessions on C&T testbed –Adding experiments and sites rapidly in parallel is problematic Getting the experiments on one after the other Limited number of users that we can interact with and keep informed

CERN Lessons - 1 Many issues specific for each site –How many service machines, which services where, security ….. History of the components, many config files – far too complex –No tool to pack config and send to us –Sites fight with FireWalls Sites without LCFGng (even lite) have severe problems –We can’t help too much, dependencies on base system installed –The configuration is not understood well enough (by them, by us) –Need one keystroke “Instant GRID” distribution (hard..) –Middleware dependencies too complex Packaging is a big issue –we cannot force installation tools on sites - they have their own already –USA x Europe, rpm x tar, pacman x LCFG or other?

CERN Lessons - 2 Debugging a site is hard –Can’t set the site remotely in a debugging mode –The glue status variable covers the LRM’s state –Jobs keep on coming –Discovery of the other site’s setup for support is hard Testing is a big issue : –Different architectures, features, networking, interoperability…… –Scalability will become the next *big thing*. Some sites are in contact with grids for the 1st time –There is nothing like “Beginners Guide to Grids” LCG is not a top priority on many sites –Many sysadmins don’t find time to work for several hours in a row –Instructions are not followed correctly (short cuts taken) Time zones slow things down

CERN What’s Next? Learn from the lessons 2 incremental software releases Q Gain experience –Long running jobs, many jobs, complex jobs ( data access, many files,…) –Scalability test for the whole system with complex jobs –Chaotic (many users, asynchronous access, bursts) usage test –Tests of strategies to stabilize the information system under heavy load We have several that we want to try as soon as more Tier2 sites join –We need to learn how the systems behave if operated for a long time In the past some services tended to “age” or “pollute” the platforms they ran on –We need to learn how to capture the “state” of services to restart them on different nodes –Learn how to upgrade systems (RMC, LRC…) without stopping the service You can’t drain LCG1 for upgrading Prepare for Q1-Q Data Challenges –20 sites –Resources promised 5600kSI2K (1xkSI2K ~ 2.8GHz P4), 1169TB disk, 4223TB tape, 120.0FTE

CERN Acknowledgement Nearly all the material in this presentation has been culled from presentations and work by others. Thanks are extended to all.