CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003.

Slides:



Advertisements
Similar presentations
Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
Advertisements

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Accounting Update Dave Kant Grid Deployment Board Nov 2007.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.
EGEE is a project funded by the European Union under contract IST JRA1 Testing Activity: Status and Plans Leanne Guy EGEE Middleware Testing.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
CERN LCG-1 Status and Issues Ian Neilson for LCG Deployment Group CERN Hepix 2003, Vancouver.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Information System on gLite middleware Vincent.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Steve Traylen Particle Physics Department EDG and LCG Status 9 th December 2003
CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
10-Jun-03D.P.Kelsey, LCG-GDB-Security1 LCG/GDB Security (Report from the LCG Security Group) CERN, 10 June 2003 David Kelsey CCLRC/RAL, UK
First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.
8-Jul-03D.P.Kelsey, LCG-GDB-Security1 LCG/GDB Security (Report from the LCG Security Group) RAL, 8 July 2003 David Kelsey CCLRC/RAL, UK
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
GLite – An Outsider’s View Stephen Burke RAL. January 31 st 2005gLite overview Introduction A personal view of the current situation –Asked to be provocative!
GGUS at PEB – –- page 1 LCG Klaus-Peter Mickel, GridKa Karlsruhe LCG-PEB-Meeting ( ) The Global Grid User Support Model (Report of GDB.
LCG EGEE is a project funded by the European Union under contract IST LCG PEB, 7 th June 2004 Prototype Middleware Status Update Frédéric Hemmer.
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Grid Security Vulnerability Group Linda Cornwall, GDB, CERN 7 th September 2005
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA gLite Information System Pedro Rausch IF.
Andrew McNab - Manchester HEP - 17 September 2002 UK Testbed Deployment Aim of this talk is to the answer the questions: –“How much of the Testbed has.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America gLite Information System Claudio Cherubino.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Last update 31/01/ :41 LCG 1 Maria Dimou Procedures for introducing new Virtual Organisations to EGEE NA4 Open Meeting Catania.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Summary from WP 1 Parallel Section Massimo Sgaravatto INFN Padova.
CERN LCG Deployment Overview Ian Bird CERN IT/GD LCG Internal Review November 2003.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.
ATLAS Distributed Analysis Dietrich Liko IT/GD. Overview  Some problems trying to analyze Rome data on the grid Basics Metadata Data  Activities AMI.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
VOX Project Status T. Levshina. 5/7/2003LCG SEC meetings2 Goals, team and collaborators Purpose: To facilitate the remote participation of US based physicists.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Stephen Burke – Sysman meeting - 22/4/2002 Partner Logo The Testbed – A User View Stephen Burke, PPARC/RAL.
CERN Status of Deployment & tasks Ian Bird LCG & IT Division, CERN GDB – FNAL 9 October 2003.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
CERN Certification & Testing LCG Certification & Testing Team (C&T Team) Marco Serra - CERN / INFN Zdenek Sekera - CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
II EGEE conference Den Haag November, ROC-CIC status in Italy
INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.
15-Jun-04D.P.Kelsey, LCG-GDB-Security1 LCG/GDB Security Update (Report from the LCG Security Group) CERN 15 June 2004 David Kelsey CCLRC/RAL, UK
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
Status of Task Forces Ian Bird GDB 8 May 2003.
Grid Operations Centre Progress to Aug 03
Ian Bird GDB Meeting CERN 9 September 2003
Grid Deployment Area Status Report
Database Readiness Workshop Intro & Goals
LCG-1 Status & Task Forces
Leigh Grundhoefer Indiana University
Information Services Claudio Cherubino INFN Catania Bologna
Presentation transcript:

CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

CERN Overview LCG-1 Status Middleware Deployment and architecture Security Operations & Support Milestone analysis Resources Plans for remainder of 2003

CERN LCG-1 Status  Middleware release: We have a candidate LCG-1 release ! Quite robust – compared with previous EDG releases Certification team & EDG Loose Cannons currently running tests Some problems (broken new functionality) No major showstoppers Currently trying to stress to find limitations – has not failed disastrously but rather degrades performance (RB)  Deployment : Deployment of previous tag started to 10 Tier 1 sites Going very slowly – only 5 up, slow responses, expect 8, dubious about 2 sites Want to start pushing out LCG-1 release this week This will be an upgrade Hope experiments can start controlled testing before the end of August Two activities in parallel:

CERN LCG-1 Release Candidate Contents and Status

CERN LCG-1 Contents Based on VDT –2 changes from LCG – gridftp bug fix and fix for GRIS instability – not yet in official VDT. –Soon will move to VDT (or 10) which will be a converged US and LCG VDT EDG components –WP1 (Resource Broker) –WP2 – Replica Management tools, including the EDG Replica Location Service (RLS) –WP4 – gatekeeper, and LCFG installation tool Information Services (see below) –MDS with EDG/LCG improvements and bug fixes –GLUE Schema v1.1 with extensions from LCG; Information providers from WP3,4,5 Storage access –“Classical” EDG Storage Element: disk pools accessed via gridFTP; tested in conjunction with Replica Management tools –Will add access to Castor and Enstore via gridFTP in next few weeks once system is deployed

CERN LCG-1 Contents – 2 VDT – –Excellent collaboration – very responsive to our needs – fast turnaround LCG contributions to VDT and EDG –added accounting and security auditing (connection between the process, the batch system and the submitter) –fixed gatekeeper log file growing infinitely by implementing rotating log schema, avoiding gatekeeper restarts –incorporated MDS bug fixes from NorduGrid, improving at the same time the timeout handling, this allows LCG to deploy MDS on bigger scale; found and fixed bugs in GRIS –new LCG versions of job managers that do not need shared home directories issue raised by deploying LCG-0 solves scalability issues that prevented use of more than a few tens of worker nodes –gass-cache inode leak, “dead” inodes were never removed - filling up the disk, eventually causing the service to crash  It was a significant effort to put this all together in a coherent way

CERN Certification Tests so far have been on LCG certification TB –Local set of clusters – no WAN –There has been no time yet for true WAN tests Testing is being done by LCG and Loose Cannons –Have matrix of tests as baseline now – will not accept changes that break these tests –Will do regression testing against future changes –Push baseline acceptance tests upwards – more stringent as we evolve Certification TB –Intended to reproduce and test actual deployed configuration Test info. system architecture Test various installation options Test various batch systems and jobmanagers Middleware functionality and robustness –Expanding now to Wisconsin and Hungary, to include also CNAF, FNAL and Moscow (although Moscow do not have sufficient resources at the moment) –Stress system and determine limits in parallel with deployed LCG-1

CERN RB_a BDII_a MDS_a CE_a SE_a RB_b BDII_b CE_b WNs CE_2 SE_2 WNs RB_3 BDII_3 MDS_3_a CE_3 SE_3 WNs CE_4 SE_4 WNs WN_a1 WNs WN_b1 WNs WN_2_a1 WNs WN_3_a2 WN_3_a1 WNs WN_4 RLS_MySQL RLS_oracle Cluster_1Cluster_2Cluster_3Cluster_4 UI_1UI_4 CE_5 WNs WN_5 Cluster_5 CE_6 WNs WN Cluster_6 LSFCondor Certification Test Bed Architecture Proxy WN_b2 WN_a2 WN_2_a2 LCFGng Lite install MDS_b MDS_3_b

CERN Testing Progress in the last few weeks with new Russian people involved We have the following tests – to define when LCG-1 can be deployed –Sequential & parallel job submission (RB functionality) –Job storms (parametrizable) normal replica manager copy (gridFTP) checksum (big files through the sandbox with verification) –MDS testing suite –Replica Manager (simulates Monte Carlo production) –Globus functionality tests (through VDT test suite) –Service nodes functionality tests MDS x BDII coherence tests LRC x RMC coherence tests Many of these based on EDG “stress-tests” of Atlas and CMS We still need to define a “site verification test suite”, –to be run as a validation of a site installation, before the site connects to the grid

CERN LCG-1 status Resource Broker: –1000 jobs in 20 parallel streams – 3 failed (1 in submission, would normally work with auto-retry; 2 for local –non-grid reasons) –500 long jobs (such that proxy expired) – single stream 10 failed, but was due to 1 job that put Condor-G in a strange state – under investigation; but hard to reproduce, subsequent runs did not fail –New functionality (Output data file) created bad JDL –RB can use all CPU of 2x800MHz machine – need large machine for RB (trying to understand why) – but did not fail, just degrades Replica Management: –Large variety of tests done – 1% failure of copy/register/replicate between 4 sites (due to a known problem in BDII, under investigation) –60 parallel streams replicating 10 1GB files worked without problem –Some new functionality did not work (block copy and register) –Oracle was used as back-end service Combined functionality: –Matchmaking requiring files works (but was not stressing system)

CERN Middleware expectations Middleware developments LCG would like in 2003 –Cut-off for this is October in order to have system ready for 2004 –EDG further developments will finish in September R-GMA: –Multiple registries: are essential for a full production service – this work still has to be done, may not be simple –Will not be started until initial R-GMA is stable –If not delivered have a single point of failure/bottleneck, or MDS (**) RLS: –Need the RLI component to have distributed LRCs –Code is ready; being stress tested by developers –Fallback is single global catalog – but is single point of failure/bottleneck RLS: proxy service for Replica Management –Essential to remove requirement on WN IP connectivity –If not delivered limits sites that can run LCG, and resources that can be included (**) VOMS (VO management service) –Service itself is ready and tested –Needs integration with CE and Storage – must be done by developers of those services –Fallback is to continue as now with basic authorization via grid-map GFAL (Grid File Access Library) –LCG development –Prototype available, expect production version in August

CERN Middleware expectations – but Based on last 6 months experience, seems unlikely that we get much more than bug fixes from EDG now Desirables: –gcc 3.2 –Globus 2.4 (i.e. an update via VDT) –R-GMA: may get 1 st implementation, but will have a single registry Comparisons between MDS and R-GMA – essential –Replica Management: RLI implementation, proxy service Also need to align the different RLS versions (Globus-EDG and EDG) Priorities: –Proxy service helps with clusters with no outgoing connectivity –Aligning RLS versions avoids two RLS universes –There is a plan to converge, but means we don’t get RLI and proxy service this year

CERN Middleware development Recent experience has shown it is very difficult for EDG developers to bug fix “LCG” in parallel with developing EDG We have agreed a process with EDG Currently LCG release started from a consistent EDG tag –Have made specific LCG fixes – happened that these did not have dependencies between packages –LCG-1 is not a consistent EDG tag Once EDG 2.0 has been released –Re-align LCG release with an EDG tag –Branch CVS repository Production branch (LCG-1) for bugfixes Development branch for EDG We will not accept anything that does not meet our current baseline test matrix on the certification test-bed

CERN Deployment Status –Deployment has started to the initial 10 Tier 1 sites (CERN, FNAL, BNL, CNAF, RAL, Taipei, Tokyo, FZK, IN2P3, Moscow), also Hungary is ready to join immediately –Started several weeks ago with sites asked to set up LCFG (different to LCG-0) and complete installation on earlier release, with intent of deploying the LCG-1 release as an update –Many sites have been very slow to respond and do the installation –The LCG-1 release is now prepared for deployment and will be started next week Caveats –First deployment insisted on PBS as batch system – suggest add CE for favourite batch system and migrate LSF, PBS, Condor work, while FBSng and BQS require small amount of local mods –First deployment forced full LCFGng install (i.e. incl. OS) – real LCG-1 distribution has “lite” version fully tested (install on top of OS) Deployment status pages –Site web pages, General status page (see LCG main web page) –Real status is monitoring system

CERN Deployed sites SiteLCG-0LCG-1 pre Tier 1 0CERN 1CNAF 2RAL  3FNAL  4Taipei 5FZK 6IN2P3  ? 7BNL  8Russia (Moscow)  9Tokyo Tier 2 10Legnaro (INFN) N/A LCG-0: Spring deployment of pre- production middleware based on VDT and previous EDG version (1.4.x) LCG-1pre: Deployment of full system preparatory to installing final release for LCG-1. Full installation procedure using LCFGng tools. Pre-release tag. LCG-1: The initial LCG-1 release, with tested middleware based on VDT and EDG 2.0 components. : Done  : Started ? : Unknown

CERN Deployed System Architecture RLS –Single LRC per VO – run at CERN with Oracle back ends –When RLI is introduced propose to run LRC with Oracle at all Tier 1s (agreed in principle by GDB) Tests started in Taipei, FNAL, RAL VO Services –Run by Nikhef for experiments –LCG at CERN for LCG-1 VO (sign user rules), dteam –Registration web server at CERN Configuration and installation servers –Run at CERN Batch system –Begin with PBS (most tested) –Add parallel CE for LSF/Condor/FBSng/BQS and migrate –Start with few WN only – add more when service is stable All sites run –Disk SE, UI –Most run 1 RB, CERN will run 2 –(CERN) UI available on AFS and LXplus

CERN Deployed system architecture RLS (RMC&LRC) CMS RLS (RMC&LRC) ATLAS RLS (RMC&LRC) ALICE RLS (RMC&LRC) LCG-Team VO LCG Proxy UI AFS users RB-2RB-1 SE Disk UI-b Lxplus CE-1CE-2 WN PBS VO LCG-Team LCG Registration Server LCG CVS Server RLS (RMC&LRC) LHCb UI ProxyRB-1 SE Disk CE WN PBS/ ???? VO CMS VO LHCb VO ATLAS VO NIKHEF Services at CERN Services at other sites CE-3CE-4 WN LSF CE-3CE-4 WN Favourite Batch System

CERN Information system architecture LCG-1 uses MDS –Top level is the BDII (Static interface) – but modified to get regular and frequent updates Each site will run a site GIIS. On day one this will be run on one of CEs. The site GIIS registers to two or more regional GIISes. –These will be well known and in the configuration that we distribute. The BDII system has been modified by LCG to handle multiple regions and to react on one instance of the regional GIIS failing. The problem of stale information has been limited by repopulating and swapping the ldap trees. Every site that runs an RB will run its own BDII. There is room to improve the way this system works via small modifications to the RB –(not requesting DNs, using alternate multiple BDIIs…) –These changes can be handled after we gain experience with the first release(s) –In addition we can try to register the GRISes directly with the region GIISes to see how this improves the reliability. US grid sites (non-LCG) will likely use the work on MDS we have done

CERN RegionA1 GIIS RegionA2 GIIS BDII A LDAP BDII B LDAP RB RegionB1 GIIS RegionB2 GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteC GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteD GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteA GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteB GIIS Query Register /dataCurrent/../dataNew/.. BDII LDAP Swap&Restart Query While using the data from one directory the BDII will query the regional GIISes to fill another directory structure. If this has finished the BDII is stopped, the dirs swapped and the BDII is then restarted. The restart takes less than 0.5 sec. To improve the availability during this time it was suggested that the TCP port should be switched off and the TCP protocol should take care of the retry (David). This has to be tested. Another idea worth testing is to remove the site GIIS and configure the GRISes to register directly with the region GIISes secondary primary Using multiple BDIIs requires RB changes LCG-1 First Launch Information System Overview

CERN RegionA1 GIIS RegionA2 GIIS BDII A LDAP RB RegionB1 GIIS RegionB2 GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteC GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteD GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteA GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteB GIIS Query Register secondary primary LCG-1 First Launch Information System Structure

CERN LCG-1 First Launch Information System Sites and Regions A Region should not contain too many sites since we have observed problems with MDS if a large number of sites are involved. To allow for future expansion, but not to make the system too complex we suggest starting with two regions and if needed split later to smaller regions. The regions are: West and East of 0 degrees The idea is to have a large region and a small one and see how they work For the West 2 regional GIISes, and for the East 3 will be setup at the beginning, RAL FNAL BNL WEST1 RegionGIIS WEST2 RegionGIIS CERN CNAF LYONMOSCOW FZK TOKYO TAIPEI EAST1 RegionGIIS EAST2 RegionGIIS EAST3 RegionGIIS

CERN Security Security group led by Dave Kelsey has been very active, includes many sites and experiment reps –Also includes reps of sites that overlap with LCG Has put in place agreements and infrastructure needed for LCG-1 Is actively planning security policy, and implementation plan for 2004 Set up incident response group as well as contacts list Next few slides from Dave Kelsey’s report to July GDB, numbers refer to GDB documents (

CERN Rules for Use of LCG-1 #36 To be agreed to by all users (signed via private key in browser) when they register with LCG-1 Deliberately based on current EDG Usage Rules –Does not override sites rules and policies –Only allows professional use Once discussions start on changes –Chance we never converge! We know that they are far from perfect Are there major objections today? –One comment says we should define the list of user data fields (as agreed at the last GDB) Use now and work on better version for Jan 2004 –Consult lawyers?

CERN Audit Requirements #37 UINone RBNone – look at later For origin of job submission CEgatekeeper maps DN to local account Keep gatekeeper and jobmanager logs SE/GridFTP Keep input and output data transfer logs Batch system jobmanager logs (or batch system logs) Need to trace process activity – pacct logs –This is large Central storage of all logfiles? Rather than on the WN To be kept for at least 90 days by all sites

CERN Incident Response #38 Procedures for LCG-1 start (before GOC) –Incidents, communications, enforcement, escalation etc Party discovering incident responsible for Taking local action Informing all other security contacts Difficult to be precise at this stage – we have to learn! We have created an ops security list (before GOC) –Default site entry is the Contact person but an operational list would be better LCG-1 sites need to refine and improve All sites must buy-in to the procedures

CERN User Registration & VO Management #39 User registers once with LCG-1 –Accepts User Rules –Gives the agreed set of personal data (last GDB) –Requests to join one VO/Experiment We need robust VO Registration Authorities to check –The user actually made the request –User is valid member of the experiment –User is at the listed institution –That all user data looks reasonable E.g. mail address The web form will warn that these checks will be made User data is distributed to all LCG-1 sites

CERN User Registration aims To provide LCG-1 with accurate information about users for –Pre-registration of accounts (where needed) –Auditing (legal requirements) To ensure VO managers do appropriate checks –To allow LCG-1 sites to open resources to VO BUT… the current procedures have limited resources –To some extent has to be “best efforts” E.g. do we need backup VO managers?

CERN VO Registration (2) Today’s VO managers –ALICEDaniele MuraINFN –ATLASAlessandro De SalvoINFN –CMSAndrea SciabaINFN –LHCbJoel ClosierCERN –DTEAMIan NeilsonCERN Plan to continue to use the existing VO servers and services (run by NIKHEF) and the current VO managers (all agree to continue) –DTEAM run at CERN

CERN VO/Experiment RA For LCG-1 start VO manager checks request via one of –Direct personal knowledge or contact (not ) –Check in official CERN or experiment database –With official experiment contact person at employing institute Signed ? (not done today) Identity and employing institute are the critical ones VO managers/LCG registrar to maintain a list of institutes and contact persons Work needed on more robust procedures for 2004 –That can scale With distributed RA’s?

CERN Operations Prototype has been set up by RAL Uses several monitoring tools GridIce (INFN) – significant effort to set up for LCG-1 by INFN and CERN groups Task force set up to define how this will evolve –Requirements –Tools –…

CERN Monitoring “Dashboard”

CERN Operations & Monitoring

CERN Support Initial User Support prototype has been implemented by FZK This will evolve over time Agreement is that initial problem triage will be done by the experiments support teams –Experiment experts will submit problems to LCG support Next few slides from Klaus-Peter Mickel’s presentation to the PEB ( User guide and installation guide are available as drafts

CERN Local Operations Level: At Central Grid Operation Center and at each T-1-C (and also at each T-2-C?) e.g.: Problem solving Maintenance Local services Resource management Preventive activities Problem announcements The Support Model — three levels Customer/Experi- Problem orientedInformation oriented ment Level: Submit a problem Ask for current Grid status, Track a problem documentation, training Support Level: At least three identical support centers with: Helpdesk application User, ticket and resource data base Knowledge base On call service outside the working hours

CERN

CERN Deployment Milestones Recent – Initial M/W delivery for LCG-1 (30/4/03) Was not met – now is delivered (~31/7/03) LCG contributed 2 FTE to assist integration process LCG decided to use MDS as info system – Implement LCG-1 Security Model (30/6/03) Was met (see above) – Prototype operations and support service (30/6/03) Met for support – FZK Is now met for operations (5/8/03) (see above) – RAL + GD team – Deploy LCG-1 to 10 Tier 1 (15/7/03) Is late – but in progress now – expect complete by 31/8/03 – Experiment verification of LCG-1 (31/7/03) Is late – cannot happen before LCG-1 is deployed – start around end August Upcoming – Middleware functionality complete (30/9/03) This will be a cut off for significant new functionality available from EDG – Job Execution Model defined (30/9/03) Will specify how LCG-1 will be useable

CERN Resources 6 INFN Fellows have been recruited –Starting September/October –3 will work on experiment integration –3 will work on certification, debugging, deployment FZK post is being filled –Starting October? –To work on deployment and service operation, troubleshooting 1 Portuguese trainee has started –Grid systems administrator Moscow group –Have had 2 (3 month rotation to CERN) working on testing (RLS and R- GMA) – this will be ongoing and building up effort in Moscow Taipei –1 more physicist joined us 1/8/03 for 1 year – deployment –3 monthly rotations of 2 people (2 here now, 2 more arriving Sept) 1 working on Oracle/RLS installation, 1 on GOC task force with goal to build GOC in Taipei

CERN Lessons learned Must have a running service – must keep it running – this is the only basis on which to progress and evolve Big-bang integration (a la EDG) is unworkable – must not be carried into EGEE –Must have a development service in parallel with production service on which we verify incremental changes – and back them out if they don’t work Sites are not honest about available staffing –Bits of 1 overworked person are not equivalent to 2 FTE even allowing for vacations –Committed resources count many dedicated FTE at most sites – clearly not true, we must adjust this to reflect reality –Buy-in commitment to LCG-1 was 2 FTE minimum as well as machines Every site seems to over-commit resources – this is a real problem which we must resolve if we want to operate a service

CERN Summary Middleware for LCG-1 is ready –Tests that we and Loose Cannons have done are promising Deployment of pre-cursor release (for configurations etc) was completed to 5/10 sites (expect FNAL, RAL today?) Deployment of LCG-1 to 10 sites will start next week –Will take a few days to already configured sites, longer to others Expect experiments to have access mid-August –In a controlled way at first, to monitor problems Planning for next steps – expansion, features – in hand –Once the Tier 1’s are up and stable would like to start to add Tier 2s and other countries as they are ready

CERN Potential Issues for deployment Grid 3 –Not entirely clear what the relationship to LCG is and how it will affect deployment of LCG middleware and services in the US Middleware support for the next year –For EDG workpackages – assuming EGEE or institutional commitments but this is not yet clear