Presentation is loading. Please wait.

Presentation is loading. Please wait.

CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003.

Similar presentations


Presentation on theme: "CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003."— Presentation transcript:

1 CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

2 CERN Ian.Bird@cern.ch2 Overview LCG-1 Status Middleware Deployment and architecture Security Operations & Support Milestone analysis Resources Plans for remainder of 2003

3 CERN Ian.Bird@cern.ch3 LCG-1 Status  Middleware release: We have a candidate LCG-1 release ! Quite robust – compared with previous EDG releases Certification team & EDG Loose Cannons currently running tests Some problems (broken new functionality) No major showstoppers Currently trying to stress to find limitations – has not failed disastrously but rather degrades performance (RB)  Deployment : Deployment of previous tag started to 10 Tier 1 sites Going very slowly – only 5 up, slow responses, expect 8, dubious about 2 sites Want to start pushing out LCG-1 release this week This will be an upgrade Hope experiments can start controlled testing before the end of August Two activities in parallel:

4 CERN LCG-1 Release Candidate Contents and Status

5 CERN Ian.Bird@cern.ch5 LCG-1 Contents Based on VDT 1.1.8-9 –2 changes from LCG – gridftp bug fix and fix for GRIS instability – not yet in official VDT. –Soon will move to VDT 1.1.9 (or 10) which will be a converged US and LCG VDT EDG components –WP1 (Resource Broker) –WP2 – Replica Management tools, including the EDG Replica Location Service (RLS) –WP4 – gatekeeper, and LCFG installation tool Information Services (see below) –MDS with EDG/LCG improvements and bug fixes –GLUE Schema v1.1 with extensions from LCG; Information providers from WP3,4,5 Storage access –“Classical” EDG Storage Element: disk pools accessed via gridFTP; tested in conjunction with Replica Management tools –Will add access to Castor and Enstore via gridFTP in next few weeks once system is deployed

6 CERN Ian.Bird@cern.ch6 LCG-1 Contents – 2 VDT – 1.1.8-9 –Excellent collaboration – very responsive to our needs – fast turnaround LCG contributions to VDT and EDG –added accounting and security auditing (connection between the process, the batch system and the submitter) –fixed gatekeeper log file growing infinitely by implementing rotating log schema, avoiding gatekeeper restarts –incorporated MDS bug fixes from NorduGrid, improving at the same time the timeout handling, this allows LCG to deploy MDS on bigger scale; found and fixed bugs in GRIS –new LCG versions of job managers that do not need shared home directories issue raised by deploying LCG-0 solves scalability issues that prevented use of more than a few tens of worker nodes –gass-cache inode leak, “dead” inodes were never removed - filling up the disk, eventually causing the service to crash  It was a significant effort to put this all together in a coherent way

7 CERN Ian.Bird@cern.ch7 Certification Tests so far have been on LCG certification TB –Local set of clusters – no WAN –There has been no time yet for true WAN tests Testing is being done by LCG and Loose Cannons –Have matrix of tests as baseline now – will not accept changes that break these tests –Will do regression testing against future changes –Push baseline acceptance tests upwards – more stringent as we evolve Certification TB –Intended to reproduce and test actual deployed configuration Test info. system architecture Test various installation options Test various batch systems and jobmanagers Middleware functionality and robustness –Expanding now to Wisconsin and Hungary, to include also CNAF, FNAL and Moscow (although Moscow do not have sufficient resources at the moment) –Stress system and determine limits in parallel with deployed LCG-1

8 CERN RB_a BDII_a MDS_a CE_a SE_a RB_b BDII_b CE_b WNs CE_2 SE_2 WNs RB_3 BDII_3 MDS_3_a CE_3 SE_3 WNs CE_4 SE_4 WNs WN_a1 WNs WN_b1 WNs WN_2_a1 WNs WN_3_a2 WN_3_a1 WNs WN_4 RLS_MySQL RLS_oracle Cluster_1Cluster_2Cluster_3Cluster_4 UI_1UI_4 CE_5 WNs WN_5 Cluster_5 CE_6 WNs WN Cluster_6 LSFCondor Certification Test Bed Architecture Proxy WN_b2 WN_a2 WN_2_a2 LCFGng Lite install MDS_b MDS_3_b

9 CERN Ian.Bird@cern.ch9 Testing Progress in the last few weeks with new Russian people involved We have the following tests – to define when LCG-1 can be deployed –Sequential & parallel job submission (RB functionality) –Job storms (parametrizable) normal replica manager copy (gridFTP) checksum (big files through the sandbox with verification) –MDS testing suite –Replica Manager (simulates Monte Carlo production) –Globus functionality tests (through VDT test suite) –Service nodes functionality tests MDS x BDII coherence tests LRC x RMC coherence tests Many of these based on EDG “stress-tests” of Atlas and CMS We still need to define a “site verification test suite”, –to be run as a validation of a site installation, before the site connects to the grid

10 CERN Ian.Bird@cern.ch10 LCG-1 status Resource Broker: –1000 jobs in 20 parallel streams – 3 failed (1 in submission, would normally work with auto-retry; 2 for local –non-grid reasons) –500 long jobs (such that proxy expired) – single stream 10 failed, but was due to 1 job that put Condor-G in a strange state – under investigation; but hard to reproduce, subsequent runs did not fail –New functionality (Output data file) created bad JDL –RB can use all CPU of 2x800MHz machine – need large machine for RB (trying to understand why) – but did not fail, just degrades Replica Management: –Large variety of tests done – 1% failure of copy/register/replicate between 4 sites (due to a known problem in BDII, under investigation) –60 parallel streams replicating 10 1GB files worked without problem –Some new functionality did not work (block copy and register) –Oracle was used as back-end service Combined functionality: –Matchmaking requiring files works (but was not stressing system)

11 CERN Ian.Bird@cern.ch11 Middleware expectations Middleware developments LCG would like in 2003 –Cut-off for this is October in order to have system ready for 2004 –EDG further developments will finish in September R-GMA: –Multiple registries: are essential for a full production service – this work still has to be done, may not be simple –Will not be started until initial R-GMA is stable –If not delivered have a single point of failure/bottleneck, or MDS (**) RLS: –Need the RLI component to have distributed LRCs –Code is ready; being stress tested by developers –Fallback is single global catalog – but is single point of failure/bottleneck RLS: proxy service for Replica Management –Essential to remove requirement on WN IP connectivity –If not delivered limits sites that can run LCG, and resources that can be included (**) VOMS (VO management service) –Service itself is ready and tested –Needs integration with CE and Storage – must be done by developers of those services –Fallback is to continue as now with basic authorization via grid-map GFAL (Grid File Access Library) –LCG development –Prototype available, expect production version in August

12 CERN Ian.Bird@cern.ch12 Middleware expectations – but Based on last 6 months experience, seems unlikely that we get much more than bug fixes from EDG now Desirables: –gcc 3.2 –Globus 2.4 (i.e. an update via VDT) –R-GMA: may get 1 st implementation, but will have a single registry Comparisons between MDS and R-GMA – essential –Replica Management: RLI implementation, proxy service Also need to align the different RLS versions (Globus-EDG and EDG) Priorities: –Proxy service helps with clusters with no outgoing connectivity –Aligning RLS versions avoids two RLS universes –There is a plan to converge, but means we don’t get RLI and proxy service this year

13 CERN Ian.Bird@cern.ch13 Middleware development Recent experience has shown it is very difficult for EDG developers to bug fix “LCG” in parallel with developing EDG We have agreed a process with EDG Currently LCG release started from a consistent EDG tag –Have made specific LCG fixes – happened that these did not have dependencies between packages –LCG-1 is not a consistent EDG tag Once EDG 2.0 has been released –Re-align LCG release with an EDG tag –Branch CVS repository Production branch (LCG-1) for bugfixes Development branch for EDG We will not accept anything that does not meet our current baseline test matrix on the certification test-bed

14 CERN Ian.Bird@cern.ch14 Deployment Status –Deployment has started to the initial 10 Tier 1 sites (CERN, FNAL, BNL, CNAF, RAL, Taipei, Tokyo, FZK, IN2P3, Moscow), also Hungary is ready to join immediately –Started several weeks ago with sites asked to set up LCFG (different to LCG-0) and complete installation on earlier release, with intent of deploying the LCG-1 release as an update –Many sites have been very slow to respond and do the installation –The LCG-1 release is now prepared for deployment and will be started next week Caveats –First deployment insisted on PBS as batch system – suggest add CE for favourite batch system and migrate LSF, PBS, Condor work, while FBSng and BQS require small amount of local mods –First deployment forced full LCFGng install (i.e. incl. OS) – real LCG-1 distribution has “lite” version fully tested (install on top of OS) Deployment status pages –Site web pages, General status page (see LCG main web page) –Real status is monitoring system

15 CERN Ian.Bird@cern.ch15 Deployed sites SiteLCG-0LCG-1 pre Tier 1 0CERN 1CNAF 2RAL  3FNAL  4Taipei 5FZK 6IN2P3  ? 7BNL  8Russia (Moscow)  9Tokyo Tier 2 10Legnaro (INFN) N/A LCG-0: Spring deployment of pre- production middleware based on VDT 1.1.6 and previous EDG version (1.4.x) LCG-1pre: Deployment of full system preparatory to installing final release for LCG-1. Full installation procedure using LCFGng tools. Pre-release tag. LCG-1: The initial LCG-1 release, with tested middleware based on VDT 1.1.8-9 and EDG 2.0 components. : Done  : Started ? : Unknown

16 CERN Ian.Bird@cern.ch16 Deployed System Architecture RLS –Single LRC per VO – run at CERN with Oracle back ends –When RLI is introduced propose to run LRC with Oracle at all Tier 1s (agreed in principle by GDB) Tests started in Taipei, FNAL, RAL VO Services –Run by Nikhef for experiments –LCG at CERN for LCG-1 VO (sign user rules), dteam –Registration web server at CERN Configuration and installation servers –Run at CERN Batch system –Begin with PBS (most tested) –Add parallel CE for LSF/Condor/FBSng/BQS and migrate –Start with few WN only – add more when service is stable All sites run –Disk SE, UI –Most run 1 RB, CERN will run 2 –(CERN) UI available on AFS and LXplus

17 CERN Ian.Bird@cern.ch17 Deployed system architecture RLS (RMC&LRC) CMS RLS (RMC&LRC) ATLAS RLS (RMC&LRC) ALICE RLS (RMC&LRC) LCG-Team VO LCG Proxy UI AFS users RB-2RB-1 SE Disk UI-b Lxplus CE-1CE-2 WN PBS VO LCG-Team LCG Registration Server LCG CVS Server RLS (RMC&LRC) LHCb UI ProxyRB-1 SE Disk CE WN PBS/ ???? VO CMS VO LHCb VO ATLAS VO ALICE @ NIKHEF Services at CERN Services at other sites CE-3CE-4 WN LSF CE-3CE-4 WN Favourite Batch System

18 CERN Ian.Bird@cern.ch18 Information system architecture LCG-1 uses MDS –Top level is the BDII (Static interface) – but modified to get regular and frequent updates Each site will run a site GIIS. On day one this will be run on one of CEs. The site GIIS registers to two or more regional GIISes. –These will be well known and in the configuration that we distribute. The BDII system has been modified by LCG to handle multiple regions and to react on one instance of the regional GIIS failing. The problem of stale information has been limited by repopulating and swapping the ldap trees. Every site that runs an RB will run its own BDII. There is room to improve the way this system works via small modifications to the RB –(not requesting DNs, using alternate multiple BDIIs…) –These changes can be handled after we gain experience with the first release(s) –In addition we can try to register the GRISes directly with the region GIISes to see how this improves the reliability. US grid sites (non-LCG) will likely use the work on MDS we have done

19 CERN Ian.Bird@cern.ch19 RegionA1 GIIS RegionA2 GIIS BDII A LDAP BDII B LDAP RB RegionB1 GIIS RegionB2 GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteC GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteD GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteA GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteB GIIS Query Register /dataCurrent/../dataNew/.. BDII LDAP Swap&Restart Query While using the data from one directory the BDII will query the regional GIISes to fill another directory structure. If this has finished the BDII is stopped, the dirs swapped and the BDII is then restarted. The restart takes less than 0.5 sec. To improve the availability during this time it was suggested that the TCP port should be switched off and the TCP protocol should take care of the retry (David). This has to be tested. Another idea worth testing is to remove the site GIIS and configure the GRISes to register directly with the region GIISes secondary primary Using multiple BDIIs requires RB changes LCG-1 First Launch Information System Overview

20 CERN Ian.Bird@cern.ch20 RegionA1 GIIS RegionA2 GIIS BDII A LDAP RB RegionB1 GIIS RegionB2 GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteC GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteD GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteA GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteB GIIS Query Register secondary primary LCG-1 First Launch Information System Structure

21 CERN Ian.Bird@cern.ch21 LCG-1 First Launch Information System Sites and Regions A Region should not contain too many sites since we have observed problems with MDS if a large number of sites are involved. To allow for future expansion, but not to make the system too complex we suggest starting with two regions and if needed split later to smaller regions. The regions are: West and East of 0 degrees The idea is to have a large region and a small one and see how they work For the West 2 regional GIISes, and for the East 3 will be setup at the beginning, RAL FNAL BNL WEST1 RegionGIIS WEST2 RegionGIIS CERN CNAF LYONMOSCOW FZK TOKYO TAIPEI EAST1 RegionGIIS EAST2 RegionGIIS EAST3 RegionGIIS

22 CERN Ian.Bird@cern.ch22 Security Security group led by Dave Kelsey has been very active, includes many sites and experiment reps –Also includes reps of sites that overlap with LCG Has put in place agreements and infrastructure needed for LCG-1 Is actively planning security policy, and implementation plan for 2004 Set up incident response group as well as contacts list Next few slides from Dave Kelsey’s report to July GDB, numbers refer to GDB documents 36-39 (http://cern.ch/lcg/Documents)http://cern.ch/lcg/Documents

23 CERN Ian.Bird@cern.ch23 Rules for Use of LCG-1 #36 To be agreed to by all users (signed via private key in browser) when they register with LCG-1 Deliberately based on current EDG Usage Rules –Does not override sites rules and policies –Only allows professional use Once discussions start on changes –Chance we never converge! We know that they are far from perfect Are there major objections today? –One comment says we should define the list of user data fields (as agreed at the last GDB) Use now and work on better version for Jan 2004 –Consult lawyers?

24 CERN Ian.Bird@cern.ch24 Audit Requirements #37 UINone RBNone – look at later For origin of job submission CEgatekeeper maps DN to local account Keep gatekeeper and jobmanager logs SE/GridFTP Keep input and output data transfer logs Batch system jobmanager logs (or batch system logs) Need to trace process activity – pacct logs –This is large Central storage of all logfiles? Rather than on the WN To be kept for at least 90 days by all sites

25 CERN Ian.Bird@cern.ch25 Incident Response #38 Procedures for LCG-1 start (before GOC) –Incidents, communications, enforcement, escalation etc Party discovering incident responsible for Taking local action Informing all other security contacts Difficult to be precise at this stage – we have to learn! We have created an ops security list (before GOC) –Default site entry is the Contact person but an operational list would be better LCG-1 sites need to refine and improve All sites must buy-in to the procedures

26 CERN Ian.Bird@cern.ch26 User Registration & VO Management #39 User registers once with LCG-1 –Accepts User Rules –Gives the agreed set of personal data (last GDB) –Requests to join one VO/Experiment We need robust VO Registration Authorities to check –The user actually made the request –User is valid member of the experiment –User is at the listed institution –That all user data looks reasonable E.g. mail address The web form will warn that these checks will be made User data is distributed to all LCG-1 sites

27 CERN Ian.Bird@cern.ch27 User Registration aims To provide LCG-1 with accurate information about users for –Pre-registration of accounts (where needed) –Auditing (legal requirements) To ensure VO managers do appropriate checks –To allow LCG-1 sites to open resources to VO BUT… the current procedures have limited resources –To some extent has to be “best efforts” E.g. do we need backup VO managers?

28 CERN Ian.Bird@cern.ch28 VO Registration (2) Today’s VO managers –ALICEDaniele MuraINFN –ATLASAlessandro De SalvoINFN –CMSAndrea SciabaINFN –LHCbJoel ClosierCERN –DTEAMIan NeilsonCERN Plan to continue to use the existing VO servers and services (run by NIKHEF) and the current VO managers (all agree to continue) –DTEAM run at CERN

29 CERN Ian.Bird@cern.ch29 VO/Experiment RA For LCG-1 start VO manager checks request via one of –Direct personal knowledge or contact (not e-mail) –Check in official CERN or experiment database –With official experiment contact person at employing institute Signed e-mail? (not done today) Identity and employing institute are the critical ones VO managers/LCG registrar to maintain a list of institutes and contact persons Work needed on more robust procedures for 2004 –That can scale With distributed RA’s?

30 CERN Ian.Bird@cern.ch30 Operations Prototype has been set up by RAL Uses several monitoring tools GridIce (INFN) – significant effort to set up for LCG-1 by INFN and CERN groups Task force set up to define how this will evolve –Requirements –Tools –…

31 CERN Ian.Bird@cern.ch31 Monitoring “Dashboard”

32 CERN Ian.Bird@cern.ch32 Operations & Monitoring

33 CERN Ian.Bird@cern.ch33 Support Initial User Support prototype has been implemented by FZK This will evolve over time Agreement is that initial problem triage will be done by the experiments support teams –Experiment experts will submit problems to LCG support Next few slides from Klaus-Peter Mickel’s presentation to the PEB (http://agenda.cern.ch/fullAgenda.php?ida=a031492)http://agenda.cern.ch/fullAgenda.php?ida=a031492 User guide and installation guide are available as drafts

34 CERN Ian.Bird@cern.ch34 Local Operations Level: At Central Grid Operation Center and at each T-1-C (and also at each T-2-C?) e.g.: Problem solving Maintenance Local services Resource management Preventive activities Problem announcements The Support Model — three levels Customer/Experi- Problem orientedInformation oriented ment Level: Submit a problem Ask for current Grid status, Track a problem documentation, training Support Level: At least three identical support centers with: Helpdesk application User, ticket and resource data base Knowledge base On call service outside the working hours

35 CERN Ian.Bird@cern.ch35

36 CERN Ian.Bird@cern.ch36 Deployment Milestones Recent –1.4.1.1 Initial M/W delivery for LCG-1 (30/4/03) Was not met – now is delivered (~31/7/03) LCG contributed 2 FTE to assist integration process LCG decided to use MDS as info system –1.4.1.2 Implement LCG-1 Security Model (30/6/03) Was met (see above) –1.4.1.3 Prototype operations and support service (30/6/03) Met for support – FZK Is now met for operations (5/8/03) (see above) – RAL + GD team –1.4.1.4 Deploy LCG-1 to 10 Tier 1 (15/7/03) Is late – but in progress now – expect complete by 31/8/03 –1.4.1.6 Experiment verification of LCG-1 (31/7/03) Is late – cannot happen before LCG-1 is deployed – start around end August Upcoming –1.4.2.19 Middleware functionality complete (30/9/03) This will be a cut off for significant new functionality available from EDG –1.4.2.21 Job Execution Model defined (30/9/03) Will specify how LCG-1 will be useable

37 CERN Ian.Bird@cern.ch37 Resources 6 INFN Fellows have been recruited –Starting September/October –3 will work on experiment integration –3 will work on certification, debugging, deployment FZK post is being filled –Starting October? –To work on deployment and service operation, troubleshooting 1 Portuguese trainee has started –Grid systems administrator Moscow group –Have had 2 (3 month rotation to CERN) working on testing (RLS and R- GMA) – this will be ongoing and building up effort in Moscow Taipei –1 more physicist joined us 1/8/03 for 1 year – deployment –3 monthly rotations of 2 people (2 here now, 2 more arriving Sept) 1 working on Oracle/RLS installation, 1 on GOC task force with goal to build GOC in Taipei

38 CERN Ian.Bird@cern.ch38 Lessons learned Must have a running service – must keep it running – this is the only basis on which to progress and evolve Big-bang integration (a la EDG) is unworkable – must not be carried into EGEE –Must have a development service in parallel with production service on which we verify incremental changes – and back them out if they don’t work Sites are not honest about available staffing –Bits of 1 overworked person are not equivalent to 2 FTE even allowing for vacations –Committed resources count many dedicated FTE at most sites – clearly not true, we must adjust this to reflect reality –Buy-in commitment to LCG-1 was 2 FTE minimum as well as machines Every site seems to over-commit resources – this is a real problem which we must resolve if we want to operate a service

39 CERN Ian.Bird@cern.ch39 Summary Middleware for LCG-1 is ready –Tests that we and Loose Cannons have done are promising Deployment of pre-cursor release (for configurations etc) was completed to 5/10 sites (expect FNAL, RAL today?) Deployment of LCG-1 to 10 sites will start next week –Will take a few days to already configured sites, longer to others Expect experiments to have access mid-August –In a controlled way at first, to monitor problems Planning for next steps – expansion, features – in hand –Once the Tier 1’s are up and stable would like to start to add Tier 2s and other countries as they are ready

40 CERN Ian.Bird@cern.ch40 Potential Issues for deployment Grid 3 –Not entirely clear what the relationship to LCG is and how it will affect deployment of LCG middleware and services in the US Middleware support for the next year –For EDG workpackages – assuming EGEE or institutional commitments but this is not yet clear


Download ppt "CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003."

Similar presentations


Ads by Google