Grid Deployment Area Status Report Ian Bird SC2 14 March 2003
Deployment Status Summary GDB Working groups reported and defined LCG-1 First “packaged” version of LCG (LCG-0) is released and available LCG-0 has been deployed at CERN, RAL, CNAF, (Taiwan) Expect middleware for LCG-1 (July) to be delivered from VDT/EDG end April Re-aligned EDG/LCG at CERN to share resources and people Re-organisation of groups and people within IT also Collaborative activities developing in several areas Ian.Bird@cern.ch
Grid Deployment Organisation ALICE ATLAS CMS LHCb policies, strategy, scheduling, standards, recommendations grid deployment manager grid deployment board Grid Resource Coordinator LCG security group LCG operations team LCG toolkit integration & certification grid infra- structure team experiment support team Joint Trillium/ EDG/LCG testing team CERN-based teams regional centre operations regional centre operations regional centre operations regional centre operations operations call centre core infra- structure security tools grid monitoring regional centre operations regional centre operations regional centre operations regional centre operations anticipated teams at other institutes Ian.Bird@cern.ch
Deployment Personnel CERN LCG Other Requested* Management 1 Certification & Test 2 3 Test group 1.3 System analysis & support Grid infrastructure (people moved from ADC and retain commitments to EDG WP4,6) 0.5 2*0.5 Experiment integration & support Total 13.8 + 6 1.5 8 4.3 6 * Expected to be fulfilled by INFN-funded Fellows by July Ian.Bird@cern.ch
Personnel Issues Staff to support LCG-1 grid services have commitments to EDG test-bed support – part of reason for rationalising EDG/LCG resources Have used 2 people from certification to help support test-beds/LCG Pilot Testing group is understaffed – need to find at least 3 full time people to contribute here Don’t expect to get final LCG funded effort before July Ian.Bird@cern.ch
Milestones Define LCG-1 – Feb 1 The GDB WG reports were published and discussed at Feb 6 GDB. Sufficient to define direction and issues to be addressed; used in planning and deploying Initial Pilot Service available (Feb 1), Pilot 2 (March 15) Strategy at CERN changed – no separation between LCG and physics production systems. Pilot cluster is available – configured as minimum but can move batch nodes between Pilot and LXBatch as needed. This makes more efficient usage of the machines. Pilot service worker nodes managed by FIO group. Integrating LSF, Addressing NFS vs AFS issues Deployment schedule -> LCG-0 deployed to CERN, RAL, CNAF + Legnaro(T2), Taiwan, preparing for FNAL by end of March This is actually ahead of proposed schedule Ian.Bird@cern.ch
LCG-1 Ramp-up Schedule Date Regional Center Experiment Pilot 1 start – Feb 1 15/2/03 CERN All 1 28/2/03 CNAF, RAL 2 30/3/03 FNAL CMS 3 15/4/03 Taiwan Atlas,CMS 4 30/4/03 Karlsruhe 5 7/5/03 IN2P3 6 15/5/03 BNL (want to wait) Atlas 7 21/5/03 Russia(Moscow),Tokyo LCG-1 Initial Public Service Start – July 1 Tier 2 centres will be brought on-line in parallel once the local Tier 1 is up to provide support Ian.Bird@cern.ch
Milestones - 2 Certification process defined (January) This has been done – agreed common process with EDG Have agreed joint project with VDT (US): VDT provide basic level (Globus, Condor) testing suites We provide higher level testing Expect to get HEPCAL test-cases from GAG Need to pull in other expertise E.g. EDG WP8/loose cannons Look at using common tools and frameworks (where it makes sense) NMI/VDT-LCG We need to do this soon to avoid divergence Need much more effort on devising & writing tests Real effort currently is only 2 people Ian.Bird@cern.ch
Test and Validation process Developers machines Build system Development Testbed ~15cpu Certification Testbed ~40cpu Production Unit Test Build Integration Certification Production WPs add unit tested code to CVS repository Run nightly build & auto. tests Individual WP tests Grid certification Certified public release for use by apps. Build system Integration Team Test Group Users Tagged package WPs Tagged release selected for certification Overall release tests Certified release selected for deployment Application Certification Fix problems Appl. Representatives Releases candidate Releases candidate Tagged Releases Certified Releases Office hours 24x7 Bugzilla anomalies reports Ian.Bird@cern.ch
Testing and Certification Building certification tb At CERN is 4x10 (25 available now) nodes (local grid) Will include Wisconsin asap Currently installing the EDG2.0/LCG-1 pre-cursor on the cert tb and beginning the certification work Ian.Bird@cern.ch
Test-beds and services Agreement 2 weeks ago to merge EDG and LCG production services and to separate test-beds The only way to share scarce resources (support) March – July: No EDG production test-bed at CERN now (but access to Castor) EDG core sites will run dev. test-beds and either EDG production or LCG pilot, unless they have resources for both LCG pilot on other EDG sites and at non-EDG centres From July: There will be a single production system – LCG-1 at least at CERN, hopefully at other EDG/LCG sites Ian.Bird@cern.ch
Milestones – 3 Packaging/configuration mechanism defined – March Group (EDG, LCG, VDT) have documented an agreed common approach Now will proceed with a staged implementation Basic for LCG-1 in July, and more developed later Delivery of middleware – March 1 We have a working set (“LCG-0”) that is in use now Expect delivery of mw for July by end April (from EDG) Identify operations and call centres – February 1 2 candidates for operations centres – hopefully this should be clarified in the next 2 weeks No clear candidate for a support centre – but we (LCG CERN group) will have a basic support service – already in place. Ian.Bird@cern.ch
Other Progress
Middleware system support European grid support centre Maarten Litmaath as 1/3 of technical Globus support people (SE, UK, LCG) Will participate in Globus 2.4 release process LCG team Currently 2 (3) people Building relationships with EDG, Globus, VDT Worked on RLS stress testing and debugging Ian.Bird@cern.ch
Security Dave Kelsey will lead ongoing security activity Policies Security strategy and plan This is needed urgently – as basis for operational agreements at centres Security operational issues: Led by Dane Skow (FNAL), group of site security contacts Gathering issues, constraints, etc. This group will handle daily security issues Proposing collaboration on VO management FNAL, INFN, … Ian.Bird@cern.ch
Collaborative Activities HICB – JTB GLUE Schema and evolution Validation and Test Suites Distribution, Meta-Packaging, Configuration Monitoring tools (proposed), aspects of ops centres Proposed collaboration on VO tools (led by FNAL) GGF Production Grid Management (operations) User Services (call centres) Tools, trouble ticket exchange standards, etc Site AAA (security) Particle and Nuclear Physics Applications area As a forum in GGF to present issues and get collaboration Other HEPiX – Fabric, operations, tools, procedures Security – site security contacts Storage Interfaces – SRM Ian.Bird@cern.ch
Compiler Issues EDG 2.0 release plan has assumed gcc 2.95.2 This was agreed with the experiments CMS and Atlas now request EDG 2.0/LCG-1 with gcc 3.2.2 For LHCb and ALICE 2.95.2 is acceptable Essential for integration with POOL Problem This will delay delivery of EDG 2.0 to LCG by ~6 weeks (this was estimate after 5 minutes thought) Possibilities: Continue as scheduled and deliver EDG2.0 with 2.95.2 ..and .. foresee update in September This implies LCG-1 cannot use POOL before upgrade Switch to gcc 3.2.2 now – introduces 6 week delay Ian.Bird@cern.ch
Compiler Issues Observations: (I propose ) Continue with agreed plan – Pool will not have been integrated in experiment software and tested before LCG-1 is deployed in July It has a command-line interface A delay of 6 weeks will mean nothing is deployed during the summer (vacations) An upgrade of middleware is already foreseen for Sep/Oct (I propose ) Continue with agreed plan – deploy EDG 2.0/LCG-1 as planned, and upgrade to gcc 3.2 and include POOL in September Ian.Bird@cern.ch
Summary Progress on schedule with deployment Need to find effort on testing activities Need to get operations and call centre activities started Ian.Bird@cern.ch