Download presentation
Presentation is loading. Please wait.
1
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Status of EGEE Operations Ian Bird, CERN SA1 Activity Leader EGEE 3 rd Conference Athens, 18 th April, 2005
2
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 2 Overview Overall activity status Service & Operations Planning for remainder of project Main focus of activities gLite migration Summary Tomorrow’s plenary session for technical details
3
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Operations Status
4
Country providing resources Country anticipating joining In LCG-2: 131 sites, 30 countries >12,000 cpu ~5 PB storage Includes non-EGEE sites: 9 countries 20 sites Computing Resources: April 2005
5
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 5 Infrastructure metrics Countries, sites, and CPU available in EGEE production service Countries, sites, and CPU available in EGEE production service Regioncoun- tries sitescpu M6 (TA) cpu M15 (TA) cpu actual CERN0190018001841 UK/Ireland21910022002398 France184008951172 Italy1215536792164 South East516146322159 South West213250 498 Central Europe510385730629 Northern Europe242002000427 Germany/Switzerland2101004001733 Russia1950152276 EGEE-total211113084942811297 USA13--555 Canada16--316 Asia-Pacific68--394 Hewlett-Packard13--172 Total other920--1437 Grand Total30131--12734 EGEE partner regions Other collaborating sites
6
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 6 Service Usage VOs and users on the production service Active HEP experiments: 4 LHC, D0, CDF, Zeus, Babar Active other VO: Biomed, ESR (Earth Sciences), Compchem, Magic (Astronomy), EGEODE (Geo-Physics) 6 disciplines Registered users in these VO: 600 In addition to these there are many VO that are local to a region, supported by their ROCs, but not yet visible across EGEE Scale of work performed: LHC Data challenges 2004: >1 M SI2K years of cpu time (~1000 cpu years) 400 TB of data generated, moved and stored 1 VO achieved ~4000 simultaneous jobs (~4 times CERN grid capacity) Number of jobs processed/month
7
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 7 SA1 – Operations Structure Operations Management Centre (OMC): Core Infrastructure Centres (CIC) Manage daily grid operations – oversight, troubleshooting Run essential infrastructure services Provide 2 nd level support to ROCs UK/I, Fr, It, CERN, + Russia (M12) Weekly rotation in place since October Taipei also run a CIC Regional Operations Centres (ROC) Act as front-line support for user and operations issues Provide local knowledge and adaptations One in each region – many distributed User Support Centre (GGUS) In FZK – manage PTS – provide single point of contact (service desk) Not foreseen as such in TA, but need is clear
8
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 8 Operations Procedures Driven by experience during 2004 Data Challenges, & Reflecting the outcome of the November Operations Workshop Operations Procedures roles of CICs - ROCs - RCs weekly rotation of operations centre duties (CIC-on-duty) Process in place since October daily tasks of the operations shift monitoring (tools, frequency) problem reporting problem tracking system communication with ROCs&RCs escalation of unresolved problems handing over the service to the next CIC
9
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 9 New Release Process (simplified) C&T EIS GIS GDB Applications RC Bugs/Patches/Task Savannah Bugs/Patches/Task Savannah EIS CICs Head of Deployment Head of Deployment prioritization & selection Developers Applications Developers 1 1 List for next release (can be empty) List for next release (can be empty) 2 2 integration & first tests C&T 3 3 Internal Releases Internal Releases 4 4 User Level install of client tools EIS 5 5 full deployment on test clusters (6) functional/stress tests ~1 week C&T 6 6 assign and update cost Bugs/Patches/Task Savannah Bugs/Patches/Task Savannah components ready at cutoff Internal Client Release Internal Client Release 7 7 Client Release Client Release Service Release Service Release Updates Release Updates Release Core Service Release Core Service Release C&T
10
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 10 Deployment process Release(s) Certification is run daily Update User Guides EIS Update Release Notes GIS Release Notes Installation Guides User Guides Re-Certify CIC Every Month 11 Release Client Release Deploy Client Releases (User Space) GIS Deploy Service Releases (Optional) CICs RCs CICs RCs Deploy Major Releases (Mandatory) ROCs RCs ROCs RCs YAIM Every Month Every 3 months on fixed dates ! at own pace
11
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Planning for next year
12
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 12 Future work – comments from review Testing and software packaging will be critical to success. Reinforce these also intellectually very demanding activities even further. Yes – this is agreed! Work hard on event-based monitoring techniques, triggering preventive maintenance actions, to improve the stability of the Grid infrastructure. Implement a strong mechanism to quickly isolate unstable sites in the production Grid. These are both part of ongoing program of work Use R-GMA as monitoring framework; build triggers and alarms on top Better mechanism to remove sites – web interface to allow VO to select Improve the middleware deployment process (technical, organisational) even further to increase the stability of the infrastructure and consequently improve the job success rate and reduce the load on the support team. Already updated and streamlined deployment and release process and improved configuration mechanisms
13
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 13 15 month plan No major changes to goals or work Areas of work focus: Migration to gLite See next slides Improving operational and grid reliability Follow recommendations of review discussed above Improve monitoring systems – build reactive alarms Site isolation – need simple mechanism (CIC tool) to remove sites Bad sites, security problems, etc. Improving user support In progress – need recognised usable service by mid-year 24x7 service availability Availability of service rather than components Identify critical services Isues: on-call support; hot stand-by machines; etc (might need work on middleware to support this!)
14
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 14 Review recommendations to SA1 The migration path to gLite needs to be better planned, as it is inherently difficult to support two different grid software stacks indefinitely. More specifically, establishing a fixed time-line for migration as well as deprecation deadlines for LCG-2 services, plus possibly identifying who would be the earliest adopters from the application side and the time-line for their possible early committal, would be essential; otherwise, existing users may not be motivated to migrate. Migration plan is being worked out in detail – but will be driven by experience in the certification and pre-production deployment Must be a migration plan and not a switch from old to new Early adopters include LCG, others should be identified via NA4
15
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 15 Migration to gLite Migration strategy Needs to be incremental rather than big- bang – as has been stated for a year 2 Activities in parallel: Deploy components into LCG-2 certification test-bed and then to pre-production Deploy pre-production sites in parallel PPS and Production Are evolutionary LCG-2 gLite components Cannot provide LCG-2 end-of-life estimate/deadlines LCG-2 is the fallback solution Applications must test services and decide which ones they need LCG-2 (=EGEE-0) prototyping product 2004 2005 LCG-3 (=EGEE-x?) product
16
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 16 Review recommendations to SA1 Consider the current gLite as a stepping stone towards a more robust standards-based infrastructure, rather than a final deployment solution. Select additional components for integration and deployment through collaborations with other international middleware R&D initiatives. Work with Globus, VDT, OSG, etc on common solutions/interfaces – but has to be driven by the applications and experience from operations Should be in situation to be able to deploy components needed by the applications Integration and certification process mechanism from selecting other components
17
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 17 Review recommendations to SA1 Continue to conduct application-driven investigation that may result in complex usage scenarios and consider how the advanced middleware and infrastructure would support them in a viable manner. As such, keep a keen eye on new generations of production-level Grid middleware from various international groups that go beyond gLite features. For HEP – Data challenges and service challenges bring specific goals and targets (and timescales) – this will continue Other applications might consider similar exercises – define some goals
18
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 18 Milestones for rest of project M14: full production grid in production 9 ROCs, 5 CICs (include Russia at M12), 20 sites Should be based on EGEE re-engineered middleware. This is dependent on the quality and robustness of gLite components Experience: takes 6 months to put new software into production Will not deploy new components unless they improve upon existing components or add new required functionality M21: expanded production infrastructure in place As above, but expanded to 50 sites Now decoupled from specific gLite release
19
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 19 Deliverables for rest of project Release notes corresponding to milestones Updated relative to first set of release notes; snapshots corresponding to milestones NB. ALL releases are accompanied by full set of release notes EGEE “Cookbook” Foreseen as planning guides to assist new participants join or build components of the infrastructure. Resource centres and their administrators ROCs, CICs, and VOs Templates and checklists to assist administrators to: design a facility, determine what resources to acquire, how to configure them, etc. Detailed enough to allow admins to understand limitations of the system are and how to address them (e.g. what services can run on 1 machine, how to configure, etc.) Make use of expertise of CICs, ROCs and staff in RCs (“and use technical writers in NA3”) M24: Assessment of infrastructure operation throughout the project Remove suggestions on long-term sustainability put into EGEE-2 planning
20
Enabling Grids for E-sciencE INFSO-RI-508833 Athens Conference; 18 th April 2005 20 Summary Production grid is operational and in use Larger scale than foreseen, use in 2004 probably the first time such a set of large scale grid productions has been done Modest growth in resources foreseen over next year Operational infrastructure in place and working Need to continue to improve reliability of service Need to continue to improve user support Support for applications and VOs VO deployment should become still simpler and more routine Application support needs more resources than foreseen Deployment and migration to gLite is now a major focus
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.