Download presentation
Presentation is loading. Please wait.
Published byBlanche Harriet Walker Modified over 9 years ago
1
BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006
2
5-6 September 2006 B. Gibbard Grid Deployment Board 2 Service Planning (Requirements) Tier 1 Center at BNL services only ATLAS Some member of staff are ATLAS collaborators Some participation in ATLAS planning o ATLAS computing model o ATLAS Computing TDR, etc. Base requirement is the agreed fraction of the appropriate revision of the overall Tier1 ATLAS requirement from the computing model/TDR. BNL Tier 1 site also supplies additional US ATLAS specific capacity Requirement is total CPU, Disk, Tape, Network Bandwidth Plus associated, implementation dependent o Cyber infrastructure (LAN, LDAP, backup, Grid servers, etc.) o Staffing o Physical infrastructure (space, power, cooling, fire protection, security, etc.)
3
5-6 September 2006 B. Gibbard Grid Deployment Board 3 Service Planning (Facility Evolution & Cost Plan) Facility staff converts requirements into Plan for current and future years including Projected set of costed capital equipment procurements based on o In-house experience o Experience at other sites o Interactions with vendors, etc Projected operations budget o Labor o Licenses, maintenance, media, small equipment o Space and power charges, general overhead With approval of US ATLAS program management this plan is Integrated into overall Computing Facilities Plan, which is then Integrated into overall Software and Computing Plan, which is then Integrated into overall US ATLAS Program Plan
4
5-6 September 2006 B. Gibbard Grid Deployment Board 4 Service Planning (Review and Approval) Program and plans are reviewed By funding agencies o Department of Energy o National Science Foundation Twice yearly o Most detailed review, usually in winter, with many technically knowledgeable consultants Review flags issues and makes recommendations US ATLAS Program Management with Agency Approval Allocates funds Out of common US ATLAS Program funding To Tier 1 for current year
5
5-6 September 2006 B. Gibbard Grid Deployment Board 5 Execution of Plan Funding is by US Fiscal Year starting October 1 rather than January 1 Funding typically arrives in two chunks, beginning of and halfway through fiscal year; with details of how much, when, negotiated with program management based on Competing needs within the program Schedule of capacity requirements When funds can be most effectively spent Technology/product evaluation and review is a year around activity Major equipment procurements typically require from one to two months to execute Installation and commissioning typically takes from a couple of weeks to a couple of months to complete
6
5-6 September 2006 B. Gibbard Grid Deployment Board 6 Maintaining Availability of Services ATLAS Tier 1 at BNL is co-located and co-operated with RHIC Computing Facility (RCF) Use of redundancy in critical elements o Fail over and/or graceful degradation of services Appropriate response time maintenance contracts on critical elements 24 x 7 operational coverage of fabric services o Jointly maintained by RCF and ATLAS Tier 1 staff o Five years of experience with RHIC 24 x 7 operations When accelerator runs (25-30 weeks/year) Coverage Includes o 16 x 7 on site staff coverage 2 operators extend coverage to week-ends and evenings o 24 x 7 on call staff coverage for critical fabric services On call activation by automated systems, operators/other staff members, critical points of contact in user communities
7
5-6 September 2006 B. Gibbard Grid Deployment Board 7 Monitoring Facility uses RT problem tracking system Substantial use of Nagios-base monitoring by individual subsystems Working toward facility wide unification Automated monitoring and paging of staff for failures where possible Physical infrastructure Many common off the shelve subsystems Complex, software intensive or newly deployed systems still require humans for early failure identification HPSS dCache SFT suite is not a good monitor of BNL Tier 1 availability Making SFT run continues to demand significant effort at BNL but … … fails to detect problems impacting some critical site services … reports some failures which have no effect on any significant site service
8
5-6 September 2006 B. Gibbard Grid Deployment Board 8 Monitoring Related Issues Facility functions within the context of the Open Science Grid (OSG) and 5 US ATLAS Tier 2 Centers Relatively few issues of interoperability (OSG EGEE) at data transfer, storage and management levels Significant interoperability issues with accounting, monitoring, allocation, work flow management, etc. do exist o Some are being addressed through OSG Facility functions are convoluted with PanDA & DDM layers so reported problems need interpretation including expertise from PanDA & DDM teams o Grid Production and Analysis uses the US ATLAS specific “PanDA” job management system o ATLAS DDM system has a complex interaction with underlying facility services o So there is no automated monitoring for these critical systems yet For SC’s and ATLAS CSC activities On call list for critical Grid services Accessible through OSG GOC (IU), PanDA operations team, selected other “power” users
9
5-6 September 2006 B. Gibbard Grid Deployment Board 9 Evolution of Monitoring & Support Increase automation especially for services having immediate impact on operations dCache, DDM, PanDA, and other Grid related services In particular need to codify de-convolution of PanDA and/or DDM problems from underlying Tier 1 operations problems Unify monitoring (under Nagios umbrella) Add an additional operator allowing expansion of on site staff coverage to ~24 x 7 – ε Integrate problem report/tracking systems RT Footprint (OSG GOC at IU) Footprint GGUS Better integrate monitoring and problem resolution with US ATLAS Tier 2’s and with overall ATLAS effort Target is to established comprehensive ATLAS directed ~24x7 operational monitoring and support by Jan ‘07
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.