Download presentation
Presentation is loading. Please wait.
Published byShona Walsh Modified over 9 years ago
1
WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO
2
11 th February 2006Service Checklist Tim.Bell@cern.ch2 Agenda LCG Memorandum of Understanding Defining what needs to be delivered Checking the plan Tracking delivery using a dashboard
3
11 th February 2006Service Checklist Tim.Bell@cern.ch3 What the MoU provides A high level definition of the service Basis for estimating Tier investments Tier responsibilities Overall capacity Basic support structure Implementation schedule Governance Roles *B
4
11 th February 2006Service Checklist Tim.Bell@cern.ch4 Tier0 service levels ServiceMaximum delay in responding to operational problemsAverage availability2 Service interruptionDegradation of the capacity of the service by more than 50% Degradation of the capacity of the service by more than 20% During accelerator operation At all other times Raw data recording4 hours6 hours 99%n/a Event reconstruction or distribution of data to Tier-1 Centres during accelerator operation 6 hours 12 hours99%n/a Networking service to Tier-1 Centres during accelerator operation 6 hours 12 hours99%n/a All other Tier-0 services12 hours24 hours48 hours98% All other services3 – prime service hours4 1 hour 4 hours98% All other services – outwith prime service hours 12 hours24 hours48 hours97%
5
11 th February 2006Service Checklist Tim.Bell@cern.ch5 Tier1 service levels
6
11 th February 2006Service Checklist Tim.Bell@cern.ch6 The MoU is not … An implementation bible What grid services at which site How to run the services How to deploy Magic recipe for service delivery Application 99% = 1.5 hours down / week Administrator 40 hours/week = 24% up
7
11 th February 2006Service Checklist Tim.Bell@cern.ch7 What is your quest ?
8
11 th February 2006Service Checklist Tim.Bell@cern.ch8 We seek the holy grail ! A stable and functional Grid
9
11 th February 2006Service Checklist Tim.Bell@cern.ch9 Define the site services What services do we provide ? Who is responsible ? What level of service is required ? What capacity of service ? What is the support structure ? Who pays for what ?
10
11 th February 2006Service Checklist Tim.Bell@cern.ch10 Service catalog approach A service catalog consists Service Class – Criticality Calendar – Variation with time Product – What application Customer – Which VO Service = Service Class x Calendar x Product x Customer
11
11 th February 2006Service Checklist Tim.Bell@cern.ch11 Service class https://uimon.cern.ch/twiki/bin/view/LCG/ScFourServiceDefinition ClassDescriptionDowntimeReducedDegradedAvail CCritical1 hour 4 hours99% HHigh4 hours6 hours 99% MMedium6 hours 12 hours99% LLow12 hours24 hours48 hours98% UUnmanagedNone
12
11 th February 2006Service Checklist Tim.Bell@cern.ch12 Class notes Downtime defines the time between the start of the problem and restoration of service at minimal capacity (i.e. basic function but capacity < 50%) Reduced defines the time between the start of the problem and the restoration of a reduced capacity service (i.e. >50%) Degraded defines the time between the start of the problem and the restoration of a degraded capacity service (i.e. >80%) Availability defines the sum of the time that the service is down compared with the total time during the calendar period for the service. Site wide failures are not considered as part of the availability calculations. None means the service is running unattended
13
11 th February 2006Service Checklist Tim.Bell@cern.ch13 Service calendar CalendarDescription AccOnPrime AP Accelerator operating, prime shiftYY AS Accelerator operating, second shiftYN OP Accelerator off, prime shiftNY OS Accelerator off, second shiftNN Some services are critical only during accelerator shift Other services are less critical outside working hours
14
11 th February 2006Service Checklist Tim.Bell@cern.ch14 Products Product NameProduct Short Code Description Resource BrokerRBFarms out jobs to sites+logging and book-keeping MyProxyPXRenew/acquire credentials BDII Grid Information System Compute ElementCEGateway to local batch systems Mon BoxMONBGrid Monitoring including archiver Grid ViewGRVWMonitoring of Grid activity Site Functional TestSFTRegular test of components per site Grid PeekGRPKStorage of outputs of running jobs VOMS Manage user/roles for VOs
15
11 th February 2006Service Checklist Tim.Bell@cern.ch15 Products (cont) Product NameProduct Short Code Description LCG File CatalogLFCMaps file names to storage locations File Transfer ServiceFTSReliable file transfer delivery Storage ElementSESRM Compatible Storage Service
16
11 th February 2006Service Checklist Tim.Bell@cern.ch16 Products notes Provides 1 st level breakdown of the grid to smaller units Suprisingly dynamic list. New products arriving weekly. Short codes provide basis for naming conventions
17
11 th February 2006Service Checklist Tim.Bell@cern.ch17 Service catalog ServiceInstanceProductCstAPASOPOS RBPProduction Resource BrokerRBSHCCCC PXPProduction My ProxyPXSHCCCC BDIIPProduction Global BDIIDBIISHCCCC BDIISProduction Site BDIIDBIISHHHHH CEPProduction Compute ElementCESHCCCC MONBPProduction MonboxMONBSHMMMM GRVWPProduction Grid ViewGRVWSHMLML SFTPProduction Site Func TestSFTSHMMMM GRPKPProduction Grid Peek ServiceGRPKSHMMMM VOMSPProduction VOMSVOMSSHCCCC Match product with customer and service class in each calendar slot Multiple services (e.g. production, test, site…) for single product
18
11 th February 2006Service Checklist Tim.Bell@cern.ch18 Service catalog (cont) ServiceInstanceProductCstAPASOPOS LFCP- ALICE Alice Production LCG File Catalog LFCAliceHHHH LFCP- ATLAS Atlas Production LCG File Catalog LFCAtlasHHHH LFCP- CMS CMS Production LCG File Catalog LFCCMSHHHH LFCP- LHCB LHCb Production LCG File Catalog LFCLHCbCCCC FTSPProduction file transfer serviceFTSSHCCCC CSTRPProduction Castor + SRMSESHCCCC
19
11 th February 2006Service Checklist Tim.Bell@cern.ch19 Questionnaire Simple questions to assess readiness for production It is not actually necessary to fill out the answers but the questions should be asked Focus is on the infrastructure
20
11 th February 2006Service Checklist Tim.Bell@cern.ch20 Service questions What service levels are required for each calendar period ? Who is providing support for the application ? Who supports the infrastructure ? How should the support be contacted? What support service do they provide?
21
11 th February 2006Service Checklist Tim.Bell@cern.ch21 Configuration questions What are the application interfaces? What server does the application run on ? Is there a picture of the configuration? What are the application parameters and how are they set up?
22
11 th February 2006Service Checklist Tim.Bell@cern.ch22 Facilities questions ?
23
11 th February 2006Service Checklist Tim.Bell@cern.ch23 Facilities questions Are all systems in a machine room ? Is the room access controlled ? Is there good power provision ? UPS ? Batteries ? What is the response time for facilities problems ?
24
11 th February 2006Service Checklist Tim.Bell@cern.ch24 Hardware questions What kind of machine is required CPU, RAM, Disk Do we need redundancy ? Power Supply, Disk, …. Do maintenance contracts match the service ? Currently, there are no capacity guides for each application. These are required to avoid purchase of inappropriate machines
25
11 th February 2006Service Checklist Tim.Bell@cern.ch25 Sample RB disk calculation ParameterValue (MB) Size of input sandbox10 Size of output sandbox10 Jobs / Day currently21000 Estimated Factor for LHC3 Sandbox Purge Time (days)14 Jobs in queue35000 Total Disk Space Required17,640,000
26
11 th February 2006Service Checklist Tim.Bell@cern.ch26 Network questions What network capacity OPN connectivity ? Bandwidth ? Firewall ports ? Currently, there is no connectivity guide for each application. This is required for secure set up and appropriate network configuration.
27
11 th February 2006Service Checklist Tim.Bell@cern.ch27 Sample CE ports sheet FunctionDirectionPort Globus Job ManagerOutgoing20000-21000 GridFTPIncoming2811 GRIS BDIIIncoming2135 EDG Log DaemonIncoming9002
28
11 th February 2006Service Checklist Tim.Bell@cern.ch28 Database questions What is your sites preferred database ? What are the options for each application ? Expected database size / growth ? High Availability options ?
29
11 th February 2006Service Checklist Tim.Bell@cern.ch29 Backup / Restore questions What needs to be backed up for each service ? How do we ensure consistency in the event of a restore ? e.g. RB / CE. Software corruption risk different by application ? e.g. LFC/SE vs Proxy Has a restore test been done ? There is currently no list of critical state data for each application or steps to be executed after a restore
30
11 th February 2006Service Checklist Tim.Bell@cern.ch30 Operations questions How are problems identified ? Local console ? Grid Monitoring ? Who should be contacted to resolve the problem ? Who should be informed of the problem ? What new procedures / operations guides are required ? What is the local coverage for nights / weekends ? How does local and Grid operations interwork ?
31
11 th February 2006Service Checklist Tim.Bell@cern.ch31 Validation Check that the service class matches the answers A critical service cannot have the server in an office Check the dependencies that no critical services depend on non- critical services FTS, critical, requires MyProxy therefore MyProxy Service must be critical
32
11 th February 2006Service Checklist Tim.Bell@cern.ch32 Implementation Tracking at CERN A dashboard approach on the Wiki
33
11 th February 2006Service Checklist Tim.Bell@cern.ch33 Common Themes But it’s all green ? What’s the problem ? Green does not mean no problems. We are often generous with assessments since red/yellow everywhere does not highlight issues. Operations No operations or problem determination guides. Limited administration guides. Support call-tree unclear Backup/Restore details are missing Hardware Limited or no capacity planning information leads to incorrect server sizing ‘Forgot a box’ problems e.g. one per-VO not one per site Development Difficult to match the user expectations (e.g. a critical service) with implementation (e.g. stateful)
34
11 th February 2006Service Checklist Tim.Bell@cern.ch34 Summary Complete a service catalog for your sites Check the questions and prepare an action plan to address items under your control Assess the status by service and concentrate on getting the reds to yellows
35
11 th February 2006Service Checklist Tim.Bell@cern.ch35 More Information LCG MoU http://lcg.web.cern.ch/lcg/C-RRB/MoU/WLCGMoU.pdf http://lcg.web.cern.ch/lcg/C-RRB/MoU/WLCGMoU.pdf SC4 Service Definitions for CERN https://uimon.cern.ch/twiki/bin/view/LCG/ScFourServiceDefinition https://uimon.cern.ch/twiki/bin/view/LCG/ScFourServiceDefinition SC4 CERN Dashboard https://uimon.cern.ch/twiki/bin/view/LCG/WlcgScDash https://uimon.cern.ch/twiki/bin/view/LCG/WlcgScDash
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.