WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO
11 th February 2006Service Checklist Agenda LCG Memorandum of Understanding Defining what needs to be delivered Checking the plan Tracking delivery using a dashboard
11 th February 2006Service Checklist What the MoU provides A high level definition of the service Basis for estimating Tier investments Tier responsibilities Overall capacity Basic support structure Implementation schedule Governance Roles *B
11 th February 2006Service Checklist Tier0 service levels ServiceMaximum delay in responding to operational problemsAverage availability2 Service interruptionDegradation of the capacity of the service by more than 50% Degradation of the capacity of the service by more than 20% During accelerator operation At all other times Raw data recording4 hours6 hours 99%n/a Event reconstruction or distribution of data to Tier-1 Centres during accelerator operation 6 hours 12 hours99%n/a Networking service to Tier-1 Centres during accelerator operation 6 hours 12 hours99%n/a All other Tier-0 services12 hours24 hours48 hours98% All other services3 – prime service hours4 1 hour 4 hours98% All other services – outwith prime service hours 12 hours24 hours48 hours97%
11 th February 2006Service Checklist Tier1 service levels
11 th February 2006Service Checklist The MoU is not … An implementation bible What grid services at which site How to run the services How to deploy Magic recipe for service delivery Application 99% = 1.5 hours down / week Administrator 40 hours/week = 24% up
11 th February 2006Service Checklist What is your quest ?
11 th February 2006Service Checklist We seek the holy grail ! A stable and functional Grid
11 th February 2006Service Checklist Define the site services What services do we provide ? Who is responsible ? What level of service is required ? What capacity of service ? What is the support structure ? Who pays for what ?
11 th February 2006Service Checklist Service catalog approach A service catalog consists Service Class – Criticality Calendar – Variation with time Product – What application Customer – Which VO Service = Service Class x Calendar x Product x Customer
11 th February 2006Service Checklist Service class ClassDescriptionDowntimeReducedDegradedAvail CCritical1 hour 4 hours99% HHigh4 hours6 hours 99% MMedium6 hours 12 hours99% LLow12 hours24 hours48 hours98% UUnmanagedNone
11 th February 2006Service Checklist Class notes Downtime defines the time between the start of the problem and restoration of service at minimal capacity (i.e. basic function but capacity < 50%) Reduced defines the time between the start of the problem and the restoration of a reduced capacity service (i.e. >50%) Degraded defines the time between the start of the problem and the restoration of a degraded capacity service (i.e. >80%) Availability defines the sum of the time that the service is down compared with the total time during the calendar period for the service. Site wide failures are not considered as part of the availability calculations. None means the service is running unattended
11 th February 2006Service Checklist Service calendar CalendarDescription AccOnPrime AP Accelerator operating, prime shiftYY AS Accelerator operating, second shiftYN OP Accelerator off, prime shiftNY OS Accelerator off, second shiftNN Some services are critical only during accelerator shift Other services are less critical outside working hours
11 th February 2006Service Checklist Products Product NameProduct Short Code Description Resource BrokerRBFarms out jobs to sites+logging and book-keeping MyProxyPXRenew/acquire credentials BDII Grid Information System Compute ElementCEGateway to local batch systems Mon BoxMONBGrid Monitoring including archiver Grid ViewGRVWMonitoring of Grid activity Site Functional TestSFTRegular test of components per site Grid PeekGRPKStorage of outputs of running jobs VOMS Manage user/roles for VOs
11 th February 2006Service Checklist Products (cont) Product NameProduct Short Code Description LCG File CatalogLFCMaps file names to storage locations File Transfer ServiceFTSReliable file transfer delivery Storage ElementSESRM Compatible Storage Service
11 th February 2006Service Checklist Products notes Provides 1 st level breakdown of the grid to smaller units Suprisingly dynamic list. New products arriving weekly. Short codes provide basis for naming conventions
11 th February 2006Service Checklist Service catalog ServiceInstanceProductCstAPASOPOS RBPProduction Resource BrokerRBSHCCCC PXPProduction My ProxyPXSHCCCC BDIIPProduction Global BDIIDBIISHCCCC BDIISProduction Site BDIIDBIISHHHHH CEPProduction Compute ElementCESHCCCC MONBPProduction MonboxMONBSHMMMM GRVWPProduction Grid ViewGRVWSHMLML SFTPProduction Site Func TestSFTSHMMMM GRPKPProduction Grid Peek ServiceGRPKSHMMMM VOMSPProduction VOMSVOMSSHCCCC Match product with customer and service class in each calendar slot Multiple services (e.g. production, test, site…) for single product
11 th February 2006Service Checklist Service catalog (cont) ServiceInstanceProductCstAPASOPOS LFCP- ALICE Alice Production LCG File Catalog LFCAliceHHHH LFCP- ATLAS Atlas Production LCG File Catalog LFCAtlasHHHH LFCP- CMS CMS Production LCG File Catalog LFCCMSHHHH LFCP- LHCB LHCb Production LCG File Catalog LFCLHCbCCCC FTSPProduction file transfer serviceFTSSHCCCC CSTRPProduction Castor + SRMSESHCCCC
11 th February 2006Service Checklist Questionnaire Simple questions to assess readiness for production It is not actually necessary to fill out the answers but the questions should be asked Focus is on the infrastructure
11 th February 2006Service Checklist Service questions What service levels are required for each calendar period ? Who is providing support for the application ? Who supports the infrastructure ? How should the support be contacted? What support service do they provide?
11 th February 2006Service Checklist Configuration questions What are the application interfaces? What server does the application run on ? Is there a picture of the configuration? What are the application parameters and how are they set up?
11 th February 2006Service Checklist Facilities questions ?
11 th February 2006Service Checklist Facilities questions Are all systems in a machine room ? Is the room access controlled ? Is there good power provision ? UPS ? Batteries ? What is the response time for facilities problems ?
11 th February 2006Service Checklist Hardware questions What kind of machine is required CPU, RAM, Disk Do we need redundancy ? Power Supply, Disk, …. Do maintenance contracts match the service ? Currently, there are no capacity guides for each application. These are required to avoid purchase of inappropriate machines
11 th February 2006Service Checklist Sample RB disk calculation ParameterValue (MB) Size of input sandbox10 Size of output sandbox10 Jobs / Day currently21000 Estimated Factor for LHC3 Sandbox Purge Time (days)14 Jobs in queue35000 Total Disk Space Required17,640,000
11 th February 2006Service Checklist Network questions What network capacity OPN connectivity ? Bandwidth ? Firewall ports ? Currently, there is no connectivity guide for each application. This is required for secure set up and appropriate network configuration.
11 th February 2006Service Checklist Sample CE ports sheet FunctionDirectionPort Globus Job ManagerOutgoing GridFTPIncoming2811 GRIS BDIIIncoming2135 EDG Log DaemonIncoming9002
11 th February 2006Service Checklist Database questions What is your sites preferred database ? What are the options for each application ? Expected database size / growth ? High Availability options ?
11 th February 2006Service Checklist Backup / Restore questions What needs to be backed up for each service ? How do we ensure consistency in the event of a restore ? e.g. RB / CE. Software corruption risk different by application ? e.g. LFC/SE vs Proxy Has a restore test been done ? There is currently no list of critical state data for each application or steps to be executed after a restore
11 th February 2006Service Checklist Operations questions How are problems identified ? Local console ? Grid Monitoring ? Who should be contacted to resolve the problem ? Who should be informed of the problem ? What new procedures / operations guides are required ? What is the local coverage for nights / weekends ? How does local and Grid operations interwork ?
11 th February 2006Service Checklist Validation Check that the service class matches the answers A critical service cannot have the server in an office Check the dependencies that no critical services depend on non- critical services FTS, critical, requires MyProxy therefore MyProxy Service must be critical
11 th February 2006Service Checklist Implementation Tracking at CERN A dashboard approach on the Wiki
11 th February 2006Service Checklist Common Themes But it’s all green ? What’s the problem ? Green does not mean no problems. We are often generous with assessments since red/yellow everywhere does not highlight issues. Operations No operations or problem determination guides. Limited administration guides. Support call-tree unclear Backup/Restore details are missing Hardware Limited or no capacity planning information leads to incorrect server sizing ‘Forgot a box’ problems e.g. one per-VO not one per site Development Difficult to match the user expectations (e.g. a critical service) with implementation (e.g. stateful)
11 th February 2006Service Checklist Summary Complete a service catalog for your sites Check the questions and prepare an action plan to address items under your control Assess the status by service and concentrate on getting the reds to yellows
11 th February 2006Service Checklist More Information LCG MoU SC4 Service Definitions for CERN SC4 CERN Dashboard