Presentation is loading. Please wait.

Presentation is loading. Please wait.

WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO.

Similar presentations


Presentation on theme: "WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO."— Presentation transcript:

1 WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

2 11 th February 2006Service Checklist Tim.Bell@cern.ch2 Agenda  LCG Memorandum of Understanding  Defining what needs to be delivered  Checking the plan  Tracking delivery using a dashboard

3 11 th February 2006Service Checklist Tim.Bell@cern.ch3 What the MoU provides  A high level definition of the service  Basis for estimating Tier investments  Tier responsibilities  Overall capacity  Basic support structure  Implementation schedule  Governance  Roles  *B

4 11 th February 2006Service Checklist Tim.Bell@cern.ch4 Tier0 service levels ServiceMaximum delay in responding to operational problemsAverage availability2 Service interruptionDegradation of the capacity of the service by more than 50% Degradation of the capacity of the service by more than 20% During accelerator operation At all other times Raw data recording4 hours6 hours 99%n/a Event reconstruction or distribution of data to Tier-1 Centres during accelerator operation 6 hours 12 hours99%n/a Networking service to Tier-1 Centres during accelerator operation 6 hours 12 hours99%n/a All other Tier-0 services12 hours24 hours48 hours98% All other services3 – prime service hours4 1 hour 4 hours98% All other services – outwith prime service hours 12 hours24 hours48 hours97%

5 11 th February 2006Service Checklist Tim.Bell@cern.ch5 Tier1 service levels

6 11 th February 2006Service Checklist Tim.Bell@cern.ch6 The MoU is not …  An implementation bible  What grid services at which site  How to run the services  How to deploy  Magic recipe for service delivery  Application 99% = 1.5 hours down / week  Administrator 40 hours/week = 24% up

7 11 th February 2006Service Checklist Tim.Bell@cern.ch7 What is your quest ?

8 11 th February 2006Service Checklist Tim.Bell@cern.ch8 We seek the holy grail ! A stable and functional Grid

9 11 th February 2006Service Checklist Tim.Bell@cern.ch9 Define the site services  What services do we provide ?  Who is responsible ?  What level of service is required ?  What capacity of service ?  What is the support structure ?  Who pays for what ?

10 11 th February 2006Service Checklist Tim.Bell@cern.ch10 Service catalog approach  A service catalog consists  Service Class – Criticality  Calendar – Variation with time  Product – What application  Customer – Which VO  Service =  Service Class x Calendar x Product x Customer

11 11 th February 2006Service Checklist Tim.Bell@cern.ch11 Service class https://uimon.cern.ch/twiki/bin/view/LCG/ScFourServiceDefinition ClassDescriptionDowntimeReducedDegradedAvail CCritical1 hour 4 hours99% HHigh4 hours6 hours 99% MMedium6 hours 12 hours99% LLow12 hours24 hours48 hours98% UUnmanagedNone

12 11 th February 2006Service Checklist Tim.Bell@cern.ch12 Class notes  Downtime defines the time between the start of the problem and restoration of service at minimal capacity (i.e. basic function but capacity < 50%)  Reduced defines the time between the start of the problem and the restoration of a reduced capacity service (i.e. >50%)  Degraded defines the time between the start of the problem and the restoration of a degraded capacity service (i.e. >80%)  Availability defines the sum of the time that the service is down compared with the total time during the calendar period for the service. Site wide failures are not considered as part of the availability calculations.  None means the service is running unattended

13 11 th February 2006Service Checklist Tim.Bell@cern.ch13 Service calendar CalendarDescription AccOnPrime AP Accelerator operating, prime shiftYY AS Accelerator operating, second shiftYN OP Accelerator off, prime shiftNY OS Accelerator off, second shiftNN  Some services are critical only during accelerator shift  Other services are less critical outside working hours

14 11 th February 2006Service Checklist Tim.Bell@cern.ch14 Products Product NameProduct Short Code Description Resource BrokerRBFarms out jobs to sites+logging and book-keeping MyProxyPXRenew/acquire credentials BDII Grid Information System Compute ElementCEGateway to local batch systems Mon BoxMONBGrid Monitoring including archiver Grid ViewGRVWMonitoring of Grid activity Site Functional TestSFTRegular test of components per site Grid PeekGRPKStorage of outputs of running jobs VOMS Manage user/roles for VOs

15 11 th February 2006Service Checklist Tim.Bell@cern.ch15 Products (cont) Product NameProduct Short Code Description LCG File CatalogLFCMaps file names to storage locations File Transfer ServiceFTSReliable file transfer delivery Storage ElementSESRM Compatible Storage Service

16 11 th February 2006Service Checklist Tim.Bell@cern.ch16 Products notes  Provides 1 st level breakdown of the grid to smaller units  Suprisingly dynamic list. New products arriving weekly.  Short codes provide basis for naming conventions

17 11 th February 2006Service Checklist Tim.Bell@cern.ch17 Service catalog ServiceInstanceProductCstAPASOPOS RBPProduction Resource BrokerRBSHCCCC PXPProduction My ProxyPXSHCCCC BDIIPProduction Global BDIIDBIISHCCCC BDIISProduction Site BDIIDBIISHHHHH CEPProduction Compute ElementCESHCCCC MONBPProduction MonboxMONBSHMMMM GRVWPProduction Grid ViewGRVWSHMLML SFTPProduction Site Func TestSFTSHMMMM GRPKPProduction Grid Peek ServiceGRPKSHMMMM VOMSPProduction VOMSVOMSSHCCCC  Match product with customer and service class in each calendar slot  Multiple services (e.g. production, test, site…) for single product

18 11 th February 2006Service Checklist Tim.Bell@cern.ch18 Service catalog (cont) ServiceInstanceProductCstAPASOPOS LFCP- ALICE Alice Production LCG File Catalog LFCAliceHHHH LFCP- ATLAS Atlas Production LCG File Catalog LFCAtlasHHHH LFCP- CMS CMS Production LCG File Catalog LFCCMSHHHH LFCP- LHCB LHCb Production LCG File Catalog LFCLHCbCCCC FTSPProduction file transfer serviceFTSSHCCCC CSTRPProduction Castor + SRMSESHCCCC

19 11 th February 2006Service Checklist Tim.Bell@cern.ch19 Questionnaire  Simple questions to assess readiness for production  It is not actually necessary to fill out the answers but the questions should be asked  Focus is on the infrastructure

20 11 th February 2006Service Checklist Tim.Bell@cern.ch20 Service questions  What service levels are required for each calendar period ?  Who is providing support for the application ?  Who supports the infrastructure ?  How should the support be contacted?  What support service do they provide?

21 11 th February 2006Service Checklist Tim.Bell@cern.ch21 Configuration questions  What are the application interfaces?  What server does the application run on ?  Is there a picture of the configuration?  What are the application parameters and how are they set up?

22 11 th February 2006Service Checklist Tim.Bell@cern.ch22 Facilities questions ?

23 11 th February 2006Service Checklist Tim.Bell@cern.ch23 Facilities questions  Are all systems in a machine room ?  Is the room access controlled ?  Is there good power provision ?  UPS ? Batteries ?  What is the response time for facilities problems ?

24 11 th February 2006Service Checklist Tim.Bell@cern.ch24 Hardware questions  What kind of machine is required  CPU, RAM, Disk  Do we need redundancy ?  Power Supply, Disk, ….  Do maintenance contracts match the service ? Currently, there are no capacity guides for each application. These are required to avoid purchase of inappropriate machines

25 11 th February 2006Service Checklist Tim.Bell@cern.ch25 Sample RB disk calculation ParameterValue (MB) Size of input sandbox10 Size of output sandbox10 Jobs / Day currently21000 Estimated Factor for LHC3 Sandbox Purge Time (days)14 Jobs in queue35000 Total Disk Space Required17,640,000

26 11 th February 2006Service Checklist Tim.Bell@cern.ch26 Network questions  What network capacity  OPN connectivity ?  Bandwidth ?  Firewall ports ? Currently, there is no connectivity guide for each application. This is required for secure set up and appropriate network configuration.

27 11 th February 2006Service Checklist Tim.Bell@cern.ch27 Sample CE ports sheet FunctionDirectionPort Globus Job ManagerOutgoing20000-21000 GridFTPIncoming2811 GRIS BDIIIncoming2135 EDG Log DaemonIncoming9002

28 11 th February 2006Service Checklist Tim.Bell@cern.ch28 Database questions  What is your sites preferred database ?  What are the options for each application ?  Expected database size / growth ?  High Availability options ?

29 11 th February 2006Service Checklist Tim.Bell@cern.ch29 Backup / Restore questions  What needs to be backed up for each service ?  How do we ensure consistency in the event of a restore ? e.g. RB / CE.  Software corruption risk different by application ? e.g. LFC/SE vs Proxy  Has a restore test been done ? There is currently no list of critical state data for each application or steps to be executed after a restore

30 11 th February 2006Service Checklist Tim.Bell@cern.ch30 Operations questions  How are problems identified ?  Local console ?  Grid Monitoring ?  Who should be contacted to resolve the problem ?  Who should be informed of the problem ?  What new procedures / operations guides are required ?  What is the local coverage for nights / weekends ?  How does local and Grid operations interwork ?

31 11 th February 2006Service Checklist Tim.Bell@cern.ch31 Validation  Check that the service class matches the answers  A critical service cannot have the server in an office  Check the dependencies that no critical services depend on non- critical services  FTS, critical, requires MyProxy therefore MyProxy Service must be critical

32 11 th February 2006Service Checklist Tim.Bell@cern.ch32 Implementation Tracking at CERN  A dashboard approach on the Wiki

33 11 th February 2006Service Checklist Tim.Bell@cern.ch33 Common Themes  But it’s all green ? What’s the problem ?  Green does not mean no problems. We are often generous with assessments since red/yellow everywhere does not highlight issues.  Operations  No operations or problem determination guides. Limited administration guides.  Support call-tree unclear  Backup/Restore details are missing  Hardware  Limited or no capacity planning information leads to incorrect server sizing  ‘Forgot a box’ problems e.g. one per-VO not one per site  Development  Difficult to match the user expectations (e.g. a critical service) with implementation (e.g. stateful)

34 11 th February 2006Service Checklist Tim.Bell@cern.ch34 Summary  Complete a service catalog for your sites  Check the questions and prepare an action plan to address items under your control  Assess the status by service and concentrate on getting the reds to yellows

35 11 th February 2006Service Checklist Tim.Bell@cern.ch35 More Information  LCG MoU  http://lcg.web.cern.ch/lcg/C-RRB/MoU/WLCGMoU.pdf http://lcg.web.cern.ch/lcg/C-RRB/MoU/WLCGMoU.pdf  SC4 Service Definitions for CERN  https://uimon.cern.ch/twiki/bin/view/LCG/ScFourServiceDefinition https://uimon.cern.ch/twiki/bin/view/LCG/ScFourServiceDefinition  SC4 CERN Dashboard  https://uimon.cern.ch/twiki/bin/view/LCG/WlcgScDash https://uimon.cern.ch/twiki/bin/view/LCG/WlcgScDash


Download ppt "WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO."

Similar presentations


Ads by Google