WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO.

Slides:



Advertisements
Similar presentations
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Advertisements

T1 at LBL/NERSC/OAK RIDGE General principles. RAW data flow T0 disk buffer DAQ & HLT CERN Tape AliEn FC Raw data Condition & Calibration & data DB disk.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Introduction Optimizing Application Performance with Pinpoint Accuracy What every IT Executive, Administrator & Developer Needs to Know.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
BINP/GCF Status Report BINP LCG Site Registration Oct 2009
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
EGEE is a project funded by the European Union under contract IST VO box: Experiment requirements and LCG prototype Operations.
The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
The LHC Computing Environment Challenges in Building up the Full Production Environment [ Formerly known as the LCG Service Challenges ]
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Julia Andreeva on behalf of the MND section MND review.
DJ: WLCG CB – 25 January WLCG Overview Board Activities in the first year Full details (reports/overheads/minutes) are at:
EGEE is a project funded by the European Union under contract INFSO-RI Grid accounting with GridICE Sergio Fantinel, INFN LNL/PD LCG Workshop November.
Ian Bird WLCG Networking workshop CERN, 10 th February February 2014
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
A Computing Tier 2 Node Eric Fede – LAPP/IN2P3. 2 Eric Fede – 1st Chinese-French Workshop Plan What is a Tier 2 –Context and definition To be a Tier 2.
Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
The status of IHEP Beijing Site WLCG Asia-Pacific Workshop Yaodong CHENG IHEP, China 01 December 2006.
1 Grid2003 Monitoring, Metrics, and Grid Cataloging System Leigh GRUNDHOEFER, Robert QUICK, John HICKS (Indiana University) Robert GARDNER, Marco MAMBELLI,
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
Review of IT General Controls
WLCG IPv6 deployment strategy
LCG Service Challenge: Planning and Milestones
Belle II Physics Analysis Center at TIFR
Cross-site problem resolution Focus on reliable file transfer service
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Relate to Clients on a business level
Leigh Grundhoefer Indiana University
IT OPERATIONS Session 7.
The LHCb Computing Data Challenge DC06
Presentation transcript:

WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11 th February 2006Service Checklist Agenda  LCG Memorandum of Understanding  Defining what needs to be delivered  Checking the plan  Tracking delivery using a dashboard

11 th February 2006Service Checklist What the MoU provides  A high level definition of the service  Basis for estimating Tier investments  Tier responsibilities  Overall capacity  Basic support structure  Implementation schedule  Governance  Roles  *B

11 th February 2006Service Checklist Tier0 service levels ServiceMaximum delay in responding to operational problemsAverage availability2 Service interruptionDegradation of the capacity of the service by more than 50% Degradation of the capacity of the service by more than 20% During accelerator operation At all other times Raw data recording4 hours6 hours 99%n/a Event reconstruction or distribution of data to Tier-1 Centres during accelerator operation 6 hours 12 hours99%n/a Networking service to Tier-1 Centres during accelerator operation 6 hours 12 hours99%n/a All other Tier-0 services12 hours24 hours48 hours98% All other services3 – prime service hours4 1 hour 4 hours98% All other services – outwith prime service hours 12 hours24 hours48 hours97%

11 th February 2006Service Checklist Tier1 service levels

11 th February 2006Service Checklist The MoU is not …  An implementation bible  What grid services at which site  How to run the services  How to deploy  Magic recipe for service delivery  Application 99% = 1.5 hours down / week  Administrator 40 hours/week = 24% up

11 th February 2006Service Checklist What is your quest ?

11 th February 2006Service Checklist We seek the holy grail ! A stable and functional Grid

11 th February 2006Service Checklist Define the site services  What services do we provide ?  Who is responsible ?  What level of service is required ?  What capacity of service ?  What is the support structure ?  Who pays for what ?

11 th February 2006Service Checklist Service catalog approach  A service catalog consists  Service Class – Criticality  Calendar – Variation with time  Product – What application  Customer – Which VO  Service =  Service Class x Calendar x Product x Customer

11 th February 2006Service Checklist Service class ClassDescriptionDowntimeReducedDegradedAvail CCritical1 hour 4 hours99% HHigh4 hours6 hours 99% MMedium6 hours 12 hours99% LLow12 hours24 hours48 hours98% UUnmanagedNone

11 th February 2006Service Checklist Class notes  Downtime defines the time between the start of the problem and restoration of service at minimal capacity (i.e. basic function but capacity < 50%)  Reduced defines the time between the start of the problem and the restoration of a reduced capacity service (i.e. >50%)  Degraded defines the time between the start of the problem and the restoration of a degraded capacity service (i.e. >80%)  Availability defines the sum of the time that the service is down compared with the total time during the calendar period for the service. Site wide failures are not considered as part of the availability calculations.  None means the service is running unattended

11 th February 2006Service Checklist Service calendar CalendarDescription AccOnPrime AP Accelerator operating, prime shiftYY AS Accelerator operating, second shiftYN OP Accelerator off, prime shiftNY OS Accelerator off, second shiftNN  Some services are critical only during accelerator shift  Other services are less critical outside working hours

11 th February 2006Service Checklist Products Product NameProduct Short Code Description Resource BrokerRBFarms out jobs to sites+logging and book-keeping MyProxyPXRenew/acquire credentials BDII Grid Information System Compute ElementCEGateway to local batch systems Mon BoxMONBGrid Monitoring including archiver Grid ViewGRVWMonitoring of Grid activity Site Functional TestSFTRegular test of components per site Grid PeekGRPKStorage of outputs of running jobs VOMS Manage user/roles for VOs

11 th February 2006Service Checklist Products (cont) Product NameProduct Short Code Description LCG File CatalogLFCMaps file names to storage locations File Transfer ServiceFTSReliable file transfer delivery Storage ElementSESRM Compatible Storage Service

11 th February 2006Service Checklist Products notes  Provides 1 st level breakdown of the grid to smaller units  Suprisingly dynamic list. New products arriving weekly.  Short codes provide basis for naming conventions

11 th February 2006Service Checklist Service catalog ServiceInstanceProductCstAPASOPOS RBPProduction Resource BrokerRBSHCCCC PXPProduction My ProxyPXSHCCCC BDIIPProduction Global BDIIDBIISHCCCC BDIISProduction Site BDIIDBIISHHHHH CEPProduction Compute ElementCESHCCCC MONBPProduction MonboxMONBSHMMMM GRVWPProduction Grid ViewGRVWSHMLML SFTPProduction Site Func TestSFTSHMMMM GRPKPProduction Grid Peek ServiceGRPKSHMMMM VOMSPProduction VOMSVOMSSHCCCC  Match product with customer and service class in each calendar slot  Multiple services (e.g. production, test, site…) for single product

11 th February 2006Service Checklist Service catalog (cont) ServiceInstanceProductCstAPASOPOS LFCP- ALICE Alice Production LCG File Catalog LFCAliceHHHH LFCP- ATLAS Atlas Production LCG File Catalog LFCAtlasHHHH LFCP- CMS CMS Production LCG File Catalog LFCCMSHHHH LFCP- LHCB LHCb Production LCG File Catalog LFCLHCbCCCC FTSPProduction file transfer serviceFTSSHCCCC CSTRPProduction Castor + SRMSESHCCCC

11 th February 2006Service Checklist Questionnaire  Simple questions to assess readiness for production  It is not actually necessary to fill out the answers but the questions should be asked  Focus is on the infrastructure

11 th February 2006Service Checklist Service questions  What service levels are required for each calendar period ?  Who is providing support for the application ?  Who supports the infrastructure ?  How should the support be contacted?  What support service do they provide?

11 th February 2006Service Checklist Configuration questions  What are the application interfaces?  What server does the application run on ?  Is there a picture of the configuration?  What are the application parameters and how are they set up?

11 th February 2006Service Checklist Facilities questions ?

11 th February 2006Service Checklist Facilities questions  Are all systems in a machine room ?  Is the room access controlled ?  Is there good power provision ?  UPS ? Batteries ?  What is the response time for facilities problems ?

11 th February 2006Service Checklist Hardware questions  What kind of machine is required  CPU, RAM, Disk  Do we need redundancy ?  Power Supply, Disk, ….  Do maintenance contracts match the service ? Currently, there are no capacity guides for each application. These are required to avoid purchase of inappropriate machines

11 th February 2006Service Checklist Sample RB disk calculation ParameterValue (MB) Size of input sandbox10 Size of output sandbox10 Jobs / Day currently21000 Estimated Factor for LHC3 Sandbox Purge Time (days)14 Jobs in queue35000 Total Disk Space Required17,640,000

11 th February 2006Service Checklist Network questions  What network capacity  OPN connectivity ?  Bandwidth ?  Firewall ports ? Currently, there is no connectivity guide for each application. This is required for secure set up and appropriate network configuration.

11 th February 2006Service Checklist Sample CE ports sheet FunctionDirectionPort Globus Job ManagerOutgoing GridFTPIncoming2811 GRIS BDIIIncoming2135 EDG Log DaemonIncoming9002

11 th February 2006Service Checklist Database questions  What is your sites preferred database ?  What are the options for each application ?  Expected database size / growth ?  High Availability options ?

11 th February 2006Service Checklist Backup / Restore questions  What needs to be backed up for each service ?  How do we ensure consistency in the event of a restore ? e.g. RB / CE.  Software corruption risk different by application ? e.g. LFC/SE vs Proxy  Has a restore test been done ? There is currently no list of critical state data for each application or steps to be executed after a restore

11 th February 2006Service Checklist Operations questions  How are problems identified ?  Local console ?  Grid Monitoring ?  Who should be contacted to resolve the problem ?  Who should be informed of the problem ?  What new procedures / operations guides are required ?  What is the local coverage for nights / weekends ?  How does local and Grid operations interwork ?

11 th February 2006Service Checklist Validation  Check that the service class matches the answers  A critical service cannot have the server in an office  Check the dependencies that no critical services depend on non- critical services  FTS, critical, requires MyProxy therefore MyProxy Service must be critical

11 th February 2006Service Checklist Implementation Tracking at CERN  A dashboard approach on the Wiki

11 th February 2006Service Checklist Common Themes  But it’s all green ? What’s the problem ?  Green does not mean no problems. We are often generous with assessments since red/yellow everywhere does not highlight issues.  Operations  No operations or problem determination guides. Limited administration guides.  Support call-tree unclear  Backup/Restore details are missing  Hardware  Limited or no capacity planning information leads to incorrect server sizing  ‘Forgot a box’ problems e.g. one per-VO not one per site  Development  Difficult to match the user expectations (e.g. a critical service) with implementation (e.g. stateful)

11 th February 2006Service Checklist Summary  Complete a service catalog for your sites  Check the questions and prepare an action plan to address items under your control  Assess the status by service and concentrate on getting the reds to yellows

11 th February 2006Service Checklist More Information  LCG MoU   SC4 Service Definitions for CERN   SC4 CERN Dashboard 