CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop 13-14 September,

Slides:



Advertisements
Similar presentations
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
Advertisements

CERN - IT Department CH-1211 Genève 23 Switzerland t Relational Databases for the LHC Computing Grid The LCG Distributed Database Deployment.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
CERN IT Department CH-1211 Genève 23 Switzerland t Recovery Exercise Wrap-up Jacek Wojcieszuk, CERN IT-DM Distributed Database Operations.
Module 8 Implementing Backup and Recovery. Module Overview Planning Backup and Recovery Backing Up Exchange Server 2010 Restoring Exchange Server 2010.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
Database monitoring and service validation Dirk Duellmann CERN IT/PSS and 3D
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
LCG 3D StatusDirk Duellmann1 LCG 3D Throughput Tests Scheduled for May - extended until end of June –Use the production database clusters at tier 1 and.
15 Copyright © 2005, Oracle. All rights reserved. Performing Database Backups.
CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
Workshop Summary (my impressions at least) Dirk Duellmann, CERN IT LCG Database Deployment & Persistency Workshop.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Metalink for Tier 1 Miguel Anjo Database mini workshop 26.January.2007.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
Appendix C: Designing an Operations Framework to Manage Security.
Database Administrator RAL Proposed Workshop Goals Dirk Duellmann, CERN.
CERN Physics Database Services and Plans Maria Girone, CERN-IT
Training and Dissemination Enabling Grids for E-sciencE Jinny Chien, ASGC 1 Training and Dissemination Jinny Chien Academia Sinica Grid.
3D Workshop Outline & Goals Dirk Düllmann, CERN IT More details at
BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
CERN IT Department CH-1211 Genève 23 Switzerland t Security Overview Luca Canali, CERN Distributed Database Operations Workshop April
CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.
Grid Operations Centre LCG SLAs and Site Audits Trevor Daniels, John Gordon GDB 8 Mar 2004.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.
3D Testing and Monitoring Lee Lueking LCG 3D Meeting Sept. 15, 2005.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
State of Georgia Release Management Training
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
LCG 3D Project Update (given to LCG MB this Monday) Dirk Duellmann CERN IT/PSS and 3D
CERN IT Department CH-1211 Geneva 23 Switzerland t Eva Dafonte Perez IT-DB Database Replication, Backup and Archiving.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
CERN - IT Department CH-1211 Genève 23 Switzerland t Operating systems and Information Services OIS Proposed Drupal Service Definition IT-OIS.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
Site Services and Policies Summary Dirk Düllmann, CERN IT More details at
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.
CERN IT Department CH-1211 Geneva 23 Switzerland t James Casey CCRC’08 April F2F 1 April 2008 Communication with Network Teams/ providers.
Database Project Milestones (+ few status slides) Dirk Duellmann, CERN IT-PSS (
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
A quick summary and some ideas for the 2005 work plan Dirk Düllmann, CERN IT More details at
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
1 LCG Distributed Deployment of Databases A Project Proposal Dirk Düllmann LCG PEB 20th July 2004.
CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.
Feature Overview Oracle Explorer – browse and alter schema Wizards and Designers Automatic code generation PL/SQL Editor with IntelliSense Oracle Data.
15-Jun-04D.P.Kelsey, LCG-GDB-Security1 LCG/GDB Security Update (Report from the LCG Security Group) CERN 15 June 2004 David Kelsey CCLRC/RAL, UK
CERN IT Department CH-1211 Genève 23 Switzerland t DPM status and plans David Smith CERN, IT-DM-SGT Pre-GDB, Grid Storage Services 11 November.
LCG Introduction John Gordon, STFC GDB December 7 th 2010.
Dirk Duellmann CERN IT/PSS and 3D
LCG 3D Distributed Deployment of Databases
Ian Bird GDB Meeting CERN 9 September 2003
Database Services at CERN Status Update
3D Application Tests Application test proposals
Database Readiness Workshop Intro & Goals
Maximum Availability Architecture Enterprise Technology Centre.
WLCG Service Interventions
LCG Distributed Deployment of Databases A Project Proposal
Conditions Data access using FroNTier Squid cache Server
Workshop Summary Dirk Duellmann.
3D Project Status Report
Presentation transcript:

CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September, CERN

Dirk.D ü Service Level & Procedures - 2 A Proposal As the production is starting now we need to agree on –Responsibilities of sites vs users vs application owners –Integrate communication and operational procedures with the existing grid infrastructure Next slides contain a proposal for a possible agreement –Based on discussion with site during the GridKA DB admin workshop –Taking into account the MoU requirements –the current state of experience and manpower of the sites –Some common sense…

Dirk.D ü Service Level & Procedures - 3 Database setup and maintenance Each T1 site is responsible for installation and maintenance of the DB and Frontier servers according to experiment and project request –Requests/updates will be collected by 3D and presented to LCG GDB/MB for approval every 6 month The sites are responsible for –h/w server selection, acquisition, installation and monitoring as well as related network setup –s/w installation and upgrade according to the agreed evolution defined –regular application of security patches according to site policy Tier 0 is responsible for defining the streams setup in agreement with the experiments/grid deployment 3D will collect and distribute experience and configuration examples from all sites –Including proposals for patch and security upgrades

Dirk.D ü Service Level & Procedures - 4 Frontier and Squid Service In case of h/w problems of the squid setup sites are required to replace unavailable nodes, but not the cached data. –No cache backup is required In case the squid cache becomes inconsistent (eg after a power failure) sites may clear the cache (following procedures defined by CMS) The Frontier production setup at T0 is operated by FIO (box level) and the FNAL Frontier team (tomcat & squid) –Do we need to review this?

Dirk.D ü Service Level & Procedures - 5 Database Consistency The sites are responsible for recovery from unavailability/inconsistency caused by power or h/w problems –Worst case: re-import from T0 and streams resync of one site –Need to exercise this at each site! The application owners are responsible for recovering from logical corruption caused by their s/w packages –Worst case: point-in-time recovery and re-sync on all affected sites –Need to schedule a full size recovery test to estimate the unavailability caused by this.

Dirk.D ü Service Level & Procedures - 6 Backup and Recovery Each site is responsible to setup database backup and recovery infrastructure via Oracle RMAN –This may include on-disk backups and should include tape backups and associated media Backups should be performed online and with a retention period which is compatible with the time window for point-in- time recovery required by the experiments –Eg 1 month or 3 month? This is required to allow for a standard recovery procedure including streams re-synchronisation The T1 sites are responsible for performing –Standard database recovery (eg after media faults) –Point-in-time recovery within the agreed time window in case of logical data corruption Tier 0 is responsible for the streams re-syncronisation after the site is locally consistent

Dirk.D ü Service Level & Procedures - 7 Problems Resolution & Support Team The sites are responsible for staffing their support teams to meet the problem response time and availability numbers defined by the LCG MoU –Work split between DBA and other support staff should be organised by each site Availability numbers there are only defined indirectly –eg for high level application types –need to correlate service and database availability monitoring to understand required DB availability If application code does not implement DB retry/failover then the DB availability requirements significantly grow. –Are there experiment or grid s/w policies ?

Dirk.D ü Service Level & Procedures - 8 Problem Reporting We should use as much as possible the established reporting channels Each site should now join the GGUS support setup –Expect that all service users will report problems via this channel Each site should pre-announce service changes/outage (new list for any 3D service related –EGEE broadcast and other LCG lists (gmod) Integration with operations meetings –Either: All sites database team are represented via their grid teams –Or: LCG 3D is represented by one person (eg a rotating database manager on duty - dbmod)

Dirk.D ü Service Level & Procedures - 9 Service Security Security monitoring and patching is entirely site responsibility –Each site needs to monitor for security incidents and apply security patches according to site policy –This includes emergency interventions like change of compromised admin credentials 3D will make information about content and schedule of security patches available and collect experience with their application at the sites –But their selection and application is with the sites The application of security patches does not require agreement of experiments / software provider –It is the responsibility of s/w owners to adapt their applications in case issues with security patches are detected.

Dirk.D ü Service Level & Procedures - 10 Next Steps Please point out any missing points Can we write down a first agreement during this month for the wiki pages? –And review it during upcoming workshops Would somebody (from an experiment or site) be prepared act as an editor?