CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.

Slides:



Advertisements
Similar presentations
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Advertisements

CERN - IT Department CH-1211 Genève 23 Switzerland t Transportable Tablespaces for Scalable Re-Instantiation Eva Dafonte Pérez.
Backup The flip side of recovery. Types of Failures Transaction failure –Transaction must be aborted System failure –Hardware or software problem resulting.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
Backup and Recovery Part 1.
Backup Concepts. Introduction Backup and recovery procedures protect your database against data loss and reconstruct the data, should loss occur. The.
CERN IT Department CH-1211 Genève 23 Switzerland t Recovery Exercise Wrap-up Jacek Wojcieszuk, CERN IT-DM Distributed Database Operations.
National Manager Database Services
CERN IT Department CH-1211 Geneva 23 Switzerland t CERN IT Department CH-1211 Geneva 23 Switzerland t
Introduction to Oracle Backup and Recovery
CERN IT Department CH-1211 Genève 23 Switzerland t Streams new features in 11g Zbigniew Baranowski.
Backup & Recovery Concepts for Oracle Database
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
PPD Computing “Business Continuity” Windows and Mac Kevin Dunford May 17 th 2012.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
Chapter 7 Making Backups with RMAN. Objectives Explain backup sets and image copies RMAN Backup modes’ Types of files backed up Backup destinations Specifying.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.
Backup and Recovery Overview Supinfo Oracle Lab. 6.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Database Administrator RAL Proposed Workshop Goals Dirk Duellmann, CERN.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Castor incident (and follow up) Alberto Pace.
CERN IT Department CH-1211 Genève 23 Switzerland t DBA Experience in a multiple RAC environment DM Technical Meeting, Feb 2008 Miguel Anjo.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review and Outlook Distributed Database Workshop PIC, 20th April 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Overview of Oracle Backup and Recovery Darl Kuhn, Regis University.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
CERN IT Department CH-1211 Geneva 23 Switzerland t Eva Dafonte Perez IT-DB Database Replication, Backup and Archiving.
BNL Oracle database services status and future plans Carlos Fernando Gamboa, John DeStefano, Dantong Yu Grid Group, RACF Facility Brookhaven National Lab,
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
Database CNAF Barbara Martelli Rome, April 4 st 2006.
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
Virtual Machine Movement and Hyper-V Replica
CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop November 17 th, 2010 Przemyslaw Radowiecki CERN.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Database Project Milestones (+ few status slides) Dirk Duellmann, CERN IT-PSS (
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
6 Copyright © Oracle Corporation, All rights reserved. Backup and Recovery Overview.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
CERN IT Department CH-1211 Genève 23 Switzerland t Using Data Guard for hardware migration UKOUG RAC & HA SIG, Feb 2008 Miguel Anjo, CERN.
SQL Database Management
Streams Service Review
WLCG DB Service Reviews
Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
Maximum Availability Architecture Enterprise Technology Centre.
Workshop Summary Dirk Duellmann.
WLCG Service Report 5th – 18th July
Oracle Database Monitoring and beyond
Review of Tier1 DB Interventions and Service procedures
3D Project Status Report
Performing Database Recovery
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams Status Report

CERN IT Department CH-1211 Geneva 23 Switzerland t Outline 3D Operations – reminder –Tier0 Responsibilities –Tier1 Responsibilities –Announcing Interventions Service Incident Reports (SIRs) Procedures Monitoring Recent problems and interventions Useful documentation Streams Status Report - 2

CERN IT Department CH-1211 Geneva 23 Switzerland t Tier0 responsibilities Initial Streams setup Adding new schemas to the Streams environment Split & Merge procedures Streams resynchronization –Split & merge –Coordination role between voluntary site and affected one Analyze and test new features and optimizations Validate upgrades and patches Monitoring –Oracle Enterprise Manager for 3D Agent installation and configuration is Tier1 responsibility –StreamMon Streams Status Report - 3

CERN IT Department CH-1211 Geneva 23 Switzerland t Tier1 Responsibilities Interventions –Before – announce –After – check and re-enable Streams processes Apply Propagation – use “STRMPROP_ ” account to connect to the downstream database (i.e. STRMPROP_PIC for PIC Tier1) Enable the capture process when site is split Maintain the 3D OEM operational –Check agents status (recommender version is ) –Configure targets Streams resynchronization –Collaborate with voluntary Tier1 on re-synchronizing streams –Update Tier0 on resynchronization progress Streams Status Report - 4

CERN IT Department CH-1211 Geneva 23 Switzerland t Announcing interventions Announce DB-related interventions –Schedule new intervention using 3D Twiki3D Twiki –Submit EGEE broadcasts –Register outages in the CIC portal –Long interventions – contact Tier0 to analyze if it is necessary to split the Streams setup Unplanned downtime – contact Tier0 –Problem description, progress and expected duration Report regularly –Feel free to ask for a specific help from Tier0 and Tier1 DBAs Streams Status Report - 5

CERN IT Department CH-1211 Geneva 23 Switzerland t Service Incident Reports Service Incident Reports (SIRs) –Unexpected database downtime > 4 hours –Unexpected streams outages caused by hardware or software problems Create SIR for site resynchronization Create SIR for any streams outage caused by Oracle bug or not known issue Do not create SIR for apply problems after node reboots and ‘user’ errors caused by modifying read- only data –Log SIRs to: –Template: Streams Status Report - 6

CERN IT Department CH-1211 Geneva 23 Switzerland t Procedures - resynchronization Procedures –See Streams resynchronization – –Resynchronize a Tier1 site which is out of the Streams recovery window using transportable tablespaces –For Tier1-perspective and experience see talk from Alexander & Carmine –Please remember to instantiate streamed schemas after copying to destination! Streams Status Report - 7

CERN IT Department CH-1211 Geneva 23 Switzerland t Procedures – restore/recovery Recovery procedures in Streams environment – –Recovery is not transparent in Streams environment Always notify Tier0 when attempting recovery Changes that had already being applied but lost by the recovery must be re-applied –Split required and new capture with different SCN needs to be created –If replication is re-enabled without notifying Tier0 Apply aborts, SCN information is updated, last correct SCN applied is lost Streams re-configuration is not longer possible Streams Status Report - 8

CERN IT Department CH-1211 Geneva 23 Switzerland t Monitoring OMS 3D –Cleanup required 33% targets seen as unknown (132 targets) 3% targets seen as down (12) Tier1s affected: CNAF, IN2P3, MICHIGAN, RAL, INFN, SARA, TRIUMF –For the moment we stay on OMS StreamMon – new features –Better streams time statistics (capture time, replication delay, time since last LCR) –DB PGA monitoring –Sessions metrics (total, sessions/s) –Schema statistics – LCRs per schema Streams Status Report - 9

CERN IT Department CH-1211 Geneva 23 Switzerland t Monitoring – weekly reports Weekly reports for Tier1s –Provide CPU and IO usage statistics (per domain/user) Service usage Simple session statistics Password expiration Space usage –Installation instructions –Access –Please install if you haven’t done it yet Not obligatory though Streams Status Report - 10

CERN IT Department CH-1211 Geneva 23 Switzerland t Interventions at Tier1s in 2010 More transparent interventions at Tier1s in 2010 (64% of total, compared to 48% in 2009) Decreased total number of interventions and interventions requiring more downtime Streams Status Report - 11

CERN IT Department CH-1211 Geneva 23 Switzerland t Problems and interventions RAL – 20 th of July 2010 –Multipath reconfiguration intervention –After the intervention only few LCRs were applied before the ORA was hit. It looked like another instance of the bug where the LCRs are consumed but not physically applied on the database. A service request ( ) was open with oracle under the CERN support ID. –No LCRs were applied until the database was resynchronized on 21 st July 2010 Streams Status Report - 12

CERN IT Department CH-1211 Geneva 23 Switzerland t Problems and interventions SARA – 18 th of August 2010 –DB crashed with datafile corruption –Extensive analysis conducted at SARA found potential storage problems –DB has been later restored to different hardware –RAL helped SARA to recover using transportable tablespaces –8 th of September – SARA replication is back SARA – 26 th of October 2010 –Apply failed with ORA-01403: no data found –Some data was found to be missing from the time of last resynchronization with RAL (between export and import) –Resynchronized selectively from Tier0 –Problem traced down to missing instantiation at destination Transportable tablespaces procedure needs an update Streams Status Report - 13

CERN IT Department CH-1211 Geneva 23 Switzerland t Problems and interventions ASGC – 24 th of September –Site-wide power cut, after half an hour abnormal DB shutdown as UPS’s battery run out –Several problems during data recovery –Decision taken to re-synchronize from full backup from RAL –Problems with transferring the backup to ASGC –Network is not fast enough on normal links, decided to put backup in Castor and ship data via FTS –17 th October – ATLAS decided to drop conditions DB at ASGC and use Frontier instead Streams Status Report - 14

CERN IT Department CH-1211 Geneva 23 Switzerland t Problems and interventions PSU July 2010 (PSU4) –Two issues not caught during PSU4 validation Non-rollingness of the patch (bug ) – patched instance hangs in mount state, affecting the whole cluster –Oracle Support recently confirmed PSU4 is not rolling if installing, in a fully supported configuration, with shared ASM and Oracle home ORA-7445 and high contention during logoff with multiple sessions in one process (bug ) – whole cluster unresponsive –affecting COOL and other applications using multiple sessions in one server process –Fixed by one off patch, tested; fixed in –We are now pushing for more comprehensive validation and gained much more experience Streams Status Report - 15

CERN IT Department CH-1211 Geneva 23 Switzerland t Useful documentation 3D wiki – Streams operations manual Overview for Troubleshooting Streams Performance Issues (metalink note ) Streams monitoring Streams health check report –metalink note D OEM Ask questions to grid-service-databases mailing list or use Tier1 Service Coordination Meeting to raise any outstanding DB-related issues Streams Status Report - 16

CERN IT Department CH-1211 Geneva 23 Switzerland t Streams Status Report - 17