ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.

Slides:

Advertisements

Similar presentations

Introduction to Oracle

Advertisements

INTRODUCTION TO ORACLE Lynnwood Brown System Managers LLC Oracle High Availability Solutions RAC and Standby Database Copyright System Managers LLC 2008.

DB server limits (process/sessions) Carlos Fernando Gamboa, BNL Andrew Wong, TRIUMF WLCG Collaboration Workshop, CERN Geneva, April 2008.

Database Backup and Recovery

RMAN Restore and Recovery

Backup and Recovery (2) Oracle 10g CAP364 1 Hebah ElGibreen.

Backup and Recovery Part 1.

Backup Concepts. Introduction Backup and recovery procedures protect your database against data loss and reconstruct the data, should loss occur. The.

CERN IT Department CH-1211 Genève 23 Switzerland t Recovery Exercise Wrap-up Jacek Wojcieszuk, CERN IT-DM Distributed Database Operations.

Agenda  Overview  Configuring the database for basic Backup and Recovery  Backing up your database  Restore and Recovery Operations  Managing your.

Oracle backup and recovery strategy

Using RMAN to Perform Recovery

Backup & Recovery Concepts for Oracle Database

Incident Response Updated 03/20/2015

Castor F2F Meeting Barbara Martelli Castor Database CNAF.

20 Copyright © 2004, Oracle. All rights reserved. Database Recovery.

PPOUG, 05-OCT-01 Agenda RMAN Architecture Why Use RMAN? Implementation Decisions RMAN Oracle9i New Features.

7 Copyright © 2006, Oracle. All rights reserved. Dealing with Database Corruption.

Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.

Backup & Recovery Backup and Recovery Strategies on Windows Server 2003.

16 Copyright © 2007, Oracle. All rights reserved. Performing Database Recovery.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

WLCG Service Report ~~~ WLCG Management Board, 1 st September

15 Copyright © Oracle Corporation, All rights reserved. RMAN Incomplete Recovery.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

DB Questions and Answers open session Carlos Fernando Gamboa, BNL WLCG Collaboration Workshop, CERN Geneva, April 2008.

11 Copyright © 2004, Oracle. All rights reserved. Dealing with Database Corruption.

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

IT Database Administration Section 09. Backup and Recovery Backup: The available options Full Consistent (cold) Backup Database shutdown, all files.

Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008.

14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.

1 Copyright © 2005, Oracle. All rights reserved. Introduction.

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Castor incident (and follow up) Alberto Pace.

CERN IT Department CH-1211 Genève 23 Switzerland t DBA Experience in a multiple RAC environment DM Technical Meeting, Feb 2008 Miguel Anjo.

CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review and Outlook Distributed Database Workshop PIC, 20th April 2009.

CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.

18 Copyright © 2004, Oracle. All rights reserved. Backup and Recovery Concepts.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.

6 Copyright © 2007, Oracle. All rights reserved. Performing User-Managed Backup and Recovery.

16 Copyright © 2005, Oracle. All rights reserved. Performing Database Recovery.

Oracle Architecture - Structure. Oracle Architecture - Structure The Oracle Server architecture 1. Structures are well-defined objects that store the.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

ASGC Site Report Jason Shih ASGC Grid Ops CASTOR External Operation Face to Face Meeting.

14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.

PIC port d’informació científica Luis Diaz (PIC) ‏ Databases services at PIC: review and plans.

Patricia Méndez Lorenzo Status of the T0 services.

CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos

Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.

CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

Considerations for database servers Castor review – June 2006 Eric Grancher, Nilo Segura Chinchilla IT-DES.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

9 Copyright © 2004, Oracle. All rights reserved. Incomplete Recovery.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,

DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.

Oracle Database Architectural Components

Oracle Database High Availability

WLCG DB Service Reviews

CASTOR-SRM Status GridPP NeSC SRM workshop

ASGC Status report ASGC/OPS Jason Shih Nov 26th 2009

Castor services at the Tier-0

Oracle Database High Availability

STREAMS failover and resynchronization

Performing Database Recovery

Database Backup and Recovery

Presentation transcript:

ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop

Outline Critical service incident Atlas 3D Castor DB corruption Procedures and event logs Plan

Atlas 3D incident - I Sept 27 th, First observed by Eva ORA error code, and propagation cannot be re-started normally. verbose exception msg is 'TNS:listener does not currently know of service requested in connect descriptor' Streams Monitor Error Report Report date: :21:49 Affected Site: CERN-PROD Affected Database: ATLDSC.CERN.CH Process Name: STRM_PROPAGATION_ASGC Error Time: :21:48 Error Message: ORA-12514: TNS:listener does not currently know of service requested in connect descriptor ORA-06512: at "SYS.DBMS_AQADM_SYS", line 1087 ORA-06512: at "SYS.DBMS_AQADM_SYS", line 7642 ORA-06512: at "SYS.DBMS_AQADM", line 631 ORA-06512: at line 1

Atlas 3D incident - II 28-Sep-09 07:17 - CERN-PROD : Process error report service degradation of listener is suspicious. 28-Sep-09 13:40 - Eva confirm the same error code "ORA-12528: TNS: listener: all appropriate instances are blocking new connections“ Instance down, err code 'ERROR at line 1: ORA-01507: database not mounted‘ 28-Sep-09 14:05 - all lsnr service are blocked, and confirm the db is not open properly except with the fs mounted.

Atlas 3D incident - events 28-Sep-09 - the problem seems affect by the recent power surge and cause hardware degradation of two blade chassis, Consider performing point-in-time restore for the 3D database propagation is disable now and we confirm LCRs activities already via stream monitor page 29-Sep-09 14:05, problem persist after point-in-time recovery. In fact, it is not possible to access to some old data so this makes impossible to try any kind of resynchronization using Streams with the current database status. 30-Sep-09 - perform point-in-time restore to Sept 21, where we confirm to have full backup 30-Sep-09 13:41 - Eva confirm with some errors when try to access the data and this is why we suspect there is data corrupted (12 entries being selected with owner like 'ATLAS_COOLOFL_DCS' and object_name like '%_F0028_%', and the missing object observed: select * from ATLAS_COOLOFL_DCS.COMP200_F0028_IOVS_SEQ; ERROR at line 1: ORA-08103: object no longer exists 5-Oct-09 13:41, the datafiles are nolonger valid (the point-in-time restored refer to Sept 21 while streams might not be able to recover from a 2 weeks backlog or not having archived log files on disk

Atlas 3D incident - recovery

Castor DB incident DB lock critical alarm Date: Wed Oct 21 00:16:52 UTC 2009 Error: OracleLocks CRITICAL - CASTORDB2: 10 are waiting for lock out of the 146 sessions are connected eLog: c-db/2009-October/ html url: bin/status.cgi?host=w-rac02&servicestatustypes=28

Castor DB incident - analysis Alert log in 2nd instance indicating the block corruption: # tail /u01/app/oracle/admin/castordb/bdump/alert_castordb2.log Wed Oct 21 10:00: Recovery of Online Redo Log: Thread 2 Group 21 Seq Reading mem 0 Mem# 0: /u02/oradata/castordb/redo2101.log Block recovery completed at rba , scn Wed Oct 21 10:00: Corrupt Block Found TSN = 10, TSNAME = STAGER_DATA RFN = 10, BLK = , RDBA = OBJN = 84545, OBJD = 84545, OBJECT = SUBREQUEST, SUBOBJECT = P_STATUS_6 SEGMENT OWNER = STAGER, SEGMENT TYPE = Table Partition

Castor DB incident – recovery Oct 29 th – Nov 3 rd, Hardware preparation New blade servers, pass-through parts, fabric switch Backend storage 5 nodes in total, split into two clusters Clean cluster preparation collect any additional patches suggested by CASTOR DB team at CERN CERN lead install additional patches and the current security patch start DB and check proper server functioning initiate cloning of binaries to second ASGC cluster (SRM, Stager, DLF) import recovered name server content into ASGC cluster 1 bring up cluster 2 and import the other recovered schemata

Castor DB incident – service restart Nov 4 th, Recreate dlf schema which is less critical confirm all core services up and running Upgrading SRM to 2.8 SRM service resume Nov 5 th Nov 6 th, OP revert the wrong srm config and cause the service degraded start from Fri. Situation clarified: Stager/VO mapping Incorrect CNS host definition Full function since Mon. Exp. confirm with FT and production transfers

Plan Weekly Tape backup Spare hardware for standby cluster Enforcement of recovery exercises

Acknowledgment Atlas 3D resync Carlos from BNL Eva/Maria from CERN CastorDB recovery Eric, Luca, Jacek, Dawid, Nilo, Giuseppe, Maria and Dirk. Castor support team