WLCG DB Service Reviews

Slides:

Advertisements

Similar presentations

CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.

Advertisements

Database Backup and Recovery

Backup Concepts. Introduction Backup and recovery procedures protect your database against data loss and reconstruct the data, should loss occur. The.

BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.

Castor F2F Meeting Barbara Martelli Castor Database CNAF.

ITimpulse NOC process This is an interactive, detailed, step wise guide explaining how alerts are managed at our NOC. This document contains information.

Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

16 Copyright © 2007, Oracle. All rights reserved. Performing Database Recovery.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

DB-2: OpenEdge® Replication: How to get Home in Time … Brian Bowman Sr. Solutions Engineer Sandy Caiado Sr. Solutions Engineer.

Daniela Anzellotti Alessandro De Salvo Barbara Martelli Lorenzo Rinaldi.

Heterogeneous Database Replication Gianni Pucciani LCG Database Deployment and Persistency Workshop CERN October 2005 A.Domenici

Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.

Alwayson Availability Groups

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review and Outlook Distributed Database Workshop PIC, 20th April 2009.

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.

6 Copyright © 2007, Oracle. All rights reserved. Performing User-Managed Backup and Recovery.

BNL Oracle database services status and future plans Carlos Fernando Gamboa, John DeStefano, Dantong Yu Grid Group, RACF Facility Brookhaven National Lab,

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.

Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.

CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.

WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.

Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.

Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.

WLCG Service Roadmap 2009 – 2010 ~~~ Distributed DB Operations Workshop, PIC, April 2009.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.

DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.

GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals

10 Copyright © 2007, Oracle. All rights reserved. Managing Undo Data.

INFN Tier1/Tier2 Cloud WorkshopCNAF, 22 November 2006 Conditions Database Services How to implement the local replicas at Tier1 and Tier2 sites Andrea.

CERN IT Department CH-1211 Geneva 23 Switzerland t Service Reliability & Critical Services January 15 th 2008.

Jean-Philippe Baud, IT-GD, CERN November 2007

WLCG IPv6 deployment strategy

Maria Girone, CERN – IT, Data Management Group

Streams Service Review

Dirk Duellmann CERN IT/PSS and 3D

WLCG Management Board, 30th September 2008

gLite->EMI2/UMD2 transition

Exadata and ZFS Storage at Nielsen

Database Services at CERN Status Update

Service Challenge 3 CERN

3D Application Tests Application test proposals

Elizabeth Gallas - Oxford ADC Weekly September 13, 2011

Database Readiness Workshop Intro & Goals

Maximum Availability Architecture Enterprise Technology Centre.

CASTOR-SRM Status GridPP NeSC SRM workshop

~~~ LCG-LHCC Referees Meeting, 16th February 2010

Olof Bärring LCG-LHCC Review, 22nd September 2008

WLCG Service Interventions

Conditions Data access using FroNTier Squid cache Server

Workshop Summary Dirk Duellmann.

WLCG Service Report 5th – 18th July

3D Project Status Report

Data Lifecycle Review and Outlook

Database Services for CERN Deployment and Monitoring

Project Name - Testing Iteration 1 UAT Kick-off

Performing Database Recovery

OPS-7: Building and Deploying a Highly Available Application

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

Presentation transcript:

WLCG DB Service Reviews Maria.Girone, IT-DM ~~~ Oracle Operational Review Meeting, 6th April 2009

Overview Motivation for such reviews Some concrete examples with technical details & service impact Status of the discussions with Oracle Outlook Summary

Motivation As highlighted by the experiments’ list of Critical Services and as confirmed by regular WLCG service reports, databases play a key role Used for online and offline applications, at Tier0 as well as Tier1s, for experiment applications, Grid middleware services, monitoring, data management, … Services designed and deployed with resilience in view but we need to be mindful of what can go wrong and the possible service impact Cannot expect Oracle support engineers looking at a specific problem report to understand the overall WLCG scope or impact Need to be both pro-active as well as re-active

Examples and Current High-Priority Issues Atlas, Streams and LogMiner crash CMS Frontier, change notification and Streams incompatibility Castor, BigID issue Castor, ORA-600 crash Castor, crosstalk and wrong SQL executed Oracle clients on SLC5, connectivity problem Restore performance for VLDB

Atlas Streams Setup: Capture Aborted with Fatal Logminer Error Atlas streams replication aborted with a critical error on 12-12-2008 Full re-instantiation of streams was necessary PVSS replication was down for 4 days Conditions replication was down for 3 days Replication to Tier1s was severely affected, part of the replication was down during the CERN annual shutdown period Open SR 7239707.994 with Oracle Caused by the parallel option for a index rebuild 5

Streams Issue and LogMiner Error Generated by LogMiner aborting and not being able skip ‘wrong redo’ and/or bypass single transactions Issue could not be reproduced exactly, but likely to come up any time LogMiner will run in a critical error while reading a redolog/archivelog file In a test environment we could see similar behaviour by manually corrupting redo files Workaround proposed by Oracle support on 10g: skip the log file which contains the transaction causing the problem Not acceptable in production environment Will cause logical corruption, i.e. data loss at the target database ATLAS dbas run several operations in order to rebuild some indexes on the database Skip as well other transactions contained in the log file and consequently the replicas will not be synchronized 6

CMS Frontier: Change Notification Incompatibility with Streams Streams capture process aborts with errors ORA-16146 and ORA-07445 ORA-16146 reported intermittently and capture aborts (SR 20080444.6 ) and ORA-07445: [kghufree()+485] repeatedly in a Streams environment (SR 7280176.993) Streams and change notification have incompatibilities The issue also causes high load on the server (performance issue) Forces manual capture restart Oracle support recommendations and actions Proposed patch 5868257, did not solve the issue Oracle support will try to reproduce in house Alternative solution for CMS: replace the use of change notification with systems info of last modified tables. 7

Castor and BigID issue Wrong bind value being inserted into castor tables i.e. ‘Big ID’ problem in CASTOR @RAL, ASGC & CNAF OCCI NUMBER TYPE OVERFLOW Oracle support SR 7299845.994 OCCI 10.2.0.3 application build with GCC 3.4.6 on Linux x86, Oracle 10.2.0.4 database Randomly the application inserts huge values (10^17 instead of 10, for example) Difficult to trace OCI or OCCI in multi-threaded environment RAL working on a test case to reproduce the problem for Oracle support 8

Castor and Crosstalk Between Different Schemas SQL executed in wrong schema when fully qualified names are not used Seen in 10.2.0.2 (LFC) Similar bugs are reported on 10.2.0.3 Supposed fixed in 10.2.0.4 Although Castor @ RAL has reported one occurrence of ‘crosstalk’ which caused loss of 14K files SR 7103432.994 (September 2008) Issue not reproduced since Problem seen also on VOMS (average of once/day). To follow-up 9

Invalid Number on Bind Variable for Castor Invalid number found when updating with bind variable value of type OCCI BDOUBLE Oracle support SR 7420155.994 Similar to the BigId issue (SR 7299845.994) The effect is a wrong bind variable being detected by Oracle. This generates ‘data corruption’. Sporadic issue, difficult to reproduce (although when present is consistently seen) Work in progress with Oracle analyst to understand the source 10

Castor and Block Corruption Issue Oracle crash due to ORA-600 on 14-3-2009 Castor stager for Atlas down for about 10 hours Because of bug 6647480 Instance crashed and crash recovery failed SR 7398656.993 opened with Oracle support Issue resolved with patch CERN also requested merge for 7046187 and 6647480 Note: bug 6647480 was available since October 2008, but not listed as critical in Oracle main support doc “10.2.0.4 Patch Set - Availability and Known Issues” See https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090314 some blocks corrupted that can not be fixed by transaction recovery. 11

Oracle Clients cannot connect on SLC5 and RHEL5 It’s a blocking issue for Oracle clients on SLC5 (and RHEL5) Based on Bug 6140224, "SQLPLUS FAILS TO LOAD LIBNNZ11.SO WITH SELINUX ENABLED ON EL5/RHEL5" The root cause is explained in Metalink Note 454196.1 The first proposed workaround by support (use of SELinux in ‘permissive’ mode) is not acceptable in our environment A second workaround proposed from support, involves repackaging of the Oracle client as rpms installed on localdisks (instead of AFS) requires significant changes in CERN and Tier 1 deployment model From oracle support, the issue should be fixed in 11gR2 Beta 2 We requested a backport to 10.2 One possible workaround (for CERN and T1 sites)involves the repackaging of the Oracle client as rpms installed on localdisks (instead of AFS) 12

RMAN Restore Performance Issue Backup restore performance issue Affects significantly time to restore for very large databases (VLDB) Restores are limited by gigabit Ethernet speed (100 MBPS) The issue effectively disallows to use multiple RAC nodes/NICS to restore a VLDB Example: 10TB DB, the use of a single node/channel limits restore time to 1 day or more Status: Opened as SR 7241741.994 Bug 8269674 opened by Oracle support for this issue 13

Summary of Issues Issue Services Involved Service Request Impact Logminer ATLAS Streams SR 7239707.994 PVSS, conditions service interruptions (2/3 days); Replication to T1s severly affected Streams Proposed ‘workaround’ not acceptable -> data loss! Streams/ Change notification CMS Frontier SR 7280176.993 Incompatibility with Streams BigID CASTOR SR 7299845.994 Service interruption “Crosstalk” CASTOR + ?? SR 7103432.994 Logical data corruption ORA-600 SR 7398656.993 Clients on SL5 Clients can’t connect!

Status of Discussions with Oracle After some initial contact and e-mail discussions, a first preparatory meeting is scheduled for April 6th at Oracle’s offices in Geneva There is already a draft agenda for this meeting Introductions & Welcome WLCG Service & Operations – Issues and Concerns WLCG Distributed Database Operations: Summary of Current & Recent Operational Problems Oracle Support – Reporting and Escalating Problems for Distributed Applications (Customers) Discussion on Frequency, Attendance, Location, … Wrap-up & Actions http://indico.cern.ch/conferenceDisplay.py?confId=53482 A report on this meeting will be given at the Distributed DB Operations meeting to be held in PIC later in April Agenda: http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=54037

Outlook Depending on the outcome of the discussions with Oracle, we would expect to hold such “WLCG Oracle Operational” (W) reviews roughly quarterly RAL have already offered to host such a meeting The proposed initial scope is to: Follow-up on outstanding service problems; Be made aware of critical bugs in production releases, together with associated patches and / or work-arounds.

Summary Following proposals to the WLCG MB and elsewhere WLCG Oracle Operational review meetings are being established Such meetings will initially take place quarterly Input from regular WLCG Distributed Database Operations (ex-3D) con-calls and workshops will be essential in preparing and prioritizing lists of issues, as well as to exchange knowledge on workarounds and solutions

Backup Slides Presentation title - 18

CASTOR Deployment Architecture CERN: 6 instances (4x LHC, Public, cernT3) Each instance: stager + DLF + SRM DB schemas, each on a different cluster (2 nodes) Name server on yet another cluster 38 machines! RAL: 4 different stagers + DLF + SRM, one Name server, two clusters CNAF and ASGC: one castor instance (Stager + DLF + SRM + Name server), one cluster 19