Streams Service Review

Slides:



Advertisements
Similar presentations
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Advertisements

CERN - IT Department CH-1211 Genève 23 Switzerland t Transportable Tablespaces for Scalable Re-Instantiation Eva Dafonte Pérez.
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
National Manager Database Services
CERN IT Department CH-1211 Genève 23 Switzerland t Streams new features in 11g Zbigniew Baranowski.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.
Backup and Recovery Overview Supinfo Oracle Lab. 6.
1 Oracle Enterprise Manager Slides from Dominic Gélinas CIS
WLCG Service Report ~~~ WLCG Management Board, 9 th August
ESRI User Conference 2004 ArcSDE. Some Nuggets Setup Performance Distribution Geodatabase History.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Implementation and performance analysis of.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review and Outlook Distributed Database Workshop PIC, 20th April 2009.
Database Competence Centre openlab Major Review Meeting nd February 2012 Maaike Limper Zbigniew Baranowski Luigi Gallerani Mariusz Piorkowski Anton.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
18 Copyright © 2004, Oracle. All rights reserved. Backup and Recovery Concepts.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
18 Copyright © 2004, Oracle. All rights reserved. Recovery Concepts.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
Virtual Machine Movement and Hyper-V Replica
CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
6 Copyright © Oracle Corporation, All rights reserved. Backup and Recovery Overview.
20 Copyright © 2006, Oracle. All rights reserved. Best Practices and Operational Considerations.
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
1 Copyright © 2005, Oracle. All rights reserved. Oracle Database Administration: Overview.
American Diploma Project Administrative Site Training New Jersey.
INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre
Marcin Bogusz CERN, PH-CMG WLCG Collaboration Workshop CMS online/offline replication Online/offline replication via Oracle Streams WLCG Collaboration.
WLCG Collaboration Workshop CMS online/offline replication
Servizi core INFN Grid presso il CNAF: setup attuale
SQL Database Management
Jean-Philippe Baud, IT-GD, CERN November 2007
Replication using Oracle Streams at CERN
Troubleshooting Tools
How To Pass Oracle 1z0-060 Exam In First Attempt?
Evolution of Data(base) Replication Technologies for WLCG
WLCG DB Service Reviews
Database Services at CERN Status Update
3D Application Tests Application test proposals
Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
Database Readiness Workshop Intro & Goals
Maximum Availability Architecture Enterprise Technology Centre.
WLCG Service Interventions
WLCG Accounting Task Force Update Julia Andreeva CERN WLCG Workshop 08
Workshop Summary Dirk Duellmann.
STREAMS failover and resynchronization
Oracle Database Monitoring and beyond
Review of Tier1 DB Interventions and Service procedures
Upgrading to Microsoft SQL Server 2014
3D Project Status Report
SharePoint Administrative Communications Planning: Dynamic User Notifications for Upgrades, Migrations, Testing, … Presented by Robert Freeman (
Predictive Performance
Oracle Streams Performance
Backup Monitoring – EMC NetWorker
Performing Database Recovery
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

Streams Service Review Distributed Database Operations Workshop Eva Dafonte Pérez

Outline Tier0 responsibilities Tier1 responsibilities What do I have to do? Recent problems Bugs related to Streams Recommended patches Pending requests New 11g features Summary

Overview ATLAS

Overview LHCB CMS

Tier0 responsibilities Initial Streams setup Add new schemas to the Streams environment Split & Merge Streams re-synchronization Analyze and test new features and optimizations Validate upgrades and patches Monitoring

Tier1 responsibilities Announce interventions schedule new intervention using 3D wiki submit EGEE broadcasts register outages in the CIC portal long interventions: contact Tier0 to analyze if it is necessary to split the Streams setup Unplanned downtime: update Tier0 problem description, progress and expected duration Report regularly Read-only replica: ensure only reader account is open All database interventions, also transparent ones (like single node failures) and also unplanned interventions (like node reboots) need to follow the standard WLCG notification procedure as discussed on the SARA workshop. In particular, please make sure that you submit EGEE broadcasts and register outages in the CIC portal with according to an explicit description of which services are affected (eg which experiment and which grid services). This is an additional effort, but otherwise the operations teams and the wider experiment community will not be able to follow proposed interventions or correlate higher level service problems with database interventions.

Tier1 responsibilities Maintain the 3d OEM operational check agents status configure targets After an intervention: check and re-enable Streams processes re-start apply process @destination re-enable propagation job @downstream box When SPLIT: re-start capture process @downstream box

Tier1 responsibilities # connect as streams admnistrator @destination database strmadmin@db> select apply_name, status from dba_apply; Apply Process Name Status ---------------------------- ----------- STRMADMIN_APPLY_STREVA DISABLED strmadmin@db> exec dbms_apply_adm.start_apply(‘STRMADMIN_APPLY_STREVA‘); PL/SQL procedure successfully completed.

Tier1 responsibilities account with privileges to re-start the Streams components @downstream database - one per Tier1 site # connect as strmprop user @downstream database strmprop_cern@db> select propagation_name, status from dba_propagation; Propagation Name Status ------------------------------ ----------- STREAMS_PROP_STREVA_DWSDB DISABLED STREAMS_PROP_STREVA_STRMTEST ENABLED strmprop_cern@db> exec dbms_propagation_adm.start_propagation(‘STREAMS_PROP_STREVA_DWSDB‘); PL/SQL procedure successfully completed.

Tier1 responsibilities ensure you can connect using your strmprop account (password, connection string) check that you are using the correct process name # connect as strmprop user @downstream database strmprop_cern@db> select propagation_name, status from dba_propagation; strmprop_cern@db> select capture_name, status from dba_capture; Capture Name Status ---------------------------- ----------- STRMADMIN_CAPTURE_STREVA ENABLED STRMADMIN_CAP_TEMP DISABLED strmprop_cern@db> exec dbms_propagation_adm.start_propagation(‘STREAMS_PROP_TEMP‘); strmprop_cern@db> exec dbms_capture_adm.start_capture(‘STRMADMIN_CAP_TEMP‘); PL/SQL procedure successfully completed.

What do I have to do? 1. Check Streams monitoring Hi all, it looks like it is our turn now… So, what do I have to do? Cheers, Olli Streams Monitor wrote: > Streams Monitor Error Report > Report date: 2008-09-25 14:43:09 > Affected Site: NDGF-T1 > Affected Database: ATLAS.DB1TIER1.NDGF.ORG > Process Name: STRMADMIN_APPLY_ATLN > Error Time: 25-09-2008 15:42:51 > Error Message: ORA-26714: User error encountered while applying > Current process status: ABORTED 1. Check Streams monitoring 2. Check Streams Service Manual for Tier1s 3. Ask for help

What do I have to do? Apply process status: ABORTED “user error encountered while applying” get more details: exec print_errors.sql as streams administrator @destination human errors ex: modifications to system-generated names, updates by users which don’t exist at destination,… destination schema is overwritten ex: statement is executed first at the destination, then at the source (online – offline), Tier1 database is not read-only, … ORA-01403: no data found ORA-00955: name is already used by an existing object  check with Tier0 which actions are needed

What do I have to do? Apply process status: ABORTED “user error encountered while applying” database administration related ex: unable to extend tablespace, deadlock waiting for resource, … ORA-01652: unable to extend tablespace ORA-00060: deadlock detected while waiting for resource fix the problem re-execute the error and re-start apply process exec dbms_apply_adm.execute_all_errors(‘STRMADMIN_APPLY‘); exec dbms_apply_adm.start_apply(‘STRMADMIN_APPLY‘);

What do I have to do? Propagation is DISABLED after 16 attempts: ORA-00257: archiver error. Connect internal only, until freed ORA-12514: TNS:listener does not currently know of service requested in connect descriptor ORA-12545: Connect failed because target host or object does not exist ORA-12170: TNS:Connect timeout occurred ORA-12560: TNS:protocol adapter error fix the problem re-enable propagation

What do I have to do? Check our wiki: Oracle Streams Documentation https://twiki.cern.ch/twiki/bin/view/PSSGroup/StreamsServiceReview Oracle Streams Documentation Oracle Streams Concepts and Administration 10g Release 2 Oracle Streams Replication Administrator's Guide 10g Release 2 Send us an email with your questions Help us to maintain the wiki updated you can also update it !!!

Recent problems Missing primary keys / indexes Apply is aborted because of duplicated rows cannot identify an unique row to apply the change Apply performance seriously impacted apply server performs full table scans  Delay on the whole replication system dependent transactions have to wait ATLAS has already implemented an automatic job to detect tables without primary key

Recent problems Apply gets stuck on “applying” status Reader and coordinator are IDLE Server shows APPLYING LCRs spilled over to disk Under investigation by Oracle Connection lost contact to Gridka Only LFC replication to Gridka affected Diagnostic patch installed

Recent problems Unresponsive NDGF Additional capture latency monitored propagation could not send LCRs to destination processes were healthy – no errors reported large number of spilled LCRs kicked up the flow control (≈ 6.000.000 LCRs) capture process « temporarily » paused Additional capture latency monitored alert sent when 90 minutes threshold exceeded Tests on the streams pool memory usage new node allocated for the downstream cluster Streams Pool Size (MB) 2.6 GB

Interventions LFC migration out of SRM v1 endpoint Streams replication stopped Data updated at source and all destinations problems with RAL, where data was finally imported from CERN CNAF, PIC and IN2P3 hardware migration re-synchronization using transportable tablespaces Tier1 sites should consider the use of Data Guard in order to minimize the impact

Bugs related to Streams Fixed: ORA-600 when dropping propagation ORA-26687 no instantiation SCN provided when drop table (2 streams setup between same source and destination databases) To be fixed: <BUG:6402302> create view on schema not in streams is replicated drop view is not replicated!

Recommended patches Metalink note 437838.1 7363767 addresses performance improvement for capture process and logminer: merge label requesy on top of 10.2.0.4 for Bugs: Bug 7345904 Streams capture slow processing direct path insert, high cpu for logmnr builder Bug:6683178 High latencies in Streams capture, while capturing primary workload with a lot of DDL activities such as truncates of empty tables Bug:6994160 Capture reader process constantly writing messages to trace file Bug:6413089 Restarting a logminer session can be slow if the session has fallen behind Bug:6650256 Parallel DDL (PDDL) transactions can cause logminer memory spill for Streams, or run slowly during adhoc log mining 7263055 + 7480651 in order to fix ORA-600 [KWQBMCRCPTS101] when dropping propagation 5933656 Propagation ora-600 [KWQPCBK179], [1], [1369] 6827260 Excessive memory usage for lcr cache due to large freelists 7219752 ORA-26773 Malformed redo on capture of long 6452375 ORA-26687 No instantiation scn provided when drop child table 7033630 Apply aborting with ORA-600 [KNLQDQM2USR:4] after installing 10.2.0.4 patchset

Pending requests MUON sites replication to CERN master: 3 Tier2 sites (Rome, Munich, Michigan) target: ATLAS offline AMI replication to CERN master: Tier1 Lyon Resources: currently 2 apply process @ATLAS offline 4 more to be added!! Service level: problems must be addressed to the master side

New 11g features Combined Capture and Apply Split/Merge of Streams capture sends LCRs directly to apply only 1 target, detected automatically big performance improvement rate: 14.000 LCRs/sec (before 5.000 LCRs/sec) Split/Merge of Streams Cross-database LCR tracking Source and Target data compare & converge compare rows in an object at 2 databases converge objects in case of differences

Summary Keep the monitoring operational Coordination with Tier0 spot problems quickly, understand bottlenecks, ... Coordination with Tier0 complex streams environments where the activity at one point might impact the whole system Feedback!!! and collaboration to improve the documentation and the service Interventions during 3 last months