Emergency Database Failover: Impacts & Recovery Plan

Slides:



Advertisements
Similar presentations
1 Market Notification List Process Change Update Commercial Operations Subcommittee Meeting November 14, 2005.
Advertisements

1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014.
1 Choosing Disaster Recovery Solution for Database Systems EECS711 : Security Management and Audit Spring 2010 Presenter : Amit Dandekar Instructor : Dr.
CMWG Update June WMS Meeting CMWG Update 1. CMWG (Vote) Confirmation of new Vice Chair: Greg Thurnher (Representing Luminant/TXU Energy) 2.
Information Technology Report Trey Felton Manager, IT Service Delivery January 2012 ERCOT Public.
EIM April 19, Robin Weaver 13 Years with IBM Prior to Assignment at UNC Charlotte Range of Database Development/Data Management Projects and Products.
Database Backup and Recovery
Emergency Database Failover: Impacts & Recovery Plan
Nodal EDW Project Reporting Requirements & Conceptual Design Update EDW Project Team July 23, 2007.
Retail Sub-Committee Update Robert Connell June 14, 2002.
Objectives: Upgrade Siebel to a supported application Upgrade Oracle database to current version Deliver all existing user functionality with no degradation.
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th.
Information Technology Report Dave Pagliai Manager, IT Support Services February 2015 ERCOT Public.
RMS Update to TAC May 8, RMS Update to TAC ► At April 9 RMS Meeting:  Antitrust Training  RMS Voting Items: ► NPRR097Changes to Section 8 to Incorporate.
Data Extracts & Reporting Recent Issues ERCOT Information Technology Data Extracts Working Group 11/27/07.
Texas Nodal Program ERCOT Readiness Update TPTF March 31, 2008.
March 26, 2015 Technical Advisory Committee (TAC) Update to RMS Kathy Scott April 7, 2015 TAC Update to RMS 1.
1 Nodal Stabilization Market Call December 14, 2010.
Retail Market Subcommittee June 9, 2010 Performance Measures 1st Quarter 2010 Transaction Comparison.
Systems Management Server 2.0: Backup and Recovery Overview SMS Recovery Web Site location: Updated.
Information Technology Update ERCOT Board of Directors Meeting January 17th, 2005.
June 22 and 23,  The information and/or flow processes contained in this Power Point presentation: ◦ Were created to allow interested parties to.
ERCOT SCR745 Update ERCOT Outage Evaluation Phase 1 and Phase 2 TDTWG April 2, 2008.
1Texas Nodal Market Trials Update. 2Texas Nodal LFC Testing Review Review of materials presented at TAC Changes in Market Trials schedule and activities.
ERCOT Project Update ERCOT Outage Evaluation Phase 2 (SCR745) TDTWG May 7, 2008.
10/11/2011 COPS Schedule for Upcoming Resettlements ERCOT Mandy Bauld.
Nodal ATF 1 Nodal Advisory Task Force Update for TAC November 4 th, 2010.
Information Technology Service Availability Metrics Trey Felton IT Account Manager COPS/RMS September 2009.
PR50121_07 Retail Business Processes (RBP) Project Update Retail Market Subcommittee November 8, 2006 Adam D. Martinez Mgr, Market Operations DPO.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
Retail Transaction Processing Year End Review and Recent Issues RMS January 2007.
COPS – ERCOT PROJECTS UPDATE WHOLESALE MARKET DEVELOPMENT Paula Feuerbacher February 22, 2005.
High Availability in DB2 Nishant Sinha
Backup Concepts. Introduction Backup and recovery procedures protect your database against data loss and reconstruct the data, should loss occur. The.
Texas Nodal © Electric Reliability Council of Texas, Inc. All rights reserved. 1 Nodal Enterprise Data Warehouse Section 8 – Performance Monitoring.
February 20, 2006 Nodal Architecture Overview Jeyant Tamby 20 Feb 2006.
ERCOT PMO Update Robert Connell Director Program Management May Board of Directors May 17, 2005 (Through 4/30/05)
Information Technology Report Trey Felton Manager, IT Service Delivery October 2011 ERCOT Public.
June 2010 COPS/RMS Information Technology Report Trey Felton Manager, IT Administration.
Commercial Operations Sub-Committee Update to TAC April 4, 2008.
18 Copyright © 2004, Oracle. All rights reserved. Backup and Recovery Concepts.
Objectives: Upgrade Siebel to a supported application Upgrade Oracle database to current version Deliver all existing user functionality with no degradation.
ERCOT Service Availability Metrics and Retail Systems Update April 2007.
1 Market Operations Presentation Board of Director’s Meeting January 17, 2006.
COPS Communication Working Group Conference call on 3/8/05 from 1:00 – 2:00 Reviewed scope document and 2005 Goals Reviewed Notification Template.
Information Technology Service Availability Metrics March 2008.
Information Technology Update Aaron Smallwood Manager, IT Business & Customer Services.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
TAC 2012 Meetings Update (March & April) to Commercial Operations Subcommittee Harika Basaran 5/8/2012.
1 TDTWG Update to RMS Wednesday May 6, Primary Activities 1.Reviewed ERCOT System Outages and Failures 2.Reviewed Service Availability 3.Reviewed.
1 TDTWG Report to RMS SCR Addressing ERCOT System Outages Tuesday, May 10.
Information Technology Service Availability Metrics RMS August 2008 Trey Felton.
August 17, 2006 Data Extract Working Group EDW Project Updates 2006 / 2007 MO & RO PPL.
ERCOT Project Update ERCOT Outage Evaluation Phase 2 (SCR745) TDTWG November 5, 2008.
ROS Meeting – 06/12/081Texas Nodal Program - TPTF Update Texas Nodal Transition Plan Task Force ROS Meeting Stacy Bridges, ERCOT Thursday, 06/12/08.
1 Market Trials Update NATF January 5, 2010.
Information Technology Report Trey Felton Manager, IT Service Delivery July 2011 COPS/RMS.
Oracle Standby Implementation Tantra Invedy. Standby Database Introduction Fail over Solution Disaster Recovery Solution if remote Ease of implementation.
August 11, 2008 TPTF EDS Sequence and Durations Discussion Daryl Cote.
Project Update and Summary of Project Priority List (PPL) Activity
Commercial Operations Subcommittee (COPS) Update to RMS 11/1/2016
Market Continuity Update to TAC Joel Mickey
Retail Market Subcommittee Update to TAC
EMMS Infrastructure Cost/Risk Analysis
Settlements Analysis due to SFT Deration Issue
Project Update and Summary of Project Priority List (PPL) Activity
COPS Communication Conference Call
Commercial Operations Sub-Committee Update to TAC
ERCOT SCR745 Update ERCOT Outage Evaluation Phase 1 and Phase 2
Presentation transcript:

Emergency Database Failover: Impacts & Recovery Plan Trey Felton – ERCOT IT

Synopsis ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse

Synopsis Failover Out of synch (24 hrs) Emergency DB failover on April 21st, 2008 Market DB (which feeds ISM) became unresponsive Data could not be written/read Synchronization issues caused a 24 hr gap in data Propagated through to ISM Out of synch (24 hrs) ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse

Synopsis Failover Physical Standby brought online ISM rebuilt through Source data to recover affected extracts ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse

Impacts Impacts: Market transactions were prevented from updating ISM through Logical Standby Market DB utilizes a standby to prevent outages / performance degradations Logical Standby (RSS) became out of synch with Physical Standby by 24 hrs April 22 at 11:14am through April 21 at 10:44am Other DBs feeding ISM continued normally (only Market DB was out of synch) Priority of rebuild led to the Standby being rebuilt before the RSS Market DB has to be kept up This prolonged the outage to the EDW and affected extracts Prices had to be recalculated and extracts restored from Source Price adjustments for NSRS were completed June 5th Missing extracts for April 21 - April 30 completed on July 1st Why did recovery take so long? ISM generates up to 25-35G of data per day Data restored from Source back to April 1st 120 Terabytes had to be restored in order to roll-forward through transaction gap Archive log changes applied during 24-hour gap

Emergency Database Failover All data was restored with 100% accuracy The affected market systems that caused the April failure: Run the balancing energy and ancillary services markets Not used for wholesale batch or the retail markets.  ERCOT considers this to be an isolated incident and not a systemic problem

Actions to prevent future occurrences: Going Forward Actions to prevent future occurrences: Nodal market DBs will utilize newer Hardware More fault tolerance Redundancy Change of architecture in the replication process for Nodal Proof of Concept recently introduced into the Nodal market systems Testing underway ERCOT is conducting a risk/cost analysis of several options for these Zonal systems To be presented to TAC in August New Backups / Recovery Procedures Project initiated to stabilize our database backup procedures Shorter recovery time

Data Recovery NOTICE DATE: July 1, 2008 NOTICE TYPE: W-A042308-48 UPDATE Extracts - Wholesale CLASSIFICATION: Public SHORT DESCRIPTION: ERCOT has completed recovery of the missing data for April 21 through April 30, 2008. INTENDED AUDIENCE: QSEs DAY AFFECTED: April 21 through April 30, 2008 LONG DESCRIPTION: ERCOT conducted an emergency database failover on April 21, 2008 following a hardware failure. This database failover resulted in an out-of-synch data problem from April 21 through April 30. ERCOT developed a phased process to attempt to thoroughly recover the missing data. The missing data has been recovered for the following extracts.  A market notice will be sent when the extracts are expected to be posted. Act_Res_Output Ancillary_Services_Daily Bids_and_Schedules_Daily Forecast_Data_Daily Market_Information_Daily Sched_and_Actual_Load Self_Sch_Energy_Services ASDEPLOYMENTS