Role of Account Management at ERCOT Lessons Learned - 12/05 SAN Failure January 26, 2006.

Slides:



Advertisements
Similar presentations
1 Market Notification List Process Change Update Commercial Operations Subcommittee Meeting November 14, 2005.
Advertisements

Major Incident Process
Program Management Office “PMO”
By Saurabh Sardesai October 2014.
ECM Project Roles and Responsibilities
Page 1 Program Management Office “PMO” Update Steve Wallace February 3, 2003.
ASPEC Internal Auditor Training Version
Quality Representative Training Version
Change Advisory Board COIN v1.ppt Change Advisory Board ITIL COIN June 20, 2007.
Privileged and Confidential Strategic Approach to Asset Management Presented to October Urban Water Council Regional Seminar.
TAC July 2, 2003 Market Design Implementation Process Recommendation.
Do it pro bono. Key Messages & Brand Strategy Service Grant.
Retail Market Subcommittee Update to TAC Kathy Scott April 24,
Prepared by Opinion Dynamics Corporation May 2004.
David N. Wozei Systems Administrator, IT Auditor.
1 Market Trials Outage Scheduling Qualifications Weekly Update April 02, 2010.
GBA IT Project Management Final Project - Establishment of a Project Management Management Office 10 July, 2003.
RO Project Priority List Update EDW Projects Update RMS Meeting Adam Martinez Mgr, Market Ops Divisional Projects Organization ERCOT April 12, 2006.
Co-location Sites for Business Continuity and Disaster Recovery Peter Lesser (212) Peter Lesser (212) Kraft.
May 13, 2008 COPS Commercial Operations Subcommittee (COPS) Procedures Review ERCOT Market Rules.
RMS Update to TAC January 3, Goals Update ► Complete and improve SCR745, Retail Market Outage Evaluation & Resolution, implementation and reporting.
MP Online Data Entry Project Update WMS / ROS August 2013 Troy Anderson.
August 7, Market Participant Survey Action Plan Dale Goodman Director, Market Services.
Module 9 Planning a Disaster Recovery Solution. Module Overview Planning for Disaster Mitigation Planning Exchange Server Backup Planning Exchange Server.
RMS Update to TAC May 8, RMS Update to TAC ► At April 9 RMS Meeting:  Antitrust Training  RMS Voting Items: ► NPRR097Changes to Section 8 to Incorporate.
Public Meeting on ERCOT’s Proposed 2004 Budget September 3, 2003 Preliminary - Subject to ERCOT Board Approval.
TEXAS NODAL Board Review and Approval August 19, 2003.
Prepared by Opinion Dynamics Corporation May 2004.
Item 5d Texas RE 2011 Budget Assumptions April 19, Texas RE Preliminary Budget Assumptions Board of Directors and Advisory Committee April 19,
TEXAS NODAL Board of Directors Austin, Texas July 15, 2003.
Information Technology Update ERCOT Board of Directors Meeting January 17th, 2005.
1 TDTWG Update to RMS Wednesday February 14 th, 2007.
Prepared by Opinion Dynamics Corporation May 2004.
ERCOT Strategic Plan H.B. “Trip” Doggett President and Chief Executive Officer Technical Advisory Committee ERCOT Public December 3, 2013.
NFPA 1600 Disaster/Emergency Management and Business Continuity Programs.
Key Terms Business Continuity Plan (BCP) – A comprehensive written plan to maintain or resume business in the event of a disruption Critical Process –
November 2, 2006 LESSONS FROM CIPAG 1 Lessons from Critical Infrastructure Group Bill Bojorquez November 2, 2006.
1 TAC Report to the ERCOT Board February 16, 2005.
Technical Advisory Committee Presentation to the ERCOT Board of Directors March 21, 2007.
ERCOT Market Services Organization Dale Goodman Director, Market Services (office) (cell) (fax)
Emergency and Other Event Communications ERCOT Board of Directors Richard Gruber, Director of Market Services Paul Wattles, Manager, Governmental Relations.
PMO Update to RMS Troy Anderson Program Management Office December 7, 2005.
State of Georgia Release Management Training
COPS Communication Working Group Conference call on 3/8/05 from 1:00 – 2:00 Reviewed scope document and 2005 Goals Reviewed Notification Template.
TEXAS NODAL Market Design Structure and Process August 19, 2003.
1 TAC Report to the ERCOT Board January 17, 2006.
COPS NOVEMBER 2012 UPDATE TO TAC 11/01/2012 Harika Basaran, Chair Jim Lee, Vice Chair.
Market Notice Communication Process Ted Hailu Director, Client Services CSWG October 26, 2015.
ERCOT MARKET NOTICE PROCESS CCWG Workshop April 3, 2007.
Role of Account Management at ERCOT 2006 TAC Subcommittee Review ERCOT Board February 21, 2006.
1 TDTWG Update to RMS Wednesday May 6, Primary Activities 1.Reviewed ERCOT System Outages and Failures 2.Reviewed Service Availability 3.Reviewed.
ERCOT Project Process Karen Farley COPS 3/22/05. Project Process  Overview of Project Requests  Process for Market Requests  Process for Internal ERCOT.
Commercial Operations Sub-Committee Update to TAC January 8, 2009 Lee Starr, BTU.
1 SCR756 – Enhancements to the MarkeTrak application –Fondly called - MarkeTrak Phase 3 –ERCOT CEO determined that SCR756 is not necessary prior to the.
1 TDTWG Report to RMS SCR Addressing ERCOT System Outages Tuesday, May 10.
CBIZ RISK & ADVISORY SERVICES BUSINESS CONTINUITY PLANNING Developing a Readiness Strategy that Mitigates Risk and is Actionable and Easy to Implement.
Disaster Recovery Management By: Chris Rozic COSC 481.
Business Continuity Planning 101
Pandemic Flu Tabletop Exercise (TTX) [insert date of exercise] Public Health – Seattle & King County [insert your agency logo]
February 26, 2015 Technical Advisory Committee (TAC) Update to RMS Kathy Scott March 3, 2015 TAC Update to RMS 1.
TEXAS NODAL (ERCOT REVISIONS)
Utilizing Your Business Continuity Plan.
2011 Prioritization Update to Market Subcommittees
MAC Board Effectiveness Survey
Finance & Planning Committee of the San Francisco Health Commission
MAC Board Effectiveness Survey
Define Your IT Strategy
Technology Maintenance
Presentation transcript:

Role of Account Management at ERCOT Lessons Learned - 12/05 SAN Failure January 26, 2006

Lessons Learned – December 2005 SAN Failure Agenda – Lessons Learned Information Technology – Assets, deployment, execution Internal Communications – Escalation, extended event coordination, restoration decision making External Communications – Escalation, distribution, PUCT compliance Risk Management – Critical infrastructure and its impact on delivery of business services RMS/TAC Questions and Answers

January 26, 2006Lessons Learned – December 2005 SAN Failure Levels of data storage back-up and recovery - Summary Production – RAID 5 Data “SNAP’s” < 3 hr. recovery Recovery – Level 1 – SNAP’s Recovery – Level 2 – AUS MirrorAUS DB Mirror “SRDF” Tape back-upRecovery – Level 3

January 26, 2006Lessons Learned – December 2005 SAN Failure Enhance SAN Availability Issue Production outage triggered by dual disk failure, immediate disk recovery through “Hot Spares” was not available Action Taken Implemented 32 in frame “Hot Spares” Next Step Will review other options to provide a higher level of redundancy

January 26, 2006Lessons Learned – December 2005 SAN Failure Level 1 - On line Recovery Unavailable Issue – Level 1 Recovery (Snap’s) unavailable Second disk still running, but begins creating bad sectors – Snap’s evaluated/deemed corrupted Original/current SNAP process does not provide adequate online recovery Action Taken Vendor engaged to review and recommend best practice changes Next Step Continue with vendor engagement

January 26, 2006Lessons Learned – December 2005 SAN Failure Level 2 – “Austin Mirror” Unavailable & upgrade project not executed per plan - impact to Level 3 Recovery Issue – Austin Mirror upgrade project – Critical project step not executed Failed to follow post migration step in project plan which would have mitigated the risks Recovery efforts for archive/dw required back to 12/19 as opposed to 12/25 Action Taken Business owners to gain sign-off on project plans impacting critical infrastructure supporting service delivery to stakeholders Next Step Hiring Manager of Storage Management Reviewing storage management practices Changes in risk management practices

January 26, 2006Lessons Learned – December 2005 SAN Failure Internal Communications Issues As outage extended, communication between IT operations and business operations management too slow to be initiated Initial restoration decisions made without business ops consultation Client Relations was contacted but had a bigger task of translating the emerging information into communications to the market. Lack of awareness at the IT and business operations levels about Reg. Affairs needs related to PUCT notification per rules Lack of a common understanding of recovery capabilities/options Action To be taken Develop an “event” escalation matrix, including Reg. Affairs Address Bus/IT joint management decision making process related to restoration Confirm roles and responsibilities related to internal communications during an “event” Next Step Begin development of escalation matrix

January 26, 2006Lessons Learned – December 2005 SAN Failure Risk Management Issue Internal decisions that elevated risk or reduced effectiveness of approved mitigation strategies (recover faster, restore services quickly) made in isolation, did not evaluate/document risk elevation Action Taken Business owners’ sign off required for critical infrastructure project plans Project plans address risk to service continuity and mitigation strategies Next Step Implement action steps

January 26, 2006Lessons Learned – December 2005 SAN Failure Follow up Questions from RMS and TAC During other December outages, planned or unplanned, were there any ‘warning signs’ of storage hardware problems? –After a review of planned and unplanned outages for the month of December, there were no warning signs of disk failure. A review of the storage system logs also showed no signs of an impending disk failure. Share the cost/benefit of the purchase of the hot swappable drives? –Cost to ERCOT was $42,000. The benefits: (1) gain a higher degree of reliability in our primary production storage service, (2) reduce the risk of similar production storage failure requiring ERCOT to restore MP data from other on or off-line data storage sources and (3) reduce the risk of service interruption to MP’s given a similar event type. (ERCOT staff alone logged over 2,000 hours in the recovery process with MP’s likely spending more in aggregate)

January 26, 2006Lessons Learned – December 2005 SAN Failure Restoration Management/Coordination Issue Communications breakdown between Production Support and Market Operations Resource issues that impacted ability to perform more parallel recovery DR environment not adequately upgraded, maintained and tested Lack of a common understanding of recovery capabilities Action Taken Restoration strategies under review Joint business/IT involvement throughout recovery efforts via standing calls/meetings according to escalation matrix Include in operations report when there is change that impacts DR environment (regardless of planned or unplanned) Next Step Development of escalation matrix Continue evaluation of resource availability and utilization in events requiring parallel recovery efforts

January 26, 2006Lessons Learned – December 2005 SAN Failure Impact Analysis – Direct and Indirect Issues Comprehensive evaluation of service impacts not completed until more than 1 week Need to develop a comprehensive list of extract/reports and business owners Restoration a priority over impact analysis – outage estimates not available Competition for resources affects ability to support other environments Amount of time spent in meetings (internally/externally) to restore confidence Action To Be Taken Develop and maintain an inventory of reports & extracts with associated business owners Cross functional teams to work restoration to better ascertain outage durations and required recovery time (determined by escalation matrix) BU manager/director should gain general awareness of how reports/extracts are used by MPs As outage becomes and “event” schedule standing internal meetings for more efficient information sharing and decision making process Next Step Initiate action items above

January 26, 2006Lessons Learned – December 2005 SAN Failure Follow up Questions from RMS and TAC Share more about ERCOT analysis on the stop writing data when a partial failure happens to prevent the bad data/bad tables problem –Bad data/tables were a result of the hardware failure, not due to the applications continuing to operate for a time in a degraded state due to the hardware not entirely failing at one point Estimate recovery time if today two disks fail with the mirror synchronized and working –There would have been no outage if the mirror were working. If an array failed in Taylor the frames would have served the data from the mirrored volumes in Austin with disruption of services to MP’s

January 26, 2006Lessons Learned – December 2005 SAN Failure Follow up Questions from RMS and TAC Who audits the storage processes at ERCOT and will ERCOT be bringing in an outside firm to assist with lessons learned? –ERCOT’s storage administration group adheres to daily operating procedures and standards including daily auditing and reporting, further, auditing of the storage function is part of the annual SAS70 Type II audit. –Yes, one of ERCOT’s storage vendors is onsite assisting conducting an analysis and lessons learned.

January 26, 2006Lessons Learned – December 2005 SAN Failure ERCOT External Communications Challenge Designing Content and Distribution Systems to Meet Diverse Needs and Wants Functions QSE Ops Sched/Dispatch Policy Making (Strategic) Regulatory/ Governance Retail Trans Info Technology Meter/ Forecasting Disputes/ADR MP Segment, Size, Organization Structure Policy Analysis and Governance (Tactical to Strategic) Day–to-day Operations (Operating) Data/ Extracts Organizational/ Market View Grid Planning QSE Ops Financial

January 26, 2006Lessons Learned – December 2005 SAN Failure ERCOT Communications Challenge Designing Content and Distribution Systems to Meet Diverse Needs and Wants Functions QSE Ops Sched/Dispatch Policy Making (Strategic) Regulatory/ Governance Retail Trans Info Technology Meter/ Forecasting Disputes/ADR MP Segment, Size, Organization Structure Policy Analysis and Governance (Tactical to Strategic) Day–to-day Operations (Operating) Data/ Extracts Organizational/ Market View Grid Planning QSE Ops Financial This diversity drives a need for ERCOT Staff to understand/determine: primary purpose/aim of a communication primary audience (s) appropriate vehicle specific content to meet the primary aim

January 26, 2006Lessons Learned – December 2005 SAN Failure ERCOT Communications Challenge Designing Content and Distribution Systems to Meet Diverse Needs and Wants Functions QSE Ops Sched/Dispatch Policy Making (Strategic) Regulatory/ Governance Retail Trans Info Technology Meter/ Forecasting Disputes/ADR MP Segment, Size, Organization Structure Policy Advisory and Governance (Tactical to Strategic) Day–to-day Operations (Operating) Data/ Extracts Organizational/ Market View Grid Planning QSE Ops Financial Market Notices (Operations) Stakeholder Meetings (RMS, WMS, COPS, PRS, TAC) Stakeholder Meetings (BOARD, PUCT)

January 26, 2006Lessons Learned – December 2005 SAN Failure Types of Content and Volume of Messaging Designing Content and Distribution Systems to Meet Diverse Needs and Wants Operational notice types and estimated volumes: Market Notices (100’s) Market Bulletins (10’s) Market Meeting Agendas (400+) Meeting Minutes or Notes (400+) Meeting Presentations (1000+) Market Calls (100’s) (?) PRR’s and SCR’s (100+, multiple rounds) Project Priority List (12) Cost/Benefit Analyses and Impact Analyses (100+) Ad hoc phone calls (?) Training classes (100+ days of delivery, of pages of content) Market Data Reports and Member Data Extracts (10,000’s) Texas Market Link (continuous updates) ERCOT.com (continuous updates)

January 26, 2006Lessons Learned – December 2005 SAN Failure Establishment of Communications Working Group (under COPS) – –“CWG is also responsible for advising ERCOT on the content, format and frequency of communication, which is used by ERCOT to ensure that all participants receive timely and accurate market information regarding commercial operations market rules and system changes.” Focused on operational communications –Collaborative and productive process with market participants and ERCOT Staff –Restructuring of market notice template –Restructuring of list construct to better meet the needs of market participant staff and empower them to control the flow of information to them –Dynamic process always – MP needs and wants change over time - thus a standing body (Working Group) as opposed to a Task Force 2005 Improvements Efforts

January 26, 2006Lessons Learned – December 2005 SAN Failure MP Feedback on Communications – 1205 Storage Failure/Services Disruption Issue “ERCOT should have extended its communication distribution list, to include policy makers and governance participants, as the recent operating outage became an extended outage” Actions Taken/Recommended –Create a market notification list titled “ERCOT System Event” or other Triggered when ERCOT deems a major system event needs escalation to governance and policy makers Used for service events across ERCOT (including when a system/service outage extends to 24 hours – excluding events/actions already prescribed by NERC or PUCT) Subscriber controlled Gives additional transparency for policy makers into operational events that need their attention The content would be targeted to the policy makers Communicates summary of events, impacts, risks and issues related to market rules and other policy implications.

January 26, 2006Lessons Learned – December 2005 SAN Failure Issue ERCOT failed to meet its notice requirements with Sr. PUCT Staff in this event Actions Taken/Recommended –Regulatory Affairs to create and maintain a PUCT Sr. Staff after hours call list –RA to make phone call to notice event and call if necessary to confirm receipt of message –RA to review with ERCOT managers, directors and officers, our PUCT notification obligations in an effort to ensure proper internal flow of information in the event of an extended outage –Create a market notification list titled “ERCOT System Event” ERCOT Staff to work with PUCT Sr. Staff to ensure they are properly subscribed initially PUCT Feedback on Communications – 1205 Storage Failure/Services Disruption

January 26, 2006Lessons Learned – December 2005 SAN Failure Feedback on Session