Presentation is loading. Please wait.

Presentation is loading. Please wait.

NERC Published Lessons Learned Summary

Similar presentations


Presentation on theme: "NERC Published Lessons Learned Summary"— Presentation transcript:

1 NERC Published Lessons Learned Summary
November 2016

2 NERC Lessons Learned - November 2016
Three NERC Lessons Learned (LL) were published in the November 2016 LL “Redundant Systems may not Cold-Start unless fully intact to prevent Dual Primary Operation” LL “Failover Configuration Leads to Loss of EMS” LL “Loss of ICCP due to Database Sizing Issue”

3 Redundant Systems may not Cold-Start – Details
The entity deployed a patch on several noncritical but similar hardware installations throughout the organization in addition to the identical QA/test EMS When the patch was executed on the production EMS communications halted The failure was attributed to a corrupt switch configuration (part of a redundant pair) The second switch’s configuration was verified intact, and the upgrade process had not yet executed on the second switch

4 Redundant Systems may not Cold-Start – Details
The freshly upgraded switch was powered down and traffic was expected to resume to normal – Did not The switches were power cycled in various orders in an attempt to restore service – Still not working Once the corruption/missing configuration was identified, it was restored from backup and system functionality resumed immediately

5 Redundant Systems may not Cold-Start – Details
The network switches will not forward traffic when the two units do not have matching configuration A single switch would not start without its mate in a cold- start situation This is to prevent dual primary operation (split-brain scenario), where two isolated switches each think that they are the only operating switch

6 Redundant Systems may not Cold-Start – Corrective Actions
Vendor modified their recommended configuration baseline to include the ability to cold-start a single switch after a waiting period This balances dual primary protection (split-brain scenario) with the operational need to start a system using a single network switch Engineers now have processes to quickly compare configurations with known good baselines during maintenance operations Standard commissioning and testing procedures include cold-starting redundant systems when the mate is not present

7 Redundant Systems may not Cold-Start – Corrective Actions
These concepts redundant system must be understood at all levels of technicians, operators, and engineers of redundant systems See lesson learned document for details In addition to common testing, redundant systems should be tested with partial outages Ensure backups and disaster recovery procedures are readily available before performing maintenance

8 Failover Configuration Leads to Loss of EMS– Details
A failover was initiated from EMS-C server to EMS-D server, the AGC application aborted twice within a minute during EMS-D’s initialization/startup This caused an automatic failover to the next backup EMS server in line (EMS-A) The same condition was experienced by EMS-A, which initiated another automatic failover to the next backup EMS server in line (EMS-B) When the system reached the final available server (EMS-B), all systems were in a DISABLED state

9 Failover Configuration Leads to Loss of EMS– Corrective Actions
The entity removed the scheme that initiated an automatic failover after two consecutive AGC failures (within a minute) from the EMS process manager model The entity also reviewed all other schemes to ensure that the triggering of an automatic failover is properly defined

10 Failover Configuration Leads to Loss of EMS– Lessons Learned
Review all failover configuration settings in the EMS that could initiate an automatic failover of the EMS to determine the value of the scheme Remove schemes that are not necessary or could lead to a cascading failover scenario Evaluate whether to allow these applications to fail in lieu of automatic failovers

11 Loss of ICCP due to Database Sizing Issue – Details
An entity was updating and expanding it’s state estimator (STE) network model This STE update required an additional 13,000 points from the ICCP These had already been added to the ICCP database and to the development and staging STE servers When the production STE server was updated, it began requesting the 13,000 additional points again and the database table was increased by 26,000 points (13,000 for each of the two production STE servers)

12 Loss of ICCP due to Database Sizing Issue – Details
This exceeded the maximum allowed size of the database table and caused the ICCP processes to abort Investigation revealed that the database table was 97% full prior to the expansion, and the extra points caused it to exceed its maximum size A failback to the previous ICCP database was attempted, but this did not successfully resolve the issue

13 Loss of ICCP due to Database Sizing Issue –Corrective Action
The database table was temporarily resized to accommodate the appropriate number of entries More extensive research was done to determine the total size that will be needed through the end of the network model expansion project A report was created to compare the current size of database tables to their maximum limit This is now reviewed at each ICCP database change Two ICCP support staff completed a vendor class on ICCP support and maintenance

14 Loss of ICCP due to Database Sizing Issue – Lessons learned
Database sizes need to be carefully monitored as a system is expanded Sizes should be large enough to accommodate all data being requested, not just what is currently being transferred Primary databases, as well as peripherally associated databases, need to be evaluated for size constraints The vendor may need to be contacted to verify that database sizes can be increased without causing problems or to provide a more comprehensive validation routine Alternative ICCP configurations need to be evaluated to determine if there is a more efficient means to feed data into the staging and development systems

15 Loss of ICCP due to Database Sizing Issue – Lessons Learned
ICCP databases should be set up so that external companies cannot inadvertently request data that does not originate in the host utility Backup support staff should be fully trained so that discovery of problems does not rest completely with primary support personnel Support staff should meet regularly to discuss questions, discoveries, and findings

16 Lessons Learned Survey Link
NERC’s goal with publishing lessons learned is to provide industry with technical and understandable information that assists them with maintaining the reliability of the bulk power system NERC requests that industry provide input on lessons learned by taking the short survey. A link is provided in the PDF version of each Lesson Learned

17


Download ppt "NERC Published Lessons Learned Summary"

Similar presentations


Ads by Google