Annie Griffith December 2007 December 2007 Gemini OSU - UKLC Update.

Annie Griffith December 2007 December 2007 Gemini OSU - UKLC Update

Discussion Items Focus of discussions will be around the system elements of the recent Gemini incident :-  Summary of findings to date  Review of learning points  Lessons learned applicable to UKLTR  Open discussion

Overview of events  21 st Oct – upgraded system implemented –  API errors identified  22 nd Oct – issue with Shipper views of other shippers’ data  shipper access revoked  24 th Oct – Code fix for data view implemented  internal National Grid access only  26 th Oct – external on-line service restored  1 st Nov – hardware changes implemented to external service  2 nd Nov – API service restored  Further intermittent outage problem occurring to APIs  5 th Nov – Last outage on API service recorded at 13:00  Root cause analysis still underway

Summary - Causes  Two problems identified 1.Application Code Construct – associated with high volume instantaneous concurrent usage of same transaction type. Fix deployed 05:00 23/10/07 2.API error – associated with saturation usage displaying itself as “memory leakage”, builds up over time and eventually results in loss of service. Indications are that this is an error with a 3 rd party system software product. Investigations continuing

Fixes since Go-live  Since 4 th November  10 Application defects  All minor  All fixed  No outstanding application errors

Gemini OSU Testing  Extensive testing programme  2 months integration and system testing  6 weeks OAT Performance Testing  Volume testing of 130% of current user load  8 weeks UAT  4 weeks shipper trials (voluntary)  3 participants  7 weeks dress rehearsal  Focus was on actions needed to complete the technical hardware upgrade across multiple servers and platforms

Testing Lessons Learnt  UAT  each functional area tested discretely  Issues around concurrent usage unknown and therefore not specifically targeted for testing  “Field” testing of system under fully loaded conditions may have highlighted this problem, but this is not certain.  OAT  Although volume and stress testing completed successfully, reliability testing/soak testing over a prolonged period not undertaken

Other Observations  Communications during incident  Undersold the scale of the change  Engagement - right individuals/forums ?  Planning for failure…..as well as success

UKLTR – What’s different ?  Main workhorse of the system is the batch processing  Predictable transaction volumes  Far easier to replicate load and volume testing  Easy to verify outputs  Shipper interaction is batch driven  Low volume of on-line users  Doesn’t have same level of real-time/instantaneous transaction criticality  Ability to do more verification following cut-over before releasing data from upgraded system to the outside world.

UKLTR – Lessons to be applied  Plan for failure  Differing levels, problems vs. incident  Technical and resource planning  Fully prepared Incident Management procedure established in advance and understood by all parties  Escalation routes  Communications mechanisms  Status Communications to be issued during outage period  Milestone Updates ?  Who to ?  Fall-back options  Old system provides straight forward option  However, once interface data has been propagated to other systems will be in a “fix-forward” situation.

Discussion ?

Annie Griffith December 2007 December 2007 Gemini OSU - UKLC Update.

Similar presentations

Presentation on theme: "Annie Griffith December 2007 December 2007 Gemini OSU - UKLC Update."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Annie Griffith December 2007 December 2007 Gemini OSU - UKLC Update.

Similar presentations

Presentation on theme: "Annie Griffith December 2007 December 2007 Gemini OSU - UKLC Update."— Presentation transcript:

Similar presentations

About project

Feedback