Download presentation
Presentation is loading. Please wait.
Published byBennett Webb Modified over 9 years ago
1
Annie Griffith December 2007 December 2007 Gemini OSU - UKLC Update
2
Discussion Items Focus of discussions will be around the system elements of the recent Gemini incident :- Summary of findings to date Review of learning points Lessons learned applicable to UKLTR Open discussion
3
Overview of events 21 st Oct – upgraded system implemented – API errors identified 22 nd Oct – issue with Shipper views of other shippers’ data shipper access revoked 24 th Oct – Code fix for data view implemented internal National Grid access only 26 th Oct – external on-line service restored 1 st Nov – hardware changes implemented to external service 2 nd Nov – API service restored Further intermittent outage problem occurring to APIs 5 th Nov – Last outage on API service recorded at 13:00 Root cause analysis still underway
4
Summary - Causes Two problems identified 1.Application Code Construct – associated with high volume instantaneous concurrent usage of same transaction type. Fix deployed 05:00 23/10/07 2.API error – associated with saturation usage displaying itself as “memory leakage”, builds up over time and eventually results in loss of service. Indications are that this is an error with a 3 rd party system software product. Investigations continuing
5
Fixes since Go-live Since 4 th November 10 Application defects All minor All fixed No outstanding application errors
6
Gemini OSU Testing Extensive testing programme 2 months integration and system testing 6 weeks OAT Performance Testing Volume testing of 130% of current user load 8 weeks UAT 4 weeks shipper trials (voluntary) 3 participants 7 weeks dress rehearsal Focus was on actions needed to complete the technical hardware upgrade across multiple servers and platforms
7
Testing Lessons Learnt UAT each functional area tested discretely Issues around concurrent usage unknown and therefore not specifically targeted for testing “Field” testing of system under fully loaded conditions may have highlighted this problem, but this is not certain. OAT Although volume and stress testing completed successfully, reliability testing/soak testing over a prolonged period not undertaken
8
Other Observations Communications during incident Undersold the scale of the change Engagement - right individuals/forums ? Planning for failure…..as well as success
9
UKLTR – What’s different ? Main workhorse of the system is the batch processing Predictable transaction volumes Far easier to replicate load and volume testing Easy to verify outputs Shipper interaction is batch driven Low volume of on-line users Doesn’t have same level of real-time/instantaneous transaction criticality Ability to do more verification following cut-over before releasing data from upgraded system to the outside world.
10
UKLTR – Lessons to be applied Plan for failure Differing levels, problems vs. incident Technical and resource planning Fully prepared Incident Management procedure established in advance and understood by all parties Escalation routes Communications mechanisms Status Communications to be issued during outage period Milestone Updates ? Who to ? Fall-back options Old system provides straight forward option However, once interface data has been propagated to other systems will be in a “fix-forward” situation.
11
Discussion ?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.