Download presentation
Presentation is loading. Please wait.
1
Brian Nisbet Network Operations Manager @natural20
08/02/2017 The Missing Link Brian Nisbet Network Operations Manager @natural20 60 Second SLA at TNC16… This is a war… operational story, I’m pretty sure everyone in the room has many. So why am I taking up your time talking about a network issue? This is somewhat of a practical application of a lot of the 60 Second SLA talk Because this isn’t about the fibre break (it’s never *really* about the fibre break), it’s about what was in place, how everyone reacted and what we all did next. The missing link can be bridged by this. TNC 2017, Linz, Austria
2
Before 08/02/2017 - Fibre figure of 8 in Dublin, 6 or 12 nodes, depending on how you look at it November 28, 2018
3
Fibre Break! 08/02/2017 Saturday 11th March a link went down, reported it to the fibre provider but they said they weren’t aware of any issue (this is the fun of dark fibre!) No services off line, so no need to hugely escalate Monday 13th of March when access could be easily gained (no service outage, resilience!) we sent a crew out to site (near the office) to work with the fibre provider to find the break. Found the break was 300m away… November 28, 2018
4
08/02/2017 It was not supposed to look like this… Complete change control process breakdown. Someone had told someone else assuming they’d tell everyone etc. Credit: Garwin Liu
5
Fibre Break! 08/02/2017 Loss of resilience to one half of the ring. 3 universities, one major DC, a host of smaller clients, all of the Schools and the HEAnet office That’s fine, I mean, it isn’t like there’s a huge infrastructure project still going on across Dublin… November 28, 2018
6
08/02/2017 LUAS Cross City. Better than this now, but still so much work going on… Or a long weekend coming up… Credit: Brian Nisbet
7
08/02/2017 St Patrick’s Day, you may have heard of it, we Irish take it kinda seriously… Major Incident declared Contacted all of the affected parties by phone, opened a public ticket, set expectations for the next update, both internally and externally. Kept the HEAnet Management Team informed Started restoration planning conversations with our fibre provider Credit: Niall Carson/PA Wire
8
Traffic Rerouting 08/02/2017 With some repatching/resplicing the southern side of the ring could be rerouted through the middle part of the figure of 8. This was achieved by 12:00 on Tuesday 14th Still the small matter of one university, a major HEAnet DC and numerous other smaller clients without resilience. We hoped this would happen on the 15th. It didn’t.
9
08/02/2017 Getting closer… Credit: Niall Carson/PA Wire
10
Traffic Rerouting 08/02/2017 We hoped it would happen during the morning of the 16th… No stereotypes, but you know how long weekends are… But the moves and repatching and resplicing weren’t easy. Fibre provider continuing to work very hard and to communicate HEAnet NOC also communicating internally & externally
11
08/02/2017 Getting closer… Credit: Niall Carson/PA Wire
12
Medium Term Solution? 08/02/2017
14:00 on Thursday 16:00, link came back up No operationally important packets lost during the incident! Light level checking, monitoring checking, final communication, done!
13
60 Second SLA Check-List Resilience Incident Management Processes
08/02/2017 Resilience Incident Management Processes Incident Management Processes Followed Change Management Processes Change Management Processes Followed Obligatory Automation Mention Communication, Communication, Communication This all comes down to the basics Resilience Incident Management Processes Engaged Management Clear communication Everyone knew we were working, everyone who was working knew what they had to do November 28, 2018
14
PARTNERSHIP! Secret Sauce? 08/02/2017 - Secret Partnership Sauce!
November 28, 2018
15
@natural20
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.