Download presentation
Presentation is loading. Please wait.
Published byMay Sullivan Modified over 8 years ago
1
LHCOPN operational model - 4 use-cases Guillaume Cessieux (FR-CCIN2P3 / EGEE networking support) on behalf of the Ops WG LHCOPN meeting, 2009-01-15, Berlin
2
Agenda Focus on 4 use-cases: Incident Management 1.L3: Power outage at DE-KIT leading to routers down 2.L2: Fibre cut between London and Didcot affecting CERN- RAL-LHCOPN-001 Change Management 3.L3: New IP prefixe for ES-PIC Maintenance Management 4.L2: USLHCNET's scheduled power cut for devices in Chicago GCX - LHCOPN meeting - 2009-01-152
3
Tools used CERN’s twiki – https://twiki.cern.ch/twiki/bin/view/LHCOPN/WebHome https://twiki.cern.ch/twiki/bin/view/LHCOPN/WebHome – https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsModelUseCases https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsModelUseCases – https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel – https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsContacts https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsContacts GGUS – Public release 2009-02-01 Monitoring – MDM, e2e2mon, ASPDrawer... GCX - LHCOPN meeting - 2009-01-153
4
POWER OUTAGE AT DE-KIT LEADING TO ROUTERS DOWN L3 incident management GCX - LHCOPN meeting - 2009-01-154
5
Scope GCX - LHCOPN meeting - 2009-01-155 2 routers unexpectedly down Affected: NL-T1, CH-CERN, IT-INFN-CNAF, FR-CCIN2P3, DE-KIT 5 links
6
L3 incident management https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L3_incident_management_process https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L3_incident_management_process GCX - LHCOPN meeting - 2009-01-156 Scope: Router down, BGP filtering, bad routing... The source site is the site where the problem lies. 1.1 A tickets is created on the LHCOPN Heldpesk for reporting by the router operator of the source site. It is assigned to itself, the source site. 1.2 The Router Operator contacts is counterpart on distant site (site-site communication) to know if something goes wrong (power outage...). If problem is on distant site the distant site will start this process (ticket then re-assigned to distant site). 1.3 If the problem is related to an underlying layer (L2: dark fibre outage...) the router operator will start the L2 incident management process. The router operator will be responsible to manage the trouble with the L2NOC (open and follow NOC's ticket...). He stays responsible for the LHCOPN ticket into GGUS. 1.4 Otherwise the router operator is owning the problem and will contact its local Grid Data contact to report impact. Distant Router operator will also be informed. 2 The LHCOPN TTS notifies all impacted sites about the incident
7
L3 Incident management process Source site involved Site involved A notifies B Grid Data contact * Router operators Router operators A AB B A interacts with B Affected sites 1.1 LHCOPN TTS (GGUS) L2 incident management 1.4 1.2 2(1.3) BAA reads and writes BA goes to process BAB V0.5 20081215 gcx
8
Ticket opening 1.1 A DE-KIT router operator opens a trouble ticket into GGUS GCX - LHCOPN meeting - 2009-01-158 DE-KIT * Router operators 1.1 LHCOPN TTS (GGUS)
9
GGUS submit interface GCX - LHCOPN meeting - 2009-01-159
10
Ticket opened GCX - LHCOPN meeting - 2009-01-1510
11
Other steps Outage is localised and noticed by source site – No need to perform 1.2: Contact counterpart on distant site This is a power cut, not a real L2 problem – No need to go further on 1.3: L2 incident management process GCX - LHCOPN meeting - 2009-01-1511
12
Grid interaction 1.4: Grid data contact at DE-KIT is warned about the outage – GGUS TTid provided – He will compute impact on the Grid – He will warn the Grid GCX - LHCOPN meeting - 2009-01-1512 DE-KIT Grid Data contact * Router operators 1.1 LHCOPN TTS (GGUS) 1.4
13
Automatic broadcasting 2: The GGUS TTS will warn all affected sites – This is done when ticket is submited GCX - LHCOPN meeting - 2009-01-1513 DE-KIT Grid Data contact * Router operators 1.1 LHCOPN TTS (GGUS) 1.4 CH-CERN, FR-CCIN2P3, IT-INFN-CNAF, NL-T1, DE-KIT 2
14
Following/Closure Incident registration and broadcasting is terminated DE-KIT router operator is in charge of updating/ closing the GGUS ticket – Affected sites will be notified Local Grid data contact has also to be warned GCX - LHCOPN meeting - 2009-01-1514
15
History GCX - LHCOPN meeting - 2009-01-1515
16
Conclusion for first use case Shortcut as the incident is quickly localised – Otherwise more interactions between sites Deeply organised around GGUS tickets – Could be opened by another site and assigned to DE-KIT – Put status from « assigned » to « in progress » to acknowledge GCX - LHCOPN meeting - 2009-01-1516
17
Fibre cut between London and Didcot affecting CERN-RAL-LHCOPN-001 L2 Incident management GCX - LHCOPN meeting - 2009-01-1517
18
Scope GCX - LHCOPN meeting - 2009-01-1518 Router operator at UK-T1-RAL noticed that link is down thanks to their monitoring system Affected 1 link: CERN-RAL-LHCOPN-001 2 sites: CH-CERN and UK-T1-RAL Not clear idea of what and where the problem is Router down at CH-CERN, fibre cut…
19
Global problem management process started GCX - LHCOPN meeting - 2009-01-1519
20
Quick investigation 1- Nothing seems occurring on site 2- Take an overview of the LHCOPN – e2emon monitoring system indicates that the L2 link is down in segment “UKERNA” Now tracking a fibre cut – Nothing seems registered on GGUS about Unscheduled event = Incident Going to L2 incident management GCX - LHCOPN meeting - 2009-01-1520
21
L2 incident management https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L2_incident_management_process https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L2_incident_management_process GCX - LHCOPN meeting - 2009-01-1521 Scope: Dark fibres outages... 1.1 A L2NOC and a router operator could notice a L2 incident. They will interact together to confirm it or not. A router operator could also be warned from the L3 incident management process through a LHCOPN ticket assigned to its site 1.2 If confirmed the router operator of a linked site will put a ticket on the LHCOPN TTS. The router operator is in charge of dealing with involved L2 network providers and to reflect ongoing resolution within the LHCOPN TTS. 1.3 It is the responsibilities of linked and affected sites to warn their Grid data contact. 2 All impacted sites will be notified by the TTS. 3 If nothing if found at L2 the Escalated incident management process is started.
22
Sites linked L2 Incident management process Sites linked * L2 NOC Grid Data contact * Router operators LHCOPN TTS (GGUS) * End of L3 incident management A notifies B A AB B A interacts with BBAA reads and writes B 1.11.3 1.2 2 escalated incident management (3) Affected sites V0.5 20081215 gcx
23
Incident registration 1.1 Router operator at UK-T1-RAL will open a ticket to JANET for the outage 1.2: UK-T1-RAL noticed the outage so will open a ticket into GGUS for the LHCOPN community – Self assigned to them because under their responsibility (T0-T1) GCX - LHCOPN meeting - 2009-01-1523 UK-T1-RAL JANET NOC * Router operators LHCOPN TTS (GGUS) 1.1 1.2
24
GGUS ticket submited GCX - LHCOPN meeting - 2009-01-1524
25
Broadcasting 1.3: Grid interaction – Local Grid data contact warned (+ #GGUS-TTid) 2: Other affected sites automaticaly notified by GGUS GCX - LHCOPN meeting - 2009-01-1525 Sites linked UK-T1-RAL JANET NOC Grid Data contact * Router operators LHCOPN TTS (GGUS) 1.11.3 1.2 2 CH-CERN
26
Following/Closure UK-T1-RAL will update GGUS tickets with information from JANET – Grid data contact and affected sites are kept updated Ticket will be closed by UK-T1-RAL GCX - LHCOPN meeting - 2009-01-1526
27
Conclusion for second use-case Accurate and reliable monitoring is required to really shortcut investigations Key communication between network provider and customer – We did not changed the way this currently works GCX - LHCOPN meeting - 2009-01-1527
28
New IP prefixe for ES-PIC L3 Change management GCX - LHCOPN meeting - 2009-01-1528
29
Scope ES-PIC has a new IP prefixe that must be included within the LHCOPN Affected: – All sites: Filters to update… – And monitoring systems GCX - LHCOPN meeting - 2009-01-1529
30
L3 change management https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L3_change_management_process https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L3_change_management_process GCX - LHCOPN meeting - 2009-01-1530 Scope: IP addresses change, new prefix propagated, new filtering The source actor for these changes are router operators. 1.1 Router operator will expose change to its Grid data contact (change in performing, new resiliency possibility...) 1.2 Router operator will expose change to affected sites (e.g linked sites) 2.1 The change will be fully documented on the global web repository and some technical information should also be updated 2.2 An informational ticket summarizing the change will be put into the LHCOPN TTS. It will contain link to the full documentation of the change (e.g URL to the Global web repository) 2.3 The L3 monitoring infrastructure may be adapted if needed (new p2p IPs to be watched...) 3 The LHCOPN TTS notifies all impacted sites 4 If the change has an impact a L3 maintenance management process will be started to commit changes. Else the change could be directly done If we have some L3 changes impacting the L2 (L3 VPN for instance) the L2 change management process should be started.
31
Linked Sites L3 Change Management Source site Grid Data contact Router * operators Affected Sites Router operators L3 maintenance management Global web repository (Twiki) A notifies B A AB B A interacts with BBAA reads and writes B Monitoring 1.1 1.2 2.1 2.2 (2.3) (4) LHCOPN TTS (GGUS) 3 Affected sites V0.5 20081215 gcx
32
Change registration 1.1: Grid data contact is warned about the change – New hosts will benefit of the LHCOPN? 1.2: This change is common and has no deep impact for others – No need to discuss with impacted sites GCX - LHCOPN meeting - 2009-01-1532 ES-PIC Grid Data contact Router * operators 1.1
33
2.1: – The change will be documented on the change management database https://twiki.cern.ch/twiki/bin/view/LHCOPN/ChangeManagementDatabase – Technical information will be updated https://twiki.cern.ch/twiki/bin/view/LHCOPN/LhcopnIpAddresses https://twiki.cern.ch/twiki/bin/view/LHCOPN/OverallNetworkMaps 2.1 ES-PIC Grid Data contact Router * operators 1.1 Documentation and tool update GCX - LHCOPN meeting - 2009-01-1533 Global web repository (Twiki) Technical information Change management DB 2.1
34
Broadcasting 2.2: A « informational » GGUS ticket will be created – With link to the change management database entry – With link to technical information updated – 3: All sites will be notified 3: DANTE Operation + ENOC are put in copy – New prefixes might need to be also monitored by MDM + ASPDrawer GCX - LHCOPN meeting - 2009-01-1534
35
GGUS submit interface GCX - LHCOPN meeting - 2009-01-1535 Dante.ops@dante.netDante.ops@dante.net + ENOC
36
Summary GCX - LHCOPN meeting - 2009-01-1536 ES-PIC Grid Data contact Router * operators Global web repository (Twiki) Monitoring 1.1 2.1 2.2 (2.3) LHCOPN TTS (GGUS) 3 ALL Sites DANTE Operation Technical information MDM Change management DB BGP ENOC
37
Committing the change (1/2) The change is documented and advertised but not yet committed Has the change, or its commitment, impact on existing service? – No, so no need to commit it within a “true” maintenance GCX - LHCOPN meeting - 2009-01-1537
38
Committing the change (2/2) The change will be silently implemented by ES-PIC and reported with a GGUS ticket – Kind: Maintenance L3 – To track implementation + statistics GCX - LHCOPN meeting - 2009-01-15
39
Conclusion for third use-case Documenting and implementing are separated – 2 tickets: Informational & Maintenance Third party tools might need to be updated – MDM, e2emon, ASPDrawer, GGUS … Lighten process for non impacting changes GCX - LHCOPN meeting - 2009-01-1539
40
USLHCNET's scheduled power cut for devices in Chicago L2 maintenance management GCX - LHCOPN meeting - 2009-01-1540
41
Scope (1/2) USLHCNET will have power cut in Chicago GCX - LHCOPN meeting - 2009-01-1541
42
Scope (2/2) Fictional impact: – US-FNAL-CMS will be fully disconnected GCX - LHCOPN meeting - 2009-01-1542
43
L2 maintenance management https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L2_maintenance_management_proces https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L2_maintenance_management_proces GCX - LHCOPN meeting - 2009-01-1543 Sources for L2 Maintenance are L2 network providers (optical transmitter to be changed, fibre physically rerouted, fibre to be cleaned...) Often we will not have negotiation phase for L2 maintenance with L2 network providers. But if an event is really disturbing this should be tried. 1.1 The L2NOC will send its maintenance to connected or affected Router operators. The first noticed router operator start this process. 1.2 The router operator will warn its Grid data contact (and may check with him date is ok) 1.3 The router operator may check with distant affected sites - off the record - that the date is suitable 1.4 If a disturbing overlapping event is found we should try to negotiate another date with the network provider and we restart at step 1.1. Else the maintenance is posted in the LHCOPN TTS by the router operator. 2 All impacted sites are notified. 3 The maintenance is performed and the LHCOPN TT is updated. Updates are broadcasted to all impacted sites. It ends when LHCOPN TT is closed.
44
Linked Sites L2 Maintenance management process * L2 NOC Linked Sites Grid Data contact Router operators A notifies B A AB B A interacts with BBAA reads and writes B Linked Sites Router operators LHCOPN TTS (GGUS) 1.1 1.4 1.2 1.3 2 Affected sites V0.5 20081215 gcx
45
Registering maintenance (1/2) 1.1: USLHCNET warns at least site US-FNAL-CMS Not Grid, not all LHCOPN sites etc. 1.2: US-FNAL-CMS will warn its local Grid data contact – And may check with him date is OK – 1.3: Ideally also avoid overlap with CH-CERN’s events GCX - LHCOPN meeting - 2009-01-1545 USLHCNET NOC US-FNAL-CMS Grid Data contact Router operators 1.1 1.2 Linked Site CH-CERN 1.3
46
Registering maintenance (2/2) Affected sites: – US-FNAL-CMS, CH-CERN – US-FNAL-CMS is responsible for following this event 1.4: A FNAL Router operator will put the maintenance into GGUS GCX - LHCOPN meeting - 2009-01-1546
47
GGUS submit interface GCX - LHCOPN meeting - 2009-01-1547
48
Summary GCX - LHCOPN meeting - 2009-01-1548 USLHCNET NOC US-FNAL-CMS Grid Data contact Router operators 1.1 1.2 Linked Site CH-CERN Router operators LHCOPN TTS (GGUS) 1.4 2 CH-CERN 1.3
49
Following US-FNAL-CMS updates ticket according to USLHCNET reports US-FNAL-CMS is in charge to close the ticket when terminated GCX - LHCOPN meeting - 2009-01-1549
50
Ticket’s handling GCX - LHCOPN meeting - 2009-01-1550
51
Conclusion for fourth use-case Light process for network providers – Like what currently happens – Warn only your customers – No Grid interaction Site acts as a relay for information from network providers – Propagated within LHCOPN community GCX - LHCOPN meeting - 2009-01-1551
52
Overall conclusion GCX - LHCOPN meeting - 2009-01-1552
53
Overall conclusion (1/2) Sample provided here – Many details could be adjusted Steps for incident management – Investigate, register, broadcast, follow Steps for change management – Document, register, broadcast, commit Steps for maintenance management – Register, broadcast, (commit), follow GCX - LHCOPN meeting - 2009-01-1553
54
Overall conclusion (2/2) Not really different from current way to carry network operations? – But formalised Feel free to ask details on processes – Propose interesting/embarrassing use-case – Everything is/will be on the twiki GGUS accesses/notifications are indispensable – Access table is a key thing to be accurately filled GCX - LHCOPN meeting - 2009-01-1554
55
Questions & discussion GCX - LHCOPN meeting - 2009-01-1555
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.