LHCOPN operational model - 4 use-cases Guillaume Cessieux (FR-CCIN2P3 / EGEE networking support) on behalf of the Ops WG LHCOPN meeting, 2009-01-15, Berlin.

Slides:



Advertisements
Similar presentations
Using the Self Service BMC Helpdesk
Advertisements

Tivoli Service Request Manager
Common NOC Practices 4/05/2007 The Quilt NOC Common Practice Panel April 4, 2007.
GTA Network Management Systems On Behalf Of BellSouth.
Africa & Arabia ROC tutorial Model for L1-L2 user support based on x-GUS Mario Reale GARR - Italy ASREN-JUNET Grid School - 24 November 2011 Africa & Arabia.
Requirements Structure 2.0 Clark Elliott Instructor With debt to Chris Thomopolous and Ali Merchant Original Authors.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Best Practices – Overview
Ch. 31 Q and A IS 333 Spring 2015 Victor Norman. SNMP, MIBs, and ASN.1 SNMP defines the protocol used to send requests and get responses. MIBs are like.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Network trouble ticket standardisation -
Structure Commander Technical Presentation. Copyright (C) MCS 2013, All rights reserved. 2 STRUCTURE COMMANDER Introduction Product Overview.
1 Network Quarantine At Cornell University Steve Schuster Director, Information Security Office.
Web Self Service Take Home Message Web Self Service gives CRM information access to assigned non-CRM users.
Connect. Communicate. Collaborate Place your organisation logo in this area End-to-End Coordination Unit Toby Rodwell, Network Engineer, DANTE TNLC, 28.
What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Operations update Guillaume Cessieux.
Project Tracking. Questions... Why should we track a project that is underway? What aspects of a project need tracking?
Project Management Methodology Project Closing. Project closing stage Must be performed for all projects, successfully completed or shut off by management.
What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Ops WG Act 4 – Conclusion Guillaume.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What GGUS can do for you JRA1 All hands.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN operations Presentation and training.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Ops WG Act 5 Guillaume Cessieux (CNRS/IN2P3-CC,
1 LHC-OPN 2008, Madrid, th March. Bruno Hoeft, Aurelie Reymund GridKa – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. –
Connect. Communicate. Collaborate Place your organisation logo in this area The PERT – Evolution from a Centralised to a Federated Organization Toby Rodwell.
INFO 424 Team Project Practicum Week 2 - Launch report, Project tracking, Review report Glenn Booker Notes largely from Prof. Hislop.
EGEE-III Enabling Grids for E-sciencE EGEE and gLite are registered trademarks 2008 report on LHCOPN from ASPDrawer
LHCOPN operational working group Guillaume Cessieux (CNRS/FR-CCIN2P3 – EGEE SA2) third meeting CERN – December th, 2008
LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, , Copenhagen.
1.Registration block send request of registration to super peer via PRP. Process re-registration will be done at specific period to info availability of.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-III Network activity overall Xavier.
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space Operational Architecture of PL-Grid project M.Radecki,
Network infrastructure at FR-CCIN2P3 Guillaume Cessieux – CCIN2P3 network team Guillaume. cc.in2p3.fr On behalf of CCIN2P3 network team LHCOPN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ENOC - Status and plans Guillaume Cessieux.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Standard network trouble tickets exchange.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Communication tools between Grid Virtual.
1 Network Quarantine At Cornell University Steve Schuster Director, Information Security Office.
LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona,
Introduction to ITIL and ITIS. CONFIDENTIAL Agenda ITIL Introduction  What is ITIL?  ITIL History  ITIL Phases  ITIL Certification Introduction to.
INFSO-RI SA2 ETICS2 first Review Valerio Venturi INFN Bruxelles, 3 April 2009 Infrastructure Support.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks A three years thorough review of a project’s.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1 & SA2-ENOC Interactions status and plans.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Operations WS: Introduction & Objectives.
26/01/2007Riccardo Brunetti OSCT Meeting1 Security at The IT-ROC Status and Plans.
CERN IT Department CH-1211 Geneva 23 Switzerland t James Casey CCRC’08 April F2F 1 April 2008 Communication with Network Teams/ providers.
David Foster, CERN GDB Meeting April 2008 GDB Meeting April 2008 LHCOPN Status and Plans A lot more detail at:
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Operational model: Roles and functions.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN operations Presentation and training.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN operations Presentation and training.
Opensciencegrid.org Operations Interfaces and Interactions Rob Quick, Indiana University July 21, 2005.
LHC-OPN operations Roberto Sabatino LHC T0/T1 networking meeting Amsterdam, 31 January 2006.
Ch. 31 Q and A IS 333 Spring 2016 Victor Norman. SNMP, MIBs, and ASN.1 SNMP defines the protocol used to send requests and get responses. MIBs are like.
LHCOPN operational model Guillaume Cessieux (CNRS/FR-CCIN2P3, EGEE SA2) On behalf of the LHCOPN Ops WG GDB CERN – November 12 th, 2008.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
Connect. Communicate. Collaborate Place your organisation logo in this area End-to-End Coordination Unit Marian Garcia, Operations Manager, DANTE LHC Meeting,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ENOC status LHC-OPN meeting – ,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operating an Optical Private Network: the.
LHCOPN operational handbook Documenting processes & procedures Presented by Guillaume Cessieux (CNRS/IN2P3-CC) on behalf of CERN & EGEE-SA2 LHCOPN meeting,
MTG CEMR/MEDIACC Repair System Improvements
LHC T0/T1 networking meeting
Operations Interfaces and Interactions
1 VO User Team Alarm Total ALICE ATLAS CMS
LHCOPN Operations: Yearly review
Networking support (SA2) tasks for EGI
‘s tools targeted to be useful for COD activity
Thanks to everyone for attending!!
Unit4 Customer Portal Submitting & Managing Cases.
The Troubleshooting theory
Presentation transcript:

LHCOPN operational model - 4 use-cases Guillaume Cessieux (FR-CCIN2P3 / EGEE networking support) on behalf of the Ops WG LHCOPN meeting, , Berlin

Agenda Focus on 4 use-cases: Incident Management 1.L3: Power outage at DE-KIT leading to routers down 2.L2: Fibre cut between London and Didcot affecting CERN- RAL-LHCOPN-001 Change Management 3.L3: New IP prefixe for ES-PIC Maintenance Management 4.L2: USLHCNET's scheduled power cut for devices in Chicago GCX - LHCOPN meeting

Tools used CERN’s twiki – – – – GGUS – Public release Monitoring – MDM, e2e2mon, ASPDrawer... GCX - LHCOPN meeting

POWER OUTAGE AT DE-KIT LEADING TO ROUTERS DOWN L3 incident management GCX - LHCOPN meeting

Scope GCX - LHCOPN meeting routers unexpectedly down Affected: NL-T1, CH-CERN, IT-INFN-CNAF, FR-CCIN2P3, DE-KIT 5 links

L3 incident management GCX - LHCOPN meeting Scope: Router down, BGP filtering, bad routing... The source site is the site where the problem lies. 1.1 A tickets is created on the LHCOPN Heldpesk for reporting by the router operator of the source site. It is assigned to itself, the source site. 1.2 The Router Operator contacts is counterpart on distant site (site-site communication) to know if something goes wrong (power outage...). If problem is on distant site the distant site will start this process (ticket then re-assigned to distant site). 1.3 If the problem is related to an underlying layer (L2: dark fibre outage...) the router operator will start the L2 incident management process. The router operator will be responsible to manage the trouble with the L2NOC (open and follow NOC's ticket...). He stays responsible for the LHCOPN ticket into GGUS. 1.4 Otherwise the router operator is owning the problem and will contact its local Grid Data contact to report impact. Distant Router operator will also be informed. 2 The LHCOPN TTS notifies all impacted sites about the incident

L3 Incident management process Source site involved Site involved A notifies B Grid Data contact * Router operators Router operators A AB B A interacts with B Affected sites 1.1 LHCOPN TTS (GGUS) L2 incident management (1.3) BAA reads and writes BA goes to process BAB V gcx

Ticket opening 1.1 A DE-KIT router operator opens a trouble ticket into GGUS GCX - LHCOPN meeting DE-KIT * Router operators 1.1 LHCOPN TTS (GGUS)

GGUS submit interface GCX - LHCOPN meeting

Ticket opened GCX - LHCOPN meeting

Other steps Outage is localised and noticed by source site – No need to perform 1.2: Contact counterpart on distant site This is a power cut, not a real L2 problem – No need to go further on 1.3: L2 incident management process GCX - LHCOPN meeting

Grid interaction 1.4: Grid data contact at DE-KIT is warned about the outage – GGUS TTid provided – He will compute impact on the Grid – He will warn the Grid GCX - LHCOPN meeting DE-KIT Grid Data contact * Router operators 1.1 LHCOPN TTS (GGUS) 1.4

Automatic broadcasting 2: The GGUS TTS will warn all affected sites – This is done when ticket is submited GCX - LHCOPN meeting DE-KIT Grid Data contact * Router operators 1.1 LHCOPN TTS (GGUS) 1.4 CH-CERN, FR-CCIN2P3, IT-INFN-CNAF, NL-T1, DE-KIT 2

Following/Closure Incident registration and broadcasting is terminated DE-KIT router operator is in charge of updating/ closing the GGUS ticket – Affected sites will be notified Local Grid data contact has also to be warned GCX - LHCOPN meeting

History GCX - LHCOPN meeting

Conclusion for first use case Shortcut as the incident is quickly localised – Otherwise more interactions between sites Deeply organised around GGUS tickets – Could be opened by another site and assigned to DE-KIT – Put status from « assigned » to « in progress » to acknowledge GCX - LHCOPN meeting

Fibre cut between London and Didcot affecting CERN-RAL-LHCOPN-001 L2 Incident management GCX - LHCOPN meeting

Scope GCX - LHCOPN meeting Router operator at UK-T1-RAL noticed that link is down thanks to their monitoring system Affected 1 link: CERN-RAL-LHCOPN sites: CH-CERN and UK-T1-RAL Not clear idea of what and where the problem is Router down at CH-CERN, fibre cut…

Global problem management process started GCX - LHCOPN meeting

Quick investigation 1- Nothing seems occurring on site 2- Take an overview of the LHCOPN – e2emon monitoring system indicates that the L2 link is down in segment “UKERNA” Now tracking a fibre cut – Nothing seems registered on GGUS about Unscheduled event = Incident Going to L2 incident management GCX - LHCOPN meeting

L2 incident management GCX - LHCOPN meeting Scope: Dark fibres outages A L2NOC and a router operator could notice a L2 incident. They will interact together to confirm it or not. A router operator could also be warned from the L3 incident management process through a LHCOPN ticket assigned to its site 1.2 If confirmed the router operator of a linked site will put a ticket on the LHCOPN TTS. The router operator is in charge of dealing with involved L2 network providers and to reflect ongoing resolution within the LHCOPN TTS. 1.3 It is the responsibilities of linked and affected sites to warn their Grid data contact. 2 All impacted sites will be notified by the TTS. 3 If nothing if found at L2 the Escalated incident management process is started.

Sites linked L2 Incident management process Sites linked * L2 NOC Grid Data contact * Router operators LHCOPN TTS (GGUS) * End of L3 incident management A notifies B A AB B A interacts with BBAA reads and writes B escalated incident management (3) Affected sites V gcx

Incident registration 1.1 Router operator at UK-T1-RAL will open a ticket to JANET for the outage 1.2: UK-T1-RAL noticed the outage so will open a ticket into GGUS for the LHCOPN community – Self assigned to them because under their responsibility (T0-T1) GCX - LHCOPN meeting UK-T1-RAL JANET NOC * Router operators LHCOPN TTS (GGUS)

GGUS ticket submited GCX - LHCOPN meeting

Broadcasting 1.3: Grid interaction – Local Grid data contact warned (+ #GGUS-TTid) 2: Other affected sites automaticaly notified by GGUS GCX - LHCOPN meeting Sites linked UK-T1-RAL JANET NOC Grid Data contact * Router operators LHCOPN TTS (GGUS) CH-CERN

Following/Closure UK-T1-RAL will update GGUS tickets with information from JANET – Grid data contact and affected sites are kept updated Ticket will be closed by UK-T1-RAL GCX - LHCOPN meeting

Conclusion for second use-case Accurate and reliable monitoring is required to really shortcut investigations Key communication between network provider and customer – We did not changed the way this currently works GCX - LHCOPN meeting

New IP prefixe for ES-PIC L3 Change management GCX - LHCOPN meeting

Scope ES-PIC has a new IP prefixe that must be included within the LHCOPN Affected: – All sites: Filters to update… – And monitoring systems GCX - LHCOPN meeting

L3 change management GCX - LHCOPN meeting Scope: IP addresses change, new prefix propagated, new filtering The source actor for these changes are router operators. 1.1 Router operator will expose change to its Grid data contact (change in performing, new resiliency possibility...) 1.2 Router operator will expose change to affected sites (e.g linked sites) 2.1 The change will be fully documented on the global web repository and some technical information should also be updated 2.2 An informational ticket summarizing the change will be put into the LHCOPN TTS. It will contain link to the full documentation of the change (e.g URL to the Global web repository) 2.3 The L3 monitoring infrastructure may be adapted if needed (new p2p IPs to be watched...) 3 The LHCOPN TTS notifies all impacted sites 4 If the change has an impact a L3 maintenance management process will be started to commit changes. Else the change could be directly done If we have some L3 changes impacting the L2 (L3 VPN for instance) the L2 change management process should be started.

Linked Sites L3 Change Management Source site Grid Data contact Router * operators Affected Sites Router operators L3 maintenance management Global web repository (Twiki) A notifies B A AB B A interacts with BBAA reads and writes B Monitoring (2.3) (4) LHCOPN TTS (GGUS) 3 Affected sites V gcx

Change registration 1.1: Grid data contact is warned about the change – New hosts will benefit of the LHCOPN? 1.2: This change is common and has no deep impact for others – No need to discuss with impacted sites GCX - LHCOPN meeting ES-PIC Grid Data contact Router * operators 1.1

2.1: – The change will be documented on the change management database – Technical information will be updated ES-PIC Grid Data contact Router * operators 1.1 Documentation and tool update GCX - LHCOPN meeting Global web repository (Twiki) Technical information Change management DB 2.1

Broadcasting 2.2: A « informational » GGUS ticket will be created – With link to the change management database entry – With link to technical information updated – 3: All sites will be notified 3: DANTE Operation + ENOC are put in copy – New prefixes might need to be also monitored by MDM + ASPDrawer GCX - LHCOPN meeting

GGUS submit interface GCX - LHCOPN meeting ENOC

Summary GCX - LHCOPN meeting ES-PIC Grid Data contact Router * operators Global web repository (Twiki) Monitoring (2.3) LHCOPN TTS (GGUS) 3 ALL Sites DANTE Operation Technical information MDM Change management DB BGP ENOC

Committing the change (1/2) The change is documented and advertised but not yet committed Has the change, or its commitment, impact on existing service? – No, so no need to commit it within a “true” maintenance GCX - LHCOPN meeting

Committing the change (2/2) The change will be silently implemented by ES-PIC and reported with a GGUS ticket – Kind: Maintenance L3 – To track implementation + statistics GCX - LHCOPN meeting

Conclusion for third use-case Documenting and implementing are separated – 2 tickets: Informational & Maintenance Third party tools might need to be updated – MDM, e2emon, ASPDrawer, GGUS … Lighten process for non impacting changes GCX - LHCOPN meeting

USLHCNET's scheduled power cut for devices in Chicago L2 maintenance management GCX - LHCOPN meeting

Scope (1/2) USLHCNET will have power cut in Chicago GCX - LHCOPN meeting

Scope (2/2) Fictional impact: – US-FNAL-CMS will be fully disconnected GCX - LHCOPN meeting

L2 maintenance management GCX - LHCOPN meeting Sources for L2 Maintenance are L2 network providers (optical transmitter to be changed, fibre physically rerouted, fibre to be cleaned...) Often we will not have negotiation phase for L2 maintenance with L2 network providers. But if an event is really disturbing this should be tried. 1.1 The L2NOC will send its maintenance to connected or affected Router operators. The first noticed router operator start this process. 1.2 The router operator will warn its Grid data contact (and may check with him date is ok) 1.3 The router operator may check with distant affected sites - off the record - that the date is suitable 1.4 If a disturbing overlapping event is found we should try to negotiate another date with the network provider and we restart at step 1.1. Else the maintenance is posted in the LHCOPN TTS by the router operator. 2 All impacted sites are notified. 3 The maintenance is performed and the LHCOPN TT is updated. Updates are broadcasted to all impacted sites. It ends when LHCOPN TT is closed.

Linked Sites L2 Maintenance management process * L2 NOC Linked Sites Grid Data contact Router operators A notifies B A AB B A interacts with BBAA reads and writes B Linked Sites Router operators LHCOPN TTS (GGUS) Affected sites V gcx

Registering maintenance (1/2) 1.1: USLHCNET warns at least site US-FNAL-CMS Not Grid, not all LHCOPN sites etc. 1.2: US-FNAL-CMS will warn its local Grid data contact – And may check with him date is OK – 1.3: Ideally also avoid overlap with CH-CERN’s events GCX - LHCOPN meeting USLHCNET NOC US-FNAL-CMS Grid Data contact Router operators Linked Site CH-CERN 1.3

Registering maintenance (2/2) Affected sites: – US-FNAL-CMS, CH-CERN – US-FNAL-CMS is responsible for following this event 1.4: A FNAL Router operator will put the maintenance into GGUS GCX - LHCOPN meeting

GGUS submit interface GCX - LHCOPN meeting

Summary GCX - LHCOPN meeting USLHCNET NOC US-FNAL-CMS Grid Data contact Router operators Linked Site CH-CERN Router operators LHCOPN TTS (GGUS) CH-CERN 1.3

Following US-FNAL-CMS updates ticket according to USLHCNET reports US-FNAL-CMS is in charge to close the ticket when terminated GCX - LHCOPN meeting

Ticket’s handling GCX - LHCOPN meeting

Conclusion for fourth use-case Light process for network providers – Like what currently happens – Warn only your customers – No Grid interaction Site acts as a relay for information from network providers – Propagated within LHCOPN community GCX - LHCOPN meeting

Overall conclusion GCX - LHCOPN meeting

Overall conclusion (1/2) Sample provided here – Many details could be adjusted Steps for incident management – Investigate, register, broadcast, follow Steps for change management – Document, register, broadcast, commit Steps for maintenance management – Register, broadcast, (commit), follow GCX - LHCOPN meeting

Overall conclusion (2/2) Not really different from current way to carry network operations? – But formalised Feel free to ask details on processes – Propose interesting/embarrassing use-case – Everything is/will be on the twiki GGUS accesses/notifications are indispensable – Access table is a key thing to be accurately filled GCX - LHCOPN meeting

Questions & discussion GCX - LHCOPN meeting