LHCOPN Operations: Yearly review

Slides:



Advertisements
Similar presentations
VoipNow Core Solution capabilities and business value.
Advertisements

Chapter 1 Introduction to Databases
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Network trouble ticket standardisation -
INFSO-RI Enabling Grids for E-sciencE Incident Response Policies and Procedures Carlos Fuentes
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Operations update Guillaume Cessieux.
What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Ops WG Act 4 – Conclusion Guillaume.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN operations Presentation and training.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Ops WG Act 5 Guillaume Cessieux (CNRS/IN2P3-CC,
EGEE-III Enabling Grids for E-sciencE EGEE and gLite are registered trademarks 2008 report on LHCOPN from ASPDrawer
LHCOPN operational working group Guillaume Cessieux (CNRS/FR-CCIN2P3 – EGEE SA2) third meeting CERN – December th, 2008
LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, , Copenhagen.
TeraGrid Operations Overview Mike Pingleton NCSA TeraGrid Operations December 2 nd, 2004.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
Network infrastructure at FR-CCIN2P3 Guillaume Cessieux – CCIN2P3 network team Guillaume. cc.in2p3.fr On behalf of CCIN2P3 network team LHCOPN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ENOC - Status and plans Guillaume Cessieux.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Standard network trouble tickets exchange.
LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1 & SA2-ENOC Interactions status and plans.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Operations WS: Introduction & Objectives.
LHCOPN operational model - 4 use-cases Guillaume Cessieux (FR-CCIN2P3 / EGEE networking support) on behalf of the Ops WG LHCOPN meeting, , Berlin.
David Foster, CERN GDB Meeting April 2008 GDB Meeting April 2008 LHCOPN Status and Plans A lot more detail at:
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Operational model: Roles and functions.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN operations Presentation and training.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN operations Presentation and training.
A Computing Tier 2 Node Eric Fede – LAPP/IN2P3. 2 Eric Fede – 1st Chinese-French Workshop Plan What is a Tier 2 –Context and definition To be a Tier 2.
LHCOPN operational model Guillaume Cessieux (CNRS/FR-CCIN2P3, EGEE SA2) On behalf of the LHCOPN Ops WG GDB CERN – November 12 th, 2008.
Monitoring BOF, 23 rd Jan 2007 Grid Service Monitoring Working Group Monitoring WG BOF, January 2007 James Casey/Ian Neilson.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operating an Optical Private Network: the.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI COD activity in EGI-InSPIRE Marcin Radecki CYFRONET, Poland & COD Team 9/29/2016.
LHCOPN operational handbook Documenting processes & procedures Presented by Guillaume Cessieux (CNRS/IN2P3-CC) on behalf of CERN & EGEE-SA2 LHCOPN meeting,
AMI to SmartGrid “DATA”
WLCG Network Discussion
Adam Backman Chief Cat Wrangler – White Star Software
Maintaining Windows Server 2008 File Services
Incident Management Incident Management.
CERN Openlab: my transition from science to business
LCG/EGEE Incident Response Planning
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
How to enable computing
The Systems Engineering Context
Marketing Your IT Strategic Planning Process: Relationship Building with Business Stakeholders Fred Mapp EFM April 10, 2013.
GOCDB current status and plans
Proposal for GOCDB workload management
WLCG Service Interventions
Update from the HEPiX IPv6 WG
‘s tools targeted to be useful for COD activity
Thanks to everyone for attending!!
LCG Operations Workshop, e-IRG Workshop
Chapter 1 The Systems Development Environment
HP Quality Center 10 Hottest Features and Project Harmonization
Change Assurance Dashboard
Benoît DAUDIN (GS-AIS-PM – CERN) 22-March-2012
Outline Announcements Fault Tolerance.
Instructional Learning Cycle:
Using outcomes data for program improvement
Engaging Your Stakeholders and Making the Most of Your Team
Twin Cities Business Architecture Forum 1/19/2016
Coordinate Operations Standard
Week Ten – IT Audit Reporting
Chapter 5 Identifying and Selecting Systems Development Projects
Critical Element: PBIS Team
Project Management How to access the power of projects!
Network performance issues recently raised at IN2P3-CC
Health Service R&D permissions
CORE 3: Unit 3 - Part D Change depends on…
FrAmework for Multi-agency Environments
Global One Communications
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
International Standards and Contemporary Technologies,
Presentation transcript:

LHCOPN Operations: Yearly review Guillaume.Cessieux @ cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Lyon, 2011-02-11

Tickets per month AVG 19 tickets/month 231 tickets in 2010 GCX LHCOPN meeting, Lyon, 2011-02-11

LHCOPN tickets’ ownership 34% of tickets GCX LHCOPN meeting, Lyon, 2011-02-11

Breakdown per category and kind of problem 80% carrier issues 80% carrier issues GCX LHCOPN meeting, Lyon, 2011-02-11

Service impact reported At least 67% of events have no service impact GCX LHCOPN meeting, Lyon, 2011-02-11

Top 10 links Ids involved in tickets GCX LHCOPN meeting, Lyon, 2011-02-11

Top 10 link Ids involved in L2 incident tickets GCX LHCOPN meeting, Lyon, 2011-02-11

Excellent redundancy and diversity Situation in simultaneously loosing 12 previous links GCX LHCOPN meeting, Lyon, 2011-02-11

Correlation Monitoring / Operations 77% of events have no related GGUS tickets GCX LHCOPN meeting, Lyon, 2011-02-11

Focus on useful events Service impacting events None Now Target Service impacting events None service impacting events (backup link down...) None service impacting events (backup link down...) Events Reported and managed in the TTS Events Reported and managed in the TTS ~30% ~70% Ensuring, showing and tracking restoration of redundancy GCX LHCOPN meeting, Lyon, 2011-02-11

Attendance to the quarterly LHCOPN Ops phoneconf GCX LHCOPN meeting, Lyon, 2011-02-11

LHCOPN operations: Status Good Process and tools implemented and agreed Clear improvement on documenting Bad Hard to push for administrative issues (update doc etc.) Redundancy prevents us from regularly practising No evidence of faults giving feeling of unnecessary actions Can’t clearly assess work done by sites due to lack of monitoring Can’t correlate service impacting events vs managed events Can’t assess, can’t improve Backup tests (4 sites reported something in 2010) Change management DB (9 entries related to 3 sites) Interactions with WLCG GCX LHCOPN meeting, Lyon, 2011-02-11

Role of LHCOPN helpdesk T0 Network team T1 T1 LHCOPN TTS (GGUS) Network team Network team T1 Network team Centralise information Unify communications GCX LHCOPN meeting, Lyon, 2011-02-11

Problem for LHCOPN network support Site LHCOPN TTS (GGUS) Network team Storage team Site Network team WLCG TTS (GGUS) Experiments Storage team GCX LHCOPN meeting, Lyon, 2011-02-11

How/why we ended here? Two information repositories but two different goals Problem solving vs coordinating network teams Backup link down, informational/changes tickets, routing issue etc. We ended with 230 tickets/year for this... Scheduled events vs users’ enquiries Lot of LHCOPN tickets not of interest for WLCG “No service impacting event, no Grid problem, no need for a WLCG tickets” Standard GGUS not tailored for network support Particularly multi-sites notification scheme? Clear weaknesses for user support But disturb everything for 4 enquiries/year? Was assumed we can link LHCOPN TTS and WLCG TTS Who are users of the LHCOPN? Was said only storage teams on sites Network teams did not want direct exposure in WLCG TTS Only accepting enquiries from local teams or remote network teams Only local storage team can state if there is a network problem or not GCX LHCOPN meeting, Lyon, 2011-02-11

Why something specific for the LHCOPN? (1/2) Not so specific processes, just clear implementation of usual processes for a delimited and dedicated network Can’t we handle generic IP issues in the same way? Same concepts sound applicable Project ↔ On site Grid related teams (storage...) ↔ local network team Generic IP issues or ... LHCONE issues? Careful scaling required: Point to point vs any to any; 12 sites vs 300 GCX LHCOPN meeting, Lyon, 2011-02-11

Why something specific for the LHCOPN? (2/2) Two important points Should sites’ network teams be directly acting in WLCG TTS for generic issues, or should information be relayed by some other teams (Storage, Grid, support, etc.) ? What kind of issue are we discussing? Expected link cut or complex performance issues? Previously agreed: Clear demarcation point for network teams = iperf test working We learnt that solving complex issue need concurrent involvement from a LOT of supporters As it is for scheduled network downtimes: Only resource managers talk to projects Network teams did not want to duplicate actions for several projects Generic networks = generic processes not focused on WLCG Handle non dedicated networks as a generic resource like electricity ? Ownership of issues has to be very clear Who is in charge of a London – St Petersburg issue between two Tiers 2? Can this be really pre-determinated? Maybe enable transfer of responsibility if default assignment is not good GCX LHCOPN meeting, Lyon, 2011-02-11

Summary on LHCOPN network support Problems We are not doing network support No clear ownership or process for network issues appearing in WLCG TTS Generic IP vs LHCOPN Our helpdesk is particular, isolated and restricted Possible solutions Use only WLCG TTS Could it make us fully happy? Which changes are really required? Make a clear and strong bridge between the two helpdesks Initially envisioned features like “Linking tickets” etc. not sufficient Need real cross helpdesk interactions Make transparent the two helpdesks keeping specificities When something turns to be a LHCOPN issue transforms the ticket in a LHCOPN ticket and allow a wide range of supporter to act into Otherwise keep things as they currently are GCX LHCOPN meeting, Lyon, 2011-02-11

Infrastructure quality hides Ops weaknesses Conclusion Infrastructure quality hides Ops weaknesses LHCOPN operations need improvements Two key issues Network monitoring Preventing improvement process User support Communication issues between two worlds GCX LHCOPN meeting, Lyon, 2011-02-11