EGI Network Support task force: Proposal for the identified use cases

Slides:



Advertisements
Similar presentations
Africa & Arabia ROC tutorial Model for L1-L2 user support based on x-GUS Mario Reale GARR - Italy ASREN-JUNET Grid School - 24 November 2011 Africa & Arabia.
Advertisements

Defining France Grilles resource allocation strategy Gilles Mathieu, IN2P3 Computing Centre France Grilles International Advisory Committee – March 2011.
EGI-Engage Recent Experiences in Operational Security: Incident prevention and incident handling in the EGI and WLCG infrastructure.
What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks From ROCs to NGIs The pole1 and pole 2 people.
GN2 Performance Monitoring & Management : AA Needs – Nicolas Simar - 2 nd AA Workshop Nov 2003 Malaga, Spain GN2 Performance Monitoring & Management.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE II - Network Service Level Agreement (SLA) Establishment EGEE’07 Mary Grammatikou.
EGI: SA1 Operations John Gordon EGEE09 Barcelona September 2009.
Performance Monitoring - Internet2 Member Meeting -- Nicolas Simar Performance Monitoring Internet2 Member Meeting, Indianapolis.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Steven Newhouse EGEE’s plans for transition.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Security Vulnerabilities Linda Cornwall, GridPP15, RAL, 11 th January 2006
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
Grid Security Vulnerability Group Linda Cornwall, GDB, CERN 7 th September 2005
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.
EGEE is a project funded by the European Union under contract IST Network Resources Provision Jean-Paul Gautier SA2 manager Cork meeting,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGI Operations Tiziana Ferrari EGEE User.
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space Operational Architecture of PL-Grid project M.Radecki,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Task tracking SA3 All Hands Meeting Prague.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ENOC - Status and plans Guillaume Cessieux.
INFSO-RI Enabling Grids for E-sciencE NRENs & Grids Workshop Relations between EGEE & NRENs Mathieu Goutelle (CNRS UREC) EGEE-SA2.
INFSO-RI SA2 ETICS2 first Review Valerio Venturi INFN Bruxelles, 3 April 2009 Infrastructure Support.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1 & SA2-ENOC Interactions status and plans.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Operational model: Roles and functions.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What all NGIs need to do: Helpdesk / User.
INFSO-RI Enabling Grids for E-sciencE Network Services Development Network Resource Provision 3 rd EGEE Conference, Athens, 20 th.
Enabling Grids for E-sciencE EGEE-III INFSO-RI Status of the EGI O-E-12 Task: Coordination of Network Support for EGI Mario Reale IGI / GARR
EMI INFSO-RI Testbed for project continuous Integration Danilo Dongiovanni (INFN-CNAF) -SA2.6 Task Leader Jozef Cernak(UPJŠ, Kosice, Slovakia)
Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Etienne Dublé.
LHCOPN operational model Guillaume Cessieux (CNRS/FR-CCIN2P3, EGEE SA2) On behalf of the LHCOPN Ops WG GDB CERN – November 12 th, 2008.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Network Support Workshop Mario Reale / IGI - GARR EGI Network Support.
Javier Orellana EGEE-JRA4 Coordinator CERN March 2004 EGEE is proposed as a project funded by the European Union under contract IST Network.
EGI-InSPIRE EGI-InSPIRE RI Network Troubleshooting and PerfSONAR-Lite_TSS Mario Reale GARR.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Network Support task force January 24, 2011 EGI OMB f2f meeting Amsterdam.
TSA1.4 Infrastructure for Grid Management Tiziana Ferrari, EGI.eu EGI-InSPIRE – SA1 Kickoff Meeting1.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operating an Optical Private Network: the.
RI EGI-InSPIRE RI Operations Portal Lightweight Release Process Cristina Aiftimiei EGI.eu.
Documentation, Best Practices and Procedures: Roadmap
Bob Jones EGEE Technical Director
Il Sistema di Supporto INFNGrid & GGUS (Global Grid User Support )
LHC T0/T1 networking meeting
Regional Operations Centres Core infrastructure Centres
Operations Interfaces and Interactions
Status of SA2 network monitoring and troubleshooting tools
Managing the Project Lifecycle
SA1 Execution Plan Status and Issues
LCG Security Status and Issues
Ian Bird GDB Meeting CERN 9 September 2003
ATLAS support in LCG.
PRACE-EGI helpdesk integration
NGI Operations readiness report
Infrastructure Support
EGI Community Forum 2012 Munich, 29 March 2012
Networking support (SA2) tasks for EGI
NA3: User Community Support Team
Agenda Welcome Project Status (inc. Activity Reports)
WP7 objectives, achievements and plans
Operational Documentation Vera Hansper, CSC/NDGF
Nordic ROC Organization
Action U-E-5 Technical Coordination – User Technical Support
Mario Reale – IGI / GARR Lyon, Sept 19, 2011
LCG Operations Centres
LCG Operations Workshop, e-IRG Workshop
Leigh Grundhoefer Indiana University
Systems Analysis and Design
EGEE Operation Tools and Procedures
User Support in EGI Reactive and proactive services
Presentation transcript:

EGI Network Support task force: Proposal for the identified use cases Mario Reale IGI / GARR mario.reale@garr.it January 24, 2011 EGI OMB f2f meeting Amsterdam EGI.eu 1

Overview Description of what we propose for each one of the identified use cases GGUS EGI PERT Network-related Scheduled Maintenances TroubleShooting on-demand End-to-end Multi Domain monitoring DownCollector Policy and Cooperation 2

1. GGUS based Network Support workflow

GGUS Reference Tools: GGUS / Network Support Unit Additional network monitoring & troubleshooting tools will be involved by the parties involved (NRENs, NOCs, PERTs..) Proposal: Implement a workflow to handle network tickets based on using GGUS Which does not foresee the establishment of a permanent EGI Network Support Team for this GGUS unit the great majority of NGIs being against it

Proposed GGUS Workflow 1/2 A user belonging to a given Virtual Organization (VO) experiences poor performances, or repeated failures while transferring data from site A to site B: A  B Of course first simple debuging is assumed to be carried out at the user level (possibly involving some VO support). Aim is to exclude “trivial” issues ( Software, SE down, ..) Basic troubleshooting can be provided by troubleshooting on-demand tool Also check if monitoring data are available A network ticket is then opened in GGUS describing the problem

Proposed GGUS workflow 2/2 The network ticket is assigned automatically to the Site administrators of site A and Site B They are both responsible for handling the ticket. However, only person should be accountable for the ticket: Site-A (A  B: originator of the data transfer) In case of User Interface node to Site-X transfer, ticket is assigned to Site-X Site Administrator, after first basic debugging by user Site administrators handle the contacting the NRENs contacts (APM, NOC) : They inform their NGI Operation Centers They should contact first local Campus Network Admins, local NREN APM NRENs will handle it, using their PERT team, APM, NOCs, experts, and possibly involving DANTE/GEANT NOC and Federated EduPERT NOCS further involve TELCO operators if/when required according to their workflows At each step the ticket originator, the VO responsible persons and the site admins are kept posted, informed, until the issue is solved

Network Support workflow Ticket posted (Already after initial debugging by user failed) Responible for the ticket processing NGI Operations Center informed / COD informed Site A & Site B Grid site Administrators try to fix it (for an A to B transfer) VO/VRC application expert & Campus Net Admin involved YES Solved ? NO NREN A & NREN B NOCs try to fix it NREN A & NREN B PERTs and local APMs involved YES Solved ? INFORM Site A&B NO GEANT NOC tries to fix it Federated PERT & GEANT APMs involved YES Solved ? NO Other Actors (TELCOs NOCs & Operations..) Problem fixed. Ticket closed

Observations Site A  Site B coordination assumed at least until clear domain of competence for the problem/bottleneck is identified A digging into the issue much more than B should be avoided Grid-to-Network domain crossing is responsibility of the Grid Site Administrators They contact Local Campus Admins, local NREN APMs and/or NOC Many parties involved (subsequently holding the token) but from the point of view of the workflow to be implemented, GGUS just assigns the network ticket to Site-A and Site-B site administrators (A->B) or Site-X admin ( UI->X) They will have to deal with following up the ticket Ultimately, site-A Grid Admin is responsible (start point for the data transfer) Whether NRENs prefer Grid Site Admins to contact their NOC or the Local APM first is something to be clarified ( NREN questionnaire)

Involved Actors A B LAN A Campus A NREN A BACKBONE LAN B Campus B NREN B RESPONSIBLE FOR TICKET user user NREN B APM NOC NREN A APM NOC Campus A Net Sup Grid Site A Admin Campus B Net Sup Grid Site B Admin GEANT/DANTE TEIN3 ORIENT SEEREN2 ALICE2/RedCLARA EUMEDCONNECT2 TELCOs ACCOUNTABLE FOR TICKET VO experts VO experts

Problem Solving: Actors Stack TELCOs operators Stack / Domain Federated PERT GEANT NOC (DANTE) GEANT APM NETWORK NREN PERT NREN-TELCOs operators NREN NOC NREN Local APM Campus Network Administrator GRID Grid Site Administrator NGI Operation Center COD Applications Hands over / Requests support VO/VRC applications support/experts Informs User Functional/Geographical distance from user

2. EGI PERT

EGI PERT At this stage we feel there wasn’t enough consensus by NGIs to establish a permanent EGI PERT team providing both Grid middleware/Applications expertise PERT Networking expertise PERTs will contribute to the general Network Support workflow We propose to leave involvement of PERT Teams to NRENs and GEANT NOC Our proposal is to provide a web contact point (web page) for EGI Users and Site Administrators, to fetch information from, about General PERT Issues and basic procedures and how to reach PERT Teams of NRENs and Federated PERT if required Gather relevant PERT contact information in one location Provide a basic web guide for common PERT-related issues (Example: how do I set the TCP window size on my machine ?, how do I check my machine is not closing a fundamental port for Grid middleware ?) Pointing to the EduPERT knowledge database General procedure for direct involvement of PERT Teams should however fall in the scope of the General GGUS-based workflow

3. Network-related Scheduled Maintenances

Network Related Scheduled Maintenances Reference tool: no specific tool for Network-related maintenances currently GOC DB for GRID-related maintenances. No tools warning users about Network-related scheduled maintenances Locally, Grid Site administrators, warned by the NREN APM of possible availability, can post the unavailability of their sites/services relies on ATM-Grid Site Admin local coordination No automation

Network Related Scheduled Maintenances What is envisageable around Network-related scheduled maintenances is NRENs coordinate with corresponding NGIs in order to have a Mapping between network devices/PoPs and directly impacted Grid resource center NGIs set up a tool implementing a mapping between Grid resource centers/services and involved user to be informed NGIs and NRENs to coordinate so that when a network device/PoP is object of a Scheduled Maintenance impacting on a Grid resource center/service, NREN informs NGI. NGI informs EGI.eu Operations and Users.

Network Related scheduled maintenances Today things are demanded to the good will of local APMs and Grid Site Managers, and their co-ordination A higher level workflow should be put in place, systematically addressing this issue A relevant deepening of this proposal is still required, in close coordination with the OTAG, JRA1 and EGI operations

Network Troubleshooting on Demand

Troubleshooting on demand Based on the experience gained in EGEE SA2, the French NGI has started and developed a new tool called HINTS HINTS has been developed on a volunteering basis by UREC CNRS Flexible, based on PerfSONAR web services and protocols The Task Force proposes this tool for Network Troubleshooting Presentation later on today

End-to-end multi domain monitoring

E2E multi domain monitoring The reference tool for e2e multi domain monitoring we propose is PerfSONAR Many NRENs are familiar with it Long term development by many key organizations and projects both in Europe and America GEANT project The Spanish NREN RedIRIS developed a customized version of PerfSONAR on a live CD for e2e measurements The PerfSONAR Team is presenting today its tools and Use Cases The NetJobs tools has been developed by CNRS and GARR to perform basic network monitoring measurements using Grid Jobs No need for local deployment Presentations today

Down Collector

DownCollector The DownCollector is currently in use EGI Inspire TSA1.4 Our proposal is to improve packaging, installation and configuration of the tool to ease the creation of new instances for the NGIs willing to deploy an instance – could be achieved with reasonable effort On the longer term an integrated system could be built, gathering information from the various distributed instances Building a mesh ( service at site X reached by site Y ) It has to be further discussed also with TSA1.4, JRA1 and OTAG However this cannot be endorsed immediately given manpower, responses to questionnaire and the pending discussions with relevant tasks/teams

Policy and Cooperation

Policy and Cooperation Majority of NGIs is against the establishment of permanent EGI Network Support body for policy and cooperation Some of them are very much in favor though We did not elaborate enough a structured proposal We have only identified fields for cooperation At this stage, our proposal is to invite volunteering NGIs to join the activity of the Network Support coordination (within EGI-Inspire TSA1.7) to further discuss this issue and elaborate a plan for ensuring issues are tackled within TSA1.7

General Issues The proposed GGUS workflow for Network support assumes some basic Network checks are done at the user level  We should ensure users are familiar with basic network debugging operations and tools we are providing Users should somehow try to refer to their site administrators first Site administrators have to deal with network-related issues; they’re likely to have at least basic network know-how  Should we foresee Training on the tools we want to provide ? A guide for users and site administrators around network support and related debugging procedures/tools ?

General Issues For some use cases we should find out more about the NREN-NGI interaction A Questionnaire for NRENs about Grids and NGIs has to be organized, aimed at clarifying Which are the current NREN-NGI communication channels (especially for NGI operations..) ? How feasible is to set up a global system to automatically advertise network-related scheduled maintenances and accidents to users and Site Admins ? What are the used/preferred Multi Domain tools NRENs are currently using and are familiar with ?

General Issues Network-related Scheduled Maintenances requires further analysis to refine our proposal and design/identify the corresponding tools We need to acquire more information from NRENs We also need more internal discussion Within the GGUS-based workflow, we should find out / decide whether NRENs prefer Grid Site admins to contact their local APMs first or their NOC Also to be asked to NRENs

References EGI Network Support coordination: https://wiki.egi.eu/wiki/NST PERT: http://edupert.geant.net/index.html Federated PERT Knowledge DataBase http://kb.pert.geant.net/PERTKB/WebHome