Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015.

Slides:



Advertisements
Similar presentations
® IBM Software Group © 2010 IBM Corporation What’s New in Profiling & Code Coverage RAD V8 April 21, 2011 Kathy Chan
Advertisements

Update on OSG/WLCG perfSONAR infrastructure Shawn McKee, Marian Babik HEPIX Spring Workshop, Oxford 23 rd - 27 th March 2015.
Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.
Proximity service Main idea – provide “glue” between experiments and sonar topology – mainly map sonars to storages and vice versa – determine existing.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Use Cases. Summary Define and understand slow transfers – Identify weak links, narrow down the source – Understand what perfSONAR measurements mean wrt.
Integration Program Update Rob Gardner US ATLAS Tier 3 Workshop OSG All LIGO.
Connect communicate collaborate perfSONAR MDM updates: New interface, new possibilities Domenico Vicinanza perfSONAR MDM Product Manager
Network Monitoring for OSG Shawn McKee/University of Michigan OSG Staff Planning Retreat July 10 th, 2012 July 10 th, 2012.
WG Goals and Workplan We have a charter, we have a group of interested people…what are our plans? goalsOur goals should reflect what we have listed in.
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
Connect communicate collaborate perfSONAR MDM updates: New interface, new weathermap, towards a complete interoperability Domenico Vicinanza perfSONAR.
Internet2 Performance Update Jeff W. Boote Senior Network Software Engineer Internet2.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 8 th April 2015.
New perfSonar Dashboard Andy Lake, Tom Wlodek. What is the dashboard? I assume that everybody is familiar with the “old dashboard”:
ASCR/ESnet Network Requirements an Internet2 Perspective 2009 ASCR/ESnet Network Requirements Workshop April 15/16, 2009 Richard Carlson -- Internet2.
The Internet2 HENP Working Group Internet2 Spring Meeting April 9, 2003.
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015.
PerfSONAR-PS Functionality February 11 th 2010, APAN 29 – perfSONAR Workshop Jeff Boote, Assistant Director R&D.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 18 h March 2015.
WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,
Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,
WLCG Network and Transfer Metrics WG After One Year Shawn McKee, Marian Babik GDB 4 th November
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Update on Network Performance Monitoring.
Network and Transfer WG perfSONAR operations Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 28 h January 2015.
Connect communicate collaborate LHCONE Diagnostic & Monitoring Infrastructure Richard Hughes-Jones DANTE Delivery of Advanced Network Technology to Europe.
PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
OSG Networking: Summarizing a New Area in OSG Shawn McKee/University of Michigan Network Planning Meeting Esnet/Internet2/OSG August 23 rd, 2012.
Julia Andreeva on behalf of the MND section MND review.
PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Network Awareness and perfSONAR Why we want it. What are the challenges? Where are we going? Shawn McKee / University of Michigan OSG AHG - US CMS Tier-2.
WLCG Latency Mesh Comments + – It can be done, works consistently and already provides useful data – Latency mesh stable, once configured sonars are stable.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GEMINI: Active Network Measurements Martin Swany, Indiana University.
© 2015 Pittsburgh Supercomputing Center Opening the Black Box Using Web10G to Uncover the Hidden Side of TCP CC PI Meeting Austin, TX September 29, 2015.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
LHCONE Monitoring Thoughts June 14 th, LHCOPN/LHCONE Meeting Jason Zurawski – Research Liaison.
Strawman LHCONE Point to Point Experiment Plan LHCONE meeting Paris, June 17-18, 2013.
Deploying perfSONAR-PS for WLCG: An Overview Shawn McKee/University of Michigan WLCG PS Deployment TF Co-chair Fall 2013 HEPiX Ann Arbor, Michigan October.
Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.
Maintaining and Updating Windows Server 2008 Lesson 8.
Activities and Perspectives at Armenian Grid site The 6th International Conference "Distributed Computing and Grid- technologies in Science and Education"
Secure Access and Mobility Jason Kunst, Technical Marketing Engineer March 2016 Location Based Services with Mobility Services Engine ISE Location Services.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
1 Deploying Measurement Systems in ESnet Joint Techs, Feb Joseph Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
OSG Production Foundations for 2M+ Hours/Day April 9, 2014 Rob Quick With Help from Shawn McKee and Chander Seghal.
1 Network Measurement Challenges LHC E2E Network Research Meeting October 25 th 2006 Joe Metzger Version 1.1.
PerfSONAR operations meeting 3 rd October Agenda Propose changes to the current operations of perfSONAR Discuss current and future deployment model.
SQL Database Management
Shawn McKee, Marian Babik for the
WLCG Network Discussion
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
perfSONAR-PS Deployment: Status/Plans
LHCOPN/LHCONE perfSONAR Update
Networking for the Future of Science
LHCOPN/LHCONE perfSONAR Update
Monitoring the US ATLAS Network Infrastructure with perfSONAR-PS
Shawn McKee/University of Michigan ATLAS Technical Interchange Meeting
Alerting/Notifications (MadAlert)
Deployment & Advanced Regular Testing Strategies
Network Monitoring Update: June 14, 2017 Shawn McKee
Presentation transcript:

Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015

Goals: – Find and isolate “network” problems; alerting in time – Characterize network use (base-lining) – Provide a source of network metrics for higher level services perfSONAR Choice of a standard open source tool: perfSONAR – Benefiting from the R&E community consensus Tasks achieved: – Monitoring in place to create a baseline of the current situation between sites – Continuous measurements to track the network, alerting on problems as they develop – Developed test coverage and made it possible to run “on-demand” tests to quickly isolate problems and identify problematic links 2 Network Monitoring in WLCG/OSG WLCG Workshop Okinawa

3 perfSONAR Deployment 259 perfSONAR instances registered in GOCDB/OIM 233 Active perfSONAR instances 171 Running latest version (3.4.2) Initial deployment coordinated by WLCG perfSONAR TF Commissioning of the network followed by WLCG Network and Transfer Metrics WG WLCG Workshop Okinawa

perfSONAR The WLCG deployment scale and monitoring has helped identify issues in the perfSONAR toolkit – Version 3.3.x had numerous issues including security problems in the underlying OS and tests that wouldn’t consistently run – Version 3.4 fixed all known issues we identified in 3.3 but introduced a few new problems: excessive logging, more memory use, regular tests stopping and a few minor bugs – Current version has addressed all the new issues we know about. See release notes at We track deployment status daily at monitoring.cern.ch/perfsonar_report.txthttp://grid- monitoring.cern.ch/perfsonar_report.txt WLCG Workshop Okinawa4 perfSONAR Current Release

Tests are organized in meshes – set of instances that test to each other perfSONAR perfSONAR regular tests currently configured: – Traceroute: End to end path, important to understand context of other metrics (Full WLCG mesh/hour) – Throughput: Notice problems and debug network, also help differentiate server problems from path problems (Full WLCG mesh/week) – Latency: Notice route changes, asymmetric routes and watch for excessive Packet Loss (Regional meshes, Continuous, 10Hz) perfSONAR perfSONAR is a testing framework, new tests and tools can be integrated as they become available – From iperf to iperf3, traceroute to tracepath Dynamic reconfigurations now possible – Creation, modification of meshes – Test frequency and parameters Additional perfSONAR nodes inside local network, and/or at periphery still needed (on LHCONE: MANLAN, WIX, GEANT) – Characterize local performance and internal packet loss – Separate WAN performance from internal performance 5 perfSONAR Metrics and Meshes WLCG Workshop Okinawa

6 Configuration Interface perfSONAR perfSONAR instance can participate in more than one mesh perfSONAR Configuration interface and auto-URL enables dynamic re- configuration of the entire perfSONAR network perfSONAR perfSONAR instance only needs to be configured once, with an auto-URL ( tname/ ) WLCG Workshop Okinawa

7 Configuration Interface WLCG Workshop Okinawa OSG and the perfSONAR developers are working on creating a standalone “mesh-config” package to allow users and sites to manage and maintain their own meshes

WLCG Workshop Okinawa8 MaDDash for Metrics Visualization

WLCG Workshop Okinawa9 OMD/Check_MK Service Monitoring We are using OMD & Check_MK to monitor our perfSONAR hosts and services. Provides useful overview of status/problems [Requires x509 in your browser]

WLCG Workshop Okinawa10 Detailed Service Checks

WLCG Workshop Okinawa11 OSG Network Datastore A critical component is the datastore to organize and store the network metrics and associated metadata q OSG is gathering relevant metrics from the complete set of OSG and WLCG perfSONAR instances q This data will be available via an API, must be visualized and must be organized to provide the “OSG Networking Service” q Operating now q Targeting a production service by mid-summer

Problem with transfers from SARA to AGLT2 noted February 20 th. – FTS transfers failing. ATLAS asked about network. Dataset had large files; transfers failed because of timeout (1 MB/sec=3.6GB/hour; 1 hour timeout) – Setup ‘Debug’ mesh using OSG tools to track SARA, CERN to AGLT2, MWT2 – Jason Zurawski/ESnet ran tests using perfSONAR. Bad throughput ~few hundred kbits/sec SARA->AGLT2 (real => network issue!) WLCG Workshop Okinawa12 Network Problem Debugging(1/4)

Ticket opened by AGLT2 with Internet2 NOC on February 20th – Poor iperf results also between SARA and MWT2 – Routes provided both ways by perfSONAR WLCG Workshop Okinawa13 Network Debugging (2/4)

Internet2 initially pursuing asymmetric routes and link congestion in US I2 opened ticket with GEANT Feb 27 GEANT brought up LHCONE perfSONAR in Frankfurt – Tests to SARA(close) and AGLT2(far) showed 3x better bandwidth to AGLT2. Problem close to SARA – Isolated link between SARA and GEANT March 4 WLCG Workshop Okinawa14 Network Debugging (3/4) SARA noticed problems once we highlighted the link

2 weeks of link-debugging Problem identified and fixed Mar 20 – Bandwidth “policing” for LHCONE. More than a year ago, LHCONE setup 1 Gbps. Never enforced. – Turned on policing start of February. – Changed BW 10Gbps March 20…fixed WLCG Workshop Okinawa15 Problem Found/Fixed – Mar 20 Packet Loss (%) Throughput MB/s 8 Gbps perfSONAR Throughput SARA to AGLT2 (1 month) perfSONAR Packet Loss SARA to AGLT2 (1 month) 300 Mbps 1% loss

What kinds of capabilities can we enable given a rich datastore of historical and current network metrics? – Users want "someone" to tell them when there is a network problem involving their site or their workflow. – Can we create a framework to identify when network problems occur and locate them? (Must minimize the false-positives). WLCG Workshop Okinawa16 OSG Network Alerting Targeting this in the PuNDIT OSG satellite project See PuNDIT Poster event/304944/contribu tion/408 [#8 Poster in booth 24 Session B] event/304944/contribu tion/408

Can we create tools to manipulate, visualize, compare and analyze network topologies from the OSG network datastore contents? Can we build upon these tools to create a set of next-generation network diagnostic tools to make debugging network problems easier, quicker and more accurate? Even without requiring the ability to perform complicated data analysis and correlation, basic tools developed in the area of network topology- based metric visualization would be very helpful in letting users and network engineers better understand what is happening in our networks. WLCG Workshop Okinawa17 Understanding Network Topology

Most scientists just care about the end-to-end results: – How well does their infrastructure support them in doing their science? Network metrics allow us to differentiate end-site issues from network issues. There is an opportunity to do this better by having access to end-to-end metrics to compare & contrast with network-specific metrics. – What end-to-end data can WLCG/OSG regularly collect for such a purpose? WG is soliciting input here… – Is there some kind of common instrumentation that can be added to some data-transfer tools? (NetLogger in GridFTP, having transfers "report" results to the nearest perfSONAR- PS instance?, etc) – FTS data-analysis project is underway (adds pS metrics) WLCG Workshop Okinawa18 WLCG Networking and End-to-end

19 Proposed Datastore Extension WLCG Workshop Okinawa Being implemented by Henryk Giemza / LHCb Goal: easy access to specific net metrics

perfSONAR widely deployed and already showing benefits in troubleshooting network issues – Additional deployments by R&E networks still needed Significant progress in configuration and infrastructure monitoring – Helping to reach full potential of the perfSONAR deployment OSG datastore – community network data store for all perfSONAR metrics – planned to enter production in Q3 Integration projects underway to aggregate network and transfer metrics – FTS Performance – Experiment’s interface to perfSONAR Advanced network monitoring - diagnosis and alerts based on perfSONAR, developed within NSF funded PuNDIT project Questions? 20 Closing remarks WLCG Workshop Okinawa

Network Documentation OSG OSG Deployment documentation for OSG and WLCG hosted in OSG New 3.4 MA guide ps/wiki/MeasurementArchiveClientGuidehttps://code.google.com/p/perfsonar- ps/wiki/MeasurementArchiveClientGuide Modular Dashboard and OMD Prototypes – OSG Production instances for OMD, MaDDash and Datastore – – – Mesh-config in OSG Use-cases document for experiments and middleware dkPQTQic0VbH1mc/edit dkPQTQic0VbH1mc/edit WLCG Workshop Okinawa21 References