Network Monitoring Update: June 14, 2017 Shawn McKee

Slides:



Advertisements
Similar presentations
Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
Advertisements

PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.
1 © 2006 Cisco Systems, Inc. All rights reserved. Session Number Presentation_ID Using the Cisco Technical Support & Documentation Website for Security.
DBS to DBSi 5.0 Environment Strategy Quinn March 22, 2011.
Promoting Open Source Software Through Cloud Deployment: Library à la Carte, Heroku, and OSU Michael B. Klein Digital Applications Librarian
Portal User Group Meeting June 29, Agenda Introduction (Angela Taetz) Ulogin (Mario Mezzio) Database Breakup (Mario Mezzio) New Help Desk Forms.
Integration and Sites Rob Gardner Area Coordinators Meeting 12/4/08.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
Nagios Demonstration Tom Wlodek SLAC Tier2 workshop
Portal User Group Meeting June 13, Agenda I. Welcome II. Updates on the following: –Migration Status –New Templates –DB Breakup –Keywords –Streaming.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 8 th April 2015.
New perfSonar Dashboard Andy Lake, Tom Wlodek. What is the dashboard? I assume that everybody is familiar with the “old dashboard”:
Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015.
Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 18 h March 2015.
System/SDWG Update Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.
PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.
OSG Networking: Summarizing a New Area in OSG Shawn McKee/University of Michigan Network Planning Meeting Esnet/Internet2/OSG August 23 rd, 2012.
Network Awareness and perfSONAR Why we want it. What are the challenges? Where are we going? Shawn McKee / University of Michigan OSG AHG - US CMS Tier-2.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI APEL Regional Accounting Alison Packer (STFC) Iván Díaz Álvarez (CESGA) APEL.
Testing and Release Procedures/Tools Cristina Aiftimiei (INFN-CNAF) Mario David (LIP)
보안 취약점 비교 Linux vs. Windows
PerfSONAR operations meeting 3 rd October Agenda Propose changes to the current operations of perfSONAR Discuss current and future deployment model.
Planning for Application Recovery
Libra Data University of Virginia’s Dataverse
WLCG Workshop 2017 [Manchester] Operations Session Summary
Shawn McKee, Marian Babik for the
C Loomis (CNRS/LAL) and V. Floros (GRNET)
IT Services Katarzyna Dziedziniewicz-Wojcik IT-DB.
Constructing Deploying and Maintaining Enterprise Systems
perfSONAR-PS Deployment: Status/Plans
U.S. ATLAS Grid Production Experience
oVirt Node Project Douglas Schilling Landgraf
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
LHCOPN/LHCONE perfSONAR Update
EPAM Cloud Orchestration
LHCOPN/LHCONE perfSONAR Update
Savannah to Jira Migration
Welcome! Thank you for joining us. We’ll get started in a few minutes.
Olof Bärring LCG-LHCC Review, 22nd September 2008
Update from the HEPiX IPv6 WG
Статус ГРИД-кластера ИЯФ СО РАН.
Shawn McKee/University of Michigan ATLAS Technical Interchange Meeting
LQCD Computing Operations
Alerting/Notifications (MadAlert)
Workshop Summary Dirk Duellmann.
CernVM Status Report Predrag Buncic (CERN/PH-SFT).
LoboCloud.unm.edu June 9 – 10, 2016 Presented by
Tech Inside Extended Document Management System (EDMS)
Haiyan Meng and Douglas Thain
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Leigh Grundhoefer Indiana University
Simplified Development Toolkit
ETHZ, Zürich September 1st , 2016
Cloud computing mechanisms
Performance Measuring & Monitoring
Features Overview.
PerformanceBridge Application Suite and Practice 2.0 IT Specifications
Robert Down & Pranay Sadarangani Nov 8th 2011
Building a minimum viable Security Operations Centre
Frontier Status Alessandro De Salvo on behalf of the Frontier group
Setting up PostgreSQL for Production in AWS
Presentation transcript:

Network Monitoring Update: June 14, 2017 Shawn McKee OSG Area Coordinators Network Monitoring Update: June 14, 2017 Shawn McKee 6/14/2017 Shawn McKee - OSG Networking

OSG Network Services Update Since the last meeting ALL VMs running OSG network services have been updated to use the most recent perfSONAR v4 RPMs The postgresql DB (part of ESmond) was also updated to use 9.6 instead of 9.4 No significant issues discovered and all services are running well after the update. 6/14/2017 Shawn McKee - OSG Networking

Virtualization Resource Balancing Once all VMs were update in terms of software we let them run for a few days to get a baseline of resource use We had a number of cases where we had mismatches in CPU or RAM for VMs. Fortunately it was roughly balanced so that systems need more CPUs could use the ones released from others and similarly for memory. The check_mk monitoring was really important for identifying the right resource requirements One small tweak (adding back some “stolen” memory) was needed on a couple VMs System now seems balanced for the current workload Have to watch over time and adjust as required. 6/14/2017 Shawn McKee - OSG Networking

Updating Documentation Our existing documentation at https://twiki.opensciencegrid.org/bin/view/Documentation/Networkin gInOSG needs: Streamlining Updating for v4.0 Migration to GitHub? (What is the new location?) The perfSONAR developers provide excellent documentation on installation and configuration of perfSONAR We need to point to that and provide details on the few specific things we need to worry about for OSG/WLCG installations Firewalls configured allowing our access to the data Configuration for auto-mesh and information about MCA Information about OSG network services Information about OSG and WLCG support Timeline: STILL guessing about 2 weeks to get to final version once things settle down. This work has been lagging because of other issues; waiting for other work to converge and gaining some feedback/experience on v4.0 so documentation reflects identified “best practices” 6/14/2017 Shawn McKee - OSG Networking

Hardware Update / Data loss Since the beginning of the year we have been waiting to update the storage (disks) in use for the network datastore 8TB replacements for 4TB disks have been on-hand for months At the end of April we attempted to do incremental updates to RAID-10 (4-disk) sections of the hardware PROBLEM: VMs running Postgresql and Cassandra were not shutdown while the storage was changed and we lost/corrupted the postgres DB Attempted recovery (3 days of work) only recovered 5-7% of the Postgres metadata. Consulted experts but nothing to be done Cassandra retained all the data but “index” for it missing in postgres Two changes: Postgresql DB should now have daily backups (need to confirm this is in place with Scott) and future procedures will shutdown VMs if their storage is modified 6/14/2017 Shawn McKee - OSG Networking

OSG Network Data Distribution Since early May our OSG network data is also being replicated to FNAL where it goes to tape (a little late to help with the data loss but great to have for the future) Given that many perfSONAR nodes have local copies of data for up to 1-year, Edgar is also running “long-term” query out to toolkits to capture any data still available. Updates to production perfSONAR RSV yesterday now support posting data to CERN’s ActiveMQ, the OSG RabbitMQ (goes to FNAL) and to ESmond Concern: may need to modify how RSV perfSONAR probe gathers data. Need to do “small” requests or risk time-outs when gathering a large time-span of data Have issue with FNAL perfSONAR (psonar3/4.fnal.gov) that needs debugging: psetf can get data but psds0 (RSV) can’t 6/14/2017 Shawn McKee - OSG Networking

Standalone Mesh-config (MCA) We (OSG networking/Soichi) have finally finished the standalone mesh config administrator (GUI)! As of May 31, 2017 the code and responsibility for maintenance and future upgrades was give to the perfSONAR developers team. Documentation at http://docs.perfsonar.net/mca and at https://github.com/soichih/meshconfig-admin Issues tracked at https://github.com/soichih/meshconfig- admin/issues (14 open, 17 closed) OSG ITB instance running at https://meshconfig-itb.grid.iu.edu/ (create an account to play with this) Production instance is operational https://meshconfig.grid.iu.edu/ Old URLs using my Some small issues (see issue tracker) but it is working well 6/14/2017 Shawn McKee - OSG Networking

Updated Net Service Monitoring We have been using OMD/Check_mk to monitor both perfSONAR services and host status. WLCG has built an improved system (ETF=Experiments Testing Framework) which we have leveraged to create a perfSONAR specific implementation of. It is distributed as a Docker image and requires a CentOS 7.x base OS with Docker to use. We requested a new VM (psetf.grid.iu.edu) and ‘root’ access for Marian and I We deployed Marian’s etf_ps docker instance and tested and tweaked Once working we requested the psomd.grid.iu.edu DNS alias be switched to psetf.grid.iu.edu from perfsonar1.grid.iu.edu We then had Tom shutdown perfsonar1.grid.iu.edu (we want to keep the VM image for 6 months in case we need any data off from it) Have a look https://psomd.grid.iu.edu/etf/check_mk/ NOTE: Apache redirect requested for (ticket #34087) https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk but not yet in place Docs and details at: https://gitlab.cern.ch/etf/docker/blob/master/README.md 6/14/2017 Shawn McKee - OSG Networking

Network Alerting Reminder We have a longer term goal of alerting and alarming on network issues. Milestone completed: technical design of a suitable analysis system based upon existing time-series technologies Current operating implementation gathers all perfSONAR data OSG sends to CERN and puts it in ElasticSearch. Jupyter instance regularly runs cron tasks to analyze data Anyone can subscribe to simple alert-emails. http://tiny.cc/RegATLASAlarm Needs “tuning” of text and user-interface for end-user use. 6/14/2017 Shawn McKee - OSG Networking

Current Example Alerting Email Last time I showed an text example email We updated the format: Working to customize by alert type 6/14/2017 Shawn McKee - OSG Networking

Network Analytics Because our perfSONAR data is published to CERN and gets to the U Chicago Elasticsearch analytics platform we have a number of nice dashboards constructed to see the (meta)data In the next few slides I will flip through a few of them showing what we currently have The dashboards are only available right now on development systems (hidden behind user/pw credentials) but we are working to publish (readonly) versions of them for general use 6/14/2017 Shawn McKee - OSG Networking

Kibana perfSONAR Infrastructure 6/14/2017 Shawn McKee - OSG Networking

Grafana Dashboard (ESnet sites) 6/14/2017 Shawn McKee - OSG Networking

Grafana (Site to 2 Destinations) (CERN to BNL and TRIUMF) 6/14/2017 Shawn McKee - OSG Networking

Concerns While we have made very good progress in updating things there is still much to do Need updated documentation reflecting the right best practices Need to include public version of dashboards Must recruit additional non-WLCG sites using our developing capabilities as a “carrot” Some services need minor fixes (perfSONAR RSV, MCA) and host admins may need to do some work to fix toolkit issues we are seeing with their sites 6/14/2017 Shawn McKee - OSG Networking

Questions or Comments? Thanks! 6/14/2017 Shawn McKee - OSG Networking

URLs for Reference OSG Network Datastore Documents Operations https://docs.google.com/document/d/1l144BSo-88M0cLMMjKcKMIE-Q5s21X-w3lYl- 0Pn_08/edit# SLA https://twiki.grid.iu.edu/bin/view/Operations/PSServiceLevelAgreement Data lifecycle https://docs.google.com/document/d/1mJ1kf43nZf6gvKoNtiTOc0g0MYDv_wSfSm7YdiMs3Lo/edit# Current OSG network documentation https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG OSG networking year-5 goals and milestones: https://docs.google.com/document/d/1FzmXZinO4Pb8NAfd5SWUzaAFYOL23dt66hQsDmaP- WI/edit perfSONAR adoption tracking: http://grid-monitoring.cern.ch/perfsonar_coverage.txt Deployment documentation for both OSG and WLCG hosted in OSG (migrated from CERN) https://twiki.opensciencegrid.org/bin/view/Documentation/DeployperfSONAR ATLAS Analytics: Packet-loss: http://tiny.cc/PktLossNoUnknown (6 month view) perfSONAR dashboard: http://tiny.cc/pSDash perfSONAR link details: http://tiny.cc/pSLink Mesh-config in OSG https://meshconfig.grid.iu.edu/ Pre-Production Meshconfig https://meshconfig-itb.grid.iu.edu/ perfSONAR homepage: http://www.perfsonar.net/ 6/14/2017 Shawn McKee - OSG Networking