Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED IB Monitoring Through the Console Jesse Martinez Los.

Slides:



Advertisements
Similar presentations
Operated by Los Alamos National Security, LLC for the U.S. Department of Energys NNSA U N C L A S S I F I E D Slide 1 Review and Release Safebox Frances.
Advertisements

1 Routing Protocols I. 2 Routing Recall: There are two parts to routing IP packets: 1. How to pass a packet from an input interface to the output interface.
Unified Wire Felix Marti, Open Fabrics Alliance Workshop Sonoma, April 2008 Chelsio Communications.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED Managing IOCs with Local Filesystems Scott A. Baily.
Introduction to Network Analysis and Sniffer Pro
1 OFED Management Tools Ira Weiny Lawrence Livermore National Lab OFED Developer Workshop November 16, 2007.
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
1 LINK STATE PROTOCOLS (contents) Disadvantages of the distance vector protocols Link state protocols Why is a link state protocol better?
1 CCNA 2 v3.1 Module 9. 2 Basic Router Troubleshooting CCNA 2, Module 9.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
13.1 © 2004 Pearson Education, Inc. Exam Planning, Implementing, and Maintaining a Microsoft® Windows® Server 2003 Active Directory Infrastructure.
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
CSEE W4140 Networking Laboratory Lecture 11: SNMP Jong Yul Kim
70-291: MCSE Guide to Managing a Microsoft Windows Server 2003 Network Chapter 14: Troubleshooting Windows Server 2003 Networks.
Check Disk. Disk Defragmenter Using Disk Defragmenter Effectively Run Disk Defragmenter when the computer will receive the least usage. Educate users.
Control and monitoring of on-line trigger algorithms using a SCADA system Eric van Herwijnen Wednesday 15 th February 2006.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
Connecting LANs, Backbone Networks, and Virtual LANs
Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.
A modern NM registration system capable of sending data to the NMDB Helen Mavromichalaki - Christos Sarlanis NKUA TEAM National & Kapodistrian University.
© 2008, Renesas Technology America, Inc. All Rights Reserved The RCAN-ET peripheral and the CAN API SH2 & SH2A MCUs V 1.2 Mar 2010.
ZTF Server Architecture Roger Smith Caltech
VLAN Trunking Protocol (VTP)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D The Case for Monitoring and Testing David.
Ch 6. Performance Rating Windows 7 adjusts itself to match the ability of the hardware –Aero Theme v. Windows Basic –Gaming features –TV recording –Video.
Module 7: Fundamentals of Administering Windows Server 2008.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Slide 1 Nick Salazar Operations Support.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Lessons Learned: Certification and Accreditation.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Cisco S2 C4 Router Components. Configure a Router You can configure a router from –from the console terminal (a computer connected to the router –through.
Welcome & Introductions
TELE 301 Lecture 10: Scheduled … 1 Overview Last Lecture –Post installation This Lecture –Scheduled tasks and log management Next Lecture –DNS –Readings:
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
7. CBM collaboration meetingXDAQ evaluation - J.Adamczewski1.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)
Management Tools Development related to DoE Hal Rosenstock.
Module 10: Preparing to Monitor Server Performance.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Page 1 Printing & Terminal Services Lecture 8 Hassan Shuja 11/16/2004.
Monitoring and Managing Server Performance. Server Monitoring To become familiar with the server’s performance – typical behavior Prevent problems before.
Operated by Los Alamos National Security, LLC for DOE/NNSA DC Reviewed by Kei Davis SKA – Static Kernel Analysis using LLVM IR Kartik Ramkrishnan and Ben.
Module 12: Configuring and Managing Storage Technologies
Module 9 Planning and Implementing Monitoring and Maintenance.
Windows Server 2003 系統效能監視 林寶森
Network management Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance,
M ODULE 2: P REPARING TO M ONITOR S ERVER P ERFORMANCE.
TELL1 command line tools Guido Haefeli EPFL, Lausanne Tutorial for TELL1 users : 25.February
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Management of the LHCb DAQ Network Guoming Liu *†, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.
Monas MS is a software suite designed for displaying, processing and storing messages received in the centralized security and monitoring stations. Software.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
COMPASS DAQ Upgrade I.Konorov, A.Mann, S.Paul TU Munich M.Finger, V.Jary, T.Liska Technical University Prague April PANDA DAQ/FEE WS Игорь.
ITMT 1371 – Window 7 Configuration 1 ITMT Windows 7 Configuration Chapter 8 – Managing and Monitoring Windows 7 Performance.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
MCSA Windows Server 2012 Pass Upgrading Your Skills to MCSA Windows Server 2012 Exam By The Help Of Exams4Sure Get Complete File From
Team Members: ECE- Wes Williams, Will Steiden, Josh Howard, Alan Jimenez Sponsor: Brad Luyster Honeywell Network Traffic Generator.
HTCC coffee march /03/2017 Sébastien VALAT – CERN.
Monitoring Windows Server 2012
LHCb and InfiniBand on FPGA
Managing Your Network Environment
Computing Infrastructure for DAQ, DM and SC
Outline Overview Development Tools
IS3440 Linux Security Unit 9 Linux System Logging and Monitoring
Microsoft Ignite NZ October 2016 SKYCITY, Auckland.
Training Module Introduction to the TB9100/P25 CG/P25 TAG Customer Service Software (CSS) Describes Release 3.95 for Trunked TB9100 and P25 TAG Release.
Nir Zaidman and Michael Tahar
Features Overview.
Presentation transcript:

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED IB Monitoring Through the Console Jesse Martinez Los Alamos National Laboratory LA-UR April 3rd, 2013

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED Outline ●Monitoring Methods ○Errors ○Performance ●Use of Console ●Analysis and Reporting ●Future Implementations

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED ●Monitoring is done per each cluster’s fabric ○Range from 8 node to 1600 node clusters ■DDR, QDR, FDR systems ○OpenSM to ●Monitoring at near real time: ○Fabric Errors ○Non Optimal Links ○Performance Issues ○Bandwidth and Latency (Susan Coulter) ○Throughput Monitoring at LANL

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED ●Developed by Susan Coulter ●Suite of scripts designed to look for InfiniBand hardware errors as well as performance metrics ●Runs off master nodes for each cluster ○Where subnet manager is located ●Forwards messages to both Zenoss and Splunk ●Thresholds are set to trigger fabric errors and performance issues to send to operators and system administrators IBMon2

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED ●Subnet Manager gathers counters from IB fabric continuously ●Scripts written to gather this data and convert it to readable format ○Local Device: [Error == Counter] - (Remote Device) ●Error counters reset every half hour ○Allows to monitor errors at near real time ○Automatically disabled during Dedicated Service Time (DST) ●Errors messages recorded in syslog for each fabric Error Monitoring Methods

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED ●Scripts written to gather transmit and receive data from ports throughout fabric ○Recalculates actual data across 4 links and converts to MB ●Performance counters reset every half hour ●Throughput calculated based on transmit and receive data ○Converts performance counters to Average MB/s ○MB/30 minutes → ~MB/s ●Can look at overall cluster or port usage every half hour Performance Monitoring Methods

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED ●Before: ibqueryerrors calls ○Used before to gather errors and congestion counters on the fabric and modified by scripts ●OpenSM console used now to dump fabric counters via PerfMgr every half hour ○Allows counters to be gathered continuously over fabric without additional calls from our scripts ○Scripts parse dump file for information to gather error and performance counters ○Calculations done on master nodes Counters through Console

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED Console Output

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED ●Scripts search over all ports on hardware through dump file (Spine/Line cards, HCAs) ○Locate at /var/log/opensm_port_counters.log ●Grep for non zero counters for errors ○SymbolErrors, PortRcv, LinkedDowned, etc. ●Use source device/port to find remote device/port ○Through ibnetdiscover parse ●Gathers performance metrics per port ●Sends error events to syslog and Zenoss ●Stores performance numbers in file (read by Splunk) Monitoring through Console

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED "mu1456" 0x2c d050 active TRUE port 1 Last Reset : Wed Mar 26 16:03: Last Error Update : Wed Mar 26 16:30: symbol_err_cnt : 0 link_err_recover : 0 link_downed : 0 rcv_err : 0 rcv_rem_phys_err : 0 rcv_switch_relay_err : 0 xmit_discards : 0 xmit_constraint_err : 0 rcv_constraint_err : 0 link_integrity_err : 0 buf_overrun_err : 0 vl15_dropped : 0 Last Data Update : Wed Mar 26 16:30: xmit_data : ( GB) rcv_data : ( GB) xmit_pkts : ( M) rcv_pkts : ( M) unicast_xmit_pkts : 0 (0.000) unicast_rcv_pkts : 0 (0.000) multicast_xmit_pkts : 0 (0.000) multicast_rcv_pkts : 0 (0.000) PerfMgr Dump File

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED ●Two methods for monitoring errors ○Zenoss ○Splunk ●Why both? ○Preference ○Zenoss designed for real time virtualization of clusters to monitor errors ■IB grid sent to Zenoss for virtualization ■Automatically clear events ○Splunk designed for analysis and benchmarking of performance and alerts Error Analysis and Reporting

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED Zenoss Example

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED Splunk Example

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED Splunk Example

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED ●Compatible IBmon2 for InfiniBand fabrics ○Configuration Standards ■Different fabric rates ■Difference organizational implementations ●Pulling additional counters to look for trends in performance and error analysis ○PortXmitWait ●Robust design to handle upgrades Future Modifications

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED Questions?