Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Slides:



Advertisements
Similar presentations
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 8: Monitoring the Network Connecting Networks.
Advertisements

1 Generic logging layer for the distributed computing by Gene Van Buren Valeri Fine Jerome Lauret.
1 Configuring Internet- related services (April 22, 2015) © Abdou Illia, Spring 2015.
USING THE GLOBUS TOOLKIT This summary by: Asad Samar / CALTECH/CMS Ben Segal / CERN-IT FULL INFO AT:
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Introduction to eValid Presentation Outline What is eValid? About eValid, Inc. eValid Features System Architecture eValid Functional Design Script Log.
Series DATA MANAGEMENT. 1 Why ? Alarm/Status Notification –Remote unattended sites »Pumping stations –Pharmaceutical/Plant maintenance.
Maintaining and Updating Windows Server 2008
OSG Logging Architecture Update Center for Enabling Distributed Petascale Science Brian L. Tierney: LBNL.
Security Guidelines and Management
NDT Tools Tutorial: How-To setup your own NDT server Rich Carlson Summer 04 Joint Tech July 19, 2004.
Winter Consolidated Server Deployment Guide for Hosted Messaging and Collaboration version 3.5 Philippe Maurent Principal Consultant Microsoft.
Form Handling, Validation and Functions. Form Handling Forms are a graphical user interfaces (GUIs) that enables the interaction between users and servers.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Ch 11 Managing System Reliability and Availability 1.
Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
CEDPS: Center for Enabling Distributed Petascale Science Brian Tierney Lawrence Berkeley National Laboratory
1 Group Account Administration Introduction to Groups Planning a Group Strategy Creating Groups Understanding Default Groups Groups for Administrators.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
SOS EGEE ‘06 GGF Security Auditing Service: Draft Architecture Brian Tierney Dan Gunter Lawrence Berkeley National Laboratory Marty Humphrey University.
Troubleshooting Windows Vista Security Chapter 4.
Web Application Firewall (WAF) RSA ® Conference 2013.
Module 7: Fundamentals of Administering Windows Server 2008.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Tunis International Centre for Environmental Technologies Small Seminar on Networking Technology Information Centers UNFCCC secretariat offices Bonn, Germany.
Publication and Protection of Site Sensitive Information in Grids Shreyas Cholia NERSC Division, Lawrence Berkeley Lab Open Source Grid.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
TELE 301 Lecture 10: Scheduled … 1 Overview Last Lecture –Post installation This Lecture –Scheduled tasks and log management Next Lecture –DNS –Readings:
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Computer Emergency Notification System (CENS)
Jan Hatje, DESY CSS ITER March 2009: Technology and Interfaces XFEL The European X-Ray Laser Project X-Ray Free-Electron Laser 1 CSS – Control.
Module 10 Administering and Configuring SharePoint Search.
Shannon Hastings Multiscale Computing Laboratory Department of Biomedical Informatics.
CERN IT Department CH-1211 Genève 23 Switzerland t MSG status update Messaging System for the Grid First experiences
1 Implementing Monitoring and Reporting. 2 Why Should Implement Monitoring? One of the biggest complaints we hear about firewall products from almost.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
A Collaborative Framework for Scientific Data Analysis and Visualization Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox Department of Computer.
NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
Jan Hatje, DESY CSS GSI Feb. 2009: Technology and Interfaces XFEL The European X-Ray Laser Project X-Ray Free-Electron Laser 1 CSS – Control.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
- GMA Athena (24mar03 - CHEP La Jolla, CA) GMA Instrumentation of the Athena Framework using NetLogger Dan Gunter, Wim Lavrijsen,
FlowLevel Client, server & elements monitoring and controlling system Message Include End Dial Start.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
1 A Look at the Application Authorized users can access Communicator! NXT from any Internet-capable computer via the Web.
Maintaining and Updating Windows Server 2008 Lesson 8.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
Monitoring Dynamic IOC Installations Using the alive Record Dohn Arms Beamline Controls & Data Acquisition Group Advanced Photon Source.
9 Copyright © 2004, Oracle. All rights reserved. Getting Started with Oracle Migration Workbench.
I Copyright © 2004, Oracle. All rights reserved. Introduction.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Module 4: Troubleshooting Web Servers. Overview Use IIS 7.0 troubleshooting features to gather troubleshooting information Use the Runtime Control and.
SQL Database Management
Simulation Production System
WWW and HTTP King Fahd University of Petroleum & Minerals
Module Overview Installing and Configuring a Network Policy Server
Robert Szuman – Poznań Supercomputing and Networking Center, Poland
End-to-End Monitoring and
Configuring Internet-related services
Brian L. Tierney, Dan Gunter
Training Module Introduction to the TB9100/P25 CG/P25 TAG Customer Service Software (CSS) Describes Release 3.95 for Trunked TB9100 and P25 TAG Release.
Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta
A General Approach to Real-time Workflow Monitoring
Presentation transcript:

Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL

Office of Science U.S. Department of Energy Background Work is part of SciDAC CEDPS (Center for Enabling Distributed Petascale Science) Basic question: Why did my transfer (or remote operation) fail? We want to answer this question before the users even ask it!  Instrument: middleware, applications, etc.  Monitor: gather data (in response to problems)  Diagnose: failures and performance problems

Office of Science U.S. Department of Energy Topics Two broad categories of work:  Gathering and normalizing existing data to allow analysis across sites (e.g. in OSG)  Adding new data through instrumentation of standard Grid middleware (e.g. GridFTP)

Office of Science U.S. Department of Energy CEDPS Troubleshooting Architecture

Office of Science U.S. Department of Energy Syslog-ng Features:  Can filter logs based on level and content  Arbitrary number of sources and destinations  Provide remote logging  Can act as a proxy, tunnel thru firewalls  Execute programs  Send , load database, etc.  Built-in Log rotation  Timezone support

Office of Science U.S. Department of Energy Log collection using syslog-ng

Office of Science U.S. Department of Energy Log Parser Normalizes unformatted logs to name=value pairs Plug-in architecture to make it easy to add additional log file formats  We will provide a set of example parsers But…  Parsers will be hard to write and maintain  If middleware and application developers follow the “logging best practices” document, the parser component will be not necessary

Office of Science U.S. Department of Energy Missing Event Detector Assumes all ‘start’ events should have a corresponding ‘end’ event  Looks for missing ‘end’ events  Generates a replacement ‘end’ event with an error code Planning to develop more sophisticated anomaly detection capabilities as MDS trigger services.

Office of Science U.S. Department of Energy Log Filter Some sites may not want to forward potentially sensitive information to the VO archive  E.g.: usernames, user DN’s, IP addresses Syslog-ng can filter entire events  But would prefer to just filter out the sensitive fields Log filter will be able to remove or anonymize specific fields in the event

Office of Science U.S. Department of Energy Database Loader This component loads normalized logs into an SQL database Ability to specify mapping of fields to database columns

Office of Science U.S. Department of Energy Sample Site Deployment

Office of Science U.S. Department of Energy Sample Grid Deployment

Office of Science U.S. Department of Energy Current Status

Office of Science U.S. Department of Energy Logging “Best Practices” Recommendations Practices  Each logged event must contain a unique event name and an ISO-format timestamp (e.g T07:23: Z )  All system operations that might fail or experience performance variations should be wrapped with start and end events.  All logs from a given execution thread should be tagged with a globally unique ID (or GUID), such as a Universal Unique Identifiers (UUIDs) Log format  Logs should be lines of ASCII name=value pairs  Example: ts= T18:48: Z event=org.globus.gridFTP.transfer.start guid=ID file=filename src.host=H1 src.port=P1 dst.host=H2 dst.port=P2

Office of Science U.S. Department of Energy Event Names Use a '.' as a separator and go from general to specific  Same as Java class names First part of name should be used as a unique namespace (e.g.: org.globus) Use start/end suffixes whenever possible  Helps immensely with troubleshooting Examples  org.globus.gridFTP.start  org.globus.gridFTP.authn.start  org.globus.gridFTP.authn.end  org.globus.gridFTP.transfer.start  org.globus.gridFTP.transfer.end  org.globus.gridFTP.end org.globus.MDS.response.start org.globus.MDS.query.start org.globus.MDS.query.end org.globus.MDS.write.net.start org.globus.MDS.write.net.end org.globus.MDS.response.end

Office of Science U.S. Department of Energy Reporting Errors Errors should be reported as part of the ‘end’ event if possible  Use ‘status=N’ (>= 0 success)  Not attempting to define other status codes  too hard to get agreement on these Example:  ts= T18:39: Z event=org.globus.authz.gridmap.end status=-1 DN= ” /O=CEDS/CN=Some User ” msg= ” Cannot open gridmap file /etc/grid-security/grid- mapfile for reading ” guid=F7D A A21F-57109AA46DFA level=ERROR

Office of Science U.S. Department of Energy Error Reporting cont. Depending on how program is structured, it may be hard to propagate the error message to the ‘end’ event Use ‘error’ event name suffix in this situation Examples:  event=org.globus.gridFTP.write.error path=“/home/grid” msg=“write error, disk full”  event=myprogram.input.error msg=“invalid input”

Office of Science U.S. Department of Energy Globally Unique IDs Use the ‘guid’ reserved name to allow correlation of a set of events together event=org.globus.gridFTP.authn.start guid= BFDD3DA A3AF-1C5E3B8EF9BB event=org.globus.gridFTP.authn.end guid= BFDD3DA A3AF-1C5E3B8EF9BB event=org.globus.gridFTP.transfer.start guid= BFDD3DA A3AF-1C5E3B8EF9BB event=org.globus.gridFTP.transfer.end guid= BFDD3DA A3AF-1C5E3B8EF9BB UUID easy to generate  uuidgen, uuidlib  MD5 hash in hexadecimal But free-form ‘id’ also allowed (e.g. process ID)

Office of Science U.S. Department of Energy Example: GridFTP ts= T18:39: Z event=org.globus.gridFTP.start prog=GridFTP localhost=myhost remoteHost=somehost.gov:56010 serverMode=inetd id=56010 ts= T18:39: Z event=org.globus.gridFTP.authn.start DN=“/DC=org/DC=doegrids/OU=People/CN=Somebody” id=56010 ts= T18:39: Z event=org.globus.gridFTP.authn.end DN=“/DC=org/DC=doegrids/OU=People/CN=Somebody” msg=“ successfully authorized” localUser=uscmspool381 id=56010 status=0 ts= T18:39: Z event=org.globus.gridFTP.transfer.start file=/tmp/myfile tcpBufferSize=128KB dataBlockSize= numStreams=1 numStripes=1 destHost= id=56010 ts= T18:45: Z event=org.globus.gridFTP.transfer.end file=/tmp/myfile bytesTransferred= id=56010 status=0 ts= T18:45: Z event=org.globus.gridFTP.end id=56010 status=226

Office of Science U.S. Department of Energy Logging API We are not requiring any special library to generate log messages We assume that programmers use one of the standard logging APIs (syslog, Java log4j, python logger, etc.)  Could also use ‘printf’, custom logging API, etc.  Syslog-ng can be used to forward any newline-delimited ASCII log file

Office of Science U.S. Department of Energy Status Working with GT4 developers to add this to  GRAM4, GridFTP, MDS4, Java Core, C Core, Delegation Service Working with OSG on deployment of syslog-ng to gather up logs

Office of Science U.S. Department of Energy Log Summarizer

Office of Science U.S. Department of Energy Issue Would like to have detailed I/O logging for performance analysis But detailed logs can be far too large and intrusive  For example, a trace of the I/O operations performed by a single GridFTP server capable of saturating a 10 Gigabit network will generate  O(20,000) log events / second  over 70 million per hour Need:  ongoing report of I/O characteristics  negligible perturbation ( < 1% ) and storage

Office of Science U.S. Department of Energy Solution Summarization  developer can log 1000’s of events/sec  run-time choice of summary granularity (easy to turn off by default) and algorithm NetLogger Summarization Library  Summarizes logs before they are ever written to disk  Huge reduction of log volume, while retaining important information  Can be used for bottleneck analysis and performance anomaly detection  General-purpose tool can be extended to do different kinds of summarization (currently only does time-based)

Office of Science U.S. Department of Energy How summarization works Code for (i=0; i < N; i++) { nl_write(log, “loop.start”, “id=i”, 0); double v = do_work(); nl_write(log, “loop.end”, “id=i val=d”, 0, v); } Configuration (XML version) event " 1 val loop.summ id loop.start loop.end Log calls 1 sec 2 sec 0 sec summary events with average time, average value per time, etc. start/end of each loop Output

Office of Science U.S. Department of Energy Programmatic configuration New NetLogger calls (slightly simplified):  Add Events to Summarizer: add_eventpair( “my.event”, my.event.start / my.event.end “nbytes”, value field, e.g. nbytes= “guid”) identifier field  Set summary interval: set_interval(10) 10 second summary interval  Get summary statistics: I = get_stats(“my.event”)

Office of Science U.S. Department of Energy Sample GridFTP Deployment with Summarizer

Office of Science U.S. Department of Energy Anomaly Detection Summarized events can be used for simple anomaly detection  Summarize disk and network throughput every 10 seconds  Generate an alarm if disk or network drops below threshold X for duration Y

Office of Science U.S. Department of Energy Bottleneck Analysis Can configure summarizer to just output a single summary at the end Need to collect summary information at both client and server sides Because the start/end events measure both...  time inside instrumented function (end i - start i )  time between successive calls (start i+1 - end i )..there is potential for determining which functions are busy and which are mostly waiting  admittedly, this is somewhat complicated by OS buffering

Office of Science U.S. Department of Energy Summary Two broad categories of work:  Gathering and normalizing existing data to allow analysis across sites (e.g. in OSG)  syslog-ng, log parser, db loader  missing event detector  anomaly detection  Adding new data through instrumentation of standard Grid middleware  best practices logging recommendations  summarizer

Office of Science U.S. Department of Energy More Information CEDPS TS home page:  Best-practices sub-page:  CEDPS TS team  Brian Tierney, LBNL (Area lead)  Jen Schopf, ANL  Stu Martin, ANL  Laura Perlman, ISI

Office of Science U.S. Department of Energy Extra Slides

Office of Science U.S. Department of Energy NL summarizer performance FullSummarized Disk202,000 events/sec 588,000 events/sec /dev/null331,000 events/sec 588,000 events/sec Log dest NL mode

Office of Science U.S. Department of Energy NL summarizer overhead I/O, compute overlap