NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.

Slides:



Advertisements
Similar presentations
GPW2005 GGF Techniques for Monitoring Large Loosely-coupled Cluster Jobs Brian L. Tierney Dan Gunter Distributed Systems Department Lawrence Berkeley National.
Advertisements

Unix Systems Performance Tuning Project of COSC 513 Name: Qinghui Mu Instructor: Prof. Anvari.
Operating System Structures
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 8: Monitoring the Network Connecting Networks.
1 Chapter 5 Threads 2 Contents  Overview  Benefits  User and Kernel Threads  Multithreading Models  Solaris 2 Threads  Java Threads.
Troubleshooting.
UNIX Chapter 01 Overview of Operating Systems Mr. Mohammad A. Smirat.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
(NHA) The Laboratory of Computer Communication and Networking Network Host Analyzer.
1 I/O Management in Representative Operating Systems.
Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
Virtual Memory Tuning   You can improve a server’s performance by optimizing the way the paging file is used   You may want to size the paging file.
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 9 Internet Control Message.
Linux Operations and Administration
CEDPS: Center for Enabling Distributed Petascale Science Brian Tierney Lawrence Berkeley National Laboratory
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
CHAPTER FOUR COMPUTER SOFTWARE.
Operating System Concepts Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information Engineering Da-Yeh University.
Introduction to Interactive Media Interactive Media Tools: Software.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Cisco S2 C4 Router Components. Configure a Router You can configure a router from –from the console terminal (a computer connected to the router –through.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
Citrix MPS 3.0 Licensing Douglas A. Brown President
SZTAKI in DataGrid 2003 What to do this year. Topics ● Application monitoring (GRM) ● Analysis and Presentation (Pulse) ● Performance of R-GMA.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview Part 2: History (continued)
MACCE and Real-Time Schedulers Steve Roberts EEL 6897.
NetLogger GGF Distributed Application Analysis and Debugging using NetLogger v2 Lawrence Berkeley National Laboratory Brian L. Tierney.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
A Brief Documentation.  Provides basic information about connection, server, and client.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Debugging and Profiling With some help from Software Carpentry resources.
Computing Division Requests The following is a list of tasks about to be officially submitted to the Computing Division for requested support. D0 personnel.
CS2204: Introduction to Unix January 19 th, 2004 Class Meeting 1 * Notes adapted by Christian Allgood from previous work by other members of the CS faculty.
Manchester University Tiny Network Element Monitor (MUTiny NEM) A Network/Systems Management Tool Dave McClenaghan, Manchester Computing George Neisser,
Chapter 1 Computers, Compilers, & Unix. Overview u Computer hardware u Unix u Computer Languages u Compilers.
CIS250 OPERATING SYSTEMS Chapter One Introduction.
Debugging TI RTOS TEAM 4 JORGE JIMENEZ JHONY MEDRANO ALBIEN FEZGA.
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
Web Server Administration Chapter 11 Monitoring and Analyzing the Web Environment.
- GMA Athena (24mar03 - CHEP La Jolla, CA) GMA Instrumentation of the Athena Framework using NetLogger Dan Gunter, Wim Lavrijsen,
EPICS and LabVIEW Tony Vento, National Instruments
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Web Server Administration Chapter 11 Monitoring and Analyzing the Web Environment.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Windows Server 2003 { First Steps and Administration} Benedikt Riedel MCSE + Messaging
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Introduction to threads
Operating System & Application Software
Troubleshooting Tools
CCNA Routing and Switching Routing and Switching Essentials v6.0
Software Architecture in Practice
21-2 ICMP(Internet control message protocol)
ITIS 3110 IT Infrastructure II
Chapter 10: Device Discovery, Management, and Maintenance
CCNA Routing and Switching Routing and Switching Essentials v6.0
End-to-End Monitoring and
Chapter 10: Device Discovery, Management, and Maintenance
Internet Control Message Protocol Version 4 (ICMPv4)
Chapter 8: Monitoring the Network
Brian L. Tierney, Dan Gunter
An XML-based System Architecture for IXA/IA Intercommunication
A General Approach to Real-time Workflow Monitoring
Chapter 4: Threads.
Anant Mudambi, U. Virginia
Presentation transcript:

NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence Berkeley National Laboratory Brian L. Tierney

NetLogger Outline NetLogger Overview NetLogger Components Results from BaBar analysis

NetLogger Overview The Problem –When building distributed systems, we often observe unexpectedly low performance the reasons for which are usually not obvious –The bottlenecks can be in any of the following components: the applications the operating systems the disks or network adapters on either the sending or receiving host the network switches and routers, and so on The Solution: Highly instrumented systems with precision timing information and analysis tools

NetLogger Bottleneck Analysis Distributed system users and developers often assume the problem is network congestion –This is often not true In our experience tuning distributed applications, performance problems are due to: –network problems: 40% –host problems: 20% –application design problems/bugs: 40% 50% client, 50% server Therefore it is equally important to instrument the applications

NetLogger NetLogger Toolkit We have developed the NetLogger Toolkit, which includes: –tools to make it easy for distributed applications to log interesting events at every critical point –tools for host and network monitoring The approach is novel in that it combines network, host, and application-level monitoring to provide a complete view of the entire system. This has proven invaluable for: – isolating and correcting performance bottlenecks – debugging distributed applications

NetLogger Why “NetLogger”? The name “NetLogger” is somewhat misleading –Should really be called: “Distributed Application, Host, and Network Logger” “NetLogger” was a catchy name that stuck

NetLogger NetLogger Components NetLogger Toolkit contains the following components: –NetLogger message format –NetLogger client library –NetLogger visualization tools –NetLogger host/network monitoring tools Additional critical component for distributed applications: –NTP (Network Time Protocol) or GPS host clock is required to synchronize the clocks of all systems

NetLogger NetLogger Message Format We are using the IETF draft standard Universal Logger Message (ULM) format: a list of “field=value” pairs required fields: DATE, HOST, PROG; followed by optional user defined fields Sample ULM event DATE= HOST=foo.lbl.gov PROG=testprog LVL=Usage NL.EVNT=SEND_DATA SEND.SZ=49332 We are currently adding XML support as well

NetLogger NetLogger API NetLogger Toolkit includes application libraries for generating NetLogger messages – Can send log messages to: file host/port (netlogd) syslogd memory, then one of the above C, C++, Java, Fortran, Perl, and Python APIs are currently supported

NetLogger NetLogger API Only 6 simple calls: –NetLoggerOpen() create NetLogger handle –NetLoggerWrite() get timestamp, build NetLogger message, send to destination –NetLoggerGTWrite() must pass in results of Unix gettimeofday() call –NetLoggerFlush() flush any buffered message to destination –NetLoggerSetLevel() set ULM severity level –NetLoggerClose() destroy NetLogger handle

NetLogger Sample NetLogger Use lp = NetLoggerOpen(method, progname, NULL, hostname, NL_PORT); while (!done) { NetLoggerWrite(lp, "EVENT_START", "TEST.SIZE=%d", size); /* perform the task to be monitored */ done = do_something(data, size); NetLoggerWrite(lp, "EVENT_END"); } NetLoggerClose(lp);

NetLogger NetLogger Host/Network Tools Wrapped UNIX network and OS monitoring tools to log “interesting” events using the same log format –netstat (TCP retransmissions, etc.) –vmstat (system load, available memory, etc.) –iostat (disk activity) –ping These tools have been wrapped with Perl or Java programs which: –parse the output of the system utility –build NetLogger messages containing the results

NetLogger NetLogger Event “Life Lines”

NetLogger Event ID In order to associate a group of events into a “lifeline”, you must assign an event ID to each NetLogger event Sample Event Ids –file name –block ID –frame ID –user name –host name –etc.

NetLogger NetLogger Visualization Tools Exploratory, interactive analysis of the log data has proven to be the most important means of identifying problems –this is provided by nlv (NetLogger Visualization) nlv functionality: –can display several types of NetLogger events at once –user configurable: which events to plot, and the type of plot to draw (lifeline, load-line, or point) –play, pause, rewind, slow motion, zoom in/out, and so on –nlv can be run post-mortem or in real-time real-time mode done by reading the output of netlogd as it is being written

NetLogger NLV Example

NetLogger What to Instrument in Your Application Usually one will add NetLogger events to the following places in a distributed application: –before and after all disk I/O –before and after all network I/O –entering and leaving each distributed component –before and after any significant computation e.g.: an FFT operation –before and after any significant graphics call e.g.: certain CPU intensive OpenGL calls This is usually an iterative process –add more NetLogger events as you zero in on the bottlenecks

NetLogger Results

NetLogger Results: 2 nodes with Objectivity Error

NetLogger Results: dblock Example

NetLogger Results: Possible Deadlock

NetLogger Getting NetLogger Source code, binaries, and tutorial are available at: – Client libraries run on all Unix platforms Solaris, Linux, and Irix versions of nlv are currently supported