NetLogger GGF Distributed Application Analysis and Debugging using NetLogger v2 Lawrence Berkeley National Laboratory Brian L. Tierney.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

The Grid Job Monitoring Service Luděk Matyska et al. CESNET, z.s.p.o. Prague Czech Republic.
GPW2005 GGF Techniques for Monitoring Large Loosely-coupled Cluster Jobs Brian L. Tierney Dan Gunter Distributed Systems Department Lawrence Berkeley National.
Performance Testing - Kanwalpreet Singh.
Distributed Processing, Client/Server and Clusters
1 1999/Ph 514: Channel Access Concepts EPICS Channel Access Concepts Bob Dalesio LANL.
High Speed Total Order for SAN infrastructure Tal Anker, Danny Dolev, Gregory Greenman, Ilya Shnaiderman School of Engineering and Computer Science The.
Remote Procedure Call (RPC)
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
GridFTP: File Transfer Protocol in Grid Computing Networks
Distributed components
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Technical Architectures
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
How Clients and Servers Work Together. Objectives Learn about the interaction of clients and servers Explore the features and functions of Web servers.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
© 2010 VMware Inc. All rights reserved VMware ESX and ESXi Module 3.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
PPOUG, 05-OCT-01 Agenda RMAN Architecture Why Use RMAN? Implementation Decisions RMAN Oracle9i New Features.
CEDPS: Center for Enabling Distributed Petascale Science Brian Tierney Lawrence Berkeley National Laboratory
ProCol~ A jEdit Plugin for Remote Project Collaboration Justin Dieters Spring 2004 CS470 Final Presentation.
Globus Striped GridFTP Framework and Server Raj Kettimuthu, ANL and U. Chicago.
SOS EGEE ‘06 GGF Security Auditing Service: Draft Architecture Brian Tierney Dan Gunter Lawrence Berkeley National Laboratory Marty Humphrey University.
SCADA. 3-Oct-15 Contents.. Introduction Hardware Architecture Software Architecture Functionality Conclusion References.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
1 Version 3.0 Module 11 TCP Application and Transport.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
An Integrated Instrumentation Architecture for NGI Applications Ian Foster, Darcy Quesnel, Steven Tuecke Argonne National Laboratory The University of.
1 Next Few Classes Networking basics Protection & Security.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
UDT as an Alternative Transport Protocol for GridFTP Raj Kettimuthu Argonne National Laboratory The University of Chicago.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Application Layer Khondaker Abdullah-Al-Mamun Lecturer, CSE Instructor, CNAP AUST.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
3.14 Work List IOC Core Channel Access. Changes to IOC Core Online add/delete of record instances Tool to support online add/delete OS independent layer.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
LEGS: A WSRF Service to Estimate Latency between Arbitrary Hosts on the Internet R.Vijayprasanth 1, R. Kavithaa 2,3 and Raj Kettimuthu 2,3 1 Coimbatore.
Writing a Channel Access Client in EPICS Bob Dalesio, April 5, 2000.
Logic Analyzer ECE-4220 Real-Time Embedded Systems Final Project Dallas Fletchall.
Connections to Other Packages The Cactus Team Albert Einstein Institute
1 Microsoft Windows 2000 Network Infrastructure Administration Chapter 4 Monitoring Network Activity.
NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
Marcelo R.N. Mendes. What is FINCoS? A set of tools for data generation, load submission, and performance measurement of CEP systems; Main Characteristics:
- GMA Athena (24mar03 - CHEP La Jolla, CA) GMA Instrumentation of the Athena Framework using NetLogger Dan Gunter, Wim Lavrijsen,
1 Channel Access Concepts – IHEP EPICS Training – K.F – Aug EPICS Channel Access Concepts Kazuro Furukawa, KEK (Bob Dalesio, LANL)
March 2004 At A Glance ITPS is a flexible and complete trending and plotting solution which provides user access to an entire mission full-resolution spacecraft.
9/29/04 GGF Random Thoughts on Application Performance and Network Characteristics Distributed Systems Department Lawrence Berkeley National Laboratory.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Software Architecture in Practice
Alternative system models
Lecture 6: TCP/IP Networking By: Adal Alashban
End-to-End Monitoring and
Brian L. Tierney, Dan Gunter
Lecture 6: TCP/IP Networking 1nd semester By: Adal ALashban.
GATES: A Grid-Based Middleware for Processing Distributed Data Streams
Channel Access Concepts
Chapter 7 Network Applications
Anant Mudambi, U. Virginia
Channel Access Concepts
Presentation transcript:

NetLogger GGF Distributed Application Analysis and Debugging using NetLogger v2 Lawrence Berkeley National Laboratory Brian L. Tierney

NetLogger GGF The Problem –When building distributed systems, we often observe unexpectedly low performance the reasons for which are usually not obvious –The bottlenecks can be in any of the following components: the applications the operating systems the disks, network adapters, bus, memory, etc. on either the sending or receiving host the network switches and routers, and so on

NetLogger GGF Solution A complete End-to-End monitoring framework that includes the following components: –instrumentation tools (application, middleware, and OS monitoring) –host and network sensors (host and network monitoring) –sensor management tools (sensor control system) –event publication service –event archive service –event analysis and visualization tools –a common set of formats and protocols for describing, exchanging and locating monitoring data

NetLogger GGF NetLogger Toolkit We have developed the NetLogger Toolkit (short for Networked Application Logger), which includes: –tools to make it easy for distributed applications to log interesting events at every critical point NetLogger client library (C, C++, Java, Perl, Python) –tools for host and network monitoring –event visualization tools that allow one to correlate application events with host/network events –NetLogger event archive and retrieval tools (new) NetLogger combines network, host, and application-level monitoring to provide a complete view of the entire system. Open Source, available at

NetLogger GGF NetLogger Analysis: Key Concepts NetLogger visualization tools are based on time correlated and object correlated events. –precision timestamps (default = microsecond) If applications specify an “object ID” for related events, this allows the NetLogger visualization tools to generate an object “lifeline” In order to associate a group of events into a “lifeline”, you must assign an “Event ID” to each NetLogger event –Sample Event ID: file name, block ID, frame ID, etc.

NetLogger GGF Example: Combined Host and Application Monitoring

NetLogger GGF NetLogger Tuning Results I/O followed by processing overlapped I/O and processing almost a 2:1 speedup Next IO starts when processing ends remote IO process previous block

NetLogger GGF NetLogger Analysis of GridFTP: multi-stream bug

NetLogger GGF Sample NetLogger Use lp = NetLoggerOpen( progname, x-netlog://loghost.lbl.gov, NL_MEM); while (!done) { NetLoggerWrite(lp, "EVENT_START", "TEST.SIZE=%d", size); /* perform the task to be monitored */ done = do_something(data, size); NetLoggerWrite(lp, "EVENT_END"); } NetLoggerClose(lp);

NetLogger GGF NetLoggerOpen() shell environment variables Enable/Disable logging: setenv NETLOGGER_ON {true, on, yes, 1}: do logging setenv NETLOGGER_ON {false, off, no, 0}: do not do logging Log Destination: setenv NETLOGGER_DEST logging destination Examples: setenv NETLOGGER_DEST file://tmp/netlog.log write log messages to file /tmp/netlog.log setenv NETLOGGER_DEST x-netlog://loghost.lbl.gov send log messages to netlogd on host loghost.lbl.gov, default port setenv NETLOGGER_DEST x-netlog://loghost.lbl.gov:6006 send log messages to netlogd on host loghost.lbl.gov, port 6006 NETLOGGER_DEST overrides the URL passed in via the NetLoggerOpen() call.

NetLogger GGF NetLogger Write Call Creates and Writes the log event: NetLoggerWrite(nl, “EVENT_NAME”, “EVENTID=%d F2=%d F3=%s F4=%.2f”, id, user_data, user_string, user_float); –timestamps are automatically done by library –the “event name” field is required, all other fields are optional –this call is thread-safe: automatically does a mutex lock around write call (compile time option) Example: NetLoggerWrite(nl, “HTTPD.START_DISK_READ”, “HTTPD.FILENAME=%s HTTPD.HOST=%s”, filename, hostname);

NetLogger GGF Event ID In order to associate a group of events into a “lifeline”, you must assign an event ID to each NetLogger event Sample Event Ids –file name –block ID –frame ID –Socket ID –user name –host name –combination of the above –etc.

NetLogger GGF Sample NetLogger Use with Event IDs lp = NetLoggerOpen(progname, NULL, NL_ENV); for (i=0; i< num_blocks; i++) { NetLoggerWrite(lp, “START_READ”, “BLOCK_ID=%d BLOCK_SIZE=%d”, i, size); read_block(i); NetLoggerWrite(lp, “END_READ”, “BLOCK_ID=%d BLOCK_SIZE=%d”, i, size); NetLoggerWrite(lp, “START_PROCESS”, “BLOCK_ID=%d BLOCK_SIZE=%d”, i, size); process_block(i); NetLoggerWrite(lp, “END_PROCESS”, “BLOCK_ID=%d BLOCK_SIZE=%d”, i, size); NetLoggerWrite(lp, “START_SEND”, “BLOCK_ID=%d BLOCK_SIZE=%d”, i, size); send_block(i); NetLoggerWrite(lp, “END_SEND”, “BLOCK_ID=%d BLOCK_SIZE=%d”, i, size); } NetLoggerClose(lp);

NetLogger GGF How to Instrument Your Application You’ll probably want to add a NetLogger event to the following places in your distributed application: –before and after all disk I/O –before and after all network I/O –entering and leaving each distributed component –before and after any significant computation e.g.: an FFT operation –before and after any significant graphics call e.g.: certain CPU intensive OpenGL calls This is usually an iterative process –add more NetLogger events as you zero in on the bottleneck

NetLogger GGF NLV Analysis Tool: Plots Time vs. Event Name Menu bar Scale for load-line/ points Events Legend Zoom window controls Zoom box Playback controls Window size Max window size Zoom-box actions Playback speed Summary line Time axis You are here Title

NetLogger GGF Use Case: Instrumented GridFTP server An Efficient Data Channel Protocol is extremely important For example, consider the following: –FTP server is instrumented to log the start and end times for all network and disk read and writes, which are in blocks of 64 KBytes –FTP Server has a fast RAID disk array and 1000-BT network –10 simultaneous clients (parallel GridFTP) –Each client is transferring data at 10 Mbytes/second –Total server throughput = 100 Mbytes/sec This will generate roughly 6250 events / second of monitoring data Assuming each monitoring event is 50 Bytes, this equates to 313 KBytes/second, or 1.1 GBytes per hour, of monitoring data.

NetLogger GGF NetLogger v2 Improvements Rewrite of client library –Multiple log formats allowed with same API ASCII (ULM; name = value) —e.g.: DATE= HOST=foo.lbl.gov PROG=testprog NL.EVNT=SEND_DATA SEND_SIZE=49332 Binary log format — much better performance, and requires less bandwidth, than ASCII —Crucial for GridFTP use-case: ASCII NetLogger format not fast enough –Other language APIs automatically generated with SWIG Much faster than “100% native” implementations, esp. for script languages such as Perl, Python, and TCL Changes and bug fixes in core automatically propagated to all APIs

NetLogger GGF NetLogger v2 Improvements Reliability –Provide fault-tolerance in WAN environment –Recovers gracefully from network problems –Also recovers from restart/reboot of the log receiver Trigger API –Dynamically start, stop, change logging levels

NetLogger GGF New NetLogger Feature: Binary Format Data is sent in native format: “receiver-makes-right” IEEE floating point only, simple data types supported Lots of buffering (64K) to reduce system call overhead

NetLogger GGF New NetLogger Feature: Reliability API Reliability API is used to provide fault tolerance –if remote NetLogger receiver become unavailable, will automatically failover to alternate location (e.g: local disk or 2 nd receiver) –Option to periodically try to reconnect to the remote server send local disk file on reconnect NetLoggerSetBackup( handle, char *backupURL, int SendonReconnect ) –Set the backup URL –SendonReconnect flag: if backup URL is local disk, send these results after a successful reconnect NetLoggerSetReconnect( handle, int sec ) –Set the number of seconds between reconnect attempts when the backup URL is in use

NetLogger GGF New NetLogger Feature: Trigger API Trigger API is used to activate monitoring from an external “activation service” NetLoggerSetTrigger( handle, char *filename. int sec) –Check the configuration file specified in “filename”, every “sec” seconds int NetLoggerSetTriggerFunc( handle, fn_t *fnptr, int sec ) –Call a user-defined trigger function every “sec” seconds NetLoggerCheckTrigger( handle) –Check the trigger function immediately (in addition to the periodic check from NetLoggerSetTrigger).

NetLogger GGF Monitoring Event Archive Archived event data is required for –performance analysis and tuning compare current performance to previous results –accounting The archive must be extremely high performance and scalable to ensure that it does not become a bottleneck. –heavily loaded FTP server could generate > 6000 monitoring event per second Must use pipelining to guarantee that applications and sensors never block when writing to the archive buffer event data on disk SQL capability desirable –ability to do complex queries, find averages, standard deviations, etc.

NetLogger GGF Monitoring Event Archive Requirements for FTP Server monitoring: 6200 events/sec 313 KBytes/second, (1.1 GBytes / hour)

NetLogger GGF Current Work

NetLogger GGF NetLogger + GMA NetLogger + Grid Monitoring Architecture (GMA) –GMA is a publish / subscribe event system –Using SOAP for control channel –Using NetLogger binary for data channel Sender: NetLoggerWrite() Receiver: NetLoggerRead() NetLogger binary format is a very efficient transport protocol for simple event data

NetLogger GGF Activation Service We are using the GMA producer/consumer interfaces to build an activation service for NetLogger –Creates the “trigger file” used by the NetLogger trigger API

NetLogger GGF Timestamps: Clock Synchronization Issues To correlate events from multiple systems requires synchronized clocks –NTP (Network Time Protocol) is required to synchronize the clocks of all systems How accurate does this synchronization need to be? –We have found that to analyze systems from the “user perspective” requires: microsecond resolution between events on a single host (gettimeofday() system call) millisecond resolution between WAN hosts —fairly easy to achieve this with NTP somewhere in between for LAN hosts Recommendation: everyone use IETF timestamp standard –example: T21:20: Z –YYYY-MM-DDTHH:MM:SS.SZ (T=date-time separator, Z = GMT) –

NetLogger GGF For More Information