GPW2005 GGF Techniques for Monitoring Large Loosely-coupled Cluster Jobs Brian L. Tierney Dan Gunter Distributed Systems Department Lawrence Berkeley National.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

NGS computation services: API's,
IT253: Computer Organization
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Chapter 19: Network Management Business Data Communications, 5e.
Today’s topics Single processors and the Memory Hierarchy
1 Generic logging layer for the distributed computing by Gene Van Buren Valeri Fine Jerome Lauret.
1 DB2 Access Recording Services Auditing DB2 on z/OS with “DBARS” A product developed by Software Product Research.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
OS Spring’03 Introduction Operating Systems Spring 2003.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
Virtual Memory Tuning   You can improve a server’s performance by optimizing the way the paging file is used   You may want to size the paging file.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Task Farming on HPCx David Henty HPCx Applications Support
CEDPS: Center for Enabling Distributed Petascale Science Brian Tierney Lawrence Berkeley National Laboratory
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
SOS EGEE ‘06 GGF Security Auditing Service: Draft Architecture Brian Tierney Dan Gunter Lawrence Berkeley National Laboratory Marty Humphrey University.
Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Chapter 4 Realtime Widely Distributed Instrumention System.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Chapter 101 Multiprocessor and Real- Time Scheduling Chapter 10.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
NetLogger GGF Distributed Application Analysis and Debugging using NetLogger v2 Lawrence Berkeley National Laboratory Brian L. Tierney.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Distribution and components. 2 What is the problem? Enterprise computing is Large scale & complex: It supports large scale and complex organisations Spanning.
The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
George Goulas, Christos Gogos, Panayiotis Alefragis, Efthymios Housos Computer Systems Laboratory, Electrical & Computer Engineering Dept., University.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
- GMA Athena (24mar03 - CHEP La Jolla, CA) GMA Instrumentation of the Athena Framework using NetLogger Dan Gunter, Wim Lavrijsen,
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Parallel Computing Presented by Justin Reschke
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
Resource Characterization Rich Wolski, Dan Nurmi, and John Brevik Computer Science Department University of California, Santa Barbara VGrADS Site Visit.
SQL Database Management
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction to Distributed Platforms
Distribution and components
University of Technology
End-to-End Monitoring and
Brian L. Tierney, Dan Gunter
Chapter 2: Operating-System Structures
Software Requirements Specification (SRS) Template.
Chapter 2: Operating-System Structures
Lecture 4: File-System Interface
Presentation transcript:

GPW2005 GGF Techniques for Monitoring Large Loosely-coupled Cluster Jobs Brian L. Tierney Dan Gunter Distributed Systems Department Lawrence Berkeley National Laboratory

GPW2005 GGF Tightly Coupled vs. Loosely Coupled Cluster applications can be classified as follows: –Tightly Coupled: jobs have a large amount of communication between nodes, usually using specialized interfaces such as the Message Passing Interface (MPI) –Loosely Coupled: jobs have occasional synchronization points, but are largely independent –Uncoupled: jobs have no communication or synchronization points An important class of parallel processing jobs on clusters today are workflow-based applications that process large amounts of data in parallel –e.g.: searching for supernovae or Higgs particles –In this context we define workflow as the processing steps required to analyze a unit of data

GPW2005 GGF Uncoupled / Loosely Coupled Jobs Often this type of computing is I/O or database bound, not CPU bound. Performance analysis requires system-wide analysis of competition for resources such as disk arrays and database tables –This is very different from traditional parallel processing analysis of CPU usage and explicitly synchronized communications There are a number of performance analysis tools which focus on tightly coupled applications. –we are focused on uncoupled and loosely coupled applications

GPW2005 GGF Tools for Tightly Coupled Jobs Traditional parallel computing performance analysis tools focus on CPU usage, communication, and memory access patterns. e.g.: –TAU ( –Paraver ( –FPMPI ( –Intel Trace Collector ( A number of other projects that started out as mainly for tightly coupled applications –Then were extended or adapted to work for loosely coupled systems as well These include: –SvPablo ( –Paradyn ( –Prophesy (

GPW2005 GGF Sample Loosely Coupled Job An example of an uncoupled cluster application is the Nearby Supernova Factory (SNfactory) project at LBL –Mission: to find and analyze nearby Type Ia supernovae – SNfactory jobs are submitted to PDSF cluster at NERSC, and typically run on nodes SNfactory jobs produce about one monitoring event per second on each node –total of roughly up to 1,100,000 events per day Roughly 1% of jobs were failing for unknown reasons –SNfactory group came to us for help

GPW2005 GGF Sample Loosely Coupled Job

GPW2005 GGF Sample Distribution of Job completion Time Q: What is the cause of the very long tail?

GPW2005 GGF NetLogger Toolkit We have developed the NetLogger Toolkit (short for Networked Application Logger), which includes: –tools to make it easy for distributed applications to log interesting events at every critical point –tools for host and network monitoring The approach combines network, host, and application- level monitoring to provide a complete view of the entire system. This has proven invaluable for: – isolating and correcting performance bottlenecks – debugging distributed applications

GPW2005 GGF NetLogger Components NetLogger Toolkit contains the following components: –NetLogger message format –NetLogger client library (C, Java, Python, Perl) –NetLogger visualization tools –NetLogger host/network monitoring tools Additional critical component for distributed applications: –NTP (Network Time Protocol) or GPS host clock is required to synchronize the clocks of all systems

GPW2005 GGF NetLogger Methodology NetLogger is both a methodology for analyzing distributed systems, and a set of tools to help implement the methodology. –You can use the NetLogger methodology without using any of the LBNL provided tools. The NetLogger methodology consists of the following: 1. All components must be instrumented to produce monitoring These components include application software, middleware, operating system, and networks. The more components that are instrumented the better. 2. All monitoring events must use a common format and common set of attributes and a globally synchronized timestamp 3. Log all of the following events: Entering and exiting any program or software component, and begin/end of all IO (disk and network) 4. Collect all log data in a central location 5. Use event correlation and visualization tools to analyze the monitoring event logs

GPW2005 GGF NetLogger Analysis: Key Concepts NetLogger visualization tools are based on time correlated and object correlated events. –precision timestamps (default = microsecond) If applications specify an object ID for related events, this allows the NetLogger visualization tools to generate an object lifeline In order to associate a group of events into a lifeline, you must assign an object ID to each NetLogger event –Sample Event ID: file name, block ID, frame ID, Grid Job ID, etc.

GPW2005 GGF Sample NetLogger Instrumentation log = netlogger.LogOutputStream(my.log) done = 0 while not done: log.write("EVENT.START",{"TEST.SIZE:size}) # perform the task to be monitored done = do_something(data,size) log.write("EVENT.END,{}) Sample Event: t DATE= s HOST=gridhost.lbl.gov s PROG=gridApp l LVL=Info s NL.EVNT=WriteData l SEND.SZ=49332

GPW2005 GGF SNfactory Lifelines

GPW2005 GGF Scaling Issues Running a large number of workflows on a cluster will generate far too much monitoring data to be able to use the standard NetLogger lifeline visualization techniques to spot problems. –For even a small set of nodes, these plots can be very dense

GPW2005 GGF Anomaly Detection To address this problem, we designed and developed a new NetLogger automatic anomaly detection tool, called nlfindmissing The basic idea is to identify lifelines that are missing events. –Users define the events that make up an important linear sequence within the workflow, as a lifeline. The tool then outputs the incomplete lifelines on a data file or stream.

GPW2005 GGF Lifeline Timeouts Issue: given an open-ended dataset that is too large to fit in memory, how to determine when to give up waiting for a lifeline to complete? Our solution: –approximate the density function of the lifeline latencies by maintaining a histogram with a relatively large (e.g. 1000) bins –the timeout becomes a user-selected section of the tail of that histogram, e.g. the 99th percentile This works well, runs in a fixed memory footprint, is computationally cheap, and does not rely on any assumptions about the distribution of the data –additional parameters, such as a minimum and maximum timeout value, and how many lifelines to use as a ``baseline'' for dynamic calculations, make the method more robust to messy real-world data

GPW2005 GGF Anomalies Only

GPW2005 GGF Anomalies Plus Context

GPW2005 GGF NetLogger Cluster Deployment

GPW2005 GGF Monitoring Data Management Issues A challenge for application instrumentation on large clusters is sifting through the volume of data that even a modest amount of instrumentation can generate. For example, –a 24 hour application run produces 50MB of application and host monitoring data per node –while a 32-node cluster might be almost manageable (50 MB x 32 nodes = 1.6 GB) –when scaled to a 512-node cluster the amount of data starts to become quite unwieldy (50 x 512 = 25.6 GB).

GPW2005 GGF Data Collection

GPW2005 GGF nldemux The nldemux tool is then used to group monitoring data into manageable pieces. –the ganglia data is placed in its own directory, and data from each node is written to a separate file; –the entire directory is rolled over once per day. –the workflow data are placed in files named for the observation date at the telescope, this information is carried in each event record –data is removed after 3 weeks

GPW2005 GGF NetLogger and Grid Job IDs

GPW2005 GGF Grid Workflow Identifiers (GIDs) Globally unique key needed to identify a workflow Propagated down and across workflow components –This is the hard part! –Options: modify app. interfaces add SOAP header Acronyms: RFT = Reliable File Transfer service GridFTP = Grid File Transfer Protocol PBS = Portable Batch System HPSS = High Performance Storage System SRM = Storage Resource Manager

GPW2005 GGF Without GIDs

GPW2005 GGF With GIDs

GPW2005 GGF For More Information –Source code (open source) and publications available