CEDPS: Center for Enabling Distributed Petascale Science Brian Tierney Lawrence Berkeley National Laboratory

Slides:



Advertisements
Similar presentations
GPW2005 GGF Techniques for Monitoring Large Loosely-coupled Cluster Jobs Brian L. Tierney Dan Gunter Distributed Systems Department Lawrence Berkeley National.
Advertisements

Chapter 19: Network Management Business Data Communications, 5e.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Unified Logs and Reporting for Hybrid Centralized Management
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.
Operational Dataset Update Functionality Included in the NCAR Research Data Archive Management System 1 Zaihua Ji Doug Schuster Steven Worley Computational.
OPC Alarm.NET.
OSG Logging Architecture Update Center for Enabling Distributed Petascale Science Brian L. Tierney: LBNL.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
A Mobile-Agent-Based Performance-Monitoring System at RHIC Richard Ibbotson.
©Kwan Sai Kit, All Rights Reserved Windows Small Business Server 2003 Features.
Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.
SOS EGEE ‘06 GGF Security Auditing Service: Draft Architecture Brian Tierney Dan Gunter Lawrence Berkeley National Laboratory Marty Humphrey University.
Customer Service and Support Sutherland Global Services Consultant Learning Services Microsoft Store.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
An Integrated Instrumentation Architecture for NGI Applications Ian Foster, Darcy Quesnel, Steven Tuecke Argonne National Laboratory The University of.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Protect Your Business-Critical Data in the Cloud with SoftNAS, a Full-Featured, Highly Available Solution for the Agile Microsoft Azure Platform MICROSOFT.
GRID IIII D UK Particle Physics GridPP Collaboration meeting - R.P.Middleton (RAL/PPD) 23-25th May Grid Monitoring Services Robin Middleton RAL/PPD24-May-01.
Globus GridFTP and RFT: An Overview and New Features Raj Kettimuthu Argonne National Laboratory and The University of Chicago.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Upload, Process, and Deliver Digital Media Assets from Everywhere and at Any Time with Reelway and the Microsoft Azure Cloud MICROSOFT ISV PROFILE: REELWAY.
NetLogger GGF Distributed Application Analysis and Debugging using NetLogger v2 Lawrence Berkeley National Laboratory Brian L. Tierney.
Mobilise Your Business in Days with Crimson Tide’s mpro5 Enterprise Solution on Microsoft Azure! MICROSOFT AZURE ISV PROFILE: CRIMSON TIDE Crimson Tide.
Alert Logic Provides a Fully Managed Security and Compliance Solution Based in the Cloud, Powered by the Robust Microsoft Azure Platform MICROSOFT AZURE.
The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
SAN DIEGO SUPERCOMPUTER CENTER Inca TeraGrid Status Kate Ericson November 2, 2006.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
Securely Synchronize and Share Enterprise Files across Desktops, Web, and Mobile with EasiShare on the Powerful Microsoft Azure Cloud Platform MICROSOFT.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
GridFTP GUI: An Easy and Efficient Way to Transfer Data in Grid
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
== Enovatio Delivers a Scalable Project Management Solution Minus Large Upfront Infrastructure Costs, Thanks to the Powerful Microsoft Azure Platform MICROSOFT.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Datalayer Notebook Allows Data Scientists to Play with Big Data, Build Innovative Models, and Share Results Easily on Microsoft Azure MICROSOFT AZURE ISV.
NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.
A Managed Object Placement Service (MOPS) using NEST and GridFTP Dr. Dan Fraser John Bresnahan, Nick LeRoy, Mike Link, Miron Livny, Raj Kettimuthu SCIDAC.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Microsoft Azure Integrated with C21 Live Cloud Mosaic Helps Control Your Live Streaming from Anywhere by Deploying in Global Azure Regions MICROSOFT AZURE.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
+ Logentries Is a Real-Time Log Analytics Service for Aggregating, Analyzing, and Alerting on Log Data from Microsoft Azure Apps and Systems MICROSOFT.
Flight is a SaaS Solution that Accelerates the Secure Transfer of Large Files and Data Sets Into and Out of Microsoft Azure Blob Storage MICROSOFT AZURE.
- GMA Athena (24mar03 - CHEP La Jolla, CA) GMA Instrumentation of the Athena Framework using NetLogger Dan Gunter, Wim Lavrijsen,
An Active Security Infrastructure for Grids Stuart Kenny*, Brian Coghlan Trinity College Dublin.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Built on the Microsoft Azure Platform, UberCloud Helps Engineers and Software Providers to Offer and Deploy Powerful Cloud Services On Demand MICROSOFT.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Globus online Delivering a scalable service Steve Tuecke Computation Institute University of Chicago and Argonne National Laboratory.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Retele de senzori EEMon Electrical Energy Monitoring System.
Improve the Performance, Scalability, and Reliability of Applications in the Cloud with jetNEXUS Load Balancer for Microsoft Azure MICROSOFT AZURE ISV.
The Future of Whole Human Genome Data Management and Analysis, Available on the Microsoft Azure Platform Today MICROSOFT AZURE APP BUILDER PROFILE: SPIRAL.
GameChanger’s Rate Quote Issue Solution is Deployed to Microsoft Azure for a Fast, Flexible Direct to Consumer Insurance Sales Solution MICROSOFT AZURE.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Get Full Protection on Microsoft Azure with Symantec™ Endpoint Protection 12.1 MICROSOFT AZURE ISV PROFILE: SYMANTEC Symantec™ Endpoint Protection is an.
HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.
ITIS 3110 IT Infrastructure II
Cloud based Open Source Backup/Restore Tool
End-to-End Monitoring and
Logsign All-In-One Security Information and Event Management (SIEM) Solution Built on Azure Improves Security & Business Continuity MICROSOFT AZURE APP.
Brian L. Tierney, Dan Gunter
Presentation transcript:

CEDPS: Center for Enabling Distributed Petascale Science Brian Tierney Lawrence Berkeley National Laboratory

2 CEDPS in a Nutshell Center for Enabling Distributed Petascale Science CEDPS – “seds” (silent P) DOE SciDAC Center for Enabling Technology July 1, 2006 – June 30, 2011, $2.4M/yr Collaboration between 5 sites Argonne National Laboratory Fermi National Laboratory Lawrence Berkeley National Laboratory USC Information Sciences Institute University of Wisconsin Madison Three focus areas Moving data to compute resources Moving compute services to data sites Troubleshooting and diagnosis tools

3 The Petascale Data Challenge DOE facilities generate many petabytes of data (2 petabytes = all U. S. academic research libraries!) Massive data U U U U U DOE facilities Remote users (at labs universities, industry) need data! Rapid, reliable access key to maximizing value of $B facilities U Remote distributed users U U

4 Reliable: recover from many failures Predictable: data arrives when scheduled Secure: protect expensive resources & data Scalable: deal with many users & much data Bridging the Divide (1): Move Data to Users When & Where Needed C B A Fast: >10,000x faster than usual Internet “Deliver this 100 Terabytes to locations A, B, C by 9am tomorrow”

5 Flexible: easy integration of functions Secure: protect expensive resources & data Scalable: deal with many users & much data Bridging the Divide (2): Allow Users to Move Computation Near Data A Science services: provide analysis functions near data source “Perform my computation F on datasets X, Y, Z” Y Z X F

6 Instrument: include monitoring points in all system components Monitor: collect data in response to problems Diagnose: identify the source of problems Bridging the Divide (3): Troubleshoot End-to-End Problems C B A “Why did my data transfer (or remote operation) fail?” Identify & diagnose failures & performance problems

CEDPS Troubleshooting Work

8 The Troubleshooting Problem Large production Grids (OSG, TeraGrid, etc.) report a high failure rate 10-30% of jobs submitted to the Grid fail mostly authentication errors and disk space problems Users don’t always notice, as jobs may be automatically resubmitted and may succeed the next time Troubleshooting in this environment is very difficult Current Approach Log into all hosts used (if possible) ‘grep’ various log files looking for problems Inconsistent logging levels Multiple file formats Often a tedious and time consuming problem

9 Our Approach Discover broken services before users do Deploy monitoring software that can perform alerts on errors Run test jobs to detect problems When need to do log analysis Use a unified approach to logging for applications and middleware Collect logs centrally Develop automatic analysis tools for anomaly detection

10 Logging Solution: “NetLogger” Methodology NetLogger (short for “Networked Application Logger”) Methodology for troubleshooting distributed applications Troubleshooting = detection and analysis of faults and performance issues Key Components: common log format: name=value pairs precision ISO-format timestamp and synchronized clocks (via NTP) wrap all “interesting” actions with ‘start’ and ‘end’ log events interesting = anything that might fail or run slow use unique event names e.g.: org. event=org.globus.gridFTP.authn.start include “event ID” in each log message allows correlation of sets of events collect logs in a centralized database

11 Example Log: GridFTP ts= T18:39: Z event=org.globus.gridFTP.start prog=GridFTP localhost=myhost remoteHost=somehost.gov:56010 serverMode=inetd guid=1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3 ts= T18:39: Z event=org.globus.gridFTP.authn.start DN=“/DC=org/DC=doegrids/OU=People/CN=Somebody” guid=1DDF1F3D-A677-4DBC-8C4E- 6A8A3B252AE3 ts= T18:39: Z event=org.globus.gridFTP.authn.end DN=“/DC=org/DC=doegrids/OU=People/CN=Somebody” msg=“ successfully authorized” localUser=uscmspool381 guid=1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3 status=0 ts= T18:39: Z event=org.globus.gridFTP.transfer.start file=/tmp/myfile tcpBufferSize=128KB dataBlockSize= numStreams=1 numStripes=1 destHost= guid=1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3 ts= T18:45: Z event=org.globus.gridFTP.transfer.end file=/tmp/myfile bytesTransferred= guid=1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3 status=0 ts= T18:45: Z event=org.globus.gridFTP.end guid=1DDF1F3D-A677-4DBC- 8C4E-6A8A3B252AE3 status=226

12 No need to invent something new for this syslog-ng tool fills all requirements Open source, runs on all major OSes Fault tolerant, secure (via stunnel), scalable, easy to configure, etc. Large user base Can filter logs based on level and content Arbitrary number of sources and destinations Can act as a proxy, tunnel thru firewalls Execute programs: Send , load database, etc. Built-in log rotation Timezone support and Fully qualified host names Log Collection

13 Log collection using syslog-ng

14 Sample Site Deployment For Grid Middleware

15 Troubleshooting Architecture: Site Archive

Log Summarization Library: New Component of the NetLogger Toolkit

17 NetLogger Toolkit Client library that to make it easy to generate logs C, Java, Perl, Python First released in 1994 Summarizer new in NetLogger 4.0 release, Nov Open Source available at Visualization Support scripts to convert logs to R, gnuplot, Excel format collection of sample R and gnuplot scripts Database tools tools to load logs into mySQL

18 New NetLogger Feature: Time-based summarization DO for i=1, N log(“read.start”,...) read_data() log(“read.end”,...) process_data() log(“write.start”,...) write_data() log(“write.end”,...) DONE Example: you want detailed I/O instrumentation of a simple loop that reads data, processes it, then writes out some result.

19 Go from here.. Time-based summarization Full logging can produce many gigabytes of logs! Summarized Logs report only a periodic summary (mean, sd) of times and values between “start” and “end” events (of a given type and ID). Orders of magnitude data reduction with, for many purposes, no loss in explanatory power. Since the original instrumentation is still there, you can “turn on” and off the full detail in a running program.

20 Sample Use: GridFTP Bottleneck Detector GridFTP today Only logs single throughput number Throughput might be limited by Sending/receiving disk NFS/AFS mounted partition? Network Currently no way to tell New GridFTP enhancement Use log summarization library to monitor all I/O streams

21 Full vs Summary Results Bottleneck = disk read

22 More Information General CEDPS information: Log Summarizer: New GridFTP with NetLogger Log Summarizer: Contact us if you need troubleshooting help!

Extra Slides

24 NetLogger Summarizer vs ‘R’ summarized logs

25 Logging “Best Practices” Recommendations Practices All logs should contain a unique event name and an ISO- format timestamp All system operations that might fail or experience performance variations should be wrapped with start and end events. All logs from a given execution thread should be tagged with a globally unique ID (or GUID), such as a Universal Unique Identifiers (UUIDs) Log format Logs should be composed of lines of ASCII name=value pairs Example: ts= T18:48: Z event=org.globus.gridFTP.transfer.start prog=GridFTP-v4.2 guid=1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3 file=filename src.host=H1 src.port=P1 dst.host=H2 dst.port=P2

26 Event Names Use a '.' as a separator and go from general to specific Same as Java class names First part of name should be used as a unique namespace (e.g.: org.globus) Use start/end suffixes whenever possible Helps immensely with troubleshooting Examples org.globus.gridFTP.start org.globus.gridFTP.authn.start org.globus.gridFTP.authn.end org.globus.gridFTP.transfer.start org.globus.gridFTP.transfer.end org.globus.gridFTP.end –org.globus.MDS.response.start –org.globus.MDS.query.start –org.globus.MDS.query.end –org.globus.MDS.write.net.start –org.globus.MDS.write.net.end –org.globus.MDS.response.end

27 Globally Unique IDs Use the ‘guid’ or ‘id’ reserved name to allow correlation of a set of events together event=org.globus.gridFTP.authn.start id=27023 event=org.globus.gridFTP.authn.end id=27023 event=org.globus.gridFTP.transfer.start id=27023 event=org.globus.gridFTP.transfer.end id=27023 Can use standard unix/windows program ‘uuidgen’ to generate globally unique ID e.g.: A5A563CD-D80C-4E58-9ECD-79C6B611E122

28 GridFTP: network vs disk performance

29 Syslog-ng Deployment for OSG