Cluster 2004 San Diego, CA A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison September 23 rd,

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Database Architectures and the Web
Lecture 11: Operating System Services. What is an Operating System? An operating system is an event driven program which acts as an interface between.
The google file system Cs 595 Lecture 9.
Parasol Architecture A mild case of scary asynchronous system stuff.
Lecture 1: History of Operating System
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Introduction  What is an Operating System  What Operating Systems Do  How is it filling our life 1-1 Lecture 1.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Chapter 1 and 2 Computer System and Operating System Overview
Figure 1.1 Interaction between applications and the operating system.
Device Management.
©Company confidential 1 Performance Testing for TM & D – An Overview.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Massive Ray Tracing in Fusion Plasmas on EGEE J.L. Vázquez-Poletti, E. Huedo, R.S. Montero and I.M. Llorente Distributed Systems Architecture Group Universidad.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
part I, , Part I Introduction to Operating Systems First Semester, Year 2000 Wannarat Suntiamorntut Department of Computer Engineering,
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Peter F. Couvares (based on material from Tevfik Kosar, Nick LeRoy, and Jeff Weber) Associate Researcher, Condor Team Computer Sciences Department University.
STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar and Miron Livny University of Wisconsin-Madison March 25 th, 2004 Tokyo, Japan.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.
Flexibility, Manageability and Performance in a Grid Storage Appliance John Bent, Venkateshwaran Venkataramani, Nick Leroy, Alain Roy, Joseph Stanley,
STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar University of Wisconsin-Madison May 25 th, 2004 CERN.
Vinay Paul. CONTENTS:- What is Event Log Service ? Types of event logs and their purpose. How and when the Event Log is useful? What is Event Viewer?
Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Unit 9: Distributing Computing & Networking Kaplan University 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
O PERATING S YSTEM. What is an Operating System? An operating system is an event driven program which acts as an interface between a user of a computer,
1 Stork: State of the Art Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Introduction to Operating Systems Prepared by: Dhason Operating Systems.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully.
Reliable and Efficient Grid Data Placement using Stork and DiskRouter Tevfik Kosar University of Wisconsin-Madison April 15 th, 2004.
Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project, University of Wisconsin.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
BIG DATA/ Hadoop Interview Questions.
Fuzzing Machine By Nikolaj Tolkačiov.
Distributed OS.
Applied Operating System Concepts
Introduction to Operating Systems
Presented by: Daniel Taylor
Lecture 1: Operating System Services
CompSci 725 Presentation by Siu Cho Jun, William.
Introduction to Operating System (OS)
Chapter 1: Introduction
Introduction to Operating Systems
湖南大学-信息科学与工程学院-计算机与科学系
The Globus Toolkit™: Information Services
CLUSTER COMPUTING.
Operating Systems.
Operating System Concepts
Outline What users want ? Data pipeline overview
Unit 1: Introduction to Operating System
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Operating System Concepts
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

Cluster 2004 San Diego, CA A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison September 23 rd, 2004

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 2 Grid Trivia  How many of you have submitted a job to the Grid resources and did never hear back from it?  How many of you got mad by the inconsistent behavior of some grid resources? Completing successfully some jobs and failing others.. Similar jobs performing completely different..... We did!

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 3 Goal: Prevent Unexpected Behavior in a Grid  Learn from experience and prevent them from repeating in the future again.  Causes for unexpected behavior in a Grid: Black holes Resources with –Faulty hardware –Buggy or misconfigured software Extremely slow computational sites Memory leaks..etc

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 4 Black holes

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 5 Black holes  Definition: “A black hole is a region of spacetime from which nothing can escape, even light.”  If you send a light beam to a black hole, you never hear back from it.  You can only know it after you have encounter it. Is it too late? No. You should learn from experience..

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 6 Black holes in the Grid  Resources that accept jobs but never complete them You send a job to a resource, but never hear back from it.

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 7 Black hole examples from real life:  In the WCER educational video processing pipeline: A specific pool was accepting and processing our jobs for a couple of hours, but evicting before completion. A machine accepted a job, but due to a memory leak it kept throwing “shadow exceptions” and retrying the job forever. Some thirdparty (GridFTP, DiskRouter) transfers hang occasionally and never returned. A machine caused an error because of a corrupted FPU. It successfully completed MPEG-1 encoding but failed MPEG-4.

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 8 Grid is good.. but not perfect..  Heterogeneous resources  Multi administrative domains  Spanning wide area networks  Consists of commodity hardware and software Prone to network-, hardware-, software-, middleware- failures! We cannot expect everything from the Grid or Grid middleware!

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 9 Take the Ethernet Approach  A truly distributed (and very effective) access control protocol to a shared service  Client responsible access control  Client responsible for error detection  Client responsible for fairness Keep track of job/resource performance & failure characteristics as observed by the client. Use job/user log files collected at the client side to build a grid knowledgebase.

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 10 Grid Knowledgebase  Parse user/job log files  Load them into a database  Aggregate experience of different jobs  Interpret them  Plan action  Generate feedback to the scheduler as well as to the user

JOB DESCRIPTIONS PLANNER JOB QUEUE JOB LOGS GRID RESOURCES JOB SCHEDULER MATCH MAKER Personal Computers Storage Servers Clusters

JOB DESCRIPTIONS PLANNER JOB QUEUE JOB LOGS GRID RESOURCES JOB SCHEDULER MATCH MAKER Personal Computers Storage Servers Clusters JOB PARSER GRID KNOWLEDGEBASE DATABASEDATA MINER NOTIFICATION LAYER ADAPTATION LAYER

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 13 Database Schema FieldType JobIdInt JobNamestring StateInt SubmitHoststring SubmitTimeInt ExecuteHoststring [] ExecuteTimestring [] ImageSizeint[] ImageSizeTimeint [] EvictTimeint [] Checkpointedbool [] EvictReasonstring TerminateTimeint [] TotalLocalUsagestring TotalRemoteUsagestring TerminateMessagestring ExceptionTimeint [] ExceptionMessagestring [] Evicted Exception Submit Terminated Abnormally Terminated Normally Schedule Execute Exit code = 0? Job Succeeded Job Failed Yes No Suspend Un-suspend User

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 14 Difference from existing approaches  Client view  Use only job/user log files at the client side Many administrators do not want to share resource/scheduler log files.  We do not need to know everything going on in the whole grid Scalable

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 15 What do we get?  Collecting job execution time statistics Average job execution time Standard deviation Fit a distribution  Detect and avoid black holes For normal distribution: –99.7% of job execution times should lie between (avg-3*stdev) and (avg+3*stdev) –96% of job execution times should lie between (avg-2*stdev) and (avg+2*stdev)

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 16 Detecting hanging transfers

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 17 Setting Execution Time Limits  Avg = 7.8 min  Stdev = 3.17min  For normal distribution: %99.7 : [0 – min] %96 : [1.46 min – min]

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 18 What do we get? (2)  Identifying misconfigured machines e.g. find set of machines which fail jobs with I/O data size larger than 2 GB (i.e. OS limitations)  Identifying factors affecting job run-time  Bug hunting Job failures on certain inputs Memory leaks –Scheduler logs image size regularly

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 19 Catching Memory Leaks Job Memory Image Size (MB) Time

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 20 What do we get? (3)  Application optimization How long does each step of an application/pipeline take to execute?  Adaptation Find resources that take least time to execute jobs from a particular class

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 21 Conclusions  View of the Grid from the client side  Job/user log files as main source of information  Aggregate experience of different jobs and pass them to future ones  Helps in: Catching black holes Identify faulty/misconfigured resources Bug tracking Statistics collection  Future work: Merge experience of different clients

A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny 22 Thank you… For more information, contact: Tevfik Kosar