Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC.

Slides:



Advertisements
Similar presentations
Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.
Advertisements

Starfish: A Self-tuning System for Big Data Analytics.
Introduction to Data Center Computing Derek Murray October 2010.
Machine Learning in DryadLINQ Kannan Achan Mihai Budiu MSR-SVC, 1/30/
Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
The Kinect body tracking pipeline Oliver Williams, Mihai Budiu Microsoft Research, Silicon Valley With slides contributed by Johnny Lee, Jamie Shotton.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Big Data Platforms Mihai Budiu, Oct My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer.
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
PARALLELIZING LARGE-SCALE DATA- PROCESSING APPLICATIONS WITH DATA SKEW: A CASE STUDY IN PRODUCT-OFFER MATCHING Ekaterina Gonina UC Berkeley Anitha Kannan,
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Shujaat Hussain.  Karmasphere's core technology, the Karmasphere Application Framework, is an open platform that provides independence across Hadoop.
Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.
Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:
Microsoft DryadLINQ --Jinling Li. What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1]
Image Processing Image Processing Windows HPC Server 2008 HPC Job Scheduler Dryad DryadLINQ Machine Learning Graph Analysis Graph Analysis Data Mining.NET.
Programming clusters with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Association of C and C++ Users (ACCU) Mountain View, CA, April 13, 2011.
Cloud Computing Other High-level parallel processing languages Keke Chen.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Training Kinect Mihai Budiu Microsoft Research, Silicon Valley UCSD CNS 2012 RESEARCH REVIEW February 8, 2012.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca.
An Introduction to HDInsight June 27 th,
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Artemis Logs Database View Data Collectio n GUI Dryad Overview Data collection Distributed system Plug-ins GUI Plug-ins Hunting for Bugs with Artemis System.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
4 5 6 var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.
Hadoop implementation of MapReduce computational model Ján Vaňo.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC.
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Large-scale Machine Learning using DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Ambient Intelligence: From Sensor Networks to Smart Environments.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Hongbin Li 11/13/2014 A Debugger of Parallel Mutli- Agent Spatial Simulation.
CS239-Lecture 3 DryadLINQ Madan Musuvathi Visiting Professor, UCLA
Some slides adapted from those of Yuan Yu and Michael Isard
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
CSCI5570 Large Scale Data Processing Systems
DryadInc: Reusing work in large-scale computations
Server & Tools Business
Presentation transcript:

Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) 2011

Programming Clusters: Marketing Map-Reduce

Programming Clusters: Reality

Complexity Exposed Correctness or performance bugs break the single-system abstraction

Motivation Job structure The Job Object Model Tools for job understanding Conclusions

Execution Application Data-Parallel Computation 6 Storage Language Map- Reduce GFS BigTable Cosmos Azure HPC Dryad DryadLINQ Scope Sawzall,FlumeJava Hadoop HDFS S3 Pig, Hive ≈SQLLINQ, SQLSawzall, Java

2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 7

Dryad Job Structure 8 grep sed sort awk perl grep sed sort awk Input files Vertices (processes) Output files Channels Stage

Dryad System Architecture 9 Network job schedule data plane control plane NS, Sched Exec V VV Job managercluster

Firewall How does it work in detail? Cluster/Cloud Cluster Scheduler Job Manager (JM) Exec Storage Localhost Job Submission Compiler Application IDE Vertex Exec Storage Vertex Exec Storage L: Logs, IO: Input/Output, R: Resources LRIOLR LR

Logs – lots of them Job-related – Plan (xml), status, resources Job-manager – stdout.txt, stderr.txt, *.log Vertex – stdout.txt, *.log, *.xml, *.cmd

Monitoring Tools Structure CosmosScopeHPC v2HPC v3 Cluster abstraction Job Object Model Monitoring, Profiling, Debugging GUIs

Job Object Model Logs JOM Views Job Vertices Plan Tools

Motivation Job structure The Job Object Model Tools for job understanding Conclusions

The Job Browser JobStageVertex

Job Schedule

Failure diagnosis

Diagnosis decision tree “Hand-made” Least portable tool Incomplete High-coverage Bug types: – User level – System-level – Cluster malfunction

Powershell = Interactive Queries $cluster = get-cluster X $job = $cluster | select-AllJobs | sort-object Date | select-object -last 1 | select-DryadJob $failed = $job.Vertices | where-object { $_.State -eq "Failed" }

Vertex Debugging on Client

Vertex Profiling on Client

Debugging on Cluster Collection collection; var results = from c in collection where c.name.length > 10 orderby c.age select c.name; where c.name.length > 10 ProgramJob Breakpoint

Firewall Cluster/Cloud Storage LR Remote debugging Cluster Scheduler Job Manager (JM) Localhost Job Submission DryadLINQ Application Visual Studio Vertex 1Vertex 2 Breakpoint hit… Breakpoint L: Logs, IO: Input/Output, R: Resources attach Exec Storage Exec Storage Exec LRIOLR

Firewall Cluster/Cloud Exec Storage LLL Notifications: Our Implementation Cluster Scheduler Job Manager (JM) Localhost Job Submission DryadLINQ Application Visual Studio Vertex 1Vertex 2 Daphne L: Logs, IO: Input/Output, R: Resources Exec RIOR R attach

Remote debugging

Open Problems What happens when 100,000 processes hit a breakpoint? How to evaluate expressions in the debugger when state is distributed? How to do large-scale performance debugging? How to preserve map between distributed state and original program state? How much can the illusion of a single system be preserved?

Conclusions Single-machine abstractions break down in the presence of (performance/correctness) bugs Job Object Model insulates tools from messy details Design the cluster runtime to make it easy to build a JOM Rich interactive tools easily built on top of JOM Much more work needed for debugging at scale