Download presentation
Presentation is loading. Please wait.
1
Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) 2011
2
Programming Clusters: Marketing Map-Reduce
3
Programming Clusters: Reality
4
Complexity Exposed Correctness or performance bugs break the single-system abstraction
5
Motivation Job structure The Job Object Model Tools for job understanding Conclusions
6
Execution Application Data-Parallel Computation 6 Storage Language Map- Reduce GFS BigTable Cosmos Azure HPC Dryad DryadLINQ Scope Sawzall,FlumeJava Hadoop HDFS S3 Pig, Hive ≈SQLLINQ, SQLSawzall, Java
7
2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 7
8
Dryad Job Structure 8 grep sed sort awk perl grep sed sort awk Input files Vertices (processes) Output files Channels Stage
9
Dryad System Architecture 9 Network job schedule data plane control plane NS, Sched Exec V VV Job managercluster
10
Firewall How does it work in detail? Cluster/Cloud Cluster Scheduler Job Manager (JM) Exec Storage Localhost Job Submission Compiler Application IDE Vertex Exec Storage Vertex Exec Storage L: Logs, IO: Input/Output, R: Resources LRIOLR LR
11
Logs – lots of them Job-related – Plan (xml), status, resources Job-manager – stdout.txt, stderr.txt, *.log Vertex – stdout.txt, *.log, *.xml, *.cmd
12
Monitoring Tools Structure CosmosScopeHPC v2HPC v3 Cluster abstraction Job Object Model Monitoring, Profiling, Debugging GUIs
13
Job Object Model Logs JOM Views Job Vertices Plan Tools
14
Motivation Job structure The Job Object Model Tools for job understanding Conclusions
15
The Job Browser JobStageVertex
16
Job Schedule
17
Failure diagnosis
18
Diagnosis decision tree “Hand-made” Least portable tool Incomplete High-coverage Bug types: – User level – System-level – Cluster malfunction
19
Powershell = Interactive Queries $cluster = get-cluster X $job = $cluster | select-AllJobs | sort-object Date | select-object -last 1 | select-DryadJob $failed = $job.Vertices | where-object { $_.State -eq "Failed" }
20
Vertex Debugging on Client
21
Vertex Profiling on Client
22
Debugging on Cluster Collection collection; var results = from c in collection where c.name.length > 10 orderby c.age select c.name; where c.name.length > 10 ProgramJob Breakpoint
23
Firewall Cluster/Cloud Storage LR Remote debugging Cluster Scheduler Job Manager (JM) Localhost Job Submission DryadLINQ Application Visual Studio Vertex 1Vertex 2 Breakpoint hit… Breakpoint L: Logs, IO: Input/Output, R: Resources attach Exec Storage Exec Storage Exec LRIOLR
24
Firewall Cluster/Cloud Exec Storage LLL Notifications: Our Implementation Cluster Scheduler Job Manager (JM) Localhost Job Submission DryadLINQ Application Visual Studio Vertex 1Vertex 2 Daphne L: Logs, IO: Input/Output, R: Resources Exec RIOR R attach
25
Remote debugging
26
Open Problems What happens when 100,000 processes hit a breakpoint? How to evaluate expressions in the debugger when state is distributed? How to do large-scale performance debugging? How to preserve map between distributed state and original program state? How much can the illusion of a single system be preserved?
27
Conclusions Single-machine abstractions break down in the presence of (performance/correctness) bugs Job Object Model insulates tools from messy details Design the cluster runtime to make it easy to build a JOM Rich interactive tools easily built on top of JOM Much more work needed for debugging at scale
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.