Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) 2011
Programming Clusters: Marketing Map-Reduce
Programming Clusters: Reality
Complexity Exposed Correctness or performance bugs break the single-system abstraction
Motivation Job structure The Job Object Model Tools for job understanding Conclusions
Execution Application Data-Parallel Computation 6 Storage Language Map- Reduce GFS BigTable Cosmos Azure HPC Dryad DryadLINQ Scope Sawzall,FlumeJava Hadoop HDFS S3 Pig, Hive ≈SQLLINQ, SQLSawzall, Java
2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 7
Dryad Job Structure 8 grep sed sort awk perl grep sed sort awk Input files Vertices (processes) Output files Channels Stage
Dryad System Architecture 9 Network job schedule data plane control plane NS, Sched Exec V VV Job managercluster
Firewall How does it work in detail? Cluster/Cloud Cluster Scheduler Job Manager (JM) Exec Storage Localhost Job Submission Compiler Application IDE Vertex Exec Storage Vertex Exec Storage L: Logs, IO: Input/Output, R: Resources LRIOLR LR
Logs – lots of them Job-related – Plan (xml), status, resources Job-manager – stdout.txt, stderr.txt, *.log Vertex – stdout.txt, *.log, *.xml, *.cmd
Monitoring Tools Structure CosmosScopeHPC v2HPC v3 Cluster abstraction Job Object Model Monitoring, Profiling, Debugging GUIs
Job Object Model Logs JOM Views Job Vertices Plan Tools
Motivation Job structure The Job Object Model Tools for job understanding Conclusions
The Job Browser JobStageVertex
Job Schedule
Failure diagnosis
Diagnosis decision tree “Hand-made” Least portable tool Incomplete High-coverage Bug types: – User level – System-level – Cluster malfunction
Powershell = Interactive Queries $cluster = get-cluster X $job = $cluster | select-AllJobs | sort-object Date | select-object -last 1 | select-DryadJob $failed = $job.Vertices | where-object { $_.State -eq "Failed" }
Vertex Debugging on Client
Vertex Profiling on Client
Debugging on Cluster Collection collection; var results = from c in collection where c.name.length > 10 orderby c.age select c.name; where c.name.length > 10 ProgramJob Breakpoint
Firewall Cluster/Cloud Storage LR Remote debugging Cluster Scheduler Job Manager (JM) Localhost Job Submission DryadLINQ Application Visual Studio Vertex 1Vertex 2 Breakpoint hit… Breakpoint L: Logs, IO: Input/Output, R: Resources attach Exec Storage Exec Storage Exec LRIOLR
Firewall Cluster/Cloud Exec Storage LLL Notifications: Our Implementation Cluster Scheduler Job Manager (JM) Localhost Job Submission DryadLINQ Application Visual Studio Vertex 1Vertex 2 Daphne L: Logs, IO: Input/Output, R: Resources Exec RIOR R attach
Remote debugging
Open Problems What happens when 100,000 processes hit a breakpoint? How to evaluate expressions in the debugger when state is distributed? How to do large-scale performance debugging? How to preserve map between distributed state and original program state? How much can the illusion of a single system be preserved?
Conclusions Single-machine abstractions break down in the presence of (performance/correctness) bugs Job Object Model insulates tools from messy details Design the cluster runtime to make it easy to build a JOM Rich interactive tools easily built on top of JOM Much more work needed for debugging at scale