Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY
Motivation Debugging distributed programs is hard Debuggers for general distributed systems incur high overhead Spark model enables debugging for almost zero overhead
Spark Programming Model map(_.split(‘\t’)(3)) articles Resilient Distributed Datasets (RDDs) filter(_.contains( “Berkeley”)) matches count() 10,000 HDFS file Deterministic transformations Example: Find how many Wikipedia articles match a search term
Debugging a Spark Program Debug the individual transformations instead of the whole system Rerun tasks Recompute RDDs Debugging a distributed program is now as easy as debugging a single-threaded one Also applies to MapReduce and Dryad
Approach As Spark program runs, workers report key events back to the master, which logs them Worker Master Worker Performance stats Exceptions RDD checksums Event log
Approach Later, user can re-execute from the event log to debug in a controlled environment Worker Master Debugger Worker Event log
Detecting Nondeterministic Transformations Re-running a nondeterministic transformation may yield different results We can use RDD checksums to detect nondeterminism and alert the user
Demo Example app: PageRank on Wikipedia dataset
Performance Event logging introduces minimal overhead
Future Plans Culprit determination GC monitoring Memory monitoring
Ankur Dave The Spark debugger is in development at branch event-log Try Spark at