The Case for Tiny Tasks in Compute Clusters Kay Ousterhout *, Aurojit Panda *, Joshua Rosen *, Shivaram Venkataraman *, Reynold Xin *, Sylvia Ratnasamy *, Scott Shenker *+, Ion Stoica * * UC Berkeley, + ICSI
Setting … … … Tas k Map Reduce/Spark/Dr yad Job
Today’s tasks Tiny Tasks Use smaller tasks!
Why ? How ? Wher e?
Why ? How ? Wher e?
Problem: Skew and Stragglers Contended machine? Data skew?
Benefit: Handling of Skew and Stragglers Today’s tasks Tiny Tasks As much as 5.2x reduction in job completion time!
Problem: Batch and Interactive Sharing High priority interactive job arrives Low priority batch task Clusters forced to trade off utilization and responsiveness!
Benefit: Improved Sharing Today’s tasks Tiny Tasks High-priority tasks not subject to long wait times!
Benefits: Recap (1) Straggler mitigation (2) Improved sharing Mantri (OSDI ‘10) Scarlett (EuroSys ’11) SkewTune (SIGMOD ‘12) Dolly (NSDI ’13) … Quincy (SOSP ‘09) Amoeba (SOCC ’12) …
Why ? How ? Wher e?
Scheduling requirements: High Throughput Low Latency Distributed Scheduling (e.g., Sparrow Scheduler) Sched ule task (millions per second) (millisecon ds)
Use existing thread pool to launch tasks Launc h task Sched ule task
Use existing thread pool to launch tasks + Cache task binaries Task launch = RPC time (<1ms) Launc h task Sched ule task
Read input data Smallest efficient file block size: Distribute Metadata (à la Flat Datacenter Storage, OSDI ‘12) Launch task Sched ule task 8M B
Execute task + read data for next task Sched ule task …… Tons of tiny transfers! Framework- Controlled I/O (enables optimizations, e.g., pipelining) Read input data Launch task
How low can you go? Execute task + read data for next task Sched ule task 100’s of millisecon ds Read input data Launch task 8MB disk block
Why ? How ? Wher e?
Original Job Map Task 1 … Map Task 2 … N … Map Task s Tiny Tasks Job Reduce Task 1 … Reduc e Tasks K1: K2: K3: K5: … K1: K2: Kn:
Original Reduce Phase Tiny Tasks = ? Reduce Task 1 K1:
Splitting Large Tasks Aggregation trees –Works for functions that are associative and commutative Framework-managed temporary state store Ultimately, need to allow a small number of large tasks
Tiny tasks mitigate stragglers + Improve sharing Distribu ted file metada ta Launch task in existing thread pool Distribu ted schedul ing Pipelined task execution Questions? Find me or Shivaram:
Backup Slides
5.2x at the 95 th percentile! Benefit of Eliminating Stragglers Based on Facebook Trace
Why Not Preemption? Preemption only handles sharing (not stragglers) Task migration is time consuming Tiny tasks improve fault tolerance
Dremel/Drill/Impala Similar goals and challenges (supporting short tasks) Dremel statically assigns tablets to machines; rebalances if query dispatcher notices that a machine is processing a tablet slowly standard straggler mitigation Most jobs expected to be interactive (no sharing)
10,000 Machines 16 cores/machine 100 millisecond tasks Scheduling Throughput Over 1 million task scheduling decisions per second
Sparrow: Technique Place m tasks on the least loaded of d m slaves Slave Schedu ler Job m = 2 tasks 4 probes (d = 2) More at tinyurl.com/sparrow-scheduler
Sparrow: Performance on TPC- H Workload Within 12% of offline optimal; median queuing delay of 8ms 29 More at tinyurl.com/sparrow-scheduler