Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica.

Similar presentations


Presentation on theme: "Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica."— Presentation transcript:

1 Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

2 Outline The Spark scheduling bottleneck Sparrow’s fully distributed, fault-tolerant technique Sparrow’s near-optimal performance

3 Spark Today Worke r Spark Context User 1 User 2 User 3 Query Compilation Storage Scheduling

4 Spark Today Worke r Spark Context User 1 User 2 User 3 Query Compilation Storage Scheduling

5 Job Latencies Rapidly Decreasing 10 min. 10 sec. 100 ms 1 ms 2004: MapReduce batch job 2009: Hive query 2010: Dremel Query 2012: Impala query 2010: In- memory Spark query 2013: Spark streamin g

6 Job latencies rapidly decreasing

7 + Spark deployments growing in size Scheduling bottleneck!

8 Spark scheduler throughput: 1500 tasks / second 1 second 100 ms 10 second Task Duration Cluster size (# 16-core machines)

9 Optimizing the Spark Scheduler 0.8: Monitoring code moved off critical path 0.8.1: Result deserialization moved off critical path Future improvements may yield 2-3x higher throughput

10 Is the scheduler the bottleneck in my cluster?

11 Worke r Cluster Scheduler Task launch Task completion

12 Worke r Cluster Scheduler Task launch Task completion

13 Worke r Cluster Scheduler Task launch Task completion Schedu ler delay

14

15

16 Spark Today Worke r Spark Context User 1 User 2 User 3 Query Compilation Storage Scheduling

17 Future Spark Worke r User 1 User 2 User 3 Scheduler Query compilation Scheduler Query compilation Scheduler Query compilation Benefits: High throughput Fault tolerance

18 Future Spark Worke r User 1 User 2 User 3 Scheduler Query compilation Scheduler Query compilation Scheduler Query compilation Storage: Tachy on

19 Scheduling with Sparrow Worke r Schedu ler Stage Worke r

20 Stage Batch Sampling Worke r Schedu ler Worke r Place m tasks on the least loaded of 2m workers 4 probes (d = 2)

21 Queue length poor predictor of wait time Work er 80 ms 155 ms 530 ms Poor performance on heterogeneous workloads

22 Stage Late Binding Worker Scheduler Worker Place m tasks on the least loaded of d  m workers 4 probes (d = 2)

23 Late Binding Scheduler Place m tasks on the least loaded of d  m workers 4 probes (d = 2) Worker Stage

24 Late Binding Scheduler Place m tasks on the least loaded of d  m workers Worke r reques ts task Worker Stage

25 What about constraints?

26 Stage Per-Task Constraints Scheduler Worker Probe separately for each task

27 Technique Recap Scheduler Batch sampling + Late binding + Constraints Work er

28 How well does Sparrow perform?

29 How does Sparrow compare to Spark’s native scheduler? 100 16-core EC2 nodes, 10 tasks/job, 10 schedulers, 80% load

30 TPC-H Queries: Background TPC-H: Common benchmark for analytics workloads Sparrow Spark Shark : SQL execution engine

31 TPC-H Queries 100 16-core EC2 nodes, 10 schedulers, 80% load 95 75 25 50 Percent iles 5 Within 12% of ideal Median queuing delay of 9ms

32 Policy Enforcement Worker High Priority Low Priority Worker User A (75%) User B (25%) Fair Shares Serve queues using weighted fair queuing Priorities Serve queues based on strict priorities

33 Weighted Fair Sharing

34 Fault Tolerance Schedul er 1 Schedul er 2 Spark Client 1 ✗ Spark Client 2 Timeout: 100ms Failover: 5ms Re-launch queries: 15ms

35 Making Sparrow feature- complete Interfacing with UI Delay scheduling Speculation

36 (2) Distributed, fault-tolerant scheduling with Sparrow www.github.com/radlab/sparrow Scheduler Work er (1) Diagnosing a Spark scheduling bottleneck


Download ppt "Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica."

Similar presentations


Ads by Google