Download presentation
Presentation is loading. Please wait.
Published byMyra Perkins Modified over 9 years ago
1
Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica
2
Outline The Spark scheduling bottleneck Sparrow’s fully distributed, fault-tolerant technique Sparrow’s near-optimal performance
3
Spark Today Worke r Spark Context User 1 User 2 User 3 Query Compilation Storage Scheduling
4
Spark Today Worke r Spark Context User 1 User 2 User 3 Query Compilation Storage Scheduling
5
Job Latencies Rapidly Decreasing 10 min. 10 sec. 100 ms 1 ms 2004: MapReduce batch job 2009: Hive query 2010: Dremel Query 2012: Impala query 2010: In- memory Spark query 2013: Spark streamin g
6
Job latencies rapidly decreasing
7
+ Spark deployments growing in size Scheduling bottleneck!
8
Spark scheduler throughput: 1500 tasks / second 1 second 100 ms 10 second Task Duration Cluster size (# 16-core machines)
9
Optimizing the Spark Scheduler 0.8: Monitoring code moved off critical path 0.8.1: Result deserialization moved off critical path Future improvements may yield 2-3x higher throughput
10
Is the scheduler the bottleneck in my cluster?
11
Worke r Cluster Scheduler Task launch Task completion
12
Worke r Cluster Scheduler Task launch Task completion
13
Worke r Cluster Scheduler Task launch Task completion Schedu ler delay
16
Spark Today Worke r Spark Context User 1 User 2 User 3 Query Compilation Storage Scheduling
17
Future Spark Worke r User 1 User 2 User 3 Scheduler Query compilation Scheduler Query compilation Scheduler Query compilation Benefits: High throughput Fault tolerance
18
Future Spark Worke r User 1 User 2 User 3 Scheduler Query compilation Scheduler Query compilation Scheduler Query compilation Storage: Tachy on
19
Scheduling with Sparrow Worke r Schedu ler Stage Worke r
20
Stage Batch Sampling Worke r Schedu ler Worke r Place m tasks on the least loaded of 2m workers 4 probes (d = 2)
21
Queue length poor predictor of wait time Work er 80 ms 155 ms 530 ms Poor performance on heterogeneous workloads
22
Stage Late Binding Worker Scheduler Worker Place m tasks on the least loaded of d m workers 4 probes (d = 2)
23
Late Binding Scheduler Place m tasks on the least loaded of d m workers 4 probes (d = 2) Worker Stage
24
Late Binding Scheduler Place m tasks on the least loaded of d m workers Worke r reques ts task Worker Stage
25
What about constraints?
26
Stage Per-Task Constraints Scheduler Worker Probe separately for each task
27
Technique Recap Scheduler Batch sampling + Late binding + Constraints Work er
28
How well does Sparrow perform?
29
How does Sparrow compare to Spark’s native scheduler? 100 16-core EC2 nodes, 10 tasks/job, 10 schedulers, 80% load
30
TPC-H Queries: Background TPC-H: Common benchmark for analytics workloads Sparrow Spark Shark : SQL execution engine
31
TPC-H Queries 100 16-core EC2 nodes, 10 schedulers, 80% load 95 75 25 50 Percent iles 5 Within 12% of ideal Median queuing delay of 9ms
32
Policy Enforcement Worker High Priority Low Priority Worker User A (75%) User B (25%) Fair Shares Serve queues using weighted fair queuing Priorities Serve queues based on strict priorities
33
Weighted Fair Sharing
34
Fault Tolerance Schedul er 1 Schedul er 2 Spark Client 1 ✗ Spark Client 2 Timeout: 100ms Failover: 5ms Re-launch queries: 15ms
35
Making Sparrow feature- complete Interfacing with UI Delay scheduling Speculation
36
(2) Distributed, fault-tolerant scheduling with Sparrow www.github.com/radlab/sparrow Scheduler Work er (1) Diagnosing a Spark scheduling bottleneck
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.