Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam
Motivation
Big data 2.5 quintillion bytes of data generated per day [IBM] Volume, velocity, variety Need complex event processing engine Represent data as a real-time flow of events Analyze this data as quickly as possible
Storm
Processing engine for high throughput data streams Used by Groupon, Yahoo, Flipboard, etc.
Storm Topology: Directed graph of spouts and bolts Spout Data source Bolt Tuple Bolt Tuple Output
Worker nodes Storm Scheduler Plugin S Deployment Plan Nimbus Slots Executor Worker Process Topology G(V, T), w Supervisor
Storm Grouping strategies Shuffle grouping: target task is chosen randomly Ensures even distribution of tuples Fields grouping: tuple is forwarded to a task based on the content of the tuple E.g. tuples with key beginning with A-I are sent to one task, J-R to another task, etc.
Storm EvenScheduler Round robin allocation strategy First phase: assigns executors to workers evenly Second phase: assigns workers to worker nodes evenly Problem: does not take into account network communication overhead Solution: Identify “hot edges” of the topology Map hot edges to inter-process channels
Adaptive Scheduling in Storm
Adaptive Schedulers Key idea: place executors that frequently communicate together into the same slot, thus reducing network traffic Offline scheduler Examine the topology before deployment and use a heuristic to place the executors Online scheduler Analyze network traffic at runtime and periodically re-compute a new schedule Assumptions Only acyclic topologies Upper bound on number of hops for a tuple as it traverses topology Parameter α [0, 1] affects the maximum number of executors in a single slot
Topology-based Scheduling
Offline Scheduler 1.Create a partial ordering of components If component c i emits tuples that are consumed by another component c j then c i < c j If c i < c j and c j < c k, then c i < c k (transitivity) There can be components c i and c j such that neither c i < c j nor c j < c i are true 2.Use the partial order to create a linearization φ If c i < c i then c i appears before c j in φ The first element of φ is a spout 3.Iterate over φ and for each component c i, place its executors in the slots that already contain executors of the components that directly emit tuples towards c i 4.Assign the slots to worker nodes in round-robin fashion
Offline Scheduler Problem: If a worker does not have an executor it gets ignored Solution: Use a tuning parameter β [0, 1] to force scheduler to use its empty slots Use a higher β if traffic is expected to be heavier among upstream components
Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 3 C 3 < C 4 < C 6 C 2 < C 5 < C 6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 Worker process 1 Worker process 2 Worker node 1 Worker process 3 Worker node 2
Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 Worker process 1 Worker process 2 Worker node 1 Worker process 3 Worker node 2
Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 Worker process 1 Worker process 2 Worker node 1 C2C2 Worker process 3 Worker node 2
Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 Worker process 2 Worker node 1 C2C2 Worker process 3 Worker node 2
Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 Worker process 2 Worker node 1 C2C2 Worker process 3 Worker node 2
Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2
Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2
Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2
Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2
Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2
Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2
Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2
Traffic-based Scheduling
Online Scheduler Goal: dynamically adapt scheduler as load on nodes changes Need to satisfy constraints on: 1.Number of workers for each topology 2.Number of slots available on each worker node 3.Computational power on each node
Storm Architecture with Online Scheduler Scheduler Plugin S Deployment Plan Nimbus Supervisor Slots Executor Worker Process Worker nodes Performance Log Scheduler Plugin Topology G(V, T), w
Online Scheduler I.Partition the executors among the workers 1.Iterate over all pairs of communicating executors (most traffic first) 2.If neither executor has been assigned, assign both to least loaded worker 3.Otherwise determine the best assignment using executors’ current workers and least loaded worker II.Allocate workers to available slots 1.Iterate over all pairs of communicating workers (most traffic first) 2.If neither worker has been assigned, assign both to least loaded node 3.Otherwise determine the best assignment using workers’ current nodes and least loaded nodes
Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 C2C2 C5C5 Worker process 3 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] Phase I
Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 C2C2 C5C5 Worker process 3 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C4C4 C6C6 Worker process 2 C2C2 C5C5 Worker process 3 Worker process 4 (Least loaded worker) Phase I
Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C1C1 C3C3 Worker process 1 C4C4 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C4C4 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 (Least loaded worker) Phase I
Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C1C1 C3C3 Worker process 1 C4C4 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C4C4 Worker process 2 C5C5 C6C6 Worker process 4 C2C2 Worker process 3 (Least loaded worker) Phase I
Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C1C1 C3C3 Worker process 1 C4C4 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C1C1 C3C3 Worker process 1 C4C4 Worker process 2 C2C2 Worker process 3 (Least loaded worker) Phase I
Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C3C3 Worker process 1 C4C4 C1C1 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C3C3 Worker process 1 C4C4 C1C1 Worker process 2 C2C2 Worker process 3 (Least loaded worker) Phase I
Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C3C3 Worker process 1 C4C4 C1C1 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C2C2 Worker process 3 C5C5 C6C6 Worker process 4 (Least loaded worker) Phase I
Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C3C3 Worker process 1 C4C4 C1C1 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C3C3 Worker process 1 C4C4 C1C1 Worker process 2 (Least loaded worker) Phase I
Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C5C5 C6C6 Worker process 4 C4C4 C1C1 Worker process 2 Worker node 1 C2C2 Worker process 3 Worker node 2 C3C3 Worker process 1 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] Phase II
Evaluation
Topologies General-case reference topology DEBS 2013* Grand Challenge dataset Key metrics Average latency for event to traverse the entire topology Average inter-node traffic at runtime Cluster specifications 8 worker nodes, each with: 5 worker slots Ubuntu x2.8 GHz CPUs 3 GB RAM 15 GB disk storage *The 7th ACM International Conference on Distributed Event-Based Systems
Evaluation Reference Topology Each spout executor emits tuples at a fixed rate and the average of these rates is exactly R Bolts forward the received value ½ the time and a different constant value the rest of the time spoutsimplestateful ack stage 1stage 2 stage N-1 stage N
Evaluation Reference topology settings: 7 stages, replication factor of 4 for spout, 3 for simple bolts, 2 for stateful bolts Each point represents average latencies for a 10 events window
Evaluation Parameters: α = 0, β = 0.5, average data rate R = 100 tuples/s, variance V = 20%
Evaluation Parameters: α = 0, β = 0.5, average data rate R = 100 tuples/s, variance V = 20%
Evaluation Parameters: 5 stage topology, replication factor 5, R = 1000 tuples/s, variance V = 20%
Evaluation 2013 DEBS Grand Challenge sensors in soccer players’ shoes emit position and speed data at 200 Hz frequency goal is to maintain up-to-date statistics such as average speed, walked distance, etc. Grand Challenge Topology spout for the sensors (sensor) bolt that computes instantaneous speed and receives tuples by shuffle grouping (speed) bolt that maintains and updates statistics as tuples are received from the speed bolt (analysis)
Evaluation Spout sensor (x8) Bolt speed (x4) Bolt analysis (x2)
Evaluation
Personal Thoughts
Pros: Key idea (scheduling based on minimizing network communication) can easily make a direct impact on average processing time for Storm topologies Offline algorithm is relatively simple and does not require significant architectural changes Online algorithm is conceptually simple to understand despite length
Personal Thoughts Cons: Authors did not prove greedy heuristic correct for online algorithm Online algorithm doesn’t consider the load due to IO bound operations or network communications with extern systems Authors acknowledge that online algorithm as presented ignores corner cases - what are those corner cases?
Questions?
Storm Key constructs: Topology: directed graph representing Storm application Tuple: Ordered list of elements E.g. Stream: Sequence of tuples Spout: Node that serves as source of tuples Bolt: Node that processes tuples and provides output stream
Nimbus Java process responsible for accepting topology, deploying it across cluster, and detecting failures Serves as the master to the supervisors Uses ZooKeeper for coordination Includes the scheduler, which deploys topology in 2 phases: 1.Assign executors to workers 2.Assign workers to slots
Execution Terminology Component: node in topology graph Task: Instance of a component Worker node: physical machine in the Storm cluster Worker: Java process running on a worker node Supervisor: Java process running on a worker node that launches and monitors each worker Slot: space for worker on worker node Executor: thread running on a worker
Execution Constraints Number of tasks for a certain component is fixed by the user Number of executors for each component is fixed by the user Must be less than or equal to the number of tasks to avoid idle executors
Custom Storm Schedules Input: topology G(V, T), w + user defined parameters (α, β, …) Output: deployment plan ISchedule API provided to plug-in a custom scheduler schedule method takes 2 parameters: 1.object containing the definition of all currently-running topologies 2.object representing the physical cluster