From Rivulets to Rivers: Elastic Stream Processing in Heron Bill Graham , Twitter - @billgraham Ashvin Agrawal, Microsoft Avrilia Floratou, Microsoft
Prediction is very difficult, especially if it’s about the future. Nils Bohr We cannot direct the wind, but we can adjust the sails. Dolly Parton
Outline Heron Overview Elastic Scaling Challenges Current Implementation Work in Progress – Auto-scaling
A realtime, distributed, fault-tolerant stream processing engine. Heron A realtime, distributed, fault-tolerant stream processing engine.
About Heron Developed by Twitter in 2014 Open sourced in May 2016 Storm API compatible Isolation at all levels: Topology Container Task (process-based) At least once, at most once semantics Backpressure Low resource overhead (< 10%)
Logical Topology Bolt 1 Spout 1 Bolt 4 Bolt 2 Spout 2 Bolt 5 Bolt 3
Physical Execution Bolt 1 Spout 1 Bolt 4 Bolt 2 Spout 2 Bolt 5 Bolt 3
Packing Plan How to distribute instances onto containers? IPacking.pack()
Topology Submission Containers Allocated Processes Initialize Instances Register Stream Manager Registers S1 S2 B3 S1 S2 B3 S1 B2 B3 Data Flows B4 B5 B6 B4 B5 B6 B4 B5 B6 heron submit Heron Client Stream Manager Stream Manager Stream Manager PackingPlan Heron Scheduler Container 0 Topology Master
Data Rate Variations
Parallelism Challenges Anticipating component parallelism is difficult Changing parallelism is costly - O(hour) code change, review, merge, build, kill, submit Tuning for load spikes or valleys is manual - O(day) Under-provisioning leads to back pressure leads to support costs Over-provisioning is the norm
Over-provisioning CPU Requested CPU Used 40% 25%
Elastic Scaling Opportunity Reduce administration cost Reduce support cost Reduce hardware cost Provide better SLA
Ordinary Topology Management Process User Tasks Heron System Tasks Releases Resources Kill Topology Submit Topology Create Packing Acquire Resources Monitor / Estimate Build State Start Topology Install Topology Time Consuming Tasks
Low-cost Topology “update” 2 2 3 4 4 3
Optimized Topology Scale-up Process User Tasks Heron System Tasks Kill Topology Submit Topology Create Packing Acquire Resources Update Topology Pause Topology Add / Reduce Resources Un-Pause Topology Prepare Components Monitor / Estimate Build State Start Topology Install Topology
heron “update” … Aims to Maintain Uniform Component Distribution $ heron update my_cluster/user/dev MyTopology \ --component-parallelism=bolt1:20 \ --component-parallelism=bolt2:40 Available in 0.14.5 Aims to Maintain Uniform Component Distribution Execution Time O(mins) Aggressively Prunes Containers Minimizes Disruption Customizable Through IRepacking.repack()
Current Limitations Automated state transition not yet supported Component scaling event notification : IUpdatable.update() Example: KafkaSpout queue partition mappings Fields group routing might change Workaround: pause topology > cache flush interval before scaling Algorithmic Auto-Scaling Modifying an existing packing plan can be more complex than creating one from scratch
Algorithmic Auto-Scaling … User Tasks User Tasks Heron System Tasks Heron System Tasks Submit Topology Create Packing Acquire Resources Update Topology Pause Topology Add / Reduce Resources Un-Pause Topology Prepare Components Monitor / Estimate Build State Start Topology Install Topology
Auto-Scaling Heron uses Dhalion to adjust to external shocks. Dhalion is a framework that provides self-regulating capabilities to Heron and will be open-sourced in the near future. Dhalion periodically observes the state of the topology and determines whether resources should be scaled up or down. Heron should automatically identify variations in the incoming load and react to them.
Using Dhalion to Auto-Scale Dhalion’s scales up and down the topology resources as needed while still keeping the topology in a steady state where backpressure is not observed Resource Overprovisioning Diagnoser Pending Packets Detector Bolt Scale Down Resolver Symptoms Resource Underprovisioning Diagnoser Diagnosis Bolt Scale Up Resolver Resolver Invocation Metrics Backpressure Detector Data Skew Diagnoser Data Skew Resolver Processing Rate Skew Detector Restart Instances Resolver Slow Instances Diagnoser Symptom Detection Diagnosis Generation Resolution
Initial Results Dhalion is able to adjust the topology resources on-the-fly when workload spikes occur. Our policy eventually reaches a healthy state where backpressure is not observed and the overall throughput is maximized.
Future Plans Use Dhalion to enforce throughput and latency SLOs and to auto-tune Heron topologies. Open-source Dhalion and the auto-scaling policy as part of Heron. Combine scaling with stateful stream processing.
Get Involved http://github.com/twitter/heron http://heronstreaming.io @heronstreaming
Up Next Anomaly detection in real-time data streams using Heron Arun Kejariwal, Machine Zone Karthik Ramasamy, Twitter
Questions?