Download presentation
Presentation is loading. Please wait.
Published byBarry O’Neal’ Modified over 6 years ago
1
International Conference on Data Engineering (ICDE 2016)
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Processing Hey everyone: I am Anis. A PhD student from KTH Royal Institute of Technology. The work I will present today is a result of my internship at Yahoo Labs Barcelona For this work, I was working with people at the top. There is Gianmarco, David, Nicolas and Marco In our work, we propose a very simple and practical approach for load balancing for Stream processing Engines Muhammad Anis Uddin Nasir International Conference on Data Engineering (ICDE 2016)
2
Past work: ICDE 2015 Hey everyone,
I am Anis :D I am a PhD Student at KTH Royal Institute of Technology The work I will present today is the result of my internship during the PhD at Yahoo Labs Barcelona On the top is our team, where, David and Nicolas were from Yahoo Labs Barcelona, Gianmarco, currently is in Alto University in Helsinki and Marco is from Qatar Computing Research Institute Our work focuses on solving the problem of load imabalance in Stream Processing Engines. So during the talk, I will briefly describe Stream Processing Systems and Model, explain about the issues that cause load imbalance and discuss our simple solution to achieve load balance in Streaming environment
3
Stream Processing Streams are produced through event logs, user activity logs, web logs, network traffic, mobile traffic, sensors, High rate of incoming traffic millions of messages per sec 3 Vs, variety, velocity, volume
4
1st challenge: Velocity
stream
5
2nd challenge: Volume
6
Requirements Low Latency High throughput Utilization of Resources
Results in microseconds High throughput Millions of events per second Utilization of Resources Workload sharing
7
Stream Processing Engines
Streaming Application Online Machine Learning Real Time Query Processing Continuous Computation Streaming Frameworks Storm, S4, Flink Streaming, Spark Streaming Stream processing systems are specialized for low latency and high throughput processing for real time data Few common domains of streaming applications are online machine learning, real time query processing, continuous computation Due to the need of real time processing, various frameworks have been proposed in the last decade.
8
Stream Processing Model
Streaming Applications are represented by Directed Acyclic Graphs (DAGs) Worker Streaming Applications are represented as DAGs In a DAG a vertex are set of operators that are distributed across cluster, which apply various light weight transformation of the incoming stream. For example, filters, join, union, aggregrate are common stream transformation. Edges are data channels that are use to route the data from one operator to another In stream processing systems, there are various stream grouping strategies for routing the keys from one level of operators to the never level of operators In todays talk, I will be concentrating on load balancing between the group of operators at each level of DAG Source Worker Data Stream Source Operators Worker Data Channels
9
Real-Life Example (1)
10
Real-Life Example (2) k1 (p1) kn (pn)
m = <timestamp, key, value> = < , 046-XXXX817, 217> k1 (p1) kn (pn)
11
Stream Grouping 1. Key or Fields Grouping (Hash-Based)
Single worker per key Stateful operators 2. Shuffle Grouping (Round Robin) All workers per key Stateless Operators 3. Partial Key Grouping Two workers per key MapReduce-like Operators Key grouping uses hash based assignment. It applies a hash function on the incoming data and assign the worker to the message by taking the mod of the hash Key grouping is a good choice for stateful operators like aggregates Shuffle grouping is a simple round roibin assignment scheme. Shuffle grouping is a good choice for stateless operators
12
Key Grouping Key Grouping Scalable ✖ Low Memory ✔ Load Imbalance ✖
Worker Source To understand stream grouping strategies, lets take an example of word count Suppose we want to count the words in the tweets from twitter To implement twitter wordcount, we need to set of operators, A first level of operator to split the tweets into set of words. And the next level of operator to count the words As we know that many of the real workloads are highly skewed. Worker Source Key Grouping Scalable ✖ Low Memory ✔ Load Imbalance ✖ Worker
13
Shuffle Grouping Shuffle Grouping Load Balance ✔ Memory O(W) ✖
Worker Source Worker Aggr. Source Shuffle Grouping Load Balance ✔ Memory O(W) ✖ Aggregation O(W) ✖ Worker
14
Partial Key Grouping Partial Key Grouping Scalable ✖ Low Memory ✔
Worker Aggr. Source Partial Key Grouping Scalable ✖ Low Memory ✔ Load Imbalance ✖
15
Partial Key Grouping P1 = 9.32%
16
Problem Formulation Input is a unbounded sequence of messages from a key distribution Each message is assigned to a worker for processing (i.e., filter, aggregate, join) Load balance properties Memory Load Balance Network Load Balance Processing Load Balance Metric: Load Imbalance Talk about different load balance schemes. Memory Network Processing In particular, we are interested in balancing the workload for processing. So total number of messages processing
17
High-Level Idea splits head from the tail k1 (p1) kn (pn)
This is the average imbalance across the system and time, for Twitter, Wikipedia, Cashtags and a synthetic lognormal distribution. We study how well the local load estimation compares with global information. It is always very close, regardless of how many sources we allow the system to have. Some different patterns for the various datasets, but the trend is the same. Hashing is worst. Hashing is just for reference k1 (p1) kn (pn)
18
How to find the optimal threshold?
Any key that exceeds the capacity of two workers requires more than two workers pi ≥ 2/(n) We need to consider the collision of the keys while deciding the number of workers PKG guarantees nearly perfect load balance for p1 ≤ 1/(5n)
19
How to find optimal threshold?
20
How many keys are in the head?
Plots for the number of keys in head for two different thresholds
21
How many workers for the head?
D-Choices: adapts to the frequencies of the keys in the head W-Choices: allows all the workers for the keys in the head Round-Robin: employs shuffle grouping for the keys in the head
22
How many workers for the head?
How to assign a key to set of d workers? Greedy-d: uses d different hash functions generates set of d candidate workers assigns the key to the least-loaded worker In case of W-Choices, all the workers are candidates for a key
23
How to find the optimal d?
Optimization problem:
24
How to find the optimal d?
We rewrite the constraint: For instance for the first key with p1:
25
How to find the optimal d?
We rewrite the constraint: where
26
What are the values of d?
27
Memory Overhead Compared to PKG
28
Memory Overhead Compared to SG
29
Experimental Evaluation
Datasets
30
Experimental Evaluation
Algorithms
31
How good are estimated d?
Estimated d vs. the minimal experimental value of d
32
Load Imbalance for Zipf
33
Load balance for real workloads
Comparison of D-C, WC with PKG
34
Load Imbalance over time
Load imbalance over time for the real-world datasets
35
Throughput on real DSPE
Throughput on a cluster deployment on Apache Storm for KG, PKG, SG, D-C, and W-C on the ZF dataset
36
Latency on a real DSPE Latency (on a cluster deployment on Apache Storm) for KG, PKG, SG, D-C, and W-C
37
Conclusion We propose two algorithms to achieve load balance at scale for DSPEs Use heavy hitters to separate the head of the distribution and process on larger set of workers Improvement translates into 150% gain in throughput and 60% gain in latency over PKG
38
International Conference on Data Engineering (ICDE 2016)
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Processing Hey everyone: I am Anis. A PhD student from KTH Royal Institute of Technology. The work I will present today is a result of my internship at Yahoo Labs Barcelona For this work, I was working with people at the top. There is Gianmarco, David, Nicolas and Marco In our work, we propose a very simple and practical approach for load balancing for Stream processing Engines Muhammad Anis Uddin Nasir International Conference on Data Engineering (ICDE 2016)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.