International Conference on Data Engineering (ICDE 2016)

International Conference on Data Engineering (ICDE 2016)
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Processing Hey everyone: I am Anis. A PhD student from KTH Royal Institute of Technology. The work I will present today is a result of my internship at Yahoo Labs Barcelona For this work, I was working with people at the top. There is Gianmarco, David, Nicolas and Marco In our work, we propose a very simple and practical approach for load balancing for Stream processing Engines Muhammad Anis Uddin Nasir International Conference on Data Engineering (ICDE 2016)

Past work: ICDE 2015 Hey everyone,
I am Anis :D I am a PhD Student at KTH Royal Institute of Technology The work I will present today is the result of my internship during the PhD at Yahoo Labs Barcelona On the top is our team, where, David and Nicolas were from Yahoo Labs Barcelona, Gianmarco, currently is in Alto University in Helsinki and Marco is from Qatar Computing Research Institute Our work focuses on solving the problem of load imabalance in Stream Processing Engines. So during the talk, I will briefly describe Stream Processing Systems and Model, explain about the issues that cause load imbalance and discuss our simple solution to achieve load balance in Streaming environment

Stream Processing Streams are produced through event logs, user activity logs, web logs, network traffic, mobile traffic, sensors, High rate of incoming traffic millions of messages per sec 3 Vs, variety, velocity, volume

1st challenge: Velocity
stream

2nd challenge: Volume

Requirements Low Latency High throughput Utilization of Resources
Results in microseconds High throughput Millions of events per second Utilization of Resources Workload sharing

Stream Processing Engines
Streaming Application Online Machine Learning Real Time Query Processing Continuous Computation Streaming Frameworks Storm, S4, Flink Streaming, Spark Streaming Stream processing systems are specialized for low latency and high throughput processing for real time data Few common domains of streaming applications are online machine learning, real time query processing, continuous computation Due to the need of real time processing, various frameworks have been proposed in the last decade.

Stream Processing Model
Streaming Applications are represented by Directed Acyclic Graphs (DAGs) Worker Streaming Applications are represented as DAGs In a DAG a vertex are set of operators that are distributed across cluster, which apply various light weight transformation of the incoming stream. For example, filters, join, union, aggregrate are common stream transformation. Edges are data channels that are use to route the data from one operator to another In stream processing systems, there are various stream grouping strategies for routing the keys from one level of operators to the never level of operators In todays talk, I will be concentrating on load balancing between the group of operators at each level of DAG Source Worker Data Stream Source Operators Worker Data Channels

Real-Life Example (1)

Real-Life Example (2) k1 (p1) kn (pn)
m = <timestamp, key, value> = < , 046-XXXX817, 217> k1 (p1) kn (pn)

Stream Grouping 1. Key or Fields Grouping (Hash-Based)
Single worker per key Stateful operators 2. Shuffle Grouping (Round Robin) All workers per key Stateless Operators 3. Partial Key Grouping Two workers per key MapReduce-like Operators Key grouping uses hash based assignment. It applies a hash function on the incoming data and assign the worker to the message by taking the mod of the hash Key grouping is a good choice for stateful operators like aggregates Shuffle grouping is a simple round roibin assignment scheme. Shuffle grouping is a good choice for stateless operators

Key Grouping Key Grouping Scalable ✖ Low Memory ✔ Load Imbalance ✖
Worker Source To understand stream grouping strategies, lets take an example of word count Suppose we want to count the words in the tweets from twitter To implement twitter wordcount, we need to set of operators, A first level of operator to split the tweets into set of words. And the next level of operator to count the words As we know that many of the real workloads are highly skewed. Worker Source Key Grouping Scalable ✖ Low Memory ✔ Load Imbalance ✖ Worker

Shuffle Grouping Shuffle Grouping Load Balance ✔ Memory O(W) ✖
Worker Source Worker Aggr. Source Shuffle Grouping Load Balance ✔ Memory O(W) ✖ Aggregation O(W) ✖ Worker

Partial Key Grouping Partial Key Grouping Scalable ✖ Low Memory ✔
Worker Aggr. Source Partial Key Grouping Scalable ✖ Low Memory ✔ Load Imbalance ✖

Partial Key Grouping P1 = 9.32%

Problem Formulation Input is a unbounded sequence of messages from a key distribution Each message is assigned to a worker for processing (i.e., filter, aggregate, join) Load balance properties Memory Load Balance Network Load Balance Processing Load Balance Metric: Load Imbalance Talk about different load balance schemes. Memory Network Processing In particular, we are interested in balancing the workload for processing. So total number of messages processing

High-Level Idea splits head from the tail k1 (p1) kn (pn)
This is the average imbalance across the system and time, for Twitter, Wikipedia, Cashtags and a synthetic lognormal distribution. We study how well the local load estimation compares with global information. It is always very close, regardless of how many sources we allow the system to have. Some different patterns for the various datasets, but the trend is the same. Hashing is worst. Hashing is just for reference k1 (p1) kn (pn)

How to find the optimal threshold?
Any key that exceeds the capacity of two workers requires more than two workers pi ≥ 2/(n) We need to consider the collision of the keys while deciding the number of workers PKG guarantees nearly perfect load balance for p1 ≤ 1/(5n)

How to find optimal threshold?

How many keys are in the head?
Plots for the number of keys in head for two different thresholds

How many workers for the head?
D-Choices: adapts to the frequencies of the keys in the head W-Choices: allows all the workers for the keys in the head Round-Robin: employs shuffle grouping for the keys in the head

How many workers for the head?
How to assign a key to set of d workers? Greedy-d: uses d different hash functions generates set of d candidate workers assigns the key to the least-loaded worker In case of W-Choices, all the workers are candidates for a key

How to find the optimal d?
Optimization problem:

We rewrite the constraint: For instance for the first key with p1:

We rewrite the constraint: where

What are the values of d?

Memory Overhead Compared to PKG

Memory Overhead Compared to SG

Experimental Evaluation
Datasets

Experimental Evaluation
Algorithms

How good are estimated d?
Estimated d vs. the minimal experimental value of d

Load Imbalance for Zipf

Load balance for real workloads
Comparison of D-C, WC with PKG

Load Imbalance over time
Load imbalance over time for the real-world datasets

Throughput on real DSPE
Throughput on a cluster deployment on Apache Storm for KG, PKG, SG, D-C, and W-C on the ZF dataset

Latency on a real DSPE Latency (on a cluster deployment on Apache Storm) for KG, PKG, SG, D-C, and W-C

Conclusion We propose two algorithms to achieve load balance at scale for DSPEs Use heavy hitters to separate the head of the distribution and process on larger set of workers Improvement translates into 150% gain in throughput and 60% gain in latency over PKG

International Conference on Data Engineering (ICDE 2016)
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Processing Hey everyone: I am Anis. A PhD student from KTH Royal Institute of Technology. The work I will present today is a result of my internship at Yahoo Labs Barcelona For this work, I was working with people at the top. There is Gianmarco, David, Nicolas and Marco In our work, we propose a very simple and practical approach for load balancing for Stream processing Engines Muhammad Anis Uddin Nasir International Conference on Data Engineering (ICDE 2016)

International Conference on Data Engineering (ICDE 2016)

Similar presentations

Presentation on theme: "International Conference on Data Engineering (ICDE 2016)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

International Conference on Data Engineering (ICDE 2016)

Similar presentations

Presentation on theme: "International Conference on Data Engineering (ICDE 2016)"— Presentation transcript:

Similar presentations

About project

Feedback