Guangxiang Du*, Indranil Gupta

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Distributed Process Scheduling Summery Distributed Process Scheduling Summery BY:-Yonatan Negash.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.
An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
1 04/18/2005 Flux Flux: An Adaptive Partitioning Operator for Continuous Query Systems M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin UC.
MapReduce How to painlessly process terabytes of data.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
An Energy-efficient Task Scheduler for Multi-core Platforms with per-core DVFS Based on Task Characteristics Ching-Chi Lin Institute of Information Science,
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Scale up Vs. Scale out in Cloud Storage and Graph Processing Systems
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Static Process Scheduling
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Load Balanced Link Reversal Routing in Mobile Wireless Ad Hoc Networks Nabhendra Bisnik, Alhussein Abouzeid ECSE Department RPI Costas Busch CSCI Department.
Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam.
Stela: Enabling Stream Processing Systems to Scale-in and Scale-out On- demand Le Xu ∗, Boyang Peng†, Indranil Gupta ∗ ∗ Department of Computer Science,
Online Parameter Optimization for Elastic Data Stream Processing Thomas Heinze, Lars Roediger, Yuanzhen Ji, Zbigniew Jerzak (SAP SE) Andreas Meister (University.
R-Storm: Resource Aware Scheduling in Storm
HERON.
TensorFlow– A system for large-scale machine learning
Distributed, real-time actionable insights on high-volume data streams
E-Storm: Replication-based State Management in Distributed Stream Processing Systems Xunyun Liu, Aaron Harwood, Shanika Karunasekera, Benjamin Rubinstein.
OPERATING SYSTEMS CS 3502 Fall 2017
Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le
International Conference on Data Engineering (ICDE 2016)
Original Slides by Nathan Twitter Shyam Nutanix
Large-scale file systems and Map-Reduce
Cloud-Assisted VR.
Applying Control Theory to Stream Processing Systems
Fair K-Mutual Exclusion Algorithm for Peer to Peer Systems
Efficient Join Query Evaluation in a Parallel Database System
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
CSE 486/586 Distributed Systems Consistency --- 1
Dynamic Graph Partitioning Algorithm
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.
Parallel Density-based Hybrid Clustering
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Cloud-Assisted VR.
Performance Evaluation of Adaptive MPI
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Benchmarking Modern Distributed Stream Processing Systems
Boyang Peng, Le Xu, Indranil Gupta
Database Management Systems (CS 564)
ECF: an MPTCP Scheduler to Manage Heterogeneous Paths
湖南大学-信息科学与工程学院-计算机与科学系
Department of Computer Science University of California, Santa Barbara
StreamApprox Approximate Stream Analytics in Apache Spark
Mélange: Multi-tenant Scheduling for Graph Processing Jobs
Computer Systems Performance Evaluation
CSE 486/586 Distributed Systems Consistency --- 1
Henge: Intent-Driven Multi-Tenant Stream Processing
Kevin Lee & Adam Piechowicz 10/10/2009
CPU SCHEDULING.
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Resource-Efficient and QoS-Aware Cluster Management
Multithreaded Programming
Fast Congestion Control in RDMA-Based Datacenter Networks
Chavit Denninnart, Mohsen Amini Salehi and Xiangbo Li
Hawk: Hybrid Datacenter Scheduling
Computer Systems Performance Evaluation
External Sorting Adapt fastest internal-sort methods.
5/7/2019 Map Reduce Map reduce.
EdgeWise: A Better Stream Processing Engine for the Edge
Presentation transcript:

NEW TECHNIQUES TO CURTAIL THE TAIL LATENCY IN STREAM PROCESSING SYSTEMS Guangxiang Du*, Indranil Gupta Department of Computer Science, University of Illinois, Urbana Champaign *Google (work done at UIUC) DPRG@UIUC: http://dprg.cs.uiuc.edu

Motivation For stream processing systems, latency is very critical. Most existing work, e.g. traffic-aware scheduling, elastic scaling, etc, focuses on lowering the average latency. Some applications, like interactive web service, security-related applications, require low tail latency.

Contributions of Our work We propose three techniques to lower tail latency in stream processing systems: Adaptive Timeout Strategy Improved Concurrency Model for worker process Latency feedback-based Load Balancing Implemented in Apache Storm Micro-benchmark & Real topologies Evaluation. We implement these three techniques on Apache Storm, one of the most popular stream processing systems. We perform evaluation on our three techniques with sets of micro-benchmarks as well as real world topologies.

System Model op2 op4 op5 op1 op3 Storm, Flink and Samza fit this system model. Topology: Directed Acyclic Graph of operators. Operators are stateless. Each operator is split into multiple tasks. Data flow through the topology in the form of discrete units, tuples. Operators are connected through shuffle-grouping streams. Shuffle-grouping: tuples arriving at an operator are spread in a random way across its constituent tasks. op3

Adaptive Timeout Strategy Storm has a built-in mechanism to guarantee message processing. If a tuple has not been completely processed within timeout, it will get replayed. But it is fixed and specified by users. We propose to adjust the timeout adaptively to catch and replay straggler tuples promptly. Op2 Op3 Op1 (source) Op4 (sink) t2 t5 t1 t3 t6 t8 t4 t7 Each tuple emitted from source operator would have a timeout value.

Adaptive Timeout Strategy Contd At moment 𝑡 𝑖 set the timeout value for period 𝑃 𝑖 based on statistics of tuple latency in 𝑃 𝑖−1 Intuition: continuously collects the statistics of tuple latency, and periodically adjusts the timeout value based on latency distribution of recent issued tuples. based on how long the tail has been in the last time period, decide how aggressively to set the timeout value with heuristic rules: For example, if ( 99th latency) 𝑃 𝑖−1 > 2 * ( 90th latency) 𝑃 𝑖−1 Then Timeout 𝑃 𝑖 = ( 90th latency) 𝑃 𝑖−1 Why: If the tail is very long, we set the timeout to be low aggressively. Otherwise, we set timeout conservatively to avoid unnecessary replay. (Note tail long or short is a relative concept, for instance, how long is 99% latency with respect to 90% latency)

Improved Concurrency Model For Worker Process Op2 Op3 W1 W2 W3 Op1 (source) Op4 (sink) t2 t5 t1 t3 t6 t8 t4 t7 That was for our first technique. Now, moving to our second technique. For Storm, Flink, by default each task/executor has an independent queue to buffer incoming tuples. Why: Improved concurrency model technique can reduce queueing delay by merging input queues for those tasks. A task, whenever free, can grab next available tuple from the shared input queue.

Improved Concurrency Model For Worker Process Contd In an M/M/c queue model, 𝜆 : queue's input rate 𝜇: server's service rate c : the number of servers for the queue 𝜌: the utilization of the queue 𝑄𝑡 𝑎𝑣𝑔 : average queueing time It shows that for a given queue utilization, increasing the number of servers for a queue will lead to lower queueing delay.

Latency-based Load Balancing Many stream processing systems may run in heterogeneous conditions, for example: the machines (or VMs) may be heterogeneous, the task assignment may be heterogeneous (machines have different number of tasks), etc. 33.33% Op2 Op3 W1 W2 W3 Op1 (source) Op4 (sink) That was for our second technique. Now, moving to our third technique. Some tasks may be faster than other tasks within the same operator. Partitioning the incoming stream of tuples uniformly across tasks thus exacerbates the tail latency. t2 t5 33.33% t1 t3 t6 t8 t4 t7 33.33%

Latency-based Load Balancing Contd Goal: faster tasks process more work, slower tasks process less work such that tasks have basically the same latency. 33.33% 51.33% 35.33% 34.33% W1 W2 W3 t1 t4 1st t7 The key point: each task (except sink) collects latency of its immediate downstream tasks as feedback periodically, and sort them from fastest to slowest. Quickest task form a pair with the slowest. 2nd quickest form a pair with the 2nd slowest. So on… The technique performs smooth load adjustment among task pairs to suppress load oscillation. For each pair, the algorithm balances load by shifting 1% of the upstream task's outgoing traffic from the slower task to the quicker task until the latency of two is close enough. 32.33% 23.33% 33.33% 2nd 3rd t2 t5 32.33% 25.33% 33.33% t8 2nd 3rd t3 t6

Evaluation Experimental Setup (Google Compute Engine). We implement our techniques in Apache Storm and evaluate them. 1 VM for nimbus & zookeeper 5 VMs for worker nodes. By default, each worker node runs a worker process. VM Node Machine conguration Role 1 VM n1-standard-1 (1 vCPU, 3.75GB memory) Zookeeper & Nimbus 5 VMs n1-standard-2 (2 vCPUs, 7.5GB memory) Worker Node

Evaluation: Adaptive Timeout Strategy 4-operator “Exclamation Topology” from Storm examples. Comparison between adaptive timeout strategy with different levels of replication. Approach 99th latency (ms) 99.9th latency (ms) Cost default 29.2 76.6 ------- adaptive timeout 24.1 66.4 2.92% 20% replication 25.5 87.8 20% 50% replication 22.1 107.7 50% 100% replication 17.9 78.1 100% Why: replication is a popular/well known approach used to cut down tail latency in networking systems and search engines. & Both replication and adaptive timeout strategy benefit from speculative execution.

Evaluation: Improved Concurrency For Worker Process micro-topology where a spout connects to a bolt through shuffle-grouping stream. The bolt has 20 task, each worker has 4 tasks. Average queueing delay drops from 2.07 ms to 0.516 ms. The 90th latency, 99th latency and 99.9th latency are improved by 3.49 ms (35.5%), 3.94 ms (24.9%) and 30.1 ms (36.2%) respectively.

Evaluation: Latency-based Load Balancing three kinds of heterogeneous scenarios: Different Storm workers are assigned different numbers of tasks. Subset of Storm workers are competing for resources with external processes. Storm workers are deployed in a cluster of heterogeneous VMs.

Evaluation: Latency-based Load Balancing Contd Overall effect: shifts load from slower tasks to quicker tasks gradually to achieve latency balance among tasks of the same operator. 90th latency 99th latency 99.9th latency Improvement 2.2% - 56% 21.4% - 60.8% 25% - 72.9%

Qualitative Conditions for the Techniques Given a topology, vary the tasks's input queue utilization and observe its effect on the adaptive timeout strategy and the improved concurrency model. However, when do we apply which technique? To find this out, we conducted two experiments… Improved concurrency model: there is a positive correlation between tasks' input queue utilization and its improvement on tail latency. Adaptive Timeout Strategy: its improvement on tail latency weakens after a certain point of input queue utilization.

Qualitative Conditions for the Techniques Contd vary the system workload and observe its effect on the latency-based load balancing and the adaptive timeout strategy. Latency-based load balancing: works well under high workload when the heterogeneity among different VMs is most prominent. Adaptive Timeout Strategy: achieves improvement on tail latency under moderate or low system workload.

Qualitative Conditions for the Techniques Contd Since the scopes for each different techniques hardly overlap with each other, we recommend using each of them exclusively under their desired situations.

Real-world Topologies Evaluation Yahoo PageLoad Topology Yahoo Processing Topology

Real-world Topologies Experimental Results Name 90th latency 99th latency 99.9th latency Adaptive Timeout Strategy ------ 28%-40% 24%-26% Improved Concurrency Model 16%-19% 36%-42% 20%-32% Latency-based Load Balance 22%-48% 50%-57% 21%-50%

Summary Propose a three novel techniques for reducing tail latency based on a common system model of system processing systems, like Storm. Adaptive Timeout Strategy Improved Concurrency Model for Worker Process Latency-based Load Balancing Provide guidelines for when to use which technique. Achieve improvement on tail latency up to 72.9% compared to Storm default implementation. DPRG: http://dprg.cs.uiuc.edu gdu3@illinois.edu