Online Parameter Optimization for Elastic Data Stream Processing Thomas Heinze, Lars Roediger, Yuanzhen Ji, Zbigniew Jerzak (SAP SE) Andreas Meister (University.

Slides:



Advertisements
Similar presentations
Performance Testing - Kanwalpreet Singh.
Advertisements

Lindsey Bleimes Charlie Garrod Adam Meyerson
A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud Many slides from authors’ presentation.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Thank you for your introduction.
SLA-Oriented Resource Provisioning for Cloud Computing
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
CloudScale: Elastic Resource Scaling for Multi-Tenant Cloud Systems Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, John Wilkes.
Efficient Autoscaling in the Cloud using Predictive Models for Workload Forecasting Roy, N., A. Dubey, and A. Gokhale 4th IEEE International Conference.
A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.
Academic Advisor: Prof. Ronen Brafman Team Members: Ran Isenberg Mirit Markovich Noa Aharon Alon Furman.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.
Adaptive Stream Processing using Dynamic Batch Sizing Tathagata Das, Yuan Zhong, Ion Stoica, Scott Shenker.
MAC Layer Protocols for Sensor Networks Leonardo Leiria Fernandes.
23 September 2004 Evaluating Adaptive Middleware Load Balancing Strategies for Middleware Systems Department of Electrical Engineering & Computer Science.
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)
Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.
Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.
Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,
The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.
FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
Group 3 Sandeep Chinni Arif Khan Venkat Rajiv. Delay Tolerant Networks Path from source to destination is not present at any single point in time. Combining.
THE LITTLE ENGINE(S) THAT COULD: SCALING ONLINE SOCIAL NETWORKS B 圖資三 謝宗昊.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.
Supporting Privacy Protection in Personalized Web Search.
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Performance and Energy Efficiency Evaluation of Big Data Systems Presented by Yingjie Shi Institute of Computing Technology, CAS
A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Wei Dong.
Scalable and Coordinated Scheduling for Cloud-Scale computing
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer
Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.
A novel, low-latency algorithm for multiple group-by query optimization Duy-Hung Phan Pietro Michiardi ICDE16.
Event-Based Scheduling for Energy-Efficient QoS (eQoS) in Mobile Web Applications Yuhao Zhu, Matthew Halpern, Vijay Janapa Reddi Department of Electrical.
Bishnu Priya Nanda , Tata Consultancy Services Ltd.
NOVA University of Lisbon
Distributed, real-time actionable insights on high-volume data streams
Pavlos Petoumenos, Hugh Leather, Björn Franke
Dan C. Marinescu Office: HEC 439 B. Office hours: M, Wd 3 – 4:30 PM.
Introduction to Load Balancing:
Guangxiang Du*, Indranil Gupta
International Conference on Data Engineering (ICDE 2016)
Outline Introduction Related Work
Online parameter optimization for elastic data stream processing
Ching-Chi Lin Institute of Information Science, Academia Sinica
Practical aspects of multi-core job submission at CERN
Process Scheduling B.Ramamurthy 9/16/2018.
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Process Scheduling B.Ramamurthy 11/18/2018.
Users Manage Terabytes of Data with Powerful and Agnostic Hosting from Azure Cloud Service Partner Logo “Given the challenges we face both in dealing with.
CPU SCHEDULING.
Resource-Efficient and QoS-Aware Cluster Management
Process Scheduling B.Ramamurthy 2/23/2019.
Process Scheduling B.Ramamurthy 4/11/2019.
Process Scheduling B.Ramamurthy 4/7/2019.
Presented By: Darlene Banta
Performance And Scalability In Oracle9i And SQL Server 2000
CS5123 Software Validation and Quality Assurance
EdgeWise: A Better Stream Processing Engine for the Edge
Presentation transcript:

Online Parameter Optimization for Elastic Data Stream Processing Thomas Heinze, Lars Roediger, Yuanzhen Ji, Zbigniew Jerzak (SAP SE) Andreas Meister (University of Magdeburg) Christof Fetzer (TU Dresden) Presented by Rohit Mukerji February 18,

Data Stream Processing Engines 2 Processes queries using directed acyclic graph structure (DAG) Are used in order to process large amounts of data in real time compared to batch processing done in Hadoop o i.e. Apache Storm

Motivation 1) Major problem in today’s cloud systems is low utilization of the entire system  Results in high cost for user and low energy efficiency for cloud  Average cloud utilization is only 30%; users reserve 80% 2) Knowledge of the system is required in order to manually set parameters  lack of understanding results in conservative parameter settings, which results in low usage of the system 3

Proposition  Want to build an elastic data stream processing system which allows to trade off monetary cost against the offered quality of service (latency) Elastic: system that automatically scales in or out based on workload needs 4

Architecture 5

Threshold-Based Scaling  3 factors are considered when deciding to scale 1) CPU 2) Network 3) Memory Consumption Actions are triggered when a host is marked as overloaded and under-loaded 6

Online Parameter Optimization 7 Parameters p include, lower threshold and duration, upper threshold and duration, grace period g, bin packing variant bin

Cost Function - Calculate the monetary costs for a parameter configuration, where the best configuration is marked by a minimal cost value 8 Utils: how much each operator utilizes a host at a given time Assign: the assignment of operators to a host at a given time Certain parameters, e.g., the grace period and the threshold du-ration, do not have any influence on the scaling behavior if the cost for a single utilization snapshot is calculated. The system uses the last k (a number decided by the online profiler) snapshots when calculating cost

Latency-Aware Cost Function  Minimizing cost alone isn’t enough, we want to also ensure good performance by providing a low-latency system. We define a latency function as follows: lat(q l, t j ) = latency for query q l at time t j and the system latency at time t j is measured as the average latency of all currently running queries We adjust our cost function as follows: 9

Search Algorithm  Goal : find the best parameter configuration p* with the minimal cost given the current time series of operator utilizations Uses Random Recursive Search (RRS) Algorithm 10 Realign sub-phaseShrink sub-phase

Online Profiler  Responsible for deciding when to trigger the optimization as well as the size of the utilization history to be used as an input  Need just the right window size; use an adaptive window size strategy  Consider CPU usage of various hosts  Good metric as it indicates how “busy” a host is and the rate at which input is coming in  Idea:  The size of the window increases as long as the observed metrics do not change significantly  If a significant change is detected, elements from the end of the window (oldest elements) are removed until the variance of the window is small again 11

Evaluation Some Questions to consider:  1. Does the online parameter optimization improve the trade-off between monetary cost and latency violations compared to state of the art solutions?  2. Can we influence the results of our parameter optimization by the latency threshold?  3. Does the online parameter optimization solution scale with an increasing operator count and history size? 12

Testing Input and Environment 13 3 Sources of Data Streams: 1)Frankfurt Stock Exchange 2)Twitter Feeds 3)Energy Data from DEBS Challenge in 2014 Run in private shared cloud environment with up to 12 client hosts and one manager host. Each host has 2 cores and 4GB RAM.

Average Processing Latency of Different Approaches 14 The end-to-end latency is measured, which is the measured time between an event enters the query and leaves the query at the sink. A value of 3 seconds is used as the upper threshold. Based on the results, the average processing latency is not influenced by different strategies.

Vs. The Baseline (Manual Parameter Setting) 15 Naïve Strategy: use the median result of all measured configurations for each workload to illustrate the expected results for an unexperienced user. Top 3 Per Workload: Determine the best 3 configurations per workload for all 16 scenarios and average the results

Parameter Optimization vs. Other Methods 16

Does the proposed solution scale?  The cost function is the most important thing to consider when discussing scalability  Deals with two metrics: the number of placed operators and the length of the utilization history 17 If the window size is increased by a factor of 6, from 30 seconds to 180 seconds, the average runtime only increases by a factor of 2.5, 1.4 or 3.0 for energy, Twitter or financial work-loads respectively

Summary/Results  The parameter optimization version performs better than the baseline (manually tuned configuration) and the reinforcement learning in terms of the trade off between monetary cost and speed.  The proposed solution is scalable  Does not require knowledge to use because the system adjusts on its own to different workloads  Can be reused modularly without any modification 18

Thoughts  Central Management Component is a single point of failure. Also paper did not touch upon what happens if the online profiler were to fail as well  This system cannot handle load bursts i.e. when the magnitude of the workload sharply increases in a short duration of time.  During operator movements, the processing of incoming events is stopped to avoid any loss of information. This increases the latency of the system and can perhaps be solved by creating a temporary host.  Overall, more tests could have been conducted as this will give concrete evidence whether or not the proposed system does indeed perform well. 19

Thank You 20