StreamApprox Approximate Stream Analytics in Apache Flink

Slides:

Advertisements

Similar presentations

SAMPLING METHODS OR TECHNIQUES

Advertisements

Achieving Elasticity for Cloud MapReduce Jobs Khaled Salah IEEE CloudNet 2013 – San Francisco November 13, 2013.

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.

Ranked Set Sampling: Improving Estimates from a Stratified Simple Random Sample Christopher Sroka, Elizabeth Stasny, and Douglas Wolfe Department of Statistics.

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.

CORE Rome Meeting – 3/4 October WP3: A Process Scenario for Testing the CORE Environment Diego Zardetto (Istat CORE team)

CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

11 Experimental and Analytical Evaluation of Available Bandwidth Estimation Tools Cesar D. Guerrero and Miguel A. Labrador Department of Computer Science.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 1-1 Statistics for Managers Using Microsoft ® Excel 4 th Edition Chapter.

Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.

Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.

Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

McGraw-Hill/IrwinCopyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved. SAMPLING Chapter 14.

Chapter 6 Conducting & Reading Research Baumgartner et al Chapter 6 Selection of Research Participants: Sampling Procedures.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Efficient Gigabit Ethernet Switch Models for Large-Scale Simulation Dong (Kevin) Jin David Nicol Matthew Caesar University of Illinois.

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Slider Incremental Sliding Window Analytics Pramod Bhatotia MPI-SWS Umut Acar CMU Flavio Junqueira MSR Cambridge Rodrigo Rodrigues NOVA University of Lisbon.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 1-1 Statistics for Managers Using Microsoft ® Excel 4 th Edition Chapter.

SketchVisor: Robust Network Measurement for Software Packet Processing

Experience Report: System Log Analysis for Anomaly Detection

NOVA University of Lisbon

Learning Deep Generative Models by Ruslan Salakhutdinov

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

International Conference on Data Engineering (ICDE 2016)

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Zhu Han University of Houston Thanks for Professor Dan Wang’s slides

Distributed Network Traffic Feature Extraction for a Real-time IDS

Applying Control Theory to Stream Processing Systems

A Study of Group-Tree Matching in Large Scale Group Communications

QianZhu, Liang Chen and Gagan Agrawal

Augmented Sketch: Faster and More Accurate Stream Processing

Auditing & Investigations I

Sampling: Design and Procedures

Social Research Methods

Selectivity Estimation of Big Spatial Data

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Supporting Fault-Tolerance in Streaming Grid Applications

Benchmarking Modern Distributed Stream Processing Systems

Kijung Shin1 Mohammad Hammoud1

Pyramid Sketch: a Sketch Framework

Load Shedding Techniques for Data Stream Systems

AQUA: Approximate Query Answering

StreamApprox Approximate Stream Analytics in Apache Spark

StreamApprox Approximate Computing for Stream Analytics

Communication and Memory Efficient Parallel Decision Tree Construction

Estimation of Sampling Errors, CV, Confidence Intervals

CS110: Discussion about Spark

Henge: Intent-Driven Multi-Tenant Stream Processing

Stratified Sampling for Data Mining on the Deep Web

Pramod Bhatotia, Ruichuan Chen, Myungjin Lee

Declarative Transfer Learning from Deep CNNs at Scale

Introduction to Stream Computing and Reservoir Sampling

Using Clustering to Make Prediction Intervals For Neural Networks

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Presentation transcript:

StreamApprox Approximate Stream Analytics in Apache Flink Do Le Quoc, Pramod Bhatotia, Ruichuan Chen, Christof Fetzer, Volker Hilt, Thorsten Strufe Sep 2017

Modern online services Stream Aggregator Stream Analytics System Useful Information

Modern online services Approximate computing Low latency Efficient resource utilization Tension

Approximate Computing Many applications: Approximate output is good enough! E.g. : Google Trends --- Big Data vs Machine Learning (Sep/2012 – Sep/2017) The trend of data is more important than the precise numbers

Approximate Computing Idea: To achieve low latency, compute over a sub-set of data items instead of the entire data-set Take a sample Approximate output ± Error bound Compute Approximate computing

State-of-the-art systems Not designed for stream analytics BlinkDB [EuroSyS’13] Using pre-existing samples ApproxHadoop [ASPLOS’15] Using multi-stage sampling Quickr [SIGMOD’16] Injecting samplers into query plan

StreamApprox: Design goals Transparent Targets existing applications w/ minor code changes Practical Supports adaptive execution based on query budget Efficient Employs online sampling techniques

Outline Motivation Design Evaluation

StreamApprox: Overview Input data stream Approximate output ± Error bound StreamApprox Streaming query Query budget Stream Aggregator (E.g Kafka) S1 S2 Sn … data stream Query budget: Latency/throughput guarantees Desired computing resources for query processing Desired accuracy

Key idea: Sampling Simple random sampling (SRS): Stratified sampling (STS): SRS

Key idea: Sampling Reservoir sampling (RS): With probability (1- k i ) Size of reservoir = k Replace by item i Drop item i

Spark-based Stratified Sampling (Spark-based STS) Spark-based Sampling Spark-based Stratified Sampling (Spark-based STS) Step #1 Create strata using groupByKey() Step #2 Apply SRS to each stratum Si Step #3 Synchronize between worker nodes to select a sample of size k These steps are very expensive

StreamApprox: Core idea Online Adaptive Stratified Reservoir Sampling (OASRS) S1 RS Size of reservoir = k Weight = #items/k = 8/4 RS Weight = #items/k = 6/4 S2 RS S3 Weight = 1 Easy to parallelize, doesn't need any synchronization between workers RS : Reservoir Sampling k = 4

StreamApprox: Core idea Weight = 2 Weight = 1.5 Weight = 1 OASRS Worker 1 Weight = 1 Weight = 2 Weight = 1.5 Size of reservoir = 4 OASRS Worker 2

Approximate output ± Error bound Implementation Approximate output ± Error bound StreamApprox Stream Aggregator S1 S2 Sn … data stream

Implementation Flink-based StreamApprox Sampling module Flink Computation Engine Output ± Error bound Error Estimation module Refined sampling parameters Stream aggregator S1 S2 Sn … Flink-based StreamApprox

Outline Motivation Design Evaluation

See the paper for more results! Experimental setup Evaluation questions Throughput vs sample size Throughput vs accuracy Testbed Cluster: 17 nodes Datasets: Synthesis: Gaussian distribution, Poisson distribution datasets CAIDA Network traffic traces; NYC Taxi ride records See the paper for more results!

With sampling fraction < 60% Throughput Higher the better Spark-based StreamApprox: ~2X higher throughput over Spark-based STS Flink-based StreamApprox: 1.3X higher throughput over Spark-based StreamApprox With sampling fraction < 60%

Throughput vs Accuracy Higher the better Spark-based StreamApprox: ~1.32X higher throughput over Spark-based STS Flink-based StreamApprox: 1.62X higher throughput over Spark-based StreamApprox With the same accuracy loss

Conclusion StreamApprox: Approximate computing for stream analytics Transparent Supports applications w/ minor code changes Practical Adaptive execution based on query budget Efficient Online stratified sampling technique Thank you! Details: StreamApprox [Middleware’17] https://streamapprox.github.io