StreamApprox Approximate Stream Analytics in Apache Spark

Slides:

Advertisements

Similar presentations

SAMPLING METHODS OR TECHNIQUES

Advertisements

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.

Achieving Elasticity for Cloud MapReduce Jobs Khaled Salah IEEE CloudNet 2013 – San Francisco November 13, 2013.

Best-Effort Top-k Query Processing Under Budgetary Constraints

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.

Ranked Set Sampling: Improving Estimates from a Stratified Simple Random Sample Christopher Sroka, Elizabeth Stasny, and Douglas Wolfe Department of Statistics.

Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

CORE Rome Meeting – 3/4 October WP3: A Process Scenario for Testing the CORE Environment Diego Zardetto (Istat CORE team)

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.

Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

McGraw-Hill/IrwinCopyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved. SAMPLING Chapter 14.

Chapter 6 Conducting & Reading Research Baumgartner et al Chapter 6 Selection of Research Participants: Sampling Procedures.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Efficient Gigabit Ethernet Switch Models for Large-Scale Simulation Dong (Kevin) Jin David Nicol Matthew Caesar University of Illinois.

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Slider Incremental Sliding Window Analytics Pramod Bhatotia MPI-SWS Umut Acar CMU Flavio Junqueira MSR Cambridge Rodrigo Rodrigues NOVA University of Lisbon.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 1-1 Statistics for Managers Using Microsoft ® Excel 4 th Edition Chapter.

SketchVisor: Robust Network Measurement for Software Packet Processing

Performance Assurance for Large Scale Big Data Systems

NOVA University of Lisbon

Guangxiang Du*, Indranil Gupta

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

International Conference on Data Engineering (ICDE 2016)

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Zhu Han University of Houston Thanks for Professor Dan Wang’s slides

Distributed Network Traffic Feature Extraction for a Real-time IDS

Applying Control Theory to Stream Processing Systems

So, what was this course about?

Spark Presentation.

A Study of Group-Tree Matching in Large Scale Group Communications

QianZhu, Liang Chen and Gagan Agrawal

Graduate School of Business Leadership

Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

Augmented Sketch: Faster and More Accurate Stream Processing

Sampling: Design and Procedures

Selectivity Estimation of Big Spatial Data

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Supporting Fault-Tolerance in Streaming Grid Applications

COS 518: Advanced Computer Systems Lecture 11 Michael Freedman

Introduction to Spark.

Sampling: Design and Procedures

StreamApprox Approximate Stream Analytics in Apache Flink

AQUA: Approximate Query Answering

StreamApprox Approximate Computing for Stream Analytics

Communication and Memory Efficient Parallel Decision Tree Construction

Estimation of Sampling Errors, CV, Confidence Intervals

CS110: Discussion about Spark

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

Slides prepared by Samkit

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Stratified Sampling for Data Mining on the Deep Web

Pramod Bhatotia, Ruichuan Chen, Myungjin Lee

Indiana University, Bloomington

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

Streaming data processing using Spark

COS 518: Advanced Computer Systems Lecture 12 Michael Freedman

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Presentation transcript:

StreamApprox Approximate Stream Analytics in Apache Spark https://streamapprox.github.io Do Le Quoc, Pramod Bhatotia, Ruichuan Chen, Christof Fetzer, Volker Hilt, Thorsten Strufe 10/2017

Modern online services Stream Aggregator Stream Analytics System Useful Information

Modern online services Approximate computing Low latency Efficient resource utilization Tension

Approximate Computing Many applications: Approximate output is good enough! E.g. : Google Trends --- “Spark SQL” vs “Spark Streaming” (Sep/2017 – Oct/2017) The trend of data is more important than the precise numbers 100 50 Sep 17 Sep 26 Oct 5 Average

Approximate Computing Idea: To achieve low latency, compute over a sub-set of data items instead of the entire data-set Take a sample Approximate output ± Error bound Compute Approximate computing

State-of-the-art systems Not designed for stream analytics BlinkDB [EuroSyS’13] Using pre-existing samples ApproxHadoop [ASPLOS’15] Using multi-stage sampling Quickr [SIGMOD’16] Injecting samplers into query plan

StreamApprox: Design goals Transparent Targets existing applications w/ minor code changes Practical Supports adaptive execution based on query budget Efficient Employs online sampling techniques

Outline Motivation Design Evaluation

StreamApprox: Overview Input data stream Approximate output error bound StreamApprox Streaming query Query budget Stream aggregator (E.g Kafka) S1 S2 Sn … Data stream Query budget: Latency/throughput guarantees Desired computing resources for query processing Desired accuracy

Key idea: Sampling Simple random sampling (SRS): Stratified sampling (STS): SRS

Key idea: Sampling Reservoir sampling (RS): With probability (1- k i ) Size of reservoir = k Replace by item i Drop item i

Spark-based Simple Random Sampling (Spark-based SRS) Spark-based Sampling Spark-based Simple Random Sampling (Spark-based SRS) Assign each item with a random number in [0, 1] Step #1 0.01 0.08 0.02 0.06 0.68 0.15 0.26 0.12 0.88 0.12 0.88 Step #2 Sort items based on assigned value 0.01 0.02 0.08 0.06 0.15 0.26 0.68 Step #3 Take out k smallest items 0.12 0.01 0.02 0.08 0.06 Sorting big data is very expensive

Spark-based Stratified Sampling (Spark-based STS) Spark-based Sampling Spark-based Stratified Sampling (Spark-based STS) Step #1 Create strata using groupByKey() Step #2 Apply SRS to each stratum Si Step #3 Synchronize between worker nodes to select a sample of size k These steps are very expensive

StreamApprox: Core idea Online Adaptive Stratified Reservoir Sampling (OASRS) S1 RS Size of reservoir = k Weight = #items/k = 8/4 RS Weight = #items/k = 6/4 S2 RS S3 Weight = 1 Easy to parallelize, doesn't need any synchronization between workers RS : Reservoir Sampling k = 4

StreamApprox: Core idea Weight = 2 Weight = 1.5 Weight = 1 OASRS Worker 1 Weight = 1 Weight = 2 Weight = 1.5 Size of reservoir = 4 OASRS Worker 2

Approximate output error bound Implementation Approximate output error bound StreamApprox Stream aggregator S1 S2 Sn … Data stream

Implementation Stream aggregator S1 S2 Sn … Sampling module Batch generator Batched RDDs Spark computation engine Error estimation module Output error bound Refined sampling parameters

Outline Motivation Design Evaluation

See the paper for more results! Experimental setup Evaluation questions Throughput vs sample size Throughput vs accuracy End-to-end latency Testbed Cluster: 17 nodes Datasets: Synthesis: Gaussian distribution, Poisson distribution datasets CAIDA Network traffic traces; NYC Taxi ride records See the paper for more results!

With sampling fraction < 60% Throughput Higher the better Spark-based StreamApprox: ~2X higher throughput over Spark-based STS Flink-based StreamApprox: 1.3X higher throughput over Spark-based StreamApprox With sampling fraction < 60%

Throughput vs Accuracy Higher the better Spark-based StreamApprox: ~1.32X higher throughput over Spark-based STS Flink-based StreamApprox: 1.62X higher throughput over Spark-based StreamApprox With the same accuracy loss

Latency Lower the better Spark-based StreamApprox: ~1.68X faster than Spark-based STS Spark-based StreamApprox: ~1.45X faster than Spark-based SRS

Conclusion StreamApprox: Approximate computing for stream analytics Transparent Supports applications w/ minor code changes Practical Adaptive execution based on query budget Efficient Online stratified sampling technique Thank you! Details: StreamApprox [Middleware’17] https://streamapprox.github.io