Pramod Bhatotia, Ruichuan Chen, Myungjin Lee

Slides:



Advertisements
Similar presentations
L3S Research Center University of Hanover Germany
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Achieving Elasticity for Cloud MapReduce Jobs Khaled Salah IEEE CloudNet 2013 – San Francisco November 13, 2013.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Fast Algorithms For Hierarchical Range Histogram Constructions
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Effective Gaussian mixture learning for video background subtraction Dar-Shyang Lee, Member, IEEE.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
On Self Adaptive Routing in Dynamic Environments -- A probabilistic routing scheme Haiyong Xie, Lili Qiu, Yang Richard Yang and Yin Yale, MR and.
Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†
1 Latency Equalization: A Programmable Routing Service Primitive Minlan Yu Joint work with Marina Thottan, Li Li at Bell Labs.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Time-critical gravitational wave searches Craig Robinson Cardiff University LIGO-G Z.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Sampling in Space Restricted Settings Anup Bhattacharya IIT Delhi Joint work with Davis Issac (MPI), Ragesh Jaiswal (IITD) and Amit Kumar (IITD)
Fast Random Walk with Restart and Its Applications Hanghang Tong, Christos Faloutsos and Jia-Yu (Tim) Pan ICDM 2006 Dec , HongKong.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 7 Sampling and Sampling Distributions.
1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
QoS Supported Clustered Query Processing in Large Collaboration of Heterogeneous Sensor Networks Debraj De and Lifeng Sang Ohio State University Workshop.
02/12/03© 2003 University of Wisconsin Last Time Intro to Monte-Carlo methods Probability.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Distributed Ranked Data Dissemination in Social Networks Joint work with: Mo Sadoghi Vinod Muthusamy Hans-Arno.
Zeta: Scheduling Interactive Services with Partial Execution Yuxiong He, Sameh Elnikety, James Larus, Chenyu Yan Microsoft Research and Microsoft Bing.
IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015.
Slider Incremental Sliding Window Analytics Pramod Bhatotia MPI-SWS Umut Acar CMU Flavio Junqueira MSR Cambridge Rodrigo Rodrigues NOVA University of Lisbon.
Confidence Intervals Cont.
NOVA University of Lisbon
Double and Multiple Sampling Plan
Presented by: Saurav Kumar Bengani
Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics
Introduction to Load Balancing:
Zhu Han University of Houston Thanks for Professor Dan Wang’s slides
A Study of Group-Tree Matching in Large Scale Group Communications
QianZhu, Liang Chen and Gagan Agrawal
A paper on Join Synopses for Approximate Query Answering
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Supporting Fault-Tolerance in Streaming Grid Applications
Spatial Online Sampling and Aggregation
DISTRIBUTED CLUSTERING OF UBIQUITOUS DATA STREAMS
StreamApprox Approximate Stream Analytics in Apache Flink
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
Communication and Memory Efficient Parallel Decision Tree Construction
2. Stratified Random Sampling.
Random Samples Random digit table
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Random Samples Random digit table
Resource Allocation for Distributed Streaming Applications
Topological Signatures For Fast Mobility Analysis
Kostas Kolomvatsos, Christos Anagnostopoulos
Efficient Processing of Top-k Spatial Preference Queries
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.
EdgeWise: A Better Stream Processing Engine for the Edge
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
Presentation transcript:

Pramod Bhatotia, Ruichuan Chen, Myungjin Lee ApproxIoT Approximate Analytics for Edge Computing https://ApproxIoT.github.io/ApproxIoT/ Zhenyu Wen, Do Le Quoc, Pramod Bhatotia, Ruichuan Chen, Myungjin Lee

Modern online services Stream aggregator Stream analytics system Useful Information Processing streaming data from different sources

Modern online services Low latency Efficient resource utilization Tension Approximate computing

Approximate computing Many applications: Approximate output is good enough! The proportion of data is useful for this application Live taxi heatmap

Approximate computing Idea: To achieve low latency, compute over a sub-set of data items instead of the entire data-set Approximate computing (sampling) Approximate output ± error bound Analyze

State-of-the-art system StreamApprox [Middleware’17] Approximate output ± error bound StreamApprox Stream aggregator S1 S2 Sn … Data stream Cloud datacenter Limitations: It wastes bandwidth It utilizes only cloud datacenter resources

Edge computing Allows data to be processed at the edge node before it’s sent to the cloud Source of data Gateway Edge node Local processing Cloud Opportunities: Providing more computing resources Saving bandwidth

Edge infrastructure Azure IoT edge Watson IoT AWS IoT Source: https://peering.google.com/#/infrastructure

Problem statement To build a stream analytics system Design goals By utilizing the cloud and edge computing resources By leveraging approximate computing Design goals Efficiency: Efficient utilization of computing resources Adaptability: Adaptive execution based on the available resources Transparency: No code change required and resource management

Outline Motivation Design Implementation Evaluation

ApproxIoT employs sampling in the distributed environment of ApproxIoT: Overview Query Approximate output ± error bound Edge nodes Regional edge Continental node Central node Cloud S1 Si Sn … Sm ApproxIoT employs sampling in the distributed environment of edge + cloud ApproxIoT

Simple random sampling (SRS) Naïve algorithm Simple random sampling (SRS) SRS Query Approximate output ± error bound Low accuracy Overlooked Sampled unfairly

Background: Stratified sampling Advantage: The sub-streams are sampled fairly Disadvantage: Requires the knowledge of each sub-stream size

Background: Reservoir sampling Size of reservoir = 4 The 6th item With probability( 4 6 ) replaced by the 6th item The 5th item With probability( 4 5 ) replaced by the 5th item Reservoir sampling Size of reservoir = 4 Reservoir sampling Size of reservoir = 4 Reservoir sampling Size of reservoir = 4 Reservoir sampling Size of reservoir = 4 Reservoir sampling Size of reservoir = 4 Reservoir sampling Size of reservoir = 4 Advantage: No pre-knowledge required of sub-stream size Disadvantages: The sub-streams are sampled unfairly Difficult to run on multiple nodes

ApproxIoT sampling algorithm Weighted hierarchical sampling (WHS) Combining stratified and reservoir sampling Reservoir size N=4 Weight: C/N, if C>N 1, if C <=N With initial weight 1 C=6 WHS W=6/4 W=1 W=1 Easy to parallelize, requires no synchronization between sub-streams

WHS on edge nodes WHS WHS Edge nodes Cloud Regional edge Continental Central node Cloud Edge nodes Regional edge Continental node Regional edge WHS W=1 W=6/2=3 W=4/2=2 Continental node WHS W=4 W=1 W=3 W=4*5/2=10 W=1*3/2=3/2 Carried weight Current weight Easy to parallelize, requires no synchronization between computing nodes Reservoir size equals 2

ApproxIoT in the cloud Edge nodes Cloud Query (sum) WHS Central node Cloud Edge nodes Regional edge Continental node The weights are carried Query (sum) WHS W=4/3*6/1 =8 W=1*4/1=4 W=1*2/1=2 W=4/3 W=1 Approximate output: ± error bound 8* +4* +2* Reservoir size equals 1

Outline Motivation Design Implementation Evaluation

See the paper for more details Implementation S1 See the paper for more details Kafka cluster S2 … Sn Data stream Edge nodes Stream pub/sub Cloud datacenter Sampled data stream Sampled data stream Kafka Streams

See the paper for more results! Experimental setup Evaluation questions Accuracy vs. sample size Throughput vs. sample size Testbed: 25 nodes 15 nodes for ApproxIoT deployment 10 nodes for Kafka cluster Datasets: Synthetic: Poisson and Gaussian distribution Real: Brasvo pollution and New York Taxi Ride See the paper for more results!

Accuracy vs. sample size Lower the better The average is 0.035% ApproxIoT: ~2600X higher accuracy over SRS

Throughput vs. sample size Higher the better ApproxIoT has low overhead compared to the native execution ApproxIoT has similar throughput as SRS

Conclusion ApproxIoT: Approximate analytics for edge computing Efficiency Efficient computing and bandwidth resource utilization Adaptability Adaptive execution based on the available resources Transparency Requires no code changes and resource management Thank you! More details on the project website: https://ApproxIoT.github.io/ApproxIoT/