Slider Incremental Sliding Window Analytics Pramod Bhatotia MPI-SWS Umut Acar CMU Flavio Junqueira MSR Cambridge Rodrigo Rodrigues NOVA University of Lisbon.

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
MapReduce.
Incremental Maintenance of XML Structural Indexes Ke Yi 1, Hao He 1, Ioana Stanoi 2 and Jun Yang 1 1 Department of Computer Science, Duke University 2.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
ABCD: Eliminating Array-Bounds Checks on Demand Rastislav Bodík Rajiv Gupta Vivek Sarkar U of Wisconsin U of Arizona IBM TJ Watson recent experiments.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
1 Failure Recovery for Priority Progress Multicast Jung-Rung Han Supervisor: Charles Krasic.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Face Detection using the Viola-Jones Method
Design Space Exploration
NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.
Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
1 Efficient Trie Braiding in Scalable Virtual Routers Author: Haoyu Song, Murali Kodialam, Fang Hao, T.V. Lakshman Publisher: IEEE/ACM TRANSACTIONS ON.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Streaming Big Data with Self-Adjusting Computation Umut A. Acar, Yan Chen DDFP January 2014 SNU IDB Lab. Namyoon Kim.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
CS 6401 Intra-domain Routing Outline Introduction to Routing Distance Vector Algorithm.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
IThreads A Threading Library for Parallel Incremental Computation Pramod Bhatotia Pedro Fonseca, Björn Brandenburg (MPI-SWS) Umut Acar (CMU) Rodrigo Rodrigues.
IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015.
Introduction toData structures and Algorithms
MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.
NOVA University of Lisbon
COMP261 Lecture 23 B Trees.
Record Storage, File Organization, and Indexes
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Curator: Self-Managing Storage for Enterprise Clusters
Advanced Computer Networks
Applying Control Theory to Stream Processing Systems
Dynamic Hashing (Chapter 12)
Parallel Programming By J. H. Wang May 2, 2017.
Dynamic Graph Partitioning Algorithm
Extra: B+ Trees CS1: Java Programming Colorado State University
Orthogonal Range Searching and Kd-Trees
StreamApprox Approximate Stream Analytics in Apache Flink
August 20, 2002 (joint work with Umut Acar, and Guy Blelloch)
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
Indexing and Hashing Basic Concepts Ordered Indices
Intradomain Routing Outline Introduction to Routing
Ananth I. Sundararaj Ashish Gupta Peter A. Dinda Prescience Lab
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
What's New in eCognition 9
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Pramod Bhatotia, Ruichuan Chen, Myungjin Lee
Incremental Maintenance of XML Structural Indexes
Resource Allocation in a Middleware for Streaming Data
June 12, 2003 (joint work with Umut Acar, and Guy Blelloch)
Pei Lee, ICDE 2014, Chicago, IL, USA
DryadInc: Reusing work in large-scale computations
ContinuStreaming: Achieving High Playback Continuity of Gossip-based Peer-to-Peer Streaming IPDPS 2008 LI Zhenhua Dept. Computer, Nanjing University.
Indexing, Access and Database System Architecture
Streaming data processing using Spark
What's New in eCognition 9
Modeling IDS using hybrid intelligent systems
What's New in eCognition 9
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
Presentation transcript:

Slider Incremental Sliding Window Analytics Pramod Bhatotia MPI-SWS Umut Acar CMU Flavio Junqueira MSR Cambridge Rodrigo Rodrigues NOVA University of Lisbon Middleware 2014

Data analytics systems 2 Raw data Data analytics system Information E.g. Web-crawl E.g. computing PageRank E.g. search SparkNaiadStormS4Hadoop

Design requirements 3 Recent trends Sliding window Streaming data Incremental updates + Incremental sliding window analytics for data stream

State-of-the-art: Stream processing 4 mutable state node 1 node 3 input records node 2 input records Batch-based systems Stream Batch# nBatch# 1 Batch# 2 …….. M M M M M M M M M M M M M M M M M M R R R R R R R R Output Input Single batch Classification based on programming model E.g. Storm, S4, NaiadE.g. D-Streams Trigger-based systems

Trade-offs for incremental updates 5 (+) efficient (-) hard to design (-) inefficient (+) easy to design Slider (require dynamic algorithms) (re-compute from scratch) Trigger-based systems Batch-based systems

Goals 1.Retain the advantages/simplicity of batch-based approach 2.Achieve the efficiency of incremental processing for sliding window analytics 6

Outline Motivation Basic design Slider design Evaluation 7

Our approach Take an unmodified data-parallel application written assuming unchanging data Automatically adapt it for incremental sliding window analytics 8

Behind the scenes 9 computation sub-computations dependence graph change propagation We follow this high-level approach for batch-based stream processing Step#1 divide Step#2 build Step #3 perform

Batch-based sliding window analytics 10 M M M M M M M M R R R R R R Stream.….. Window Step#1: Divide the computation Map & Reduce tasks Step#2: Build the dependence graph Data-flow graph of MapReduce

Step#3: Change propagation 11 B4B3B2B1 … … Stream M1 M2 M3 M4 R1 R2 R3 B5 added removed window M1 M5 Contraction tree # 3 Contraction tree # 3 Contraction tree # 1 Contraction tree # 1 Contraction tree # 2 Contraction tree # 2 Replace Reduce tasks with contraction trees

Outline Motivation Basic design Slider design Contraction tree Self-adjusting contraction tree Split processing Evaluation 12

Contraction tree What: Breaks down the work done by a Reduce task to allow fine-grained change propagation How: Leverages Combiners at the Reducer site 13

“Zoom IN” with a single Reducer 14 M2 M3 M4 M1 B4B3B2B1 B5 Stream window M1 removed added Contraction tree Replace M5 R R

Example of contraction tree 15 Reduce task Tree of combiners Map outputs

“Zoom IN” with a single Reducer 16 M2 M3 M4 M1 B4B3B2B1 B5 Stream window M1 removed added Contraction tree Replace M5 R R

Basic design w/ contraction tree 17 Pramod Bhatotia M2 M3 M4 M1 B4B3B2B1 B5 Stream window M1 removed added M5 Path affected by M1 Path affected by M5

Limitation of the contraction tree Naïve grouping of Combiner tasks may lead to sub-optimal reuse of the memoized result 18 Self-adjusting contraction tree

Outline Motivation Basic design Slider design Contraction tree Self-adjusting contraction tree Split processing Evaluation 19

Self-adjusting contraction tree The tree should have low depth (implies short path length for re-computation) Key ingredients: Balanced tree: sublinear updates w.r.t. window size Self-adjusting capability after change propagation 20

Self-adjusting contraction tree(s) 21 General case Fixed-width Append-only Different modes of operation Fixed-width

Fixed-width window slides 22

Rotating contraction tree 23

Rotating contraction tree 24 B4B4 Update path for bucket 4 Memoized results are reused

Outline Motivation Basic design Slider design Contraction tree Self-adjusting contraction tree Split processing Evaluation 25

Split processing 26 Background pre-processing Foreground processing Change propagation

Change propagation for bucket#4 27 Update path for bucket 4 Memoized results are reused

Split processing for bucket#4 28 Foreground processing Background pre-processing

Outline Motivation Basic design Slider design Evaluation 29

Evaluating Slider Goal: Determine how Slider works in practice 1.What are the performance benefits? 2.How effective is split processing? 3.What is the overhead for the initial run? Case MPI-SWS 30 more results in the paper

Q1: Performance gains 31 Speedup up to 3.8X w.r.t. basic contraction tree

Q2: Split processing 32 Foreground processing is faster by 30% on avg.

Q3: Performance overheads 33 Overheads 2% to 38% for the initial run

Case studies Online Social Networks [IMC’11] Information propagation in Twitter Networked Systems [NSDI’10] Glasnost: Detecting traffic shaping Hybrid CDNs [NSDI’12] Reliable client accounting 34 Details in the paper

Information propagation in Twitter 35 Speedup > 13X for ~ 5% window change

Summary Slider enables incremental sliding window analytics Transparently & efficiently Slider design includes Self-adjusting contractions trees for sub-linear updates Split processing for background pre-processing Multi-level trees for general data-flow programs (didn’t cover in the talk!) 36

Incremental Sliding Window Analytics Transparent + Efficient 37 Thanks!