Scalable Data Partitioning Techniques for Parallel Sliding Window Processing over Data Streams DMSN 2011 Cagri Balkesen & Nesime Tatbul
Talk Outline Intro & Motivation Stream Partitioning Techniques Basic window partitioning Batch partitioning Pane-based partitioning Ring-based Query Evaluation Experimental Evaluation Conclusions & Future Work 2
Intro & Motivation DSMS 3
Architectural Overview Query Query Split stage Split node Query Merge stage Merge node input stream output stream Query nodes QoS: latency < 5 seconds disorder < 3 tuples Classical Split-Merge pattern from Parallel DBs Adjustable parallelism level, d QoS on max latency & order 4
Related Work: How to Partition? Content-sensitive FluX: Fault-tolerant, load balancing Exchange [1,2] Use group-by values from the query to partition Need explicit load-balancing due to skewed data Content-insensitive GDSM: Window-based parallelization (fixed-size tumbling wins) [3] Win-Distribute: Partition at window boundaries Win-Split: Partition each win into equi-length subwins The Problem: How to handle sliding windows? How to handle queries without group-by or a few groups? [1] Flux: An Adaptive Partitioning Operator for Continuous Query Systems, ICDE‘03 [2] Highly-Available, Fault-Tolerant, Parallel Dataflows, SIGMOD ‘04 [3] Customizable Parallel Execution of Scientific Stream Queries, VLDB ‘05 5
Stream Partitioning Techniques
Approach 1: Basic Sliding Window Partitioning Independently processable chunking Window aware splitting of the stream Each window has an id & tuples are marked (first-winid, last-winid, is-win-closer) Tuples are replicated for each of their windows Node1 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 . . . W1 Split Node2 W2 W3 W4 Node3 w = 6 units, s = 2 units, Replication = 6/2 = 3 7
Approach 1: Basic Sliding Window Partitioning The Problem with Basic sliding window partitioning: Tuples belong to many windows depending on slide Excessive replication of tuples for each window Increase in output data volume of split Node1 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 . . . W1 Split Node2 W2 W3 W4 Node3 w = 6 units, s = 2 units, Replication = 6/2 = 3 8
Approach 2: Batch-based Partitioning Batch several windows together to reduce replication “Batch-window”: wb = w+(B-1)*s ; sb = B*s All the tuples in a batch go to the same partition Only tuples overlapping btw. batches are replicated Replication reduced to wb/sb partitions instead of w/s t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 . . . w1 w2 w3 w4 w5 w6 w7 w8 B1 B2 Definitions: w : window-size s : slide-size B : batch-size w = 3, s = 1 B = 3 wb = 5, sb = 3 Replication : 3 5/3 9
The Panes Technique Divide overlapping windows into disjoint panes Reduce cost by sub-aggregation and sharing Each window has w/gcd(w,s) panes of size gcd(w,s) Query is decomposed: pane-level (PLQ) & window-level (WLQ) queries w1 w2 w3 w4 w5 . . . windows p1 p2 p3 p4 p5 p6 p7 p8 panes [1] No Pane, No Gain: Efficient Evaluation of Sliding Window Aggregates over Data Streams, SIGMOD Record ‘05 10
Approach 3: Pane-based Partitioning Mark each tuple with pane-id + win-id Treat panes as tumbling window with wp = sp = gcd(w,s) Route tuples to a node based on pane-id Nodes compute PLQ with pane tuples Combine all PLQ results of a window to form WLQ Need for an organized topology of nodes We propose organization of nodes in a ring Node1 Node2 Node3 Split w = 6 units, s = 2 units 11
Ring-based Query Evaluation High amount of pipelined result sharing among nodes Organized communication topology Pane1 Pane2 4 3 Pane3 6 5 Window1 1 2 Input Source W = 6, S = 4 tuples P = GCD(6,4) = 2 tuples Pane3 6 5 Pane4 8 7 Pane5 10 9 Window2 … P9 P8 P3 P2 P1 … P11 P10 P5 P4 Window3 Pane6 Pane7 14 13 12 11 Pane5 10 9 Split … P13 P12 P7 P6 . . . R3 R9 Node2 Node1 W2 Merge W1 R11 R7 W3 R5 R13 Node3 12
Assignment of Windows and Panes to Nodes All pane results only arrive from predecessors Pane results sent to successor is only local panes Each node is assigned n consecutive windows Min n st. Definitions: ww : win-size in # of panes sw : slide-size in # of panes 13
Flexible Result Merging FIFO Fully-ordered * k = 0 k-ordered: k-ordering constraint [1], certain disorder allowed Defn: For any tuple s, s’ arrives at least k+1 tuples after s st. s’.A ≥ s.A [1] Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams. ACM TODS ‘04 14
Experimental Evaluation Implementation of techniques in Borealis Workload adapted from Linear Road Benchmark Slightly modified segment statistics queries Basic aggregation functions with different window/slide ratios 15
Scalability of Split Operator Maximum input rate (tuples/second) window-size/slide ratio (window overlap) Pane-partitioning: cost & tput constant regardless of overlap ratio Window & batch –partitioning: cost ↑ and tput↓ as overlap ↑ Excessive replication in window-partitioning is reduced by batching 16
Scalability of Partitioning Techniques * w/s = overlap ratio = 100 Pane-based scales close to linear until split is saturated per tuple cost is constant Window & batch based: exteremely high replication Split is not saturated, but scales very slowly 17
Summary & Conclusions Pane-partitioning is the choice of partitioning 1) Window-based 2) Batch-based 3) Pane-based Pane-partitioning is the choice of partitioning Avoids tuple replication Incurs less overhead in split and aggregate Scales close to linear 18
Ongoing & Future Work Generalization of the framework Support for adaptivity during runtime Extending complexity of query plans Extending performance analysis & experiments 19
Thank You! 20