Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD 2005
Introduction Window aggregation is an important query capacity. Evaluating window aggregate queries over streams is non-trivial. Overlapping subsets (window extents) Confusion by window definition with physical stream properties Out-of-order data arrival. Hurt performance. Execution time and Memory Bandwidth
Introduction High arrival rates, huge volumes of data and real time requirements make execution time and memory requirements very critical Bursty out of order arrival of data makes detection of window extents very difficult Also leads to inaccurate results with higher latencies Need for window semantics
Introduction Problems faced currently Lack of explicit semantics Lack of implementation efficiency wrt execution time and memory requirements Most implementations keep active input tuples in memory, thereby increasing memory bandwidth Further each tuple is reprocessed multiple times as a part of multiple extents it belongs to Also most implementations assume that the input stream is ordered
Techniques Window-ID (WID): On the fly processing Does not keep tuples in memory No reprocessing of tuples Processes out of order tuples on the fly without sorting them Does not require ordering of the data stream Uses punctuations to encode whatever kind of ordering information available Punctuation: Out-of-order data arrival
Example 1 Q1:SELECTseg-id, max(speed), min(speed)FROMTraffic [RANGE 300 seconds SLIDE 60 seconds WATTR ts]GROUP BY seg-id
Example 1 tuple
Window Semantics Previous works often describe window semantics operationally, leading to confusion with physical properties of the stream Example: some window query operators process window extents sequentially, but data arrivals without in window extent’s order. In such cases some sorting mechanisms like that in Aurora's BSort scheme is used to order the data. Leads to high execution time and bandwidths
Window Specification Window specification: a window type and a set of parameters that defines a window to be used by a query. ex: RANGE, SLIDE and WATTR in Q1. Different window aggregate query has different window specification. Sliding window aggregate query. Time based sliding window query Row based Slide by tuple based query Partitioned window based query Using functions
Window Specification Similar to the CQL (Continuous Query Language). Different: user specified WATTR and SLIDE parameters.
Sliding Window Aggregate Time-based: Q1 Row-based: RANGE and SLIDE are different attributes:
Sliding Window Aggregate Partitioned Window Aggregate: Using function: a variation of Q3 `
Window Semantic Framework Defines window semantics using mappings between window-ids and tuples in both directions Three functions for mapping between window-ids and tuples in both directions windows, extent and wids. T : a set of tuples. S : window specification windows (T,S): set of window-ids that identify window extents to which tuples in T may belongs. extent (w,T,S): the set of tuples in T belonging to the window extent identified by w,
windows, extent queries in which RANGE and SLIDE are specified on the WATTR attribute: slide-by-tuple:
slide-by-n_tuples: slide-by-n_tuples over logical order: partitioned tuple-based:
Mapping Tuples to Window-ids wids: Function for identifying window extent to which tuple t belongs. queries in which RANGE and SLIDE are specified on the WATTR attribute: slide-by-tuple (and variations):
Partitioned tuple-base: r=rank(t,row-num,PATTR,T)
Towards Window Query Evaluation Backward-context Given a tuple t, it’s backward-context is information about tuples that have arrived before t. ex: partitioned tuple-based window. Forward-context – Given a tuple t, it’s forward-context is information about tuples that have arrived after t. ex: slide-by-tuple. FCF( forward-context free) FCA (forward-context award)
Disorder Merging unsynchronized streams, network delays. ex: network flow sometimes use start time as timestamp. Methods: slack, BSort, heartbeats.
FCF Window with WID Approach Punctuation: A message embedded in a data stream indicating that a certain subset of data is complete. WID uses punctuations to signal the end of window extents. wids function punctuation
FCA Windows with WID Approach FCB (forward-context bounded) FCU (forward-context unbounded)
Performance Environment: Data generator: XMark data generator, and network analysis tool. 1. data in generated order. 2. data in bounded-disorder 3. data in block-sorted-disorder. Comparison: buffering mechanism.
Result WID V.S. Buffering
Conclusion Continuing with larger picture: We show the issues with a broader base. Approaches to solve the problem. Few examples which illustrate the problems and solutions.
Issues Many systems have the bottleneck of managing continuous data streams like financial data auction system etc. The current systems for evaluating sliding window aggregate queries, buffer each input tuple until it is no longer needed. Each tuple is accessed multiple times once for each window that it participates in.
Issues Contd … There are few problems with it: – The buffer size required is unbounded. – Processing each tuple multiple times leads to high computation cost.
Approaches An approach that reduces both space and computation time for query execution. It follows the concept of dis-joint panes and calculate the sub aggregates over each pane. This gives us significant performance benefits
Contd… New technique reduces the required buffer size by sub-aggregating the input stream and reduces buffer size by sub-aggregating the input stream and reduces computing window aggregates.
Sliding Window Tuples
Semantics To evaluate a sliding-window aggregate query using panes, the query is decomposed into two sub-queries: – A Pane level sub query PLQ, which is a tumbling window aggregate query, separating input stream into non overlapping panes. – A window level query WLQ which is a sliding window query over the result of PLQ which returns the window aggregate.
Evaluation
Details There are two types of aggregates that affect the evaluation of sliding window aggregates: – Holistic: For a sub aggregate function L there is no constant bound on the size of storage needed to store the result of L. – Differential: Two types, Full differential-If bounded storage Pseudo-differential: if it cannot be stored in a constant bound, like heavy hitter queries.
Contd… Panes for Holistic Aggregates: – Despite not having constant bound on buffer size in many cases it will reduce the amount of buffer space needed. – The PLQ Using a hashtable improves the overall performance, by sharing each hastable entry between multiple windows there by reducing the computation cost.
Example PLQ maintains Hashtable with (item-id, count). Non empty hashtable entries are output WLQ buffers each hashtable entry to update the sketches. Using Panes the PLQ compresses all the bids to a single hash entry to reduce the storage space.