Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.

Slides:

Advertisements

Similar presentations

Scalable Data Partitioning Techniques for Parallel Sliding Window Processing over Data Streams DMSN 2011 Cagri Balkesen & Nesime Tatbul.

Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Maintaining Sliding Widow Skylines on Data Streams.

Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.

Computer science is a field of study that deals with solving a variety of problems by using computers. To solve a given problem by using computers, you.

Windows in Niagara Jin (Jenny) Li, David Maier, Vassilis Papadimos, Peter Tucker, Kristin Tufte.

File Management Systems

An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS

Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

Privacy and Integrity Preserving in Distributed Systems Presented for Ph.D. Qualifying Examination Fei Chen Michigan State University August 25 th, 2009.

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

Semantics and Evaluation Techniques for Window Aggregates in Data Stream Jin Li, David Maier, Kristin Tufte, Vassillis Papadimos, Peter Tucker. Presented.

Winter 2012SEG Chapter 11 Chapter 1 (Part 2) Introduction to Requirements Modeling.

An adaptive framework of multiple schemes for event and query distribution in wireless sensor networks Vincent Tam, Keng-Teck Ma, and King-Shan Lui IEEE.

CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.

Cloud Computing Lecture Column Store – alternative organization for big relational data.

STREAM The Stanford Data Stream Management System.

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Query Processing, Resource Management, and Approximation in a Data Stream Management System.

An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 5 Arrays.

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

Data Streams: Lecture 101 Window Aggregates in NiagaraST Kristin Tufte, Jin Li Thanks to the NiagaraST PSU.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Chapter 10 Normalization Pearson Education © 2009.

A Low-bandwidth Network File System Athicha Muthitacharoen et al. Presented by Matt Miller September 12, 2002.

CSC310 © Tom Briggs Shippensburg University Fundamentals of the Analysis of Algorithm Efficiency Chapter 2.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

1 Chapter 4: Creating Simple Queries 4.1 Introduction to the Query Task 4.2 Selecting Columns and Filtering Rows 4.3 Creating New Columns with an Expression.

CE Operating Systems Lecture 17 File systems – interface and implementation.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

CS4432: Database Systems II Query Processing- Part 2.

The Structure of the “THE”- Multiprogramming System Edsger W. Dijkstra Presented by: Jin Li.

Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Protocols and Architecture Slide 1 Use of Standard Protocols.

Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Towards Unifying Vector and Raster Data Models for Hybrid Spatial Regions Philip Dougherty.

1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

The latte Stream-Archive Query Project - Exploring Stream+Archive Data in Intelligent Transportation Systems Jin Li (with Kristin Tufte, Vassilis Papadimos,

Buffering Techniques Greg Stitt ECE Department University of Florida.

1 Out of Order Processing for Stream Query Evaluation Jin Li (Portland State Universtiy) Joint work with Theodore Johnson, Vladislav Shkapenyuk, David.

SketchVisor: Robust Network Measurement for Software Packet Processing

INTRODUCTION TO PROBLEM SOLVING

CHP - 9 File Structures.

Empirically Characterizing the Buffer Behaviour of Real Devices

Chapter 12: Query Processing

Evaluation of Relational Operations

Data Structures (CS212D) Overview & Review.

Objective of This Course

Lecture 2- Query Processing (continued)

Introduction to Requirements Modeling

Dop d d 1 2 reconst reconst sop P P 1 2.

Idle Waiting for slides

Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.

Heavy Hitters in Streams and Sliding Windows

DryadInc: Reusing work in large-scale computations

Outline Introduction Background Distributed DBMS Architecture

ONNX Training Discussion

Presentation transcript:

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD 2005

Introduction Window aggregation is an important query capacity. Evaluating window aggregate queries over streams is non-trivial.  Overlapping subsets (window extents)  Confusion by window definition with physical stream properties  Out-of-order data arrival.  Hurt performance. Execution time and Memory Bandwidth

Introduction High arrival rates, huge volumes of data and real time requirements make execution time and memory requirements very critical Bursty out of order arrival of data makes detection of window extents very difficult Also leads to inaccurate results with higher latencies Need for window semantics

Introduction Problems faced currently  Lack of explicit semantics  Lack of implementation efficiency wrt execution time and memory requirements Most implementations keep active input tuples in memory, thereby increasing memory bandwidth Further each tuple is reprocessed multiple times as a part of multiple extents it belongs to Also most implementations assume that the input stream is ordered

Techniques Window-ID (WID):  On the fly processing  Does not keep tuples in memory  No reprocessing of tuples  Processes out of order tuples on the fly without sorting them  Does not require ordering of the data stream  Uses punctuations to encode whatever kind of ordering information available Punctuation:  Out-of-order data arrival

Example 1 Q1:SELECTseg-id, max(speed), min(speed)FROMTraffic [RANGE 300 seconds SLIDE 60 seconds WATTR ts]GROUP BY seg-id

Example 1 tuple

Window Semantics Previous works often describe window semantics operationally, leading to confusion with physical properties of the stream  Example: some window query operators process window extents sequentially, but data arrivals without in window extent’s order. In such cases some sorting mechanisms like that in Aurora's BSort scheme is used to order the data. Leads to high execution time and bandwidths

Window Specification Window specification: a window type and a set of parameters that defines a window to be used by a query.  ex: RANGE, SLIDE and WATTR in Q1. Different window aggregate query has different window specification.  Sliding window aggregate query.  Time based sliding window query  Row based  Slide by tuple based query  Partitioned window based query  Using functions

Window Specification Similar to the CQL (Continuous Query Language).  Different: user specified WATTR and SLIDE parameters.

Sliding Window Aggregate Time-based:  Q1 Row-based: RANGE and SLIDE are different attributes:

Sliding Window Aggregate Partitioned Window Aggregate: Using function: a variation of Q3 `

Window Semantic Framework Defines window semantics using mappings between window-ids and tuples in both directions Three functions for mapping between window-ids and tuples in both directions  windows, extent and wids. T : a set of tuples. S : window specification windows (T,S): set of window-ids that identify window extents to which tuples in T may belongs. extent (w,T,S): the set of tuples in T belonging to the window extent identified by w,

windows, extent queries in which RANGE and SLIDE are specified on the WATTR attribute: slide-by-tuple:

slide-by-n_tuples: slide-by-n_tuples over logical order: partitioned tuple-based:

Mapping Tuples to Window-ids wids: Function for identifying window extent to which tuple t belongs. queries in which RANGE and SLIDE are specified on the WATTR attribute: slide-by-tuple (and variations):

Partitioned tuple-base: r=rank(t,row-num,PATTR,T)

Towards Window Query Evaluation Backward-context  Given a tuple t, it’s backward-context is information about tuples that have arrived before t.  ex: partitioned tuple-based window. Forward-context – Given a tuple t, it’s forward-context is information about tuples that have arrived after t.  ex: slide-by-tuple.  FCF( forward-context free)  FCA (forward-context award)

Disorder Merging unsynchronized streams, network delays.  ex: network flow sometimes use start time as timestamp.  Methods: slack, BSort, heartbeats.

FCF Window with WID Approach Punctuation: A message embedded in a data stream indicating that a certain subset of data is complete. WID uses punctuations to signal the end of window extents. wids function punctuation

FCA Windows with WID Approach FCB (forward-context bounded) FCU (forward-context unbounded)

Performance Environment:  Data generator: XMark data generator, and network analysis tool.  1. data in generated order.  2. data in bounded-disorder  3. data in block-sorted-disorder.  Comparison: buffering mechanism.

Result WID V.S. Buffering

Conclusion Continuing with larger picture: We show the issues with a broader base. Approaches to solve the problem. Few examples which illustrate the problems and solutions.

Issues Many systems have the bottleneck of managing continuous data streams like financial data auction system etc. The current systems for evaluating sliding window aggregate queries, buffer each input tuple until it is no longer needed. Each tuple is accessed multiple times once for each window that it participates in.

Issues Contd … There are few problems with it: – The buffer size required is unbounded. – Processing each tuple multiple times leads to high computation cost.

Approaches An approach that reduces both space and computation time for query execution. It follows the concept of dis-joint panes and calculate the sub aggregates over each pane. This gives us significant performance benefits

Contd… New technique reduces the required buffer size by sub-aggregating the input stream and reduces buffer size by sub-aggregating the input stream and reduces computing window aggregates.

Sliding Window Tuples

Semantics To evaluate a sliding-window aggregate query using panes, the query is decomposed into two sub-queries: – A Pane level sub query PLQ, which is a tumbling window aggregate query, separating input stream into non overlapping panes. – A window level query WLQ which is a sliding window query over the result of PLQ which returns the window aggregate.

Evaluation

Details There are two types of aggregates that affect the evaluation of sliding window aggregates: – Holistic: For a sub aggregate function L there is no constant bound on the size of storage needed to store the result of L. – Differential: Two types, Full differential-If bounded storage Pseudo-differential: if it cannot be stored in a constant bound, like heavy hitter queries.

Contd… Panes for Holistic Aggregates: – Despite not having constant bound on buffer size in many cases it will reduce the amount of buffer space needed. – The PLQ Using a hashtable improves the overall performance, by sharing each hastable entry between multiple windows there by reducing the computation cost.

Example PLQ maintains Hashtable with (item-id, count). Non empty hashtable entries are output WLQ buffers each hashtable entry to update the sketches. Using Panes the PLQ compresses all the bids to a single hash entry to reduce the storage space.