Continuous Processing of Preference Queries in Data Streams : a Survey

Slides:



Advertisements
Similar presentations
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Advertisements

指導教授:陳良弼 老師 報告者:鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Fast Algorithms For Hierarchical Range Histogram Constructions
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Maintaining Sliding Widow Skylines on Data Streams.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Mining Data Streams.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Context Compression: using Principal Component Analysis for Efficient Wireless Communications Christos Anagnostopoulos & Stathes Hadjiefthymiades Pervasive.
Communication-Efficient Distributed Monitoring of Thresholded Counts Ram Keralapura, UC-Davis Graham Cormode, Bell Labs Jai Ramamirtham, Bell Labs.
Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.
1 Continuous k-dominant Skyline Query Processing Presented by Prasad Sriram Nilu Thakur.
WiOpt’04: Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks March 24-26, 2004, University of Cambridge, UK Session 2 : Energy Management.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.
Package Transportation Scheduling Albert Lee Robert Z. Lee.
Computer Science and Engineering Loyalty-based Selection: Retrieving Objects That Persistently Satisfy Criteria Presented By: Zhitao Shen Joint work with.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
Kyriakos Mouratidis, Spiridon Bakiras, Dimitris Papadias SIGMOD
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
Efficient Processing of Top-k Spatial Preference Queries
The university of Hong Kong Department of Computer Science Continuous Monitoring of Top-k Queries over Sliding Windows Authors: Kyriakos Mouratidis, Spiridon.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.
Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
A Semantic Caching Method Based on Linear Constraints Yoshiharu Ishikawa and Hiroyuki Kitagawa University of Tsukuba
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Mining Data Streams (Part 1)
Tian Xia and Donghui Zhang Northeastern University
Updating SF-Tree Speaker: Ho Wai Shing.
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
RE-Tree: An Efficient Index Structure for Regular Expressions
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
Probabilistic n-of-N Skyline Computation over Uncertain Data Streams
Range-Efficient Computation of F0 over Massive Data Streams
Introduction to Stream Computing and Reservoir Sampling
Heavy Hitters in Streams and Sliding Windows
Approximation and Load Shedding Sampling Methods
The Skyline Query in Databases Which Objects are the Most Important?
Efficient Processing of Top-k Spatial Preference Queries
Lu Tang , Qun Huang, Patrick P. C. Lee
Presentation transcript:

Continuous Processing of Preference Queries in Data Streams : a Survey M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos Data Engineering Lab Department of Informatics Aristotle University of Thessaloniki

Presentation Layout Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating queries Summary

Presentation Layout Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating queries Summary

Data Streams Data Stream is an infinite sequence of objects. Each object can be one-dimensional or multi-dimensional. Streaming Time Series are finite sequences of objects. Streaming Time Series changes over time. Arrival rate of objects usually varies.

Sliding Window Model (1) Count-based window: Sliding window contains the W most recent tuples (“active”). Older tuples expire. Time W=5 t1 t2 t3 t4 t5 t6 t7 t8 expired active

Sliding Window Model (2) Time-based window: Sliding window contains the tuples (“active”) of the W most recent timestamps. Older records expire. Time W=5 t6 t1 t2 t3 t4 t5 t8 t7 expired active

Database System User / Application Query Result Result Query Input

Continuous Evaluation in a Data Stream System User / Application Query Result Query processor

Motivation (1) Numerous data stream contexts Financial data analysis Network management Astronomical data analysis Sensor network Telecommunication data management

Motivation (2) Preference queries Many applications in data streams Useful decision support tool Many applications in data streams Example 2 (stock-market data) Report the products with the maximum price, the minimum sales and the minimum number of buyers. Example 1 (telecommunication data) Report the clients with the maximum call time and the maximum number of calls. Continuous top-k dominating query Continuous skyline query

Presentation Layout Continuous skyline queries Preliminaries Continuous top-k queries Continuous top-k dominating queries Conclusions

Skyline Query Hotels price distance T1 4 1 T2 3 2 T3 0.5 T4 2.5 4.5 T5 1.5 T6 3.5 5 price Skyline: contains all the tuples not dominated by any other tuple. T1 T6 T2 T4 T5 T3 distance Dominant tuple: A tuple t dominates another tuple t’ if t is not worse than t’ in all dimensions, and t is better than t’ in at least one dimension.

Continuous Skyline Query Problem definition: We have to continuously evaluate a skyline query in multidimensional streaming time series. Application example: network data Computers with suspicious behavior. Network traffic, number of connections, number of destinations.

Basic Idea Skyline changes due The insertion of a new skyline tuple. The expiration of a skyline tuple. LookOut [Morse, ICDE06] and Lazy [Tao, TKDE06] Use of a spatial index Advantage: simple implementation Disadvantage: the expiration of a skyline tuple is not handled efficiently

Event Approach (1) Existing skyline tuple expires: How can we find new skyline tuples? Very costly operation Skyline influence time (SIT) Minimum time in which a tuple may become a skyline tuple. Generate events based on SIT Event – examine tuples with such an influence time

Event Approach (2) Eager [Tao, TKDE06] F(6) J(10) W=10 H(8) Eager [Tao, TKDE06] Advantage: handles skyline expiration Disadvantage: pro-cessing time per tuple G(7) K(11) I(9) L(12) D(4) B(2) E(5) C(3) Tuple K can be discarded due to tuple L (younger and better) K.SIT=19

n-of-N Skyline Queries (1) n-of-N definition S6 = {a,c} S4 = {c,g} source: icde05

n-of-N Skyline Queries (2) n-of-N definition S6 = {c,h} S4 = {e,h} source: icde05

Method cnN(1) Method cnN [Lin, ICDE05] is also based on events Tuple K is redundant because tuple L is better and younger than K A(1) F(6) J(10) W=10 H(8) G(7) K(11) Tuple L is dominated by D and E. I(9) L(12) D(4) B(2) E(5) C(3) The dominance relation between L and E is critical because E is the youngest tuple which dominates L

Critical dominance relation Method cnN (2) Redundant tuples A(1) B(2) G(7) Dominance graph contains all the critical dominance relations F(6) Critical dominance relation E(5) C(3) D(4) Generate intervals For the skyline tuples, e.g. C = (0,3] For the critical dominance relations, C -> G = (3,7] Use an interval-tree to store them

To answer a n-of-N query, apply a (M–n+1) stabbing query Method cnN (3) A tuple t is in the answer of an n-of-N skyline query iff there exists an interval containing the value M–n+1, where M is the number of the total elements seen so far. M = 7 A(1) C = (0,3] stabbing query For n = 4, M–n+1 = 4 For n = 6, M–n+1 = 2 B(2) G(7) D = (0,4] F(6) C -> G = (3,7] E(5) D -> E = (4,5] S4 = {D, G} S6 = {C, D} C(3) D(4) D -> F = (4,6] To answer a n-of-N query, apply a (M–n+1) stabbing query

Method cnN (3) Advantages Disadvantages Good use of skyline properties Multiple query processing Disadvantages Processing time per tuple Increased memory requirements Skyline properties – removws reduntant tuples Graph built for several user queries Increased memory because of the necessary graph

Frequent Skyline - Motivation Highly dynamic environment The skyline results are meaningful only if the skyline tuples appear consistently Frequent skyline: tuples on the skyline for a minimum user-defined interval. [Zhang, SIGMOD09]

Streaming Model Client/Server architecture Server receives object updates from the clients. Each object can be represented as a d-dimensional point. Object update (point movement in the d-dimensional space). at least a value in one dimension changes Object insertion or deletion Point movement from/to a nonexistent position Minimization of communication cost

An object as a point and its filter (safe region) Safe region technique Skyline remains unchanged if each object stays in a safe region Communication happens only when the safe region is violated Safe region approach leads to communication optimization An object as a point and its filter (safe region) source: sigmod09

Sampling All clients report their skyline at the same sampled time The clients are synchronized with the same random seed Guaranteed quality if sampling rate is high enough

Hybrid Hybrid solution Disadvantage of all three methods Combines Filter and Sampling Small changes: apply Filter Larger changes: apply Sampling Disadvantage of all three methods energy consumption is not uniform (critical in sensor networks)

k-dominant Skyline Query - Μotivation Skyline: contains tuples not dominated by any other tuple. Disadvantage: High dimensionality problem. Solution: Relax the notion of dominance. k-dominant tuple: A tuple t k-dominates another tuple t’ if t is not worse than t’ in at least k dimensions and t is better than t’ in at least one of them. k-dominant skyline: contains all tuples not k-dominated by any other tuple [Kontaki, SAC08]

k-dominant Skyline Query - Εxample 6 5 4 3 2 1 T2 T3 T4 T5 T1 4-dominates T3 T1 5-dominates T4 T1 dominates T5 Add one box about property Change the values highlight the domination Smaller k, less tuples in k-dominant skyline Conventional skyline {T1, T2, T3, T4} 5-dominant skyline {T1, T2, T3} 4-dominant skyline {T1, T2}

Observations Traditional or streaming skyline methods are inappropriate Skyline properties do not hold E.g. transitive property k-dominance can be cyclic Existence of multiple users and multiple queries. cnN supports multiple queries

Method CoSMuQ (1) A query on D dimensions arrives. Given a parameter value k, split the query to subqueries of d=k dimensions. Compute the conventional skyline of each subquery. The k-dominant skyline is the intersection of the skylines of the subqueries of a query.

Method CoSMuQ (2) Advantages Disadvantages Based on conventional skyline (simple domination checks) Properties of conventional skylines can be used Exploits the overlap between different queries. Disadvantages Memory requirements increase in high dimensionality.

Continuous Skyline methods - Summary Query Type Window Type Multiple Queries LookOut skyline time no Lazy and Eager both n-of-N count yes Filter and Sampling frequent skyline CoSMuQ k-dominant skyline

Presentation Layout Continuous top-k queries Data streams - Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating queries Summary

Top-k query - Εxample Hotels price distance T1 4 1 T2 3 2 T3 0.5 T4 2.5 4.5 T5 1.5 T6 3.5 5 price Given a preference function, a top-k query returns the k tuples with the best scores. T1 T6 T2 T4 T5 k=2 k=1 F=price+distance T3 distance

Continuous Top-k Query Problem definition: Continuous evaluation of top-k query in multidimensional streaming time series. Application Example: network data top-100 flows with the largest individual throughput Common destination DDoS attack Distributed denial of service

Basic Idea New tuple changes the top-k Top-k tuple expiration Should belong in the influence region of the query Top-k tuple expiration From scratch query computation TMA (Top-k Monitoring Algorithm) [Mouratidis, SIGMOD06] Advantage: simple implementation Disadvantage: no efficient handling of an expired top-k tuple Line defined by the F = score(tk) = x1 + x2 x2 tk Influence region x1 source: sigmod06

Skyband - Example 1-skyband is the skyline 2-skyband (tuples dominated by at most 1 other tuples) 1-skyband (tuples not dominated by other tuples) A B D Dominated by 2 other tuples (3-skyband) C E k-skyband: contains all the tuples which are dominated by at most k–1 other tuples.

Transform tuples in the (score,expiration_time) space Skyband Approach (1) Transform tuples in the (score,expiration_time) space original space transformed space F=price+distance price score score T1 5 T2 T3 3.5 T4 7 T5 5.5 T6 8.5 top-1 T1 DC=0 T6 T6 T2 T4 DC=1 T4 T5 DC=0 T2 DC=1 T5 T1 DC=1 T3 DC=0 T3 distance exp_time Rule: Keep tuples with DC < k Dominance counter (DC): number of tuples that are younger and better Observation: tuples appearing in some top-k result belong to the k-skyband in the (score,exp_time) space.

Skyband Approach (2) SMA (Skyband Monitoring Algorithm) proposed in [Mouratidis, SIGMOD06] Advantage: independent of the dimensionality 2-dimensional space (score-exp_time) Disadvantage: k-skyband may contain less than k tuples In this case, a top-k tuple expiration will cause query computation from scratch

Distributed Top-k Continuously report the k largest values obtained from distributed data streams. Objective is to minimize communication cost Proposed by [Babcock, SIGMOD03]

Streaming Model Nodes: N1, N2 , … , Nm, coordinator node: N0 Set of n data objects O1, O2 , … , On associated with real values V1, V2 , … , Vn Value updates are represented as <Oi, Nj, > tuples: Nj detects a change  in the value Vi of Oi. Change is not seen by other nodes Nk (kj) The value Vi for an object Oi: Vi= j (Vi,j) where Vi,j is the value of i-th object in the j-th node

Method (1) Initialize a top-k set at the coordinator node Set arithmetic constraints at monitor nodes Depend on current top-k set Constraints valid  No communications Constraints invalidated Client communicates with server Possibly new top-k set Recomputation of constraints

Method(2) - Adjustment Factors Adjustment Factors (AF) Object 1 Object 2 Object 1 Object 2 V1,1 = 1 V2,1 = 9 V1,2 = 3 V2,2 = 1 = 0 = -3 = 3 Node 1 Node 2 Top-1 = {O1} Node 2: V1,2 = 3+0 = 3 Node 2: V2,1 = 1+3 = 4 Local top-k similar to global =>Low communication cost Disadvantage: Energy consumption is not uniform Node 1, Local Top-1 = {O1} For each node Nj and object Oi associate an adjustment factor i,j Constraints are evaluated after adding the adjustment factors If OtT and OsU-T : Vt,i+  t,i  Vs,i +  s,i Adjustment factors for each object sum to zero: This ensures sum remains valid Node 2, Local Top-1 = {O2} Local top-ks differ from global top-k =>Unnecessary constraint violations => Increased communication cost To keep the results valid AF for each object sum to zero

Uncertain Data Compute probability of 6 Tuples Pr. 2, 5, 6, 8 .064 2, 5, 6 .096 2, 6, 8 2, 6 2, 5, 8 .016 2, 5 .024 2, 8 2 5, 6, 8 5, 6 .144 5, 8 5 .036 6, 8 6 8 Empty Score Prob. 6 0.8 5 0.5 2 0.4 8 tuples 16 possible worlds Sum the world probabilities Pk-topk query: returns the k most probable tuples of being the top-k. Top-2: {6,5} with prob. {0.64, 0.5} source: pvldb08

Pk-topk Query Solution proposed by [Jin, PVLDB08] Compact set based Space-efficient solution Discard unnecessary tuples and Apply several compression schemes to compress data Disadvantages Model assumption: the probability of a tuple is assumed random and independent of each other.

Continuous Top-k Methods -Summary Query Type Window Type Multiple Queries TMA and SMA top-k both yes Distributed top-k Distributed top-k time no Compact set based Pk-topk

Presentation Layout Continuous top-k dominating queries Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating queries Summary

Top-k Dominating Query - Example Hotels price distance T1 4 1 T2 3 2 T3 0.5 T4 2.5 4.5 T5 1.5 T6 3.5 5 price Top-k dominating: the answer contains the k tuples with highest domination power. Top-k: Given a preference function, a top-k query returns the k tuples with the best scores. T1 Skyline: contains all the tuples not dominated by any other tuple. T6 T2 T4 T5 k=2 k=1 F=price+distance T3 distance Disadvantage: user-defined preference function. Disadvantage: High dimensionality problem. Combines the advantages of skyline and top-k queries and avoids their disdvantages.

Continuous Top-k Dominating Query Problem definition: Continuous evaluation of top-k dominating query in multidimensional streaming time series. Application Example: sensor network Areas with high probability of fire outbreak Temperature, humidity and wind speed

EVA Objective: reduce domination checks Safe interval of a tuple Ignore tuple for this interval It depends on its score and the k-th score End of safe interval -> event Event Try to compute new safe interval, else Compute score from scratch New tuple Find another tuple that dominates the new one Estimate a lower bound of the safe interval

ADA Advanced computation of safe interval Candidate tuples Depends on the number of tuples that dominate this tuple and expire later Candidate tuples Tuples with scores close to k-th score are updated in each time instance EVA and ADA proposed by [Kontaki 2009]

Presentation Layout Summary Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating queries Summary

Summary Preference queries are very useful in data streams Presented state-of-the-art methods For continuous skyline queries For continuous top-k queries For continuous top-k dominating queries Examined advantages and disadvantages of the proposed methods

Research Directions Continuous subspace skyline queries Solutions appropriate for distributed environments uniform energy consumption Approximate algorithms Existence of multiple queries

Thank you