Continuous Processing of Preference Queries in Data Streams : a Survey M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos Data Engineering Lab Department of Informatics Aristotle University of Thessaloniki
Presentation Layout Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating queries Summary
Presentation Layout Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating queries Summary
Data Streams Data Stream is an infinite sequence of objects. Each object can be one-dimensional or multi-dimensional. Streaming Time Series are finite sequences of objects. Streaming Time Series changes over time. Arrival rate of objects usually varies.
Sliding Window Model (1) Count-based window: Sliding window contains the W most recent tuples (“active”). Older tuples expire. Time W=5 t1 t2 t3 t4 t5 t6 t7 t8 expired active
Sliding Window Model (2) Time-based window: Sliding window contains the tuples (“active”) of the W most recent timestamps. Older records expire. Time W=5 t6 t1 t2 t3 t4 t5 t8 t7 expired active
Database System User / Application Query Result Result Query Input
Continuous Evaluation in a Data Stream System User / Application Query Result Query processor
Motivation (1) Numerous data stream contexts Financial data analysis Network management Astronomical data analysis Sensor network Telecommunication data management
Motivation (2) Preference queries Many applications in data streams Useful decision support tool Many applications in data streams Example 2 (stock-market data) Report the products with the maximum price, the minimum sales and the minimum number of buyers. Example 1 (telecommunication data) Report the clients with the maximum call time and the maximum number of calls. Continuous top-k dominating query Continuous skyline query
Presentation Layout Continuous skyline queries Preliminaries Continuous top-k queries Continuous top-k dominating queries Conclusions
Skyline Query Hotels price distance T1 4 1 T2 3 2 T3 0.5 T4 2.5 4.5 T5 1.5 T6 3.5 5 price Skyline: contains all the tuples not dominated by any other tuple. T1 T6 T2 T4 T5 T3 distance Dominant tuple: A tuple t dominates another tuple t’ if t is not worse than t’ in all dimensions, and t is better than t’ in at least one dimension.
Continuous Skyline Query Problem definition: We have to continuously evaluate a skyline query in multidimensional streaming time series. Application example: network data Computers with suspicious behavior. Network traffic, number of connections, number of destinations.
Basic Idea Skyline changes due The insertion of a new skyline tuple. The expiration of a skyline tuple. LookOut [Morse, ICDE06] and Lazy [Tao, TKDE06] Use of a spatial index Advantage: simple implementation Disadvantage: the expiration of a skyline tuple is not handled efficiently
Event Approach (1) Existing skyline tuple expires: How can we find new skyline tuples? Very costly operation Skyline influence time (SIT) Minimum time in which a tuple may become a skyline tuple. Generate events based on SIT Event – examine tuples with such an influence time
Event Approach (2) Eager [Tao, TKDE06] F(6) J(10) W=10 H(8) Eager [Tao, TKDE06] Advantage: handles skyline expiration Disadvantage: pro-cessing time per tuple G(7) K(11) I(9) L(12) D(4) B(2) E(5) C(3) Tuple K can be discarded due to tuple L (younger and better) K.SIT=19
n-of-N Skyline Queries (1) n-of-N definition S6 = {a,c} S4 = {c,g} source: icde05
n-of-N Skyline Queries (2) n-of-N definition S6 = {c,h} S4 = {e,h} source: icde05
Method cnN(1) Method cnN [Lin, ICDE05] is also based on events Tuple K is redundant because tuple L is better and younger than K A(1) F(6) J(10) W=10 H(8) G(7) K(11) Tuple L is dominated by D and E. I(9) L(12) D(4) B(2) E(5) C(3) The dominance relation between L and E is critical because E is the youngest tuple which dominates L
Critical dominance relation Method cnN (2) Redundant tuples A(1) B(2) G(7) Dominance graph contains all the critical dominance relations F(6) Critical dominance relation E(5) C(3) D(4) Generate intervals For the skyline tuples, e.g. C = (0,3] For the critical dominance relations, C -> G = (3,7] Use an interval-tree to store them
To answer a n-of-N query, apply a (M–n+1) stabbing query Method cnN (3) A tuple t is in the answer of an n-of-N skyline query iff there exists an interval containing the value M–n+1, where M is the number of the total elements seen so far. M = 7 A(1) C = (0,3] stabbing query For n = 4, M–n+1 = 4 For n = 6, M–n+1 = 2 B(2) G(7) D = (0,4] F(6) C -> G = (3,7] E(5) D -> E = (4,5] S4 = {D, G} S6 = {C, D} C(3) D(4) D -> F = (4,6] To answer a n-of-N query, apply a (M–n+1) stabbing query
Method cnN (3) Advantages Disadvantages Good use of skyline properties Multiple query processing Disadvantages Processing time per tuple Increased memory requirements Skyline properties – removws reduntant tuples Graph built for several user queries Increased memory because of the necessary graph
Frequent Skyline - Motivation Highly dynamic environment The skyline results are meaningful only if the skyline tuples appear consistently Frequent skyline: tuples on the skyline for a minimum user-defined interval. [Zhang, SIGMOD09]
Streaming Model Client/Server architecture Server receives object updates from the clients. Each object can be represented as a d-dimensional point. Object update (point movement in the d-dimensional space). at least a value in one dimension changes Object insertion or deletion Point movement from/to a nonexistent position Minimization of communication cost
An object as a point and its filter (safe region) Safe region technique Skyline remains unchanged if each object stays in a safe region Communication happens only when the safe region is violated Safe region approach leads to communication optimization An object as a point and its filter (safe region) source: sigmod09
Sampling All clients report their skyline at the same sampled time The clients are synchronized with the same random seed Guaranteed quality if sampling rate is high enough
Hybrid Hybrid solution Disadvantage of all three methods Combines Filter and Sampling Small changes: apply Filter Larger changes: apply Sampling Disadvantage of all three methods energy consumption is not uniform (critical in sensor networks)
k-dominant Skyline Query - Μotivation Skyline: contains tuples not dominated by any other tuple. Disadvantage: High dimensionality problem. Solution: Relax the notion of dominance. k-dominant tuple: A tuple t k-dominates another tuple t’ if t is not worse than t’ in at least k dimensions and t is better than t’ in at least one of them. k-dominant skyline: contains all tuples not k-dominated by any other tuple [Kontaki, SAC08]
k-dominant Skyline Query - Εxample 6 5 4 3 2 1 T2 T3 T4 T5 T1 4-dominates T3 T1 5-dominates T4 T1 dominates T5 Add one box about property Change the values highlight the domination Smaller k, less tuples in k-dominant skyline Conventional skyline {T1, T2, T3, T4} 5-dominant skyline {T1, T2, T3} 4-dominant skyline {T1, T2}
Observations Traditional or streaming skyline methods are inappropriate Skyline properties do not hold E.g. transitive property k-dominance can be cyclic Existence of multiple users and multiple queries. cnN supports multiple queries
Method CoSMuQ (1) A query on D dimensions arrives. Given a parameter value k, split the query to subqueries of d=k dimensions. Compute the conventional skyline of each subquery. The k-dominant skyline is the intersection of the skylines of the subqueries of a query.
Method CoSMuQ (2) Advantages Disadvantages Based on conventional skyline (simple domination checks) Properties of conventional skylines can be used Exploits the overlap between different queries. Disadvantages Memory requirements increase in high dimensionality.
Continuous Skyline methods - Summary Query Type Window Type Multiple Queries LookOut skyline time no Lazy and Eager both n-of-N count yes Filter and Sampling frequent skyline CoSMuQ k-dominant skyline
Presentation Layout Continuous top-k queries Data streams - Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating queries Summary
Top-k query - Εxample Hotels price distance T1 4 1 T2 3 2 T3 0.5 T4 2.5 4.5 T5 1.5 T6 3.5 5 price Given a preference function, a top-k query returns the k tuples with the best scores. T1 T6 T2 T4 T5 k=2 k=1 F=price+distance T3 distance
Continuous Top-k Query Problem definition: Continuous evaluation of top-k query in multidimensional streaming time series. Application Example: network data top-100 flows with the largest individual throughput Common destination DDoS attack Distributed denial of service
Basic Idea New tuple changes the top-k Top-k tuple expiration Should belong in the influence region of the query Top-k tuple expiration From scratch query computation TMA (Top-k Monitoring Algorithm) [Mouratidis, SIGMOD06] Advantage: simple implementation Disadvantage: no efficient handling of an expired top-k tuple Line defined by the F = score(tk) = x1 + x2 x2 tk Influence region x1 source: sigmod06
Skyband - Example 1-skyband is the skyline 2-skyband (tuples dominated by at most 1 other tuples) 1-skyband (tuples not dominated by other tuples) A B D Dominated by 2 other tuples (3-skyband) C E k-skyband: contains all the tuples which are dominated by at most k–1 other tuples.
Transform tuples in the (score,expiration_time) space Skyband Approach (1) Transform tuples in the (score,expiration_time) space original space transformed space F=price+distance price score score T1 5 T2 T3 3.5 T4 7 T5 5.5 T6 8.5 top-1 T1 DC=0 T6 T6 T2 T4 DC=1 T4 T5 DC=0 T2 DC=1 T5 T1 DC=1 T3 DC=0 T3 distance exp_time Rule: Keep tuples with DC < k Dominance counter (DC): number of tuples that are younger and better Observation: tuples appearing in some top-k result belong to the k-skyband in the (score,exp_time) space.
Skyband Approach (2) SMA (Skyband Monitoring Algorithm) proposed in [Mouratidis, SIGMOD06] Advantage: independent of the dimensionality 2-dimensional space (score-exp_time) Disadvantage: k-skyband may contain less than k tuples In this case, a top-k tuple expiration will cause query computation from scratch
Distributed Top-k Continuously report the k largest values obtained from distributed data streams. Objective is to minimize communication cost Proposed by [Babcock, SIGMOD03]
Streaming Model Nodes: N1, N2 , … , Nm, coordinator node: N0 Set of n data objects O1, O2 , … , On associated with real values V1, V2 , … , Vn Value updates are represented as <Oi, Nj, > tuples: Nj detects a change in the value Vi of Oi. Change is not seen by other nodes Nk (kj) The value Vi for an object Oi: Vi= j (Vi,j) where Vi,j is the value of i-th object in the j-th node
Method (1) Initialize a top-k set at the coordinator node Set arithmetic constraints at monitor nodes Depend on current top-k set Constraints valid No communications Constraints invalidated Client communicates with server Possibly new top-k set Recomputation of constraints
Method(2) - Adjustment Factors Adjustment Factors (AF) Object 1 Object 2 Object 1 Object 2 V1,1 = 1 V2,1 = 9 V1,2 = 3 V2,2 = 1 = 0 = -3 = 3 Node 1 Node 2 Top-1 = {O1} Node 2: V1,2 = 3+0 = 3 Node 2: V2,1 = 1+3 = 4 Local top-k similar to global =>Low communication cost Disadvantage: Energy consumption is not uniform Node 1, Local Top-1 = {O1} For each node Nj and object Oi associate an adjustment factor i,j Constraints are evaluated after adding the adjustment factors If OtT and OsU-T : Vt,i+ t,i Vs,i + s,i Adjustment factors for each object sum to zero: This ensures sum remains valid Node 2, Local Top-1 = {O2} Local top-ks differ from global top-k =>Unnecessary constraint violations => Increased communication cost To keep the results valid AF for each object sum to zero
Uncertain Data Compute probability of 6 Tuples Pr. 2, 5, 6, 8 .064 2, 5, 6 .096 2, 6, 8 2, 6 2, 5, 8 .016 2, 5 .024 2, 8 2 5, 6, 8 5, 6 .144 5, 8 5 .036 6, 8 6 8 Empty Score Prob. 6 0.8 5 0.5 2 0.4 8 tuples 16 possible worlds Sum the world probabilities Pk-topk query: returns the k most probable tuples of being the top-k. Top-2: {6,5} with prob. {0.64, 0.5} source: pvldb08
Pk-topk Query Solution proposed by [Jin, PVLDB08] Compact set based Space-efficient solution Discard unnecessary tuples and Apply several compression schemes to compress data Disadvantages Model assumption: the probability of a tuple is assumed random and independent of each other.
Continuous Top-k Methods -Summary Query Type Window Type Multiple Queries TMA and SMA top-k both yes Distributed top-k Distributed top-k time no Compact set based Pk-topk
Presentation Layout Continuous top-k dominating queries Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating queries Summary
Top-k Dominating Query - Example Hotels price distance T1 4 1 T2 3 2 T3 0.5 T4 2.5 4.5 T5 1.5 T6 3.5 5 price Top-k dominating: the answer contains the k tuples with highest domination power. Top-k: Given a preference function, a top-k query returns the k tuples with the best scores. T1 Skyline: contains all the tuples not dominated by any other tuple. T6 T2 T4 T5 k=2 k=1 F=price+distance T3 distance Disadvantage: user-defined preference function. Disadvantage: High dimensionality problem. Combines the advantages of skyline and top-k queries and avoids their disdvantages.
Continuous Top-k Dominating Query Problem definition: Continuous evaluation of top-k dominating query in multidimensional streaming time series. Application Example: sensor network Areas with high probability of fire outbreak Temperature, humidity and wind speed
EVA Objective: reduce domination checks Safe interval of a tuple Ignore tuple for this interval It depends on its score and the k-th score End of safe interval -> event Event Try to compute new safe interval, else Compute score from scratch New tuple Find another tuple that dominates the new one Estimate a lower bound of the safe interval
ADA Advanced computation of safe interval Candidate tuples Depends on the number of tuples that dominate this tuple and expire later Candidate tuples Tuples with scores close to k-th score are updated in each time instance EVA and ADA proposed by [Kontaki 2009]
Presentation Layout Summary Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating queries Summary
Summary Preference queries are very useful in data streams Presented state-of-the-art methods For continuous skyline queries For continuous top-k queries For continuous top-k dominating queries Examined advantages and disadvantages of the proposed methods
Research Directions Continuous subspace skyline queries Solutions appropriate for distributed environments uniform energy consumption Approximate algorithms Existence of multiple queries
Thank you