Presentation is loading. Please wait.

Presentation is loading. Please wait.

Continuous Processing of Preference Queries in Data Streams : a Survey

Similar presentations


Presentation on theme: "Continuous Processing of Preference Queries in Data Streams : a Survey"— Presentation transcript:

1 Continuous Processing of Preference Queries in Data Streams : a Survey
M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos Data Engineering Lab Department of Informatics Aristotle University of Thessaloniki

2 Presentation Layout Preliminaries Continuous skyline queries
Continuous top-k queries Continuous top-k dominating queries Summary

3 Presentation Layout Preliminaries Continuous skyline queries
Continuous top-k queries Continuous top-k dominating queries Summary

4 Data Streams Data Stream is an infinite sequence of objects.
Each object can be one-dimensional or multi-dimensional. Streaming Time Series are finite sequences of objects. Streaming Time Series changes over time. Arrival rate of objects usually varies.

5 Sliding Window Model (1)
Count-based window: Sliding window contains the W most recent tuples (“active”). Older tuples expire. Time W=5 t1 t2 t3 t4 t5 t6 t7 t8 expired active

6 Sliding Window Model (2)
Time-based window: Sliding window contains the tuples (“active”) of the W most recent timestamps. Older records expire. Time W=5 t6 t1 t2 t3 t4 t5 t8 t7 expired active

7 Database System User / Application Query Result Result Query Input

8 Continuous Evaluation in a Data Stream System
User / Application Query Result Query processor

9 Motivation (1) Numerous data stream contexts Financial data analysis
Network management Astronomical data analysis Sensor network Telecommunication data management

10 Motivation (2) Preference queries Many applications in data streams
Useful decision support tool Many applications in data streams Example 2 (stock-market data) Report the products with the maximum price, the minimum sales and the minimum number of buyers. Example 1 (telecommunication data) Report the clients with the maximum call time and the maximum number of calls. Continuous top-k dominating query Continuous skyline query

11 Presentation Layout Continuous skyline queries Preliminaries
Continuous top-k queries Continuous top-k dominating queries Conclusions

12 Skyline Query Hotels price distance T1 4 1 T2 3 2 T3 0.5 T4 2.5 4.5 T5 1.5 T6 3.5 5 price Skyline: contains all the tuples not dominated by any other tuple. T1 T6 T2 T4 T5 T3 distance Dominant tuple: A tuple t dominates another tuple t’ if t is not worse than t’ in all dimensions, and t is better than t’ in at least one dimension.

13 Continuous Skyline Query
Problem definition: We have to continuously evaluate a skyline query in multidimensional streaming time series. Application example: network data Computers with suspicious behavior. Network traffic, number of connections, number of destinations.

14 Basic Idea Skyline changes due
The insertion of a new skyline tuple. The expiration of a skyline tuple. LookOut [Morse, ICDE06] and Lazy [Tao, TKDE06] Use of a spatial index Advantage: simple implementation Disadvantage: the expiration of a skyline tuple is not handled efficiently

15 Event Approach (1) Existing skyline tuple expires:
How can we find new skyline tuples? Very costly operation Skyline influence time (SIT) Minimum time in which a tuple may become a skyline tuple. Generate events based on SIT Event – examine tuples with such an influence time

16 Event Approach (2) Eager [Tao, TKDE06]
F(6) J(10) W=10 H(8) Eager [Tao, TKDE06] Advantage: handles skyline expiration Disadvantage: pro-cessing time per tuple G(7) K(11) I(9) L(12) D(4) B(2) E(5) C(3) Tuple K can be discarded due to tuple L (younger and better) K.SIT=19

17 n-of-N Skyline Queries (1)
n-of-N definition S6 = {a,c} S4 = {c,g} source: icde05

18 n-of-N Skyline Queries (2)
n-of-N definition S6 = {c,h} S4 = {e,h} source: icde05

19 Method cnN(1) Method cnN [Lin, ICDE05] is also based on events
Tuple K is redundant because tuple L is better and younger than K A(1) F(6) J(10) W=10 H(8) G(7) K(11) Tuple L is dominated by D and E. I(9) L(12) D(4) B(2) E(5) C(3) The dominance relation between L and E is critical because E is the youngest tuple which dominates L

20 Critical dominance relation
Method cnN (2) Redundant tuples A(1) B(2) G(7) Dominance graph contains all the critical dominance relations F(6) Critical dominance relation E(5) C(3) D(4) Generate intervals For the skyline tuples, e.g. C = (0,3] For the critical dominance relations, C -> G = (3,7] Use an interval-tree to store them

21 To answer a n-of-N query, apply a (M–n+1) stabbing query
Method cnN (3) A tuple t is in the answer of an n-of-N skyline query iff there exists an interval containing the value M–n+1, where M is the number of the total elements seen so far. M = 7 A(1) C = (0,3] stabbing query For n = 4, M–n+1 = 4 For n = 6, M–n+1 = 2 B(2) G(7) D = (0,4] F(6) C -> G = (3,7] E(5) D -> E = (4,5] S4 = {D, G} S6 = {C, D} C(3) D(4) D -> F = (4,6] To answer a n-of-N query, apply a (M–n+1) stabbing query

22 Method cnN (3) Advantages Disadvantages Good use of skyline properties
Multiple query processing Disadvantages Processing time per tuple Increased memory requirements Skyline properties – removws reduntant tuples Graph built for several user queries Increased memory because of the necessary graph

23 Frequent Skyline - Motivation
Highly dynamic environment The skyline results are meaningful only if the skyline tuples appear consistently Frequent skyline: tuples on the skyline for a minimum user-defined interval. [Zhang, SIGMOD09]

24 Streaming Model Client/Server architecture
Server receives object updates from the clients. Each object can be represented as a d-dimensional point. Object update (point movement in the d-dimensional space). at least a value in one dimension changes Object insertion or deletion Point movement from/to a nonexistent position Minimization of communication cost

25 An object as a point and its filter (safe region)
Safe region technique Skyline remains unchanged if each object stays in a safe region Communication happens only when the safe region is violated Safe region approach leads to communication optimization An object as a point and its filter (safe region) source: sigmod09

26 Sampling All clients report their skyline at the same sampled time
The clients are synchronized with the same random seed Guaranteed quality if sampling rate is high enough

27 Hybrid Hybrid solution Disadvantage of all three methods
Combines Filter and Sampling Small changes: apply Filter Larger changes: apply Sampling Disadvantage of all three methods energy consumption is not uniform (critical in sensor networks)

28 k-dominant Skyline Query - Μotivation
Skyline: contains tuples not dominated by any other tuple. Disadvantage: High dimensionality problem. Solution: Relax the notion of dominance. k-dominant tuple: A tuple t k-dominates another tuple t’ if t is not worse than t’ in at least k dimensions and t is better than t’ in at least one of them. k-dominant skyline: contains all tuples not k-dominated by any other tuple [Kontaki, SAC08]

29 k-dominant Skyline Query - Εxample
6 5 4 3 2 1 T2 T3 T4 T5 T1 4-dominates T3 T1 5-dominates T4 T1 dominates T5 Add one box about property Change the values highlight the domination Smaller k, less tuples in k-dominant skyline Conventional skyline {T1, T2, T3, T4} 5-dominant skyline {T1, T2, T3} 4-dominant skyline {T1, T2}

30 Observations Traditional or streaming skyline methods are inappropriate Skyline properties do not hold E.g. transitive property k-dominance can be cyclic Existence of multiple users and multiple queries. cnN supports multiple queries

31 Method CoSMuQ (1) A query on D dimensions arrives.
Given a parameter value k, split the query to subqueries of d=k dimensions. Compute the conventional skyline of each subquery. The k-dominant skyline is the intersection of the skylines of the subqueries of a query.

32 Method CoSMuQ (2) Advantages Disadvantages
Based on conventional skyline (simple domination checks) Properties of conventional skylines can be used Exploits the overlap between different queries. Disadvantages Memory requirements increase in high dimensionality.

33 Continuous Skyline methods - Summary
Query Type Window Type Multiple Queries LookOut skyline time no Lazy and Eager both n-of-N count yes Filter and Sampling frequent skyline CoSMuQ k-dominant skyline

34 Presentation Layout Continuous top-k queries
Data streams - Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating queries Summary

35 Top-k query - Εxample Hotels price distance T1 4 1 T2 3 2 T3 0.5 T4 2.5 4.5 T5 1.5 T6 3.5 5 price Given a preference function, a top-k query returns the k tuples with the best scores. T1 T6 T2 T4 T5 k=2 k=1 F=price+distance T3 distance

36 Continuous Top-k Query
Problem definition: Continuous evaluation of top-k query in multidimensional streaming time series. Application Example: network data top-100 flows with the largest individual throughput Common destination DDoS attack Distributed denial of service

37 Basic Idea New tuple changes the top-k Top-k tuple expiration
Should belong in the influence region of the query Top-k tuple expiration From scratch query computation TMA (Top-k Monitoring Algorithm) [Mouratidis, SIGMOD06] Advantage: simple implementation Disadvantage: no efficient handling of an expired top-k tuple Line defined by the F = score(tk) = x1 + x2 x2 tk Influence region x1 source: sigmod06

38 Skyband - Example 1-skyband is the skyline
2-skyband (tuples dominated by at most 1 other tuples) 1-skyband (tuples not dominated by other tuples) A B D Dominated by 2 other tuples (3-skyband) C E k-skyband: contains all the tuples which are dominated by at most k–1 other tuples.

39 Transform tuples in the (score,expiration_time) space
Skyband Approach (1) Transform tuples in the (score,expiration_time) space original space transformed space F=price+distance price score score T1 5 T2 T3 3.5 T4 7 T5 5.5 T6 8.5 top-1 T1 DC=0 T6 T6 T2 T4 DC=1 T4 T5 DC=0 T2 DC=1 T5 T1 DC=1 T3 DC=0 T3 distance exp_time Rule: Keep tuples with DC < k Dominance counter (DC): number of tuples that are younger and better Observation: tuples appearing in some top-k result belong to the k-skyband in the (score,exp_time) space.

40 Skyband Approach (2) SMA (Skyband Monitoring Algorithm) proposed in [Mouratidis, SIGMOD06] Advantage: independent of the dimensionality 2-dimensional space (score-exp_time) Disadvantage: k-skyband may contain less than k tuples In this case, a top-k tuple expiration will cause query computation from scratch

41 Distributed Top-k Continuously report the k largest values obtained from distributed data streams. Objective is to minimize communication cost Proposed by [Babcock, SIGMOD03]

42 Streaming Model Nodes: N1, N2 , … , Nm, coordinator node: N0
Set of n data objects O1, O2 , … , On associated with real values V1, V2 , … , Vn Value updates are represented as <Oi, Nj, > tuples: Nj detects a change  in the value Vi of Oi. Change is not seen by other nodes Nk (kj) The value Vi for an object Oi: Vi= j (Vi,j) where Vi,j is the value of i-th object in the j-th node

43 Method (1) Initialize a top-k set at the coordinator node
Set arithmetic constraints at monitor nodes Depend on current top-k set Constraints valid  No communications Constraints invalidated Client communicates with server Possibly new top-k set Recomputation of constraints

44 Method(2) - Adjustment Factors
Adjustment Factors (AF) Object 1 Object 2 Object 1 Object 2 V1,1 = 1 V2,1 = 9 V1,2 = 3 V2,2 = 1 = 0 = -3 = 3 Node 1 Node 2 Top-1 = {O1} Node 2: V1,2 = 3+0 = 3 Node 2: V2,1 = 1+3 = 4 Local top-k similar to global =>Low communication cost Disadvantage: Energy consumption is not uniform Node 1, Local Top-1 = {O1} For each node Nj and object Oi associate an adjustment factor i,j Constraints are evaluated after adding the adjustment factors If OtT and OsU-T : Vt,i+  t,i  Vs,i +  s,i Adjustment factors for each object sum to zero: This ensures sum remains valid Node 2, Local Top-1 = {O2} Local top-ks differ from global top-k =>Unnecessary constraint violations => Increased communication cost To keep the results valid AF for each object sum to zero

45 Uncertain Data Compute probability of 6 Tuples Pr. 2, 5, 6, 8 .064 2, 5, 6 .096 2, 6, 8 2, 6 2, 5, 8 .016 2, 5 .024 2, 8 2 5, 6, 8 5, 6 .144 5, 8 5 .036 6, 8 6 8 Empty Score Prob. 6 0.8 5 0.5 2 0.4 8 tuples 16 possible worlds Sum the world probabilities Pk-topk query: returns the k most probable tuples of being the top-k. Top-2: {6,5} with prob. {0.64, 0.5} source: pvldb08

46 Pk-topk Query Solution proposed by [Jin, PVLDB08]
Compact set based Space-efficient solution Discard unnecessary tuples and Apply several compression schemes to compress data Disadvantages Model assumption: the probability of a tuple is assumed random and independent of each other.

47 Continuous Top-k Methods -Summary
Query Type Window Type Multiple Queries TMA and SMA top-k both yes Distributed top-k Distributed top-k time no Compact set based Pk-topk

48 Presentation Layout Continuous top-k dominating queries Preliminaries
Continuous skyline queries Continuous top-k queries Continuous top-k dominating queries Summary

49 Top-k Dominating Query - Example
Hotels price distance T1 4 1 T2 3 2 T3 0.5 T4 2.5 4.5 T5 1.5 T6 3.5 5 price Top-k dominating: the answer contains the k tuples with highest domination power. Top-k: Given a preference function, a top-k query returns the k tuples with the best scores. T1 Skyline: contains all the tuples not dominated by any other tuple. T6 T2 T4 T5 k=2 k=1 F=price+distance T3 distance Disadvantage: user-defined preference function. Disadvantage: High dimensionality problem. Combines the advantages of skyline and top-k queries and avoids their disdvantages.

50 Continuous Top-k Dominating Query
Problem definition: Continuous evaluation of top-k dominating query in multidimensional streaming time series. Application Example: sensor network Areas with high probability of fire outbreak Temperature, humidity and wind speed

51 EVA Objective: reduce domination checks Safe interval of a tuple
Ignore tuple for this interval It depends on its score and the k-th score End of safe interval -> event Event Try to compute new safe interval, else Compute score from scratch New tuple Find another tuple that dominates the new one Estimate a lower bound of the safe interval

52 ADA Advanced computation of safe interval Candidate tuples
Depends on the number of tuples that dominate this tuple and expire later Candidate tuples Tuples with scores close to k-th score are updated in each time instance EVA and ADA proposed by [Kontaki 2009]

53 Presentation Layout Summary Preliminaries Continuous skyline queries
Continuous top-k queries Continuous top-k dominating queries Summary

54 Summary Preference queries are very useful in data streams
Presented state-of-the-art methods For continuous skyline queries For continuous top-k queries For continuous top-k dominating queries Examined advantages and disadvantages of the proposed methods

55 Research Directions Continuous subspace skyline queries
Solutions appropriate for distributed environments uniform energy consumption Approximate algorithms Existence of multiple queries

56 Thank you


Download ppt "Continuous Processing of Preference Queries in Data Streams : a Survey"

Similar presentations


Ads by Google