Probabilistic n-of-N Skyline Computation over Uncertain Data Streams

Probabilistic n-of-N Skyline Computation over Uncertain Data Streams
Wenjie Zhang University of New South Wales Joint work: Aiping Li (NUDT), Ying Zhang, Muhammad Aamir Cheema, Lijun Chang (UNSW)

Outline Overview Algorithms Experiment Conclusion Read word by word

Overview --- skyline Skyline computation plays a vital role in daily lives Smaller screen (lighter) Higher CPU speed Lower price

Overview Higher Star Shorter distance to airport Lower price

Overview --- skyline Skyline: candidates of best options in multi-criteria decision applications. n-dimensional numeric space D = (D1, …, Dn) on each dimension, a user preference ≻ is defined two points, u dominates v (u ≻ v), if  Di (1 ≤ i ≤ n), u.Di ≻= v.Di  Dj (1 ≤ j ≤ n), u.Dj ≻ v.Dj Skyline: points not dominated by another point.

Overview --- uncertainty exists

Overview --- streaming
Streaming environment Online trading system Stock management Financial market Real estate monitoring ……

Overview --- a conceptual example
2 0.1 1 1 0.1 4 0.8 6 0.5 3 0.4 5 0.1 This is a simple example of uncertain data stream. For instance, if we only keep the most recent 5 elements, with the arrival of the 6th element, the 1st one expires. So the challenge is how can we do such a sliding window based update to the skyline results efficiently ? Animation: 1 to 5 Elements continuously arrive with occurrence probabilities Problem : How to continuously compute skylines in a sliding window with size N (elements)? Sliding window: N = 5

Overview --- n-of-N Different users may have different window sizes
Supporting different N ? n-of-N model [ICDE 2005, Lin et al] Support any window size n as long as n ≤ N n-of-N skyline over uncertain streams

Probabilistic skyline computation Uncertain stream processing
Related work Probabilistic stream skyline (ICDE 09) Probabilistic skyline (VLDB07) Probabilistic reverse skyline (SIGMOD08) Probabilistic aggregates and sketches over uncertain streams (SIGMOD07, SODA07, PODS07) Frequent items on uncertain streams (SIGMOD08) Top-k queries over uncertain sliding window (VLDB08) … … Probabilistic skyline computation Skyline computation and streaming computation has been extensively studied. In vldb 07, the probabilistic skyline has been firstly investigated by Jian Pei and our group. We adopt their model in our paper. In sigmod 08, Lei Chen’s group studied …; prob. Agg. And sket has been studied in sigmod 07 by cormod and his collegues and in soda 07 and pods 07. Although there are some work on … this is the first work on … Uncertain stream processing

Models and Problem Definition
Model: DS is a stream of elements, each element a is in a d-dimensional space and with an occurrence probability P(a) ( in (0, 1]) The skyline probability of an element a is: Problem Definition: retrieving elements from the most recent n (n ≤ N) elements, with skyline probability no less than a given threshold q We assume that the data stream is append only It is easy on certain data. ebay: good offer. Can be extended to time window This is how people model uncertain skyline. In our problem, every object has only one instance. In conventional database, dominance is a certain relationship. But in uncertain data, we deal domination relationship with a probability. If we have an element A dominated by B, if B has a very small occurrence prob., then this relationship is very weak. So there are two parts of the skyline prob of an element: the first is occ. Prob. The second is ….

Challenges and Contributions
Space efficiency: N can be too large to fit in memorys Space reduction: O(N) to O(lnd-1N) Time efficiency Elements in sliding window continuously changes Naively re-computing with each change: cost prohibitive Two main challenges..The first is space efficiency. In the light of stream computation, we always to avoid keeping all the information; then the challenge is, if you don’t keep all the information, how can you ensure that the probability you calculated is correct ? The space requirement for our algorithm is poly-logarithmic … The 2nd challenge is time efficiency. We develped a very efficient incremental technqiue. Space reduction: on average Reduce from linear to poly logorithmic

Outline Overview Algorithms Experiment Conclusion Read word by word

Framework: what to keep ?
Pold (2) = 1 – P(1) 2 0.1 1 0.1 Pnew(2) = (1 – P(3)) * (1 – P(4)) 4 0.8 3 0.4 Not dominated by older elements; newer The skyline prob can be divided into 3 parts; occ. Prob; the prob for older elements not to dominate this element; and the prob. For newer elements not to dominate this element; The probability that an element a is not dominated by other elements can be divided into two parts, it is not dominated by elements that are older than it and not dominated by elements newer than it. We call the two Pold and Pnew separately. For instance, the Pold value of element 2 is 1 minus the occurrence probability of 1 which is older than 2. and Pnew value is the non-occurrence probability of 3 & 4. An intuition is that, an element will never be a skyline if its Pnew value is smaller than the threshold since it expires earlier then the elements newer than it. So we propose to keep the elements with Pnew >= q only as a candidate set. Animation for Pold and Pnew: together with the example of element 2 5 0.1 window size N : 5 probability threshold: 0.5

Framework: what to keep ?
Candidate set SN,q: [ICDE09, Zhang et al] Correctness: (1) no missing skyline points (2) no false hits to determine SN, qs (3) no false positive to determine skyline results (4) no false negative to determine skyline results --- probability based on SN,q may not be accurate, but satisfies the threshold requirement. Here we only keep Snq which is much smaller than the whole sliding window. As mention before, the challenge is how we can ensure that the skyline prob computed against this candidate set will give us a right solution. We can prove that the calculation based on snq will give us the correct result the skyline. ( although the skyline probabilities are not precise for some elements in the candidate set, it has a good property that for skyline points, such probabilities are correctly computed). We proved the correctness based on the following 4 facts. Every skyline point is in SNq Pnew computed based on Snq is the same as computed based on the whole stream If psky computed based on Snq < q, it is also < q based on the whole stream new: althought the prob calculated against snq not always equal to the prob calculated based on the whole sliding window. But no false negative. If psky|snq >q, it is also Psky|snq = psky. New: Backup example for why elements can be deleted

Space of Candidate Set Theorem: Candidate Set requires a poly-logarithmic space on average case regarding uniform distributions, O(f(q)lnd-1N). This problem is similar to the k-skyband problem. And f(q) tells you the k value of a skyband. Not talk about details math involved refer to the paper

Result Set a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 Critical dominance relations
Psky(a8) = (1–P(a2)) ×(1-P(a3)) ×(1-P(a5) ×(1-P(a10)) × P(a8) Pold Pnew a3 critically dominates a8: a8 is probabilistic skyline for any recent n elements where n < 7 ( 10 – 3)

Algorithm R-tree based indexing
Using dominance check technique to quickly identify critical dominance relation. Update as the window slides Insert Delete

Experiment Data set: Real: stock transactions. 2-d. probability assigned randomly. Size: 2 million Synthetic: spatial location (independent or anti- correlated); probability (uniform or normal); 2d to 5d; 2 million Default values: p : 0.3; d: 3; N : 1M; spatial distribution: anti-correlated; probability: uniform; Objects are independent:

Experiment Algorithms
q-sky: algorithm in [ICDE09, Zhang et al] to keep candidate set. For an n-of-N query, naively check each element in the candidate set. pnN: our processing algorithm utilizing critical dominance relation. pmnN: our algorithms for continuously maintaining the data structure for supporting pnN.

Experiment - pnN

Experiment - scalability

Experiment

Conclusion and Future Work
Probabilistic skyline in data streams following the n-of- N model Future work Computation sharing More general uncertain model

Thanks !

Framework Space required for SN,q:
SN,q is the minimum information to be maintained to get a correct answer. Psky(3) = 0.9 * (1 – 0.4) * (1- 0.3) < q Psky(3) = 0.9 > q 3 0.9 2 2 0.4 1 1 0.3 A weak minimum: element in Snq may not be a result, but not keep Snq will miss result New: write the psky of 3; For instance, Pnew value of 1 & 2 are > q, as occurrence prob. < q, they will never become a skyline point. However, we still need to keep them otherwise the skyline prob for other elements, like 3, can not be computed correctly Animation: psky for 3 before & after deletion of 1 &2 4 0.8 window size N : 4 probability threshold q: 0.5

Probabilistic n-of-N Skyline Computation over Uncertain Data Streams

Similar presentations

Presentation on theme: "Probabilistic n-of-N Skyline Computation over Uncertain Data Streams"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic n-of-N Skyline Computation over Uncertain Data Streams

Similar presentations

Presentation on theme: "Probabilistic n-of-N Skyline Computation over Uncertain Data Streams"— Presentation transcript:

Similar presentations

About project

Feedback