Download presentation
Presentation is loading. Please wait.
Published byArielle Heasley Modified over 9 years ago
1
Jiang Chen Columbia University Ke Yi HKUST
2
Motivation Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data cleaning, etc. Items associated with “confidence” may or may not be true may or may not exist Very hot topic in the database community
3
Motivation itemscore t3 t5 t4 t1 t2 100 87 80 65 30 probability 0.2 0.8 0.9 0.5 0.6 (sensor reading, reliability) (page rank, how well match query) itemscore t3 t5 t4 t1 t2 100 87 80 65 30 probability 0.2 0.8 0.9 0.5 0.6 top-k answer depends on the interplay between score and confidence
4
Problem Definition [Soliman et al. 07] The k items with the maximum probability of being the top-k tuplescore t3 t5 t4 t1 t2 100 87 80 65 30 confidence 0.2 0.8 0.9 0.5 0.6 {t3, t5}: 0.2*0.8 = 0.16 {t3, t4}: 0.2*(1-0.8)*0.9 = 0.036 {t5, t4}: (1-0.2)*0.8*0.9 = 0.576...
5
One-Time Computation Assume items are already sorted by score t1 t2 t3 t4 t5 t6 t7 t8... 0.2 0.8 0.7 0.2 0.1 1 0.1 0.8... Consider the i-th item ti: Question: Among t1,..., ti, which k items have the maximum prob. of appearing while the rest not appearing? Answer: The k items with the largest prob. {t2, t5} being top-2 t2, t5 appearing and t1, t3, t4 not appearing Just need to answer the question for all i Time: O(n log k)
6
The Data Structure Problem Build a data structure, such that: Query Given j, return the top-j answer Update Insert an item Delete an item Update the probability of an item Construction
7
Our Results A data structure of size O(n) Query: O(log(n) + j) Given j, return the top-j answer, j=1,...,k Update: O(k log n) (better than paper) Insert an item Delete an item Update the probability of an item Construction: O(n log k) (better than paper)
8
Overall Structure u vw top-j prob. ρ j u j’ largest prob φ j’ v top-(j-j’) ρ j-j’ u ρ j u = max{ρ j v, max 0≤j’≤j-1 {φ j’ v ρ j-j’ w }}, j=1,…,k leaf has k ~ 2k items Top-j query: O(log n + j)
9
φ v 0 ρ 1 w φ0v ρ2wφ0v ρ2w φ1v ρ1wφ1v ρ1w φ0v ρ3wφ0v ρ3w φ1v ρ2wφ1v ρ2w φ2v ρ1wφ2v ρ1w ………… …………… φ0v ρkwφ0v ρkw φ 1 v ρ k-1 w φ 2 v ρ k-2 w ……φ k-1 v ρ 1 w Update an Internal Node ρ j u = max{ρ j v, max 0≤j’≤j-1 {φ j’ v ρ j-j’ w }}, j=1,…,k Monotone The last item of the top-(j+1) answer can’t be in front of the last item of top-j
10
Total Monotonicity A matrix is totally monotone if all its sub- matrices are monotone Enough to check all 2x2 sub-matrices AB CD A > B C > D For a k*k totally monotone matrix, the SMAWK algorithm [Aggarwal et al. 87] can find all row-maximum in time O(k).
11
φ v 0 ρ 1 w φ0v ρ2wφ0v ρ2w φ1v ρ1wφ1v ρ1w φ0v ρ3wφ0v ρ3w φ1v ρ2wφ1v ρ2w φ2v ρ1wφ2v ρ1w ………… …………… φ0v ρkwφ0v ρkw φ 1 v ρ k-1 w φ 2 v ρ k-2 w ……φ k-1 v ρ 1 w Total Monotonicity Lemma: The matrix (φ j’ v ρ j-j’ w ) is totally monotone. An internal node can be updated in time O(k).
12
Update (Recompute) a Leaf Goal: Compute ρ j, j = 1,…,n, where n = Θ(k) Define φ j,i = p(e 1,i )∙p(e 2,i )∙ ∙∙∙ ∙p(e j,i )∙(1-p(e j+1,i )) ∙(1-p(e j+2,i ))∙ ∙∙∙ ∙(1-p(e i,i )) where e i,1,…,e i,i are the first i items sorted by decreasing probability ρ j = max 1≤i≤n φ j,i Compute the row-max for the matrix (φ j,i ) k*n !
13
Total Monotonicity, Again Lemma: The matrix (φ j,i ) k*n is totally monotone. Are we done yet? The SMAWK algorithm probes O(k) entries in the matrix (φ j,i ) k*n, but still need to retrieve φ j,i = p(e 1,i )∙ ∙∙∙ ∙p(e j,i )∙(1-p(e j+1,i ))∙ ∙∙∙ ∙(1-p(e i,i )) on demand
14
Retrieve φ j,i Rewrite φ j,i = p(e 1,i )∙ ∙∙∙ ∙p(e j,i )∙(1-p(e j+1,i ))∙ ∙∙∙ ∙(1-p(e i,i )) p(e 1,i ) p(e j,i ) 1-p(e 1,i ) 1-p(e j,i ) = ∙ ∙∙∙ ∙ ∙(1-p(e 1,i ))∙ ∙∙∙ ∙(1-p(e i,i )) pre-compute in time O(k) p(e 1,i ) p(e j,i ) 1-p(e 1,i ) 1-p(e j,i ) = ∙ ∙∙∙ ∙ ∙(1-p(t 1 ))∙ ∙∙∙ ∙(1-p(t i ))
15
Retrieve φ j,i Focus on p(e 1,i ) p(e j,i ) 1-p(e 1,i ) 1-p(e j,i ) ∙ ∙∙∙ ∙ e 1,i e 2,i e 3,i e 4,i e 5,i e 6,i e 1,i+1 e 2,i+1 e 3,i+1 e 5,i+1 e 6,i+1 e 7,i+1 e 4,i+1 To support all i, make the structure partially persistent Insertion: O(log k) Query: O(log k)
16
Update (Recompute) a Leaf Goal: Compute ρ j, j = 1,…,n, where n = Θ(k) ρ j = max 1≤i≤n φ j,i Compute the row-max for the matrix (φ j,i ) k*n ! The SMAWK algorithm probes O(k) φ j,i ’s Using persistent (2,3)-tree Construction: O(k log k) Query: O(log k) Total time for a leaf: O(k log k)
17
Summary Update (recompute) an internal node: O(k) O(log n) such nodes Update (recompute) a leaf node: O(k log k) Total update time: O(k log n) Insertions/deletions can be handled using standard techniques (rebalancing) Construction time: O(n log k) Construction as efficient as one-time computation
18
Final Remarks Conjecture Ω(k) is lower bound for update time Other top-k definitions? for each item, compute its prob. being one of the top-k return the k items with the largest such prob. k-nearest neighbors in uncertain geometric data each point has a pdf
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.