Jiang Chen Columbia University Ke Yi HKUST. Motivation  Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data.

Jiang Chen Columbia University Ke Yi HKUST

Motivation  Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data cleaning, etc.  Items associated with “confidence” may or may not be true may or may not exist  Very hot topic in the database community

Motivation itemscore t3 t5 t4 t1 t2 100 87 80 65 30 probability 0.2 0.8 0.9 0.5 0.6 (sensor reading, reliability) (page rank, how well match query) itemscore t3 t5 t4 t1 t2 100 87 80 65 30 probability 0.2 0.8 0.9 0.5 0.6 top-k answer depends on the interplay between score and confidence

Problem Definition [Soliman et al. 07] The k items with the maximum probability of being the top-k tuplescore t3 t5 t4 t1 t2 100 87 80 65 30 confidence 0.2 0.8 0.9 0.5 0.6 {t3, t5}: 0.2*0.8 = 0.16 {t3, t4}: 0.2*(1-0.8)*0.9 = 0.036 {t5, t4}: (1-0.2)*0.8*0.9 = 0.576...

One-Time Computation  Assume items are already sorted by score t1 t2 t3 t4 t5 t6 t7 t8... 0.2 0.8 0.7 0.2 0.1 1 0.1 0.8... Consider the i-th item ti: Question: Among t1,..., ti, which k items have the maximum prob. of appearing while the rest not appearing? Answer: The k items with the largest prob. {t2, t5} being top-2  t2, t5 appearing and t1, t3, t4 not appearing Just need to answer the question for all i Time: O(n log k)

The Data Structure Problem  Build a data structure, such that:  Query Given j, return the top-j answer  Update Insert an item Delete an item Update the probability of an item  Construction

Our Results  A data structure of size O(n)  Query: O(log(n) + j) Given j, return the top-j answer, j=1,...,k  Update: O(k log n) (better than paper) Insert an item Delete an item Update the probability of an item  Construction: O(n log k) (better than paper)

Overall Structure u vw top-j prob. ρ j u j’ largest prob φ j’ v top-(j-j’) ρ j-j’ u ρ j u = max{ρ j v, max 0≤j’≤j-1 {φ j’ v ρ j-j’ w }}, j=1,…,k leaf has k ~ 2k items Top-j query: O(log n + j)

φ v 0 ρ 1 w φ0v ρ2wφ0v ρ2w φ1v ρ1wφ1v ρ1w φ0v ρ3wφ0v ρ3w φ1v ρ2wφ1v ρ2w φ2v ρ1wφ2v ρ1w ………… …………… φ0v ρkwφ0v ρkw φ 1 v ρ k-1 w φ 2 v ρ k-2 w ……φ k-1 v ρ 1 w Update an Internal Node ρ j u = max{ρ j v, max 0≤j’≤j-1 {φ j’ v ρ j-j’ w }}, j=1,…,k Monotone The last item of the top-(j+1) answer can’t be in front of the last item of top-j

Total Monotonicity  A matrix is totally monotone if all its sub- matrices are monotone Enough to check all 2x2 sub-matrices AB CD A > B  C > D For a k*k totally monotone matrix, the SMAWK algorithm [Aggarwal et al. 87] can find all row-maximum in time O(k).

φ v 0 ρ 1 w φ0v ρ2wφ0v ρ2w φ1v ρ1wφ1v ρ1w φ0v ρ3wφ0v ρ3w φ1v ρ2wφ1v ρ2w φ2v ρ1wφ2v ρ1w ………… …………… φ0v ρkwφ0v ρkw φ 1 v ρ k-1 w φ 2 v ρ k-2 w ……φ k-1 v ρ 1 w Total Monotonicity Lemma: The matrix (φ j’ v ρ j-j’ w ) is totally monotone. An internal node can be updated in time O(k).

Update (Recompute) a Leaf  Goal: Compute ρ j, j = 1,…,n, where n = Θ(k)  Define φ j,i = p(e 1,i )∙p(e 2,i )∙ ∙∙∙ ∙p(e j,i )∙(1-p(e j+1,i )) ∙(1-p(e j+2,i ))∙ ∙∙∙ ∙(1-p(e i,i )) where e i,1,…,e i,i are the first i items sorted by decreasing probability  ρ j = max 1≤i≤n φ j,i Compute the row-max for the matrix (φ j,i ) k*n !

Total Monotonicity, Again  Lemma: The matrix (φ j,i ) k*n is totally monotone. Are we done yet?  The SMAWK algorithm probes O(k) entries in the matrix (φ j,i ) k*n, but still need to retrieve φ j,i = p(e 1,i )∙ ∙∙∙ ∙p(e j,i )∙(1-p(e j+1,i ))∙ ∙∙∙ ∙(1-p(e i,i )) on demand

Retrieve φ j,i Rewrite φ j,i = p(e 1,i )∙ ∙∙∙ ∙p(e j,i )∙(1-p(e j+1,i ))∙ ∙∙∙ ∙(1-p(e i,i )) p(e 1,i ) p(e j,i ) 1-p(e 1,i ) 1-p(e j,i ) = ∙ ∙∙∙ ∙ ∙(1-p(e 1,i ))∙ ∙∙∙ ∙(1-p(e i,i )) pre-compute in time O(k) p(e 1,i ) p(e j,i ) 1-p(e 1,i ) 1-p(e j,i ) = ∙ ∙∙∙ ∙ ∙(1-p(t 1 ))∙ ∙∙∙ ∙(1-p(t i ))

Retrieve φ j,i Focus on p(e 1,i ) p(e j,i ) 1-p(e 1,i ) 1-p(e j,i ) ∙ ∙∙∙ ∙ e 1,i e 2,i e 3,i e 4,i e 5,i e 6,i e 1,i+1 e 2,i+1 e 3,i+1 e 5,i+1 e 6,i+1 e 7,i+1 e 4,i+1 To support all i, make the structure partially persistent Insertion: O(log k) Query: O(log k)

Update (Recompute) a Leaf  Goal: Compute ρ j, j = 1,…,n, where n = Θ(k)  ρ j = max 1≤i≤n φ j,i Compute the row-max for the matrix (φ j,i ) k*n !  The SMAWK algorithm probes O(k) φ j,i ’s  Using persistent (2,3)-tree Construction: O(k log k) Query: O(log k) Total time for a leaf: O(k log k)

Summary  Update (recompute) an internal node: O(k) O(log n) such nodes  Update (recompute) a leaf node: O(k log k)  Total update time: O(k log n) Insertions/deletions can be handled using standard techniques (rebalancing)  Construction time: O(n log k) Construction as efficient as one-time computation

Final Remarks  Conjecture Ω(k) is lower bound for update time  Other top-k definitions? for each item, compute its prob. being one of the top-k return the k items with the largest such prob.  k-nearest neighbors in uncertain geometric data each point has a pdf

Jiang Chen Columbia University Ke Yi HKUST. Motivation  Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data.

Similar presentations

Presentation on theme: "Jiang Chen Columbia University Ke Yi HKUST. Motivation  Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jiang Chen Columbia University Ke Yi HKUST. Motivation  Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data.

Similar presentations

Presentation on theme: "Jiang Chen Columbia University Ke Yi HKUST. Motivation  Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data."— Presentation transcript:

Similar presentations

About project

Feedback