Jiang Chen Columbia University Ke Yi HKUST. Motivation  Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data.

Slides:



Advertisements
Similar presentations
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Advertisements

Incremental Maintenance of XML Structural Indexes Ke Yi 1, Hao He 1, Ioana Stanoi 2 and Jun Yang 1 1 Department of Computer Science, Duke University 2.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
1 Top-k Spatial Joins
An Optimal Dynamic Interval Stabbing-Max Data Structure? Pankaj K. Agarwal, Lars Arge and Ke Yi Department of Computer Science Duke University.
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
© The McGraw-Hill Companies, Inc., Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Presented by: Duong, Huu Kinh Luan March 14 th, 2011.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
The Complexity of Algorithms and the Lower Bounds of Problems
CS4432: Database Systems II
1 Geometric Intersection Determining if there are intersections between graphical objects Finding all intersecting pairs Brute Force Algorithm Plane Sweep.
SMAWK. REVISE Global alignment (Revise) Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1) +  (S[i], T[j]),
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
Weight balance trees (Nievergelt & Reingold 73)
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
Efficient Processing of Top-k Spatial Preference Queries
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.
Combining Fuzzy Information: An Overview Ronald Fagin.
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
A Sub-quadratic Sequence Alignment Algorithm. Global alignment Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1)
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
X1x1 x2x2 top-k y 3-sided x1x1 x2x2 External Memory Three-Sided Range Reporting and Top-k Queries with Sublogarithmic Updates Gerth Stølting Brodal Aarhus.
arxiv.org/abs/ y 3-sided x1 x2 x1 x2 top-k
Updating SF-Tree Speaker: Ho Wai Shing.
Probabilistic Data Management
Temporal Indexing MVBT.
CPSC 411 Design and Analysis of Algorithms
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Data Structures Lecture 4 AVL and WAVL Trees Haim Kaplan and Uri Zwick
Data Structures: Segment Trees, Fenwick Trees
Balanced-Trees This presentation shows you the potential problem of unbalanced tree and show two way to fix it This lecture introduces heaps, which are.
Order maintenance problem
A simpler implementation and analysis of Chazelle’s
STACS arxiv.org/abs/ y 3-sided x1 x2 x1 x2 top-k
B- Trees D. Frey with apologies to Tom Anastasio
B- Trees D. Frey with apologies to Tom Anastasio
B-Trees This presentation shows you the potential problem of unbalanced tree and show one way to fix it This lecture introduces heaps, which are used.
Balanced-Trees This presentation shows you the potential problem of unbalanced tree and show two way to fix it This lecture introduces heaps, which are.
Structure and Content Scoring for XML
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Minimizing the Aggregate Movements for Interval Coverage
B- Trees D. Frey with apologies to Tom Anastasio
CS202 - Fundamental Structures of Computer Science II
8th Workshop on Massive Data Algorithms, August 23, 2016
Range Queries on Uncertain Data
Structure and Content Scoring for XML
Dynamic Programming II DP over Intervals
CPSC 411 Design and Analysis of Algorithms
Heaps By JJ Shepherd.
B-Trees This presentation shows you the potential problem of unbalanced tree and show one way to fix it This lecture introduces heaps, which are used.
Efficient Processing of Top-k Spatial Preference Queries
Order maintenance problem
CS202 - Fundamental Structures of Computer Science II
Presentation transcript:

Jiang Chen Columbia University Ke Yi HKUST

Motivation  Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data cleaning, etc.  Items associated with “confidence” may or may not be true may or may not exist  Very hot topic in the database community

Motivation itemscore t3 t5 t4 t1 t probability (sensor reading, reliability) (page rank, how well match query) itemscore t3 t5 t4 t1 t probability top-k answer depends on the interplay between score and confidence

Problem Definition [Soliman et al. 07] The k items with the maximum probability of being the top-k tuplescore t3 t5 t4 t1 t confidence {t3, t5}: 0.2*0.8 = 0.16 {t3, t4}: 0.2*(1-0.8)*0.9 = {t5, t4}: (1-0.2)*0.8*0.9 =

One-Time Computation  Assume items are already sorted by score t1 t2 t3 t4 t5 t6 t7 t Consider the i-th item ti: Question: Among t1,..., ti, which k items have the maximum prob. of appearing while the rest not appearing? Answer: The k items with the largest prob. {t2, t5} being top-2  t2, t5 appearing and t1, t3, t4 not appearing Just need to answer the question for all i Time: O(n log k)

The Data Structure Problem  Build a data structure, such that:  Query Given j, return the top-j answer  Update Insert an item Delete an item Update the probability of an item  Construction

Our Results  A data structure of size O(n)  Query: O(log(n) + j) Given j, return the top-j answer, j=1,...,k  Update: O(k log n) (better than paper) Insert an item Delete an item Update the probability of an item  Construction: O(n log k) (better than paper)

Overall Structure u vw top-j prob. ρ j u j’ largest prob φ j’ v top-(j-j’) ρ j-j’ u ρ j u = max{ρ j v, max 0≤j’≤j-1 {φ j’ v ρ j-j’ w }}, j=1,…,k leaf has k ~ 2k items Top-j query: O(log n + j)

φ v 0 ρ 1 w φ0v ρ2wφ0v ρ2w φ1v ρ1wφ1v ρ1w φ0v ρ3wφ0v ρ3w φ1v ρ2wφ1v ρ2w φ2v ρ1wφ2v ρ1w ………… …………… φ0v ρkwφ0v ρkw φ 1 v ρ k-1 w φ 2 v ρ k-2 w ……φ k-1 v ρ 1 w Update an Internal Node ρ j u = max{ρ j v, max 0≤j’≤j-1 {φ j’ v ρ j-j’ w }}, j=1,…,k Monotone The last item of the top-(j+1) answer can’t be in front of the last item of top-j

Total Monotonicity  A matrix is totally monotone if all its sub- matrices are monotone Enough to check all 2x2 sub-matrices AB CD A > B  C > D For a k*k totally monotone matrix, the SMAWK algorithm [Aggarwal et al. 87] can find all row-maximum in time O(k).

φ v 0 ρ 1 w φ0v ρ2wφ0v ρ2w φ1v ρ1wφ1v ρ1w φ0v ρ3wφ0v ρ3w φ1v ρ2wφ1v ρ2w φ2v ρ1wφ2v ρ1w ………… …………… φ0v ρkwφ0v ρkw φ 1 v ρ k-1 w φ 2 v ρ k-2 w ……φ k-1 v ρ 1 w Total Monotonicity Lemma: The matrix (φ j’ v ρ j-j’ w ) is totally monotone. An internal node can be updated in time O(k).

Update (Recompute) a Leaf  Goal: Compute ρ j, j = 1,…,n, where n = Θ(k)  Define φ j,i = p(e 1,i )∙p(e 2,i )∙ ∙∙∙ ∙p(e j,i )∙(1-p(e j+1,i )) ∙(1-p(e j+2,i ))∙ ∙∙∙ ∙(1-p(e i,i )) where e i,1,…,e i,i are the first i items sorted by decreasing probability  ρ j = max 1≤i≤n φ j,i Compute the row-max for the matrix (φ j,i ) k*n !

Total Monotonicity, Again  Lemma: The matrix (φ j,i ) k*n is totally monotone. Are we done yet?  The SMAWK algorithm probes O(k) entries in the matrix (φ j,i ) k*n, but still need to retrieve φ j,i = p(e 1,i )∙ ∙∙∙ ∙p(e j,i )∙(1-p(e j+1,i ))∙ ∙∙∙ ∙(1-p(e i,i )) on demand

Retrieve φ j,i Rewrite φ j,i = p(e 1,i )∙ ∙∙∙ ∙p(e j,i )∙(1-p(e j+1,i ))∙ ∙∙∙ ∙(1-p(e i,i )) p(e 1,i ) p(e j,i ) 1-p(e 1,i ) 1-p(e j,i ) = ∙ ∙∙∙ ∙ ∙(1-p(e 1,i ))∙ ∙∙∙ ∙(1-p(e i,i )) pre-compute in time O(k) p(e 1,i ) p(e j,i ) 1-p(e 1,i ) 1-p(e j,i ) = ∙ ∙∙∙ ∙ ∙(1-p(t 1 ))∙ ∙∙∙ ∙(1-p(t i ))

Retrieve φ j,i Focus on p(e 1,i ) p(e j,i ) 1-p(e 1,i ) 1-p(e j,i ) ∙ ∙∙∙ ∙ e 1,i e 2,i e 3,i e 4,i e 5,i e 6,i e 1,i+1 e 2,i+1 e 3,i+1 e 5,i+1 e 6,i+1 e 7,i+1 e 4,i+1 To support all i, make the structure partially persistent Insertion: O(log k) Query: O(log k)

Update (Recompute) a Leaf  Goal: Compute ρ j, j = 1,…,n, where n = Θ(k)  ρ j = max 1≤i≤n φ j,i Compute the row-max for the matrix (φ j,i ) k*n !  The SMAWK algorithm probes O(k) φ j,i ’s  Using persistent (2,3)-tree Construction: O(k log k) Query: O(log k) Total time for a leaf: O(k log k)

Summary  Update (recompute) an internal node: O(k) O(log n) such nodes  Update (recompute) a leaf node: O(k log k)  Total update time: O(k log n) Insertions/deletions can be handled using standard techniques (rebalancing)  Construction time: O(n log k) Construction as efficient as one-time computation

Final Remarks  Conjecture Ω(k) is lower bound for update time  Other top-k definitions? for each item, compute its prob. being one of the top-k return the k items with the largest such prob.  k-nearest neighbors in uncertain geometric data each point has a pdf