On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008.

Slides:



Advertisements
Similar presentations
Optimal Top-k Generation of Attribute Combinations based on Ranked Lists Jiaheng Lu, Renmin University of China Joint work with Pierre Senellart, Chunbin.
Advertisements

Web Information Retrieval
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
指導教授:陳良弼 老師 報告者:鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Representing and Querying Correlated Tuples in Probabilistic Databases
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Efficient Query Evaluation on Probabilistic Databases
Cleaning Uncertain Data for Top-k Queries Luyi Mo, Reynold Cheng, Xiang Li, David Cheung, Xuan Yang The University of Hong Kong {lymo, ckcheng, xli, dcheung,
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 April 20, 2005
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]
Similar Sequence Similar Function Charles Yan Spring 2006.
Presented by: Duong, Huu Kinh Luan March 14 th, 2011.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Querying Structured Text in an XML Database By Xuemei Luo.
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.
Randomized Algorithms for Bayesian Hierarchical Clustering
Efficient Processing of Top-k Spatial Preference Queries
Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati.
Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.
Lu Chaojun, SJTU 1 Extended Relational Algebra. Bag Semantics A relation (in SQL, at least) is really a bag (or multiset). –It may contain the same tuple.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
1 An infrastructure for context-awareness based on first order logic 송지수 ISI LAB.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
1 VLDB, Background What is important for the user.
Algorithms for Large Data Sets
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
Seung-won Hwang, Kevin Chen-Chuan Chang
Probabilistic Data Management
Top-k Query Processing
Data Integration with Dependent Sources
Lecture 16: Probabilistic Databases
Probabilistic Ranking of Database Query Results
Probabilistic Data Management
Probabilistic Databases
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Efficient Processing of Top-k Spatial Preference Queries
Probabilistic Ranking of Database Query Results
Presentation transcript:

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008

Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

Outline Background Probabilistic database model Top-k queries & scoring functions Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

Probabilistic Databases Motivation Uncertainty/vagueness/imprecision in data History Imcomplete information in relational DB [Imielinski & Lipski 1984] Probabilistic DB model [Cavallo & Pittarelli 1987] Probabilistic Relational Algebra [Fuhr & Rölleke 1997 etc.] Comeback Flourish of uncertain data in real world application Examples: WWW, Biological data, Sensor network etc.

Probabilistic Database Model [Fubr & Rölleke 1997] Probabilisitc Database Model A generalizaiton of relational DB Probabilistic Relational Algebra (PRA) A generalization of standard relational algebra

DocNoTerm IR DB IR DB AI Prob DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) A Table in Probabilistic Database Event expression Independent events

Probabilistic Relational Algebra Just like in Relational Algebra… Selection Projection Join Union Difference -

Probabilistic Relational Algebra Just like in Relational Algebra… Selection Projection Join Union Difference -

DocNoTerm IR DB IR DB AI Prob DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) Selection DocNoTerm 1313 IR Prob Complex Event e DT(1, IR) e DT(3, IR) In derived table Propositional expression of basic events

DocNoTerm IR DB IR DB AI Prob DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) Projection Term IR DB AI Prob Complex Event e DT(1, IR) e DT(3, IR) e DT(2, DB) e DT(4, AI)

Join DocNoTerm 1212 IR DB Prob DocTerm: Basic Event e DT(1, IR) e DT(2, DB) DocNoAName 1212 Bauer Meier Prob Basic Event e DU(1, Bauer) e DU(2, Meier) DocAu: DocAu. DocNo ANameDocTerm. DocNo Term Bauer Meier IR DB IR DB Prob 0.9* * * *0.7 Complex Event e DU(1, Bauer) e DT(1, IR) e DU(1, Bauer) e DT(2, DB) e DU(2, Meier) e DT(1, IR) e DU(2, Meier) e DT(2, DB)

DocNoTerm IR DB IR DB AI Prob DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) Join + Projection DocNo 1313 Prob Complex Event e DT(1, IR) e DT(3, IR) IR: DocNo 2323 Prob Complex Event e DT(2, DB) e DT(3, DB) DB: DocNoAName Bauer Meier Schmidt Koch Bauer Prob Basic Event e DU(1, Bauer) e DU(2, Bauer) e DU(2, Meier) e DU(2, Schmidt) e DU(3, Schmidt) e DU(3, Koch) e DU(3, Bauer) DocAu: AName Bauer Schimdt AName Bauer Meier Schmidt Prob Complex Event e DU(1, Bauer) e DT(1, IR) e DU(3, S) e DT(3, IR) Prob Complex Event e DU(2, Bauer) e DT(2, DB) e DU(2, Meier) e DT(2, DB) (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) AName Bauer Schmidt 0.81 * 0.21 = * 0.91 = ProbComplex Event (e DU(1, B) e DT(1, IR) ) (e DU(2, B) e DT(2, DB) ) (e DU(3, S) e DT(3, IR) ) ( (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) )

DocNoTerm IR DB IR DB AI Prob DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) DocNo 1313 Prob Complex Event e DT(1, IR) e DT(3, IR) IR: DocNo 2323 Prob Complex Event e DT(2, DB) e DT(3, DB) DB: DocNoAName Bauer Meier Schmidt Koch Bauer Prob Basic Event e DU(1, Bauer) e DU(2, Bauer) e DU(2, Meier) e DU(2, Schmidt) e DU(3, Schmidt) e DU(3, Koch) e DU(3, Bauer) DocAu: AName Bauer Schimdt AName Bauer Meier Schmidt Prob Complex Event e DU(1, Bauer) e DT(1, IR) e DU(3, S) e DT(3, IR) Prob Complex Event e DU(2, Bauer) e DT(2, DB) e DU(2, Meier) e DT(2, DB) (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) AName Bauer Schmidt 0.81 * 0.21 = * 0.91 = ProbComplex Event (e DU(1, B) e DT(1, IR) ) (e DU(2, B) e DT(2, DB) ) (e DU(3, S) e DT(3, IR) ) ( (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) ) Intensional Semantics v.s. Extensional Semantics Join + Projection

Intensional v.s Extensional Intensional Semantics Assume data independence of base tables Keeps track of data dependence during the evaluation Extensional Semantics Assume data independence during the evaluation Could be WRONG with probability computation!

When Intensional = Extensional? No identical underlying basic events in the event expression AName Bauer Schmidt Prob 0.81 * 0.21 = * 0.91 = Complex Event (e DU(1, B) e DT(1, IR) ) (e DU(2, B) e DT(2, DB) ) (e DU(3, S) e DT(3, IR) ) ( (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) ) Identical basic event

Fubr & Rölleke 1997 Summary Probabilisitc DB Model Concept of event Basic v.s. complex event Event expression Probabilistic Relational Algebra Just like in Relational Algebra… Computation of event probabilities Intensional v.s. extensional semantics Yield the same result when NO data dependence in event expressions

Outline Background Probabilistic database model Top-k queries & scoring functions Motivation Examples Top-k Queries in Probabilistic Databases Semantics Query Evaluation Conclusion

Top-k Queries Traditonally, given Objects:o 1, o 2, …, o n An non-negative integer: k A scoring function s: Question: What are the k objects with the highest score? Have been studied in Web, XML, Relational Databases, and more recently in Probabilistic Databases.

Scoring Function A scoring function s over a deterministic relation R is For any t i and t j from R,

Outline Background Motivation Examples Smart Enviroment Example Sensor Network Example Top-k Queries in Probabilistic Databases Conclusion

Motivating Example I Smart Environment Sample Question “Who were the two visitors in the lab last Saturday night?” Data Biometric data from sensors  We would be able to see how those data match the profile of every candidate -- a scoring function Historical statistics  e. g. Probability of a certain candidate being in lab on Saturday nights

Motivating Example I (cont.) Face Voice Detection, Detection, Aiden score( 0.70, 0.60, … ) = 0.65 Bob score( 0.50, 0.60, … ) = 0.55 Chris score( 0.50, 0.40, … ) = 0.45 Probability of being in lab on Saturday nights Personnel Biometrics score( … ) Question: Find two people in the lab last Saturday night a Top-2 query over the above probabilistic database under the above scoring function

Motivating Example II Sensor Network in a Habitat Sample Question “What is the temperature of the warmest spot?” Data Sensor readings from different sensors At a sampling time, only one “real” reading from a sensor Each sensor reading comes with a confidence value

Motivating Example II (cont.) Temp (F) Prob Question: What is the temperature of the warmest spot? a Top-1 query over the above probabilistic database under the scoring function proportional to temperature C 1 (from Sensor 1) C 2 (from Sensor 2)

Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Semantics Query Evaluation Conclusion

Models A probabilistic relation R p = R:the support deterministic relation p:probability function C :a partition of R, such that Simple v.s. General probabilistic relation Simple Assume tuple independence, i.e. |C |=|R| E.g. smart environment example General Tuples can be independent or exclusive, i.e. |C |<|R| E.g. sensor network example

Challenges Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) How to compute the top-k answer of R p ? (Query Evaluation)

What is a “Good” Semantics? Desired Properties Exact-k Faithfulness Stability

Properties Exact-k If R has at least k tuples, then exactly k tuples are returned as the top-k answer Faithfulness A “better” tuple, i.e. higher in score and probability, is more likely to be in the top-k answer, compared to a “worse” one Stability Raising the score/prob. of a winning tuple will not cause it to lose Lowering the score/prob. of a losing tuple will not cause it to win

Global-Topk Semantics Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) Global-Topk Return the k highest-ranked tuples according to their probability of being in top-k answers in possible worlds Global-Topk satisfies aforementioned three properties

Smart Environment Example Score( 0.50, 0.40, … ) = 0.45Chris Score( 0.50, 0.60, … ) = 0.55Bob Score( 0.70, 0.60, … ) = 0.65Aiden Face Voice Detection, Detection, Prob Personnel Biometrics Score( … ) Query: Find two people in lab on last Saturday night Aiden Bob Chris AidenBobChris Aiden Bob Aiden Chris Bob Chris Top-2 possible worlds Pr(Chris in top-2) = = Global-Topk Semantics: Pr(Aiden in top-2) = 0.3 Pr(Bob in top-2) = 0.9 Top-2 Answer

Other Semantics Soliman, Ilyas & Chang 2007 Two Alternative Semantics U-Topk U-kRanks

U-Topk Semantics Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) U-Topk Return the most probable top-k answer set that belongs to possible worlds U-Topk does not satisfies all three properties

Smart Environment Example Score( 0.50, 0.40, … ) = 0.45Chris Score( 0.50, 0.60, … ) = 0.55Bob Score( 0.70, 0.60, … ) = 0.65Aiden Face Voice Detection, Detection, Prob Personnel Biometrics Score( … ) Query: Find two people in lab on last Saturday night Aiden Bob Chris AidenBobChris Aiden Bob Aiden Chris Bob Chris Top-2 possible worlds Pr({Aiden, Bob}) = = 0.27 U-Topk Semantics: … Pr({Bob}) = Top-2 Answer

U-kRanks Semantics Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) U-kRanks For i=1,2,…,k, return the most probable i th -ranked tuples across all possible worlds U-kRanks does not satisfies all three properties

Smart Environment Example Score( 0.50, 0.40, … ) = 0.45Chris Score( 0.50, 0.60, … ) = 0.55Bob Score( 0.70, 0.60, … ) = 0.65Aiden Face Voice Detection, Detection, Prob Personnel Biometrics Score( … ) Query: Find two people in lab on last Saturday night Aiden Bob Chris AidenBobChris Aiden Bob Aiden Chris Bob Chris Top-2 possible worlds e.g. Pr(Chris at rank-2) = = U-kRanks Semantics: Top-2 Answer {Bob} AidenBob Rank-1 Rank Chris Highest at rank-1 Highest at rank-2

Properties SemanticsExact-kFaithfulnessStability Global-Topk U-Topk U-kRanks Yes No Yes Yes/No* No Yes No * Yes when the relation is simple, No otherwise A better sementics

Challenges Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) How to compute the top-k answer of R p ? (Query Evaluation) Global- Topk

Global-Topk in Simple Relation Given R p =, a scoring function s, a non-negative integer k Assumptions Tuples are independent, i.e. |C |=|R| R={t 1,t 2,…t n }, ordered in the decreasing order of their scores, i.e.

Global-Topk in Simple Relation Query Evaluation Recursion P k,s (t i ): Global-Topk probability of tuple t i Dynamic Programming

Optimization Threshold Algorithm (TA) [Fagin & Lotem 2001] Given a system of objects, such that For each object attribute, there is a sorted list ranking objects in the decreasing order of its score on that attribute An aggregation function f combines individual attribute scores x i, i=1,2,…m, to obtain the overall object score f(x 1,x 2,…,x m ) f is monotonic  f(x 1,x 2,…,x m )<= f(x’ 1,x’ 2,…,x’ m ) whenever x i <=x’ i for every i TA is cost-optimal in finding the top-k objects TA and its variants are widely used in ranking queries, e.g. top-k, skyline, etc.

Applying TA Optimization Global-Topk Two attributes: probability & score Aggregation function: Global-Topk probability

Global-Topk in General Relation Given R p =, a scoring function s, a non-negative integer k Assumptions Tuples are independent or exclusive, i.e. |C |<|R| R={t 1,t 2,…t n }, ordered in the decreasing order of their scores, i.e.

Global-Topk in General Relation Induced Event Relation For each tuple in R, there is a probabilistic relation E p = generated by the following two rules E p is simple

Sensor Network Example Temp (F) Prob C 1 (from Sensor 1) C 2 (from Sensor 2) Event t eC1 t et 0.6 = 0.6 = p(t) For example: Induced Event Relation (simple) t= where i=1 Prob Rule 1 Rule 2 Prob. Relation (general)

Global-Topk in General Relation

Evaluating Global-Topk in General Relation For each tuple t, generate corresponding induced event relation Compute the Global-Topk probability of t by Theorem 4.3 Pick the k tuples with the highest Global-Topk probability

Summary on Query Evaluation Simple (Independent Tuples) Dynamic Programming Tuples are ordered on their scores Recursion on the tuple index and k General (Independent/Exclusive Tuples) Polynomial reduction to simple cases

Complexity Global- Topk U-TopkU-kRanks SimpleO(kn) GeneralO(kn 2 )Θ(mkn k-1 lg n)*Ω(mn k-1 )* * m is a rule engine related factor m represents how complicated the relationship between tuples could be

Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

Three intuitive semantic properties for top-k queries in probability databases Global-Topk semantics which satisfies all the properties above Query evaluation algorithm for Global-Topk in simple and general probabilistic databases

Future Problems Weak order scoring function Allow ties Not clear how to extend properties Not clear how to define the semantics (other than “arbitrary tie breaker”) Preference Strength Sensitivity to Score Given a prob. relation R p, if the DB is sufficiently large, by manipulating the scores of tuples, we would be able to get different answers NOT satisfied by our semantics NOT satisfied by any semantics in literature Need to consider preference strength in the semantics

Thank you !

Related Works Introduction to Probabilistic Databases Probabilistic DB Model & Probabilistic Relational Algebra [Fubr & Rölleke 1997] Top-K Query in Probabilistic Databases On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases [Zhang & Chomicki 2008] Alternative Top-k Semantics and Query Evaluation in Probabilistic Databases [Soliman, Ilyas & Chang 2007]