On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008
Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion
Outline Background Probabilistic database model Top-k queries & scoring functions Motivation Examples Top-k Queries in Probabilistic Databases Conclusion
Probabilistic Databases Motivation Uncertainty/vagueness/imprecision in data History Imcomplete information in relational DB [Imielinski & Lipski 1984] Probabilistic DB model [Cavallo & Pittarelli 1987] Probabilistic Relational Algebra [Fuhr & Rölleke 1997 etc.] Comeback Flourish of uncertain data in real world application Examples: WWW, Biological data, Sensor network etc.
Probabilistic Database Model [Fubr & Rölleke 1997] Probabilisitc Database Model A generalizaiton of relational DB Probabilistic Relational Algebra (PRA) A generalization of standard relational algebra
DocNoTerm IR DB IR DB AI Prob DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) A Table in Probabilistic Database Event expression Independent events
Probabilistic Relational Algebra Just like in Relational Algebra… Selection Projection Join Union Difference -
Probabilistic Relational Algebra Just like in Relational Algebra… Selection Projection Join Union Difference -
DocNoTerm IR DB IR DB AI Prob DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) Selection DocNoTerm 1313 IR Prob Complex Event e DT(1, IR) e DT(3, IR) In derived table Propositional expression of basic events
DocNoTerm IR DB IR DB AI Prob DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) Projection Term IR DB AI Prob Complex Event e DT(1, IR) e DT(3, IR) e DT(2, DB) e DT(4, AI)
Join DocNoTerm 1212 IR DB Prob DocTerm: Basic Event e DT(1, IR) e DT(2, DB) DocNoAName 1212 Bauer Meier Prob Basic Event e DU(1, Bauer) e DU(2, Meier) DocAu: DocAu. DocNo ANameDocTerm. DocNo Term Bauer Meier IR DB IR DB Prob 0.9* * * *0.7 Complex Event e DU(1, Bauer) e DT(1, IR) e DU(1, Bauer) e DT(2, DB) e DU(2, Meier) e DT(1, IR) e DU(2, Meier) e DT(2, DB)
DocNoTerm IR DB IR DB AI Prob DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) Join + Projection DocNo 1313 Prob Complex Event e DT(1, IR) e DT(3, IR) IR: DocNo 2323 Prob Complex Event e DT(2, DB) e DT(3, DB) DB: DocNoAName Bauer Meier Schmidt Koch Bauer Prob Basic Event e DU(1, Bauer) e DU(2, Bauer) e DU(2, Meier) e DU(2, Schmidt) e DU(3, Schmidt) e DU(3, Koch) e DU(3, Bauer) DocAu: AName Bauer Schimdt AName Bauer Meier Schmidt Prob Complex Event e DU(1, Bauer) e DT(1, IR) e DU(3, S) e DT(3, IR) Prob Complex Event e DU(2, Bauer) e DT(2, DB) e DU(2, Meier) e DT(2, DB) (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) AName Bauer Schmidt 0.81 * 0.21 = * 0.91 = ProbComplex Event (e DU(1, B) e DT(1, IR) ) (e DU(2, B) e DT(2, DB) ) (e DU(3, S) e DT(3, IR) ) ( (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) )
DocNoTerm IR DB IR DB AI Prob DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) DocNo 1313 Prob Complex Event e DT(1, IR) e DT(3, IR) IR: DocNo 2323 Prob Complex Event e DT(2, DB) e DT(3, DB) DB: DocNoAName Bauer Meier Schmidt Koch Bauer Prob Basic Event e DU(1, Bauer) e DU(2, Bauer) e DU(2, Meier) e DU(2, Schmidt) e DU(3, Schmidt) e DU(3, Koch) e DU(3, Bauer) DocAu: AName Bauer Schimdt AName Bauer Meier Schmidt Prob Complex Event e DU(1, Bauer) e DT(1, IR) e DU(3, S) e DT(3, IR) Prob Complex Event e DU(2, Bauer) e DT(2, DB) e DU(2, Meier) e DT(2, DB) (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) AName Bauer Schmidt 0.81 * 0.21 = * 0.91 = ProbComplex Event (e DU(1, B) e DT(1, IR) ) (e DU(2, B) e DT(2, DB) ) (e DU(3, S) e DT(3, IR) ) ( (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) ) Intensional Semantics v.s. Extensional Semantics Join + Projection
Intensional v.s Extensional Intensional Semantics Assume data independence of base tables Keeps track of data dependence during the evaluation Extensional Semantics Assume data independence during the evaluation Could be WRONG with probability computation!
When Intensional = Extensional? No identical underlying basic events in the event expression AName Bauer Schmidt Prob 0.81 * 0.21 = * 0.91 = Complex Event (e DU(1, B) e DT(1, IR) ) (e DU(2, B) e DT(2, DB) ) (e DU(3, S) e DT(3, IR) ) ( (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) ) Identical basic event
Fubr & Rölleke 1997 Summary Probabilisitc DB Model Concept of event Basic v.s. complex event Event expression Probabilistic Relational Algebra Just like in Relational Algebra… Computation of event probabilities Intensional v.s. extensional semantics Yield the same result when NO data dependence in event expressions
Outline Background Probabilistic database model Top-k queries & scoring functions Motivation Examples Top-k Queries in Probabilistic Databases Semantics Query Evaluation Conclusion
Top-k Queries Traditonally, given Objects:o 1, o 2, …, o n An non-negative integer: k A scoring function s: Question: What are the k objects with the highest score? Have been studied in Web, XML, Relational Databases, and more recently in Probabilistic Databases.
Scoring Function A scoring function s over a deterministic relation R is For any t i and t j from R,
Outline Background Motivation Examples Smart Enviroment Example Sensor Network Example Top-k Queries in Probabilistic Databases Conclusion
Motivating Example I Smart Environment Sample Question “Who were the two visitors in the lab last Saturday night?” Data Biometric data from sensors We would be able to see how those data match the profile of every candidate -- a scoring function Historical statistics e. g. Probability of a certain candidate being in lab on Saturday nights
Motivating Example I (cont.) Face Voice Detection, Detection, Aiden score( 0.70, 0.60, … ) = 0.65 Bob score( 0.50, 0.60, … ) = 0.55 Chris score( 0.50, 0.40, … ) = 0.45 Probability of being in lab on Saturday nights Personnel Biometrics score( … ) Question: Find two people in the lab last Saturday night a Top-2 query over the above probabilistic database under the above scoring function
Motivating Example II Sensor Network in a Habitat Sample Question “What is the temperature of the warmest spot?” Data Sensor readings from different sensors At a sampling time, only one “real” reading from a sensor Each sensor reading comes with a confidence value
Motivating Example II (cont.) Temp (F) Prob Question: What is the temperature of the warmest spot? a Top-1 query over the above probabilistic database under the scoring function proportional to temperature C 1 (from Sensor 1) C 2 (from Sensor 2)
Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Semantics Query Evaluation Conclusion
Models A probabilistic relation R p = R:the support deterministic relation p:probability function C :a partition of R, such that Simple v.s. General probabilistic relation Simple Assume tuple independence, i.e. |C |=|R| E.g. smart environment example General Tuples can be independent or exclusive, i.e. |C |<|R| E.g. sensor network example
Challenges Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) How to compute the top-k answer of R p ? (Query Evaluation)
What is a “Good” Semantics? Desired Properties Exact-k Faithfulness Stability
Properties Exact-k If R has at least k tuples, then exactly k tuples are returned as the top-k answer Faithfulness A “better” tuple, i.e. higher in score and probability, is more likely to be in the top-k answer, compared to a “worse” one Stability Raising the score/prob. of a winning tuple will not cause it to lose Lowering the score/prob. of a losing tuple will not cause it to win
Global-Topk Semantics Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) Global-Topk Return the k highest-ranked tuples according to their probability of being in top-k answers in possible worlds Global-Topk satisfies aforementioned three properties
Smart Environment Example Score( 0.50, 0.40, … ) = 0.45Chris Score( 0.50, 0.60, … ) = 0.55Bob Score( 0.70, 0.60, … ) = 0.65Aiden Face Voice Detection, Detection, Prob Personnel Biometrics Score( … ) Query: Find two people in lab on last Saturday night Aiden Bob Chris AidenBobChris Aiden Bob Aiden Chris Bob Chris Top-2 possible worlds Pr(Chris in top-2) = = Global-Topk Semantics: Pr(Aiden in top-2) = 0.3 Pr(Bob in top-2) = 0.9 Top-2 Answer
Other Semantics Soliman, Ilyas & Chang 2007 Two Alternative Semantics U-Topk U-kRanks
U-Topk Semantics Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) U-Topk Return the most probable top-k answer set that belongs to possible worlds U-Topk does not satisfies all three properties
Smart Environment Example Score( 0.50, 0.40, … ) = 0.45Chris Score( 0.50, 0.60, … ) = 0.55Bob Score( 0.70, 0.60, … ) = 0.65Aiden Face Voice Detection, Detection, Prob Personnel Biometrics Score( … ) Query: Find two people in lab on last Saturday night Aiden Bob Chris AidenBobChris Aiden Bob Aiden Chris Bob Chris Top-2 possible worlds Pr({Aiden, Bob}) = = 0.27 U-Topk Semantics: … Pr({Bob}) = Top-2 Answer
U-kRanks Semantics Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) U-kRanks For i=1,2,…,k, return the most probable i th -ranked tuples across all possible worlds U-kRanks does not satisfies all three properties
Smart Environment Example Score( 0.50, 0.40, … ) = 0.45Chris Score( 0.50, 0.60, … ) = 0.55Bob Score( 0.70, 0.60, … ) = 0.65Aiden Face Voice Detection, Detection, Prob Personnel Biometrics Score( … ) Query: Find two people in lab on last Saturday night Aiden Bob Chris AidenBobChris Aiden Bob Aiden Chris Bob Chris Top-2 possible worlds e.g. Pr(Chris at rank-2) = = U-kRanks Semantics: Top-2 Answer {Bob} AidenBob Rank-1 Rank Chris Highest at rank-1 Highest at rank-2
Properties SemanticsExact-kFaithfulnessStability Global-Topk U-Topk U-kRanks Yes No Yes Yes/No* No Yes No * Yes when the relation is simple, No otherwise A better sementics
Challenges Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) How to compute the top-k answer of R p ? (Query Evaluation) Global- Topk
Global-Topk in Simple Relation Given R p =, a scoring function s, a non-negative integer k Assumptions Tuples are independent, i.e. |C |=|R| R={t 1,t 2,…t n }, ordered in the decreasing order of their scores, i.e.
Global-Topk in Simple Relation Query Evaluation Recursion P k,s (t i ): Global-Topk probability of tuple t i Dynamic Programming
Optimization Threshold Algorithm (TA) [Fagin & Lotem 2001] Given a system of objects, such that For each object attribute, there is a sorted list ranking objects in the decreasing order of its score on that attribute An aggregation function f combines individual attribute scores x i, i=1,2,…m, to obtain the overall object score f(x 1,x 2,…,x m ) f is monotonic f(x 1,x 2,…,x m )<= f(x’ 1,x’ 2,…,x’ m ) whenever x i <=x’ i for every i TA is cost-optimal in finding the top-k objects TA and its variants are widely used in ranking queries, e.g. top-k, skyline, etc.
Applying TA Optimization Global-Topk Two attributes: probability & score Aggregation function: Global-Topk probability
Global-Topk in General Relation Given R p =, a scoring function s, a non-negative integer k Assumptions Tuples are independent or exclusive, i.e. |C |<|R| R={t 1,t 2,…t n }, ordered in the decreasing order of their scores, i.e.
Global-Topk in General Relation Induced Event Relation For each tuple in R, there is a probabilistic relation E p = generated by the following two rules E p is simple
Sensor Network Example Temp (F) Prob C 1 (from Sensor 1) C 2 (from Sensor 2) Event t eC1 t et 0.6 = 0.6 = p(t) For example: Induced Event Relation (simple) t= where i=1 Prob Rule 1 Rule 2 Prob. Relation (general)
Global-Topk in General Relation
Evaluating Global-Topk in General Relation For each tuple t, generate corresponding induced event relation Compute the Global-Topk probability of t by Theorem 4.3 Pick the k tuples with the highest Global-Topk probability
Summary on Query Evaluation Simple (Independent Tuples) Dynamic Programming Tuples are ordered on their scores Recursion on the tuple index and k General (Independent/Exclusive Tuples) Polynomial reduction to simple cases
Complexity Global- Topk U-TopkU-kRanks SimpleO(kn) GeneralO(kn 2 )Θ(mkn k-1 lg n)*Ω(mn k-1 )* * m is a rule engine related factor m represents how complicated the relationship between tuples could be
Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion
Three intuitive semantic properties for top-k queries in probability databases Global-Topk semantics which satisfies all the properties above Query evaluation algorithm for Global-Topk in simple and general probabilistic databases
Future Problems Weak order scoring function Allow ties Not clear how to extend properties Not clear how to define the semantics (other than “arbitrary tie breaker”) Preference Strength Sensitivity to Score Given a prob. relation R p, if the DB is sufficiently large, by manipulating the scores of tuples, we would be able to get different answers NOT satisfied by our semantics NOT satisfied by any semantics in literature Need to consider preference strength in the semantics
Thank you !
Related Works Introduction to Probabilistic Databases Probabilistic DB Model & Probabilistic Relational Algebra [Fubr & Rölleke 1997] Top-K Query in Probabilistic Databases On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases [Zhang & Chomicki 2008] Alternative Top-k Semantics and Query Evaluation in Probabilistic Databases [Soliman, Ilyas & Chang 2007]