Probabilistic Data Management

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
指導教授:陳良弼 老師 報告者:鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Representing and Querying Correlated Tuples in Probabilistic Databases
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Jiang Chen Columbia University Ke Yi HKUST. Motivation  Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Thomas Bernecker, Tobias Emrich, Hans-Peter Kriegel,
Cleaning Uncertain Data for Top-k Queries Luyi Mo, Reynold Cheng, Xiang Li, David Cheung, Xuan Yang The University of Hong Kong {lymo, ckcheng, xli, dcheung,
Indexing the imprecise positions of moving objects Xiaofeng Ding and Yansheng Lu Department of Computer Science Huazhong University of Science & Technology.
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.
Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Presented by: Duong, Huu Kinh Luan March 14 th, 2011.
On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
J OURNAL C LUB : Cardoso et al., University College London, UK “STEPS: Similarity and Truth Estimation for Propagated Segmentations and its application.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.
Efficient Processing of Top-k Spatial Preference Queries
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Probabilistic Skylines on Uncertain Data (VLDB2007) Jian Pei et al Supervisor: Dr Benjamin Kao Presenter: For Date: 22 Feb 2008 ??: the possible world.
Computer Science and Engineering Ranking Complex Objects in a Multi-dimensional Space Wenjie Zhang, Ying Zhang, Xuemin Lin The University of New South.
Information Retrieval and Web Search
Information Retrieval
Probabilistic Data Management
Probabilistic Data Management
Probabilistic Data Management
Chapter 4: Probabilistic Query Answering (2)
Probabilistic Data Management
Lecture 16: Probabilistic Databases
Random Sampling over Joins Revisited
Probabilistic Data Management
Probabilistic Data Management
Probabilistic Data Management
Distributed Probabilistic Range-Aggregate Query on Uncertain Data
Probabilistic n-of-N Skyline Computation over Uncertain Data Streams
Probabilistic Databases
Range Queries on Uncertain Data
Uncertain Data Mobile Group 报告人:郝兴.
Relaxing Join and Selection Queries
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Probabilistic Data Management Chapter 8: Probabilistic Query Answering (6)

Objectives In this chapter, you will: Explore the definitions of more probabilistic query types Probabilistic top-k query

Recall: Probabilistic Query Types Probabilistic Spatial Query Uncertain/probabilistic database Probabilistic range query Probabilistic k-nearest neighbor query Probabilistic group nearest neighbor (PGNN) query Probabilistic reverse k-nearest neighbor query Probabilistic spatial join /similarity join Probabilistic top-k query (or ranked query) Probabilistic skyline query Probabilistic reverse skyline query Probabilistic Preference Query 3 3

Motivation Example In a coal mine surveillance application, a number of sensors are deployed to detect density of gas, temperature, and so on Assume we have a preference function f(O) = O.temp + O.den Top-k query: Retrieve k sensors with the highest scores (most dangerous) 4 4

Motivation Example (cont'd) Sensor data usually contain noises The reported data can be modeled as uncertain objects Obtain top-k query answers over uncertain data with high confidence actual data actual data 5 5

Background of Probabilistic Top-k Query Under possible worlds semantics Each tuple t is associated with a score t.score Each tuple t is associated with an existence probability t.prob possible worlds query answer in 6

Different Semantics of Probabilistic Top-k Query Top-k query in probabilistic databases Consider each possible world from which top-k answers are retrieved Aggregate the top-k answers (weighted by the probabilities of possible worlds) Aggregation Semantics Uncertain Top-k (U-Topk) [Soliman et al., ICDE 2007] Uncertain Rank-k (U-kRank) [Soliman et al., ICDE 2007] Probabilistic Threshold Top-k (PT(h)) [Hua et al., SIGMOD 2008] Expected Ranks (Exp-Rank) [Cormode et al., ICDE 2009] Expected Score (E-Score) [Cormode et al., ICDE 2009] 7

Uncertain Top-k (U-Topk) [Soliman et al., ICDE 2007] group by top-k answer vectors top-k answer vector Find one top-k answer vector that appears in possible worlds with the highest probability top-k answer vector … … … … … … … … probabilistic database top-k answer vector U-Topk answers possible worlds 8

Example of U-Topk Given the Uncertain Database and k=2 Tuple Score P(t) t1 100 0.4 t2 85 0.5 t3 70 1 t4 60 Rules R1 { t1 } R2 { t2, t4 } R3 { t3 } Pr[{ t1, t2 }] = 0.2 Pr[{ t1, t3 }] = 0.2 Pr[{ t2, t3 }] = 0.3 Pr[{ t3, t4 }] = 0.3 Final Result: {t2, t3} or {t3, t4} Possible World (W) Pr(W) { t1, t2, t3 } P(t1)P(t2)P(t3) = 0.2 { t1, t3, t4 } P(t1)P(t3)P(t4) = 0.2 { t2, t3 } (1-P(t1))P(t2)P(t3) = 0.3 { t3, t4 } (1-P(t1))P(t3)P(t4) = 0.3 9 9

Uncertain Rank-k (U-kRanks) [Soliman et al., ICDE 2007] For some j  [1, k], group by tuples with the j-th rank tuple with the j-th rank For each j [1, k], find one tuple that has the j-th rank in possible worlds with the highest probability tuple with the j-th rank … … … … … … … … probabilistic database tuple with the j-th rank U-kRank answers possible worlds 10

Example of U-kRanks Given the Uncertain Database and k=2 Tuple Score P(t) t1 100 0.4 t2 85 0.5 t3 70 1 t4 60 Rules R1 { t1 } R2 { t2, t4 } R3 { t3 } At rank i = 1: Pr[t1] = 0.4 Pr[t2] = 0.3 Pr[t3] = 0.3 At rank i = 2: Pr[t2] = 0.2 Pr[t3] = 0.5 Pr[t4] = 0.3 Final Result: {t1, t3} Possible World (W) Pr(W) { t1, t2, t3 } P(t1)P(t2)P(t3) = 0.2 { t1, t3, t4 } P(t1)P(t3)P(t4) = 0.2 { t2, t3 } (1-P(t1))P(t2)P(t3) = 0.3 { t3, t4 } (1-P(t1))P(t3)P(t4) = 0.3 11 11

Probabilistic Threshold Top-k (PT(h)) [Hua et al., SIGMOD 2008] group by tuples in top-h answer sets top-h answer set Find k tuples that are in top-h answer sets of possible worlds with the highest probabilities top-h answer set … … … … … … … … probabilistic database top-h answer set PT(h) answers possible worlds 12

Example of PT-k Given the Uncertain Database, k=2, Threshold=0.5 Tuple Score P(t) t1 100 0.4 t2 85 0.5 t3 70 1 t4 60 Rules R1 { t1 } R2 { t2, t4 } R3 { t3 } Pr[t1] = 0.4 Pr[t2] = 0.5 Pr[t3] = 0.8 Pr[t4] = 0.3 Threshold=0.5 Final Result: {t2, t3} Possible World (W) Pr(W) { t1, t2, t3 } P(t1)P(t2)P(t3) = 0.2 { t1, t3, t4 } P(t1)P(t3)P(t4) = 0.2 { t2, t3 } (1-P(t1))P(t2)P(t3) = 0.3 { t3, t4 } (1-P(t1))P(t3)P(t4) = 0.3 13 13

Expected Ranks (Exp-Rank) [Cormode et al., ICDE 2009] expected rank of t1: pw rpw(t1)Pr(pw) t1 t2 … … … … … … Find k tuples with the highest expected ranks … … … … … … probabilistic database … … alternatives possible worlds 14

Expected Score (E-Score) [Cormode et al., ICDE 2009] expected score of t1: pw score(t1)Pr(pw) t1 t2 … … … … … … Find k tuples with the highest expected scores … … … … … … probabilistic database … … alternatives possible worlds 15

Example of Expected Ranks If a tuple doesn’t appear in a world, its rank is considered to be the last one Given the Uncertain Database and k=2 Tuple Score P(t) t1 100 0.4 t2 85 0.5 t3 70 1 t4 60 Rules R1 { t1 } R2 { t2, t4 } R3 { t3 } E[R(t1)] = 1×0.2+ 1×0.2+3×0.3+3× 0.3= 2.2 E[R(t2)] = 2.4 E[R(t3)] = 1.9 E[R(t4)] = 2.9 Final Result: {t3, t1} Possible World (W) Pr(W) { t1, t2, t3 } P(t1)P(t2)P(t3) = 0.2 { t1, t3, t4 } P(t1)P(t3)P(t4) = 0.2 { t2, t3 } (1-P(t1))P(t2)P(t3) = 0.3 { t3, t4 } (1-P(t1))P(t3)P(t4) = 0.3 16 16

Unified Ranking Functions Parameterized Ranking Function (PRF) A probabilistic top-k query returns k tuples with the highest |gw| values weighted function Li, J., Deshpande, A. A Unified Approach to Ranking in Probabilistic Databases. In VLDB, 2009. 17

Unified Ranking Functions (cont'd) When w(t, i) = 1, the result is the set of k tuples with the highest probability When w(t, i) = score(t), E-Score When , PT(h) When , U-Rank PRF cannot simulate U-Topk 18

Unified Ranking Functions (cont'd) Two new semantics PRFw(h) and PRFe(h) PRFw(h): w(t, i) = wi for i  h, and w(t, i) = 0 for i > h PRFe(h): w(t, i) = a i, where a can be a real/complex number 19

Ranking Algorithms Assuming tuple independence Compute the probability that a tuple ti has the j-th rank Observation: the coefficient cj of xj in a function, Fi(x), is exactly the probability that ti is at rank j 20

Example Consider the rank of a tuple t3, .4x Incremental computation of Fi(x): Consider the rank of a tuple t3, .4x 21

Ranking Algorithms (cont'd) Assuming correlated database represented by and/xor tree Generating functions on the and/xor tree Observation: the coefficient cj of the term xj-1y is Pr(r(ti) = j) 22

Summary Probabilistic top-k query Different semantics w.r.t. ranks and probabilities in possible worlds A unified approach 23