Probabilistic Data Management

Slides:



Advertisements
Similar presentations
Finding the Sites with Best Accessibilities to Amenities Qianlu Lin, Chuan Xiao, Muhammad Aamir Cheema and Wei Wang University of New South Wales, Australia.
Advertisements

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Analysis of Algorithms
1 Finding Shortest Paths on Terrains by Killing Two Birds with One Stone Manohar Kaul (Aarhus University) Raymond Chi-Wing Wong (Hong Kong University of.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Indexing the imprecise positions of moving objects Xiaofeng Ding and Yansheng Lu Department of Computer Science Huazhong University of Science & Technology.
Effectively Indexing Uncertain Moving Objects for Predictive Queries School of Computing National University of Singapore Department of Computer Science.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Efficient Join Processing over Uncertain Data - By Reynold Cheng, et all. Presented By Lydia & Usha.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
Presented by: Duong, Huu Kinh Luan March 14 th, 2011.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
Database k-Nearest Neighbors in Uncertain Graphs Lin Yincheng VLDB10.
Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
ICDE 2012 Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data Yongxin Tong 1, Lei Chen 1, Bolin Ding 2 1 Department of Computer.
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Efficient Processing of Top-k Spatial Preference Queries
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
Spatial Range Querying for Gaussian-Based Imprecise Query Objects Yoshiharu Ishikawa, Yuichi Iijima Nagoya University Jeffrey Xu Yu The Chinese University.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.
Greedy algorithms: CSC317
Mining Utility Functions based on user ratings
SIMILARITY SEARCH The Metric Space Approach
Greedy & Heuristic algorithms in Influence Maximization
CACTUS-Clustering Categorical Data Using Summaries
Database Management System
Influence sets based on Reverse Nearest Neighbor Queries
Comparing Three or More Means
Probabilistic Data Management
Probabilistic Data Management
Mathematical Structures for Computer Science Chapter 6
Clustering Uncertain Taxi data
CS & CS Probabilistic Data Management
Spatio-temporal Pattern Queries
Probabilistic Data Management
Data Mining Association Analysis: Basic Concepts and Algorithms
Outlier Discovery/Anomaly Detection
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Chapter 4: Probabilistic Query Answering (2)
Probabilistic Data Management
Probabilistic Data Management
Probabilistic Data Management
CS & CS ST: Probabilistic Data Management
Warmup To check the accuracy of a scale, a weight is weighed repeatedly. The scale readings are normally distributed with a standard deviation of
Distributed Probabilistic Range-Aggregate Query on Uncertain Data
Xu Zhou Kenli Li Yantao Zhou Keqin Li
Probabilistic n-of-N Skyline Computation over Uncertain Data Streams
Numerical Analysis Lecture 17.
Multiple Regression – Split Sample Validation
Continuous Density Queries for Moving Objects
Data Mining Anomaly Detection
Efficient Processing of Top-k Spatial Preference Queries
Donghui Zhang, Tian Xia Northeastern University
Data Mining Anomaly Detection
Presentation transcript:

Probabilistic Data Management Chapter 12: Data Quality in Probabilistic Databases (2)

GPS Application -- Rescue Ships Ship in Danger!

GPS Application – Nearest Neighbor Query Uncertainty of GPS Data

GPS Application -- Rescue Ships (cont'd) Ship in Danger! Probabilistic Nearest Neighbor Query on Uncertain Data

GPS Application – Probabilistic Nearest Neighbor Query PNN: Return nearest neighbors of q with probability ≥ a

Effect of Low-Quality Samples Abnormal GPS Sample s22 Low-quality samples may greatly affect probabilistic query answers! ship o3 is NN with probability 0.75 > a ship o3 is NN with probability 0.625 <a a = 0.7 PNN (q) = {o3} PNN (q) = 

Our Contributions We study the sensitivity of probabilistic query answers to low-quality objects from the angle of the causality and responsibility (CR) We use CR to interpret and define probabilistic nearest neighbor query, namely CR-PNN, which returns possible query answers and/or low-quality objects X. Lian, Y. Lin, and L. Chen. Cost-Efficient Repair in Inconsistent Probabilistic Databases. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM'11), pages 1731-1736, 2011.

Outline Background Problem Definition Causality Responsibility Problem Definition Causality and Responsibility in Uncertain Databases Causality-and-Responsibility-Based PNN Query CR-PNN Query Processing Approaches Experimental Evaluation Conclusion

Background – Causality in Certain Databases Assumptions Given a database D and a query Q Causality Object o is a counterfactual cause for object a, if it holds that: a is not an answer to query Q in database D a is an answer to query Q in D − {o} Object o is an actual cause for object a, if it holds that: a is an answer to query Q in D − {o} − , where  is a contingency set  ⊆ D

NN Example of Causality o1 is the cause such that o3 is not NN of q Object o1 is a counterfactual cause for o3 Object o3 is not a nearest neighbor of q If we remove o1, then o3 will become the nearest neighbor of q Object o1 is an actual cause for NN answer candidate o2 If we remove o1 and  = {o3}, then o2 is the nearest neighbor of q The existence of o1 and o3 collaboratively causes o2 to be a non-NN-answer Nearest Neighbor Query o1 is one of causes such that o2 is not NN of q

Background – Responsibility in Certain Databases Responsibility (degree of causes) Let object o ∈ D be a cause of a possible answer a to a query Q Then, the responsibility, (o, a), of o for possible answer a in D is: where is  a contingency set for uncertain object o, and || is the size of the set .

NN Example of Responsibility Both objects o1 and o3 are responsible for that o2 is not NN of q Object o1 is an actual cause for NN answer candidate o2 with the contingency set  = {o3} Therefore, we have the responsibility, (o1, o2), of o1 for a possible NN answer o2 as follows: Nearest Neighbor Query o1 has 1/2 responsibility (degree of causes) to let o2 be a non-NN-answer

Outline Background Problem Definition Causality Responsibility Problem Definition Causality and Responsibility in Uncertain Databases Causality-and-Responsibility-Based PNN Query CR-PNN Query Processing Approaches Experimental Evaluation Conclusion

probabilities of possible worlds Probabilistic Nearest Neighbor Query uncertain database D: … … possible worlds pw(D): … … probabilities of possible worlds Pr{pw(D)}: s11.p  s21.p  s31.p … … s12.p  s22.p  s32.p

Causality and Responsibility in Uncertain Databases The expected responsibility in uncertain databases Let si (or sj) be a sample of uncertain object oi (or aj) that appears in a possible world pw(D) The expected responsibility, resp(oi, aj), of oi for possible answer aj is given by: PNN:

Responsibility Matrix For uncertain database D We construct an NN responsibility matrix, resp_matrix Each entry, resp_matrix(oi, aj), of the matrix corresponds to the expected responsibility (oi, aj) (oi, oj)

Novel Interpretation of Probabilistic Queries Summed column value (probabilistic confidence) We sum up all the values on the j-th column We can prove that Sum_C(aj) = 1 − PrQ(aj) For PNN query Sum_C(aj) = 1 − PrPNN(aj) (oi, oj) Sum_C(o3) = 0.375 PrPNN(o3) = 1- Sum_C(o3) = 0.625

Novel Interpretation of Probabilistic Queries (cont'd) A equivalent PNN query interpretation Return those uncertain objects aj, such that the PNN probabilities, PrPNN(aj), are greater than a Return those uncertain objects aj, such that the summed column values, Sum_C(aj), are smaller than or equal to (1 - a)

Influence of Uncertain Objects Summed row value (influence) We sum up all the values on the i-th row The influence of object oi is given by: Interpretation of influence Uncertain object oi with high Sum_R(oi) values indicates that they are more influential to query answers Two possible reasons for high influences Objects oi are query answers themselves Objects oi contain low-quality samples (oi, oj)

Problem Definition – CR-PNN CR-PNN query Given a query point q, an uncertain database D, and a probabilistic threshold a, a CR-PNN query returns a PNN answer set AQ and a set AN of (at least) k high-influence objects The query answering of CR-PNN Straightforward method: 1. Compute responsibility matrix resp_matrix 2. Add query answers aj to AQ with Sum_C(aj) < 1- a 3. Remove all aj from database D 4. Add object oi with the highest influence Sum_R(oi ) to AN 5. AN = AN – AQ, and remove oi from database D 6. Update the matrix, and repeat Step 2 until |AN| ≥ k

CR-PNN Example {o2, o3} a = 0.7, k = 2 (oi, oj) CR-based PNN query highest influences CR-based PNN query AQ =  {o2, o3} AN =  a = 0.7, k = 2

Outline Background Problem Definition Causality Responsibility Problem Definition Causality and Responsibility in Uncertain Databases Causality-and-Responsibility-Based PNN Query CR-PNN Query Processing Approaches Experimental Evaluation Conclusion

CR-PNN Problem Reduction oi q Reduction of PNN responsibility computation sj ok

Derivation of F(q, si, x) F(q, si, x) can be computed in the following recursive function: oi q sj ok

Pruning Strategies To reduce the computation cost, we compute lower/upper bounds of responsibility in the matrix Pruning with probabilistic threshold If LB_Sum_C(aj) ≥ 1 − a, then aj can be pruned

Pruning Strategies (cont'd) Pruning with influence threshold Let t be the k-th largest lower bound of influence If UB_Sum_R(oi) < t, then oi should not be put into AN set

Derivation of PNN Responsibility Bounds lb_resp(oi, aj) and ub_resp(oi, aj) Assume that the probabilities that n objects fall into  are Let =

CR-PNN Query Procedure 1. Compute lower/upper bounds of responsibility matrix resp_matrix 2. Add query answers aj to AQ with LB_Sum_C(aj)<1- a 3. Remove all aj from database D 4. Set t to be the k-th largest influence lower bound 5. Add objects oi with UB_Sum_R(oi ) > t to AN 5. AN = AN – AQ, and remove oi from database D 6. Update the matrix, and repeat Step 2 until |AN| ≥ k

Experimental Results Data sets Measure Real spatial data: Long Beach (LB) and Tiger/Line LA River and Railways (RR) Synthetic data: Randomly generate center Co and half extent eo (on each dimension) of uncertain objects o 4 data sets: lUeU, lUeG, lSeU, lSeG Measure CPU time and I/O cost

Effectiveness Synthetically inject noises into 50% of data sets, and compare the recall ratio of CR-PNN (compared with PNN results on data sets without noises) data size N = 30K, dimensionality d = 2, k = 5, probabilistic threshold a = 0.5, [emin, emax] = [0.1, 0.5]

Efficiency of CR-PNN dimensionality d = 2, k = 5, probabilistic threshold a = 0.5, [emin, emax] = [0.1, 0.5]

Conclusion We introduce the causality and responsibility (CR) to uncertain databases, and propose to interpret probabilistic queries using CR We formalize the CR-PNN problem and reduce the problem over possible worlds to the one on uncertain objects We propose effective pruning methods to reduce the computation cost We propose efficient CR-PNN query answering approach to return both query answers and high-influence objects We conduct extensive experiments to verify the effectiveness and efficiency of our proposed approaches