Probabilistic Data Management Chapter 12: Data Quality in Probabilistic Databases (2)
GPS Application -- Rescue Ships Ship in Danger!
GPS Application – Nearest Neighbor Query Uncertainty of GPS Data
GPS Application -- Rescue Ships (cont'd) Ship in Danger! Probabilistic Nearest Neighbor Query on Uncertain Data
GPS Application – Probabilistic Nearest Neighbor Query PNN: Return nearest neighbors of q with probability ≥ a
Effect of Low-Quality Samples Abnormal GPS Sample s22 Low-quality samples may greatly affect probabilistic query answers! ship o3 is NN with probability 0.75 > a ship o3 is NN with probability 0.625 <a a = 0.7 PNN (q) = {o3} PNN (q) =
Our Contributions We study the sensitivity of probabilistic query answers to low-quality objects from the angle of the causality and responsibility (CR) We use CR to interpret and define probabilistic nearest neighbor query, namely CR-PNN, which returns possible query answers and/or low-quality objects X. Lian, Y. Lin, and L. Chen. Cost-Efficient Repair in Inconsistent Probabilistic Databases. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM'11), pages 1731-1736, 2011.
Outline Background Problem Definition Causality Responsibility Problem Definition Causality and Responsibility in Uncertain Databases Causality-and-Responsibility-Based PNN Query CR-PNN Query Processing Approaches Experimental Evaluation Conclusion
Background – Causality in Certain Databases Assumptions Given a database D and a query Q Causality Object o is a counterfactual cause for object a, if it holds that: a is not an answer to query Q in database D a is an answer to query Q in D − {o} Object o is an actual cause for object a, if it holds that: a is an answer to query Q in D − {o} − , where is a contingency set ⊆ D
NN Example of Causality o1 is the cause such that o3 is not NN of q Object o1 is a counterfactual cause for o3 Object o3 is not a nearest neighbor of q If we remove o1, then o3 will become the nearest neighbor of q Object o1 is an actual cause for NN answer candidate o2 If we remove o1 and = {o3}, then o2 is the nearest neighbor of q The existence of o1 and o3 collaboratively causes o2 to be a non-NN-answer Nearest Neighbor Query o1 is one of causes such that o2 is not NN of q
Background – Responsibility in Certain Databases Responsibility (degree of causes) Let object o ∈ D be a cause of a possible answer a to a query Q Then, the responsibility, (o, a), of o for possible answer a in D is: where is a contingency set for uncertain object o, and || is the size of the set .
NN Example of Responsibility Both objects o1 and o3 are responsible for that o2 is not NN of q Object o1 is an actual cause for NN answer candidate o2 with the contingency set = {o3} Therefore, we have the responsibility, (o1, o2), of o1 for a possible NN answer o2 as follows: Nearest Neighbor Query o1 has 1/2 responsibility (degree of causes) to let o2 be a non-NN-answer
Outline Background Problem Definition Causality Responsibility Problem Definition Causality and Responsibility in Uncertain Databases Causality-and-Responsibility-Based PNN Query CR-PNN Query Processing Approaches Experimental Evaluation Conclusion
probabilities of possible worlds Probabilistic Nearest Neighbor Query uncertain database D: … … possible worlds pw(D): … … probabilities of possible worlds Pr{pw(D)}: s11.p s21.p s31.p … … s12.p s22.p s32.p
Causality and Responsibility in Uncertain Databases The expected responsibility in uncertain databases Let si (or sj) be a sample of uncertain object oi (or aj) that appears in a possible world pw(D) The expected responsibility, resp(oi, aj), of oi for possible answer aj is given by: PNN:
Responsibility Matrix For uncertain database D We construct an NN responsibility matrix, resp_matrix Each entry, resp_matrix(oi, aj), of the matrix corresponds to the expected responsibility (oi, aj) (oi, oj)
Novel Interpretation of Probabilistic Queries Summed column value (probabilistic confidence) We sum up all the values on the j-th column We can prove that Sum_C(aj) = 1 − PrQ(aj) For PNN query Sum_C(aj) = 1 − PrPNN(aj) (oi, oj) Sum_C(o3) = 0.375 PrPNN(o3) = 1- Sum_C(o3) = 0.625
Novel Interpretation of Probabilistic Queries (cont'd) A equivalent PNN query interpretation Return those uncertain objects aj, such that the PNN probabilities, PrPNN(aj), are greater than a Return those uncertain objects aj, such that the summed column values, Sum_C(aj), are smaller than or equal to (1 - a)
Influence of Uncertain Objects Summed row value (influence) We sum up all the values on the i-th row The influence of object oi is given by: Interpretation of influence Uncertain object oi with high Sum_R(oi) values indicates that they are more influential to query answers Two possible reasons for high influences Objects oi are query answers themselves Objects oi contain low-quality samples (oi, oj)
Problem Definition – CR-PNN CR-PNN query Given a query point q, an uncertain database D, and a probabilistic threshold a, a CR-PNN query returns a PNN answer set AQ and a set AN of (at least) k high-influence objects The query answering of CR-PNN Straightforward method: 1. Compute responsibility matrix resp_matrix 2. Add query answers aj to AQ with Sum_C(aj) < 1- a 3. Remove all aj from database D 4. Add object oi with the highest influence Sum_R(oi ) to AN 5. AN = AN – AQ, and remove oi from database D 6. Update the matrix, and repeat Step 2 until |AN| ≥ k
CR-PNN Example {o2, o3} a = 0.7, k = 2 (oi, oj) CR-based PNN query highest influences CR-based PNN query AQ = {o2, o3} AN = a = 0.7, k = 2
Outline Background Problem Definition Causality Responsibility Problem Definition Causality and Responsibility in Uncertain Databases Causality-and-Responsibility-Based PNN Query CR-PNN Query Processing Approaches Experimental Evaluation Conclusion
CR-PNN Problem Reduction oi q Reduction of PNN responsibility computation sj ok
Derivation of F(q, si, x) F(q, si, x) can be computed in the following recursive function: oi q sj ok
Pruning Strategies To reduce the computation cost, we compute lower/upper bounds of responsibility in the matrix Pruning with probabilistic threshold If LB_Sum_C(aj) ≥ 1 − a, then aj can be pruned
Pruning Strategies (cont'd) Pruning with influence threshold Let t be the k-th largest lower bound of influence If UB_Sum_R(oi) < t, then oi should not be put into AN set
Derivation of PNN Responsibility Bounds lb_resp(oi, aj) and ub_resp(oi, aj) Assume that the probabilities that n objects fall into are Let =
CR-PNN Query Procedure 1. Compute lower/upper bounds of responsibility matrix resp_matrix 2. Add query answers aj to AQ with LB_Sum_C(aj)<1- a 3. Remove all aj from database D 4. Set t to be the k-th largest influence lower bound 5. Add objects oi with UB_Sum_R(oi ) > t to AN 5. AN = AN – AQ, and remove oi from database D 6. Update the matrix, and repeat Step 2 until |AN| ≥ k
Experimental Results Data sets Measure Real spatial data: Long Beach (LB) and Tiger/Line LA River and Railways (RR) Synthetic data: Randomly generate center Co and half extent eo (on each dimension) of uncertain objects o 4 data sets: lUeU, lUeG, lSeU, lSeG Measure CPU time and I/O cost
Effectiveness Synthetically inject noises into 50% of data sets, and compare the recall ratio of CR-PNN (compared with PNN results on data sets without noises) data size N = 30K, dimensionality d = 2, k = 5, probabilistic threshold a = 0.5, [emin, emax] = [0.1, 0.5]
Efficiency of CR-PNN dimensionality d = 2, k = 5, probabilistic threshold a = 0.5, [emin, emax] = [0.1, 0.5]
Conclusion We introduce the causality and responsibility (CR) to uncertain databases, and propose to interpret probabilistic queries using CR We formalize the CR-PNN problem and reduce the problem over possible worlds to the one on uncertain objects We propose effective pruning methods to reduce the computation cost We propose efficient CR-PNN query answering approach to return both query answers and high-influence objects We conduct extensive experiments to verify the effectiveness and efficiency of our proposed approaches