Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic Data Management

Similar presentations


Presentation on theme: "Probabilistic Data Management"— Presentation transcript:

1 Probabilistic Data Management
Chapter 12: Data Quality in Probabilistic Databases (2)

2 GPS Application -- Rescue Ships
Ship in Danger!

3 GPS Application – Nearest Neighbor Query
Uncertainty of GPS Data

4 GPS Application -- Rescue Ships (cont'd)
Ship in Danger! Probabilistic Nearest Neighbor Query on Uncertain Data

5 GPS Application – Probabilistic Nearest Neighbor Query
PNN: Return nearest neighbors of q with probability ≥ a

6 Effect of Low-Quality Samples
Abnormal GPS Sample s22 Low-quality samples may greatly affect probabilistic query answers! ship o3 is NN with probability 0.75 > a ship o3 is NN with probability <a a = 0.7 PNN (q) = {o3} PNN (q) = 

7 Our Contributions We study the sensitivity of probabilistic query answers to low-quality objects from the angle of the causality and responsibility (CR) We use CR to interpret and define probabilistic nearest neighbor query, namely CR-PNN, which returns possible query answers and/or low-quality objects X. Lian, Y. Lin, and L. Chen. Cost-Efficient Repair in Inconsistent Probabilistic Databases. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM'11), pages , 2011.

8 Outline Background Problem Definition
Causality Responsibility Problem Definition Causality and Responsibility in Uncertain Databases Causality-and-Responsibility-Based PNN Query CR-PNN Query Processing Approaches Experimental Evaluation Conclusion

9 Background – Causality in Certain Databases
Assumptions Given a database D and a query Q Causality Object o is a counterfactual cause for object a, if it holds that: a is not an answer to query Q in database D a is an answer to query Q in D − {o} Object o is an actual cause for object a, if it holds that: a is an answer to query Q in D − {o} − , where  is a contingency set  ⊆ D

10 NN Example of Causality
o1 is the cause such that o3 is not NN of q Object o1 is a counterfactual cause for o3 Object o3 is not a nearest neighbor of q If we remove o1, then o3 will become the nearest neighbor of q Object o1 is an actual cause for NN answer candidate o2 If we remove o1 and  = {o3}, then o2 is the nearest neighbor of q The existence of o1 and o3 collaboratively causes o2 to be a non-NN-answer Nearest Neighbor Query o1 is one of causes such that o2 is not NN of q

11 Background – Responsibility in Certain Databases
Responsibility (degree of causes) Let object o ∈ D be a cause of a possible answer a to a query Q Then, the responsibility, (o, a), of o for possible answer a in D is: where is  a contingency set for uncertain object o, and || is the size of the set .

12 NN Example of Responsibility
Both objects o1 and o3 are responsible for that o2 is not NN of q Object o1 is an actual cause for NN answer candidate o2 with the contingency set  = {o3} Therefore, we have the responsibility, (o1, o2), of o1 for a possible NN answer o2 as follows: Nearest Neighbor Query o1 has 1/2 responsibility (degree of causes) to let o2 be a non-NN-answer

13 Outline Background Problem Definition
Causality Responsibility Problem Definition Causality and Responsibility in Uncertain Databases Causality-and-Responsibility-Based PNN Query CR-PNN Query Processing Approaches Experimental Evaluation Conclusion

14 probabilities of possible worlds
Probabilistic Nearest Neighbor Query uncertain database D: … … possible worlds pw(D): … … probabilities of possible worlds Pr{pw(D)}: s11.p  s21.p  s31.p … … s12.p  s22.p  s32.p

15 Causality and Responsibility in Uncertain Databases
The expected responsibility in uncertain databases Let si (or sj) be a sample of uncertain object oi (or aj) that appears in a possible world pw(D) The expected responsibility, resp(oi, aj), of oi for possible answer aj is given by: PNN:

16 Responsibility Matrix
For uncertain database D We construct an NN responsibility matrix, resp_matrix Each entry, resp_matrix(oi, aj), of the matrix corresponds to the expected responsibility (oi, aj) (oi, oj)

17 Novel Interpretation of Probabilistic Queries
Summed column value (probabilistic confidence) We sum up all the values on the j-th column We can prove that Sum_C(aj) = 1 − PrQ(aj) For PNN query Sum_C(aj) = 1 − PrPNN(aj) (oi, oj) Sum_C(o3) = 0.375 PrPNN(o3) = 1- Sum_C(o3) = 0.625

18 Novel Interpretation of Probabilistic Queries (cont'd)
A equivalent PNN query interpretation Return those uncertain objects aj, such that the PNN probabilities, PrPNN(aj), are greater than a Return those uncertain objects aj, such that the summed column values, Sum_C(aj), are smaller than or equal to (1 - a)

19 Influence of Uncertain Objects
Summed row value (influence) We sum up all the values on the i-th row The influence of object oi is given by: Interpretation of influence Uncertain object oi with high Sum_R(oi) values indicates that they are more influential to query answers Two possible reasons for high influences Objects oi are query answers themselves Objects oi contain low-quality samples (oi, oj)

20 Problem Definition – CR-PNN
CR-PNN query Given a query point q, an uncertain database D, and a probabilistic threshold a, a CR-PNN query returns a PNN answer set AQ and a set AN of (at least) k high-influence objects The query answering of CR-PNN Straightforward method: 1. Compute responsibility matrix resp_matrix 2. Add query answers aj to AQ with Sum_C(aj) < 1- a 3. Remove all aj from database D 4. Add object oi with the highest influence Sum_R(oi ) to AN 5. AN = AN – AQ, and remove oi from database D 6. Update the matrix, and repeat Step 2 until |AN| ≥ k

21 CR-PNN Example {o2, o3} a = 0.7, k = 2 (oi, oj) CR-based PNN query
highest influences CR-based PNN query AQ =  {o2, o3} AN =  a = 0.7, k = 2

22 Outline Background Problem Definition
Causality Responsibility Problem Definition Causality and Responsibility in Uncertain Databases Causality-and-Responsibility-Based PNN Query CR-PNN Query Processing Approaches Experimental Evaluation Conclusion

23 CR-PNN Problem Reduction
oi q Reduction of PNN responsibility computation sj ok

24 Derivation of F(q, si, x) F(q, si, x) can be computed in the following recursive function: oi q sj ok

25 Pruning Strategies To reduce the computation cost, we compute lower/upper bounds of responsibility in the matrix Pruning with probabilistic threshold If LB_Sum_C(aj) ≥ 1 − a, then aj can be pruned

26 Pruning Strategies (cont'd)
Pruning with influence threshold Let t be the k-th largest lower bound of influence If UB_Sum_R(oi) < t, then oi should not be put into AN set

27 Derivation of PNN Responsibility Bounds
lb_resp(oi, aj) and ub_resp(oi, aj) Assume that the probabilities that n objects fall into  are Let =

28 CR-PNN Query Procedure
1. Compute lower/upper bounds of responsibility matrix resp_matrix 2. Add query answers aj to AQ with LB_Sum_C(aj)<1- a 3. Remove all aj from database D 4. Set t to be the k-th largest influence lower bound 5. Add objects oi with UB_Sum_R(oi ) > t to AN 5. AN = AN – AQ, and remove oi from database D 6. Update the matrix, and repeat Step 2 until |AN| ≥ k

29 Experimental Results Data sets Measure
Real spatial data: Long Beach (LB) and Tiger/Line LA River and Railways (RR) Synthetic data: Randomly generate center Co and half extent eo (on each dimension) of uncertain objects o 4 data sets: lUeU, lUeG, lSeU, lSeG Measure CPU time and I/O cost

30 Effectiveness Synthetically inject noises into 50% of data sets, and compare the recall ratio of CR-PNN (compared with PNN results on data sets without noises) data size N = 30K, dimensionality d = 2, k = 5, probabilistic threshold a = 0.5, [emin, emax] = [0.1, 0.5]

31 Efficiency of CR-PNN dimensionality d = 2, k = 5, probabilistic threshold a = 0.5, [emin, emax] = [0.1, 0.5]

32 Conclusion We introduce the causality and responsibility (CR) to uncertain databases, and propose to interpret probabilistic queries using CR We formalize the CR-PNN problem and reduce the problem over possible worlds to the one on uncertain objects We propose effective pruning methods to reduce the computation cost We propose efficient CR-PNN query answering approach to return both query answers and high-influence objects We conduct extensive experiments to verify the effectiveness and efficiency of our proposed approaches


Download ppt "Probabilistic Data Management"

Similar presentations


Ads by Google