Download presentation
Presentation is loading. Please wait.
Published byRyder Reid Modified over 9 years ago
1
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng
2
Outline Background Related works Data and Query model PWS-quality model Cleaning procedure Experiments result
3
Uncertain Database(old model) Inherent in various application Examples: RFID data sensor networks data protected because of privacy reason Infeasible to eliminate all uncertainty in many models
4
Uncertain Database(new model) Previous model focus on query in the uncertain database But what if we are able to reduce SOME of the uncertainty in this kind of database? New model are required to produce optimal solution
5
Example 1: Sensor probing Some sensors in the sensor network might have transmission problems and cannot update data Commands can be sent to refresh some sensors New certain data are obtained Limited by the bandwidth / battery power, cannot probe too often
6
Example 2: Movie Rating Movie ratings(IMDB, Netflix) collected from customers might contain some uncertainty managers can communicate with customers to verify the rating data New certain movie rating data is obtained Limited by the human power or other resource
7
Cleaning Data Uncertain DB Query Ambiguous result LESS Uncertain DB Query LESS ambiguous result Cleaning procedure
8
Real model example A database of some products and theirs price(uncertain) KeyProduct ID Price ($)Prob. a1a1 a1200.7 a2a2 a800.3 b1b1 b1100.6 b2b2 b900.4 c1c1 c1400.5 c2c2 c1100.3 c3c3 c1000.2 d1d1 d101 Price of product a has two different possible values: 120 (prob 0.7 ) or 80 (prob 0.3)
9
Query Example 1: KeyProduct ID Price ($)Prob. a1a1 a1200.7 a2a2 a800.3 b1b1 b1100.6 b2b2 b900.4 c1c1 c1400.5 c2c2 c1100.3 c3c3 c1000.2 d1d1 d101 Query 1(Range Query): Select product with price in range [100$, 110$] Possible world result: ({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2)
10
Query Example 2: KeyProduct ID Price ($)Prob. a1a1 a1200.7 a2a2 a800.3 b1b1 b1100.6 b2b2 b900.4 c1c1 c1400.5 c2c2 c1100.3 c3c3 c1000.2 d1d1 d101 Query 2 (Max query): Select product with highest price Possible world answer: ({c1}, 0.5), ({a1}, 0.35), ({c2} 0.036) ({b1}, 0.06), ({b1, c2}, 0.054) ({c3}, 0.054)
11
Clean up example Suppose we have some amount of resource to clean up some data Assume we clean up the information related to product a and c New database with less uncertainty KeyProduct ID Price ($)Prob. a2a2 a801 b1b1 b1100.6 b2b2 b900.4 c3c3 c1001 d1d1 d101
12
Clean up example (Cont.) KeyProduct ID Price ($)Prob. a2a2 a801 b1b1 b1100.6 b2b2 b900.4 c3c3 c1001 d1d1 d101 Run query 1 again: Select product with price in range [100$, 110$] New possible world result: ({b1,c3}, 0.6), ({c3}, 0.4) Old possible result: ({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2) Apparently less uncertain in the cleaned database, but clean up procedure limited by budget New database with less uncertainty
13
Background Related works Data and Query model PWS-quality model Cleaning procedure Experiments result Outline
14
Important related works Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar: Evaluating Probabilistic Queries over Imprecise Data. SIGMOD Conference 2003: 551-562 Mentioned about the ideas of doing clean up in Max/Min and Range query, but not real implementation P. Andritsos, A. Fuxman, and R. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, 2006. Introduce the technique to rewrite query
15
Important related works (Cont) Jinchuan Chen, Reynold Cheng: Quality-Aware Probing of Uncertain Data with Resource Constraints. SSDBM 2008 Similar cleaning method continuous pdf function representation of uncertainty Support less query type(only range query) Chris Mayfield, Jennifer Neville, Sunil Prabhakar ERACER: A Database Approach for Statistical Inference and Data Cleaning SIGMOD 2010 Use the attribute level correlation to provide optimized clean up
16
Background Related works Database and Query model PWS-quality model Cleaning procedure Experiments result Outline
17
System Structure
18
Important Notations KeyProduct ID Price ($)Prob. a1a1 a1200.7 a2a2 a800.3 b1b1 b1100.6 b2b2 b900.4 c1c1 c1400.5 c2c2 c1100.3 c3c3 c1000.2 d1d1 d101 tuple t i (total n tuples) x-tuple τ i (total m x-tuple ) uncertain attribute existential probability (e i ) One x-tuple
19
Important Notations KeyProduct ID Price ($)Prob. a1a1 a1200.7 a2a2 a800.3 b1b1 b1100.6 b2b2 b900.4 c1c1 c1400.5 c2c2 c1100.3 c3c3 c1000.2 d1d1 d101 tuple t i (total n tuples) x-tuple τ i (total m x-tuple) uncertain attribute existential probability (e i ) One x-tuple
20
Query in possible world model (b1,0.28), (c2,0.18), (c3,0.1) 0.18 0.1 {b1,c2}, 0.18 {b1,c3}, 0.1 - 1.44 Qualification probability(p i ) of c2: 0.18 Qualification probability(P k ) of c: 0.28
21
Possible Range Query(PRQ) Given a closed interval, where and, a PRQ returns a set of tuples, where is the non-zero probability that. KeyProduct ID Price ($)Prob. a1a1 a1200.7 a2a2 a800.3 b1b1 b1100.6 b2b2 b900.4 c1c1 c1400.5 c2c2 c1100.3 c3c3 c1000.2 d1d1 d101 Range Query: Select product with price in range [100$, 110$] Possible world result set: ({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2) Prob. q j of occurrence
22
Probabilistic Maximum Query(PMaxQ) A PMaxQ returns a set of tuples, where, the probability of, is the non-zero probability that, where and. KeyProduct ID Price ($)Prob. a1a1 a1200.7 a2a2 a800.3 b1b1 b1100.6 b2b2 b900.4 c1c1 c1400.5 c2c2 c1100.3 c3c3 c1000.2 d1d1 d101 Query: Select product with highest price Possible world answer: ({c1}, 0.5), ({a1}, 0.35), ({c2} 0.036) ({b1}, 0.06), ({b1, c2}, 0.054) ({c3}, 0.054)
23
Background Related works Data and Query model PWS-quality model Cleaning procedure Experiments result Outline
24
PWS-quality Suppose we have two sets of possible world result: 0.2 0.1 0.2 0.9 0.1 {a2,b1} {a1,b2,c1} {b3,c2} {b1} 0.3 {a1,c1} We need a measurement to tell which result is more uncertain and by how Solution: Use entropy like measurement to calculate the PWS-quality (degree of uncertainty)
25
PWS-Quality: Calculation Let q j be the prob. of getting distinct PW-result r j Let d be the number of distinct pw-result Negative S(D, Q) score, larger the score, better the quality 0 means no uncertainty(only 1 possible world result exist)
26
PWS-quality example Suppose we have a set of possible world result: PWS score: S(D,Q) = 0.5*log0.5 + 0.4*log0.4 + 0.1*log0.1 = -0.496 0.4 0.1 {b1} {a1,c1} 0.5 {b2}
27
PWS-quality problem However, calculating PWS-quality for all possible worlds are too expensive # of possible world result might be exponential Need to speed up the algorithm
28
x-Form PWS-Quality x-Form of PWS-Quality g(k,D,Q)= func(existential & qualification probs. of tuples in k-th x-tuple) Summation of quality information of all the result x-tuples Only consider x-tuples whose tuples are in query answer
29
x-Form of PRQ (Range Query) Each g(k, D, Q) only require O(| τ k |) time p i and P k are the qualification probability of the current tuple t i and current x-tuple t K which can be calculated easily
30
x-Form of PMaxQ (Max Query) Require O(| τ k | 2 ) to calculate g(k, D, Q) for PMaxQ Details of the proof will be talked at the end of present
31
x-form PWS-quality summary By transforming the original PWS-quality calculation to the x-form PWS calculation, we avoid the exponential computation time Total computation time O(m log(n/m)) Compared to the query time, the x-form PWS-quality calculation time is small. (will be shown in the experiment)
32
Background Related works Data and Query model PWS-quality model Cleaning procedure Experiments result Outline
33
Cleaning with limited budget With a limited budget, say, 10 Units, which tuples should we clean? KeyProduct ID Price ($)Prob. a1a1 a1200.7 a2a2 a800.3 b1b1 b1100.6 b2b2 b900.4 c1c1 c1400.5 c2c2 c1100.3 c3c3 c1000.2 d1d1 d101 Clean cost: 5 unit Clean cost: 7 unit Clean cost: 10 unit
34
Example of cleaning After Cleaning, the tuple existential probability become 1 This x-tuple contracted to 1 single tuple with certain attribute value KeyProduct ID Price ($)Prob. a1a1 a1200.7 a2a2 a800.3 b1b1 b1100.6 b2b2 b900.4 c3c3 c1001 d1d1 d101
35
Quality improvement Expected Quality after cleaning The set of x-tuple that we are going to clean is represented by X = { τ 1, ···, τ |x| } Quality Improvement But quality improvement calculation is exponential
36
Computation example: KeyProduct ID Price ($)Prob.QP a1a1 a1200.70.35 a2a2 a800.30 b1b1 b1100.60.09 b2b2 b900.40 c1c1 c1400.5 c2c2 c1100.30.05 c3c3 c1000.20.024 d1d1 d1010 Query 2 (Max query): Select product with highest price if we decided to clean up x- tuple c
37
Computation example (Cont.): KeyProduct ID Price ($)Prob.QP a1a1 a1200.7 a2a2 a800.30 b1b1 b1100.60.18 b2b2 b900.40 c1c1 c1400.5 c2c2 c1100.3 c3c3 c10010.12 d1d1 d1010 New PWS-quality S(D’, Q) = -1.17 Query 2 (Max query): Select product with highest price We decided to clean up x-tuple c one possible case is c3 is the real world case
38
Computation example (Cont.): KeyProduct ID Price ($)Prob.QP a1a1 a1200.7 a2a2 a800.30 b1b1 b1100.60.18 b2b2 b900.40 c1c1 c1400.5 c2c2 c11010.12 c3c3 c1000.2 d1d1 d1010 Query 2 (Max query): Select product with highest price We decided to clean up x-tuple c another possible case is c2 is the real world case New PWS-quality S(D’, Q) = -1.17
39
Computation example (Cont.): KeyProduct ID Price ($)Prob.QP a1a1 a1200.70.35 a2a2 a800.30 b1b1 b1100.60.09 b2b2 b900.40 c1c1 c1400.5 c2c2 c1100.30.05 c3c3 c1000.20.024 d1d1 d1010 Query 2 (Max query): Select product with highest price To clean up x-tuple c we have 3 different possible real world scenarios Expected quality of cleaning up x-tuple c = 0 * 0.5 + (-1.17) * 0.3 + (- 1.17) * 0.2 = -0.585
40
x-form quality improvement calculation of the quality improvement in x-form will become following X is the set of x-tuple that we are going to clean proof: rewrite the original E(S(D’(t), Q)) as left side is equal to 0, right side is unchanged after the cleaning
41
Optimal Data Cleaning Algorithm in x-form quality improvement problem, we get the following objective function: c K : the cleaning cost k-th x-tuple C: total cleaning budget Z: total number of x-tuple with p i in (0,1) Can be transformed to 0/1 Knapsack problem
42
DP algorithm Time complexity O(CZ) Space Complexity O(CZ 2 ) C: total budget Z: number of x-tuples
43
Other heuristics methods: Random MaxQP Select x-tuples with highest qualification probability Greedy: Rank x-tuples with max expected quality improvement per cleaning cost
44
Background Related works Data and Query model PWS-quality model Cleaning procedure Experiments result Outline
45
Experiment set up Size of DB10 K x-tuples, 100 K tuples (synthetic) 4,999 x-tuples, 10,037tuples (Netflix movie ratings) Prob. distributionsGaussian (variance = 100) Cleaning costUniform in [1,10] Resource Budget[20,500] default = 30
46
PWS-quality(S) vs database size(Z) (PRQ)
47
Quality evaluation performance(PRQ) (database size)
48
Running time for Clean up selection(PMaxQ) Total budget
49
Quality improvement vs Budget(PRQ) Total budget Quality Improvement
50
Quality improvement vs Budget(PMaxQ) Total budget Quality Improvement
51
Quality improvement vs Budget(PRQ, real data) Quality Improvement Total budget
52
Thank you Q & A
53
Appendix: Deriving x-form of PRQ
54
Appendix: Deriving x-form of PMaxQ A number in [0, ]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.