Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequential Data Cleaning: A Statistical Approach

Similar presentations


Presentation on theme: "Sequential Data Cleaning: A Statistical Approach"β€” Presentation transcript:

1 Sequential Data Cleaning: A Statistical Approach
Aoqian Zhang1, Shaoxu Song1 , Jianmin Wang1 1Tsinghua University, China SIGMOD 2016

2 Outline Motivation Problem Solutions Experiments Conclusion
Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016

3 Stream Data Erroneous Stream data are often dirty Stock and Flight
Unreliable sensor reading Large spike and small errors Stock and Flight Accuracy of Stock in Yahoo! Finance is 0.93[1] Accuracy of Travelocity is 0.95[1] Reasons Ambiguity in information extraction Pure mistake [1] X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97-108, 2012. SIGMOD 2016

4 Data Cleaning Constraint based methods
Constraints on speeds of value changes[1] Minimum change principle Large spike error: max/min values allowed Small error: fail to identify [1] Song, Shaoxu, et al. "SCREEN: Stream Data Cleaning under Speed Constraints."Β SIGMOD’15 SIGMOD 2016

5 Probability Notations Speed: 𝑣 π‘–βˆ’1, 𝑖 = π‘₯ 𝑖 βˆ’ π‘₯ π‘–βˆ’1 𝑑 𝑖 βˆ’ 𝑑 π‘–βˆ’1
Change of speed: 𝑒 𝑖 = 𝑣 𝑖,𝑖+1 βˆ’ 𝑣 π‘–βˆ’1,𝑖 Intuition: in consecutive data points should not be significant Likelihood: 𝐿 π‘₯ = 𝑖=2 π‘›βˆ’1 𝐿( 𝑒 𝑖 ) = 𝑖=2 π‘›βˆ’1 log 𝑃( 𝑒 𝑖 ) SIGMOD 2016

6 Likelihood Example Sequence π‘₯={11,12,15,14,15,15,17}
𝑃 𝑒 3 =𝑃 𝑣 34 βˆ’ 𝑣 23 =𝑃 βˆ’4 =0.1 Condition Observe Truth Repair1 Repair2 𝐿 π‘₯ -8.1 -6.6 -7.0 -6.0 SIGMOD 2016

7 Outline Motivation Problem Solutions Experiments Conclusion
Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016

8 Problem Factors Target
Data point and error range: π‘₯ 𝑖 β€² ∈[ π‘₯ 𝑖 Β± 𝜽 π’Š ] Likelihood: 𝐿 π‘₯ = 𝑖=2 π‘›βˆ’1 𝐿( 𝑒 𝑖 ) = 𝑖=2 π‘›βˆ’1 log 𝑃( 𝑒 𝑖 ) Repair cost: Ξ” π‘₯, π‘₯ β€² = 𝑖=1 𝑛 | π‘₯ 𝑖 βˆ’ π‘₯ 𝑖 β€²| Target Maximal likelihood repair, Budget threshold, Error range (𝐿|𝜹, πœƒ) SIGMOD 2016

9 Problem Budget SIGMOD 2016

10 Outline Motivation Problem Solutions Experiments Conclusion
Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016

11 DP-based Solution DP: Exact solution DPL: Approximation on cost
NPC, reduction to 0/1 knapsack 𝑂(𝑛 πœƒ π‘šπ‘Žπ‘₯ 3 𝛿) time and 𝑂 𝑛 πœƒ π‘šπ‘Žπ‘₯ 2 𝛿 space DPL: Approximation on cost Repair cost: 0,𝛿 β†’ 0,𝑑 𝑂(𝑛 𝑑 4 ) time and 𝑂 𝑛 𝑑 3 space DPC: Approximation on likelihood 𝐾=βˆ’πœ–β‹… log 𝑝 π‘šπ‘Žπ‘₯ 𝐿 β€² 𝑒 𝑖 β€² =⌊ 𝐿( 𝑒 𝑖 β€² ) 𝐾 βŒ‹, 𝐿 π‘₯ β€² β‰₯ 1+πœ– ⋅𝐿 π‘₯ βˆ— 𝑂( 𝑛 2 πœƒ π‘šπ‘Žπ‘₯ 3 ) time and 𝑂 𝑛 2 πœƒ π‘šπ‘Žπ‘₯ 2 space SIGMOD 2016

12 Other Approximation QP: Approximation on probability distribution
Discrete to Continuous probability distribution Quadratic Programming SG: Simple Greedy Reduce of | 𝑒 𝑖 β€² | 𝑂( max (𝑛,𝛿) ) SIGMOD 2016

13 Solutions Summary 𝑂( max (𝑛,𝛿) ) Algorithm Time Complexity Feature DP
𝑂(𝑛 πœƒ π‘šπ‘Žπ‘₯ 3 𝛿) Better repair accuracy DPC 𝑂( 𝑛 2 πœƒ π‘šπ‘Žπ‘₯ 3 ) Run faster than DP with high budget DPL 𝑂( 𝑑 4 ) Fast, higher error QP Probabilistic Distribution SG 𝑂( max (𝑛,𝛿) ) Fastest, repair accuracy not guaranteed SIGMOD 2016

14 Outline Motivation Problem Solutions Experiments Conclusion
Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016

15 Experiment Setting Datasets Errors Criteria
STOCK1: daily prices of AIP.L from to , with data points. GPS: 150/2358 points in the trajectory ENGINE: 4 sequences of a crane, π‘ π‘€π‘–π‘‘π‘β„Žπ‘–π‘›π‘”βˆ’π‘π‘œπ‘’π‘›π‘‘=π›Όβˆ—π‘π‘’π‘šπ‘βˆ’π‘£π‘œπ‘™π‘’π‘šπ‘’+π›½βˆ—π·π‘‡0+π›Ύβˆ—π‘’π‘›π‘”π‘–π‘›π‘’βˆ’π‘ π‘π‘’π‘’π‘‘ SYNTHETIC: synthetic by 𝑁 0, ,πœƒ=5 Errors Injected errors in the STOCK, SYNTHETIC-injected Manually identified in the GPS Unknown in ENGINE Criteria RMS error 1. SIGMOD 2016

16 Comparison Constraint based Different Algorithm SCREEN1
DP: Exact solution DPC: Constant factor approximation DPL: Linear time approximation QP: Continuous probabilistic distribution approximation SG: Simple greedy Song, Shaoxu, et al. "SCREEN: Stream Data Cleaning under Speed Constraints." Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015. SIGMOD 2016

17 Analysis-STOCK Scalability Budget SIGMOD 2016

18 Outline Motivation Problem Solutions Experiments Conclusion
Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016

19 Conclusion Repair Various methods accustomed in different situations
Precisely handle large spike errors Small errors can be detected and repaired Various methods accustomed in different situations Better performance in both repairing and application accuracies compared to the state-of-art data constraint-based repairing SIGMOD 2016

20 Q & A Thanks! SIGMOD 2016


Download ppt "Sequential Data Cleaning: A Statistical Approach"

Similar presentations


Ads by Google