Download presentation
Presentation is loading. Please wait.
Published byTito Clemente Modified over 5 years ago
1
Sequential Data Cleaning: A Statistical Approach
Aoqian Zhang1, Shaoxu Song1 , Jianmin Wang1 1Tsinghua University, China SIGMOD 2016
2
Outline Motivation Problem Solutions Experiments Conclusion
Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016
3
Stream Data Erroneous Stream data are often dirty Stock and Flight
Unreliable sensor reading Large spike and small errors Stock and Flight Accuracy of Stock in Yahoo! Finance is 0.93[1] Accuracy of Travelocity is 0.95[1] Reasons Ambiguity in information extraction Pure mistake [1] X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97-108, 2012. SIGMOD 2016
4
Data Cleaning Constraint based methods
Constraints on speeds of value changes[1] Minimum change principle Large spike error: max/min values allowed Small error: fail to identify [1] Song, Shaoxu, et al. "SCREEN: Stream Data Cleaning under Speed Constraints."Β SIGMODβ15 SIGMOD 2016
5
Probability Notations Speed: π£ πβ1, π = π₯ π β π₯ πβ1 π‘ π β π‘ πβ1
Change of speed: π’ π = π£ π,π+1 β π£ πβ1,π Intuition: in consecutive data points should not be significant Likelihood: πΏ π₯ = π=2 πβ1 πΏ( π’ π ) = π=2 πβ1 log π( π’ π ) SIGMOD 2016
6
Likelihood Example Sequence π₯={11,12,15,14,15,15,17}
π π’ 3 =π π£ 34 β π£ 23 =π β4 =0.1 Condition Observe Truth Repair1 Repair2 πΏ π₯ -8.1 -6.6 -7.0 -6.0 SIGMOD 2016
7
Outline Motivation Problem Solutions Experiments Conclusion
Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016
8
Problem Factors Target
Data point and error range: π₯ π β² β[ π₯ π Β± π½ π ] Likelihood: πΏ π₯ = π=2 πβ1 πΏ( π’ π ) = π=2 πβ1 log π( π’ π ) Repair cost: Ξ π₯, π₯ β² = π=1 π | π₯ π β π₯ π β²| Target Maximal likelihood repair, Budget threshold, Error range (πΏ|πΉ, π) SIGMOD 2016
9
Problem Budget SIGMOD 2016
10
Outline Motivation Problem Solutions Experiments Conclusion
Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016
11
DP-based Solution DP: Exact solution DPL: Approximation on cost
NPC, reduction to 0/1 knapsack π(π π πππ₯ 3 πΏ) time and π π π πππ₯ 2 πΏ space DPL: Approximation on cost Repair cost: 0,πΏ β 0,π π(π π 4 ) time and π π π 3 space DPC: Approximation on likelihood πΎ=βπβ
log π πππ₯ πΏ β² π’ π β² =β πΏ( π’ π β² ) πΎ β, πΏ π₯ β² β₯ 1+π β
πΏ π₯ β π( π 2 π πππ₯ 3 ) time and π π 2 π πππ₯ 2 space SIGMOD 2016
12
Other Approximation QP: Approximation on probability distribution
Discrete to Continuous probability distribution Quadratic Programming SG: Simple Greedy Reduce of | π’ π β² | π( max (π,πΏ) ) SIGMOD 2016
13
Solutions Summary π( max (π,πΏ) ) Algorithm Time Complexity Feature DP
π(π π πππ₯ 3 πΏ) Better repair accuracy DPC π( π 2 π πππ₯ 3 ) Run faster than DP with high budget DPL π( π 4 ) Fast, higher error QP Probabilistic Distribution SG π( max (π,πΏ) ) Fastest, repair accuracy not guaranteed SIGMOD 2016
14
Outline Motivation Problem Solutions Experiments Conclusion
Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016
15
Experiment Setting Datasets Errors Criteria
STOCK1: daily prices of AIP.L from to , with data points. GPS: 150/2358 points in the trajectory ENGINE: 4 sequences of a crane, π π€ππ‘πβπππβπππ’ππ‘=πΌβππ’ππβπ£πππ’ππ+π½βπ·π0+πΎβππππππβπ ππππ SYNTHETIC: synthetic by π 0, ,π=5 Errors Injected errors in the STOCK, SYNTHETIC-injected Manually identified in the GPS Unknown in ENGINE Criteria RMS error 1. SIGMOD 2016
16
Comparison Constraint based Different Algorithm SCREEN1
DP: Exact solution DPC: Constant factor approximation DPL: Linear time approximation QP: Continuous probabilistic distribution approximation SG: Simple greedy Song, Shaoxu, et al. "SCREEN: Stream Data Cleaning under Speed Constraints." Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015. SIGMOD 2016
17
Analysis-STOCK Scalability Budget SIGMOD 2016
18
Outline Motivation Problem Solutions Experiments Conclusion
Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016
19
Conclusion Repair Various methods accustomed in different situations
Precisely handle large spike errors Small errors can be detected and repaired Various methods accustomed in different situations Better performance in both repairing and application accuracies compared to the state-of-art data constraint-based repairing SIGMOD 2016
20
Q & A ThanksοΌ SIGMOD 2016
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.