Sequential Data Cleaning: A Statistical Approach Aoqian Zhang1, Shaoxu Song1 , Jianmin Wang1 1Tsinghua University, China SIGMOD 2016
Outline Motivation Problem Solutions Experiments Conclusion Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016
Stream Data Erroneous Stream data are often dirty Stock and Flight Unreliable sensor reading Large spike and small errors Stock and Flight Accuracy of Stock in Yahoo! Finance is 0.93[1] Accuracy of Travelocity is 0.95[1] Reasons Ambiguity in information extraction Pure mistake [1] X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97-108, 2012. SIGMOD 2016
Data Cleaning Constraint based methods Constraints on speeds of value changes[1] Minimum change principle Large spike error: max/min values allowed Small error: fail to identify [1] Song, Shaoxu, et al. "SCREEN: Stream Data Cleaning under Speed Constraints." SIGMOD’15 SIGMOD 2016
Probability Notations Speed: 𝑣 𝑖−1, 𝑖 = 𝑥 𝑖 − 𝑥 𝑖−1 𝑡 𝑖 − 𝑡 𝑖−1 Change of speed: 𝑢 𝑖 = 𝑣 𝑖,𝑖+1 − 𝑣 𝑖−1,𝑖 Intuition: in consecutive data points should not be significant Likelihood: 𝐿 𝑥 = 𝑖=2 𝑛−1 𝐿( 𝑢 𝑖 ) = 𝑖=2 𝑛−1 log 𝑃( 𝑢 𝑖 ) SIGMOD 2016
Likelihood Example Sequence 𝑥={11,12,15,14,15,15,17} 𝑃 𝑢 3 =𝑃 𝑣 34 − 𝑣 23 =𝑃 −4 =0.1 Condition Observe Truth Repair1 Repair2 𝐿 𝑥 -8.1 -6.6 -7.0 -6.0 SIGMOD 2016
Outline Motivation Problem Solutions Experiments Conclusion Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016
Problem Factors Target Data point and error range: 𝑥 𝑖 ′ ∈[ 𝑥 𝑖 ± 𝜽 𝒊 ] Likelihood: 𝐿 𝑥 = 𝑖=2 𝑛−1 𝐿( 𝑢 𝑖 ) = 𝑖=2 𝑛−1 log 𝑃( 𝑢 𝑖 ) Repair cost: Δ 𝑥, 𝑥 ′ = 𝑖=1 𝑛 | 𝑥 𝑖 − 𝑥 𝑖 ′| Target Maximal likelihood repair, Budget threshold, Error range (𝐿|𝜹, 𝜃) SIGMOD 2016
Problem Budget SIGMOD 2016
Outline Motivation Problem Solutions Experiments Conclusion Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016
DP-based Solution DP: Exact solution DPL: Approximation on cost NPC, reduction to 0/1 knapsack 𝑂(𝑛 𝜃 𝑚𝑎𝑥 3 𝛿) time and 𝑂 𝑛 𝜃 𝑚𝑎𝑥 2 𝛿 space DPL: Approximation on cost Repair cost: 0,𝛿 → 0,𝑑 𝑂(𝑛 𝑑 4 ) time and 𝑂 𝑛 𝑑 3 space DPC: Approximation on likelihood 𝐾=−𝜖⋅ log 𝑝 𝑚𝑎𝑥 𝐿 ′ 𝑢 𝑖 ′ =⌊ 𝐿( 𝑢 𝑖 ′ ) 𝐾 ⌋, 𝐿 𝑥 ′ ≥ 1+𝜖 ⋅𝐿 𝑥 ∗ 𝑂( 𝑛 2 𝜃 𝑚𝑎𝑥 3 ) time and 𝑂 𝑛 2 𝜃 𝑚𝑎𝑥 2 space SIGMOD 2016
Other Approximation QP: Approximation on probability distribution Discrete to Continuous probability distribution Quadratic Programming SG: Simple Greedy Reduce of | 𝑢 𝑖 ′ | 𝑂( max (𝑛,𝛿) ) SIGMOD 2016
Solutions Summary 𝑂( max (𝑛,𝛿) ) Algorithm Time Complexity Feature DP 𝑂(𝑛 𝜃 𝑚𝑎𝑥 3 𝛿) Better repair accuracy DPC 𝑂( 𝑛 2 𝜃 𝑚𝑎𝑥 3 ) Run faster than DP with high budget DPL 𝑂( 𝑑 4 ) Fast, higher error QP Probabilistic Distribution SG 𝑂( max (𝑛,𝛿) ) Fastest, repair accuracy not guaranteed SIGMOD 2016
Outline Motivation Problem Solutions Experiments Conclusion Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016
Experiment Setting Datasets Errors Criteria STOCK1: daily prices of AIP.L from 1984-09 to 2010-02, with 12826 data points. GPS: 150/2358 points in the trajectory ENGINE: 4 sequences of a crane, 𝑠𝑤𝑖𝑡𝑐ℎ𝑖𝑛𝑔−𝑐𝑜𝑢𝑛𝑡=𝛼∗𝑝𝑢𝑚𝑝−𝑣𝑜𝑙𝑢𝑚𝑒+𝛽∗𝐷𝑇0+𝛾∗𝑒𝑛𝑔𝑖𝑛𝑒−𝑠𝑝𝑒𝑒𝑑 SYNTHETIC: synthetic by 𝑁 0,0.8 2 ,𝜃=5 Errors Injected errors in the STOCK, SYNTHETIC-injected Manually identified in the GPS Unknown in ENGINE Criteria RMS error 1. http://finance.yahoo.com/q/hp?s=AIP.L+Historical+Prices SIGMOD 2016
Comparison Constraint based Different Algorithm SCREEN1 DP: Exact solution DPC: Constant factor approximation DPL: Linear time approximation QP: Continuous probabilistic distribution approximation SG: Simple greedy Song, Shaoxu, et al. "SCREEN: Stream Data Cleaning under Speed Constraints." Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015. SIGMOD 2016
Analysis-STOCK Scalability Budget SIGMOD 2016
Outline Motivation Problem Solutions Experiments Conclusion Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016
Conclusion Repair Various methods accustomed in different situations Precisely handle large spike errors Small errors can be detected and repaired Various methods accustomed in different situations Better performance in both repairing and application accuracies compared to the state-of-art data constraint-based repairing SIGMOD 2016
Q & A Thanks! SIGMOD 2016