Sequential Data Cleaning: A Statistical Approach

Slides:



Advertisements
Similar presentations
A Privacy Preserving Index for Range Queries
Advertisements

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Fast Algorithms For Hierarchical Range Histogram Constructions
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
Community Detection Algorithm and Community Quality Metric Mingming Chen & Boleslaw K. Szymanski Department of Computer Science Rensselaer Polytechnic.
Fusion in web data extraction
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
(A fast quadratic program solver for) Stress Majorization with Orthogonal Ordering Constraints Tim Dwyer 1 Yehuda Koren 2 Kim Marriott 1 1 Monash University,
Iterative closest point algorithms
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Sensor placement applications Monitoring of spatial phenomena Temperature Precipitation... Active learning, Experiment design Precipitation data from Pacific.
A faster reliable algorithm to estimate the p-value of the multinomial llr statistic Uri Keich and Niranjan Nagarajan (Department of Computer Science,
OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et.
Student: Hsu-Yung Cheng Advisor: Jenq-Neng Hwang, Professor
Online Piece-wise Linear Approximation of Numerical Streams with Precision Guarantees Hazem Elmeleegy Purdue University Ahmed Elmagarmid (Purdue) Emmanuel.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
1 Collaborative Filtering: Latent Variable Model LIU Tengfei Computer Science and Engineering Department April 13, 2011.
Active Learning for Networked Data Based on Non-progressive Diffusion Model Zhilin Yang, Jie Tang, Bin Xu, Chunxiao Xing Dept. of Computer Science and.
Alert Correlation for Extracting Attack Strategies Authors: B. Zhu and A. A. Ghorbani Source: IJNS review paper Reporter: Chun-Ta Li ( 李俊達 )
Storage Allocation in Prefetching Techniques of Web Caches D. Zeng, F. Wang, S. Ram Appeared in proceedings of ACM conference in Electronic commerce (EC’03)
On Simultaneous Clustering and Cleaning over Dirty Data
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Reynold Cheng†, Eric Lo‡, Xuan S
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Optimizing Plurality for Human Intelligence Tasks Luyi Mo University of Hong Kong Joint work with Reynold Cheng, Ben Kao, Xuan Yang, Chenghui Ren, Siyu.
Fast and Exact Monitoring of Co-evolving Data Streams Yasuko Matsubara, Yasushi Sakurai (Kumamoto University) Naonori Ueda (NTT) Masatoshi Yoshikawa (Kyoto.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variances Jing Xiang & Seyoung Kim Bayesian Network Structure Learning X 1...
1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.
Shambhavi Srinivasa Carey Williamson Zongpeng Li Department of Computer Science University of Calgary Barrier Counting in Mixed Wireless Sensor Networks.
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
Engineering Statistics ENGR 592 Prepared by: Mariam El-Maghraby Date: 26/05/04 Design of Experiments Plackett-Burman Box-Behnken.
AutoPlait: Automatic Mining of Co-evolving Time Sequences Yasuko Matsubara (Kumamoto University) Yasushi Sakurai (Kumamoto University) Christos Faloutsos.
Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
School of Biomedical Engineering, Science and Health Systems APPLICATION OF WAVELET BASED FUSION TECHNIQUES TO PHYSIOLOGICAL MONITORING Han C. Ryoo, Leonid.
Approximate Dynamic Programming Methods for Resource Constrained Sensor Management John W. Fisher III, Jason L. Williams and Alan S. Willsky MIT CSAIL.
Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.
Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Dual Transfer Learning Mingsheng Long 1,2, Jianmin Wang 2, Guiguang Ding 2 Wei Cheng, Xiang Zhang, and Wei Wang 1 Department of Computer Science and Technology.
IPv6-Oriented 4 OC768 Packet Classification with Deriving-Merging Partition and Field- Variable Encoding Scheme Mr. Xin Zhang Undergrad. in Tsinghua University,
1/18 New Feature Presentation of Transition Probability Matrix for Image Tampering Detection Luyi Chen 1 Shilin Wang 2 Shenghong Li 1 Jianhua Li 1 1 Department.
D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.
ICONIP 2010, Sydney, Australia 1 An Enhanced Semi-supervised Recommendation Model Based on Green’s Function Dingyan Wang and Irwin King Dept. of Computer.
1 1 MPI for Intelligent Systems 2 Stanford University Manuel Gomez Rodriguez 1,2 Bernhard Schölkopf 1 S UBMODULAR I NFERENCE OF D IFFUSION NETWORKS FROM.
Shaoxu Song 1, Aoqian Zhang 1, Lei Chen 2, Jianmin Wang 1 1 Tsinghua University, China 2Hong Kong University of Science & Technology, China 1/19 VLDB 2015.
SCREEN: Stream Data Cleaning under Speed Constraints Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu SIGMOD 2015.
Forecasting with Cyber-physical Interactions in Data Centers (part 3)
Data Driven Resource Allocation for Distributed Learning
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
Data Integration with Dependent Sources
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Maximum Lifetime of Sensor Networks with Adjustable Sensing Range
Finding Periodic Discrete Events in Noisy Streams
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Sequential Data Cleaning: A Statistical Approach Aoqian Zhang1, Shaoxu Song1 , Jianmin Wang1 1Tsinghua University, China SIGMOD 2016

Outline Motivation Problem Solutions Experiments Conclusion Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016

Stream Data Erroneous Stream data are often dirty Stock and Flight Unreliable sensor reading Large spike and small errors Stock and Flight Accuracy of Stock in Yahoo! Finance is 0.93[1] Accuracy of Travelocity is 0.95[1] Reasons Ambiguity in information extraction Pure mistake [1] X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97-108, 2012. SIGMOD 2016

Data Cleaning Constraint based methods Constraints on speeds of value changes[1] Minimum change principle Large spike error: max/min values allowed Small error: fail to identify [1] Song, Shaoxu, et al. "SCREEN: Stream Data Cleaning under Speed Constraints." SIGMOD’15 SIGMOD 2016

Probability Notations Speed: 𝑣 𝑖−1, 𝑖 = 𝑥 𝑖 − 𝑥 𝑖−1 𝑡 𝑖 − 𝑡 𝑖−1 Change of speed: 𝑢 𝑖 = 𝑣 𝑖,𝑖+1 − 𝑣 𝑖−1,𝑖 Intuition: in consecutive data points should not be significant Likelihood: 𝐿 𝑥 = 𝑖=2 𝑛−1 𝐿( 𝑢 𝑖 ) = 𝑖=2 𝑛−1 log 𝑃( 𝑢 𝑖 ) SIGMOD 2016

Likelihood Example Sequence 𝑥={11,12,15,14,15,15,17} 𝑃 𝑢 3 =𝑃 𝑣 34 − 𝑣 23 =𝑃 −4 =0.1 Condition Observe Truth Repair1 Repair2 𝐿 𝑥 -8.1 -6.6 -7.0 -6.0 SIGMOD 2016

Outline Motivation Problem Solutions Experiments Conclusion Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016

Problem Factors Target Data point and error range: 𝑥 𝑖 ′ ∈[ 𝑥 𝑖 ± 𝜽 𝒊 ] Likelihood: 𝐿 𝑥 = 𝑖=2 𝑛−1 𝐿( 𝑢 𝑖 ) = 𝑖=2 𝑛−1 log 𝑃( 𝑢 𝑖 ) Repair cost: Δ 𝑥, 𝑥 ′ = 𝑖=1 𝑛 | 𝑥 𝑖 − 𝑥 𝑖 ′| Target Maximal likelihood repair, Budget threshold, Error range (𝐿|𝜹, 𝜃) SIGMOD 2016

Problem Budget SIGMOD 2016

Outline Motivation Problem Solutions Experiments Conclusion Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016

DP-based Solution DP: Exact solution DPL: Approximation on cost NPC, reduction to 0/1 knapsack 𝑂(𝑛 𝜃 𝑚𝑎𝑥 3 𝛿) time and 𝑂 𝑛 𝜃 𝑚𝑎𝑥 2 𝛿 space DPL: Approximation on cost Repair cost: 0,𝛿 → 0,𝑑 𝑂(𝑛 𝑑 4 ) time and 𝑂 𝑛 𝑑 3 space DPC: Approximation on likelihood 𝐾=−𝜖⋅ log 𝑝 𝑚𝑎𝑥 𝐿 ′ 𝑢 𝑖 ′ =⌊ 𝐿( 𝑢 𝑖 ′ ) 𝐾 ⌋, 𝐿 𝑥 ′ ≥ 1+𝜖 ⋅𝐿 𝑥 ∗ 𝑂( 𝑛 2 𝜃 𝑚𝑎𝑥 3 ) time and 𝑂 𝑛 2 𝜃 𝑚𝑎𝑥 2 space SIGMOD 2016

Other Approximation QP: Approximation on probability distribution Discrete to Continuous probability distribution Quadratic Programming SG: Simple Greedy Reduce of | 𝑢 𝑖 ′ | 𝑂( max (𝑛,𝛿) ) SIGMOD 2016

Solutions Summary 𝑂( max (𝑛,𝛿) ) Algorithm Time Complexity Feature DP 𝑂(𝑛 𝜃 𝑚𝑎𝑥 3 𝛿) Better repair accuracy DPC 𝑂( 𝑛 2 𝜃 𝑚𝑎𝑥 3 ) Run faster than DP with high budget DPL 𝑂( 𝑑 4 ) Fast, higher error QP Probabilistic Distribution SG 𝑂( max (𝑛,𝛿) ) Fastest, repair accuracy not guaranteed SIGMOD 2016

Outline Motivation Problem Solutions Experiments Conclusion Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016

Experiment Setting Datasets Errors Criteria STOCK1: daily prices of AIP.L from 1984-09 to 2010-02, with 12826 data points. GPS: 150/2358 points in the trajectory ENGINE: 4 sequences of a crane, 𝑠𝑤𝑖𝑡𝑐ℎ𝑖𝑛𝑔−𝑐𝑜𝑢𝑛𝑡=𝛼∗𝑝𝑢𝑚𝑝−𝑣𝑜𝑙𝑢𝑚𝑒+𝛽∗𝐷𝑇0+𝛾∗𝑒𝑛𝑔𝑖𝑛𝑒−𝑠𝑝𝑒𝑒𝑑 SYNTHETIC: synthetic by 𝑁 0,0.8 2 ,𝜃=5 Errors Injected errors in the STOCK, SYNTHETIC-injected Manually identified in the GPS Unknown in ENGINE Criteria RMS error 1. http://finance.yahoo.com/q/hp?s=AIP.L+Historical+Prices SIGMOD 2016

Comparison Constraint based Different Algorithm SCREEN1 DP: Exact solution DPC: Constant factor approximation DPL: Linear time approximation QP: Continuous probabilistic distribution approximation SG: Simple greedy Song, Shaoxu, et al. "SCREEN: Stream Data Cleaning under Speed Constraints." Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015. SIGMOD 2016

Analysis-STOCK Scalability Budget SIGMOD 2016

Outline Motivation Problem Solutions Experiments Conclusion Exact Solution Approximate Solution Experiments Conclusion SIGMOD 2016

Conclusion Repair Various methods accustomed in different situations Precisely handle large spike errors Small errors can be detected and repaired Various methods accustomed in different situations Better performance in both repairing and application accuracies compared to the state-of-art data constraint-based repairing SIGMOD 2016

Q & A Thanks! SIGMOD 2016