Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shaoxu Song 1, Aoqian Zhang 1, Lei Chen 2, Jianmin Wang 1 1 Tsinghua University, China 2Hong Kong University of Science & Technology, China 1/19 VLDB 2015.

Similar presentations


Presentation on theme: "Shaoxu Song 1, Aoqian Zhang 1, Lei Chen 2, Jianmin Wang 1 1 Tsinghua University, China 2Hong Kong University of Science & Technology, China 1/19 VLDB 2015."— Presentation transcript:

1 Shaoxu Song 1, Aoqian Zhang 1, Lei Chen 2, Jianmin Wang 1 1 Tsinghua University, China 2Hong Kong University of Science & Technology, China 1/19 VLDB 2015

2  Motivation  Imputation Methods  Preliminary  Exact Solutions  Approximation Method  Experiments  Conclusion 2/19 VLDB 2015

3  Fill the missing data  With values of neighbors  Editing rules  Statistically by relational dependency networks  Sparsity  Values with variances  Extensive similarity neighbors  Using similarity rules  Tolerance to small variations 3/19 VLDB 2015

4 4/19  Equality rules  Functional dependency (FD)  Rather rare  Similarity rules  Differential dependency (DD)  Plenty of them VLDB 2015

5  Motivation  Imputation Methods  Preliminary  Exact Solutions  Approximation Method  Experiments  Conclusion 5/19 VLDB 2015

6 6/19 VLDB 2015

7 7/19 VLDB 2015

8 8/19 VLDB 2015

9 9/19 VLDB 2015

10 10/19  Examples  DD1: (Name, Street → Address, [0,1],[0,9],[0,2])  DD2: (Street → Address, ) VLDB 2015

11 11/19 VLDB 2015

12 12/19 VLDB 2015

13 13/19 VLDB 2015

14  Motivation  Cleaning Methods  Preliminary  Global Optimum  Local Optimum  Experiments  Conclusion 14/19 VLDB 2015

15 15/19  Real datasets  RESTAURANT: with name, address, type, and city information of 864 restaurants.  UIS 1 :data base generator, using at most 100k of them  Errors  Injected errors in the RESTAURANT and UIS by removing values  Criteria  Recall, precision and f-measure 1. http://www.cs.utexas.edu/users/ml/riddle/data.html VLDB 2015

16 16/19  Equality based  CERTAIN 1  ERACER 2  Similarity based  MIBOS 3  CMI 4 1.W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1):173–184, 2010. 2.C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD Conference, pages 75–86, 2010 3.S. Wu, X. Feng, Y. Han, and Q. Wang. Missing categorical data imputation approach based on similarity. In SMC, pages 2827–2832, 2012. 4.S. Zhang, J. Zhang, X. Zhu, Y. Qin, and C. Zhang. Missing value imputation based on data clustering. Trans. on Computational Science, 1:128–138, 2008. VLDB 2015

17 17/19 VLDB 2015

18  Motivation  Cleaning Methods  Preliminary  Global Optimum  Local Optimum  Experiments  Conclusion 18/19 VLDB 2015

19  Imputing missing values  Explore the extensive similarity neighbors  Exact approaches  Approximation approaches  Efficient approximation algorithms are devised with certain performance guarantees  More room for DDs 19/19 VLDB 2015

20 Thanks ! 20/19 VLDB 2015


Download ppt "Shaoxu Song 1, Aoqian Zhang 1, Lei Chen 2, Jianmin Wang 1 1 Tsinghua University, China 2Hong Kong University of Science & Technology, China 1/19 VLDB 2015."

Similar presentations


Ads by Google