Detecting Data Errors: Where are we and what needs to be done?

Slides:



Advertisements
Similar presentations
Author: Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, Thomas Ball MIT CSAIL.
Advertisements

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Kapsalakis Giorgos - AM: 1959 HY459 - Internet Measurements Fall 2010.
Lecture-19 ETL Detail: Data Cleansing
Data Collection Six Sigma Foundations Continuous Improvement Training Six Sigma Foundations Continuous Improvement Training Six Sigma Simplicity.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
RIT Software Engineering
An Experimental Evaluation on Reliability Features of N-Version Programming Xia Cai, Michael R. Lyu and Mladen A. Vouk ISSRE’2005.
SE 450 Software Processes & Product Metrics 1 Defect Removal.
SLIDE 1IS 257 – Fall 2008 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
A Population Size Estimation Problem Eliezer Kantorowitz Software Engineering Department Ort Braude College of Engineering
1 Software Testing and Quality Assurance Lecture 30 – Testing Systems.
Bootstrapping Privacy Compliance in Big Data System Shayak Sen, Saikat Guha et al Carnegie Mellon University Microsoft Research Presenter: Cheng Li.
Experimental Statistics I.  We use data to answer research questions  What evidence does data provide?  How do I make sense of these numbers without.
Presenter: Miguel Garzon Torres CrUise Lab - SITE SQL Coverage Measurement for Testing Database Applications María José Suárez-Cabal University of Oviedo.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
More on Data Mining KDnuggets Datanami ACM SIGKDD
Lecture 7 Integrity & Veracity UFCE8K-15-M: Data Management.
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
Querying Structured Text in an XML Database By Xuemei Luo.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Verification and Validation in the Context of Domain-Specific Modelling Janne Merilinna.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Cmpe 589 Spring 2006 Lecture 2. Software Engineering Definition –A strategy for producing high quality software.
An Evaluation of Commercial Data Mining Proposed and Presented by Emily Davis Supervisor: John Ebden.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
MIS 301 Information Systems in Organizations Dave Salisbury ( )
Research Methodology For AEP Assoc. Prof. Dr. Nguyen Thi Tuyet Mai HÀ NỘI 12/2015.
Data Verification and Validation
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Data Mining What is to be done before we get to Data Mining?
Software Design and Architecture
Data Understanding, Cleaning, Transforming. Recall the Data Science Process Data acquisition Data extraction (wrapper, IE) Understand/clean/transform.
Jeremy Nimmer, page 1 Automatic Generation of Program Specifications Jeremy Nimmer MIT Lab for Computer Science Joint work with.
What we mean by Big Data and Advanced Analytics
Getting started with Accurately Storing Data
Analysis Manager Training Module
View Integration and Implementation Compromises
The Development Process of Web Applications
Analyze ICD-10 Diagnosis Codes with Stata
Normalization Karolina muszyńska
Potter’s Wheel: An Interactive Data Cleaning System
Clustering Evaluation The EM Algorithm
UNIT-4 BLACKBOX AND WHITEBOX TESTING
An Approach for Testing the Extract-Transform-Load Process in Data Warehouse Systems Master’s Defense Hajar Homayouni Dec. 1, 2017.
Gage R&R Estimating measurement components
Data Integration with Dependent Sources
Lecture 13: Error Detection
Semantic Interoperability and Data Warehouse Design
Interactive repairing of inconsistent knowledge bases
Report on Data Cleaning Framework
Detecting Faulty Empty Cells in Spreadsheets
iSRD Spam Review Detection with Imbalanced Data Distributions
[jws13] Evaluation of instance matching tools: The experience of OAEI
Data Understanding, Cleaning, Transforming
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Data Warehousing Concepts
Retrieval Performance Evaluation - Measures
The Data Civilizer System
Embedding based entity summarization
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Yingze Wang and Shi-Kuo Chang University of Pittsburgh
Presentation transcript:

Detecting Data Errors: Where are we and what needs to be done? Ziawasch Abedjan, Xu Chu, Dong Deng, Raul C.- Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan tang

Motivation There has been extensive research on many different cleaning algorithms Usually evaluated on errors injected into clean data Which we find unconvincing (finding errors you injected…) How well do current techniques work “in the wild”? What about combinations of techniques? This study is not about finding the best tool or better tools! Detecting Data Errors: Where are we and what needs to be done?

Detecting Data Errors: Where are we and what needs to be done? What we did Ran 8 different cleaning systems on real world datasets and measured effectivity of each single system combined effectivity upper-bound recall Analyzed impact of Enrichment Tried out domain specific cleaning tools Detecting Data Errors: Where are we and what needs to be done?

Error Types Literature: [Hellerstein 2008, Ilyas&Chu 2015,Kim et al. 2003, Rahm&Do 2000] General types: Quantitative Qualitative Outliers Duplicates Constraint violations Pattern violations Detecting Data Errors: Where are we and what needs to be done?

Error Detection Strategies Rule-based detection algorithms Detecting violation of constraints, such as functional dependencies Pattern verification and enforcement tools Syntactical patterns, such as date formatting Semantical patterns, such as location names Quantitative algorithms Statistical outliers Deduplication Discovering conflicting attribute values in duplicates Detecting Data Errors: Where are we and what needs to be done?

Detecting Data Errors: Where are we and what needs to be done? Tool Selection Premise: Tool is State-of-the-Art Tool is sufficiently general Tool is available Tool covers at least one of the leaf error types: Detecting Data Errors: Where are we and what needs to be done?

Detecting Data Errors: Where are we and what needs to be done? 5 Data Sets MIT VPF Procurement dataset containing information about suppliers (companies and individuals) Contains names, contact data, and business flags Merck List of IT-services and software Attributes include location, number of end users, business flags Animal Information about random capture of animals, Attributes include tags, sex, weight, etc Rayyan Bib Literature references collected from various sources Attributes include author names, publication titles, ISSN, etc. BlackOak Address dataset that have been synthetically dirtied Contains names, addresses, birthdate, etc. Detecting Data Errors: Where are we and what needs to be done?

Detecting Data Errors: Where are we and what needs to be done? 5 Data Sets continued Dataset # columns # rows Ground truth Errors MIT VPF 42 24K 13k (partial) 6.7% Merck 61 2262 19.7% Animal 14 60k 0.1% Rayyan Bib 11 1M 1k (partial) 35% BlackOak 12 94k 34% Detecting Data Errors: Where are we and what needs to be done?

Evaluation Methodology We have the same knowledge as the data owners about the data: Quality constraints, business rules Best effort in using all capabilities of the tools However: No heroics, i.e., embedding custom java code within a tool Precision = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝑒𝑟𝑟𝑜𝑟𝑠 𝑚𝑎𝑟𝑘𝑒𝑑 𝑎𝑠 𝑒𝑟𝑟𝑜𝑟 Recall = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝑒𝑟𝑟𝑜𝑟𝑠 𝑒𝑥𝑖𝑠𝑡𝑖𝑛𝑔 𝑒𝑟𝑟𝑜𝑟𝑠 F-Measure = 2 ∙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 Detecting Data Errors: Where are we and what needs to be done?

Single Tool Performance: MIT Tools MIT VPF P R F DC-Clean .25 .14 .18 Trifacta .94 .86 .90 OpenRefine .95 Pentaho .59 .73 KNIME Gaussian .07 Histogram .13 .11 .12 GMM .29 .19 Katara .40 .01 .02 Tamr .16 .04 Union .24 .93 .38 # columns # rows Ground truth Errors 42 24K 13k (partial) 6.7% Detecting Data Errors: Where are we and what needs to be done?

Single Tool Performance: Merck Tools Merck P R F DC-Clean .99 .78 .87 Trifacta OpenRefine Pentaho KNIME Gaussian .19 .00 .01 Histogram .13 .02 .04 GMM .17 .32 .22 Katara -- Tamr Union .33 .85 .48 # columns # rows Ground truth Errors 61 2262 19.7% Detecting Data Errors: Where are we and what needs to be done?

Single Tool Performance: Animal Tools Animal P R F DC-Clean .12 .53 .20 Trifacta 1.0 .03 .06 OpenRefine .33 .001 Pentaho KNIME Gaussian .00 Histogram GMM Katara .55 .04 .07 Tamr -- Union .13 .58 .21 # columns # rows Ground truth Errors 14 60k 0.1% Detecting Data Errors: Where are we and what needs to be done?

Single Tool Performance: Rayyan Tools Rayyan Bib P R F DC-Clean .74 .55 .63 Trifacta .71 .59 .65 OpenRefine .95 .60 Pentaho .58 .64 KNIME Gaussian .41 .13 .20 Histogram .40 .16 .23 GMM .53 .39 .44 Katara .47 Tamr -- Union .85 .61 # columns # rows Ground truth Errors 11 1M 1k (partial) 35% Detecting Data Errors: Where are we and what needs to be done?

Single Tool Performance: BlackOak Tools BlackOak P R F DC-Clean .46 .43 .44 Trifacta .96 .93 .94 OpenRefine .99 .95 .97 Pentaho 1.0 .66 .79 KNIME Gaussian .91 .73 .81 Histogram .52 .51 GMM .38 .37 Katara .88 .06 .11 Tamr .41 .63 .50 Union .39 .56 # columns # rows Ground truth Errors 12 94k 34% Detecting Data Errors: Where are we and what needs to be done?

Single Tool Performance Tools MIT VPF P R F Merck Animal Rayyan Bib BlackOak DC-Clean .25 .14 .18 .99 .78 .87 .12 .53 .20 .74 .55 .63 .46 .43 .44 Trifacta .94 .86 .90 1.0 .03 .06 .71 .59 .65 .96 .93 OpenRefine .95 .33 .001 .60 .97 Pentaho .73 .58 .64 .66 .79 KNIME Gaussian .07 .19 .00 .01 .41 .13 .91 .81 Histogram .11 .02 .04 .40 .16 .23 .52 .51 GMM .29 .17 .32 .22 .39 .38 .37 Katara -- .47 .88 Tamr .50 Union .24 .85 .48 .21 .61 .56 Detecting Data Errors: Where are we and what needs to be done?

Combined Tool Performance Naïve appraoch k tools agree on a value to be an error Typical precision-recall trade-off Maximum entropy-based order selection: Run tool on samples and verify the results Pick the tool with highest precision (maximum entropy reduction) Verify the results Update precision and recall of other tools accordingly Repeat step 2 Drop tools with precision below 10% Detecting Data Errors: Where are we and what needs to be done?

Ordering-based approach Precision and recall depending on different minimum precision thresholds (compared to union) MIT VPF with 39,158 errors Merck with 27,208 errors Detecting Data Errors: Where are we and what needs to be done?

Maximum possible recall Manually checked each undetected error and reasoned whether it could have beendetected by a better variant of a tool, e.g. a more sophisticated rule or transformation. Dataset Best effort recall Upper-bound recall Remaining errors MIT VPF 0.92 0.98 (+1,950) 798 Merck 0.85 0.99 (+4,101) 58 Animal 0.57 592 Rayyan Bib 0.91 (+231) 347 BlackOak 0.99 75 Detecting Data Errors: Where are we and what needs to be done?

Enrichment and Domain-specific tools Manually appended more columns through joining to other tables of the database Improves performance of rule-based and duplicate detection systems Domain-specific tool: Used a commercial address cleaning service High precision on the specific domain But did not lead to the increase of overall recall Detecting Data Errors: Where are we and what needs to be done?

Detecting Data Errors: Where are we and what needs to be done? Conclusions There is no single dominant tool. Improving individual tools has marginal benefit. We need a combination of tools Picking the right order in applying the tools can improve the precision and help reduce the cost of validation by humans. Domain specific tools can achieve on average high precision and recall compared to general-purpose tools. Rule-based systems and duplicate detection benefited from data enrichment. Detecting Data Errors: Where are we and what needs to be done?

Detecting Data Errors: Where are we and what needs to be done? Future Directions More reasoning on holistic combination of tools Data enrichment can benefit cleaning Interactive dashboard More reasoning on real-world data ಧನ್ಯವಾದಗಳು ధన్యవాదాలు നന്ദി ਤੁਹਾਡਾ ਧੰਨਵਾਦ આભાર آپ کا شکریہ ধন্যবাদ நன்றி धन्यवाद Thank you! Detecting Data Errors: Where are we and what needs to be done?