1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

C6 Databases.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Clean code. Motivation Total cost = the cost of developing + maintenance cost Maintenance cost = cost of understanding + cost of changes + cost of testing.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012.
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Evaluating Search Engine
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Managing Data Resources
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Insertion Sort & Shellsort By: Andy Le CS146 – Dr. Sin Min Lee Spring 2004.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Objectives Learn what a file system does
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Chapter 13 Sequential File Processing. Master Files Set of files used to store companies data in areas like payroll, inventory Usually processed by batch.
The McGraw-Hill Companies, Inc Information Technology & Management Thompson Cats-Baril Chapter 3 Content Management.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Lecture 12 Data Duplication Elimination & BSN Method by Adeel Ahmed Faculty of Computer Science 1.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Ahsan Abdullah 1 Data Warehousing Lecture-20 Data Duplication Elimination & BSN Method Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head.
13-1 COBOL for the 21 st Century Nancy Stern Hofstra University Robert A. Stern Nassau Community College James P. Ley University of Wisconsin-Stout (Emeritus)
Sequential Files Chapter 13. Master Files Set of files used to store companies data in areas like payroll, inventory Set of files used to store companies.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Privacy Preserving Schema and Data Matching Scannapieco, Bertino, Figotin and Elmargarmid Presented by : Vidhi Thapa.
Chapter 10 Designing the Files and Databases. SAD/CHAPTER 102 Learning Objectives Discuss the conversion from a logical data model to a physical database.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University Class Presentation by Rhonda Kost, 06.April.
13-1 Sequential File Processing Chapter Chapter Contents Overview of Sequential File Processing Sequential File Updating - Creating a New Master.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
7 Strategies for Extracting, Transforming, and Loading.
13- 1 Chapter 13.  Overview of Sequential File Processing  Sequential File Updating - Creating a New Master File  Validity Checking in Update Procedures.
Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems Faten Hussein Presented by The University of British.
CSCE Database Systems Chapter 15: Query Execution 1.
Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
Optimal Database Marketing Drozdenko & Drake, ©
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem M. Hernandez & S. Stolfo: Columbia University Class Presentation by Jeff Maynard.
Chapter 15 QUERY EXECUTION.
Objective of This Course
Real-World Data Is Dirty
Presentation transcript:

1 Haiguang Li 01. Dec Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University Class Presentation by Haiguang Li, 01. Dec 2011

2 Haiguang Li 01. Dec TOPICS Introduction A Basic Data Cleansing Solution Test & Real World Results Incremental Merge Purge w/ New Data Conclusion Recap

3 Haiguang Li 01. Dec Introduction

4 Haiguang Li 01. Dec The problem: Some corporations acquire large amounts of information every month The data is stored in many large databases (DB) These databases may be heterogeneous Variations in schema The data may be represented differently across the various datasets Data in these DB may simply be inaccurate

5 Haiguang Li 01. Dec Requirement of the analysis The data mining needs to be done Quickly Efficiently Accurately

6 Haiguang Li 01. Dec Examples of real-world applications Credit card companies Assess risk of potential new customers Find false identities Match disparate records concerning a customer Mass Marketing companies Government agencies

7 Haiguang Li 01. Dec A Basic Data Cleansing Solution

8 Haiguang Li 01. Dec Duplicate Elimination Sorted-Neighborhood Method (SNM) This is done in three phases Create a Key for each record Sort records on this key Merge/Purge records

9 Haiguang Li 01. Dec SNM: Create key Compute a key for each record by extracting relevant fields or portions of fields Example: FirstLastAddressIDKey SalStolfo123 First Street STLSAL123FRST456

10 Haiguang Li 01. Dec SNM: Sort Data Sort the records in the data list using the key in step 1 This can be very time consuming O(NlogN) for a good algorithm, O(N 2 ) for a bad algorithm

11 Haiguang Li 01. Dec SNM: Merge records Move a fixed size window through the sequential list of records. This limits the comparisons to the records in the window

12 Haiguang Li 01. Dec SNM: Considerations What is the optimal window size while Maximizing accuracy Minimizing computational cost Execution time for large DB will be bound by Disk I/O Number of passes over the data set

13 Haiguang Li 01. Dec Selection of Keys The effectiveness of the SNM highly depends on the key selected to sort the records A key is defined to be a sequence of a subset of attributes Keys must provide sufficient discriminating power

14 Haiguang Li 01. Dec Example of Records and Keys FirstLastAddressIDKey SalStolfo123 First Street STLSAL123FRST456 SalStolfo123 First Street STLSAL123FRST456 SalStolpho123 First Street STLSAL123FRST456 SalStiles123 Forest Street STLSAL123FRST456

15 Haiguang Li 01. Dec Equational Theory The comparison during the merge phase is an inferential process Compares much more information than simply the key The more information there is, the better inferences can be made

16 Haiguang Li 01. Dec Equational Theory - Example Two names are spelled nearly identically and have the same address It may be inferred that they are the same person Two social security numbers are the same but the names and addresses are totally different Could be the same person who moved Could be two different people and there is an error in the social security number

17 Haiguang Li 01. Dec A simplified rule in English Given two records, r1 and r2 IF the last name of r1 equals the last name of r2, AND the first names differ slightly, AND the address of r1 equals the address of r2 THEN r1 is equivalent to r2

18 Haiguang Li 01. Dec The distance function A “distance function” is used to compare pieces of data (usually text) Apply “distance function” to data that “differ slightly” Select a threshold to capture obvious typographical errors. Impacts number of successful matches and number of false positives

19 Haiguang Li 01. Dec Examples of matched records SSNName (First, Initial, Last)Address Lisa Boardman Lisa Brown 144 Wars St. 144 Ward St Ramon Bonilla Raymond Bonilla 38 Ward St Diana D. Ambrosion Diana A. Dambrosion 40 Brik Church Av. 40 Brick Church Av Kathi Kason Kathy Kason 48 North St Kathy Kason Kathy Smith 48 North St.

20 Haiguang Li 01. Dec Building an equational theory The process of creating a good equational theory is similar to the process of creating a good knowledge- base for an expert system In complex problems, an expert’s assistance is needed to write the equational theory

21 Haiguang Li 01. Dec Transitive Closure In general, no single pass (i.e. no single key) will be sufficient to catch all matching records An attribute that appears first in the key has higher discriminating power than those appearing after them If an employee has two records in a DB with SSN and , it’s unlikely they will fall under the same window

22 Haiguang Li 01. Dec Transitive Closure To increase the number of similar records merged Widen the scanning window size, w Execute several independent runs of the SNM  Use a different key each time  Use a relatively small window  Call this the Multi-Pass approach

23 Haiguang Li 01. Dec Transitive Closure Each independent run of the Multi-Pass approach will produce a set of pairs of records Although one field in a record may be in error, another field may not Transitive closure can be applied to those pairs to be merged

24 Haiguang Li 01. Dec Multi-pass Matches Pass 1 (Lastname discriminates) KSNKAT48NRTH789 (Kathi Kason ) KSNKAT48NRTH879 (Kathy Kason ) Pass 2 (Firstname discriminates) KATKSN48NRTH789 (Kathi Kason ) KATKSN48NRTH879 (Kathy Kason ) Pass 3 (Address discriminates) 48NRTH879KSNKAT (Kathy Kason ) 48NRTH879SMTKAT (Kathy Smith )

25 Haiguang Li 01. Dec Transitive Equality Example IF A implies B AND B implies C THEN A implies C From example: Kathi Kason 48 North St. (A) Kathy Kason 48 North St. (B) Kathy Smith 48 North St. (C)

26 Haiguang Li 01. Dec Test Results

27 Haiguang Li 01. Dec Test Environment Test data was created by a database generator Names are randomly chosen from a list of real names The database generator provides a large number of parameters: size of the DB, percentage of duplicates, amount of error…

28 Haiguang Li 01. Dec Correct Duplicate Detection

29 Haiguang Li 01. Dec Time for each run

30 Haiguang Li 01. Dec Accuracy for each run

31 Haiguang Li 01. Dec Real-World Test Data was obtained from the Office of Children Administrative Research (OCAR) of the Department of Social and Health Services (State of Washington) OCAR’s goals How long do children stay in foster care? How many different homes do children typically stay in?

32 Haiguang Li 01. Dec OCAR’s Database Most of OCAR’s data is stored in one relation The DB contains 6,000,000 total records The DB grows by about 50,000 records per month

33 Haiguang Li 01. Dec Typical Problems in the DB Names are frequently misspelled SSN or birthdays are either missing or clearly wrong Case number often changes when the child’s family moves to another part of the state Some records use service provider names instead of the child’s No reliable unique identifier

34 Haiguang Li 01. Dec OCAR Equational Theory Keys for the independent runs Last Name, First Name, SSN, Case Number First Name, Last Name, SSN, Case Number Case Number, First Name, Last Name, SSN

35 Haiguang Li 01. Dec OCAR Results

36 Haiguang Li 01. Dec Incremental Merge/Purge w/ New Data

37 Haiguang Li 01. Dec Incremental Merge/Purge Lists are concatenated for first time processing Concatenating new data before reapplying the merge/purge process may be very expensive in both time and space An incremental merge/purge approach is needed: Prime Representatives method

38 Haiguang Li 01. Dec Prime-Representative: Definition A set of records extracted from each cluster of records used to represent the information in the cluster The “Cluster Centroid” or base element of equivalence class

39 Haiguang Li 01. Dec Prime-Representative creation Initially, no PR exists After the execution of the first merge/purge create clusters of similiar records Correct selection of PR from cluster impacts accuracy of results No PR can be the best selection for some clusters

40 Haiguang Li 01. Dec Strategies for Choosing PR Random Sample Select a sample of records at random from each cluster N-Latest Most recent elements entered in DB Syntactic Choose the largest or more complete record

41 Haiguang Li 01. Dec Important Assumption No data previously used to select each cluster’s PR will be deleted Deleted records could require restructuring of clusters (expensive) No changes in the rule-set will occur after the first increment of data is processed Substantial rule change could invalidate clusters.

42 Haiguang Li 01. Dec Results Cumulative running time for the Incremental Merge/Purge algorithm is higher than the classic algorithm PR selection methodology could improve cumulative running time Total running time of the Incremental Merge/Purge algorithm is always smaller

43 Haiguang Li 01. Dec Conclusion

44 Haiguang Li 01. Dec Cleansing of Data Sorted-Neighborhood Method is expensive due to the sorting phase the need for large windows for high accuracy Multiple passes with small windows followed by transitive closure improves accuracy and performance for level of accuracy increasing number of successful matches decreasing number of false positives

45 Haiguang Li 01. Dec major reasons merging large databases becomes a difficult problem: The databases are heterogeneous The identifiers or strings differ in how they are represented within each DB Questions 1?

46 Haiguang Li 01. Dec The 3 steps in SNM are: Creation of key(s) Sorting records on this key Merge/Purge records Questions 2?

47 Haiguang Li 01. Dec strategies for selecting a PR: Random Sample N-Latest Syntactic Questions 3?

48 Haiguang Li 01. Dec The End Thanks very much!