Download presentation
Presentation is loading. Please wait.
Published byDaisy Sheryl Quinn Modified over 9 years ago
1
Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012
2
Introduction Real data is dirty Why clean? – Eliminate duplicates – Smaller database – Accurate statistics The problem – Merge/Purge of large databases
3
Preview Data Cleansing Solutions Real World Data OCAR’s Data Conclusion
4
Data Cleansing Solutions Sorted-Neighborhood Method Equational Theory Transitive Closure
5
Sorted-Neighborhood Method Three phases – 1. create keys – 2. sort the data – 3. merge Three passes using different key – Multi-pass method
6
Sorted-Neighborhood Method Key selection First NameLast NameAddressIDKey SalStolfo123 First Street45678987STLSAL123FRST456 SalStolfo123 First Street45678987STLSAL123FRST456 SalStolpho123 First Street45678987STLSAL123FRST456 SalStiles123 Forest Street45654321STLSAL123FRST456
7
Sorted-Neighborhood Method Sort using the key selected First NameLast NameAddressIDKey SalStolfo123 First Street45678987STLSAL123FRST456 SalStolfo123 First Street45678987STLSAL123FRST456 SalStolpho123 First Street45678987STLSAL123FRST456 SalStiles123 Forest Street45654321STLSAL123FRST456
8
Sorted-Neighborhood Method A ‘window size’ is created for merging First NameLast NameAddressIDKey SalStolfo123 First Street45678987STLSAL123FRST456 SalStolfo123 First Street45678987STLSAL123FRST456 SalStolpho123 First Street45678987STLSAL123FRST456 SalStiles123 Forest Street45654321STLSAL123FRST456
9
Merge Phase - Equational Theory A set of equation rules that defines equivalence A type of clustering function (pattern recognition) Rules may require an expert
10
Merge Phase - Equational Theory English rules: Given two records, r1 and r2. IF (the last names of r1 equals the last name of r2, AND the first names differ slightly, AND the address of r1 equals the address of r2) THEN R1 is equivalent to r2
11
Merge Phase - Equational Theory Results SSNName (First, Initial, Last)Address 334600443Lisa Boardman144 Wars St. 334600443Lisa Brown144 Ward St. 525520001Ramon Bonilla38 Ward St. 525250001Raymond Bonilla38 Ward St. 0Diana D. Ambrosion40 Brik Church Av. 0Diana A. Dambrosion40 Brick Church Av. 0Colette Johnen600 113 th St. apt.5a5 0John Colette600 113 th St. ap. 585 850982319Ivette A Keegan23 Florida Av. 950982319Yvette A Kegan23 Florida St. r1 r2
12
Merge Phase - Transitive Closure Applied to a single pass sorted-neighborhood method Improvement of accuracy Decreases processing time and cost
13
Merge Phase - Transitive Closure English rules: Given three records a, b and c. IF (a is similar to b AND b is similar to c) THEN a is similar to c
14
Real World Data State of Washington Department of Social and Health Services Office of Children Administrative Research (OCAR) of the Department of Social and Health Services
15
OCAR’s Data 6,000,000 records Grows by 50,000 per month 19 fields – First and last name – Birthdate – SSN – Case number – Worker ID – Gender – Race – Service ID – Service dates – Payments
16
OCAR’s Data - Problems Names misspelled Missing birthdates Missing or wrong SSN Multiple case numbers Ghost records
17
OCAR’s Data - Goals To answer: – “How many children are in foster care?” – “How long do children stay in foster care?” – “How many different homes do children typically stay in?”
18
OCAR’s Data - Cleaning 128,438 records sampled (one service office) Consulted with expert 1 24 rules established Used sorted-neighborhood multi-pass methods Applied equational theory Keys – 1. Last name, First name, SSN, and Case number – 2. First name, Last name, SSN, and Case number – 3. Case number, First name, Last name, and SSN 1 Timothy Clark, OCAR Computer Information Consultant
19
OCAR’s Data - Results Identified 8,504 individuals in sample 45.8% correctly classified 86.0% where correctly merged Multi-pass sorted-neighborhood confirmed
20
Review Multi-pass sorted-neighborhood method Equational method OCAR’s data
21
Conclusions Sort-neighborhood method can be expensive – During the sorting phase Process time improved accuracy – Multiple times – Small windows – Computation of the transitive closure
22
Sources Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem; Mauricio A. Hernandez and Salvatore J. Stolfo; Department of Computer Science, Columbia University, New York, NY 10027. Haiguang Li, 2011 class presentation www.cs.columbia.edu/~sal http://www.dshs.wa.gov/default.shtm
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.