Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012.

Slides:



Advertisements
Similar presentations
Setup Computer Based Training Launch Reg Manager and Log-in Launch Training Manager Create CBT Program Link Training Material (video, document, seminar,
Advertisements

HRMS 8.9 Upgrade Person Model. Introduction One of the significant changes to HRMS with the upgrade to 8.9 is the new Person Model. This course provides.
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.
Join Processing in Database Systems with Large Main Memories ACM Transactions on Database Systems Vol. 11, No. 3, Sep 1986 Leonard D. Shapiro Donghui Zhang,
SHADoW in Michigan Barbara Ritter and Jim Davis. ●Using de-identified client-level information, the Statewide Homeless Assistance Data Online Warehouse.
Data Collection An overview of how data are collected and used in Washington state.
Lecture-19 ETL Detail: Data Cleansing
Mining databases with different schema: Integrating incompatible classifiers Andreas L Prodromidis Salvatore Stolfo Dept of Computer Science Columbia University.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
National Provider Identifier (NPI) Public Workshop 1.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
90 th Annual Meeting & Exposition April 3 – 6, 2011 Memphis, Tennessee An Introduction to Spend Analysis and Spend Management Optimizing your spend.
Economic Incentives and Foster Child Adoptions Economic Incentives and Foster Child Adoptions Laura Argys and Brian Duncan Department of Economics University.
Introduction to Access. What is Access? Database tool Creates a database Good data query (lookup and analysis) ability Good entry forms Good reports Multi-user.
1 Haiguang Li 01. Dec Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University Class Presentation.
Systems of Equations. I. Systems of Linear Equations Four Methods: 1. Elimination by Substitution 2. Elimination by Addition 3. Matrix Method 4. Cramer’s.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
2015 Redrock Software Conference Hitchhiker’s Guide To The Trac System Navigating Your Trac System With Iliana Ramos and Jennifer Turley.
PMI Inventory Tracker™
Lesia Edwards, Program Coordinator Office of Assessment.
Java Asynchronous Wireless Application Server (JAWAS)
DATABASE. Computer-based filing systems Information in computer-based filing systems are stored in DATA FILES. A FILE is a collection of RELATED RECORDS.
Lesli Scott Ashley Bowers Sue Ellen Hansen Robin Tepper Jacob Survey Research Center, University of Michigan Third International Conference on Establishment.
4.1 Matrix Operations What you should learn: Goal1 Goal2 Add and subtract matrices, multiply a matrix by a scalar, and solve the matrix equations. Use.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Grobman, K. H. "Confirmation Bias." Teaching about. Developmentalpsychology.org, Web. 16 Sept Sequence Fits the instructor's Rule? Guess.
Lecture 12 Data Duplication Elimination & BSN Method by Adeel Ahmed Faculty of Computer Science 1.
Ahsan Abdullah 1 Data Warehousing Lecture-20 Data Duplication Elimination & BSN Method Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head.
Surname:Brown Forename:James Form:7B Date of Birth: Telephone:
Chapter- 14- Index structures for files
Microsoft Access 2010 Crash Course Part 1 Academic Health Center Training (352)
Databases Organizing Sorting Querying A Presentation by Karen Work Richardson.
Improving Data Quality Tuscaloosa County School System STI Office/District, McAleer PR.
DATABASE SYSTEMS. DATABASE u A filing system for holding data u Contains a set of similar files –Each file contains similar records Each record contains.
Using Name Change and Non-Education Administrative Data to Assist in Identity Matching 26th Annual Management Information Systems (MIS) Conference February.
Data Analysis.
NATIONAL DIRECTORY OF NEW HIRES Office of Child Support Enforcement Administration for Children and Families Department of Health and Human Services.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University Class Presentation by Rhonda Kost, 06.April.
1 P-20W Identity Management November 16, :15 – 12:15 Bob Swiggum, GA Bill Hurwitch, ME Cathy Wagner, MN.
1 Intelli-Householding Sample Client (Insurance).
National Enrolment Service (NES) Overview October 2015 – June 2016.
Jeopardy DB parts sortingMore DBField prop. MISC. Q $100 Q $200 Q $300 Q $400 Q $500 Q $100 Q $200 Q $300 Q $400 Q $500 Final Jeopardy.
Louisiana’s First Choice for College Access Tell Me About Chafee ETV.
Solve using mental math.
A Introduction to Computing II Lecture 7: Sorting 1 Fall Session 2000.
Solving Quadratic Equations by Factoring. Zero Product Property For any real numbers a and b, if the product ab = 0, then either a = 0, b = 0, or both.
Optimal Database Marketing Drozdenko & Drake, ©
Background Studies Division Office of Inspector General Improving Minnesota’s Background Study System.
Arizona’s Sentinel Site Data Quality Efforts Fragmented Records and MOGE Coding Lisa Rasmussen Arizona Department of Health Services March 30, 2011.
Hospital inpatient data James Hebblethwaite. Acknowledgements This presentation has been adapted from the original presentation provided by the following.
Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem M. Hernandez & S. Stolfo: Columbia University Class Presentation by Jeff Maynard.
Pleiades Software Development, Inc. Automatic Merging of Pedigree Information Annual Workshop on Family History Technology April 3, 2003 Sue Dintelman.
Solving linear equations  Review the properties of equality  Equations that involve simplification  Equations containing fractions  A general strategy.
The Impact of Massachusetts Health Reform on Labor Mobility
Session 8 Data Processing Estonian case study
Prepare Microsoft MB6-895 Question Answers - MB6-895 Exam Dumps - Realexamdumps.com
Semantic Interoperability and Data Warehouse Design
Real-World Data Is Dirty
8.1.1 Solving Simple Equations
“Day D” September 13, :01 - 9:01 Exploratory 9: :03
Please locate your seat
Data Management – Processing
METRO SOUTH EDUCATION DISTRICT Western Cape Department of Education
DATABASES Surname: Brown Forename: James Form: 7B
Solve the equation: 6 x - 2 = 7 x + 7 Select the correct answer.
Solving 1 and 2 Step Equations
Andrew Borthwick, Ph.D. Martin Buechi, Ph.D. ChoiceMaker Technologies
Measurement Systems Lesson 4: Conversions.
Presentation transcript:

Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Introduction Real data is dirty Why clean? – Eliminate duplicates – Smaller database – Accurate statistics The problem – Merge/Purge of large databases

Preview Data Cleansing Solutions Real World Data OCAR’s Data Conclusion

Data Cleansing Solutions Sorted-Neighborhood Method Equational Theory Transitive Closure

Sorted-Neighborhood Method Three phases – 1. create keys – 2. sort the data – 3. merge Three passes using different key – Multi-pass method

Sorted-Neighborhood Method Key selection First NameLast NameAddressIDKey SalStolfo123 First Street STLSAL123FRST456 SalStolfo123 First Street STLSAL123FRST456 SalStolpho123 First Street STLSAL123FRST456 SalStiles123 Forest Street STLSAL123FRST456

Sorted-Neighborhood Method Sort using the key selected First NameLast NameAddressIDKey SalStolfo123 First Street STLSAL123FRST456 SalStolfo123 First Street STLSAL123FRST456 SalStolpho123 First Street STLSAL123FRST456 SalStiles123 Forest Street STLSAL123FRST456

Sorted-Neighborhood Method A ‘window size’ is created for merging First NameLast NameAddressIDKey SalStolfo123 First Street STLSAL123FRST456 SalStolfo123 First Street STLSAL123FRST456 SalStolpho123 First Street STLSAL123FRST456 SalStiles123 Forest Street STLSAL123FRST456

Merge Phase - Equational Theory A set of equation rules that defines equivalence A type of clustering function (pattern recognition) Rules may require an expert

Merge Phase - Equational Theory English rules: Given two records, r1 and r2. IF (the last names of r1 equals the last name of r2, AND the first names differ slightly, AND the address of r1 equals the address of r2) THEN R1 is equivalent to r2

Merge Phase - Equational Theory Results SSNName (First, Initial, Last)Address Lisa Boardman144 Wars St Lisa Brown144 Ward St Ramon Bonilla38 Ward St Raymond Bonilla38 Ward St. 0Diana D. Ambrosion40 Brik Church Av. 0Diana A. Dambrosion40 Brick Church Av. 0Colette Johnen th St. apt.5a5 0John Colette th St. ap Ivette A Keegan23 Florida Av Yvette A Kegan23 Florida St. r1 r2

Merge Phase - Transitive Closure Applied to a single pass sorted-neighborhood method Improvement of accuracy Decreases processing time and cost

Merge Phase - Transitive Closure English rules: Given three records a, b and c. IF (a is similar to b AND b is similar to c) THEN a is similar to c

Real World Data State of Washington Department of Social and Health Services Office of Children Administrative Research (OCAR) of the Department of Social and Health Services

OCAR’s Data 6,000,000 records Grows by 50,000 per month 19 fields – First and last name – Birthdate – SSN – Case number – Worker ID – Gender – Race – Service ID – Service dates – Payments

OCAR’s Data - Problems Names misspelled Missing birthdates Missing or wrong SSN Multiple case numbers Ghost records

OCAR’s Data - Goals To answer: – “How many children are in foster care?” – “How long do children stay in foster care?” – “How many different homes do children typically stay in?”

OCAR’s Data - Cleaning 128,438 records sampled (one service office) Consulted with expert 1 24 rules established Used sorted-neighborhood multi-pass methods Applied equational theory Keys – 1. Last name, First name, SSN, and Case number – 2. First name, Last name, SSN, and Case number – 3. Case number, First name, Last name, and SSN 1 Timothy Clark, OCAR Computer Information Consultant

OCAR’s Data - Results Identified 8,504 individuals in sample 45.8% correctly classified 86.0% where correctly merged Multi-pass sorted-neighborhood confirmed

Review Multi-pass sorted-neighborhood method Equational method OCAR’s data

Conclusions Sort-neighborhood method can be expensive – During the sorting phase Process time improved accuracy – Multiple times – Small windows – Computation of the transitive closure

Sources Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem; Mauricio A. Hernandez and Salvatore J. Stolfo; Department of Computer Science, Columbia University, New York, NY Haiguang Li, 2011 class presentation