GENERIC ENTITY RESOLUTION WITH NEGATIVE RULES Steven Euijong Whang · Omar Benjelloun · Hector Garcia-Molina Compiled by – Darshana Pathak.

Slides:



Advertisements
Similar presentations
Quantitative and Scientific Reasoning Standard n Students must demonstrate the math skills needed to enter the working world right out of high school or.
Advertisements

Introductory Mathematics & Statistics for Business
Heuristic Search techniques
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Analysis of Algorithms
Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.
Logic.
Chapter 4 Quality Assurance in Context
Programming Logic and Design Eighth Edition
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Copyright 2008 Tieto Corporation Database merge. Copyright 2008 Tieto Corporation Table of contents Please, do not remove this slide if you want to use.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
1 Refining the Basic Constraint Propagation Algorithm Christian Bessière and Jean-Charles Régin Presented by Sricharan Modali.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
Lecture 8 Recursively enumerable (r.e.) languages
CPSC 322, Lecture 12Slide 1 CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12 (Textbook Chpt ) January, 29, 2010.
Feature Selection for Regression Problems
Let remember from the previous lesson what is Knowledge representation
Testing an individual module
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
Solver & Optimization Problems n An optimization problem is a problem in which we wish to determine the best values for decision variables that will maximize.
File and Database Design; Logic Modeling Class 24.
Equivalence Class Testing
Logical Database Design Nazife Dimililer. II - Logical Database Design Two stages –Building and validating local logical model –Building and validating.
Introduction While it may not be efficient to write out the justification for each step when solving equations, it is important to remember that the properties.
DCT 1123 PROBLEM SOLVING & ALGORITHMS INTRODUCTION TO PROGRAMMING.
Complex Numbers MATH 018 Combined Algebra S. Rook.
17.5 Rule Learning Given the importance of rule-based systems and the human effort that is required to elicit good rules from experts, it is natural to.
Called as the Interval Scheduling Problem. A simpler version of a class of scheduling problems. – Can add weights. – Can add multiple resources – Can ask.
Database Management System Lecture 6 The Relational Database Model – Keys, Integrity Rules.
CP Summer School Modelling for Constraint Programming Barbara Smith 1.Definitions, Viewpoints, Constraints 2.Implied Constraints, Optimization,
 Let A and B be any sets A binary relation R from A to B is a subset of AxB Given an ordered pair (x, y), x is related to y by R iff (x, y) is in R. This.
Chapter P Prerequisites: Fundamental Concepts of Algebra
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Chapter 3 Making Decisions
Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
MD – Object Model Domain eSales Checker Presentation Régis Elling 26 th October 2005.
Optimistic Design 1. Guarded Methods Do something based on the fact that one or more objects have particular states  Make a set of purchases assuming.
1 On Interactions in the RM-ODP Guy Genilloud, Gonzalo Génova WODPEC’2005 Workshop on ODP for Enterprise Computing * Information Engineering Group Departamento.
Constraint Satisfaction CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Programming with Microsoft Visual Basic th Edition
Algorithms & FlowchartsLecture 10. Algorithm’s CONCEPT.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Chapter 5 Constraint Satisfaction Problems
Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,
Copyright © Curt Hill The IF Revisited If part 4 Style and Testing.
Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103.
Heuristics for Efficient SAT Solving As implemented in GRASP, Chaff and GSAT.
Quality Assurance in the Presence of Variability Kim Lauenroth, Andreas Metzger, Klaus Pohl Institute for Computer Science and Business Information Systems.
Ch 7: Normalization-Part 1
1 CSC 384 Lecture Slides (c) , C. Boutilier and P. Poupart CSC384: Lecture 16  Last time Searching a Graphplan for a plan, and relaxed plan heuristics.
Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.
Recursion. Objectives At the conclusion of this lesson, students should be able to Explain what recursion is Design and write functions that use recursion.
1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.
Complex Numbers and Equation Solving 1. Simple Equations 2. Compound Equations 3. Systems of Equations 4. Quadratic Equations 5. Determining Quadratic.
COP Introduction to Database Structures
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Chapter 3 The Relational Database Model
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Program Design Introduction to Computer Programming By:
Introduction to Data Structures
Chapter 19 (part 1) Functional Dependencies
Using Decision Structures
A handbook on validation methodology. Metrics.
Presentation transcript:

GENERIC ENTITY RESOLUTION WITH NEGATIVE RULES Steven Euijong Whang · Omar Benjelloun · Hector Garcia-Molina Compiled by – Darshana Pathak

CONTENTS 1. Introduction 2. Example 3. ER-N model 4. GNR Algorithm 5. ENR Algorithm 10/7/2011 2

CONTENTS 6. How to choose negative rules? 7. Conclusion 8. References 10/7/2011 3

Introduction  What is entity resolution? A two step process: I. Identifying records that refer to same real-world entity II. Merge them together  Also known as record linkage, merge- purge, deduplication etc.  Application-specific, complex and error- prone process. 10/7/2011 4

Introduction  Why is this so complex? Most of the times because data is – I. Ambiguous II. Missing or incomplete III. Incorrect These things are difficult to capture no matter what logic is used to decide whether records match or how they should be merged! 10/7/2011 5

Introduction  What are negative rules? I. Integrity constraints – Rules that tell us what data is invalid. II. Sanity check in order to remove inconsistencies. e.g. I. One person with 2 genders, II. Same address location with two different street names. 10/7/2011 6

Example 10/7/  General ER-Process: I. Match Function M(r1, r2) = true if same, false if different. Denoted by (r1 ≈ r2) or (r1 ≠ r2) II. Merge Function µ(r1, r2) = ‹r1, r2› RecordNameSSNGender r1Pat r2PatriciaF r3Pat M

Example Match r1 and r2, if (r1 ≈ r2), merge them r 12 | Pat, Patricia | | F Match r 12 with r3, if (r 12 ≈ r3), merge them r 123 | Pat, Patricia | | F, M Problem: Violation of negative rule 10/7/2011 8

Example  Why was this constraint not enforced during ER process? I. Constraints are much more complex (may be a big computer program considering many factors) II. Patches added to program over time by different people III. Condition acceptable during ER process and fixable. 10/7/2011 9

Example  Resolving the inconsistency Because, r 123 is not acceptable as a final merge, I. Unmerge r 123 into {r 12, r 3 }. II. No two final records can have same SSN. III. R 12 and r 3 can not be in final record set. IV. Problem occurred because r1 was initially merged with r2 instead of r3. In practice, there will be no obvious ordering of merging! 10/7/

ER-N Model  Basic properties of Match and Merge functions: I. Idempotence: Any record matches itself and merging a record with itself yields the same record. Idempotence: ∀ r, r ≈ r and r, r = r. II. Commutativity: If r 1 matches r 2, then r 2 matched r 1. Commutativity: ∀ r1, r2, r1 ≈ r2 iff r2 ≈ r1, and if r1 ≈ r2, then =. 10/7/

ER-N Model  Basic properties of Match and Merge functions: III. Domination: r 1 ≤ r 2 Record r 1 is dominated by r 2 if both records refer to the same entity, but r 2 ’s information “includes” that of r 1. Thus r 1 is redundant information. We can have r 1 ≤ r 2, whenever r 2 = for some r’. 10/7/

ER-N Model – Merge Closure  Merge closure A merge closure ī contains all the possible records that can be generated from I using M and μ, where I = {r 1, ……, r n }.  Definition: The merge closure ī of I satisfies the following conditions: 1. I ⊆ ī 2. ∀ r 1, r 2 ∈ ī s.t. r 1 ≈ r 2, ∈ ī. 3. No strict subset of ī satisfies conditions 1,2. 10/7/

Algorithm overview:  Steps to find closure from a set I of records: I. Start with empty ī. II. Loop until I is empty. III. r = record from I. Remove r from I. IV. For all r’ from ī, follow steps 1 to If r’ ≈ r then merged = 2. If merged not in I U ī U {r} then I = I U {merged} V. ī = ī U {r} VI. Return ī. * This basic algorithm does not consider negative rules. 10/7/

Time to apply negative rules! Classified according to number of arguments: I. Unary negative rule: Checks if a record r is valid by itself. II. Binary negative rule: Checks if two different records r 1 and r 2 can coexist  Two inconsistent records cannot coexist in ER solution.  Match & merge rules and negative rules cannot be combined together. 10/7/

Properties of negative rules  A set of records is inconsistent if there exists a single I. Record violating a unary negative rule and/or II. A pair of records violating binary negative rule.  Commutativity for negative rules: For all r 1 and r 2, if r 1 is not consistent with r 2, then r 2 is not consistent with r 1. 10/7/

ER-N Model 10/7/

Back to our example ī = {r 1, r 2, r 3, r 12, r 13, r 23, r 123 } The instance {r 13, r 2 } is a valid ER-N solution. where r 13 = Pat | | M, r 2 = Patricia | … | F This satisfies all conditions of ER-N model. 10/7/

Resolving Inconsistencies  Late approach: I. Using match & merge rules, ER solution is generated II. Solution is checked for inconsistencies. III. Appropriate fixes are applied to remove inconsistencies with the guidance of domain expert – solver.  Early approach: I. With the help of solver, start identifying records that we want in the final answer J. II. Start fixing problems between the selected records in J and other records not yet selected. * Early approach is preferred over late approach. 10/7/

Resolving Inconsistencies  Ways inconsistencies can be fixed: I. Discard data: Solver may decide to drop the record. II. Forced merge: Solver decides that two inconsistent records should have been merged. III. Override negative rule: Solver decides that flagged record(s) are indeed consistent i. e. negative rule was wrong flagging that record as inconsistent. e.g. Comfort Inn vs Comfort Inn Milton 10/7/

The GNR Algorithm General algorithm for negative rules:  Solver plays key role in making decisions.  If no solver is available, algorithm makes choice at random (!) or based on some heuristic. e.g. A record with more fields available is preferable to the one with fewer fields.  The solution generated without solver may not be the “most desirable solution”. 10/7/

The GNR Algorithm  Algorithm overview: I. Generate closure ī using ER algorithm & Set S = ī. II. Select set of non-dominated records (ndS) from ī. III. Select record r from ndS with the help of solver. IV. S = S \ {r} means remove r from S. V. If r is self-inconsistent, discard r. Continue from step II. VI. Else J = J U {r} VII. Remove all records from S that are either inconsistent with r or dominated by r. VIII. Continue from step II till S is empty. IX. Return J. 10/7/

Back to our example ī = {r 1, r 2, r 3, r 12, r 13, r 23, r 123 } Select r 123, but it is inconsistent, so discard it. Select one from {r 12, r 13, r 23 }  r 13 Remove all records dominated by or inconsistent with r 13 So, S = {r 2, r 23 } We discard r 23 because its internally inconsistent. Final Solution {r 13, r 2 } 10/7/

Things to ponder  An important matric for GNR algorithm is the “Human effort” of the solver.  How to calculate human effort?  Entity resolution is inherently expensive operation.  If we apply negative rules, it becomes more expensive! 10/7/

Techniques to reduce cost  Semantic partitioning: Data is divided into independent blocks using semantic knowledge. e.g. Category: Book, camera, snacks, … The technique is commonly known as blocking.  Exploiting properties Exploit properties of match and merge rules to make it possible to find correct solution with less effort. 10/7/

Exploiting properties 10/7/

ENR algorithm  Enhanced algorithm for negative rules - Makes things simpler and more efficient.  Rather than looking at entire merge closure of I, partition I and look at merge closure of each partition.  This partitioning is different that blocking, as these do not assume any semantic knowledge.  This is similar to our first algorithm, except once records r and r’ are merged, they are removed from further consideration.  This works because any future records that match r or r’ match the merged record. 10/7/

Some more negative rules  Borderline cases: In practice, many times rules are written such that all borderline cases are flagged, so that solver can check them out. e. g. The case in which name, DOB, address, gender all match except one digit of SSN.  NameAddr negative rules: Special string comparison rules for name and address checks. e. g. Are two street with similar names are “too far apart” to be in the same record? 10/7/

How to choose negative rules?  Important: Design negative rules that do not generate too many unnecessary checks.  Always remember, more the flagged records, more are the human efforts required.  Good understanding of the application and match and merge rules is necessary.  Knowing “common errors” with match and merge rules helps a lot! (knowing the weak-points). 10/7/

Conclusion  In ER process, negative rules capture “sanity checks”.  ER process often requires human guidance to handle real-world data and unexpected situations.  GNR algorithm represents generic way to solve ER-N.  ENR algorithm makes GNR algorithm less costly.  Choice of negative rules is very important. 10/7/

References 1. Paper: Generic entity resolution with negative rules: by Steven Euijong Whang, Omar Benjelloun, Hector Garcia-Molina (The VLDB Journal (2009) 18:1261–1277 DOI /s ) 2. resolution-with-negative.html resolution-with-negative.html /7/

THANK YOU … 10/7/