Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

Slides:



Advertisements
Similar presentations
Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Advertisements

Ispd-2007 Repeater Insertion for Concurrent Setup and Hold Time Violations with Power-Delay Trade-Off Salim Chowdhury John Lillis Sun Microsystems University.
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
GENERIC ENTITY RESOLUTION WITH NEGATIVE RULES Steven Euijong Whang · Omar Benjelloun · Hector Garcia-Molina Compiled by – Darshana Pathak.
Safeguarding and Charging for Information on the Internet Hector Garcia-Molina, Steven P. Ketchpel, Narayanan Shivakumar Stanford University Presented.
ICS 624 Spring 2011 Entity Resolution with Evolving Rules Preface to Steven Whang’s slides Asst. Prof. Lipyeow Lim Information & Computer Science Department.
Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Searching TAL Online Developed by Northern Lights Internet Solutions Ltd. Advanced Searching.
Normalisation up to 1NF Bottom-up Approach to Data Modelling.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
1 PART III STORAGE AND INDEXING CH. 8: OVERVIEW OF STORAGE AND INDEXING (introduction) CH. 9: STORING DATA: DISKS AND FILES (hardware) CH. 10: TREE-STRUCTURED.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
L. Grewe. Computing hash function for a string Horner’s rule: (( … (a 0 x + a 1 ) x + a 2 ) x + … + a n-2 )x + a n-1 ) int hash( const string & key )
Methodology Conceptual Database Design
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
1 Towards an end-to-end architecture for handling sensitive data Hector Garcia-Molina Rajeev Motwani and students.
CS4432: Database Systems II
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
Storage and Indexing February 26 th, 2003 Lecture 19.
Indexing and Hashing (emphasis on B+ trees) By Huy Nguyen Cs157b TR Lee, Sin-Min.
Or Where do I begin?. Review the key Features from the Mind map last week Read the handout on Essay Writing Now work in pairs Mind map an outline for.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Identifying Reasons for Software Changes Using Historic Databases The CISC 864 Analysis By Lionel Marks.
1 Translating E/R Diagrams into Relational Schemas.
Entity Resolution for Big Data Lise Getoor University of Maryland College Park, MD Ashwin Machanavajjhala Duke University Durham, NC
Database Management 9. course. Execution of queries.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Data-Centric Human Computation Jennifer Widom Stanford University.
High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.
CS370 Spring 2007 CS 370 Database Systems Lecture 4 Introduction to Database Design.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
DBMS 2001Notes 4.1: B-Trees1 Principles of Database Management Systems 4.1: B-Trees Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina,
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
Introduction to Database Tonga Institute of Higher Education NOS 215.
Functional Dependencies. FarkasCSCE 5202 Reading and Exercises Database Systems- The Complete Book: Chapter 3.1, 3.2, 3.3., 3.4 Following lecture slides.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Hashing Chapter 7 Section 3. What is hashing? Hashing is using a 1-D array to implement a dictionary o This implementation is called a "hash table" Items.
ReSeTrus Development of a digital library technology based on redundancy elimination and semantic elevation, with special emphasis on trust management.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Paper Title Authors names Conference and Year Presented by Your Name Date.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
1 Lecture 08: E/R Diagrams and Functional Dependencies Friday, January 21, 2005.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Databases Databases are collections of information; our study repeats a theme: Tell the computer the structure, and it can help you! © 2004, Lawrence Snyder.
SQL Basics Review Reviewing what we’ve learned so far…….
CPT-S Advanced Databases 1 Yinghui Wu EME 49 ADB (ln24)
Frequency Counts over Data Streams
Steven Whang Stanford University
Lecture 9: Entity Resolution
Predict Protein Sequence by Fuzzy-Association Rules
Data Warehousing Overview CS245 Notes 11
Lecture 09: Functional Dependencies, Database Design
Disambiguation Algorithm for People Search on the Web
Entity Resolution with Evolving Rules
Implementation of Relational Operations
Leverage Consensus Partition for Domain-Specific Entity Coreference
Lecture 08: E/R Diagrams and Functional Dependencies
Managing Information Leakage
Presentation transcript:

Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su, Jennifer Widom, Tyson Condie, Johnson Gong, Nicolas Pombourcq, David Menestrina, Steven Whang

2 Entity Resolution N: a A: b CC#: c Ph: e e1 N: a Exp: d Ph: e e2

3 Applications comparison shopping mailing lists classified ads customer files counter-terrorism N: a A: b CC#: c Ph: e e1 N: a Exp: d Ph: e e2

4 Outline Why is ER challenging? How is ER done? Some ER work at Stanford Confidences

5 Challenges (1) No keys! Value matching –“Kaddafi”, “Qaddafi”, “Kadafi”, “Kaddaffi”... Record matching Nm: Tom Ad: 123 Main St Ph: (650) Ph: (650) Nm: Thomas Ad: 132 Main St Ph: (650)

6 Challenges (2) Merging records Nm: Tom Ad: 123 Main St Ph: (650) Ph: (650) Nm: Thomas Ad: 132 Main St Ph: (650) Zp: Nm: Tom Nm: Thomas Ad: 123 Main St Ph: (650) Ph: (650) Zp: 94305

7 Challenges (3) Chaining Nm: Tom Wk: IBM Oc: laywer Sal: 500K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Nm: Thomas Ad: 123 Maim Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500K

8 Challenges (4) Un-merging Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500K too young to make 500K at IBM!!

9 Challenges (5) Confidences in data Nm: Tom (0.9) Ad: 123 Main St (1.0) Ph: (650) (0.6) Ph: (650) (0.8) (0.8) In value matching, match rules, merge: conf = ?

10 Taxonomy Pairwise snaps vs. clustering De-duplication vs. fidelity enhancement Schema differences Relationships Exact vs. approximate Generic vs application specific Confidences

11 Schema Differences Name: Tom Address: 123 Main St Ph: (650) Ph: (650) FirstName: Tom StreetName: Main St StreetNumber: 123 Tel: (650)

12 Pair-Wise Snaps vs. Clustering r1 r2 r3 r4 r5 r6 s9 s8 s7 s10 r1 r2 r3 r4 r5 r6 r9 r8 r7 r10

13 De-Duplication vs. Fidelity Enhancement R S B S N

14 Relationships r1 r2 r5 r7 brother father business

15 Using Relationships p1 p2 p5 p7 a1 a2 a3 a5 a4 authors papers same??

16 Exact vs Approximate ER products cameras resolved cameras CDs books... resolved CDs resolved books... ER

17 Exact vs Approximate ER terrorists sort by age Widom 30 match against ages 25-35

18 Generic vs Application Specific Match function M(r, s) Merge function => t

19 Taxonomy Pairwise snaps vs. clustering De-duplication vs. fidelity enhancement Schema differences Relationships Exact vs. approximate Generic vs application specific Confidences

20 Outline Why is ER challenging? How is ER done? Some ER work at Stanford Confidences

21 Taxonomy Pairwise snaps vs. clustering De-duplication vs. fidelity enhancement Schema differences No Relationships No Exact vs. approximate Generic vs application specific Confidences... later on

22 Model Nm: Tom Wk: IBM Oc: laywer Sal: 500K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Nm: Thomas Ad: 123 Maim Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500K r1 r3 r2 r4: M(r1, r2) M(r4, r3)

23 Correct Answer r1 r2 r3 r4 r5 r6 s9 s8 s7 s10 ER(R) = All derivable records..... Minus “dominated” records

24 Question What is best sequence of match, merge calls that give us right answer?

25 Brute Force Algorithm Input R: –r1 = [a:1, b:2] –r2 = [a:1, c: 4, e:5] –r3 = [b:2, c:4, f:6] –r4 = [a:7, e:5, f:6]

26 Brute Force Algorithm Input R: –r1 = [a:1, b:2] –r2 = [a:1, c: 4, e:5] –r3 = [b:2, c:4, f:6] –r4 = [a:7, e:5, f:6] Match all pairs: –r1 = [a:1, b:2] –r2 = [a:1, c: 4, e:5] –r3 = [b:2, c:4, f:6] –r4 = [a:7, e:5, f:6] –r12 = [a:1, b:2, c:4, e:5]

27 Brute Force Algorithm Match all pairs: –r1 = [a:1, b:2] –r2 = [a:1, c: 4, e:5] –r3 = [b:2, c:4, f:6] –r4 = [a:7, e:5, f:6] –r12 = [a:1, b:2, c:4, e:5] Repeat: –r1 = [a:1, b:2] –r2 = [a:1, c: 4, e:5] –r3 = [b:2, c:4, f:6] –r4 = [a:7, e:5, f:6] –r12 = [a:1, b:2, c:4, e:5] –r123 = [a:1, b:2, c:4, e:5, f:6]

28 Question # 1 Can we delete r1, r2?

29 Question # 2 Can we avoid comparisons?

30 ICAR Properties Idempotence: –M(r1, r1) = true; = r1 Commutativity: –M(r1, r2) = M(r2, r1) – = Associativity – > =, r3>

31 More Properties Representativity –If = r3, then for any r4 such that M(r1, r4) is true we also have M(r3, r4) = true. r1 r2 r3 r4

32 ICAR Properties  Efficiency Commutativity Idempotence Associativity Representativity Can discard records ER result independent of processing order

33 Swoosh Algorithms Record Swoosh Merges records as soon as they match Optimal in terms of record comparisons Feature Swoosh Remembers values seen for each feature Avoids redundant value comparisons

34 Swoosh Performance

35 If ICAR Properties Do Not Hold? r1: [Joe Sr., 123 Main, DL:X] r23: [Joe Jr., 123 Main, Ph: 123, DL:Y] r12: [Joe Sr., 123 Main, Ph: 123, DL:X] r2: [Joe, 123 Main, Ph:123] r3: [Joe Jr., 123 Main, DL:Y]

36 If ICAR Properties Do Not Hold? r1: [Joe Sr., 123 Main, DL:X] r23: [Joe Jr., 123 Main, Ph: 123, DL:Y] r12: [Joe Sr., 123 Main, Ph: 123, DL:X] r2: [Joe, 123 Main, Ph:123] r3: [Joe Jr., 123 Main, DL:Y] Full Answer: ER(R) = {r12, r23, r1, r2, r3} Minus Dominated:ER(R) = {r12, r23}

37 If ICAR Properties Do Not Hold? r1: [Joe Sr., 123 Main, DL:X] r23: [Joe Jr., 123 Main, Ph: 123, DL:Y] r12: [Joe Sr., 123 Main, Ph: 123, DL:X] r2: [Joe, 123 Main, Ph:123] r3: [Joe Jr., 123 Main, DL:Y] Full Answer: ER(R) = {r12, r23, r1, r2, r3} Minus Dominated:ER(R) = {r12, r23} R-Swoosh Yields:ER(R) = {r12, r3} or {r1, r23}

38 Swoosh Without ICAR Properties

39 Distributed Swoosh P1P2P3 r1 r2 r3 r4 r5 r6...

40 Distributed Swoosh P1P2P3 r1 r3 r4 r6... r1 r2 r4 r5... r2 r3 r5 r6...

41 DSwoosh Performance

42 Outline Why is ER challenging? How is ER done? Some ER work at Stanford Confidences

43 Conclusion ER is old and important problem Our approach: generic Confidences –challenging –two ways to tame: thresholds packages

44 Thanks.