Entity Resolution for Big Data Lise Getoor University of Maryland College Park, MD Ashwin Machanavajjhala Duke University Durham, NC

Slides:

Advertisements

Similar presentations

Three-Step Database Design

Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

Large-Scale Entity-Based Online Social Network Profile Linkage.

Automatically Annotating and Integrating Spatial Datasets Chieng-Chien Chen, Snehal Thakkar, Crail Knoblock, Cyrus Shahabi Department of Computer Science.

Entity Profiling with Varying Source Reliabilities

A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.

NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.

Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.

The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 5 Understanding Entity Relationship Diagrams.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,

Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection Boanerges Aleman-Meza, Meenakshi Nagarajan,

Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.

Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.

Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based.

INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information.

INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.

ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.

Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.

Meenakshi Nagarajan PhD. Student KNO.E.SIS Wright State University Data Integration.

Large Scale Data Representation Erik Goodman Daniel Kapellusch Brennen Meland Hyunjae Park Michael Rogers.

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.

Text Clustering Hongning Wang

Challenge Problem: Link Mining Lise Getoor University of Maryland, College Park.

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.

Computer Vision Group Department of Computer Science University of Illinois at Urbana-Champaign.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.

Large-Scale Record Linkage Support for Cloud Computing Platforms Yuan Xue, Bradley Malin, Elizabeth Durham EECS Department, Biomedical Informatics Department,

Database Design, Application Development, and Administration, 6 th Edition Copyright © 2015 by Michael V. Mannino. All rights reserved. Chapter 5 Understanding.

Of 24 lecture 11: ontology – mediation, merging & aligning.

Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.

CPT-S Advanced Databases 1 Yinghui Wu EME 49 ADB (ln24)

Introduction to Data Science Lecture 5 Data Integration

Big Data Quality the next semantic challenge

Lecture 9: Entity Resolution

Clustering Algorithms for Noun Phrase Coreference Resolution

Objective of This Course

Prepared by: Mahmoud Rafeek Al-Farra

Record Linkage with Uniqueness Constraints and Erroneous Values

Adaptive entity resolution with human computation

Property consolidation for entity browsing

MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.

Big Data Quality the next semantic challenge

Panagiotis G. Ipeirotis Luis Gravano

Data Warehousing Data Mining Privacy

CS639: Data Management for Data Science

Statistical Relational AI

Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel

Big Data Quality the next semantic challenge

Presentation transcript:

Entity Resolution for Big Data Lise Getoor University of Maryland College Park, MD Ashwin Machanavajjhala Duke University Durham, NC 1

What is Entity Resolution? Problem of identifying and linking/grouping different manifestations of the same real world object. Examples of manifestations and objects: Different ways of addressing (names, addresses, FaceBook accounts) the same person in text. Web pages with differing descriptions of the same business. Different photos of the same object. … 2

What is Entity Resolution? 3

Ironically, Entity Resolution has many duplicate names Doubles Duplicate detection Record linkage Deduplication Object identification Object consolidation Coreference resolution Entity clustering Reference reconciliation Reference matchingHouseholding Household matching Fuzzy match Approximate match Merge/purge Hardening soft databases Identity uncertainty 4

ER Motivating Examples Linking Census Records Public Health Web search Comparison shopping Counter-terrorism Knowledge Graph Construction … 5

before after Motivation: ER and Network Analysis 6

Measuring the topology of the internet … using traceroute 7

IP Aliasing Problem [Willinger et al. 2009] 8

9

10

Traditional Challenges in ER Name/Attribute ambiguity Thomas Cruise Michael Jordan 11

Traditional Challenges in ER Name/Attribute ambiguity Errors due to data entry 12

Traditional Challenges in ER Name/Attribute ambiguity Errors due to data entry Missing Values [Gill et al; Univ of Oxford 2003] 13

Traditional Challenges in ER Name/Attribute ambiguity Errors due to data entry Missing Values Changing Attributes Data formatting Abbreviations / Data Truncation 14

Big-Data ER Challenges 15

Big-Data ER Challenges Larger and more Datasets – Need efficient parallel techniques More Heterogeneity – Unstructured, Unclean and Incomplete data. Diverse data types. – No longer just matching names with names, but Amazon profiles with browsing history on Google and friends network in Facebook. 16

Big-Data ER Challenges Larger and more Datasets – Need efficient parallel techniques More Heterogeneity – Unstructured, Unclean and Incomplete data. Diverse data types. More linked – Need to infer relationships in addition to “equality” Multi-Relational – Deal with structure of entities (Are Walmart and Walmart Pharmacy the same?) Multi-domain – Customizable methods that span across domains Multiple applications (web search versus comparison shopping) – Serve diverse application with different accuracy requirements 17

Outline 1.Abstract Problem Statement 2.Algorithmic Foundations of ER 3.Scaling ER to Big-Data 4.Challenges & Future Directions 18

Outline 1.Abstract Problem Statement 2.Algorithmic Foundations of ER a)Data Preparation and Match Features b)Pairwise ER c)Constraints in ER d)Algorithms Record Linkage Deduplication Collective ER 3.Scaling ER to Big-Data 4.Challenges & Future Directions 10 minute break 19

Outline 1.Abstract Problem Statement 2.Algorithmic Foundations of ER 3.Scaling ER to Big-Data a)Blocking/Canopy Generation b)Distributed ER 4.Challenges & Future Directions 20

Outline 1.Abstract Problem Statement 2.Algorithmic Foundations of ER 3.Scaling ER to Big-Data 4.Challenges & Future Directions 21

Scope of the Tutorial What we cover: – Fundamental algorithmic concepts in ER – Scaling ER to big datasets – Taxonomy of current ER algorithms What we do not cover: – Schema/ontology resolution – Data fusion/integration/exchange/cleaning – Entity/Information Extraction – Privacy aspects of Entity Resolution – Details on similarity measures – Technical details and proofs 22

ER References Book / Survey Articles – Data Quality and Record Linkage Techniques [T. Herzog, F. Scheuren, W. Winkler, Springer, ’07] – Duplicate Record Detection [A. Elmagrid, P. Ipeirotis, V. Verykios, TKDE ‘07] – An Introduction to Duplicate Detection [F. Naumann, M. Herschel, M&P synthesis lectures 2010] – Evaluation of Entity Resolution Approached on Real-world Match Problems [H. Kopke, A. Thor, E. Rahm, PVLDB 2010] – Data Matching [P. Christen, Springer 2012] Tutorials – Record Linkage: Similarity measures and Algorithms [N. Koudas, S. Sarawagi, D. Srivatsava SIGMOD ‘06] – Data fusion--Resolving data conflicts for integration [X. Dong, F. Naumann VLDB ‘09] – Entity Resolution: Theory, Practice and Open Challenges [L. Getoor, A. Machanavajjhala AAAI ‘12] 23

ABSTRACT PROBLEM STATEMENT PART 1 24

Abstract Problem Statement Real WorldDigital World Records / Mentions 25

Deduplication Problem Statement Cluster the records/mentions that correspond to same entity 26

Deduplication Problem Statement Cluster the records/mentions that correspond to same entity – Intensional Variant: Compute cluster representative 27

Record Linkage Problem Statement Link records that match across databases A B 28

Reference Matching Problem Match noisy records to clean records in a reference table Reference Table 29

Abstract Problem Statement Real WorldDigital World AI ML DB 30

Deduplication Problem Statement 31

Deduplication with Canonicalization AI 32

Link Prediction Problem Statement 33

Graph/Motif Alignment Graph 1 Graph 2 34

Relationships are crucial 35

Relationships are crucial 36

Notation R: set of records / mentions (typed) H: set of relations / hyperedges (typed) M: set of matches (record pairs that correspond to same entity ) N: set of non-matches (record pairs corresponding to different entities) E: set of entities L: set of links True (M true, N true, E true, L true ): according to real world vs Predicted (M pred, N pred, E pred, L pred ): by algorithm 37

Relationship between M true and M pred M true (SameAs, Equivalence) M pred (Similar representations and similar attributes) M true RxR M pred 38

Metrics Pairwise metrics – Precision/Recall, F1 – # of predicted matching pairs Cluster level metrics – purity, completeness, complexity – Precision/Recall/F1: Cluster-level, closest cluster, MUC, B 3, Rand Index – Generalized merge distance [Menestrina et al, PVLDB10] Little work that evaluates correct prediction of links 39

Typical Assumptions Made Each record/mention is associated with a single real world entity. In record linkage, no duplicates in the same source If two records/mentions are identical, then they are true matches (, ) ε M true 40

ER versus Classification Finding matches vs non-matches is a classification problem Imbalanced: typically O(R) matches, O(R^2) non-matches Instances are pairs of records. Pairs are not IID (, ) ε M true AND 41

ER vs (Multi-relational) Clustering Computing entities from records is a clustering problem In typical clustering algorithms (k-means, LDA, etc.) number of clusters is a constant or sub linear in R. In ER: number of clusters is linear in R, and average cluster size is a constant. Significant fraction of clusters are singletons. 42