Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015.

Slides:

Advertisements

Similar presentations

CrowdER - Crowdsourcing Entity Resolution

Advertisements

PowerPoint Presentation for Dennis & Haley Wixom, Systems Analysis and Design, 2 nd Edition Copyright 2003 © John Wiley & Sons, Inc. All rights reserved.

Schema Summarization cong Yu Department of EECS University of Michigan H. V. Jagadish Department of EECS University of Michigan

Data Mining Sangeeta Devadiga CS 157B, Spring 2007.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Hierarchical Region-Based Segmentation by Ratio-Contour Jun Wang April 28, 2004 Course Project of CSCE 790.

Aki Hecht Seminar in Databases (236826) January 2009

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

Introduction to CSE 591: Autonomous agents - theory and practice. Chitta Baral Professor Department of Computer Sc. & Engg. Arizona State University.

Discovery of Aggregate Usage Profiles for Web Personalization

Chapter 5 Normalization of Database Tables

On Fairness, Optimizing Replica Selection in Data Grids Husni Hamad E. AL-Mistarihi and Chan Huah Yong IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

Dealing with NFRs Vahid Jalali Amirkabir university of technology, Department of computer engineering and information technology, Intelligent systems laboratory,

Short Course on Introduction to Meteorological Instrumentation and Observations Techniques QA and QC Procedures Short Course on Introduction to Meteorological.

A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.

Data Mining Chun-Hung Chou

Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.

Introduction Knowledge of the snow microstructure (correct a priori parameterization of grain size) is relevant for successful retrieval of snow parameters.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.

Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Database Systems: Design, Implementation, and Management Tenth Edition

1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.

Storing Organizational Information - Databases

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.

Team 5 Wavelets for Image Fusion Xiaofeng “Sam” Fan Jiangtao “Willy” Kuang Jason “Jingsu” West.

Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS.

Event-Centric Summary Generation Lucy Vanderwende, Michele Banko and Arul Menezes One Microsoft Way, WA, USA DUC 2004.

Querying Web Data – The WebQA Approach Author: Sunny K.S.Lam and M.Tamer Özsu CSI5311 Presentation Dongmei Jiang and Zhiping Duan.

Learning with AdaBoost

Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining

Project Lachesis: Parsing and Modeling Location Histories Daniel Keeney CS 4440.

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

Introduction to the Semantic Web and Linked Data

Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103.

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

Semantic Web Knowledge Fusion Jennifer Sleeman University of Maryland, Baltimore County Motivation Definitions Methodology Evaluation Future Work Based.

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

Co-occurrence and place name disambiguation. GeoCLEF 2006 Simon Overell João Magalhães Stefan Rüger.

McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.

An Introduction to Scientific Research Methods in Geography Chapter 2: Fundamental Research Concepts.

Natural Language Processing for Underground Communications Dan Klein MURI Kickoff, 11/20/2009.

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

Item-Based Collaborative Filtering Recommendation Algorithms Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl GroupLens Research Group/ Army.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.

1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:

ItemBased Collaborative Filtering Recommendation Algorithms 1.

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

What is a database? (a supplement, not a substitute for Chapter 1…) some slides copied/modified from text Collection of Data? Data vs. information Example:

SEDEX: Scalable Entity Preserving Data Exchange

Leena Leppänen1, Anna Kontu1, Juha Lemmetyinen1, Martin Proksch2

CRF &SVM in Medication Extraction

Saisai Gong, Wei Hu, Yuzhong Qu

A Consensus-Based Clustering Method

Data Mining: Concepts and Techniques Course Outline

Introduction to Database Systems

Social Knowledge Mining

Lecture 9: Entity Resolution

دانشگاه شهیدرجایی تهران

تعهدات مشتری در کنوانسیون بیع بین المللی

VOLUMES OF SIMILAR OBJECTS

Block Matching for Ontologies

Low-Rank Sparse Feature Selection for Patient Similarity Learning

Presentation transcript:

Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

INTRODUCTION IN many applications, a real-world entity may appear in multiple data sources so that the entity may have quite different descriptions. For example, there are several ways to represent a person’s name or a mailing address. Thus, it is necessary to identify the records referring to the same real-world entity, which is called Entity Resolution (ER). ER is one of the most important problems in data cleaning and arises in many applications such as information integration and information retrieval. Because of its importance, it has attracted much attention in the literature

Traditional ER approaches  Similarity comparison among records.  Can’t identify records correctly in some cases.

observation: The existence and nonexistence of some attribute-value pairs are both useful to identify records

Contribution

syntax

semantics

Properties of ER-Rule Set

Algorithm Rule Discovery(DiscR) -To get rules from a training data set Rule-based entity resolution (R-ER) -To determine the record in the new data set refers to which entity

Rule Discovery Several definition before the algorithm

Rule Discovery

Rule requirements

Gen-PR

Gen-SingleNR First step:

Second step:

Rule-based entity resolution we define the weight of each ER-rule r as:

Rule update Invalid rules Useless rules

Evaluation the effectiveness of our rule learning algorithm (DiscR) and our rule-based ER approach the impact of training data size on ER accuracy and the number of generated rules The impact of rule length threshold on ER accuracy The scalability of DiscR and R-ER with the size of data

Algorithm compared with: GHOST and CFR

Summary DiscR and R-ER can achieve a high accuracy using a small training data; updating rules indeed help identify records; The number of generated rules scales well with the training data size on both data sets; rules with length larger than 2 are seldom needed to identify records; both DiscR and R-ER scales well with the size of data.

Thank you!