ICS 624 Spring 2011 Entity Resolution with Evolving Rules Preface to Steven Whang’s slides Asst. Prof. Lipyeow Lim Information & Computer Science Department.

Slides:



Advertisements
Similar presentations
ADAPTIVE FASTEST PATH COMPUTATION ON A ROAD NETWORK: A TRAFFIC MINING APPROACH Hector Gonzalez, Jiawei Han, Xiaolei Li, Margaret Myslinska, John Paul Sondag.
Advertisements

Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
6.830 Lecture 9 10/1/2014 Join Algorithms. Database Internals Outline Front End Admission Control Connection Management (sql) Parser (parse tree) Rewriter.
Lecture 8 Join Algorithms. Intro Until now, we have used nested loops for joining data – This is slow, n^2 comparisons How can we do better? – Sorting.
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
ICS 421 Spring 2010 Transactions & Recovery Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/4/20101Lipyeow.
IT 433 Data Warehousing and Data Mining Hierarchical Clustering Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department.
Naming Computer Engineering Department Distributed Systems Course Asst. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2014.
ICS 421 Spring 2010 Data Mining 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/8/20101Lipyeow Lim.
ICS 321 Fall 2009 Project Presentation Template Team Name Team Members 11/19/20091Lipyeow Lim -- University of Hawaii at Manoa.
ICS 421 Spring 2010 Indexing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 02/18/20101Lipyeow Lim.
ICS 421 Spring 2010 Indexing (2) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 2/23/20101Lipyeow Lim.
ICS 421 Spring 2010 Relational Query Optimization Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 9/8/20091Lipyeow.
ICS 421 Spring 2010 Transactions & Concurrency Control (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa.
ICS 421 Spring 2010 Data Mining 1 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/6/20101Lipyeow Lim.
ICS 624 Spring 2011 Multi-Dimensional Clustering Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 1/24/20111Lipyeow.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Introduction to Data Structure, Fall 2006 Slide- 1 California State University, Fresno Introduction to Data Structure Chapter 8 Ming Li Department of.
P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig Kaiserslautern,
Clustering Algorithms Mu-Yu Lu. What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
Component 4: Introduction to Information and Computer Science Unit 6: Databases and SQL Lecture 2 This material was developed by Oregon Health & Science.
Hierarchical Clustering
Information and Computer Sciences University of Hawaii, Manoa
ICS 321 Spring 2011 High Level Database Models Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 2/7/20111Lipyeow.
ICS 321 Fall 2011 Overview of Storage & Indexing (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 11/9/20111Lipyeow.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
ICS 321 Fall 2009 SQL: Queries, Constraints, Triggers Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 9/8/20091Lipyeow.
ICS 321 Fall 2011 Constraints, Triggers, Views & Indexes Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
ICS 321 Fall 2009 The Relational Model (ii) Asst. Prof. Lipyeow Lim Information and Computer Science Department University of Hawaii at Manoa 9/10/20091Lipyeow.
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
ICS 321 Fall 2010 SQL in a Server Environment (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 11/1/20101Lipyeow.
Clustering.
SQL LANGUAGE and Relational Data Model TUTORIAL Prof: Dr. Shu-Ching Chen TA: Hsin-Yu Ha.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
ICS 321 Spring 2011 The Database Language SQL (iii) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/14/20111Lipyeow.
Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,
Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103.
Clustering Algorithm CS 157B JIA HUANG. Definition Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics.
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Elastic Data Partitioning for Cloud-based SQL Processing Systems Lipyeow Lim Information & Computer Science Department University of Hawai`i at Mānoa 9/8/20101Lipyeow.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
More Optimization Exercises. Block Nested Loops Join Suppose there are B buffer pages Cost: M + ceil (M/(B-2))*N where –M is the number of pages of R.
Query Processing – Implementing Set Operations and Joins Chap. 19.
ICS 321 Fall 2011 Algebraic and Logical Query Languages (ii) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
1 Unsupervised Learning from URL Corpora Deepak P*, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras *Work done while at IIT Madras.
DATA STRUCURES II CSC QUIZ 1. What is Data Structure ? 2. Mention the classifications of data structure giving example of each. 3. Briefly explain.
CSE 4705 Artificial Intelligence
CS 540 Database Management Systems
CS 440 Database Management Systems
Discrimination and Classification
Database Management Systems (CS 564)
DATA MINING Introductory and Advanced Topics Part II - Clustering
Entity Resolution with Evolving Rules
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Presentation transcript:

ICS 624 Spring 2011 Entity Resolution with Evolving Rules Preface to Steven Whang’s slides Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/30/20111Lipyeow Lim -- University of Hawaii at Manoa

3/30/2011Lipyeow Lim -- University of Hawaii at Manoa2 Outline Facebook LinkedIn Are these two pages referring to the same person ?

Entity Resolution (ER) comparison shopping mailing lists classified ads customer files counter-terrorism 3/30/2011Lipyeow Lim -- University of Hawaii at Manoa3 NameAddressCredit CardPhone SharonRI NameAffiliationPhone SharonIBM r1 r2 Are these two records referring to the same entity ?

Entity Resolution Problem Statement Given a table of records (about some entities), partition the records according to the “entities” that they refer to. 3/30/2011Lipyeow Lim -- University of Hawaii at Manoa4 NameAddressCredit CardPhone SharonRI SharonNY JohnNY JohnNJ Entity: Sharon Entity: John

ER Rules ER Rules: ER algorithm that computes the mapping of records to entities (“partition”) Match-based: boolean rules like – “if the name in the two records are the same, then they belong to the same partition” Distance-based: uses a distance function 3/30/2011Lipyeow Lim -- University of Hawaii at Manoa5 NameAddressCredit CardPhone SharonRI SharonNY JohnNY JohnNJ Entity: Sharon Entity: John ER Rule

Sorted Neighborhood (SN) Avoids comparing all O(n 2 ) pairs of records by: Sorting records based on some column(s) Comparing all pairs of records in a sliding window Merging connected components into entities 3/30/2011Lipyeow Lim -- University of Hawaii at Manoa6 NameAddressCredit CardPhone SharonRI SharonNY JohnNY JohnNJ SortFind pairs

HC B : Hierarchical Clustering Boolean Similar to bottom-up hierarchical agglomerative clustering Merge two clusters if a boolean comparison rule B returns true. Apply rule B on one chosen tuple in each of the two clusters 3/30/2011Lipyeow Lim -- University of Hawaii at Manoa7 NameAddressCredit CardPhone SharonRI SharonNY JohnNY JohnNJ B(r1,r2) = true if r1.name=r2.name

HC BR : Hierarchical Clustering Boolean Same as HC B except in how comparison is evaluated. Apply rule B on all pairs of tuples in each of the two clusters Merge clusters if B is true on at least one pair 3/30/2011Lipyeow Lim -- University of Hawaii at Manoa8 NameAddressCredit CardPhone SharonRI SharonNY JohnNY JohnNJ B(r1,r2) = true if r1.name=r2.name

ME: Monge-Elkan Clustering Sort records according to some column(s) Initialize an empty fixed length queue of clusters Scan through sorted records and match each record to clusters in queue If record matches existing cluster, move cluster to front Else make record into a new cluster at front of queue If queue is full, last cluster is dropped 3/30/2011Lipyeow Lim -- University of Hawaii at Manoa9 NameAddressCredit CardPhone SharonRI SharonNY JohnNY JohnNJ Sort Sharon Queue

Distance-based ER Algorithms Similar to bottom-up hierarchical agglomerative clustering with different variations on how distance is computed from two clusters HC DS Single-link : smallest possible distance between two records from the two clusters HC DC Complete-link : largest possible distance between two records from the two clusters 3/30/2011Lipyeow Lim -- University of Hawaii at Manoa10 R1 R2 R3 R4

Evolving Rules 3/30/2011Lipyeow Lim -- University of Hawaii at Manoa11 Input Records Resolved Records R ER Rule 1 ER Rule 2 Resolved Records S ER Rule 3 Given Rule 1 and Rule 2, can we compute S from R? Under what conditions would that be possible?