Distance functions and IE - 3 William W. Cohen CALD.

Slides:



Advertisements
Similar presentations
The Relational Model and Relational Algebra Nothing is so practical as a good theory Kurt Lewin, 1945.
Advertisements

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Record Linkage Tutorial: Distance Metrics for Text William W. Cohen CALD.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Maurice Hermans.  Ontologies  Ontology Mapping  Research Question  String Similarities  Winkler Extension  Proposed Extension  Evaluation  Results.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Information Retrieval in Practice
Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Languages with mismatches and an application to approximate indexing Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.
Introduction to Structured Query Language (SQL)
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Overview of Search Engines
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Distance functions and IE -2 William W. Cohen CALD.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Concepts of Database Management, Fifth Edition
Relational DBs and SQL Designing Your Web Database (Ch. 8) → Creating and Working with a MySQL Database (Ch. 9, 10) 1.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Using String Similarity Metrics for Terminology Recognition Jonathan Butters March 2008 LREC 2008 – Marrakech, Morocco.
WHIRL – summary of results. WHIRL project ( ) WHIRL initiated when at AT&T Bell Labs AT&T Research AT&T Labs - Research AT&T.
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
Distance functions and IE William W. Cohen CALD. Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin.
Chapter 6: Information Retrieval and Web Search
Minimum Edit Distance Definition of Minimum Edit Distance.
1 Relational Algebra and Calculas Chapter 4, Part A.
Blocking. Basic idea: – heuristically find candidate pairs that are likely to be similar – only compare candidates, not all pairs Variant 1: – pick some.
Distance functions and IE – 5 William W. Cohen CALD.
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
Distance functions and IE – 4? William W. Cohen CALD.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Advanced Accounting Information Systems Day 10 answers Organizing and Manipulating Data September 16, 2009.
Melodic Similarity Presenter: Greg Eustace. Overview Defining melody Introduction to melodic similarity and its applications Choosing the level of representation.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Relation Extraction William Cohen Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.
DNA, RNA and protein are an alien language
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Learning Analogies and Semantic Relations Nov William Cohen.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Edit Distances William W. Cohen.
7 1 Database Systems: Design, Implementation, & Management, 7 th Edition, Rob & Coronel 7.6 Advanced Select Queries SQL provides useful functions that.
More announcements Unofficial auditors: send to Sharon Woodside to make sure you get any late-breaking announcements. Project: –Already.
Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD.
Minimum Edit Distance Definition of Minimum Edit Distance.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Information Retrieval in Practice
Spatial Data Management
Definition of Minimum Edit Distance
Soft Joins with TFIDF: Why and What
Edit Distances William W. Cohen.
Announcements Project 2’s due date is moved to Tuesday 8/3/04
Text Joins in an RDBMS for Web Data Integration
Kernels for Relation Extraction
Data Integration for Relational Web
WHIRL – Reasoning with IE output
CS639: Data Management for Data Science
Similarity Measures in Deep Web Data Integration
Presentation transcript:

Distance functions and IE - 3 William W. Cohen CALD

Announcements No meeting this Wed March 24 March 25 Thus – talk from Carlos Guestrin on max-margin Markov nets –Newell-Simon Hall 1507 at 9:30am –no wait! – make that Wean Hall 4625 Writeups: –today: “distance metrics for text” – three papers

Record linkage: definition Record linkage: determine if pairs of data records describe the same entity –I.e., find record pairs that are co-referent –Entities: usually people (or organizations or…) –Data records: names, addresses, job titles, birth dates, … Main applications: –Joining two heterogeneous relations –Removing duplicates from a single relation –Storing results of information extraction in a database, or answering queries that involve information extracted from different places Key step: measuring similarity of two strings –TFIDF metric (WHIRL) –Edit distance (Monge-Elkan)

The data integration problem

Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC s t op cost alignment gap

Computing Levenshtein distance D(i,j)= min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M12345 C12345 C22345 O32345 H43234 N54333 = D(s,t)

Smith-Waterman distance c o h e n d o r f m c c o h n s k i dist=5

Affine gap distances - 3 D(i,j) = max D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’ D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj) IT(I-1,j-1) + d(si,tj)

Record linkage: definition Record linkage: determine if pairs of data records describe the same entity –I.e., find record pairs that are co-referent –Entities: usually people (or organizations or…) –Data records: names, addresses, job titles, birth dates, … Main applications: –Joining two heterogeneous relations –Removing duplicates from a single relation –Storing results of information extraction in a database, or answering queries that involve information extracted from different places Key step: measuring similarity of two strings –TFIDF metric (WHIRL) –Edit distance (Monge-Elkan)

Inference in WHIRL Explode p(X1,X2,X3): find all DB tuples for p and bind Xi to ai. Constrain X~Y: if X is bound to a and Y is unbound, –find DB column C to which Y should be bound –pick a term t in X, find proper inverted index for t in C, and bind Y to something in that index Keep track of t’s used previously, and don’t allow Y to contain one.

String distance metrics so far... Term-based (e.g. TF/IDF as in WHIRL) –Distance depends on set of words contained in both s and t – so sensitive to spelling errors. –Usually weight words to account for “importance” –Fast comparison: O(n log n) for |s|+|t|=n Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –No notion of word importance –More expensive: O(n 2 ) Other metrics –Jaro metric & variants –Monge-Elkan’s recursive string matching –etc? Which metrics work best, for which problems?

Jaro metric Jaro metric is (apparently) tuned for personal names: –Given (s,t) define c to be common in s,t if it si=c, tj=c, and |i-j|<min(|s|,|t|)/2. –Define c,d to be a transposition if c,d are common and c,d appear in different orders in s and t. –Jaro(s,t) = average of #common/|s|, #common/|t|, and 0.5#transpositions/#common –Variant: weight errors early in string more heavily Fast to compute

Jaro metric

Winkler-Jaro metric

String distance metrics so far... Term-based (e.g. TF/IDF as in WHIRL) –Distance depends on set of words contained in both s and t – so sensitive to spelling errors. –Usually weight words to account for “importance” –Fast comparison: O(n log n) for |s|+|t|=n Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –No notion of word importance –More expensive: O(n 2 ) Other metrics –Jaro metric & variants –Monge-Elkan’s recursive string matching –etc? Which metrics work best, for which problems?

So which metric should you use? Java toolkit of string-matching methods from AI, Statistics, IR and DB communities Tools for evaluating performance on test data Exploratory tool for adding, testing, combining string distances –e.g. SecondString implements a generic “Winkler rescorer” which can rescale any distance function with range of [0,1] URL – Distribution also includes several sample matching problems. SecondString (Cohen, Ravikumar, Fienberg):

SecondString distance functions Edit-distance like: –Levenshtein – unit costs –untuned Smith-Waterman –Monge-Elkan (tuned Smith-Waterman) –Jaro and Jaro-Winkler –Less ad hoc Jaro variants Term-based –TFIDF –Jaccard distance:

SecondString distance functions Edit-distance like: –Levenshtein – unit costs –untuned Smith-Waterman –Monge-Elkan (tuned Smith-Waterman) –Jaro and Jaro-Winkler

Results - Edit Distances Monge-Elkan is the best on average....

Edit distances

SecondString distance functions Term-based, for sets of terms S and T: –TFIDF distance –Jaccard distance: –Language models: construct P S and P T and use

SecondString distance functions Term-based, for sets of terms S and T: –TFIDF distance –Jaccard distance –Jensen-Shannon distance smoothing toward union of S,T reduces cost of disagreeing on common terms unsmoothed P S, Dirichlet smoothing, Jelenik-Mercer – “Simplified Fellegi-Sunter”

Results – Token Distances

SecondString distance functions Hybrid term-based & edit-distance based: –Monge-Elkan’s “recursive matching scheme”, segmenting strings at token boundaries (rather than separators like commas) –SoftTFIDF Like TFIDF but consider not just tokens in both S and T, but tokens in S “close to” something in T (“close to” relative to some distance metric) Downweight close tokens slightly

Results – Hybrid distances

Results - Overall

Prospective test on two clustering tasks

An anomolous dataset

An anomalous dataset: census

Why?