Distance functions and IE William W. Cohen CALD. Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

SYSTEM PROGRAMMING & SYSTEM ADMINISTRATION
Chapter 7 Dynamic Programming.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Record Linkage Tutorial: Distance Metrics for Text William W. Cohen CALD.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Aki Hecht Seminar in Databases (236826) January 2009
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Distance Functions for Sequence Data and Time Series
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
ASP.NET Database Connectivity I. 2 © UW Business School, University of Washington 2004 Outline Database Concepts SQL ASP.NET Database Connectivity.
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit J. Hein, C. Wiuf, B. Knudsen, M.B. Moller and G. Wibling.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
SQL SQL stands for Structured Query Language SQL allows you to access a database SQL is an ANSI standard computer language SQL can execute queries against.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Class 2: Basic Sequence Alignment
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Relational Databases What is a relational database? What would we use one for? What do they look like? How can we describe them? How can you create one?
INFORMATION TECHNOLOGY IN BUSINESS AND SOCIETY SESSION 16 – SQL SEAN J. TAYLOR.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Akhila Kondai October 30, 2013.
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Developing Pairwise Sequence Alignment Algorithms
Distance functions and IE -2 William W. Cohen CALD.
Session 5: Working with MySQL iNET Academy Open Source Web Development.
Dbwebsites 2.1 Making Database backed Websites Session 2 The SQL… Where do we put the data?
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
Minimum Edit Distance Definition of Minimum Edit Distance.
Blocking. Basic idea: – heuristically find candidate pairs that are likely to be similar – only compare candidates, not all pairs Variant 1: – pick some.
Distance functions and IE – 5 William W. Cohen CALD.
Distance functions and IE – 4? William W. Cohen CALD.
Relation Extraction William Cohen Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.
Student Centered ODS ETL Processing. Insert Search for rows not previously in the database within a snapshot type for a specific subject and year Search.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1 CS 430: Information Discovery Lecture 5 Ranking.
Learning Analogies and Semantic Relations Nov William Cohen.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Edit Distances William W. Cohen.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
More announcements Unofficial auditors: send to Sharon Woodside to make sure you get any late-breaking announcements. Project: –Already.
Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD.
Minimum Edit Distance Definition of Minimum Edit Distance.
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.
1 Section 1 - Introduction to SQL u SQL is an abbreviation for Structured Query Language. u It is generally pronounced “Sequel” u SQL is a unified language.
Distance functions and IE - 3 William W. Cohen CALD.
Definition of Minimum Edit Distance
Database application MySQL Database and PhpMyAdmin
Definition of Minimum Edit Distance
Web Data Extraction Based on Partial Tree Alignment
Edit Distances William W. Cohen.
Kernels for Relation Extraction
Introduction to Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS639: Data Management for Data Science
Presentation transcript:

Distance functions and IE William W. Cohen CALD

Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin Markov nets –time constraints? Writeups: –nothing today –“distance metrics for text” – three papers due next Monday, 3/22

Record linkage: definition Record linkage: determine if pairs of data records describe the same entity –I.e., find record pairs that are co-referent –Entities: usually people (or organizations or…) –Data records: names, addresses, job titles, birth dates, … Main applications: –Joining two heterogeneous relations –Removing duplicates from a single relation

Record linkage: terminology The term “record linkage” is possibly co- referent with: –For DB people: data matching, merge/purge, duplicate detection, data cleansing, ETL (extraction, transfer, and loading), de-duping –For AI/ML people: reference matching, database hardening, object consolidation, –In NLP: co-reference/anaphora resolution –Statistical matching, clustering, language modeling, …

Motivation Q: What does this have to do with IE? A: Quite a lot, actually... Webfind (Monge & Elkan, 1995)

Finding a technical paper c Start with citation: " Experience With a Learning Personal Assistant", T.M. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski, Communications of the ACM, Vol. 37, No. 7, pp , July Find author’s institution (w/ INSPEC) Find web host (w/ NETFIND) Find author’s home page and (hopefully) the paper by browsing

Automatically finding a technical paper c with WebFind Start with citation: " Experience With a Learning Personal Assistant", T.M. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski, Communications of the ACM, Vol. 37, No. 7, pp , July Find author’s institution (w/ automated search against INSPEC) Find web host (w/ auto search on NETFIND) Find author’s home page and (hopefully) the paper by heuristic spidering

The data integration problem

Control flow (modulo details about querying –Extract (author, department) pairs from DB1 –Extract (department,www server) pairs from DB2 –Execute the two-step plan to get paper: author -> department -> wwwServer –two steps means matching (linking, integrating, deduping,....) department names in DB1/DB2 –issues are completely different if user is executing a one-step plan: one-step plan is retrieval

String distance metrics: Levenshtein Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1) –This is “Levenshtein distance”

Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC s t op cost alignment

Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC s t op cost alignment gap

Computing Levenshtein distance - 1 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j)+1 //insert D(i,j-1)+1 //delete

Computing Levenshtein distance - 2 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j

Computing Levenshtein distance - 3 D(i,j)= min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M12345 C12345 C22345 O32345 H43234 N54333 = D(s,t)

Computing Levenshtein distance – 4 D(i,j) = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M C C O H N A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

Needleman-Wunch distance D(i,j) = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) + G //insert D(i,j-1) + G //delete d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) William Cohen Wukkuan Cigeb G = “gap cost”

Smith-Waterman distance - 1 D(i,j) = max 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete Distance is maximum over all i,j in table of D(i,j)

Smith-Waterman distance - 2 D(i,j) = max 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete G = 1 d(c,c) = -2 d(c,d) = +1 COHEN M C C O H N

Smith-Waterman distance - 3 D(i,j) = max 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete G = 1 d(c,c) = -2 d(c,d) = +1 COHEN M C C O H N

Smith-Waterman distance - 5 c o h e n d o r f m c c o h n s k i dist=5

Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)

Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996) Used a standard version of Smith-Waterman with hand-tuned weights for inserts and character substitutions. Split large text fields by separators like commas, etc, and explore different pairings (since S-W assigns a large cost to large transpositions) Result competitive with plausible competitors.

Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996) String s=A 1 A 2... A K, string t=B 1 B 2... B L sim’ is editDistance scaled to [0,1] Monge-Elkan’s “recursive matching scheme” is average maximal similarity of A i to B j:

Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996) computer science department stanford university palo alto california Dept. of Comput. Sci. Stanford Univ. CA USA

Results: S-W from Monge & Elkan

Affine gap distances Smith-Waterman fails on some pairs that seem quite similar: William W. Cohen William W. ‘Don’t call me Dubya’ Cohen Intuitively, a single long insertion is “cheaper” than a lot of short insertions Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions

Affine gap distances - 2 Idea: –Current cost of a “gap” of n characters: nG –Make this cost: A + (n-1)B, where A is cost of “opening” a gap, and B is cost of “continuing” a gap.

Affine gap distances - 3 D(i,j) = max D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’ D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj) IT(I-1,j-1) + d(si,tj)

Affine gap distances - 4 -B -d(si,tj) D IS IT -d(si,tj) -A

Affine gap distances – experiments ( from McCallum,Nigam,Ungar KDD2000) Goal is to match data like this:

Affine gap distances – experiments ( from McCallum,Nigam,Ungar KDD2000) Hand-tuned edit distance Lower costs for affine gaps Even lower cost for affine gaps near a “.” HMM-based normalization to group title, author, booktitle, etc into fields

Affine gap distances – experiments TFIDFEdit Distance Adaptive Cora OrgName Orgname Restaurant Parks