Distance functions and IE – 5 William W. Cohen CALD.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Problem Semi supervised sarcasm identification using SASI

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.

Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.

Structural bioinformatics

Sequence Similarity Searching Class 4 March 2010.

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

Heuristic alignment algorithms and cost matrices

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.

Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.

Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.

Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Overview of Search Engines

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

Online Learning Algorithms

Distance functions and IE -2 William W. Cohen CALD.

Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.

A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:

Distance functions and IE – 4? William W. Cohen CALD.

Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.

Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.

Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park,

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Sequence Alignment.

Conditional Markov Models: MaxEnt Tagging and MEMMs

Construction of Substitution matrices

Doug Raiford Phage class: introduction to sequence databases.

Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center

More announcements Unofficial auditors: send to Sharon Woodside to make sure you get any late-breaking announcements. Project: –Already.

Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD.

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.

4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.

Distance functions and IE - 3 William W. Cohen CALD.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

IE With Undirected Models: the saga continues

Max-margin sequential learning methods

Searching Similar Segments over Textual Event Sequences

The Voted Perceptron for Ranking and Structured Classification

Basic Local Alignment Search Tool

Sequential Learning with Dependency Nets

Presentation transcript:

Distance functions and IE – 5 William W. Cohen CALD

Announcements Current statistics: –days with unscheduled student talks: 5 –students with unscheduled student talks: 3 –Projects are due: 4/28 (last day of class) –Additional requirement: draft (for comments) no later than 4/21

String distance metrics so far... Term-based (e.g. TF/IDF as in WHIRL) –Distance depends on set of words contained in both s and t – so sensitive to spelling errors. –Usually weight words to account for “importance” –Fast comparison: O(n log n) for |s|+|t|=n Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –No notion of word importance –More expensive: O(n 2 ) Other metrics –Jaro metric & variants –Monge-Elkan’s recursive string matching –etc? Which metrics work best, for which problems?

Results - Overall

Combining Information Extraction and Similarity Computations Krauthammer et al

Background Common task in proteomics/genomics: –look for (soft) matches to a query sequence in a large “database” of sequences. –want to find subsequences (genes) that are highly similar (and hence probably related) –want to ignore “accidental” matches –possible technique is Smith-Waterman (local alignment) want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance

Background Common task in proteomics/genomics: –look for (soft) matches to a query sequence in a large “database” of sequences. –want to find subsequences (genes) that are highly similar (and hence probably related) –want to ignore “accidental” matches –possible technique is Smith-Waterman (local alignment) want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance

Smith-Waterman distance c o h e n d o r f m c c o h n s k i dist=5

In general “peaks” in the matrix scores indicate highly similar substrings.

Background Common task in proteomics/genomics: –look for (soft) matches to a query sequence in a large “database” of sequences. –possible technique is Smith-Waterman (local alignment) want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance based on substitutability theory/stats for amino acids –doesn’t scale well BLAST and FASTA: fast approximate S-W

BLAST/FASTA ideas Find all char n-grams (“words”) in the query string. FASTA: –Use inverted indices to find out where these words appear in the DB sequence –Use S-W only near DB sections that contain some of these words

BLAST/FASTA ideas Find all char n-grams (“words”) in the query string. BLAST: –Generate variations of these words by looking for changes that would lead to strong similarities –Discard “low IDF” words (where accidental matches are likely) –Use expanded set of n-grams to focus search

query string words and expansions

BLAST/FASTA ideas Find all char n-grams (“words”) in the query string. BLAST: –Generate variations of these words by looking for changes that would lead to strong similarities –Discard “low IDF” words (where accidental matches are likely) –Use expanded set of n-grams to focus search The BLAST program: –Widely used, –Fast implementation, –Supports asking multiple queries against a database at once... –Can one use it find soft matches of protein names (from a dictionary) in text?

Basic idea: Protein database Query strings Proposed alignment (query->database) Query algorithm: BLAST Biomedical paper Protein name dictionary Extracted protein name (dict. entry->text) IE system: dictionaries+BLAST (optimized for this problem)

1) Mapping text to DNA sequences (Q: what sort of char similarity is this?)

2) Optimizing blast Split protein-name database into several parts (for short, medium-length, long protein names) –Scoring depends on length of matched string Require space chars before and after “short” protein names. Manually search (grid search?) for better settings for certain key parameters for each protein-name subdatabase –With what data? Evaluate on one review article, 1162 protein names –inter-annotator agreement not great (70-85%)

2) Optimizing blast

Results

Overall: precision 71.1%, recall 78.8% (optimized)

IE with Dictionaries Cohen & Sarawagi

Finding names you know about Problem: given dictionary of names, find them in text –Important task beyond (biology, link analysis,...) –Exact match is unlikely to work perfectly, due to nicknames (Will Cohen), abbreviations (William C), misspellings (Willaim Chen), polysemous words (June, Bill), etc –In informal text it sometimes works very poorly –Problem is similar to record linkage (aka data cleaning, de-duping, merge-purge,...) problem of finding duplicate database records in heterogeneous databases.

Finding names you know about Problem: given dictionary of names, find them in text –Exact match is unlikely to work well for informal text. –Problem is similar to record linkage –Hard to combine state of the art similarity metrics (as used in record linkage) with state of the art NER system due to representational mismatch: Opening up the box, modern NER systems don’t really know anything about names....

IE as Sequential Word Classification Yesterday Pedro Domingos spoke this example sentence. Person name: Pedro Domingos A trained IE system models the relative probability of labeled sequences of words. To classify, find the most likely state sequence for the given words: Any words said to be generated by the designated “person name” state extract as a person name: person name location name background

IE as Sequential Word Classification Modern IE systems use a rich representation for words, and clever probabilistic models of how labels interact in a sequence, but do not explicitly represent the names extracted. w t-1 w t O t w t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” … … part of noun phrase is “Wisniewski” ends in “-ski”

Semi-Markov models for IE Train on sequences of labeled segments, not labeled words. S=(start,end,label) Build probability model of segment sequences, not word sequences Define features f of segments (Approximately) optimize feature weights on training data f(S) = words x t...x u, length, previous words, case information,..., distance to known name maximize: with Sunita Sarawagi, IIT Bombay

Details: Semi-Markov model

Conditional Semi-Markov models CMM: CSMM:

A training algorithm for CSMM’s (1) Review: Collins’ perceptron training algorithm Correct tags Viterbi tags

A training algorithm for CSMM’s (2) Variant of Collins’ perceptron training algorithm: voted perceptron learner for T TRANS like Viterbi

A training algorithm for CSMM’s (3) Variant of Collins’ perceptron training algorithm: voted perceptron learner for T TRANS like Viterbi

A training algorithm for CSMM’s (3) Variant of Collins’ perceptron training algorithm: voted perceptron learner for T SEGTRANS like Viterbi

Sample CSMM features

Experimental results Baseline algorithms: –HMM-VP/1: tags are “in entity”, “other” –HMM-VP/4: tags are “begin entity”, “end entity”, “continue entity”, “unique”, “other” –SMM-VP: all features f(w) have versions for “f(w) true for some w in segment that is first (last, any) word of segment” –dictionaries: like Borthwick HMM-VP/1: f D (w)=“word w is in D” HMM-VP/4: f D,begin (w)=“word w begins entity in D”, etc, etc Dictionary lookup

Datasets used Used small training sets (10% of available) in experiments.

Results

Results: varying history

Results: changing the dictionary

Results: vs CRF