Kernels for Relation Extraction

Slides:



Advertisements
Similar presentations
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Advertisements

Chapter 7 Dynamic Programming.
Record Linkage Tutorial: Distance Metrics for Text William W. Cohen CALD.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Support Vector Machines and The Kernel Trick William Cohen
Dynamic Programming Solving Optimization Problems.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Distance Functions for Sequence Data and Time Series
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Multiple Sequence alignment Chitta Baral Arizona State University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Class 2: Basic Sequence Alignment
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then Shlomo Moran. Background Readings:
Developing Pairwise Sequence Alignment Algorithms
Distance functions and IE -2 William W. Cohen CALD.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
Distance functions and IE William W. Cohen CALD. Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Minimum Edit Distance Definition of Minimum Edit Distance.
Blocking. Basic idea: – heuristically find candidate pairs that are likely to be similar – only compare candidates, not all pairs Variant 1: – pick some.
Distance functions and IE – 5 William W. Cohen CALD.
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Relation Extraction William Cohen Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Learning Analogies and Semantic Relations Nov William Cohen.
Edit Distances William W. Cohen.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
More announcements Unofficial auditors: send to Sharon Woodside to make sure you get any late-breaking announcements. Project: –Already.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Distance functions and IE - 3 William W. Cohen CALD.
Spectral Methods for Dimensionality
Learning to Align: a Statistical Approach
Lecture 7: Constrained Conditional Models
Definition of Minimum Edit Distance
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS 3343: Analysis of Algorithms
Relation Extraction CSCI-GA.2591
The Greedy Method and Text Compression
Definition of Minimum Edit Distance
Distance Functions for Sequence Data and Time Series
Max-margin sequential learning methods
String Processing.
Edit Distances William W. Cohen.
CS 4/527: Artificial Intelligence
An Introduction to Support Vector Machines
An Introduction to Support Vector Machines
Kernels for Relation Extraction
Pairwise sequence Alignment.
Intro to Alignment Algorithms: Global and Local
Dynamic Programming-- Longest Common Subsequence
Bioinformatics Algorithms and Data Structures
String Processing.
NER with Models Allowing Long-Range Dependencies
The Voted Perceptron for Ranking and Structured Classification
A task of induction to find patterns
Implementation of Learning Systems
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Kernels for Relation Extraction William Cohen 10-13

Outline for Today Review: SVMs & kernels

Perceptrons vs SVMs

If mistake: vk+1 = vk + yi xi The voted perceptron Compute: yi = vk . xi ^ instance xi B A If mistake: vk+1 = vk + yi xi yi ^ yi

(3a) The guess v2 after the two positive examples: v2=v1+x2 (3b) The guess v2 after the one positive and one negative example: v2=v1-x2 v2 u u +x2 v2 >γ v1 v1 +x1 -x2 -u -u 2γ 2γ

Perceptrons vs SVMs For the voted perceptron to “work” (in this proof), we need to assume there is some u such that

Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. Given: γ, (x1,y1), (x2,y2), (x3,y3), … Find: some w where ||w||=1 and for all i, w.xi.yi > γ

Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. Given: (x1,y1), (x2,y2), (x3,y3), … Find: some w and γ such that ||w||=1 and for all i, w.xi.yi > γ The best possible w and γ

Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. Given: (x1,y1), (x2,y2), (x3,y3), … Maximize γ under the constraint ||w||=1 and for all i, w.xi.yi > γ Mimimize ||w||2 under the constraint for all i, w.xi.yi > 1 Units are arbitrary: rescaling increases γ and w

Perceptrons vs SVMs Variant: Basic optimization problem: Given: (x1,y1), (x2,y2), (x3,y3), … Mimimize ||w||2 under the constraint for all i, w.xi.yi > 1 Variant: Ranking constraints (e.g., to model click-thru feedback): for all i,j~=l, w.xi.yi,l > w.xi.yi,j +1 But you have exponentially many constraints But Thorsten is a clever man

Review of Kernels

If mistake: vk+1 = vk + yi xi The kernel perceptron instance xi B A yi ^ yi Compute: yi = vk . xi ^ If mistake: vk+1 = vk + yi xi Mathematically the same as before … but allows use of the kernel trick

If mistake: vk+1 = vk + yi xi The kernel perceptron instance xi B A yi ^ yi Compute: yi = vk . xi ^ If mistake: vk+1 = vk + yi xi Mathematically the same as before … but allows use of the “kernel trick” Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.

Extracting Relationships

What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. 23rd July 2009 05:51 GMT Microsoft was in violation of the GPL (General Public License) on the Hyper-V code it released to open source this week. After Redmond covered itself in glory by opening up the code, it now looks like it may have acted simply to head off any potentially embarrassing legal dispute over violation of the GPL. The rest was theater. As revealed by Stephen Hemminger - a principal engineer with open-source network vendor Vyatta - a network driver in Microsoft's Hyper-V used open-source components licensed under the GPL and statically linked to binary parts. The GPL does not permit the mixing of closed and open-source elements. … Hemminger said he uncovered the apparent violation and contacted Linux Driver Project lead Greg Kroah-Hartman, a Novell programmer, to resolve the problem quietly with Microsoft. Hemminger apparently hoped to leverage Novell's interoperability relationship with Microsoft. NAME TITLE ORGANIZATION Stephen Hemminger Greg Kroah-Hartman principal engineer programmer lead Vyatta Novell Linux Driver Proj. What is IE. As a task it is… Starting with some text… and a empty data base with a defined ontology of fields and records, Use the information in the text to fill the database.

What is “Information Extraction” Techniques: NER + Segment + Classify EntityPairs from same segment 23rd July 2009 05:51 GMT Hemminger said he uncovered the apparent violation and contacted Linux Driver Project lead Greg Kroah-Hartman, a Novell programmer, to resolve the problem quietly with Microsoft. Hemminger apparently hoped to leverage Novell's interoperability relationship with Microsoft. Hemminger Microsoft One-stage process: classify (E1,E2) as unrelated or employedBy, employerOf, hasTitle, titleOf, hasPosition, positionInCompany Two-stage process: classify (E1,E2) as related or not; classify related (E1,E2) as … Linux Driver Project What is IE. As a task it is… Starting with some text… and a empty data base with a defined ontology of fields and records, Use the information in the text to fill the database. programmer Novell lead Greg Kroah-Hartman

Bunescu & Mooney’s papers

Kernels vs Structured Output Spaces Two kinds of structured learning: HMMs, CRFs, VP-trained HMM, structured SVMs, stacked learning, ….: the output of the learner is structured. Eg for linear-chain CRF, the output is a sequence of labels—a string Yn Bunescu & Mooney (EMNLP, NIPS): the input to the learner is structured. EMNLP: structure derived from a dependency graph. New!

Tasks: ACE relations

Dependency graphs for sentences holding seized Protesters stations workers several pumping 127 Shell hostage

Dependency graphs for sentences CFG dependency parsers  dependency trees Context-sensitive formalisms  dependency DAGs

Disclaimer: this is a shortest path, not the shortest path

K( x1 × … × xn, y1 × … × yn ) = ( x1 × … × xn ) ∩ (y1 × … × yn) …  x1 × x2 × x3 × x4 × x5 = 4*1*3*1*4 = 48 features x1 x2 x3 x4 x5 K( x1 × … × xn, y1 × … × yn ) = ( x1 × … × xn ) ∩ (y1 × … × yn) …

Results -CCG, -CFG: Context-sensitive CCG vs Collins’ (CF) parser S1, S2: one multi-class SVM vs two SVMs (binary, then multiclass) K4 is baseline (two stage SVM, custom kernel) Correct entity output is assumed

Some background on … edit distances

String distance metrics: Levenshtein Edit-distance metrics for pairs of strings s,t Distance is shortest sequence of edit commands that transform s to t. Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1) This is “Levenshtein distance”

Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 alignment t op cost

Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 gap alignment t op cost

Computing Levenshtein distance - 1 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j)+1 //insert D(i,j-1)+1 //delete

Computing Levenstein distance - 2 D(i,j) = score of best alignment from s1..si to t1..tj D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete = min (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j

Computing Levenstein distance - 3 D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete D(i,j)= min C O H E N M 1 2 3 4 5 = D(s,t)

Computing Levenshtein distance – 4 D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete D(i,j) = min C O H E N M 1 2 3 4 5 A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

Stopped HERE 10/13

Needleman-Wunch distance d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) + G //insert D(i,j-1) + G //delete D(i,j) = min G = “gap cost” William Cohen Wukkuan Cigeb

Smith-Waterman distance - 1 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max Distance is maximum over all i,j in table of D(i,j)

Smith-Waterman distance - 2 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max C O H E N M -1 -2 -3 -4 -5 +1 +2 +4 +3 +5 G = 1 d(c,c) = -2 d(c,d) = +1

Smith-Waterman distance - 3 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max C O H E N M +1 +2 +4 +3 +5 G = 1 d(c,c) = -2 d(c,d) = +1

Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)

Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996) Used a standard version of Smith-Waterman with hand-tuned weights for inserts and character substitutions. Split large text fields by separators like commas, etc, and found minimal cost over all possible pairings of the subfields (since S-W assigns a large cost to large transpositions) Result competitive with plausible competitors.

Results: S-W from Monge & Elkan

William W. ‘Don’t call me Dubya’ Cohen Affine gap distances Smith-Waterman fails on some pairs that seem quite similar: William W. Cohen William W. ‘Don’t call me Dubya’ Cohen Intuitively, a single long insertion is “cheaper” than a lot of short insertions Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions

Affine gap distances - 2 Idea: Current cost of a “gap” of n characters: nG Make this cost: A + (n-1)B, where A is cost of “opening” a gap, and B is cost of “continuing” a gap.

Affine gap distances - 3 D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj) IT(I-1,j-1) + d(si,tj) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete D(i,j) = max IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’

Affine gap distances - 4 -B IS -d(si,tj) -A D -d(si,tj) -A -d(si,tj)

Affine gap distances – experiments from McCallum,Nigam,Ungar KDD2000 Goal is to match data like this:

Now the NIPS paper Similar representation for relation instances: x1 × … × xn where each xi is a set…. …but instead of informative dependency path elements, the x’s just represent adjacent tokens. To compensate: use a richer kernel

Motivation Rules for protein-protein interaction like “interaction of (gap0-3) <Protein1> with (gap0-3) <Protein2>” Used by prior rule-based system Add ability to match features of words (e.g., POS tags) Add constraints: match words before&between, between, between&after two proteins

Subsequence kernel set of all sparse subsequences u of x1 × … × xn with each u downweighted according to sparsity Relaxation of old kernel: We don’t have to match everywhere, just at selected locations For every position spanned by our matching pattern, we get a penalty of λ To pick a “feature” inside (x1 … xn)’ Pick a subset of locations i=i1,…,ik and then Pick a feature value in each location  In the preprocessed vector x’ weight every feature for i by λlength(i) = λik-i1+1

Subsequence kernel w/cost c(x,y) Only counts u that align with last char of s and t

Dynamic programming computation Kn(s,t): #matches between s and t of size n K’n(s,t): #matches between s and t of size n, scored as if final pos’n matched i.e., recursion “remembers” that “there a match to the right ” K’’n(s,t): #matches between s and t that match last char of s to something i.e. recursion “remembers” that “final char of s matches” Skipping position i in s Including position i Final pos’n of s not matched Final pos’n of s matched

Additional details Special domain-specific tricks for combining the subsequences for what matches in the fore, aft, and between sections of a relation-instance pair. Subsequences are of length less than 4. Is DP needed for this now? Count fore-between, between-aft, and between subsequences separately.

Results Protein-protein interaction ERK-A: no fore/aft sequences

Results