Kernels for Relation Extraction

Kernels for Relation Extraction
William Cohen 10-13

Outline for Today Review: SVMs & kernels

Perceptrons vs SVMs

If mistake: vk+1 = vk + yi xi
The voted perceptron Compute: yi = vk . xi ^ instance xi B A If mistake: vk+1 = vk + yi xi yi ^ yi

(3a) The guess v2 after the two positive examples: v2=v1+x2
(3b) The guess v2 after the one positive and one negative example: v2=v1-x2 v2 u u +x2 v2 >γ v1 v1 +x1 -x2 -u -u 2γ 2γ

Perceptrons vs SVMs For the voted perceptron to “work” (in this proof), we need to assume there is some u such that

Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. Given: γ, (x1,y1), (x2,y2), (x3,y3), … Find: some w where ||w||=1 and for all i, w.xi.yi > γ

Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. Given: (x1,y1), (x2,y2), (x3,y3), … Find: some w and γ such that ||w||=1 and for all i, w.xi.yi > γ The best possible w and γ

Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. Given: (x1,y1), (x2,y2), (x3,y3), … Maximize γ under the constraint ||w||=1 and for all i, w.xi.yi > γ Mimimize ||w||2 under the constraint for all i, w.xi.yi > 1 Units are arbitrary: rescaling increases γ and w

Perceptrons vs SVMs Variant:
Basic optimization problem: Given: (x1,y1), (x2,y2), (x3,y3), … Mimimize ||w||2 under the constraint for all i, w.xi.yi > 1 Variant: Ranking constraints (e.g., to model click-thru feedback): for all i,j~=l, w.xi.yi,l > w.xi.yi,j +1 But you have exponentially many constraints But Thorsten is a clever man

Review of Kernels

The kernel perceptron instance xi B A yi ^ yi Compute: yi = vk . xi ^ If mistake: vk+1 = vk + yi xi Mathematically the same as before … but allows use of the kernel trick

The kernel perceptron instance xi B A yi ^ yi Compute: yi = vk . xi ^ If mistake: vk+1 = vk + yi xi Mathematically the same as before … but allows use of the “kernel trick” Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.

Extracting Relationships

What is “Information Extraction”
As a task: Filling slots in a database from sub-segments of text. 23rd July :51 GMT Microsoft was in violation of the GPL (General Public License) on the Hyper-V code it released to open source this week. After Redmond covered itself in glory by opening up the code, it now looks like it may have acted simply to head off any potentially embarrassing legal dispute over violation of the GPL. The rest was theater. As revealed by Stephen Hemminger - a principal engineer with open-source network vendor Vyatta - a network driver in Microsoft's Hyper-V used open-source components licensed under the GPL and statically linked to binary parts. The GPL does not permit the mixing of closed and open-source elements. … Hemminger said he uncovered the apparent violation and contacted Linux Driver Project lead Greg Kroah-Hartman, a Novell programmer, to resolve the problem quietly with Microsoft. Hemminger apparently hoped to leverage Novell's interoperability relationship with Microsoft. NAME TITLE ORGANIZATION Stephen Hemminger Greg Kroah-Hartman principal engineer programmer lead Vyatta Novell Linux Driver Proj. What is IE. As a task it is… Starting with some text… and a empty data base with a defined ontology of fields and records, Use the information in the text to fill the database.

What is “Information Extraction”
Techniques: NER + Segment + Classify EntityPairs from same segment 23rd July :51 GMT Hemminger said he uncovered the apparent violation and contacted Linux Driver Project lead Greg Kroah-Hartman, a Novell programmer, to resolve the problem quietly with Microsoft. Hemminger apparently hoped to leverage Novell's interoperability relationship with Microsoft. Hemminger Microsoft One-stage process: classify (E1,E2) as unrelated or employedBy, employerOf, hasTitle, titleOf, hasPosition, positionInCompany Two-stage process: classify (E1,E2) as related or not; classify related (E1,E2) as … Linux Driver Project What is IE. As a task it is… Starting with some text… and a empty data base with a defined ontology of fields and records, Use the information in the text to fill the database. programmer Novell lead Greg Kroah-Hartman

Bunescu & Mooney’s papers

Kernels vs Structured Output Spaces
Two kinds of structured learning: HMMs, CRFs, VP-trained HMM, structured SVMs, stacked learning, ….: the output of the learner is structured. Eg for linear-chain CRF, the output is a sequence of labels—a string Yn Bunescu & Mooney (EMNLP, NIPS): the input to the learner is structured. EMNLP: structure derived from a dependency graph. New!

Tasks: ACE relations

Dependency graphs for sentences
holding seized Protesters stations workers several pumping 127 Shell hostage

Dependency graphs for sentences
CFG dependency parsers  dependency trees Context-sensitive formalisms  dependency DAGs

Disclaimer: this is a shortest path, not the shortest path

K( x1 × … × xn, y1 × … × yn ) = ( x1 × … × xn ) ∩ (y1 × … × yn) …
 x1 × x2 × x3 × x4 × x5 = 4*1*3*1*4 = 48 features x1 x2 x3 x4 x5 K( x1 × … × xn, y1 × … × yn ) = ( x1 × … × xn ) ∩ (y1 × … × yn) …

Results -CCG, -CFG: Context-sensitive CCG vs Collins’ (CF) parser
S1, S2: one multi-class SVM vs two SVMs (binary, then multiclass) K4 is baseline (two stage SVM, custom kernel) Correct entity output is assumed

Some background on … edit distances

String distance metrics: Levenshtein
Edit-distance metrics for pairs of strings s,t Distance is shortest sequence of edit commands that transform s to t. Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1) This is “Levenshtein distance”

Levenshtein distance - example
distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 alignment t op cost

Levenshtein distance - example
distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 gap alignment t op cost

Computing Levenshtein distance - 1
D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j) //insert D(i,j-1) //delete

Computing Levenstein distance - 2
D(i,j) = score of best alignment from s1..si to t1..tj D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete = min (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j

Computing Levenstein distance - 3
D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j)= min C O H E N M 1 2 3 4 5 = D(s,t)

Computing Levenshtein distance – 4
D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j) = min C O H E N M 1 2 3 4 5 A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

Stopped HERE 10/13

Needleman-Wunch distance
d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) + G //insert D(i,j-1) + G //delete D(i,j) = min G = “gap cost” William Cohen Wukkuan Cigeb

Smith-Waterman distance - 1
//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max Distance is maximum over all i,j in table of D(i,j)

//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max C O H E N M -1 -2 -3 -4 -5 +1 +2 +4 +3 +5 G = 1 d(c,c) = -2 d(c,d) = +1

//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max C O H E N M +1 +2 +4 +3 +5 G = 1 d(c,c) = -2 d(c,d) = +1

Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)

Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)
Used a standard version of Smith-Waterman with hand-tuned weights for inserts and character substitutions. Split large text fields by separators like commas, etc, and found minimal cost over all possible pairings of the subfields (since S-W assigns a large cost to large transpositions) Result competitive with plausible competitors.

Results: S-W from Monge & Elkan

William W. ‘Don’t call me Dubya’ Cohen
Affine gap distances Smith-Waterman fails on some pairs that seem quite similar: William W. Cohen William W. ‘Don’t call me Dubya’ Cohen Intuitively, a single long insertion is “cheaper” than a lot of short insertions Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions

Affine gap distances - 2 Idea:
Current cost of a “gap” of n characters: nG Make this cost: A + (n-1)B, where A is cost of “opening” a gap, and B is cost of “continuing” a gap.

Affine gap distances - 3 D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj)
IT(I-1,j-1) + d(si,tj) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j) = max IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’

Affine gap distances - 4 -B IS -d(si,tj) -A D -d(si,tj) -A -d(si,tj)

Affine gap distances – experiments from McCallum,Nigam,Ungar KDD2000
Goal is to match data like this:

Now the NIPS paper Similar representation for relation instances: x1 × … × xn where each xi is a set…. …but instead of informative dependency path elements, the x’s just represent adjacent tokens. To compensate: use a richer kernel

Motivation Rules for protein-protein interaction like
“interaction of (gap0-3) <Protein1> with (gap0-3) <Protein2>” Used by prior rule-based system Add ability to match features of words (e.g., POS tags) Add constraints: match words before&between, between, between&after two proteins

Subsequence kernel set of all sparse subsequences u of x1 × … × xn with each u downweighted according to sparsity Relaxation of old kernel: We don’t have to match everywhere, just at selected locations For every position spanned by our matching pattern, we get a penalty of λ To pick a “feature” inside (x1 … xn)’ Pick a subset of locations i=i1,…,ik and then Pick a feature value in each location  In the preprocessed vector x’ weight every feature for i by λlength(i) = λik-i1+1

Subsequence kernel w/cost c(x,y)
Only counts u that align with last char of s and t

Dynamic programming computation
Kn(s,t): #matches between s and t of size n K’n(s,t): #matches between s and t of size n, scored as if final pos’n matched i.e., recursion “remembers” that “there a match to the right ” K’’n(s,t): #matches between s and t that match last char of s to something i.e. recursion “remembers” that “final char of s matches” Skipping position i in s Including position i Final pos’n of s not matched Final pos’n of s matched

Additional details Special domain-specific tricks for combining the subsequences for what matches in the fore, aft, and between sections of a relation-instance pair. Subsequences are of length less than 4. Is DP needed for this now? Count fore-between, between-aft, and between subsequences separately.

Results Protein-protein interaction ERK-A: no fore/aft sequences

Results

Kernels for Relation Extraction

Similar presentations

Presentation on theme: "Kernels for Relation Extraction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kernels for Relation Extraction

Similar presentations

Presentation on theme: "Kernels for Relation Extraction"— Presentation transcript:

Similar presentations

About project

Feedback