Download presentation
Presentation is loading. Please wait.
1
Kernels for Relation Extraction
William Cohen 10-13
2
Outline for Today Review: SVMs & kernels
3
Perceptrons vs SVMs
4
If mistake: vk+1 = vk + yi xi
The voted perceptron Compute: yi = vk . xi ^ instance xi B A If mistake: vk+1 = vk + yi xi yi ^ yi
6
(3a) The guess v2 after the two positive examples: v2=v1+x2
(3b) The guess v2 after the one positive and one negative example: v2=v1-x2 v2 u u +x2 v2 >γ v1 v1 +x1 -x2 -u -u 2γ 2γ
7
Perceptrons vs SVMs For the voted perceptron to “work” (in this proof), we need to assume there is some u such that
8
Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. Given: γ, (x1,y1), (x2,y2), (x3,y3), … Find: some w where ||w||=1 and for all i, w.xi.yi > γ
9
Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. Given: (x1,y1), (x2,y2), (x3,y3), … Find: some w and γ such that ||w||=1 and for all i, w.xi.yi > γ The best possible w and γ
10
Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. Given: (x1,y1), (x2,y2), (x3,y3), … Maximize γ under the constraint ||w||=1 and for all i, w.xi.yi > γ Mimimize ||w||2 under the constraint for all i, w.xi.yi > 1 Units are arbitrary: rescaling increases γ and w
11
Perceptrons vs SVMs Variant:
Basic optimization problem: Given: (x1,y1), (x2,y2), (x3,y3), … Mimimize ||w||2 under the constraint for all i, w.xi.yi > 1 Variant: Ranking constraints (e.g., to model click-thru feedback): for all i,j~=l, w.xi.yi,l > w.xi.yi,j +1 But you have exponentially many constraints But Thorsten is a clever man
12
Review of Kernels
13
If mistake: vk+1 = vk + yi xi
The kernel perceptron instance xi B A yi ^ yi Compute: yi = vk . xi ^ If mistake: vk+1 = vk + yi xi Mathematically the same as before … but allows use of the kernel trick
14
If mistake: vk+1 = vk + yi xi
The kernel perceptron instance xi B A yi ^ yi Compute: yi = vk . xi ^ If mistake: vk+1 = vk + yi xi Mathematically the same as before … but allows use of the “kernel trick” Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.
15
Extracting Relationships
16
What is “Information Extraction”
As a task: Filling slots in a database from sub-segments of text. 23rd July :51 GMT Microsoft was in violation of the GPL (General Public License) on the Hyper-V code it released to open source this week. After Redmond covered itself in glory by opening up the code, it now looks like it may have acted simply to head off any potentially embarrassing legal dispute over violation of the GPL. The rest was theater. As revealed by Stephen Hemminger - a principal engineer with open-source network vendor Vyatta - a network driver in Microsoft's Hyper-V used open-source components licensed under the GPL and statically linked to binary parts. The GPL does not permit the mixing of closed and open-source elements. … Hemminger said he uncovered the apparent violation and contacted Linux Driver Project lead Greg Kroah-Hartman, a Novell programmer, to resolve the problem quietly with Microsoft. Hemminger apparently hoped to leverage Novell's interoperability relationship with Microsoft. NAME TITLE ORGANIZATION Stephen Hemminger Greg Kroah-Hartman principal engineer programmer lead Vyatta Novell Linux Driver Proj. What is IE. As a task it is… Starting with some text… and a empty data base with a defined ontology of fields and records, Use the information in the text to fill the database.
17
What is “Information Extraction”
Techniques: NER + Segment + Classify EntityPairs from same segment 23rd July :51 GMT Hemminger said he uncovered the apparent violation and contacted Linux Driver Project lead Greg Kroah-Hartman, a Novell programmer, to resolve the problem quietly with Microsoft. Hemminger apparently hoped to leverage Novell's interoperability relationship with Microsoft. Hemminger Microsoft One-stage process: classify (E1,E2) as unrelated or employedBy, employerOf, hasTitle, titleOf, hasPosition, positionInCompany Two-stage process: classify (E1,E2) as related or not; classify related (E1,E2) as … Linux Driver Project What is IE. As a task it is… Starting with some text… and a empty data base with a defined ontology of fields and records, Use the information in the text to fill the database. programmer Novell lead Greg Kroah-Hartman
18
Bunescu & Mooney’s papers
19
Kernels vs Structured Output Spaces
Two kinds of structured learning: HMMs, CRFs, VP-trained HMM, structured SVMs, stacked learning, ….: the output of the learner is structured. Eg for linear-chain CRF, the output is a sequence of labels—a string Yn Bunescu & Mooney (EMNLP, NIPS): the input to the learner is structured. EMNLP: structure derived from a dependency graph. New!
20
Tasks: ACE relations
21
Dependency graphs for sentences
holding seized Protesters stations workers several pumping 127 Shell hostage
22
Dependency graphs for sentences
CFG dependency parsers dependency trees Context-sensitive formalisms dependency DAGs
24
Disclaimer: this is a shortest path, not the shortest path
25
K( x1 × … × xn, y1 × … × yn ) = ( x1 × … × xn ) ∩ (y1 × … × yn) …
x1 × x2 × x3 × x4 × x5 = 4*1*3*1*4 = 48 features x1 x2 x3 x4 x5 K( x1 × … × xn, y1 × … × yn ) = ( x1 × … × xn ) ∩ (y1 × … × yn) …
26
Results -CCG, -CFG: Context-sensitive CCG vs Collins’ (CF) parser
S1, S2: one multi-class SVM vs two SVMs (binary, then multiclass) K4 is baseline (two stage SVM, custom kernel) Correct entity output is assumed
27
Some background on … edit distances
28
String distance metrics: Levenshtein
Edit-distance metrics for pairs of strings s,t Distance is shortest sequence of edit commands that transform s to t. Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1) This is “Levenshtein distance”
29
Levenshtein distance - example
distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 alignment t op cost
30
Levenshtein distance - example
distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 gap alignment t op cost
31
Computing Levenshtein distance - 1
D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j) //insert D(i,j-1) //delete
32
Computing Levenstein distance - 2
D(i,j) = score of best alignment from s1..si to t1..tj D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete = min (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j
33
Computing Levenstein distance - 3
D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j)= min C O H E N M 1 2 3 4 5 = D(s,t)
34
Computing Levenshtein distance – 4
D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j) = min C O H E N M 1 2 3 4 5 A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)
35
Stopped HERE 10/13
36
Needleman-Wunch distance
d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) + G //insert D(i,j-1) + G //delete D(i,j) = min G = “gap cost” William Cohen Wukkuan Cigeb
37
Smith-Waterman distance - 1
//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max Distance is maximum over all i,j in table of D(i,j)
38
Smith-Waterman distance - 2
//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max C O H E N M -1 -2 -3 -4 -5 +1 +2 +4 +3 +5 G = 1 d(c,c) = -2 d(c,d) = +1
39
Smith-Waterman distance - 3
//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max C O H E N M +1 +2 +4 +3 +5 G = 1 d(c,c) = -2 d(c,d) = +1
40
Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)
41
Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)
Used a standard version of Smith-Waterman with hand-tuned weights for inserts and character substitutions. Split large text fields by separators like commas, etc, and found minimal cost over all possible pairings of the subfields (since S-W assigns a large cost to large transpositions) Result competitive with plausible competitors.
42
Results: S-W from Monge & Elkan
43
William W. ‘Don’t call me Dubya’ Cohen
Affine gap distances Smith-Waterman fails on some pairs that seem quite similar: William W. Cohen William W. ‘Don’t call me Dubya’ Cohen Intuitively, a single long insertion is “cheaper” than a lot of short insertions Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions
44
Affine gap distances - 2 Idea:
Current cost of a “gap” of n characters: nG Make this cost: A + (n-1)B, where A is cost of “opening” a gap, and B is cost of “continuing” a gap.
45
Affine gap distances - 3 D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj)
IT(I-1,j-1) + d(si,tj) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j) = max IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’
46
Affine gap distances - 4 -B IS -d(si,tj) -A D -d(si,tj) -A -d(si,tj)
47
Affine gap distances – experiments from McCallum,Nigam,Ungar KDD2000
Goal is to match data like this:
48
Now the NIPS paper Similar representation for relation instances: x1 × … × xn where each xi is a set…. …but instead of informative dependency path elements, the x’s just represent adjacent tokens. To compensate: use a richer kernel
49
Motivation Rules for protein-protein interaction like
“interaction of (gap0-3) <Protein1> with (gap0-3) <Protein2>” Used by prior rule-based system Add ability to match features of words (e.g., POS tags) Add constraints: match words before&between, between, between&after two proteins
50
Subsequence kernel set of all sparse subsequences u of x1 × … × xn with each u downweighted according to sparsity Relaxation of old kernel: We don’t have to match everywhere, just at selected locations For every position spanned by our matching pattern, we get a penalty of λ To pick a “feature” inside (x1 … xn)’ Pick a subset of locations i=i1,…,ik and then Pick a feature value in each location In the preprocessed vector x’ weight every feature for i by λlength(i) = λik-i1+1
51
Subsequence kernel w/cost c(x,y)
Only counts u that align with last char of s and t
52
Dynamic programming computation
Kn(s,t): #matches between s and t of size n K’n(s,t): #matches between s and t of size n, scored as if final pos’n matched i.e., recursion “remembers” that “there a match to the right ” K’’n(s,t): #matches between s and t that match last char of s to something i.e. recursion “remembers” that “final char of s matches” Skipping position i in s Including position i Final pos’n of s not matched Final pos’n of s matched
53
Additional details Special domain-specific tricks for combining the subsequences for what matches in the fore, aft, and between sections of a relation-instance pair. Subsequences are of length less than 4. Is DP needed for this now? Count fore-between, between-aft, and between subsequences separately.
54
Results Protein-protein interaction ERK-A: no fore/aft sequences
55
Results
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.