Learning Analogies and Semantic Relations Nov William Cohen
Announcements Upcoming assignments: –Wiki pages for October should be revised –Wiki pages for November due tomorrow 11/30 –Projects due Fri 12/10 Project presentations next week: –Monday12/6 and Wed 12/8 –20min including time for Q/A –30min for the group project –(Order is reverse of mid-term project reports)
[Machine Learning, 2005]
Motivation Information extraction is about understanding entity names in text… … and also relations between entities. How do you determine if you “understand” an arbitrary relation? –For fixed relations R: labeled data (ACE) –For arbitrary relations: … ?
Evaluation
How do you measure the similarity of relation instances? 1.Create a feature vector r x:y for each instance x:y mason:stone soldier:gun 2.Use cosine distance.
Creating an instance vector for x:y Generate a bunch of queries. –“X of the Y” (“stone of the mason”) –“X with the Y” (soldier with the gun”) –… For each query q j (X,Y), record the number of hits in a search engine as r x:y,j –Actually record log(#hits+1) –Actually sometimes replace X with stem(X)*
The queries used Similar to Hearst ’92 & followups
Some results Ranking 369 possible x:y pairs as possible answers
How do you measure the similarity of relation instances? 1.Create a feature vector r x:y for each instance x:y 2.Use cosine distance to rank (a),…(d) 3.Test-taking strategy: -Define margin=(bestScore-secondBest) -If margin 0 then skip -If margin<θ and θ<0 then guess the top 2.
Results
Followup work Given x:y pairs, replace vectors with rows in M’: 1.Look up synonyms x’, y’ of x and y and construct “near analogies” x’:y, x:y’. Drop any that don’t occur frequently. - e.g. “mason:stone” “mason:rock” 2.Search for phrase “x Q y” or “y Q x”, using near analogies as well as original pair x:y, and any sequence of up to three words Q. 3.For each phrase create patterns by introducing wildcards. 4.Build a pair-pattern matrix frequency M. 5.Apply SVD to M to get best 300 dimensions M’. Define sim 1 (x:y, u:v) = cosine distance in M’. Compute similarity of x:y and u:v as average of sim1(p1,p2) for all pairs p1,p2 where (a) p1 is x:y or an alternate; (b) p2 is u:v or an alternate; and (c) sim1(p1,p2)>=sim1(x:y,u:v) [Turney, CL 2006]
Results for LRA 56.5 On 50B word WMTS corpus… 40.3 VSM-WMTS
Additional application: relation classification
Relation classification
Ablation experiments - 1
Ablation experiments - 2 What is the effect of using many automatically-generated patterns vs only 64 manually-generated ones? (Most of manual patterns are found automatically). Feature selection in pattern space instead of SVD
Lessons and questions How are relations and surface patterns correlated? –One-many? (several class-subclass patterns) –Many-one? (some patterns are ambiguous) –Many-many? (and is it 10-10, , ?) Is it surprising that information about relation similarity is spread out across –So much text? –So many surface patterns?
Followup 2 … a pure corpus-based approach Given M word pairs X,Y, construct feature vectors f XY like this: –Find phrases: left? X middle{0,3} Y right? (e.g., the mason cut the stone with”) and stem –In each phrase, replace all words other than X and Y are replace them with wildcards, creating 2 n-2 patterns: (e.g., * mason cut the stone with”, “the mason * the stone with”, … “*mason * * stone *”) –Retain the 20M examples associated with the most X,Y pairs –Weight a pattern that appears i times for X,Y as log(i+1). –Normalize vectors to unit length Use supervised learning on this representation [Turney, COLING 2008]
Followup 2 … a pure corpus-based approach Given M word pairs X,Y, construct feature vectors f XY Use supervised learning for synonym-or-not [Turney, COLING 2008] Use 10-CV on 80 questions = 320 word pairs Accuracy 76.2% Rank = 9/15 compared to prior approaches (best, 97.5; avg human, 64.5)
Followup 2 … a pure corpus-based approach Given M word pairs X,Y, construct feature vectors f XY Use supervised learning for synonym-vs-antonym [Turney, COLING 2008] Use 10-CV on 136 sample questions Accuracy 75% First published results
Followup 2 … a pure corpus-based approach Given M word pairs X,Y, construct feature vectors f XY Use supervised learning for synonym-vs-antonym [Turney, COLING 2008] Use 10-CV on 136 sample questions Accuracy 75% First published results
Followup 2 … a pure corpus-based approach Given M word pairs X,Y, construct feature vectors f XY Use supervised learning for similar/associated/both [Turney, COLING 2008] Use 10-CV on 144 pairs labeled in psychological experiments Accuracy 77.1% First published results
Followup 2 … a pure corpus-based approach Given M word pairs X,Y, construct feature vectors f XY Use supervised learning for analogies [Turney, COLING 2008] From another problem Repeat 10x with a different “negative” example and average scores for test cases, then pick best answer Accuracy: 52.1% Rank: 3/12 prior papers (best 56.1%; avg student 57%)
Summary
Background for Wed: pair HMMs and generative models of alignment
Alignments and expectations Simplified version of the idea: from Learning String Edit Distance, Ristad and Yianilos, PAMI 1998
HMM Example 1 2 Pr(1->2) Pr(2->1) Pr(2->2)Pr(1->1) Pr(1->x) d0.3 h0.5 b0.2 Pr(2->x) a0.3 e0.5 o0.2 Sample output: x T =heehahaha, s T =
HMM Inference t=1t=2...t=T l=1... l=2... l=K... Key point: Pr(s i =l) depends only on Pr(l’->l) and s i-1 so you can propogate probabilities forward... x1x1 x2x2 x3x3 xTxT
Pair HMM Notation Andrew will use “null”
Pair HMM Example 1 ePr(e)
Pair HMM Example 1 ePr(e) Sample run: z T =,,, Strings x,y produced by z T : x=heehee, y=teehe Notice that x,y is also produced by z 4 +, and many other edit strings
Distances based on pair HMMs
Pair HMM Inference Dynamic programming is possible: fill out matrix left- to-right, top-down
Pair HMM Inference t=1t=2...t=T v=1... v=2... v=K...
Pair HMM Inference t=1t=2...t=T v=1... v=2... v=K... One difference: after i emissions of pair HMM, we do not know the column position i=1 i=2i=3 i=1 i=2
Pair HMM Inference: Forward-Backward t=1t=2...t=T v=1... v=2... v=K...
Multiple states SUB ePr(e) IX ePr(e) …… IY
...v=K... v=2...v=1 t=T...t=2 t=1 l=2 An extension: multiple states...v=K... v=2...v=1 t=T...t=2 t=1 l=1 conceptually, add a “state” dimension to the model EM methods generalize easily to this setting SUB IX