Presentation is loading. Please wait.

Presentation is loading. Please wait.

NER with Models Allowing Long-Range Dependencies

Similar presentations


Presentation on theme: "NER with Models Allowing Long-Range Dependencies"— Presentation transcript:

1 NER with Models Allowing Long-Range Dependencies
William W. Cohen 10/12

2 Some models we’ve looked at
HMMs generative sequential model MEMMs/aka maxent tagging; stacked learning Cascaded sequences of “ordinary” classifiers (for stacking, also sequential classifiers) Linear-chain CRFs Similar functional form as an HMM, but optimized for Pr(Y|X) instead of Pr(X,Y) [Klein and Manning] An MRF (undirected graphical model) with edge and node potentials defined via features that depend on X,Y [Sha & Pereira] Stacked sequential learning Meta-learning, using as features the cross-validated prediction of simpler model on nearby nodes in a chain.

3 Some models we haven’t looked at
Conditional Graphical Models (Perez-Cruz & Ghahramani) Assume an arbitrary graph of nodes X Learn to predict the pair of labels (Yi,Yj) on each edge using SVMs Inference: Predict each pair of edge labels, and get associated confidence finally, use Viterbi (or something) to get the single best consistent set of labels

4 Some models we haven’t looked at
Dependency network (Toutanova et al, 2003) Assume an arbitrary graph of nodes X Learn an “every state” predictor (instead of a next-state predictor) Pr(Xi | W1,…,Wk) = … for each variable Xi where Wj’s are neighbors of Xi. Train local predictors using true labels of W’s. Inference: popular choice is Gibbs sampling. Guess initial values for Xi0’s. For t=1…T For i=1…N Draw new value for Xit using Pr(Xit-1 | W1t-1,…,Wkt-1) Finally use the average value of Xi on the last T-B iterations Actually, they use an approximate Viterbi

5 Example DNs – bidirectional chains
Y1 Y2 Yi When will dr Cohen post the notes

6 DN examples … … How do we do inference? Iteratively:
Yi When will dr Cohen post the notes How do we do inference? Iteratively: Pick values for Y1, Y2, …at random Pick some j, and compute Set new value of Yj according to this Go back to (2) Current values

7 DN Examples Y1 Y2 Yi When will dr Cohen post the notes

8 DN Examples … … POS … … BIO/NER Z1 Z2 Zi Y1 Y2 Yi will dr post the
When will dr Cohen post the notes

9 Example DNs – “skip” chains
Y1 Y2 Y7 Dr Yu and his wife Mi N. Yu y for next/prev x=xj

10 Why does Gibbs sampling work?
Feeling lucky? Suppose X1t,…,Xnt were drawn from the “correct” distribution for some t… Then X1t+1,…,Xnt+1 would also be drawn from the correct distribution, and so on.

11 Some models we’ve looked at
Linear-chain CRFs Similar functional form as an HMM, but optimized for Pr(Y|X) instead of Pr(X,Y) [Klein and Manning] An MRF (undirected graphical model) with edge and node potentials defined via features that depend on X,Y [my lecture] Dependency nets aka MRFs learned w/ pseudo-likelihood Local conditional probabilities + Gibbs sampling (or something) for inference. Easy to use a network that is not a linear chain Question: why can’t we use general MRFs for CRFs as well?

12 When will prof Cohen post …
can see the locality When will prof Cohen post … B B B B B I I I I I O O O O O

13 With Z[j,y] we can also compute stuff like:
what’s the probability that y2=“B” ? what’s the probability that y2=“B” and y3=“I”? When will prof Cohen post … B B B B B I I I I I O O O O O

14 Another visualization of the MRF
Ink=potential B W B W B W All black/all white are only assignments

15 B W B W B W B W B W B W Best assignment to X_S maximizes black ink (potential) on chosen nodes plus edges.

16 B W B W B W Best assignment to X_S maximizes black ink (potential) on chosen nodes plus edges.

17 Forward-Backward review
= total weight of paths from a to node(Xi=j) = 1/Z * total weight of paths from a to b thru node(Xi=j)

18 Belief Propagation in Factor Graphs (review?)
When will prof Cohen post … B B B B B I I I I I O O O O O X1 X2 X3 X4 X5 f12 f23 f34 f45

19

20 Belief Propogation on Trees
For each leaf a, Walk away from that leaf to every node X, keeping track of the total weight of all paths from a to X. Compute this incrementally as you go. When you reach a node X with k neighbors, wait until k-1 walks converge; then multiply the signals and send the signals on. After you’re done you have “α,β values” for each node.

21 Belief Propogation on Trees Graphs
For each leaf a, Walk away from that leaf to every node X, keeping track of the total weight of all paths from a to X. Compute this incrementally as you go. When you reach a node X with k neighbors, wait until k-1 walks converge; then multiply the signals and send the signals on. After you’re bored you have “α,β values” for each node.

22 CRF learning – from Sha & Pereira

23 CRF learning – from Sha & Pereira

24 CRF learning – from Sha & Pereira
i.e. expected value, under λ, of fi(x,yj,yj+1)  partition function Pr(x) = Zλ(x)  “total flow” through MRF graph In general, this is not tractible

25 Skip-chain CRFs: Sutton & McCallum
Connect adjacent words with edges Connect pairs of identical capitalized words We don’t want too many “skip” edges

26 Skip-chain CRFs: Sutton & McCallum
Inference: loopy belief propogation

27 Skip-chain CRF results

28 Krishnan & Manning: An effective two-stage model….”

29 Repetition of names across the corpus is even more important in other domains…

30 How to use these regularities
Stacked CRFs with special features: Token-majority: majority label assigned to a token (e.g., token “Melinda”  person) Entity-majority: majority label assigned to an entity (e.g., tokens inside “Bill & Melinda Gates Foundation”  organization) Super-entity-majority: majority label assigned to entities that are super-strings of an entity (e.g., tokens inside “Melinda Gates”  organization) Compute within document and across corpus

31

32 Candidate phrase classification with general CRFs; Local templates control overlap; Global templates are like ‘skip’ edges CRF + hand-coded external classifier (with Gibbs sampling) to handle long-range edges

33 [Kou & Cohen, SDM-2007]

34


Download ppt "NER with Models Allowing Long-Range Dependencies"

Similar presentations


Ads by Google