NER with Models Allowing Long-Range Dependencies

NER with Models Allowing Long-Range Dependencies
William W. Cohen 10/12

Some models we’ve looked at
HMMs generative sequential model MEMMs/aka maxent tagging; stacked learning Cascaded sequences of “ordinary” classifiers (for stacking, also sequential classifiers) Linear-chain CRFs Similar functional form as an HMM, but optimized for Pr(Y|X) instead of Pr(X,Y) [Klein and Manning] An MRF (undirected graphical model) with edge and node potentials defined via features that depend on X,Y [Sha & Pereira] Stacked sequential learning Meta-learning, using as features the cross-validated prediction of simpler model on nearby nodes in a chain.

Some models we haven’t looked at
Conditional Graphical Models (Perez-Cruz & Ghahramani) Assume an arbitrary graph of nodes X Learn to predict the pair of labels (Yi,Yj) on each edge using SVMs Inference: Predict each pair of edge labels, and get associated confidence finally, use Viterbi (or something) to get the single best consistent set of labels

Some models we haven’t looked at
Dependency network (Toutanova et al, 2003) Assume an arbitrary graph of nodes X Learn an “every state” predictor (instead of a next-state predictor) Pr(Xi | W1,…,Wk) = … for each variable Xi where Wj’s are neighbors of Xi. Train local predictors using true labels of W’s. Inference: popular choice is Gibbs sampling. Guess initial values for Xi0’s. For t=1…T For i=1…N Draw new value for Xit using Pr(Xit-1 | W1t-1,…,Wkt-1) Finally use the average value of Xi on the last T-B iterations Actually, they use an approximate Viterbi

Example DNs – bidirectional chains
Y1 Y2 … Yi … When will dr Cohen post the notes

DN examples … … How do we do inference? Iteratively:
Yi … When will dr Cohen post the notes How do we do inference? Iteratively: Pick values for Y1, Y2, …at random Pick some j, and compute Set new value of Yj according to this Go back to (2) Current values

DN Examples Y1 Y2 … Yi … When will dr Cohen post the notes

DN Examples … … POS … … BIO/NER Z1 Z2 Zi Y1 Y2 Yi will dr post the
When will dr Cohen post the notes

Example DNs – “skip” chains
Y1 Y2 … … … … Y7 Dr Yu and his wife Mi N. Yu y for next/prev x=xj

Why does Gibbs sampling work?
Feeling lucky? Suppose X1t,…,Xnt were drawn from the “correct” distribution for some t… Then X1t+1,…,Xnt+1 would also be drawn from the correct distribution, and so on.

Some models we’ve looked at
… Linear-chain CRFs Similar functional form as an HMM, but optimized for Pr(Y|X) instead of Pr(X,Y) [Klein and Manning] An MRF (undirected graphical model) with edge and node potentials defined via features that depend on X,Y [my lecture] Dependency nets aka MRFs learned w/ pseudo-likelihood Local conditional probabilities + Gibbs sampling (or something) for inference. Easy to use a network that is not a linear chain Question: why can’t we use general MRFs for CRFs as well?

When will prof Cohen post …
can see the locality When will prof Cohen post … B B B B B I I I I I O O O O O

With Z[j,y] we can also compute stuff like:
what’s the probability that y2=“B” ? what’s the probability that y2=“B” and y3=“I”? When will prof Cohen post … B B B B B I I I I I O O O O O

Another visualization of the MRF
Ink=potential B W B W B W All black/all white are only assignments

B W B W B W B W B W B W Best assignment to X_S maximizes black ink (potential) on chosen nodes plus edges.

B W B W B W Best assignment to X_S maximizes black ink (potential) on chosen nodes plus edges.

Forward-Backward review
= total weight of paths from a to node(Xi=j) = 1/Z * total weight of paths from a to b thru node(Xi=j)

Belief Propagation in Factor Graphs (review?)
When will prof Cohen post … B B B B B I I I I I O O O O O X1 X2 X3 X4 X5 f12 f23 f34 f45

Belief Propogation on Trees
For each leaf a, Walk away from that leaf to every node X, keeping track of the total weight of all paths from a to X. Compute this incrementally as you go. When you reach a node X with k neighbors, wait until k-1 walks converge; then multiply the signals and send the signals on. After you’re done you have “α,β values” for each node.

Belief Propogation on Trees Graphs
For each leaf a, Walk away from that leaf to every node X, keeping track of the total weight of all paths from a to X. Compute this incrementally as you go. When you reach a node X with k neighbors, wait until k-1 walks converge; then multiply the signals and send the signals on. After you’re bored you have “α,β values” for each node.

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira
i.e. expected value, under λ, of fi(x,yj,yj+1)  partition function Pr(x) = Zλ(x)  “total flow” through MRF graph In general, this is not tractible

Skip-chain CRFs: Sutton & McCallum
Connect adjacent words with edges Connect pairs of identical capitalized words We don’t want too many “skip” edges

Skip-chain CRFs: Sutton & McCallum
Inference: loopy belief propogation

Skip-chain CRF results

Krishnan & Manning: An effective two-stage model….”

Repetition of names across the corpus is even more important in other domains…

How to use these regularities
Stacked CRFs with special features: Token-majority: majority label assigned to a token (e.g., token “Melinda”  person) Entity-majority: majority label assigned to an entity (e.g., tokens inside “Bill & Melinda Gates Foundation”  organization) Super-entity-majority: majority label assigned to entities that are super-strings of an entity (e.g., tokens inside “Melinda Gates”  organization) Compute within document and across corpus

Candidate phrase classification with general CRFs; Local templates control overlap; Global templates are like ‘skip’ edges CRF + hand-coded external classifier (with Gibbs sampling) to handle long-range edges

[Kou & Cohen, SDM-2007]

NER with Models Allowing Long-Range Dependencies

Similar presentations

Presentation on theme: "NER with Models Allowing Long-Range Dependencies"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NER with Models Allowing Long-Range Dependencies

Similar presentations

Presentation on theme: "NER with Models Allowing Long-Range Dependencies"— Presentation transcript:

Similar presentations

About project

Feedback