NER with Models Allowing Long-Range Dependencies

Slides:



Advertisements
Similar presentations
Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
Advertisements

Learning on the Test Data: Leveraging “Unseen” Features Ben Taskar Ming FaiWong Daphne Koller.
Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.
Exact Inference in Bayes Nets
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Conditional Random Fields and beyond …
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)
Markov Networks.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Conditional Random Fields
CS121 Heuristic Search Planning CSPs Adversarial Search Probabilistic Reasoning Probabilistic Belief Learning.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Today Logistic Regression Decision Trees Redux Graphical Models
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
CSC2535 Spring 2013 Lecture 2a: Inference in factor graphs Geoffrey Hinton.
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Graphical models for part of speech tagging
Margin Learning, Online Learning, and The Voted Perceptron SPLODD ~= AE* – 3, 2011 * Autumnal Equinox.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
Continuous Variables Write message update equation as an expectation: Proposal distribution W t (x t ) for each node Samples define a random discretization.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Conditional Markov Models: MaxEnt Tagging and MEMMs
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
Edit Distances William W. Cohen.
Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Bayesian Belief Propagation for Image Understanding David Rosenberg.
Slide 1 Directed Graphical Probabilistic Models: inference William W. Cohen Machine Learning Feb 2008.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.
Hidden Markov Models BMI/CS 576
IE With Undirected Models: the saga continues
Lecture 7: Constrained Conditional Models
Conditional Random Fields
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19
MEMMs/CMMs and CRFs William W. Cohen Sep 22, 2010.
CS 4/527: Artificial Intelligence
Prof. Adriana Kovashka University of Pittsburgh April 4, 2017
CRFs for SPLODD William W. Cohen Sep 8, 2011.
Klein and Manning on CRFs vs CMMs
Markov Networks.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
CSCI 5822 Probabilistic Models of Human and Machine Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17
N-Gram Model Formulas Word sequences Chain rule of probability
CS 188: Artificial Intelligence
Expectation-Maximization & Belief Propagation
Topic models for corpora and for graphs
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Approximate Inference by Sampling
IE With Undirected Models
The Voted Perceptron for Ranking and Structured Classification
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Markov Networks.
Sequential Learning with Dependency Nets
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Learning to Search as a Means of Doing Structured Prediction
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

NER with Models Allowing Long-Range Dependencies William W. Cohen 10/12

Some models we’ve looked at HMMs generative sequential model MEMMs/aka maxent tagging; stacked learning Cascaded sequences of “ordinary” classifiers (for stacking, also sequential classifiers) Linear-chain CRFs Similar functional form as an HMM, but optimized for Pr(Y|X) instead of Pr(X,Y) [Klein and Manning] An MRF (undirected graphical model) with edge and node potentials defined via features that depend on X,Y [Sha & Pereira] Stacked sequential learning Meta-learning, using as features the cross-validated prediction of simpler model on nearby nodes in a chain.

Some models we haven’t looked at Conditional Graphical Models (Perez-Cruz & Ghahramani) Assume an arbitrary graph of nodes X Learn to predict the pair of labels (Yi,Yj) on each edge using SVMs Inference: Predict each pair of edge labels, and get associated confidence finally, use Viterbi (or something) to get the single best consistent set of labels

Some models we haven’t looked at Dependency network (Toutanova et al, 2003) Assume an arbitrary graph of nodes X Learn an “every state” predictor (instead of a next-state predictor) Pr(Xi | W1,…,Wk) = … for each variable Xi where Wj’s are neighbors of Xi. Train local predictors using true labels of W’s. Inference: popular choice is Gibbs sampling. Guess initial values for Xi0’s. For t=1…T For i=1…N Draw new value for Xit using Pr(Xit-1 | W1t-1,…,Wkt-1) Finally use the average value of Xi on the last T-B iterations Actually, they use an approximate Viterbi

Example DNs – bidirectional chains Y1 Y2 … Yi … When will dr Cohen post the notes

DN examples … … How do we do inference? Iteratively: Yi … When will dr Cohen post the notes How do we do inference? Iteratively: Pick values for Y1, Y2, …at random Pick some j, and compute Set new value of Yj according to this Go back to (2) Current values

DN Examples Y1 Y2 … Yi … When will dr Cohen post the notes

DN Examples … … POS … … BIO/NER Z1 Z2 Zi Y1 Y2 Yi will dr post the When will dr Cohen post the notes

Example DNs – “skip” chains Y1 Y2 … … … … Y7 Dr Yu and his wife Mi N. Yu y for next/prev x=xj

Why does Gibbs sampling work? Feeling lucky? Suppose X1t,…,Xnt were drawn from the “correct” distribution for some t… Then X1t+1,…,Xnt+1 would also be drawn from the correct distribution, and so on.

Some models we’ve looked at … Linear-chain CRFs Similar functional form as an HMM, but optimized for Pr(Y|X) instead of Pr(X,Y) [Klein and Manning] An MRF (undirected graphical model) with edge and node potentials defined via features that depend on X,Y [my lecture] Dependency nets aka MRFs learned w/ pseudo-likelihood Local conditional probabilities + Gibbs sampling (or something) for inference. Easy to use a network that is not a linear chain Question: why can’t we use general MRFs for CRFs as well?

When will prof Cohen post … can see the locality When will prof Cohen post … B B B B B I I I I I O O O O O

With Z[j,y] we can also compute stuff like: what’s the probability that y2=“B” ? what’s the probability that y2=“B” and y3=“I”? When will prof Cohen post … B B B B B I I I I I O O O O O

Another visualization of the MRF Ink=potential B W B W B W All black/all white are only assignments

B W B W B W B W B W B W Best assignment to X_S maximizes black ink (potential) on chosen nodes plus edges.

B W B W B W Best assignment to X_S maximizes black ink (potential) on chosen nodes plus edges.

Forward-Backward review = total weight of paths from a to node(Xi=j) = 1/Z * total weight of paths from a to b thru node(Xi=j)

Belief Propagation in Factor Graphs (review?) When will prof Cohen post … B B B B B I I I I I O O O O O X1 X2 X3 X4 X5 f12 f23 f34 f45

Belief Propogation on Trees For each leaf a, Walk away from that leaf to every node X, keeping track of the total weight of all paths from a to X. Compute this incrementally as you go. When you reach a node X with k neighbors, wait until k-1 walks converge; then multiply the signals and send the signals on. After you’re done you have “α,β values” for each node.

Belief Propogation on Trees Graphs For each leaf a, Walk away from that leaf to every node X, keeping track of the total weight of all paths from a to X. Compute this incrementally as you go. When you reach a node X with k neighbors, wait until k-1 walks converge; then multiply the signals and send the signals on. After you’re bored you have “α,β values” for each node.

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira i.e. expected value, under λ, of fi(x,yj,yj+1)  partition function Pr(x) = Zλ(x)  “total flow” through MRF graph In general, this is not tractible

Skip-chain CRFs: Sutton & McCallum Connect adjacent words with edges Connect pairs of identical capitalized words We don’t want too many “skip” edges

Skip-chain CRFs: Sutton & McCallum Inference: loopy belief propogation

Skip-chain CRF results

Krishnan & Manning: An effective two-stage model….”

Repetition of names across the corpus is even more important in other domains…

How to use these regularities Stacked CRFs with special features: Token-majority: majority label assigned to a token (e.g., token “Melinda”  person) Entity-majority: majority label assigned to an entity (e.g., tokens inside “Bill & Melinda Gates Foundation”  organization) Super-entity-majority: majority label assigned to entities that are super-strings of an entity (e.g., tokens inside “Melinda Gates”  organization) Compute within document and across corpus

Candidate phrase classification with general CRFs; Local templates control overlap; Global templates are like ‘skip’ edges CRF + hand-coded external classifier (with Gibbs sampling) to handle long-range edges

[Kou & Cohen, SDM-2007]