Conditional Random Fields

Slides:



Advertisements
Similar presentations
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Advertisements

Unsupervised Learning
Least squares CS1114
Supervised Learning Recap
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Conditional Random Fields
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Semi-Supervised Learning
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Introduction to Profile Hidden Markov Models
Final review LING572 Fei Xia Week 10: 03/11/
Masquerade Detection Mark Stamp 1Masquerade Detection.
Gaussian Mixture Model and the EM algorithm in Speech Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Graphical models for part of speech tagging
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Hidden Markov Models for Software Piracy Detection Shabana Kazi Mark Stamp HMMs for Piracy Detection 1.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
John Lafferty Andrew McCallum Fernando Pereira
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
CH 5: Multivariate Methods
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Conditional Random Fields Mark Stamp CRF

Intro Hidden Markov Model (HMM) used in Bioinformatics Natural language processing Speech recognition Malware detection/analysis And many, many other applications Bottom line: HMMs are very useful Everybody knows that! CRF

Generic HMM Recall that A is Markov process Implies that Xi only depends on Xi-1 Matrix B is observation probabilities Note probability of Oi only depends on Xi CRF

HMM Limitations Assumptions Often independence is not realistic Observation depends on current state Current state depends on previous state Strong independence assumption Often independence is not realistic Observation can depend on several states And/or current state might depend on several previous states CRF

HMMs Within HMM framework, we can… Increase N, number of hidden states And/or higher order Markov process “Order 2” means hidden state depends on 2 immediately previous hidden states Order > 1 limits independence constraint More hidden states, more “breadth” Higher order, increased “depth” CRF

Beyond HMMs HMMs do not fit some situations For example, arbitrary dependencies on state transitions and/or observations Here, focus on generalization of HMM Conditional Random Fields (CRF) There are other generalizations We mention a few Mostly focused on the “big picture” CRF

HMM Revisited Illustrates graph structure of HMM That is, HMM is a directed line graph Can other types of graphs work? Would they make sense? CRF

MEMM In HMM, observation sequence O is related to states X via B matrix And O affects X in training, not scoring Might want X to depend on O in scoring Maximum Entropy Markov Model State Xi is function of Xi-1 and Oi MEMM focused on “problem 2” That is, determine (hidden) states CRF

Generic MEMM How does this differ from HMM? State Xi is function of Xi-1 and Oi Cannot generate Oi using the MEMM, while we can do so using HMM While an HMM can be used to generate observation sequences that fit a given model, your humble author is not aware of a lot of application where this feature is particularly useful… CRF

MEMM vs HMM HMM  Find “best” state sequence X That is, solve HMM Problem 2 Solution is X that maximizes P(X|O) = Π P(Oi|Xi) Π P(Xi|Xi-1) MEMM  Find “best” state sequence X P(X|O) = Π P(Xi|Xi-1,Oi) where P(x|y,o) = 1/Z(o,y) exp(Σwjfj(o,x)) Note the form of the MEMM probability function, which is very different from the HMM case. Also, note that for MEMM, observation directly affects the probability P(X|O). CRF

MEMM vs HMM Note Σwj fj(o,x) in MEMM probability This sum is over entire sequence Any useful feature of input observation can affect probability MEMM more “general”, in this sense As compared to HMM, that is But MEMM creates a new problem A problem that does not occur in HMM CRF

Label Bias Problem MEMM uses dynamic programming (DP) Also known as the Viterbi algorithm HMM (problem 2) does not use DP HMM α-pass uses sum, DP uses max In MEMM probability is “conserved” Probability must be split between successor states (not so in HMM) Is this good or bad? CRF

Label Bias Problem Only one possible successor in MEMM? All probability passed along to that state In effect, observation is ignored More generally, if one dominant successor, observation doesn’t matters much CRF solves label bias problem of MEMM So, observation matters We won’t go into details here… CRF

Label Bias Problem Example In M state… Hot, Cold, and Medium states Observation does little (MEMM) Observation can matter more (HMM) 0.7 0.99 H 0.3 0.3 M 0.01 In CRF, transition from M to H (almost surely), but observation at M could affect resulting probability, In contrast, in MEMM, all of the probability that arrives at M must be passed along to its successor state, regardless of observation at M. C 0.1 0.6 CRF

Conditional Random Fields CRFs a generalization of HMMs Generalization to other graphs Undirected graphs Linear Chain CRF is simplest case But also generalizes to arbitrary (undirected) graphs That is, can have arbitrary dependencies between states and observations CRF

Simplest Case of CRF How is it different from HMM/MEMM? More things can depend on each other The case illustrated is a linear chain CRF More general graph structure can work CRF

Another View Next, consider deeper connection between HMM and CRF But first, we need some background Naïve Bayes Logistic regression These topics are very useful in their own right… …so wake up and pay attention! CRF

What Are We Doing Here? Recall, O observation, X is state Ideally, want to model P(X,O) All possible interactions of Xs and Os But P(X,O) involves lots of parameters Like the complete covariance matrix Lots of data needed for “training” And too much work to train Generally, this problem is intractable For example, we probably don’t care too much about all of the possible interactions between the observations. So, it would make sense not to expend a lot of effort trying to model these interactions. CRF

What to Do? Simplify, simplify, simplify… Need to make problem tractable And then hope we get decent results In Naïve Bayes, assume independence In regression analysis, try to fit specific function to data Eventually, we’ll see this is relevant Wrt HMMs and CRFs, that is CRF

Naïve Bayes Why is it “naïve”? Assume features in X are independent Probably not true, but simplifies things And often works well in practice Why does independence simplify? Recall covariance: For X = (x1,…,xn) and Y = (y1,…,yn), if means are 0, then Cov(X,Y) = (x1y1 +…+ xnyn) / n CRF

Naïve Bayes Independent implies covariance is 0 If so, in covariance matrix only the diagonal elements are non-zero Only need means and variances Not the entire covariance matrix Far fewer parameters to estimate And a lot less data needed for training Bottom line: Practical solution Independent implies covariance is 0, but covariance of 0 does not, in general, imply independence. CRF

Naïve Bayes Why is it “Bayes”? Because it uses Bayes Theorem: That is, More generally, where Aj form partition CRF

Bayes Formula Example Consider a test for an illegal drug If you use drug, 98% positive (TPR = sensitivity) If don’t use, 99% negative (TNR = specificity) In overall population, 5/1000 use the drug Let A = uses the drug, B = tests positive Then = .98 × .005 / (.98 × .005 + .01 × .995) = 0.329966 = 33% CRF

Naïve Bayes Why is this relevant? Spse classify based on observation O Compute P(X|O) = P(O|X) P(X) / P(O) Where X is one possible class (state) And P(O|X) is easy to compute Repeat for all possible classes X Biggest probability is most likely class X Can ignore P(O) since it’s constant CRF

Regression Analysis Generically, method for measuring relationship between 2 or more things E.g., house price vs size First, we consider linear regression Since it’s the simplest case Then logistic regression More complicated, but often more useful Used for binary classifiers CRF

Linear Regression Spse x is house ft2 And y is sale price Could be vector x of observations instead And y is sale price Points represent recent sales results How to use this info? Given a house to sell… Given a recent sale… x Eigenvector Techniques

Linear Regression Blue line is “best fit” What good is it? y Blue line is “best fit” Minimum squared error Perpendicular distance Linear least squares What good is it? Given a new point, how well does it fit in? Given x, predict y This sounds familiar… x Eigenvector Techniques

Regression Analysis In many problems, only 2 outcomes Binary classifier, e.g., malware vs benign “Malware of specific type” vs “other” Then x is an observation (vector) But each y is either 0 or 1 Linear regression not so good (Why?) A better idea  logistic regression Fit a logistic function instead of line CRF

Binary Classification Suppose we compute score for many files Score is on x-axis Output on y-axis 1 if file is malware 0 if file is “other” Linear regression not very useful here x Eigenvector Techniques

Binary Classification Instead of a line… Use a function better for 0,1 data Logistic function Transition from 0 to 1 more abrupt than line Why is this better? Less wasted time between 0 and 1 x Eigenvector Techniques

Logistic Regression Logistic function Here, t = b0 + b1x F(t) = 1 / (1 + e-t) Input: –∞ to ∞ Output: 0 to 1, can be interpreted as P(t) Here, t = b0 + b1x Or t=b0+b1x1+…+bmxm I.e., x is observation CRF

Logistic Regression Instead of fitting a line to data… Fit logistic function to data And instead of least squares error… Measure “deviance”  distance from ideal case (where ideal is “saturated model”) Iterative process to find parameters Find best fit F(t) using data points More complex training than linear case… …but, better suited to binary classification Actually, finding parameters is much more complex than the linear least squares algorithm used in linear regression. CRF

Conditional Probability Recall, we would like to model P(X,O) Observe that P(X,O) includes all relationships between Xs and Os Too complex, too many parameters, too… So we settle for P(X|O) A lot fewer parameters Problem is tractable Works well in practice CRF

Generative vs Discriminative We are interested in P(X|O) Generative models Focus on P(O|X) P(X) From Naïve Bayes (without denominator) Discriminative models Focus directly on P(X|O) Like logistic regression Tradeoffs? CRF

Generative vs Discriminative Naïve Bayes is generative model Since it uses P(O|X) P(X) Good in unsupervised case, unlabeled data Logistic regression is discriminative Directly deal with P(X|O) No need to expend effort modeling O So, more freedom to model X Unsupervised is “active area of research” In principle, fewer parameters of concern in discriminative case. But we have efficient algorithms in the generative case. So, maybe the tradeoff here is between “advantages in theory” vs “efficient in practice”. CRF

HMM and Naïve Bayes Connection(s) between NB and HMM? Recall HMM, problem 2 For given O, find “best” (hidden) state X We use P(X|O) to determine best X Alpha pass used in solving problem 2 Looking closely at alpha pass… It is based on computing P(O|X) P(X) With probabilities from the model λ Note that the “alpha pass” is usually known as the forward algorithm and the “beta pass” is the backward algorithm. CRF

HMM and Naïve Bayes Connection(s) between NB and HMM? HMM can be viewed as sequential version of Naïve Bayes Classifications over series of observations HMM uses info about state transitions Conversely, Naïve Bayes is a “static” version of HMM Bottom line: HMM is generative model CRF

CRF and Logistic Regression Connection between CRF & regression? Linear chain CRF is sequential version of logistic regression Classification over series of observations CRF uses info about state transitions Conversely, logistic regression can be viewed as static (linear chain) CRF Bottom line: CRF discriminative model CRF

Generative vs Discriminative Naïve Bayes and Logistic Regression A “generative-discriminative pair” HMM and (Linear Chain) CRF Another generative-discriminative pair Sequential versions of those above Are there other such pairs? Yes, based on further generalizations What’s more general than sequential? CRF

General CRF Can define CRF on any (undirected) graph structure Not just a linear chain In general CRF, training and scoring not as efficient, so… Linear Chain CRF used most in practice If special cases, might be worth considering more general CRF Determining such structure is very problem-specific. CRF

Generative Directed Model Can view HMM as defined on (directed) line graph Could consider similar process on more general (directed) graph structures This more general case is known as “generative directed model” Algorithms (training, scoring, etc.) not as efficient in more general case CRF

Generative-Discriminative Pair Generative directed model As the name implies, a generative model General CRF A discriminative model So, this gives us a 3rd generative-discriminative pair Summary on next slide… CRF

Generative-Discriminative Pairs CRF

HCRF Yes, you guessed it… So, what is hidden? To be continued… Hidden Conditional Random Field So, what is hidden? To be continued… CRF

Algorithms Where are the algorithms? Yes, CRF algorithms do exist This is a CS class, after all… Yes, CRF algorithms do exist Omitted, since lot of background needed Would take too long to cover it all We’ve got better things to do So, just use existing implementations It’s your lucky day… CRF

References E. Chen, Introduction to conditional random fields Y. Ko, Maximum entropy Markov models and conditional random fields A. Quattoni, Tutorial on conditional random fields for sequence prediction The blog by E. Chen is the easiest to read of these references. The slides by Y. Ko are also fairly readable and good, especially wrt the algorithms. Many other sources can be found online, but most (including some of the references listed here) are challenging to read (to put it mildly…). CRF

References C. Sutton and A. McCallum, An introduction to conditional random fields, Foundations and Trends in Machine Learning, 4(4):267-373, 2011 H.M. Wallach, Conditional random fields: An introduction, 2004 CRF