CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009.

Slides:



Advertisements
Similar presentations
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Advertisements

John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
2004/11/161 A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition LAWRENCE R. RABINER, FELLOW, IEEE Presented by: Chi-Chun.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –One exception: games with multiple moves In particular, the Bayesian.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Hidden Markov Models David Meir Blei November 1, 1999.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Feb, 27, 2015 Slide Sources Raymond J. Mooney University of.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Introduction to Machine Learning Approach Lecture 5.
1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.
Raymond J. Mooney University of Texas at Austin
11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.
Graphical models for part of speech tagging
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
1 CS 391L: Machine Learning Natural Language Learning Raymond J. Mooney University of Texas at Austin.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
John Lafferty Andrew McCallum Fernando Pereira
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
CS 2750: Machine Learning Hidden Markov Models Prof. Adriana Kovashka University of Pittsburgh March 21, 2016 All slides are from Ray Mooney.
Hidden Markov Models BMI/CS 576
CSC 594 Topics in AI – Natural Language Processing
CS 2750: Machine Learning Hidden Markov Models
Machine Learning in Natural Language Processing
Hidden Markov Models Part 2: Algorithms
CONTEXT DEPENDENT CLASSIFICATION
Raymond J. Mooney University of Texas at Austin
LECTURE 23: INFORMATION THEORY REVIEW
CPSC 503 Computational Linguistics
Hidden Markov Models Teaching Demo The University of Arizona
Raymond J. Mooney University of Texas at Austin
Presentation transcript:

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Sequence Labeling Problem Many NLP problems can viewed as sequence labeling. Each token in a sequence is assigned a label. Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors (not i.i.d).

Part Of Speech (POS) Tagging Annotate each word in a sentence with a part- of-speech. Lowest level of syntactic analysis. Useful for subsequent syntactic parsing and word sense disambiguation. John saw the saw and decided to take it to the table. PN V Det N Con V Part V Pro Prep Det N

Bioinformatics Sequence labeling also valuable in labeling genetic sequences in genome analysis. extron intron  AGCTAACGTTCGATACGGATTACAGCCT

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier PN

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier V

Probabilistic Sequence Models Probabilistic sequence models allow integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment. Two standard models  Hidden Markov Model (HMM)  Conditional Random Field (CRF)

Hidden Markov Model Probabilistic generative model for sequences. A finite state machine with probabilistic transitions and probabilistic generation of outputs from states. Assume an underlying set of states in which the model can be. Assume probabilistic transitions between states over time. Assume a probabilistic generation of tokens from states.

Sample HMM for POS PropNoun John Mary Alice Jerry Tom Noun cat dog car pen bed apple Det a the that a the a Verb bit ate saw played hit gave 0.05 stop

Sample HMM Generation PropNoun John Mary Alice Jerry Tom Noun cat dog car pen bed apple Det a the that a the a Verb bit ate saw played hit gave 0.05 stop

Sample HMM Generation PropNoun John Mary Alice Jerry Tom Noun cat dog car pen bed apple Det a the that a the a Verb bit ate saw played hit gave 0.05 stop John

Sample HMM Generation PropNoun John Mary Alice Jerry Tom Noun cat dog car pen bed apple Det a the that a the a Verb bit ate saw played hit gave 0.05 stop John

Sample HMM Generation PropNoun John Mary Alice Jerry Tom Noun cat dog car pen bed apple Det a the that a the a Verb bit ate saw played hit gave 0.05 stop John bit

Sample HMM Generation PropNoun John Mary Alice Jerry Tom Noun cat dog car pen bed apple Det a the that a the a Verb bit ate saw played hit gave 0.05 stop John bit

Sample HMM Generation PropNoun John Mary Alice Jerry Tom Noun cat dog car pen bed apple Det a the that a the a Verb bit ate saw played hit gave 0.05 stop John bit the

Sample HMM Generation PropNoun John Mary Alice Jerry Tom Noun cat dog car pen bed apple Det a the that a the a Verb bit ate saw played hit gave 0.05 stop John bit the

Sample HMM Generation PropNoun John Mary Alice Jerry Tom Noun cat dog car pen bed apple Det a the that a the a Verb bit ate saw played hit gave 0.05 stop John bit the apple

Sample HMM Generation PropNoun John Mary Alice Jerry Tom Noun cat dog car pen bed apple Det a the that a the a Verb bit ate saw played hit gave 0.05 stop John bit the apple

Three Useful HMM Tasks Observation likelihood: To classify sequences. Most likely state sequence: To tag each token in a sequence with a label. Maximum likelihood training: To train models to fit empirical training data.

Observation Likelihood Given a sequence of observations, O, and a model with a set of parameters, λ, what is the probability that this observation was generated by this model: P(O| λ ) ? Allows HMM to be used as a language model: A formal probabilistic model of a language that assigns a probability to each string saying how likely that string was to have been generated by the language. Useful for two tasks:  Sequence Classification  Most Likely Sequence

Sequence Classification Assume an HMM is available for each category (i.e. language). What is the most likely category for a given observation sequence, i.e. which category’s HMM is most likely to have generated it? Used in speech recognition to find most likely word model to have generate a given sound or phoneme sequence. AustinBoston ? ? P(O | Austin) > P(O | Boston) ? ah s t e n O

Most Likely State Sequence Given an observation sequence, O, and a model, λ, what is the most likely state sequence,Q=Q 1,Q 2,…Q T, that generated this sequence from this model? Used for sequence labeling. John gave the dog an apple.

Observation Likelihood Efficient Solution Markov assumption: Probability of the current state only depends on the immediately previous state, not on any earlier history (via the transition probability distribution, A). Forward-Backward Algorithm: Uses dynamic programming to exploit this fact to efficiently compute observation likelihood in O(N 2 T) time. (N: # of words, T: # of tokens)

Maximum Likelihood Training Given an observation sequence, O, what set of parameters, λ, for a given model maximizes the probability that this data was generated from this model (P(O| λ ))? Only need to have an unannotated observation sequence (or set of sequences) generated from the model. In this sense, it is unsupervised.

Supervised HMM Training If training sequences are labeled (tagged) with the underlying state sequences that generated them, then the parameters, λ ={A,B,π} can all be estimated directly from counts accumulated from the labeled sequences (with appropriate smoothing). Supervised HMM Training John ate the apple A dog bit Mary Mary hit the dog John gave Mary the cat Training Sequences Det Noun PropNoun Verb

Generative vs. Discriminative Sequence Labeling Models HMMs are generative models and are not directly designed to maximize the performance of sequence labeling. They model the joint distribution P(O,Q). Conditional Random Field (CRF) is specifically designed and trained to maximize performance of sequence labeling. They model the conditional distribution P(Q | O)

Definition of CRF

An Example of CRF

Sequence Labeling Y2Y2 X1X1 X2X2 … XTXT HMM Linear-chain CRF Generative Discriminative Y1Y1 YTYT.. Y2Y2 X1X1 X2X2 … XTXT Y1Y1 YTYT

Conditional Distribution for Linear Chain CRF A typical form of CRF for sequence labeling is

CRF experimental Results Experimental results verify that they have superior accuracy on various sequence labeling tasks.  Part of Speech tagging  Noun phrase chunking  Named entity recognition … However, CRFs are much slower to train and do not scale as well to large amounts of training data.  Training for POS on full Penn Treebank (~1M words) currently takes “over a week.”

CRF Summary CRF is a discriminative approach to sequence labeling whereas HMMs are generative. Discriminative methods are usually more accurate since they are trained for a specific performance task. CRF also easily allows adding additional token features without making additional independence assumptions. Training time is increased since a complex optimization procedure is needed to fit supervised training data.