CSCI 5832 Natural Language Processing

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
Part of speech (POS) tagging
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
10/24/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.
Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
CSA3202 Human Language Technology HMMs for POS Tagging.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Stochastic and Rule Based Tagger for Nepali Language Krishna Sapkota Shailesh Pandey Prajol Shrestha nec & MPP.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
CSC 594 Topics in AI – Natural Language Processing
N-Grams Chapter 4 Part 2.
Part-of-Speech Tagging
Statistical NLP Winter 2009
CSC 594 Topics in AI – Natural Language Processing
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CPSC 503 Computational Linguistics
CSCI 5832 Natural Language Processing
FIRST SEMESTER GRAMMAR
CSCI 5832 Natural Language Processing
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15
N-Gram Model Formulas Word sequences Chain rule of probability
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
Lecture 7 HMMs – the 3 Problems Forward Algorithm
Lecture 7 HMMs – the 3 Problems Forward Algorithm
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Statistical n-gram David ling.
Classical Part of Speech (PoS) Tagging
Natural Language Processing
Allen’s Chapter 7 J&M’s Chapters 8 and 12
CPSC 503 Computational Linguistics
CSCI 5832 Natural Language Processing
Hidden Markov Models Teaching Demo The University of Arizona
CSCI 5582 Artificial Intelligence
Part-of-Speech Tagging Using Hidden Markov Models
Presentation transcript:

CSCI 5832 Natural Language Processing Lecture 12 Jim Martin 11/11/2018 CSCI 5832 Spring 2007

Today: 2/13 Review Entropy Back to Parts of Speech Break Tagging 11/11/2018 CSCI 5832 Spring 2007

Entropy Defined as 11/11/2018 CSCI 5832 Spring 2007

Entropy Two views of this… As a means of measuring the information (surprise) in a given string with respect to some input As a means of measuring how well a given (learned) model fits a given corpus (big long string) 11/11/2018 CSCI 5832 Spring 2007

Cross-Entropy Defined as Where p is the true (unknown distribution) And m is the distribution defined by the current model For any incorrect model H(p) < H(p,m) Why? An incorrect model is in some sense a model that’s being surprised by what it sees Surprise means probabilities that are lower than they should be… meaning entropy is higher than it should be. 11/11/2018 CSCI 5832 Spring 2007

Entropy View the cross-entropy of sequence as the average level of surprise in the sequence The more you’re surprised by a given corpus, the worse job your model was doing at prediction 11/11/2018 CSCI 5832 Spring 2007

So… If you have 2 models m1 and m2 the one with the lower cross-entropy is the better one. 11/11/2018 CSCI 5832 Spring 2007

Parts of Speech Start with eight basic categories Noun, verb, pronoun, preposition, adjective, adverb, article, conjunction These categories are based on morphological and distributional properties (not semantics) Some cases are easy, others are murky 11/11/2018 CSCI 5832 Spring 2007

Parts of Speech Two kinds of category Closed class Open class Prepositions, articles, conjunctions, pronouns Open class Nouns, verbs, adjectives, adverbs 11/11/2018 CSCI 5832 Spring 2007

Sets of Parts of Speech: Tagsets There are various standard tagsets to choose from; some have a lot more tags than others The choice of tagset is based on the application Accurate tagging can be done with even large tagsets 11/11/2018 CSCI 5832 Spring 2007

Tagging Part of speech tagging is the process of assigning parts of speech to each word in a sentence… Assume we have A tagset A dictionary that gives you the possible set of tags for each entry A text to be tagged A reason? 11/11/2018 CSCI 5832 Spring 2007

Three Methods Rules Probabilities Sort of both 11/11/2018 CSCI 5832 Spring 2007

Rules Hand-crafted rules for ambiguous words that test the context to make appropriate choices Early attempts fairly error-prone Extremely labor-intensive 11/11/2018 CSCI 5832 Spring 2007

Probabilities We want the best set of tags for a sequence of words (a sentence) W is a sequence of words T is a sequence of tags 11/11/2018 CSCI 5832 Spring 2007

Probabilities We want the best set of tags for a sequence of words (a sentence) W is a sequence of words T is a sequence of tags 11/11/2018 CSCI 5832 Spring 2007

Tag Sequence: P(T) How do we get the probability of a specific tag sequence? Count the number of times a sequence occurs and divide by the number of sequences of that length. Not likely. Make a Markov assumption and use N-grams over tags... P(T) is a product of the probability of N-grams that make it up. 11/11/2018 CSCI 5832 Spring 2007

P(T): Bigram Example <s> Det Adj Adj Noun </s> P(Det|<s>)P(Adj|Det)P(Adj|Adj)P(Noun|Adj) 11/11/2018 CSCI 5832 Spring 2007

Counts Where do you get the N-gram counts? From a large hand-tagged corpus. For N-grams, count all the Tagi Tagi+1 pairs And smooth them to get rid of the zeroes Alternatively, you can learn them from an untagged corpus 11/11/2018 CSCI 5832 Spring 2007

What about P(W|T) First its odd. It is asking the probability of seeing “The big red dog” given “Det Adj Adj Noun” Collect up all the times you see that tag sequence and see how often “The big red dog” shows up. Again not likely to work. 11/11/2018 CSCI 5832 Spring 2007

P(W|T) We’ll make the following assumption (because it’s easy)… Each word in the sequence only depends on its corresponding tag. So… How do you get the statistics for that? 11/11/2018 CSCI 5832 Spring 2007

So… We start with And get 11/11/2018 CSCI 5832 Spring 2007

Break Quiz Chapter 2: All Chapter 3: Sec 3.1-3.9 Chapter 5: Pages 1-26 11/11/2018 CSCI 5832 Spring 2007

HMMs This is an HMM The states in the model are the tags, and the observations are the words. The state to state transitions are driven by the bigram statistics The observed words are based solely on the state that you’re in 11/11/2018 CSCI 5832 Spring 2007

HMMs Why hidden? You have a sequence of observations. You have a set of unseen states that gave rise to (generated) that sequence of observations. Those are hidden from you 11/11/2018 CSCI 5832 Spring 2007

State Machine 11/11/2018 CSCI 5832 Spring 2007

State Machine with Observations 11/11/2018 CSCI 5832 Spring 2007

Unwinding in Time That diagram isn’t too helpful since it really isn’t keeping track of where it is in the sequence in the right way. So we’ll in effect unroll it (since we know the length of the sequence we’re dealing with. 11/11/2018 CSCI 5832 Spring 2007

State Space I want to race 11/11/2018 CSCI 5832 Spring 2007 NN NN NN VB VB VB VB PP PP PP PP I want to race 11/11/2018 CSCI 5832 Spring 2007

State Space Observation Probs I want to race 11/11/2018 NN NN NN NN TO TO TO TO <s> </s> VB VB VB VB PP PP PP PP I want to race 11/11/2018 CSCI 5832 Spring 2007

State Space How big is that space? How many paths through the machine? How many things to argmax over? Is that a lot? 11/11/2018 CSCI 5832 Spring 2007

State Space Fortunately… The markov assumption combined with the laws of probability allow us to use dynamic programming to get around this. 11/11/2018 CSCI 5832 Spring 2007

Viterbi Algorithm Efficiently return the most likely path Argmax P(path|observations) Sweep through the columns multiplying the probabilities of one row, times the transition probabilities to the previous row, times the appropriate observation probabilities And storing the MAX prob at each node And keep track of where you’re coming from 11/11/2018 CSCI 5832 Spring 2007

Viterbi 11/11/2018 CSCI 5832 Spring 2007

Viterbi Example 11/11/2018 CSCI 5832 Spring 2007

Viterbi How fast is that? 11/11/2018 CSCI 5832 Spring 2007

Next time Quiz on Chapters 2,3,4,and 5 11/11/2018 CSCI 5832 Spring 2007