Text Models. Why? To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Learning HMM parameters
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
INTRODUCTION TO Machine Learning 3rd Edition
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Midterm Review CS4705 Natural Language Processing.
Lecture 5: Learning models using EM
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Inside-outside algorithm LING 572 Fei Xia 02/28/06.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
More on Text Management. Context Free Grammars Context Free Grammars are a more natural model for Natural Language Syntax rules are very easy to formulate.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Albert Gatt Corpora and Statistical Methods Lecture 9.
1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.
Introduction to Profile Hidden Markov Models
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Some Probability Theory and Computational models A short overview.
Text Models Continued HMM and PCFGs. Recap So far we have discussed 2 different models for text – Bag of Words (BOW) where we introduced TF-IDF Location.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-17: Probabilistic parsing; inside- outside probabilities.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
Chapter 23: Probabilistic Language Models April 13, 2004.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 26– Recap HMM; Probabilistic Parsing cntd) Pushpak Bhattacharyya CSE Dept., IIT.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
CSA3202 Human Language Technology HMMs for POS Tagging.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Probabilistic Context Free Grammars Grant Schindler 8803-MDM April 27, 2006.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
PCFG estimation with EM The Inside-Outside Algorithm.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Hidden Markov Models BMI/CS 576
Hidden Markov Models Part 2: Algorithms
N-Gram Model Formulas Word sequences Chain rule of probability
CS4705 Natural Language Processing
Presentation transcript:

Text Models

Why? To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging

Simple application: spelling suggestions Say that we have a dictionary of words – Real dictionary or the result of crawling – Sentences instead of words Now we are given a word w not in the dictionary How can we correct it to something in the dictionary

String editing Given two strings (sequences) the “distance” between the two strings is defined by the minimum number of “character edit operations” needed to turn one sequence into the other. Edit operations: delete, insert, modify (a character) – Cost assigned to each operation (e.g. uniform =1 )

Edit distance Already a simple model for languages Modeling the creation of strings (and errors in them) through simple edit operations

Distance between strings Edit distance between strings = minimum number of edit operations that can be used to get from one string to the other – Symmetric because of the particular choice of edit operations and uniform cost distance(“Willliam Cohon”,“William Cohen”) 2

Finding the edit distance An “alignment” problem Deciding how to align the two strings Can we try all alignments? How many (reasonable options) are there?

Dynamic Programming An umbrella name for a collection of algorithms Main idea: reuse computation for sub- problems, combined in different ways

Example: Fibonnaci if n = 0 or n = 1 return n else return fib(n-1) + fib(n-2) Exponential time!

Fib with Dynamic Programming table = {} def fib(n): global table if table.has_key(n): return table[n] if n == 0 or n == 1: table[n] = n return n else: value = fib(n-1) + fib(n-2) table[n] = value return value

Using a partial solution Partial solution: – Alignment of s up to location i, with t up to location j How to reuse? Try all options for the “last” operation

Base case : D(i,0)=I, D(0,i)=i for i inserts \ deletions Easy to generalize to arbitrary cost functions!

Models Bag-of-words N-grams Hidden Markov Models Probabilistic Context Free Grammar

Bag-of-words Every document is represented as a bag of the words it contains Bag means that we keep the multiplicity (=number of occurrences) of each word Very simple, but we lose all track of structure

n-grams Limited structure Sliding window of n words

n-gram model

How would we infer the probabilities? Issues: – Overfitting – Probability 0

How would we infer the probabilities? Maximum Likelihood:

"add-one" (Laplace) smoothing V = Vocabulary size

Good-Turing Estimate

Good-Turing

More than a fixed n.. Linear Interpolation

Precision vs. Recall

Richer Models HMM PCFG

Motivation: Part-of-Speech Tagging – Useful for ranking – For machine translation – Word-Sense Disambiguation – …

Part-of-Speech Tagging Tag this word. This word is a tag. He dogs like a flea The can is in the fridge The sailor dogs me every day

A Learning Problem Training set: tagged corpus – Most famous is the Brown Corpus with about 1M words – The goal is to learn a model from the training set, and then perform tagging of untagged text – Performance tested on a test-set

Simple Algorithm Assign to each word its most popular tag in the training set Problem: Ignores context Dogs, tag will always be tagged as a noun… Can will be tagged as a verb Still, achieves around 80% correctness for real-life test-sets – Goes up to as high as 90% when combined with some simple rules

(HMM) Hidden Markov Model Model: sentences are generated by a probabilistic process In particular, a Markov Chain whose states correspond to Parts-of-Speech Transitions are probabilistic In each state a word is outputted – The output word is again chosen probabilistically based on the state

HMM HMM is: – A set of N states – A set of M symbols (words) – A matrix NXN of transition probabilities Ptrans – A vector of size N of initial state probabilities Pstart – A matrix NXM of emissions probabilities Pout “Hidden” because we see only the outputs, not the sequence of states traversed

Example

3 Fundamental Problems 1) Compute the probability of a given observation Sequence (=sentence) 2) Given an observation sequence, find the most likely hidden state sequence This is tagging 3) Given a training set find the model that would make the observations most likely

Tagging Find the most likely sequence of states that led to an observed output sequence Problem: exponentially many possible sequences!

Viterbi Algorithm Dynamic Programming V t,k is the probability of the most probable state sequence – Generating the first t + 1 observations (X0,..Xt) – And terminating at state k

Viterbi Algorithm Dynamic Programming V t,k is the probability of the most probable state sequence – Generating the first t + 1 observations (X0,..Xt) – And terminating at state k V 0,k = Pstart(k)*Pout(k,X 0 ) V t,k = Pout(k,X t )*max{V t-1k’ *Ptrans(k’,k)}

Finding the path Note that we are interested in the most likely path, not only in its probability So we need to keep track at each point of the argmax – Combine them to form a sequence What about top-k?

Complexity O(T*|S|^2) Where T is the sequence (=sentence) length, |S| is the number of states (= number of possible tags)

Computing the probability of a sequence Forward probabilities: α t (k) is the probability of seeing the sequence X 1 …X t and terminating at state k Backward probabilities: β t (k) is the probability of seeing the sequence X t+1 …X n given that the Markov process is at state k at time t.

Computing the probabilities Forward algorithm α 0 (k)= Pstart(k)*Pout(k,X 0 ) α t (k)= Pout(k,X t )*Σ k’ {α t-1k’ *Ptrans(k’,k)} P(O 1,… On )= Σ k α n (k) Backward algorithm β t (k) = P(O t+1 …O n | state at time t is k) β t (k) = Σ k’ {Ptrans(k,k’)* Pout(k’,X t+1 )* β t+1 (k’)} β n (k) = 1 for all k P(O)= Σ k β 0 (k)* Pstart(k)

Learning the HMM probabilities Expectation-Maximization Algorithm 1.Start with initial probabilities 2.Compute Eij the expected number of transitions from i to j while generating a sequence, for each i,j (see next) 3.Set the probability of transition from i to j to be Eij/ (Σ k Eik) 4. Similarly for omission probability 5. Repeat 2-4 using the new model, until convergence

Estimating the expectancies By sampling – Re-run a random a execution of the model 100 times – Count transitions By analysis – Use Bayes rule on the formula for sequence probability – Called the Forward-backward algorithm

Accuracy Tested experimentally Exceeds 96% for the Brown corpus – Trained on half and tested on the other half Compare with the 80-90% by the trivial algorithm The hard cases are few but are very hard..

NLTK Natrual Language ToolKit Open source python modules for NLP tasks – Including stemming, POS tagging and much more

Context Free Grammars Context Free Grammars are a more natural model for Natural Language Syntax rules are very easy to formulate using CFGs Provably more expressive than Finite State Machines – E.g. Can check for balanced parentheses

Context Free Grammars Non-terminals Terminals Production rules – V → w where V is a non-terminal and w is a sequence of terminals and non-terminals

Context Free Grammars Can be used as acceptors Can be used as a generative model Similarly to the case of Finite State Machines How long can a string generated by a CFG be?

Stochastic Context Free Grammar Non-terminals Terminals Production rules associated with probability – V → w where V is a non-terminal and w is a sequence of terminals and non-terminals

Chomsky Normal Form Every rule is of the form V → V1V2 where V,V1,V2 are non-terminals V → t where V is a non-terminal and t is a terminal Every (S)CFG can be written in this form Makes designing many algorithms easier

Questions What is the probability of a string? – Defined as the sum of probabilities of all possible derivations of the string Given a string, what is its most likely derivation? – Called also the Viterbi derivation or parse – Easy adaptation of the Viterbi Algorithm for HMMs Given a training corpus, and a CFG (no probabilities) learn the probabilities on derivation rule

Inside probability: probability of generating w p …w q from non-terminal N j. Outside probability: total prob of beginning with the start symbol N 1 and generating and everything outside w p …w q Inside-outside probabilities

CYK algorithm NjNj NrNr NsNs wpwp wdwd W d+1 wqwq

CYK algorithm NjNj NgNg wpwp wqwq W q+1 wewe NfNf N1N1 w1w1 wmwm

CYK algorithm NgNg NjNj wewe W p-1 WpWp wqwq NfNf N1N1 w1w1 wmwm

Outside probability

Probability of a sentence

The probability that a binary rule is used (1)

The probability that N j is used (2)

The probability that a unary rule is used (3)

Multiple training sentences (1) (2)