Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPSC 503 Computational Linguistics

Similar presentations


Presentation on theme: "CPSC 503 Computational Linguistics"— Presentation transcript:

1 CPSC 503 Computational Linguistics
Model Evaluation Intro HMMs Lecture 9 Giuseppe Carenini 4/6/2019 CPSC503 Spring 2004

2 Today 11/2 Summary n-grams How to evaluate probabilistic models
Markov Models Visible Markov Models Hidden Markov Models 4/6/2019 CPSC503 Spring 2004

3 Why n-grams? AB Assign a probability to a sentence
Part-of-speech tagging Word-sense disambiguation Probabilistic Parsing Predict the next word Speech recognition Hand-writing recognition Augmentative communication for the disabled AB Why would you want to assign a probability to a sentence or… Why would you want to predict the next word… 4/6/2019 CPSC503 Spring 2004

4 Impossible to estimate!
Assuming 104 word types and average sentence contains 10 words -> sample space? Chain rule does not help  Markov assumption: Unigram… sample space? Bigram … sample space? Trigram … sample space? Most sentences/sequences will not appear or appear only once  At least your corpus should be >> than your sample space Sparse matrix: Smoothing techniques 4/6/2019 CPSC503 Spring 2004

5 Model Evaluation Relative Entropy (KL divergence) ?
Actual distribution How different? Our approximation Relative Entropy (KL divergence) ? D(p||q)= xX p(x)log(p(x)/q(x)) Def. The relative entropy is a measure of how different two probability distributions (over the same event space) are. average number of bits wasted by encoding events from a distribution p with distribution q. I(X,Y) = D(p(x,y)||p(x)p(y)) 0*log0/q=0; p*logp/0=inf. P ¼ ¾ Q ½ ½ Distance (¼ * log ¼ / ½) + ( ¾ * log ¾ / ½) (log 3/2 = ~0.6) 4/6/2019 CPSC503 Spring 2004

6 Entropy Def1. Measure of uncertainty
Def2. Measure of the information that we need to resolve an uncertain situation Def3. Measure of the information that we obtain form an experiment that resolves an uncertain situation X not limited to numbers ranges over a set of basic elements Can be words part-of-speech Let p(x)=P(X=x); where x  X. H(p)= H(X)= - xX p(x)log2p(x) It is normally measured in bits. 4/6/2019 CPSC503 Spring 2004

7 Entropy of Entropy rate Language Entropy Assumptions:
stationary and ergodic Let’s try to find an alternative measure Shannon-McMillan-Breiman theorem Ergotic process: for each state there is a non-zero probability to end up in any other state Stationary probability distributions on words/sequences-of-words are time invariant (for bigram if it depends only from previous word at time t it will depend only on previous word at any other future time) Assumptions: language is generated by a stationary and ergodic process NL is not stationary the probability of upcoming words can be dependent on events that were arbitrary distant and time dependent NL? Entropy can be computed by taking the average log probability of a loooong sample 4/6/2019 CPSC503 Spring 2004

8 Cross-Entropy Between probability distribution P and another distribution Q (model for P) Applied to Language We could not use the kl divergence (relative entropy) But we’ll see that we can use a related measure of distance called cross-entropy Between two models Q1 and Q2 the more accurate is the one with lower cross-entropy 4/6/2019 CPSC503 Spring 2004

9 Model Evaluation: In practice
Corpus A:split Training Set Testing set B: train model C:Apply model counting frequencies smoothing Compute cross-perplexity Def. The relative entropy is a measure of how different two probability distributions (over the same event space) are. average number of bits wasted by encoding events from a distribution p with distribution q. I(X,Y) = D(p(x,y)||p(x)p(y)) Model 4/6/2019 CPSC503 Spring 2004

10 Knowledge-Formalisms Map (including probabilistic formalisms)
State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics My Conceptual map - This is the master plan Markov Models used for part-of-speech and dialog Syntax is the study of formal relationship between words How words are clustered into classes (that determine how they group and behave) How they group with they neighbors into phrases Pragmatics Discourse and Dialogue Logical formalisms (First-Order Logics) AI planners 4/6/2019 CPSC503 Spring 2004

11 Markov Models: state machines
Generalizations of Finite-state automata Probabilistic transitions (later) Probabilistic symbol emission q1 q0 q2 q3 k z r 0.2 0.8 1 0.5 k z r 0.2 0.8 1 0.5 r q1 q0 q2 q3 k z 0.2 0.8 1 0.5 Their first use was in modeling the letter sequences in works of Russian literature They were later developed as a general statistical tool. 4/6/2019 CPSC503 Spring 2004

12 Example of a Markov Chain
.6 1 a p h .4 .4 .3 .6 1 .3 e t 1 i .4 .6 .4 Start Start 4/6/2019 CPSC503 Spring 2004

13 Markov-Chain Formal description: 1 Stochastic Transition matrix A 2
.3 .3 .4 1 Stochastic Transition matrix A i .4 .6 p 1 a .4 .6 h 1 e 1 2 Probability of initial states t .6 i .4 4/6/2019 CPSC503 Spring 2004 Manning/Schütze, 2000: 318

14 Markov Assumptions P(Xt+1=sk|X1, .., Xt)=P(X t+1 = sk |Xt)
Let X=(X1, .., Xt) be a sequence of random variables taking values in some finite set S={s1, …, sn}, the state space, the Markov properties are: Limited Horizon: P(Xt+1=sk|X1, .., Xt)=P(X t+1 = sk |Xt) Time Invariant: P(Xt+1=sk|X1, .., Xt)=P(X2 =sk|X1) i.e., the dependency does not change over time. If X possesses these properties, then X is said to be a Markov Chain 4/6/2019 CPSC503 Spring 2004

15 Markov-Chain Example: Similar to …….?
Probability of a sequence of states X1 … XT Example: Similar to …….? 4/6/2019 CPSC503 Spring 2004 Manning/Schütze, 2000: 320

16 Hidden Markov Model (State Emission)
.6 1 a .1 .5 b b s4 .4 .4 s3 .5 a .5 i .7 .6 .3 .1 1 i b s1 s2 a .9 .4 .6 .4 Start Start 4/6/2019 CPSC503 Spring 2004

17 Hidden Markov Model (Arc Emission)
.1 .6 .5 b 1 b .4 a b i .5 .5 a s4 .5 s3 .5 a .1 .4 .5 i b b .7 .1 .4 .5 a .3 .5 .9 .6 a i s1 s2 1 .4 .6 .4 b Start Start 4/6/2019 CPSC503 Spring 2004

18 Hidden Markov Model State Emission Model Arc Emission Model
The symbol emitted at time t depends only on State at time t and t t+1 o Arc Emission Model The symbol emitted at time t depends on State at time t and State at time t+1 t t+1 o 4/6/2019 CPSC503 Spring 2004

19 Hidden Markov Model Formal Specification as five-tuple Set of States
Output Alphabet Initial State Probabilities State Transition Probabilities Why do they sum up to 1? Symbol Emission Probabilities 4/6/2019 CPSC503 Spring 2004

20 Program to simulate an (arc emission) HMM
Start in state si with probability (i.e., X1 = i) Forever do Move from state si to state sj with probability aij (i.e. Xt+1 = j) Emit observation symbol ot = k with probability bijk t := t+1 end Not terribly interesting…. 4/6/2019 CPSC503 Spring 2004

21 Three fundamental questions for HMMs
Decoding: Finding the probability of an observation brute force or Forward/Backward-Algorithm Finding the best state sequence Viterbi-Algorithm Training: find model parameters which best explain the observations - Given a model =(A, B, ), how do we efficiently compute how likely a certain observation is, that is, P(O| ) - Given the observation sequence O and a model , how do we choose a state sequence (X1, …, X T+1) that best explains the observations? - Given an observation sequence O, and a space of possible models found by varying the model parameters  = (A, B, ), how do we find the model that best explains the observed data? 4/6/2019 CPSC503 Spring 2004 Manning/Schütze, 2000: 325

22 Computing the probability of an observation sequence
state transition symbol emission Example on sample arc emission HMM? On P(b,i) This is simply the sum of the probability of the observation occurring according to each possible state sequence. Direct evaluation of this expression, however, is extremely inefficient. 4/6/2019 CPSC503 Spring 2004

23 Decoding Example s1, s1, s1 = 0 ? s1, s2, s1 = 0 ? ………. ……….
Complexity s1, s4, s4 = .6 * .7 * .6 * .4 * .5 s2, s4, s3 = 0? s2, s1, s4= .4 * .4 * .7 * 1 * .5 ………. 4/6/2019 CPSC503 Spring 2004 Manning/Schütze, 2000: 327

24 Next Time Finish HMMs Part-of-Speech Tagging 4/6/2019
CPSC503 Spring 2004


Download ppt "CPSC 503 Computational Linguistics"

Similar presentations


Ads by Google