Download presentation
Presentation is loading. Please wait.
Published byDarleen Cook Modified over 9 years ago
1
NLP
2
Introduction to NLP
3
Sequence of random variables that aren’t independent Examples –weather reports –text
4
Limited horizon: P(X t+1 = s k |X 1,…,X t ) = P(X t+1 = s k |X t ) Time invariant (stationary) = P(X 2 =s k |X 1 ) Definition: in terms of a transition matrix A and initial state probabilities .
5
f ab ed c 1.0 0.2 0.8 1.0 0.3 0.7 0.1 0.7 start
6
P(X 1,…X T ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 1,X 2 ) … P(X T |X 1,…,X T-1 ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) … P(X T |X T-1 ) P(d, a, b) = P(X 1 =d) P(X 2 =a|X 1 =d) P(X 3 =b|X 2 =a) = 1.0 x 0.7 x 0.8 = 0.56
7
Motivation –Observing a sequence of symbols –The sequence of states that led to the generation of the symbols is hidden Definition –Q = sequence of states –O = sequence of observations, drawn from a vocabulary –q 0,q f = special (start, final) states –A = state transition probabilities –B = symbol emission probabilities – = initial state probabilities –µ = (A,B, ) = complete probabilistic model
8
Uses –part of speech tagging –speech recognition –gene sequencing
9
Can be used to model state sequences and observation sequences Example: –P(s,w) = i P(s i |s i-1 )P(w i |s i ) S1 S2 S3 Sn S0 … W1 W2 W3 Wn
10
Pick start state from For t = 1..T –Move to another state based on A –Emit an observation based on B
11
GH 0.2 0.6 0.40.8 start
12
P(O t =k|X t =s i,X t+1 =s j ) = b ijk xyz G0.70.20.1 H0.30.50.2
13
Initial –P(G|start) = 1.0, P(H|start) = 0.0 Transition –P(G|G) = 0.8, P(G|H) = 0.6, P(H|G) = 0.2, P(H|H) = 0.4 Emission –P(x|G) = 0.7, P(y|G) = 0.2, P(z|G) = 0.1 –P(x|H) = 0.3, P(y|H) = 0.5, P(z|H) = 0.2
14
Starting in state G (or H), P(yz) = ? Possible sequences of states: –GG –GH –HG –HH P(yz) = P(yz|GG) + P(yz|GH) + P(yz|HG) + P(yz|HH) = =.8 x.2 x.8 x.1 +.8 x.2 x.2 x.2 +.2 x.5 x.4 x.2 +.2 x.5 x.6 x.1 =.0128+.0064+.0080+.0060 =.0332
15
The states encode the most recent history The transitions encode likely sequences of states –e.g., Adj-Noun or Noun-Verb –or perhaps Art-Adj-Noun Use MLE to estimate the transition probabilities
16
Estimating the emission probabilities –Harder than transition probabilities –There may be novel uses of word/POS combinations Suggestions –It is possible to use standard smoothing –As well as heuristics (e.g., based on the spelling of the words)
17
The observer can only see the emitted symbols Observation likelihood –Given the observation sequence S and the model = (A,B, ), what is the probability P(S| ) that the sequence was generated by that model. Being able to compute the probability of the observations sequence turns the HMM into a language model
18
Tasks –Given = (A,B, ), find P(O| ) –Given O, , what is (X 1,…X T+1 ) –Given O and a space of all possible , find model that best describes the observations Decoding – most likely sequence –tag each token with a label Observation likelihood –classify sequences Learning –train models to fit empirical data
19
Find the most likely sequence of tags, given the sequence of words –t* = argmax t P(t|w) Given the model µ, it is possible to compute P (t|w) for all values of t In practice, there are way too many combinations Possible solution: –Use beam search (partial hypotheses) –At each state, only keep the k best hypotheses so far –May not work
20
Find the best path up to observation i and state s Characteristics –Uses dynamic programming –Memoization –Backpointers
21
P(H|G) P(B|B) P(y|G) start end H H H H H H H H G G G G G G G G start end
22
start end H H H H H H H H G G G G G G G G start end y y z z.. P(G,t=1) P(G,t=1) = P(start) x P(G|start) x P(y|G)
23
start end H H H H H H H H G G G G G G G G y y z z start end.. P(H,t=1) P(H,t=1) = P(start) x P(H|start) x P(y|H)
24
start end H H H H H H H H G G G G G G G G y y z z start end.. P(H,t=2) = max (P(G,t=1) x P(H|G) x P(z|H), P(H,t=1) x P(H|H) x P(z|H)) P(H,t=2)
25
start end H H H H H H H H G G G G G G G G y y z z start end.. P(H,t=2)
26
start end H H H H H H H H G G G G G G G G y y z z start end..
27
start end H H H H H H H H G G G G G G G G y y z z start end.. P(end,t=3)
28
start end H H H H H H H H G G G G G G G G y y z z start end.. P(end,t=3) P(end,t=3) = max (P(G,t=2) x P(end|G), P(H,t=2) x P(end|H))
29
start end H H H H H H H H G G G G G G G G y y z z start end.. P(end,t=3) P(end,t=3) = best score for the sequence Use the backpointers to find the sequence of states. P(end,t=3) = max (P(G,t=2) x P(end|G), P(H,t=2) x P(end|H))
30
Given multiple HMMs –e.g., for different languages –Which one is the most likely to have generated the observation sequence Naïve solution –try all possible state sequences
31
Dynamic programming method –Computing a forward trellis that encodes all possible state paths. –Based on the Markov assumption that the probability of being in any state at a given time point only depends on the probabilities of being in all states at the previous time point
32
Supervised –Training sequences are labeled Unsupervised –Training sequences are unlabeled –Known number of states Semi-supervised –Some training sequences are labeled
33
Estimate the static transition probabilities using MLE Estimate the observation probabilities using MLE Use smoothing
34
Given: –observation sequences Goal: –build the HMM Use EM (Expectation Maximization) methods –forward-backward (Baum-Welch) algorithm –Baum-Welch finds an approximate solution for P(O|µ)
35
Algorithm –Randomly set the parameters of the HMM –Until the parameters converge repeat: E step – determine the probability of the various state sequences for generating the observations M step – reestimate the parameters based on these probabilities Notes –the algorithm guarantees that at each iteration the likelihood of the data P(O|µ) increases –it can be stopped at any point and give a partial solution –it converges to a local maximum
36
NLP
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.