Zaawansowana Analiza Danych Wykład 4: Ukryte modele Markova Piotr Synak.

Zaawansowana Analiza Danych Wykład 4: Ukryte modele Markova Piotr Synak

A Markov System s1s1 s3s3 s2s2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1, … N = 3 t=0

A Markov System s1s1 s3s3 s2s2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it q t Note: q t  {s 1, s 2.. s N } N = 3 t=0 q t =q 0 =s 3 Current State

A Markov System s1s1 s3s3 s2s2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it q t Note: q t  {s 1, s 2.. s N } Between each timestep, the next state is chosen randomly. N = 3 t=1 q t =q 1 =s 2 Current State

A Markov System s1s1 s3s3 s2s2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it q t Note: q t  {s 1, s 2.. s N } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state. N = 3 t=1 q t =q 1 =s 2 P(q t+1 =s 1 |q t =s 3 ) = 1/3 P(q t+1 =s 2 |q t =s 3 ) = 2/3 P(q t+1 =s 3 |q t =s 3 ) = 0 P(q t+1 =s 1 |q t =s 1 ) = 0 P(q t+1 =s 2 |q t =s 1 ) = 0 P(q t+1 =s 3 |q t =s 1 ) = 1 P(q t+1 =s 1 |q t =s 2 ) = 1/2 P(q t+1 =s 2 |q t =s 2 ) = 1/2 P(q t+1 =s 3 |q t =s 2 ) = 0

A Markov System s1s1 s3s3 s2s2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it q t Note: q t  {s 1, s 2.. s N } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state. N = 3 t=1 q t =q 1 =s 2 P(q t+1 =s 1 |q t =s 3 ) = 1/3 P(q t+1 =s 2 |q t =s 3 ) = 2/3 P(q t+1 =s 3 |q t =s 3 ) = 0 P(q t+1 =s 1 |q t =s 1 ) = 0 P(q t+1 =s 2 |q t =s 1 ) = 0 P(q t+1 =s 3 |q t =s 1 ) = 1 P(q t+1 =s 1 |q t =s 2 ) = 1/2 P(q t+1 =s 2 |q t =s 2 ) = 1/2 P(q t+1 =s 3 |q t =s 2 ) = 0 1/2 1/3 2/3 1 Often notated with arcs between states

Markov Property s1s1 s3s3 s2s2 q t+1 is conditionally independent of { q t-1, q t-2, … q 1, q 0 } given q t. In other words: P(q t+1 = s j |q t = s i ) = P(q t+1 = s j |q t = s i, any earlier history ) Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q 0, q 1, q 2,q 3,q 4 )? N = 3 t=1 q t =q 1 =s 2 P(q t+1 =s 1 |q t =s 3 ) = 1/3 P(q t+1 =s 2 |q t =s 3 ) = 2/3 P(q t+1 =s 3 |q t =s 3 ) = 0 P(q t+1 =s 1 |q t =s 1 ) = 0 P(q t+1 =s 2 |q t =s 1 ) = 0 P(q t+1 =s 3 |q t =s 1 ) = 1 P(q t+1 =s 1 |q t =s 2 ) = 1/2 P(q t+1 =s 2 |q t =s 2 ) = 1/2 P(q t+1 =s 3 |q t =s 2 ) = 0 1/2 1/3 2/3 1

Markov Property s1s1 s3s3 s2s2 q t+1 is conditionally independent of { q t-1, q t-2, … q 1, q 0 } given q t. In other words: P(q t+1 = s j |q t = s i ) = P(q t+1 = s j |q t = s i, any earlier history ) Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q 0, q 1, q 2,q 3,q 4 )? N = 3 t=1 q t =q 1 =s 2 P(q t+1 =s 1 |q t =s 3 ) = 1/3 P(q t+1 =s 2 |q t =s 3 ) = 2/3 P(q t+1 =s 3 |q t =s 3 ) = 0 P(q t+1 =s 1 |q t =s 1 ) = 0 P(q t+1 =s 2 |q t =s 1 ) = 0 P(q t+1 =s 3 |q t =s 1 ) = 1 P(q t+1 =s 1 |q t =s 2 ) = 1/2 P(q t+1 =s 2 |q t =s 2 ) = 1/2 P(q t+1 =s 3 |q t =s 2 ) = 0 1/2 1/3 2/3 1 Answer: q0q0 q1q1 q2q2 q3q3 q4q4

Markov Property s1s1 s3s3 s2s2 q t+1 is conditionally independent of { q t-1, q t-2, … q 1, q 0 } given q t. In other words: P(q t+1 = s j |q t = s i ) = P(q t+1 = s j |q t = s i, any earlier history ) Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q 0, q 1, q 2,q 3,q 4 )? N = 3 t=1 q t =q 1 =s 2 P(q t+1 =s 1 |q t =s 3 ) = 1/3 P(q t+1 =s 2 |q t =s 3 ) = 2/3 P(q t+1 =s 3 |q t =s 3 ) = 0 P(q t+1 =s 1 |q t =s 1 ) = 0 P(q t+1 =s 2 |q t =s 1 ) = 0 P(q t+1 =s 3 |q t =s 1 ) = 1 P(q t+1 =s 1 |q t =s 2 ) = 1/2 P(q t+1 =s 2 |q t =s 2 ) = 1/2 P(q t+1 =s 3 |q t =s 2 ) = 0 1/2 1/3 2/3 1 Answer: q0q0 q1q1 q2q2 q3q3 q4q4 Each of these probability tables is identical i P(q t+1 =s 1 |q t =s i ) P(q t+1 =s 2 |q t =s i ) … P(q t+1 =s j |q t =s i ) … P(q t+1 =s N |q t = s i ) 1 a 11 a 12 … a 1j … a 1N 2 a 21 a 22 … a 2j … a 2N 3 a 31 a 32 … a 3j … a 3N ::::::: i a i1 a i2 … a ij … a iN N a N1 a N2 … a Nj … a NN Notation:

A Blind Robot R H STATE q = Location of Robot, Location of Human A human and a robot wander around randomly on a grid… Note: N (num. states) = 18 * 18 = 324

Dynamics of System R H Typical Questions: “What’s the expected time until the human is crushed like a bug?” “What’s the probability that the robot will hit the left wall before it hits the human?” “What’s the probability Robot crushes human on next time step?” Each timestep the human moves randomly to an adjacent cell. And Robot also moves randomly to an adjacent cell. q 0 =

Example Question “It’s currently time t, and human remains uncrushed. What’s the probability of crushing occurring at time t + 1 ?” If robot is blind: We can compute this in advance. If robot is omnipotent: (i.e., if robot knows state at time t), can compute directly. If robot has some sensors, but incomplete state information … Hidden Markov Models are applicable! We’ll do this first Too Easy. We won’t do this Main Body of Lecture

What is P(q t =s)? Slow, stupid answer Step 1: Work out how to compute P(Q) for any path Q = q 1 q 2 q 3.. q t Given we know the start state q 1 (i.e., P(q 1 )=1) P(q 1 q 2.. q t ) = P(q 1 q 2.. q t-1 ) P(q t |q 1 q 2.. q t-1 ) = P(q 1 q 2.. q t-1 ) P(q t |q t-1 ) = P(q 2 |q 1 )P(q 3 |q 2 )…P(q t |q t-1 ) Step 2: Use this knowledge to get P(q t =s) WHY? Computation is exponential in t

What is P(q t =s)? Clever answer  For each state s i, define p t (i) = Prob. state is s i at time t = P(q t = s i )  Easy to do inductive definition Remember,  Computation is simple.  Just fill in this table in this order: t p t (1)p t (2) … p t (N) 0010 1 : t final  Cost of computing p t (i) for all states s i is now O(t N 2 )  The slow way was O(N t )  This was a simple example  It was meant to warm you up to this trick, called Dynamic Programming, because HMMs do many tricks like this.

Hidden State “It’s currently time t, and human remains uncrushed. What’s the probability of crushing occurring at time t + 1 ?” If robot is blind: We can compute this in advance. If robot is omnipotent: (i.e., if robot knows state at time t), can compute directly. If robot has some sensors, but incomplete state information … Hidden Markov Models are applicable! We’ll do this first Too Easy. We won’t do this Main Body of Lecture

Hidden State  The previous example tried to estimate P(q t = s i ) unconditionally (using no observed evidence).  Suppose we can observe something that’s affected by the true state.  Example: Proximity sensors. (tell us the contents of the 8 adjacent squares)  H WWW  H W denotes “WALL” True state q t What the robot sees: Observation O t

Noisy Hidden State  Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares)  H WWW  H W denotes “WALL” True state q t Uncorrupted Observation W W  W HH What the robot sees: Observation O t

Noisy Hidden State  H WWW  H W denotes “WALL” True state q t Uncorrupted Observation W W  W HH What the robot sees: Observation O t O t is noisily determined depending on the current state. Assume that O t is conditionally independent of {q t-1, q t-2, … q 1, q 0, O t-1, O t-2, … O 1, O 0 } given q t. In other words: P(O t = X |q t = s i ) = P(O t = X |q t = s i, any earlier history) Question: what’d be the best Bayes Net structure to represent the Joint Distribution of (q 0,q 1,q 2,q 3,q 4,O 0,O 1,O 2,O 3,O 4 )? Answer: q0q0 q1q1 q2q2 q3q3 q4q4 O0O0 O1O1 O2O2 O3O3 O4O4 i P(O t =1|q t = s i ) P(O t =2|q t = s i ) … P(O t =k|q t = s i ) … P(O t =M|q t = s i ) 1 b 1 (1)b 1 (2) … b 1 (k) … b 1 (M) 2 b 2 (1)b 2 (2) … b 2 (k) … b 2 (M) 3 b 3 (1)b 3 (2) … b 3 (k) … b 3 (M) : :::::: i b i (1)b i (2) … b i (k) … b i (M) : :::::: N b N (1)b N (2) … b N (k) … b N (M) Notation:

Hidden Markov Models Our robot with noisy sensors is a good example of an HMM  Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) It will turn out that a new cute trick will get this for us.  Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? And what is that probability? Yet another famous trick, the VITERBI algorithm, gets this.  Question 3: Learning HMMs: Given O 1 O 2 …O T, what is the maximum likelihood HMM that could have produced this string of observations? Very very useful. Uses the E.M. Algorithm

HMM Notation (from Rabiner’s Survey) The states are labeled S 1 S 2.. S N For a particular trial…. Let T be the number of observations T is also the number of states passed through O = O 1 O 2.. O T is the sequence of observations Q = q 1 q 2.. q T is the notation for a path of states λ = N,M, i, ,a ij ,b i (j) is the specification of an HMM *L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.

HMM Formal Definition An HMM, λ, is a 5-tuple consisting of  N the number of states  M the number of possible observations  { 1,  2,..  N } The starting state probabilities P(q 0 = S i ) =  i  a 11 a 22 …a 1N a 21 a 22 …a 2N : : : a N1 a N2 …a NN  b 1 (1)b 1 (2)…b 1 (M) b 2 (1)b 2 (2)…b 2 (M) : : : b N (1)b N (2)…b N (M) This is new. In our previous example, start state was deterministic The state transition probabilities P(q t+1 =S j | q t =S i )=a ij The observation probabilities P(O t =k | q t =S i )=b i (k)

Here’s an HMM N = 3 M = 3  1 = 1/2  2 = 1/2  3 = 0 a 11 = 0 a 12 = 1/3 a 13 = 2/3 a 21 = 1/3 a 22 = 0 a 23 = 2/3 a 31 = 1/3 a 32 = 1/3 a 33 = 1/3 b 1 (X) = 1/2 b 1 ( Y ) = 1/2 b 1 (Z) = 0 b 2 (X) = 0 b 2 ( Y ) = 1/2 b 2 (Z) = 1/2 b 3 (X) = 1/2 b 3 ( Y ) = 0 b 3 (Z) = 1/2 Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. XYXY ZXZX ZYZY S2S2 S1S1 S3S3 1/3 2/3 1/3

Here’s an HMM N = 3 M = 3  1 = 1/2  2 = 1/2  3 = 0 a 11 = 0 a 12 = 1/3 a 13 = 2/3 a 21 = 1/3 a 22 = 0 a 23 = 2/3 a 31 = 1/3 a 32 = 1/3 a 23 = 1/3 b 1 (X) = 1/2 b 1 ( Y ) = 1/2 b 1 (Z) = 0 b 2 (X) = 0 b 2 ( Y ) = 1/2 b 2 (Z) = 1/2 b 3 (X) = 1/2 b 3 ( Y ) = 0 b 3 (Z) = 1/2 Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: XYXY ZXZX ZYZY S2S2 S1S1 S3S3 1/3 2/3 1/3 q0=q0=__O0=O0= q1=q1= O1=O1= q2=q2= O2=O2= 50-50 choice between S 1 and S 2

Here’s an HMM N = 3 M = 3  1 = 1/2  2 = 1/2  3 = 0 a 11 = 0 a 12 = 1/3 a 13 = 2/3 a 21 = 1/3 a 22 = 0 a 23 = 2/3 a 31 = 1/3 a 32 = 1/3 a 23 = 1/3 b 1 (X) = 1/2 b 1 ( Y ) = 1/2 b 1 (Z) = 0 b 2 (X) = 0 b 2 ( Y ) = 1/2 b 2 (Z) = 1/2 b 3 (X) = 1/2 b 3 ( Y ) = 0 b 3 (Z) = 1/2 Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: XYXY ZXZX ZYZY S2S2 S1S1 S3S3 1/3 2/3 1/3 q0=q0=S1S1 O0=O0=X q1=q1=__O1=O1= q2=q2= O2=O2= Goto S 3 with probability 2/3 or S 2 with prob. 1/3

Here’s an HMM N = 3 M = 3  1 = 1/2  2 = 1/2  3 = 0 a 11 = 0 a 12 = 1/3 a 13 = 2/3 a 21 = 1/3 a 22 = 0 a 23 = 2/3 a 31 = 1/3 a 32 = 1/3 a 33 = 1/3 b 1 (X) = 1/2 b 1 ( Y ) = 1/2 b 1 (Z) = 0 b 2 (X) = 0 b 2 ( Y ) = 1/2 b 2 (Z) = 1/2 b 3 (X) = 1/2 b 3 ( Y ) = 0 b 3 (Z) = 1/2 Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: XYXY ZXZX ZYZY S2S2 S1S1 S3S3 1/3 2/3 1/3 q0=q0=S1S1 O0=O0=X q1=q1=S3S3 O1=O1=__ q2=q2= O2=O2= 50-50 choice between Z and X

Here’s an HMM N = 3 M = 3  1 = 1/2  2 = 1/2  3 = 0 a 11 = 0 a 12 = 1/3 a 13 = 2/3 a 21 = 1/3 a 22 = 0 a 23 = 2/3 a 31 = 1/3 a 32 = 1/3 a 33 = 1/3 b 1 (X) = 1/2 b 1 ( Y ) = 1/2 b 1 (Z) = 0 b 2 (X) = 0 b 2 ( Y ) = 1/2 b 2 (Z) = 1/2 b 3 (X) = 1/2 b 3 ( Y ) = 0 b 3 (Z) = 1/2 Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: XYXY ZXZX ZYZY S2S2 S1S1 S3S3 1/3 2/3 1/3 q0=q0=S1S1 O0=O0=X q1=q1=S3S3 O1=O1=X q2=q2=__O2=O2= Each of the three next states is equally likely

Here’s an HMM N = 3 M = 3  1 = 1/2  2 = 1/2  3 = 0 a 11 = 0 a 12 = 1/3 a 13 = 2/3 a 21 = 1/3 a 22 = 0 a 23 = 2/3 a 31 = 1/3 a 32 = 1/3 a 33 = 1/3 b 1 (X) = 1/2 b 1 ( Y ) = 1/2 b 1 (Z) = 0 b 2 (X) = 0 b 2 ( Y ) = 1/2 b 2 (Z) = 1/2 b 3 (X) = 1/2 b 3 ( Y ) = 0 b 3 (Z) = 1/2 Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: XYXY ZXZX ZYZY S2S2 S1S1 S3S3 1/3 2/3 1/3 q0=q0=S1S1 O0=O0=X q1=q1=S3S3 O1=O1=X q2=q2=S3S3 O2=O2=__ 50-50 choice between Z and X

Here’s an HMM N = 3 M = 3  1 = 1/2  2 = 1/2  3 = 0 a 11 = 0 a 12 = 1/3 a 13 = 2/3 a 21 = 1/3 a 22 = 0 a 23 = 2/3 a 31 = 1/3 a 32 = 1/3 a 33 = 1/3 b 1 (X) = 1/2 b 1 ( Y ) = 1/2 b 1 (Z) = 0 b 2 (X) = 0 b 2 ( Y ) = 1/2 b 2 (Z) = 1/2 b 3 (X) = 1/2 b 3 ( Y ) = 0 b 3 (Z) = 1/2 Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: XYXY ZXZX ZYZY S2S2 S1S1 S3S3 1/3 2/3 1/3 q0=q0=S1S1 O0=O0=X q1=q1=S3S3 O1=O1=X q2=q2=S3S3 O2=O2=Z

Here’s an HMM N = 3 M = 3  1 = 1/2  2 = 1/2  3 = 0 a 11 = 0 a 12 = 1/3 a 13 = 2/3 a 21 = 1/3 a 22 = 0 a 23 = 2/3 a 31 = 1/3 a 32 = 1/3 a 33 = 1/3 b 1 (X) = 1/2 b 1 ( Y ) = 1/2 b 1 (Z) = 0 b 2 (X) = 0 b 2 ( Y ) = 1/2 b 2 (Z) = 1/2 b 3 (X) = 1/2 b 3 ( Y ) = 0 b 3 (Z) = 1/2 Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: XYXY ZXZX ZYZY S2S2 S1S1 S3S3 1/3 2/3 1/3 q0=q0=?O0=O0=X q1=q1=?O1=O1=X q2=q2=?O2=O2=Z This is what the observer has to work with…

Prob. of a series of observations What is P(O) = P(O 1 O 2 O 3 ) = P(O 1 =XO 2 =XO 3 =Z)? Slow way: How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q? XYXY ZXZX ZYZY S2S2 S1S1 S3S3 1/3 2/3 1/3 P(Q)= P(q 1,q 2,q 3 ) =P(q 1 ) P(q 2,q 3 |q 1 ) (chain rule) =P(q 1 ) P(q 2 |q 1 ) P(q 3 | q 2,q 1 ) (chain) =P(q 1 ) P(q 2 |q 1 ) P(q 3 | q 2 ) (why?) Example in the case Q = S 1 S 3 S 3 : =1/2 * 2/3 * 1/3 = 1/9 P(O|Q) = P(O 1 O 2 O 3 |q 1 q 2 q 3 ) = P(O 1 | q 1 ) P(O 2 | q 2 ) P(O 3 | q 3 ) Example in the case Q = S 1 S 3 S 3 : = P(X| S 1 ) P(X| S 3 ) P(Z| S 3 ) = =1/2 * 1/2 * 1/2 = 1/8 P(O) would need 27 P(Q) computations and 27 P(O|Q) computations A sequence of 20 observations would need 3 20 = 3.5 billion computations and 3.5 billion P(O|Q) computations So let’s be smarter…

The Prob. of a given series of observations, non-exponential-cost-style Given observations O 1 O 2 … O T Define α t (i) = P(O 1 O 2 … O t  q t = S i | λ) where 1 ≤ t ≤ T α t (i) = Probability that, in a random trial, We’d have seen the first t observations We’d have ended up in S i as the t’th state visited. In our example, what is α 2 (3) ?

α t (i) : easy to define recursively α t (i) = P(O 1 O 2 … O T  q t = S i | λ) (α t (i) can be defined stupidly by considering all paths length “t”. How?)

in our example WE SAW O 1 O 2 O 3 = X X Z XYXY ZXZX ZYZY S2S2 S1S1 S3S3 1/3 2/3 1/3

Easy Question We can cheaply compute α t (i)=P(O 1 O 2 …O t  q t =S i ) (How) can we cheaply compute P(O 1 O 2 …O t )? (How) can we cheaply compute P(q t =S i |O 1 O 2 …O t )?

Most probable path given observations

Efficient MPP computation We’re going to compute the following variables: δ t (i)= max P(q 1 q 2.. q t-1  q t = S i  O 1.. O t ) q 1 q 2..q t-1 = The Probability of the path of Length t-1 with the maximum chance of doing all these things: …OCCURING and …ENDING UP IN STATE S i and …PRODUCING OUTPUT O 1 …O t DEFINE:mpp t (i) = that path So: δ t (i)= Prob(mpp t (i))

The Viterbi Algorithm Now, suppose we have all the δ t (i)’s and mpp t (i)’s for all i. HOW TO GET δ t+1 (j) andmpp t+1 (j)? mpp t (1) Prob=δ t (1) mpp t (2) : mpp t (N) S1S1 S2S2 SNSN qtqt SjSj q t+1 Prob=δ t (N) Prob=δ t (2) ? :

The Viterbi Algorithm time ttime t+1 S 1 :S j S i : The most prob path with last two states S i S j is the most prob path to S i, followed by transition S i → S j What is the prob of that path? δ t (i) x P(S i → S j  O t+1 | λ) = δ t (i) a ij b j (O t+1 ) SO The most probable path to S j has S i* as its penultimate state where i*=argmax δ t (i) a ij b j (O t+1 ) i } with i* defined to the left Summary: δ t+1 (j) = δ t (i*) a ij b j (O t+1 ) mpp t+1 (j) = mpp t+1 (i*)S i*

What’s Viterbi used for? Classic Example Speech recognition: Signal  words HMM  observable is signal  Hidden state is part of word formation What is the most probable word given this signal? UTTERLY GROSS SIMPLIFICATION In practice: many levels of inference; not one big jump.

HMMs are used and useful But how do you design an HMM? Occasionally, (e.g. in our robot example) it is reasonable to deduce the HMM from first principles. But usually, especially in Speech or Genetics, it is better to infer it from large amounts of data. O 1 O 2.. O T with a big “T”. O 1 O 2.. O T Observations previously in lecture Observations in the next bit

Inferring an HMM Remember, we’ve been doing things like P(O 1 O 2.. O T | λ ) That “λ” is the notation for our HMM parameters. NowWe have some observations and we want to estimate λ from them. AS USUAL: We could use (i)MAX LIKELIHOOD λ = argmax P(O 1.. O T | λ) λ (ii)BAYES Work out P( λ | O 1.. O T ) and then take E[λ] or max P( λ | O 1.. O T ) λ

Max likelihood HMM estimation Define γ t (i) = P(q t = S i | O 1 O 2 …O T, λ ) ε t (i,j) = P(q t = S i  q t+1 = S j | O 1 O 2 …O T,λ ) γ t (i) and ε t (i,j) can be computed efficiently  i,j,t Expected number of transitions out of state i during the path Expected number of transitions from state i to state j during the path

HMM estimation

EM for HMMs If we knew λ we could estimate EXPECTATIONS of quantities such as Expected number of times in state i Expected number of transitions i  j If we knew the quantities such as Expected number of times in state i Expected number of transitions i  j We could compute the MAX LIKELIHOOD estimate of λ = a ij ,b i (j),  i  Roll on the EM Algorithm…

EM 4 HMMs 1.Get your observations O 1 …O T 2.Guess your first λ estimate λ(0), k=0 3.k = k+1 4.Given O 1 …O T, λ(k) compute γ t (i), ε t (i,j) 1 ≤ t ≤ T, 1 ≤ i ≤ N, 1 ≤ j ≤ N 5.Compute expected freq. of state i, and expected freq. i → j 6.Compute new estimates of a ij, b j (k),  i accordingly. Call them λ(k+1) 7.Goto 3, unless converged. Also known (for the HMM case) as the BAUM- WELCH algorithm.

Bad News Good News Notice There are lots of local minima The local minima are usually adequate models of the data. EM does not estimate the number of states. That must be given. Often, HMMs are forced to have some links with zero probability. This is done by setting a ij =0 in initial estimate λ(0) Easy extension of everything seen today: HMMs with real valued outputs

What You Should Know  What is an HMM ?  Computing (and defining) α t (i)  The Viterbi algorithm  Outline of the EM algorithm  To be very happy with the kind of maths and analysis needed for HMMs  Fairly thorough reading of Rabiner* up to page 266* [Up to but not including “IV. Types of HMMs”]. *L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.

Markov Systems, Markov Decision Processes, and Dynamic Programming

Discounted Rewards An assistant professor gets paid, say, 20K per year. How much, in total, will the A.P. earn in their life? 20 + 20 + 20 + 20 + 20 + … = Infinity What’s wrong with this argument? $$

Discounted Rewards “A reward (payment) in the future is not worth quite as much as a reward now.” Because of chance of obliteration Because of inflation Example: Being promised $10,000 next year is worth only 90% as much as receiving $10,000 right now. Assuming payment n years in future is worth only (0.9) n of payment now, what is the AP’s Future Discounted Sum of Rewards ?

Discount Factors People in economics and probabilistic decision- making do this all the time. The “Discounted sum of future rewards” using discount factor ” is (reward now) +  (reward in 1 time step) +  2 (reward in 2 time steps) +  3 (reward in 3 time steps) + : : (infinite sum)

The Academic Life Define: J A = Expected discounted future rewards starting in state A J B = Expected discounted future rewards starting in state B J T = “ “ “ “ “ “ “ T J S = “ “ “ “ “ “ “ S J D = “ “ “ “ “ “ “ D How do we compute J A, J B, J T, J S, J D ? A. Assistant Prof 20 B. Assoc. Prof 60 S. On the Street 10 D. Dead 0 T. Tenured Prof 400 Assume Discount Factor  = 0.9 0.7 0.6 0.3 0.2 0.3 0.6 0.2

Computing the Future Rewards of an Academic

A Markov System with Rewards…  Has a set of states {S 1 S 2 ·· S N }  Has a transition probability matrix P 11 P 12 ·· P 1N P= P 21 P ij = Prob(Next = S j | This = S i ) : P N1 ·· P NN  Each state has a reward. {r 1 r 2 ·· r N }  There’s a discount factor . 0 <  < 1 On Each Time Step … 0. Assume your state is S i 1. You get given reward r i 2. You randomly move to another state P(NextState = S j | This = S i ) = P ij 3. All future rewards are discounted by 

Solving a Markov System Write J*(S i ) = expected discounted sum of future rewards starting in state S i J*(S i ) = r i +  x (Expected future rewards starting from your next state) = r i + (P i1 J*(S 1 )+P i2 J*(S 2 )+ ··· P iN J*(S N )) Using vector notation write J*(S 1 ) r 1 P 11 P 12 ·· P 1N J*(S 2 ) r 2 P 21 ·. J= :R=:P= : J*(S N ) r N P N1 P N2 ·· P NN Question: can you invent a closed form expression for J in terms of R P and  ?

Solving a Markov System with Matrix Inversion  Upside: You get an exact answer  Downside:

Solving a Markov System with Matrix Inversion  Upside: You get an exact answer  Downside: If you have 100,000 states you’re solving a 100,000 by 100,000 system of equations.

Value Iteration: another way to solve a Markov System Define J 1 (S i ) = Expected discounted sum of rewards over the next 1 time step. J 2 (S i ) = Expected discounted sum rewards during next 2 steps J 3 (S i ) = Expected discounted sum rewards during next 3 steps : J k (S i ) = Expected discounted sum rewards during next k steps J 1 (S i ) = (what?) J 2 (S i ) = (what?) : J k+1 (S i ) = (what?)

Value Iteration: another way to solve a Markov System Define J 1 (S i ) = Expected discounted sum of rewards over the next 1 time step. J 2 (S i ) = Expected discounted sum rewards during next 2 steps J 3 (S i ) = Expected discounted sum rewards during next 3 steps : J k (S i ) = Expected discounted sum rewards during next k steps J 1 (S i ) = r i (what?) J 2 (S i ) = (what?) : J k+1 (S i ) = (what?) N = Number of states

Let’s do Value Iteration kJ k ( SUN )J k ( WIN D ) J k ( HAIL ) 1 2 3 4 5 SUN  +4 WIND  0 HAIL.::.:.:: -8 1/2  = 0.5

Let’s do Value Iteration kJ k ( SUN )J k ( WIN D ) J k ( HAIL ) 140-8 25-10 35-1.25-10.75 44.94-1.44-11 54.88-1.52-11.11 SUN  +4 WIND  0 HAIL.::.:.:: -8 1/2  = 0.5

Value Iteration for solving Markov Systems  Compute J 1 (S i ) for each j  Compute J 2 (S i ) for each j :  Compute J k (S i ) for each j As k → ∞ J k (S i ) → J*(S i ). Why? When to stop? When Max J k+1 (S i ) – J k (S i ) < ξ i This is faster than matrix inversion (N 3 style) if the transition matrix is sparse

A Markov Decision Process  = 0.9 Poor & Unknown +0 Rich & Unknown +10 Rich & Famous +10 Poor & Famous +0 You run a startup company. In every state you must choose between Saving money or Advertising. S A A S A A S S 1 1 1 1/2

Markov Decision Processes An MDP has…  A set of states {s 1 ··· S N }  A set of actions {a 1 ··· a M }  A set of rewards {r 1 ··· r N } (one for each state)  A transition probability function On each step: 0. Call current state S i 1. Receive reward r i 2. Choose action  {a 1 ··· a M } 3. If you choose action a k you’ll move to state S j with probability 4. All future rewards are discounted by 

A Policy A policy is a mapping from states to actions. Examples STATE → ACTION PUS PFA RUS RFA STATE → ACTION PUA PFA RUA RFA Policy Number 1: How many possible policies in our example? Which of the above two policies is best? How do you compute the optimal policy? PU 0 PF 0 RU +10 RF +10 RF 10 PF 0 PU 0 RU 10 S S A A A A AA 1 1 1 1 1 1/2 Policy Number 2:

Interesting Fact For every M.D.P. there exists an optimal policy. It’s a policy such that for every possible start state there is no better option than to follow the policy. (Not proved in this lecture)

Computing the Optimal Policy Idea One: Run through all possible policies. Select the best. What’s the problem ??

Optimal Value Function Define J*(S i ) = Expected Discounted Future Rewards, starting from state S i, assuming we use the optimal policy S 1 +0 S 3 +2 S 2 +3 B B A A B A 1/2 0 1 1 1 1/3 Question What (by inspection) is an optimal policy for that MDP? (assume  = 0.9) What is J*(S 1 ) ? What is J*(S 2 ) ? What is J*(S 3 ) ?

Computing the Optimal Value Function with Value Iteration Define J k (S i ) = Maximum possible expected sum of discounted rewards I can get if I start at state S i and I live for k time steps. Note that J 1 (S i ) = r i

Let’s compute J k (S i ) for our example kJ k (PU)J k (PF)J k (RU)J k (RF) 1 2 3 4 5 6

Let’s compute J k (S i ) for our example kJ k (PU)J k (PF)J k (RU)J k (RF) 10010 204.514.519 32.036.5325.0818.55 43.85212.2029.6319.26 57.2215.0732.0020.40 610.0317.6533.5822.43

Bellman’s Equation Value Iteration for solving MDPs Compute J 1 (S i ) for all i Compute J 2 (S i ) for all i : Compute J n (S i ) for all i …..until converged …Also known as Dynamic Programming

Finding the Optimal Policy 1.Compute J*(S i ) for all i using Value Iteration (a.k.a. Dynamic Programming) 2.Define the best action in state S i as (Why?)

Applications of MDPs This extends the search algorithms of your first lectures to the case of probabilistic next states. Many important problems are MDPs…. … Robot path planning … Travel route planning … Elevator scheduling … Bank customer retention … Autonomous aircraft navigation … Manufacturing processes … Network switching & routing

Asynchronous D.P. Value Iteration: “Backup S 1 ”, “Backup S 2 ”, ···· “Backup S N ”, then “Backup S 1 ”, “Backup S 2 ”, ···· repeat : : There’s no reason that you need to do the backups in order! Random Order …still works. Easy to parallelize (Dyna, Sutton 91) On-Policy Order Simulate the states that the system actually visits. Efficient Order e.g. Prioritized Sweeping [Moore 93] Q-Dyna [Peng & Williams 93]

Policy Iteration Write π(S i ) = action selected in the i’th state. Then π is a policy. Write π t = t’th policy on t’th iteration Algorithm: π˚ = Any randomly chosen policy  i compute J˚(S i ) = Long term reward starting at S i using π˚ π 1 (S i ) = J 1 = …. π 2 (S i ) = …. … Keep computing π 1, π 2, π 3 …. until π k = π k+1. You now have an optimal policy. Another way to compute optimal policies

Policy Iteration & Value Iteration: Which is best ??? It depends. Lots of actions? Choose Policy Iteration Already got a fair policy? Policy Iteration Few actions, acyclic? Value Iteration Best of Both Worlds: Modified Policy Iteration [Puterman] …a simple mix of value iteration and policy iteration 3 rd Approach Linear Programming

Time to Moan What’s the biggest problem(s) with what we’ve seen so far?

Dealing with large numbers of states STATEVALUE s1s1 S2S2 : S 15122189 Don’t use a Table… use… (Generalizers) (Hierarchies) Splines A Function Approximator Variable Resolution Multi Resolution Memory Based STATEVALUE [Munos 1999]

Function approximation for value functions Polynomials [Samuel, Boyan, Much O.R. Literature] Neural Nets [Barto & Sutton, Tesauro, Crites, Singh, Tsitsiklis] SplinesEconomists, Controls Downside: All convergence guarantees disappear. Backgammon, Pole Balancing, Elevators, Tetris, Cell phones Checkers, Channel Routing, Radio Therapy

Memory-based Value Functions J(“state”) = J(most similar state in memory to “state”) or Average J(20 most similar states) or Weighted Average J(20 most similar states) [Jeff Peng, Atkenson & Schaal, Geoff Gordon, proved stuff Scheider, Boyan & Moore 98] “Planet Mars Scheduler”

Hierarchical Methods Continuous State Space:“Split a state when statistically significant that a split would improve performance” e.g. Simmons et al 83, Chapman & Knelbling 92, Mark Ring 94 …, Munos 96 with interpolation! “Prove needs a higher resolution” Moore 93, Moore & Atkeson 95 Discrete Space: Chapman & Kaelbling 92, McCallum 95 (includes hidden state) A kind of Decision Tree Value Function Multiresolution A hierarchy with high level “managers” abstracting low level “servants” Many O.R. Papers, Dayan & Sejnowski’s Feudal learning, Dietterich 1998 (MAX-Q hierarchy) Moore, Baird & Kaelbling 2000 (airports Hierarchy) Continuous Space

What You Should Know  Definition of a Markov System with Discounted rewards  How to solve it with Matrix Inversion  How (and why) to solve it with Value Iteration  Definition of an MDP, and value iteration to solve an MDP  Policy iteration  Great respect for the way this formalism generalizes the deterministic searching of the start of the class  But awareness of what has been sacrificed.

Zaawansowana Analiza Danych Wykład 4: Ukryte modele Markova Piotr Synak.

Similar presentations

Presentation on theme: "Zaawansowana Analiza Danych Wykład 4: Ukryte modele Markova Piotr Synak."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Zaawansowana Analiza Danych Wykład 4: Ukryte modele Markova Piotr Synak.

Similar presentations

Presentation on theme: "Zaawansowana Analiza Danych Wykład 4: Ukryte modele Markova Piotr Synak."— Presentation transcript:

Similar presentations

About project

Feedback