Presentation is loading. Please wait.

Presentation is loading. Please wait.

Siddiqi and Moore, Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon.

Similar presentations

Presentation on theme: "Siddiqi and Moore, Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon."— Presentation transcript:

1 Siddiqi and Moore, Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon University

2 Siddiqi and Moore, Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon University

3 Siddiqi and Moore, Sajid Siddiqi: Happy Sajid Siddiqi: Discontented

4 Siddiqi and Moore, Hidden Markov Models 1/3 1 q0q0 q1q1 q2q2 q3q3 q4q4

5 Siddiqi and Moore, Each of these probability tables is identical i P( q t+1 =s 1 |q t = s i ) P( q t+1 =s 2 |q t = s i )… P( q t+1 =s j |q t = s i )… P( q t+1 =s N |q t = s i ) 1 a 11 a 12 … a 1j … a 1N 2 a 21 a 22 … a 2j … a 2N 3 a 31 a 32 … a 3j … a 3N ::::::: i a i1 a i2 … a ij … a iN N a N1 a N2 … a Nj … a NN Hidden Markov Models 1/3 1 q0q0 q1q1 q2q2 q3q3 q4q4 Notation:

6 Siddiqi and Moore, Observation Model q0q0 q1q1 q2q2 q3q3 q4q4 O0O0 O1O1 O2O2 O3O3 O4O4

7 Siddiqi and Moore, Observation Model q0q0 q1q1 q2q2 q3q3 q4q4 O0O0 O1O1 O2O2 O3O3 O4O4 i P( O t =1 |q t = s i ) P( O t =2 |q t = s i )… P( O t =k |q t = s i )… P( O t =M |q t = s i ) 1 b 1 (1)b 1 (2) … b 1 (k) … b 1 (M) 2 b 2 (1)b 2 (2) … b 2 (k) … b 2 (M) 3 b 3 (1)b 3 (2) … b 3 (k) … b 3 (M) : :::::: i b i (1)b i (2) … b i (k) … b i (M) : :::::: N b N (1)b N (2) … b N (k) … b N (M) Notation:

8 Siddiqi and Moore, Some Famous HMM Tasks Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T )

9 Siddiqi and Moore, Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Some Famous HMM Tasks

10 Siddiqi and Moore, Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Some Famous HMM Tasks

11 Siddiqi and Moore, Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? Some Famous HMM Tasks

12 Siddiqi and Moore, Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? Some Famous HMM Tasks

13 Siddiqi and Moore, Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? Some Famous HMM Tasks Woke up at 8.35, Got on Bus at 9.46, Sat in lecture 10.05-11.22…

14 Siddiqi and Moore, Some Famous HMM Tasks Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? Question 3: Learning HMMs: Given O 1 O 2 …O T, what is the maximum likelihood HMM that could have produced this string of observations?

15 Siddiqi and Moore, Some Famous HMM Tasks Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? Question 3: Learning HMMs: Given O 1 O 2 …O T, what is the maximum likelihood HMM that could have produced this string of observations?

16 Siddiqi and Moore, Some Famous HMM Tasks Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? Question 3: Learning HMMs: Given O 1 O 2 …O T, what is the maximum likelihood HMM that could have produced this string of observations? Eat Bus walk a AB a BB a AA a CB a BA a BC a CC O t-1 O t+1 OtOt b A (O t-1 ) b B (O t ) b C (O t+1 )

17 Siddiqi and Moore, Basic Operations in HMMs For an observation sequence O = O 1 …O T, the three basic HMM operations are: ProblemAlgorithmComplexity + Evaluation: Calculating P(O| ) Forward-Backward O(TN 2 ) Inference: Computing Q * = argmax Q P(O,Q| ) Viterbi Decoding O(TN 2 ) Learning: Computing * = argmax P(O|  Baum-Welch (EM) O(TN 2 ) T = # timesteps, N = # states

18 Siddiqi and Moore, Basic Operations in HMMs For an observation sequence O = O 1 …O T, the three basic HMM operations are: ProblemAlgorithmComplexity + Evaluation: Calculating P(O| ) Forward-Backward O(TN 2 ) Inference: Computing Q * = argmax Q P(O,Q| ) Viterbi Decoding O(TN 2 ) Learning: Computing * = argmax P(O|  Baum-Welch (EM) O(TN 2 ) T = # timesteps, N = # states This talk: A simple approach to reducing the complexity in N

19 Siddiqi and Moore, Reducing Quadratic N penalty Why does it matter? Quadratic HMM algorithms hinder HMM computations when N is large Several promising applications for efficient large-state-space HMM algorithms in biological sequence analysis speech recognition real-time HMM systems such as for activity monitoring

20 Siddiqi and Moore, Idea One: Sparse Transition Matrix Only K << N non- zero next-state probabilities

21 Siddiqi and Moore, Idea One: Sparse Transition Matrix Only K << N non- zero next-state probabilities

22 Siddiqi and Moore, Idea One: Sparse Transition Matrix Only K << N non- zero next-state probabilities Only O(TNK)

23 Siddiqi and Moore, Idea One: Sparse Transition Matrix Only K << N non- zero next-state probabilities But can get very badly confused by “impossible transitions” Cannot learn the sparse structure (once chosen cannot change) Only O(TNK)

24 Siddiqi and Moore, Dense-Mostly-Constant Transitions  K non-constant probabilities per row  DMC HMMs comprise a richer and more expressive class of models than sparse HMMs a DMC transition matrix with K=2

25 Siddiqi and Moore, Dense-Mostly-Constant Transitions The transition model for state i now comprises: NC i = { j : s i  s j is a non-constant transition probability } c i = the transition probability for s i to all states not in NC i a ij = the non-constant transition probability for s i  s j, NC 3 = {2,5} c 3 = 0.05 a 32 = 0.25 a 35 = 0.6

26 Siddiqi and Moore, HMM Filtering P(q t = s i | O 1, O 2 … O t )

27 Siddiqi and Moore, HMM Filtering P(q t = s i | O 1, O 2 … O t ) = Where

28 Siddiqi and Moore, HMM Filtering P(q t = s i | O 1, O 2 … O t ) = Where

29 Siddiqi and Moore, HMM Filtering P(q t = s i | O 1, O 2 … O t ) = Where t  t (1)  t (2)  t (3) …  t (N) 1 2… 3 4 5 6 7 8 9

30 Siddiqi and Moore, HMM Filtering P(q t = s i | O 1, O 2 … O t ) = Where t  t (1)  t (2)  t (3) …  t (N) 1 2… 3… 4 5 6 7 8 9

31 Siddiqi and Moore, HMM Filtering P(q t = s i | O 1, O 2 … O t ) = Where t  t (1)  t (2)  t (3) …  t (N) 1 2… 3… 4 5 6 7 8 9 Cost O(TN 2 )

32 Siddiqi and Moore, Fast Evaluation in DMC HMMs

33 Siddiqi and Moore, Fast Evaluation in DMC HMMs O(N), but common to all j per timestep t O(K) for each  t ( j )  This yields O(TNK) complexity for the evaluation problem.

34 Siddiqi and Moore, The Viterbi algorithm uses dynamic programming to calculate the globally optimal state sequence Q g =max Q P(Q,O| ). Fast Inference in DMC HMMs Define  t (i) as The  variables can be computed in O(TN 2 ) time, with the O(N) inductive step: Under the DMC assumption, this step can be carried out in O(K) time: O(N), but common to all j per timestep t O(K) for each  t (j)

35 Siddiqi and Moore, Learning a DMC HMM

36 Siddiqi and Moore, Learning a DMC HMM Idea One: Ask user to tell us the DMC structure Learn the parameters using EM

37 Siddiqi and Moore, Learning a DMC HMM Idea One: Ask user to tell us the DMC structure Learn the parameters using EM Simple But in general, don’t know the DMC structure

38 Siddiqi and Moore, Learning a DMC HMM Idea Two: Use EM to learn the DMC structure too 1.Guess DMC structure 2.Find expected transition counts and observation parameters, given current model and observations 3.Find maximum likelihood DMC model given counts 4.Goto 2

39 Siddiqi and Moore, Learning a DMC HMM Idea Two: Use EM to learn the DMC structure too 1.Guess DMC structure 2.Find expected transition counts and observation parameters, given current model and observations 3.Find maximum likelihood DMC model given counts 4.Goto 2 DMC structure can (and does) change!

40 Siddiqi and Moore, Learning a DMC HMM Idea Two: Use EM to learn the DMC structure too 1.Guess DMC structure 2.Find expected transition counts and observation parameters, given current model and observations 3.Find maximum likelihood DMC model given counts 4.Goto 2 DMC structure can (and does) change! In fact, just start with an all-constant transition model

41 Siddiqi and Moore, Learning a DMC HMM 2.Find expected transition counts and observation parameters, given current model and observations

42 Siddiqi and Moore, We wantnew estimate of

43 Siddiqi and Moore, We wantnew estimate of

44 Siddiqi and Moore, We wantnew estimate of

45 Siddiqi and Moore, We wantnew estimate of where

46 Siddiqi and Moore, We want where

47 Siddiqi and Moore, T N  T N  We want where

48 Siddiqi and Moore, T N  T N  Can get this in O(TN) time We want where

49 Siddiqi and Moore,  We wantwhere T N T N  Can get this in O(TN) time

50 Siddiqi and Moore,  We want where T N T N 

51 Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns

52 Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 )

53 Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 ) Speedups: Strassen?

54 Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 ) Speedups: Strassen Approximate  by DMC

55 Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 ) Speedups: Strassen Approximate  by DMC Approximate randomized A T B

56 Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 ) Speedups: Strassen Approximate  by DMC Approximate randomized A T B Sparse structure fine?

57 Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 ) Speedups: Strassen Approximate  by DMC Approximate randomized A T B Sparse structure fine Fixed DMC is fine?

58 Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 ) Speedups: Strassen Approximate  by DMC Approximate randomized A T B Sparse structure fine Fixed DMC is fine Speedup without approximation

59 Siddiqi and Moore,  We want where T N T N  S N N S 24 Insight One: only need the top K entries in each row of S Insight Two: Values in rows of  and  are often very skewed

60 Siddiqi and Moore,  T NN   -biggies(i)  -biggies(j) For i = 1..N, store indexes of R largest values in i’th column of  For j = 1..N, store indexes of R largest values in j’th column of  There’s an important detail I’m omitting here to do with prescaling the rows of  and .

61 Siddiqi and Moore,  T NN   -biggies(i)  -biggies(j) For i = 1..N, store indexes of R largest values in i’th column of  For j = 1..N, store indexes of R largest values in j’th column of  R << T Takes O(TN) time to do all indexes There’s an important detail I’m omitting here to do with prescaling the rows of  and .

62 Siddiqi and Moore,  T NN   -biggies(i)  -biggies(j) For i = 1..N, store indexes of R largest values in i’th column of  For j = 1..N, store indexes of R largest values in j’th column of  R << T Takes O(TN) time to do all indexes There’s an important detail I’m omitting here to do with prescaling the rows of  and .

63 Siddiqi and Moore,  T NN   -biggies(i)  -biggies(j) For i = 1..N, store indexes of R largest values in i’th column of  For j = 1..N, store indexes of R largest values in j’th column of  R << T Takes O(TN) time to do all indexes There’s an important detail I’m omitting here to do with prescaling the rows of  and .

64 Siddiqi and Moore,  T NN   -biggies(i)  -biggies(j) For i = 1..N, store indexes of R largest values in i’th column of  For j = 1..N, store indexes of R largest values in j’th column of  R << T Takes O(TN) time to do all indexes There’s an important detail I’m omitting here to do with prescaling the rows of  and .

65 Siddiqi and Moore,  T NN   -biggies(i)  -biggies(j) For i = 1..N, store indexes of R largest values in i’th column of  For j = 1..N, store indexes of R largest values in j’th column of  R << T Takes O(TN) time to do all indexes R’th largest value in i’th column of  O(1) time to obtain O(1) time to obtain (precached for all j in time O(TN) ) O(R) computation There’s an important detail I’m omitting here to do with prescaling the rows of  and .

66 Siddiqi and Moore, S N j 123N… S ij Computing the i’th row of S… In O(NR) time, we can put upper and lower bounds on S ij for j = 1,2.. N

67 Siddiqi and Moore, S N j 123N… S ij Computing the i’th row of S… In O(NR) time, we can put upper and lower bounds on S ij for j = 1,2.. N Only need exact values of S ij for the k largest values within the row

68 Siddiqi and Moore, S N j 123N… S ij Computing the i’th row of S… In O(NR) time, we can put upper and lower bounds on S ij for j = 1,2.. N Only need exact values of S ij for the k largest values within the row Ignore j’s that can’t be the best

69 Siddiqi and Moore, S N j 123N… S ij Computing the i’th row of S… In O(NR) time, we can put upper and lower bounds on S ij for j = 1,2.. N Only need exact values of S ij for the k largest values within the row Ignore j’s that can’t be the best Be exact for the rest: O(N) time each.

70 Siddiqi and Moore, S N j 123N… S ij Computing the i’th row of S… In O(NR) time, we can put upper and lower bounds on S ij for j = 1,2.. N Only need exact values of S ij for the k largest values within the row Ignore j’s that can’t be the best Be exact for the rest: O(N) time each. If there’s enough pruning, total time is O(TN+RN 2 )

71 Siddiqi and Moore, Evaluation and Inference Speedup Dataset: synthetic data with T=2000 time steps

72 Siddiqi and Moore, Parameter Learning Speedup Dataset: synthetic data with T=2000 time steps

73 Siddiqi and Moore, Performance Experiments DMC-friendly dataset: 2-D gaussian 20-state DMC HMM with K=5 (20,000 train, 5,000 test) Anti-DMC dataset: 2-D gaussian 20-state regular HMM with steadily varying, well-distributed transition probabilities (20,000 train, 5,000 test) Motionlogger dataset: Accelerometer data from two sensors worn over several days (10,000 train, 4,720 test) Regular and DMC HMMs: 20 states Small HMM: 5-state regular HMM Uniform HMM: 20-state HMM with uniform transition probabilities

74 Siddiqi and Moore, Learning Curves for DMC-friendly data

75 Siddiqi and Moore, Learning Curves for DMC-friendly data

76 Siddiqi and Moore, Learning Curves for DMC-friendly data

77 Siddiqi and Moore, Learning Curves for DMC-friendly data

78 Siddiqi and Moore, Learning Curves for DMC-friendly data

79 Siddiqi and Moore, Learning Curves for DMC-friendly data

80 Siddiqi and Moore, Learning Curves for DMC-friendly data

81 Siddiqi and Moore, Learning Curves for Anti-DMC data

82 Siddiqi and Moore, Learning Curves for Anti-DMC data

83 Siddiqi and Moore, Learning Curves for Anti-DMC data

84 Siddiqi and Moore, Learning Curves for Anti-DMC data

85 Siddiqi and Moore, Learning Curves for Anti-DMC data

86 Siddiqi and Moore, Learning Curves for Anti-DMC data

87 Siddiqi and Moore, Learning Curves for Anti-DMC data

88 Siddiqi and Moore, Learning Curves for Motionlogger data

89 Siddiqi and Moore, Learning Curves for Motionlogger data

90 Siddiqi and Moore, Learning Curves for Motionlogger data

91 Siddiqi and Moore, Learning Curves for Motionlogger data

92 Siddiqi and Moore, Learning Curves for Motionlogger data

93 Siddiqi and Moore, Learning Curves for Motionlogger data

94 Siddiqi and Moore, Learning Curves for Motionlogger data

95 Siddiqi and Moore, Tradeoffs between N and K We vary N and K while keeping the number of transition parameters (N×K) constant Increasing N and decreasing K allows more states for modeling data features but fewer parameters per state for temporal structure

96 Siddiqi and Moore, Tradeoffs between N and K Average test-set log-likelihoods at convergence Datasets: A: DMC-friendly B: Anti-DMC C: Motionlogger Each dataset has a different optimal N-vs-K tradeoff

97 Siddiqi and Moore, Regularization with DMC HMMs # of transition parameters in regular 100-state HMM: 10,000 # of transition parameters in DMC 100-state HMM with K= 5 : 500

98 Siddiqi and Moore, Conclusions DMC HMMs are an important class of models that allow parameterized complexity-vs- efficiency tradeoffs in large state spaces The speedup can be several orders of magnitude Even for non-DMC domains, DMC HMMs yield higher scores than baseline models The DMC HMM model can be applied to arbitrary state spaces and observation densities

99 Siddiqi and Moore, Related Work Felzenszwalb et al. (2003) – fast HMM algorithms when transition probabilities can be expressed as distances in an underlying parameter space Murphy and Paskin (2002) – fast inference in hierarchical HMMs cast as DBNs Salakhutdinov et al. (2003) – combined EM and conjugate gradient for faster HMM learning when missing information amount is high Beam Search – widely used heuristic in word recognition for speech systems

100 Siddiqi and Moore, Future Work Investigate DMC HMMs as regularization mechanism Eliminate R parameter using an automatic backoff evaluation approach Devise ways to automatically set K parameter, have per-row K parameters

101 Siddiqi and Moore, Future Work Investigate DMC HMMs as regularization mechanism Eliminate R parameter using an automatic backoff evaluation approach Devise ways to automatically set K parameter, have per-row K parameters The End

102 Siddiqi and Moore,

Download ppt "Siddiqi and Moore, Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon."

Similar presentations

Ads by Google