Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU.

Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU

2 t OtOt Consider a sequence of real-valued observations (speech, sensor readings, stock prices …)

3 t We can model it purely based on contextual properties OtOt

4 t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties OtOt

5 t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties However, we would miss important temporal structure OtOt

6 t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties However, we would miss important temporal structure OtOt

7 t Current efficient approaches learn the wrong model OtOt

8 t OtOt (STACS) successfully discovers the overlapping states

9 t Our goal: Efficiently discover states in sequential data while learning a Hidden Markov Model OtOt

10 Definitions and Notation An HMM is ={A,B,  } where A : N  N transition matrix B : observation model {  s,  s } for each of N states  : N  1 prior probability vector T : size of observation sequence O 1,…,O T q t : the state the HMM is in at time t. q t  {s 1,…,s N }

11 S2S2 HMMs as Bayes Nets S1S1 S3S3 q0q0 q1q1 q2q2 q3q3 q4q4 O0O0 O1O1 O2O2 O3O3 O4O4 HMMs as Finite State Machines

12 S2S2 HMMs as Bayes Nets S1S1 S3S3 q0q0 q1q1 q2q2 q3q3 q4q4 O0O0 O1O1 O2O2 O3O3 O4O4 HMMs as Finite State Machines

13 ProblemAlgorithmComplexit y Likelihood evaluation: L(  O) = P(O| ) Forward- Backward O(TN 2 ) Path inference : Q * = argmax Q P(O,Q| ) Viterbi O(TN 2 ) Parameter learning: *  argmax,Q P(O,Q|  *  argmax P(O|  (for fixed N) Viterbi Training Baum-Welch (EM) O(TN 2 ) Operations on HMMs

14 ProblemAlgorithmComplexit y Likelihood evaluation: L(  O) = P(O| ) Forward- Backward O(TN 2 ) Path inference : Q * = argmax Q P(O,Q| ) Viterbi O(TN 2 ) Parameter learning: *  argmax,Q P(O,Q|  *  argmax P(O|  (for fixed N) Viterbi Training Baum-Welch (EM) O(TN 2 ) Model selection: *  argmax,Q,N P(O,Q|  *  argmax,N P(O|  ?? want O(TN 2 ) Operations on HMMs

15 Previous Approaches Multi-restart Baum-Welch N is inefficient, highly prone to local minima

16 Previous Approaches Multi-restart Baum-Welch N is inefficient, highly prone to local minima Bottom-up State merging [Stolcke & Omohundro 1994] Entropic state pruning [Brand 1999] Advantage: –More robust to local minima Problems: –Require a loose upper bound on N, which adds complexity –Difficult to decide which states to prune/merge

17 Previous Approaches Multi-restart Baum-Welch N is inefficient, highly prone to local minima Bottom-upTop-down State merging [Stolcke & Omohundro 1994] Entropic state pruning [Brand 1999] Advantage: –More robust to local minima Problems: –Require a loose upper bound on N, which adds complexity –Difficult to decide which states to prune/merge ML Successive State Splitting [Ostendorf & Singer 1997] Heuristic split-merge [ Li & Biswas 1999] Advantage: –More robust to local minima, and more scalable Problems: – Previous methods not effective at state discovery, and still slow for large N

18 Previous Approaches Multi-restart Baum-Welch N is inefficient, highly prone to local minima Bottom-upTop-down State merging [Stolcke & Omohundro 1994] Entropic state pruning [Brand 1999] Advantage: –More robust to local minima Problems: –Require a loose upper bound on N, which adds complexity –Difficult to decide which states to prune/merge ML Successive State Splitting [Ostendorf & Singer 1997] Heuristic split-merge [ Li & Biswas 1999] Advantage: –More robust to local minima, and more scalable Problems: – Previous methods not effective at state discovery, and still slow for large N We propose Simultaneous Temporal and Contextual Splitting (STACS), a top-down approach that is much better at state-discovery while being at least as efficient, and a variant V-STACS that is much faster.

19 STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat

20 STACS Learn parameters using EM, calculate the Viterbi path Q * input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2

21 STACS Learn parameters using EM, calculate the Viterbi path Q * Consider splits on all states e.g. for state s 2 input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2

22 Learn parameters using EM, calculate the Viterbi path Q * Consider splits on all states e.g. for state s 2 –Choose a subset D = {O t : Q * (t) = s 2 } STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2

23 Learn parameters using EM, calculate the Viterbi path Q * Consider splits on all states e.g. for state s 2 –Choose a subset D = {O t : Q * (t) = s 2 } –Note that | D | = O(T/N) STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2

24 STACS Split the state input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3

25 Split the state Constrain  s to except for offspring states’ observation densities and all their transition probabilities, both in and out S1S1 S2S2 S3S3 STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat

26 Split the state Constrain  s to except for offspring states’ observation densities and all their transition probabilities, both in and out Learn the free parameters using two-state EM over D. This optimizes the partially observed likelihood P(O,Q * \ D |  s ) STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3

27 Split the state Constrain  s to except for offspring states’ observation densities and all their transition probabilities, both in and out Learn the free parameters using two-state EM over D. This optimizes the partially observed likelihood P(O,Q * \ D |  s ) Update Q * over D to get R * STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3

28 Scoring is of two types: STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat

29 Scoring is of two types: The candidates are compared to each other according to their Viterbi path likelihoods STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3 vs. S1S1 S2S2 S3S3

30 Scoring is of two types: The candidates are compared to each other according to their Viterbi path likelihoods The best candidate in this ranking is compared to the un-split model  using BIC, i.e. log P(model | data )  log  P(data | model) – complexity penalty STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3 vs. S1S1 S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3

31 Viterbi STACS (V-STACS) input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat

32 Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants Viterbi STACS (V-STACS) input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3

33 Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants V-STACS uses two-state Viterbi training over D to learn the free parameters, which uses hard updates vs STACS’ soft updates Viterbi STACS (V-STACS) input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3

34 Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants V-STACS uses two-state Viterbi training over D to learn the free parameters, which uses hard updates vs STACS’ soft updates The Viterbi path likelihood is used to approximate the BIC vs. the un-split model in V-STACS Viterbi STACS (V-STACS) input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3

35 Time Complexity Optimizing N candidates takes – N  O(T) time for STACS – N  O(T/N) time for V-STACS Scoring N candidates takes N  O(T) time  Candidate search and scoring is O(TN) Best-candidate evaluation is –O(TN 2 ) for BIC in STACS –O(TN) for approximate BIC V-STACS

36 Other Methods Li-Biswas –Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen)

37 Other Methods Li-Biswas –Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) –Optimizes all candidate parameters over entire sequence

38 Other Methods Li-Biswas –Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) –Optimizes all candidate parameters over entire sequence ML-SSS –Generates 2 N candidates, splitting each state in two ways

39 Other Methods Li-Biswas –Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) –Optimizes all candidate parameters over entire sequence ML-SSS –Generates 2 N candidates, splitting each state in two ways Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel”

40 Other Methods Li-Biswas –Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) –Optimizes all candidate parameters over entire sequence ML-SSS –Generates 2 N candidates, splitting each state in two ways Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel” Temporal split: optimizes offspring states’ observation densities, self-transitions and mutual transitions with EM, assumes offspring ``in series”

41 Other Methods Li-Biswas –Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) –Optimizes all candidate parameters over entire sequence ML-SSS –Generates 2 N candidates, splitting each state in two ways Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel” Temporal split: optimizes offspring states’ observation densities, self-transitions and mutual transitions with EM, assumes offspring ``in series” –Optimizes split of state s over all timesteps with nonzero posterior probability of being in state s [ i.e. O(T) data points]

42 Results

43 Data sets Australian Sign-Language data collected from 2 Flock 5DT instrumented gloves and Ascension flock-of-birds tracker [Kadous 2002 (available in UCI KDD Archive)] Other data sets obtained from the literature –Robot, MoCap, MLog, Vowel

44 Learning HMMs of Predetermined Size: Scalability Robot data (others similar)

45 Learning HMMs of Predetermined Size: Log-Likelihood Learning a 40-state HMM on Robot data (others similar)

46 Learning HMMs of Predetermined Size Learning 40-state HMMs

47 Model Selection: Synthetic Data Generalize (4 states, T = 1000) to (10 states, T = 10,000)

48 Model Selection: Synthetic Data Generalize (4 states, T = 1000) to (10 states, T = 10,000) Both STACS, VSTACS discovered 10 states and correct underlying transition structure

49 Model Selection: Synthetic Data Generalize (4 states, T = 1000) to (10 states, T = 10,000) Both STACS, VSTACS discovered 10 states and correct underlying transition structure Li-Biswas, ML-SSS failed to find 10-state model 10-state Baum-Welch also failed to find correct observation and transition models, even with 50 restarts!

50 Model Selection: BIC score MoCap data (others similar)

51 Model Selection

52 Sign-language recognition Initial results on sign-language word recognition 95 distinct words, 27 instances each, divided 8:1 Average classification accuracies and HMM sizes: Accuracy final N

53 Conclusion –STACS: an efficient method for HMM model selection that effectively discovers hidden states – Even when learning HMMs with known size, better to discover states using STACS up to the desired N

54 Conclusion –STACS: an efficient method for HMM model selection that effectively discovers hidden states – Even when learning HMMs with known size, better to discover states using STACS up to the desired N The Point: Temporal modeling is essential when discovering states in sequential data, even at the cost of modeling a bit less contextual information

55 Future Work Scaling to large state-spaces –As N grows, can save time by avoiding repeated testing of bad candidates –Merge back final states that are very similar

56 Future Work Scaling to large state-spaces –As N grows, can save time by avoiding repeated testing of bad candidates –Merge back final states that are very similar Generalize to discovering latent variable cardinality in Bayesian Networks

57 Future Work Scaling to large state-spaces –As N grows, can save time by avoiding repeated testing of bad candidates –Merge back final states that are very similar Generalize to discovering latent variable cardinality in Bayesian Networks Comparisons to componential models –What’s better: a large, cleverly-built HMM with exact inference, or a richer model with approximate inference?

59 t δ  t (1) δ  t (2) δ  t (3)… δ  t (N) 1 2 3 4 5 6 7 8 9

60 t δ  t (1) δ  t (2) δ  t (3)… δ  t (N) 1 2… 3… 4 5 6 7 8 9

61 t δ  t (1) δ  t (2) δ  t (3)… δ  t (N) 1 2 3 4 5 6 7 8 9 The Viterbi path is denoted by Suppose we split state N into s 1,s 2

62 t δ  t (1) δ  t (2) δ  t (3)… δ  t (s 1 ) δ  t (s 2 ) 1 2 3 4 5 6 7 8 9 ?? ?? ?? ?? The Viterbi path is denoted by Suppose we split state N into s 1,s 2

63 t δ  t (1) δ  t (2) δ  t (3)… δ  t (s 1 ) δ  t (s 2 ) 1 2 3 4 5 6 7 8 9 The Viterbi path is denoted by Suppose we split state N into s 1,s 2

64 t δ  t (1) δ  t (2) δ  t (3)… δ  t (s 1 ) δ  t (s 2 ) 1 2 3 4 5 6 7 8 9 The Viterbi path is denoted by Suppose we split state N into s 1,s 2

Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU.

Similar presentations

Presentation on theme: "Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU.

Similar presentations

Presentation on theme: "Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU."— Presentation transcript:

Similar presentations

About project

Feedback