Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU.

Slides:



Advertisements
Similar presentations
Pattern Finding and Pattern Discovery in Time Series
Advertisements

Yasuhiro Fujiwara (NTT Cyber Space Labs)
Angelo Dalli Department of Intelligent Computing Systems
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Introduction to Hidden Markov Models
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Adapted from Dr Catherine Sweeney-Reed’s slides.
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
INTRODUCTION TO Machine Learning 3rd Edition
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Fast Temporal State-Splitting for HMM Model Selection and Learning Sajid Siddiqi Geoffrey Gordon Andrew Moore.
Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU.
Siddiqi and Moore, Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon.
Scenario Generation for the Asset Allocation Problem Diana Roman Gautam Mitra EURO XXII Prague July 9, 2007.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Gaussian Mixture Model and the EM algorithm in Speech Recognition
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Siddiqi and Moore, Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.
Hidden Markov Models (HMMs) Chapter 3 (Duda et al.) – Section 3.10 (Warning: this section has lots of typos) CS479/679 Pattern Recognition Spring 2013.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models Part 2: Algorithms
CONTEXT DEPENDENT CLASSIFICATION
Handwritten Characters Recognition Based on an HMM Model
Presentation transcript:

Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU

2 t OtOt Consider a sequence of real-valued observations (speech, sensor readings, stock prices …)

3 t We can model it purely based on contextual properties OtOt

4 t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties OtOt

5 t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties However, we would miss important temporal structure OtOt

6 t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties However, we would miss important temporal structure OtOt

7 t Current efficient approaches learn the wrong model OtOt

8 t OtOt Our method successfully discovers the overlapping states

9 t Our goal: Efficiently discover states in sequential data while learning a Hidden Markov Model OtOt

10 Motion Capture

11 Definitions and Notation An HMM is ={A,B,  } where A : N  N transition matrix B : observation model {  s,  s } for each of N states  : N  1 prior probability vector T : size of observation sequence O 1,…,O T q t : the state the HMM is in at time t. q t  {s 1,…,s N }

12 ProblemAlgorithm Complexity Likelihood evaluation: L(  O) = P(O| ) Forward- Backward O(TN 2 ) Path inference : Q * = argmax Q P(O,Q| ) Viterbi O(TN 2 ) Parameter learning: *  argmax,Q P(O,Q|  *  argmax P(O|  (for fixed N) Viterbi Training Baum-Welch (EM) O(TN 2 ) Operations on HMMs

13 ProblemAlgorithm Complexity Likelihood evaluation: L(  O) = P(O| ) Forward- Backward O(TN 2 ) Path inference : Q * = argmax Q P(O,Q| ) Viterbi O(TN 2 ) Parameter learning: *  argmax,Q P(O,Q|  *  argmax P(O|  (for fixed N) Viterbi Training Baum-Welch (EM) O(TN 2 ) Model selection: *  argmax,Q,N P(O,Q|  *  argmax,N P(O|  ?? want O(TN 2 ) Operations on HMMs

14 Previous Approaches Multi-restart Baum-Welch N is inefficient, highly prone to local minima

15 Previous Approaches Multi-restart Baum-Welch N is inefficient, highly prone to local minima Bottom-up State merging [Stolcke & Omohundro 1994] Entropic state pruning [Brand 1999] Advantage: –More robust to local minima Problems: –Require a loose upper bound on N, which adds complexity –Difficult to decide which states to prune/merge

16 Previous Approaches Multi-restart Baum-Welch N is inefficient, highly prone to local minima Bottom-upTop-down State merging [Stolcke & Omohundro 1994] Entropic state pruning [Brand 1999] Advantage: –More robust to local minima Problems: –Require a loose upper bound on N, which adds complexity –Difficult to decide which states to prune/merge ML Successive State Splitting [Ostendorf & Singer 1997] Heuristic split-merge [ Li & Biswas 1999] Advantage: –More robust to local minima, and more scalable Problems: – Previous methods not effective at state discovery, and still slow for large N

17 Previous Approaches Multi-restart Baum-Welch N is inefficient, highly prone to local minima Bottom-upTop-down State merging [Stolcke & Omohundro 1994] Entropic state pruning [Brand 1999] Advantage: –More robust to local minima Problems: –Require a loose upper bound on N, which adds complexity –Difficult to decide which states to prune/merge ML Successive State Splitting [Ostendorf & Singer 1997] Heuristic split-merge [ Li & Biswas 1999] Advantage: –More robust to local minima, and more scalable Problems: – Previous methods not effective at state discovery, and still slow for large N We propose Simultaneous Temporal and Contextual Splitting (STACS) A top-down approach that is much better at state- discovery while being at least as efficient, and a variant V-STACS that is much faster.

18 Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data) / P(data|model size) P(model size) log P(model size|data) / log P(data|model size) + log P(model size)

19 Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data) / P(data|model size) P(model size) log P(model size|data) / log P(data|model size) + log P(model size) - BIC assumes a prior that penalizes complexity (favors smaller models): log P(model size|data) ¼ log P(data|model size, MLE ) – (#FP/2) log T where #FP = number of free parameters, T = length of data sequence, MLE is the ML parameter estimate

20 Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data) / P(data|model size) P(model size) log P(model size|data) / log P(data|model size) + log P(model size) - BIC assumes a prior that penalizes complexity (favors smaller models): log P(model size|data) ¼ log P(data|model size, MLE ) – (#FP/2) log T where #FP = number of free parameters, T = length of data sequence, MLE is the ML parameter estimate - BIC is an asymptotic approximation to the true posterior

21 Algorithm Summary (STACS/VSTACS) Initialize n 0 -state HMM randomly for n = n 0 … Nmax –Learn model parameters –for i = 1 … n Split state i, optimize by constrained EM (STACS) or constrained Viterbi Training (VSTACS) Calculate approximate BIC score of split model –Choose best split based on approximate BIC –Compare to original model with exact BIC (STACS) or approximate BIC (VSTACS) –if larger model not chosen, stop

22 STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat

23 STACS Learn parameters using EM, calculate the Viterbi path Q * input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2

24 STACS Learn parameters using EM, calculate the Viterbi path Q * Consider splits on all states e.g. for state s 2 input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2

25 Learn parameters using EM, calculate the Viterbi path Q * Consider splits on all states e.g. for state s 2 –Choose a subset D = {O t : Q * (t) = s 2 } STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2

26 Learn parameters using EM, calculate the Viterbi path Q * Consider splits on all states e.g. for state s 2 –Choose a subset D = {O t : Q * (t) = s 2 } –Note that | D | = O(T/N) STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2

27 STACS Split the state input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3

28 Split the state Constrain  s to except for offspring states’ observation densities and all their transition probabilities, both in and out S1S1 S2S2 S3S3 STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat

29 Split the state Constrain  s to except for offspring states’ observation densities and all their transition probabilities, both in and out Learn the free parameters using two-state EM over D. This optimizes the partially observed likelihood P(O,Q * \ D |  s ) STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3

30 Split the state Constrain  s to except for offspring states’ observation densities and all their transition probabilities, both in and out Learn the free parameters using two-state EM over D. This optimizes the partially observed likelihood P(O,Q * \ D |  s ) Update Q * over D to get R * STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3

31 Scoring is of two types: STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat

32 Scoring is of two types: The candidates are compared to each other according to their Viterbi path likelihoods STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3 vs. S1S1 S2S2 S3S3

33 Scoring is of two types: The candidates are compared to each other according to their Viterbi path likelihoods The best candidate in this ranking is compared to the un-split model  using BIC, i.e. log P(model | data )  log  P(data | model) – complexity penalty STACS input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3 vs. S1S1 S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3

34 Viterbi STACS (V-STACS) input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat

35 Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants Viterbi STACS (V-STACS) input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3

36 Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants V-STACS uses two-state Viterbi training over D to learn the free parameters, which uses hard updates vs STACS’ soft updates Viterbi STACS (V-STACS) input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3

37 Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants V-STACS uses two-state Viterbi training over D to learn the free parameters, which uses hard updates vs STACS’ soft updates The Viterbi path likelihood is used to approximate the BIC vs. the un-split model in V-STACS Viterbi STACS (V-STACS) input: n 0, data sequence O = {O 1,…,O T } output: HMM of appropriate size  n 0 -state initial HMM repeat optimize  over sequence O choose a subset of states  for each s   design a candidate model  s : choose a relevant subset of sequence O split state s, optimize  s over subset score  s end for if max s (score(  s )) > score( )  best-scoring candidate from {  s } else terminate, return current end if end repeat S1S1 S2S2 S3S3

38 Time Complexity Optimizing N candidates takes – N  O(T) time for STACS – N  O(T/N) time for V-STACS Scoring N candidates takes N  O(T) time  Candidate search and scoring is O(TN) Best-candidate evaluation is –O(TN 2 ) for BIC in STACS –O(TN) for approximate BIC in V-STACS

39 Other Methods Li-Biswas –Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen)

40 Other Methods Li-Biswas –Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) –Optimizes all candidate parameters over entire sequence

41 Other Methods Li-Biswas –Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) –Optimizes all candidate parameters over entire sequence ML-SSS –Generates 2 N candidates, splitting each state in two ways

42 Other Methods Li-Biswas –Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) –Optimizes all candidate parameters over entire sequence ML-SSS –Generates 2 N candidates, splitting each state in two ways Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel”

43 Other Methods Li-Biswas –Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) –Optimizes all candidate parameters over entire sequence ML-SSS –Generates 2 N candidates, splitting each state in two ways Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel” Temporal split: optimizes offspring states’ observation densities, self-transitions and mutual transitions with EM, assumes offspring ``in series”

44 Other Methods Li-Biswas –Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) –Optimizes all candidate parameters over entire sequence ML-SSS –Generates 2 N candidates, splitting each state in two ways Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel” Temporal split: optimizes offspring states’ observation densities, self-transitions and mutual transitions with EM, assumes offspring ``in series” –Optimizes split of state s over all timesteps with nonzero posterior probability of being in state s [ i.e. O(T) data points]

45 Results

46 Data sets Australian Sign-Language data collected from 2 Flock 5DT instrumented gloves and Ascension flock-of-birds tracker [Kadous 2002 (available in UCI KDD Archive)] Other data sets obtained from the literature –Robot, MoCap, MLog, Vowel

47 Learning HMMs of Predetermined Size: Scalability Robot data (others similar)

48 Learning HMMs of Predetermined Size: Log-Likelihood Learning a 40-state HMM on Robot data (others similar)

49 Learning HMMs of Predetermined Size Learning 40-state HMMs

50 Model Selection: Synthetic Data Generalize (4 states, T = 1000) to (10 states, T = 10,000)

51 Model Selection: Synthetic Data Generalize (4 states, T = 1000) to (10 states, T = 10,000) Both STACS, VSTACS discovered 10 states and correct underlying transition structure

52 Model Selection: Synthetic Data Generalize (4 states, T = 1000) to (10 states, T = 10,000) Both STACS, VSTACS discovered 10 states and correct underlying transition structure Li-Biswas, ML-SSS failed to find 10-state model 10-state Baum-Welch also failed to find correct observation and transition models, even with 50 restarts!

53 Model Selection: BIC score MoCap data (others similar)

54 Model Selection

55 Sign-language recognition Initial results on sign-language word recognition 95 distinct words, 27 instances each, divided 8:1 Average classification accuracies and HMM sizes: Accuracy final N

56 Modeling motion capture data 35-dimensional data (thanks to Adrien Treuille)

57 Modeling motion capture data Original data:

58 Modeling motion capture data Original data: STACS simulation: (found 235 states)

59 Modeling motion capture data Original data: STACS simulation: Baum-Welch: (found 235 states) (on 235 states)

60 Modeling motion capture data Original data: STACS simulation: Baum-Welch: (found 235 states) (on 235 states) [Video]

61 Discovering Underlying Structure Sparse dynamics - difficult to learn using regular EM STACS smoothly tiles the low-dimensional manifold of observations along with correct dynamic structure

62 Conclusion –A better method for HMM model selection and learning discovers hidden states avoids local minima faster than Baum-Welch – Even when learning HMMs with known size, better to discover states using STACS up to the desired N –Widespread applicability classification, recognition and prediction for real-valued sequential data problems

63

64 t δ  t (1) δ  t (2) δ  t (3)… δ  t (N)

65 t δ  t (1) δ  t (2) δ  t (3)… δ  t (N) 1 2… 3…

66 t δ  t (1) δ  t (2) δ  t (3)… δ  t (N) The Viterbi path is denoted by Suppose we split state N into s 1,s 2

67 t δ  t (1) δ  t (2) δ  t (3)… δ  t (s 1 ) δ  t (s 2 ) ?? ?? ?? ?? The Viterbi path is denoted by Suppose we split state N into s 1,s 2

68 t δ  t (1) δ  t (2) δ  t (3)… δ  t (s 1 ) δ  t (s 2 ) The Viterbi path is denoted by Suppose we split state N into s 1,s 2

69 t δ  t (1) δ  t (2) δ  t (3)… δ  t (s 1 ) δ  t (s 2 ) The Viterbi path is denoted by Suppose we split state N into s 1,s 2