Siddiqi and Moore, www.autonlab.org Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon.

Slides:



Advertisements
Similar presentations
Fast Algorithms For Hierarchical Range Histogram Constructions
Advertisements

Yasuhiro Fujiwara (NTT Cyber Space Labs)
Learning HMM parameters
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
An Introduction to Hidden Markov Models and Gesture Recognition Troy L. McDaniel Research Assistant Center for Cognitive Ubiquitous Computing Arizona State.
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser university Homotopy-based Semi- Supervised Hidden Markov.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Apaydin slides with a several modifications and additions by Christoph Eick.
INTRODUCTION TO Machine Learning 3rd Edition
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Lecture 5: Learning models using EM
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Face Recognition Using Embedded Hidden Markov Model.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Fast Temporal State-Splitting for HMM Model Selection and Learning Sajid Siddiqi Geoffrey Gordon Andrew Moore.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU.
Siddiqi and Moore, Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.
Isolated-Word Speech Recognition Using Hidden Markov Models
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
CS 541: Artificial Intelligence Lecture VIII: Temporal Probability Models.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Presentation transcript:

Siddiqi and Moore, Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon University

Siddiqi and Moore,  HMM Overview  Reducing quadratic complexity in the number of states The model Algorithms for fast evaluation and inference Algorithms for fast learning  Results Speed Accuracy  Conclusion

Siddiqi and Moore,  HMM Overview  Reducing quadratic complexity in the number of states The model Algorithms for fast evaluation and inference Algorithms for fast learning  Results Speed Accuracy  Conclusion

Siddiqi and Moore, Hidden Markov Models 1/3 q0q0 q1q1 q2q2 q3q3 q4q4 O0O0 O1O1 O2O2 O3O3 O4O4

Siddiqi and Moore, i P( q t+1 =s 1 |q t = s i ) P( q t+1 =s 2 |q t = s i )… P( q t+1 =s j |q t = s i )… P( q t+1 =s N |q t = s i ) 1 a 11 a 12 … a 1j … a 1N 2 a 21 a 22 … a 2j … a 2N 3 a 31 a 32 … a 3j … a 3N ::::::: i a i1 a i2 … a ij … a iN N a N1 a N2 … a Nj … a NN Transition Model 1/3 q0q0 q1q1 q2q2 q3q3 q4q4

Siddiqi and Moore, Each of these probability tables is identical i P( q t+1 =s 1 |q t = s i ) P( q t+1 =s 2 |q t = s i )… P( q t+1 =s j |q t = s i )… P( q t+1 =s N |q t = s i ) 1 a 11 a 12 … a 1j … a 1N 2 a 21 a 22 … a 2j … a 2N 3 a 31 a 32 … a 3j … a 3N ::::::: i a i1 a i2 … a ij … a iN N a N1 a N2 … a Nj … a NN Transition Model 1/3 q0q0 q1q1 q2q2 q3q3 q4q4 Notation:

Siddiqi and Moore, Observation Model q0q0 q1q1 q2q2 q3q3 q4q4 O0O0 O1O1 O2O2 O3O3 O4O4 i P( O t =1 |q t = s i ) P( O t =2 |q t = s i )… P( O t =k |q t = s i )… P( O t =M |q t = s i ) 1 b 1 (1)b 1 (2) … b 1 (k) … b 1 (M) 2 b 2 (1)b 2 (2) … b 2 (k) … b 2 (M) 3 b 3 (1)b 3 (2) … b 3 (k) … b 3 (M) : :::::: i b i (1)b i (2) … b i (k) … b i (M) : :::::: N b N (1)b N (2) … b N (k) … b N (M)

Siddiqi and Moore, Observation Model q0q0 q1q1 q2q2 q3q3 q4q4 O0O0 O1O1 O2O2 O3O3 O4O4 i P( O t =1 |q t = s i ) P( O t =2 |q t = s i )… P( O t =k |q t = s i )… P( O t =M |q t = s i ) 1 b 1 (1)b 1 (2) … b 1 (k) … b 1 (M) 2 b 2 (1)b 2 (2) … b 2 (k) … b 2 (M) 3 b 3 (1)b 3 (2) … b 3 (k) … b 3 (M) : :::::: i b i (1)b i (2) … b i (k) … b i (M) : :::::: N b N (1)b N (2) … b N (k) … b N (M) Notation:

Siddiqi and Moore, Some Famous HMM Tasks Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T )

Siddiqi and Moore, Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Some Famous HMM Tasks

Siddiqi and Moore, Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Some Famous HMM Tasks

Siddiqi and Moore, Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? Some Famous HMM Tasks

Siddiqi and Moore, Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? Some Famous HMM Tasks

Siddiqi and Moore, Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? Some Famous HMM Tasks Woke up at 8.35, Got on Bus at 9.46, Sat in lecture …

Siddiqi and Moore, Some Famous HMM Tasks Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? Question 3: Learning HMMs: Given O 1 O 2 …O T, what is the maximum likelihood HMM that could have produced this string of observations?

Siddiqi and Moore, Some Famous HMM Tasks Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? Question 3: Learning HMMs: Given O 1 O 2 …O T, what is the maximum likelihood HMM that could have produced this string of observations?

Siddiqi and Moore, Some Famous HMM Tasks Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? Question 3: Learning HMMs: Given O 1 O 2 …O T, what is the maximum likelihood HMM that could have produced this string of observations? Eat Bus walk a AB a BB a AA a CB a BA a BC a CC O t-1 O t+1 OtOt b A (O t-1 ) b B (O t ) b C (O t+1 )

Siddiqi and Moore, Basic Operations in HMMs For an observation sequence O = O 1 …O T, the three basic HMM operations are: ProblemAlgorithmComplexity Evaluation: Calculating P(O| ) Forward-Backward O(TN 2 ) Inference: Computing Q * = argmax Q P(O,Q| ) Viterbi Decoding O(TN 2 ) Learning: Computing * = argmax P(O|  Baum-Welch (EM) O(TN 2 ) T = # timesteps, i.e. datapoints N = # states

Siddiqi and Moore, Basic Operations in HMMs For an observation sequence O = O 1 …O T, the three basic HMM operations are: ProblemAlgorithmComplexity Evaluation: Calculating P(O| ) Forward-Backward O(TN 2 ) Inference: Computing Q * = argmax Q P(O,Q| ) Viterbi Decoding O(TN 2 ) Learning: Computing * = argmax P(O|  Baum-Welch (EM) O(TN 2 ) This talk: A simple approach to reducing the complexity in N T = # timesteps, i.e. datapoints N = # states

Siddiqi and Moore,  HMM Overview  Reducing quadratic complexity The model Algorithms for fast evaluation and inference Algorithms for fast learning  Results Speed Accuracy  Conclusion

Siddiqi and Moore, Reducing Quadratic Complexity in N Why does it matter? Quadratic HMM algorithms hinder HMM computations when N is large Several promising applications for efficient large-state-space HMM algorithms in topic modeling speech recognition real-time HMM systems such as for activity monitoring … and more

Siddiqi and Moore, Idea One: Sparse Transition Matrix Only K << N non- zero next-state probabilities

Siddiqi and Moore, Idea One: Sparse Transition Matrix Only K << N non- zero next-state probabilities

Siddiqi and Moore, Idea One: Sparse Transition Matrix Only K << N non- zero next-state probabilities Only O(TNK)!

Siddiqi and Moore, Idea One: Sparse Transition Matrix Only K << N non- zero next-state probabilities But can get very badly confused by “impossible transitions” Cannot learn the sparse structure (once chosen cannot change) Only O(TNK)!

Siddiqi and Moore, Dense-Mostly-Constant (DMC) Transitions  K non-constant probabilities per row  DMC HMMs comprise a richer and more expressive class of models than sparse HMMs a DMC transition matrix with K=2

Siddiqi and Moore, Dense-Mostly-Constant (DMC) Transitions The transition model for state i now consists of: K = the number of non-constant values per row NC i = { j : s i  s j is a non-constant transition probability } c i = the transition probability for s i to all states not in NC i a ij = the non-constant transition probability for s i  s j, K = 2 NC 3 = {2,5} c 3 = 0.05 a 32 = 0.25 a 35 = 0.6

Siddiqi and Moore,  HMM Overview  Reducing quadratic complexity in the number of states The model Algorithms for fast evaluation and inference Algorithms for fast learning  Results Speed Accuracy  Conclusion

Siddiqi and Moore, Evaluation in Regular HMMs P(q t = s i | O 1, O 2 … O t )

Siddiqi and Moore, Evaluation in Regular HMMs P(q t = s i | O 1, O 2 … O t ) = Where

Siddiqi and Moore, Evaluation in Regular HMMs P(q t = s i | O 1, O 2 … O t ) = Where Then,

Siddiqi and Moore, Evaluation in Regular HMMs P(q t = s i | O 1, O 2 … O t ) = Where Then, Called the “forward variables”

Siddiqi and Moore,

t  t (1)  t (2)  t (3) …  t (N) 1 2…

Siddiqi and Moore, t  t (1)  t (2)  t (3) …  t (N) 1 2… 3…

Siddiqi and Moore, t  t (1)  t (2)  t (3) …  t (N) 1 2… 3… Cost O(TN 2 )

Siddiqi and Moore, Similarly, and Also costs O(TN 2 )

Siddiqi and Moore, Similarly, and Also costs O(TN 2 ) Called the “backward variables”

Siddiqi and Moore, Fast Evaluation in DMC HMMs

Siddiqi and Moore, Fast Evaluation in DMC HMMs O(N), but only computed once per row of the  table! O(K) for each  t ( j ) entry  This yields O(TNK) complexity for the evaluation problem

Siddiqi and Moore, Fast Inference in DMC HMMs

Siddiqi and Moore, Fast Inference in DMC HMMs O(N 2 ) recursion in regular model:

Siddiqi and Moore, Fast Inference in DMC HMMs O(N 2 ) recursion in regular model: O(NK) recursion in DMC model: O(N), but only computed once per row of the  table O(K) for each  t ( j ) entry

Siddiqi and Moore,  HMM Overview  Reducing quadratic complexity in the number of states The model Algorithms for fast evaluation and inference Algorithms for fast learning  Results Speed Accuracy  Conclusion

Siddiqi and Moore, Learning a DMC HMM

Siddiqi and Moore, Learning a DMC HMM Idea One: Ask user to tell us the DMC structure Learn the parameters using EM

Siddiqi and Moore, Learning a DMC HMM Idea One: Ask user to tell us the DMC structure Learn the parameters using EM Simple! But in general, don’t know the DMC structure

Siddiqi and Moore, Learning a DMC HMM Idea Two: Use EM to learn the DMC structure also 1.Guess DMC structure 2.Find expected transition counts and observation parameters, given current model and observations 3.Find maximum likelihood DMC model given counts 4.Goto 2

Siddiqi and Moore, Learning a DMC HMM Idea Two: Use EM to learn the DMC structure also 1.Guess DMC structure 2.Find expected transition counts and observation parameters, given current model and observations 3.Find maximum likelihood DMC model given counts 4.Goto 2 DMC structure can (and does) change!

Siddiqi and Moore, Learning a DMC HMM Idea Two: Use EM to learn the DMC structure also 1.Guess DMC structure 2.Find expected transition counts and observation parameters, given current model and observations 3.Find maximum likelihood DMC model given counts 4.Goto 2 DMC structure can (and does) change! In fact, just start with an all-constant transition model

Siddiqi and Moore, Learning a DMC HMM 2.Find expected transition counts and observation parameters, given current model and observations

Siddiqi and Moore, We wantnew estimate of

Siddiqi and Moore, We wantnew estimate of

Siddiqi and Moore, We wantnew estimate of

Siddiqi and Moore, We wantnew estimate of where Applying Bayes rule to both terms gives us…

Siddiqi and Moore, We want where

Siddiqi and Moore, T N  T N  We want where

Siddiqi and Moore, T N  T N  Can get this in O(TN) time We want where

Siddiqi and Moore,  We wantwhere T N T N  Can get this in O(TN) time

Siddiqi and Moore,  We want where T N T N 

Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns

Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 )

Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 ) Speedups: Strassen?

Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 ) Speedups: Strassen Approximate  by DMC

Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 ) Speedups: Strassen Approximate  by DMC Approximate randomized A T B

Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 ) Speedups: Strassen Approximate  by DMC Approximate randomized A T B Sparse structure fine?

Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 ) Speedups: Strassen Approximate  by DMC Approximate randomized A T B Sparse structure fine Fixed DMC is fine?

Siddiqi and Moore,  We want where T N T N  S N N S 24  *2  *4 Dot Product of Columns O(TN 2 ) Speedups: Strassen Approximate  by DMC Approximate randomized A T B Sparse structure fine Fixed DMC is fine Speedup without approximation

Siddiqi and Moore,  We want where T N T N  S N N S 24 Insight One: only need the top K entries in each row of S Insight Two: Values in columns of  and  are often very skewed

Siddiqi and Moore,  T NN   -biggies(i)  -biggies(j) For i = 1..N, store indexes of R largest values in i’th column of  For j = 1..N, store indexes of R largest values in j’th column of  There’s an important detail I’m omitting here to do with prescaling the rows of  and .

Siddiqi and Moore,  T NN   -biggies(i)  -biggies(j) For i = 1..N, store indexes of R largest values in i’th column of  For j = 1..N, store indexes of R largest values in j’th column of  R << T Takes O(TN) time to do all indexes

Siddiqi and Moore,  T NN   -biggies(i)  -biggies(j) For i = 1..N, store indexes of R largest values in i’th column of  For j = 1..N, store indexes of R largest values in j’th column of  R << T Takes O(TN) time to do all indexes

Siddiqi and Moore,  T NN   -biggies(i)  -biggies(j) For i = 1..N, store indexes of R largest values in i’th column of  For j = 1..N, store indexes of R largest values in j’th column of  R << T Takes O(TN) time to do all indexes

Siddiqi and Moore,  T NN   -biggies(i)  -biggies(j) For i = 1..N, store indexes of R largest values in i’th column of  For j = 1..N, store indexes of R largest values in j’th column of  R << T Takes O(TN) time to do all indexes

Siddiqi and Moore,  T NN   -biggies(i)  -biggies(j) For i = 1..N, store indexes of R largest values in i’th column of  For j = 1..N, store indexes of R largest values in j’th column of  R << T Takes O(TN) time to do all indexes R’th largest value in i’th column of  O(1) time to obtain O(1) time to obtain (precached for all j in time O(TN) ) O(R) computation

Siddiqi and Moore, S N j 123N… S ij Computing the i’th row of S… In O(NR) time, we can put upper and lower bounds on S ij for j = 1,2.. N

Siddiqi and Moore, S N j 123N… S ij Computing the i’th row of S… In O(NR) time, we can put upper and lower bounds on S ij for j = 1,2.. N Only need exact values of S ij for the k largest values within the row

Siddiqi and Moore, S N j 123N… S ij Computing the i’th row of S… In O(NR) time, we can put upper and lower bounds on S ij for j = 1,2.. N Only need exact values of S ij for the k largest values within the row Ignore j’s that can’t be the best

Siddiqi and Moore, S N j 123N… S ij Computing the i’th row of S… In O(NR) time, we can put upper and lower bounds on S ij for j = 1,2.. N Only need exact values of S ij for the k largest values within the row Ignore j’s that can’t be the best Be exact for the rest: O(N) time each.

Siddiqi and Moore, S N j 123N… S ij Computing the i’th row of S… In O(NR) time, we can put upper and lower bounds on S ij for j = 1,2.. N Only need exact values of S ij for the k largest values within the row Ignore j’s that can’t be the best Be exact for the rest: O(N) time each. If there’s enough pruning, total time is O(TN+RN 2 )

Siddiqi and Moore, In Short … Sub-quadratic evaluation Sub-quadratic inference ‘Nearly’ sub-quadratic learning Fully connected transition models allowed

Siddiqi and Moore, In Short … Sub-quadratic evaluation Sub-quadratic inference ‘Nearly’ sub-quadratic learning Fully connected transition models allowed Some extra work to extract ‘important’ transitions from data

Siddiqi and Moore,  HMM Overview  Reducing quadratic complexity in the number of states The model Algorithms for fast evaluation and inference Algorithms for fast learning  Results Speed Accuracy  Conclusion

Siddiqi and Moore, Evaluation and Inference Speedup Dataset: synthetic data with T=2000 time steps

Siddiqi and Moore, Parameter Learning Speedup Dataset: synthetic data with T=2000 time steps

Siddiqi and Moore,  HMM Overview  Reducing quadratic complexity in the number of states The model Algorithms for fast evaluation and inference Algorithms for fast learning  Results Speed Accuracy  Conclusion

Siddiqi and Moore, Datasets DMC-friendly dataset: From 2-D gaussian 20-state DMC HMM with K=5 (20,000 train, 5,000 test) Anti-DMC dataset: From 2-D gaussian 20-state regular HMM with steadily varying, well-distributed transition probabilities (20,000 train, 5,000 test) Motionlogger dataset: Accelerometer data from two sensors worn over several days (10,000 train, 4,720 test)

Siddiqi and Moore, HMMs Used Regular and DMC HMMs: 20 states Baseline 1: 5-state regular HMM Baseline 2: 20-state HMM with uniform transition probabilities

Siddiqi and Moore, HMMs Used Regular and DMC HMMs: 20 states Baseline 1: 5-state regular HMM Baseline 2: 20-state HMM with uniform transition probabilities Do we really need a large HMM? Does the transition model matter?

Siddiqi and Moore, Learning Curves for DMC-friendly data

Siddiqi and Moore, Learning Curves for DMC-friendly data

Siddiqi and Moore, Learning Curves for DMC-friendly data

Siddiqi and Moore, Learning Curves for DMC-friendly data

Siddiqi and Moore, Learning Curves for DMC-friendly data

Siddiqi and Moore, Learning Curves for DMC-friendly data

Siddiqi and Moore, Learning Curves for DMC-friendly data DMC model achieves full model score!

Siddiqi and Moore, Learning Curves for DMC-friendly data DMC model achieves full model score!

Siddiqi and Moore, Learning Curves for Anti-DMC data

Siddiqi and Moore, Learning Curves for Anti-DMC data

Siddiqi and Moore, Learning Curves for Anti-DMC data

Siddiqi and Moore, Learning Curves for Anti-DMC data

Siddiqi and Moore, Learning Curves for Anti-DMC data

Siddiqi and Moore, Learning Curves for Anti-DMC data

Siddiqi and Moore, Learning Curves for Anti-DMC data DMC model worse than full model

Siddiqi and Moore, Learning Curves for Anti-DMC data DMC model worse than full model

Siddiqi and Moore, Learning Curves for Motionlogger data

Siddiqi and Moore, Learning Curves for Motionlogger data

Siddiqi and Moore, Learning Curves for Motionlogger data

Siddiqi and Moore, Learning Curves for Motionlogger data

Siddiqi and Moore, Learning Curves for Motionlogger data

Siddiqi and Moore, Learning Curves for Motionlogger data

Siddiqi and Moore, Learning Curves for Motionlogger data DMC model achieves full model score!

Siddiqi and Moore, Learning Curves for Motionlogger data DMC model achieves full model score! Baselines do much worse

Siddiqi and Moore, Regularization with DMC HMMs # of transition parameters in regular 100-state HMM: 10,000 # of transition parameters in DMC 100-state HMM with K= 5 : 500

Siddiqi and Moore, Tradeoffs between N and K We vary N and K while keeping the number of transition parameters (N×K) constant Increasing N and decreasing K allows more states for modeling data features but fewer parameters per state for temporal structure

Siddiqi and Moore, Tradeoffs between N and K Average test-set log-likelihoods at convergence Datasets: A: DMC-friendly B: Anti-DMC C: Motionlogger

Siddiqi and Moore, Tradeoffs between N and K Average test-set log-likelihoods at convergence Datasets: A: DMC-friendly B: Anti-DMC C: Motionlogger Each dataset has a different optimal N-vs-K tradeoff

Siddiqi and Moore,  HMM Overview  Reducing quadratic complexity in the number of states The model Algorithms for fast evaluation and inference Algorithms for fast learning  Results Speed Accuracy  Conclusion

Siddiqi and Moore, Conclusions DMC HMMs are an important class of models that allow parameterized complexity-vs-efficiency tradeoffs in large state spaces

Siddiqi and Moore, Conclusions DMC HMMs are an important class of models that allow parameterized complexity-vs-efficiency tradeoffs in large state spaces The speedup can be several orders of magnitude

Siddiqi and Moore, Conclusions DMC HMMs are an important class of models that allow parameterized complexity-vs-efficiency tradeoffs in large state spaces The speedup can be several orders of magnitude Even for non-DMC domains, DMC HMMs yield higher scores than baseline models

Siddiqi and Moore, Conclusions DMC HMMs are an important class of models that allow parameterized complexity-vs-efficiency tradeoffs in large state spaces The speedup can be several orders of magnitude Even for non-DMC domains, DMC HMMs yield higher scores than baseline models The DMC HMM model can be applied to arbitrary state spaces and observation densities

Siddiqi and Moore, Related Work Felzenszwalb et al. (2003) – fast HMM algorithms when transition probabilities can be expressed as distances in an underlying parameter space Murphy and Paskin (2002) – fast inference in hierarchical HMMs cast as DBNs Salakhutdinov et al. (2003) – combines EM and conjugate gradient for faster HMM learning when missing information amount is high Ghahramani and Jordan (1996) – Factorial HMMs for distributed representation of large state spaces Beam Search – widely used heuristic in viterbi inference for speech systems

Siddiqi and Moore, Future Work Eliminate R parameter using an automatic backoff evaluation approach Investigate DMC HMMs as regularization mechanism Compare robustness against overfitting with factorial HMMs for large-state-space problems

Siddiqi and Moore, Thank You!