Ensemble Learning Method for Hidden Markov Models

Ensemble Learning Method for Hidden Markov Models
Anis Hamdi May 19th, 2010 Advisor, Dr. Hichem Frigui

Outline Introduction Hidden Markov Models Ensemble HMM classifier
Motivations Ensemble HMM Architecture Similarity matrix computation Hierarchical clustering Model training Decision level fusion Application to Landmine Detection Proposed Future Work After the introduction, give some background material on HMMs Then I ll present the proposed ensemble method for HMMs, First MOTIVATIONS, then details the 4 steps that are: IN PART 4, will show the results of the eHMM on realworld data set, that is landmine detection Finally Conclusion and Future WORK

Introduction Classification is one of the key tasks in data mining.
Statistical Learning problems in many fields involve sequential data: speech signals, stock market prices, protein sequences, etc. The scope of this work is the classification of sequential data. The standard approach in model-based classification is to learn a model for each class. The main challenge for complex classification problems is how to account for the intra- class variability For static data, Gaussians mixture models have been widely used. For sequential data, we intend to use a mixture of Hidden Markov Models to model the potential intra-class variability

Introduction (cont.) λ1 λ2 λ3
S: Sequences, pi’s are HMM probabilities, θi are the HMM parameters

Related work: Discrete HMMs
Given a set of N states {s1,s2, …, sN}, and a set of M observations {v1,v2,…vM} , The process moves from one state to another generating a sequence of states q1,q2, …, qt , … such that P(qt=sj|qt-1=si) =aij , 1≤i,j≤N : state transition probabilities States are not visible, but at each state, the model randomly generates one observation ot according to P(ot=vk|qt=si). P(ot=vk|qt=si)=bik , 1≤i≤N 1≤k≤M : state emission probabilities The probability that the system starts at state i is: P(q1=si)=πi, 1≤i≤N: initial state probability. Compact representation of the HMM Model is M=(A, B, π). q1 q2 q3 o1 o2 o3 π A B

Related work: Discrete HMMs (cont.)
Evaluation problem. Given the model λ =(A, B, π) and the observation sequence O = o1 o2 ... oT, calculate the probability that model λ has generated sequence O. Backward-forward procedure Decoding problem. Given the HMM λ =(A, B, π) and the observation sequence O = o1 o2 ... oT, find the most likely sequence of hidden states si that produced this observation sequence O. Viterbi algorithm Learning problem. Given K training observation sequences O=[O(1) O(2)…O(K)] and the general structure of the HMM (number of hidden states and number of codewords), determine the HMM parameters λ = (A, B, π) that best fit the training data. Maximum Likelihood (ML), Minimum Classification Error (MCE), and Variational Bayesian (VB) training

Motivations Ensemble HMM Architecture Similarity matrix computation Hierarchical clustering Model training Decision level fusion Application to Landmine Detection Proposed Future Work

Ensemble HMM: Motivations
Sequences belonging to class 1 Sequences belonging to class 0 Using all sequences to train a single model for class 1 may lead to Too much averaging of the sequences Loss of discriminative characteristics within class 1 One model needs to be learned for each group of similar sequences How to group sequences? Ground truth is not sufficient.

Ensemble HMM: Overview
We assume that the data is generated by K HMM models. These different models reflect the “natural” partitions within the data, regardless of the ground truth labels. Partitioning and model identification is achieved through clustering in the loglikelihood space. Resulting clusters can vary: Different sizes Homogeneous or heterogeneous Adapt learning to different clusters Fuse the multiple HMM outputs

Similarity Matrix Computation
eHMM: Block Diagram BW training MCE training Cluster 1 Cluster J ... Cluster J+1 Cluster E Confidence Similarity Matrix Computation Hierarchical Clustering Homogeneous clusters Mixed clusters Model λJ Model λJ+1,C Model λJ+1,1 Decision Level Fusion Training Data Small clusters VB training Model λ1 Model λK Model λE+1 Cluster E+1 Cluster K Model λE,C Model λE,1 Max Max

eHMM: Similarity Matrix Computation Fitting individual models to sequences
Initial HMM for each sequence Fix the number of states, N. Cluster the sequence elements into N clusters. Each cluster center is a state representative. Define the codebook symbols as the sequence vectors Training: Baum-Welch algorithm is used to learn the HMM parameters that best fit a particular sequence. Overfitting: we seek each model to perfectly fit the corresponding sequence. We are not looking to use the trained model for generalization. λ1 λ2 . λR λ1

eHMM: Similarity Matrix Computation Computing the similarity matrix
Test each training sequence with each learned model Construct a pair-wise penalized log-likelihood matrix Pr(Oi|λj) : the probability of sequence Oi being generated by Model λj sq(i) : the representative of state q in model λi q(ij) = q1(ij) … qT(ij): the most likely hidden state sequence that generated the sequence Oi from the model λj . α: mixing factor L is not symmetric, we use the following scheme to transform it to a similarity matrix:

eHMM: Similarity Matrix Computation Penalized loglikelihood
Loglikelihood of sequence Oi being generated from model λj : two similar sequences should have high likelihood values for being generated from their respective HMM models. Viterbi path mismatch term, Two similar sequence should have similar Viterbi paths. Mixing factor, α: trade-off parameter between the likelihood-based similarity and the Viterbi-path-mismatch based similarity.

eHMM: Similarity-based Clustering
The previous step resulted in a penalized-loglikelihood-based similarity matrix, Since the data is available in relational form, we use a standard hierarchical clustering algorithm with the complete link inter-cluster distance. Agglomerative hierarchical clustering is a bottom-up approach that starts with all the data points as clusters. Then, it proceeds to merging the most similar clusters using an inter-cluster distance. In the complete link based algorithm, the distance between two clusters is the maximum of all pair-wise distances between sequences in the two clusters. It produces compact clusters.

eHMM: Models Construction Models initialization
For each model λk Initial values for the initial state and state transition probabilities (πk and Ak) of model λk are obtained by averaging the initial state and state transition probabilities of the individual models of the sequences belonging to cluster k. The state representatives, s(k), of model λk are obtained by clustering the observation vectors of the sequences belonging to cluster k into N clusters. The codebook symbols, V(k), of model λk are obtained by clustering the observation vectors of the sequences belonging to cluster k into M clusters. For each symbol v(k)m, the membership in each state s(k)n is computed using λ 1 λ2 λK Typo vk corrected to vm Once Pi A B initialized, we proceed to the training

eHMM: Models Construction Models training
Sequences are presumably similar and mainly belong to the same ground truth class. In this case It is expected that the class conditional posterior probability is uni-modal and peaked around the MLE of the parameters. A maximum likelihood estimation would result in a HMM model parameters that best fit this particular class. For clusters with a mixture of sequences belonging to different classes, it is expected that the posterior distribution is multimodal. We initialize a model for each class within this cluster. We then focus on finding the class boundaries within the posterior probability. The models parameters are jointly optimized such that the overall misclassification error is minimized MLE and MCE approaches need a large number of data points to give good estimates of the model parameters. Bayesian approach is used to approximate the class conditional posterior distribution. The variational Bayesian training is suitable for clusters with small number of sequences. Current implementation not ready

eHMM: Models Construction Models training
For clusters that are dominated by sequences from only one class, we use the standard Baum-Welch re-estimation procedure. λjBW , j = 1..J, models For clusters with a mixture of observations belonging to different classes, we use discriminative training based on minimizing the misclassification error to learn a model for each class. λ i,cMCE , i= J+1..E, c = 1..C, models For clusters containing a small number of sequences, we use a variational Bayesian method to update the model parameters given the observed data λkVB , k = E+1..K, models

eHMM: Decision Level Fusion
Let Г = {λjBW, λ i,cMCE, λkVB}, where j = 1..J, i= J+1..E, c = 1..C, and k = E+1..K. be the resulting mixture model after the eHMM training. To test a new sequence O, we

eHMM: Decision Level Fusion
Let F(r,k) = log Pr(Or|γk), 1≤ r ≤ R, 1 ≤ k ≤ K, be the R-by-K loglikelihood matrix. Each row Fi, i = 1 .. R, of F represents the feature vector of the sequence i in the decision space. Thus, the set of sequences is mapped to an Euclidean confidence space via a function .

eHMM: Decision Level Fusion ANN combination
Simple combination methods could be used, such as mean, maximum, majority voting. However these linear methods are not trainable and require the proper identification of cluster to class associations. Thus we uses a simple neural network to model the potentially nonlinear mapping between the individual confidence values and the predicted output confidence/class. The combination function is: And the final output is a sigmoid function:

eHMM: Decision Level Fusion HME combination
The input to the HME network is a K-dimensional vector F. The network is comprised of expert networks, and gating networks. For each expert network, where with U a weight vector and f a link function; f is the identity function for regression problems and logistic function for binary classification. The output of each expert network is

eHMM: Decision Level Fusion HME combination
For the gating networks, with vi a weight vector. The vector weights U and vi are the HME parameters and can be learned using a gradient descent method or an EM-like approach.

Application to Landmine Detection GPR data EHD feature Extraction Baseline HMM classifier Experimental results Proposed Future Work

Application to Landmine Detection: GPR data
Ground Penetrating Radar (GPR) offers the promise of detecting landmines with little or no metal content, at the expense of higher false alarm rate. A GPR signature is a 3-dimensional matrix of sample values S(z,x,y). (z,x,y) represent depth, cross-track position, and down-track positions, respectively. The down track position is considered as the time variable in our HMM modeling NIITEK vehicle mounted GPR system GPR scans GPR signature

Application to Landmine Detection: EHD feature extraction
Simple edge detector operators are used to identify edges and group them into five categories: horizontal, vertical, diagonal, anti-diagonal, and isotropic (non-edge) Illustration of the EHD feature extraction process

Application to Landmine Detection: Baseline HMM classifier
Baseline HMM classifier has two HMM models, one for mine and one for background. Each model has four states. The mine model assumes that mine signatures have a hyperbolic shape. Each model produces a likelihood value, and a most likely Viterbi path by backtracking through the model states using the backward- forward and the Viterbi algorithm, respectively. The confidence value assighed to each observation sequence, O, is Illustration of the baseline HMM mine model Illustration of the baseline HMM architecture

Application to Landmine Detection: eHMM landmine detector

Application to Landmine Detection: eHMM landmine detector
(1) Feature extraction, results in a set of R sequences of length T=15 each. (2) Similarity matrix computation Fit a model to each sequence Compute the likelihood and Viterbi path of each sequence in each model Deduce the pair-wise similarity matrix (3) Pair-wise similarity based clustering, using standard hierarchical algorithm with the complete link distance, K=20.

Application to Landmine Detection: eHMM landmine detector (cont.)
(4) Models initialization and training For each cluster k, initialization of λk=(A, B, π) is done using the sequences (and their corresponding models λr)belonging to the cluster. For clusters that will be trained using MCE, one model is initialized for each class: λk mine and λk background. Training is done according to the procedure described earlier: Large clusters dominated by a majority of mines or clutter signatures are trained using the maximum likelihood estimation Large clusters containing a mixture of signatures form both classes are trained using MCE based discriminative training. Small Clusters are trained using the variational Bayesian method . (5) Decision level fusion Done using ANN and HME fusion methods. Details are provided for the general ensemble HMM classifier.

Application to Landmine Detection: The dataset
The eHMM was trained and tested on GPR data collected by a NIITEK system. Data was collected from 3 different locations. A total of 12 lanes Total of 1616 signatures 605 mine signatures. 1011 clutter signatures. The EHD features are used. Each signature is represented by a sequence of dimensional vectors.

Application to Landmine Detection: eHMM clustering results
(a) similarity matrix after clustering (b) dendrogram As sketched in figure (a), the diagonal blocks of the matrix are darker, which corresponds to higher intra-cluster similarities. Dendrogram in figure (b) shows that, at a certain threshold, we can identify two main groups of clusters. On the left hand side of the dendrogram, clusters (3; 17; ; 13) have mainly clutter signatures. On the right hand side, clusters (1; 15; ; 5) are dominated by mine signatures. As it can be seen in the next figure,

some clusters (e.g. cluster 1, 5, and 15) have a large number of mines and few or no clutter alarms some clusters are dominated by clutter with few mines (e.g. cluster 2, 13, and 18) The few mines included in these clusters are typically mines with weak signatures that are either low metal mines or mines buried at deep depths Other clusters are composed of a mixture of mines and clutter signatures (e.g. cluster 3, 6, and 11). The mines within these clusters are either low metal mines (figure (b), cluster 6) or mines buried at deep depths (figure (c), clusters 3 and 11). Distribution of the alarms in each cluster: (a) per class, (b) per type, (c) per depth.

Sequences State Representative Transition Matrix A H V D AD NE S1 0.05 0.18 0.22 0.50 S2 0.04 0.46 0.27 0.19 S3 0.03 0.44 0.45 S4 0.24 0.52 0.15 H V D AD NE S1 0.06 0.03 0.10 0.11 0.69 S2 0.18 0.62 S3 0.19 0.15 0.56 S4 0.04 0.20 0.60 S1 S2 S3 S4 0.69 0.31 0.00 0.47 0.53 0.85 0.15 0.38 0.62 S1 S2 S3 S4 0.77 0.23 0.00 0.76 0.16 0.08 0.84 0.28 0.72

Application to Landmine Detection: Individual models performances
(a) a sample signature from cluster (b) models responses to the signature in (a) As expected, the highest likelihood occurs when testing the sequence with the HMM model of cluster 1. Moreover, the higher likelihood values correspond to the mine-dominated clusters (1; 4; 5; 6; ::; 14; 15; 17; ::). figure 2 shows that a test sequence belonging to cluster 2 has high likelihood in clutter-dominated clusters' models. (a) a sample signature from cluster (b) models responses to the signature in (a)

Even though the two models are dominated by mine signatures, we see that not all confidence values are highly correlated. In fact, some strong mine signatures have high likelihoods in model 5 and lower likelihoods in model 1 (upper left side of the scatter plot, region R1). This can be attributed to the fact that cluster 5 contains mainly strong mines and is more likely to yield high loglikelihood when testing a strong mine signature. On the other hand, in region R2, the performance of cluster 1 model is better as it gave higher likelihood values to the "weak" mines in that region. In the proposal report a similar scatter plot between model 5 (strong mines) and model 2 (clutter) is presented. It shoes the decorraltion Scatter plot of the loglikelihoods of the training data in model 5 (strong mines) versus. model 1 (weak mines). Clutter, low metal (LM), and high metal (HM) signatures at different depths are shown with different symbols and colors.

The individual ROCs show that the models perform differently at different levels of false alarms rate. We notice also that no model consistently outperforms the other models. Individual ROCs of some models. Solid lines: clusters dominated by mines. Dashed lines: clusters dominated by clutter.

Application to Landmine Detection: eHMM performance
For the remainder of the experiments, we use 4-fold cross validation technique to average the results of the eHMM on unseen data. For the remainder of the experiments, we use 4-fold cross validation technique to average the results of the eHMM on unseen data. In each fold, the eHMM is trained on a subset of the original data (three-fourths of the number of data samples) and tested on the remaining samples. Comparison of the eHMM with the best 3 cluster models (1, 2, and 12)

Ann based eHMM Comparison of the eHMM with the baseline HMM.

Clutter, low metal (LM), and high metal (HM) signatures at different depths are shown with different symbols and colors. eHMM outperforms the baseline HMM, as the majority of mine signatures are located above the diagonal of the scatter plot. Mine signatures belonging to region R1 are assigned relatively high confidence values by the eHMM but the baseline HMM assigns them low confidence values. Those signatures are all weak mines (low metal and buried at 3" or more). R2 BOTH RELATIVELY HIGH R3 contains mainly background signatures and weak mine signatures (low metal mines buried at deep depths) that are assigned low confidence values by both classifiers. Scatter plot of the confidence values of the test data in the eHMM vs. the baseline HMM classifier.

Conclusions Ensemble HMM classifier is proposed
Learn one model per training sequence Cluster sequences in the log-likelihood space Learn a HMM model for each cluster using optimized training techniques The multiple models are expected to capture the intra-class variations. The output of the multiple models are fused using ANN or HME. In an application to the landmine detection problem, the eHMM steps are individually analyzed and the overall performance is significantly better than the baseline DHMM

Proposed Future Work eHMM implementation improvements Applications
Joint clustering, training, and fusion optimization Use variational Bayesian learning for small clusters Use BIC to optimize the HMM models structures Use BIC to optimize the number of clusters Applications Indentify potential cross domain applications to evaluate the eHMM Compare the eHMM performance to other ensemble methods such as the Adaboost algorithm with HMMs as weak classifiers

Thank you! Questions?

Ensemble Learning Method for Hidden Markov Models

Similar presentations

Presentation on theme: "Ensemble Learning Method for Hidden Markov Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ensemble Learning Method for Hidden Markov Models

Similar presentations

Presentation on theme: "Ensemble Learning Method for Hidden Markov Models"— Presentation transcript:

Similar presentations

About project

Feedback