Download presentation
Presentation is loading. Please wait.
Published byPhyllis Knight Modified over 8 years ago
1
Profile Hidden Markov Models PHMM 1 Mark Stamp
2
Hidden Markov Models Here, we assume you know about HMMs o If not, see “A revealing introduction to hidden Markov models” Executive summary of HMMs o HMM is a machine learning technique… o …and a discrete hill climb technique o Train model based on observation sequence o Score any given sequence to determine how closely it matches the model o Efficient algorithms and many useful apps 2 PHMM
3
HMM Notation Recall, HMM model denoted λ = (A,B,π) Observation sequence is O Notation: 3 PHMM
4
Hidden Markov Models Among the many uses for HMMs… Speech analysis Music search engine Malware detection Intrusion detection systems (IDS) And more all the time 4 PHMM
5
Limitations of HMMs Positional information not considered o HMM has no “memory” beyond previous state o Higher order models have more “memory” o But no explicit use of positional information With HMM, no insertions or deletions These limitations are serious problems in some applications o In bioinformatics string comparison, sequence alignment is critical o Also, insertions and deletions can occur 5 PHMM
6
Profile HMM Profile HMM (PHMM) designed to overcome limitations on previous slide o In some ways, PHMM easier than HMM o In some ways, PHMM more complex The basic idea of PHMM ? o Define multiple B matrices o Almost like having an HMM for each position in sequence 6 PHMM
7
In bioinformatics, begin by aligning multiple related sequences o Multiple sequence alignment (MSA) o Analogous to training phase for HMM Generate PHMM based on MSA o This is easy, once MSA is known o Again, hard part is generating MSA Then can score sequences using PHMM o Use forward algorithm, similar to HMM 7 PHMM
8
Training: PHMM vs HMM Training PHMM o Determine MSA challenging o Determine PHMM matrices easy Training HMM o Append training sequences trivial o Determine HMM matrices challenging PHMM and HMM are, in this sense, almost opposites… PHMM 8
9
Generic View of PHMM Have delete, insert, and match states o Match states correspond to HMM states Arrows are possible transitions o Each transition has a probability Transition probabilities are A matrix Emission probabilities are B matrices o In PHMM, observations called emissions o Match and insert states have emissions 9 PHMM
10
PHMM without Gaps If no gaps, PHMM is simple Illustrate such a PHMM as Here, M i is i th “match state” o This diagram neglects the B matrices o Recall, that there is a distinct B matrix for each match state PHMM 10
11
PHMM with Insertions If we also allow for insertions, diagram is of the form Allows for multiple insertions PHMM 11
12
PHMM with Deletions If instead, we allow for deletions, we obtain the following diagram Note that a deletion skips over the corresponding match state PHMM 12
13
Generic View of PHMM Circles are delete states, diamonds are insert states, squares are match states Note the many possible transitions 13 PHMM
14
PHMM Notation Notation 14 PHMM
15
Match state probabilities easily determined from MSA a Mi,Mi+1 transitions between match states e Mi (k) emission probability at match state Many other transition probabilities o For example, a Mi,Ii and a Mi,Di+1 Emissions at all match & insert states o Remember, “emission” == “observation” 15 PHMM
16
Multiple Sequence Alignment First we show MSA construction o This is the difficult part o Lots of ways to do this o “Best” way depends on specific problem Then construct PHMM from MSA o This is the easy part o Standard algorithm to generate PHMM How to score a sequence? o Forward algorithm, similar to HMM 16 PHMM
17
MSA How to construct MSA? o Construct pairwise alignments o Combine pairwise alignments into MSA Allow gaps to be inserted o To make better matches Gaps tend to weaken PHMM scoring o So, tradeoff between number of gaps and strength of score 17 PHMM
18
Global vs Local Alignment For these pairwise alignment examples… o “ - ” is gap o “ | ” means elements aligned o “ * ” for omitted beginning/ending symbols 18 PHMM
19
Global vs Local Alignment Global alignment is lossless o But gaps tend to proliferate o And gaps increase when we do MSA o More gaps, more random sequences match… o …and result is less useful for scoring We usually only consider local alignment o That is, omit ends for better alignment For simplicity, we do global alignment in examples presented here 19 PHMM
20
Pairwise Alignment Allow gaps when aligning How to score an alignment? o Based on n x n substitution matrix S o Where n is number of symbols What algorithm(s) to align sequences? o Usually, dynamic programming o Sometimes, HMM is used o Other? Local alignment? Additional issues arise… 20 PHMM
21
Pairwise Alignment Example Tradeoff: Gaps vs misaligned elements o Depends on matrix S and gap penalty function 21 PHMM
22
Substitution Matrix For example, masquerade detection o Detect imposter using computer account Consider 4 different operations o E == send email o G == play games o C == C programming o J == Java programming How similar are these to each other? 22 PHMM
23
Substitution Matrix Consider 4 different operations: o E, G, C, J Possible substitution matrix: Diagonal matches o High positive scores Which others most similar? o J and C, so substituting C for J is a high score Game playing/programming, very different o So substituting G for C is a negative score 23 PHMM
24
Substitution Matrix Depending on problem, might be easy or very difficult to find useful S matrix Consider masquerade detection based on UNIX commands o Sometimes difficult to say how “close” 2 commands are Suppose instead, aligning DNA sequences o Biologically valid reasons for S matrix 24 PHMM
25
Gap Penalty Generally must allow gaps to be inserted But gaps make alignment more generic o Less useful for scoring, so we penalize gaps How to penalize gaps? Two common ways Linear gap penalty function: g(x) = ax (constant penalty for every gap) Affine gap penalty function g(x) = a + b(x – 1) o Gap opening penalty a and constant penalty of b for each extension of existing gap 25 PHMM
26
Pairwise Alignment Algorithm We use dynamic programming o Based on S matrix, gap penalty function Notation: 26 PHMM
27
Pairwise Alignment DP Initialization: Recursion: where 27 PHMM
28
MSA from Pairwise Alignments Given pairwise alignments… How to construct MSA? Generally use “progressive alignment” o Select one pairwise alignment o Select another and combine with first o Continue to add more until all are used Relatively easy (good) Gaps proliferate, and it’s unstable (bad) 28 PHMM
29
MSA from Pairwise Alignments Lots of ways to improve on generic progressive alignment o Here, we mention one such approach o Not necessarily “best” or most popular Feng-Dolittle progressive alignment o Compute scores for all pairs of n sequences o Select n-1 alignments that a) “connect” all sequences and b) maximize pairwise scores o Then generate a minimum spanning tree o For MSA, add sequences in the order that they appear in the spanning tree 29 PHMM
30
MSA Construction Create pairwise alignments o Generate substitution matrix S o Dynamic program for pairwise alignments Use pairwise alignments to make MSA o Use pairwise alignments to construct spanning tree (e.g., Prim’s Algorithm) o Add sequences in spanning tree order (from high score, insert gaps as needed) o Note: gap penalty is used here 30 PHMM
31
MSA Example Suppose we have 10 sequences, with the following pairwise alignment scores 31 PHMM
32
MSA Example: Spanning Tree Spanning tree based on scores So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9) 32 PHMM
33
MSA Snapshot Intermediate step and final o Use “+” for neutral symbol o Then “-” for gaps in MSA Note increase in gaps 33 PHMM
34
PHMM from MSA In PHMM, determine match and insert states & probabilities from MSA “Conservative” columns are match states o Half or less of symbols are gaps Other columns are insert states o Majority of symbols are gaps Delete states are a separate issue 34 PHMM
35
PHMM States from MSA Consider a simpler MSA… Columns 1,2,6 are match states 1,2,3, respectively o Since less than half gaps Columns 3,4,5 are combined to form insert state 2 o Since more than half gaps o Match states between insert 35 PHMM
36
Probabilities from MSA Emission probabilities o Based on symbol distribution in match and insert states State transition probs o Based on transitions in the MSA 36 PHMM
37
Probabilities from MSA Emission probabilities: But 0 probabilities are bad o Model overfits the data o So, use “add one” rule o Add one to each numerator, add total to denominators 37 PHMM
38
Probabilities from MSA More emission probabilities: But 0 probabilities still bad o Model overfits the data o Again, use “add one” rule o Add one to each numerator, add total to denominators 38 PHMM
39
Probabilities from MSA Transition probabilities: We look at some examples o Note that “ - ” is delete state First, consider begin state: Again, use add one rule 39 PHMM
40
Probabilities from MSA Transition probabilities When no information in MSA, set probs to uniform For example I 1 does not appear in MSA, so 40 PHMM
41
Probabilities from MSA Transition probabilities, another example What about transitions from state D 1 ? Can only go to M 2, so Again, use add one rule: 41 PHMM
42
PHMM Emission Probabilities Emission probabilities for the given MSA o Using add-one rule 42 PHMM
43
PHMM Transition Probabilities Transition probabilities for the given MSA o Using add-one rule 43 PHMM
44
PHMM Summary Construct pairwise alignments o Usually, use dynamic programming Use these to construct MSA o Lots of ways to do this Using MSA, determine probabilities o Emission probabilities o State transition probabilities Then we have trained a PHMM o Now what??? 44 PHMM
45
PHMM Scoring Want to score sequences to see how closely they match PHMM How did we score using HMM? o Forward algorithm How to score sequences with PHMM? o Forward algorithm (surprised?) But, algorithm is a little more complex o Due to more complex state transitions 45 PHMM
46
Forward Algorithm Notation o Indices i and j are columns in MSA o x i is i th observation (emission) symbol o q xi is distribution of x i in “random model” o Base case is o is score of x 1,…,x i up to state j (note that in PHMM, i and j may not agree) o Some states undefined o Undefined states ignored in calculation 46 PHMM
47
Forward Algorithm Compute P(X|λ) recursively Note that depends on, and o And corresponding state transition probs 47 PHMM
48
We will see examples of PHMM used in security application later In particular, o Malware detection based on opcodes o Masquerade detection based on UNIX commands o Malware detection based on dynamically extracted API calls 48 PHMM
49
References Durbin, et al, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic AcidsBiological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids L. Huang and M. Stamp, Masquerade detection using profile hidden Markov models, Computers & Security, 30(8):732-747, 2011Masquerade detection using profile hidden Markov models S. Attaluri, S. McGhee, and M. Stamp, Profile hidden Markov models for metamorphic virus detection, Journal in Computer Virology, 5(2):151-169, 2009Profile hidden Markov models for metamorphic virus detection 49 PHMM
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.