Download presentation
Presentation is loading. Please wait.
Published byReynold Houston Modified over 9 years ago
1
Hidden Markov Models As used to summarize multiple sequence alignments, and score new sequences
2
Time flies like an arrow, but fruit flies like a banana The Markov Property: the state of a system depends only on its previous state. –Works very well for speech recognition: a sound is modified by the sound immediately preceding it. –For DNA and protein sequences, there is no reason to think that the subunit just upstream is the only influence: probably the whole neighborhood is important. –So, HMMs are just an approximation, but they work very well. In a first order Markov model, the current event only depends on the immediately preceding event. Second order models use the two preceding events, third order models use the three preceding, etc. –as far as I know, all the HMMs for protein sequences that are readily available are first order models.
3
Hidden Markov Models Position-specific substitution matrices have an important weakness: they don’t deal very well with gaps caused by small insertions and deletions. –A we know, gaps are part of most sequence alignments. The standard solution these days: profile hidden Markov models (HMM). –Like PSSM, they are a statistical model of a multiple sequence alignment, but they easily incorporate gaps. –HMMs can be used to evaluate a query sequence: to determine how likely the query is to fit the model.
4
Hidden Markov Models The basic problem is: given a series of observations, determine the most probable set of hidden states that generate them. –For matching a query sequence to a model, each residue in the query is assigned one of 3 states: it is a match/mismatch to a residue in the model, or it is deleted, or it is inserted but, to understand HMMs, we need a simpler and more concrete example, taken from Wikipedia, which we can call Life in St. Paul. –I have a friend (Carol) in St. Paul who has only 3 interests in life: going for a walk, shopping, and cleaning her apartment. –St. Paul weather can be either sunny or rainy. Carol decides which of her 3 activities she will do based on the weather. –Every day she e-mails me to tell me what she is doing. (exciting!) –From the sequence of her activities, I infer the most likely weather for each day.
5
Terminology The HMM for Life in St. Paul has 2 states: rainy and sunny. –These states are hidden: we don’t know them and can only infer them from Carol’s behavior. For each set, there is a set of transition probabilities: what will happen the next day. –Which depends only on what happened today (the Markov property) –prob R -> R (stays rainy) = 0.7 –prob R -> S (goes from rain to sun) = 0.3 –prob S-> S (stays sunny) = 0.6 –prob S -> R (goes from sun to rain) = 0.4 For each state, there is a set of emission probabilities: what Carol’s behavior will be under different weather conditions. –These behaviors are our observations: she tells us what she is doing each day. –if state is Rainy: walk = 0.1 clean = 0.5 shop = 0.4 –if state is Sunny: walk = 0.6 clean = 0.1 shop = 0.3 A path is a sequence of states through the model: what the state is for each day. We are trying to determine the optimal (most likely) path that fits our observations.
6
State Diagrams One way to draw state diagram is to have arrows going back and forth between states. Note that there is also a transition back to the same state. –This represents eternity: each day is in one of the other state.
7
State Diagrams Another way is to move forward in time, with a new pair of Rainy and Sunny states for each new day. I like this because it is a closer approximation to progressing through a sequence, trying to get a match between the query sequence and the model.
8
Basic Questions We are interested in 3 basic questions: –Given the model and its parameters (the transition and emission probabilities), what is the most likely path that explains our observations? This is done using the Viterbi algorithm. –What is the overall probability that the model fits our observations? This is the most common use of HMMs in bioinformatics: determining the probability that the query sequence fits the model Uses the forward algorithm, which is very similar to Viterbi –Given a set of observations, how can we accurately estimate parameters for the model? This uses the Baum-Welch algorithm.
9
Viterbi Algorithm Used to determine most probable path (series of states) given a set of observed data. –the forward algorithm is closely related and can be performed at the same time: it generates the overall probability of the observed sequence. –based on the idea that if you always take the most likely next step, you will get the most likely overall path. Given a first order model, this seems quite reasonable. –Very similar to dynamic programming Go through the observations from beginning to end, inferring a state of the hidden machine for each observation. Keep track of 3 values: overall probability (prob), Viterbi path (v_path), and Viterbi probability (v_prob) (i.e. the probability of the observations given the Viterbi path). Test all possible next steps. The probability of a possible step given its corresponding observation is the probability of the transition to that next state times the emission probability of the observation in that state. –for overall probability prob, multiply each new prob by the old prob, then add together (i.e. overall probability is sum of all possible next steps) –for Viterbi prob, just take the highest next-step probability and multiply it by the previous v_prob –for Viterbi path, add next state to list of previous states
10
Our Example observations: WCSW start at beginning. We will simply assume that the day before the beginning was sunny Observation is W (walk), so go with the most probable. –P(Walk | Sunny) = 0.6; P(Walk | Rainy) = 0.1 –prob (overall probability) = 0.6 –v_prob (Viterbi probability) = 0.6 –v_path = S Step 2 (next day) observation is C (clean) –Given that previous day was sunny: 1.stays sunny. Transition prob = 0.6. Prob of cleaning on a sunny day (emission prob) = 0.1. Total probability for this possibility (i.e. it stays sunny) = 0.6 x 0.1 = 0.06. 2.turns rainy. Transition prob = 0.4. Prob of cleaning on a rainy day = 0.5. Total prob for this step = 0.4 x 0.5 = 0.20. Therefore this is the most likely state of step 2 update overall probability: prob = 0.6 x 0.06 + 0.6 x 0.2 = 0.152 update Viterbi probabilty: v_prob = 0.6 x 0.2 = 0.12 update Viterbi path: v_path = SR
11
More step 3. Obs = S (shop) –given previous day was Rainy: 1.stays rainy = 0.7. Prob of shopping on a rainy day = 0.4. Prob for this step = 0.7 x 0.4 = 0.28 2.turns sunny = 0.3. Prob of shopping on a sunny day = 0.3. Total = 0.3 x 0.3 = 0.09 –best is R. v_path now is SRR –overall prob = previous prob (0.152) x sum of this step’s possibilities: 0.152 x (0.28 + 0.09) = 0.0562 –v_prob = previous v_prob x best current prob = 0.12 x 0.28 = 0.0336.
12
Still More Step 4 (last). Obs = W previous day was R: 1.R-> R transition prob = 0.7. Emission probability of a walk on a rainy day is 0.1. Total for this step = 0.7 x 0.1 = 0.07. 2.R -> S prob = 0.3. Emission prob = 0.6. Total = 0.3 x 0.6 = 0.18. –Best is S (turns sunny). v_path = SRRS –overall prob = previous (0.0562) x sum of current possibilities (0.07 + 0.18 = 0.25) = 0.562 x 0.25 = 0.0141. –v_prob = previous (0.0336) x best (0.18) = 0.00605. We have now determined the best set of states that would give us the observed data (v_path = SRRS), and the overall probability of the observed data (given the model) (prob = 0.0141). It is obvious that multiplying probabilities soon gets to very small numbers, so in general they are converted to logarithms and added.
13
Results from a Longer Run activity: CCSCCCSCWCWSCSCSSSCSSSWWWWSCCSSWCCSWWSCWSCSSCSSSWW weather: RRRRRRSSSRSSRSRSRRRRRRSSSSSSRRRSRRRSSSRSSRRRSSRRSS predicted: RRRRRRRRSRSSRRRRRRRRRRSSSSSRRRRSRRRSSSRSSRRRRRRRSS Viterbi prob = 5.48e-30 Overall prob = 1.92e-24 Percent correct = 86.0 %
14
Estimating Model Parameters The parameters that need to be estimated are the transition and emission probabilities. Since they must sum to 1, our St. Paul weather HMM needs only 2 transition probabilities and 2 emission probabilities estimated. –Sequence models have very large numbers of parameters to estimate, which requires large amounts of data. If you have training data with known states (e.g. you know when it is sunny or rainy, plus your friend’s activities, for a period of time: parameters come from simply calculating the proportion of each transition or emission. –Example: on rainy days, Carol took 6 walks, cleaned 30 times, and shopped 24 times. Emission probability estimates are 6/60 = 0.1, 30/60 =0.5, 24/60 = 0.4. (very frequentist!) –problem of 0 counts. Add a “pseudocount” so this probabilty isn’t zero (because it might actually happen). Typically add 1 to numerator and total number of possibilites to denominator. e.g. for walks, emission prob = (6 +1) / (60 + 3) = 7/63 = 0.111
15
More Parameter Estimation If you don’t know the underlying states: you just have a series of observations: you can use the Baum-Welch algorithm to estimate parameters. –Need the backward algorithm in addition to the forward algorithm. The backward algorithm works exactly the same as the forward algorithm, except that you calculate from the end to the beginning. The Baum-Welch algorithm starts with a set of parameters that have been estimated or guessed. –Then feed each training sequence through the model, with overall probabilities calculated both forward and backward. –Then use these probabilities to weight the contribution of each sequence to the transition and emission probabiliites. –Then use this to adjust the parameters.
16
Profile HMMs Start with a multiple sequence alignment: we want to make a statistical model of it.
18
Building the Model: Match States If we were just trying to do an ungapped alignment, we could use a very simple, unbranched HMM, where every match state can only transition to another match state –Each match state has an emission probability for each amino acid, with different emission probabilities for every match state –This is essentially what a PSSM is: the scores in each column of a PSSM could be scaled to 0-1, giving emission probabilities. –The transition probabilities are all 1: the only choice is to move to the next match state.
19
Insertion States In a multiple alignment, there are often columns which are gaps in most sequences but have amino acids in a few sequences. –These columns are best dealt with as insertion states. As you move through the model, emitting the query sequence, insertion states emit the extra amino acids found in these columns. Insertion states have emission probabilities that are usually the same as the overall proportion of each amino acid in the database. Insertion states loop back on themselves, meaning that multiple residues can be inserted here. Insertion states are entered from one match state and then exited to the next match state: insertion occurs between adjacent amino acids.
20
Deletion States Deletions are places in multiple alignments where most of the sequences have amino acids but a few of them have gaps. Deletion states are used to skip over match states. –You can skip multiple match states by moving from one deletion state to the next. –Deletion states act like affine gap penalties: the transition probability for entering a deletion state from a match state is equivalent to the gap opening penalty, and the transition probability from one deletion state to the next is the gap extension penalty. Unlike match states and insertion states, deletion states are silent: they emit nothing.
21
The Full HMM A full HMM model for a making a profile of a multiple alignment. Note that it is necessary to have begin and end states too, and often a bit more complicated than is shown here. These are needed to deal with local alignments as opposed to global. HMMs also usually have transitions directly from insertion states to deletion states, but these are considered very improbable, but their existence helps in model building.
22
Variable start and end points We can assume that most proteins will have amino acids before and after the region matched by the model. It is also possible to have sequences that only match part of the model. We model this by: –Allowing any insertion at the beginning or end to occur with no penalty. These are called “free insert states”. – allowing transitions from the start to any point in the model, and from any point in the model to the end state. This is equivalent to filling out a dynamic programming matrix, where all possible alignments are considered.
23
Determining the Parameters Start with a good multiple alignment, which generally involves hand refinement. This is often refered to as “curated”. Most of the columns will become match states. –Pseudocounts are needed to make sure that each match state has a non-zero probability of emitting all 20 amino acids. Insertion and deletion states are given some background low transition probabilities to start with. – columns that are mostly gaps are considered insertion states, and gaps in columns with mostly amino acids are modeled with deletion states.
24
Using Profile HMMs The query sequence is used as a set of observations that are emitted from the HMM if the proper path is followed. Either the Viterbi algorithm (to give the probability for the most likely path) or the forward algorithm (to give the probability of emission by all possible paths) can be used. To get log-odds scores, the probability of emitting the sequence from the model is compared to the probability of generating it randomly given the background frequencies of each amino acid.
25
A Profile HMM Example We are going to use a DNA example because then we only have 4 emission probabilities to deal with. This is a section of a repeated sequence in Bacillus megaterium. There are 15 sequences, and the alignment is 16 bases long. The first project is to parameterize the model, to estimate transition and emission probabilities. Then we will use the model to score various sequences. GG-GGAAAAACGTATT TG-GGACAAAAGTATT TG-GAACAAAAGTATG TACGGACAAAATTATT T--GAAGAAAAGTATG TA-GAACAAAAGTAGG TG-GAACAAACGCATT CGGGACAAA-AGTATT TGGGGTAAA-AGTATT TGAGACAAA-AGTAGT TGAGACAAA-AGTATA TGGGACAAAGAGTATT TG-AAACAAAGATATT CG-GAACAAAAGTATT TA-GGACAAAAGTGTT
26
Setting Up the Model The first problem is dealing with gaps: which ones should be inserts and which ones should be deletes? –A standard rule of thumb (that we are going to violate): if more than ½ the sequences have the same gap, it’s an insert, and if less than ½ have it it’s a delete. Nine sequences have a gap in column 3, and one also has a gap in column 2. –By the above rule, column 3 should be an insert and column 2 a deletion, but this means that we will have a transition directly from deletion to insertion, and we would like to avoid that if possible. –So, let’s call columns 2 and 3 deletions. Four sequences have a gap in column 10. This could be a deletion, but I want to make it an insertion just so we have one of them. GG-GGAAAAACGTATT TG-GGACAAAAGTATT TG-GAACAAAAGTATG TACGGACAAAATTATT T--GAAGAAAAGTATG TA-GAACAAAAGTAGG TG-GAACAAACGCATT CGGGACAAA-AGTATT TGGGGTAAA-AGTATT TGAGACAAA-AGTAGT TGAGACAAA-AGTATA TGGGACAAAGAGTATT TG-AAACAAAGATATT CG-GAACAAAAGTATT TA-GGACAAAAGTGTT
27
More Set Up Considering the gaps in columns 2 and 3 as delete states means that the bases in these columns in other sequences are a match state (i.e. aligned bases). However, considering column 10 as an insertion means that the bases in column 10 come from an insert state, not a match state, so the model doesn’t have a match state for column 10. The final model thus has 15 match states, with the corresponding insert and delete states. –Most of the insert and delete states aren’t used in our sequences, so they will have low probabilities. But still, they need to be in the model. To simplify things a bit, we are doing this as a global model, so we don’t need to worry about free insert states before or after the model, and we aren’t going to allow entering or exiting the model anyplace other than the beginning or the end, column 12345678910111213141516 state M1M2 / D1 M3 / D2 M4M5M6M7M8M9I9M10M11M2M13M14M15
28
Parameterization What parameters do we need? Emissions: –Each match state needs emission probabilities for all 4 bases –The insert states also need emission probabilities for all 4 bases. these are generally based on the random model: the background frequencies of the bases in the relevant genome or database Transitions: –For the specific cases on columns 2 and 3, we need match -> delete (M->D) probabilities, and a delete -> delete (D->D) probability. –For column 10, we need a M->I probability, and an I->I probability (for which we have no data). –We also need general M->M, M->D, and M->I probabilities for the other columns –Several other numbers also, but many of them will be covered by the need to have all transition probabilities coming from a given state sum to one.
29
Emission Probabilities Background levels (i.e. what the probabilities for each base would be if a bases were simply chosen at random) –We will use these for emissions from insert states. –We could use base frequencies just from this data set: A = 0.44, C = 0.08, G = 0.25, T = 0.23. These numbers are very biased. And, if you use them, you are assuming that whatever sequence you run the model against is a member of this family. That is, you are assuming the thing you are trying to prove. (Bad!) –So, let us use overall base frequencies. These sequences are all from B. megaterium, which has a GC content of about 38%. Thus, G = C = 0.19 and A = T = 0.31. Specific emission probabilities for each match state. –Count up the number of each base (not including gaps) in each column. –We are clearly missing some data: need pseudocounts to fill in.
30
Emission Pseudocounts The simplest way to do pseudocounts is the Laplace method: adding 1 to the numerator and 4 (i.e. total types of base) to the denominator: –Freq(C in column 1) = (count of C’s + 1) / (total number of bases + 4) –= (2 + 1) / (15 + 4) = 0.158 –As compared to actual frequency = 2/15 = 0.133 –There are no A’s in column 1, so the probability of A from column 1 = 1/19 = 0.052 A somewhat more sophisticated method is to use overall base frequencies for each base. –Freq(C in column 1) = (count of C’s + 0.19) / (total number of bases + 1) = 2.19/16 = 0.137 –Freq(A in column 1) = 0.31/16 = 0.019 The base frequency method could be altered by multiplying the pseudocounts by some constant, as an estimate of our uncertainty of how likely we are to find a sequence with an A first. –For example, to be more equivalent to the Laplace method, multiply by 4: –Freq(C in column 1) = (count of C’s + (4 * 0.19) ) / (total number of bases + 4) = 2.76/19 = 0.145 –Freq(A in column 1) = (4 * 0.31)/19 = 0.065 –Note how different the probabilities are for A. We will just say that how to apply pseudocounts is an area of heuristics and active research. We will use the overall base frequency method.
31
Transition Frequencies Overall M->D. There are 225 total transitions that occur here, and 9 of them are to delete states. So, we calculate M->D background frequency as 9/225 = 0.040. –For D->D, there is 1 case out of 9 delete states where the sequence continues the deletion, so we will use 1/9 = 0.111 as the D->D probability. –Then, D->M = 1 – (D->D) = 0.888 Overall M->I. There are 11 transitions to an insert state (column 10). So, the overall M->I probability is 11/225 = 0.044. –No cases of I->I in the data. So, we arbitrarily decide to make it the same as D->D (0.111), since we arbitrarily decided which columns to treat as inserts and which to treat as deletes. –I->M is thus the same as D->M = 0.888 This leaves the background M->M transition, which is 1 – (M->I + M- >D) = 1 – (0.040 + 0.044) = 0.916. We also need a low probability for I->D and D->I, which shouldn’t happen, so we set them at 0.00001
32
Specific Transitions We need to deal with columns where we have inserts and deletes. Column 2 has 1 M->D and 14 M->M. –Need to add in pseudocounts from the overall data, so: prob(M->D in column 2) = (M->D count + 0.04) / (total transitions in column 2 + 1) = 1.04/16 = 0.065. --M->I in column 2 is the background level, 0.044 –M->M for column 2 is 1 – 0.065 - 0.044 = 0.891 Column 3 has 8 M->D and 6 M->M (there is also the D->D already covered). –Prob(M->D in column 3 ) = 8.04/15 = 0.536 –Prob (M->M in column 3) = 1 – 0.536 - 0.044 = 0.420 Column 10 is an insert with 10 M->I and 5 M->M transitions –Prob(M->I in column 10) = 10.044/16 = 0.628 –Prob (M->D in column 10) = 0.04 (background) –Prob (M->M in column 10 is 1 – 0.628 – 0.04 = 0.332
33
Emission Probability Tables matchACGT 10.0280.1300.0780.764 20.2290.0050.7500.015 30.3490.1540.4640.033 40.0900.0050.8900.014 50.6530.0050.3280.014 60.6530.2550.0150.077 70.4030.5050.0780,014 80.9650.0050.0150.014 90.9650.0050.0150.014 100.7780.1300.0780.014 110.0900.0050.8280.077 120.0280.0670.0150.889 130.9030.0050.0780.014 140.0280.0050.1400.827 150.0900.0050.2030.702 ACGT overall0.310.19 0.31
34
Transitions Specific M1 M20.891 M1->D0.065 M2->M30.420 M2->D0.536 M9->M100.332 M9->I0.628 Default M->M0.916 M->I0.044 M->D0.040 D->M0.888 D->I0.0001 D->D0.111 I->M0.888 I->I0.111 I->D0.0001
35
Scoring a Sequence Whew! We have now estimated parameters for all transitions and emissions. Scoring a sequence. We are going to use both the Viterbi algorithm and the forward algorithm to determine the most likely path through the model and the overall probability of emitting that sequence. –Note that we really should convert everything to logarithms –Also, it is standard practice to express emission probabilities as odds rations, which means dividing them by the overall base frequencies. –We are not going to do either of these things here, in the interest of simplification and clarity. Let’s just score the first sequence in the list: –GG-GGAAAAACGTATT –Remove the gap, since a sequence derived from real data is not going to come with a gap (which came from a multiple alignment program) –GGGGAAAAACGTATT
36
Scoring GGGGAAAAACGTATT Base 1 is G. To start the global model off, we are going to require that this be a match state. –The emission probability for G in M1 is 0.078, so this is the initial overall probability and Viterbi probability. Base 2 is also G. There are 3 possibilities for this base: it might be a match state (M2), or it might the result of an insert state, or it might be the result of entering a delete state (and thus match a later base. We choose the most likely: –M1->M2 has a 0.891 probability, and the probability of emitting a G in column 2 is 0.750. So, this probability is 0.891 * 0.750 = 0.668 –M1->D = 0.065 –M1->I, then emitting a G from the insert state = 0.044 * 0.19 = 0.008 –M1->M2 is most likely. So, Viterbi probability = previous prob * this prob = 0.078 * 0.668 = 0.052. Overall prob = 0.078 * (0.668 + 0.065 + 0.008) = 0.078 * 0.741 = 0.058
37
More Scoring Base 3 is also a G. –M2->M3 has 0.420 probability and 0.464 chance of emitting a G. 0.420 * 0.464 = 0.195 –M2->D has 0.536 probability –M1->I, then emitting a G from the insert state = 0.044 * 0.19 = 0.008 –Choose M2->D. Viterbi = 0.052 * 0.536 = 0.028. –Overall = 0.058 * (0.195 + 0.536 + 0.008) = 0.058 * 0.739 = 0.043. We are now in a delete state between M2 and M4; we skipped the M3 state. Since delete states are silent, the G in position 3 hasn’t been emitted yet. –From the delete state we can either move to another delete state (skipping the M4 state in addition to M3) or we can move to M4 and emit the G. –D->M4 = 0.888 and M4 emitting a G = 0.890, so prob = 0.888 * 0.890 = 0.790 –D->D = 0.111 –Move to M4. Viterbi = 0.028 * 0.790 = 0.022. – Overall = 0.043 * (0.790 + 0.111) = 0.043 * 0.901 = 0.039. We can now move on to base 4 (another G) Our path so far: M1->M2->D->M4. We have emitted the first 3 bases. GGGGAAAAACGTATT
38
Still More Scoring GGG GAAAA ACGTATT The next several bases are easy. Since the probability of moving to a delete or insert state is low, we just have to be sure that the M->M probability times the emission probability stays above 0.044. M4->M5 : G prob = 0.916 * 0.328 = 0.300 –Viterbi prob = 0.022 * 0.300 = 0.0066 –Overall prob = 0.039 * (0.300 + 0.040 + (0.044 * 0.19) ) = 0.039 * 0.3484 = 0.0136 M5->M6 : A prob = 0.916 * 0.653 = 0.598 –Viterbi prob = 0.0066 * 0.598 = 0.00395 –Overall prob = 0.0136 * (0.598 + 0.040 + (0.044 * 0.31) ) = 0.0136 * 0.6516 = 0.0089 M6->M7 : A prob = 0.916 * 0.403 = 0.369 –Viterbi prob = 0.00395 * 0.369 = 0.00146 –Overall prob = 0.0089 * (0.369 + 0.040 + (0.044 * 0.31) ) = 0.0089 * 0.423 = 0.00376 M7->M8 : A prob = 0.916 * 0.965 = 0.884 –Viterbi prob = 0.00146 * 0.884 = 0.00129 –Overall prob = 0.00376 * (0.884 + 0.040 + (0.044 * 0.31) ) = 0.00376 * 0.938 = 0.00353 M8->M9 : A prob = 0.916 * 0.965 = 0.884 –Viterbi prob = 0.00129 * 0.884 = 0.00114 –Overall prob = 0.00353 * (0.884 + 0.040 + (0.044 * 0.31) ) = 0.00353 * 0.938 = 0.00331
39
Yet More At this point we have emitted positions 1- 8, and the most probable path is M1->M2->D->M4->M5- >M6->M7->M8->M9 GGG GAAAA ACGTATT Since the transition out of M9 is not the standard one, we need to pause and think it through. –M9->M10 = 0.332. Emission prob for A from M10 is 0.778. 0.332 * 0.778 = 0.258 –M9->I = 0.628. Emission prob for A from an insert state (i.e. background probability) is 0.31 0.628 * 0.31 = 0.195. –Thus our best choice, the most probable path, is M9- >M10. However, looking at the aligned sequences we can see that this is the wrong choice. Don’t despair: correction occurs in the next step. –Viterbi prob = 0.00114 * 0.258 = 0.000294 –Overall prob = 0.00331 * (0.258 + 0.195 + 0.040) = 0.00331 * 0.493 = 0.00163 GG-GGAAAAACGTATT TG-GGACAAAAGTATT TG-GAACAAAAGTATG TACGGACAAAATTATT T--GAAGAAAAGTATG TA-GAACAAAAGTAGG TG-GAACAAACGCATT CGGGACAAA-AGTATT TGGGGTAAA-AGTATT TGAGACAAA-AGTAGT TGAGACAAA-AGTATA TGGGACAAAGAGTATT TG-AAACAAAGATATT CG-GAACAAAAGTATT TA-GGACAAAAGTGTT
40
Yet Still More At this point we have emitted positions 1- 8, and the most probable path is M1->M2->D->M4->M5->M6->M7->M8->M9->M10 GGG GAAAAA CGTATT At M10, we can: –move to M11 and emit a C. Prob = 0.916 * 0.005 = 0.0046 –Move to an insert state and emit a C. Prob = 0.044 * 0.19 = 0.0083. –Move to a delete state. Prob = 0.04. This would be the best choice, but it leads to a mess: delete all the remaining match states, then inserting all the remaining bases in the query sequence at the end. It clearly shows the need for dynamic programming. And while we are at it, switching to logarithms at the beginning would greater ease calculations. –So, to continue our example, we move from M10 to an insert state and emit a C. Viterbi prob = 0.000294 * 0.0083 = 2.44 x 10 -6 Overall prob = 0.00163 * (0.0046 + 0.0083) = 2.10 x 10 -5
41
To the End… Our path so far: – M1->M2->D->M4->M5->M6->M7->M8->M9->M10->I –GGG GAAAAAC GTATT From the insert state we can: –I->I and emit a G, with probability 0.111 * 0.19 = 0.0211 –I->M11, with prob 0.888 * 0.828 = 0.735 Viterbi prob = 2.44 x 10 -6 * 0.735 = 1.79 x 10 -6 Overall prob = 2.10 x 10 -5 * (0.0211 + 0.735) =1.58 x 10 -5 The remaining steps are all match states, so we skip the calculations: –Final Viterbi probability = 4.46 x 10 -7 –Final overall prob = 6.79 x 10 -6
42
Final probability We need to know what the probability would be for the random model, with every base inserted according to its overall frequency in the genome. GGGGAAAAACGTATT has 6 G/C and 9 A/T, so the random probability is: (0.19) 6 * (0.31) 9 = 1.24 x 10 -9 We compare to the overall probability of 6.79 x 10 -6 by dividing, giving 5459. This means that the overall score for this sequence is 5459 times more likely than chance to match the model.
43
Interpro Maybe the most comprehensive site today (2011). InterProScan covers several different HMM databases, including TIGRfams, PFAM, and others http://www.ebi.ac.uk/interpro/ Sequences are generally classified as members of a superfamily, family, or subfamily. Different member databases supply different information. It also has a programming (web services) interface, for doing many submissions automatically.
44
Sequences to try >BMQ_3096 | QMB1551_chromosome:3062647-3060926| MVNKILKAPEGPPSDVERIKDESNYLRGTLGETMLDRISSGISEDDNRLMKFHGSYLQDD RDLRNERQKQKLEPAYQFMLRVRTPGGVSTPEQWLVMDDLAQKYGNGTLKLTTRQAFQMH GILKWNMKKTIQEIHASLLDTIAACGDVNRNVMCNPNPYQSEVHAEVFEWSKKLSDYLLP RTRAYHELWLDEEKVISTPEVEEEVEPMYGPLYLPRKFKIGVAVPPSNDIDVYSQDLGFI AILEDEKLVGFNVAIGGGMGMTHGDKATYPQLAKVIGFCRPDQILEVAEKVITIQRDYGN RSVRKNARFKYTVDRLGLETVKEELENRLGWSLDEAKSYHFDHNGDRYGWEKGVKGKWHF TLFVQGGRIADFEDYKLMTGLREIAKVHSGDFRLTANQNLIIANVSTEKKKQISDLIEQY GLTDGKHYSALRRSSLACVSLPTCGLAMAEAERYLPVLLEKIEAIVDENGLRDKEITIRM TGCPNGCARPALGEIAFIGKAPGKYNMYLGAAFDGSRLSKMYRENISEEQILNELRVLLP RYAKEREEGEHFGDYVIRAGVIEAVTDGTNFHA –http://www.ebi.ac.uk/Tools/services/web_iprscan/toolresult.ebi?tool=iprscan&jobId=iprscan-I20111003-174548-0114-52596225-oy >BMQ_0452 | QMB1551_chromosome:415338-416369| MWKKRIWVIASIVIVCLLIPYSLPIVFALLTALALEGIVQKLHQSFKMKRVYAVLMTFIL YVVSLILIGFFIVRTIVTQVVALSKIAPSFVKEIYETAIYPTIIKWKYYSNALPTEVISS IERTLEKTVNSLDSLLKSTVEIIISFAATVPGFLLEFLIYLVALVFISLELPAIKNKIKS FLTEETKYKLQVVITQLIQAGVGFIKAQIILSVMTFVMAYVGLWLLDVPYTALLSLLIVI VDILPILGTGSVLVPWGIFALTQGHDSLAIGLFILFGVITVVRRVVEPKVFSTNMGISPL AALVSLYIGFKLLGFVGLFLGPALVILYDALRKVGIIQINFKL –http://www.ebi.ac.uk/Tools/services/web_iprscan/toolresult.ebi?tool=iprscan&jobId=iprscan-I20111003-173646-0614-88204841-pg
45
Full Length HMMs The J. Craig Venter Institute (formerly known as TIGR) has a large collection of HMMs, called TIGRFAMS. –They are classified according to their isology: how specific they are. The least specific are motifs, then domains, then families (super and sub), and most specific are equivalogs. –Equivalogs are homologous proteins which all have the same function. They are usually full length proteins, not just domains. A good hit on an equivalog is strong evidence that you have determined the protein’s function. –Lots of useful information about the HMMs also. –Two cutoff scores: above the trusted cutoff there should be no false positives. Below the noise cutoff the match is of no value. –http://www.jcvi.org/cgi-bin/tigrfams/index.cgihttp://www.jcvi.org/cgi-bin/tigrfams/index.cgi Panther seems to do similar things, with explicit linking to experimental data.
46
Protein Domain HMMs Domains are areas of proteins that perform some function important to the protein’s overall function. –It is not uncommon to have several different domains in the same protein –Across species lines, the domains that perform a specific tasks can sometimes be found together in a single protein, and other times found in separate proteins. PFAM: probably stands for “Protein Families”. http://pfam.sanger.ac.uk/ http://pfam.sanger.ac.uk/ –PFAM HMMs are generally domains as opposed to full-length proteins such as are found in TIGRfams. –Nice sequence logos!
47
Domain HMMs based on 3-D Structure Superfamily: HMMs of domains classified according to the SCOP (Structural Classification of Proteins) scheme, which is based on 3-dimensional protein structure and grouped by evolutionary relationship. –SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/index.html Look at statistics and top of hierarchy.http://scop.mrc-lmb.cam.ac.uk/scop/index.html CATH: 3-D structures chopped up into domains, then used to find families of evolutionarily related proteins. Based on PBD database of stuctures.
48
HMMS based on motifs and special features Prosite and PRINTS: Contain short patterns “motifs” and rules for distinguishing them in different proteins. TM-HMM: finds segments that go through membranes. –This may be the easiest way to detect an integral membrane protein. –A related HMM program scans for N-terminal signal peptides, for secretion as well as membrane insertion.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.