Fundamentals and applications to bioinformatics.

Fundamentals and applications to bioinformatics.
Hidden Markov Models Fundamentals and applications to bioinformatics.

Markov Chains Given a finite discrete set S of possible states, a Markov chain process occupies one of these states at each unit of time. The process either stays in the same state or moves to some other state in S. This occurs in a stochastic way, rather than in a deterministic one. The process is memoryless and time homogeneous. Memoryless: If at some time t the process is in state Sj, the probability that one time unit later it is in state Sk depends only on Sj, and not on the past history of the states it was in before time t. That is, the current state is all that matters in determining the probabilities for the states that the process will occupy in the future. Time homegeneity: Given that at time t the process is in state Sj, the probability that one time unit later it is in state Sk is independent of t.

Transition Matrix Let S={S1, S2, S3}. A Markov Chain is described by a table of transition probabilities such as the following: S1 S2 S3 1 1/3 2/3 1/2 1/6 S1 S3 S2 1 1/3 2/3 1/6 1/2

A simple example Consider a 3-state Markov model of the weather. We assume that once a day the weather is observed as being one of the following: rainy or snowy, cloudy, sunny. We postulate that on day t, weather is characterized by a single one of the three states above, and give ourselves a transition probability matrix A given by:

- 2 - Given that the weather on day 1 is sunny, what is the probability that the weather for the next 7 days will be “sun-sun-rain-rain-sun-cloudy-sun”?

- 3 - Given that the model is in a known state, what is the probability it stays in that state for exactly d days? The answer is Thus the expected number of consecutive days in the same state is So the expected number of consecutive sunny days, according to the model is 5.

Elements of an HMM What if each state does not correspond to an observable (physical) event? What if the observation is a probabilistic function of the state? An HMM is characterized by the following: N, the number of states in the model. M, the number of distinct observation symbols per state. the state transition probability distribution where the observation symbol probability distribution in state qj, , where bj(k) is the probability that the k-th observation symbol pops up at time t, given that the model is in state Ej. the initial state distribution Although the states are hidden, for many practical applications there is often some physical significance attached to the states or to sets of states of the model. In the urn and ball model, the states corresponded to the urns. The observation symbols correspond to the physical output of the system being modeled. For the urn and ball model, they were the colors of the balls selected from the urns. Components (1) and (2) describe the structure of the model, and (3), (4) and (5) describe the parameters.

Three Basic Problems for HMMs
Given the observation sequence O = O1O2O3…Ot, and a model m = (A, B, p), how do we efficiently compute P(O | m)? Given the observation sequence O and a model m, how do we choose a corresponding state sequence Q = q1q2q3…qt which is optimal in some meaningful sense? How do we adjust the model parameters to maximize P(O | m)? The evaluation problem: given a model and a sequence of observations, how do we compute the probability that the observed sequence was produced by the model. We can also view the problem as one of scorning how well a given model matches a given observation sequence. The latter viewpoint is extremely useful. For example, if we consider the case in which we are trying to choose among several competing models, the solution to this problem allows us to choose the model which best matches the observations. We attempt to uncover the hidden part of the model, i.e. to find the “correct” state sequence to be found. Hence for practical situations, we use an optimality criterion to solve this problem as best as possible. Unfortunately, there are several reasonable optimality criteria that can be imposed. We attempt to optimize the model parameters so as to best describe how a given observation sequence comes about. The observation sequence comes about. The observation sequence used to adjust the model parameters is called a training sequence since it is used to “train” the HMM. The training problem has proven to be crucial in applications of HMMs, since it allows one to optimally adapt model parameters to observed training data.

Solution to Problem (1) Given an observed output sequence O, we have that This calculation involves the sum of NT multiplications, each being a multiplication of 2T terms. The total number of operations is on the order of 2T NT. Fortunately, there is a much more efficient algorithm, called the forward algorithm. Unless T is quite small, this calculation is computationally infeasible. For example, if N = 4, T = 100, the number of calculations is on the order of It would take the life of the universe to make such a calculation.

The Forward Algorithm It focuses on the calculation of the quantity
which is the joint probability that the sequence of observations seen up to and including time t is O1,…,Ot, and that the state of the HMM at time t is Ei. Once these quantities are known,

…continuation The calculation of the (t, i)’s is by induction on t.
From the formula we get This equation gives a recursion formula that simplifies the computation of P[O]. This procedure gives an algorithm for the solution of problem (1), which requires on the order of TN2 computations

Backward Algorithm Another approach is the backward algorithm.
Specifically, we calculate (t, i) by the formula Again, by induction one can find the (t,i)’s starting with the value t = T – 1, then for the value t = T – 2, and so on, eventually working back to t = 1.

Solution to Problem (2) Given an observed sequence O = O1,…,OT of outputs, we want to compute efficiently a state sequence Q = q1,…,qT that has the highest conditional probability given O. In other words, we want to find a Q that makes P[Q | O] maximal. There may be many Q’s that make P[Q | O] maximal. We give an algorithm to find one of them.

The Viterbi Algorithm It is divided in two steps. First it finds maxQ P[Q | O], and then it backtracks to find a Q that realizes this maximum. First define, for arbitrary t and i, (t,i) to be the maximum probability of all ways to end in state Si at time t and have observed sequence O1O2…Ot. Then maxQ P[Q and O] = maxi (T,i) The probability P[Q and O] is the joint probability of Q and O, not a conditional probability. Our aim is to find a sequence Q for which the maximum conditional probability is achieved.

- 2 - But Since the denominator on the RHS does not depend on Q, we have We calculate the (t,i)’s inductively.

- 3 - Finally, we recover the qi’s as follows. Define and put:
This is the last state in the state sequence desired. The remaining qt for t < T are found recursively by defining and putting

Solution to Problem (3) We are given a set of observed data from an HMM for which the topology is known. We wish to estimate the parameters in that HMM. We briefly describe the intuition behind the Baum-Welch method of parameter estimation. Assume that the alphabet M and the number of states N is fixed at the outset. The data we use to estimate the parameters constitute a set of observed sequences {O(d)}. The parameter space is usually far too large to allow exact calculation of a set of parameter estiates that maximizes the probability of the data. Instead, we employ algorithms that find locally best sets of parameters. The focus on local estimation means that the procedure is heuristic. Therefore, the efficacy of the procedure must be evaluated empirically by using benchmarks and test sets for which there are known outcomes. Some further comments are in order. It is not necessary to assume that the data come from an HMM. Instead, it is usually more accurate to assume that the data are generated by some random process that we try to fit with an HMM. Sometimes it might be possible to achieve a tight fit with an HMM and sometimes it might not. This discussion shows that we should use the term estimation of parameters cautiously. Our aim is to set parameters at values providing a good fit to data rather than to estimate parameters.

The Baum-Welch Algorithm
We start by setting the parameters pi, aij, bi(k) at some initial values. We then calculate, using these initial parameter values: pi* = the expected proportion of times in state Si at the first time point, given {O(d)}. The initial values can be chosen from some uniform distribution or can be chosen to incorporate some prior knowledge about them.

- 2 - 2) 3) where Nij is the random number of times qt(d) =Si and qt+1(d) = Sj for some d and t; Ni is the random number of times qt(d) = Si for some d and t; and Ni(k) equals the random number of times qt(d) = Si and it emits symbol k, for some d and t.

Upshot It can be shown that if  = (pi, ajk, bi(k)) is substituted by * = (pi*, ajk*, bi*(k)) then P[{O(d)}| *] P[{O(d)}| ], with equality holding if and only if * = . Thus successive iterations continually increase the probability of the data, given the model. Iterations continue until a local maximum of the probability is reached. The new parameters are efficiently calculable, but we shall refrain from showing how this is done.

Gene finding MSA Protein family modeling
HMM applications Gene finding MSA Protein family modeling

Gene finding

What is a (protein-coding) gene?
mRNA DNA transcription translation CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE

Some facts about human genes
Comprise about 3% of the genome Average gene length: ~ 8,000 bp Average of 5-6 exons/gene Average exon length: ~200 bp Average intron length: ~2,000 bp ~8% genes have a single exon Some exons can be as small as 1 or 3 bp. HUMFMR1S is not atypical: 17 exons bp long, comprising 3% of a 67,000 bp gene

What is a gene, ctd? In general the transcribed sequence is longer than the translated portion: parts called introns (intervening sequence) are removed, leaving exons (expressed sequence), and yet other regions remain untranslated. The translated sequence comes in triples called codons, beginning and ending with a unique start (ATG) and one of three stop (TAA, TAG, TGA) codons. There are also characteristic intron-exon boundaries called splice donor and acceptor sites, and a variety of other motifs: promoters, transcription start sites, polyA sites,branching sites, and so on.

In more detail (color ~state)

Gene Finding Challenges
Need the correct reading frame Introns can interrupt an exon in mid-codon There is no hard and fast rule for identifying donor and acceptor splice sites Signals are very weak

The idea behind a GHMM genefinder
States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..). Observations embody state-dependent base composition, dependence, and signal features. In a GHMM, duration must be included as well. Finally, reading frames and both strands must be dealt with.

Gene model B = gene start S = translation start D = donor
A = accceptor T = translation stop E = gene end

Why HMMs might be a good fit for Gene Finding
Classification: Classifying observations within a sequence Order: A DNA sequence is a set of ordered observations Grammar / Architecture: Our grammatical structure (and the beginnings of our architecture) is right here: Success measure: # of complete exons correctly labeled Training data: Available from various genome annotation projects

Half a model for a genefinder
Not for exam Half a model for a genefinder

Splice sites can be included in the exons
Not for exam Splice sites can be included in the exons

Beyond position-specific distributions
Not for exam Beyond position-specific distributions The bases in splice sites exhibit dependence, and not simply of the nearest neighbor kind. High-order (non-stationary) Markov models would be one option, but the number of parameters in relation to the amount of data rules them out. The class of variable length Markov models (VLMMs) deriving from early research by Rissanen prove to be valuable in this context. However, there is likely to be room for more research here.

Not for exam HMM Gene Finders: VEIL A straight HMM Gene Finder
Takes advantage of grammatical structure and modular design Uses many states that can only emit one symbol to get around state independence

Not for exam Motif detection

Not for exam Motifs Defines a probability distribution over possible sequences Our case: Observations = nucleotides Series = sequence of observations

Not for exam MSA Consider the following DNA motif A C A - - - A T G
T C A A C T A T C A C A C - - A G C A G A A T C A C C G - - A T C

Not for exam MSA [AT] [CG] [AC] [ACGT]* A [TG] [GC] A C A - - - A T G
T C A A C T A T C A C A C - - A G C A G A A T C A C C G - - A T C

Not for exam MSA The regular expression can:
Determine if the sequence in question fits the criteria of the search or not The regular expression cannot: Determine how well the sequence in question fits the criteria of the search

Not for exam Deriving the HMM Deriving the HMM from a known alignment
Statistics Each column in the alignment generates a state Count the occurrence of [ATGC] in each column to determine probabilities for each state Insertions are trickier

Not for exam Deriving the HMM A C A - - - A T G T C A A C T A T C
A C A C - - A G C A G A A T C A C C G - - A T C

Not for exam Using the HMM
How well does the given sequence fit the family Let’s try it Exceptional Sequence: T G C T - - A G G Consensus Sequence : A C A C - - A T C

Not for exam Using the HMM Exceptional Sequence Consensus Sequence
P(TGCT- -AGG) = (.2*1)*(.2*1)*(.2*.6)*(.2*.6)*(1*1)*(.2) ~=0.0023e-2 Consensus Sequence P(ACAC- -ATC) = (.8*1)*(.8*1)*(.8*.6)*(.4*.6)*(1*1)*(.8*1)*(.8) ~= 4.7e-2

Not for exam Using the HMM

Not for exam Log-odds Log-odds is computed as: P(S) – same as before
0.25L – null model Considers the overall sequence of nucleotides as random Better estimate – use overall frequency of nucleotides in organisms genome

Not for exam Log-odds Consensus Sequence
LO(ACACATC) = = 6.64

A drawback and pseudocounts
Not for exam A drawback and pseudocounts Dangerous to estimate a probability distribution from just a few examples Pseudocount - fake count Pretend you saw a nucleotide in a position even though it wasn’t there – allows for the small possibility that something else may occur other than what you have observed

Not for exam How pseudocounts help
If for instance you have only the first 2 sequences and you are looking at sequence 4 P(4) = .5*0*1*… = 0 A C A A T G T C A A C T A T C A C A C - - A G C A G A A T C A C C G - - A T C When in fact we already know that sequence 4 is part of the same family

Not for exam MSA

Multiple Sequence Alignment
Not for exam Multiple Sequence Alignment The msa of a set of sequences may be viewed as an evolutionary history of the sequences. HMMs often provide a msa as good as, if not better than, other methods. The approach is well grounded in probability theory No sequence ordering is required. Insertion/deletion penalties are not needed. Experimentally derived information may be incorporated. The sequences to be aligned are used as the training data, to train the parameters of the model. For each sequence, the Viterbi algorithm is then used to determine a path most likely to have produced that sequence.

Not for exam MSA with HMMs Construct a profile HMM
Find most likely path for each sequence Sequence of matching/insert/delete states is the alignment

m0m1 m2m3m4d5d6m7m8m9m10 m0m1i1m2m3m4d5m6m7m8m9m10.
Not for exam MSA Consider the sequences CAEFDDH CDAEFPDDH Suppose the model has length 10 and their most likely paths through the model are: m0m1 m2m3m4d5d6m7m8m9m10 m0m1i1m2m3m4d5m6m7m8m9m10. The alignment induced is found by aligning positions that were generated by the same match state. This leads to the alignment C–AEF –DDH The likely paths are found via the Viterbi algorithm.

Not for exam Protein Famillies

Not for exam Pfam Pfam is a web-based resource maintained by the Sanger Center Pfam uses the basic theory described above to determine protein domains in a query sequence. Suppose that a new protein is obtained for which no information is available except the raw sequence. We wish to “annotate” this sequence. A protein usually has one or more functional domains, namely portions of the protein that have essential function and thus have low tolerance for amino acid substitutions. Proteins in different families often share high homology in one or more domains. Entire protein families can be characterized by HMMs, as in the previous section, or one can characterize just functional domains. Annotation is the process of assigning to a sequence biologically relevant information, such as where the functional domains are, what their homology is to known domains, and what their function is.

Protein Family Classification
Not for exam Protein Family Classification Pfam large collection of multiple sequence alignments and hidden Markov models covers many common protein domains and families Over 73% of all known protein sequences have at least one match 5,193 different protein families

Not for exam Pfam Initial multiple alignment of seeds using a program such as Clustal Alignment hand scrutinized and adjusted additional sequences are added to the family by comparing the HMM against sequence databases Resulting full alignments with additional family members may look worse than initial seed alignments

Not for exam Pfam Family Types
family – default classification, stating members are related domain – structural unit found in multiple protein contexts repeat –domain that in itself is not stable, but when combined with multiple tandem repeats forms a domain or structure motif – shorter sequence units found outside of domains

Not for exam Pfam Links to the Pfam software: View some examples:
View some examples:

Fundamentals and applications to bioinformatics.

Similar presentations

Presentation on theme: "Fundamentals and applications to bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fundamentals and applications to bioinformatics.

Similar presentations

Presentation on theme: "Fundamentals and applications to bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback