Hidden Markov model BioE 480 Sept 16, 2004. In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.

Slides:



Advertisements
Similar presentations
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Hidden Markov Model.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Probabilities and Probabilistic Models
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Parameter Estimation using likelihood functions Tutorial #1
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Profiles for Sequences
Hidden Markov Models in Bioinformatics Applications
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
More about Markov model.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Similar Sequence Similar Function Charles Yan Spring 2006.
Hidden Markov Models.
Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Single Motif Charles Yan Spring Single Motif.
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
Introduction to Profile Hidden Markov Models
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Course Introduction What these courses are about What I expect What you can expect.
Comp. Genomics Recitation 3 The statistics of database searching.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Sequence Alignment.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Chapter 6 - Profiles1 Assume we have a family of sequences. To search for other sequences in the family we can Search with a sequence from the family Search.
Other Models for Time Series. The Hidden Markov Model (HMM)
Hidden Markov Models BMI/CS 576
Learning Sequence Motif Models Using Expectation Maximization (EM)
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Hidden Markov model BioE 480 Sept 16, 2004

In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes. Example: Assume we know that on average extracellular proteins have a slightly different a.a. composition than intracellular ones. Eg. More cysteines. How do we use this information to predict a new protein sequence x=x 1 x 2 …x n whether it is intracellular or extracellular. –We first split the training examples from Swiss-Prot into intracellular and extracellular proteins, leaving aside those unclassifiable. –We then estimate a set of frequencies for intraceullar proteins and a set of extracellular frequencies. –Also estimate the probability that any new sequence is extracelluar, p ext and intracellular p int, called prior probabilites, because they are best guesses about a sequence before we actually see the sequence itself.

We now have: Because we assume that every sequence must be either extracellular or intracelluar, we have: By Bayes’ theorem, This is the number we want: the posterior probability that a sequence is extracellular. –It is our best guess after we have seen the data. More complicated: transmembrane proteins have both intra and extra cellular components.

Random Model R: For two sequences x and y, of lengths n and m. If x i is the i th symbol in x, and y i the i th symbol in y. Assume that letter a occurs independently with some frequency q a. –The probability of the two sequences x and y is just the product of the probabilities of each amino acid: P(x,y|R) =  q xi  q yi An alternative model: Match Model M: Aligned pairs of residues occur with a joint probability P ab. Its value can be thought of as the probability that the resdiues a and b have each independently been derived from some unknown original residue c in their common ancester. –c might be the same as a and/or b. –The probability of the whole alignment is: P(x,y|M) =  p xiyi The ratio of these two likelihoods is the odds ratio: P(x,y|M) / P(x,y|R) =  p xiyi / (  q xi  q yi )=  p xiyi / q xi q yi To make this additive, we take the logarithm of this ratio, the log-odd ratio. S =  s(x i, y i ), where s(a, b) = log (p ab / q a q b ),

Here s(a, b) is the log likelihood ratio of the residue pair (a,b) occurring as an aligned pair, as opposed to an unaligned pair. A biologist may write down and ad hoc substitution matrix based on intuition, but it actually implies the “target frequencies” p ab. Any substitution matrix is making a statement about the probability of observing ab pairs in real alignment. How to develop an evolutionary model? –Parameterized by probability of residue A mutated to residue B: PAB –Statistical modeling: these parameters cannot be assigned, rather, they have to be estimated from data. Suppose we know sequences s and s’ are related: find PAB that maximizes: –Maximum likelihood: maximize data likelihood under model. –Results:

Substitution matrix can be obtained when alignment of sequences are compiled. Different matrix for different evolutionary time t : How do we estimate it? –The probability of given a residue A and it is substituted by B within evolutionary distance t : –Ignore directionality of time: –Assume that the distribution of amino acid (a.a.) does not change during evolution: Can be estimated from: relative frequency of pair (A,B) in the known alignment of s and s’, and relative frequency q A of residue A. –Substitution matrix over a longer time scale:

Regular Expression Widely used in many programs, especially those on Unix: awk, grep, sed, and perl. –Used for searching text files for a pattern. Eg. Search for all files that containing “C.elegans” or “Caenorhabditis elegans” with the regular expression: % grep “C[\.a-z]* elegans” * –This matches any line containing a C followed by any number of lower- case leters or “.”, then a space, and then “elegans”. –Another example, PROSITE. –Difficulty: need to be very broad and complex, because protein spelling is much more free than English spelling.

Example: ( use DNA because of the smaller number of letters than a.a.) ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC A regular expression for this is: [AT][CG][AC][ACGT]*A[TG][GC] Problem: It does not distinguish: TGCT--AGG Highly implausible, exceptional character in each position ACAC--ATC Consensus sequence

Alternative: score sequences by how well they fit the alignment. –Eg. A proabability of 4/5=0.8 for A in the first position, 1/5=0.2 for a T; etc. –After the third position in the alignment, 3 out of 5 sequences have “insertions” of varying lengths, so we say the probability of making an insertion is 3/5 and thus 2/5 for not makng one. –A diagram: This is a hidden Markov model!

ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC A 0.8 C 0.0 G 0.0 T 0.2 A 0.0 C 0.8 G 0.2 T 0.0 A ? C ? G ? T ? A ? C ? G ? T ? A 1.0 C 0.0 G 0.0 T 0.0 A 0.0 C 0.0 G 0.2 T 0.8 A 0.0 C 0.8 G 0.2 T ? ? ? ? ?

Hidden Markov Model A box is called a “(Match) state”: –one state for each term in the regular expression. –Probabilities: counting in the multiple alignment how many times each event occurs. “Insertion”: A state above the other states. –Probabilities of NTs: counting all occurrences of the four NTs in this region of the alignment: A 1/5; C 2/5; G 1/5, and T 1/5. –Probabilities of transitions: After sequences 2, 3 and 5 have made one insertion each, there are two more (from sequence 2) Total number of transitions back to the main line is 3: there are three sequences that have insertions. All will come back to the main states. Therefore, probability of making a transition to the next state: 3/5 Probability of making a transition to itself: 2/5 --- keep inserting.

Scoring Sequences Consensus sequences: ACACATC. –Probability of the 1st A: 4/5. –This is multiplied by the probability of the transition from the first state to the second, which is 1. –…. How do we score the exceptional sequence TGCT--AGG? –This is 2000 times smaller. –We can now get a score for each sequence to measure how well it fits the motif.

For the other four original sequences: The probability depends very strongly on the length of the sequence. –Probability: Not a good number as score. –Use log-odds ratio: ln( observed/random), here the random model (null model) is that the sequences are random strings of NTs: the probability of a sequence of length L is: 0.25 L

The log-odds score is: Other null model: not 0.25, but background NT compositions. –When a sequence fits a HMM very well: high log-odds score –When it fits a null model better: negative score.

The second sequence has raw score as low as the exceptional score –because it has three inserts. –But the log-odds score is much higher than the exceptional seq. –Excellent discrimination. –But, high log-odds may not be a “hit”: there will always be random hits when searching a database. Need to look at E-value and P- value. If the alignment had no gaps or insertions: –No insert state. –All probabilities associated with the arrows (the transition probabilities) = 1. Can all be ignored. –HMM works then exactly as a weight matrix of log-odds scores.

What is hidden Come back to the occasional dishonest casino: they use a fair die most of the time, with a probability of 0.05 it switching to loaded die, and with a probability of 0.1 of switch back. The switch between die is a Markov process (it only depends on the previous state). The observation of the sequence of rolls is hidden Markov process because the casino wouldn’t tell you in which role they were using loaded die.

Profile HMMs Profile HMMs: allows position-dependent gap penalties. –Obtained from a multiple alignment. –Can be used to search a database for other members of the family just like a standard profile. Structure of the Model: –Main states (bottom): model the columns of the alignment, are the main states. Probabilities are calculated by the frequency of the a.a. or NT. –Insert states (diamond): model highly variable regions in the alignment Often the probabilities are a fixed distribution, eg, by composition

–Delete states (circle): silent or null state. Do not match any residues, they are there so it is possible to jump over one or more columns: For modeling when just a few of the sequences have a “-” at a position. Example:

Insertion region (white): an alignment of this region is highly uncertain. Shaded region: columns that correspond to main states in the HMM model. –Probabilities: For each non-insert column, we make a main state and set the probabilities equal to the amino acid frequencies. –Transition probabilities: count how many sequences use the various transitions, like before. Delete states: Two transitions from a main state to a delete state, shown with dashed lines: –from “begin” to the first delete state –from main state 12 to delete state 13. –Both correspond to dashes in the alignment: Only one sequence has gaps, the probability of these delete transitions is 1/30. –The 4th sequence continues deletion to the end: Probability from delete 13 to 14 is 1, and from delete 14 to the end is also 1.

Pseudo-counts Dangerous to estimate a probability distribution from just a few observed amino acids. –If there are two sequences, with Leu at a position: P for Leu =1, but P = 0 for all other residues at this position But we know that often Val substitutes Leu. The probability of the whole sequence are easily become 0 if a single Leu is substituted by a Val. Or, the log-odds is minus infinity. How to avoid “over-fitting” (strong conclusions drawn from very little evidence)? Use pseudocounts: –Pretend to have more counts than those from the data. –A. Add 1 to all the counts: Leu: 3/23, other a.a.: 1/23

Adding 1 to all counts is as assuming a priori all a.a. are equally likely. Another approach: use background composition as pseudocounts.