Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Hidden Markov Models Eine Einführung.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Patterns, Profiles, and Multiple Alignment.
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models in Bioinformatics Applications
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
SNU BioIntelligence Lab. ( 1 Ch 5. Profile HMMs for sequence families Biological sequence analysis: Probabilistic models of proteins.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models: an Introduction by Rachel Karchin.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Modeling biological data and structure with probabilistic networks I Yuan Gao, Ph.D. 11/05/2002 Slides prepared from text material by Simon Kasif and Arthur.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Introduction to Profile Hidden Markov Models
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
7-Speech Recognition Speech Recognition Concepts
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
John Lafferty Andrew McCallum Fernando Pereira
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Sequence Alignment.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
(H)MMs in gene prediction and similarity searches.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Hidden Markov Models BMI/CS 576
Hidden Markov Models.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
Intro to Alignment Algorithms: Global and Local
N-Gram Model Formulas Word sequences Chain rule of probability
Hidden Markov Models (HMMs)
CONTEXT DEPENDENT CLASSIFICATION
CSE 5290: Algorithms for Bioinformatics Fall 2009
Hidden Markov Models By Manish Shrivastava.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen Major sources: 1. Tutorial: Stochastic Modeling Techniques: Understanding and using hidden Markov models (Leslie Grate, Richard Hughey, Kevin Karplus, Kimmen Sjölander 2. An Introduction to Hidden Markov Models for Biological Sequences (Anders Krogh) [Chapter 4 in Computational Methods in Molecular Biology, edited by S. L. Salzberg, D. B. Searls and S. Kasif, pages 45-63. Elsevier, 1998] 3. Hidden Markov Models and Protein Sequence Analysis (Rachel Karchin)

The Plan Modeling Biological Sequences Markov Chains Hidden Markov Models Issues Examples Techniques and Algorithms Doing it with Mathematica

Biological Sequences FA12_HUMAN VVGGLVALRGAHPYIAALYWGHSFCAGSLIAPC TRYP_PIG IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWV TRY1_BOVIN IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWV URT1_DESRO STGGLFTDITSHPWQAAIFAQNRRSSGERFLCGG TRY1_SALSA IVGGYECKAYSQTHQVSLNSGYHFCGGSLVNENWV Because these segments of sequences have similar/identical biological functions, it is presumed that the observed sequences have relationships that can be explained by means of a history of branching changes from sequences that existed in the past. The relationships show, based on alignment procedures and biological activity, that the essential functionality (i.e. enzyme sites, specific metal binding, etc.) vary little from species to species. These are conserved regions of the sequence. Assigning relationship among sequences are muddled by variability of other regions. So, how can we find similarities among existing sequences that reflect such (hypothesized) historical changes? If we assume various "models" of the past (ancestry of sequences) we can calculate the probabilities of observing the particular sequences extant today; however, we must use some form of Bayes' Formula to construct probable hypothetical paths connecting sequences changes and divergence over time. Several computational tools have and are being developed. For today, we take a closer look at methods based on the concept of a Hidden Markov Model. FA12_HUMAN: Coagulation factor XII [Precursor] (human) TRYP_PIG: Trypsin [Precursor] (pig) TRY1_BOVIN: Trypsinogen, cationic [Precursor] [Fragment] (bovine – cow) URT1_DESRO: Salivary plasminogen activator alpha 1 [Precursor] (vampire bat) TRY1_SALSA: Trypsin I [Precursor] (Atlantic salmon) TRY1_RAT: Trypsin I, anionic [Precursor] (rat) NRPN_MOUSE: Neuropsin [Precursor] (mouse) COGS_UCAPU: Brachyurin (Atlantic sand fiddler crab) TRY1_RAT IVGGYTCPEHSVPYQVSLNSGYHFCGGSLINDQWV NRPN_MOUSE ILEGRECIPHSQPWQAALFQGERLICGGVLVGDRW COGS_UCAPU IVGGVEAVPNSWPHQAALFIDDMYFCGGSLISPEW

Sequences and Models Many biological sequences (DNA/RNA, proteins) have very “subtle” rules for their structure; they clearly form “families” and are “related,” yet simple measures or descriptions of these relationships or rules rarely apply There is a need to create some kind of “model” that can be used to identify relationships among sequences and distinguish members of families from non-members Given the complexity and variability of these biological structures, any practical model must have a probabilistic component: that is, it will be a stochastic model, rather than a mechanistic one. It will be evaluated by the (statistical) accuracy and usefulness of its predictions, rather than the correspondence of its internal features to any corresponding internal mechanism in the structures being modeled. HMMs can also be utilized to recognize various kinds of “grammars” – in fact, their most wide-spread use initially was for speech recognition (and other pattern recognition tasks) – where the “symbols” were phonemes and the sequence was the vocalization of a word or sentence by a speaker.

An initial state distribution π(i) = Prob(q0 = Si) Markov Chains A system with a set of m possible states, Si; at each of a sequence of discrete points in time t>=0, the system is in exactly one of those states; the state at time t >= 0 is designated by qt; the movement from qt to qt+1 is probabilistic, and depends only on the states of the system at or prior to t. An initial state distribution π(i) = Prob(q0 = Si) Process terminates either at time T or when reaching a designated final state Sf If the process is guaranteed to eventually reach some state which it cannot leave, this is an “absorbing” Markov process.

Markov Chains of Order N Nth-order Markov chain (N >= 0): transition probabilities out of state qt depend only on the values of qt, qt-1,… qt-(N-1). Typically deal with 1st-order Markov chain, so only qt itself affects the transition probabilities. In a 1st-order chain, for each state Sj, there is set of m probabilities for selecting the next state to move to: ai,j = Prob(qt+1 = Si | qt = Sj) [1 <= i <= m, t >= 0] If there is some ordering of states such that ai,j = 0 whenever i < j (i.e., no “non-trivial” loops), then this is a “linear” (or “left-to-right”) Markov process “Homogeneous” Markov model: ai,j is independent of t

Simple Markov Models Might use a Markov chain to model a sequence where the symbol in position n depends on the symbol(s) in position(s) n-1,…n-N. For example, if a protein is more likely to have Lys after a sequence Arg-Cys, this could be encoded as (a small part of) a 2nd-order Markov model. If the probabilities of a given symbol are the same for all positions in the sequence, and independent of symbols in other positions, then can use the “degenerate” 0th-order Markov chain, where the probability of a given symbol is constant, regardless of the preceding symbol (or of the position in the sequence).

Hidden Markov Models (HMMs) In a Markov chain (or model), the states are themselves visible; they can be considered the outputs of the system (or deterministically associated with those outputs). However, if each state can emit (generate) any of several possible outputs (symbols) vk, from an output alphabet O of M symbols, on a probabilistic basis, then it is not possible (in general) to determine the sequence of the states themselves; they are “hidden.” Classic example: the “urn” game A set of N urns (states), each containing various colored balls (output symbols – total of M colors available), behind a curtain Player 1 selects an urn at random (with Markov assumptions), then picks a ball at random from that urn and announces its color to player 2 Player 1 then repeats the above process, a total of T times Player 2 must determine sequence of urns selected based on the sequence of colors announced For example, we might hypothesize that some underlying historic or causal mechanism – not directly accessible to observation – is generating the DNA or protein sequences that we observe in living organisms. Note that we are not actually hypothesizing any particular historical or causal mechanism(s): the state notion is an abstraction that encapsulates whatever connection that might be between various possible sequences.

Additional Parameters for HMMs Now, in addition to the transition probabilities, each state has a prescribed probability distribution to emit or produce a symbol vk from O: bi,k = Prob(vk | Si) If qt = Si, then the generated output at time t is vk with probability bi,k. So, a HMM is “doubly stochastic” – both the (hidden) state transition process and the (visible) output symbol generation process are probabilistic.

Bayesian Aspects of HMM Usage Given an HMM M, we can relatively easily calculate the probability of occurrence of an arbitrary output sequence, s: P(s | M) However, we often want to determine the underlying set of states, transition probabilities, etc. (the model) that is “most likely” to have produced the output sequence s that we have observed: P(M | s) Bayes’ Formula for sequence “recognition”: P(M | s) = P(s | M) P(M) / P(s) Very hard to find this “absolute” probability: depends on specific a priori probabilities we are unlikely to know) Instead, make it a “discrimination” problem: define a “null model” N, find P(M | s) / P(N | s) These issues are actually fairly general, and apply to the use of many kinds of probabilistic models, not just HMMs.

Issues in Using HMMs Model architecture/topology Training Selecting an appropriate training set Finding an “optimal” HMM that fits that set Must avoid overfitting Scoring (for HMM construction, sequence recognition) How likely that our sequence was generated by our HMM? Versus some “null model” – this converts a very difficult recognition problem into a tractable discrimination problem Score is the “relative likelihood” for our HMM Efficiency of evaluation Pruning the search: Dynamic programming Using log-odds scores Model architecture – often use a “linear” model, based on process of aligning biological sequences with some chosen “consensus” sequence (but must allow for gaps and insertions)

A Simple HMM for Some DNA Sequences Transition Probabilities (ai,j) State (Si) This shows “raw” probabilities; in actual algorithmic usage, usually utilize “log-odds” values. Emission Probabilities (bi,k)

HMMs and Multiple Alignments Can convert a multiple alignment into an HMM: Create a node for each column in which most sequences have an aligned residue Columns with many missing letters go to Insert states Emission probabilities are computed from the relative frequencies in the alignment column (for Match states), usually with aid of a regularizer (to avoid zero-probability cases) Emission probabilities for Insert states are taken from background frequencies Can also create a multiple alignment from a linear HMM: Find Viterbi (most likely) path in the HMM for each sequence Each match state on that path creates a column in the alignment Ignore or show in lower case letters from insert states Setting transition probabilities is equivalent to setting gap penalties in sequence alignment - more an art than a science.

Protein Sequences Alignment ALYW-------GHSFCAGSL AIFAKHRRSPGERFLCGGIL AIYRRHRG-GSVTYVCGGSL AIFAQNRRSSGERFLCGGIL ALFQGE------RLICGGVL ALFIDD------MYFCGGSL AIYHYS------SFQCGGVL SLNS-------GSHFCGGSL Three states Match states Insertion states Deletion states Building a profile HMM from an alignment becomes trivial if there were no gaps (indels) or just one insert state in the aligned protein sequences. When we allow indels of variable lengths, a number of additional components are need to model these variables. In the design of a profile HMM, one incorporates insertion and deletions states with prescribe properties. How do we assign which residues and gaps belongs to match, inserts and delete states? 1.Threshold method: Column with >= 50% gaps (‘-’) than residues -insertion state. The residues in this state are labeled as insert. Choice of threshold can be changed. 2. All other columns are in the match state. The gap (‘-’) in a match state are refer to delete. The distinction is important in considering state transition probabilities. Gaps in insertion states are ignored but gaps in matched states are relevant.

Topology of Profile HMM D1 D3 D2 D4 M1 M2 M3 M4 I4 I3 I2 I1 I0 D1 D3 D2 D4 Match states Deletion states Standard display of Profile HMM – this is a Linear 1st-order HMM. Each position (“node”in a sequence model contains three states : Match, Insertion and Deletion states. They may not be all active, depending on the specific model parameters Match – square Insertion – diamond Deletion – circle Note some things here: by assigning a set of states to each position in the model sequence, we obtain the effect of position- (but not neighbor- or history-) dependent transition probabilities. Also, assuming no insertion state has a looping probability of 1, this process will eventually reach the end state (with probabilitiy 1). Not necessarily any finite bound on length of generated sequence, however. Insertion states Mi -> Mi+1 Mi -> Ii Mi -> Di+1 Di -> Mi+1 Di -> Ii Di -> Di+1 Ii -> Mi+1 Ii -> Ii Ii -> Di+1

Regularizers For avoiding overfitting to training set Substitution matrices Identify more likely amino acid substitutions, reflecting biochemical similarities/differences Fixed for all positions in a sequence; one value for a given pair of amino acides Pseudocounts For protein sequences, typically based on observed (relative) frequencies of various amino acids “Universal” frequencies or position/type dependent values Dirichlet mixtures Probabilistic combinations of Dirichlet densities Densities over probability distributions: i.e., the probability density of various distributions of symbols (in a given sequence position) Used to generate data-dependent pseudocounts Dirichlet mixtures are mixtures of Dirichlet densities, which jointly assign probabilities to all distributions of the symbol alphabet. For instance, a distribution over amino acids that gave tryptophan probability 0.5, and glycine probability 0.5, would probably be given low probability by a mixture of densities estimated on alignment columns. We simply don't see too many distributions like that. However, since the alignments we used to estimate these densities were of fairly close homologs, these mixtures give pure distributions and mixtures of amino acids sharing common physico-chemical attributes fairly high probability.

Algorithms for HMM Tasks (1) 3 Major Problems: Determine HMM parameters (given some HMM topology and a training set) Calculate (relative) likelihood of a given output sequence through a given HMM Find the optimal (most likely) path through a given HMM for a specific output sequence, and its (relative) likelihood Forward/Backward Algorithm Used for determining the parameters of an HMM from training set data Calculates probability of going forward to a given state (from initial state), and of generating final model state (member of training set) from that state Iteratively adjusts the model parameters

Algorithms for HMM Tasks (2) Baum-Welch (Expectation-maximization, EM)Algorithm Often used to determine the HMM parameters Can also determine most likely path for a (set of) output sequence(s) Add up probabilities over all possible paths Then re-update parameters and iterate Cannot guarantee global optimum; very expensive Forward Algorithm Calculates probability of a particular output sequence given the HMM Straightforward summation of product of (partial path) probabilities Viterbi Algorithm Classical dynamic programming algorithm Choose “best” path (at each point), based on log-odds scores Save results of “subsubproblems” and re-use them as part of higher-level evaluations More efficient than Baum-Welch

HMMs for Protein/Gene Sequence Analysis Using any of various means, identify a set of related sequences with conserved regions Make 1st-order Markov assumptions: transitions independent of sequence history and sequence content (other than at the substitution site itself) Construct a HMM based on the set of sequences Use this HMM to search for additional members of this family, possibly performing alignments Search by comparing fit to HMM against fit to some null model For phylogenetic trees, also concerned with the length of the paths involved and with shared intermediate states (sequences)

Very Simple Viterbi DNA Sequence Alignment – with value of +1 for match, -1 for mismatch, -2 for “gap”; remember “best” at each step min = -1099

Very Simple Viterbi: Traceback

Onward to Mathematica!