CSE182-L9 Modeling Protein domains using HMMs. Profiles Revisited Note that profiles are a powerful way of capturing domain information Pr(sequence x|

Slides:



Advertisements
Similar presentations
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Advertisements

Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Hidden Markov Model.
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Hidden Markov Models.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
O PTICAL C HARACTER R ECOGNITION USING H IDDEN M ARKOV M ODELS Jan Rupnik.
Hidden Markov Models Modified from:
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Statistical NLP: Lecture 11
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Albert Gatt Corpora and Statistical Methods Lecture 8.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 14: Introduction to Hidden Markov Models Martin Russell.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Gene Finding (DNA signals) Genome Sequencing and assembly
. Class 5: HMMs and Profile HMMs. Review of HMM u Hidden Markov Models l Probabilistic models of sequences u Consist of two parts: l Hidden states These.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
. Inference in HMM Tutorial #6 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
1 Introduction(1/2)  Eukaryotic cells can synthesize up to 10,000 different kinds of proteins  The correct transport of a protein to its final destination.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Hidden Markov Models CBB 231 / COMPSCI 261 part 2.
10-07CSE182 CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Chapter 3 Regular Expressions, Nondeterminism, and Kleene’s Theorem Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
CSE182-L10 HMM applications.
Introduction to Profile HMMs
N-Gram Model Formulas Word sequences Chain rule of probability
CSE 5290: Algorithms for Bioinformatics Fall 2009
Hidden Markov Model Lecture #6
Presentation transcript:

CSE182-L9 Modeling Protein domains using HMMs

Profiles Revisited Note that profiles are a powerful way of capturing domain information Pr(sequence x| profile P) = Pr(x 1 |P 1 ) P(x 2 |P 2 )… Pr(AGCTTGTA|P) = 0.9*0.2*0.7*0.4*0.3*…. Why do we want a different formalism? ACGTACGT

Profiles might need to be extended A Domain might only be a part of a sequence, and we need to identify and score that part –Pr[x|P] = max i Pr[x i …x i+l | P] A sequence containing a domain might contain indels in the domain –EX: AACTTCGGA ACGTACGT A A C T T C G G A

Profiles & Indels Indels can be allowed in matching sequences to a profile. However, how do we construct the profile in the first place? –Without indels, multiple alignment of a sequence is trivial –Multiple alignments in the presence of indels is much trickier. Similarly, construction of profiles is harder. –The HMM formalism allows you to do both, alignment of single a sequence to the HMM, and construction of the HMM given unaligned sequences.

Profiles and Compositional signals While most protein domains have position dependent properties, other biological signals do not. Ex: 1.CpG islands, 2.Signals for protein targeting, transmembrane etc.

Protein targeting Proteins act in specific compartments of the cell, often distinct from where it is ‘manufactured’. How does the cellular machinery know where to send each protein.

Protein targeting In 1970, Gunter Blobel showed that proteins have an N-terminal signal sequence which directs proteins to the membrane. Proteins have to be transported to other organelles: nucleus, mitochondria,… Can we computationally identify the ‘signal’ which distinguishes the cellular compartment?

Predicting Transmembrane Regions For transmembrane proteins, can we predict the transmembrane, outer, and inner regions? transmembrane outer inner These signals are of varying length and depend more on compositional, rather than position dependent properties

The answer is HMMs Certainly, it is not the only answer! Many other formalisms have been proposed, such as Neural Networks, with varying degree of success.

Revisiting automaton Recall that Profiles were introduced as a generalization of automata The OR operator is replaced by a distribution. However, the * operator (Kleene’s closure) has no direct analog. Instead we introduce probabilities directly into automata.

Profile HMMs Problem: Given Sequence x, Profile P of length l, –compute Pr[x|P] = max i Pr[x i …x i+l | P] Construct an automaton with a state (node) for every column Extra states at the beginning and end. In each step, the automaton emits a symbol, and moves to a new state. Unlike finite automata, the emission and transistion are governed by probabilistic rules ACGTACGT

Emission Probability for states Each state π emits a base (residue) with a probability (defined by the profile) Thus Pr(A|s 1 ) = 0.9, Pr(T| s 4 )=0.4,… ACGTACGT s1s1 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 s2s2 s9s9 s0s0

Transition probabilities The probability of emitting a base depends upon a state. An initial state π 0 is defined. We move to subsequent states using a probabilistic transition EX: Pr(s 0 -> s 1 )= Pr(s 0 -> s 0 )=0.5, Pr(s 1 -> s 2 ) = ACGTACGT s1s1 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 s2s2 s9s9 s0s0

Emission of a sequence What is the probability that the automaton emitted the sequence x 1..x n The problem becomes easier if we know the sequence of states π=π 0..π n that the automaton was in

Computing Pr(x|π) Example: Consider the sequence GGGAACTTGTACCC – X = G G G A A C T T G T A C C C – π = Pr(x i |π i ): Pr( π i ->π i+1 )= ACGTACGT s1s1 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 s2s2 s9s9 s0s0

HMM: Formal definition M=(∑,Q,A,E), where ∑=is the alphabet (EX: {A,C,G,T}) Q: set of states, with an initial state q 0, and perhaps, a final state. (EX: {s 0,…,s 9 }) A: Transition Probabilities, A[i,j] = Pr(s i ->s j ) (EX: A[0,1]=0.5) E: Emission probabilities e i (b) = Pr(b|s i ) (EX: e 3 (T)=0.7) ACGTACGT s1s1 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 s2s2 s9s9 s0s0

Pr(x|M) Given the state sequence π, we know how to compute Pr(x| π). We would like to compute –Pr(x|M) = max π Pr(x| π) Let |x| = n, and |Q|=m, with s m being the final state. As is common in Dynamic programming, we consider the probability of generating a prefix of x Define v(i,j) = Probability of generating x 1..x i, and ending in state s j Is it sufficient to compute v(i,j) for all i,j? –Yes, because Pr(x|M) = v(n,m)

Computing v(i,j) If the previous state (at time i-1) was s l, –then v(i,j) = v(i-1,l).A[l,j].e j (x i ) The previous state must be one of the m states. Therefore v(i,j) = max l  Q {v(i-1,l).A[l,j] }.e j (x i ) sjsj slsl

The Viterbi Algorithm v(i,j) = max l  Q {v(i-1,l).A[l,j] }.e j (x i ) We take logarithms and an approximation to maintain precision –S(i,j) = log(e j (x i )) + max l  Q { S[i-1,l] + log(A[l,j] ) } The Viterbi Algorithm For i = 1..n –For j = 1.. M –S(i,j) = log(e j (x i )) + max l  Q { S[i-1,l] + log(A[l,j] ) }

Probability of being in specific states What is the probability that we were in state k at step I? Pr[All paths that passed through state k at step I, and emitted x] Pr[All paths that emitted x]