Introduction to Profile HMMs

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Using PFAM database’s profile HMMs in MATLAB Bioinformatics Toolkit Presentation by: Athina Ropodi University of Athens- Information Technology in Medicine.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 14: Introduction to Hidden Markov Models Martin Russell.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Hidden Markov Models: an Introduction by Rachel Karchin.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
. Class 5: HMMs and Profile HMMs. Review of HMM u Hidden Markov Models l Probabilistic models of sequences u Consist of two parts: l Hidden states These.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Probabilistic Sequence Alignment BMI 877 Colin Dewey February 25, 2014.
Introduction to Profile Hidden Markov Models
Hidden Markov Models As used to summarize multiple sequence alignments, and score new sequences.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
The dynamic nature of the proteome
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PHMMs for Metamorphic Detection Mark Stamp 1PHMMs for Metamorphic Detection.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Protein Domain Database
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Multiple alignment using hidden Markove models November 21, 2001 Kim Hye Jin Intelligent Multimedia Lab
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
(H)MMs in gene prediction and similarity searches.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Hidden Markov Models BMI/CS 576
Free for Academic Use. Jianlin Cheng.
Genome Annotation (protein coding genes)
INTRODUCTION TO BIOINFORMATICS
The ideal approach is simultaneous alignment and tree estimation.
Gil McVean Department of Statistics, Oxford
Learning Sequence Motif Models Using Expectation Maximization (EM)
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Hidden Markov Models Part 2: Algorithms
Hidden Markov Models (HMMs)
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Lecture 13: Hidden Markov Models and applications
Applying principles of computer science in a biological context
Presentation transcript:

Introduction to Profile HMMs Tandy Warnow

Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used to model a family of sequences Can be built from a multiple sequence alignment Algorithms using profile HMMs are based on dynamic programming (much like Needleman-Wunsch)

Profile Given a gap-less multiple sequence alignment, we can build a profile describing what we see: S1 = A C T A G S2 = A C A A G S3 = A T T T G S4 = G T T C G 1 2 3 4 5 A 0.75 0.0 0.25 0.50 0.0 C 0.00 0.5 0.00 0.25 0.0 T 0.00 0.5 0.75 0.25 0.0 G 0.25 0.0 0.00 0.00 1.0

Dealing with zero-probability emissions Note that our profiles had some letters that had zero probability. Most profile HMMs don’t allow zero-probability emissions for letters in the alphabet. One way of dealing with this is the “add-one” rule (see Mona Singh’s lecture), but there are others. When you don’t modify your empirical distribution at all to create the emission probabilities, this is called an “unadjusted profile”.

Using a profile 1 2 3 4 5 A 0.75 0.0 0.25 0.50 0.0 C 0.00 0.5 0.00 0.25 0.0 T 0.00 0.5 0.75 0.25 0.0 G 0.25 0.0 0.00 0.00 1.0 0.75 0.0 0.25 0.0 0.5 0.25 0.0 0.75 0.50 0.25 0.0 0.0 1.0 End 1.0 1.0 1.0 1.0 1.0 1.0 Start       End The profile yields a probability distribution of sequences – here, all of the same length.

What is the probability of generating s = ATATG? 1 2 3 4 5 A 0.75 0.0 0.25 0.50 0.0 C 0.00 0.5 0.00 0.25 0.0 T 0.00 0.5 0.75 0.25 0.0 G 0.25 0.0 0.00 0.00 1.0 0.75 0.0 0.25 0.0 0.5 0.25 0.0 0.75 0.50 0.25 0.0 0.0 1.0 End 1.0 1.0 1.0 1.0 1.0 1.0 Start       End The profile yields a probability distribution of sequences – here, all of the same length.

What is the probability of generating s = AAAAA? 1 2 3 4 5 A 0.75 0.0 0.25 0.50 0.0 C 0.00 0.5 0.00 0.25 0.0 T 0.00 0.5 0.75 0.25 0.0 G 0.25 0.0 0.00 0.00 1.0 0.75 0.0 0.25 0.0 0.5 0.25 0.0 0.75 0.50 0.25 0.0 0.0 1.0 End 1.0 1.0 1.0 1.0 1.0 1.0 Start       End The profile yields a probability distribution of sequences – here, all of the same length.

What are the most probable sequences for this model? 1 2 3 4 5 A 0.75 0.0 0.25 0.50 0.0 C 0.00 0.5 0.00 0.25 0.0 T 0.00 0.5 0.75 0.25 0.0 G 0.25 0.0 0.00 0.00 1.0 0.75 0.0 0.25 0.0 0.5 0.25 0.0 0.75 0.50 0.25 0.0 0.0 1.0 End 1.0 1.0 1.0 1.0 1.0 1.0 Start       End The profile yields a probability distribution of sequences – here, all of the same length.

How would you compute the probability of generating a sequence s? 1 2 3 4 5 A 0.75 0.0 0.25 0.50 0.0 C 0.00 0.5 0.00 0.25 0.0 T 0.00 0.5 0.75 0.25 0.0 G 0.25 0.0 0.00 0.00 1.0 0.75 0.0 0.25 0.0 0.5 0.25 0.0 0.75 0.50 0.25 0.0 0.0 1.0 End 1.0 1.0 1.0 1.0 1.0 1.0 Start       End The profile yields a probability distribution of sequences – here, all of the same length.

Notes Each profile generates strings of exactly the same length. This is not very useful for dealing with biological data. If you are given a sequence that is generated by the profile, then you know exactly what path was used to generate the sequence. That is, none of the states are hidden.

Insertions and Deletions (Indels) Insertions: events that increase the sequence length, such as: AAAAA -> AATCGAAA AAAAA -> AAAAATCGATTA Deletions: events that decrease the sequence length, such as: ATCGA -> AGA AAAAA -> AA

Allowing for sequence length heterogeneity The profile shown in the previous slides only had match states (indicated by rectangles). It doesn’t allow any insertions or deletions. To model indels, we just have to add additional states to the graphical model. Insertion states: Diamonds (have non-zero emission probabilities) Deletion states: Circles (nothing emitted)

From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

How many paths can generate s = AAA? From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

How many paths can generate s = AA? From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

How many paths can generate the empty string? From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

Can you find a string for which only one path can generate it? From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

A more general topology for a profile HMM From http://codecereal.blogspot.com/2011/07/protein-profile-with-hmm.html

What has changed between this model and the previous one? From http://codecereal.blogspot.com/2011/07/protein-profile-with-hmm.html

HMMER http://hmmer.janelia.org One of the most popular collection of tools to perform analyses based on profile HMMs. HMMER web server: interactive sequence similarity searching, NAR 2011, http://nar.oxfordjournals.org/content/39/suppl_2/W29 See PFAM, http://pfam.xfam.org, for how profile HMMs are used to represent groups of functionally and structurally related proteins.

Building Profile HMMs Profile HMMs can be built from a given multiple sequence alignment – this is not too difficult. Profile HMMs can also be built from unaligned sequences. This is a bit complicated, and often uses the Baum-Welch algorithm.

Using Profile HMMs Given a Profile HMM computed for a multiple sequence alignment, you can use it to Classify new sequences into families (e.g., protein families and superfamilies) Infer function of new sequences Add related sequences into the multiple sequence alignment Compute multiple sequence alignments for groups of related sequences

Using profile HMMs for Protein Family Classification Given two profile HMMs (H1 and H2), and a sequence s, you can determine which one is more likely to generate s using dynamic programming. Computing Pr(s|H) can be done using dynamic programming (the “forward algorithm”). Given a set of protein families, compute profile HMMs for each, and then find the family whose profile HMM is most likely to generate s.

What is the probability of generating s = AAA? From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

What paths can generate s = AAA? From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

For each path that can generate s = AAA, what is its probability? From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

Computing Pr(AAA|H) is not trivial From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

What about Pr(A|H)? From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

Hidden states For some sequences, there is no path through the profile HMM that can generate the sequence. For some other sequences, there is exactly one path. And then for some others, there is more than one path… and you can’t tell which one generated the sequence! More generally, you can’t tell which state generated each letter in the sequence – so the states are said to be “hidden”.

What path is most likely to generate sequence AA? From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

Can we compute the maximum probability path for the general case? From http://codecereal.blogspot.com/2011/07/protein-profile-with-hmm.html