Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004.

Slides:



Advertisements
Similar presentations
A retrospective look at our models First we saw the finite state automaton The rigid non-stochastic nature of these structures ultimately limited their.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Profiles for Sequences
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
SNU BioIntelligence Lab. ( 1 Ch 5. Profile HMMs for sequence families Biological sequence analysis: Probabilistic models of proteins.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Hidden Markov Models: an Introduction by Rachel Karchin.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Multiple Sequence Alignments
Probabilistic Sequence Alignment BMI 877 Colin Dewey February 25, 2014.
Introduction to Profile Hidden Markov Models
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Hidden Markov Models As used to summarize multiple sequence alignments, and score new sequences.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.
Comp. Genomics Recitation 3 The statistics of database searching.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
(H)MMs in gene prediction and similarity searches.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Introduction to Profile HMMs
Free for Academic Use. Jianlin Cheng.
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Genome Annotation Continued
Sequence Based Analysis Tutorial
Presentation transcript:

Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004

Outline Profile HMMs generate MSAs States and transitions for –Matches, Insertions, Deletions, Silent and Flanking states Statistics –Null model, E values Training –Model construction, Weighting training sequences and including pseudocounts (which have a Bayesian interpretation) Existing tools –Interpro, including Pfam and HMMER

Globins Helix HBA_HUMAN -DLS-----HGSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL- HBB_HUMAN GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL---D-- NLKGTFATLSELHCDKL- MYG_PHYCA KHLKTEAEMKASEDLKKHGVTVLTALGAILKK----K- GHHEAELKPLAQSHATKH- LGB2_LUPLU LK- GTSEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG- Consensus.l.t....kHg.kV. a l. L..H. K.

Hidden Markov models Observed sequence of symbols Hidden sequence of underlying states Transition probabilities govern transitions among states Emission probabilities govern the likelihood of observing a symbol in a particular state

Profile HMMs Use scores rather than emission probabilities directly

A PSSM as a simple HMM beginM1M1 MiMi M i+1 MLML end …… With emission probabilities unique to each match state

But what about gaps? Ignore them (BLOCKS database) OR Model them –Insert states have background emission probabilities beginM1M1 MiMi MLML end …… IiIi

Gap scores For an insert of length k with background emission probabilities, we have affine gap scores

Length distribution of inserts Geometric distribution I a MI a II a IM

Which columns are match states? Options –Assign columns to be match states by eye –Heuristic i.e. no more than 50% gaps per column –Maximum a posteriori (MAP) model construction O(L 2 ) dynamic programming algorithm exists to find model that optimizes score on training data Helix HBA_HUMAN -DLS-----HGSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL- HBB_HUMAN GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL---D-- NLKGTFATLSELHCDKL- MYG_PHYCA KHLKTEAEMKASEDLKKHGVTVLTALGAILKK----K- GHHEAELKPLAQSHATKH- LGB2_LUPLU LK- GTSEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG- Consensus.l.t....kHg.kV. a l. L..H. K.

Two ways to handle deletions Transitions between match states Silent deletion states (no emission) beginM1M1 MiMi MLML end …… beginM1M1 M2M2 M3M3 end D2D2 D1D1 D3D3

Profile HMM

Flanking states Many sites in a sequence may be assigned to 'flanking' states (N, C, or J) Transitions should force one or more match states to be traversed at least once

Local or global alignment? Are transitions allowed –From start to internal match? –From internal match to end? Are there states that can emit sequences before and after the profile? Do transitions allow the profile to be repeated? In HMMs –Global/local behavior governed by model not algorithm –Behavior may differ w.r.t. the profile and the sequence

Null model S N T

Extreme value problems How to convert S bit to an expect value? Since alignment is not truly local, theory used for BLAST does not hold here Solutions (both available in HMMER) –Conservative approximation valid for any profile –Empirically fit extreme value distribution using simulated sequences –Must be done once for every profile HMM

Why not always use full model? The sum of probabilities is constrained to be one Spreading probability among many paths decreases power to discriminate among them You should always choose the most restrictive model (fewest transitions) consistent with your purpose

Which algorithm to use? Three choices –Viterbi: maximum likelihood path –Forward: sum of probabilities of all possible paths –Forward-backward: prob of each state at each pos For database search –Query sequence against a database of profile HMMs –Profile HMM against a database of sequences? For alignment –Adding new sequence(s) to an existing alignment

Training a profile HMM Weighting training sequences –We saw the same problem when scoring multiple alignments –Same approaches are used for profile HMMs Estimating transition probabilities –Taken care of by MAP model construction Estimating emission probabilities –We will assume the alignment is correct –Only issue is how to add pseudocounts

Better pseudocounts Laplace's Rule –Ignores background frequencies of residues Background frequency pseudocounts –

Pseudocounts as Bayesian priors Bayes' rule Posterior PriorLikelihood

Dirichlet mixture pseudocounts Background probabilities are not uniform throughout the protein –eg exposed loops (hydrophilic residues abundant) vs. buried core (small side chains abundant) Different sets of pseudocount priors (Dirichlet distributions) for each environment Pseudocounts for I i are determined by a mixture of Dirichlet distributions fit to position i

Evolutionary pseudocounts Related to phylogenetic methods we will see later –Calculate probability of each residue having been the common ancestor of the residues in a column –Calculate probability of each residue as a descendent –Use these probabilities as priors with appropriate weighting Requires use of a position-independent scoring matrix (eg PAM)

Queries vs. subjects Two directions of search are possible –Sequence query against database of profile HMMs –Profile HMM against a database of sequences Bit scores will be the same regardless But E-values will differ –Search space (ie number of subjects in database) can differ considerably –It is usually more sensitive to search a database of profile HMMs

Interpro Regular Expressions –PROSITE PSSMs, other motifs –PROSITE, PRINTS, PRODOM Profile HMMs –Pfam –SMART –TIGRFAMs –PIR SuperFamily –SUPERFAMILY

Interpro v8

Pfam A profile HMM database –Based on Swissprot and TREMBL Current version (v15) has 7503 families. –~75% of all new protein sequences match an existing Pfam profile Profiles constructed semi-automatically –New families identified –Seed alignment manually optimized –Profile HMM constructed –All matching sequences aligned to HMM

HMMER Used in construction of Pfam –Can build a profile (with MAP algorithm) –Can search a sequence against a profile and vice versa (i.e. with forward algorithm) –Can add new sequences to an alignment (via Viterbi) –Uses Plan 7 profiles User sets the local/global behavior

HMMER2.0 [2.3.1] NAME fn3 ACC PF DESC Fibronectin type III domain LENG 84 ALPH Amino RF no CS no MAP yes COM hmmbuild -F HMM_ls.ann SEED.ann COM hmmcalibrate --seed 0 HMM_ls.ann NSEQ 108 DATE Mon Jul 26 14:10: CKSUM 1153 GA TC NC XT NULT NULE EVD HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -13 * *

Summary Profile HMMs generate MSAs States and transitions for –Matches, Insertions (which can model affine gaps), Deletions, which allow local alignment, Silent states and flanking states Statistics –Scored relative to a null model and E values must be determined empirically Training –MAP model construction, Training sequence weighting and pseudocounts (which have a Bayesian interpretation) Existing tools –Interpro, including Pfam and HMMER

Assignment Look over study guide –posted on Blackboard Turn in lab/problem set on Tuesday Midterm on Thursday