Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.

Slides:



Advertisements
Similar presentations
A retrospective look at our models First we saw the finite state automaton The rigid non-stochastic nature of these structures ultimately limited their.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
SNU BioIntelligence Lab. ( 1 Ch 5. Profile HMMs for sequence families Biological sequence analysis: Probabilistic models of proteins.
Lecture 6, Thursday April 17, 2003
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Heuristic alignment algorithms and cost matrices
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
. Class 5: HMMs and Profile HMMs. Review of HMM u Hidden Markov Models l Probabilistic models of sequences u Consist of two parts: l Hidden states These.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Similar Sequence Similar Function Charles Yan Spring 2006.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Probabilistic Sequence Alignment BMI 877 Colin Dewey February 25, 2014.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
HMM for multiple sequences
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
Copyright (c) 2002 by SNU CSE Biointelligence Lab 1 Chap. 4 Pairwise alignment using HMMs Biointelligence Laboratory School of Computer Sci. & Eng. Seoul.
1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov model BioE 480 Sept 16, In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.
Sequence Similarity. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Hidden Markov Models BMI/CS 576
Hidden Markov Models - Training
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe

2 Example alignment HBA_HUMAN –HGSAQVKGHGKKVADALTNAVAHV- HBB_HUMAN VMGNPKVKAHGKKVLGAFSDGLAHL- MYG_PHYCAMKASEDLKKHGVTVLTALGAILKK-- GLB3_CHITPIKGTAPFETHANRIVGFFSKIIGEL- GLB5_PETMALKKSADVRWHAERIINAVNDAVASM- LGB2_LUPLUPQNNPELQAHAGKVFKLVYEAAIQLQ GLB1_GLYDI---DPGVAALGAKVLAQIGVAVSHL-

Linda Muselaars and Miranda Stobbe3 Overview chapter 5 Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments Searching with profile HMMs. Profile HMM variants for non-global alignments. More on estimation of probabilities. Optimal model construction. Weighting training sequences.

Linda Muselaars and Miranda Stobbe4 Overview chapter 5 Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments Searching with profile HMMs. Profile HMM variants for non-global alignments. More on estimation of probabilities. Optimal model construction. Weighting training sequences.

Linda Muselaars and Miranda Stobbe5 Key-issues Identifying the relationship of an individual sequence to a sequence family. How to build a profile HMM. Use profile HMMs to detect potential membership in a family. Use profile HMMs to give an alignment of a sequence to the family.

Linda Muselaars and Miranda Stobbe6 Key-issues (2) Lollypops for a valuable (up to the speakers to decide) contribution to this lecture.

Linda Muselaars and Miranda Stobbe7 Needed theory Emission probabilities. Silent states. Pair HMMs. The Viterbi algorithm. The Forward algorithm.

Linda Muselaars and Miranda Stobbe8 Contents Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments. – Non-probabilistic profiles – Basic profile HMM parameterisation Searching with profile HMMs. Profile HMM variants for non-global alignments.

Linda Muselaars and Miranda Stobbe9 Example alignment HBA_HUMAN –HGSAQVKGHGKKVADALTNAVAHV- HBB_HUMAN VMGNPKVKAHGKKVLGAFSDGLAHL- MYG_PHYCAMKASEDLKKHGVTVLTALGAILKK-- GLB3_CHITPIKGTAPFETHANRIVGFFSKIIGEL- GLB5_PETMALKKSADVRWHAERIINAVNDAVASM- LGB2_LUPLUPQNNPELQAHAGKVFKLVYEAAIQLQ GLB1_GLYDI---DPGVAALGAKVLAQIGVAVSHL- *********************

Linda Muselaars and Miranda Stobbe10 Ungapped regions Gaps tend to line up. We can consider models for ungapped regions. Specify indepependent probabilities e i (a). But of course: log-odds ratio! Position specific score matrix.

Linda Muselaars and Miranda Stobbe11 Drawbacks Multiple alignments do have gaps. Need to be accounted for. For example: BLOCKS database, with combined scores of ungapped regions. We will develop a single probabilistic model for the whole extent of the alignment.

Linda Muselaars and Miranda Stobbe12 Contents Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments. – Non-probabilistic profiles – Basic profile HMM parameterisation Searching with profile HMMs. Profile HMM variants for non-global alignments.

Linda Muselaars and Miranda Stobbe13 Short review Emission probabilities: the probability that a certain symbol is seen when in certain state k. Silent states: states that do not emit symbols in an HMM.

Linda Muselaars and Miranda Stobbe14 Building the model (1) We need position sensitive gap scores. HMM with repetitive structure of (match) states. Transitions of probability 1. Emmision probabilities: e M i (a). BeginEnd MjMj....

Linda Muselaars and Miranda Stobbe15 Building the model (2) Deal with insertions: set of new states I i. I i have emission distribution e I i (a). Set to the background distribution q a. Begin MjMj End IjIj

Linda Muselaars and Miranda Stobbe16 Building the model (3) Deal with deletions. Possibly forward jumps. For arbitrarily long gaps: silent states D j. Begin MjMj End DjDj

Linda Muselaars and Miranda Stobbe17 Costs for additional states States for insertions: the sum of the costs of the transitions and emissions (M→ I, number of I→ I, I→ M). States for deletions: the sum of the costs of an M→ D transition and a number of D→ D transitions and an D→ M transition.

Linda Muselaars and Miranda Stobbe18 Full model Begin MjMj End IjIj DjDj

Linda Muselaars and Miranda Stobbe19 Comparison with pair HMM X q xi M p xiyj Y q yj Begin End X q xi Y q yj

Linda Muselaars and Miranda Stobbe20 Contents Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments. – Non-probabilistic profiles – Basic profile HMM parameterisation Searching with profile HMMs. Profile HMM variants for non-global alignments.

Linda Muselaars and Miranda Stobbe21 Non-probabilistic profiles Profile HMM without underlying probabilistic model. Set scores to averages of standard substitution scores. Anomalies: – Conservation of columns is not taken into account. – Scores for gaps do not behave properly.

Linda Muselaars and Miranda Stobbe22 Example HBA_HUMAN...VGA--HAGEY... HBB_HUMAN...V----NVDEV... MYG_PHYCA...VEA--DVAGH... GLB3_CHITP...VKG------D... GLB5_PETMA...VYS--TYETS... LGB2_LUPLU...FNA--NIPKH... GLB1_GLYDI...IAGADNGAGV... *** ***** The score for residue a in column 1 would be set to:

Linda Muselaars and Miranda Stobbe23 Basic profile HMM parameterisation Objective: make the probability distribution peak around members of the family. Available parameters: – Length of the model. – Transition and emission probabilities.

Linda Muselaars and Miranda Stobbe24 Length of the model Which multiple alignment columns do we assign to match states? And which to insert states? Heuristic rule: Columns that consist for more than 50% of gap characters should be modeled by insert states.

Linda Muselaars and Miranda Stobbe25 Transition probability: Emission probability: In the limit this is an accurate and consistent estimation. Pseudocount method: LaPlace’s rule. Probability parameters # of transitions from state k to state l # of transitions from state k to any other state

Linda Muselaars and Miranda Stobbe26 Example BatAG---C RatA-AG-C CatAG-AA- Gnat--AAAC GoatAG---C ****

Linda Muselaars and Miranda Stobbe27 Example continued Begin ACGTACGT End D2D2 D3D3 I2I2 I3I3 I0I0 D1D1 I1I1 D4D4 I4I4 ACGTACGT ACGTACGT ACGTACGT A 5/8 C 1/8 G 1/8 T 1/8 A 1/7 C 1/7 G 4/7 T 1/7 A 3/7 C 1/7 G 2/7 T 1/7 A 1/8 C 5/8 G 1/8 T 1/8 M 1 M 2 M 3 M 4 a M 1 M 2 = 4/7 a M 1 D 2 = 2/7 a M 1 I 1 = 1/7

Linda Muselaars and Miranda Stobbe28 Contents Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments. – Non-probabilistic profiles – Basic profile HMM parameterisation Searching with profile HMMs. Profile HMM variants for non-global alignments.

Linda Muselaars and Miranda Stobbe29 Searching with profile HMMs Obtaining significant matches of a sequence to the profile HMM: – Viterbi algorithm: P(x, π*| M). – Forward algorithm: P(x | M). Give an alignment of a sequence to the family. – Highest scoring, or Viterbi, alignment.

Linda Muselaars and Miranda Stobbe30 Log-odds score of best path matching subsequence x 1…i to the submodel up to state j, ending with x i being emitted by state M j : Log-odds score of the best path ending in x i being emitted by I j : The best path ending in state D j : Pair HMM: Viterbi equations

Linda Muselaars and Miranda Stobbe31 Viterbi equations

Linda Muselaars and Miranda Stobbe32 Forward algorithm

Linda Muselaars and Miranda Stobbe33 Initialisation and termination Viterbi algorithm: – Initialisation: – Termination: Forward algorithm: – Initialisation: – Termination:

Linda Muselaars and Miranda Stobbe34 Alternative to log-odds scoring Log Likelihood score (LL score) Strongly length dependent. Solutions: – Divide by sequence length – Z-score Which method is preferred?

Linda Muselaars and Miranda Stobbe35

Linda Muselaars and Miranda Stobbe36 Demo

Linda Muselaars and Miranda Stobbe37 Part of the profile HMM

Linda Muselaars and Miranda Stobbe38 Scoring

Linda Muselaars and Miranda Stobbe39 Part of the multiple alignment

Linda Muselaars and Miranda Stobbe40 Relative frequencies

Linda Muselaars and Miranda Stobbe41 Contents Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments. – Non-probabilistic profiles – Basic profile HMM parameterisation Searching with profile HMMs. Profile HMM variants for non-global alignments.

Linda Muselaars and Miranda Stobbe42 Flanking model states Used to model the flanking sequences to the actual profile match itself. Extra probabilities needed: – Emission probability: q a. – ‘Looping’ transition probability: (1 - η). – Transition probability from left flanking state: depends on application.

Linda Muselaars and Miranda Stobbe43 Model for local alignment Smith-Waterman style Begin MjMj End IjIj DjDj Begin End QQ

Linda Muselaars and Miranda Stobbe44 Model for overlap matches Begin MjMj End IjIj Q DjDj Q

Linda Muselaars and Miranda Stobbe45 Model for repeat matches Begin MjMj End IjIj DjDj BeginEnd Q

Linda Muselaars and Miranda Stobbe46 Summary Construction of a profile HMM for different kinds of alignments. Use profile HMMs to detect potential membership in a family. Use profile HMMs to give an alignment of a sequence to the family.

Linda Muselaars and Miranda Stobbe47 BLAST versus profile HMM Discussion subject