Hidden Markov Model Ed Anderson and Sasha Tkachev.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
Hidden Markov Models in Bioinformatics
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Patterns, Profiles, and Multiple Alignment.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Using PFAM database’s profile HMMs in MATLAB Bioinformatics Toolkit Presentation by: Athina Ropodi University of Athens- Information Technology in Medicine.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Profile-profile alignment using hidden Markov models Wing Wong.
Hidden Markov Models: an Introduction by Rachel Karchin.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
. Class 5: HMMs and Profile HMMs. Review of HMM u Hidden Markov Models l Probabilistic models of sequences u Consist of two parts: l Hidden states These.
Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
HMMER tutorial 羅偉軒 Account IP: Account: binfo2005 Password: 2005binfo.
By: Manchikalapati Myerow Shivananda Monday, April 14, 2003
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004.
Hidden Markov Models In BioInformatics
Probabilistic Sequence Alignment BMI 877 Colin Dewey February 25, 2014.
Introduction to Profile Hidden Markov Models
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Lab7 QRNA, HMMER, PFAM. Sean Eddy’s Lab
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the.
Finding new nirK genes in metagenomic data
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Sequence Alignment.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
(H)MMs in gene prediction and similarity searches.
PORTING HMMER AND INTERPROSCAN TO THE GRID Daniel Alberto Burbano Sefair ( ) Michael Angel Pérez Cabarcas.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Hidden Markov Models BMI/CS 576
Free for Academic Use. Jianlin Cheng.
Sequence Based Analysis Tutorial
Handwritten Characters Recognition Based on an HMM Model
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presentation transcript:

Hidden Markov Model Ed Anderson and Sasha Tkachev

Who Was Markov? Graduate of Saint Petersburg University (1878), where he began a professor in 1886 Applied the method of continued fractions, pioneered by his teacher Pafnuty Chebyshev, to probability theory He proved the central limit theorem under fairly general assumptions Most remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes In 1923 Norbert Weiner became the first to treat rigorously a continuous Markov process. The foundation of a general theory was provided during the 1930s by Andrei Kolmogorov. Excerpted from: Andrei A Markov Born: 14 June 1856 in Ryazan, Russia Died: 20 July 1922 in Petrograd, Russia

What is the Hidden Markov Model? Clipped from

What Makes HMM Useful? Efficiency: The algorithms are simple enough to be performant for real- time speech recognition. Speed is advantageous when dealing with large biological data sets Strong Theoretical Basis Probability distribution must sum to 1. Scores are not influenced by ad-hoc criteria. Scores may be compared across different experiments of varying size and complexity Well suited for analyzing noisy, time-phased or sequentially connected events.

What are HMM’s Limitations? Model building is not so easy “Since HMM training algorithms are local optimizers, it pays to build HMMs on pre-aligned data whenever possible… the parameter space may be complex with may spurious local optima than can trap a training algorithm.” 1 Distance between related states must be constant A disadvantage when analyzing distant and arbitrarily spaced items: Amino acids in folded proteins RNA base pairs 1 Eddy, S.R., Profile hidden Markov models, Bioinformatics Review, 1998, Vol. 14, no , pg. 757

A Concrete Example Example adapted from Can you guess the weather based on a person’s activity? Use the Forward algorithm to calculate the probabilities.

How to Avoid False Optima? Is it necessary to calculate every possible path? The Viterbi algorithm can help. Example from

HMM In Speech Recognition Handling a single word; evaluating each HMM according to the input, using the Viterbi Search Every senone gets a HMM: Adapted from Shir, O. M., Speech Recognition Seminar, 10/15/03 Leiden Institute of Advanced Computer Science UW ONE TWO THREE T AHWN RTHIY 5-state HMM

HMM In Speech Recognition Taken from Shir, O. M., Speech Recognition Seminar, 10/15/03 Leiden Institute of Advanced Computer Science time State with best path-score State with path-score < best State without a valid path-score P (t) j = max [P (t-1) a b (t)] iijj i Total path-score ending up at state j at time t State transition probability, i to j Score for state j, given the input at time t

HMM in Bioinformatics Sequence profiling Gene finding Protein secondary structure prediction Radiation hybrid mapping Genetic linkage mapping Phylogenetic analysis

HMM in Sequence Profiling Review – Lecture 7 Highlights Emission probabilities and transition probabilities

HMM in Sequence Profiling Log Odds scores are comparable across different length sequences Taken from lecture 7 slides, apparently from Krogh, “Computational Methods in molecular biology, pages 45-63, Elsevier, 1998.

Why HMM for Sequence Analysis? Position-specific scoring methods make intuitive sense. BLAST and FASTA use pair-wise alignment as opposed to profile scoring Profile methods have historically used ad hoc scoring systems. HMM gap penalties a grounded in probability theory. HMMs provide a coherent, probabilistic model. 2 (2) Eddy, Sean R., Profile hidden Markov models, Bioinformatics Review, Vol. 14 no. 9, 1998, pps

Profile HMM Software ‘Motif’ models have strings of match states separated by a small number of insert states. ‘Profile’ models have insert and delete states associated with each match state.. 3 (3) Eddy, Sean R., Profile hidden Markov models, Bioinformatics Review, Vol. 14 no. 9, 1998, pps (4) Ibid., Figure 3 on page

HMMER Architecture Both local and global profile alignment. (5) Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct

How Does it Work? Generative models work by recursive enumeration of possible sequences from a finite set of rules. The Plan 7 architecture explicitly models the entire target sequence, regardless of how much of that sequence matches the main model. All alignments to a Plan 7 model are “global” alignments! (6) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct

HMMR Programs 7 hmmalign - align sequences to an HMM profile hmmbuild - build a profile HMM from an alignment hmmcalibrate - calibrate HMM search statistics hmmconvert - convert between profile HMM file formats hmmemit - generate sequences from a profile HMM hmmfetch - retrieve an HMM from an HMM database hmmindex - create a binary SSI index for an HMM database hmmpfam - search one or more sequences against an HMM database hmmsearch - search a sequence database with a profile HMM HMMER’s native alignment format is called Stockholm format, the format of the Pfam protein database that allows extensive markup and annotation. HMMER can read alignments in several common formats, including the output of the CLUSTAL family of programs, Wisconsin/GCG MSF format, the input format for the PHYLIP phylogenetic analysis programs, and “alighed FASTA” format (where the sequences in a FASTA file contain gap symbols, so that they are all the same length). (7) Excerpted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct

Building a profile with hmmbuild 8 > hmmbuild globin.hmm globins50.msf hmmbuild - build a hidden Markov model from an alignment HMMER 2.3 (April 2003) Copyright (C) HHMI/Washington University School of Medicine Freely distributed under the GNU General Public License (GPL) Alignment file: globins50.msf File format: MSF Search algorithm configuration: Multiple domain (hmmls) Model construction strategy: MAP (gapmax hint: 0.50) Null model used: (default) Prior used: (default) Sequence weighting method: G/S/C tree weights New HMM file: globin.hmm Alignment: #1 Number of sequences: 50 Number of columns: 308 Constructed a profile HMM (length 143) Average score: bits Minimum score: bits Maximum score: bits Std. deviation: bits (8) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct

Calibrating the profile 9 > hmmcalibrate globin.hmm hmmcalibrate -- calibrate HMM search statistics HMMER 2.3 (April 2003) Copyright (C) HHMI/Washington University School of Medicine Freely distributed under the GNU General Public License (GPL) HMM file: globin.hmm Length distribution mean: 325 Length distribution s.d.: 200 Number of samples: 5000 random seed: histogram(s) saved to: [not saved] POSIX threads: HMM : globins50 mu : lambda : max : (9) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct

Searching the sequence DB 10  Header Section  hmmsearch globin.hmm Artemia.fa hmmsearch - search a sequence database with a profile HMM HMMER 2.3 (April 2003) Copyright (C) HHMI/Washington University School of Medicine Freely distributed under the GNU General Public License (GPL) HMM file: globin.hmm [globins50] Sequence database: Artemia.fa per-sequence score cutoff: [none] per-domain score cutoff: [none] per-sequence Eval cutoff: <= 10 per-domain Eval cutoff: [none] Query HMM: globins50 Accession: [none] Description: [none] [HMM has been calibrated; E-values are empirical estimates] (10) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct

Searching the sequence DB (cont.) 11  Sequence Top Hits Section (11) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct

Searching the sequence DB (cont.) 12  Alignment Output Section (12) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct

Searching the sequence DB (cont.) 13  Score Histogram Section (13) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct

Local versus Global Alignment 14  HMMER does not do local (Smith/Waterman) and global (Needleman/Wunsch) style alignments in the same way that most computational biology analysis programs do it.  To HMMER, whether local or global alignments are allowed is part of the model, rather than being accomplished by running a different algorithm.  You must choose what kind of alignments you want to allow when you build the model  By default, hmmbuild builds models which allow alignments that are global with respect to the HMM, local with respect to the sequence, and allows multiple domains to hit per sequence. (13) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct

Experimental Observations My tests on the clipped SH3 Domain sequence in the Krogh paper. 15 The insert gap penalty was small but significant. The number of inserts had a linear, negative affect on the score. Relative to the overall score, the inserts and deletes had a small effect. (15) Krogh, “Computational Methods in molecular biology, pages 45-63, Elsevier, Avg Log Odds by Domain