Hidden Markov Models What are the good for? Morten Nielsen CBS.

Slides:



Advertisements
Similar presentations
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Hidden Markov Model.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Protein Fold recognition Morten Nielsen, CBS, BioSys, DTU.
Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems.
Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
Heuristic alignment algorithms and cost matrices
Protein structure and homology modeling Morten Nielsen, CBS, BioCentrum, DTU.
Profile-profile alignment using hidden Markov models Wing Wong.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Protein Fold recognition
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein homology modeling Morten Nielsen, CBS, BioCentrum, DTU.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioSys, DTU.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Viterbi once again! Model generates numbers – :1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Hidden Markov Models, HMM’s
Sequence Alignment.
Construction of Substitution matrices
Blosum matrices What are they? Morten Nielsen BioSys, DTU
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov model BioE 480 Sept 16, In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.
Psi-Blast Morten Nielsen, Department of systems biology, DTU.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Other Models for Time Series. The Hidden Markov Model (HMM)
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.
Outline Basic Local Alignment Search Tool
Free for Academic Use. Jianlin Cheng.
Sequence motifs, information content, logos, and HMM’s
Sequence Based Analysis Tutorial
Outline Basic Local Alignment Search Tool
Pairwise Alignment Global & local alignment
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Presentation transcript:

Hidden Markov Models What are the good for? Morten Nielsen CBS

Absolutely nothing!

Objectives Introduce Hidden Markov models and understand that they are just weight matrices with gaps See the beauty of sequence profiles Position specific scoring matrices (PSSMs) Understand what biological problems are best described using HMM’s –And which are not!

Outline What is an HMM –What are they good for? How to construct an HMM How to “score” a sequence to an HMM –Viterbi decoding HMM’s that made a difference –Profile HMMs –TMHMM Links to HMM packages

Markov Models A model with no memory –What I decide depends only on “state” now, not on what I have learned in the past –No dependence on i-1, i-2 …

A Markov model? No memory Model generates numbers – :1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1

Why hidden? Model generates numbers – Does not tell which dice was used Alignment (decoding) can give the most probable solution/path (Viterby) –FFFFFFLLLLLL Or most probable set of states –FFFFFFLLLLLL 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1

HMM (a simple example) ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Example from A. Krogh Core region defines the number of states in the HMM (red) Insertion and deletion statistics are derived from the non-core part of the alignment (black) Core of alignment

ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT HMM construction ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC 5 matches. A, 2xC, T, G 5 transitions in gap region C out, G out A-C, C-T, T out Out transition 3/5 Stay transition 2/5 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10 -2

Align sequence to HMM ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10 -2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = x10 -2 ACAC--AGC = 1.2x10 -2 Consensus: ACAC--ATC = 4.7x10 -2, ACA---ATC = 13.1x10 -2 Exceptional: TGCT--AGG = x10 -2

Align sequence to HMM - Null model Score depends strongly on length Null model is a random model. For length L the score is 0.25 L Log-odds score for sequence S –Log( P(S)/0.25 L ) Positive score means more likely than Null model ACA---ATG = 4.9 TCAACTATC = 3.0 ACAC--AGC = 5.3 AGA---ATC = 4.9 ACCG--ATC = 4.6 Consensus: ACAC--ATC = 6.7 ACA---ATC = 6.3 Exceptional: TGCT--AGG = Note!

Model decoding (Viterby) Example: What was the series of dice used to generate this output? 1: : : : : :-0-78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded Log model

Dynamic programming: computation of scores T C G C A T C A x Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from. Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner. score(x,y) = max score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty

Model decoding (Viterby) Example: What was the series of dice used to generate this output? 1: : : : : :-0-78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded Log model F L Null-3.08

Model decoding (Viterby) 1: : : : : :-0-78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded Log model F L Null

Model decoding (Viterby) 1: : : : : :-0-78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded Log model F L Null Identify what series of dice was used to generate this output?

Model decoding (Viterby) 1: : : : : :-0-78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded Log model F L Null Series of dice is FFFFLLL

HMM’s and weight matrices In the case of un-gapped alignments HMM’s become simple weight matrices

ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT HMM construction X

.8.2 ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT HMM construction ACA---ATG sco = 0.8x1x0.8x1x0.8x1x1x1x0.8x1x0.2 = 3.3x10 -2 or Log-sco = log(0.8)+log(0.8)+log(0.8)+log(1)+log(0.8)+log(0.2)

HMM’s and weight matrices In the case of un-gapped alignments HMM’s become simple weight matrices To achieve high performance, the emission frequencies are estimated using the techniques of –Sequence weighting –Pseudo counts

HMMs. What are they good for? Weight matrices do not deal with insertions and deletions In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension HMM is a natural frame work where insertions/deletions are dealt with explicitly

Profile HMM’s Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix) Profile HMM’s are ideal suited to describe such position specific variations

What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences

Alignment scoring matrices Blosum62 score matrix. Fg=1. Ng=0? LAGDSD F I G D S L

Alignment scoring matrices Blosum62 score matrix. Fg=1. Ng=0? Score = =17 LAGDSD F I G-4060 D S L LAGDS I-GDS

What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences This scoring matrix is identical at all positions in the protein sequence! EVVFIGDSLVQLMHQC X X AGDS.GGGDSAGDS.GGGDS

When Blast works! 1PLC._ 1PLB._

When Blast fails! 1PLC._ 1PMY._

Sequence profiles In reality not all positions in a protein are equally likely to mutate Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score Sequence profiles can capture these differences

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Profile HMM’s Conserved Core: Position with < 2 gaps Deletion Insertion Non-conserved Must have a GAny thing can match

HMM vs. alignment Detailed description of core –Conserved/variable positions Price for insertions/deletions varies at different locations in sequence These features cannot be captured in conventional alignments

Profile-profile scoring matrix 1K7C.A 1WAB._

Profile HMM’s All M/D pairs must be visited once L1- Y2A3V4R5- I6P1D2P3P4I4P5D6P7L1- Y2A3V4R5- I6P1D2P3P4I4P5D6P7

Example. Sequence profiles Alignment of protein sequences 1PLC._ and 1GYC.A E-value > 1000 Profile alignment –Align 1PLC._ against Swiss-prot –Make position specific weight matrix from alignment –Use this matrix to align 1PLC._ against 1GYC.A E-value < Rmsd=3.3

Example continued Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) Query: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + + Sbjct: VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G V Sbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126 Rmsd=3.3 Å Model red Structure blue

HMMs. What are they good for II Trans membrane helix proteins

HMMs. What are they good for II Transmembrane helix proteins TMHMM. A. Krogh, 2001

Gene Finding

HMM packages HMMER ( –S.R. Eddy, WashU St. Louis. Freely available. SAM ( –R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz. Freely available to academia, nominal license fee for commercial users. META-MEME ( –William Noble Grundy, UC San Diego. Freely available. Combines features of PSSM search and profile HMM search. NET-ID, HMMpro ( –Freely available to academia, nominal license fee for commercial users. –Allows HMM architecture construction. EasyGibbs ( –Webserver for Gibbs sampling of proteins sequences