Sequence motifs, information content, logos, and HMM’s

Slides:

Advertisements

Similar presentations

Sequence motifs, information content, logos, and HMM’s

Advertisements

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.

Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

Hidden Markov Models What are the good for? Morten Nielsen CBS.

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.

1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry

Profiles for Sequences

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Biological sequence analysis and information processing by artificial neural networks Morten Nielsen CBS.

Heuristic alignment algorithms and cost matrices

HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.

Profile-profile alignment using hidden Markov models Wing Wong.

Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT

Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Protein Fold recognition

Sequence similarity.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,

Similar Sequence Similar Function Charles Yan Spring 2006.

Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Protein Sequence Alignment and Database Searching.

Hidden Markov Models for Sequence Analysis 4

BINF6201/8201 Hidden Markov Models for Sequence Analysis

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Sequence Alignment Csc 487/687 Computing for bioinformatics.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.

Sequence Alignment.

Construction of Substitution matrices

Blosum matrices What are they? Morten Nielsen BioSys, DTU

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Hidden Markov model BioE 480 Sept 16, In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.

Psi-Blast Morten Nielsen, Department of systems biology, DTU.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.

Outline Basic Local Alignment Search Tool

Pairwise Sequence Alignment and Database Searching

Free for Academic Use. Jianlin Cheng.

Sequence similarity, BLAST alignments & multiple sequence alignments

Position-Specific Substitution Matrices

Protein Families, Motifs & Domains.

Learning Sequence Motif Models Using Expectation Maximization (EM)

Ab initio gene prediction

Sequence Based Analysis Tutorial

Sequence motifs, information content, logos, and HMM’s

Outline Basic Local Alignment Search Tool

Sequence Alignment Algorithms Morten Nielsen BioSys, DTU

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY

Alignment IV BLOSUM Matrices

Presentation transcript:

Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Outline Pattern recognition Multiple alignment and sequence motifs Regular expression and probabilities Multiple alignment and sequence motifs Weight matrix construction and consensus sequence Sequence weighting Low (pseudo) counts Information content Sequence logos Mutual information Example from the real world HMM’s and profile HMM’s TMHMM (trans-membrane protein) Gene finding Links to HMM packages

Pattern recognition ALAKAAAAM ALAKAAAAV GMNERPILV GILGFVFTM TLNAWVKVV 10 peptides from MHCpep database Bind to the MHC complex A*0201 Which of the are most likely to bind? FLLTRILTI WLDQVPFSV TVILGVLLL Regular expression X1[LMIV]2X3…X8[MVL]9 2 and 3 will bind and 1 will not bind Cannot tell if 2 if more likely to bind Truth is that 1 and 2 binds and 1 binds the strongest. 3 does not bind A probabilistic model can capture this! ALAKAAAAM ALAKAAAAV GMNERPILV GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

Multiple alignment and sequence motifs Core Consensus sequence Weight matrices Problems Sequence weights Low counts ----------MLEFVVEADLPGIKA-------- ----------MLEFVVEFALPGIKA-------- ----------MLEFVVEFDLPGIAA-------- -------------YLQDSDPDSFQD-------- ---GSDTITLPCRMKQFINMWQE---------- ---RNQEERLLADLMQNYDPNLR---------- -------YDPNLRPAERDSDVVNVSLK------ ----------NVSLKLTLTNLISLNEREEA--- ----EREEALTTNVWIEMQWCDYR--------- ----------WCDYRLRWDPRDYEGLWVLR--- --LWVLRVPSTMVWRPDIVLEN----------- ------------IVLENNVDGVFEVALYCNVL- -------------YCNVLVSPDGCIYWLPPAIF ---------PPAIFRSACSISVTYFPFDW---- ********* FVVEFDLPG Consensus

Sequences weighting 1 - Clustering (slow, but accurate) ----------MLEFVVEADLPGIKA-------- ----------MLEFVVEFALPGIKA-------- ----------MLEFVVEFDLPGIAA-------- -------------YLQDSDPDSFQD-------- ---GSDTITLPCRMKQFINMWQE---------- ---RNQEERLLADLMQNYDPNLR---------- -------YDPNLRPAERDSDVVNVSLK------ ----------NVSLKLTLTNLISLNEREEA--- ----EREEALTTNVWIEMQWCDYR--------- ----------WCDYRLRWDPRDYEGLWVLR--- --LWVLRVPSTMVWRPDIVLEN----------- ------------IVLENNVDGVFEVALYCNVL- -------------YCNVLVSPDGCIYWLPPAIF ---------PPAIFRSACSISVTYFPFDW---- ********* } Homologous sequences Weight = 1/n (1/3) Consensus sequence YRQELDPLV Previous FVVEFDLPG

Sequences weighting 2 - Henikoff & Henikoff (fast) FVVEADLPG 0.37 FVVEFALPG 0.43 FVVEFDLPG 0.32 YLQDSDPDS 0.59 MKQFINMWQ 0.90 LMQNYDPNL 0.68 PAERDSDVV 0.75 LKLTLTNLI 0.85 VWIEMQWCD 0.84 YRLRWDPRD 0.51 WRPDIVLEN 0.71 VLENNVDGV 0.59 YCNVLVSPD 0.71 FRSACSISV 0.75 waa’ = 1/rs r: Number of different aa in a column s: Number occurrences Normalize S waa= 1 for each column Sequence weight is sum of waa in sequence F: r=7 (FYMLPVW), s=4 w’=1/28, w = 0.055 Y: s=3, w`=1/21, w = 0.073 M,P,W: s=1, w’=1/7, w = 0.218 L,V: s=2, w’=1/14, w = 0.109

Low count correction Limited number of data P1 Limited number of data Poor sampling of sequence space I is not found at position P1. Does this mean that I can never be found at P1? No! Use Blosum matrix to estimate pseudo frequency of I --------MLEFVVEADLPGIKA-------- --------MLEFVVEFALPGIKA-------- --------MLEFVVEFDLPGIAA-------- -----------YLQDSDPDSFQD-------- -GSDTITLPCRMKQFINMWQE---------- -RNQEERLLADLMQNYDPNLR---------- -----YDPNLRPAERDSDVVNVSLK------ --------NVSLKLTLTNLISLNEREEA--- --EREEALTTNVWIEMQWCDYR--------- --------WCDYRLRWDPRDYEGLWVLR--- LWVLRVPSTMVWRPDIVLEN----------- ----------IVLENNVDGVFEVALYCNVL- -----------YCNVLVSPDGCIYWLPPAIF -------PPAIFRSACSISVTYFPFDW---- *********

Low count correction using Blosum matrices Every time for instance L/V is observed, I is also likely to occur Estimate low (pseudo) count correction using this approach As more data are included the pseudo count correction becomes less important Blosum62 substitution frequencies # I L V L 0.12 0.38 0.10 V 0.16 0.13 0.27 Neff: Number of sequences b: Weight on pseudo count or weight on prior

Information content Information and entropy Shannon information (D) Conserved amino acid regions contain high degree of information (high order == low entropy) Variable amino acid regions contain low degree of information (low order == high entropy) Shannon information (D) D = log2(N) + S pi log2 pi (for proteins N=20, DNA N=4) Conserved residue pA=1, pi<>A=0, D = log2(N) ( = 4.3 for proteins) Variable region pA=0.05, pC=0.05, .., D = 0

Sequence logo Height of a column equal to D MHC class I High information positions Height of a column equal to D Relative height of a letter is pA Highly useful tool to visualize sequence motifs http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

More on logos Information content Shannon, qi = 1/N = 0.05 D = S pi log2 (pi/qi) Shannon, qi = 1/N = 0.05 D = S pi log2 (pi) - S pi log2 (1/N) = log2 N + S pi log2 (pi) Kullback-Leibler, qi = background frequency V/L/A more frequent than for instance C/H/W

Mutual information ALWGFFPVA ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS YMNGTMSQV GILGFVFTL WLSLLVPFV FLPSDFFPS P(G1) = 2/9 = 0.22, .. P(V6) = 4/9 = 0.44,.. P(G1,V6) = 2/9 = 0.22, P(G1)*P(V6) = 8/81 = 0.10 log(0.22/0.10) > 0

Mutual information 313 binding peptides 313 random peptides

Learning higher order correlation Neural networks can learn higher order correlations! What does this mean? 0 0 => 0 0 1 => 1 1 0 => 1 1 1 => 0 No linear function can learn this pattern

Learning higher order correlation 0 0 => 0; 1 0 => 1 1 1 => 0; 0 1 => 1 X1 X2 X1 X2 W11 W1 W2 W22 W21 W12 h1 hs V1 V2 Solution Has no solution!

Take a deep breath Smile to you neighbor End of first part Take a deep breath Smile to you neighbor

How to score a sequences to a probability matrix? pij describes a motif The probability that a peptide fits the motif is The probability that the peptide fits a random model is The ratio of the two gives the odds The log gives the score

Weight matrices Wij = log(pij/qj) Estimate amino acid frequencies from alignment including sequence weighting and pseudo counts Construct a weight matrix as Wij = log(pij/qj) Here i is a position in the motif, and j an amino acid. qj is the prior (background) frequency for amino acid j. W is a L x 20 matrix, L is motif length Score sequences to weight matrix by looking up and adding L values from matrix

Weight matrix 2 What are log-odds scores (Wij = log(pij/qj))? Does an monthly income of 2000 $ mean that you are rich? Depends on where you live In Denmark no In Argentina yes You must always compare your measured value (pij) to a background (qj) In nature not all amino acids are found equally often PA = 0.070, PW = 0.013 Finding 6% A is hence not significant, but 6% W highly significant

Scoring sequences to a weight matrix A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5 ILYQVPFSV ALPYWNFAT MTAQWWLDA 15.0 -3.4 0.8 Which peptide is most likely to bind? Which peptide second?

Example from real life 10 peptides from MHCpep database Bind to the MHC complex Relevant for immune system recognition Estimate sequence motif and weight matrix Evaluate motif “correctness” on 528 peptides ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

Example (cont.) Raw sequence counting No sequence weighting ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Raw sequence counting No sequence weighting No pseudo count Prediction accuracy 0.45

Prediction accuracy Pearson correlation 0.45

Example (cont.) Sequence weighting No pseudo count ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Example (cont.) Sequence weighting No pseudo count Prediction accuracy 0.5

Example (cont.) Sequence weighting and pseudo count ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Example (cont.) Sequence weighting and pseudo count Prediction accuracy 0.60 Motif found on a large dataset Prediction accuracy 0.79

Hidden Markov Models Weight matrices do not deal with insertions and deletions In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension HMM is a natural frame work where insertions/deletions are dealt with explicitly

Why hidden? The unfair casion: Loaded die p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1 Model generates numbers 312453666641 Do not tell which die was used Alignment (decoding) can give the most probable solution/path (Viterby) FFFFFFLLLLLL Or most probable set of states 0.95 0.9 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 0.05 0.10 Fair Loaded

HMM (a simple example) ACA---ATG TCAACTATC ACAC--AGC AGA---ATC Example from A. Krogh Core region defines the number of states in the HMM (red) Insertion and deletion statistics are derived from the non-core part of the alignment (black) ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Core of alignment

HMM construction ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC 5 matches. A, 2xC, T, G 5 transitions in gap region C out, G out A-C, C-T, T out Out transition 3/5 Stay transition 2/5 ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC .4 A C G T .2 .4 .2 .2 .6 .6 A C G T .8 A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. .4 1. 1. .8 .2 .8 .2 .2 .2 .2 .8 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

Align sequence to HMM ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2 ACAC--AGC = 1.2x10-2 AGA---ATC = 3.3x10-2 ACCG--ATC = 0.59x10-2 Consensus: ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2 Exceptional: TGCT--AGG = 0.0023x10-2

Align sequence to HMM - Null model Score depends strongly on length Null model is a random model. For length L the score is 0.25L Log-odds score for sequence S Log( P(S)/0.25L) Positive score means more likely than Null model ACA---ATG = 4.9 TCAACTATC = 3.0 ACAC--AGC = 5.3 AGA---ATC = 4.9 ACCG--ATC = 4.6 Consensus: ACAC--ATC = 6.7 ACA---ATC = 6.3 Exceptional: TGCT--AGG = -0.97 Note!

Model decoding (Viterby) The unfair casino Log model -0.02 -0.05 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 -1.3 Example: 1245666 -1 FFFFLLL Fair Loaded 1 2 4 5 6 F -0.78 -1.58 -2.38 -3.18 -3.98 -4.78 -5.58 L Null -3.08 -3.88 -4.68 -5.13 -5.48

HMM’s and weight matrices In the case of un-gapped alignments HMM’s become simple weight matrices To achieve high performance, the emission frequencies are estimated using the techniques of Sequence weighting Pseudo counts

Profile HMM’s Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix) Profile HMM’s are ideal suited to describe such position specific variations

Profile HMM’s Core: Position with < 2 gaps Insertion Conserved Deletion ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Core: Position with < 2 gaps

HMM vs alignment Detailed description of core ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Detailed description of core Conserved/variable positions Price for insertions/deletions varies at different locations in sequence These features cannot be captured in conventional alignments

All M/D pairs must be visited once Profile HMM’s All M/D pairs must be visited once

Example Sequence profiles Alignment of protein sequences 1PLC._ and 1GYC.A E-value > 1000 Profile alignment Align 1PLC._ against Swiss-prot Make position specific weight matrix from alignment Use this matrix to align 1PLC._ against 1GYC.A E-value < 10-22. Rmsd=3.3

Example continued Score = 97.1 bits (241), Expect = 9e-22 Rmsd=3.3 Å Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) Query: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + + Sbjct: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G V Sbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126 Rmsd=3.3 Å Model red Template blue

TMHMM (trans-membrane HMM) (Sonnhammer, von Heijne, and Krogh) Difference in amino acid composition. Easy in HMM. Difficult in alignment. Model TM length distribution. Easy in HMM. Difficult in alignment.

HMM packages NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html) HMMER (http://hmmer.wustl.edu/) S.R. Eddy, WashU St. Louis. Freely available. SAM (http://www.cse.ucsc.edu/research/compbio/sam.html) R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz. Freely available to academia, nominal license fee for commercial users. META-MEME (http://metameme.sdsc.edu/) William Noble Grundy, UC San Diego. Freely available. Combines features of PSSM search and profile HMM search. NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html) Freely available to academia, nominal license fee for commercial users. Allows HMM architecture construction.

trainanhmm 1.221 Copyright (C) 1998 by Anders Krogh The unfair casion: Loaded die p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1 Header {alphabet 123456;} begin { trans Fair:0.5 Loaded:0.5; } Fair { trans Fair:0.95 Loaded:0.05; Loaded { trans Fair:0.1 Loaded:0.9; letter 6:0.5; 0.95 0.9 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 0.05 0.10 Fair Loaded