CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,

Slides:



Advertisements
Similar presentations
Sequence motifs, information content, logos, and HMM’s
Advertisements

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
CALENDAR.
The 5S numbers game..
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
Sampling in Marketing Research
Break Time Remaining 10:00.
This module: Telling the time
The basics for simulations
PP Test Review Sections 6-1 to 6-6
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
15. Oktober Oktober Oktober 2012.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
We are learning how to read the 24 hour clock
Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.
Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU IIB-INTECH, UNSAM, Argentina.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Hidden Markov models) Morten.
Immune system overview in 10 minutes The non-immunologist guide to the immune system Morten Nielsen Department of Systems Biology DTU.
Artificial Intelligence
: 3 00.
5 minutes.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
DTU Informatics Introduction to Medical Image Analysis Rasmus R. Paulsen DTU Informatics TexPoint fonts.
Clock will move after 1 minute
Immune system overview in 10 minutes The non-immunologist guide to the immune system.
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,
Sequence motifs, information content, logos, and Weight matrices
Prediction of T cell epitopes using artificial neural networks
MHC binding and MHC polymorphism Or Finding the needle in the haystack.
MHC binding and MHC polymorphism. MHC-I molecules present peptides on the surface of most cells.
Morten Nielsen, CBS, BioCentrum, DTU
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.
Immunological Bioinformatics Or Finding the needle in the haystack Morten Nielsen
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
Sequence motifs, information content, and sequence logos Morten Nielsen, CBS, Depart of Systems Biology, DTU.
Hidden Markov Models What are the good for? Morten Nielsen CBS.
Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU.
Characterizing receptor ligand interactions Morten Nielsen, CBS, Depart of Systems Biology, DTU.
Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.
Immunological Bioinformatics Introduction to the immune system.
Immunological Bioinformatics. The Immunological Bioinformatics group Immunological Bioinformatics group, CBS, Technical University of Denmark (
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioSys, DTU.
Psi-Blast Morten Nielsen, CBS, Department of Systems Biology, DTU.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioSys, DTU.
Hidden Markov Models, HMM’s Morten Nielsen, CBS, BioSys, DTU.
Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.
Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.
Weight matrices, Sequence motifs, information content, and sequence logos Morten Nielsen, CBS, Department of Systems Biology, DTU and Instituto de Investigaciones.
Psi-Blast Morten Nielsen, Department of systems biology, DTU.
Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.
Prediction of T cell epitopes using artificial neural networks Morten Nielsen, CBS, BioCentrum, DTU.
Outline Basic Local Alignment Search Tool
Sequence motifs, information content, logos, and HMM’s
Immunological Bioinformatics
Motifs, logos, and Profile HMM’s
Sequence motifs, information content, and sequence logos
Immunological Bioinformatics
Sequence motifs, information content, logos, and HMM’s
Outline Basic Local Alignment Search Tool
Sequence motifs, information content, and sequence logos
Presentation transcript:

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Outline Pattern recognition –Regular expression and probabilities Information content –Sequence logos Multiple alignment and sequence motifs Weight matrix construction –Sequence weighting –Low (pseudo) counts Example from the real world HMM’s and profile HMM’s –Viterbi decoding –TMHMM (trans-membrane protein) Links to HMM packages

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU MHC class I with peptide Anchor positions

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU How to predict MHC binding? Structure based –Molecular dynamics –Calculate binding energy explicitly Sequence based –Forget about the MHC molecule it self! –Learn binding motif from peptides known to bind MHC

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence Information Calculate p a at each position Entropy Information content Conserved positions –P V =1, P REST =0 => S=0, I=log(20) Mutable positions –P aa =1/20 => S=log(20), I=0

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Information content A R N D C Q E G H I L K M F P S T W Y V S I

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence logo Height of a column equal to I Relative height of a letter is p Highly useful tool to visualize sequence motifs High information positions HLA-A0201

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Characterizing a binding motif from small data sets What can we learn? 1.A at P1 favors binding? 2.I is not allowed at P9? 3.K at P4 favors binding? 4.Which positions are important for binding? ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV 10 MHC restricted peptides

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Simple motifs Yes/No rules ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV 10 MHC restricted peptides Only 11 of 212 peptides identified! Need more flexible rules If not fit P1 but fit P3 then ok Not all positions are equally important We know that P2 and P9 determines binding more than other positions Cannot discriminate between good and very good binders

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Simple motifs Yes/No rules Example Two first peptides will not fit the motif RLLDDTPEV 0.59 GLLGNVSTV 0.71 ALAKAAAAL 0.47 ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV 10 MHC restricted peptides

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Extended motifs Fitness of aa at each position given by P(aa) Example P1 P A = 6/10 P G = 2/10 P T = P K = 1/10 P C = P D = …P V = 0 Problems –Few data –Data redundancy/duplication ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Example: RLLDDTPEV 0.59 GLLGNVSTV 0.71 ALAKAAAAL 0.47

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information Raw sequence counting ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence weighting ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Poor or biased sampling of sequence space Example P1 P A = 2/6 P G = 2/6 P T = P K = 1/6 P C = P D = …P V = 0 } Similar sequences Weight 1/5 Example RLLDDTPEV 0.59 GLLGNVSTV 0.71 ALAKAAAAL 0.47

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence weighting ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Pseudo counts ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV I is not found at position P9. Does this mean that I is forbidden (P(I)=0)? No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V The Blosum matrix Some amino acids are highly conserved (i.e. C), some have a high change of mutation (i.e. I)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Calculate observed amino acids frequencies f a Pseudo frequency for amino acid b Example Pseudo count estimation ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Weight on pseudo count Pseudo counts are important when only limited data is available With large data sets only “true” observation should count  is the effective number of sequences (N-1),  is the weight on prior

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Example If  large, p ≈ f and only the observed data defines the motif If  small, p ≈ g and the pseudo counts (or prior) defines the motif  is [50-200] normally ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Weight on pseudo count

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence weighting and pseudo counts RLLDDTPEV 0.59 GLLGNVSTV 0.71 ALAKAAAAL 0.47 P 7p and P 7s > 0 ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Position specific weighting We know that positions 2 and 9 are anchor positions for most MHC binding motifs –Increase weight on high information positions Motif found on large data set

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Weight matrices Estimate amino acid frequencies from alignment including sequence weighting and pseudo count What do the numbers mean? –P2(V)>P2(M). Does this mean that V enables binding more than M. –In nature not all amino acids are found equally often PA = 0.070, PW = Finding 6% A is hence not significant, but 6% W highly significant In nature V is found more often than M, so we must somehow rescale with the background A R N D C Q E G H I L K M F P S T W Y V

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU How to score a sequence to a probability matrix? p ij describes a motif The probability that a peptide fits the motif is The probability that the peptide fits a random model is The ratio of the two gives the odds The log gives the score

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Weight matrices A weight matrix is given as W ij = log(p ij /q j ) –where i is a position in the motif, and j an amino acid. q j is the background frequency for amino acid j. W is a L x 20 matrix, L is motif length A R N D C Q E G H I L K M F P S T W Y V

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Score sequences to weight matrix by looking up and adding L values from the matrix A R N D C Q E G H I L K M F P S T W Y V Scoring a sequence to a weight matrix RLLDDTPEV GLLGNVSTV ALAKAAAAL Which peptide is most likely to bind? Which peptide second?

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Example from real life 10 peptides from MHCpep database Bind to the MHC complex Relevant for immune system recognition Estimate sequence motif and weight matrix Evaluate motif “correctness” on 528 peptides ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Prediction accuracy Pearson correlation 0.45

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Predictive performance

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU End of first part Take a deep breath Smile to you neighbor

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Hidden Markov Models Weight matrices do not deal with insertions and deletions In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension HMM is a natural frame work where insertions/deletions are dealt with explicitly

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Why hidden? Model generates numbers – Does not tell which die was used Alignment (decoding) can give the most probable solution/path (Viterby) –FFFFFFLLLLLL Or most probable set of states –FFFFFFLLLLLL 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded The unfair casino: Loaded die p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU HMM (a simple example) ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Example from A. Krogh Core region defines the number of states in the HMM (red) Insertion and deletion statistics are derived from the non-core part of the alignment (black) Core of alignment

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU.8.2 ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT HMM construction ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC 5 matches. A, 2xC, T, G 5 transitions in gap region C out, G out A-C, C-T, T out Out transition 3/5 Stay transition 2/5 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10 -2

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Align sequence to HMM ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10 -2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = x10 -2 ACAC--AGC = 1.2x10 -2 AGA---ATC = 3.3x10 -2 ACCG--ATC = 0.59x10 -2 Consensus: ACAC--ATC = 4.7x10 -2, ACA---ATC = 13.1x10 -2 Exceptional: TGCT--AGG = x10 -2

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Align sequence to HMM - Null model Score depends strongly on length Null model is a random model. For length L the score is 0.25 L Log-odds score for sequence S Log( P(S)/0.25 L ) Positive score means more likely than Null model ACA---ATG = 4.9 TCAACTATC = 3.0 ACAC--AGC = 5.3 AGA---ATC = 4.9 ACCG--ATC = 4.6 Consensus: ACAC--ATC = 6.7 ACA---ATC = 6.3 Exceptional: TGCT--AGG = Note!

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Model decoding (Viterbi) The unfair casino Example: : : : : : :-0-78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded F L Null FFFFLLL Log model

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU HMM’s and weight matrices In the case of un-gapped alignments HMM’s become simple weight matrices To achieve high performance, the emission frequencies are estimated using the techniques of –Sequence weighting –Pseudo counts

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Profile HMM’s Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix) Profile HMM’s are ideal suited to describe such position specific variations

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Profile HMM’s Conserved Core: Position with < 2 gaps Deletion Insertion Non-conserved

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU HMM vs. alignment Detailed description of core –Conserved/variable positions Price for insertions/deletions varies at different locations in sequence These features cannot be captured in conventional alignments

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Profile HMM’s All M/D pairs must be visited once

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Example Sequence profiles Alignment of protein sequences 1PLC._ and 1GYC.A E-value > 1000 Profile alignment –Align 1PLC._ against Swiss-prot –Make position specific weight matrix from alignment –Use this matrix to align 1PLC._ against 1GYC.A E-value < Rmsd=3.3

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Example continued Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) Query: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + + Sbjct: VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G V Sbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126 Rmsd=3.3 Å Model red Template blue

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU TMHMM ( trans-membrane HMM ) (Sonnhammer, von Heijne, and Krogh) Model TM length distribution. Easy in HMM. Difficult in alignment. Difference in amino acid composition. Easy in HMM. Difficult in alignment.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU HMM packages HMMER ( –S.R. Eddy, WashU St. Louis. Freely available. – SAM ( –R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz. Freely available to academia, nominal license fee for commercial users. META-MEME ( –William Noble Grundy, UC San Diego. Freely available. Combines features of PSSM search and profile HMM search. NET-ID, HMMpro ( –Freely available to academia, nominal license fee for commercial users. –Allows HMM architecture construction.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU trainanhmm Copyright (C) 1998 by Anders Krogh Header {alphabet ;} begin { trans Fair:0.5 Loaded:0.5; } Fair { trans Fair:0.95 Loaded:0.05; } Loaded { trans Fair:0.1 Loaded:0.9; letter 6:0.5; } 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded The unfair casino: Loaded die p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1