Hidden Markov Models, HMM’s

Slides:



Advertisements
Similar presentations
Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.
Advertisements

Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU IIB-INTECH, UNSAM, Argentina.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Hidden Markov models) Morten.
Immune system overview in 10 minutes The non-immunologist guide to the immune system Morten Nielsen Department of Systems Biology DTU.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
Immune system overview in 10 minutes The non-immunologist guide to the immune system.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Sequence motifs, information content, logos, and Weight matrices
Prediction of T cell epitopes using artificial neural networks
MHC binding and MHC polymorphism Or Finding the needle in the haystack.
MHC binding and MHC polymorphism. MHC-I molecules present peptides on the surface of most cells.
Morten Nielsen, CBS, BioCentrum, DTU
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.
Immunological Bioinformatics Or Finding the needle in the haystack Morten Nielsen
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
Sequence motifs, information content, and sequence logos Morten Nielsen, CBS, Depart of Systems Biology, DTU.
Hidden Markov Model.
Hidden Markov Models What are the good for? Morten Nielsen CBS.
Hidden Markov Models, HMM’s Morten Nielsen, CBS, Department of Systems Biology, DTU.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Patterns, Profiles, and Multiple Alignment.
Profiles for Sequences
Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU.
Characterizing receptor ligand interactions Morten Nielsen, CBS, Depart of Systems Biology, DTU.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.
Hidden Markov Models: an Introduction by Rachel Karchin.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Immunological Bioinformatics Introduction to the immune system.
Immunological Bioinformatics. The Immunological Bioinformatics group Immunological Bioinformatics group, CBS, Technical University of Denmark (
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
Hidden Markov Models.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioSys, DTU.
Psi-Blast Morten Nielsen, CBS, Department of Systems Biology, DTU.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.
Hidden Markov Models, HMM’s Morten Nielsen, CBS, BioSys, DTU.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioSys, DTU.
Class 5 Hidden Markov models. Markov chains Read Durbin, chapters 1 and 3 Time is divided into discrete intervals, t i At time t, system is in one of.
Viterbi once again! Model generates numbers – :1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models, HMM’s Morten Nielsen, CBS, BioSys, DTU.
Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.
Weight matrices, Sequence motifs, information content, and sequence logos Morten Nielsen, CBS, Department of Systems Biology, DTU and Instituto de Investigaciones.
Hidden Markov model BioE 480 Sept 16, In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.
Psi-Blast Morten Nielsen, Department of systems biology, DTU.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Prediction of T cell epitopes using artificial neural networks Morten Nielsen, CBS, BioCentrum, DTU.
Immunological Bioinformatics
Motifs, logos, and Profile HMM’s
Sequence motifs, information content, and sequence logos
Immunological Bioinformatics
Sequence motifs, information content, logos, and HMM’s
Sequence motifs, information content, and sequence logos
Presentation transcript:

Hidden Markov Models, HMM’s Morten Nielsen Department of systems biology, DTU

Objectives Introduce Hidden Markov models and understand that they are just weight matrices with gaps How to construct an HMM How to “align/score” sequences to HMM’s Viterbi decoding Forward decoding Backward decoding Posterior Decoding Use and construct a Profile HMM HMMer

Markov Chains State transition matrix : The probability of Rain Sunny Cloudy State transition matrix : The probability of the weather given the previous day's weather. States : Three states - sunny, cloudy, rainy. Initial Distribution : Defining the probability of the system being in each of the states at time 0.

Hidden Markov Models Emission probabilities Transition probabilities Hidden states : the (TRUE) states of a system that may be described by a Markov process (e.g., the weather). Observable states : the states of the process that are `visible' (e.g., seaweed dampness).

TMHMM (trans-membrane HMM) (Sonnhammer, von Heijne, and Krogh) Extra cellular Trans membrane Intra cellular

TMHMM (trans-membrane HMM) (Sonnhammer, von Heijne, and Krogh) Model TM length distribution. Power of HMM. Difficult in alignment. ALLYVDWQILPVIL

Gene finding (Left) (Removed)

Weight matrix construction SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

Calculate amino acid frequencies at each position using PSSM construction Calculate amino acid frequencies at each position using Sequence weighting Pseudo counts Define background model Use background amino acids frequencies PSSM is

More on scoring Probability of observation given Model Probability of observation given Prior (background)

Hidden Markov Models Weight matrices do not deal with insertions and deletions In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension HMM is a natural framework where insertions/deletions are dealt with explicitly

Multiple sequence alignment Learning from evolution QUERY HAMDIRCYHSGG-PLHL-GEI-EDFNGQSCIVCPWHKYKITLATGE--GLYQSINPKDPS Q8K2P6 HAMDIRCYHSGG-PLHL-GEI-EDFNGQSCIVCPWHKYKITLATGE--GLYQSINPKDPS Q8TAC1 HAMDIRCYHSGG-PLHL-GDI-EDFDGRPCIVCPWHKYKITLATGE--GLYQSINPKDPS Q07947 FAVQDTCTHGDW-ALSE-GYL-DGD----VVECTLHFGKFCVRTGK--VKAL------PA P0A185 YATDNLCTHGSA-RMSD-GYL-EGRE----IECPLHQGRFDVCTGK--ALC------APV P0A186 YATDNLCTHGSA-RMSD-GYL-EGRE----IECPLHQGRFDVCTGK--ALC------APV Q51493 YATDNLCTHGAA-RMSD-GFL-EGRE----IECPLHQGRFDVCTGR--ALC------APV A5W4F0 FAVQDTCTHGDW-ALSD-GYL-DGD----IVECTLHFGKFCVRTGK--VKAL------PA P0C620 FAVQDTCTHGDW-ALSD-GYL-DGD----IVECTLHFGKFCVRTGK--VKAL------PA P08086 FAVQDTCTHGDW-ALSD-GYL-DGD----IVECTLHFGKFCVRTGK--VKAL------PA Q52440 FATQDQCTHGEW-SLSE-GGY-LDGD---VVECSLHMGKFCVRTGK-------------V Q7N4V8 FAVDDRCSHGNA-SISE-GYL-ED---NATVECPLHTASFCLRTGK--ALCL------PA P37332 FATQDRCTHGDW-SLSDGGYL-EGD----VVECSLHMGKFCVRTGK-------------V A7ZPY3 YAINDRCSHGNA-SMSE-GYL-EDD---ATVECPLHAASFCLKTGK--ALCL------PA P0ABW1 YAINDRCSHGNA-SMSE-GYL-EDD---ATVECPLHAASFCLKTGK--ALCL------PA A8A346 YAINDRCSHGNA-SMSE-GYL-EDD---ATVECPLHAASFCLKTGK--ALCL------PA P0ABW0 YAINDRCSHGNA-SMSE-GYL-EDD---ATVECPLHAASFCLKTGK--ALCL------PA P0ABW2 YAINDRCSHGNA-SMSE-GYL-EDD---ATVECPLHAASFCLKTGK--ALCL------PA Q3YZ13 YAINDRCSHGNA-SMSE-GYL-EDD---ATVECPLHAASFCLKTGK--ALCL------PA Q06458 YALDNLEPGSEANVLSR-GLL-GDAGGEPIVISPLYKQRIRLRDG--------------- Core Insertion Deletion

Why hidden? The unfair casino: Loaded die p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1 Model generates numbers 312453666641 Does not tell which die was used Alignment (decoding) can give the most probable solution/path (Viterbi) FFFFFFLLLLLL Or most probable set of states 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95 0.10 0.05 0.9

HMM (a simple example) ACA---ATG TCAACTATC ACAC--AGC AGA---ATC Example from A. Krogh Core region defines the number of states in the HMM (red) Insertion and deletion statistics are derived from the non-core part of the alignment (black) ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Core of alignment

HMM construction (supervised learning) 5 matches. A, 2xC, T, G 5 transitions in gap region C out, G out A-C, C-T, T out Out transition 3/5 Stay transition 2/5 ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC .4 A C G T .2 .4 .2 .2 .6 .6 A C G T .8 A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. .4 1. 1. .8 .2 .8 .2 .2 .2 .2 .8 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

Scoring a sequence to an HMM ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2 ACAC--AGC = 1.2x10-2 Consensus: ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2 Exceptional: TGCT--AGG = 0.0023x10-2

Align sequence to HMM - Null model Score depends strongly on length Null model is a random model. For length L the score is 0.25L Log-odds score for sequence S Log( P(S)/0.25L) Positive score means more likely than Null model This is just like we did for PSSM log(p/q)! ACA---ATG = 4.9 TCAACTATC = 3.0 ACAC--AGC = 5.3 AGA---ATC = 4.9 ACCG--ATC = 4.6 Consensus: ACAC--ATC = 6.7 ACA---ATC = 6.3 Exceptional: TGCT--AGG = -0.97 ACA---ATG = 3.3x10-2 TCAACTATC = 0.0075x10-2 ACAC--AGC = 1.2x10-2 ACAC--ATC = 4.7x10-2 Consensus ACA---ATC = 13.1x10-2 Exception TGCT--AGG = 0.0023x10-2 Note!

Aligning a sequence to an HMM Find the path through the HMM states that has the highest probability For alignment, we found the path through the scoring matrix that had the highest sum of scores The number of possible paths rapidly gets very large making brute force search infeasible Just like checking all path for alignment did not work Use dynamic programming The Viterbi algorithm does the job

The Viterbi algorithm The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95 0.10 0.05 0.9 Model generates numbers 312453666641

Model decoding (Viterbi) Example: 566. What was the most likely series of dice used to generate this output? Use Brute force 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95 0.10 0.05 0.9 FFF = 0.5*0.167*0.95*0.167*0.95*0.167 = 0.0021 FFL = 0.5*0.167*0.95*0.167*0.05*0.5 = 0.00333 FLF = 0.5*0.167*0.05*0.5*0.1*0.167 = 0.000035 FLL = 0.5*0.167*0.05*0.5*0.9*0.5 = 0.00094 LFF = 0.5*0.1*0.1*0.167*0.95*0.167 = 0.00013 LFL = 0.5*0.1*0.1*0.167*0.05*0.5 = 0.000021 LLF = 0.5*0.1*0.9*0.5*0.1*0.167 = 0.00038 LLL = 0.5*0.1*0.9*0.5*0.9*0.5 = 0.0101

Or in log space Example: 566. What was the most likely series of dice used to generate this output? 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0.78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model Log(P(LLL|M)) = log(0.5*0.1*0.9*0.5*0.9*0.5) = log(0.0101) or Log(P(LLL|M)) = log(0.5)+log(0.1)+log(0.9)+log(0.5)+log(0.9)+log(0.5) = -0.3 -1 -0.046 -0.3 -0.046 -0.3 = -1.99

Model decoding (Viterbi) Example: 566611234. What was the most likely series of dice used to generate this output? 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0.78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model F = 0.5*0.167 log(F) = log(0.5) + log(0.167) = -1.08 L = 0.5*0.1 log(L) = log(0.5) + log(0.1) = -1.30 5 6 1 2 3 4 F -1.08 L -1.30

Model decoding (Viterbi) Example: 566611234. What was the most likely series of dice used to generate this output? 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0.78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model FF = 0.5*0.167*0.95*0.167 Log(FF) = -0.30 -0.78 – 0.02 -0.78 = -1.88 LF = 0.5*0.1*0.1*0.167 Log(LF) = -0.30 -1 -1 -0.78 = -3.08 FL = 0.5*0.167*0.05*0.5 Log(FL) = -0.30 -0.78 – 1.30 -0.30 = -2.68 LL = 0.5*0.1*0.9*0.5 Log(LL) = -0.30 -1 -0.046 -0.3 = -1.65 5 6 1 2 3 4 F -1.08 L -1.30

Model decoding (Viterbi) Example: 566611234. What was the most likely series of dice used to generate this output? 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0.78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model FF = 0.5*0.167*0.95*0.167 Log(FF) = -0.30 -0.78 – 0.02 -0.78 = -1.88 LF = 0.5*0.1*0.1*0.167 Log(LF) = -0.30 -1 -1 -0.78 = -3.08 FL = 0.5*0.167*0.05*0.5 Log(FL) = -0.30 -0.78 – 1.30 -0.30 = -2.68 LL = 0.5*0.1*0.9*0.5 Log(LL) = -0.30 -1 -0.046 -0.3 = -1.65 5 6 1 2 3 4 F -1.08 -1.88 L -1.30 -1.65

Model decoding (Viterbi) Example: 566611234. What was the most likely series of dice used to generate this output? 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0.78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model FFF = 0.5*0.167*0.95*0.167*0.95*0.167 = 0.0021 FLF = 0.5*0.167*0.05*0.5*0.1*0.167 = 0.000035 LFF = 0.5*0.1*0.1*0.167*0.95*0.167 = 0.00013 LLF = 0.5*0.1*0.9*0.5*0.1*0.167 = 0.00038 FLL = 0.5*0.167*0.05*0.5*0.9*0.5 = 0.00094 FFL = 0.5*0.167*0.95*0.167*0.05*0.5 = 0.00333 LFL = 0.5*0.1*0.1*0.167*0.05*0.5 = 0.000021 LLL = 0.5*0.1*0.9*0.5*0.9*0.5 = 0.0101 5 6 1 2 3 4 F -1.08 -1.88 L -1.30 -1.65

Model decoding (Viterbi) Example: 566611234. What was the most likely series of dice used to generate this output? 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0.78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model FFF = 0.5*0.167*0.95*0.167*0.95*0.167 = 0.0021 Log(P(FFF))=-2.68 LLL = 0.5*0.1*0.9*0.5*0.9*0.5 = 0.0101 Log(P(LLL))=-1.99 5 6 1 2 3 4 F -1.08 -1.88 L -1.30 -1.65

Model decoding (Viterbi) Example: 566611234. What was the most likely series of dice used to generate this output? 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0.78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model FFF = 0.5*0.167*0.95*0.167*0.95*0.167 = 0.0021 Log(P(FFF))=-2.68 LLL = 0.5*0.1*0.9*0.5*0.9*0.5 = 0.0101 Log(P(LLL))=-1.99 5 6 1 2 3 4 F -1.08 -1.88 -2.68 L -1.30 -1.65 -1.99

Model decoding (Viterbi) Example: 566611234. What was the most likely series of dice used to generate this output? 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0.78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model 5 6 1 2 3 4 F -1.08 -1.88 -2.68 L -1.30 -1.65 -1.99

Model decoding (Viterbi) Example: 566611234. What was the most likely series of dice used to generate this output? 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0.78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model 5 6 1 2 3 4 F -1.08 -1.88 -2.68 L -1.30 -1.65 -1.99

Model decoding (Viterbi) Example: 566611234. What was the most likely series of dice used to generate this output? 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0.78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model 5 6 1 2 3 4 F -1.08 -1.88 -2.68 -3.48 L -1.30 -1.65 -1.99

Model decoding (Viterbi) Now we can formalize the algorithm! 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0.78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model Old max score New match Transition

Model decoding (Viterbi). Can you do it? Example: 566611234. What was the mots likely series of dice used to generate this output? Fill out the table using the Viterbi recursive algorithm Add the arrows for backtracking Find the optimal path 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model 5 6 1 2 3 4 F -1.08 -1.88 -2.68 -3.48 L -1.30 -1.65 -1.99 Make this shorter!

Model decoding (Viterbi). Can you do it? Example: 566611234. What was the most likely series of dice used to generate this output? Fill out the table using the Viterbi recursive algorithm Add the arrows for backtracking Find the optimal path 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model 5 6 1 2 3 4 F -1.08 -1.88 -2.68 -3.48 -4.92 -6.53 L -1.30 -1.65 -1.99 -3.39 -6.52

Model decoding (Viterbi). Can you do it? Example: 566611234. What was the most lilely series of dice used to generate this output? Fill out the table using the Viterbi recursive algorithm Add the arrows for backtracking Find the optimal path 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model 5 6 1 2 3 4 F -1.08 -1.88 -2.68 -3.48 -4.12 -4.92 -5.72 -6.52 -7.33 L -1.30 -1.65 -1.99 -2.34 -3.39 -4.44 -5.49 -6.53 -7.57

Model decoding (Viterbi). Can you do it? Example: 566611234. What was the most likely series of dice used to generate this output? The most likely path is LLLLFFFFF 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 Fair 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 Loaded -0.02 -1 -1.3 -0.046 Log model 5 6 1 2 3 4 F -1.08 -1.88 -2.68 -3.48 -4.12 -4.92 5.73 -6.53 -7.33 L -1.30 -1.65 -1.99 -2.34 -3.39 -4.44 -5.48 -6.52 -7.57

Model decoding (Viterbi). What happens if you have three dice? 5 6 1 2 3 4 F -1.0 L1 -1.2 L2 -1.3

And if you have a trans-membrane model What is the most likely path (alignment) of a protein sequence to the model D G V L I M A Q iC -1.0 -1.2 xC -1.3

One other interesting question would be The Forward algorithm The Viterbi algorithm finds the most probable path giving rise to a given sequence One other interesting question would be What is the probability that a given sequence can be generated by the hidden Markov model Calculated by summing over all path giving rise to a given sequence

The Forward algorithm Calculate summed probability over all path giving rise to a given sequence The number of possible paths is very large making (once more) brute force calculations infeasible Use dynamic (recursive) programming

The Forward algorithm Say we know the probability of generating the sequence up to and including xi ending in state k Then the probability of observing the element i+1 of x ending in state l is where pl(xi+1) is the probability of observing xi+1 is state l, and akl is the transition probability from state k to state l Then

Forward algorithm 5 6 1 2 3 4 F L 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95 0.10 0.05 0.9 5 6 1 2 3 4 F L

Forward algorithm 5 6 1 2 3 4 F L 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95 0.10 0.05 0.9 5 6 1 2 3 4 F 8.3e-2 L 5e-2

Forward algorithm 5 6 1 2 3 4 F L 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95 0.10 0.05 0.9 5 6 1 2 3 4 F 8.3e-2 L 5e-2

Forward algorithm 5 6 1 2 3 4 F L 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95 0.10 0.05 0.9 5 6 1 2 3 4 F 0.083 L 0.05

Forward algorithm 5 6 1 2 3 4 F L 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95 0.10 0.05 0.9 5 6 1 2 3 4 F 0.083 0.014 L 0.05

Forward algorithm. Can you do it yourself? 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95 0.10 0.05 0.9 Fill out the empty cells in the table! What is P(x)? 5 6 1 2 3 4 F 8.30e-2 2.63e-3 6.08e-4 1.82e-4 3.66e-5 1.09e-6 1.79e-7 L 5.00e-2 2.46e-2 1.14e-2 4.71e-4 4.33e-5 4.00e-7 4.14e-8

Forward algorithm 5 6 1 2 3 4 F L 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95 0.10 0.05 0.9 5 6 1 2 3 4 F 8.30e-2 1.40e-2 2.63e-3 6.08e-4 1.82e-4 3.66e-5 6.48e-6 1.09e-6 1.79e-7 L 5.00e-2 2.46e-2 1.14e-2 5.20e-3 4.71e-4 4.33e-5 4.08e-6 4.00e-7 4.14e-8

The Posterior decoding (Backward algorithm) One other interesting question would be What is the probability that an observation xi came from a state k given the observed sequence x

The Backward algorithm The probability of generation the sequence up to and including xi ending in state k Forward algorithm! The probability of generation the rest of the sequence starting from state k Backward algorithm! 5 6 1 2 3 4 F ? L

The Backward algorithm Forward algorithm Backward algorithm Forward/Backward algorithm

Backward algorithm 5 6 1 2 3 4 F L 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95 0.10 0.05 0.9 5 6 1 2 3 4 F 0.163 L 0.107

Backward algorithm 5 6 1 2 3 4 F L 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 Fair 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 Loaded 0.95 0.10 0.05 0.9 5 6 1 2 3 4 F 6.83e-7 3.18e-6 1.76e-5 1.07e-4 6.69e-5 4.19e-3 2.61e-2 0.163 L 3.27e-6 7.14e-6 1.52e-5 2.98e-5 2.08e-4 1.54e-3 1.23e-2 0.107

Backward algorithm Note that the sum of first column of the backward matric is NOT equal to the sum of the last column of the forward matrix This is because the first column of the backward matrix gives the probability values of generating the sequence AFTER having generated the first observation You hence cannot get the P(x) value directly from the backward matrix

Posterior decoding What is the posterior probability that observation xi came from state k given the observed sequence X? or What is the probability that a given amino acids is part of the trans-membrane helix given the protein sequence is X?

Posterior decoding What is the posterior probability that observation xi came from state k given the observed sequence X.

Posterior decoding The probability is context dependent

Un-supervised training Training of HMM Supervised training If each symbol is assigned to one state, all parameters can be found by simply counting number of emissions and transitions as we did for the DNA model Un-supervised training We do not know to which state each symbol is assigned so another approach is needed Find emission and transition parameters that most likely produces the observed data Baum-Welsh does this

Supervised learning ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC 5 matches. A, 2xC, T, G 5 transitions in gap region C out, G out A-C, C-T, T out Out transition 3/5 Stay transition 2/5 ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC .4 A C G T .2 .4 .2 .2 .6 .6 A C G T .8 A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. .4 1. 1. .8 .2 .8 .2 .2 .2 .2 .8 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

Un-supervised learning 312453666641456667543 5666663331234 a11 a22 1:e11 2:e12 3:e13 4:e14 5:e15 6:e16 1:e21 2:e22 3:e23 4:e24 5:e25 6:e26 a12 a21 Fair Loaded

Sum over symbols in sequence j Baum-Welsh Sum over sequences Sum over symbols in sequence j

HMM’s and weight matrices In the case of un-gapped alignments HMM’s become simple weight matrices To achieve high performance, the emission frequencies are estimated using the techniques of Sequence weighting Pseudo counts

Profile HMM’s Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix) Profile HMM’s are ideal suited to describe such position specific variations

Sequence profiles deletion Conserved Non-conserved Insertion deletion ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP Matching any thing but G => large negative score Any thing can match

HMM vs. alignment Detailed description of core Conserved/variable positions Price for insertions/deletions varies at different locations in sequence These features cannot be captured in conventional alignments

All P/D pairs must be visited once Profile HMM’s All P/D pairs must be visited once L1- Y2A3V4R5- I6 P1D2P3P4I4P5D6P7

Profile HMM Un-gapped profile HMM is just a sequence profile

Un-gapped profile HMM is just a sequence profile .. V:0.08 Y:0.01 P1 P2 P3 P4 P5 P6 P7 alk=1.0

Example. Where is the active site? Sequence profiles might show you where to look! The active site could be around S9, G42, N74, and H195

Profile HMM Profile HMM (deletions and insertions)

Profile HMM (deletions and insertions) QUERY HAMDIRCYHSGG-PLHL-GEI-EDFNGQSCIVCPWHKYKITLATGE--GLYQSINPKDPS Q8K2P6 HAMDIRCYHSGG-PLHL-GEI-EDFNGQSCIVCPWHKYKITLATGE--GLYQSINPKDPS Q8TAC1 HAMDIRCYHSGG-PLHL-GDI-EDFDGRPCIVCPWHKYKITLATGE--GLYQSINPKDPS Q07947 FAVQDTCTHGDW-ALSE-GYL-DGD----VVECTLHFGKFCVRTGK--VKAL------PA P0A185 YATDNLCTHGSA-RMSD-GYL-EGRE----IECPLHQGRFDVCTGK--ALC------APV P0A186 YATDNLCTHGSA-RMSD-GYL-EGRE----IECPLHQGRFDVCTGK--ALC------APV Q51493 YATDNLCTHGAA-RMSD-GFL-EGRE----IECPLHQGRFDVCTGR--ALC------APV A5W4F0 FAVQDTCTHGDW-ALSD-GYL-DGD----IVECTLHFGKFCVRTGK--VKAL------PA P0C620 FAVQDTCTHGDW-ALSD-GYL-DGD----IVECTLHFGKFCVRTGK--VKAL------PA P08086 FAVQDTCTHGDW-ALSD-GYL-DGD----IVECTLHFGKFCVRTGK--VKAL------PA Q52440 FATQDQCTHGEW-SLSE-GGY-LDGD---VVECSLHMGKFCVRTGK-------------V Q7N4V8 FAVDDRCSHGNA-SISE-GYL-ED---NATVECPLHTASFCLRTGK--ALCL------PA P37332 FATQDRCTHGDW-SLSDGGYL-EGD----VVECSLHMGKFCVRTGK-------------V A7ZPY3 YAINDRCSHGNA-SMSE-GYL-EDD---ATVECPLHAASFCLKTGK--ALCL------PA P0ABW1 YAINDRCSHGNA-SMSE-GYL-EDD---ATVECPLHAASFCLKTGK--ALCL------PA A8A346 YAINDRCSHGNA-SMSE-GYL-EDD---ATVECPLHAASFCLKTGK--ALCL------PA P0ABW0 YAINDRCSHGNA-SMSE-GYL-EDD---ATVECPLHAASFCLKTGK--ALCL------PA P0ABW2 YAINDRCSHGNA-SMSE-GYL-EDD---ATVECPLHAASFCLKTGK--ALCL------PA Q3YZ13 YAINDRCSHGNA-SMSE-GYL-EDD---ATVECPLHAASFCLKTGK--ALCL------PA Q06458 YALDNLEPGSEANVLSR-GLL-GDAGGEPIVISPLYKQRIRLRDG--------------- Core Insertion Deletion

Used to make the Pfam database of protein families The HMMer program HMMer is a open source program suite for profile HMM for biological sequence analysis Used to make the Pfam database of protein families http://pfam.sanger.ac.uk/

Example from the CASP8 competition A HMMer example Example from the CASP8 competition What is the best PDB template for building a model for the sequence T0391 >T0391 rieske ferredoxin, mouse, 157 residues SDPEISEQDEEKKKYTSVCVGREEDIRKSERMTAVVHDREVVIFYHKGEYHAMDIRCYHS GGPLHLGEIEDFNGQSCIVCPWHKYKITLATGEGLYQSINPKDPSAKPKWCSKGVKQRIH TVKVDNGNIYVTLSKEPFKCDSDYYATGEFKVIQSSS

A HMMer example What is the best PDB template for building a model for the sequence T0391 Use Blast No hits Use Psi-Blast Use Hmmer

A HMMer example Use Hmmer Make multiple alignment using Blast Make model using hmmbuild Find PDB template using hmmsearch

A HMMer example Make multiple alignment using Blast blastpgp -j 3 -e 0.001 -m 6 -i T0391.fsa -d sp -b 10000000 -v 10000000 > T0391.fsa.blastout Make Stockholm format # STOCKHOLM 1.0 QUERY DPEISEQDEEKKKYTSVCVGREEDIRKS-ERMTAVVHD-RE--V-V-IF--Y-H-KGE-Y Q8K2P6 DPEISEQDEEKKKYTSVCVGREEDIRKS-ERMTAVVHD-RE--V-V-IF--Y-H-KGE-Y Q8TAC1 ----SAQDPEKREYSSVCVGREDDIKKS-ERMTAVVHD-RE--V-V-IF--Y-H-KGE-Y Build HMM hmmbuild T0391.hmm T0391.fsa.blastout.sto Search for templates in PDB hmmsearch T0391.hmm pdb > T0391.out

A HMMer example Sequence Description Score E-value N -------- ----------- ----- ------- --- 3D89.A mol:aa ELECTRON TRANSPORT 178.6 2.1e-49 1 2E4Q.A mol:aa ELECTRON TRANSPORT 163.7 6.7e-45 1 2E4P.B mol:aa ELECTRON TRANSPORT 163.7 6.7e-45 1 2E4P.A mol:aa ELECTRON TRANSPORT 163.7 6.7e-45 1 2E4Q.C mol:aa ELECTRON TRANSPORT 163.7 6.7e-45 1 2YVJ.B mol:aa OXIDOREDUCTASE/ELECTRON TRANSPORT 163.7 6.7e-45 1 1FQT.A mol:aa OXIDOREDUCTASE 160.9 4.5e-44 1 1FQT.B mol:aa OXIDOREDUCTASE 160.9 4.5e-44 1 2QPZ.A mol:aa METAL BINDING PROTEIN 137.3 5.6e-37 1 2Q3W.A mol:aa ELECTRON TRANSPORT 116.2 1.3e-30 1 1VM9.A mol:aa ELECTRON TRANSPORT 116.2 1.3e-30 1

A HMMer example This is the structure we are trying to predict HEADER ELECTRON TRANSPORT 22-MAY-08 3D89 TITLE CRYSTAL STRUCTURE OF A SOLUBLE RIESKE FERREDOXIN FROM MUS TITLE 2 MUSCULUS COMPND MOL_ID: 1; COMPND 2 MOLECULE: RIESKE DOMAIN-CONTAINING PROTEIN; COMPND 3 CHAIN: A; COMPND 4 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: MUS MUSCULUS; SOURCE 3 ORGANISM_COMMON: MOUSE; SOURCE 4 GENE: RFESD; SOURCE 5 EXPRESSION_SYSTEM: ESCHERICHIA COLI;

Validation. CE structural alignment CE 2E4Q A 3D89 A (run on IRIX machines at CBS) Structure Alignment Calculator, version 1.01, last modified: May 25, 2000. CE Algorithm, version 1.00, 1998. Chain 1: /usr/cbs/bio/src/ce_distr/data.cbs/pdb2e4q.ent:A (Size=109) Chain 2: /usr/cbs/bio/src/ce_distr/data.cbs/pdb3d89.ent:A (Size=157) Alignment length = 101 Rmsd = 2.03A Z-Score = 5.5 Gaps = 20(19.8%) CPU = 1s Sequence identities = 18.1% Chain 1: 2 TFTKACSVDEVPPGEALQVSHDAQKVAIFNVDGEFFATQDQCTHGEWSLSEGGYLDG----DVVECSLHM Chain 2: 16 TSVCVGREEDIRKSERMTAVVHDREVVIFYHKGEYHAMDIRCYHSGGPLH-LGEIEDFNGQSCIVCPWHK Chain 1: 68 GKFCVRTGKVKS-----PPPC---------EPLKVYPIRIEGRDVLVDFSRAALH Chain 2: 85 YKITLATGEGLYQSINPKDPSAKPKWCSKGVKQRIHTVKVDNGNIYVTL-SKEPF

HMM packages NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html) HMMER (http://hmmer.wustl.edu/) S.R. Eddy, WashU St. Louis. Freely available. SAM (http://www.cse.ucsc.edu/research/compbio/sam.html) R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz. Freely available to academia, nominal license fee for commercial users. META-MEME (http://metameme.sdsc.edu/) William Noble Grundy, UC San Diego. Freely available. Combines features of PSSM search and profile HMM search. NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html) Freely available to academia, nominal license fee for commercial users. Allows HMM architecture construction. EasyGibbs (http://www.cbs.dtu.dk/biotools/EasyGibbs/) Webserver for Gibbs sampling of proteins sequences